linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 6/6] x86, mce: Recognise machine check bank signature for data path error
  2011-12-13 19:05 [PATCH 0/6] x86, mce: machine check recovery for applications Tony Luck
@ 2011-12-08 22:49 ` Tony Luck
  2011-12-14 15:47   ` Borislav Petkov
  2011-12-12 21:06 ` [PATCH 4/6] x86, mce: Add mechanism to safely save information in MCE handler Tony Luck
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 26+ messages in thread
From: Tony Luck @ 2011-12-08 22:49 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Borislav Petkov, Huang, Ying, Hidetoshi Seto

Action required data path signature is defined in table 15-19 of SDM:

+-----------------------------------------------------------------------------+
| SRAR Error | Valid | OVER | UC | EN | MISCV | ADDRV | PCC | S | AR | MCACOD |
| Data Load  |     1 |    0 |  1 |  1 |     1 |     1 |   0 | 1 |  1 |  0x134 |
+-----------------------------------------------------------------------------+

Recognise this, and pass MCE_AR_SEVERITY code back to do_machine_check()

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/mcheck/mce-severity.c |   14 +++++++++++++-
 1 files changed, 13 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce-severity.c b/arch/x86/kernel/cpu/mcheck/mce-severity.c
index 7395d5f..c4d8b24 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-severity.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-severity.c
@@ -54,6 +54,7 @@ static struct severity {
 #define  MASK(x, y)	.mask = x, .result = y
 #define MCI_UC_S (MCI_STATUS_UC|MCI_STATUS_S)
 #define MCI_UC_SAR (MCI_STATUS_UC|MCI_STATUS_S|MCI_STATUS_AR)
+#define	MCI_ADDR (MCI_STATUS_ADDRV|MCI_STATUS_MISCV)
 #define MCACOD 0xffff
 
 	MCESEV(
@@ -102,11 +103,22 @@ static struct severity {
 		SER, BITCLR(MCI_STATUS_S)
 		),
 
-	/* AR add known MCACODs here */
 	MCESEV(
 		PANIC, "Action required with lost events",
 		SER, BITSET(MCI_STATUS_OVER|MCI_UC_SAR)
 		),
+
+	/* known AR MCACODs: */
+	MCESEV(
+		KEEP, "HT thread notices Action required: data load error",
+		SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|0x0134),
+		MCGMASK(MCG_STATUS_EIPV, 0)
+		),
+	MCESEV(
+		AR, "Action required: data load error",
+		SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|0x0134),
+		USER
+		),
 	MCESEV(
 		PANIC, "Action required: unknown MCACOD",
 		SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_UC_SAR)
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/6] x86, mce: Add mechanism to safely save information in MCE handler
  2011-12-13 19:05 [PATCH 0/6] x86, mce: machine check recovery for applications Tony Luck
  2011-12-08 22:49 ` [PATCH 6/6] x86, mce: Recognise machine check bank signature for data path error Tony Luck
@ 2011-12-12 21:06 ` Tony Luck
  2011-12-14  7:52   ` Ingo Molnar
  2011-12-12 21:47 ` [PATCH 5/6] x86, mce: handle "action required" errors Tony Luck
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 26+ messages in thread
From: Tony Luck @ 2011-12-12 21:06 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Borislav Petkov, Huang, Ying, Hidetoshi Seto

Machine checks on Intel cpus interrupt execution on all cpus, regardless
of interrupt masking.  We have a need to save some data about the cause
of the machine check (physical address) in the machine check handler that
can be retrieved later to attempt recovery in a more flexible execution
state.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/mcheck/mce.c |   51 ++++++++++++++++++++++++++++++++++++++
 1 files changed, 51 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 43f22c8..9b83b7d 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -887,6 +887,57 @@ static void mce_clear_state(unsigned long *toclear)
 }
 
 /*
+ * Need to save faulting physical address associated with a process
+ * in the machine check handler some place where we can grab it back
+ * later in mce_notify_process()
+ */
+#define	MAX_MCE_INFO	16
+struct mce_info {
+	atomic_t		inuse;
+	struct task_struct	*t;
+	__u64			paddr;
+} mce_info[MAX_MCE_INFO];
+
+static void mce_save_info(__u64 addr)
+{
+	int	i;
+
+	for (i = 0; i < MAX_MCE_INFO; i++)
+		if (atomic_cmpxchg(&mce_info[i].inuse, 0, 1) == 0) {
+			mce_info[i].t = current;
+			mce_info[i].paddr = addr;
+			return;
+		}
+
+	mce_panic("Too many concurrent recoverable errors", NULL, NULL);
+}
+
+static int mce_find_info(__u64 *paddr)
+{
+	int	i;
+
+	for (i = 0; i < MAX_MCE_INFO; i++)
+		if (atomic_read(&mce_info[i].inuse) &&
+		    mce_info[i].t == current) {
+			*paddr = mce_info[i].paddr;
+			return 1;
+		}
+	return 0;
+}
+
+static void mce_clear_info(void)
+{
+	int	i;
+
+	for (i = 0; i < MAX_MCE_INFO; i++)
+		if (atomic_read(&mce_info[i].inuse) &&
+		    mce_info[i].t == current) {
+			atomic_set(&mce_info[i].inuse, 0);
+			return;
+		}
+}
+
+/*
  * The actual machine check handler. This only handles real
  * exceptions when something got corrupted coming in through int 18.
  *
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 5/6] x86, mce: handle "action required" errors
  2011-12-13 19:05 [PATCH 0/6] x86, mce: machine check recovery for applications Tony Luck
  2011-12-08 22:49 ` [PATCH 6/6] x86, mce: Recognise machine check bank signature for data path error Tony Luck
  2011-12-12 21:06 ` [PATCH 4/6] x86, mce: Add mechanism to safely save information in MCE handler Tony Luck
@ 2011-12-12 21:47 ` Tony Luck
  2011-12-14  9:28   ` Chen Gong
  2011-12-14 16:04   ` Borislav Petkov
  2011-12-13 17:24 ` [PATCH 1/6] HWPOISON: clean up memory_failure() vs. __memory_failure() Tony Luck
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 26+ messages in thread
From: Tony Luck @ 2011-12-12 21:47 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Borislav Petkov, Huang, Ying, Hidetoshi Seto

All non-urgent actions (reporting low severity errors and handling
"action-optional" errors) are now handled by a work queue. This
means that TIF_MCE_NOTIFY can be used to block execution for a
thread experiencing an "action-required" fault until we get all
cpus out of the machine check handler (and the thread that hit
the fault into mce_notify_process().

We use the new mce_{save,find,clear}_info() API to get information
from do_machine_check() to mce_notify_process(), and then use the
newly improved memory_failure(..., MF_ACTION_REQUIRED) to handle
the error (possibly signalling the process).

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/mcheck/mce.c |   64 ++++++++++++++++++++++---------------
 1 files changed, 38 insertions(+), 26 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 9b83b7d..66e3bfb 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1044,12 +1044,6 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 			continue;
 		}
 
-		/*
-		 * Kill on action required.
-		 */
-		if (severity == MCE_AR_SEVERITY)
-			kill_it = 1;
-
 		mce_read_aux(&m, i);
 
 		/*
@@ -1070,6 +1064,8 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 		}
 	}
 
+	m = *final;
+
 	if (!no_way_out)
 		mce_clear_state(toclear);
 
@@ -1088,7 +1084,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	 * support MCE broadcasting or it has been disabled.
 	 */
 	if (no_way_out && tolerant < 3)
-		mce_panic("Fatal machine check on current CPU", final, msg);
+		mce_panic("Fatal machine check on current CPU", &m, msg);
 
 	/*
 	 * If the error seems to be unrecoverable, something should be
@@ -1097,11 +1093,13 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	 * high, don't try to do anything at all.
 	 */
 
-	if (kill_it && tolerant < 3)
+	if (worst != MCE_AR_SEVERITY && kill_it && tolerant < 3)
 		force_sig(SIGBUS, current);
 
-	/* notify userspace ASAP */
-	set_thread_flag(TIF_MCE_NOTIFY);
+	if (worst == MCE_AR_SEVERITY) {
+		mce_save_info(m.addr);
+		set_thread_flag(TIF_MCE_NOTIFY);
+	}
 
 	if (worst > 0)
 		mce_report_event(regs);
@@ -1115,34 +1113,50 @@ EXPORT_SYMBOL_GPL(do_machine_check);
 #ifndef CONFIG_MEMORY_FAILURE
 int memory_failure(unsigned long pfn, int vector, int flags)
 {
-	printk(KERN_ERR "Action optional memory failure at %lx ignored\n", pfn);
+	if (flags & MF_ACTION_REQUIRED)
+		return -ENXIO; /* panic? */
+	else
+		printk(KERN_ERR "Action optional memory failure at %lx ignored\n", pfn);
 
 	return 0;
 }
 #endif
 
 /*
- * Called after mce notification in process context. This code
- * is allowed to sleep. Call the high level VM handler to process
- * any corrupted pages.
- * Assume that the work queue code only calls this one at a time
- * per CPU.
- * Note we don't disable preemption, so this code might run on the wrong
- * CPU. In this case the event is picked up by the scheduled work queue.
- * This is merely a fast path to expedite processing in some common
- * cases.
+ * Called in process context that interrupted by MCE and marked with
+ * TIF_MCE_NOTFY, just before returning to errorneous userland.
+ * This code is allowed to sleep.
+ * Attempt possible recovery such as calling the high level VM handler to
+ * process any corrupted pages, and kill/signal current process if required.
  */
 void mce_notify_process(void)
 {
+	__u64	paddr = paddr;
 	unsigned long pfn;
-	mce_notify_irq();
-	while (mce_ring_get(&pfn))
-		memory_failure(pfn, MCE_VECTOR, 0);
+
+	if (!mce_find_info(&paddr))
+		mce_panic("Lost address", NULL, NULL);
+	pfn = paddr >> PAGE_SHIFT;
+
+	clear_thread_flag(TIF_MCE_NOTIFY);
+
+	pr_err("Uncorrected hardware memory error in user-access at %llx",
+		 paddr);
+	if (memory_failure(pfn, MCE_VECTOR, MF_ACTION_REQUIRED) < 0) {
+		pr_err("Memory error not recovered");
+		force_sig(SIGBUS, current);
+	} else {
+		pr_err("Memory error recovered");
+		mce_clear_info();
+	}
 }
 
 static void mce_process_work(struct work_struct *dummy)
 {
-	mce_notify_process();
+	unsigned long pfn;
+
+	while (mce_ring_get(&pfn))
+		memory_failure(pfn, MCE_VECTOR, 0);
 }
 
 #ifdef CONFIG_X86_MCE_INTEL
@@ -1232,8 +1246,6 @@ int mce_notify_irq(void)
 	/* Not more than two messages every minute */
 	static DEFINE_RATELIMIT_STATE(ratelimit, 60*HZ, 2);
 
-	clear_thread_flag(TIF_MCE_NOTIFY);
-
 	if (test_and_clear_bit(0, &mce_need_notify)) {
 		/* wake processes polling /dev/mcelog */
 		wake_up_interruptible(&mce_chrdev_wait);
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 1/6] HWPOISON: clean up memory_failure() vs. __memory_failure()
  2011-12-13 19:05 [PATCH 0/6] x86, mce: machine check recovery for applications Tony Luck
                   ` (2 preceding siblings ...)
  2011-12-12 21:47 ` [PATCH 5/6] x86, mce: handle "action required" errors Tony Luck
@ 2011-12-13 17:24 ` Tony Luck
  2011-12-14  7:47   ` Ingo Molnar
  2011-12-13 17:27 ` [PATCH 2/6] HWPOISON: Add code to handle "action required" errors Tony Luck
  2011-12-13 17:48 ` [PATCH 3/6] x86, mce: create helper function to save addr/misc when needed Tony Luck
  5 siblings, 1 reply; 26+ messages in thread
From: Tony Luck @ 2011-12-13 17:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Borislav Petkov, Huang, Ying, Hidetoshi Seto

There is only one caller of memory_failure(), all other users call
__memory_failure() and pass in the flags argument explicitly. The
lone user of memory_failure() will soon need to pass flags too.

Add flags argument to the callsite in mce.c. Delete the old memory_failure()
function, and then rename __memory_failure() without the leading "__".

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/mcheck/mce.c |    9 +++++--
 drivers/base/memory.c            |    2 +-
 include/linux/mm.h               |    3 +-
 mm/hwpoison-inject.c             |    4 +-
 mm/madvise.c                     |    2 +-
 mm/memory-failure.c              |   46 +++++++++++++++++--------------------
 6 files changed, 32 insertions(+), 34 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 2af127d..265139d 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1046,11 +1046,14 @@ out:
 }
 EXPORT_SYMBOL_GPL(do_machine_check);
 
-/* dummy to break dependency. actual code is in mm/memory-failure.c */
-void __attribute__((weak)) memory_failure(unsigned long pfn, int vector)
+#ifndef CONFIG_MEMORY_FAILURE
+int memory_failure(unsigned long pfn, int vector, int flags)
 {
 	printk(KERN_ERR "Action optional memory failure at %lx ignored\n", pfn);
+
+	return 0;
 }
+#endif
 
 /*
  * Called after mce notification in process context. This code
@@ -1068,7 +1071,7 @@ void mce_notify_process(void)
 	unsigned long pfn;
 	mce_notify_irq();
 	while (mce_ring_get(&pfn))
-		memory_failure(pfn, MCE_VECTOR);
+		memory_failure(pfn, MCE_VECTOR, 0);
 }
 
 static void mce_process_work(struct work_struct *dummy)
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 8272d92..9a92444 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -474,7 +474,7 @@ store_hard_offline_page(struct class *class,
 	if (strict_strtoull(buf, 0, &pfn) < 0)
 		return -EINVAL;
 	pfn >>= PAGE_SHIFT;
-	ret = __memory_failure(pfn, 0, 0);
+	ret = memory_failure(pfn, 0, 0);
 	return ret ? ret : count;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4baadd1..bcc5234 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1607,8 +1607,7 @@ void vmemmap_populate_print_last(void);
 enum mf_flags {
 	MF_COUNT_INCREASED = 1 << 0,
 };
-extern void memory_failure(unsigned long pfn, int trapno);
-extern int __memory_failure(unsigned long pfn, int trapno, int flags);
+extern int memory_failure(unsigned long pfn, int trapno, int flags);
 extern void memory_failure_queue(unsigned long pfn, int trapno, int flags);
 extern int unpoison_memory(unsigned long pfn);
 extern int sysctl_memory_failure_early_kill;
diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
index c7fc7fd..cc448bb 100644
--- a/mm/hwpoison-inject.c
+++ b/mm/hwpoison-inject.c
@@ -45,7 +45,7 @@ static int hwpoison_inject(void *data, u64 val)
 	 * do a racy check with elevated page count, to make sure PG_hwpoison
 	 * will only be set for the targeted owner (or on a free page).
 	 * We temporarily take page lock for try_get_mem_cgroup_from_page().
-	 * __memory_failure() will redo the check reliably inside page lock.
+	 * memory_failure() will redo the check reliably inside page lock.
 	 */
 	lock_page(hpage);
 	err = hwpoison_filter(hpage);
@@ -55,7 +55,7 @@ static int hwpoison_inject(void *data, u64 val)
 
 inject:
 	printk(KERN_INFO "Injecting memory failure at pfn %lx\n", pfn);
-	return __memory_failure(pfn, 18, MF_COUNT_INCREASED);
+	return memory_failure(pfn, 18, MF_COUNT_INCREASED);
 }
 
 static int hwpoison_unpoison(void *data, u64 val)
diff --git a/mm/madvise.c b/mm/madvise.c
index 74bf193..f5ab745 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -251,7 +251,7 @@ static int madvise_hwpoison(int bhv, unsigned long start, unsigned long end)
 		printk(KERN_INFO "Injecting memory failure for page %lx at %lx\n",
 		       page_to_pfn(p), start);
 		/* Ignore return value for now */
-		__memory_failure(page_to_pfn(p), 0, MF_COUNT_INCREASED);
+		memory_failure(page_to_pfn(p), 0, MF_COUNT_INCREASED);
 	}
 	return ret;
 }
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 06d3479..ab259bb 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -984,7 +984,25 @@ static void clear_page_hwpoison_huge_page(struct page *hpage)
 		ClearPageHWPoison(hpage + i);
 }
 
-int __memory_failure(unsigned long pfn, int trapno, int flags)
+/**
+ * memory_failure - Handle memory failure of a page.
+ * @pfn: Page Number of the corrupted page
+ * @trapno: Trap number reported in the signal to user space.
+ * @flags: fine tune action taken
+ *
+ * This function is called by the low level machine check code
+ * of an architecture when it detects hardware memory corruption
+ * of a page. It tries its best to recover, which includes
+ * dropping pages, killing processes etc.
+ *
+ * The function is primarily of use for corruptions that
+ * happen outside the current execution context (e.g. when
+ * detected by a background scrubber)
+ *
+ * Must run in process context (e.g. a work queue) with interrupts
+ * enabled and no spinlocks hold.
+ */
+int memory_failure(unsigned long pfn, int trapno, int flags)
 {
 	struct page_state *ps;
 	struct page *p;
@@ -1156,29 +1174,7 @@ out:
 	unlock_page(hpage);
 	return res;
 }
-EXPORT_SYMBOL_GPL(__memory_failure);
-
-/**
- * memory_failure - Handle memory failure of a page.
- * @pfn: Page Number of the corrupted page
- * @trapno: Trap number reported in the signal to user space.
- *
- * This function is called by the low level machine check code
- * of an architecture when it detects hardware memory corruption
- * of a page. It tries its best to recover, which includes
- * dropping pages, killing processes etc.
- *
- * The function is primarily of use for corruptions that
- * happen outside the current execution context (e.g. when
- * detected by a background scrubber)
- *
- * Must run in process context (e.g. a work queue) with interrupts
- * enabled and no spinlocks hold.
- */
-void memory_failure(unsigned long pfn, int trapno)
-{
-	__memory_failure(pfn, trapno, 0);
-}
+EXPORT_SYMBOL_GPL(memory_failure);
 
 #define MEMORY_FAILURE_FIFO_ORDER	4
 #define MEMORY_FAILURE_FIFO_SIZE	(1 << MEMORY_FAILURE_FIFO_ORDER)
@@ -1251,7 +1247,7 @@ static void memory_failure_work_func(struct work_struct *work)
 		spin_unlock_irqrestore(&mf_cpu->lock, proc_flags);
 		if (!gotten)
 			break;
-		__memory_failure(entry.pfn, entry.trapno, entry.flags);
+		memory_failure(entry.pfn, entry.trapno, entry.flags);
 	}
 }
 
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/6] HWPOISON: Add code to handle "action required" errors.
  2011-12-13 19:05 [PATCH 0/6] x86, mce: machine check recovery for applications Tony Luck
                   ` (3 preceding siblings ...)
  2011-12-13 17:24 ` [PATCH 1/6] HWPOISON: clean up memory_failure() vs. __memory_failure() Tony Luck
@ 2011-12-13 17:27 ` Tony Luck
  2011-12-13 17:48 ` [PATCH 3/6] x86, mce: create helper function to save addr/misc when needed Tony Luck
  5 siblings, 0 replies; 26+ messages in thread
From: Tony Luck @ 2011-12-13 17:27 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Borislav Petkov, Huang, Ying, Hidetoshi Seto

Add new flag bit "MF_ACTION_REQUIRED" to be used by machine check
code to force a signal with si_code = BUS_MCEERR_AR in the case
where the error occurs in processor execution context. Pass the
flags argument along call chain:
	memory_failure()
	  hwpoison_user_mappings()
	    kill_procs()
	      kill_proc()

Drop the "_ao" suffix from kill_procs_ao() and kill_proc_ao() since
they can now handle "action required" as well as "action optional" errors.

Acked-by: Borislav Petkov <bp@amd64.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/mm.h  |    1 +
 mm/memory-failure.c |   50 +++++++++++++++++++++++++++++---------------------
 2 files changed, 30 insertions(+), 21 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bcc5234..bf169ca 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1606,6 +1606,7 @@ void vmemmap_populate_print_last(void);
 
 enum mf_flags {
 	MF_COUNT_INCREASED = 1 << 0,
+	MF_ACTION_REQUIRED = 1 << 1,
 };
 extern int memory_failure(unsigned long pfn, int trapno, int flags);
 extern void memory_failure_queue(unsigned long pfn, int trapno, int flags);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index ab259bb..95fd307 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -187,33 +187,40 @@ int hwpoison_filter(struct page *p)
 EXPORT_SYMBOL_GPL(hwpoison_filter);
 
 /*
- * Send all the processes who have the page mapped an ``action optional''
- * signal.
+ * Send all the processes who have the page mapped a signal.
+ * ``action optional'' if they are not immediately affected by the error
+ * ``action required'' if error happened in current execution context
  */
-static int kill_proc_ao(struct task_struct *t, unsigned long addr, int trapno,
-			unsigned long pfn, struct page *page)
+static int kill_proc(struct task_struct *t, unsigned long addr, int trapno,
+			unsigned long pfn, struct page *page, int flags)
 {
 	struct siginfo si;
 	int ret;
 
 	printk(KERN_ERR
-		"MCE %#lx: Killing %s:%d early due to hardware memory corruption\n",
+		"MCE %#lx: Killing %s:%d due to hardware memory corruption\n",
 		pfn, t->comm, t->pid);
 	si.si_signo = SIGBUS;
 	si.si_errno = 0;
-	si.si_code = BUS_MCEERR_AO;
 	si.si_addr = (void *)addr;
 #ifdef __ARCH_SI_TRAPNO
 	si.si_trapno = trapno;
 #endif
 	si.si_addr_lsb = compound_trans_order(compound_head(page)) + PAGE_SHIFT;
-	/*
-	 * Don't use force here, it's convenient if the signal
-	 * can be temporarily blocked.
-	 * This could cause a loop when the user sets SIGBUS
-	 * to SIG_IGN, but hopefully no one will do that?
-	 */
-	ret = send_sig_info(SIGBUS, &si, t);  /* synchronous? */
+
+	if ((flags & MF_ACTION_REQUIRED) && t == current) {
+		si.si_code = BUS_MCEERR_AR;
+		ret = force_sig_info(SIGBUS, &si, t);
+	} else {
+		/*
+		 * Don't use force here, it's convenient if the signal
+		 * can be temporarily blocked.
+		 * This could cause a loop when the user sets SIGBUS
+		 * to SIG_IGN, but hopefully no one will do that?
+		 */
+		si.si_code = BUS_MCEERR_AO;
+		ret = send_sig_info(SIGBUS, &si, t);  /* synchronous? */
+	}
 	if (ret < 0)
 		printk(KERN_INFO "MCE: Error sending signal to %s:%d: %d\n",
 		       t->comm, t->pid, ret);
@@ -338,8 +345,9 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
  * Also when FAIL is set do a force kill because something went
  * wrong earlier.
  */
-static void kill_procs_ao(struct list_head *to_kill, int doit, int trapno,
-			  int fail, struct page *page, unsigned long pfn)
+static void kill_procs(struct list_head *to_kill, int doit, int trapno,
+			  int fail, struct page *page, unsigned long pfn,
+			  int flags)
 {
 	struct to_kill *tk, *next;
 
@@ -363,8 +371,8 @@ static void kill_procs_ao(struct list_head *to_kill, int doit, int trapno,
 			 * check for that, but we need to tell the
 			 * process anyways.
 			 */
-			else if (kill_proc_ao(tk->tsk, tk->addr, trapno,
-					      pfn, page) < 0)
+			else if (kill_proc(tk->tsk, tk->addr, trapno,
+					      pfn, page, flags) < 0)
 				printk(KERN_ERR
 		"MCE %#lx: Cannot send advisory machine check signal to %s:%d\n",
 					pfn, tk->tsk->comm, tk->tsk->pid);
@@ -844,7 +852,7 @@ static int page_action(struct page_state *ps, struct page *p,
  * the pages and send SIGBUS to the processes if the data was dirty.
  */
 static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
-				  int trapno)
+				  int trapno, int flags)
 {
 	enum ttu_flags ttu = TTU_UNMAP | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
 	struct address_space *mapping;
@@ -962,8 +970,8 @@ static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	 * use a more force-full uncatchable kill to prevent
 	 * any accesses to the poisoned memory.
 	 */
-	kill_procs_ao(&tokill, !!PageDirty(ppage), trapno,
-		      ret != SWAP_SUCCESS, p, pfn);
+	kill_procs(&tokill, !!PageDirty(ppage), trapno,
+		      ret != SWAP_SUCCESS, p, pfn, flags);
 
 	return ret;
 }
@@ -1148,7 +1156,7 @@ int memory_failure(unsigned long pfn, int trapno, int flags)
 	 * Now take care of user space mappings.
 	 * Abort on fail: __delete_from_page_cache() assumes unmapped page.
 	 */
-	if (hwpoison_user_mappings(p, pfn, trapno) != SWAP_SUCCESS) {
+	if (hwpoison_user_mappings(p, pfn, trapno, flags) != SWAP_SUCCESS) {
 		printk(KERN_ERR "MCE %#lx: cannot unmap page, give up\n", pfn);
 		res = -EBUSY;
 		goto out;
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/6] x86, mce: create helper function to save addr/misc when needed
  2011-12-13 19:05 [PATCH 0/6] x86, mce: machine check recovery for applications Tony Luck
                   ` (4 preceding siblings ...)
  2011-12-13 17:27 ` [PATCH 2/6] HWPOISON: Add code to handle "action required" errors Tony Luck
@ 2011-12-13 17:48 ` Tony Luck
  2011-12-16  0:13   ` Hidetoshi Seto
  5 siblings, 1 reply; 26+ messages in thread
From: Tony Luck @ 2011-12-13 17:48 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Borislav Petkov, Huang, Ying, Hidetoshi Seto

The MCI_STATUS_MISCV and MCI_STATUS_ADDRV bits in the bank status
registers define whether the MISC and ADDR registers respectively
contain valid data - provide a helper function to check these bits
and read the registers when needed.

In addition, processors that support software error recovery (as
indicated by the MCG_SER_P bit in the MCG_CAP register) may include
some undefined bits in the ADDR register - mask these out.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/mcheck/mce.c |   31 +++++++++++++++++++++++--------
 1 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 265139d..43f22c8 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -492,6 +492,27 @@ static void mce_report_event(struct pt_regs *regs)
 	irq_work_queue(&__get_cpu_var(mce_irq_work));
 }
 
+/*
+ * Read ADDR and MISC registers.
+ */
+static void mce_read_aux(struct mce *m, int i)
+{
+	if (m->status & MCI_STATUS_MISCV)
+		m->misc = mce_rdmsrl(MSR_IA32_MCx_MISC(i));
+	if (m->status & MCI_STATUS_ADDRV) {
+		m->addr = mce_rdmsrl(MSR_IA32_MCx_ADDR(i));
+
+		/*
+		 * Mask the reported address by the reported granularity.
+		 */
+		if (mce_ser && (m->status & MCI_STATUS_MISCV)) {
+			u8 shift = m->misc & 0x3f;
+			m->addr >>= shift;
+			m->addr <<= shift;
+		}
+	}
+}
+
 DEFINE_PER_CPU(unsigned, mce_poll_count);
 
 /*
@@ -542,10 +563,7 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
 		    (m.status & (mce_ser ? MCI_STATUS_S : MCI_STATUS_UC)))
 			continue;
 
-		if (m.status & MCI_STATUS_MISCV)
-			m.misc = mce_rdmsrl(MSR_IA32_MCx_MISC(i));
-		if (m.status & MCI_STATUS_ADDRV)
-			m.addr = mce_rdmsrl(MSR_IA32_MCx_ADDR(i));
+		mce_read_aux(&m, i);
 
 		if (!(flags & MCP_TIMESTAMP))
 			m.tsc = 0;
@@ -981,10 +999,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 		if (severity == MCE_AR_SEVERITY)
 			kill_it = 1;
 
-		if (m.status & MCI_STATUS_MISCV)
-			m.misc = mce_rdmsrl(MSR_IA32_MCx_MISC(i));
-		if (m.status & MCI_STATUS_ADDRV)
-			m.addr = mce_rdmsrl(MSR_IA32_MCx_ADDR(i));
+		mce_read_aux(&m, i);
 
 		/*
 		 * Action optional error. Queue address for later processing.
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 0/6] x86, mce: machine check recovery for applications
@ 2011-12-13 19:05 Tony Luck
  2011-12-08 22:49 ` [PATCH 6/6] x86, mce: Recognise machine check bank signature for data path error Tony Luck
                   ` (5 more replies)
  0 siblings, 6 replies; 26+ messages in thread
From: Tony Luck @ 2011-12-13 19:05 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Borislav Petkov, Huang, Ying, Hidetoshi Seto

Yet another version ...

Some bits should look familiar (hopefully pieces that were not too controversial
from earlier versions). Other bits are all new (e.g. part 4/6 which sets up some
functions that can safely save away the physical address of the faulting address
in the machine check handler for later retrieval in a safer execution context).

Tony Luck (6):
  HWPOISON: clean up memory_failure() vs. __memory_failure()
  HWPOISON: Add code to handle "action required" errors.
  x86, mce: create helper function to save addr/misc when needed
  x86, mce: Add mechanism to safely save information in MCE handler
  x86, mce: handle "action required" errors
  x86, mce: Recognise machine check bank signature for data path error

 arch/x86/kernel/cpu/mcheck/mce-severity.c |   14 +++-
 arch/x86/kernel/cpu/mcheck/mce.c          |  153 ++++++++++++++++++++++-------
 drivers/base/memory.c                     |    2 +-
 include/linux/mm.h                        |    4 +-
 mm/hwpoison-inject.c                      |    4 +-
 mm/madvise.c                              |    2 +-
 mm/memory-failure.c                       |   96 ++++++++++---------
 7 files changed, 186 insertions(+), 89 deletions(-)

-- 
1.7.3.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/6] HWPOISON: clean up memory_failure() vs. __memory_failure()
  2011-12-13 17:24 ` [PATCH 1/6] HWPOISON: clean up memory_failure() vs. __memory_failure() Tony Luck
@ 2011-12-14  7:47   ` Ingo Molnar
  2011-12-14 16:07     ` Borislav Petkov
  0 siblings, 1 reply; 26+ messages in thread
From: Ingo Molnar @ 2011-12-14  7:47 UTC (permalink / raw)
  To: Tony Luck; +Cc: linux-kernel, Borislav Petkov, Huang, Ying, Hidetoshi Seto


* Tony Luck <tony.luck@intel.com> wrote:

> There is only one caller of memory_failure(), all other users call
> __memory_failure() and pass in the flags argument explicitly. The
> lone user of memory_failure() will soon need to pass flags too.
> 
> Add flags argument to the callsite in mce.c. Delete the old memory_failure()
> function, and then rename __memory_failure() without the leading "__".
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  arch/x86/kernel/cpu/mcheck/mce.c |    9 +++++--
>  drivers/base/memory.c            |    2 +-
>  include/linux/mm.h               |    3 +-
>  mm/hwpoison-inject.c             |    4 +-
>  mm/madvise.c                     |    2 +-
>  mm/memory-failure.c              |   46 +++++++++++++++++--------------------
>  6 files changed, 32 insertions(+), 34 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index 2af127d..265139d 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -1046,11 +1046,14 @@ out:
>  }
>  EXPORT_SYMBOL_GPL(do_machine_check);
>  
> -/* dummy to break dependency. actual code is in mm/memory-failure.c */
> -void __attribute__((weak)) memory_failure(unsigned long pfn, int vector)
> +#ifndef CONFIG_MEMORY_FAILURE
> +int memory_failure(unsigned long pfn, int vector, int flags)
>  {
>  	printk(KERN_ERR "Action optional memory failure at %lx ignored\n", pfn);

Btw., while at it, could we phrase this message in a more 
obvious way to users, such as 'Non-fatal memory failure at %lx 
ignored'?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/6] x86, mce: Add mechanism to safely save information in MCE handler
  2011-12-12 21:06 ` [PATCH 4/6] x86, mce: Add mechanism to safely save information in MCE handler Tony Luck
@ 2011-12-14  7:52   ` Ingo Molnar
  0 siblings, 0 replies; 26+ messages in thread
From: Ingo Molnar @ 2011-12-14  7:52 UTC (permalink / raw)
  To: Tony Luck; +Cc: linux-kernel, Borislav Petkov, Huang, Ying, Hidetoshi Seto


* Tony Luck <tony.luck@intel.com> wrote:

> Machine checks on Intel cpus interrupt execution on all cpus, regardless
> of interrupt masking.  We have a need to save some data about the cause
> of the machine check (physical address) in the machine check handler that
> can be retrieved later to attempt recovery in a more flexible execution
> state.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  arch/x86/kernel/cpu/mcheck/mce.c |   51 ++++++++++++++++++++++++++++++++++++++
>  1 files changed, 51 insertions(+), 0 deletions(-)

Just some cleanliness nits:

> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index 43f22c8..9b83b7d 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -887,6 +887,57 @@ static void mce_clear_state(unsigned long *toclear)
>  }
>  
>  /*
> + * Need to save faulting physical address associated with a process
> + * in the machine check handler some place where we can grab it back
> + * later in mce_notify_process()
> + */
> +#define	MAX_MCE_INFO	16
> +struct mce_info {

please separate non-bulk definitons by newlines.

> +	atomic_t		inuse;
> +	struct task_struct	*t;
> +	__u64			paddr;
> +} mce_info[MAX_MCE_INFO];
> +
> +static void mce_save_info(__u64 addr)
> +{
> +	int	i;

that tab looks weird. [there's repeat occurances further below 
as well]

> +
> +	for (i = 0; i < MAX_MCE_INFO; i++)
> +		if (atomic_cmpxchg(&mce_info[i].inuse, 0, 1) == 0) {
> +			mce_info[i].t = current;
> +			mce_info[i].paddr = addr;
> +			return;
> +		}

We typically use curly braces for all multi-line statements - so 
two would be needed above.

> +
> +	mce_panic("Too many concurrent recoverable errors", NULL, NULL);
> +}
> +
> +static int mce_find_info(__u64 *paddr)
> +{
> +	int	i;
> +
> +	for (i = 0; i < MAX_MCE_INFO; i++)
> +		if (atomic_read(&mce_info[i].inuse) &&
> +		    mce_info[i].t == current) {
> +			*paddr = mce_info[i].paddr;
> +			return 1;
> +		}
> +	return 0;
> +}
> +
> +static void mce_clear_info(void)
> +{
> +	int	i;
> +
> +	for (i = 0; i < MAX_MCE_INFO; i++)
> +		if (atomic_read(&mce_info[i].inuse) &&
> +		    mce_info[i].t == current) {

the line-break shows that the code has complexit troubles. Doing 
this in the loop iterator:

	struct mce_info *mi = mce_info + i;

would help make it shorter and more readable.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/6] x86, mce: handle "action required" errors
  2011-12-12 21:47 ` [PATCH 5/6] x86, mce: handle "action required" errors Tony Luck
@ 2011-12-14  9:28   ` Chen Gong
  2011-12-14 21:30     ` Tony Luck
  2011-12-14 16:04   ` Borislav Petkov
  1 sibling, 1 reply; 26+ messages in thread
From: Chen Gong @ 2011-12-14  9:28 UTC (permalink / raw)
  To: Tony Luck
  Cc: linux-kernel, Ingo Molnar, Borislav Petkov, Huang, Ying,
	Hidetoshi Seto

于 2011/12/13 5:47, Tony Luck 写道:
> All non-urgent actions (reporting low severity errors and handling
> "action-optional" errors) are now handled by a work queue. This
> means that TIF_MCE_NOTIFY can be used to block execution for a
> thread experiencing an "action-required" fault until we get all
> cpus out of the machine check handler (and the thread that hit
> the fault into mce_notify_process().
>
> We use the new mce_{save,find,clear}_info() API to get information
> from do_machine_check() to mce_notify_process(), and then use the
> newly improved memory_failure(..., MF_ACTION_REQUIRED) to handle
> the error (possibly signalling the process).
>
> Signed-off-by: Tony Luck<tony.luck@intel.com>
> ---
>   arch/x86/kernel/cpu/mcheck/mce.c |   64 ++++++++++++++++++++++---------------
>   1 files changed, 38 insertions(+), 26 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index 9b83b7d..66e3bfb 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -1044,12 +1044,6 @@ void do_machine_check(struct pt_regs *regs, long error_code)
>   			continue;
>   		}
>
> -		/*
> -		 * Kill on action required.
> -		 */
> -		if (severity == MCE_AR_SEVERITY)
> -			kill_it = 1;
> -
>   		mce_read_aux(&m, i);
>
>   		/*
> @@ -1070,6 +1064,8 @@ void do_machine_check(struct pt_regs *regs, long error_code)
>   		}
>   	}
>
> +	m = *final;
> +
>   	if (!no_way_out)
>   		mce_clear_state(toclear);
>
> @@ -1088,7 +1084,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
>   	 * support MCE broadcasting or it has been disabled.
>   	 */
>   	if (no_way_out&&  tolerant<  3)
> -		mce_panic("Fatal machine check on current CPU", final, msg);
> +		mce_panic("Fatal machine check on current CPU",&m, msg);
>
>   	/*
>   	 * If the error seems to be unrecoverable, something should be
> @@ -1097,11 +1093,13 @@ void do_machine_check(struct pt_regs *regs, long error_code)
>   	 * high, don't try to do anything at all.
>   	 */
>
> -	if (kill_it&&  tolerant<  3)
> +	if (worst != MCE_AR_SEVERITY&&  kill_it&&  tolerant<  3)
>   		force_sig(SIGBUS, current);

I think here it should add more comments to clarify why not killing *AR* case.
Such as: "for SRAR errors, such as DCU/IFU error, on affected logical
processors, it is reasonable that RIPV is 0."

>
> -	/* notify userspace ASAP */
> -	set_thread_flag(TIF_MCE_NOTIFY);
> +	if (worst == MCE_AR_SEVERITY) {

how about adding one more condition check: mce_usable_address(&m) here?

> +		mce_save_info(m.addr);
> +		set_thread_flag(TIF_MCE_NOTIFY);

Here only SRAR error are flagged with TIF_MCE_NOTIFY, which means only SRAR
error is handled in the function do_notify_resume. If so, SRAO error will
only be handled in work_queue mce_work. If so, I think some related function
names should be updated too. Otherwise, it will confuse people not touching
these codes before.

> +	}
>
>   	if (worst>  0)
>   		mce_report_event(regs);
> @@ -1115,34 +1113,50 @@ EXPORT_SYMBOL_GPL(do_machine_check);
>   #ifndef CONFIG_MEMORY_FAILURE
>   int memory_failure(unsigned long pfn, int vector, int flags)
>   {
> -	printk(KERN_ERR "Action optional memory failure at %lx ignored\n", pfn);
> +	if (flags&  MF_ACTION_REQUIRED)
> +		return -ENXIO; /* panic? */
> +	else
> +		printk(KERN_ERR "Action optional memory failure at %lx ignored\n", pfn);
>
>   	return 0;
>   }
>   #endif
>
>   /*
> - * Called after mce notification in process context. This code
> - * is allowed to sleep. Call the high level VM handler to process
> - * any corrupted pages.
> - * Assume that the work queue code only calls this one at a time
> - * per CPU.
> - * Note we don't disable preemption, so this code might run on the wrong
> - * CPU. In this case the event is picked up by the scheduled work queue.
> - * This is merely a fast path to expedite processing in some common
> - * cases.
> + * Called in process context that interrupted by MCE and marked with
> + * TIF_MCE_NOTFY, just before returning to errorneous userland.
> + * This code is allowed to sleep.
> + * Attempt possible recovery such as calling the high level VM handler to
> + * process any corrupted pages, and kill/signal current process if required.
>    */
>   void mce_notify_process(void)
>   {
> +	__u64	paddr = paddr;

you mean "__u64	paddr = 0;"?

>   	unsigned long pfn;
> -	mce_notify_irq();
> -	while (mce_ring_get(&pfn))
> -		memory_failure(pfn, MCE_VECTOR, 0);
> +
> +	if (!mce_find_info(&paddr))
> +		mce_panic("Lost address", NULL, NULL);
> +	pfn = paddr>>  PAGE_SHIFT;
> +
> +	clear_thread_flag(TIF_MCE_NOTIFY);
> +
> +	pr_err("Uncorrected hardware memory error in user-access at %llx",
> +		 paddr);
> +	if (memory_failure(pfn, MCE_VECTOR, MF_ACTION_REQUIRED)<  0) {
> +		pr_err("Memory error not recovered");
> +		force_sig(SIGBUS, current);
> +	} else {
> +		pr_err("Memory error recovered");
> +		mce_clear_info();
> +	}
>   }

Does there exist some possibility that in the same process there are more than
one error triggered? If so, maybe mce_find_info/mce_clear_info should be changed
to loop-style, because here TIF_MCE_NOTIFY is cleared in the handler.

Or it is impossible because overwritten will be covered by following condition:

  	MCESEV(
  		PANIC, "Action required with lost events",
  		SER, BITSET(MCI_STATUS_OVER|MCI_UC_SAR)
  		),

>
>   static void mce_process_work(struct work_struct *dummy)
>   {
> -	mce_notify_process();
> +	unsigned long pfn;
> +
> +	while (mce_ring_get(&pfn))
> +		memory_failure(pfn, MCE_VECTOR, 0);
>   }
>
>   #ifdef CONFIG_X86_MCE_INTEL
> @@ -1232,8 +1246,6 @@ int mce_notify_irq(void)
>   	/* Not more than two messages every minute */
>   	static DEFINE_RATELIMIT_STATE(ratelimit, 60*HZ, 2);
>
> -	clear_thread_flag(TIF_MCE_NOTIFY);
> -
>   	if (test_and_clear_bit(0,&mce_need_notify)) {
>   		/* wake processes polling /dev/mcelog */
>   		wake_up_interruptible(&mce_chrdev_wait);


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 6/6] x86, mce: Recognise machine check bank signature for data path error
  2011-12-08 22:49 ` [PATCH 6/6] x86, mce: Recognise machine check bank signature for data path error Tony Luck
@ 2011-12-14 15:47   ` Borislav Petkov
  0 siblings, 0 replies; 26+ messages in thread
From: Borislav Petkov @ 2011-12-14 15:47 UTC (permalink / raw)
  To: Tony Luck; +Cc: linux-kernel, Ingo Molnar, Huang, Ying, Hidetoshi Seto

On Thu, Dec 08, 2011 at 02:49:09PM -0800, Tony Luck wrote:
> Action required data path signature is defined in table 15-19 of SDM:
> 
> +-----------------------------------------------------------------------------+
> | SRAR Error | Valid | OVER | UC | EN | MISCV | ADDRV | PCC | S | AR | MCACOD |
> | Data Load  |     1 |    0 |  1 |  1 |     1 |     1 |   0 | 1 |  1 |  0x134 |
> +-----------------------------------------------------------------------------+
> 
> Recognise this, and pass MCE_AR_SEVERITY code back to do_machine_check()
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  arch/x86/kernel/cpu/mcheck/mce-severity.c |   14 +++++++++++++-
>  1 files changed, 13 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/mcheck/mce-severity.c b/arch/x86/kernel/cpu/mcheck/mce-severity.c
> index 7395d5f..c4d8b24 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce-severity.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce-severity.c
> @@ -54,6 +54,7 @@ static struct severity {
>  #define  MASK(x, y)	.mask = x, .result = y
>  #define MCI_UC_S (MCI_STATUS_UC|MCI_STATUS_S)
>  #define MCI_UC_SAR (MCI_STATUS_UC|MCI_STATUS_S|MCI_STATUS_AR)
> +#define	MCI_ADDR (MCI_STATUS_ADDRV|MCI_STATUS_MISCV)
>  #define MCACOD 0xffff
>  
>  	MCESEV(
> @@ -102,11 +103,22 @@ static struct severity {
>  		SER, BITCLR(MCI_STATUS_S)
>  		),
>  
> -	/* AR add known MCACODs here */
>  	MCESEV(
>  		PANIC, "Action required with lost events",
>  		SER, BITSET(MCI_STATUS_OVER|MCI_UC_SAR)
>  		),
> +
> +	/* known AR MCACODs: */
> +	MCESEV(
> +		KEEP, "HT thread notices Action required: data load error",
> +		SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|0x0134),
> +		MCGMASK(MCG_STATUS_EIPV, 0)

Oh this is the core "observed" the error case, ok.

This is marked as MCE_KEEP_SEVERITY, which means that we're panicking
in case we lose the AR error on the affected CPU. Which should be
conservative enough...

ACK.

> +		),
> +	MCESEV(
> +		AR, "Action required: data load error",
> +		SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|0x0134),
> +		USER
> +		),
>  	MCESEV(
>  		PANIC, "Action required: unknown MCACOD",
>  		SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_UC_SAR)
> -- 
> 1.7.3.1
> 
> 

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/6] x86, mce: handle "action required" errors
  2011-12-12 21:47 ` [PATCH 5/6] x86, mce: handle "action required" errors Tony Luck
  2011-12-14  9:28   ` Chen Gong
@ 2011-12-14 16:04   ` Borislav Petkov
  2011-12-14 19:05     ` Luck, Tony
  1 sibling, 1 reply; 26+ messages in thread
From: Borislav Petkov @ 2011-12-14 16:04 UTC (permalink / raw)
  To: Tony Luck; +Cc: linux-kernel, Ingo Molnar, Huang, Ying, Hidetoshi Seto

On Mon, Dec 12, 2011 at 01:47:45PM -0800, Tony Luck wrote:
[..]

> - * Called after mce notification in process context. This code
> - * is allowed to sleep. Call the high level VM handler to process
> - * any corrupted pages.
> - * Assume that the work queue code only calls this one at a time
> - * per CPU.
> - * Note we don't disable preemption, so this code might run on the wrong
> - * CPU. In this case the event is picked up by the scheduled work queue.
> - * This is merely a fast path to expedite processing in some common
> - * cases.
> + * Called in process context that interrupted by MCE and marked with
> + * TIF_MCE_NOTFY, just before returning to errorneous userland.
> + * This code is allowed to sleep.
> + * Attempt possible recovery such as calling the high level VM handler to
> + * process any corrupted pages, and kill/signal current process if required.
>   */
>  void mce_notify_process(void)
>  {
> +	__u64	paddr = paddr;
>  	unsigned long pfn;
> -	mce_notify_irq();
> -	while (mce_ring_get(&pfn))
> -		memory_failure(pfn, MCE_VECTOR, 0);
> +
> +	if (!mce_find_info(&paddr))
> +		mce_panic("Lost address", NULL, NULL);

Wouldn't it be good to return struct mce_info *mi here in addition to
&paddr...

> +	pfn = paddr >> PAGE_SHIFT;
> +
> +	clear_thread_flag(TIF_MCE_NOTIFY);
> +
> +	pr_err("Uncorrected hardware memory error in user-access at %llx",
> +		 paddr);
> +	if (memory_failure(pfn, MCE_VECTOR, MF_ACTION_REQUIRED) < 0) {
> +		pr_err("Memory error not recovered");
> +		force_sig(SIGBUS, current);
> +	} else {
> +		pr_err("Memory error recovered");
> +		mce_clear_info();

so that you don't need to iterate again over the mce_info array but do:

	mce_clear_info(mi);

?

This assumes, of course, that you have only one AR MCE per task, per
return to userspace. I guess this is fine for now.

> +	}
>  }
>  
>  static void mce_process_work(struct work_struct *dummy)
>  {
> -	mce_notify_process();
> +	unsigned long pfn;
> +
> +	while (mce_ring_get(&pfn))
> +		memory_failure(pfn, MCE_VECTOR, 0);
>  }
>  
>  #ifdef CONFIG_X86_MCE_INTEL
> @@ -1232,8 +1246,6 @@ int mce_notify_irq(void)
>  	/* Not more than two messages every minute */
>  	static DEFINE_RATELIMIT_STATE(ratelimit, 60*HZ, 2);
>  
> -	clear_thread_flag(TIF_MCE_NOTIFY);
> -
>  	if (test_and_clear_bit(0, &mce_need_notify)) {
>  		/* wake processes polling /dev/mcelog */
>  		wake_up_interruptible(&mce_chrdev_wait);
> -- 
> 1.7.3.1
> 

Thanks.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/6] HWPOISON: clean up memory_failure() vs. __memory_failure()
  2011-12-14  7:47   ` Ingo Molnar
@ 2011-12-14 16:07     ` Borislav Petkov
  2011-12-14 16:55       ` Ingo Molnar
  0 siblings, 1 reply; 26+ messages in thread
From: Borislav Petkov @ 2011-12-14 16:07 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Tony Luck, linux-kernel, Huang, Ying, Hidetoshi Seto

On Wed, Dec 14, 2011 at 08:47:49AM +0100, Ingo Molnar wrote:
> > -/* dummy to break dependency. actual code is in mm/memory-failure.c */
> > -void __attribute__((weak)) memory_failure(unsigned long pfn, int vector)
> > +#ifndef CONFIG_MEMORY_FAILURE
> > +int memory_failure(unsigned long pfn, int vector, int flags)
> >  {
> >  	printk(KERN_ERR "Action optional memory failure at %lx ignored\n", pfn);
> 
> Btw., while at it, could we phrase this message in a more 
> obvious way to users, such as 'Non-fatal memory failure at %lx 
> ignored'?

Yeah, that's might not be as correct as we want it to be. AO means it
is an uncorrectable error, i.e. it will become fatal if we'd consumed
it, but it isn't that now because we just saw it passing by in the
cacheline...

Maybe "Fatal, unconsumed error ignored..."

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/6] HWPOISON: clean up memory_failure() vs. __memory_failure()
  2011-12-14 16:07     ` Borislav Petkov
@ 2011-12-14 16:55       ` Ingo Molnar
  2011-12-14 17:21         ` Luck, Tony
  0 siblings, 1 reply; 26+ messages in thread
From: Ingo Molnar @ 2011-12-14 16:55 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Tony Luck, linux-kernel, Huang, Ying, Hidetoshi Seto


* Borislav Petkov <bp@amd64.org> wrote:

> On Wed, Dec 14, 2011 at 08:47:49AM +0100, Ingo Molnar wrote:
> > > -/* dummy to break dependency. actual code is in mm/memory-failure.c */
> > > -void __attribute__((weak)) memory_failure(unsigned long pfn, int vector)
> > > +#ifndef CONFIG_MEMORY_FAILURE
> > > +int memory_failure(unsigned long pfn, int vector, int flags)
> > >  {
> > >  	printk(KERN_ERR "Action optional memory failure at %lx ignored\n", pfn);
> > 
> > Btw., while at it, could we phrase this message in a more 
> > obvious way to users, such as 'Non-fatal memory failure at 
> > %lx ignored'?
> 
> Yeah, that's might not be as correct as we want it to be. AO 
> means it is an uncorrectable error, i.e. it will become fatal 
> if we'd consumed it, but it isn't that now because we just saw 
> it passing by in the cacheline...
> 
> Maybe "Fatal, unconsumed error ignored..."

There's also the distinction that tells us which context is 
affected by an error: the currently executing task/mm, or some 
other one.

So you can keep the terminology i guess lacking a better 
alternative, i just wanted to point out that it's likely 
confusing to users.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH 1/6] HWPOISON: clean up memory_failure() vs. __memory_failure()
  2011-12-14 16:55       ` Ingo Molnar
@ 2011-12-14 17:21         ` Luck, Tony
  2011-12-15  6:44           ` Ingo Molnar
  0 siblings, 1 reply; 26+ messages in thread
From: Luck, Tony @ 2011-12-14 17:21 UTC (permalink / raw)
  To: Ingo Molnar, Borislav Petkov
  Cc: linux-kernel@vger.kernel.org, Huang, Ying, Hidetoshi Seto

> > >  	printk(KERN_ERR "Action optional memory failure at %lx ignored\n", pfn);
> > 
> > Btw., while at it, could we phrase this message in a more 
> > obvious way to users, such as 'Non-fatal memory failure at 
> > %lx ignored'?
> 
> Yeah, that's might not be as correct as we want it to be. AO 
> means it is an uncorrectable error, i.e. it will become fatal 
> if we'd consumed it, but it isn't that now because we just saw 
> it passing by in the cacheline...
> 
> Maybe "Fatal, unconsumed error ignored..."

The overall meaning is "land mine seen but not stepped on yet"

I'll see if I can wordsmith the message to convey that.

-Tony

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH 5/6] x86, mce: handle "action required" errors
  2011-12-14 16:04   ` Borislav Petkov
@ 2011-12-14 19:05     ` Luck, Tony
  0 siblings, 0 replies; 26+ messages in thread
From: Luck, Tony @ 2011-12-14 19:05 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel@vger.kernel.org, Ingo Molnar, Huang, Ying,
	Hidetoshi Seto

> > +	if (!mce_find_info(&paddr))
> > +		mce_panic("Lost address", NULL, NULL);
>
> Wouldn't it be good to return struct mce_info *mi here in addition to
> &paddr...

Great idea (actually "instead of" works better than "in addition too").

> so that you don't need to iterate again over the mce_info array but do:
>
>	mce_clear_info(mi);

Just coded it - looks much better. Will send new version soon with
this change, and Ingo's suggestions incorporated.

> This assumes, of course, that you have only one AR MCE per task, per
> return to userspace. I guess this is fine for now.

While we might have multiple memory references in flight at once, we'd
have to be really unlucky to hit multiple 2-bit errors at the same
time (unless there was some system level failure in the memory controller,
in which case we not likely to be able to recover).

In current processor implementations, the recoverable errors are all
reported in just one machine check bank - so we can't actually process
more than one at a time.

-Tony

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/6] x86, mce: handle "action required" errors
  2011-12-14  9:28   ` Chen Gong
@ 2011-12-14 21:30     ` Tony Luck
  2011-12-15  2:56       ` Chen Gong
  0 siblings, 1 reply; 26+ messages in thread
From: Tony Luck @ 2011-12-14 21:30 UTC (permalink / raw)
  To: Chen Gong
  Cc: linux-kernel, Ingo Molnar, Borislav Petkov, Huang, Ying,
	Hidetoshi Seto

On Wed, Dec 14, 2011 at 1:28 AM, Chen Gong <gong.chen@linux.intel.com> wrote:
>> -       if (kill_it&&  tolerant<  3)
>>
>> +       if (worst != MCE_AR_SEVERITY&&  kill_it&&  tolerant<  3)
>>                force_sig(SIGBUS, current);
>
>
> I think here it should add more comments to clarify why not killing *AR*
> case.
> Such as: "for SRAR errors, such as DCU/IFU error, on affected logical
> processors, it is reasonable that RIPV is 0."

I'll look at this - the reason to not kill for AR is that we want to
try to recover
first (e.g. page could be re-read from disk into a different physical page).
In some cases we can recover transparently to the application.
>> -       /* notify userspace ASAP */
>> -       set_thread_flag(TIF_MCE_NOTIFY);
>> +       if (worst == MCE_AR_SEVERITY) {
>
>
> how about adding one more condition check: mce_usable_address(&m) here?

I don't think it is needed - the table lookup in mce_severity() will only set
MCE_AR_SEVERITY if the ADDRV and MISCV bits are set in MCi_STATUS.

>> +               mce_save_info(m.addr);
>> +               set_thread_flag(TIF_MCE_NOTIFY);
>
>
> Here only SRAR error are flagged with TIF_MCE_NOTIFY, which means only SRAR
> error is handled in the function do_notify_resume. If so, SRAO error will
> only be handled in work_queue mce_work. If so, I think some related function
> names should be updated too. Otherwise, it will confuse people not touching
> these codes before.

Agreed - the names of the functions and the actions they perform haven't been
kept up to date.

>>  void mce_notify_process(void)
>>  {
>> +       __u64   paddr = paddr;
>
>
> you mean "__u64 paddr = 0;"?

No. The "paddr = paddr" is a gcc'ism to silence a spurious "may be used
before set" warning.  But the point will be moot in the next version because
changes inspired by Boris' comments mean that this line goes away.

> Does there exist some possibility that in the same process there are more
> than
> one error triggered? If so, maybe mce_find_info/mce_clear_info should be
> changed
> to loop-style, because here TIF_MCE_NOTIFY is cleared in the handler.
>
> Or it is impossible because overwritten will be covered by following
> condition:

I think that in current cpus it isn't possible to have more than one
error reported at the same time per process.

-Tony

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/6] x86, mce: handle "action required" errors
  2011-12-14 21:30     ` Tony Luck
@ 2011-12-15  2:56       ` Chen Gong
  0 siblings, 0 replies; 26+ messages in thread
From: Chen Gong @ 2011-12-15  2:56 UTC (permalink / raw)
  To: Tony Luck
  Cc: linux-kernel, Ingo Molnar, Borislav Petkov, Huang, Ying,
	Hidetoshi Seto

于 2011/12/15 5:30, Tony Luck 写道:
> On Wed, Dec 14, 2011 at 1:28 AM, Chen Gong<gong.chen@linux.intel.com>  wrote:
>>> -       if (kill_it&&    tolerant<    3)
>>>
>>> +       if (worst != MCE_AR_SEVERITY&&    kill_it&&    tolerant<    3)
>>>                 force_sig(SIGBUS, current);
>>
>>
>> I think here it should add more comments to clarify why not killing *AR*
>> case.
>> Such as: "for SRAR errors, such as DCU/IFU error, on affected logical
>> processors, it is reasonable that RIPV is 0."
>
> I'll look at this - the reason to not kill for AR is that we want to
> try to recover
> first (e.g. page could be re-read from disk into a different physical page).
> In some cases we can recover transparently to the application.

Oh, yes, these reasons are very important why not killing *AR* events. But my
point is in a *AR* supported environment, "kill_it" should not be true like
below:
         if (!(m.mcgstatus & MCG_STATUS_RIPV))
                 kill_it = 1;

the reason is what I said before. But at that time the worst severity hasn't
been determined so we have to wati until it is out.

anyway, it is an interesting coincidence, isn't it? :-)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/6] HWPOISON: clean up memory_failure() vs. __memory_failure()
  2011-12-14 17:21         ` Luck, Tony
@ 2011-12-15  6:44           ` Ingo Molnar
  2011-12-15 18:05             ` Tony Luck
  0 siblings, 1 reply; 26+ messages in thread
From: Ingo Molnar @ 2011-12-15  6:44 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Borislav Petkov, linux-kernel@vger.kernel.org, Huang, Ying,
	Hidetoshi Seto


* Luck, Tony <tony.luck@intel.com> wrote:

> > > >  	printk(KERN_ERR "Action optional memory failure at %lx ignored\n", pfn);
> > > 
> > > Btw., while at it, could we phrase this message in a more 
> > > obvious way to users, such as 'Non-fatal memory failure at 
> > > %lx ignored'?
> > 
> > Yeah, that's might not be as correct as we want it to be. AO 
> > means it is an uncorrectable error, i.e. it will become fatal 
> > if we'd consumed it, but it isn't that now because we just saw 
> > it passing by in the cacheline...
> > 
> > Maybe "Fatal, unconsumed error ignored..."
> 
> The overall meaning is "land mine seen but not stepped on yet"

Perfect message!

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/6] HWPOISON: clean up memory_failure() vs. __memory_failure()
  2011-12-15  6:44           ` Ingo Molnar
@ 2011-12-15 18:05             ` Tony Luck
  2011-12-15 18:09               ` Ingo Molnar
  0 siblings, 1 reply; 26+ messages in thread
From: Tony Luck @ 2011-12-15 18:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Borislav Petkov, linux-kernel@vger.kernel.org, Huang, Ying,
	Hidetoshi Seto

On Wed, Dec 14, 2011 at 10:44 PM, Ingo Molnar <mingo@elte.hu> wrote:
>> The overall meaning is "land mine seen but not stepped on yet"
>
> Perfect message!

How about:

Uncorrected memory error in page 0x%lx ignored
Rebuild kernel with CONFIG_MEMORY_FAILURE=y for smarter handling

-Tony

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/6] HWPOISON: clean up memory_failure() vs. __memory_failure()
  2011-12-15 18:05             ` Tony Luck
@ 2011-12-15 18:09               ` Ingo Molnar
  0 siblings, 0 replies; 26+ messages in thread
From: Ingo Molnar @ 2011-12-15 18:09 UTC (permalink / raw)
  To: Tony Luck
  Cc: Borislav Petkov, linux-kernel@vger.kernel.org, Huang, Ying,
	Hidetoshi Seto


* Tony Luck <tony.luck@intel.com> wrote:

> On Wed, Dec 14, 2011 at 10:44 PM, Ingo Molnar <mingo@elte.hu> wrote:
> >> The overall meaning is "land mine seen but not stepped on yet"
> >
> > Perfect message!
> 
> How about:
> 
> Uncorrected memory error in page 0x%lx ignored
> Rebuild kernel with CONFIG_MEMORY_FAILURE=y for smarter handling

Yeah, sounds good.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 5/6] x86, mce: handle "action required" errors
  2011-12-15 19:59 [PATCH 0/6] x86, mce: machine check recovery for applications [updated] Tony Luck
@ 2011-12-15 19:02 ` Tony Luck
  2011-12-16  0:14   ` Hidetoshi Seto
  0 siblings, 1 reply; 26+ messages in thread
From: Tony Luck @ 2011-12-15 19:02 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Borislav Petkov, Chen Gong, Huang, Ying,
	Hidetoshi Seto

All non-urgent actions (reporting low severity errors and handling
"action-optional" errors) are now handled by a work queue. This
means that TIF_MCE_NOTIFY can be used to block execution for a
thread experiencing an "action-required" fault until we get all
cpus out of the machine check handler (and the thread that hit
the fault into mce_notify_process().

We use the new mce_{save,find,clear}_info() API to get information
from do_machine_check() to mce_notify_process(), and then use the
newly improved memory_failure(..., MF_ACTION_REQUIRED) to handle
the error (possibly signalling the process).

Update some comments to make the new code flows clearer.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/mcheck/mce.c |   70 ++++++++++++++++++++++++--------------
 1 files changed, 44 insertions(+), 26 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 6dfab72..2e10b73 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1036,12 +1036,6 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 			continue;
 		}

-		/*
-		 * Kill on action required.
-		 */
-		if (severity == MCE_AR_SEVERITY)
-			kill_it = 1;
-
 		mce_read_aux(&m, i);

 		/*
@@ -1062,6 +1056,8 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 		}
 	}

+	m = *final;
+
 	if (!no_way_out)
 		mce_clear_state(toclear);

@@ -1080,7 +1076,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	 * support MCE broadcasting or it has been disabled.
 	 */
 	if (no_way_out && tolerant < 3)
-		mce_panic("Fatal machine check on current CPU", final, msg);
+		mce_panic("Fatal machine check on current CPU", &m, msg);

 	/*
 	 * If the error seems to be unrecoverable, something should be
@@ -1089,11 +1085,13 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	 * high, don't try to do anything at all.
 	 */

-	if (kill_it && tolerant < 3)
+	if (worst != MCE_AR_SEVERITY && kill_it && tolerant < 3)
 		force_sig(SIGBUS, current);

-	/* notify userspace ASAP */
-	set_thread_flag(TIF_MCE_NOTIFY);
+	if (worst == MCE_AR_SEVERITY) {
+		mce_save_info(m.addr);
+		set_thread_flag(TIF_MCE_NOTIFY);
+	}

 	if (worst > 0)
 		mce_report_event(regs);
@@ -1107,34 +1105,56 @@ EXPORT_SYMBOL_GPL(do_machine_check);
 #ifndef CONFIG_MEMORY_FAILURE
 int memory_failure(unsigned long pfn, int vector, int flags)
 {
-	printk(KERN_ERR "Action optional memory failure at %lx ignored\n", pfn);
+	if (flags & MF_ACTION_REQUIRED)
+		return -ENXIO; /* panic? */
+	else
+		printk(KERN_ERR "Action optional memory failure at %lx ignored\n", pfn);

 	return 0;
 }
 #endif

 /*
- * Called after mce notification in process context. This code
- * is allowed to sleep. Call the high level VM handler to process
- * any corrupted pages.
- * Assume that the work queue code only calls this one at a time
- * per CPU.
- * Note we don't disable preemption, so this code might run on the wrong
- * CPU. In this case the event is picked up by the scheduled work queue.
- * This is merely a fast path to expedite processing in some common
- * cases.
+ * Called in process context that interrupted by MCE and marked with
+ * TIF_MCE_NOTFY, just before returning to errorneous userland.
+ * This code is allowed to sleep.
+ * Attempt possible recovery such as calling the high level VM handler to
+ * process any corrupted pages, and kill/signal current process if required.
+ * Action required errors are handled here.
  */
 void mce_notify_process(void)
 {
 	unsigned long pfn;
-	mce_notify_irq();
-	while (mce_ring_get(&pfn))
-		memory_failure(pfn, MCE_VECTOR, 0);
+	struct mce_info *mi = mce_find_info();
+
+	if (!mi)
+		mce_panic("Lost address", NULL, NULL);
+	pfn = mi->paddr >> PAGE_SHIFT;
+
+	clear_thread_flag(TIF_MCE_NOTIFY);
+
+	pr_err("Uncorrected hardware memory error in user-access at %llx",
+		 mi->paddr);
+	if (memory_failure(pfn, MCE_VECTOR, MF_ACTION_REQUIRED) < 0) {
+		pr_err("Memory error not recovered");
+		force_sig(SIGBUS, current);
+	} else {
+		pr_err("Memory error recovered");
+	}
+	mce_clear_info(mi);
 }

+/*
+ * Action optional processing happens here (picking up
+ * from the list of faulting pages that do_machine_check()
+ * placed into the "ring").
+ */
 static void mce_process_work(struct work_struct *dummy)
 {
-	mce_notify_process();
+	unsigned long pfn;
+
+	while (mce_ring_get(&pfn))
+		memory_failure(pfn, MCE_VECTOR, 0);
 }

 #ifdef CONFIG_X86_MCE_INTEL
@@ -1224,8 +1244,6 @@ int mce_notify_irq(void)
 	/* Not more than two messages every minute */
 	static DEFINE_RATELIMIT_STATE(ratelimit, 60*HZ, 2);

-	clear_thread_flag(TIF_MCE_NOTIFY);
-
 	if (test_and_clear_bit(0, &mce_need_notify)) {
 		/* wake processes polling /dev/mcelog */
 		wake_up_interruptible(&mce_chrdev_wait);
--
1.7.3.1
---
 arch/x86/kernel/cpu/mcheck/mce.c |   71 ++++++++++++++++++++++++--------------
 1 files changed, 45 insertions(+), 26 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 7d7303a..dfd20a6 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -982,7 +982,9 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	barrier();
 
 	/*
-	 * When no restart IP must always kill or panic.
+	 * When no restart IP might need to kill or panic.
+	 * Assume the worst for now, but if we find the
+	 * severity is MCE_AR_SEVERITY we have other options.
 	 */
 	if (!(m.mcgstatus & MCG_STATUS_RIPV))
 		kill_it = 1;
@@ -1036,12 +1038,6 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 			continue;
 		}
 
-		/*
-		 * Kill on action required.
-		 */
-		if (severity == MCE_AR_SEVERITY)
-			kill_it = 1;
-
 		mce_read_aux(&m, i);
 
 		/*
@@ -1062,6 +1058,8 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 		}
 	}
 
+	m = *final;
+
 	if (!no_way_out)
 		mce_clear_state(toclear);
 
@@ -1080,7 +1078,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	 * support MCE broadcasting or it has been disabled.
 	 */
 	if (no_way_out && tolerant < 3)
-		mce_panic("Fatal machine check on current CPU", final, msg);
+		mce_panic("Fatal machine check on current CPU", &m, msg);
 
 	/*
 	 * If the error seems to be unrecoverable, something should be
@@ -1089,11 +1087,13 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	 * high, don't try to do anything at all.
 	 */
 
-	if (kill_it && tolerant < 3)
+	if (worst != MCE_AR_SEVERITY && kill_it && tolerant < 3)
 		force_sig(SIGBUS, current);
 
-	/* notify userspace ASAP */
-	set_thread_flag(TIF_MCE_NOTIFY);
+	if (worst == MCE_AR_SEVERITY) {
+		mce_save_info(m.addr);
+		set_thread_flag(TIF_MCE_NOTIFY);
+	}
 
 	if (worst > 0)
 		mce_report_event(regs);
@@ -1107,6 +1107,8 @@ EXPORT_SYMBOL_GPL(do_machine_check);
 #ifndef CONFIG_MEMORY_FAILURE
 int memory_failure(unsigned long pfn, int vector, int flags)
 {
+	if (flags & MF_ACTION_REQUIRED)
+		return -ENXIO; /* panic? */
 	printk(KERN_ERR "Uncorrected memory error in page 0x%lx ignored\n"
 		"Rebuild kernel with CONFIG_MEMORY_FAILURE=y for smarter handling\n", pfn);
 
@@ -1115,27 +1117,46 @@ int memory_failure(unsigned long pfn, int vector, int flags)
 #endif
 
 /*
- * Called after mce notification in process context. This code
- * is allowed to sleep. Call the high level VM handler to process
- * any corrupted pages.
- * Assume that the work queue code only calls this one at a time
- * per CPU.
- * Note we don't disable preemption, so this code might run on the wrong
- * CPU. In this case the event is picked up by the scheduled work queue.
- * This is merely a fast path to expedite processing in some common
- * cases.
+ * Called in process context that interrupted by MCE and marked with
+ * TIF_MCE_NOTFY, just before returning to errorneous userland.
+ * This code is allowed to sleep.
+ * Attempt possible recovery such as calling the high level VM handler to
+ * process any corrupted pages, and kill/signal current process if required.
+ * Action required errors are handled here.
  */
 void mce_notify_process(void)
 {
 	unsigned long pfn;
-	mce_notify_irq();
-	while (mce_ring_get(&pfn))
-		memory_failure(pfn, MCE_VECTOR, 0);
+	struct mce_info *mi = mce_find_info();
+
+	if (!mi)
+		mce_panic("Lost address", NULL, NULL);
+	pfn = mi->paddr >> PAGE_SHIFT;
+
+	clear_thread_flag(TIF_MCE_NOTIFY);
+
+	pr_err("Uncorrected hardware memory error in user-access at %llx",
+		 mi->paddr);
+	if (memory_failure(pfn, MCE_VECTOR, MF_ACTION_REQUIRED) < 0) {
+		pr_err("Memory error not recovered");
+		force_sig(SIGBUS, current);
+	} else {
+		pr_err("Memory error recovered");
+	}
+	mce_clear_info(mi);
 }
 
+/*
+ * Action optional processing happens here (picking up
+ * from the list of faulting pages that do_machine_check()
+ * placed into the "ring").
+ */
 static void mce_process_work(struct work_struct *dummy)
 {
-	mce_notify_process();
+	unsigned long pfn;
+
+	while (mce_ring_get(&pfn))
+		memory_failure(pfn, MCE_VECTOR, 0);
 }
 
 #ifdef CONFIG_X86_MCE_INTEL
@@ -1225,8 +1246,6 @@ int mce_notify_irq(void)
 	/* Not more than two messages every minute */
 	static DEFINE_RATELIMIT_STATE(ratelimit, 60*HZ, 2);
 
-	clear_thread_flag(TIF_MCE_NOTIFY);
-
 	if (test_and_clear_bit(0, &mce_need_notify)) {
 		/* wake processes polling /dev/mcelog */
 		wake_up_interruptible(&mce_chrdev_wait);
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/6] x86, mce: create helper function to save addr/misc when needed
  2011-12-13 17:48 ` [PATCH 3/6] x86, mce: create helper function to save addr/misc when needed Tony Luck
@ 2011-12-16  0:13   ` Hidetoshi Seto
  0 siblings, 0 replies; 26+ messages in thread
From: Hidetoshi Seto @ 2011-12-16  0:13 UTC (permalink / raw)
  To: Tony Luck; +Cc: linux-kernel, Ingo Molnar, Borislav Petkov, Huang, Ying

(2011/12/14 2:48), Tony Luck wrote:
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index 265139d..43f22c8 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -492,6 +492,27 @@ static void mce_report_event(struct pt_regs *regs)
>  	irq_work_queue(&__get_cpu_var(mce_irq_work));
>  }
>  
> +/*
> + * Read ADDR and MISC registers.
> + */
> +static void mce_read_aux(struct mce *m, int i)
> +{
> +	if (m->status & MCI_STATUS_MISCV)
> +		m->misc = mce_rdmsrl(MSR_IA32_MCx_MISC(i));
> +	if (m->status & MCI_STATUS_ADDRV) {
> +		m->addr = mce_rdmsrl(MSR_IA32_MCx_ADDR(i));
> +
> +		/*
> +		 * Mask the reported address by the reported granularity.
> +		 */
> +		if (mce_ser && (m->status & MCI_STATUS_MISCV)) {
> +			u8 shift = m->misc & 0x3f;

(nitpick) You can use:

#define MCI_MISC_ADDR_LSB(m)    ((m) & 0x3f)

> +			m->addr >>= shift;
> +			m->addr <<= shift;
> +		}
> +	}
> +}
> +
>  DEFINE_PER_CPU(unsigned, mce_poll_count);
>  
>  /*

Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/6] x86, mce: handle "action required" errors
  2011-12-15 19:02 ` [PATCH 5/6] x86, mce: handle "action required" errors Tony Luck
@ 2011-12-16  0:14   ` Hidetoshi Seto
  2011-12-16  0:29     ` Tony Luck
  2011-12-16  0:51     ` Tony Luck
  0 siblings, 2 replies; 26+ messages in thread
From: Hidetoshi Seto @ 2011-12-16  0:14 UTC (permalink / raw)
  To: Tony Luck
  Cc: linux-kernel, Ingo Molnar, Borislav Petkov, Chen Gong,
	Huang, Ying

(2011/12/16 4:02), Tony Luck wrote:
> All non-urgent actions (reporting low severity errors and handling
> "action-optional" errors) are now handled by a work queue. This
> means that TIF_MCE_NOTIFY can be used to block execution for a
> thread experiencing an "action-required" fault until we get all
> cpus out of the machine check handler (and the thread that hit
> the fault into mce_notify_process().
> 
> We use the new mce_{save,find,clear}_info() API to get information
> from do_machine_check() to mce_notify_process(), and then use the
> newly improved memory_failure(..., MF_ACTION_REQUIRED) to handle
> the error (possibly signalling the process).
> 
> Update some comments to make the new code flows clearer.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  arch/x86/kernel/cpu/mcheck/mce.c |   70 ++++++++++++++++++++++++--------------
>  1 files changed, 44 insertions(+), 26 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index 6dfab72..2e10b73 100644

(snip)

... 2 patches in a mail?

> --
> 1.7.3.1
> ---
>  arch/x86/kernel/cpu/mcheck/mce.c |   71 ++++++++++++++++++++++++--------------
>  1 files changed, 45 insertions(+), 26 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index 7d7303a..dfd20a6 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -982,7 +982,9 @@ void do_machine_check(struct pt_regs *regs, long error_code)
>  	barrier();
>  
>  	/*
> -	 * When no restart IP must always kill or panic.
> +	 * When no restart IP might need to kill or panic.
> +	 * Assume the worst for now, but if we find the
> +	 * severity is MCE_AR_SEVERITY we have other options.
>  	 */
>  	if (!(m.mcgstatus & MCG_STATUS_RIPV))
>  		kill_it = 1;
> @@ -1036,12 +1038,6 @@ void do_machine_check(struct pt_regs *regs, long error_code)
>  			continue;
>  		}
>  
> -		/*
> -		 * Kill on action required.
> -		 */
> -		if (severity == MCE_AR_SEVERITY)
> -			kill_it = 1;
> -
>  		mce_read_aux(&m, i);
>  
>  		/*
> @@ -1062,6 +1058,8 @@ void do_machine_check(struct pt_regs *regs, long error_code)
>  		}
>  	}
>  
> +	m = *final;
> +
>  	if (!no_way_out)
>  		mce_clear_state(toclear);
>  

Small change, but again, you should describe reason why...

> @@ -1080,7 +1078,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
>  	 * support MCE broadcasting or it has been disabled.
>  	 */
>  	if (no_way_out && tolerant < 3)
> -		mce_panic("Fatal machine check on current CPU", final, msg);
> +		mce_panic("Fatal machine check on current CPU", &m, msg);
>  
>  	/*
>  	 * If the error seems to be unrecoverable, something should be
> @@ -1089,11 +1087,13 @@ void do_machine_check(struct pt_regs *regs, long error_code)
>  	 * high, don't try to do anything at all.
>  	 */
>  
> -	if (kill_it && tolerant < 3)
> +	if (worst != MCE_AR_SEVERITY && kill_it && tolerant < 3)
>  		force_sig(SIGBUS, current);
>  
> -	/* notify userspace ASAP */
> -	set_thread_flag(TIF_MCE_NOTIFY);
> +	if (worst == MCE_AR_SEVERITY) {
> +		mce_save_info(m.addr);
> +		set_thread_flag(TIF_MCE_NOTIFY);
> +	}

I know tolerant==3 is an insane option, but it is better to care about
it here too (or it would be happy if we can remove tolerant completely).

e.g.
	if (tolerant < 3) {
		if (no_way_out)
			mce_panic(...);
		if (worst == MCE_AR_SEVERITY) {
			/* schedule action before return to userland */
			mce_save_info(m.addr);
			set_thread_flag(TIF_MCE_NOTIFY);
		} else if (kill_it) {
			force_sig(SIGBUS, current);
		}
	}

>  
>  	if (worst > 0)
>  		mce_report_event(regs);
> @@ -1107,6 +1107,8 @@ EXPORT_SYMBOL_GPL(do_machine_check);
>  #ifndef CONFIG_MEMORY_FAILURE
>  int memory_failure(unsigned long pfn, int vector, int flags)
>  {
> +	if (flags & MF_ACTION_REQUIRED)
> +		return -ENXIO; /* panic? */
>  	printk(KERN_ERR "Uncorrected memory error in page 0x%lx ignored\n"
>  		"Rebuild kernel with CONFIG_MEMORY_FAILURE=y for smarter handling\n", pfn);
>  
> @@ -1115,27 +1117,46 @@ int memory_failure(unsigned long pfn, int vector, int flags)
>  #endif
>  
>  /*
> - * Called after mce notification in process context. This code
> - * is allowed to sleep. Call the high level VM handler to process
> - * any corrupted pages.
> - * Assume that the work queue code only calls this one at a time
> - * per CPU.
> - * Note we don't disable preemption, so this code might run on the wrong
> - * CPU. In this case the event is picked up by the scheduled work queue.
> - * This is merely a fast path to expedite processing in some common
> - * cases.
> + * Called in process context that interrupted by MCE and marked with
> + * TIF_MCE_NOTFY, just before returning to errorneous userland.

Spell checker suggests:	                      erroneous

> + * This code is allowed to sleep.
> + * Attempt possible recovery such as calling the high level VM handler to
> + * process any corrupted pages, and kill/signal current process if required.
> + * Action required errors are handled here.
>   */
>  void mce_notify_process(void)
>  {
>  	unsigned long pfn;
> -	mce_notify_irq();
> -	while (mce_ring_get(&pfn))
> -		memory_failure(pfn, MCE_VECTOR, 0);
> +	struct mce_info *mi = mce_find_info();
> +
> +	if (!mi)
> +		mce_panic("Lost address", NULL, NULL);

The message is too short, isn't it?

And if this case is an another version of "Memory error not recovered"
located below then why not force_sig() but mce_panic()?

> +	pfn = mi->paddr >> PAGE_SHIFT;
> +
> +	clear_thread_flag(TIF_MCE_NOTIFY);
> +
> +	pr_err("Uncorrected hardware memory error in user-access at %llx",
> +		 mi->paddr);
> +	if (memory_failure(pfn, MCE_VECTOR, MF_ACTION_REQUIRED) < 0) {
> +		pr_err("Memory error not recovered");
> +		force_sig(SIGBUS, current);
> +	} else {
> +		pr_err("Memory error recovered");
> +	}
> +	mce_clear_info(mi);
>  }
>  
> +/*
> + * Action optional processing happens here (picking up
> + * from the list of faulting pages that do_machine_check()
> + * placed into the "ring").
> + */
>  static void mce_process_work(struct work_struct *dummy)
>  {
> -	mce_notify_process();
> +	unsigned long pfn;
> +
> +	while (mce_ring_get(&pfn))
> +		memory_failure(pfn, MCE_VECTOR, 0);
>  }
>  
>  #ifdef CONFIG_X86_MCE_INTEL
> @@ -1225,8 +1246,6 @@ int mce_notify_irq(void)
>  	/* Not more than two messages every minute */
>  	static DEFINE_RATELIMIT_STATE(ratelimit, 60*HZ, 2);
>  
> -	clear_thread_flag(TIF_MCE_NOTIFY);
> -
>  	if (test_and_clear_bit(0, &mce_need_notify)) {
>  		/* wake processes polling /dev/mcelog */
>  		wake_up_interruptible(&mce_chrdev_wait);

Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/6] x86, mce: handle "action required" errors
  2011-12-16  0:14   ` Hidetoshi Seto
@ 2011-12-16  0:29     ` Tony Luck
  2011-12-16  0:51     ` Tony Luck
  1 sibling, 0 replies; 26+ messages in thread
From: Tony Luck @ 2011-12-16  0:29 UTC (permalink / raw)
  To: Hidetoshi Seto
  Cc: linux-kernel, Ingo Molnar, Borislav Petkov, Chen Gong,
	Huang, Ying

2011/12/15 Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>:
> (snip)
>
> ... 2 patches in a mail?

Sort of ... I pulled the old version of the patch into the commit editor - to
just grab the description part, but I forgot to delete from "Signed-off-by"
to end of file ... so the "extra" patch is now part of my commit comment,
and the real patch starts further down.

Oops

-Tony

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/6] x86, mce: handle "action required" errors
  2011-12-16  0:14   ` Hidetoshi Seto
  2011-12-16  0:29     ` Tony Luck
@ 2011-12-16  0:51     ` Tony Luck
  1 sibling, 0 replies; 26+ messages in thread
From: Tony Luck @ 2011-12-16  0:51 UTC (permalink / raw)
  To: Hidetoshi Seto
  Cc: linux-kernel, Ingo Molnar, Borislav Petkov, Chen Gong,
	Huang, Ying

2011/12/15 Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>:
>> +     m = *final;
>> +
>>       if (!no_way_out)
>>               mce_clear_state(toclear);
>>
>
> Small change, but again, you should describe reason why...

Yes - this is subtle (mce_clear_state() will clear what *final points to, so
make a copy in the local variable "m"). It deserves a comment, so I'll add
one.

> I know tolerant==3 is an insane option, but it is better to care about
> it here too (or it would be happy if we can remove tolerant completely).
>
> e.g.
>        if (tolerant < 3) {
>                if (no_way_out)
>                        mce_panic(...);
>                if (worst == MCE_AR_SEVERITY) {
>                        /* schedule action before return to userland */
>                        mce_save_info(m.addr);
>                        set_thread_flag(TIF_MCE_NOTIFY);
>                } else if (kill_it) {
>                        force_sig(SIGBUS, current);
>                }
>        }

Good point. But I don't see how "tolerant==3" and "AR" errors ever make sense
together. If we don't do something to fix the problem and just ignore
it, then we
will take a new machine check when we re-execute the instruction (unless the
problem magically went away ... but I don't think that is likely). So the a user
with tolerant=3 will loop taking the same machine check over and over. Which
isn't likely to be what was wanted.

>> + * TIF_MCE_NOTFY, just before returning to errorneous userland.
>
> Spell checker suggests:                       erroneous

Will fix.

>> +     if (!mi)
>> +             mce_panic("Lost address", NULL, NULL);
>
> The message is too short, isn't it?

Yes - it's a "Can't happen" error case (if we are here, then we must have saved
the address when we set TIF_MCE_NOTIFY - so the only way to not find the
address is for someone else to have corrupted out mce_info[] array). Perhaps
I should change to BUG_ON()?

> And if this case is an another version of "Memory error not recovered"
> located below then why not force_sig() but mce_panic()?

The more I look at that "Memory error not recovered" code, the more
I think that it should be a panic (almost the same logic as for tolerant=3,
in this case force_sig would prevent us from running right back into the
machine check - but we did nothing to poison the page

Thanks for looking at this.

-Tony

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2011-12-16  0:51 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-13 19:05 [PATCH 0/6] x86, mce: machine check recovery for applications Tony Luck
2011-12-08 22:49 ` [PATCH 6/6] x86, mce: Recognise machine check bank signature for data path error Tony Luck
2011-12-14 15:47   ` Borislav Petkov
2011-12-12 21:06 ` [PATCH 4/6] x86, mce: Add mechanism to safely save information in MCE handler Tony Luck
2011-12-14  7:52   ` Ingo Molnar
2011-12-12 21:47 ` [PATCH 5/6] x86, mce: handle "action required" errors Tony Luck
2011-12-14  9:28   ` Chen Gong
2011-12-14 21:30     ` Tony Luck
2011-12-15  2:56       ` Chen Gong
2011-12-14 16:04   ` Borislav Petkov
2011-12-14 19:05     ` Luck, Tony
2011-12-13 17:24 ` [PATCH 1/6] HWPOISON: clean up memory_failure() vs. __memory_failure() Tony Luck
2011-12-14  7:47   ` Ingo Molnar
2011-12-14 16:07     ` Borislav Petkov
2011-12-14 16:55       ` Ingo Molnar
2011-12-14 17:21         ` Luck, Tony
2011-12-15  6:44           ` Ingo Molnar
2011-12-15 18:05             ` Tony Luck
2011-12-15 18:09               ` Ingo Molnar
2011-12-13 17:27 ` [PATCH 2/6] HWPOISON: Add code to handle "action required" errors Tony Luck
2011-12-13 17:48 ` [PATCH 3/6] x86, mce: create helper function to save addr/misc when needed Tony Luck
2011-12-16  0:13   ` Hidetoshi Seto
  -- strict thread matches above, loose matches on Subject: below --
2011-12-15 19:59 [PATCH 0/6] x86, mce: machine check recovery for applications [updated] Tony Luck
2011-12-15 19:02 ` [PATCH 5/6] x86, mce: handle "action required" errors Tony Luck
2011-12-16  0:14   ` Hidetoshi Seto
2011-12-16  0:29     ` Tony Luck
2011-12-16  0:51     ` Tony Luck

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).