[RFC PATCH 0/3] RAS: Correctable Errors Collector thing

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/3] RAS: Correctable Errors Collector thing
@ 2014-05-27 13:27 Borislav Petkov
  2014-05-27 13:27 ` [RFC PATCH 1/3] MCE, CE: Corrected errors collecting thing Borislav Petkov
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Borislav Petkov @ 2014-05-27 13:27 UTC (permalink / raw)
  To: LKML; +Cc: linux-edac, X86 ML, Tony Luck

From: Borislav Petkov <bp@suse.de>

Hi all,

this is something Tony and I have been working on behind the curtains
recently. Here it is in a RFC form, it passes quick testing in kvm. Let
me send it out before I start hammering on it on a real machine.

More indepth info about what it is and what it does is in patch 1/3.

As always, comments and suggestions are most welcome.

Thanks.

Borislav Petkov (3):
  MCE, CE: Corrected errors collecting thing
  MCE, CE: Wire in the CE collector
  MCE, CE: Add debugging glue

 arch/x86/include/asm/mce.h          |   4 +
 arch/x86/kernel/cpu/mcheck/Makefile |   2 +-
 arch/x86/kernel/cpu/mcheck/ce.c     | 305 ++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/mcheck/mce.c    |  93 ++++++++++-
 4 files changed, 395 insertions(+), 9 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/mcheck/ce.c

-- 
1.9.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH 1/3] MCE, CE: Corrected errors collecting thing
  2014-05-27 13:27 [RFC PATCH 0/3] RAS: Correctable Errors Collector thing Borislav Petkov
@ 2014-05-27 13:27 ` Borislav Petkov
  2014-05-27 13:27 ` [RFC PATCH 2/3] MCE, CE: Wire in the CE collector Borislav Petkov
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Borislav Petkov @ 2014-05-27 13:27 UTC (permalink / raw)
  To: LKML; +Cc: linux-edac, X86 ML, Tony Luck

From: Borislav Petkov <bp@suse.de>

A simple data structure for collecting correctable errors.

Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/include/asm/mce.h          |   4 +
 arch/x86/kernel/cpu/mcheck/Makefile |   2 +-
 arch/x86/kernel/cpu/mcheck/ce.c     | 281 ++++++++++++++++++++++++++++++++++++
 3 files changed, 286 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/kernel/cpu/mcheck/ce.c

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 6e4ce2df87cf..b4581392df66 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -247,4 +247,8 @@ struct cper_sec_mem_err;
 extern void apei_mce_report_mem_error(int corrected,
 				      struct cper_sec_mem_err *mem_err);
 
+void __init ce_init(void);
+int ce_add_elem(u64 pfn);
+u64 ce_del_lru_elem(void);
+
 #endif /* _ASM_X86_MCE_H */
diff --git a/arch/x86/kernel/cpu/mcheck/Makefile b/arch/x86/kernel/cpu/mcheck/Makefile
index bb34b03af252..97f03043a674 100644
--- a/arch/x86/kernel/cpu/mcheck/Makefile
+++ b/arch/x86/kernel/cpu/mcheck/Makefile
@@ -1,4 +1,4 @@
-obj-y				=  mce.o mce-severity.o
+obj-y				=  mce.o mce-severity.o ce.o
 
 obj-$(CONFIG_X86_ANCIENT_MCE)	+= winchip.o p5.o
 obj-$(CONFIG_X86_MCE_INTEL)	+= mce_intel.o
diff --git a/arch/x86/kernel/cpu/mcheck/ce.c b/arch/x86/kernel/cpu/mcheck/ce.c
new file mode 100644
index 000000000000..3020415f27f4
--- /dev/null
+++ b/arch/x86/kernel/cpu/mcheck/ce.c
@@ -0,0 +1,281 @@
+#include <linux/mm.h>
+#include <linux/gfp.h>
+#include <linux/kernel.h>
+
+#include <asm/bug.h>
+
+/*
+ * RAS Correctable Errors Collector
+ *
+ * This is a simple gadget which collects correctable errors and counts their
+ * occurrence per physical page address.
+ *
+ * We've opted for possibly the simplest data structure to collect those - an
+ * array of the size of a memory page. It stores 512 u64's with the following
+ * structure:
+ *
+ * [63 ... PFN ... 12 | 11 ... generation ... 10 | 9 ... count ... 0]
+ *
+ * The generation in the two highest order bits is two bits which are set to 11b
+ * on every insertion. During the course of this entry's existence, it
+ * gets decremented during spring cleaning to 10b, then 01b and then 00b.
+ *
+ * This way we're employing the numeric ordering to make sure that newly
+ * inserted/touched elements have higher 12-bit counts (which we've
+ * manufactured) and thus iterating over the array initially won't kick out
+ * those last inserted elements.
+ *
+ * Spring cleaning is what we do when we reach a certain number CLEAN_ELEMS of
+ * elements entered into the page; during which, we're decaying all elements.
+ * If, after decay, an element gets inserted again, its generation is set to 11b
+ * to make sure it has higher numerical count than other, older elements and
+ * thus emulate an an LRU-like behavior when deleting elements to free up space
+ * in the page.
+ *
+ * When an element reaches it's max count of COUNT_MASK, we try to poison it by
+ * assuming that errors triggered COUNT_MASK times in a single page are
+ * excessive and that page shouldn't be used anymore.
+ *
+ * To the question why we've chosen a page and moving elements around with
+ * memmove, it is because it is a very simple structure to handle and max data
+ * movement is 4K which on highly optimized modern CPUs is almost unnoticeable.
+ * We wanted to avoid the pointer traversal of more complex structures like a
+ * linked list or some sort of a balancing search tree.
+ *
+ * Deleting an element takes O(n) but since it is only a single page, it should
+ * be fast enough and it shouldn't happen all too often depending on error
+ * patterns.
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) "RAS: " fmt
+
+/*
+ * We use DECAY_BITS bits of PAGE_SHIFT bits for counting decay, i.e., how long
+ * elements have stayed in the array without accessed again.
+ */
+#define DECAY_BITS		2
+#define DECAY_MASK		((1ULL << DECAY_BITS) - 1)
+#define MAX_ELEMS		(PAGE_SIZE / sizeof(u64))
+
+/*
+ * Threshold amount of inserted elements after which we start spring
+ * cleaning.
+ */
+#define CLEAN_ELEMS		(MAX_ELEMS >> DECAY_BITS)
+
+/* Bits which count the number of errors happened in this 4K page. */
+#define COUNT_BITS		(PAGE_SHIFT - DECAY_BITS)
+#define COUNT_MASK		((1ULL << COUNT_BITS) - 1)
+#define FULL_COUNT_MASK		(PAGE_SIZE - 1)
+
+/*
+ * u64: [ 63 ... 12 | DECAY_BITS | COUNT_BITS ]
+ */
+
+#define PFN(e)			((e) >> PAGE_SHIFT)
+#define DECAY(e)		(((e) >> COUNT_BITS) & DECAY_MASK)
+#define COUNT(e)		((unsigned int)(e) & COUNT_MASK)
+#define FULL_COUNT(e)		((e) & (PAGE_SIZE - 1))
+
+static struct ce_array {
+	u64 *array;		/* container page */
+	unsigned n;		/* number of elements in the array */
+
+	unsigned decay_count;	/*
+				 * number of elements inserted since the last
+				 * spring cleaning.
+				 */
+} ce_arr;
+/* ^^^^^
+ * |
+ * | This variable is passed in internally from the API functions.
+ */
+
+static DEFINE_MUTEX(ce_lock);
+
+/*
+ * Decrement decay value. We're using DECAY_BITS bits to denote decay of an
+ * element in the array. On insertion and any access, it gets maxed
+ */
+static void do_spring_cleaning(struct ce_array *ca)
+{
+	int i;
+
+	for (i = 0; i < ca->n; i++) {
+		u8 decay = DECAY(ca->array[i]);
+
+		if (!decay)
+			continue;
+
+		decay--;
+
+		ca->array[i] &= ~(DECAY_MASK << COUNT_BITS);
+		ca->array[i] |= (decay << COUNT_BITS);
+	}
+	ca->decay_count = 0;
+}
+
+/*
+ * @to: index of the smallest element which is >= then @pfn.
+ *
+ * Return the index of the pfn if found, otherwise negative value.
+ */
+static int __find_elem(struct ce_array *ca, u64 pfn, unsigned *to)
+{
+	u64 this_pfn;
+	int min = 0, max = ca->n;
+
+	while (min < max) {
+		int tmp = (max + min) >> 1;
+
+		this_pfn = PFN(ca->array[tmp]);
+
+		if (this_pfn < pfn)
+			min = tmp + 1;
+		else if (this_pfn > pfn)
+			max = tmp;
+		else {
+			min = tmp;
+			break;
+		}
+	}
+
+	if (to)
+		*to = min;
+
+	this_pfn = PFN(ca->array[min]);
+
+	if (this_pfn == pfn)
+		return min;
+
+	return -ENOKEY;
+}
+
+static int find_elem(struct ce_array *ca, u64 pfn, unsigned *to)
+{
+	WARN_ON(!to);
+
+	if (!ca->n) {
+		*to = 0;
+		return -ENOKEY;
+	}
+	return __find_elem(ca, pfn, to);
+}
+
+static void __del_elem(struct ce_array *ca, int idx)
+{
+	/*
+	 * Save us a function call when deleting the last element.
+	 */
+	if (ca->n - (idx + 1))
+		memmove((void *)&ca->array[idx],
+			(void *)&ca->array[idx + 1],
+			(ca->n - (idx + 1)) * sizeof(u64));
+
+	ca->n--;
+}
+
+/*
+ * We return the 0th pfn in the error case under the assumption that it cannot
+ * be poisoned and excessive CEs in there are a serious deal anyway.
+ */
+u64 ce_del_lru_elem(void)
+{
+	unsigned int min = FULL_COUNT_MASK;
+	struct ce_array *ca = &ce_arr;
+	int i, min_idx = 0;
+	u64 pfn;
+
+	if (!ca->n)
+		return 0;
+
+	mutex_lock(&ce_lock);
+
+	for (i = 0; i < ca->n; i++) {
+		unsigned int this = FULL_COUNT(ca->array[i]);
+		if (min > this) {
+			min = this;
+			min_idx = i;
+		}
+	}
+
+	pfn = PFN(ca->array[min_idx]);
+
+	__del_elem(ca, min_idx);
+
+	mutex_unlock(&ce_lock);
+
+	return pfn;
+}
+
+int ce_add_elem(u64 pfn)
+{
+	struct ce_array *ca = &ce_arr;
+	unsigned to;
+	int count, ret = 0;
+
+	mutex_lock(&ce_lock);
+
+	if (ca->n == MAX_ELEMS) {
+		ret = -ENOSPC;
+		goto unlock;
+	}
+
+	ret = find_elem(ca, pfn, &to);
+	if (ret < 0) {
+		/*
+		 * Shift range [to-end] to make room for one more element.
+		 */
+		memmove((void *)&ca->array[to + 1],
+			(void *)&ca->array[to],
+			(ca->n - to) * sizeof(u64));
+
+		ca->array[to] = (pfn << PAGE_SHIFT) |
+				(DECAY_MASK << COUNT_BITS) | 1;
+
+		ca->decay_count++;
+		ca->n++;
+
+		if (ca->decay_count >= CLEAN_ELEMS)
+			do_spring_cleaning(ca);
+
+		ret = 0;
+
+		goto unlock;
+	}
+
+	count = COUNT(ca->array[to]);
+
+	if (count < COUNT_MASK) {
+		ca->array[to] |= (DECAY_MASK << COUNT_BITS);
+		ca->array[to]++;
+	} else {
+		u64 pfn = ca->array[to] >> PAGE_SHIFT;
+		/*
+		 * We have reached max count for this page, poison it.
+		 */
+		if (!memory_failure(pfn, MCE_VECTOR, 0))
+			pr_err("Poisoning pfn 0x%llx\n", pfn);
+
+		__del_elem(ca, to);
+
+		ret = 0;
+	}
+
+unlock:
+	mutex_unlock(&ce_lock);
+
+	return ret;
+}
+
+void __init ce_init(void)
+{
+	ce_arr.array = (void *)get_zeroed_page(GFP_KERNEL);
+	if (!ce_arr.array) {
+		pr_err("Error allocating CE array page!\n");
+		return;
+	}
+
+	pr_info("Correctable Errors collector initialized.\n");
+}
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 2/3] MCE, CE: Wire in the CE collector
  2014-05-27 13:27 [RFC PATCH 0/3] RAS: Correctable Errors Collector thing Borislav Petkov
  2014-05-27 13:27 ` [RFC PATCH 1/3] MCE, CE: Corrected errors collecting thing Borislav Petkov
@ 2014-05-27 13:27 ` Borislav Petkov
  2014-05-27 17:48   ` Tony Luck
  2014-05-27 13:27 ` [RFC PATCH 3/3] MCE, CE: Add debugging glue Borislav Petkov
  2014-05-28  2:49 ` [RFC PATCH 0/3] RAS: Correctable Errors Collector thing Chen Yucong
  3 siblings, 1 reply; 10+ messages in thread
From: Borislav Petkov @ 2014-05-27 13:27 UTC (permalink / raw)
  To: LKML; +Cc: linux-edac, X86 ML, Tony Luck

From: Borislav Petkov <bp@suse.de>

Add the CE collector to the polling path which collects the correctable
errors. Collect only DRAM ECC errors for now.

Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/kernel/cpu/mcheck/mce.c | 62 +++++++++++++++++++++++++++++++++++-----
 1 file changed, 55 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 882e79bb0cb6..732fbe61416d 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -579,6 +579,46 @@ static void mce_read_aux(struct mce *m, int i)
 
 DEFINE_PER_CPU(unsigned, mce_poll_count);
 
+static bool is_a_memory_error(struct mce *m)
+{
+	struct cpuinfo_x86 *c = &boot_cpu_data;
+
+	if (c->x86_vendor == X86_VENDOR_AMD) {
+		/* ErrCodeExt[20:16] */
+		u8 xec = (m->status >> 16) & 0x1f;
+
+		return (xec == 0x0 || xec == 0x8);
+	} else if (c->x86_vendor == X86_VENDOR_INTEL)
+		return m->status & BIT(7);
+	else
+		return false;
+}
+
+static void __log_ce(struct mce *m, enum mcp_flags flags)
+{
+	int ret;
+
+	/*
+	 * Don't get the IP here because it's unlikely to have anything to do
+	 * with the actual error location.
+	 */
+	if ((flags & MCP_DONTLOG) || mca_cfg.dont_log_ce)
+		return;
+
+	if (is_a_memory_error(m)) {
+		ret = ce_add_elem(m->addr >> PAGE_SHIFT);
+		if (ret < 0) {
+			u64 pfn = ce_del_lru_elem();
+			if (pfn)
+				mce_ring_add(pfn);
+
+			ce_add_elem(m->addr >> PAGE_SHIFT);
+		}
+	} else
+		mce_log(m);
+}
+
+
 /*
  * Poll for corrected events or events that happened before reset.
  * Those are just logged through /dev/mcelog.
@@ -632,12 +672,8 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
 
 		if (!(flags & MCP_TIMESTAMP))
 			m.tsc = 0;
-		/*
-		 * Don't get the IP here because it's unlikely to
-		 * have anything to do with the actual error location.
-		 */
-		if (!(flags & MCP_DONTLOG) && !mca_cfg.dont_log_ce)
-			mce_log(&m);
+
+		__log_ce(&m, flags);
 
 		/*
 		 * Clear state for this bank.
@@ -2530,5 +2566,17 @@ static int __init mcheck_debugfs_init(void)
 
 	return 0;
 }
-late_initcall(mcheck_debugfs_init);
+#else
+static int __init mcheck_debugfs_init(void) {}
 #endif
+
+static int __init mcheck_late_init(void)
+{
+	if (mcheck_debugfs_init())
+		pr_err("Error creating debugfs nodes!\n");
+
+	ce_init();
+
+	return 0;
+}
+late_initcall(mcheck_late_init);
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 3/3] MCE, CE: Add debugging glue
  2014-05-27 13:27 [RFC PATCH 0/3] RAS: Correctable Errors Collector thing Borislav Petkov
  2014-05-27 13:27 ` [RFC PATCH 1/3] MCE, CE: Corrected errors collecting thing Borislav Petkov
  2014-05-27 13:27 ` [RFC PATCH 2/3] MCE, CE: Wire in the CE collector Borislav Petkov
@ 2014-05-27 13:27 ` Borislav Petkov
  2014-05-28  2:49 ` [RFC PATCH 0/3] RAS: Correctable Errors Collector thing Chen Yucong
  3 siblings, 0 replies; 10+ messages in thread
From: Borislav Petkov @ 2014-05-27 13:27 UTC (permalink / raw)
  To: LKML; +Cc: linux-edac, X86 ML, Tony Luck

From: Borislav Petkov <bp@suse.de>

For testing purposes only.

Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/kernel/cpu/mcheck/ce.c  | 24 ++++++++++++++++++++++++
 arch/x86/kernel/cpu/mcheck/mce.c | 31 ++++++++++++++++++++++++++++++-
 2 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/mcheck/ce.c b/arch/x86/kernel/cpu/mcheck/ce.c
index 3020415f27f4..aa2fee325588 100644
--- a/arch/x86/kernel/cpu/mcheck/ce.c
+++ b/arch/x86/kernel/cpu/mcheck/ce.c
@@ -94,6 +94,25 @@ static struct ce_array {
 
 static DEFINE_MUTEX(ce_lock);
 
+static void dump_array(struct ce_array *ca)
+{
+	u64 prev = 0;
+	int i;
+
+	pr_info("{ n: %d\n", ca->n);
+	for (i = 0; i < ca->n; i++) {
+		u64 this = PFN(ca->array[i]);
+
+		pr_info(" %03d: [%016llu|%03llx]\n", i, this, FULL_COUNT(ca->array[i]));
+
+		WARN_ON(prev > this);
+
+		prev = this;
+	}
+
+	pr_info("}\n");
+}
+
 /*
  * Decrement decay value. We're using DECAY_BITS bits to denote decay of an
  * element in the array. On insertion and any access, it gets maxed
@@ -202,6 +221,9 @@ u64 ce_del_lru_elem(void)
 
 	pfn = PFN(ca->array[min_idx]);
 
+	pr_err("%s: Deleting ca[%d]: %016llu|%03x\n",
+	       __func__, min_idx, pfn, min);
+
 	__del_elem(ca, min_idx);
 
 	mutex_unlock(&ce_lock);
@@ -263,7 +285,9 @@ int ce_add_elem(u64 pfn)
 		ret = 0;
 	}
 
+
 unlock:
+	dump_array(ca);
 	mutex_unlock(&ce_lock);
 
 	return ret;
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 732fbe61416d..59b726d4e116 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -2552,9 +2552,34 @@ static int fake_panic_set(void *data, u64 val)
 DEFINE_SIMPLE_ATTRIBUTE(fake_panic_fops, fake_panic_get,
 			fake_panic_set, "%llu\n");
 
+static u64 cec_pfn;
+
+static int cec_pfn_get(void *data, u64 *val)
+{
+	*val = cec_pfn;
+	return 0;
+}
+
+static int cec_pfn_set(void *data, u64 val)
+{
+	int ret;
+
+	cec_pfn = val;
+
+	ret = ce_add_elem(val);
+	if (ret < 0) {
+		ce_del_lru_elem();
+		ret = ce_add_elem(val);
+	}
+
+	return ret;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(cec_pfn_ops, cec_pfn_get, cec_pfn_set, "0x%llx\n");
+
 static int __init mcheck_debugfs_init(void)
 {
-	struct dentry *dmce, *ffake_panic;
+	struct dentry *dmce, *ffake_panic, *cec_pfn;
 
 	dmce = mce_get_debugfs_dir();
 	if (!dmce)
@@ -2564,6 +2589,10 @@ static int __init mcheck_debugfs_init(void)
 	if (!ffake_panic)
 		return -ENOMEM;
 
+	cec_pfn = debugfs_create_file("cec_pfn", 0400, dmce, NULL, &cec_pfn_ops);
+	if (!cec_pfn)
+		return -ENOMEM;
+
 	return 0;
 }
 #else
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 2/3] MCE, CE: Wire in the CE collector
  2014-05-27 13:27 ` [RFC PATCH 2/3] MCE, CE: Wire in the CE collector Borislav Petkov
@ 2014-05-27 17:48   ` Tony Luck
  2014-05-27 18:48     ` Borislav Petkov
  0 siblings, 1 reply; 10+ messages in thread
From: Tony Luck @ 2014-05-27 17:48 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: LKML, linux-edac, X86 ML

> +       } else if (c->x86_vendor == X86_VENDOR_INTEL)
> +               return m->status & BIT(7);

Intel compound error codes aren't quite that simple.  You need to look
at the low 16 bits of "status" (the MCACOD) field and see which is the
most significant bit set (ignoring bit 12, the "filter" bit).  If the answer is
bit 7 - then this is a memory error. But you can't just blindly check bit
7 because if bit 8 is set, then this is a cache error, and if bit 11 is set,
then it is a bus/interconnect error - and either way bit 7 just gives more
detail on what cache/bus/interconnect error happened.  In hex the test
you want is:

                  return (m->status & 0xef80) == BIT(7);

[compound error codes make my head hurt too]

>                 if (!(flags & MCP_TIMESTAMP))
>                         m.tsc = 0;
> -               /*
> -                * Don't get the IP here because it's unlikely to
> -                * have anything to do with the actual error location.
> -                */
> -               if (!(flags & MCP_DONTLOG) && !mca_cfg.dont_log_ce)
> -                       mce_log(&m);
> +
> +               __log_ce(&m, flags);

I'm not happy with the total removal of mce_log() here.  Skipping
it and doing *some* filtering in the kernel is OK (goal of this patch
set). But you just cut EDAC and/or EXTLOG out of the reporting path
completely.  If the corrected error count for a page gets too high, your
new code will try to take the page offline ... but we won't have a report
from EDAC/EXTLOG telling us which DIMM that page belonged to.

Perhaps __log_ce() needs a return value to say whether it took action
and then:

                 if (__log_ce(&m, flags) && all-those-other-tests)
                             mce_log(&m);

-Tony

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 2/3] MCE, CE: Wire in the CE collector
  2014-05-27 17:48   ` Tony Luck
@ 2014-05-27 18:48     ` Borislav Petkov
  0 siblings, 0 replies; 10+ messages in thread
From: Borislav Petkov @ 2014-05-27 18:48 UTC (permalink / raw)
  To: Tony Luck; +Cc: LKML, linux-edac, X86 ML

On Tue, May 27, 2014 at 10:48:38AM -0700, Tony Luck wrote:
> > +       } else if (c->x86_vendor == X86_VENDOR_INTEL)
> > +               return m->status & BIT(7);
> 
> Intel compound error codes aren't quite that simple.  You need to look
> at the low 16 bits of "status" (the MCACOD) field and see which is the
> most significant bit set (ignoring bit 12, the "filter" bit).  If the answer is
> bit 7 - then this is a memory error. But you can't just blindly check bit
> 7 because if bit 8 is set, then this is a cache error, and if bit 11 is set,
> then it is a bus/interconnect error - and either way bit 7 just gives more
> detail on what cache/bus/interconnect error happened.  In hex the test
> you want is:
> 
>                   return (m->status & 0xef80) == BIT(7);

Thanks.

> [compound error codes make my head hurt too]
> 
> >                 if (!(flags & MCP_TIMESTAMP))
> >                         m.tsc = 0;
> > -               /*
> > -                * Don't get the IP here because it's unlikely to
> > -                * have anything to do with the actual error location.
> > -                */
> > -               if (!(flags & MCP_DONTLOG) && !mca_cfg.dont_log_ce)
> > -                       mce_log(&m);
> > +
> > +               __log_ce(&m, flags);
> 
> I'm not happy with the total removal of mce_log() here.  Skipping
> it and doing *some* filtering in the kernel is OK (goal of this patch
> set). But you just cut EDAC and/or EXTLOG out of the reporting path
> completely.  If the corrected error count for a page gets too high, your
> new code will try to take the page offline ... but we won't have a report
> from EDAC/EXTLOG telling us which DIMM that page belonged to.
> 
> Perhaps __log_ce() needs a return value to say whether it took action
> and then:
> 
>                  if (__log_ce(&m, flags) && all-those-other-tests)
>                              mce_log(&m);

Right, this still needs a decision on how the collector fits in the
whole picture - this patch was just the bare minimum to get the
discussion going.

The question is: do we want to tell userspace about every CE? Since
we're doing everything ourselves in the kernel, I'd probably not upset
needlessly but only when we've disabled a memory page.

Concerning EXTLOG, shouldn't that be firmware-first anyway? If so, we
won't have any control over it anyway.

Concerning EDAC, we would want it to go first anyway, in order to do
some decoding and *then* feed the final info to the collector.

In both cases, we probably should have EDAC/EXTLOG run first so that we
can have all the relevant error info, put into the collector and maybe
then mce_log a subset of those errors. Let me draw a bit:

						 A
[ machine_check_poll ] -----> [ EDAC/EXTLOG ] ---+----> [ CE Collector ] .-.>.-. [ mce_log ]
						 |
						 |
						 +----> [ mce_log ]

This is how the pipes could look like :-)

The split at A is to denote that only a subset of the errors should go
into the collector, i.e. ECC errors. After it, we want to decide what is
going to go to the log.

Maybe even add the ability to turn off the collector in the kernel if a
userspace agent runs. Which would not be such a good idea as the kernel
one should suffice for most if not all cases.

Anyway, this should get us going towards hammering out the piping.

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/3] RAS: Correctable Errors Collector thing
  2014-05-27 13:27 [RFC PATCH 0/3] RAS: Correctable Errors Collector thing Borislav Petkov
                   ` (2 preceding siblings ...)
  2014-05-27 13:27 ` [RFC PATCH 3/3] MCE, CE: Add debugging glue Borislav Petkov
@ 2014-05-28  2:49 ` Chen Yucong
  2014-05-28 16:53   ` Max Asbock
  3 siblings, 1 reply; 10+ messages in thread
From: Chen Yucong @ 2014-05-28  2:49 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: LKML, linux-edac, X86 ML, Tony Luck

> From: Borislav Petkov <bp@suse.de>
> 
> Hi all,
> 
> this is something Tony and I have been working on behind the curtains
> recently. Here it is in a RFC form, it passes quick testing in kvm. Let
> me send it out before I start hammering on it on a real machine.
> 
> More indepth info about what it is and what it does is in patch 1/3.
> 
> As always, comments and suggestions are most welcome.
> 
> Thanks.

What's the point of this patch set?
My understanding is that if there are some(COUNT_MASK) corrected DRAM
ECC errors for a specific page frame, we can believe that the page frame
is so ill that it should be isolated as soon as possible.

The question is: memory_failure can not be used for isolating the page
frame which is being used by kernel, because it just poison the page and
IGNORED. memory_failure is mostly used for handling AR/AO type errors
related to the page frame which the userspace tasks are using now.

Although the relative page frame is very ill, it is not dead and can
still work. However, memory_failure may kill the userspace tasks,
especially for those page frames that are holding dynamic data rather
than file-backed(file/swap) data.

So I do not think that it is a good idea to directly use memory_failure
in this patch set. 

thx!
cyc

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/3] RAS: Correctable Errors Collector thing
  2014-05-28  2:49 ` [RFC PATCH 0/3] RAS: Correctable Errors Collector thing Chen Yucong
@ 2014-05-28 16:53   ` Max Asbock
  2014-05-28 17:21     ` Luck, Tony
  2014-05-28 17:23     ` Borislav Petkov
  0 siblings, 2 replies; 10+ messages in thread
From: Max Asbock @ 2014-05-28 16:53 UTC (permalink / raw)
  To: Chen Yucong; +Cc: Borislav Petkov, LKML, linux-edac, X86 ML, Tony Luck

On Wed, 2014-05-28 at 10:49 +0800, Chen Yucong wrote:
> > From: Borislav Petkov <bp@suse.de>
> > 
> > Hi all,
> > 
> > this is something Tony and I have been working on behind the curtains
> > recently. Here it is in a RFC form, it passes quick testing in kvm. Let
> > me send it out before I start hammering on it on a real machine.
> > 
> > More indepth info about what it is and what it does is in patch 1/3.
> > 
> > As always, comments and suggestions are most welcome.
> > 
> > Thanks.
> 
> What's the point of this patch set?
> My understanding is that if there are some(COUNT_MASK) corrected DRAM
> ECC errors for a specific page frame, we can believe that the page frame
> is so ill that it should be isolated as soon as possible.
> 
> The question is: memory_failure can not be used for isolating the page
> frame which is being used by kernel, because it just poison the page and
> IGNORED. memory_failure is mostly used for handling AR/AO type errors
> related to the page frame which the userspace tasks are using now.
> 
> Although the relative page frame is very ill, it is not dead and can
> still work. However, memory_failure may kill the userspace tasks,
> especially for those page frames that are holding dynamic data rather
> than file-backed(file/swap) data.
> 
> So I do not think that it is a good idea to directly use memory_failure
> in this patch set. 
> 

I second that. You can't poison a page and potentially kill an
application just because an arbitrarily chosen number of corrected
errors has been exceeded. That would be an anti-RAS feature: less
reliability and availability.
A possible alternative would be to soft-offline the page. This is
currently done in APEI code when corrected memory error thresholds are
exceeded and reported by UEFI via a generic hardware error source
(GHES). 
The example is in ghes_handle_memory_failure() where we call
memory_failure_queue(pfn, 0, flags) with flags = MF_SOFT_OFFLINE

- Max


^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [RFC PATCH 0/3] RAS: Correctable Errors Collector thing
  2014-05-28 16:53   ` Max Asbock
@ 2014-05-28 17:21     ` Luck, Tony
  2014-05-28 17:23     ` Borislav Petkov
  1 sibling, 0 replies; 10+ messages in thread
From: Luck, Tony @ 2014-05-28 17:21 UTC (permalink / raw)
  To: masbock@linux.vnet.ibm.com, Chen Yucong
  Cc: Borislav Petkov, LKML, linux-edac, X86 ML

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 611 bytes --]

> A possible alternative would be to soft-offline the page. This is
> currently done in APEI code when corrected memory error thresholds are
> exceeded and reported by UEFI via a generic hardware error source
> (GHES). 

+1

This is what the existing mcelog(8) daemon does when it sees an excessive
number of corrected errors on a page (using /sys/devices/system/memory/soft_offline_page
as the user->kernel interface to get to this function).

-Tony
ÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hšïêÿ‘êçz_è®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨èÚ&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/3] RAS: Correctable Errors Collector thing
  2014-05-28 16:53   ` Max Asbock
  2014-05-28 17:21     ` Luck, Tony
@ 2014-05-28 17:23     ` Borislav Petkov
  1 sibling, 0 replies; 10+ messages in thread
From: Borislav Petkov @ 2014-05-28 17:23 UTC (permalink / raw)
  To: Max Asbock; +Cc: Chen Yucong, LKML, linux-edac, X86 ML, Tony Luck

On Wed, May 28, 2014 at 09:53:35AM -0700, Max Asbock wrote:
> A possible alternative would be to soft-offline the page. This is
> currently done in APEI code when corrected memory error thresholds are
> exceeded and reported by UEFI via a generic hardware error source
> (GHES). 
> The example is in ghes_handle_memory_failure() where we call
> memory_failure_queue(pfn, 0, flags) with flags = MF_SOFT_OFFLINE

Ok, we can try that - it certainly makes sense. We're designing the
plumbing of this thing as we go so fitting it correctly into the whole
reporting/error handling chain is still open. It is RFC anyway so...

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-05-28 17:23 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-27 13:27 [RFC PATCH 0/3] RAS: Correctable Errors Collector thing Borislav Petkov
2014-05-27 13:27 ` [RFC PATCH 1/3] MCE, CE: Corrected errors collecting thing Borislav Petkov
2014-05-27 13:27 ` [RFC PATCH 2/3] MCE, CE: Wire in the CE collector Borislav Petkov
2014-05-27 17:48   ` Tony Luck
2014-05-27 18:48     ` Borislav Petkov
2014-05-27 13:27 ` [RFC PATCH 3/3] MCE, CE: Add debugging glue Borislav Petkov
2014-05-28  2:49 ` [RFC PATCH 0/3] RAS: Correctable Errors Collector thing Chen Yucong
2014-05-28 16:53   ` Max Asbock
2014-05-28 17:21     ` Luck, Tony
2014-05-28 17:23     ` Borislav Petkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).