[PATCH EDAC 0/6] Improvements for ghes

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH EDAC 0/6] Improvements for ghes_edac
@ 2013-02-20 11:12 Mauro Carvalho Chehab
  2013-02-20 11:12 ` [PATCH EDAC 1/6] ghes_edac: remove GHES_PFX macro Mauro Carvalho Chehab
                   ` (5 more replies)
  0 siblings, 6 replies; 8+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-20 11:12 UTC (permalink / raw)
  Cc: Mauro Carvalho Chehab, Linux Edac Mailing List,
	Linux Kernel Mailing List

This patch series complements the previous one sent:
	http://comments.gmane.org/gmane.linux.kernel/1442178

It contains:

- a patch removing the GHES_PFX macro, as requested by Joe Perches;
- a patch adding copyright notes and an entry at MAINTAINERS
  for the new driver;
- a patch suggested by Borislav moving the error description to an
  structure;
- 3 patches improving the error report for GHES-driven errors.

Patches were tested using a 4 cores machine.

With the patches, a GHES error like this one:

{1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
{1}[Hardware Error]: APEI generic hardware error status
{1}[Hardware Error]: severity: 2, corrected
{1}[Hardware Error]: section: 0, severity: 2, corrected
{1}[Hardware Error]: flags: 0x01
{1}[Hardware Error]: primary
{1}[Hardware Error]: section_type: memory error
{1}[Hardware Error]: error_status: 0x0000000000000400
{1}[Hardware Error]: node: 3
{1}[Hardware Error]: card: 0
{1}[Hardware Error]: module: 1
{1}[Hardware Error]: device: 0
{1}[Hardware Error]: error_type: 18, unknown

Was properly mapped to EDAC printk engine as:

EDAC MC0: 1 CE reserved error (18) on unknown label (node:3 card:0 module:1 page:0x0 offset:0x0 grain:0 syndrome:0x0 - status(0x0000000000000400): Storage error in memory (DRAM))

And to the corresponding RAS trace event:

mc_event: 1 Corrected error: reserved error (18) on unknown label (mc:0 location:-1:-1:-1 address:0x00000000 grain:1 syndrome:0x00000000 APEI location: node:3 card:0 module:1 status(0x0000000000000400): Storage error in memory (DRAM))

Mauro Carvalho Chehab (6):
  ghes_edac: remove GHES_PFX macro
  ghes_edac: add a MAINTAINERS entry and copyrights
  edac: put all arguments for the raw error handling call into a struct
  ghes_edac: Make it compliant with UEFI spec 2.3.1
  edac: add support on ras_event for error type "Info"
  ghes_edac: Fix RAS tracing

 MAINTAINERS              |   7 ++
 drivers/edac/edac_core.h |  16 +--
 drivers/edac/edac_mc.c   | 126 ++++++++++--------------
 drivers/edac/ghes_edac.c | 249 +++++++++++++++++++++++++++++++++++++++++------
 include/linux/edac.h     |  71 ++++++++++++++
 include/ras/ras_event.h  |   4 +-
 6 files changed, 352 insertions(+), 121 deletions(-)

-- 
1.8.1.2


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH EDAC 1/6] ghes_edac: remove GHES_PFX macro
  2013-02-20 11:12 [PATCH EDAC 0/6] Improvements for ghes_edac Mauro Carvalho Chehab
@ 2013-02-20 11:12 ` Mauro Carvalho Chehab
  2013-02-20 11:12 ` [PATCH EDAC 2/6] ghes_edac: add a MAINTAINERS entry and copyrights Mauro Carvalho Chehab
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-20 11:12 UTC (permalink / raw)
  Cc: Mauro Carvalho Chehab, Linux Edac Mailing List,
	Linux Kernel Mailing List

As suggested by Joe:

On Fri, 15 Feb 2013 08:38:17 -0800
Joe Perches <joe@perches.com> wrote:

	Perhaps these should use
	#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
	and remove GHES_PFX from all the pr_<level>()'s?

Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/edac/ghes_edac.c | 28 +++++++++++++++-------------
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 94d5286..565c516 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -1,14 +1,16 @@
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
 #include <acpi/ghes.h>
 #include <linux/edac.h>
 #include <linux/dmi.h>
 #include "edac_core.h"
 
-#define GHES_PFX   "ghes_edac: "
 #define GHES_EDAC_REVISION " Ver: 1.0.0"
 
 static DEFINE_MUTEX(ghes_edac_lock);
 static int ghes_edac_mc_num;
 
+
 /* Memory Device - Type 17 of SMBIOS spec */
 struct memdev_dmi_entry {
 	u8 type;
@@ -80,7 +82,7 @@ static void ghes_edac_dmidecode(const struct dmi_header *dh, void *arg)
 						       dimm_fill->count, 0, 0);
 
 		if (entry->size == 0xffff) {
-			pr_info(GHES_PFX "Can't get DIMM%i size\n",
+			pr_info("Can't get DIMM%i size\n",
 				dimm_fill->count);
 			dimm->nr_pages = MiB_TO_PAGES(32);/* Unknown */
 		} else if (entry->size == 0x7fff) {
@@ -228,7 +230,7 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)
 	mutex_lock(&ghes_edac_lock);
 	mci = edac_mc_alloc(ghes_edac_mc_num, ARRAY_SIZE(layers), layers, 0);
 	if (!mci) {
-		pr_info(GHES_PFX "Can't allocate memory for EDAC data\n");
+		pr_info("Can't allocate memory for EDAC data\n");
 		mutex_unlock(&ghes_edac_lock);
 		return -ENOMEM;
 	}
@@ -246,17 +248,17 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)
 
 	if (!ghes_edac_mc_num) {
 		if (!fake) {
-			pr_info(GHES_PFX "This EDAC driver relies on BIOS to enumerate memory and get error reports.\n");
-			pr_info(GHES_PFX "Unfortunately, not all BIOSes reflect the memory layout correctly.\n");
-			pr_info(GHES_PFX "So, the end result of using this driver varies from vendor to vendor.\n");
-			pr_info(GHES_PFX "If you find incorrect reports, please contact your hardware vendor\n");
-			pr_info(GHES_PFX "to correct its BIOS.\n");
-			pr_info(GHES_PFX "This system has %d DIMM sockets.\n",
+			pr_info("This EDAC driver relies on BIOS to enumerate memory and get error reports.\n");
+			pr_info("Unfortunately, not all BIOSes reflect the memory layout correctly.\n");
+			pr_info("So, the end result of using this driver varies from vendor to vendor.\n");
+			pr_info("If you find incorrect reports, please contact your hardware vendor\n");
+			pr_info("to correct its BIOS.\n");
+			pr_info("This system has %d DIMM sockets.\n",
 				num_dimm);
 		} else {
-			pr_info(GHES_PFX "This system has a very crappy BIOS: It doesn't even list the DIMMS.\n");
-			pr_info(GHES_PFX "Its SMBIOS info is wrong. It is doubtful that the error report would\n");
-			pr_info(GHES_PFX "work on such system. Use this driver with caution\n");
+			pr_info("This system has a very crappy BIOS: It doesn't even list the DIMMS.\n");
+			pr_info("Its SMBIOS info is wrong. It is doubtful that the error report would\n");
+			pr_info("work on such system. Use this driver with caution\n");
 		}
 	}
 
@@ -287,7 +289,7 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)
 
 	rc = edac_mc_add_mc(mci);
 	if (rc < 0) {
-		pr_info(GHES_PFX "Can't register at EDAC core\n");
+		pr_info("Can't register at EDAC core\n");
 		edac_mc_free(mci);
 		mutex_unlock(&ghes_edac_lock);
 		return -ENODEV;
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH EDAC 2/6] ghes_edac: add a MAINTAINERS entry and copyrights
  2013-02-20 11:12 [PATCH EDAC 0/6] Improvements for ghes_edac Mauro Carvalho Chehab
  2013-02-20 11:12 ` [PATCH EDAC 1/6] ghes_edac: remove GHES_PFX macro Mauro Carvalho Chehab
@ 2013-02-20 11:12 ` Mauro Carvalho Chehab
  2013-02-20 11:12 ` [PATCH EDAC 3/6] edac: put all arguments for the raw error handling call into a struct Mauro Carvalho Chehab
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-20 11:12 UTC (permalink / raw)
  Cc: Mauro Carvalho Chehab, Linux Edac Mailing List,
	Linux Kernel Mailing List

Add the driver inside MAINTAINERS file and fill the driver's
Copyright information.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 MAINTAINERS              |  7 +++++++
 drivers/edac/ghes_edac.c | 11 +++++++++++
 2 files changed, 18 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 35a56bc..889644d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2803,6 +2803,13 @@ W:	bluesmoke.sourceforge.net
 S:	Maintained
 F:	drivers/edac/e7xxx_edac.c
 
+EDAC-GHES
+M:	Mauro Carvalho Chehab <mchehab@redhat.com>
+L:	linux-edac@vger.kernel.org
+W:	bluesmoke.sourceforge.net
+S:	Maintained
+F:	drivers/edac/ghes-edac.c
+
 EDAC-I82443BXGX
 M:	Tim Small <tim@buttersideup.com>
 L:	linux-edac@vger.kernel.org
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 565c516..ef54829 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -1,3 +1,14 @@
+/*
+ * GHES/EDAC Linux driver
+ *
+ * This file may be distributed under the terms of the GNU General Public
+ * License version 2.
+ *
+ * Copyright (c) 2013 by Mauro Carvalho Chehab <mchehab@redhat.com>
+ *
+ * Red Hat Inc. http://www.redhat.com
+ */
+
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
 #include <acpi/ghes.h>
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH EDAC 3/6] edac: put all arguments for the raw error handling call into a struct
  2013-02-20 11:12 [PATCH EDAC 0/6] Improvements for ghes_edac Mauro Carvalho Chehab
  2013-02-20 11:12 ` [PATCH EDAC 1/6] ghes_edac: remove GHES_PFX macro Mauro Carvalho Chehab
  2013-02-20 11:12 ` [PATCH EDAC 2/6] ghes_edac: add a MAINTAINERS entry and copyrights Mauro Carvalho Chehab
@ 2013-02-20 11:12 ` Mauro Carvalho Chehab
  2013-02-20 11:20   ` Borislav Petkov
  2013-02-20 11:12 ` [PATCH EDAC 4/6] ghes_edac: Make it compliant with UEFI spec 2.3.1 Mauro Carvalho Chehab
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 8+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-20 11:12 UTC (permalink / raw)
  Cc: Mauro Carvalho Chehab, Linux Edac Mailing List,
	Linux Kernel Mailing List

The number of arguments for edac_raw_mc_handle_error() is too big;
put them into a structure and allocate space for it inside
edac_mc_alloc().

That reduces a lot the stack usage and simplifies the raw API call.

Tested with sb_edac driver and MCE error injection. Worked as expected:

[  143.066100] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
[  143.086424] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
[  143.106570] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
[  143.126712] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/edac/edac_core.h |  16 +------
 drivers/edac/edac_mc.c   | 120 +++++++++++++++++++----------------------------
 drivers/edac/ghes_edac.c |  26 +++++-----
 include/linux/edac.h     |  56 ++++++++++++++++++++++
 4 files changed, 122 insertions(+), 96 deletions(-)

diff --git a/drivers/edac/edac_core.h b/drivers/edac/edac_core.h
index 9c5da11..3c2625e 100644
--- a/drivers/edac/edac_core.h
+++ b/drivers/edac/edac_core.h
@@ -455,20 +455,8 @@ extern int edac_mc_find_csrow_by_page(struct mem_ctl_info *mci,
 				      unsigned long page);
 
 void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
-			  struct mem_ctl_info *mci,
-			  long grain,
-			  const u16 error_count,
-			  const int top_layer,
-			  const int mid_layer,
-			  const int low_layer,
-			  const unsigned long page_frame_number,
-			  const unsigned long offset_in_page,
-			  const unsigned long syndrome,
-			  const char *msg,
-			  const char *location,
-			  const char *label,
-			  const char *other_detail,
-			  const bool enable_per_layer_report);
+			      struct mem_ctl_info *mci,
+			      struct edac_raw_error_desc *e);
 
 void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 			  struct mem_ctl_info *mci,
diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index 8fddf65..e436565 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -1066,78 +1066,49 @@ static void edac_ue_error(struct mem_ctl_info *mci,
 	edac_inc_ue_error(mci, enable_per_layer_report, pos, error_count);
 }
 
-#define OTHER_LABEL " or "
-
 /**
  * edac_raw_mc_handle_error - reports a memory event to userspace without doing
  *			      anything to discover the error location
  *
  * @type:		severity of the error (CE/UE/Fatal)
  * @mci:		a struct mem_ctl_info pointer
- * @grain:		error granularity
- * @error_count:	Number of errors of the same type
- * @top_layer:		Memory layer[0] position
- * @mid_layer:		Memory layer[1] position
- * @low_layer:		Memory layer[2] position
- * @page_frame_number:	mem page where the error occurred
- * @offset_in_page:	offset of the error inside the page
- * @syndrome:		ECC syndrome
- * @msg:		Message meaningful to the end users that
- *			explains the event\
- * @location:		location of the error, like "csrow:0 channel:1"
- * @label:		DIMM labels for the affected memory(ies)
- * @other_detail:	Technical details about the event that
- *			may help hardware manufacturers and
- *			EDAC developers to analyse the event
- * @enable_per_layer_report: should it increment per-layer error counts?
+ * @e:			error description
  *
  * This raw function is used internally by edac_mc_handle_error(). It should
  * only be called directly when the hardware error come directly from BIOS,
  * like in the case of APEI GHES driver.
  */
 void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
-			  struct mem_ctl_info *mci,
-			  long grain,
-			  const u16 error_count,
-			  const int top_layer,
-			  const int mid_layer,
-			  const int low_layer,
-			  const unsigned long page_frame_number,
-			  const unsigned long offset_in_page,
-			  const unsigned long syndrome,
-			  const char *msg,
-			  const char *location,
-			  const char *label,
-			  const char *other_detail,
-			  const bool enable_per_layer_report)
+			      struct mem_ctl_info *mci,
+			      struct edac_raw_error_desc *e)
 {
 	char detail[80];
 	u8 grain_bits;
-	int pos[EDAC_MAX_LAYERS] = { top_layer, mid_layer, low_layer };
+	int pos[EDAC_MAX_LAYERS] = { e->top_layer, e->mid_layer, e->low_layer };
 
 	/* Report the error via the trace interface */
-	grain_bits = fls_long(grain) + 1;
-	trace_mc_event(type, msg, label, error_count,
-		       mci->mc_idx, top_layer, mid_layer, low_layer,
-		       PAGES_TO_MiB(page_frame_number) | offset_in_page,
-		       grain_bits, syndrome, other_detail);
+	grain_bits = fls_long(e->grain) + 1;
+	trace_mc_event(type, e->msg, e->label, e->error_count,
+		       mci->mc_idx, e->top_layer, e->mid_layer, e->low_layer,
+		       PAGES_TO_MiB(e->page_frame_number) | e->offset_in_page,
+		       grain_bits, e->syndrome, e->other_detail);
 
 	/* Memory type dependent details about the error */
 	if (type == HW_EVENT_ERR_CORRECTED) {
 		snprintf(detail, sizeof(detail),
 			"page:0x%lx offset:0x%lx grain:%ld syndrome:0x%lx",
-			page_frame_number, offset_in_page,
-			grain, syndrome);
-		edac_ce_error(mci, error_count, pos, msg, location, label,
-			      detail, other_detail, enable_per_layer_report,
-			      page_frame_number, offset_in_page, grain);
+			e->page_frame_number, e->offset_in_page,
+			e->grain, e->syndrome);
+		edac_ce_error(mci, e->error_count, pos, e->msg, e->location, e->label,
+			      detail, e->other_detail, e->enable_per_layer_report,
+			      e->page_frame_number, e->offset_in_page, e->grain);
 	} else {
 		snprintf(detail, sizeof(detail),
 			"page:0x%lx offset:0x%lx grain:%ld",
-			page_frame_number, offset_in_page, grain);
+			e->page_frame_number, e->offset_in_page, e->grain);
 
-		edac_ue_error(mci, error_count, pos, msg, location, label,
-			      detail, other_detail, enable_per_layer_report);
+		edac_ue_error(mci, e->error_count, pos, e->msg, e->location, e->label,
+			      detail, e->other_detail, e->enable_per_layer_report);
 	}
 
 
@@ -1174,18 +1145,26 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 			  const char *msg,
 			  const char *other_detail)
 {
-	/* FIXME: too much for stack: move it to some pre-alocated area */
-	char location[80];
-	char label[(EDAC_MC_LABEL_LEN + 1 + sizeof(OTHER_LABEL)) * mci->tot_dimms];
 	char *p;
 	int row = -1, chan = -1;
 	int pos[EDAC_MAX_LAYERS] = { top_layer, mid_layer, low_layer };
-	int i;
-	long grain;
-	bool enable_per_layer_report = false;
+	int i, n_labels = 0;
+	struct edac_raw_error_desc *e = &mci->error_desc;
 
 	edac_dbg(3, "MC%d\n", mci->mc_idx);
 
+	/* Fills the error report buffer */
+	memset(e, 0, sizeof (*e));
+	e->error_count = error_count;
+	e->top_layer = top_layer;
+	e->mid_layer = mid_layer;
+	e->low_layer = low_layer;
+	e->page_frame_number = page_frame_number;
+	e->offset_in_page = offset_in_page;
+	e->syndrome = syndrome;
+	e->msg = msg;
+	e->other_detail = other_detail;
+
 	/*
 	 * Check if the event report is consistent and if the memory
 	 * location is known. If it is known, enable_per_layer_report will be
@@ -1208,7 +1187,7 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 			pos[i] = -1;
 		}
 		if (pos[i] >= 0)
-			enable_per_layer_report = true;
+			e->enable_per_layer_report = true;
 	}
 
 	/*
@@ -1222,8 +1201,7 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 	 * where each memory belongs to a separate channel within the same
 	 * branch.
 	 */
-	grain = 0;
-	p = label;
+	p = e->label;
 	*p = '\0';
 
 	for (i = 0; i < mci->tot_dimms; i++) {
@@ -1237,8 +1215,8 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 			continue;
 
 		/* get the max grain, over the error match range */
-		if (dimm->grain > grain)
-			grain = dimm->grain;
+		if (dimm->grain > e->grain)
+			e->grain = dimm->grain;
 
 		/*
 		 * If the error is memory-controller wide, there's no need to
@@ -1246,8 +1224,13 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 		 * channel/memory controller/...  may be affected.
 		 * Also, don't show errors for empty DIMM slots.
 		 */
-		if (enable_per_layer_report && dimm->nr_pages) {
-			if (p != label) {
+		if (e->enable_per_layer_report && dimm->nr_pages) {
+			if (n_labels >= EDAC_MAX_LABELS) {
+				e->enable_per_layer_report = false;
+				break;
+			}
+			n_labels++;
+			if (p != e->label) {
 				strcpy(p, OTHER_LABEL);
 				p += strlen(OTHER_LABEL);
 			}
@@ -1274,12 +1257,12 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 		}
 	}
 
-	if (!enable_per_layer_report) {
-		strcpy(label, "any memory");
+	if (!e->enable_per_layer_report) {
+		strcpy(e->label, "any memory");
 	} else {
 		edac_dbg(4, "csrow/channel to increment: (%d,%d)\n", row, chan);
-		if (p == label)
-			strcpy(label, "unknown memory");
+		if (p == e->label)
+			strcpy(e->label, "unknown memory");
 		if (type == HW_EVENT_ERR_CORRECTED) {
 			if (row >= 0) {
 				mci->csrows[row]->ce_count += error_count;
@@ -1292,7 +1275,7 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 	}
 
 	/* Fill the RAM location data */
-	p = location;
+	p = e->location;
 
 	for (i = 0; i < mci->n_layers; i++) {
 		if (pos[i] < 0)
@@ -1302,14 +1285,9 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 			     edac_layer_name[mci->layers[i].type],
 			     pos[i]);
 	}
-	if (p > location)
+	if (p > e->location)
 		*(p - 1) = '\0';
 
-	edac_raw_mc_handle_error(type, mci, grain, error_count,
-				 top_layer, mid_layer, low_layer,
-				 page_frame_number, offset_in_page,
-				 syndrome,
-				 msg, location, label, other_detail,
-				 enable_per_layer_report);
+	edac_raw_mc_handle_error(type, mci, e);
 }
 EXPORT_SYMBOL_GPL(edac_mc_handle_error);
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index ef54829..9d7f797 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -175,15 +175,20 @@ static void ghes_edac_dmidecode(const struct dmi_header *dh, void *arg)
 void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
 			        struct cper_sec_mem_err *mem_err)
 {
+	struct edac_raw_error_desc *e = &ghes->mci->error_desc;
 	enum hw_event_mc_err_type type;
-	unsigned long page = 0, offset = 0, grain = 0;
-	char location[80];
-	char *label = "unknown";
+
+	/* Cleans the error report buffer */
+	memset(e, 0, sizeof (*e));
+	e->error_count = 1;
+	e->msg = "APEI";
+	strcpy(e->label, "unknown");
+	e->other_detail = "";
 
 	if (mem_err->validation_bits & CPER_MEM_VALID_PHYSICAL_ADDRESS) {
-		page = mem_err->physical_addr >> PAGE_SHIFT;
-		offset = mem_err->physical_addr & ~PAGE_MASK;
-		grain = ~(mem_err->physical_addr_mask & ~PAGE_MASK);
+		e->page_frame_number = mem_err->physical_addr >> PAGE_SHIFT;
+		e->offset_in_page = mem_err->physical_addr & ~PAGE_MASK;
+		e->grain = ~(mem_err->physical_addr_mask & ~PAGE_MASK);
 	}
 
 	switch(sev) {
@@ -201,15 +206,14 @@ void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
 		type = HW_EVENT_ERR_INFO;
 	}
 
-	sprintf(location,"node:%d card:%d module:%d bank:%d device:%d row: %d column:%d bit_pos:%d",
+	sprintf(e->location,
+		"node:%d card:%d module:%d bank:%d device:%d row: %d column:%d bit_pos:%d",
 		mem_err->node, mem_err->card, mem_err->module,
 		mem_err->bank, mem_err->device, mem_err->row, mem_err->column,
 		mem_err->bit_pos);
-	edac_dbg(3, "error at location %s\n", location);
+	edac_dbg(3, "error at location %s\n", e->location);
 
-	edac_raw_mc_handle_error(type, ghes->mci, grain, 1, 0, 0, 0,
-				 page, offset, 0,
-				 "APEI", location, label, "", 0);
+	edac_raw_mc_handle_error(type, ghes->mci, e);
 }
 EXPORT_SYMBOL_GPL(ghes_edac_report_mem_error);
 
diff --git a/include/linux/edac.h b/include/linux/edac.h
index bd14f5c..1cd4472 100644
--- a/include/linux/edac.h
+++ b/include/linux/edac.h
@@ -47,8 +47,18 @@ static inline void opstate_init(void)
 	return;
 }
 
+/* Max length of a DIMM label*/
 #define EDAC_MC_LABEL_LEN	31
 
+/* Maximum size of the location string */
+#define LOCATION_SIZE 80
+
+/* Defines the maximum number of labels that can be reported */
+#define EDAC_MAX_LABELS		8
+
+/* String used to join two or more labels */
+#define OTHER_LABEL " or "
+
 /**
  * enum dev_type - describe the type of memory DRAM chips used at the stick
  * @DEV_UNKNOWN:	Can't be determined, or MC doesn't support detect it
@@ -554,6 +564,46 @@ struct errcount_attribute_data {
 	int layer0, layer1, layer2;
 };
 
+/**
+ * edac_raw_error_desc - Raw error report structure
+ * @grain:			minimum granularity for an error report, in bytes
+ * @error_count:		number of errors of the same type
+ * @top_layer:			top layer of the error (layer[0])
+ * @mid_layer:			middle layer of the error (layer[1])
+ * @low_layer:			low layer of the error (layer[2])
+ * @page_frame_number:		page where the error happened
+ * @offset_in_page:		page offset
+ * @syndrome:			syndrome of the error (or 0 if unknown or if
+ * 				the syndrome is not applicable)
+ * @msg:			error message
+ * @location:			location of the error
+ * @label:			label of the affected DIMM(s)
+ * @other_detail:		other driver-specific detail about the error
+ * @enable_per_layer_report:	if false, the error affects all layers
+ *				(typically, a memory controller error)
+ */
+struct edac_raw_error_desc {
+	/*
+	 * NOTE: everything before grain won't be cleaned by
+	 * edac_raw_error_desc_clean()
+	 */
+	char location[LOCATION_SIZE];
+	char label[(EDAC_MC_LABEL_LEN + 1 + sizeof(OTHER_LABEL)) * EDAC_MAX_LABELS];
+	long grain;
+
+	/* the vars below and grain will be cleaned on every new error report */
+	u16 error_count;
+	int top_layer;
+	int mid_layer;
+	int low_layer;
+	unsigned long page_frame_number;
+	unsigned long offset_in_page;
+	unsigned long syndrome;
+	const char *msg;
+	const char *other_detail;
+	bool enable_per_layer_report;
+};
+
 /* MEMORY controller information structure
  */
 struct mem_ctl_info {
@@ -661,6 +711,12 @@ struct mem_ctl_info {
 	/* work struct for this MC */
 	struct delayed_work work;
 
+	/*
+	 * Used to report an error - by being at the global struct
+	 * makes the memory allocated by the EDAC core
+	 */
+	struct edac_raw_error_desc error_desc;
+
 	/* the internal state of this controller instance */
 	int op_state;
 
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH EDAC 4/6] ghes_edac: Make it compliant with UEFI spec 2.3.1
  2013-02-20 11:12 [PATCH EDAC 0/6] Improvements for ghes_edac Mauro Carvalho Chehab
                   ` (2 preceding siblings ...)
  2013-02-20 11:12 ` [PATCH EDAC 3/6] edac: put all arguments for the raw error handling call into a struct Mauro Carvalho Chehab
@ 2013-02-20 11:12 ` Mauro Carvalho Chehab
  2013-02-20 11:12 ` [PATCH EDAC 5/6] edac: add support on ras_event for error type "Info" Mauro Carvalho Chehab
  2013-02-20 11:12 ` [PATCH EDAC 6/6] ghes_edac: Fix RAS tracing Mauro Carvalho Chehab
  5 siblings, 0 replies; 8+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-20 11:12 UTC (permalink / raw)
  Cc: Mauro Carvalho Chehab, Linux Edac Mailing List,
	Linux Kernel Mailing List

The UEFI spec defines the memory error types ans the bits that
validate each field on the memory error record, at
Appendix N om items N.2.5 (Memory Error Section) and
N.2.11 (Error Status). Make the error description compliant with
it, only showing the valid fields.

The EDAC error log is now properly reporting the error:

[   55.058218] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[   55.067450] {1}[Hardware Error]: APEI generic hardware error status
[   55.074445] {1}[Hardware Error]: severity: 2, corrected
[   55.080284] {1}[Hardware Error]: section: 0, severity: 2, corrected
[   55.087287] {1}[Hardware Error]: flags: 0x01
[   55.092081] {1}[Hardware Error]: primary
[   55.096463] {1}[Hardware Error]: section_type: memory error
[   55.102707] {1}[Hardware Error]: error_status: 0x0000000000000400
[   55.109520] {1}[Hardware Error]: physical_address: 0x0000000809f56000
[   55.116721] {1}[Hardware Error]: node: 0
[   55.121125] {1}[Hardware Error]: card: 0
[   55.125508] {1}[Hardware Error]: module: 0
[   55.130127] {1}[Hardware Error]: device: 0
[   55.134724] {1}[Hardware Error]: error_type: 18, unknown
[   55.140699] EDAC MC0: 1 CE reserved error (18) on unknown label (node:0 card:0 module:0 page:0x809f56 offset:0x0 grain:0 syndrome:0x0 - status(0x0000000000000400): Storage error in memory (DRAM))

Tested on a 4 CPUs E5-4650 Sandy Bridge machine.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/edac/ghes_edac.c | 188 +++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 173 insertions(+), 15 deletions(-)

diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 9d7f797..41db89a 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -177,19 +177,19 @@ void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
 {
 	struct edac_raw_error_desc *e = &ghes->mci->error_desc;
 	enum hw_event_mc_err_type type;
+	char other_detail[160] = "";
+	char msg[80] = "";
+	char *p;
 
 	/* Cleans the error report buffer */
 	memset(e, 0, sizeof (*e));
 	e->error_count = 1;
-	e->msg = "APEI";
-	strcpy(e->label, "unknown");
-	e->other_detail = "";
-
-	if (mem_err->validation_bits & CPER_MEM_VALID_PHYSICAL_ADDRESS) {
-		e->page_frame_number = mem_err->physical_addr >> PAGE_SHIFT;
-		e->offset_in_page = mem_err->physical_addr & ~PAGE_MASK;
-		e->grain = ~(mem_err->physical_addr_mask & ~PAGE_MASK);
-	}
+	strcpy(e->label, "unknown label");
+	e->msg = msg;
+	e->other_detail = other_detail;
+	e->top_layer = -1;
+	e->mid_layer = -1;
+	e->low_layer = -1;
 
 	switch(sev) {
 	case GHES_SEV_CORRECTED:
@@ -206,12 +206,170 @@ void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
 		type = HW_EVENT_ERR_INFO;
 	}
 
-	sprintf(e->location,
-		"node:%d card:%d module:%d bank:%d device:%d row: %d column:%d bit_pos:%d",
-		mem_err->node, mem_err->card, mem_err->module,
-		mem_err->bank, mem_err->device, mem_err->row, mem_err->column,
-		mem_err->bit_pos);
-	edac_dbg(3, "error at location %s\n", e->location);
+	/* Error type, mapped on e->msg */
+	if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_TYPE) {
+		p = msg;
+		switch (mem_err->error_type) {
+		case 0:
+			p += sprintf(p, "Unknown");
+			break;
+		case 1:
+			p += sprintf(p, "No error");
+			break;
+		case 2:
+			p += sprintf(p, "Single-bit ECC");
+			break;
+		case 3:
+			p += sprintf(p, "Multi-bit ECC");
+			break;
+		case 4:
+			p += sprintf(p, "Single-symbol ChipKill ECC");
+			break;
+		case 5:
+			p += sprintf(p, "Multi-symbol ChipKill ECC");
+			break;
+		case 6:
+			p += sprintf(p, "Master abort");
+			break;
+		case 7:
+			p += sprintf(p, "Target abort");
+			break;
+		case 8:
+			p += sprintf(p, "Parity Error");
+			break;
+		case 9:
+			p += sprintf(p, "Watchdog timeout");
+			break;
+		case 10:
+			p += sprintf(p, "Invalid address");
+			break;
+		case 11:
+			p += sprintf(p, "Mirror Broken");
+			break;
+		case 12:
+			p += sprintf(p, "Memory Sparing");
+			break;
+		case 13:
+			p += sprintf(p, "Scrub corrected error");
+			break;
+		case 14:
+			p += sprintf(p, "Scrub uncorrected error");
+			break;
+		case 15:
+			p += sprintf(p, "Physical Memory Map-out event");
+			break;
+		default:
+			p += sprintf(p, "reserved error (%d)",
+				     mem_err->error_type);
+		}
+	} else {
+		strcpy(msg, "unknown error");
+	}
+
+	/* Error address */
+	if (mem_err->validation_bits & CPER_MEM_VALID_PHYSICAL_ADDRESS) {
+		e->page_frame_number = mem_err->physical_addr >> PAGE_SHIFT;
+		e->offset_in_page = mem_err->physical_addr & ~PAGE_MASK;
+	}
+
+	/* Error grain */
+	if (mem_err->validation_bits & CPER_MEM_VALID_PHYSICAL_ADDRESS_MASK) {
+		e->grain = ~(mem_err->physical_addr_mask & ~PAGE_MASK);
+	}
+
+	/* Memory error location, mapped on e->location */
+	p = e->location;
+	if (mem_err->validation_bits & CPER_MEM_VALID_NODE)
+		p += sprintf(p, "node:%d ", mem_err->node);
+	if (mem_err->validation_bits & CPER_MEM_VALID_CARD)
+		p += sprintf(p, "card:%d ", mem_err->card);
+	if (mem_err->validation_bits & CPER_MEM_VALID_MODULE)
+		p += sprintf(p, "module:%d ", mem_err->module);
+	if (mem_err->validation_bits & CPER_MEM_VALID_BANK)
+		p += sprintf(p, "bank:%d ", mem_err->bank);
+	if (mem_err->validation_bits & CPER_MEM_VALID_ROW)
+		p += sprintf(p, "row:%d ", mem_err->row);
+	if (mem_err->validation_bits & CPER_MEM_VALID_COLUMN)
+		p += sprintf(p, "col:%d ", mem_err->column);
+	if (mem_err->validation_bits & CPER_MEM_VALID_BIT_POSITION)
+		p += sprintf(p, "bit_pos:%d ", mem_err->bit_pos);
+	if (p > e->location)
+		*(p - 1) = '\0';
+
+	/* All other fields are mapped on e->other_detail */
+	p= other_detail;
+	if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_STATUS) {
+		u64 status = mem_err->error_status;
+
+		p += sprintf(p, "status(0x%016llx): ", (long long)status);
+		switch ((status >> 8) & 0xff) {
+		case 1:
+			p += sprintf(p, "Error detected internal to the component ");
+			break;
+		case 16:
+			p += sprintf(p, "Error detected in the bus ");
+			break;
+		case 4:
+			p += sprintf(p, "Storage error in memory (DRAM) ");
+			break;
+		case 5:
+			p += sprintf(p, "Storage error in TLB ");
+			break;
+		case 6:
+			p += sprintf(p, "Storage error in cache ");
+			break;
+		case 7:
+			p += sprintf(p, "Error in one or more functional units ");
+			break;
+		case 8:
+			p += sprintf(p, "component failed self test ");
+			break;
+		case 9:
+			p += sprintf(p, "Overflow or undervalue of internal queue ");
+			break;
+		case 17:
+			p += sprintf(p, "Virtual address not found on IO-TLB or IO-PDIR ");
+			break;
+		case 18:
+			p += sprintf(p, "Improper access error ");
+			break;
+		case 19:
+			p += sprintf(p, "Access to a memory address which is not mapped to any component ");
+			break;
+		case 20:
+			p += sprintf(p, "Loss of Lockstep ");
+			break;
+		case 21:
+			p += sprintf(p, "Response not associated with a request ");
+			break;
+		case 22:
+			p += sprintf(p, "Bus parity error (must also set the A, C, or D Bits) ");
+			break;
+		case 23:
+			p += sprintf(p, "Detection of a PATH_ERROR ");
+			break;
+		case 25:
+			p += sprintf(p, "Bus operation timeout ");
+			break;
+		case 26:
+			p += sprintf(p, "A read was issued to data that has been poisoned ");
+			break;
+		default:
+			p += sprintf(p, "reserved ");
+			break;
+		}
+	}
+	if (mem_err->validation_bits & CPER_MEM_VALID_REQUESTOR_ID)
+		p += sprintf(p, "requestor ID: 0x%016llx ",
+			     (long long)mem_err->requestor_id);
+	if (mem_err->validation_bits & CPER_MEM_VALID_RESPONDER_ID)
+		p += sprintf(p, "responder ID: 0x%016llx ",
+			     (long long)mem_err->responder_id);
+	if (mem_err->validation_bits & CPER_MEM_VALID_TARGET_ID)
+		p += sprintf(p, "target ID: 0x%016llx ",
+			     (long long)mem_err->responder_id);
+	if (p > other_detail)
+		*(p - 1) = '\0';
 
 	edac_raw_mc_handle_error(type, ghes->mci, e);
 }
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH EDAC 5/6] edac: add support on ras_event for error type "Info"
  2013-02-20 11:12 [PATCH EDAC 0/6] Improvements for ghes_edac Mauro Carvalho Chehab
                   ` (3 preceding siblings ...)
  2013-02-20 11:12 ` [PATCH EDAC 4/6] ghes_edac: Make it compliant with UEFI spec 2.3.1 Mauro Carvalho Chehab
@ 2013-02-20 11:12 ` Mauro Carvalho Chehab
  2013-02-20 11:12 ` [PATCH EDAC 6/6] ghes_edac: Fix RAS tracing Mauro Carvalho Chehab
  5 siblings, 0 replies; 8+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-20 11:12 UTC (permalink / raw)
  Cc: Mauro Carvalho Chehab, Linux Edac Mailing List,
	Linux Kernel Mailing List

The CPER spec defines a forth type of error: informational
logs. Add support for it at the trace event interface.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 include/linux/edac.h    | 15 +++++++++++++++
 include/ras/ras_event.h |  4 +---
 2 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/include/linux/edac.h b/include/linux/edac.h
index 1cd4472..4fd4999 100644
--- a/include/linux/edac.h
+++ b/include/linux/edac.h
@@ -112,6 +112,21 @@ enum hw_event_mc_err_type {
 	HW_EVENT_ERR_INFO,
 };
 
+static inline char *mc_event_error_type(const unsigned int err_type)
+{
+	switch (err_type) {
+	case HW_EVENT_ERR_CORRECTED:
+		return "Corrected";
+	case HW_EVENT_ERR_UNCORRECTED:
+		return "Uncorrected";
+	case HW_EVENT_ERR_FATAL:
+		return "Fatal";
+	default:
+	case HW_EVENT_ERR_INFO:
+		return "Info";
+	}
+}
+
 /**
  * enum mem_type - memory types. For a more detailed reference, please see
  *			http://en.wikipedia.org/wiki/DRAM
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index 260470e..21cdb0b 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -78,9 +78,7 @@ TRACE_EVENT(mc_event,
 
 	TP_printk("%d %s error%s:%s%s on %s (mc:%d location:%d:%d:%d address:0x%08lx grain:%d syndrome:0x%08lx%s%s)",
 		  __entry->error_count,
-		  (__entry->error_type == HW_EVENT_ERR_CORRECTED) ? "Corrected" :
-			((__entry->error_type == HW_EVENT_ERR_FATAL) ?
-			"Fatal" : "Uncorrected"),
+		  mc_event_error_type(__entry->error_type),
 		  __entry->error_count > 1 ? "s" : "",
 		  ((char *)__get_str(msg))[0] ? " " : "",
 		  __get_str(msg),
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH EDAC 6/6] ghes_edac: Fix RAS tracing
  2013-02-20 11:12 [PATCH EDAC 0/6] Improvements for ghes_edac Mauro Carvalho Chehab
                   ` (4 preceding siblings ...)
  2013-02-20 11:12 ` [PATCH EDAC 5/6] edac: add support on ras_event for error type "Info" Mauro Carvalho Chehab
@ 2013-02-20 11:12 ` Mauro Carvalho Chehab
  5 siblings, 0 replies; 8+ messages in thread
From: Mauro Carvalho Chehab @ 2013-02-20 11:12 UTC (permalink / raw)
  Cc: Mauro Carvalho Chehab, Linux Edac Mailing List,
	Linux Kernel Mailing List

With the current version of CPER, there's no way to associate an
error with the memory error. So, the error location in EDAC
layers is unused.

As CPER has its own idea about memory architectural layers, just
output whatever is there inside the driver's detail at the RAS
tracepoint.

The EDAC location keeps untouched, in the case that, in some future,
we could actually map the error into the dimm labels.

Now, the error message:

[   61.562475] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[   61.562477] {1}[Hardware Error]: APEI generic hardware error status
[   61.562479] {1}[Hardware Error]: severity: 2, corrected
[   61.562481] {1}[Hardware Error]: section: 0, severity: 2, corrected
[   61.562483] {1}[Hardware Error]: flags: 0x01
[   61.562485] {1}[Hardware Error]: primary
[   61.562486] {1}[Hardware Error]: section_type: memory error
[   61.562488] {1}[Hardware Error]: error_status: 0x0000000000000400
[   61.562489] {1}[Hardware Error]: node: 3
[   61.562490] {1}[Hardware Error]: card: 0
[   61.562491] {1}[Hardware Error]: module: 1
[   61.562492] {1}[Hardware Error]: device: 0
[   61.562493] {1}[Hardware Error]: error_type: 18, unknown
[   61.562518] EDAC MC0: 1 CE reserved error (18) on unknown label (node:3 card:0 module:1 page:0x0 offset:0x0 grain:0 syndrome:0x0 - status(0x0000000000000400): Storage error in memory (DRAM))

Is properly represented on the trace event:

mc_event: 1 Corrected error: reserved error (18) on unknown label (mc:0 location:-1:-1:-1 address:0x00000000 grain:1 syndrome:0x00000000 APEI location: node:3 card:0 module:1 status(0x0000000000000400): Storage error in memory (DRAM))

Tested on a 4 sockets E5-4650 Sandy Bridge machine.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
 drivers/edac/edac_mc.c   | 16 ++++++++--------
 drivers/edac/ghes_edac.c | 16 +++++++++++++++-
 2 files changed, 23 insertions(+), 9 deletions(-)

diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index e436565..8d89bc0 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -1083,16 +1083,8 @@ void edac_raw_mc_handle_error(const enum hw_event_mc_err_type type,
 			      struct edac_raw_error_desc *e)
 {
 	char detail[80];
-	u8 grain_bits;
 	int pos[EDAC_MAX_LAYERS] = { e->top_layer, e->mid_layer, e->low_layer };
 
-	/* Report the error via the trace interface */
-	grain_bits = fls_long(e->grain) + 1;
-	trace_mc_event(type, e->msg, e->label, e->error_count,
-		       mci->mc_idx, e->top_layer, e->mid_layer, e->low_layer,
-		       PAGES_TO_MiB(e->page_frame_number) | e->offset_in_page,
-		       grain_bits, e->syndrome, e->other_detail);
-
 	/* Memory type dependent details about the error */
 	if (type == HW_EVENT_ERR_CORRECTED) {
 		snprintf(detail, sizeof(detail),
@@ -1149,6 +1141,7 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 	int row = -1, chan = -1;
 	int pos[EDAC_MAX_LAYERS] = { top_layer, mid_layer, low_layer };
 	int i, n_labels = 0;
+	u8 grain_bits;
 	struct edac_raw_error_desc *e = &mci->error_desc;
 
 	edac_dbg(3, "MC%d\n", mci->mc_idx);
@@ -1288,6 +1281,13 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
 	if (p > e->location)
 		*(p - 1) = '\0';
 
+	/* Report the error via the trace interface */
+	grain_bits = fls_long(e->grain) + 1;
+	trace_mc_event(type, e->msg, e->label, e->error_count,
+		       mci->mc_idx, e->top_layer, e->mid_layer, e->low_layer,
+		       PAGES_TO_MiB(e->page_frame_number) | e->offset_in_page,
+		       grain_bits, e->syndrome, e->other_detail);
+
 	edac_raw_mc_handle_error(type, mci, e);
 }
 EXPORT_SYMBOL_GPL(edac_mc_handle_error);
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 41db89a..2126aab 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -15,6 +15,7 @@
 #include <linux/edac.h>
 #include <linux/dmi.h>
 #include "edac_core.h"
+#include <ras/ras_event.h>
 
 #define GHES_EDAC_REVISION " Ver: 1.0.0"
 
@@ -177,9 +178,11 @@ void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
 {
 	struct edac_raw_error_desc *e = &ghes->mci->error_desc;
 	enum hw_event_mc_err_type type;
+	char detail_location[240];
 	char other_detail[160] = "";
-	char msg[80] = "";
+	char msg[40] = "";
 	char *p;
+	u8 grain_bits;
 
 	/* Cleans the error report buffer */
 	memset(e, 0, sizeof (*e));
@@ -371,6 +374,17 @@ void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
 	if (p > other_detail)
 		*(p - 1) = '\0';
 
+	/* Generate the trace event */
+	grain_bits = fls_long(e->grain);
+	sprintf(detail_location, "APEI location: %s %s",
+		e->location, e->other_detail);
+	trace_mc_event(type, e->msg, e->label, e->error_count,
+		       ghes->mci->mc_idx,
+		       e->top_layer, e->mid_layer, e->low_layer,
+		       PAGES_TO_MiB(e->page_frame_number) | e->offset_in_page,
+		       grain_bits, e->syndrome, detail_location);
+
+	/* Report the error via EDAC API */
 	edac_raw_mc_handle_error(type, ghes->mci, e);
 }
 EXPORT_SYMBOL_GPL(ghes_edac_report_mem_error);
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH EDAC 3/6] edac: put all arguments for the raw error handling call into a struct
  2013-02-20 11:12 ` [PATCH EDAC 3/6] edac: put all arguments for the raw error handling call into a struct Mauro Carvalho Chehab
@ 2013-02-20 11:20   ` Borislav Petkov
  0 siblings, 0 replies; 8+ messages in thread
From: Borislav Petkov @ 2013-02-20 11:20 UTC (permalink / raw)
  To: Mauro Carvalho Chehab; +Cc: Linux Edac Mailing List, Linux Kernel Mailing List

On Wed, Feb 20, 2013 at 08:12:49AM -0300, Mauro Carvalho Chehab wrote:
> The number of arguments for edac_raw_mc_handle_error() is too big;
> put them into a structure and allocate space for it inside
> edac_mc_alloc().
> 
> That reduces a lot the stack usage and simplifies the raw API call.
> 
> Tested with sb_edac driver and MCE error injection. Worked as expected:
> 
> [  143.066100] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
> [  143.086424] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
> [  143.106570] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
> [  143.126712] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x320 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
> 
> Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>

Looks ok.

Acked-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2013-02-20 11:20 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-02-20 11:12 [PATCH EDAC 0/6] Improvements for ghes_edac Mauro Carvalho Chehab
2013-02-20 11:12 ` [PATCH EDAC 1/6] ghes_edac: remove GHES_PFX macro Mauro Carvalho Chehab
2013-02-20 11:12 ` [PATCH EDAC 2/6] ghes_edac: add a MAINTAINERS entry and copyrights Mauro Carvalho Chehab
2013-02-20 11:12 ` [PATCH EDAC 3/6] edac: put all arguments for the raw error handling call into a struct Mauro Carvalho Chehab
2013-02-20 11:20   ` Borislav Petkov
2013-02-20 11:12 ` [PATCH EDAC 4/6] ghes_edac: Make it compliant with UEFI spec 2.3.1 Mauro Carvalho Chehab
2013-02-20 11:12 ` [PATCH EDAC 5/6] edac: add support on ras_event for error type "Info" Mauro Carvalho Chehab
2013-02-20 11:12 ` [PATCH EDAC 6/6] ghes_edac: Fix RAS tracing Mauro Carvalho Chehab

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).