From: Mauro Carvalho Chehab <mchehab@redhat.com>
To: Borislav Petkov <bp@amd64.org>
Cc: Tony Luck <tony.luck@intel.com>, Ingo Molnar <mingo@elte.hu>,
EDAC devel <linux-edac@vger.kernel.org>,
LKML <linux-kernel@vger.kernel.org>
Subject: [PATCHv5] EDAC core changes in order to properly report errors from all types of memory controllers
Date: Mon, 05 Mar 2012 19:00:04 -0300 [thread overview]
Message-ID: <4F553764.5070305@redhat.com> (raw)
In-Reply-To: <4F54D4AF.9060802@redhat.com>
This is the 5th version of my patch series. It seemed too big to
send all those emails to LKML/edac mailing lists for the 5th
time, so, instead, I'll point to the git tree where they're hold.
I'm doing a massive test of the entire patchset with several different edac
drivers, so the biggest changes on this series are the bug-fix patches.
Besides that, there are a few other differences:
- the struct channel_info doesn't represent a channel. Its contents
represent a memory rank. So, call it as "rank_info";
- Add a FIXME information to remind that, currently, the new "dimm_info"
structure represents a "rank", if the memory is addressed via csrows;
- when the "dimm_info" is representing a rank, the sysfs nodes for it are
called as "rank" instead of "dimm";
- an agreement was not reached yet for the the MCA-based tracepoint. So, I've
removed it from the patch series;
- the out_of_range tracepoint got removed. Instead, a parse error will
generate only a printk message.
With those changes, there's just one tracepont defined there, on this patchset:
http://git.kernel.org/?p=linux/kernel/git/mchehab/linux-edac.git;a=commit;h=fdfa64045e43c942e1250708365d9240cd0da9c3
The following changes since commit 805a6af8dba5dfdd35ec35dc52ec0122400b2610:
Linux 3.2 (2012-01-04 15:55:44 -0800)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-edac.git hw_events_v5
Mauro Carvalho Chehab (43):
edac/ppc4xx_edac: Fix compilation
edac: Better describe the memory concepts
drivers/edac: rename channel_info to rank_info
edac: Create a dimm struct and move the labels into it
edac: Add per dimm's sysfs nodes
edac: Prepare to push down to drivers the filling of the memset_info
i5400_edac: Convert it to report memory with the new location
i7300_edac: Convert it to report memory with the new location
edac: move dimm properties to struct memset_info
edac: Don't initialize csrow's first_page & friends when not needed
edac: move nr_pages to dimm struct
edac: Add per-dimm sysfs show nodes
edac: DIMM location cleanup
edac-mc: Allow reporting errors on a non-csrow oriented way
edac.h: Use kernel-doc-nano-HOWTO.txt notation for enums
edac: rework memory layer hierarchy description
edac: Export MC hierarchy counters for CE and UE
edac: Cleanup the logs for i7core and sb edac drivers
edac_mc: Some clenups at the log message
edac: Add a sysfs node to test the EDAC error report facility
edac_mc: Fix the enable label filter logic
edac: Initialize the dimm label with the known information
edac: don't OOPS if the csrow is not visible
edac: Fix sysfs csrow?/*ce*count counters
edac: Fix new error counts
edac: Fix per layer error count counters
edac: i5400: Fix DIMM memory filling
edac_mc: Improve the labels parsing
edac: Fix module removal logic
edac_mc_sysfs: don't create inactive errcount sysfs nodes
edac_mc: Fixes the logic that fills the dimms
i5400_edac: Avoid calling pci_put_device() twice
i5400_edac: Better represent the memory controller hierarchy
edac: fill the location with something useful if the DIMM is not found
edac: be sure to use the GET_POS macro to get memset_info struct
amd64_edac: remove a duplicated call to edac_mc_handle_error()
i5000_edac: Fix the logic that retrieves memory information
i5100_edac: Fix the logic
edac: add a sysfs node that stores the max possible memory location
edac: Call the minimum grain node as "rank" if chip select is used
i7300_edac: fixup
Fix memory error count
events/hw_event: Create a Hardware Events Report Mecanism (HERM)
drivers/edac/amd64_edac.c | 210 +++++++------
drivers/edac/amd64_edac_dbg.c | 6 +-
drivers/edac/amd64_edac_inj.c | 24 +-
drivers/edac/amd76x_edac.c | 44 ++-
drivers/edac/cell_edac.c | 42 ++-
drivers/edac/cpc925_edac.c | 93 +++--
drivers/edac/e752x_edac.c | 94 ++++--
drivers/edac/e7xxx_edac.c | 88 ++++--
drivers/edac/edac_core.h | 48 +--
drivers/edac/edac_device.c | 27 +-
drivers/edac/edac_mc.c | 700 ++++++++++++++++++++++++---------------
drivers/edac/edac_mc_sysfs.c | 625 ++++++++++++++++++++++++++++++++---
drivers/edac/edac_module.h | 2 +-
drivers/edac/edac_pci.c | 7 +-
drivers/edac/i3000_edac.c | 51 ++-
drivers/edac/i3200_edac.c | 57 ++--
drivers/edac/i5000_edac.c | 225 +++++++------
drivers/edac/i5100_edac.c | 105 +++---
drivers/edac/i5400_edac.c | 318 ++++++++++--------
drivers/edac/i7300_edac.c | 117 +++----
drivers/edac/i7core_edac.c | 267 +++++-----------
drivers/edac/i82443bxgx_edac.c | 43 ++-
drivers/edac/i82860_edac.c | 57 +++-
drivers/edac/i82875p_edac.c | 53 ++-
drivers/edac/i82975x_edac.c | 58 +++-
drivers/edac/mpc85xx_edac.c | 45 ++-
drivers/edac/mv64x60_edac.c | 47 ++-
drivers/edac/pasemi_edac.c | 51 ++--
drivers/edac/ppc4xx_edac.c | 62 ++--
drivers/edac/r82600_edac.c | 42 ++-
drivers/edac/sb_edac.c | 203 +++++-------
drivers/edac/tile_edac.c | 33 ++-
drivers/edac/x38_edac.c | 54 ++--
include/linux/edac.h | 454 ++++++++++++++++++++-----
include/trace/events/hw_event.h | 107 ++++++
35 files changed, 2856 insertions(+), 1603 deletions(-)
create mode 100644 include/trace/events/hw_event.h
-
Whan an agreement with regards to the MCA-based tracepont is reached, a simple
patch like the one below would be enough to use a separate tracepoint for the
x86 architecture, when MCA is enabled and the error comes from it.
diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index ea7eb9a..348a396 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -1898,7 +1898,7 @@ static void amd64_handle_ce(struct mem_ctl_info *mci, struct mce *m)
-1, -1, -1,
EDAC_MOD_STR,
"HW has no ERROR_ADDRESS available",
- NULL);
+ m);
return;
}
@@ -1927,7 +1927,7 @@ static void amd64_handle_ue(struct mem_ctl_info *mci, struct mce *m)
-1, -1, -1,
EDAC_MOD_STR,
"HW has no ERROR_ADDRESS available",
- NULL);
+ m);
return;
}
@@ -1946,7 +1946,7 @@ static void amd64_handle_ue(struct mem_ctl_info *mci, struct mce *m)
page, offset, 0,
-1, -1, -1,
EDAC_MOD_STR,
- "ERROR ADDRESS NOT mapped to a MC", NULL);
+ "ERROR ADDRESS NOT mapped to a MC", m);
return;
}
@@ -1961,12 +1961,12 @@ static void amd64_handle_ue(struct mem_ctl_info *mci, struct mce *m)
-1, -1, -1,
EDAC_MOD_STR,
"ERROR ADDRESS NOT mapped to CS",
- NULL);
+ m);
} else {
edac_mc_handle_error(HW_EVENT_ERR_UNCORRECTED, mci,
page, offset, 0,
csrow, -1, -1,
- EDAC_MOD_STR, "", NULL);
+ EDAC_MOD_STR, "", m);
}
}
diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index eb73ddc..dfd24d3 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -1055,8 +1055,17 @@ void edac_mc_handle_error(const enum hw_event_mc_err_type type,
"page 0x%lx offset 0x%lx grain %d",
page_frame_number, offset_in_page, grain);
+#ifdef CONFIG_X86
+ if (arch_log)
+ trace_mc_error_mce(type, mci->mc_idx, msg, label, location,
+ detail, other_detail, arch_log);
+ else
+ trace_mc_error(type, mci->mc_idx, msg, label, location,
+ detail, other_detail);
+#else
trace_mc_error(type, mci->mc_idx, msg, label, location,
detail, other_detail);
+#endif
if (type == HW_EVENT_ERR_CORRECTED) {
if (edac_mc_get_log_ce())
diff --git a/include/trace/events/hw_event.h b/include/trace/events/hw_event.h
index 9209c6b..76f4dd5 100644
--- a/include/trace/events/hw_event.h
+++ b/include/trace/events/hw_event.h
@@ -101,6 +101,110 @@ TRACE_EVENT(mc_error,
__get_str(driver_detail))
);
+/*
+ * X86 arch-specific events
+ */
+
+#ifdef CONFIG_X86
+#include <asm/mce.h>
+
+/*
+ * MCE event for memory-controller errors
+ */
+
+/*
+ * NOTE: due to trace contraints, we can't have the mce_record at the
+ * same file as mce_record, as they're used by different files. Including
+ * trace headers twice cause duplicated symbols. So, care is needed to
+ * sync changes here with changes at include/trace/events/mce.h.
+ */
+
+TRACE_EVENT(mc_error_mce,
+
+ TP_PROTO(const unsigned int err_type,
+ const unsigned int mc_index,
+ const char *msg,
+ const char *label,
+ const char *location,
+ const char *detail,
+ const char *driver_detail,
+ const struct mce *m),
+
+ TP_ARGS(err_type, mc_index, msg, label, location,
+ detail, driver_detail, m),
+
+ TP_STRUCT__entry(
+ __field( unsigned int, err_type )
+ __field( unsigned int, mc_index )
+ __string( msg, msg )
+ __string( label, label )
+ __string( detail, detail )
+ __string( location, location )
+ __string( driver_detail, driver_detail )
+ __field( u64, mcgcap )
+ __field( u64, mcgstatus )
+ __field( u64, status )
+ __field( u64, addr )
+ __field( u64, misc )
+ __field( u64, ip )
+ __field( u64, tsc )
+ __field( u64, walltime )
+ __field( u32, cpu )
+ __field( u32, cpuid )
+ __field( u32, apicid )
+ __field( u32, socketid )
+ __field( u8, cs )
+ __field( u8, bank )
+ __field( u8, cpuvendor )
+ ),
+
+ TP_fast_assign(
+ __entry->err_type = err_type;
+ __entry->mc_index = mc_index;
+ __assign_str(msg, msg);
+ __assign_str(label, label);
+ __assign_str(location, location);
+ __assign_str(detail, detail);
+ __assign_str(driver_detail, driver_detail);
+ __entry->mcgcap = m->mcgcap;
+ __entry->mcgstatus = m->mcgstatus;
+ __entry->status = m->status;
+ __entry->addr = m->addr;
+ __entry->misc = m->misc;
+ __entry->ip = m->ip;
+ __entry->tsc = m->tsc;
+ __entry->walltime = m->time;
+ __entry->cpu = m->extcpu;
+ __entry->cpuid = m->cpuid;
+ __entry->apicid = m->apicid;
+ __entry->socketid = m->socketid;
+ __entry->cs = m->cs;
+ __entry->bank = m->bank;
+ __entry->cpuvendor = m->cpuvendor;
+ ),
+
+ TP_printk("mce#%d: %s error %s on label \"%s\" (%s %s CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, ADDR/MISC: %016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x %s)",
+ __entry->mc_index,
+ (__entry->err_type == HW_EVENT_ERR_CORRECTED) ? "Corrected" :
+ ((__entry->err_type == HW_EVENT_ERR_FATAL) ?
+ "Fatal" : "Uncorrected"),
+ __get_str(msg),
+ __get_str(label),
+ __get_str(location),
+ __get_str(detail),
+ __entry->cpu,
+ __entry->mcgcap, __entry->mcgstatus,
+ __entry->bank, __entry->status,
+ __entry->addr, __entry->misc,
+ __entry->cs, __entry->ip,
+ __entry->tsc,
+ __entry->cpuvendor, __entry->cpuid,
+ __entry->walltime,
+ __entry->socketid,
+ __entry->apicid,
+ __get_str(driver_detail))
+);
+
#endif /* _TRACE_HW_EVENT_MC_H */
/* This part must be outside protection */
next prev parent reply other threads:[~2012-03-05 22:00 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-03-02 14:25 [RFC -v2 PATCH 0/3] RAS: Use MCE tracepoint for decoded MCEs Borislav Petkov
2012-03-02 14:25 ` [PATCH 1/4] mce: Slim up struct mce Borislav Petkov
2012-03-02 17:47 ` Luck, Tony
2012-03-03 7:37 ` Ingo Molnar
2012-03-05 9:17 ` Borislav Petkov
2012-03-02 14:25 ` [PATCH 2/4] mce: Add a msg string to the MCE tracepoint Borislav Petkov
2012-03-02 14:25 ` [PATCH 3/4] x86, RAS: Add a decoded msg buffer Borislav Petkov
2012-03-02 14:25 ` [PATCH 4/4] EDAC: Convert AMD EDAC pieces to use RAS printk buffer Borislav Petkov
2012-03-02 14:52 ` Mauro Carvalho Chehab
2012-03-05 11:04 ` Borislav Petkov
2012-03-05 11:43 ` Mauro Carvalho Chehab
2012-03-05 12:44 ` Borislav Petkov
2012-03-05 13:35 ` Mauro Carvalho Chehab
2012-03-05 14:13 ` Borislav Petkov
2012-03-05 14:58 ` Mauro Carvalho Chehab
2012-03-05 22:00 ` Mauro Carvalho Chehab [this message]
2012-03-05 23:23 ` [PATCHv5] EDAC core changes in order to properly report errors from all types of memory controllers Borislav Petkov
2012-03-06 11:31 ` Mauro Carvalho Chehab
2012-03-06 12:16 ` Borislav Petkov
2012-03-07 0:20 ` [PATCHv7] " Mauro Carvalho Chehab
2012-03-07 8:42 ` Borislav Petkov
2012-03-07 11:36 ` Mauro Carvalho Chehab
2012-03-07 12:06 ` Borislav Petkov
2012-03-07 12:13 ` Mauro Carvalho Chehab
2012-03-02 14:41 ` [RFC -v2 PATCH 0/3] RAS: Use MCE tracepoint for decoded MCEs Mauro Carvalho Chehab
2012-03-02 14:48 ` Borislav Petkov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4F553764.5070305@redhat.com \
--to=mchehab@redhat.com \
--cc=bp@amd64.org \
--cc=linux-edac@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=tony.luck@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).