perf, ftrace and MCEs

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* perf, ftrace and MCEs
@ 2010-05-01 18:12 Borislav Petkov
  2010-05-03 14:41 ` Steven Rostedt
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Borislav Petkov @ 2010-05-01 18:12 UTC (permalink / raw)
  To: Ingo Molnar, Frederic Weisbecker, Steven Rostedt, Peter Zijlstra; +Cc: lkml

Hi,

so I finally had some spare time to stare at perf/ftrace code and ponder
on how to use those facilities for MCE collecting and reporting. Btw, I
have to say, it took me quite a while to understand what goes where - my
suggestion to anyone who tries to understand how perf/ftrace works is
to do make <file.i> where there is at least one trace_XXX emit record
function call and start untangling code paths from there.

So anyway, here are some questions I had, I just as well may've missed
something so please correct me if I'm wrong:

1. Since machine checks can happen at any time, we need to have the
MCE tracepoint (trace_mce_record() in <include/trace/events/mce.h>)
always enabled. This, in turn, means that we need the ftrace/perf
infrastructure always compiled in (lockless ring buffer, perf_event.c
stuff) on any x86 system so that MCEs can be handled at anytime. Is this
going to be ok to be enabled on _all_ machines, hmmm... I dunno, maybe
only a subset of those facilites at least.

2. Tangential to 1., we need that "thin" software layer prepared for
decoding and reporting them as early as possible. event_trace_init() is
an fs_initcall and executed too late, IMHO. The ->perf_event_enable in
the ftrace_event_call is enabled even later on the perf init path over
the sys_perf_even_open which is at userspace time. In our case, this is
going be executed by the error logging and decoding daemon I guess.

3. Since we want to listen for MCEs all the time, the concept of
enabling and disabling those events does not apply in the sense of
performance profiling. IOW, MCEs need to be able to be logged to the
ring buffer at any time. I guess this is easily done - we simply enable
MCE events at the earliest moment possible and disable them on shutdown;
done.

So yeah, some food for thought but what do you guys think?

Thanks.

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: perf, ftrace and MCEs
  2010-05-01 18:12 perf, ftrace and MCEs Borislav Petkov
@ 2010-05-03 14:41 ` Steven Rostedt
  2010-05-03 21:20   ` Borislav Petkov
  2010-05-04 10:15 ` Andi Kleen
  2010-05-04 11:32 ` Ingo Molnar
  2 siblings, 1 reply; 8+ messages in thread
From: Steven Rostedt @ 2010-05-03 14:41 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Ingo Molnar, Frederic Weisbecker, Peter Zijlstra, lkml

On Sat, 2010-05-01 at 20:12 +0200, Borislav Petkov wrote:
> Hi,
> 
> so I finally had some spare time to stare at perf/ftrace code and ponder
> on how to use those facilities for MCE collecting and reporting. Btw, I
> have to say, it took me quite a while to understand what goes where - my
> suggestion to anyone who tries to understand how perf/ftrace works is
> to do make <file.i> where there is at least one trace_XXX emit record
> function call and start untangling code paths from there.
> 
> So anyway, here are some questions I had, I just as well may've missed
> something so please correct me if I'm wrong:
> 
> 1. Since machine checks can happen at any time, we need to have the
> MCE tracepoint (trace_mce_record() in <include/trace/events/mce.h>)
> always enabled. This, in turn, means that we need the ftrace/perf
> infrastructure always compiled in (lockless ring buffer, perf_event.c
> stuff) on any x86 system so that MCEs can be handled at anytime. Is this
> going to be ok to be enabled on _all_ machines, hmmm... I dunno, maybe
> only a subset of those facilites at least.

I'm not exactly sure what you goal is, but if you need to do something
directly, you can bypass ftrace and perf. All trace events can be
connected by anything even when ftrace and perf are not enabled.

That is, you need to connect to the tracepoint and write you own
callback. This can be done pretty much at anytime during boot up. To see
how to connect to a trace point, you can look at
register_trace_sched_switch() in kernel/trace/ftrace.c. This registers a
callback to the trace_sched_switch() trace point in sched.c.

> 
> 2. Tangential to 1., we need that "thin" software layer prepared for
> decoding and reporting them as early as possible. event_trace_init() is
> an fs_initcall and executed too late, IMHO. The ->perf_event_enable in
> the ftrace_event_call is enabled even later on the perf init path over
> the sys_perf_even_open which is at userspace time. In our case, this is
> going be executed by the error logging and decoding daemon I guess.
> 
> 3. Since we want to listen for MCEs all the time, the concept of
> enabling and disabling those events does not apply in the sense of
> performance profiling. IOW, MCEs need to be able to be logged to the
> ring buffer at any time. I guess this is easily done - we simply enable
> MCE events at the earliest moment possible and disable them on shutdown;
> done.

This looks like a good reason to have your own handler. More than one
callback may be registered to a tracepoint, so you do not need to worry
about having other handlers affect your code.

-- Steve

> 
> So yeah, some food for thought but what do you guys think?
> 
> Thanks.
> 



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: perf, ftrace and MCEs
  2010-05-03 14:41 ` Steven Rostedt
@ 2010-05-03 21:20   ` Borislav Petkov
  0 siblings, 0 replies; 8+ messages in thread
From: Borislav Petkov @ 2010-05-03 21:20 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Ingo Molnar, Frederic Weisbecker, Peter Zijlstra, lkml

From: Steven Rostedt <rostedt@goodmis.org>
Date: Mon, May 03, 2010 at 10:41:12AM -0400

Hi Steven,

> On Sat, 2010-05-01 at 20:12 +0200, Borislav Petkov wrote:
> > Hi,
> > 
> > so I finally had some spare time to stare at perf/ftrace code and ponder
> > on how to use those facilities for MCE collecting and reporting. Btw, I
> > have to say, it took me quite a while to understand what goes where - my
> > suggestion to anyone who tries to understand how perf/ftrace works is
> > to do make <file.i> where there is at least one trace_XXX emit record
> > function call and start untangling code paths from there.
> > 
> > So anyway, here are some questions I had, I just as well may've missed
> > something so please correct me if I'm wrong:
> > 
> > 1. Since machine checks can happen at any time, we need to have the
> > MCE tracepoint (trace_mce_record() in <include/trace/events/mce.h>)
> > always enabled. This, in turn, means that we need the ftrace/perf
> > infrastructure always compiled in (lockless ring buffer, perf_event.c
> > stuff) on any x86 system so that MCEs can be handled at anytime. Is this
> > going to be ok to be enabled on _all_ machines, hmmm... I dunno, maybe
> > only a subset of those facilites at least.
> 
> I'm not exactly sure what you goal is, but if you need to do something
> directly, you can bypass ftrace and perf. All trace events can be
> connected by anything even when ftrace and perf are not enabled.

Right, so the idea is to use the perf/ftrace infrastructure to detect
failing hardware which is signalled through machine checks, among
others. I'm thinking a lockless ring buffer would be cool so we can
execute in any context... wait a minute, right, we have that already.

However, if I use the perf/ftrace facilities, I have to have them
enabled on every system since machine checks are core processor
functionality and the software support for those has to be always
enabled. And the code has to be small and execute fast since after a
critical mcheck happens all bets are off.

So I'm thinking maybe a core perf/ftrace stuff which is thin and is
always enabled... but I'm not sure for I haven't stared at the code
enough yet.

> That is, you need to connect to the tracepoint and write you own
> callback. This can be done pretty much at anytime during boot up. To see
> how to connect to a trace point, you can look at
> register_trace_sched_switch() in kernel/trace/ftrace.c. This registers a
> callback to the trace_sched_switch() trace point in sched.c.

The base tracepoint functionality should suffice for now but this is
definitely a cool point and good to know, thanks.

> > 
> > 2. Tangential to 1., we need that "thin" software layer prepared for
> > decoding and reporting them as early as possible. event_trace_init() is
> > an fs_initcall and executed too late, IMHO. The ->perf_event_enable in
> > the ftrace_event_call is enabled even later on the perf init path over
> > the sys_perf_even_open which is at userspace time. In our case, this is
> > going be executed by the error logging and decoding daemon I guess.
> > 
> > 3. Since we want to listen for MCEs all the time, the concept of
> > enabling and disabling those events does not apply in the sense of
> > performance profiling. IOW, MCEs need to be able to be logged to the
> > ring buffer at any time. I guess this is easily done - we simply enable
> > MCE events at the earliest moment possible and disable them on shutdown;
> > done.
> 
> This looks like a good reason to have your own handler. More than one
> callback may be registered to a tracepoint, so you do not need to worry
> about having other handlers affect your code.

Yep, good ideas, thanks. /me goes back to the drawing board.

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: perf, ftrace and MCEs
  2010-05-01 18:12 perf, ftrace and MCEs Borislav Petkov
  2010-05-03 14:41 ` Steven Rostedt
@ 2010-05-04 10:15 ` Andi Kleen
  2010-05-04 11:32 ` Ingo Molnar
  2 siblings, 0 replies; 8+ messages in thread
From: Andi Kleen @ 2010-05-04 10:15 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Ingo Molnar, Frederic Weisbecker, Steven Rostedt, Peter Zijlstra,
	lkml

Borislav Petkov <bp@alien8.de> writes:

> so I finally had some spare time to stare at perf/ftrace code and ponder
> on how to use those facilities for MCE collecting and reporting. Btw, I

A good beginning of any such investigations would be to describe
what exact problems you're trying to solve here.

-Andi


-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: perf, ftrace and MCEs
  2010-05-01 18:12 perf, ftrace and MCEs Borislav Petkov
  2010-05-03 14:41 ` Steven Rostedt
  2010-05-04 10:15 ` Andi Kleen
@ 2010-05-04 11:32 ` Ingo Molnar
  2010-05-15 13:43   ` Borislav Petkov
  2 siblings, 1 reply; 8+ messages in thread
From: Ingo Molnar @ 2010-05-04 11:32 UTC (permalink / raw)
  To: Borislav Petkov, Frederic Weisbecker, Steven Rostedt,
	Peter Zijlstra, lkml, Arnaldo Carvalho de Melo

* Borislav Petkov <bp@alien8.de> wrote:

> Hi,
> 
> so I finally had some spare time to stare at perf/ftrace code and ponder on 
> how to use those facilities for MCE collecting and reporting. Btw, I have to 
> say, it took me quite a while to understand what goes where - my suggestion 
> to anyone who tries to understand how perf/ftrace works is to do make 
> <file.i> where there is at least one trace_XXX emit record function call and 
> start untangling code paths from there.
> 
> So anyway, here are some questions I had, I just as well may've missed 
> something so please correct me if I'm wrong:
> 
> 1. Since machine checks can happen at any time, we need to have the MCE 
> tracepoint (trace_mce_record() in <include/trace/events/mce.h>) always 
> enabled. This, in turn, means that we need the ftrace/perf infrastructure 
> always compiled in (lockless ring buffer, perf_event.c stuff) on any x86 
> system so that MCEs can be handled at anytime. Is this going to be ok to be 
> enabled on _all_ machines, hmmm... I dunno, maybe only a subset of those 
> facilites at least.

Yeah - and this happens on x86 anyway so you can rely on it.

> 2. Tangential to 1., we need that "thin" software layer prepared for 
> decoding and reporting them as early as possible. event_trace_init() is an 
> fs_initcall and executed too late, IMHO. The ->perf_event_enable in the 
> ftrace_event_call is enabled even later on the perf init path over the 
> sys_perf_even_open which is at userspace time. In our case, this is going be 
> executed by the error logging and decoding daemon I guess.

We could certainly move bits of this initialization earlier.

Also we could add the notion of 'persistent' events that dont have a 
user-space process attached to them - and which would buffer to a certain 
degree. Such persistent events could be initialized during bootup and the 
daemon could pick up the events later on.

In-kernel actions/policy could work off the callback mechanism. (See for 
example how the new NMI watchdog code in tip:perf/nmi makes use of it - or how 
the hw-breakpoints code utilizes it.) These would work even if there's no 
user-space daemon attached (or if the daemon has been stopped). So i dont see 
a significant design problem here - it's all natural extensions of existing 
perf facilities and could be used for other purposes as well.

> 3. Since we want to listen for MCEs all the time, the concept of enabling 
> and disabling those events does not apply in the sense of performance 
> profiling. [...]

Correct.

> [...] IOW, MCEs need to be able to be logged to the ring buffer at any time. 
> I guess this is easily done - we simply enable MCE events at the earliest 
> moment possible and disable them on shutdown; done.
> 
> So yeah, some food for thought but what do you guys think?

Note that it doesnt _have to_ go to the ftrace ring-buffer. If the daemon (or 
whatever facility picking up the events) keeps a global (per cpu) MCE perf 
event enabled all the time then it might be doing that regardless of ftrace.

Some decoupling from ftrace could be done here easily. I'd suggest to not 
worry about it - once we have the MCE event code we can certainly reshape the 
underlying support code to be more readily available/configurable. (or even 
built-in) This is not really a significant design issue.

To start with this, a quick initial prototype could use the 'perf trace' live 
mode tracing script. (See latest -tip, 'perf trace --script <script-name>' and 
'perf record -o -' to activate live mode.)

Note that there's also 'perf inject' now, which can be used to simulate rare 
events and test the daemon's reactions to it. (Right now perf-inject is only 
used to inject certain special build-id events, but it can be used for the 
injection of MCE events as well.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: perf, ftrace and MCEs
  2010-05-04 11:32 ` Ingo Molnar
@ 2010-05-15 13:43   ` Borislav Petkov
  2010-05-16 11:26     ` Ingo Molnar
  0 siblings, 1 reply; 8+ messages in thread
From: Borislav Petkov @ 2010-05-15 13:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Frederic Weisbecker, Steven Rostedt, Peter Zijlstra, lkml,
	Arnaldo Carvalho de Melo

[-- Attachment #1: Type: text/plain, Size: 2808 bytes --]

From: Ingo Molnar <mingo@elte.hu>
Date: Tue, May 04, 2010 at 01:32:27PM +0200

Hi,

> To start with this, a quick initial prototype could use the 'perf trace' live 
> mode tracing script. (See latest -tip, 'perf trace --script <script-name>' and 
> 'perf record -o -' to activate live mode.)

so I did some experimenting with this and have a pretty rough prototype
which conveys decoded MCEs to userspace where they're read with perf.
More specifically, I did

perf record -e mce:mce_record -a

after tweaking the mce_record tracepoint to include the decoded error
string.

And then doing

perf trace -g python
perf trace -s perf-trace.py

got me:

in trace_begin
mce__mce_record          6 00600.700632283        0 init                  mcgcap=262, mcgstatus=0, bank=4, status=15888347641659525651, addr=26682366720, misc=13837309867997528064, ip=0, cs=0, tsc=0, walltime=1273928155, cpu=6, cpuid=1052561, apicid=6, socketid=0, cpuvendor=2, decoded_err= Northbridge Error, node 1ECC/ChipKill ECC error.
CE err addr: 0x636649b00
CE page 0x636649, offset 0xb00, grain 0, syndrome 0x1fd, row 3, channel 0
 Transaction type: generic read(mem access), no t
in trace_end

which shows the signature of an ECC which I injected earlier over the
EDAC sysfs interface. And yes, the decoded_err appears truncated so I'll
have to think of a slicker way to collect that info.

Although they're pretty rough yet, I've attached the relevant patches so
that one could get an impression of where we're moving here.

0001-amd64_edac-Remove-polling-mechanism.patch removes the EDAC
polling mechanism in favor of hooking into the machine_check_poll
polling function using the atomic notifier which we already use for
uncorrectable errors.

The other two

0002-mce-trace-Add-decoded-string-to-mce_record-s-format.patch
0003-edac-mce-Prepare-error-decoded-info.patch

add that decoded_err string. I'm open for better ideas here though.

Concerning the early MCE logging and reporting, I'm thinking of using
the mce.c ring buffer temporarily until the ftrace buffer has been
initialized and then copying all records into the last. We might do a
more elegant solution in the future after all that bootmem churn has
quieted down and allocate memory early for a dedicated MCE ring buffer
or whatever.

Wrt critical MCEs, I'm leaning towards bypassing perf/ftrace subsystem
altogether in favor of executing the smallest amount of code possible
like, for example, switching to a tty, dumping the decoded error and
in certain cases not panicking but shutting down gracefully after a
timeout. Of course, graceful shutdown is completely dependent on the
type of hw failure and in some cases we can't do anything else but
freeze in order to prevent faulty data propagation.

I'm sure there's more...

Thanks.

-- 
Regards/Gruss,
    Boris.

[-- Attachment #2: 0001-amd64_edac-Remove-polling-mechanism.patch --]
[-- Type: text/x-diff, Size: 5742 bytes --]

>From 47927877707e57f7cda3093d426160bb70654291 Mon Sep 17 00:00:00 2001
From: Borislav Petkov <borislav.petkov@amd.com>
Date: Sat, 15 May 2010 13:51:57 +0200
Subject: [PATCH 1/4] amd64_edac: Remove polling mechanism

Switch to using the machine check polling mechanism of the mcheck core
instead of duplicating functionality.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
---
 arch/x86/kernel/cpu/mcheck/mce.c |    1 +
 drivers/edac/amd64_edac.c        |  118 --------------------------------------
 2 files changed, 1 insertions(+), 118 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 8a6f0af..b57c185 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -581,6 +581,7 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
 		 */
 		if (!(flags & MCP_DONTLOG) && !mce_dont_log_ce) {
 			mce_log(&m);
+			atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, &m);
 			add_taint(TAINT_MACHINE_CHECK);
 		}
 
diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 297cc12..80600f1 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -1973,107 +1973,6 @@ static int get_channel_from_ecc_syndrome(struct mem_ctl_info *mci, u16 syndrome)
 }
 
 /*
- * Check for valid error in the NB Status High register. If so, proceed to read
- * NB Status Low, NB Address Low and NB Address High registers and store data
- * into error structure.
- *
- * Returns:
- *	- 1: if hardware regs contains valid error info
- *	- 0: if no valid error is indicated
- */
-static int amd64_get_error_info_regs(struct mem_ctl_info *mci,
-				     struct err_regs *regs)
-{
-	struct amd64_pvt *pvt;
-	struct pci_dev *misc_f3_ctl;
-
-	pvt = mci->pvt_info;
-	misc_f3_ctl = pvt->misc_f3_ctl;
-
-	if (amd64_read_pci_cfg(misc_f3_ctl, K8_NBSH, &regs->nbsh))
-		return 0;
-
-	if (!(regs->nbsh & K8_NBSH_VALID_BIT))
-		return 0;
-
-	/* valid error, read remaining error information registers */
-	if (amd64_read_pci_cfg(misc_f3_ctl, K8_NBSL, &regs->nbsl) ||
-	    amd64_read_pci_cfg(misc_f3_ctl, K8_NBEAL, &regs->nbeal) ||
-	    amd64_read_pci_cfg(misc_f3_ctl, K8_NBEAH, &regs->nbeah) ||
-	    amd64_read_pci_cfg(misc_f3_ctl, K8_NBCFG, &regs->nbcfg))
-		return 0;
-
-	return 1;
-}
-
-/*
- * This function is called to retrieve the error data from hardware and store it
- * in the info structure.
- *
- * Returns:
- *	- 1: if a valid error is found
- *	- 0: if no error is found
- */
-static int amd64_get_error_info(struct mem_ctl_info *mci,
-				struct err_regs *info)
-{
-	struct amd64_pvt *pvt;
-	struct err_regs regs;
-
-	pvt = mci->pvt_info;
-
-	if (!amd64_get_error_info_regs(mci, info))
-		return 0;
-
-	/*
-	 * Here's the problem with the K8's EDAC reporting: There are four
-	 * registers which report pieces of error information. They are shared
-	 * between CEs and UEs. Furthermore, contrary to what is stated in the
-	 * BKDG, the overflow bit is never used! Every error always updates the
-	 * reporting registers.
-	 *
-	 * Can you see the race condition? All four error reporting registers
-	 * must be read before a new error updates them! There is no way to read
-	 * all four registers atomically. The best than can be done is to detect
-	 * that a race has occured and then report the error without any kind of
-	 * precision.
-	 *
-	 * What is still positive is that errors are still reported and thus
-	 * problems can still be detected - just not localized because the
-	 * syndrome and address are spread out across registers.
-	 *
-	 * Grrrrr!!!!!  Here's hoping that AMD fixes this in some future K8 rev.
-	 * UEs and CEs should have separate register sets with proper overflow
-	 * bits that are used! At very least the problem can be fixed by
-	 * honoring the ErrValid bit in 'nbsh' and not updating registers - just
-	 * set the overflow bit - unless the current error is CE and the new
-	 * error is UE which would be the only situation for overwriting the
-	 * current values.
-	 */
-
-	regs = *info;
-
-	/* Use info from the second read - most current */
-	if (unlikely(!amd64_get_error_info_regs(mci, info)))
-		return 0;
-
-	/* clear the error bits in hardware */
-	pci_write_bits32(pvt->misc_f3_ctl, K8_NBSH, 0, K8_NBSH_VALID_BIT);
-
-	/* Check for the possible race condition */
-	if ((regs.nbsh != info->nbsh) ||
-	     (regs.nbsl != info->nbsl) ||
-	     (regs.nbeah != info->nbeah) ||
-	     (regs.nbeal != info->nbeal)) {
-		amd64_mc_printk(mci, KERN_WARNING,
-				"hardware STATUS read access race condition "
-				"detected!\n");
-		return 0;
-	}
-	return 1;
-}
-
-/*
  * Handle any Correctable Errors (CEs) that have occurred. Check for valid ERROR
  * ADDRESS and process.
  */
@@ -2197,20 +2096,6 @@ void amd64_decode_bus_error(int node_id, struct err_regs *regs)
 }
 
 /*
- * The main polling 'check' function, called FROM the edac core to perform the
- * error checking and if an error is encountered, error processing.
- */
-static void amd64_check(struct mem_ctl_info *mci)
-{
-	struct err_regs regs;
-
-	if (amd64_get_error_info(mci, &regs)) {
-		struct amd64_pvt *pvt = mci->pvt_info;
-		amd_decode_nb_mce(pvt->mc_node_id, &regs, 1);
-	}
-}
-
-/*
  * Input:
  *	1) struct amd64_pvt which contains pvt->dram_f2_ctl pointer
  *	2) AMD Family index value
@@ -2749,9 +2634,6 @@ static void amd64_setup_mci_misc_attributes(struct mem_ctl_info *mci)
 	mci->dev_name		= pci_name(pvt->dram_f2_ctl);
 	mci->ctl_page_to_phys	= NULL;
 
-	/* IMPORTANT: Set the polling 'check' function in this module */
-	mci->edac_check		= amd64_check;
-
 	/* memory scrubber interface */
 	mci->set_sdram_scrub_rate = amd64_set_scrub_rate;
 	mci->get_sdram_scrub_rate = amd64_get_scrub_rate;
-- 
1.6.4.4


[-- Attachment #3: 0002-mce-trace-Add-decoded-string-to-mce_record-s-format.patch --]
[-- Type: text/x-diff, Size: 3258 bytes --]

>From e5e0209868763073b4a82c6874a7e9d21fd7c8e5 Mon Sep 17 00:00:00 2001
From: Borislav Petkov <petkovbb@gmail.com>
Date: Sat, 27 Mar 2010 17:25:02 +0100
Subject: [PATCH 2/4] mce, trace: Add decoded string to mce_record's format

Put the decoded error string into the trace record.

Not-Signed-off-by: Borislav Petkov <petkovbb@gmail.com>
---
 arch/x86/kernel/cpu/mcheck/mce.c |    2 +-
 include/trace/events/mce.h       |   43 +++++++++++++++++++++----------------
 2 files changed, 25 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index b57c185..3880f3c 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -161,7 +161,7 @@ void mce_log(struct mce *mce)
 	unsigned next, entry;
 
 	/* Emit the trace record: */
-	trace_mce_record(mce);
+	trace_mce_record(mce, "");
 
 	mce->finished = 0;
 	wmb();
diff --git a/include/trace/events/mce.h b/include/trace/events/mce.h
index 7eee778..96b040a 100644
--- a/include/trace/events/mce.h
+++ b/include/trace/events/mce.h
@@ -8,11 +8,13 @@
 #include <linux/tracepoint.h>
 #include <asm/mce.h>
 
+#define DECODED_ERR_SZ 200
+
 TRACE_EVENT(mce_record,
 
-	TP_PROTO(struct mce *m),
+	TP_PROTO(struct mce *m, char *decoded_err),
 
-	TP_ARGS(m),
+	TP_ARGS(m, decoded_err),
 
 	TP_STRUCT__entry(
 		__field(	u64,		mcgcap		)
@@ -30,27 +32,29 @@ TRACE_EVENT(mce_record,
 		__field(	u32,		apicid		)
 		__field(	u32,		socketid	)
 		__field(	u8,		cpuvendor	)
+		__array(      char,		decoded_err,	DECODED_ERR_SZ)
 	),
 
 	TP_fast_assign(
-		__entry->mcgcap		= m->mcgcap;
-		__entry->mcgstatus	= m->mcgstatus;
-		__entry->bank		= m->bank;
-		__entry->status		= m->status;
-		__entry->addr		= m->addr;
-		__entry->misc		= m->misc;
-		__entry->ip		= m->ip;
-		__entry->cs		= m->cs;
-		__entry->tsc		= m->tsc;
-		__entry->walltime	= m->time;
-		__entry->cpu		= m->extcpu;
-		__entry->cpuid		= m->cpuid;
-		__entry->apicid		= m->apicid;
-		__entry->socketid	= m->socketid;
-		__entry->cpuvendor	= m->cpuvendor;
+		__entry->mcgcap			= m->mcgcap;
+		__entry->mcgstatus		= m->mcgstatus;
+		__entry->bank			= m->bank;
+		__entry->status			= m->status;
+		__entry->addr			= m->addr;
+		__entry->misc			= m->misc;
+		__entry->ip			= m->ip;
+		__entry->cs			= m->cs;
+		__entry->tsc			= m->tsc;
+		__entry->walltime		= m->time;
+		__entry->cpu			= m->extcpu;
+		__entry->cpuid			= m->cpuid;
+		__entry->apicid			= m->apicid;
+		__entry->socketid		= m->socketid;
+		__entry->cpuvendor		= m->cpuvendor;
+		memcpy(__entry->decoded_err, decoded_err, DECODED_ERR_SZ);
 	),
 
-	TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, ADDR/MISC: %016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x",
+	TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, ADDR/MISC: %016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x\nErr: %s\n",
 		__entry->cpu,
 		__entry->mcgcap, __entry->mcgstatus,
 		__entry->bank, __entry->status,
@@ -60,7 +64,8 @@ TRACE_EVENT(mce_record,
 		__entry->cpuvendor, __entry->cpuid,
 		__entry->walltime,
 		__entry->socketid,
-		__entry->apicid)
+		__entry->apicid,
+		__entry->decoded_err)
 );
 
 #endif /* _TRACE_MCE_H */
-- 
1.6.4.4


[-- Attachment #4: 0003-edac-mce-Prepare-error-decoded-info.patch --]
[-- Type: text/x-diff, Size: 7063 bytes --]

>From 980c6fdf00cf23ef76481a4c94ba682c0ff80d61 Mon Sep 17 00:00:00 2001
From: Borislav Petkov <petkovbb@gmail.com>
Date: Sat, 27 Mar 2010 18:42:08 +0100
Subject: [PATCH 3/4] edac, mce: Prepare error decoded info

Add a buffer where CECC error info is stored and dump it later into the
trace record.

Not-Signed-off-by: Borislav Petkov <petkovbb@gmail.com>
---
 arch/x86/kernel/cpu/mcheck/mce.c |    2 +
 drivers/edac/amd64_edac.c        |    4 ++-
 drivers/edac/edac_mc.c           |    7 ++++
 drivers/edac/edac_mce_amd.c      |   60 +++++++++++++++++++++++++++++++------
 drivers/edac/edac_mce_amd.h      |    1 +
 5 files changed, 63 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 3880f3c..0bcb488 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -160,8 +160,10 @@ void mce_log(struct mce *mce)
 {
 	unsigned next, entry;
 
+#ifndef CONFIG_EDAC_DECODE_MCE
 	/* Emit the trace record: */
 	trace_mce_record(mce, "");
+#endif
 
 	mce->finished = 0;
 	wmb();
diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 80600f1..3e036f3 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -1993,7 +1993,9 @@ static void amd64_handle_ce(struct mem_ctl_info *mci,
 	sys_addr = pvt->ops->get_error_address(mci, info);
 
 	amd64_mc_printk(mci, KERN_ERR,
-		"CE ERROR_ADDRESS= 0x%llx\n", sys_addr);
+		"CE err addr: 0x%llx\n", sys_addr);
+
+	edac_snprintf("CE err addr: 0x%llx\n", sys_addr);
 
 	pvt->ops->map_sysaddr_to_csrow(mci, info, sys_addr);
 }
diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index 3630308..f4b7de7 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -33,6 +33,7 @@
 #include <asm/edac.h>
 #include "edac_core.h"
 #include "edac_module.h"
+#include "edac_mce_amd.h"
 
 /* lock to memory controller's control array */
 static DEFINE_MUTEX(mem_ctls_mutex);
@@ -702,6 +703,12 @@ void edac_mc_handle_ce(struct mem_ctl_info *mci,
 			mci->csrows[row].grain, syndrome, row, channel,
 			mci->csrows[row].channels[channel].label, msg);
 
+	edac_snprintf("CE page 0x%lx, offset 0x%lx, grain %d, syndrome "
+			"0x%lx, row %d, channel %d\n",
+			page_frame_number, offset_in_page,
+			mci->csrows[row].grain, syndrome, row, channel);
+
+
 	mci->ce_count++;
 	mci->csrows[row].ce_count++;
 	mci->csrows[row].channels[channel].ce_count++;
diff --git a/drivers/edac/edac_mce_amd.c b/drivers/edac/edac_mce_amd.c
index 97e64bc..86b374e 100644
--- a/drivers/edac/edac_mce_amd.c
+++ b/drivers/edac/edac_mce_amd.c
@@ -1,4 +1,6 @@
 #include <linux/module.h>
+#include <linux/slab.h>
+#include <trace/events/mce.h>
 #include "edac_mce_amd.h"
 
 static bool report_gart_errors;
@@ -128,6 +130,33 @@ const char *ext_msgs[] = {
 };
 EXPORT_SYMBOL_GPL(ext_msgs);
 
+static char *decoded_err;
+static unsigned dec_len;
+
+void edac_snprintf(const char *fmt, ...)
+{
+	va_list args;
+	char *buf = decoded_err + dec_len;
+	unsigned size = DECODED_ERR_SZ - dec_len - 1;
+	int i;
+
+	if (dec_len >= DECODED_ERR_SZ-1)
+		return;
+
+	va_start(args, fmt);
+	i = vsnprintf(buf, size, fmt, args);
+	va_end(args);
+
+	if (i >= size) {
+		printk(KERN_ERR "MCE decode buffer truncated.\n");
+		dec_len = DECODED_ERR_SZ-1;
+		decoded_err[dec_len] = '\n';
+	} else {
+		dec_len += i;
+	}
+}
+EXPORT_SYMBOL_GPL(edac_snprintf);
+
 static void amd_decode_dc_mce(u64 mc0_status)
 {
 	u32 ec  = mc0_status & 0xffff;
@@ -304,7 +333,7 @@ void amd_decode_nb_mce(int node_id, struct err_regs *regs, int handle_errors)
 	if (TLB_ERROR(ec) && !report_gart_errors)
 		return;
 
-	pr_emerg(" Northbridge Error, node %d", node_id);
+	edac_snprintf(" Northbridge Error, node %d", node_id);
 
 	/*
 	 * F10h, revD can disable ErrCpu[3:0] so check that first and also the
@@ -313,17 +342,17 @@ void amd_decode_nb_mce(int node_id, struct err_regs *regs, int handle_errors)
 	if ((boot_cpu_data.x86 == 0x10) &&
 	    (boot_cpu_data.x86_model > 7)) {
 		if (regs->nbsh & K8_NBSH_ERR_CPU_VAL)
-			pr_cont(", core: %u\n", (u8)(regs->nbsh & 0xf));
+			edac_snprintf(", core: %u\n", (u8)(regs->nbsh & 0xf));
 	} else {
 		u8 assoc_cpus = regs->nbsh & 0xf;
 
 		if (assoc_cpus > 0)
-			pr_cont(", core: %d", fls(assoc_cpus) - 1);
+			edac_snprintf(", core: %d", fls(assoc_cpus) - 1);
 
-		pr_cont("\n");
+		edac_snprintf("\n");
 	}
 
-	pr_emerg("%s.\n", EXT_ERR_MSG(regs->nbsl));
+	edac_snprintf("%s.\n", EXT_ERR_MSG(regs->nbsl));
 
 	if (BUS_ERROR(ec) && nb_bus_decoder)
 		nb_bus_decoder(node_id, regs);
@@ -342,13 +371,13 @@ static void amd_decode_fr_mce(u64 mc5_status)
 static inline void amd_decode_err_code(unsigned int ec)
 {
 	if (TLB_ERROR(ec)) {
-		pr_emerg(" Transaction: %s, Cache Level %s\n",
+		edac_snprintf(" Transaction: %s, Cache Level %s\n",
 			 TT_MSG(ec), LL_MSG(ec));
 	} else if (MEM_ERROR(ec)) {
-		pr_emerg(" Transaction: %s, Type: %s, Cache Level: %s",
+		edac_snprintf(" Transaction: %s, Type: %s, Cache Level: %s",
 			 RRRR_MSG(ec), TT_MSG(ec), LL_MSG(ec));
 	} else if (BUS_ERROR(ec)) {
-		pr_emerg(" Transaction type: %s(%s), %s, Cache Level: %s, "
+		edac_snprintf(" Transaction type: %s(%s), %s, Cache Level: %s, "
 			 "Participating Processor: %s\n",
 			  RRRR_MSG(ec), II_MSG(ec), TO_MSG(ec), LL_MSG(ec),
 			  PP_MSG(ec));
@@ -363,9 +392,9 @@ static int amd_decode_mce(struct notifier_block *nb, unsigned long val,
 	struct err_regs regs;
 	int node, ecc;
 
-	pr_emerg("MC%d_STATUS: ", m->bank);
+/* already in the MCE record: pr_emerg("MC%d_STATUS: ", m->bank); */
 
-	pr_cont("%sorrected error, report: %s, MiscV: %svalid, "
+	pr_emerg("%sorrected error, report: %s, MiscV: %svalid, "
 		 "CPU context corrupt: %s",
 		 ((m->status & MCI_STATUS_UC) ? "Unc"  : "C"),
 		 ((m->status & MCI_STATUS_EN) ? "yes"  : "no"),
@@ -416,6 +445,12 @@ static int amd_decode_mce(struct notifier_block *nb, unsigned long val,
 
 	amd_decode_err_code(m->status & 0xffff);
 
+	/* this has to be at the end */
+	pr_emerg("%s\n", decoded_err);
+
+	trace_mce_record(m, decoded_err);
+	dec_len = 0;
+
 	return NOTIFY_STOP;
 }
 
@@ -432,6 +467,10 @@ static int __init mce_amd_init(void)
 	    (boot_cpu_data.x86 >= 0xf))
 		atomic_notifier_chain_register(&x86_mce_decoder_chain, &amd_mce_dec_nb);
 
+	decoded_err = kzalloc(DECODED_ERR_SZ, GFP_KERNEL);
+	if (!decoded_err)
+		return -ENOMEM;
+
 	return 0;
 }
 early_initcall(mce_amd_init);
@@ -439,6 +478,7 @@ early_initcall(mce_amd_init);
 #ifdef MODULE
 static void __exit mce_amd_exit(void)
 {
+	kfree(decoded_err);
 	atomic_notifier_chain_unregister(&x86_mce_decoder_chain, &amd_mce_dec_nb);
 }
 
diff --git a/drivers/edac/edac_mce_amd.h b/drivers/edac/edac_mce_amd.h
index df23ee0..3ff1802 100644
--- a/drivers/edac/edac_mce_amd.h
+++ b/drivers/edac/edac_mce_amd.h
@@ -66,4 +66,5 @@ void amd_register_ecc_decoder(void (*f)(int, struct err_regs *));
 void amd_unregister_ecc_decoder(void (*f)(int, struct err_regs *));
 void amd_decode_nb_mce(int, struct err_regs *, int);
 
+void edac_snprintf(const char *fmt, ...);
 #endif /* _EDAC_MCE_AMD_H */
-- 
1.6.4.4


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: perf, ftrace and MCEs
  2010-05-15 13:43   ` Borislav Petkov
@ 2010-05-16 11:26     ` Ingo Molnar
  2010-05-16 16:51       ` Borislav Petkov
  0 siblings, 1 reply; 8+ messages in thread
From: Ingo Molnar @ 2010-05-16 11:26 UTC (permalink / raw)
  To: Borislav Petkov, Frederic Weisbecker, Steven Rostedt,
	Peter Zijlstra, lkml, Arnaldo Carvalho de Melo, Lin Ming


* Borislav Petkov <bp@alien8.de> wrote:

> From: Ingo Molnar <mingo@elte.hu>
> Date: Tue, May 04, 2010 at 01:32:27PM +0200
> 
> Hi,
> 
> > To start with this, a quick initial prototype could use the 'perf trace' live 
> > mode tracing script. (See latest -tip, 'perf trace --script <script-name>' and 
> > 'perf record -o -' to activate live mode.)
> 
> so I did some experimenting with this and have a 
> pretty rough prototype which conveys decoded MCEs 
> to userspace where they're read with perf.

nice :-)

> More specifically, I did
> 
> perf record -e mce:mce_record -a
> 
> after tweaking the mce_record tracepoint to include 
> the decoded error string.
> 
> And then doing
> 
> perf trace -g python
> perf trace -s perf-trace.py
> 
> got me:
> 
> in trace_begin
> mce__mce_record          6 00600.700632283        0 init                  mcgcap=262, mcgstatus=0, bank=4, status=15888347641659525651, addr=26682366720, misc=13837309867997528064, ip=0, cs=0, tsc=0, walltime=1273928155, cpu=6, cpuid=1052561, apicid=6, socketid=0, cpuvendor=2, decoded_err= Northbridge Error, node 1ECC/ChipKill ECC error.
> CE err addr: 0x636649b00
> CE page 0x636649, offset 0xb00, grain 0, syndrome 0x1fd, row 3, channel 0
>  Transaction type: generic read(mem access), no t
> in trace_end
> 
> which shows the signature of an ECC which I 
> injected earlier over the EDAC sysfs interface. And 
> yes, the decoded_err appears truncated so I'll have 
> to think of a slicker way to collect that info.
> 
> Although they're pretty rough yet, I've attached 
> the relevant patches so that one could get an 
> impression of where we're moving here.
> 
> 0001-amd64_edac-Remove-polling-mechanism.patch 
> removes the EDAC polling mechanism in favor of 
> hooking into the machine_check_poll polling 
> function using the atomic notifier which we already 
> use for uncorrectable errors.
> 
> The other two
> 
> 0002-mce-trace-Add-decoded-string-to-mce_record-s-format.patch
> 0003-edac-mce-Prepare-error-decoded-info.patch
> 
> add that decoded_err string. I'm open for better 
> ideas here though.

Yes, this is exactly what i was thinking about.

> Concerning the early MCE logging and reporting, I'm 
> thinking of using the mce.c ring buffer temporarily 
> until the ftrace buffer has been initialized and 
> then copying all records into the last. We might do 
> a more elegant solution in the future after all 
> that bootmem churn has quieted down and allocate 
> memory early for a dedicated MCE ring buffer or 
> whatever.

Agreed.

There's overlap here with the boot tracer as well, 
which we want to convert over to perf events as well.

That could be achieved via the concept of 'persistent 
events', which are basically task-less events brought 
active and attached to a buffer space to dump the 
events to.

That buffer space could be initialized very early on.

> Wrt critical MCEs, I'm leaning towards bypassing 
> perf/ftrace subsystem altogether in favor of 
> executing the smallest amount of code possible 
> like, for example, switching to a tty, dumping the 
> decoded error and in certain cases not panicking 
> but shutting down gracefully after a timeout. Of 
> course, graceful shutdown is completely dependent 
> on the type of hw failure and in some cases we 
> can't do anything else but freeze in order to 
> prevent faulty data propagation.

I agree that the critical functionality itself should 
be implemented in the kernel - and not all routed 
through a user-space component.

But please generate the callbacks via perf events, 
like the new watchdog code does in -tip 
(tip:perf/nmi):

cafcd80: lockup_detector: Cross arch compile fixes
23637d4: lockup_detector: Introduce CONFIG_HARDLOCKUP_DETECTOR
c01d432: lockup_detector: Adapt CONFIG_PERF_EVENT_NMI to other archs
e16bb1d: lockup_detector: Update some config
5e85391: x86, watchdog: Fix build error in hw_nmi.c
0167c78: watchdog: Export touch_softlockup_watchdog
19cc36c: lockup_detector: Fix forgotten config conversion
89d7ce2: lockup_detector: Make BOOTPARAM_SOFTLOCKUP_PANIC depend on LOCKUP_DETECTOR
d7c5473: lockup_detector: Separate touch_nmi_watchdog code path from touch_watchdog
10f9014: x86: Cleanup hw_nmi.c cruft
7cbb7e7: x86: Move trigger_all_cpu_backtrace to its own die_notifier
f69bcf6: lockup_detector: Remove nmi_watchdog.c file
2508ce1: lockup_detector: Remove old softlockup code
332fbdb: lockup_detector: Touch_softlockup cleanups and softlockup_tick removal
58687ac: lockup_detector: Combine nmi_watchdog and softlockup detector
5671a10: nmi_watchdog: Tell the world we're active
c99c30f: nmi_watchdog: Turn it off by default
47195d5: nmi_watchdog: Clean up various small details
2cc4452: nmi_watchdog: Fix undefined 'apic' build bug
96ca402: nmi_watchdog: Properly configure for software events
6081b6c: nmi_watchdog: support for oprofile
cf454ae: nmi_watchdog: Fallback to software events when no hardware pmu detected
504d7cf: nmi_watchdog: Compile and portability fixes
c3128fb: nmi_watchdog: Use a boolean config flag for compiling
8e7672c: nmi_watchdog: Only enable on x86 for now
84e478c: nmi_watchdog: Config option to enable new nmi_watchdog
1fb9d6a: nmi_watchdog: Add new, generic implementation, using perf events
e40b172: x86: Move notify_die from nmi.c to traps.c

Then those callbacks can be used to implement some 
minimal critical functionality: panic/print 
decisions, tty switching, etc.

The essence of the method is (see kernel/watchdog.c):

  event = perf_event_create_kernel_counter(
                   &wd_attr, hotcpu, -1, wd_overflow);

the wd_overflow() is the callback you get. It's 
called when the event triggers. You can put arbitrary 
functionality into that callback.

Doing it this way is basically a constant, gradual 
process of unifying our currently woefully 
inconsistent callbacks within the x86 platform and 
the core kernel, and collecting these events into the 
TRACE_EVENT and perf umbrella - without any loss in 
functionality.

You can still call back on those events and add 
arbitrary EDAC and RAS features on top of that. (if 
there's something you cannot do via that method, or 
if it feels clumsy please let us know and we'll 
fix/enhance it.)

The advantage of that approach is that almost as a 
side-effect we also gain an event described in a 
structured way - which other tools can (and do) make 
use of.

Another related development is the ongoing work to 
describe events in sysfs, not in debugfs. The current 
plans put them under special files:

    /sys/kernel/sched/events/wakeup/id
    /sys/kernel/sched/events/wakeup/format

    /sys/devices/system/cpu/events/cycles/id
    /sys/devices/system/cpu/events/instructions/id

    /sys/kernel/events/mce/mce_record/id
    /sys/kernel/events/mce/mce_record/format

The exact placement is not yet final - we could 
reasonably put the MCE events under 
/sys/devices/system/cpu/events/mce/ as well - and 
alias them to multiple places.

Likewise, memory controller events could show up 
under /sys/devices/system/node/events/ ?

All events could be found via /sys/class/events/ and 
could be enumerated from there by EDAC/RAS tools.

See the discussion and patch from Lin Ming that 
started this on lkml:

  [RFC][PATCH 3/9] perf: export registerred pmus via sysfs

Plus, mid-term/long-term we could also split EDAC/RAS 
functionality from tools/perf/ and put it into 
tools/ras/, tools/mce/ or tools/edac/, and librarize 
common functionality to share as much code as 
possible.

We dont 'have to' have a 'perf mce' sub-tool - that's 
just a convenient starting place for you i suspect.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: perf, ftrace and MCEs
  2010-05-16 11:26     ` Ingo Molnar
@ 2010-05-16 16:51       ` Borislav Petkov
  0 siblings, 0 replies; 8+ messages in thread
From: Borislav Petkov @ 2010-05-16 16:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Frederic Weisbecker, Steven Rostedt, Peter Zijlstra, lkml,
	Arnaldo Carvalho de Melo, Lin Ming

From: Ingo Molnar <mingo@elte.hu>
Date: Sun, May 16, 2010 at 01:26:41PM +0200

Hi,

> > in trace_begin
> > mce__mce_record          6 00600.700632283        0 init                  mcgcap=262, mcgstatus=0, bank=4, status=15888347641659525651, addr=26682366720, misc=13837309867997528064, ip=0, cs=0, tsc=0, walltime=1273928155, cpu=6, cpuid=1052561, apicid=6, socketid=0, cpuvendor=2, decoded_err= Northbridge Error, node 1ECC/ChipKill ECC error.
> > CE err addr: 0x636649b00
> > CE page 0x636649, offset 0xb00, grain 0, syndrome 0x1fd, row 3, channel 0
> >  Transaction type: generic read(mem access), no t
> > in trace_end
> > 
> > which shows the signature of an ECC which I 
> > injected earlier over the EDAC sysfs interface. And 
> > yes, the decoded_err appears truncated so I'll have 
> > to think of a slicker way to collect that info.
> > 
> > Although they're pretty rough yet, I've attached 
> > the relevant patches so that one could get an 
> > impression of where we're moving here.
> > 
> > 0001-amd64_edac-Remove-polling-mechanism.patch 
> > removes the EDAC polling mechanism in favor of 
> > hooking into the machine_check_poll polling 
> > function using the atomic notifier which we already 
> > use for uncorrectable errors.
> > 
> > The other two
> > 
> > 0002-mce-trace-Add-decoded-string-to-mce_record-s-format.patch
> > 0003-edac-mce-Prepare-error-decoded-info.patch
> > 
> > add that decoded_err string. I'm open for better 
> > ideas here though.
> 
> Yes, this is exactly what i was thinking about.

The decoded string handling is still clumsy since it is of variable
length and I have to allocate a string of the maximal possible length of
any error for it to not get truncated.

We could avoid this by spitting error codes of fixed length, instead,
which the RAS daemon would map to strings in userspace but the coding
scheme would need some thinking so that it works adequately for all
possible error types.

Anyway, we can always improve this if needed and as we go...

> > Concerning the early MCE logging and reporting, I'm 
> > thinking of using the mce.c ring buffer temporarily 
> > until the ftrace buffer has been initialized and 
> > then copying all records into the last. We might do 
> > a more elegant solution in the future after all 
> > that bootmem churn has quieted down and allocate 
> > memory early for a dedicated MCE ring buffer or 
> > whatever.
> 
> Agreed.
> 
> There's overlap here with the boot tracer as well, 
> which we want to convert over to perf events as well.
> 
> That could be achieved via the concept of 'persistent 
> events', which are basically task-less events brought 
> active and attached to a buffer space to dump the 
> events to.

Yep, sounds neat.

> That buffer space could be initialized very early on.
> 
> > Wrt critical MCEs, I'm leaning towards bypassing 
> > perf/ftrace subsystem altogether in favor of 
> > executing the smallest amount of code possible 
> > like, for example, switching to a tty, dumping the 
> > decoded error and in certain cases not panicking 
> > but shutting down gracefully after a timeout. Of 
> > course, graceful shutdown is completely dependent 
> > on the type of hw failure and in some cases we 
> > can't do anything else but freeze in order to 
> > prevent faulty data propagation.
> 
> I agree that the critical functionality itself should 
> be implemented in the kernel - and not all routed 
> through a user-space component.
> 
> But please generate the callbacks via perf events, 
> like the new watchdog code does in -tip 
> (tip:perf/nmi):

[snip]

My only concern here is that going over the perf events and callbacks
adds unnecessary additional code when we're in emergency mode handling
an uncorrectable error. However, I don't know how much that code would
be and whether its overhead would be relevant or not...

> Then those callbacks can be used to implement some 
> minimal critical functionality: panic/print 
> decisions, tty switching, etc.
> 
> The essence of the method is (see kernel/watchdog.c):
> 
>   event = perf_event_create_kernel_counter(
>                    &wd_attr, hotcpu, -1, wd_overflow);
> 
> the wd_overflow() is the callback you get. It's 
> called when the event triggers. You can put arbitrary 
> functionality into that callback.

Correct me if I'm wrong but since the trace_mce_record() is a
tracepoint, we don't want to register callbacks over the perf_event*
interface but rather use the ftrace path, like Steven's example with
register_trace_sched_switch().

If we do it this way, there's no overhead at all and we add all
the needed functionality like tty switching and emergency shutdown
preparations to the proper path in the proper order - right after having
decoded the MCE and still in the NMI context of do_machine_check, i.e.
at the earliest possible moment and without wasting time.

And with tracepoints we still have the event unification/enumeration you
mention below so I think it can't get any better than that :).

> Doing it this way is basically a constant, gradual 
> process of unifying our currently woefully 
> inconsistent callbacks within the x86 platform and 
> the core kernel, and collecting these events into the 
> TRACE_EVENT and perf umbrella - without any loss in 
> functionality.
> 
> You can still call back on those events and add 
> arbitrary EDAC and RAS features on top of that. (if 
> there's something you cannot do via that method, or 
> if it feels clumsy please let us know and we'll 
> fix/enhance it.)
> 
> The advantage of that approach is that almost as a 
> side-effect we also gain an event described in a 
> structured way - which other tools can (and do) make 
> use of.

see above.

> Another related development is the ongoing work to 
> describe events in sysfs, not in debugfs. The current 
> plans put them under special files:
> 
>     /sys/kernel/sched/events/wakeup/id
>     /sys/kernel/sched/events/wakeup/format
> 
>     /sys/devices/system/cpu/events/cycles/id
>     /sys/devices/system/cpu/events/instructions/id
> 
>     /sys/kernel/events/mce/mce_record/id
>     /sys/kernel/events/mce/mce_record/format
> 
> The exact placement is not yet final - we could 
> reasonably put the MCE events under 
> /sys/devices/system/cpu/events/mce/ as well - and 
> alias them to multiple places.

Well, since MCEs are per CPU, the proper way to map those would be

/sys/devices/system/cpu/cpuX/events/mce/

With this, we could have the additional functionality to disable some
MCEs per CPU if it makes sense for certain cases...

Hmm?

> Likewise, memory controller events could show up 
> under /sys/devices/system/node/events/ ?
> 
> All events could be found via /sys/class/events/ and 
> could be enumerated from there by EDAC/RAS tools.
> 
> See the discussion and patch from Lin Ming that 
> started this on lkml:
> 
>   [RFC][PATCH 3/9] perf: export registerred pmus via sysfs
> 
> Plus, mid-term/long-term we could also split EDAC/RAS 
> functionality from tools/perf/ and put it into 
> tools/ras/, tools/mce/ or tools/edac/, and librarize 
> common functionality to share as much code as 
> possible.

Well, both mce and edac are a subset of RAS so calling it perf/ras/ is
the most sensible way to go IMHO. Also, this is where the perf inject
can be reused since we can inject true hardware errors and not only
simulate them in software.

> We dont 'have to' have a 'perf mce' sub-tool - that's 
> just a convenient starting place for you i suspect.

ditto, 'perf ras' is what I'm thinking.

Thanks.

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-05-16 16:51 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-01 18:12 perf, ftrace and MCEs Borislav Petkov
2010-05-03 14:41 ` Steven Rostedt
2010-05-03 21:20   ` Borislav Petkov
2010-05-04 10:15 ` Andi Kleen
2010-05-04 11:32 ` Ingo Molnar
2010-05-15 13:43   ` Borislav Petkov
2010-05-16 11:26     ` Ingo Molnar
2010-05-16 16:51       ` Borislav Petkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox