[v3][PATCH 0/4] Work around perf NMI-induced hangs

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [v3][PATCH 0/4] Work around perf NMI-induced hangs
@ 2013-05-29 22:27 Dave Hansen
  2013-05-29 22:27 ` [v3][PATCH 1/4] perf/x86: only print PMU state when also WARN()'ing Dave Hansen
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Dave Hansen @ 2013-05-29 22:27 UTC (permalink / raw)
  To: a.p.zijlstra; +Cc: mingo, paulus, acme, tglx, x86, linux-kernel, Dave Hansen

Changes from v2:

2/4:
 * Only warn on the longest NMIs.  Don't print when over
   a threshhold.
 * Output in ms as opposed to ns
4/4:
 * Add some Documentation/ for the tracepoint
 * keep tracepoint delta in a s64 instead of an int, and
   vall it 'delta_ns' instead of 'len'

Changes from v1:

 * keep a running average instead of taking a single value
   for determining NMI lengths.
 * Fixed some of the math converting from ns to/from
   percentages (it was backwards)
 * Included nmi length tracepoint at end of series 

--

If root or an unprivileged user runs 'perf top', my system hangs.
If I'm lucky, I get a warning out to dmesg, along these lines:

	hrtimer: interrupt took 13915457 ns cpu: 132

or a hard-lockup message on occasion.

The proxmiate cause of this is that perf_event_nmi_handler() has
been observed to take tens of ms on occasion.  That needs to get
fixed, and I'm working on tracking the root cause down.

But, These patches make the situation better: perf can no longer
simply wedge the box, and we have a safe, controlled exit path
when things go wrong.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [v3][PATCH 1/4] perf/x86: only print PMU state when also WARN()'ing
  2013-05-29 22:27 [v3][PATCH 0/4] Work around perf NMI-induced hangs Dave Hansen
@ 2013-05-29 22:27 ` Dave Hansen
  2013-05-29 22:27 ` [v3][PATCH 2/4] x86: warn when NMI handlers take large amounts of time Dave Hansen
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: Dave Hansen @ 2013-05-29 22:27 UTC (permalink / raw)
  To: a.p.zijlstra; +Cc: mingo, paulus, acme, tglx, x86, linux-kernel, Dave Hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

intel_pmu_handle_irq() has a warning in it if it does too many
loops.  It is a WARN_ONCE(), but the perf_event_print_debug()
call beneath it is unconditional. For the first warning, you get
a nice backtrace and message, but subsequent ones just dump the
PMU state with no leading messages.  I doubt this is what was
intended.

This patch will only print the PMU state when paired with the
WARN_ON() text.  It effectively open-codes WARN_ONCE()'s
one-time-only logic.

My suspicion is that the code really just wants to make sure we
do not sit in the loop and spit out a warning for every loop
iteration after the 100th.  From what I've seen, this is very
unlikely to happen since we also clear the PMU state.

After this patch, instead of seeing the PMU state dumped each
time, you will just see:

	[57494.894540] perf_event_intel: clearing PMU state on CPU#129
	[57579.539668] perf_event_intel: clearing PMU state on CPU#10
	[57587.137762] perf_event_intel: clearing PMU state on CPU#134
	[57623.039912] perf_event_intel: clearing PMU state on CPU#114
	[57644.559943] perf_event_intel: clearing PMU state on CPU#118
	...

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 linux.git-davehans/arch/x86/kernel/cpu/perf_event_intel.c |    8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff -puN arch/x86/kernel/cpu/perf_event_intel.c~debug-perf-hangs arch/x86/kernel/cpu/perf_event_intel.c
--- linux.git/arch/x86/kernel/cpu/perf_event_intel.c~debug-perf-hangs	2013-05-29 15:10:18.909649305 -0700
+++ linux.git-davehans/arch/x86/kernel/cpu/perf_event_intel.c	2013-05-29 15:10:18.912649437 -0700
@@ -1188,8 +1188,12 @@ static int intel_pmu_handle_irq(struct p
 again:
 	intel_pmu_ack_status(status);
 	if (++loops > 100) {
-		WARN_ONCE(1, "perfevents: irq loop stuck!\n");
-		perf_event_print_debug();
+		static bool warned = false;
+		if (!warned) {
+			WARN(1, "perfevents: irq loop stuck!\n");
+			perf_event_print_debug();
+			warned = true;
+		}
 		intel_pmu_reset();
 		goto done;
 	}
_

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [v3][PATCH 2/4] x86: warn when NMI handlers take large amounts of time
  2013-05-29 22:27 [v3][PATCH 0/4] Work around perf NMI-induced hangs Dave Hansen
  2013-05-29 22:27 ` [v3][PATCH 1/4] perf/x86: only print PMU state when also WARN()'ing Dave Hansen
@ 2013-05-29 22:27 ` Dave Hansen
  2013-05-30  8:33   ` Ingo Molnar
  2013-05-30  9:37   ` Peter Zijlstra
  2013-05-29 22:28 ` [v3][PATCH 3/4] perf: drop sample rate when sampling is too slow Dave Hansen
  2013-05-29 22:28 ` [v3][PATCH 4/4] x86: nmi length tracepoints Dave Hansen
  3 siblings, 2 replies; 7+ messages in thread
From: Dave Hansen @ 2013-05-29 22:27 UTC (permalink / raw)
  To: a.p.zijlstra; +Cc: mingo, paulus, acme, tglx, x86, linux-kernel, Dave Hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

I have a system which is causing all kinds of problems.  It has 8
NUMA nodes, and lots of cores that can fight over cachelines.
If things are not working _perfectly_, then NMIs can take longer
than expected.

If we get too many of them backed up to each other, we can easily
end up in a situation where we are doing nothing *but* running
NMIs.  The biggest problem, though, is that this happens
_silently_.  You might be lucky to get an hrtimer warning, but
most of the time system simply hangs.

This patch should at least give us some warning before we fall
off the cliff.  the warnings look like this:

	nmi_handle: perf_event_nmi_handler() took: 26095071 ns

The message is triggered whenever we notice the longest NMI
we've seen to date.  You can always view and reset this value
via the debugfs interface if you like.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 linux.git-davehans/arch/x86/kernel/nmi.c |   28 +++++++++++++++++++++++++++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff -puN arch/x86/kernel/nmi.c~debug-nmi-timing arch/x86/kernel/nmi.c
--- linux.git/arch/x86/kernel/nmi.c~debug-nmi-timing	2013-05-29 15:10:19.159660273 -0700
+++ linux.git-davehans/arch/x86/kernel/nmi.c	2013-05-29 15:10:19.162660403 -0700
@@ -14,6 +14,7 @@
 #include <linux/kprobes.h>
 #include <linux/kdebug.h>
 #include <linux/nmi.h>
+#include <linux/debugfs.h>
 #include <linux/delay.h>
 #include <linux/hardirq.h>
 #include <linux/slab.h>
@@ -82,6 +83,15 @@ __setup("unknown_nmi_panic", setup_unkno
 
 #define nmi_to_desc(type) (&nmi_desc[type])
 
+static u64 nmi_longest_ns = 1000 * 1000 * 1000;
+static int __init nmi_warning_debugfs(void)
+{
+	debugfs_create_u64("nmi_longest_ns", 0644,
+			arch_debugfs_dir, &nmi_longest_ns);
+	return 0;
+}
+fs_initcall(nmi_warning_debugfs);
+
 static int __kprobes nmi_handle(unsigned int type, struct pt_regs *regs, bool b2b)
 {
 	struct nmi_desc *desc = nmi_to_desc(type);
@@ -96,8 +106,24 @@ static int __kprobes nmi_handle(unsigned
 	 * can be latched at any given time.  Walk the whole list
 	 * to handle those situations.
 	 */
-	list_for_each_entry_rcu(a, &desc->head, list)
+	list_for_each_entry_rcu(a, &desc->head, list) {
+		u64 before, delta, whole_msecs;
+		int decimal_msecs;
+
+		before = local_clock();
 		handled += a->handler(type, regs);
+		delta = local_clock() - before;
+
+		if (delta < nmi_longest_ns)
+			continue;
+
+		nmi_longest_ns = delta;
+		whole_msecs = delta / (1000 * 1000);
+		decimal_msecs = (delta / 1000) % 1000;
+		printk_ratelimited(KERN_INFO
+			"INFO: NMI handler took too long to run: "
+			"%lld.%03d msecs\n", whole_msecs, decimal_msecs);
+	}
 
 	rcu_read_unlock();
 
_

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [v3][PATCH 3/4] perf: drop sample rate when sampling is too slow
  2013-05-29 22:27 [v3][PATCH 0/4] Work around perf NMI-induced hangs Dave Hansen
  2013-05-29 22:27 ` [v3][PATCH 1/4] perf/x86: only print PMU state when also WARN()'ing Dave Hansen
  2013-05-29 22:27 ` [v3][PATCH 2/4] x86: warn when NMI handlers take large amounts of time Dave Hansen
@ 2013-05-29 22:28 ` Dave Hansen
  2013-05-29 22:28 ` [v3][PATCH 4/4] x86: nmi length tracepoints Dave Hansen
  3 siblings, 0 replies; 7+ messages in thread
From: Dave Hansen @ 2013-05-29 22:28 UTC (permalink / raw)
  To: a.p.zijlstra; +Cc: mingo, paulus, acme, tglx, x86, linux-kernel, Dave Hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

This patch keeps track of how long perf's NMI handler is taking,
and also calculates how many samples perf can take a second.  If
the sample length times the expected max number of samples exceeds
a configurable threshold, it drops the sample rate.  This way,
we don't have a runaway sampling process eating up the CPU.

This patch can tend to drop the sample rate down to level where
perf doesn't work very well.  *BUT* the alternative is that my
system hangs because it spends all of its time handling NMIs.

I'll take a busted performance tool over an entire system that's
busted and undebuggable any day.

BTW, my suspicion is that there's still an underlying bug here.
Using the HPET instead of the TSC is definitely a contributing
factor, but I suspect there are some other things going on.
But, I can't go dig down on a bug like that with my machine
hanging all the time.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 linux.git-davehans/Documentation/sysctl/kernel.txt  |   26 ++++++
 linux.git-davehans/arch/x86/kernel/cpu/perf_event.c |   12 ++-
 linux.git-davehans/include/linux/perf_event.h       |    7 +
 linux.git-davehans/kernel/events/core.c             |   78 ++++++++++++++++++++
 linux.git-davehans/kernel/sysctl.c                  |    9 ++
 5 files changed, 131 insertions(+), 1 deletion(-)

diff -puN arch/x86/kernel/cpu/perf_event.c~drop-perf-samples-when-sampling-is-too-slow arch/x86/kernel/cpu/perf_event.c
--- linux.git/arch/x86/kernel/cpu/perf_event.c~drop-perf-samples-when-sampling-is-too-slow	2013-05-29 15:10:19.413671415 -0700
+++ linux.git-davehans/arch/x86/kernel/cpu/perf_event.c	2013-05-29 15:10:19.424671899 -0700
@@ -1249,10 +1249,20 @@ void perf_events_lapic_init(void)
 static int __kprobes
 perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
 {
+	int ret;
+	u64 start_clock;
+	u64 finish_clock;
+
 	if (!atomic_read(&active_events))
 		return NMI_DONE;
 
-	return x86_pmu.handle_irq(regs);
+	start_clock = local_clock();
+	ret = x86_pmu.handle_irq(regs);
+	finish_clock = local_clock();
+
+	perf_sample_event_took(finish_clock - start_clock);
+
+	return ret;
 }
 
 struct event_constraint emptyconstraint;
diff -puN Documentation/sysctl/kernel.txt~drop-perf-samples-when-sampling-is-too-slow Documentation/sysctl/kernel.txt
--- linux.git/Documentation/sysctl/kernel.txt~drop-perf-samples-when-sampling-is-too-slow	2013-05-29 15:10:19.414671459 -0700
+++ linux.git-davehans/Documentation/sysctl/kernel.txt	2013-05-29 15:10:19.424671899 -0700
@@ -427,6 +427,32 @@ This file shows up if CONFIG_DEBUG_STACK
 
 ==============================================================
 
+perf_cpu_time_max_percent:
+
+Hints to the kernel how much CPU time it should be allowed to
+use to handle perf sampling events.  If the perf subsystem
+is informed that its samples are exceeding this limit, it
+will drop its sampling frequency to attempt to reduce its CPU
+usage.
+
+Some perf sampling happens in NMIs.  If these samples
+unexpectedly take too long to execute, the NMIs can become
+stacked up next to each other so much that nothing else is
+allowed to execute.
+
+0: disable the mechanism.  Do not monitor or correct perf's
+   sampling rate no matter how CPU time it takes.
+
+1-100: attempt to throttle perf's sample rate to this
+   percentage of CPU.  Note: the kernel calculates an
+   "expected" length of each sample event.  100 here means
+   100% of that expected length.  Even if this is set to
+   100, you may still see sample throttling if this
+   length is exceeded.  Set to 0 if you truly do not care
+   how much CPU is consumed.
+
+==============================================================
+
 
 pid_max:
 
diff -puN include/linux/perf_event.h~drop-perf-samples-when-sampling-is-too-slow include/linux/perf_event.h
--- linux.git/include/linux/perf_event.h~drop-perf-samples-when-sampling-is-too-slow	2013-05-29 15:10:19.416671547 -0700
+++ linux.git-davehans/include/linux/perf_event.h	2013-05-29 15:10:19.425671943 -0700
@@ -696,10 +696,17 @@ static inline void perf_callchain_store(
 extern int sysctl_perf_event_paranoid;
 extern int sysctl_perf_event_mlock;
 extern int sysctl_perf_event_sample_rate;
+extern int sysctl_perf_cpu_time_max_percent;
+
+extern void perf_sample_event_took(u64 sample_len_ns);
 
 extern int perf_proc_update_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp,
 		loff_t *ppos);
+extern int perf_cpu_time_max_percent_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *lenp,
+		loff_t *ppos);
+
 
 static inline bool perf_paranoid_tracepoint_raw(void)
 {
diff -puN kernel/events/core.c~drop-perf-samples-when-sampling-is-too-slow kernel/events/core.c
--- linux.git/kernel/events/core.c~drop-perf-samples-when-sampling-is-too-slow	2013-05-29 15:10:19.418671635 -0700
+++ linux.git-davehans/kernel/events/core.c	2013-05-29 15:10:19.427672029 -0700
@@ -166,9 +166,21 @@ int sysctl_perf_event_mlock __read_mostl
  * max perf event sample rate
  */
 #define DEFAULT_MAX_SAMPLE_RATE 100000
+#define DEFAULT_SAMPLE_PERIOD_NS	(NSEC_PER_SEC / DEFAULT_MAX_SAMPLE_RATE)
+#define DEFAULT_CPU_TIME_MAX_PERCENT 25
 int sysctl_perf_event_sample_rate __read_mostly = DEFAULT_MAX_SAMPLE_RATE;
 static int max_samples_per_tick __read_mostly =
 	DIV_ROUND_UP(DEFAULT_MAX_SAMPLE_RATE, HZ);
+static int perf_sample_period_ns __read_mostly = DEFAULT_SAMPLE_PERIOD_NS;
+static atomic_t perf_sample_allowed_ns __read_mostly = ATOMIC_INIT(
+		DEFAULT_SAMPLE_PERIOD_NS * DEFAULT_CPU_TIME_MAX_PERCENT / 100);
+
+void update_perf_cpu_limits(void)
+{
+	u64 tmp = perf_sample_period_ns;
+	tmp = tmp * sysctl_perf_cpu_time_max_percent / 100;
+	atomic_set(&perf_sample_allowed_ns, tmp);
+}
 
 int perf_proc_update_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp,
@@ -180,10 +192,76 @@ int perf_proc_update_handler(struct ctl_
 		return ret;
 
 	max_samples_per_tick = DIV_ROUND_UP(sysctl_perf_event_sample_rate, HZ);
+	perf_sample_period_ns = NSEC_PER_SEC / sysctl_perf_event_sample_rate;
+	update_perf_cpu_limits();
+
+	return 0;
+}
+
+int sysctl_perf_cpu_time_max_percent __read_mostly =
+					DEFAULT_CPU_TIME_MAX_PERCENT;
+int perf_cpu_time_max_percent_handler(struct ctl_table *table, int write,
+				void __user *buffer, size_t *lenp,
+				loff_t *ppos)
+{
+	int ret = proc_dointvec(table, write, buffer, lenp, ppos);
+
+	if (ret || !write)
+		return ret;
+
+	update_perf_cpu_limits();
 
 	return 0;
 }
 
+/*
+ * perf samples are done in some very critical code paths (NMIs).
+ * If they take too much CPU time, the system can lock up and not
+ * get any real work done.  This will drop the sample rate when
+ * we detect that events are taking too long.
+ */
+#define NR_ACCUMULATED_SAMPLES 128
+DEFINE_PER_CPU(u64, running_sample_length);
+void perf_sample_event_took(u64 sample_len_ns)
+{
+	u64 avg_local_sample_len;
+	u64 local_samples_len = __get_cpu_var(running_sample_length);
+
+	if (atomic_read(&perf_sample_allowed_ns) == 0)
+		return;
+
+	/* decay the counter by 1 average sample */
+	local_samples_len = __get_cpu_var(running_sample_length);
+	local_samples_len -= local_samples_len/NR_ACCUMULATED_SAMPLES;
+	local_samples_len += sample_len_ns;
+	__get_cpu_var(running_sample_length) = local_samples_len;
+
+	/*
+	 * note: this will be biased artifically low until we have
+	 * seen NR_ACCUMULATED_SAMPLES.  Doing it this way keeps us
+	 * from having to maintain a count.
+	 */
+	avg_local_sample_len = local_samples_len/NR_ACCUMULATED_SAMPLES;
+
+	if (avg_local_sample_len <= atomic_read(&perf_sample_allowed_ns))
+		return;
+
+	if (max_samples_per_tick <= 1)
+		return;
+
+	max_samples_per_tick = DIV_ROUND_UP(max_samples_per_tick, 2);
+	sysctl_perf_event_sample_rate = max_samples_per_tick * HZ;
+	perf_sample_period_ns = NSEC_PER_SEC / sysctl_perf_event_sample_rate;
+
+	printk_ratelimited(KERN_WARNING
+			"perf samples too long (%lld > %d), lowering "
+			"kernel.perf_event_max_sample_rate to %d\n",
+			sample_len_ns, atomic_read(&perf_sample_allowed_ns),
+			sysctl_perf_event_sample_rate);
+
+	update_perf_cpu_limits();
+}
+
 static atomic64_t perf_event_id;
 
 static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
diff -puN kernel/sysctl.c~drop-perf-samples-when-sampling-is-too-slow kernel/sysctl.c
--- linux.git/kernel/sysctl.c~drop-perf-samples-when-sampling-is-too-slow	2013-05-29 15:10:19.420671722 -0700
+++ linux.git-davehans/kernel/sysctl.c	2013-05-29 15:10:19.428672073 -0700
@@ -1044,6 +1044,15 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= perf_proc_update_handler,
 	},
+	{
+		.procname	= "perf_cpu_time_max_percent",
+		.data		= &sysctl_perf_cpu_time_max_percent,
+		.maxlen		= sizeof(sysctl_perf_cpu_time_max_percent),
+		.mode		= 0644,
+		.proc_handler	= perf_cpu_time_max_percent_handler,
+		.extra1		= &zero,
+		.extra2		= &one_hundred,
+	},
 #endif
 #ifdef CONFIG_KMEMCHECK
 	{
_

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [v3][PATCH 4/4] x86: nmi length tracepoints
  2013-05-29 22:27 [v3][PATCH 0/4] Work around perf NMI-induced hangs Dave Hansen
                   ` (2 preceding siblings ...)
  2013-05-29 22:28 ` [v3][PATCH 3/4] perf: drop sample rate when sampling is too slow Dave Hansen
@ 2013-05-29 22:28 ` Dave Hansen
  3 siblings, 0 replies; 7+ messages in thread
From: Dave Hansen @ 2013-05-29 22:28 UTC (permalink / raw)
  To: a.p.zijlstra; +Cc: mingo, paulus, acme, tglx, x86, linux-kernel, Dave Hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

This patch has been invaluable in my adventures finding
issues in the perf NMI handler.  I'm as big a fan of
printk() as anybody is, but using printk() in NMIs is
deadly when they're happening frequently.

Even hacking in trace_printk() ended up eating enough
CPU to throw off some of the measurements I was making.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 linux.git-davehans/Documentation/trace/events-nmi.txt |   43 ++++++++++++++++++
 linux.git-davehans/arch/x86/kernel/nmi.c              |    9 ++-
 linux.git-davehans/include/trace/events/nmi.h         |   37 +++++++++++++++
 3 files changed, 87 insertions(+), 2 deletions(-)

diff -puN arch/x86/kernel/nmi.c~nmi-tracepoints arch/x86/kernel/nmi.c
--- linux.git/arch/x86/kernel/nmi.c~nmi-tracepoints	2013-05-29 15:10:19.717684752 -0700
+++ linux.git-davehans/arch/x86/kernel/nmi.c	2013-05-29 15:10:19.720684883 -0700
@@ -30,6 +30,9 @@
 #include <asm/nmi.h>
 #include <asm/x86_init.h>
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/nmi.h>
+
 struct nmi_desc {
 	spinlock_t lock;
 	struct list_head head;
@@ -108,11 +111,13 @@ static int __kprobes nmi_handle(unsigned
 	 */
 	list_for_each_entry_rcu(a, &desc->head, list) {
 		u64 before, delta, whole_msecs;
-		int decimal_msecs;
+		int decimal_msecs, thishandled;
 
 		before = local_clock();
-		handled += a->handler(type, regs);
+		thishandled = a->handler(type, regs);
+		handled += thishandled;
 		delta = local_clock() - before;
+		trace_nmi_handler(a->handler, (int)delta, thishandled);
 
 		if (delta < nmi_longest_ns)
 			continue;
diff -puN /dev/null Documentation/trace/events-nmi.txt
--- /dev/null	2013-03-17 11:51:50.342329668 -0700
+++ linux.git-davehans/Documentation/trace/events-nmi.txt	2013-05-29 15:10:19.720684883 -0700
@@ -0,0 +1,43 @@
+NMI Trace Events
+
+These events normally show up here:
+
+	/sys/kernel/debug/tracing/events/nmi
+
+--
+
+nmi_handler:
+
+You might want to use this tracepoint if you suspect that your
+NMI handlers are hogging large amounts of CPU time.  The kernel
+will warn if it sees long-running handlers:
+
+	INFO: NMI handler took too long to run: 9.207 msecs
+
+and this tracepoint will allow you to drill down and get some
+more details.
+
+Let's say you suspect that perf_event_nmi_handler() is causing
+you some problems and you only want to trace that handler
+specifically.  You need to find its address:
+
+	$ grep perf_event_nmi_handler /proc/kallsyms
+	ffffffff81625600 t perf_event_nmi_handler
+
+Let's also say you are only interested in when that function is
+really hogging a lot of CPU time, like a millisecond at a time.
+Note that the kernel's output is in milliseconds, but the input
+to the filter is in nanoseconds!  You can filter on 'delta_ns':
+
+cd /sys/kernel/debug/tracing/events/nmi/nmi_handler
+echo 'handler==0xffffffff81625600 && delta_ns>1000000' > filter
+echo 1 > enable
+
+Your output would then look like:
+
+$ cat /sys/kernel/debug/tracing/trace_pipe
+<idle>-0     [000] d.h3   505.397558: nmi_handler: perf_event_nmi_handler() delta_ns: 3236765 handled: 1
+<idle>-0     [000] d.h3   505.805893: nmi_handler: perf_event_nmi_handler() delta_ns: 3174234 handled: 1
+<idle>-0     [000] d.h3   506.158206: nmi_handler: perf_event_nmi_handler() delta_ns: 3084642 handled: 1
+<idle>-0     [000] d.h3   506.334346: nmi_handler: perf_event_nmi_handler() delta_ns: 3080351 handled: 1
+
diff -puN /dev/null include/trace/events/nmi.h
--- /dev/null	2013-03-17 11:51:50.342329668 -0700
+++ linux.git-davehans/include/trace/events/nmi.h	2013-05-29 15:10:19.720684883 -0700
@@ -0,0 +1,37 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM nmi
+
+#if !defined(_TRACE_NMI_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_NMI_H
+
+#include <linux/ktime.h>
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(nmi_handler,
+
+	TP_PROTO(void *handler, s64 delta_ns, int handled),
+
+	TP_ARGS(handler, delta_ns, handled),
+
+	TP_STRUCT__entry(
+		__field(	void *,		handler	)
+		__field(	s64,		delta_ns)
+		__field(	int,		handled	)
+	),
+
+	TP_fast_assign(
+		__entry->handler = handler;
+		__entry->delta_ns = delta_ns;
+		__entry->handled = handled;
+	),
+
+	TP_printk("%ps() delta_ns: %lld handled: %d",
+		__entry->handler,
+		__entry->delta_ns,
+		__entry->handled)
+);
+
+#endif /* _TRACE_NMI_H */
+
+/* This part ust be outside protection */
+#include <trace/define_trace.h>
_

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [v3][PATCH 2/4] x86: warn when NMI handlers take large amounts of time
  2013-05-29 22:27 ` [v3][PATCH 2/4] x86: warn when NMI handlers take large amounts of time Dave Hansen
@ 2013-05-30  8:33   ` Ingo Molnar
  2013-05-30  9:37   ` Peter Zijlstra
  1 sibling, 0 replies; 7+ messages in thread
From: Ingo Molnar @ 2013-05-30  8:33 UTC (permalink / raw)
  To: Dave Hansen; +Cc: a.p.zijlstra, mingo, paulus, acme, tglx, x86, linux-kernel


* Dave Hansen <dave@sr71.net> wrote:

> @@ -96,8 +106,24 @@ static int __kprobes nmi_handle(unsigned
>  	 * can be latched at any given time.  Walk the whole list
>  	 * to handle those situations.
>  	 */
> -	list_for_each_entry_rcu(a, &desc->head, list)
> +	list_for_each_entry_rcu(a, &desc->head, list) {
> +		u64 before, delta, whole_msecs;
> +		int decimal_msecs;
> +
> +		before = local_clock();
>  		handled += a->handler(type, regs);
> +		delta = local_clock() - before;
> +
> +		if (delta < nmi_longest_ns)
> +			continue;
> +
> +		nmi_longest_ns = delta;
> +		whole_msecs = delta / (1000 * 1000);
> +		decimal_msecs = (delta / 1000) % 1000;
> +		printk_ratelimited(KERN_INFO
> +			"INFO: NMI handler took too long to run: "
> +			"%lld.%03d msecs\n", whole_msecs, decimal_msecs);
> +	}

One more thing, would it make sense to print the symbol name of the 
->handler here? That way the message would be more actionable:

	INFO: NMI handler (nmi_handler_foo) took too long to run: 12.613 msecs

(Note, printk() has a '%ps' format string extension to print function 
pointer symbolic names, so it's easy to add.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [v3][PATCH 2/4] x86: warn when NMI handlers take large amounts of time
  2013-05-29 22:27 ` [v3][PATCH 2/4] x86: warn when NMI handlers take large amounts of time Dave Hansen
  2013-05-30  8:33   ` Ingo Molnar
@ 2013-05-30  9:37   ` Peter Zijlstra
  1 sibling, 0 replies; 7+ messages in thread
From: Peter Zijlstra @ 2013-05-30  9:37 UTC (permalink / raw)
  To: Dave Hansen; +Cc: mingo, paulus, acme, tglx, x86, linux-kernel

On Wed, May 29, 2013 at 03:27:59PM -0700, Dave Hansen wrote:
 
> +static u64 nmi_longest_ns = 1000 * 1000 * 1000;

1s? Isn't that a tad too long?

FWIW there's NSEC_PER_{U,M,}SEC for such cases, I find those macros
increase readability.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2013-05-30  9:37 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-05-29 22:27 [v3][PATCH 0/4] Work around perf NMI-induced hangs Dave Hansen
2013-05-29 22:27 ` [v3][PATCH 1/4] perf/x86: only print PMU state when also WARN()'ing Dave Hansen
2013-05-29 22:27 ` [v3][PATCH 2/4] x86: warn when NMI handlers take large amounts of time Dave Hansen
2013-05-30  8:33   ` Ingo Molnar
2013-05-30  9:37   ` Peter Zijlstra
2013-05-29 22:28 ` [v3][PATCH 3/4] perf: drop sample rate when sampling is too slow Dave Hansen
2013-05-29 22:28 ` [v3][PATCH 4/4] x86: nmi length tracepoints Dave Hansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox