public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Dave Hansen <dave.hansen@linux.intel.com>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	paulus@samba.org, acme@ghostprotocols.net,
	Dave Hansen <dave@sr71.net>, Ingo Molnar <mingo@kernel.org>,
	Weng Meiling <wengmeiling.weng@huawei.com>
Subject: [PATCH 3.10 36/40] perf: Drop sample rate when sampling is too slow
Date: Mon,  9 Jun 2014 15:49:07 -0700	[thread overview]
Message-ID: <20140609224840.392872335@linuxfoundation.org> (raw)
In-Reply-To: <20140609224839.127615063@linuxfoundation.org>

3.10-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Dave Hansen <dave.hansen@linux.intel.com>

commit 14c63f17b1fde5a575a28e96547a22b451c71fb5 upstream.

This patch keeps track of how long perf's NMI handler is taking,
and also calculates how many samples perf can take a second.  If
the sample length times the expected max number of samples
exceeds a configurable threshold, it drops the sample rate.

This way, we don't have a runaway sampling process eating up the
CPU.

This patch can tend to drop the sample rate down to level where
perf doesn't work very well.  *BUT* the alternative is that my
system hangs because it spends all of its time handling NMIs.

I'll take a busted performance tool over an entire system that's
busted and undebuggable any day.

BTW, my suspicion is that there's still an underlying bug here.
Using the HPET instead of the TSC is definitely a contributing
factor, but I suspect there are some other things going on.
But, I can't go dig down on a bug like that with my machine
hanging all the time.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: paulus@samba.org
Cc: acme@ghostprotocols.net
Cc: Dave Hansen <dave@sr71.net>
[ Prettified it a bit. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Weng Meiling <wengmeiling.weng@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 Documentation/sysctl/kernel.txt  |   26 +++++++++++
 arch/x86/kernel/cpu/perf_event.c |   12 ++++-
 include/linux/perf_event.h       |    7 ++
 kernel/events/core.c             |   92 +++++++++++++++++++++++++++++++++++++--
 kernel/sysctl.c                  |    9 +++
 5 files changed, 141 insertions(+), 5 deletions(-)

--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -438,6 +438,32 @@ This file shows up if CONFIG_DEBUG_STACK
 
 ==============================================================
 
+perf_cpu_time_max_percent:
+
+Hints to the kernel how much CPU time it should be allowed to
+use to handle perf sampling events.  If the perf subsystem
+is informed that its samples are exceeding this limit, it
+will drop its sampling frequency to attempt to reduce its CPU
+usage.
+
+Some perf sampling happens in NMIs.  If these samples
+unexpectedly take too long to execute, the NMIs can become
+stacked up next to each other so much that nothing else is
+allowed to execute.
+
+0: disable the mechanism.  Do not monitor or correct perf's
+   sampling rate no matter how CPU time it takes.
+
+1-100: attempt to throttle perf's sample rate to this
+   percentage of CPU.  Note: the kernel calculates an
+   "expected" length of each sample event.  100 here means
+   100% of that expected length.  Even if this is set to
+   100, you may still see sample throttling if this
+   length is exceeded.  Set to 0 if you truly do not care
+   how much CPU is consumed.
+
+==============================================================
+
 
 pid_max:
 
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1252,10 +1252,20 @@ void perf_events_lapic_init(void)
 static int __kprobes
 perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
 {
+	int ret;
+	u64 start_clock;
+	u64 finish_clock;
+
 	if (!atomic_read(&active_events))
 		return NMI_DONE;
 
-	return x86_pmu.handle_irq(regs);
+	start_clock = local_clock();
+	ret = x86_pmu.handle_irq(regs);
+	finish_clock = local_clock();
+
+	perf_sample_event_took(finish_clock - start_clock);
+
+	return ret;
 }
 
 struct event_constraint emptyconstraint;
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -695,10 +695,17 @@ static inline void perf_callchain_store(
 extern int sysctl_perf_event_paranoid;
 extern int sysctl_perf_event_mlock;
 extern int sysctl_perf_event_sample_rate;
+extern int sysctl_perf_cpu_time_max_percent;
+
+extern void perf_sample_event_took(u64 sample_len_ns);
 
 extern int perf_proc_update_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp,
 		loff_t *ppos);
+extern int perf_cpu_time_max_percent_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *lenp,
+		loff_t *ppos);
+
 
 static inline bool perf_paranoid_tracepoint_raw(void)
 {
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -165,10 +165,26 @@ int sysctl_perf_event_mlock __read_mostl
 /*
  * max perf event sample rate
  */
-#define DEFAULT_MAX_SAMPLE_RATE 100000
-int sysctl_perf_event_sample_rate __read_mostly = DEFAULT_MAX_SAMPLE_RATE;
-static int max_samples_per_tick __read_mostly =
-	DIV_ROUND_UP(DEFAULT_MAX_SAMPLE_RATE, HZ);
+#define DEFAULT_MAX_SAMPLE_RATE		100000
+#define DEFAULT_SAMPLE_PERIOD_NS	(NSEC_PER_SEC / DEFAULT_MAX_SAMPLE_RATE)
+#define DEFAULT_CPU_TIME_MAX_PERCENT	25
+
+int sysctl_perf_event_sample_rate __read_mostly	= DEFAULT_MAX_SAMPLE_RATE;
+
+static int max_samples_per_tick __read_mostly	= DIV_ROUND_UP(DEFAULT_MAX_SAMPLE_RATE, HZ);
+static int perf_sample_period_ns __read_mostly	= DEFAULT_SAMPLE_PERIOD_NS;
+
+static atomic_t perf_sample_allowed_ns __read_mostly =
+	ATOMIC_INIT( DEFAULT_SAMPLE_PERIOD_NS * DEFAULT_CPU_TIME_MAX_PERCENT / 100);
+
+void update_perf_cpu_limits(void)
+{
+	u64 tmp = perf_sample_period_ns;
+
+	tmp *= sysctl_perf_cpu_time_max_percent;
+	tmp = do_div(tmp, 100);
+	atomic_set(&perf_sample_allowed_ns, tmp);
+}
 
 int perf_proc_update_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp,
@@ -180,10 +196,78 @@ int perf_proc_update_handler(struct ctl_
 		return ret;
 
 	max_samples_per_tick = DIV_ROUND_UP(sysctl_perf_event_sample_rate, HZ);
+	perf_sample_period_ns = NSEC_PER_SEC / sysctl_perf_event_sample_rate;
+	update_perf_cpu_limits();
+
+	return 0;
+}
+
+int sysctl_perf_cpu_time_max_percent __read_mostly = DEFAULT_CPU_TIME_MAX_PERCENT;
+
+int perf_cpu_time_max_percent_handler(struct ctl_table *table, int write,
+				void __user *buffer, size_t *lenp,
+				loff_t *ppos)
+{
+	int ret = proc_dointvec(table, write, buffer, lenp, ppos);
+
+	if (ret || !write)
+		return ret;
+
+	update_perf_cpu_limits();
 
 	return 0;
 }
 
+/*
+ * perf samples are done in some very critical code paths (NMIs).
+ * If they take too much CPU time, the system can lock up and not
+ * get any real work done.  This will drop the sample rate when
+ * we detect that events are taking too long.
+ */
+#define NR_ACCUMULATED_SAMPLES 128
+DEFINE_PER_CPU(u64, running_sample_length);
+
+void perf_sample_event_took(u64 sample_len_ns)
+{
+	u64 avg_local_sample_len;
+	u64 local_samples_len = __get_cpu_var(running_sample_length);
+
+	if (atomic_read(&perf_sample_allowed_ns) == 0)
+		return;
+
+	/* decay the counter by 1 average sample */
+	local_samples_len = __get_cpu_var(running_sample_length);
+	local_samples_len -= local_samples_len/NR_ACCUMULATED_SAMPLES;
+	local_samples_len += sample_len_ns;
+	__get_cpu_var(running_sample_length) = local_samples_len;
+
+	/*
+	 * note: this will be biased artifically low until we have
+	 * seen NR_ACCUMULATED_SAMPLES.  Doing it this way keeps us
+	 * from having to maintain a count.
+	 */
+	avg_local_sample_len = local_samples_len/NR_ACCUMULATED_SAMPLES;
+
+	if (avg_local_sample_len <= atomic_read(&perf_sample_allowed_ns))
+		return;
+
+	if (max_samples_per_tick <= 1)
+		return;
+
+	max_samples_per_tick = DIV_ROUND_UP(max_samples_per_tick, 2);
+	sysctl_perf_event_sample_rate = max_samples_per_tick * HZ;
+	perf_sample_period_ns = NSEC_PER_SEC / sysctl_perf_event_sample_rate;
+
+	printk_ratelimited(KERN_WARNING
+			"perf samples too long (%lld > %d), lowering "
+			"kernel.perf_event_max_sample_rate to %d\n",
+			avg_local_sample_len,
+			atomic_read(&perf_sample_allowed_ns),
+			sysctl_perf_event_sample_rate);
+
+	update_perf_cpu_limits();
+}
+
 static atomic64_t perf_event_id;
 
 static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1050,6 +1050,15 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= perf_proc_update_handler,
 	},
+	{
+		.procname	= "perf_cpu_time_max_percent",
+		.data		= &sysctl_perf_cpu_time_max_percent,
+		.maxlen		= sizeof(sysctl_perf_cpu_time_max_percent),
+		.mode		= 0644,
+		.proc_handler	= perf_cpu_time_max_percent_handler,
+		.extra1		= &zero,
+		.extra2		= &one_hundred,
+	},
 #endif
 #ifdef CONFIG_KMEMCHECK
 	{



  parent reply	other threads:[~2014-06-09 22:51 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-06-09 22:48 [PATCH 3.10 00/40] 3.10.43-stable review Greg Kroah-Hartman
2014-06-09 22:48 ` [PATCH 3.10 01/40] sched: Use CPUPRI_NR_PRIORITIES instead of MAX_RT_PRIO in cpupri check Greg Kroah-Hartman
2014-06-09 22:48 ` [PATCH 3.10 02/40] sched: Sanitize irq accounting madness Greg Kroah-Hartman
2014-06-09 22:48 ` [PATCH 3.10 03/40] perf: Prevent false warning in perf_swevent_add Greg Kroah-Hartman
2014-06-09 22:48 ` [PATCH 3.10 04/40] perf: Limit perf_event_attr::sample_period to 63 bits Greg Kroah-Hartman
2014-06-09 22:48 ` [PATCH 3.10 05/40] perf: Fix race in removing an event Greg Kroah-Hartman
2014-06-09 22:48 ` [PATCH 3.10 06/40] perf evsel: Fix printing of perf_event_paranoid message Greg Kroah-Hartman
2014-06-09 22:48 ` [PATCH 3.10 07/40] mm/memory-failure.c: fix memory leak by race between poison and unpoison Greg Kroah-Hartman
2014-06-09 22:48 ` [PATCH 3.10 08/40] Documentation: fix DOCBOOKS=... building Greg Kroah-Hartman
2014-06-09 22:48 ` [PATCH 3.10 09/40] hwmon: (ntc_thermistor) Fix dependencies Greg Kroah-Hartman
2014-06-09 22:48 ` [PATCH 3.10 10/40] hwmon: (ntc_thermistor) Fix OF device ID mapping Greg Kroah-Hartman
2014-06-09 22:48 ` [PATCH 3.10 11/40] drm/gf119-/disp: fix nasty bug which can clobber SOR0s clock setup Greg Kroah-Hartman
2014-06-09 22:48 ` [PATCH 3.10 16/40] ARM: OMAP4: Fix the boot regression with CPU_IDLE enabled Greg Kroah-Hartman
2014-06-09 22:48 ` [PATCH 3.10 17/40] ARM: 8051/1: put_user: fix possible data corruption in put_user Greg Kroah-Hartman
2014-06-09 22:48 ` [PATCH 3.10 18/40] dm cache: always split discards on cache block boundaries Greg Kroah-Hartman
2014-06-09 22:48 ` [PATCH 3.10 19/40] sched: Fix hotplug vs. set_cpus_allowed_ptr() Greg Kroah-Hartman
2014-06-09 22:48 ` [PATCH 3.10 20/40] drm/i915: Only copy back the modified fields to userspace from execbuffer Greg Kroah-Hartman
2014-06-09 22:48 ` [PATCH 3.10 21/40] md: always set MD_RECOVERY_INTR when aborting a reshape or other "resync" Greg Kroah-Hartman
2014-06-09 22:48 ` [PATCH 3.10 22/40] md: always set MD_RECOVERY_INTR when interrupting a reshape thread Greg Kroah-Hartman
2014-06-09 22:48 ` [PATCH 3.10 24/40] Staging: speakup: Move pasting into a work item Greg Kroah-Hartman
2014-06-09 22:48 ` [PATCH 3.10 25/40] staging: comedi: ni_daq_700: add mux settling delay Greg Kroah-Hartman
2014-06-09 22:48 ` [PATCH 3.10 26/40] ALSA: hda/realtek - Correction of fixup codes for PB V7900 laptop Greg Kroah-Hartman
2014-06-09 22:48 ` [PATCH 3.10 27/40] ALSA: hda/realtek - Fix COEF widget NID for ALC260 replacer fixup Greg Kroah-Hartman
2014-06-09 22:48 ` [PATCH 3.10 28/40] USB: ftdi_sio: add NovaTech OrionLXm product ID Greg Kroah-Hartman
2014-06-09 22:49 ` [PATCH 3.10 31/40] USB: serial: option: add support for Novatel E371 PCIe card Greg Kroah-Hartman
2014-06-09 22:49 ` [PATCH 3.10 32/40] USB: io_ti: fix firmware download on big-endian machines (part 2) Greg Kroah-Hartman
2014-06-09 22:49 ` [PATCH 3.10 33/40] USB: Avoid runtime suspend loops for HCDs that cant handle suspend/resume Greg Kroah-Hartman
2014-06-09 22:49 ` [PATCH 3.10 34/40] mm: rmap: fix use-after-free in __put_anon_vma Greg Kroah-Hartman
2014-06-09 22:49 ` [PATCH 3.10 35/40] iser-target: Add missing target_put_sess_cmd for ImmedateData failure Greg Kroah-Hartman
2014-06-09 22:49 ` Greg Kroah-Hartman [this message]
2014-06-09 22:49 ` [PATCH 3.10 37/40] perf: Fix interrupt handler timing harness Greg Kroah-Hartman
2014-06-09 22:49 ` [PATCH 3.10 38/40] perf: Enforce 1 as lower limit for perf_event_max_sample_rate Greg Kroah-Hartman
2014-06-09 22:49 ` [PATCH 3.10 39/40] ARM: perf: hook up perf_sample_event_took around pmu irq handling Greg Kroah-Hartman
2014-06-09 22:49 ` [PATCH 3.10 40/40] netfilter: Fix potential use after free in ip6_route_me_harder() Greg Kroah-Hartman
2014-06-10 15:16 ` [PATCH 3.10 00/40] 3.10.43-stable review Guenter Roeck

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140609224840.392872335@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=a.p.zijlstra@chello.nl \
    --cc=acme@ghostprotocols.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=dave@sr71.net \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=paulus@samba.org \
    --cc=stable@vger.kernel.org \
    --cc=wengmeiling.weng@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox