public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH 0/7] sched: Optimize sched_clock bits
@ 2013-11-29 17:36 Peter Zijlstra
  2013-11-29 17:36 ` [RFC][PATCH 1/7] math64: mul_u64_u32_shr() Peter Zijlstra
                   ` (7 more replies)
  0 siblings, 8 replies; 13+ messages in thread
From: Peter Zijlstra @ 2013-11-29 17:36 UTC (permalink / raw)
  To: Eliezer Tamir
  Cc: John Stultz, Thomas Gleixner, Steven Rostedt, Ingo Molnar,
	Mathieu Desnoyers, Andy Lutomirski, linux-kernel, Tony Luck, hpa,
	Peter Zijlstra

Hi all,

This series is supposed to optimize the kernel/sched/clock.c and x86
sched_clock() implementations.

So far its only been boot tested. So no clue if it really makes the thing
faster, but it does remove the need to disable IRQs.

I'm hoping Eliezer will test this with his benchmark where he could measure a
performance regression between using sched_clock() and local_clock().

It looks like ia64, the only other CONFIG_HAVE_UNSTABLE_SCHED_CLOCK user,
already matches the new expectations -- mostly not requiring IRQs disabled for
calling sched_clock().


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC][PATCH 1/7] math64: mul_u64_u32_shr()
  2013-11-29 17:36 [RFC][PATCH 0/7] sched: Optimize sched_clock bits Peter Zijlstra
@ 2013-11-29 17:36 ` Peter Zijlstra
  2013-11-29 17:36 ` [RFC][PATCH 2/7] x86: Use mul_u64_u32_shr() for native_sched_clock() Peter Zijlstra
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2013-11-29 17:36 UTC (permalink / raw)
  To: Eliezer Tamir
  Cc: John Stultz, Thomas Gleixner, Steven Rostedt, Ingo Molnar,
	Mathieu Desnoyers, Andy Lutomirski, linux-kernel, Tony Luck, hpa,
	Ingo Molnar, Peter Zijlstra

[-- Attachment #1: peterz-mul_u64_u32_shr.patch --]
[-- Type: text/plain, Size: 1920 bytes --]

Introduce mul_u64_u32_shr() as proposed by Andy a while back; it
allows using 64x64->128 muls on 64bit archs and recent GCC.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: fweisbec@gmail.com
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/Kconfig       |    1 +
 include/linux/math64.h |   30 ++++++++++++++++++++++++++++++
 init/Kconfig           |    6 ++++++
 3 files changed, 37 insertions(+)

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -26,6 +26,7 @@ config X86
 	select HAVE_AOUT if X86_32
 	select HAVE_UNSTABLE_SCHED_CLOCK
 	select ARCH_SUPPORTS_NUMA_BALANCING
+	select ARCH_SUPPORTS_INT128 if X86_64
 	select ARCH_WANTS_PROT_NUMA_PROT_NONE
 	select HAVE_IDE
 	select HAVE_OPROFILE
--- a/include/linux/math64.h
+++ b/include/linux/math64.h
@@ -133,4 +133,34 @@ __iter_div_u64_rem(u64 dividend, u32 div
 	return ret;
 }
 
+#if defined(CONFIG_ARCH_SUPPORTS_INT128) && defined(__SIZEOF_INT128__)
+
+#ifndef mul_u64_u32_shr
+static inline u64 mul_u64_u32_shr(u64 a, u32 mul, unsigned int shift)
+{
+	return (u64)(((unsigned __int128)a * mul) >> shift);
+}
+#endif /* mul_u64_u32_shr */
+
+#else
+
+#ifndef mul_u64_u32_shr
+static inline u64 mul_u64_u32_shr(u64 a, u32 mul, unsigned int shift)
+{
+	u32 ah, al;
+	u64 ret;
+
+	al = a;
+	ah = a >> 32;
+
+	ret = ((u64)al * mul) >> shift;
+	if (ah)
+		ret += ((u64)ah * mul) << (32 - shift);
+
+	return ret;
+}
+#endif /* mul_u64_u32_shr */
+
+#endif
+
 #endif /* _LINUX_MATH64_H */
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -809,6 +809,12 @@ config GENERIC_SCHED_CLOCK
 config ARCH_SUPPORTS_NUMA_BALANCING
 	bool
 
+#
+# For architectures that know their GCC __int128 support is sound
+#
+config ARCH_SUPPORTS_INT128
+	bool
+
 # For architectures that (ab)use NUMA to represent different memory regions
 # all cpu-local but of different latencies, such as SuperH.
 #



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC][PATCH 2/7] x86: Use mul_u64_u32_shr() for native_sched_clock()
  2013-11-29 17:36 [RFC][PATCH 0/7] sched: Optimize sched_clock bits Peter Zijlstra
  2013-11-29 17:36 ` [RFC][PATCH 1/7] math64: mul_u64_u32_shr() Peter Zijlstra
@ 2013-11-29 17:36 ` Peter Zijlstra
  2013-11-29 17:37 ` [RFC][PATCH 3/7] x86: Avoid a runtime condition in native_sched_clock() Peter Zijlstra
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2013-11-29 17:36 UTC (permalink / raw)
  To: Eliezer Tamir
  Cc: John Stultz, Thomas Gleixner, Steven Rostedt, Ingo Molnar,
	Mathieu Desnoyers, Andy Lutomirski, linux-kernel, Tony Luck, hpa,
	Peter Zijlstra

[-- Attachment #1: peterz-cycles2ns-math64.patch --]
[-- Type: text/plain, Size: 3240 bytes --]

Use mul_u64_u32_shr() so that x86_64 can use a single 64x64->128 mul.

before:

0000000000000560 <native_sched_clock>:
 560:   44 8b 1d 00 00 00 00    mov    0x0(%rip),%r11d        # 567 <native_sched_clock+0x7>
 567:   55                      push   %rbp
 568:   48 89 e5                mov    %rsp,%rbp
 56b:   45 85 db                test   %r11d,%r11d
 56e:   75 4f                   jne    5bf <native_sched_clock+0x5f>
 570:   0f 31                   rdtsc  
 572:   89 c0                   mov    %eax,%eax
 574:   48 c1 e2 20             shl    $0x20,%rdx
 578:   48 c7 c1 00 00 00 00    mov    $0x0,%rcx
 57f:   48 09 c2                or     %rax,%rdx
 582:   48 c7 c7 00 00 00 00    mov    $0x0,%rdi
 589:   65 8b 04 25 00 00 00    mov    %gs:0x0,%eax
 590:   00 
 591:   48 98                   cltq   
 593:   48 8b 34 c5 00 00 00    mov    0x0(,%rax,8),%rsi
 59a:   00 
 59b:   48 89 d0                mov    %rdx,%rax
 59e:   81 e2 ff 03 00 00       and    $0x3ff,%edx
 5a4:   48 c1 e8 0a             shr    $0xa,%rax
 5a8:   48 0f af 14 0e          imul   (%rsi,%rcx,1),%rdx
 5ad:   48 0f af 04 0e          imul   (%rsi,%rcx,1),%rax
 5b2:   5d                      pop    %rbp
 5b3:   48 03 04 3e             add    (%rsi,%rdi,1),%rax
 5b7:   48 c1 ea 0a             shr    $0xa,%rdx
 5bb:   48 01 d0                add    %rdx,%rax
 5be:   c3                      retq 

after:

0000000000000550 <native_sched_clock>:
 550:   8b 3d 00 00 00 00       mov    0x0(%rip),%edi        # 556 <native_sched_clock+0x6>
 556:   55                      push   %rbp
 557:   48 89 e5                mov    %rsp,%rbp
 55a:   48 83 e4 f0             and    $0xfffffffffffffff0,%rsp
 55e:   85 ff                   test   %edi,%edi
 560:   75 2c                   jne    58e <native_sched_clock+0x3e>
 562:   0f 31                   rdtsc  
 564:   89 c0                   mov    %eax,%eax
 566:   48 c1 e2 20             shl    $0x20,%rdx
 56a:   48 09 c2                or     %rax,%rdx
 56d:   65 48 8b 04 25 00 00    mov    %gs:0x0,%rax
 574:   00 00 
 576:   89 c0                   mov    %eax,%eax
 578:   48 f7 e2                mul    %rdx
 57b:   65 48 8b 0c 25 00 00    mov    %gs:0x0,%rcx
 582:   00 00 
 584:   c9                      leaveq 
 585:   48 0f ac d0 0a          shrd   $0xa,%rdx,%rax
 58a:   48 01 c8                add    %rcx,%rax
 58d:   c3                      retq 

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/include/asm/timer.h |    7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

--- a/arch/x86/include/asm/timer.h
+++ b/arch/x86/include/asm/timer.h
@@ -4,6 +4,7 @@
 #include <linux/pm.h>
 #include <linux/percpu.h>
 #include <linux/interrupt.h>
+#include <linux/math64.h>
 
 #define TICK_SIZE (tick_nsec / 1000)
 
@@ -57,10 +58,8 @@ DECLARE_PER_CPU(unsigned long long, cyc2
 
 static inline unsigned long long __cycles_2_ns(unsigned long long cyc)
 {
-	int cpu = smp_processor_id();
-	unsigned long long ns = per_cpu(cyc2ns_offset, cpu);
-	ns += mult_frac(cyc, per_cpu(cyc2ns, cpu),
-			(1UL << CYC2NS_SCALE_FACTOR));
+	unsigned long long ns = this_cpu_read(cyc2ns_offset);
+	ns += mul_u64_u32_shr(cyc, this_cpu_read(cyc2ns), CYC2NS_SCALE_FACTOR);
 	return ns;
 }
 



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC][PATCH 3/7] x86: Avoid a runtime condition in native_sched_clock()
  2013-11-29 17:36 [RFC][PATCH 0/7] sched: Optimize sched_clock bits Peter Zijlstra
  2013-11-29 17:36 ` [RFC][PATCH 1/7] math64: mul_u64_u32_shr() Peter Zijlstra
  2013-11-29 17:36 ` [RFC][PATCH 2/7] x86: Use mul_u64_u32_shr() for native_sched_clock() Peter Zijlstra
@ 2013-11-29 17:37 ` Peter Zijlstra
  2013-11-29 17:37 ` [RFC][PATCH 4/7] x86: Move some code around Peter Zijlstra
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2013-11-29 17:37 UTC (permalink / raw)
  To: Eliezer Tamir
  Cc: John Stultz, Thomas Gleixner, Steven Rostedt, Ingo Molnar,
	Mathieu Desnoyers, Andy Lutomirski, linux-kernel, Tony Luck, hpa,
	Peter Zijlstra

[-- Attachment #1: peterz-tsc-static_key.patch --]
[-- Type: text/plain, Size: 1397 bytes --]

Use a static_key to avoid a runtime condition in native_sched_clock().

XXX: I still think tsc_disabled should die a horrid death.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/kernel/tsc.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -11,6 +11,7 @@
 #include <linux/clocksource.h>
 #include <linux/percpu.h>
 #include <linux/timex.h>
+#include <linux/static_key.h>
 
 #include <asm/hpet.h>
 #include <asm/timer.h>
@@ -37,6 +38,8 @@ static int __read_mostly tsc_unstable;
    erroneous rdtsc usage on !cpu_has_tsc processors */
 static int __read_mostly tsc_disabled = -1;
 
+static struct static_key __use_tsc = STATIC_KEY_INIT;
+
 int tsc_clocksource_reliable;
 /*
  * Scheduler clock - returns current time in nanosec units.
@@ -53,7 +56,7 @@ u64 native_sched_clock(void)
 	 *   very important for it to be as fast as the platform
 	 *   can achieve it. )
 	 */
-	if (unlikely(tsc_disabled)) {
+	if (static_key_false(&__use_tsc)) {
 		/* No locking but a rare wrong value is not a big deal: */
 		return (jiffies_64 - INITIAL_JIFFIES) * (1000000000 / HZ);
 	}
@@ -1002,7 +1005,9 @@ void __init tsc_init(void)
 		return;
 
 	/* now allow native_sched_clock() to use rdtsc */
+
 	tsc_disabled = 0;
+	static_key_slow_inc(&__use_tsc);
 
 	if (!no_sched_irq_time)
 		enable_sched_clock_irqtime();



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC][PATCH 4/7] x86: Move some code around
  2013-11-29 17:36 [RFC][PATCH 0/7] sched: Optimize sched_clock bits Peter Zijlstra
                   ` (2 preceding siblings ...)
  2013-11-29 17:37 ` [RFC][PATCH 3/7] x86: Avoid a runtime condition in native_sched_clock() Peter Zijlstra
@ 2013-11-29 17:37 ` Peter Zijlstra
  2013-11-29 17:37 ` [RFC][PATCH 5/7] x86: Use latch data structure for cyc2ns Peter Zijlstra
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2013-11-29 17:37 UTC (permalink / raw)
  To: Eliezer Tamir
  Cc: John Stultz, Thomas Gleixner, Steven Rostedt, Ingo Molnar,
	Mathieu Desnoyers, Andy Lutomirski, linux-kernel, Tony Luck, hpa,
	Peter Zijlstra

[-- Attachment #1: peterz-tsc-move.patch --]
[-- Type: text/plain, Size: 6570 bytes --]

There are no __cycles_2_ns() users outside of arch/x86/kernel/tsc.c,
so move it there.

There are no cycles_2_ns() users.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/include/asm/timer.h |   59 ----------------------
 arch/x86/kernel/tsc.c        |  112 +++++++++++++++++++++++--------------------
 2 files changed, 61 insertions(+), 110 deletions(-)

--- a/arch/x86/include/asm/timer.h
+++ b/arch/x86/include/asm/timer.h
@@ -13,66 +13,7 @@ extern int recalibrate_cpu_khz(void);
 
 extern int no_timer_check;
 
-/* Accelerators for sched_clock()
- * convert from cycles(64bits) => nanoseconds (64bits)
- *  basic equation:
- *		ns = cycles / (freq / ns_per_sec)
- *		ns = cycles * (ns_per_sec / freq)
- *		ns = cycles * (10^9 / (cpu_khz * 10^3))
- *		ns = cycles * (10^6 / cpu_khz)
- *
- *	Then we use scaling math (suggested by george@mvista.com) to get:
- *		ns = cycles * (10^6 * SC / cpu_khz) / SC
- *		ns = cycles * cyc2ns_scale / SC
- *
- *	And since SC is a constant power of two, we can convert the div
- *  into a shift.
- *
- *  We can use khz divisor instead of mhz to keep a better precision, since
- *  cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
- *  (mathieu.desnoyers@polymtl.ca)
- *
- *			-johnstul@us.ibm.com "math is hard, lets go shopping!"
- *
- * In:
- *
- * ns = cycles * cyc2ns_scale / SC
- *
- * Although we may still have enough bits to store the value of ns,
- * in some cases, we may not have enough bits to store cycles * cyc2ns_scale,
- * leading to an incorrect result.
- *
- * To avoid this, we can decompose 'cycles' into quotient and remainder
- * of division by SC.  Then,
- *
- * ns = (quot * SC + rem) * cyc2ns_scale / SC
- *    = quot * cyc2ns_scale + (rem * cyc2ns_scale) / SC
- *
- *			- sqazi@google.com
- */
-
 DECLARE_PER_CPU(unsigned long, cyc2ns);
 DECLARE_PER_CPU(unsigned long long, cyc2ns_offset);
 
-#define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
-
-static inline unsigned long long __cycles_2_ns(unsigned long long cyc)
-{
-	unsigned long long ns = this_cpu_read(cyc2ns_offset);
-	ns += mul_u64_u32_shr(cyc, this_cpu_read(cyc2ns), CYC2NS_SCALE_FACTOR);
-	return ns;
-}
-
-static inline unsigned long long cycles_2_ns(unsigned long long cyc)
-{
-	unsigned long long ns;
-	unsigned long flags;
-
-	local_irq_save(flags);
-	ns = __cycles_2_ns(cyc);
-	local_irq_restore(flags);
-
-	return ns;
-}
-
 #endif /* _ASM_X86_TIMER_H */
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -41,6 +41,66 @@ static int __read_mostly tsc_disabled =
 static struct static_key __use_tsc = STATIC_KEY_INIT;
 
 int tsc_clocksource_reliable;
+
+/* Accelerators for sched_clock()
+ * convert from cycles(64bits) => nanoseconds (64bits)
+ *  basic equation:
+ *              ns = cycles / (freq / ns_per_sec)
+ *              ns = cycles * (ns_per_sec / freq)
+ *              ns = cycles * (10^9 / (cpu_khz * 10^3))
+ *              ns = cycles * (10^6 / cpu_khz)
+ *
+ *      Then we use scaling math (suggested by george@mvista.com) to get:
+ *              ns = cycles * (10^6 * SC / cpu_khz) / SC
+ *              ns = cycles * cyc2ns_scale / SC
+ *
+ *      And since SC is a constant power of two, we can convert the div
+ *  into a shift.
+ *
+ *  We can use khz divisor instead of mhz to keep a better precision, since
+ *  cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
+ *  (mathieu.desnoyers@polymtl.ca)
+ *
+ *                      -johnstul@us.ibm.com "math is hard, lets go shopping!"
+ */
+
+DEFINE_PER_CPU(unsigned long, cyc2ns);
+DEFINE_PER_CPU(unsigned long long, cyc2ns_offset);
+
+#define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
+
+static inline unsigned long long cycles_2_ns(unsigned long long cyc)
+{
+	unsigned long long ns = this_cpu_read(cyc2ns_offset);
+	ns += mul_u64_u32_shr(cyc, this_cpu_read(cyc2ns), CYC2NS_SCALE_FACTOR);
+	return ns;
+}
+
+static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
+{
+	unsigned long long tsc_now, ns_now, *offset;
+	unsigned long flags, *scale;
+
+	local_irq_save(flags);
+	sched_clock_idle_sleep_event();
+
+	scale = &per_cpu(cyc2ns, cpu);
+	offset = &per_cpu(cyc2ns_offset, cpu);
+
+	rdtscll(tsc_now);
+	ns_now = cycles_2_ns(tsc_now);
+
+	if (cpu_khz) {
+		*scale = ((NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR) +
+				cpu_khz / 2) / cpu_khz;
+		*offset = ns_now - mult_frac(tsc_now, *scale,
+					     (1UL << CYC2NS_SCALE_FACTOR));
+	}
+
+	sched_clock_idle_wakeup_event(0);
+	local_irq_restore(flags);
+}
+
 /*
  * Scheduler clock - returns current time in nanosec units.
  */
@@ -65,7 +125,7 @@ u64 native_sched_clock(void)
 	rdtscll(this_offset);
 
 	/* return the value in ns */
-	return __cycles_2_ns(this_offset);
+	return cycles_2_ns(this_offset);
 }
 
 /* We need to define a real function for sched_clock, to override the
@@ -592,56 +652,6 @@ int recalibrate_cpu_khz(void)
 EXPORT_SYMBOL(recalibrate_cpu_khz);
 
 
-/* Accelerators for sched_clock()
- * convert from cycles(64bits) => nanoseconds (64bits)
- *  basic equation:
- *              ns = cycles / (freq / ns_per_sec)
- *              ns = cycles * (ns_per_sec / freq)
- *              ns = cycles * (10^9 / (cpu_khz * 10^3))
- *              ns = cycles * (10^6 / cpu_khz)
- *
- *      Then we use scaling math (suggested by george@mvista.com) to get:
- *              ns = cycles * (10^6 * SC / cpu_khz) / SC
- *              ns = cycles * cyc2ns_scale / SC
- *
- *      And since SC is a constant power of two, we can convert the div
- *  into a shift.
- *
- *  We can use khz divisor instead of mhz to keep a better precision, since
- *  cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
- *  (mathieu.desnoyers@polymtl.ca)
- *
- *                      -johnstul@us.ibm.com "math is hard, lets go shopping!"
- */
-
-DEFINE_PER_CPU(unsigned long, cyc2ns);
-DEFINE_PER_CPU(unsigned long long, cyc2ns_offset);
-
-static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
-{
-	unsigned long long tsc_now, ns_now, *offset;
-	unsigned long flags, *scale;
-
-	local_irq_save(flags);
-	sched_clock_idle_sleep_event();
-
-	scale = &per_cpu(cyc2ns, cpu);
-	offset = &per_cpu(cyc2ns_offset, cpu);
-
-	rdtscll(tsc_now);
-	ns_now = __cycles_2_ns(tsc_now);
-
-	if (cpu_khz) {
-		*scale = ((NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR) +
-				cpu_khz / 2) / cpu_khz;
-		*offset = ns_now - mult_frac(tsc_now, *scale,
-					     (1UL << CYC2NS_SCALE_FACTOR));
-	}
-
-	sched_clock_idle_wakeup_event(0);
-	local_irq_restore(flags);
-}
-
 static unsigned long long cyc2ns_suspend;
 
 void tsc_save_sched_clock_state(void)



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC][PATCH 5/7] x86: Use latch data structure for cyc2ns
  2013-11-29 17:36 [RFC][PATCH 0/7] sched: Optimize sched_clock bits Peter Zijlstra
                   ` (3 preceding siblings ...)
  2013-11-29 17:37 ` [RFC][PATCH 4/7] x86: Move some code around Peter Zijlstra
@ 2013-11-29 17:37 ` Peter Zijlstra
  2013-11-29 23:22   ` Andy Lutomirski
  2013-11-29 17:37 ` [RFC][PATCH 6/7] sched: Remove local_irq_disable() from the clocks Peter Zijlstra
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2013-11-29 17:37 UTC (permalink / raw)
  To: Eliezer Tamir
  Cc: John Stultz, Thomas Gleixner, Steven Rostedt, Ingo Molnar,
	Mathieu Desnoyers, Andy Lutomirski, linux-kernel, Tony Luck, hpa,
	Peter Zijlstra

[-- Attachment #1: peterz-tsc-latch.patch --]
[-- Type: text/plain, Size: 10184 bytes --]

Use the 'latch' data structure for cyc2ns.

This is a data structure first proposed by me and later named by
Mathieu. If anybody's got a better name; do holler.

Its a multi-version thing which allows always having a coherent
object; we use this to avoid having to disable IRQs while reading
sched_clock() and avoids a problem when getting an NMI while changing
the cyc2ns data.

The patch should have plenty comments actually explaining the thing.

The hope is that the extra logic is offset by no longer requiring the
sti;cli around reading the clock.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/include/asm/timer.h     |   21 +++-
 arch/x86/kernel/cpu/perf_event.c |   19 ++-
 arch/x86/kernel/tsc.c            |  186 ++++++++++++++++++++++++++++++++++-----
 3 files changed, 195 insertions(+), 31 deletions(-)

--- a/arch/x86/include/asm/timer.h
+++ b/arch/x86/include/asm/timer.h
@@ -13,7 +13,24 @@ extern int recalibrate_cpu_khz(void);
 
 extern int no_timer_check;
 
-DECLARE_PER_CPU(unsigned long, cyc2ns);
-DECLARE_PER_CPU(unsigned long long, cyc2ns_offset);
+/*
+ * We use the full linear equation: f(x) = a + b*x, in order to allow
+ * a continuous function in the face of dynamic freq changes.
+ *
+ * Continuity means that when our frequency changes our slope (b); we want to
+ * ensure that: f(t) == f'(t), which gives: a + b*t == a' + b'*t.
+ *
+ * Without an offset (a) the above would not be possible.
+ *
+ * See the comment near cycles_2_ns() for details on how we compute (b).
+ */
+struct cyc2ns_data {
+	u32 cyc2ns_mul;
+	u32 cyc2ns_shift;
+	u64 cyc2ns_offset;
+};
+
+extern struct cyc2ns_data __percpu *cyc2ns_read_begin(unsigned int *tail);
+extern bool cyc2ns_read_retry(unsigned int tail);
 
 #endif /* _ASM_X86_TIMER_H */
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1890,6 +1890,9 @@ static struct pmu pmu = {
 
 void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
 {
+	struct cyc2ns_data __percpu *data;
+	unsigned int tail;
+
 	userpg->cap_user_time = 0;
 	userpg->cap_user_time_zero = 0;
 	userpg->cap_user_rdpmc = x86_pmu.attr_rdpmc;
@@ -1898,13 +1901,17 @@ void arch_perf_update_userpage(struct pe
 	if (!sched_clock_stable)
 		return;
 
-	userpg->cap_user_time = 1;
-	userpg->time_mult = this_cpu_read(cyc2ns);
-	userpg->time_shift = CYC2NS_SCALE_FACTOR;
-	userpg->time_offset = this_cpu_read(cyc2ns_offset) - now;
+	do {
+		data = cyc2ns_read_begin(&tail);
+
+		userpg->cap_user_time = 1;
+		userpg->time_mult = this_cpu_read(data->cyc2ns_mul);
+		userpg->time_shift = this_cpu_read(data->cyc2ns_shift);
+		userpg->time_offset = this_cpu_read(data->cyc2ns_offset) - now;
 
-	userpg->cap_user_time_zero = 1;
-	userpg->time_zero = this_cpu_read(cyc2ns_offset);
+		userpg->cap_user_time_zero = 1;
+		userpg->time_zero = this_cpu_read(data->cyc2ns_offset);
+	} while (cyc2ns_read_retry(tail));
 }
 
 /*
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -42,7 +42,119 @@ static struct static_key __use_tsc = STA
 
 int tsc_clocksource_reliable;
 
-/* Accelerators for sched_clock()
+/*
+ * The latch data structure below is a multi-version (mostly) wait-free data
+ * structure that guarantees there's always a consistent @data part; even
+ * during writing.
+ *
+ * The data structure is an array of two entries (which is suffient when we
+ * assume updates are sequential) with a head and tail counter; updates are
+ * ordered like:
+ *
+ *   head++                  t = tail
+ *   WMB    (A)              RMB    (C)
+ *   data[head & 1] = d      d = data[tail & 1]
+ *   WMB    (B)              RMB    (D)
+ *   tail++                  retry = (head - t) >= 2
+ *
+ * This means that we can always use an {offset, mul} pair to compute a ns
+ * value that is 'roughly' in the right direction, even if we're writing a new
+ * {offset, mul} pair during the clock read.
+ *
+ * The down-side is that we can no longer guarantee strict monotonicity anymore
+ * (assuming the TSC was that to begin with), because while we compute the
+ * intersection point of the two clock slopes and make sure the time is
+ * continuous at the point of switching; we can no longer guarantee a reader is
+ * strictly before or after the switch point.
+ *
+ * It does mean a reader no longer needs to disable IRQs in order to avoid
+ * CPU-Freq updates messing with his times, and similarly an NMI reader will
+ * no longer run the risk of hitting half-written state.
+ */
+
+struct cyc2ns_latch {
+	unsigned int head, tail;
+	struct cyc2ns_data data[2];
+};
+
+static DEFINE_PER_CPU(struct cyc2ns_latch, cyc2ns);
+
+/*
+ * Use a {offset, mul} pair of this cpu.
+ *
+ * We use the @data entry that was completed last -- the tail entry.
+ */
+struct cyc2ns_data __percpu *cyc2ns_read_begin(unsigned int *tail)
+{
+	preempt_disable();
+	*tail = this_cpu_read(cyc2ns.tail);
+	/*
+	 * Ensure we read the tail value before we read the data corresponding
+	 * with it. This might actually be implied because of the data
+	 * dependency.
+	 */
+	smp_rmb(); /* C, matches B */
+	return cyc2ns.data + (*tail & 1);
+}
+
+/*
+ * We only need to retry when we observe two or more writes have happened since
+ * we started. This is because we only have storage for 2 @data entries.
+ *
+ * This still means the data structure is (mostly) wait-free because when
+ * writers are scarse this will (nearly) never happen; further it guarantees we
+ * can read while a writer is busy, because there's always the last completed
+ * state available.
+ */
+bool cyc2ns_read_retry(unsigned int tail)
+{
+	bool retry;
+
+	/*
+	 * Ensure we finish reading the data before reading the head entry.
+	 */
+	smp_rmb(); /* D, matches A */
+	retry = unlikely(this_cpu_read(cyc2ns.head) - tail >= 2);
+	preempt_enable();
+
+	return retry;
+}
+
+/*
+ * Begin writing a new @data entry for @cpu.
+ */
+static struct cyc2ns_data *cyc2ns_write_begin(int cpu)
+{
+	struct cyc2ns_latch *latch = &per_cpu(cyc2ns, cpu);
+
+	ACCESS_ONCE(latch->head)++;
+	/*
+	 * Order the head increment against the data writes; this guarantees
+	 * that a read will see this data entry invalidated before we actually
+	 * write to it.
+	 */
+	smp_wmb(); /* A, matches D */
+	return latch->data + (latch->head & 1);
+}
+
+/*
+ * Complete writing a new @data entry for @cpu.
+ */
+static void cyc2ns_write_end(int cpu)
+{
+	struct cyc2ns_latch *latch = &per_cpu(cyc2ns, cpu);
+
+	/*
+	 * Ensure the data entry is fully written before people can observe the
+	 * new tail index. This guarantees that if you observe a tail index,
+	 * the corresponding entry is indeed complete.
+	 */
+	smp_wmb(); /* B, matches C */
+	ACCESS_ONCE(latch->tail)++;
+}
+
+/*
+ * Accelerators for sched_clock()
  * convert from cycles(64bits) => nanoseconds (64bits)
  *  basic equation:
  *              ns = cycles / (freq / ns_per_sec)
@@ -64,49 +176,67 @@ int tsc_clocksource_reliable;
  *                      -johnstul@us.ibm.com "math is hard, lets go shopping!"
  */
 
-DEFINE_PER_CPU(unsigned long, cyc2ns);
-DEFINE_PER_CPU(unsigned long long, cyc2ns_offset);
-
 #define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
 
 static inline unsigned long long cycles_2_ns(unsigned long long cyc)
 {
-	unsigned long long ns = this_cpu_read(cyc2ns_offset);
-	ns += mul_u64_u32_shr(cyc, this_cpu_read(cyc2ns), CYC2NS_SCALE_FACTOR);
+	struct cyc2ns_data __percpu *data;
+	unsigned long long ns;
+	unsigned int tail;
+
+	do {
+		data = cyc2ns_read_begin(&tail);
+
+		ns = this_cpu_read(data->cyc2ns_offset);
+		ns += mul_u64_u32_shr(cyc, this_cpu_read(data->cyc2ns_mul),
+				CYC2NS_SCALE_FACTOR);
+	} while (cyc2ns_read_retry(tail));
+
 	return ns;
 }
 
+/* XXX surely we already have this someplace in the kernel?! */
+#define DIV_ROUND(n, d) (((n) + ((d) / 2)) / (d))
+
 static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
 {
-	unsigned long long tsc_now, ns_now, *offset;
-	unsigned long flags, *scale;
+	unsigned long long tsc_now, ns_now;
+	struct cyc2ns_data *data;
+	unsigned long flags;
 
 	local_irq_save(flags);
 	sched_clock_idle_sleep_event();
 
-	scale = &per_cpu(cyc2ns, cpu);
-	offset = &per_cpu(cyc2ns_offset, cpu);
+	if (!cpu_khz)
+		goto done;
+
+	data = cyc2ns_write_begin(cpu);
 
 	rdtscll(tsc_now);
 	ns_now = cycles_2_ns(tsc_now);
 
-	if (cpu_khz) {
-		*scale = ((NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR) +
-				cpu_khz / 2) / cpu_khz;
-		*offset = ns_now - mult_frac(tsc_now, *scale,
-					     (1UL << CYC2NS_SCALE_FACTOR));
-	}
+	/*
+	 * Compute a new multiplier as per the above comment and ensure our
+	 * time function is continuous; see the comment near struct
+	 * cyc2ns_data.
+	 */
+	data->cyc2ns_mul = DIV_ROUND(NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR, cpu_khz);
+	data->cyc2ns_shift = CYC2NS_SCALE_FACTOR;
+	data->cyc2ns_offset =
+		ns_now - mul_u64_u32_shr(tsc_now, data->cyc2ns_mul, CYC2NS_SCALE_FACTOR);
+
+	cyc2ns_write_end(cpu);
 
+done:
 	sched_clock_idle_wakeup_event(0);
 	local_irq_restore(flags);
 }
-
 /*
  * Scheduler clock - returns current time in nanosec units.
  */
 u64 native_sched_clock(void)
 {
-	u64 this_offset;
+	u64 tsc_now;
 
 	/*
 	 * Fall back to jiffies if there's no TSC available:
@@ -122,10 +252,10 @@ u64 native_sched_clock(void)
 	}
 
 	/* read the Time Stamp Counter: */
-	rdtscll(this_offset);
+	rdtscll(tsc_now);
 
 	/* return the value in ns */
-	return cycles_2_ns(this_offset);
+	return cycles_2_ns(tsc_now);
 }
 
 /* We need to define a real function for sched_clock, to override the
@@ -681,11 +811,21 @@ void tsc_restore_sched_clock_state(void)
 
 	local_irq_save(flags);
 
-	__this_cpu_write(cyc2ns_offset, 0);
+	/*
+	 * We're comming out of suspend, there's no concurrency yet; don't
+	 * bother being nice about the latch stuff, just write to both
+	 * data fields.
+	 */
+
+	this_cpu_write(cyc2ns.data[0].cyc2ns_offset, 0);
+	this_cpu_write(cyc2ns.data[1].cyc2ns_offset, 0);
+
 	offset = cyc2ns_suspend - sched_clock();
 
-	for_each_possible_cpu(cpu)
-		per_cpu(cyc2ns_offset, cpu) = offset;
+	for_each_possible_cpu(cpu) {
+		per_cpu(cyc2ns.data[0].cyc2ns_offset, cpu) = offset;
+		per_cpu(cyc2ns.data[1].cyc2ns_offset, cpu) = offset;
+	}
 
 	local_irq_restore(flags);
 }



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC][PATCH 6/7] sched: Remove local_irq_disable() from the clocks
  2013-11-29 17:36 [RFC][PATCH 0/7] sched: Optimize sched_clock bits Peter Zijlstra
                   ` (4 preceding siblings ...)
  2013-11-29 17:37 ` [RFC][PATCH 5/7] x86: Use latch data structure for cyc2ns Peter Zijlstra
@ 2013-11-29 17:37 ` Peter Zijlstra
  2013-11-29 17:37 ` [RFC][PATCH 7/7] sched: Use a static_key for sched_clock_stable Peter Zijlstra
  2013-12-01 18:08 ` [RFC][PATCH 0/7] sched: Optimize sched_clock bits Eliezer Tamir
  7 siblings, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2013-11-29 17:37 UTC (permalink / raw)
  To: Eliezer Tamir
  Cc: John Stultz, Thomas Gleixner, Steven Rostedt, Ingo Molnar,
	Mathieu Desnoyers, Andy Lutomirski, linux-kernel, Tony Luck, hpa,
	Peter Zijlstra

[-- Attachment #1: peterz-sched_clock_irq.patch --]
[-- Type: text/plain, Size: 2074 bytes --]

Now that x86 uses the 'latch' stuff to avoid having to disable IRQs
while using sched_clock() and ia64 never had this requirement (it
doesn't seem to do cpufreq at all), we can remove the requirement of
disabling IRQs.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/clock.c |   30 ++++--------------------------
 1 file changed, 4 insertions(+), 26 deletions(-)

--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -26,9 +26,10 @@
  * at 0 on boot (but people really shouldn't rely on that).
  *
  * cpu_clock(i)       -- can be used from any context, including NMI.
- * sched_clock_cpu(i) -- must be used with local IRQs disabled (implied by NMI)
  * local_clock()      -- is cpu_clock() on the current cpu.
  *
+ * sched_clock_cpu(i)
+ *
  * How:
  *
  * The implementation either uses sched_clock() when
@@ -50,15 +51,6 @@
  * Furthermore, explicit sleep and wakeup hooks allow us to account for time
  * that is otherwise invisible (TSC gets stopped).
  *
- *
- * Notes:
- *
- * The !IRQ-safetly of sched_clock() and sched_clock_cpu() comes from things
- * like cpufreq interrupts that can change the base clock (TSC) multiplier
- * and cause funny jumps in time -- although the filtering provided by
- * sched_clock_cpu() should mitigate serious artifacts we cannot rely on it
- * in general since for !CONFIG_HAVE_UNSTABLE_SCHED_CLOCK we fully rely on
- * sched_clock().
  */
 #include <linux/spinlock.h>
 #include <linux/hardirq.h>
@@ -316,14 +308,7 @@ EXPORT_SYMBOL_GPL(sched_clock_idle_wakeu
  */
 u64 cpu_clock(int cpu)
 {
-	u64 clock;
-	unsigned long flags;
-
-	local_irq_save(flags);
-	clock = sched_clock_cpu(cpu);
-	local_irq_restore(flags);
-
-	return clock;
+	return sched_clock_cpu(cpu);
 }
 
 /*
@@ -335,14 +320,7 @@ u64 cpu_clock(int cpu)
  */
 u64 local_clock(void)
 {
-	u64 clock;
-	unsigned long flags;
-
-	local_irq_save(flags);
-	clock = sched_clock_cpu(smp_processor_id());
-	local_irq_restore(flags);
-
-	return clock;
+	return sched_clock_cpu(smp_processor_id());
 }
 
 #else /* CONFIG_HAVE_UNSTABLE_SCHED_CLOCK */



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC][PATCH 7/7] sched: Use a static_key for sched_clock_stable
  2013-11-29 17:36 [RFC][PATCH 0/7] sched: Optimize sched_clock bits Peter Zijlstra
                   ` (5 preceding siblings ...)
  2013-11-29 17:37 ` [RFC][PATCH 6/7] sched: Remove local_irq_disable() from the clocks Peter Zijlstra
@ 2013-11-29 17:37 ` Peter Zijlstra
  2013-12-01 18:08 ` [RFC][PATCH 0/7] sched: Optimize sched_clock bits Eliezer Tamir
  7 siblings, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2013-11-29 17:37 UTC (permalink / raw)
  To: Eliezer Tamir
  Cc: John Stultz, Thomas Gleixner, Steven Rostedt, Ingo Molnar,
	Mathieu Desnoyers, Andy Lutomirski, linux-kernel, Tony Luck, hpa,
	Peter Zijlstra

[-- Attachment #1: peterz-sched_clock_static_key.patch --]
[-- Type: text/plain, Size: 6205 bytes --]

In order to avoid the runtime condition check turn sched_clock_stable
into a static_key.

Also provide a shorter implementation of local_clock() and
cpu_clock(int) when sched_clock_stable==1.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/kernel/cpu/amd.c        |    2 -
 arch/x86/kernel/cpu/intel.c      |    2 -
 arch/x86/kernel/cpu/perf_event.c |    2 -
 arch/x86/kernel/tsc.c            |    6 ++---
 include/linux/sched.h            |    4 ++-
 kernel/sched/clock.c             |   41 ++++++++++++++++++++++++++++++++-------
 kernel/sched/debug.c             |    2 -
 kernel/time/tick-sched.c         |    2 -
 kernel/trace/ring_buffer.c       |    2 -
 9 files changed, 46 insertions(+), 17 deletions(-)

--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -487,7 +487,7 @@ static void early_init_amd(struct cpuinf
 		set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC);
 		set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC);
 		if (!check_tsc_unstable())
-			sched_clock_stable = 1;
+			set_sched_clock_stable();
 	}
 
 #ifdef CONFIG_X86_64
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -93,7 +93,7 @@ static void early_init_intel(struct cpui
 		set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC);
 		set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC);
 		if (!check_tsc_unstable())
-			sched_clock_stable = 1;
+			set_sched_clock_stable();
 	}
 
 	/* Penwell and Cloverview have the TSC which doesn't sleep on S3 */
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1898,7 +1898,7 @@ void arch_perf_update_userpage(struct pe
 	userpg->cap_user_rdpmc = x86_pmu.attr_rdpmc;
 	userpg->pmc_width = x86_pmu.cntval_bits;
 
-	if (!sched_clock_stable)
+	if (!sched_clock_stable())
 		return;
 
 	do {
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -786,7 +786,7 @@ static unsigned long long cyc2ns_suspend
 
 void tsc_save_sched_clock_state(void)
 {
-	if (!sched_clock_stable)
+	if (!sched_clock_stable())
 		return;
 
 	cyc2ns_suspend = sched_clock();
@@ -806,7 +806,7 @@ void tsc_restore_sched_clock_state(void)
 	unsigned long flags;
 	int cpu;
 
-	if (!sched_clock_stable)
+	if (!sched_clock_stable())
 		return;
 
 	local_irq_save(flags);
@@ -948,7 +948,7 @@ void mark_tsc_unstable(char *reason)
 {
 	if (!tsc_unstable) {
 		tsc_unstable = 1;
-		sched_clock_stable = 0;
+		clear_sched_clock_stable();
 		disable_sched_clock_irqtime();
 		pr_info("Marking TSC unstable due to %s\n", reason);
 		/* Change only the rating, when not registered */
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1880,7 +1880,9 @@ static inline void sched_clock_idle_wake
  * but then during bootup it turns out that sched_clock()
  * is reliable after all:
  */
-extern int sched_clock_stable;
+extern int sched_clock_stable(void);
+extern void set_sched_clock_stable(void);
+extern void clear_sched_clock_stable(void);
 
 extern void sched_clock_tick(void);
 extern void sched_clock_idle_sleep_event(void);
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -58,6 +58,7 @@
 #include <linux/percpu.h>
 #include <linux/ktime.h>
 #include <linux/sched.h>
+#include <linux/static_key.h>
 
 /*
  * Scheduler clock - returns current time in nanosec units.
@@ -74,7 +75,27 @@ EXPORT_SYMBOL_GPL(sched_clock);
 __read_mostly int sched_clock_running;
 
 #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
-__read_mostly int sched_clock_stable;
+static struct static_key __sched_clock_stable = STATIC_KEY_INIT;
+
+int sched_clock_stable(void)
+{
+	if (static_key_false(&__sched_clock_stable))
+		return false;
+	return true;
+}
+
+void set_sched_clock_stable(void)
+{
+	if (!sched_clock_stable())
+		static_key_slow_inc(&__sched_clock_stable);
+}
+
+void clear_sched_clock_stable(void)
+{
+	/* XXX worry about clock continuity */
+	if (sched_clock_stable())
+		static_key_slow_dec(&__sched_clock_stable);
+}
 
 struct sched_clock_data {
 	u64			tick_raw;
@@ -236,7 +257,7 @@ u64 sched_clock_cpu(int cpu)
 
 	WARN_ON_ONCE(!irqs_disabled());
 
-	if (sched_clock_stable)
+	if (sched_clock_stable())
 		return sched_clock();
 
 	if (unlikely(!sched_clock_running))
@@ -257,7 +278,7 @@ void sched_clock_tick(void)
 	struct sched_clock_data *scd;
 	u64 now, now_gtod;
 
-	if (sched_clock_stable)
+	if (sched_clock_stable())
 		return;
 
 	if (unlikely(!sched_clock_running))
@@ -308,7 +329,10 @@ EXPORT_SYMBOL_GPL(sched_clock_idle_wakeu
  */
 u64 cpu_clock(int cpu)
 {
-	return sched_clock_cpu(cpu);
+	if (static_key_false(&__sched_clock_stable))
+		return sched_clock_cpu(cpu);
+
+	return sched_clock();
 }
 
 /*
@@ -320,7 +344,10 @@ u64 cpu_clock(int cpu)
  */
 u64 local_clock(void)
 {
-	return sched_clock_cpu(smp_processor_id());
+	if (static_key_false(&__sched_clock_stable))
+		return sched_clock_cpu(smp_processor_id());
+
+	return sched_clock();
 }
 
 #else /* CONFIG_HAVE_UNSTABLE_SCHED_CLOCK */
@@ -340,12 +367,12 @@ u64 sched_clock_cpu(int cpu)
 
 u64 cpu_clock(int cpu)
 {
-	return sched_clock_cpu(cpu);
+	return sched_clock();
 }
 
 u64 local_clock(void)
 {
-	return sched_clock_cpu(0);
+	return sched_clock();
 }
 
 #endif /* CONFIG_HAVE_UNSTABLE_SCHED_CLOCK */
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -371,7 +371,7 @@ static void sched_debug_header(struct se
 	PN(cpu_clk);
 	P(jiffies);
 #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
-	P(sched_clock_stable);
+	P(sched_clock_stable());
 #endif
 #undef PN
 #undef P
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -177,7 +177,7 @@ static bool can_stop_full_tick(void)
 	 * TODO: kick full dynticks CPUs when
 	 * sched_clock_stable is set.
 	 */
-	if (!sched_clock_stable) {
+	if (!sched_clock_stable()) {
 		trace_tick_stop(0, "unstable sched clock\n");
 		/*
 		 * Don't allow the user to think they can get
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -2558,7 +2558,7 @@ rb_reserve_next_event(struct ring_buffer
 		if (unlikely(test_time_stamp(delta))) {
 			int local_clock_stable = 1;
 #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
-			local_clock_stable = sched_clock_stable;
+			local_clock_stable = sched_clock_stable();
 #endif
 			WARN_ONCE(delta > (1ULL << 59),
 				  KERN_WARNING "Delta way too big! %llu ts=%llu write stamp = %llu\n%s",



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC][PATCH 5/7] x86: Use latch data structure for cyc2ns
  2013-11-29 17:37 ` [RFC][PATCH 5/7] x86: Use latch data structure for cyc2ns Peter Zijlstra
@ 2013-11-29 23:22   ` Andy Lutomirski
  2013-11-30  9:18     ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Andy Lutomirski @ 2013-11-29 23:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Eliezer Tamir, John Stultz, Thomas Gleixner, Steven Rostedt,
	Ingo Molnar, Mathieu Desnoyers, linux-kernel@vger.kernel.org,
	Tony Luck, H. Peter Anvin

On Fri, Nov 29, 2013 at 9:37 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> Use the 'latch' data structure for cyc2ns.
>
> This is a data structure first proposed by me and later named by
> Mathieu. If anybody's got a better name; do holler.

That structure must exist in the literature, but I have no idea what
it's called.  It's a multi-word lock-free atomic (I think -- maybe
it's just regular) register.  I even published a considerably fancier
version of much the same thing a few years ago.  :)

>
> Its a multi-version thing which allows always having a coherent
> object; we use this to avoid having to disable IRQs while reading
> sched_clock() and avoids a problem when getting an NMI while changing
> the cyc2ns data.
>
> The patch should have plenty comments actually explaining the thing.
>
> The hope is that the extra logic is offset by no longer requiring the
> sti;cli around reading the clock.

I've occasionally wondered whether it would be possible to make a
monotonicity-preserving version of this and use it for clock_gettime.
One approach: have the writer set the time for the update to be a bit
in the future and have the reader compare the current raw time to the
cutoff to see which set of frequency/offset to use.  (This requires
having some kind of bound on how long it takes to update the data
structures.)

The advantage: clock_gettime would never block.
The disadvantage: complicated, potentially nasty to implement, and it
would get complicated if anyone tried to allow multiple updates in
rapid succession.

Anyway, this is mostly irrelevant to your patches.

--Andy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC][PATCH 5/7] x86: Use latch data structure for cyc2ns
  2013-11-29 23:22   ` Andy Lutomirski
@ 2013-11-30  9:18     ` Peter Zijlstra
  0 siblings, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2013-11-30  9:18 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Eliezer Tamir, John Stultz, Thomas Gleixner, Steven Rostedt,
	Ingo Molnar, Mathieu Desnoyers, linux-kernel@vger.kernel.org,
	Tony Luck, H. Peter Anvin

On Fri, Nov 29, 2013 at 03:22:45PM -0800, Andy Lutomirski wrote:
> On Fri, Nov 29, 2013 at 9:37 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > Use the 'latch' data structure for cyc2ns.
> >
> > This is a data structure first proposed by me and later named by
> > Mathieu. If anybody's got a better name; do holler.
> 
> That structure must exist in the literature, but I have no idea what
> it's called.  It's a multi-word lock-free atomic (I think -- maybe
> it's just regular) register.  I even published a considerably fancier
> version of much the same thing a few years ago.  :)

Yeah, its a fairly straight fwd thing it has to be named someplace ;-)

> I've occasionally wondered whether it would be possible to make a
> monotonicity-preserving version of this and use it for clock_gettime.
> One approach: have the writer set the time for the update to be a bit
> in the future and have the reader compare the current raw time to the
> cutoff to see which set of frequency/offset to use.  (This requires
> having some kind of bound on how long it takes to update the data
> structures.)
> 
> The advantage: clock_gettime would never block.
> The disadvantage: complicated, potentially nasty to implement, and it
> would get complicated if anyone tried to allow multiple updates in
> rapid succession.

Yes, that way you can chain a number of linear segments in various
slots, but you're indeed right in that it will limit the update
frequency. More slots will give you more room, but eventually you're
limited.

I suppose NTP is the primary updater in that case, does that have a
limit on the updates? All the other updates we can artificially limit,
that shouldn't really matter.

But yeah in my case we pretty much assume the TSC is complete crap and a
little more crap simply doesn't matter.

For the 'stable' tsc on modern machines we never set the frequency and
it doesn't matter anyway.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC][PATCH 0/7] sched: Optimize sched_clock bits
  2013-11-29 17:36 [RFC][PATCH 0/7] sched: Optimize sched_clock bits Peter Zijlstra
                   ` (6 preceding siblings ...)
  2013-11-29 17:37 ` [RFC][PATCH 7/7] sched: Use a static_key for sched_clock_stable Peter Zijlstra
@ 2013-12-01 18:08 ` Eliezer Tamir
  2013-12-03 15:10   ` Peter Zijlstra
  7 siblings, 1 reply; 13+ messages in thread
From: Eliezer Tamir @ 2013-12-01 18:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: John Stultz, Thomas Gleixner, Steven Rostedt, Ingo Molnar,
	Mathieu Desnoyers, Andy Lutomirski, linux-kernel, Tony Luck, hpa

[-- Attachment #1: Type: text/plain, Size: 1746 bytes --]

On 29/11/2013 19:36, Peter Zijlstra wrote:
> Hi all,
> 
> This series is supposed to optimize the kernel/sched/clock.c and x86
> sched_clock() implementations.
> 
> So far its only been boot tested. So no clue if it really makes the thing
> faster, but it does remove the need to disable IRQs.
> 
> I'm hoping Eliezer will test this with his benchmark where he could measure a
> performance regression between using sched_clock() and local_clock().

So I tested and retested, but I'm not sure I understand the results.

The numbers I previously reported were with turbo boost enabled.
Since turbo boost changes the CPU frequency depending on how hot it is,
it has a complicated interaction with busy polling.
In general you see better numbers, but it's harder to tell what's
going on.

With turbo boost disabled in BIOS to try to get a more linear behavior I
see:

3.13.0-rc2 (no pathces) 82.0 KRR/s
with busy poll using local clock 80.2 KRR/s.
Note that there is a big variance between cores and on the SMT sibling
of the core that has the packets steered to I see 81.8 KRR/s. (on the
other tests this core is slightly lower than on the one that accepts the
packets, I'm not sure I can explain this.)
local clock + sched_clock patches 80.6 KRR/s
sched patches (busy poll using sched_clock) 80.6 KRR/s

Maybe I'm doing something wrong?

Perf clearly affects the netperf results but the delta is only a few
percent so the numbers might still be good.
On the other hand, I'm seeing repeated warnings that the perf MNI
handler took too long to run, and I need to reboot to get perf to run
again.

Attached are the perf outputs.

If you can think of any other interesting tests, or anything I'm doing
wrong, I'm open to suggestions.

Thanks,
Eliezer

[-- Attachment #2: perf.local-clock.txt --]
[-- Type: text/plain, Size: 21545 bytes --]

# ========
# captured on: Sun Dec  1 06:15:44 2013
# hostname : ladj537.jer.intel.com
# os release : 3.13.0-rc2-local-clockmin3+IPv6+
# perf version : 3.11.9-200.fc19.x86_64
# arch : x86_64
# nrcpus online : 32
# nrcpus avail : 32
# cpudesc : Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
# cpuid : GenuineIntel,6,45,7
# total memory : 32905936 kB
# cmdline : /usr/bin/perf record netperf -t TCP_RR -H 192.168.1.1 -l 30 -T 4,4 -C -c 
# event : name = cycles, type = 0, config = 0x0, config1 = 0x0, config2 = 0x0, excl_usr = 0, excl_kern = 0, excl_host = 0, excl_guest = 1, precise_ip = 0
# HEADER_CPU_TOPOLOGY info available, use -I to display
# HEADER_NUMA_TOPOLOGY info available, use -I to display
# pmu mappings: cpu = 4, software = 1, uncore_pcu = 15, uncore_imc_0 = 17, uncore_imc_1 = 18, uncore_imc_2 = 19, uncore_imc_3 = 20, uncore_qpi_0 = 21, uncore_qpi_1 = 22, uncore_cbox_0 = 7, uncore_cbox_1 = 8, uncore_cbox_2 = 9, uncore_cbox_3 = 10, uncore_cbox_4 = 11, uncore_cbox_5 = 12, uncore_cbox_6 = 13, uncore_cbox_7 = 14, uncore_ha = 16, uncore_r2pcie = 23, uncore_r3qpi_0 = 24, uncore_r3qpi_1 = 25, breakpoint = 5, uncore_ubox = 6
# ========
#
# Samples: 119K of event 'cycles'
# Event count (approx.): 79804562284
#
# Overhead  Command      Shared Object                                      Symbol
# ........  .......  .................  ..........................................
#
    17.65%  netperf  [kernel.kallsyms]  [k] _raw_spin_lock_bh                     
     8.74%  netperf  [kernel.kallsyms]  [k] _raw_spin_unlock_bh                   
     8.44%  netperf  [kernel.kallsyms]  [k] native_sched_clock                    
     7.90%  netperf  [kernel.kallsyms]  [k] tcp_recvmsg                           
     7.71%  netperf  [kernel.kallsyms]  [k] local_clock                           
     6.55%  netperf  [ixgbe]            [k] ixgbe_clean_rx_irq                    
     5.76%  netperf  [kernel.kallsyms]  [k] sched_clock_cpu                       
     5.60%  netperf  [ixgbe]            [k] ixgbe_low_latency_recv                
     3.95%  netperf  [kernel.kallsyms]  [k] local_bh_enable_ip                    
     1.29%  netperf  [kernel.kallsyms]  [k] tcp_ack                               
     1.14%  netperf  [kernel.kallsyms]  [k] _raw_spin_lock                        
     0.88%  netperf  [kernel.kallsyms]  [k] tcp_sendmsg                           
     0.81%  netperf  [ixgbe]            [k] ixgbe_xmit_frame_ring                 
     0.81%  netperf  [kernel.kallsyms]  [k] tcp_transmit_skb                      
     0.79%  netperf  [kernel.kallsyms]  [k] __netif_receive_skb_core              
     0.74%  netperf  netperf            [.] send_omni_inner                       
     0.71%  netperf  [kernel.kallsyms]  [k] system_call                           
     0.68%  netperf  [kernel.kallsyms]  [k] local_bh_disable                      
     0.63%  netperf  [kernel.kallsyms]  [k] tcp_write_xmit                        
     0.51%  netperf  [kernel.kallsyms]  [k] ip_rcv                                
     0.51%  netperf  [kernel.kallsyms]  [k] tcp_rcv_established                   
     0.38%  netperf  [kernel.kallsyms]  [k] tcp_v4_rcv                            
     0.36%  netperf  [kernel.kallsyms]  [k] fget_light                            
     0.36%  netperf  netperf            [.] send_data                             
     0.34%  netperf  [kernel.kallsyms]  [k] dev_hard_start_xmit                   
     0.33%  netperf  [kernel.kallsyms]  [k] __copy_skb_header                     
     0.32%  netperf  [kernel.kallsyms]  [k] build_skb                             
     0.32%  netperf  [kernel.kallsyms]  [k] __might_sleep                         
     0.31%  netperf  [kernel.kallsyms]  [k] __alloc_skb                           
     0.30%  netperf  netperf            [.] recv_data                             
     0.28%  netperf  [kernel.kallsyms]  [k] ip_finish_output                      
     0.28%  netperf  [kernel.kallsyms]  [k] ip_queue_xmit                         
     0.27%  netperf  [kernel.kallsyms]  [k] __getnstimeofday                      
     0.27%  netperf  [kernel.kallsyms]  [k] kmem_cache_free                       
     0.26%  netperf  [kernel.kallsyms]  [k] sock_def_readable                     
     0.26%  netperf  [kernel.kallsyms]  [k] tcp_queue_rcv                         
     0.25%  netperf  [kernel.kallsyms]  [k] read_tsc                              
     0.25%  netperf  [kernel.kallsyms]  [k] __tcp_select_window                   
     0.24%  netperf  [kernel.kallsyms]  [k] skb_release_data                      
     0.24%  netperf  [kernel.kallsyms]  [k] __kfree_skb                           
     0.24%  netperf  [kernel.kallsyms]  [k] __inet_lookup_established             
     0.23%  netperf  [kernel.kallsyms]  [k] local_bh_enable                       
     0.22%  netperf  [kernel.kallsyms]  [k] skb_clone                             
     0.21%  netperf  [kernel.kallsyms]  [k] __netdev_alloc_frag                   
     0.21%  netperf  [kernel.kallsyms]  [k] kmem_cache_alloc_node                 
     0.21%  netperf  [kernel.kallsyms]  [k] __skb_clone                           
     0.21%  netperf  [kernel.kallsyms]  [k] tcp_schedule_loss_probe               
     0.20%  netperf  [kernel.kallsyms]  [k] __slab_free                           
     0.20%  netperf  libc-2.17.so       [.] __libc_send                           
     0.19%  netperf  [kernel.kallsyms]  [k] __kmalloc_node_track_caller           
     0.19%  netperf  libc-2.17.so       [.] __libc_recv                           
     0.19%  netperf  [kernel.kallsyms]  [k] skb_free_head                         
     0.18%  netperf  [kernel.kallsyms]  [k] tcp_v4_do_rcv                         
     0.18%  netperf  [kernel.kallsyms]  [k] dev_queue_xmit                        
     0.18%  netperf  [kernel.kallsyms]  [k] tcp_event_data_recv                   
     0.17%  netperf  [kernel.kallsyms]  [k] memcpy                                
     0.17%  netperf  [kernel.kallsyms]  [k] lock_sock_nested                      
     0.17%  netperf  [kernel.kallsyms]  [k] tcp_cleanup_rbuf                      
     0.17%  netperf  [kernel.kallsyms]  [k] tcp_v4_early_demux                    
     0.17%  netperf  [kernel.kallsyms]  [k] tcp_set_skb_tso_segs                  
     0.16%  netperf  [kernel.kallsyms]  [k] tcp_established_options               
     0.16%  netperf  [kernel.kallsyms]  [k] mod_timer                             
     0.16%  netperf  [kernel.kallsyms]  [k] ipv4_dst_check                        
     0.16%  netperf  [kernel.kallsyms]  [k] sk_filter                             
     0.16%  netperf  [kernel.kallsyms]  [k] dev_queue_xmit_nit                    
     0.16%  netperf  [kernel.kallsyms]  [k] kmem_cache_alloc                      
     0.15%  netperf  [kernel.kallsyms]  [k] skb_network_protocol                  
     0.15%  netperf  [kernel.kallsyms]  [k] tcp_send_mss                          
     0.15%  netperf  [kernel.kallsyms]  [k] __tcp_v4_send_check                   
     0.15%  netperf  [kernel.kallsyms]  [k] __netdev_pick_tx                      
     0.15%  netperf  [ixgbe]            [k] ixgbe_poll                            
     0.15%  netperf  [kernel.kallsyms]  [k] sockfd_lookup_light                   
     0.14%  netperf  [kernel.kallsyms]  [k] put_compound_page                     
     0.14%  netperf  [kernel.kallsyms]  [k] ip_send_check                         
     0.13%  netperf  [kernel.kallsyms]  [k] skb_copy_datagram_iovec               
     0.13%  netperf  [kernel.kallsyms]  [k] sch_direct_xmit                       
     0.13%  netperf  [kernel.kallsyms]  [k] tcp_rearm_rto                         
     0.12%  netperf  [kernel.kallsyms]  [k] sk_stream_alloc_skb                   
     0.12%  netperf  [kernel.kallsyms]  [k] tcp_rtt_estimator                     
     0.12%  netperf  [kernel.kallsyms]  [k] swiotlb_map_page                      
     0.12%  netperf  [kernel.kallsyms]  [k] skb_push                              
     0.12%  netperf  [ixgbe]            [k] ixgbe_alloc_rx_buffers                
     0.11%  netperf  [kernel.kallsyms]  [k] tcp_send_delayed_ack                  
     0.11%  netperf  [kernel.kallsyms]  [k] tcp_init_tso_segs                     
     0.11%  netperf  [kernel.kallsyms]  [k] ip_local_deliver                      
     0.11%  netperf  [kernel.kallsyms]  [k] SYSC_recvfrom                         
     0.11%  netperf  [kernel.kallsyms]  [k] sock_put                              
     0.11%  netperf  [kernel.kallsyms]  [k] __sk_dst_check                        
     0.11%  netperf  [kernel.kallsyms]  [k] eth_type_trans                        
     0.11%  netperf  [kernel.kallsyms]  [k] ksize                                 
     0.11%  netperf  [kernel.kallsyms]  [k] inet_ehashfn                          
     0.10%  netperf  [kernel.kallsyms]  [k] memcpy_toiovec                        
     0.10%  netperf  [kernel.kallsyms]  [k] __tcp_ack_snd_check                   
     0.10%  netperf  [kernel.kallsyms]  [k] tcp_parse_aligned_timestamp           
     0.10%  netperf  [kernel.kallsyms]  [k] bictcp_acked                          
     0.10%  netperf  [kernel.kallsyms]  [k] tcp_wfree                             
     0.10%  netperf  [kernel.kallsyms]  [k] bictcp_cong_avoid                     
     0.10%  netperf  [ixgbe]            [k] ixgbe_tx_ctxtdesc                     
     0.09%  netperf  [kernel.kallsyms]  [k] tcp_check_space                       
     0.09%  netperf  [kernel.kallsyms]  [k] SYSC_sendto                           
     0.09%  netperf  [kernel.kallsyms]  [k] __kmalloc_reserve.clone.52            
     0.09%  netperf  [kernel.kallsyms]  [k] might_fault                           
     0.09%  netperf  [kernel.kallsyms]  [k] sock_rfree                            
     0.09%  netperf  [kernel.kallsyms]  [k] sock_sendmsg                          
     0.09%  netperf  [kernel.kallsyms]  [k] skb_release_head_state                
     0.09%  netperf  [kernel.kallsyms]  [k] tcp_rcv_space_adjust                  
     0.09%  netperf  [kernel.kallsyms]  [k] sock_recvmsg                          
     0.08%  netperf  [kernel.kallsyms]  [k] release_sock                          
     0.08%  netperf  [kernel.kallsyms]  [k] raw_local_deliver                     
     0.08%  netperf  [kernel.kallsyms]  [k] irq_entries_start                     
     0.08%  netperf  [kernel.kallsyms]  [k] msecs_to_jiffies                      
     0.07%  netperf  [kernel.kallsyms]  [k] tcp_service_net_dma                   
     0.07%  netperf  [kernel.kallsyms]  [k] ns_to_timespec                        
     0.07%  netperf  [kernel.kallsyms]  [k] sock_wfree                            
     0.07%  netperf  [kernel.kallsyms]  [k] netdev_pick_tx                        
     0.07%  netperf  [ixgbe]            [k] __ixgbe_xmit_frame                    
     0.07%  netperf  [kernel.kallsyms]  [k] tcp_options_write                     
     0.07%  netperf  [kernel.kallsyms]  [k] get_seconds                           
     0.07%  netperf  [kernel.kallsyms]  [k] ktime_get_real                        
     0.07%  netperf  [kernel.kallsyms]  [k] netif_receive_skb                     
     0.07%  netperf  [kernel.kallsyms]  [k] ipv4_mtu                              
     0.07%  netperf  [kernel.kallsyms]  [k] kmalloc_slab                          
     0.06%  netperf  [kernel.kallsyms]  [k] sk_reset_timer                        
     0.06%  netperf  [kernel.kallsyms]  [k] kfree                                 
     0.06%  netperf  [kernel.kallsyms]  [k] skb_release_all                       
     0.06%  netperf  [kernel.kallsyms]  [k] copy_user_generic_string              
     0.05%  netperf  [kernel.kallsyms]  [k] tcp_current_mss                       
     0.05%  netperf  [kernel.kallsyms]  [k] __netdev_alloc_skb                    
     0.05%  netperf  [kernel.kallsyms]  [k] __ip_local_out                        
     0.05%  netperf  [kernel.kallsyms]  [k] inet_recvmsg                          
     0.05%  netperf  [kernel.kallsyms]  [k] swiotlb_sync_single                   
     0.05%  netperf  [kernel.kallsyms]  [k] harmonize_features                    
     0.05%  netperf  [kernel.kallsyms]  [k] getnstimeofday                        
     0.05%  netperf  [kernel.kallsyms]  [k] inet_sendmsg                          
     0.05%  netperf  [kernel.kallsyms]  [k] netif_skb_features                    
     0.04%  netperf  [kernel.kallsyms]  [k] __copy_user_nocache                   
     0.04%  netperf  [kernel.kallsyms]  [k] __slab_alloc                          
     0.04%  netperf  [kernel.kallsyms]  [k] napi_by_id                            
     0.04%  netperf  [kernel.kallsyms]  [k] ip_output                             
     0.04%  netperf  [kernel.kallsyms]  [k] skb_put                               
     0.04%  netperf  [kernel.kallsyms]  [k] _copy_to_user                         
     0.04%  netperf  [kernel.kallsyms]  [k] swiotlb_dma_mapping_error             
     0.04%  netperf  [kernel.kallsyms]  [k] tcp_is_cwnd_limited                   
     0.04%  netperf  [kernel.kallsyms]  [k] __netif_receive_skb                   
     0.04%  netperf  [kernel.kallsyms]  [k] ns_to_timeval                         
     0.04%  netperf  [kernel.kallsyms]  [k] __skb_dst_set_noref                   
     0.04%  netperf  [ixgbe]            [k] ixgbe_xmit_frame                      
     0.03%  netperf  [kernel.kallsyms]  [k] tcp_md5_do_lookup                     
     0.03%  netperf  [kernel.kallsyms]  [k] dql_completed                         
     0.03%  netperf  [kernel.kallsyms]  [k] tcp_v4_send_check                     
     0.03%  netperf  [kernel.kallsyms]  [k] __tcp_push_pending_frames             
     0.03%  netperf  [kernel.kallsyms]  [k] tcp_prequeue                          
     0.03%  netperf  [kernel.kallsyms]  [k] tcp_parse_md5sig_option               
     0.03%  netperf  [kernel.kallsyms]  [k] tcp_event_new_data_sent               
     0.03%  netperf  [kernel.kallsyms]  [k] tcp_stream_memory_free                
     0.03%  netperf  [kernel.kallsyms]  [k] radix_tree_lookup_element             
     0.03%  netperf  [kernel.kallsyms]  [k] put_page                              
     0.02%  netperf  [kernel.kallsyms]  [k] tcp_release_cb                        
     0.02%  netperf  [kernel.kallsyms]  [k] ip_local_out                          
     0.02%  netperf  [kernel.kallsyms]  [k] __do_softirq                          
     0.02%  netperf  [kernel.kallsyms]  [k] dev_kfree_skb_any                     
     0.02%  netperf  [kernel.kallsyms]  [k] native_apic_msr_eoi_write             
     0.02%  netperf  [kernel.kallsyms]  [k] eth_header                            
     0.02%  netperf  [kernel.kallsyms]  [k] ktime_get                             
     0.02%  netperf  [kernel.kallsyms]  [k] neigh_resolve_output                  
     0.01%  netperf  [kernel.kallsyms]  [k] add_interrupt_randomness              
     0.01%  netperf  [kernel.kallsyms]  [k] dma_issue_pending_all                 
     0.01%  netperf  [kernel.kallsyms]  [k] tcp_v4_md5_lookup                     
     0.01%  netperf  [kernel.kallsyms]  [k] swiotlb_sync_single_for_device        
     0.01%  netperf  [kernel.kallsyms]  [k] handle_edge_irq                       
     0.01%  netperf  [kernel.kallsyms]  [k] update_curr                           
     0.01%  netperf  [kernel.kallsyms]  [k] sys_sendto                            
     0.01%  netperf  [kernel.kallsyms]  [k] net_rx_action                         
     0.01%  netperf  [ixgbe]            [k] ixgbe_msix_clean_rings                
     0.01%  netperf  [kernel.kallsyms]  [k] rcu_irq_enter                         
     0.01%  netperf  [kernel.kallsyms]  [k] __napi_complete                       
     0.01%  netperf  [kernel.kallsyms]  [k] swiotlb_sync_single_for_cpu           
     0.01%  netperf  [kernel.kallsyms]  [k] do_softirq                            
     0.01%  netperf  [kernel.kallsyms]  [k] __local_bh_enable                     
     0.01%  netperf  [kernel.kallsyms]  [k] rcu_irq_exit                          
     0.01%  netperf  [kernel.kallsyms]  [k] unmap_single                          
     0.01%  netperf  [kernel.kallsyms]  [k] handle_irq_event                      
     0.01%  netperf  [kernel.kallsyms]  [k] common_interrupt                      
     0.01%  netperf  [kernel.kallsyms]  [k] perf_adjust_freq_unthr_context        
     0.01%  netperf  [kernel.kallsyms]  [k] napi_complete                         
     0.01%  netperf  [kernel.kallsyms]  [k] do_IRQ                                
     0.01%  netperf  netperf            [.] recv@plt                              
     0.01%  netperf  [kernel.kallsyms]  [k] do_softirq_own_stack                  
     0.01%  netperf  netperf            [.] send@plt                              
     0.01%  netperf  [kernel.kallsyms]  [k] radix_tree_lookup                     
     0.01%  netperf  [kernel.kallsyms]  [k] irq_enter                             
     0.01%  netperf  [kernel.kallsyms]  [k] __napi_schedule                       
     0.00%  netperf  [kernel.kallsyms]  [k] handle_irq_event_percpu               
     0.00%  netperf  [kernel.kallsyms]  [k] handle_irq                            
     0.00%  netperf  [kernel.kallsyms]  [k] consume_skb                           
     0.00%  netperf  [kernel.kallsyms]  [k] swiotlb_unmap_page                    
     0.00%  netperf  [kernel.kallsyms]  [k] sha_transform                         
     0.00%  netperf  [kernel.kallsyms]  [k] restore_args                          
     0.00%  netperf  [kernel.kallsyms]  [k] net_rps_action_and_irq_enable.clone.82
     0.00%  netperf  [kernel.kallsyms]  [k] rcu_bh_qs                             
     0.00%  netperf  [kernel.kallsyms]  [k] napi_gro_flush                        
     0.00%  netperf  [kernel.kallsyms]  [k] note_interrupt                        
     0.00%  netperf  [kernel.kallsyms]  [k] ir_ack_apic_edge                      
     0.00%  netperf  [kernel.kallsyms]  [k] irq_to_desc                           
     0.00%  netperf  [kernel.kallsyms]  [k] sys_recvfrom                          
     0.00%  netperf  [kernel.kallsyms]  [k] update_cfs_rq_blocked_load            
     0.00%  netperf  [kernel.kallsyms]  [k] internal_add_timer                    
     0.00%  netperf  [kernel.kallsyms]  [k] apic_timer_interrupt                  
     0.00%  netperf  [kernel.kallsyms]  [k] lapic_next_deadline                   
     0.00%  netperf  [kernel.kallsyms]  [k] __d_alloc                             
     0.00%  netperf  [kernel.kallsyms]  [k] update_group_power                    
     0.00%  netperf  [kernel.kallsyms]  [k] __update_entity_load_avg_contrib      
     0.00%  netperf  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave                
     0.00%  netperf  [kernel.kallsyms]  [k] rb_next                               
     0.00%  netperf  [kernel.kallsyms]  [k] retint_restore_args                   
     0.00%  netperf  [kernel.kallsyms]  [k] smp_apic_timer_interrupt              
     0.00%  netperf  [kernel.kallsyms]  [k] account_system_time                   
     0.00%  netperf  [kernel.kallsyms]  [k] __remove_hrtimer                      
     0.00%  netperf  [kernel.kallsyms]  [k] rcu_check_callbacks                   
     0.00%  netperf  [kernel.kallsyms]  [k] timerqueue_del                        
     0.00%  netperf  [kernel.kallsyms]  [k] cpuacct_charge                        
     0.00%  netperf  [kernel.kallsyms]  [k] lookup_fast                           
     0.00%  netperf  [kernel.kallsyms]  [k] hrtimer_interrupt                     
     0.00%  netperf  [kernel.kallsyms]  [k] kill_fasync                           
     0.00%  netperf  [kernel.kallsyms]  [k] find_next_bit                         
     0.00%  netperf  [kernel.kallsyms]  [k] find_busiest_group                    
     0.00%  netperf  [kernel.kallsyms]  [k] put_cpu_partial                       
     0.00%  netperf  libc-2.17.so       [.] _IO_vfscanf                           
     0.00%  netperf  [kernel.kallsyms]  [k] update_blocked_averages               
     0.00%  netperf  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore           
     0.00%  netperf  [kernel.kallsyms]  [k] irq_exit                              
     0.00%  netperf  [kernel.kallsyms]  [k] update_rq_clock                       
     0.00%  netperf  [kernel.kallsyms]  [k] load_balance                          
     0.00%  netperf  [kernel.kallsyms]  [k] perf_event_task_tick                  
     0.00%  netperf  [kernel.kallsyms]  [k] vma_adjust                            
     0.00%  netperf  [kernel.kallsyms]  [k] __schedule                            
     0.00%  netperf  [kernel.kallsyms]  [k] pick_next_task_stop                   
     0.00%  netperf  [kernel.kallsyms]  [k] __perf_event_task_sched_in            
     0.00%  netperf  [kernel.kallsyms]  [k] sk_wait_data                          
     0.00%  netperf  [kernel.kallsyms]  [k] schedule_timeout                      
     0.00%  netperf  [kernel.kallsyms]  [k] intel_pmu_enable_all                  
     0.00%  netperf  [kernel.kallsyms]  [k] dst_release                           


#
# (For a higher level overview, try: perf report --sort comm,dso)
#

[-- Attachment #3: perf.local-clock+patched.txt --]
[-- Type: text/plain, Size: 22800 bytes --]

# ========
# captured on: Sun Dec  1 07:36:27 2013
# hostname : ladj537.jer.intel.com
# os release : 3.13.0-rc2-local-clock+sched-pathcesmin3+IPv6+
# perf version : 3.11.9-200.fc19.x86_64
# arch : x86_64
# nrcpus online : 32
# nrcpus avail : 32
# cpudesc : Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
# cpuid : GenuineIntel,6,45,7
# total memory : 32905936 kB
# cmdline : /usr/bin/perf record netperf -t TCP_RR -H 192.168.1.1 -l 30 -T 4,4 -C -c 
# event : name = cycles, type = 0, config = 0x0, config1 = 0x0, config2 = 0x0, excl_usr = 0, excl_kern = 0, excl_host = 0, excl_guest = 1, precise_ip = 0
# HEADER_CPU_TOPOLOGY info available, use -I to display
# HEADER_NUMA_TOPOLOGY info available, use -I to display
# pmu mappings: cpu = 4, software = 1, uncore_pcu = 15, uncore_imc_0 = 17, uncore_imc_1 = 18, uncore_imc_2 = 19, uncore_imc_3 = 20, uncore_qpi_0 = 21, uncore_qpi_1 = 22, uncore_cbox_0 = 7, uncore_cbox_1 = 8, uncore_cbox_2 = 9, uncore_cbox_3 = 10, uncore_cbox_4 = 11, uncore_cbox_5 = 12, uncore_cbox_6 = 13, uncore_cbox_7 = 14, uncore_ha = 16, uncore_r2pcie = 23, uncore_r3qpi_0 = 24, uncore_r3qpi_1 = 25, breakpoint = 5, uncore_ubox = 6
# ========
#
# Samples: 24K of event 'cycles'
# Event count (approx.): 15493465
#
# Overhead  Command      Shared Object                                      Symbol
# ........  .......  .................  ..........................................
#
     7.71%  netperf  [kernel.kallsyms]  [k] _raw_spin_lock_bh                     
     5.14%  netperf  [kernel.kallsyms]  [k] profile_tick                          
     4.15%  netperf  [kernel.kallsyms]  [k] lapic_next_deadline                   
     4.12%  netperf  [kernel.kallsyms]  [k] hrtimer_forward                       
     3.75%  netperf  [kernel.kallsyms]  [k] tcp_recvmsg                           
     3.60%  netperf  [kernel.kallsyms]  [k] run_posix_cpu_timers                  
     3.45%  netperf  [kernel.kallsyms]  [k] sk_wait_data                          
     3.07%  netperf  [kernel.kallsyms]  [k] ktime_get                             
     2.30%  netperf  [ixgbe]            [k] ixgbe_clean_rx_irq                    
     2.24%  netperf  [kernel.kallsyms]  [k] _raw_spin_unlock_bh                   
     2.20%  netperf  [kernel.kallsyms]  [k] tick_sched_timer                      
     2.12%  netperf  [kernel.kallsyms]  [k] _raw_spin_lock                        
     2.11%  netperf  [kernel.kallsyms]  [k] hrtimer_interrupt                     
     1.76%  netperf  [kernel.kallsyms]  [k] __run_hrtimer.clone.33                
     1.74%  netperf  [ixgbe]            [k] ixgbe_low_latency_recv                
     1.55%  netperf  [kernel.kallsyms]  [k] schedule_timeout                      
     1.47%  netperf  [kernel.kallsyms]  [k] clockevents_program_event             
     1.30%  netperf  [kernel.kallsyms]  [k] tcp_rcv_established                   
     1.26%  netperf  [kernel.kallsyms]  [k] timerqueue_add                        
     1.21%  netperf  [kernel.kallsyms]  [k] tcp_ack                               
     1.06%  netperf  [kernel.kallsyms]  [k] __do_softirq                          
     1.05%  netperf  [kernel.kallsyms]  [k] local_bh_enable                       
     1.02%  netperf  libc-2.17.so       [.] __libc_recv                           
     0.97%  netperf  [kernel.kallsyms]  [k] tcp_prequeue_process                  
     0.95%  netperf  [kernel.kallsyms]  [k] tcp_v4_do_rcv                         
     0.95%  netperf  [kernel.kallsyms]  [k] idle_cpu                              
     0.93%  netperf  [kernel.kallsyms]  [k] tcp_service_net_dma                   
     0.92%  netperf  [kernel.kallsyms]  [k] intel_pmu_enable_all                  
     0.92%  netperf  [kernel.kallsyms]  [k] __might_sleep                         
     0.86%  netperf  [kernel.kallsyms]  [k] find_busiest_group                    
     0.85%  netperf  [kernel.kallsyms]  [k] rcu_irq_exit                          
     0.84%  netperf  [kernel.kallsyms]  [k] local_bh_enable_ip                    
     0.81%  netperf  [kernel.kallsyms]  [k] lock_sock_nested                      
     0.79%  netperf  netperf            [.] send_omni_inner                       
     0.78%  netperf  [kernel.kallsyms]  [k] system_call                           
     0.77%  netperf  [kernel.kallsyms]  [k] rb_insert_color                       
     0.68%  netperf  [kernel.kallsyms]  [k] read_tsc                              
     0.67%  netperf  netperf            [.] send_data                             
     0.67%  netperf  [kernel.kallsyms]  [k] trigger_load_balance                  
     0.65%  netperf  [kernel.kallsyms]  [k] run_timer_softirq                     
     0.58%  netperf  [kernel.kallsyms]  [k] local_bh_disable                      
     0.58%  netperf  [kernel.kallsyms]  [k] __schedule                            
     0.54%  netperf  [kernel.kallsyms]  [k] tick_program_event                    
     0.51%  netperf  [kernel.kallsyms]  [k] raise_softirq                         
     0.47%  netperf  [kernel.kallsyms]  [k] tcp_md5_do_lookup                     
     0.45%  netperf  [kernel.kallsyms]  [k] tcp_event_data_recv                   
     0.40%  netperf  [kernel.kallsyms]  [k] tcp_transmit_skb                      
     0.39%  netperf  [kernel.kallsyms]  [k] tcp_cleanup_rbuf                      
     0.38%  netperf  [kernel.kallsyms]  [k] tcp_write_xmit                        
     0.38%  netperf  [kernel.kallsyms]  [k] finish_task_switch                    
     0.37%  netperf  [kernel.kallsyms]  [k] sock_def_readable                     
     0.37%  netperf  [kernel.kallsyms]  [k] ipv4_dst_check                        
     0.36%  netperf  [kernel.kallsyms]  [k] irq_exit                              
     0.35%  netperf  [kernel.kallsyms]  [k] fget_light                            
     0.33%  netperf  [ixgbe]            [k] ixgbe_poll                            
     0.33%  netperf  [kernel.kallsyms]  [k] copy_user_generic_string              
     0.32%  netperf  [kernel.kallsyms]  [k] dev_hard_start_xmit                   
     0.31%  netperf  [kernel.kallsyms]  [k] dev_queue_xmit                        
     0.30%  netperf  libc-2.17.so       [.] __libc_send                           
     0.30%  netperf  [kernel.kallsyms]  [k] kmem_cache_alloc                      
     0.29%  netperf  [kernel.kallsyms]  [k] sock_poll                             
     0.29%  netperf  [kernel.kallsyms]  [k] napi_by_id                            
     0.29%  netperf  [kernel.kallsyms]  [k] page_fault                            
     0.27%  netperf  [kernel.kallsyms]  [k] _raw_spin_lock_irq                    
     0.27%  netperf  [kernel.kallsyms]  [k] __perf_event_task_sched_in            
     0.27%  netperf  [kernel.kallsyms]  [k] dequeue_entity                        
     0.27%  netperf  netperf            [.] recv_data                             
     0.26%  netperf  [kernel.kallsyms]  [k] __getnstimeofday                      
     0.26%  netperf  [kernel.kallsyms]  [k] ip_rcv                                
     0.25%  netperf  [kernel.kallsyms]  [k] irq_entries_start                     
     0.24%  netperf  [kernel.kallsyms]  [k] skb_release_data                      
     0.24%  netperf  [kernel.kallsyms]  [k] tcp_sendmsg                           
     0.23%  netperf  [kernel.kallsyms]  [k] common_interrupt                      
     0.22%  netperf  [kernel.kallsyms]  [k] native_apic_msr_eoi_write             
     0.22%  netperf  [kernel.kallsyms]  [k] radix_tree_lookup_element             
     0.22%  netperf  [kernel.kallsyms]  [k] eth_type_trans                        
     0.22%  netperf  [kernel.kallsyms]  [k] skb_copy_datagram_iovec               
     0.22%  netperf  [kernel.kallsyms]  [k] finish_wait                           
     0.20%  netperf  [kernel.kallsyms]  [k] update_process_times                  
     0.20%  netperf  [kernel.kallsyms]  [k] __kfree_skb                           
     0.19%  netperf  [kernel.kallsyms]  [k] update_cfs_rq_blocked_load            
     0.19%  netperf  libc-2.17.so       [.] _int_malloc                           
     0.19%  netperf  [kernel.kallsyms]  [k] perf_event_aux_ctx                    
     0.19%  netperf  [kernel.kallsyms]  [k] vma_interval_tree_insert              
     0.19%  netperf  [kernel.kallsyms]  [k] do_select                             
     0.18%  netperf  [kernel.kallsyms]  [k] ksize                                 
     0.18%  netperf  [kernel.kallsyms]  [k] SYSC_sendto                           
     0.17%  netperf  [kernel.kallsyms]  [k] cpumask_next_and                      
     0.17%  netperf  [kernel.kallsyms]  [k] tcp_v4_rcv                            
     0.17%  netperf  [kernel.kallsyms]  [k] kmem_cache_alloc_node                 
     0.17%  netperf  [kernel.kallsyms]  [k] scheduler_tick                        
     0.17%  netperf  [kernel.kallsyms]  [k] tcp_parse_aligned_timestamp           
     0.16%  netperf  [kernel.kallsyms]  [k] dev_queue_xmit_nit                    
     0.16%  netperf  [kernel.kallsyms]  [k] native_sched_clock                    
     0.16%  netperf  [kernel.kallsyms]  [k] lock_timer_base.clone.26              
     0.15%  netperf  [kernel.kallsyms]  [k] tcp_v4_early_demux                    
     0.15%  netperf  [kernel.kallsyms]  [k] perf_event_task_tick                  
     0.15%  netperf  [kernel.kallsyms]  [k] dequeue_task_fair                     
     0.15%  netperf  [kernel.kallsyms]  [k] load_balance                          
     0.15%  netperf  [kernel.kallsyms]  [k] find_next_bit                         
     0.15%  netperf  [kernel.kallsyms]  [k] __skb_clone                           
     0.15%  netperf  [kernel.kallsyms]  [k] __tcp_v4_send_check                   
     0.14%  netperf  [kernel.kallsyms]  [k] kfree                                 
     0.14%  netperf  [kernel.kallsyms]  [k] mod_timer                             
     0.13%  netperf  [kernel.kallsyms]  [k] __netdev_alloc_frag                   
     0.12%  netperf  [kernel.kallsyms]  [k] tcp_rearm_rto                         
     0.12%  netperf  [kernel.kallsyms]  [k] kmem_cache_free                       
     0.12%  netperf  [kernel.kallsyms]  [k] local_clock                           
     0.12%  netperf  [kernel.kallsyms]  [k] _copy_to_user                         
     0.12%  netperf  [kernel.kallsyms]  [k] rcu_note_context_switch               
     0.12%  netperf  [kernel.kallsyms]  [k] SYSC_recvfrom                         
     0.12%  netperf  [kernel.kallsyms]  [k] might_fault                           
     0.12%  netperf  [kernel.kallsyms]  [k] __local_bh_enable                     
     0.12%  netperf  [kernel.kallsyms]  [k] clear_buddies                         
     0.11%  netperf  [kernel.kallsyms]  [k] tcp_check_space                       
     0.11%  netperf  [kernel.kallsyms]  [k] tcp_rtt_estimator                     
     0.11%  netperf  [kernel.kallsyms]  [k] __copy_user_nocache                   
     0.11%  netperf  [kernel.kallsyms]  [k] memcpy_toiovec                        
     0.11%  netperf  [kernel.kallsyms]  [k] skb_free_head                         
     0.11%  netperf  [kernel.kallsyms]  [k] do_softirq_own_stack                  
     0.11%  netperf  [kernel.kallsyms]  [k] update_group_power                    
     0.10%  netperf  [kernel.kallsyms]  [k] __netif_receive_skb_core              
     0.10%  netperf  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave                
     0.10%  netperf  [kernel.kallsyms]  [k] kfree_skb_partial                     
     0.10%  netperf  [kernel.kallsyms]  [k] release_sock                          
     0.10%  netperf  [kernel.kallsyms]  [k] update_rq_clock                       
     0.10%  netperf  [ixgbe]            [k] ixgbe_xmit_frame_ring                 
     0.10%  netperf  [kernel.kallsyms]  [k] get_seconds                           
     0.10%  netperf  netperf            [.] send_request_n                        
     0.10%  netperf  libc-2.17.so       [.] __strchrnul                           
     0.10%  netperf  libc-2.17.so       [.] __get_nprocs                          
     0.10%  netperf  libc-2.17.so       [.] __nss_database_lookup                 
     0.10%  netperf  ld-2.17.so         [.] _dl_map_object_from_fd                
     0.10%  netperf  ld-2.17.so         [.] check_match.9336                      
     0.10%  netperf  ld-2.17.so         [.] _dl_lookup_symbol_x                   
     0.10%  netperf  ld-2.17.so         [.] dl_open_worker                        
     0.10%  netperf  [vdso]             [.] 0x0000000000000a21                    
     0.10%  netperf  [kernel.kallsyms]  [k] down_read_trylock                     
     0.10%  netperf  [kernel.kallsyms]  [k] page_waitqueue                        
     0.10%  netperf  [kernel.kallsyms]  [k] get_page_from_freelist                
     0.10%  netperf  [kernel.kallsyms]  [k] anon_vma_interval_tree_remove         
     0.10%  netperf  [kernel.kallsyms]  [k] unmap_single_vma                      
     0.10%  netperf  [kernel.kallsyms]  [k] unlink_file_vma                       
     0.10%  netperf  [kernel.kallsyms]  [k] vma_wants_writenotify                 
     0.10%  netperf  [kernel.kallsyms]  [k] proc_lookup_de                        
     0.10%  netperf  [kernel.kallsyms]  [k] copy_page_rep                         
     0.10%  netperf  [kernel.kallsyms]  [k] strncpy_from_user                     
     0.10%  netperf  [kernel.kallsyms]  [k] tcp_poll                              
     0.10%  netperf  [kernel.kallsyms]  [k] tcp_stream_memory_free                
     0.10%  netperf  [kernel.kallsyms]  [k] mutex_unlock                          
     0.10%  netperf  [kernel.kallsyms]  [k] __do_page_fault                       
     0.09%  netperf  [kernel.kallsyms]  [k] __netdev_pick_tx                      
     0.09%  netperf  [kernel.kallsyms]  [k] perf_adjust_freq_unthr_context        
     0.09%  netperf  [kernel.kallsyms]  [k] tcp_rcv_space_adjust                  
     0.09%  netperf  [kernel.kallsyms]  [k] __inet_lookup_established             
     0.09%  netperf  [kernel.kallsyms]  [k] prepare_to_wait                       
     0.09%  netperf  [ixgbe]            [k] ixgbe_msix_clean_rings                
     0.09%  netperf  [kernel.kallsyms]  [k] update_curr                           
     0.08%  netperf  [kernel.kallsyms]  [k] do_softirq                            
     0.07%  netperf  [kernel.kallsyms]  [k] memcpy                                
     0.07%  netperf  [kernel.kallsyms]  [k] net_rx_action                         
     0.07%  netperf  [kernel.kallsyms]  [k] kmalloc_slab                          
     0.07%  netperf  [kernel.kallsyms]  [k] skb_network_protocol                  
     0.07%  netperf  [kernel.kallsyms]  [k] ip_output                             
     0.07%  netperf  [kernel.kallsyms]  [k] tcp_init_tso_segs                     
     0.07%  netperf  [kernel.kallsyms]  [k] __tcp_select_window                   
     0.07%  netperf  [kernel.kallsyms]  [k] rcu_process_callbacks                 
     0.07%  netperf  [kernel.kallsyms]  [k] bictcp_acked                          
     0.07%  netperf  [kernel.kallsyms]  [k] sockfd_lookup_light                   
     0.07%  netperf  [kernel.kallsyms]  [k] __kmalloc_node_track_caller           
     0.07%  netperf  [kernel.kallsyms]  [k] tcp_wfree                             
     0.07%  netperf  [kernel.kallsyms]  [k] run_rebalance_domains                 
     0.07%  netperf  [kernel.kallsyms]  [k] tcp_schedule_loss_probe               
     0.07%  netperf  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore           
     0.07%  netperf  [kernel.kallsyms]  [k] __slab_free                           
     0.07%  netperf  [kernel.kallsyms]  [k] tcp_parse_md5sig_option               
     0.06%  netperf  [kernel.kallsyms]  [k] inet_ehashfn                          
     0.06%  netperf  [kernel.kallsyms]  [k] tcp_v4_send_check                     
     0.06%  netperf  [kernel.kallsyms]  [k] schedule                              
     0.06%  netperf  [kernel.kallsyms]  [k] sock_recvmsg                          
     0.06%  netperf  [kernel.kallsyms]  [k] inet_sendmsg                          
     0.06%  netperf  [kernel.kallsyms]  [k] wakeup_softirqd                       
     0.06%  netperf  [kernel.kallsyms]  [k] sk_reset_timer                        
     0.06%  netperf  [kernel.kallsyms]  [k] add_interrupt_randomness              
     0.06%  netperf  [kernel.kallsyms]  [k] __napi_schedule                       
     0.06%  netperf  [kernel.kallsyms]  [k] __queue_work                          
     0.05%  netperf  [kernel.kallsyms]  [k] sock_sendmsg                          
     0.05%  netperf  [kernel.kallsyms]  [k] sch_direct_xmit                       
     0.05%  netperf  [kernel.kallsyms]  [k] handle_edge_irq                       
     0.05%  netperf  [kernel.kallsyms]  [k] __tcp_ack_snd_check                   
     0.05%  netperf  [kernel.kallsyms]  [k] tcp_send_mss                          
     0.05%  netperf  [kernel.kallsyms]  [k] bictcp_cong_avoid                     
     0.05%  netperf  [kernel.kallsyms]  [k] update_blocked_averages               
     0.05%  netperf  [kernel.kallsyms]  [k] net_rps_action_and_irq_enable.clone.82
     0.05%  netperf  [kernel.kallsyms]  [k] perf_pmu_rotate_start.clone.45        
     0.05%  netperf  [kernel.kallsyms]  [k] inet_recvmsg                          
     0.05%  netperf  [ixgbe]            [k] __ixgbe_xmit_frame                    
     0.04%  netperf  [kernel.kallsyms]  [k] dst_release                           
     0.04%  netperf  [kernel.kallsyms]  [k] sk_stream_alloc_skb                   
     0.04%  netperf  [kernel.kallsyms]  [k] raw_local_deliver                     
     0.04%  netperf  [kernel.kallsyms]  [k] put_page                              
     0.04%  netperf  [kernel.kallsyms]  [k] tcp_write_timer_handler               
     0.04%  netperf  [kernel.kallsyms]  [k] netif_receive_skb                     
     0.04%  netperf  [kernel.kallsyms]  [k] netdev_pick_tx                        
     0.04%  netperf  [kernel.kallsyms]  [k] __sk_mem_reclaim                      
     0.04%  netperf  [kernel.kallsyms]  [k] tcp_current_mss                       
     0.04%  netperf  [kernel.kallsyms]  [k] enqueue_entity                        
     0.04%  netperf  [kernel.kallsyms]  [k] build_skb                             
     0.04%  netperf  [kernel.kallsyms]  [k] tcp_release_cb                        
     0.04%  netperf  [kernel.kallsyms]  [k] handle_irq_event_percpu               
     0.04%  netperf  [kernel.kallsyms]  [k] rcu_irq_enter                         
     0.04%  netperf  [kernel.kallsyms]  [k] handle_irq_event                      
     0.04%  netperf  [kernel.kallsyms]  [k] restore_args                          
     0.03%  netperf  [kernel.kallsyms]  [k] tcp_queue_rcv                         
     0.03%  netperf  [kernel.kallsyms]  [k] ns_to_timeval                         
     0.03%  netperf  [kernel.kallsyms]  [k] sock_wfree                            
     0.03%  netperf  [kernel.kallsyms]  [k] tcp_established_options               
     0.03%  netperf  [kernel.kallsyms]  [k] ns_to_timespec                        
     0.03%  netperf  [kernel.kallsyms]  [k] __alloc_skb                           
     0.03%  netperf  [kernel.kallsyms]  [k] ip_local_out                          
     0.03%  netperf  [kernel.kallsyms]  [k] tcp_v4_md5_lookup                     
     0.03%  netperf  [kernel.kallsyms]  [k] try_to_wake_up                        
     0.03%  netperf  [kernel.kallsyms]  [k] check_preempt_wakeup                  
     0.03%  netperf  [kernel.kallsyms]  [k] enqueue_task_fair                     
     0.03%  netperf  [kernel.kallsyms]  [k] napi_complete                         
     0.03%  netperf  [kernel.kallsyms]  [k] dequeue_task                          
     0.03%  netperf  [kernel.kallsyms]  [k] cpuacct_charge                        
     0.03%  netperf  [kernel.kallsyms]  [k] swiotlb_sync_single                   
     0.03%  netperf  [kernel.kallsyms]  [k] tcp_event_new_data_sent               
     0.03%  netperf  [kernel.kallsyms]  [k] pick_next_task_fair                   
     0.03%  netperf  [kernel.kallsyms]  [k] smp_apic_timer_interrupt              
     0.03%  netperf  [kernel.kallsyms]  [k] skb_push                              
     0.02%  netperf  [kernel.kallsyms]  [k] __netif_receive_skb                   
     0.02%  netperf  [kernel.kallsyms]  [k] __perf_event_task_sched_out           
     0.02%  netperf  [kernel.kallsyms]  [k] perf_event_context_sched_in           
     0.02%  netperf  [kernel.kallsyms]  [k] deactivate_task                       
     0.02%  netperf  [kernel.kallsyms]  [k] ktime_get_real                        
     0.02%  netperf  [kernel.kallsyms]  [k] __tcp_push_pending_frames             
     0.02%  netperf  [kernel.kallsyms]  [k] irq_to_desc                           
     0.02%  netperf  [kernel.kallsyms]  [k] consume_skb                           
     0.02%  netperf  [kernel.kallsyms]  [k] handle_irq                            
     0.02%  netperf  [kernel.kallsyms]  [k] __slab_alloc                          
     0.02%  netperf  [kernel.kallsyms]  [k] hrtimer_run_pending                   
     0.02%  netperf  [kernel.kallsyms]  [k] source_load                           
     0.02%  netperf  [kernel.kallsyms]  [k] idle_balance                          
     0.02%  netperf  [kernel.kallsyms]  [k] _raw_spin_trylock                     
     0.02%  netperf  netperf            [.] send@plt                              
     0.02%  netperf  [kernel.kallsyms]  [k] irq_enter                             
     0.02%  netperf  [kernel.kallsyms]  [k] note_interrupt                        
     0.02%  netperf  [kernel.kallsyms]  [k] target_load                           
     0.02%  netperf  [kernel.kallsyms]  [k] hrtick_update                         
     0.02%  netperf  [kernel.kallsyms]  [k] skb_release_all                       
     0.02%  netperf  [kernel.kallsyms]  [k] get_pwq.clone.19                      
     0.02%  netperf  [kernel.kallsyms]  [k] do_IRQ                                
     0.01%  netperf  libc-2.17.so       [.] _IO_file_write@@GLIBC_2.2.5           
     0.01%  netperf  [kernel.kallsyms]  [k] set_next_buddy                        
     0.01%  netperf  libc-2.17.so       [.] __GI___libc_write                     


#
# (For a higher level overview, try: perf report --sort comm,dso)
#

[-- Attachment #4: perf.patched.txt --]
[-- Type: text/plain, Size: 23467 bytes --]

# ========
# captured on: Sun Dec  1 09:53:03 2013
# hostname : ladj537.jer.intel.com
# os release : 3.13.0-rc2-sched-pathcesmin3+IPv6+
# perf version : 3.11.9-200.fc19.x86_64
# arch : x86_64
# nrcpus online : 32
# nrcpus avail : 32
# cpudesc : Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
# cpuid : GenuineIntel,6,45,7
# total memory : 32905936 kB
# cmdline : /usr/bin/perf record netperf -t TCP_RR -H 192.168.1.1 -l 30 -T 4,4 -C -c 
# event : name = cycles, type = 0, config = 0x0, config1 = 0x0, config2 = 0x0, excl_usr = 0, excl_kern = 0, excl_host = 0, excl_guest = 1, precise_ip = 0
# HEADER_CPU_TOPOLOGY info available, use -I to display
# HEADER_NUMA_TOPOLOGY info available, use -I to display
# pmu mappings: cpu = 4, software = 1, uncore_pcu = 15, uncore_imc_0 = 17, uncore_imc_1 = 18, uncore_imc_2 = 19, uncore_imc_3 = 20, uncore_qpi_0 = 21, uncore_qpi_1 = 22, uncore_cbox_0 = 7, uncore_cbox_1 = 8, uncore_cbox_2 = 9, uncore_cbox_3 = 10, uncore_cbox_4 = 11, uncore_cbox_5 = 12, uncore_cbox_6 = 13, uncore_cbox_7 = 14, uncore_ha = 16, uncore_r2pcie = 23, uncore_r3qpi_0 = 24, uncore_r3qpi_1 = 25, breakpoint = 5, uncore_ubox = 6
# ========
#
# Samples: 36K of event 'cycles'
# Event count (approx.): 19610852
#
# Overhead  Command      Shared Object                                Symbol
# ........  .......  .................  ....................................
#
    11.60%  netperf  [kernel.kallsyms]  [k] hrtimer_interrupt               
     7.19%  netperf  [kernel.kallsyms]  [k] timerqueue_add                  
     5.62%  netperf  [kernel.kallsyms]  [k] __run_hrtimer.clone.33          
     5.21%  netperf  [kernel.kallsyms]  [k] hrtimer_forward                 
     4.25%  netperf  [kernel.kallsyms]  [k] _raw_spin_lock_bh               
     4.08%  netperf  [kernel.kallsyms]  [k] _raw_spin_lock                  
     3.80%  netperf  [kernel.kallsyms]  [k] sk_wait_data                    
     3.76%  netperf  [kernel.kallsyms]  [k] lapic_next_deadline             
     3.65%  netperf  [kernel.kallsyms]  [k] tick_program_event              
     3.41%  netperf  [kernel.kallsyms]  [k] ktime_get                       
     3.04%  netperf  [kernel.kallsyms]  [k] finish_wait                     
     2.60%  netperf  [kernel.kallsyms]  [k] tick_sched_timer                
     2.51%  netperf  [kernel.kallsyms]  [k] tcp_recvmsg                     
     2.43%  netperf  [kernel.kallsyms]  [k] rb_insert_color                 
     1.88%  netperf  [kernel.kallsyms]  [k] lock_sock_nested                
     1.78%  netperf  [kernel.kallsyms]  [k] __might_sleep                   
     1.46%  netperf  [kernel.kallsyms]  [k] profile_tick                    
     1.40%  netperf  netperf            [.] recv_data                       
     1.27%  netperf  netperf            [.] send_omni_inner                 
     1.24%  netperf  [kernel.kallsyms]  [k] tcp_service_net_dma             
     1.23%  netperf  [kernel.kallsyms]  [k] clockevents_program_event       
     1.10%  netperf  [kernel.kallsyms]  [k] tcp_rcv_established             
     0.99%  netperf  [kernel.kallsyms]  [k] local_bh_enable                 
     0.99%  netperf  [ixgbe]            [k] ixgbe_clean_rx_irq              
     0.86%  netperf  [kernel.kallsyms]  [k] local_bh_disable                
     0.59%  netperf  [kernel.kallsyms]  [k] run_posix_cpu_timers            
     0.54%  netperf  [kernel.kallsyms]  [k] tcp_ack                         
     0.54%  netperf  [kernel.kallsyms]  [k] tcp_prequeue_process            
     0.52%  netperf  netperf            [.] send_data                       
     0.51%  netperf  [kernel.kallsyms]  [k] build_skb                       
     0.51%  netperf  [kernel.kallsyms]  [k] read_tsc                        
     0.50%  netperf  [kernel.kallsyms]  [k] idle_cpu                        
     0.49%  netperf  [kernel.kallsyms]  [k] run_timer_softirq               
     0.41%  netperf  [kernel.kallsyms]  [k] tcp_md5_do_lookup               
     0.40%  netperf  [kernel.kallsyms]  [k] perf_event_task_tick            
     0.39%  netperf  [kernel.kallsyms]  [k] _raw_spin_unlock_bh             
     0.36%  netperf  [kernel.kallsyms]  [k] ipv4_dst_check                  
     0.35%  netperf  [kernel.kallsyms]  [k] system_call                     
     0.35%  netperf  [kernel.kallsyms]  [k] trigger_load_balance            
     0.35%  netperf  [kernel.kallsyms]  [k] rcu_irq_exit                    
     0.35%  netperf  [kernel.kallsyms]  [k] intel_pmu_enable_all            
     0.33%  netperf  [ixgbe]            [k] ixgbe_low_latency_recv          
     0.32%  netperf  [kernel.kallsyms]  [k] __do_softirq                    
     0.31%  netperf  [kernel.kallsyms]  [k] mod_timer                       
     0.31%  netperf  [kernel.kallsyms]  [k] tcp_parse_aligned_timestamp     
     0.27%  netperf  [kernel.kallsyms]  [k] fget_light                      
     0.27%  netperf  [kernel.kallsyms]  [k] __netif_receive_skb_core        
     0.26%  netperf  [kernel.kallsyms]  [k] tcp_event_data_recv             
     0.26%  netperf  [kernel.kallsyms]  [k] sock_def_readable               
     0.25%  netperf  [kernel.kallsyms]  [k] irq_entries_start               
     0.25%  netperf  [kernel.kallsyms]  [k] find_busiest_group              
     0.24%  netperf  [kernel.kallsyms]  [k] tcp_v4_do_rcv                   
     0.22%  netperf  [kernel.kallsyms]  [k] local_bh_enable_ip              
     0.22%  netperf  [kernel.kallsyms]  [k] kmem_cache_alloc                
     0.22%  netperf  libc-2.17.so       [.] __libc_recv                     
     0.22%  netperf  [kernel.kallsyms]  [k] __netdev_alloc_frag             
     0.20%  netperf  [kernel.kallsyms]  [k] __kmalloc_node_track_caller     
     0.20%  netperf  [kernel.kallsyms]  [k] irq_exit                        
     0.20%  netperf  [kernel.kallsyms]  [k] common_interrupt                
     0.19%  netperf  [kernel.kallsyms]  [k] tcp_sendmsg                     
     0.18%  netperf  libc-2.17.so       [.] __libc_send                     
     0.18%  netperf  [kernel.kallsyms]  [k] copy_user_generic_string        
     0.18%  netperf  [kernel.kallsyms]  [k] tcp_rearm_rto                   
     0.17%  netperf  [kernel.kallsyms]  [k] rcu_process_callbacks           
     0.17%  netperf  [kernel.kallsyms]  [k] finish_task_switch              
     0.17%  netperf  [kernel.kallsyms]  [k] _raw_spin_lock_irq              
     0.16%  netperf  [kernel.kallsyms]  [k] tcp_rcv_space_adjust            
     0.16%  netperf  [kernel.kallsyms]  [k] scheduler_tick                  
     0.16%  netperf  [kernel.kallsyms]  [k] __getnstimeofday                
     0.16%  netperf  [kernel.kallsyms]  [k] skb_release_data                
     0.16%  netperf  [kernel.kallsyms]  [k] restore_args                    
     0.15%  netperf  [kernel.kallsyms]  [k] native_apic_msr_eoi_write       
     0.15%  netperf  [kernel.kallsyms]  [k] raise_softirq                   
     0.12%  netperf  [kernel.kallsyms]  [k] ip_queue_xmit                   
     0.12%  netperf  [kernel.kallsyms]  [k] ip_rcv                          
     0.12%  netperf  [kernel.kallsyms]  [k] skb_copy_datagram_iovec         
     0.12%  netperf  [kernel.kallsyms]  [k] tcp_transmit_skb                
     0.12%  netperf  [kernel.kallsyms]  [k] load_balance                    
     0.12%  netperf  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore     
     0.12%  netperf  [kernel.kallsyms]  [k] schedule_timeout                
     0.12%  netperf  [kernel.kallsyms]  [k] radix_tree_lookup_element       
     0.11%  netperf  [kernel.kallsyms]  [k] __kfree_skb                     
     0.11%  netperf  [kernel.kallsyms]  [k] do_softirq                      
     0.11%  netperf  [kernel.kallsyms]  [k] __inet_lookup_established       
     0.11%  netperf  [kernel.kallsyms]  [k] inet_ehashfn                    
     0.10%  netperf  [kernel.kallsyms]  [k] memcpy_toiovec                  
     0.10%  netperf  [kernel.kallsyms]  [k] netif_skb_features              
     0.09%  netperf  [kernel.kallsyms]  [k] tcp_write_xmit                  
     0.09%  netperf  [kernel.kallsyms]  [k] tcp_rtt_estimator               
     0.09%  netperf  [ixgbe]            [k] ixgbe_xmit_frame_ring           
     0.08%  netperf  [kernel.kallsyms]  [k] kmem_cache_free                 
     0.08%  netperf  [kernel.kallsyms]  [k] tcp_check_space                 
     0.08%  netperf  [kernel.kallsyms]  [k] __local_bh_enable               
     0.08%  netperf  [kernel.kallsyms]  [k] SYSC_sendto                     
     0.08%  netperf  [kernel.kallsyms]  [k] might_fault                     
     0.08%  netperf  ld-2.17.so         [.] do_lookup_x                     
     0.08%  netperf  [kernel.kallsyms]  [k] __inc_zone_state                
     0.08%  netperf  [kernel.kallsyms]  [k] bictcp_acked                    
     0.08%  netperf  [kernel.kallsyms]  [k] skb_free_head                   
     0.08%  netperf  [kernel.kallsyms]  [k] note_gp_changes                 
     0.08%  netperf  [kernel.kallsyms]  [k] tcp_cleanup_rbuf                
     0.07%  netperf  [kernel.kallsyms]  [k] perf_adjust_freq_unthr_context  
     0.07%  netperf  [kernel.kallsyms]  [k] __schedule                      
     0.07%  netperf  [kernel.kallsyms]  [k] get_seconds                     
     0.07%  netperf  [kernel.kallsyms]  [k] sockfd_lookup_light             
     0.07%  netperf  [kernel.kallsyms]  [k] ip_local_deliver                
     0.07%  netperf  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave          
     0.07%  netperf  [kernel.kallsyms]  [k] dev_queue_xmit_nit              
     0.06%  netperf  [kernel.kallsyms]  [k] ip_send_check                   
     0.06%  netperf  [kernel.kallsyms]  [k] tcp_wfree                       
     0.06%  netperf  [kernel.kallsyms]  [k] sk_filter                       
     0.06%  netperf  [kernel.kallsyms]  [k] tcp_current_mss                 
     0.06%  netperf  [kernel.kallsyms]  [k] tcp_write_timer_handler         
     0.06%  netperf  [kernel.kallsyms]  [k] dev_queue_xmit                  
     0.06%  netperf  [kernel.kallsyms]  [k] prepare_to_wait                 
     0.06%  netperf  [kernel.kallsyms]  [k] __queue_work                    
     0.06%  netperf  [kernel.kallsyms]  [k] kfree                           
     0.05%  netperf  [kernel.kallsyms]  [k] put_prev_task_fair              
     0.05%  netperf  libc-2.17.so       [.] __sysconf                       
     0.05%  netperf  libc-2.17.so       [.] freeaddrinfo                    
     0.05%  netperf  ld-2.17.so         [.] _dl_lookup_symbol_x             
     0.05%  netperf  [kernel.kallsyms]  [k] find_get_page                   
     0.05%  netperf  [kernel.kallsyms]  [k] filemap_fault                   
     0.05%  netperf  [kernel.kallsyms]  [k] link_path_walk                  
     0.05%  netperf  [kernel.kallsyms]  [k] clear_page_c                    
     0.05%  netperf  [kernel.kallsyms]  [k] sys_sendto                      
     0.05%  netperf  [kernel.kallsyms]  [k] page_fault                      
     0.05%  netperf  [kernel.kallsyms]  [k] tcp_prequeue                    
     0.05%  netperf  [kernel.kallsyms]  [k] eth_header                      
     0.05%  netperf  [kernel.kallsyms]  [k] sock_wfree                      
     0.05%  netperf  [kernel.kallsyms]  [k] __skb_clone                     
     0.05%  netperf  [kernel.kallsyms]  [k] swiotlb_sync_single             
     0.05%  netperf  [kernel.kallsyms]  [k] native_sched_clock              
     0.05%  netperf  [kernel.kallsyms]  [k] raw_local_deliver               
     0.05%  netperf  [kernel.kallsyms]  [k] add_interrupt_randomness        
     0.05%  netperf  [kernel.kallsyms]  [k] tcp_is_cwnd_limited             
     0.05%  netperf  [kernel.kallsyms]  [k] update_process_times            
     0.05%  netperf  [kernel.kallsyms]  [k] __x2apic_send_IPI_mask          
     0.05%  netperf  [kernel.kallsyms]  [k] inet_sendmsg                    
     0.04%  netperf  [kernel.kallsyms]  [k] __copy_user_nocache             
     0.04%  netperf  [kernel.kallsyms]  [k] tcp_schedule_loss_probe         
     0.04%  netperf  [kernel.kallsyms]  [k] deactivate_task                 
     0.04%  netperf  [kernel.kallsyms]  [k] do_softirq_own_stack            
     0.04%  netperf  [kernel.kallsyms]  [k] hrtimer_run_pending             
     0.04%  netperf  [kernel.kallsyms]  [k] net_rx_action                   
     0.04%  netperf  [kernel.kallsyms]  [k] SYSC_recvfrom                   
     0.04%  netperf  [kernel.kallsyms]  [k] kmalloc_slab                    
     0.04%  netperf  [kernel.kallsyms]  [k] tcp_options_write               
     0.04%  netperf  [kernel.kallsyms]  [k] netif_receive_skb               
     0.04%  netperf  [kernel.kallsyms]  [k] tcp_v4_rcv                      
     0.04%  netperf  [kernel.kallsyms]  [k] msecs_to_jiffies                
     0.04%  netperf  [kernel.kallsyms]  [k] __tcp_select_window             
     0.04%  netperf  [kernel.kallsyms]  [k] get_work_pool                   
     0.04%  netperf  [kernel.kallsyms]  [k] update_cfs_rq_blocked_load      
     0.04%  netperf  [kernel.kallsyms]  [k] internal_add_timer              
     0.04%  netperf  [ixgbe]            [k] ixgbe_poll                      
     0.03%  netperf  [kernel.kallsyms]  [k] ip_finish_output                
     0.03%  netperf  [kernel.kallsyms]  [k] tcp_parse_md5sig_option         
     0.03%  netperf  [kernel.kallsyms]  [k] skb_release_head_state          
     0.03%  netperf  [kernel.kallsyms]  [k] sock_rfree                      
     0.03%  netperf  [kernel.kallsyms]  [k] __tcp_ack_snd_check             
     0.03%  netperf  [kernel.kallsyms]  [k] tcp_release_cb                  
     0.03%  netperf  [kernel.kallsyms]  [k] dequeue_entity                  
     0.03%  netperf  [kernel.kallsyms]  [k] irq_work_interrupt              
     0.03%  netperf  [kernel.kallsyms]  [k] __slab_free                     
     0.03%  netperf  [kernel.kallsyms]  [k] file_free_rcu                   
     0.03%  netperf  [kernel.kallsyms]  [k] rcu_irq_enter                   
     0.03%  netperf  [kernel.kallsyms]  [k] release_sock                    
     0.03%  netperf  [kernel.kallsyms]  [k] __sk_mem_reclaim                
     0.03%  netperf  [kernel.kallsyms]  [k] dql_completed                   
     0.03%  netperf  [kernel.kallsyms]  [k] update_curr                     
     0.03%  netperf  [kernel.kallsyms]  [k] swiotlb_sync_single_for_device  
     0.03%  netperf  [kernel.kallsyms]  [k] __alloc_skb                     
     0.03%  netperf  [kernel.kallsyms]  [k] put_cpu_partial                 
     0.03%  netperf  [kernel.kallsyms]  [k] __napi_schedule                 
     0.03%  netperf  [kernel.kallsyms]  [k] rcu_note_context_switch         
     0.03%  netperf  netperf            [.] memset@plt                      
     0.03%  netperf  netperf            [.] send_request_n                  
     0.03%  netperf  netperf            [.] recv_response_timed_n           
     0.03%  netperf  netperf            [.] get_remote_system_info          
     0.03%  netperf  libc-2.17.so       [.] __GI_____strtoll_l_internal     
     0.03%  netperf  libc-2.17.so       [.] _int_malloc                     
     0.03%  netperf  libc-2.17.so       [.] free                            
     0.03%  netperf  libc-2.17.so       [.] _getopt_internal_r              
     0.03%  netperf  libc-2.17.so       [.] getaddrinfo                     
     0.03%  netperf  libc-2.17.so       [.] __get_nprocs                    
     0.03%  netperf  libc-2.17.so       [.] __GI___bind                     
     0.03%  netperf  libc-2.17.so       [.] __GI___socket                   
     0.03%  netperf  libc-2.17.so       [.] __libc_alloca_cutoff            
     0.03%  netperf  libc-2.17.so       [.] __check_pf                      
     0.03%  netperf  libc-2.17.so       [.] __strcasecmp_l_avx              
     0.03%  netperf  ld-2.17.so         [.] _dl_fixup                       
     0.03%  netperf  [kernel.kallsyms]  [k] destroy_context                 
     0.03%  netperf  [kernel.kallsyms]  [k] sched_setaffinity               
     0.03%  netperf  [kernel.kallsyms]  [k] __wake_up_bit                   
     0.03%  netperf  [kernel.kallsyms]  [k] __mutex_init                    
     0.03%  netperf  [kernel.kallsyms]  [k] __alloc_pages_nodemask          
     0.03%  netperf  [kernel.kallsyms]  [k] lru_cache_add                   
     0.03%  netperf  [kernel.kallsyms]  [k] find_vma                        
     0.03%  netperf  [kernel.kallsyms]  [k] policy_nodemask                 
     0.03%  netperf  [kernel.kallsyms]  [k] path_init                       
     0.03%  netperf  [kernel.kallsyms]  [k] select_estimate_accuracy        
     0.03%  netperf  [kernel.kallsyms]  [k] do_select                       
     0.03%  netperf  [kernel.kallsyms]  [k] d_flags_for_inode               
     0.03%  netperf  [kernel.kallsyms]  [k] __d_lookup                      
     0.03%  netperf  [kernel.kallsyms]  [k] inode_init_always               
     0.03%  netperf  [kernel.kallsyms]  [k] fd_install                      
     0.03%  netperf  [kernel.kallsyms]  [k] proc_lookup_de                  
     0.03%  netperf  [kernel.kallsyms]  [k] sysfs_open_file                 
     0.03%  netperf  [kernel.kallsyms]  [k] security_d_instantiate          
     0.03%  netperf  [kernel.kallsyms]  [k] security_file_permission        
     0.03%  netperf  [kernel.kallsyms]  [k] strcmp                          
     0.03%  netperf  [kernel.kallsyms]  [k] strlen                          
     0.03%  netperf  [kernel.kallsyms]  [k] vsnprintf                       
     0.03%  netperf  [kernel.kallsyms]  [k] lockref_get_not_dead            
     0.03%  netperf  [kernel.kallsyms]  [k] bitmap_scnlistprintf            
     0.03%  netperf  [kernel.kallsyms]  [k] strncpy_from_user               
     0.03%  netperf  [kernel.kallsyms]  [k] __sock_create                   
     0.03%  netperf  [kernel.kallsyms]  [k] sys_bind                        
     0.03%  netperf  [kernel.kallsyms]  [k] sock_init_data                  
     0.03%  netperf  [kernel.kallsyms]  [k] __netlink_create                
     0.03%  netperf  [kernel.kallsyms]  [k] netlink_autobind.clone.30       
     0.03%  netperf  [kernel.kallsyms]  [k] tcp_send_mss                    
     0.03%  netperf  [kernel.kallsyms]  [k] mutex_lock                      
     0.03%  netperf  [kernel.kallsyms]  [k] __do_page_fault                 
     0.03%  netperf  [kernel.kallsyms]  [k] tcp_queue_rcv                   
     0.03%  netperf  [kernel.kallsyms]  [k] handle_irq_event                
     0.02%  netperf  [kernel.kallsyms]  [k] lock_timer_base.clone.26        
     0.02%  netperf  [kernel.kallsyms]  [k] group_balance_cpu               
     0.02%  netperf  [kernel.kallsyms]  [k] inet_recvmsg                    
     0.02%  netperf  [kernel.kallsyms]  [k] __kmalloc_reserve.clone.52      
     0.02%  netperf  [kernel.kallsyms]  [k] dequeue_task                    
     0.02%  netperf  [kernel.kallsyms]  [k] idle_balance                    
     0.02%  netperf  [kernel.kallsyms]  [k] find_next_bit                   
     0.02%  netperf  [kernel.kallsyms]  [k] ipv4_mtu                        
     0.02%  netperf  [kernel.kallsyms]  [k] __update_entity_load_avg_contrib
     0.02%  netperf  [ixgbe]            [k] ixgbe_alloc_rx_buffers          
     0.02%  netperf  [kernel.kallsyms]  [k] __compute_runnable_contrib      
     0.02%  netperf  [kernel.kallsyms]  [k] sk_stream_alloc_skb             
     0.02%  netperf  [kernel.kallsyms]  [k] __tcp_push_pending_frames       
     0.02%  netperf  [kernel.kallsyms]  [k] try_to_wake_up                  
     0.02%  netperf  [kernel.kallsyms]  [k] source_load                     
     0.02%  netperf  [kernel.kallsyms]  [k] __wake_up_common                
     0.02%  netperf  [kernel.kallsyms]  [k] rcu_accelerate_cbs              
     0.02%  netperf  [kernel.kallsyms]  [k] llist_add_batch                 
     0.02%  netperf  [kernel.kallsyms]  [k] __netdev_pick_tx                
     0.02%  netperf  [kernel.kallsyms]  [k] tcp_set_skb_tso_segs            
     0.02%  netperf  [kernel.kallsyms]  [k] tcp_stream_memory_free          
     0.02%  netperf  [kernel.kallsyms]  [k] memcpy                          
     0.02%  netperf  [kernel.kallsyms]  [k] swiotlb_map_page                
     0.02%  netperf  [kernel.kallsyms]  [k] harmonize_features              
     0.02%  netperf  [kernel.kallsyms]  [k] intel_pmu_disable_all           
     0.02%  netperf  [kernel.kallsyms]  [k] update_cfs_shares               
     0.02%  netperf  [kernel.kallsyms]  [k] enqueue_entity                  
     0.02%  netperf  [kernel.kallsyms]  [k] retint_restore_args             
     0.02%  netperf  [kernel.kallsyms]  [k] __netdev_alloc_skb              
     0.02%  netperf  [kernel.kallsyms]  [k] irq_to_desc                     
     0.02%  netperf  [ixgbe]            [k] ixgbe_msix_clean_rings          
     0.02%  netperf  [kernel.kallsyms]  [k] tcp_v4_early_demux              
     0.02%  netperf  [kernel.kallsyms]  [k] eth_type_trans                  
     0.02%  netperf  [kernel.kallsyms]  [k] update_rq_clock                 
     0.02%  netperf  [kernel.kallsyms]  [k] perf_pmu_rotate_start.clone.45  
     0.02%  netperf  [kernel.kallsyms]  [k] schedule                        
     0.02%  netperf  [kernel.kallsyms]  [k] dev_hard_start_xmit             
     0.02%  netperf  [kernel.kallsyms]  [k] put_page                        
     0.02%  netperf  [kernel.kallsyms]  [k] tcp_send_delayed_ack            
     0.02%  netperf  libc-2.17.so       [.] vfprintf                        
     0.02%  netperf  libc-2.17.so       [.] fprintf                         
     0.02%  netperf  libc-2.17.so       [.] _IO_file_write@@GLIBC_2.2.5     
     0.02%  netperf  libc-2.17.so       [.] _IO_file_xsputn@@GLIBC_2.2.5    
     0.02%  netperf  [kernel.kallsyms]  [k] swiotlb_dma_mapping_error       
     0.02%  netperf  [kernel.kallsyms]  [k] netdev_pick_tx                  
     0.02%  netperf  [kernel.kallsyms]  [k] ns_to_timespec                  
     0.02%  netperf  [kernel.kallsyms]  [k] __note_gp_changes               
     0.02%  netperf  [kernel.kallsyms]  [k] unmap_single                    
     0.02%  netperf  [kernel.kallsyms]  [k] __netif_receive_skb             
     0.02%  netperf  [kernel.kallsyms]  [k] __perf_event_task_sched_in      
     0.01%  netperf  [kernel.kallsyms]  [k] do_IRQ                          
     0.01%  netperf  [kernel.kallsyms]  [k] swiotlb_unmap_page              
     0.01%  netperf  [kernel.kallsyms]  [k] consume_skb                     
     0.01%  netperf  [kernel.kallsyms]  [k] dev_kfree_skb_any               
     0.01%  netperf  [kernel.kallsyms]  [k] skb_clone                       
     0.01%  netperf  [kernel.kallsyms]  [k] tcp_established_options         
     0.01%  netperf  [kernel.kallsyms]  [k] irq_enter                       
     0.01%  netperf  [kernel.kallsyms]  [k] handle_edge_irq                 
     0.01%  netperf  [kernel.kallsyms]  [k] tcp_v4_md5_lookup               


#
# (For a higher level overview, try: perf report --sort comm,dso)
#

[-- Attachment #5: perf.sched-clock.txt --]
[-- Type: text/plain, Size: 21533 bytes --]

# ========
# captured on: Sun Dec  1 06:43:34 2013
# hostname : ladj537.jer.intel.com
# os release : 3.13.0-rc2min3+IPv6+
# perf version : 3.11.9-200.fc19.x86_64
# arch : x86_64
# nrcpus online : 32
# nrcpus avail : 32
# cpudesc : Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
# cpuid : GenuineIntel,6,45,7
# total memory : 32905936 kB
# cmdline : /usr/bin/perf record netperf -t TCP_RR -H 192.168.1.1 -l 30 -T 4,4 -C -c 
# event : name = cycles, type = 0, config = 0x0, config1 = 0x0, config2 = 0x0, excl_usr = 0, excl_kern = 0, excl_host = 0, excl_guest = 1, precise_ip = 0
# HEADER_CPU_TOPOLOGY info available, use -I to display
# HEADER_NUMA_TOPOLOGY info available, use -I to display
# pmu mappings: cpu = 4, software = 1, uncore_pcu = 15, uncore_imc_0 = 17, uncore_imc_1 = 18, uncore_imc_2 = 19, uncore_imc_3 = 20, uncore_qpi_0 = 21, uncore_qpi_1 = 22, uncore_cbox_0 = 7, uncore_cbox_1 = 8, uncore_cbox_2 = 9, uncore_cbox_3 = 10, uncore_cbox_4 = 11, uncore_cbox_5 = 12, uncore_cbox_6 = 13, uncore_cbox_7 = 14, uncore_ha = 16, uncore_r2pcie = 23, uncore_r3qpi_0 = 24, uncore_r3qpi_1 = 25, breakpoint = 5, uncore_ubox = 6
# ========
#
# Samples: 119K of event 'cycles'
# Event count (approx.): 80231833254
#
# Overhead  Command      Shared Object                                      Symbol
# ........  .......  .................  ..........................................
#
    22.74%  netperf  [kernel.kallsyms]  [k] _raw_spin_lock_bh                     
    11.63%  netperf  [kernel.kallsyms]  [k] _raw_spin_unlock_bh                   
    10.62%  netperf  [kernel.kallsyms]  [k] native_sched_clock                    
    10.45%  netperf  [kernel.kallsyms]  [k] tcp_recvmsg                           
     8.26%  netperf  [ixgbe]            [k] ixgbe_clean_rx_irq                    
     5.29%  netperf  [kernel.kallsyms]  [k] local_bh_enable_ip                    
     5.04%  netperf  [ixgbe]            [k] ixgbe_low_latency_recv                
     1.15%  netperf  [kernel.kallsyms]  [k] tcp_ack                               
     1.01%  netperf  [kernel.kallsyms]  [k] _raw_spin_lock                        
     0.90%  netperf  [kernel.kallsyms]  [k] local_bh_disable                      
     0.89%  netperf  [kernel.kallsyms]  [k] tcp_sendmsg                           
     0.75%  netperf  netperf            [.] send_omni_inner                       
     0.74%  netperf  [ixgbe]            [k] ixgbe_xmit_frame_ring                 
     0.73%  netperf  [kernel.kallsyms]  [k] tcp_transmit_skb                      
     0.64%  netperf  [kernel.kallsyms]  [k] __netif_receive_skb_core              
     0.63%  netperf  [kernel.kallsyms]  [k] system_call                           
     0.63%  netperf  [kernel.kallsyms]  [k] tcp_write_xmit                        
     0.48%  netperf  [kernel.kallsyms]  [k] tcp_rcv_established                   
     0.45%  netperf  [kernel.kallsyms]  [k] ip_rcv                                
     0.35%  netperf  [kernel.kallsyms]  [k] tcp_v4_rcv                            
     0.31%  netperf  [kernel.kallsyms]  [k] __copy_skb_header                     
     0.31%  netperf  [kernel.kallsyms]  [k] dev_hard_start_xmit                   
     0.30%  netperf  [kernel.kallsyms]  [k] ip_queue_xmit                         
     0.30%  netperf  [kernel.kallsyms]  [k] __might_sleep                         
     0.28%  netperf  [kernel.kallsyms]  [k] __alloc_skb                           
     0.27%  netperf  [kernel.kallsyms]  [k] fget_light                            
     0.26%  netperf  netperf            [.] send_data                             
     0.24%  netperf  netperf            [.] recv_data                             
     0.24%  netperf  [kernel.kallsyms]  [k] skb_release_data                      
     0.24%  netperf  [kernel.kallsyms]  [k] sock_def_readable                     
     0.24%  netperf  [kernel.kallsyms]  [k] tcp_queue_rcv                         
     0.23%  netperf  [kernel.kallsyms]  [k] __getnstimeofday                      
     0.23%  netperf  [kernel.kallsyms]  [k] __tcp_select_window                   
     0.23%  netperf  [kernel.kallsyms]  [k] __inet_lookup_established             
     0.23%  netperf  [kernel.kallsyms]  [k] ip_finish_output                      
     0.23%  netperf  [kernel.kallsyms]  [k] dev_queue_xmit                        
     0.23%  netperf  [kernel.kallsyms]  [k] build_skb                             
     0.22%  netperf  libc-2.17.so       [.] __libc_recv                           
     0.22%  netperf  [kernel.kallsyms]  [k] skb_clone                             
     0.21%  netperf  [kernel.kallsyms]  [k] local_bh_enable                       
     0.21%  netperf  [kernel.kallsyms]  [k] kmem_cache_alloc_node                 
     0.21%  netperf  [kernel.kallsyms]  [k] tcp_schedule_loss_probe               
     0.20%  netperf  [kernel.kallsyms]  [k] kmem_cache_free                       
     0.19%  netperf  [kernel.kallsyms]  [k] tcp_v4_do_rcv                         
     0.19%  netperf  [kernel.kallsyms]  [k] __kmalloc_node_track_caller           
     0.19%  netperf  [kernel.kallsyms]  [k] read_tsc                              
     0.19%  netperf  [kernel.kallsyms]  [k] __skb_clone                           
     0.19%  netperf  [kernel.kallsyms]  [k] __slab_free                           
     0.18%  netperf  [kernel.kallsyms]  [k] __kfree_skb                           
     0.18%  netperf  libc-2.17.so       [.] __libc_send                           
     0.18%  netperf  [kernel.kallsyms]  [k] mod_timer                             
     0.18%  netperf  [kernel.kallsyms]  [k] __tcp_v4_send_check                   
     0.17%  netperf  [kernel.kallsyms]  [k] skb_free_head                         
     0.17%  netperf  [kernel.kallsyms]  [k] sockfd_lookup_light                   
     0.17%  netperf  [kernel.kallsyms]  [k] tcp_cleanup_rbuf                      
     0.17%  netperf  [kernel.kallsyms]  [k] tcp_set_skb_tso_segs                  
     0.16%  netperf  [kernel.kallsyms]  [k] __netdev_alloc_frag                   
     0.16%  netperf  [kernel.kallsyms]  [k] tcp_send_mss                          
     0.15%  netperf  [kernel.kallsyms]  [k] tcp_established_options               
     0.15%  netperf  [kernel.kallsyms]  [k] tcp_event_data_recv                   
     0.15%  netperf  [kernel.kallsyms]  [k] ksize                                 
     0.15%  netperf  [kernel.kallsyms]  [k] kmem_cache_alloc                      
     0.14%  netperf  [kernel.kallsyms]  [k] lock_sock_nested                      
     0.14%  netperf  [kernel.kallsyms]  [k] skb_network_protocol                  
     0.14%  netperf  [kernel.kallsyms]  [k] ip_send_check                         
     0.14%  netperf  [kernel.kallsyms]  [k] sk_filter                             
     0.14%  netperf  [kernel.kallsyms]  [k] memcpy                                
     0.14%  netperf  [kernel.kallsyms]  [k] __netdev_pick_tx                      
     0.13%  netperf  [kernel.kallsyms]  [k] skb_push                              
     0.13%  netperf  [kernel.kallsyms]  [k] swiotlb_map_page                      
     0.13%  netperf  [kernel.kallsyms]  [k] tcp_service_net_dma                   
     0.13%  netperf  [kernel.kallsyms]  [k] skb_copy_datagram_iovec               
     0.13%  netperf  [kernel.kallsyms]  [k] ipv4_dst_check                        
     0.13%  netperf  [kernel.kallsyms]  [k] dev_queue_xmit_nit                    
     0.12%  netperf  [kernel.kallsyms]  [k] tcp_rearm_rto                         
     0.12%  netperf  [kernel.kallsyms]  [k] sch_direct_xmit                       
     0.12%  netperf  [kernel.kallsyms]  [k] SYSC_recvfrom                         
     0.12%  netperf  [kernel.kallsyms]  [k] memcpy_toiovec                        
     0.12%  netperf  [kernel.kallsyms]  [k] tcp_init_tso_segs                     
     0.11%  netperf  [kernel.kallsyms]  [k] eth_type_trans                        
     0.11%  netperf  [kernel.kallsyms]  [k] raw_local_deliver                     
     0.11%  netperf  [ixgbe]            [k] ixgbe_poll                            
     0.11%  netperf  [kernel.kallsyms]  [k] __kmalloc_reserve.clone.52            
     0.11%  netperf  [kernel.kallsyms]  [k] ip_local_deliver                      
     0.11%  netperf  [kernel.kallsyms]  [k] sk_stream_alloc_skb                   
     0.11%  netperf  [kernel.kallsyms]  [k] tcp_v4_early_demux                    
     0.11%  netperf  [kernel.kallsyms]  [k] tcp_rtt_estimator                     
     0.11%  netperf  [kernel.kallsyms]  [k] put_compound_page                     
     0.11%  netperf  [kernel.kallsyms]  [k] tcp_send_delayed_ack                  
     0.10%  netperf  [kernel.kallsyms]  [k] sock_recvmsg                          
     0.10%  netperf  [kernel.kallsyms]  [k] __tcp_ack_snd_check                   
     0.10%  netperf  [kernel.kallsyms]  [k] SYSC_sendto                           
     0.10%  netperf  [kernel.kallsyms]  [k] __sk_dst_check                        
     0.10%  netperf  [kernel.kallsyms]  [k] sock_sendmsg                          
     0.09%  netperf  [kernel.kallsyms]  [k] bictcp_acked                          
     0.09%  netperf  [ixgbe]            [k] ixgbe_tx_ctxtdesc                     
     0.09%  netperf  [kernel.kallsyms]  [k] inet_ehashfn                          
     0.09%  netperf  [kernel.kallsyms]  [k] netdev_pick_tx                        
     0.09%  netperf  [kernel.kallsyms]  [k] sock_wfree                            
     0.09%  netperf  [kernel.kallsyms]  [k] kmalloc_slab                          
     0.09%  netperf  [kernel.kallsyms]  [k] ns_to_timespec                        
     0.09%  netperf  [kernel.kallsyms]  [k] bictcp_cong_avoid                     
     0.09%  netperf  [kernel.kallsyms]  [k] sock_put                              
     0.08%  netperf  [kernel.kallsyms]  [k] might_fault                           
     0.08%  netperf  [kernel.kallsyms]  [k] tcp_parse_aligned_timestamp           
     0.08%  netperf  [kernel.kallsyms]  [k] tcp_options_write                     
     0.08%  netperf  [kernel.kallsyms]  [k] tcp_rcv_space_adjust                  
     0.08%  netperf  [kernel.kallsyms]  [k] release_sock                          
     0.08%  netperf  [ixgbe]            [k] __ixgbe_xmit_frame                    
     0.08%  netperf  [kernel.kallsyms]  [k] sock_rfree                            
     0.08%  netperf  [ixgbe]            [k] ixgbe_alloc_rx_buffers                
     0.08%  netperf  [kernel.kallsyms]  [k] inet_sendmsg                          
     0.07%  netperf  [kernel.kallsyms]  [k] irq_entries_start                     
     0.07%  netperf  [kernel.kallsyms]  [k] skb_release_head_state                
     0.07%  netperf  [kernel.kallsyms]  [k] tcp_wfree                             
     0.07%  netperf  [kernel.kallsyms]  [k] get_seconds                           
     0.07%  netperf  [kernel.kallsyms]  [k] swiotlb_sync_single                   
     0.06%  netperf  [kernel.kallsyms]  [k] copy_user_generic_string              
     0.06%  netperf  [kernel.kallsyms]  [k] skb_release_all                       
     0.06%  netperf  [kernel.kallsyms]  [k] tcp_check_space                       
     0.06%  netperf  [kernel.kallsyms]  [k] msecs_to_jiffies                      
     0.06%  netperf  [kernel.kallsyms]  [k] ipv4_mtu                              
     0.05%  netperf  [kernel.kallsyms]  [k] __ip_local_out                        
     0.05%  netperf  [kernel.kallsyms]  [k] ktime_get_real                        
     0.05%  netperf  [kernel.kallsyms]  [k] tcp_current_mss                       
     0.05%  netperf  [kernel.kallsyms]  [k] getnstimeofday                        
     0.05%  netperf  [kernel.kallsyms]  [k] sk_reset_timer                        
     0.05%  netperf  [kernel.kallsyms]  [k] ip_output                             
     0.05%  netperf  [kernel.kallsyms]  [k] __copy_user_nocache                   
     0.05%  netperf  [kernel.kallsyms]  [k] __netdev_alloc_skb                    
     0.04%  netperf  [kernel.kallsyms]  [k] netif_skb_features                    
     0.04%  netperf  [kernel.kallsyms]  [k] harmonize_features                    
     0.04%  netperf  [kernel.kallsyms]  [k] netif_receive_skb                     
     0.04%  netperf  [kernel.kallsyms]  [k] __slab_alloc                          
     0.04%  netperf  [kernel.kallsyms]  [k] tcp_parse_md5sig_option               
     0.04%  netperf  [ixgbe]            [k] ixgbe_xmit_frame                      
     0.04%  netperf  [kernel.kallsyms]  [k] tcp_stream_memory_free                
     0.04%  netperf  [kernel.kallsyms]  [k] inet_recvmsg                          
     0.04%  netperf  [kernel.kallsyms]  [k] kfree                                 
     0.04%  netperf  [kernel.kallsyms]  [k] tcp_md5_do_lookup                     
     0.04%  netperf  [kernel.kallsyms]  [k] __skb_dst_set_noref                   
     0.03%  netperf  [kernel.kallsyms]  [k] tcp_event_new_data_sent               
     0.03%  netperf  [kernel.kallsyms]  [k] tcp_v4_send_check                     
     0.03%  netperf  [kernel.kallsyms]  [k] ns_to_timeval                         
     0.03%  netperf  [kernel.kallsyms]  [k] put_page                              
     0.03%  netperf  [kernel.kallsyms]  [k] skb_put                               
     0.03%  netperf  [kernel.kallsyms]  [k] swiotlb_dma_mapping_error             
     0.03%  netperf  [kernel.kallsyms]  [k] tcp_is_cwnd_limited                   
     0.03%  netperf  [kernel.kallsyms]  [k] __netif_receive_skb                   
     0.03%  netperf  [kernel.kallsyms]  [k] _copy_to_user                         
     0.03%  netperf  [kernel.kallsyms]  [k] dql_completed                         
     0.03%  netperf  [kernel.kallsyms]  [k] __tcp_push_pending_frames             
     0.03%  netperf  [kernel.kallsyms]  [k] neigh_resolve_output                  
     0.02%  netperf  [kernel.kallsyms]  [k] tcp_release_cb                        
     0.02%  netperf  [kernel.kallsyms]  [k] ktime_get                             
     0.02%  netperf  [kernel.kallsyms]  [k] native_apic_msr_eoi_write             
     0.02%  netperf  [kernel.kallsyms]  [k] napi_by_id                            
     0.02%  netperf  [kernel.kallsyms]  [k] ip_local_out                          
     0.02%  netperf  [kernel.kallsyms]  [k] eth_header                            
     0.02%  netperf  [kernel.kallsyms]  [k] dev_kfree_skb_any                     
     0.02%  netperf  [kernel.kallsyms]  [k] radix_tree_lookup_element             
     0.02%  netperf  [kernel.kallsyms]  [k] tcp_prequeue                          
     0.02%  netperf  [kernel.kallsyms]  [k] __do_softirq                          
     0.02%  netperf  [kernel.kallsyms]  [k] handle_edge_irq                       
     0.01%  netperf  [kernel.kallsyms]  [k] net_rx_action                         
     0.01%  netperf  [kernel.kallsyms]  [k] add_interrupt_randomness              
     0.01%  netperf  [kernel.kallsyms]  [k] do_softirq                            
     0.01%  netperf  [kernel.kallsyms]  [k] unmap_single                          
     0.01%  netperf  [kernel.kallsyms]  [k] swiotlb_sync_single_for_cpu           
     0.01%  netperf  [kernel.kallsyms]  [k] dma_issue_pending_all                 
     0.01%  netperf  [kernel.kallsyms]  [k] rcu_irq_exit                          
     0.01%  netperf  [kernel.kallsyms]  [k] rcu_irq_enter                         
     0.01%  netperf  netperf            [.] recv@plt                              
     0.01%  netperf  [kernel.kallsyms]  [k] __napi_complete                       
     0.01%  netperf  [kernel.kallsyms]  [k] __napi_schedule                       
     0.01%  netperf  [kernel.kallsyms]  [k] tcp_v4_md5_lookup                     
     0.01%  netperf  [kernel.kallsyms]  [k] swiotlb_sync_single_for_device        
     0.01%  netperf  [kernel.kallsyms]  [k] swiotlb_unmap_page                    
     0.01%  netperf  [kernel.kallsyms]  [k] common_interrupt                      
     0.01%  netperf  netperf            [.] send@plt                              
     0.01%  netperf  [ixgbe]            [k] ixgbe_msix_clean_rings                
     0.01%  netperf  [kernel.kallsyms]  [k] do_IRQ                                
     0.01%  netperf  [kernel.kallsyms]  [k] sys_recvfrom                          
     0.01%  netperf  [kernel.kallsyms]  [k] consume_skb                           
     0.01%  netperf  [kernel.kallsyms]  [k] rcu_bh_qs                             
     0.01%  netperf  [kernel.kallsyms]  [k] sys_sendto                            
     0.01%  netperf  [kernel.kallsyms]  [k] irq_enter                             
     0.01%  netperf  [kernel.kallsyms]  [k] __local_bh_enable                     
     0.01%  netperf  [kernel.kallsyms]  [k] handle_irq_event_percpu               
     0.01%  netperf  [kernel.kallsyms]  [k] note_interrupt                        
     0.01%  netperf  [kernel.kallsyms]  [k] napi_complete                         
     0.01%  netperf  [kernel.kallsyms]  [k] update_curr                           
     0.00%  netperf  [kernel.kallsyms]  [k] do_softirq_own_stack                  
     0.00%  netperf  [kernel.kallsyms]  [k] sha_transform                         
     0.00%  netperf  [kernel.kallsyms]  [k] handle_irq                            
     0.00%  netperf  [kernel.kallsyms]  [k] rcu_check_callbacks                   
     0.00%  netperf  [kernel.kallsyms]  [k] task_tick_fair                        
     0.00%  netperf  [kernel.kallsyms]  [k] intel_pmu_disable_all                 
     0.00%  netperf  [kernel.kallsyms]  [k] irq_to_desc                           
     0.00%  netperf  [kernel.kallsyms]  [k] apic_timer_interrupt                  
     0.00%  netperf  [kernel.kallsyms]  [k] cpuacct_charge                        
     0.00%  netperf  [kernel.kallsyms]  [k] radix_tree_lookup                     
     0.00%  netperf  [kernel.kallsyms]  [k] put_cpu_partial                       
     0.00%  netperf  [kernel.kallsyms]  [k] __wake_up_bit                         
     0.00%  netperf  [kernel.kallsyms]  [k] ktime_get_update_offsets              
     0.00%  netperf  [kernel.kallsyms]  [k] lapic_next_deadline                   
     0.00%  netperf  [kernel.kallsyms]  [k] run_timer_softirq                     
     0.00%  netperf  [kernel.kallsyms]  [k] _raw_spin_lock_irq                    
     0.00%  netperf  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave                
     0.00%  netperf  [kernel.kallsyms]  [k] handle_irq_event                      
     0.00%  netperf  [kernel.kallsyms]  [k] _mix_pool_bytes                       
     0.00%  netperf  [kernel.kallsyms]  [k] net_rps_action_and_irq_enable.clone.82
     0.00%  netperf  [kernel.kallsyms]  [k] internal_add_timer                    
     0.00%  netperf  [kernel.kallsyms]  [k] credit_entropy_bits                   
     0.00%  netperf  [kernel.kallsyms]  [k] restore_args                          
     0.00%  netperf  [kernel.kallsyms]  [k] exit_idle                             
     0.00%  netperf  [kernel.kallsyms]  [k] get_page_from_freelist                
     0.00%  netperf  [kernel.kallsyms]  [k] raise_softirq                         
     0.00%  netperf  [kernel.kallsyms]  [k] account_system_time                   
     0.00%  netperf  [kernel.kallsyms]  [k] __acct_update_integrals               
     0.00%  netperf  [kernel.kallsyms]  [k] update_rq_clock                       
     0.00%  netperf  [kernel.kallsyms]  [k] load_balance                          
     0.00%  netperf  [kernel.kallsyms]  [k] ir_ack_apic_edge                      
     0.00%  netperf  [kernel.kallsyms]  [k] sched_avg_update                      
     0.00%  netperf  [kernel.kallsyms]  [k] update_cfs_rq_blocked_load            
     0.00%  netperf  [kernel.kallsyms]  [k] update_cpu_load_active                
     0.00%  netperf  [kernel.kallsyms]  [k] cpu_needs_another_gp                  
     0.00%  netperf  [kernel.kallsyms]  [k] queue_work_on                         
     0.00%  netperf  [kernel.kallsyms]  [k] rb_erase                              
     0.00%  netperf  [kernel.kallsyms]  [k] find_next_bit                         
     0.00%  netperf  [kernel.kallsyms]  [k] rb_next                               
     0.00%  netperf  [kernel.kallsyms]  [k] enqueue_entity                        
     0.00%  netperf  [kernel.kallsyms]  [k] update_blocked_averages               
     0.00%  netperf  [kernel.kallsyms]  [k] tick_program_event                    
     0.00%  netperf  [kernel.kallsyms]  [k] napi_gro_flush                        
     0.00%  netperf  [kernel.kallsyms]  [k] irq_exit                              
     0.00%  netperf  [kernel.kallsyms]  [k] __schedule                            
     0.00%  netperf  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore           
     0.00%  netperf  [kernel.kallsyms]  [k] perf_adjust_freq_unthr_context        
     0.00%  netperf  [kernel.kallsyms]  [k] touch_atime                           
     0.00%  netperf  [kernel.kallsyms]  [k] intel_pmu_enable_all                  
     0.00%  netperf  [kernel.kallsyms]  [k] perf_pmu_rotate_start.clone.45        


#
# (For a higher level overview, try: perf report --sort comm,dso)
#

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC][PATCH 0/7] sched: Optimize sched_clock bits
  2013-12-01 18:08 ` [RFC][PATCH 0/7] sched: Optimize sched_clock bits Eliezer Tamir
@ 2013-12-03 15:10   ` Peter Zijlstra
  2013-12-10 14:47     ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2013-12-03 15:10 UTC (permalink / raw)
  To: Eliezer Tamir
  Cc: John Stultz, Thomas Gleixner, Steven Rostedt, Ingo Molnar,
	Mathieu Desnoyers, Andy Lutomirski, linux-kernel, Tony Luck, hpa

On Sun, Dec 01, 2013 at 08:08:39PM +0200, Eliezer Tamir wrote:
> If you can think of any other interesting tests, or anything I'm doing
> wrong, I'm open to suggestions.

I haven't actually looked at your results yet -- will do next week (I'm
supposed to have PTO noaw ;-) but I couldn't resist tinkering a little:

The below is a simple cycles benchmark I used on my WSM-EP (which as you
might guess from the code has a 12M L3 cache), it pretty much says that
with these patches (and a few fixes included below) local_clock() is
faster than sched_clock() used to be by about 3 times.

Obviously the unstable case is still sucky, and in fact we suck worse
even though we're now more than twice as fast as we used to be. I'm
saying we suck more because whereas we used to be ~4 times slower than
sched_clock() we're now actually ~8 times slower.

Go figure :-)

Anyway, these patches are a keeper based on these (simple) numbers from
my 1 machine test.

I'll fix them up and post them again next week.

Also, static_key and related APIs suck donkey ballz -- I did indeed
promise myself I'd not step into that bike shed contest but having had
to use them again *painfull*.

PRE:

[    5.616495] sched_clock_stable: 1
[   19.465079] (cold) sched_clock: 1643511
[   33.146204] (cold) local_clock: 1701454
[   33.150126] (warm) sched_clock: 39471
[   33.153909] (warm) local_clock: 132954
[    0.004000] sched_clock_stable: 0
[   45.110749] (cold) sched_clock: 1725363
[   58.798048] (cold) local_clock: 1799885
[   58.801976] (warm) sched_clock: 39420
[   58.805784] (warm) local_clock: 185943

[    5.615367] sched_clock_stable: 1
[   19.463919] (cold) sched_clock: 1753379
[   33.145528] (cold) local_clock: 1755582
[   33.149449] (warm) sched_clock: 39492
[   33.153237] (warm) local_clock: 132915
[    0.004000] sched_clock_stable: 0
[   45.114798] (cold) sched_clock: 1798290
[   58.802376] (cold) local_clock: 1880658
[   58.806301] (warm) sched_clock: 39482
[   58.810108] (warm) local_clock: 185943

POST:

[    5.061000] sched_clock_stable: 1
[   18.916000] (cold) sched_clock: 1335206
[   32.576000] (cold) local_clock: 1365236
[   32.580000] (warm) sched_clock: 11771
[   32.584000] (warm) local_clock: 14725
[   32.609024] sched_clock_stable: 0
[   46.298928] (cold) sched_clock: 1615798
[   59.965926] (cold) local_clock: 1780387
[   59.969844] (warm) sched_clock: 11803
[   59.973575] (warm) local_clock: 80769

[    5.059000] sched_clock_stable: 1
[   18.912000] (cold) sched_clock: 1323258
[   32.576000] (cold) local_clock: 1404544
[   32.580000] (warm) sched_clock: 11758
[   32.584000] (warm) local_clock: 14714
[   32.609498] sched_clock_stable: 0
[   46.294431] (cold) sched_clock: 1436083
[   59.965695] (cold) local_clock: 1506272
[   59.969608] (warm) sched_clock: 11801
[   59.973340] (warm) local_clock: 80782


Boot with tsc=reliable and ignore SPLATS, not for actual merging.

---
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -77,7 +77,7 @@ struct cyc2ns_latch {
 	struct cyc2ns_data data[2];
 };
 
-static DEFINE_PER_CPU(struct cyc2ns_latch, cyc2ns);
+static DEFINE_PER_CPU_ALIGNED(struct cyc2ns_latch, cyc2ns);
 
 /*
  * Use a {offset, mul} pair of this cpu.
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -87,14 +87,14 @@ int sched_clock_stable(void)
 void set_sched_clock_stable(void)
 {
 	if (!sched_clock_stable())
-		static_key_slow_inc(&__sched_clock_stable);
+		static_key_slow_dec(&__sched_clock_stable);
 }
 
 void clear_sched_clock_stable(void)
 {
 	/* XXX worry about clock continuity */
 	if (sched_clock_stable())
-		static_key_slow_dec(&__sched_clock_stable);
+		static_key_slow_inc(&__sched_clock_stable);
 }
 
 struct sched_clock_data {
@@ -255,20 +255,20 @@ u64 sched_clock_cpu(int cpu)
 	struct sched_clock_data *scd;
 	u64 clock;
 
-	WARN_ON_ONCE(!irqs_disabled());
-
 	if (sched_clock_stable())
 		return sched_clock();
 
 	if (unlikely(!sched_clock_running))
 		return 0ull;
 
+	preempt_disable();
 	scd = cpu_sdc(cpu);
 
 	if (cpu != smp_processor_id())
 		clock = sched_clock_remote(scd);
 	else
 		clock = sched_clock_local(scd);
+	preempt_enable();
 
 	return clock;
 }
@@ -345,7 +345,7 @@ u64 cpu_clock(int cpu)
 u64 local_clock(void)
 {
 	if (static_key_false(&__sched_clock_stable))
-		return sched_clock_cpu(smp_processor_id());
+		return sched_clock_cpu(raw_smp_processor_id());
 
 	return sched_clock();
 }
@@ -379,3 +379,122 @@ u64 local_clock(void)
 
 EXPORT_SYMBOL_GPL(cpu_clock);
 EXPORT_SYMBOL_GPL(local_clock);
+
+#include <linux/perf_event.h>
+
+static char sched_clock_cache[12*1024*1024]; /* 12M l3 cache */
+static struct perf_event *__sched_clock_cycles;
+
+static u64 sched_clock_cycles(void)
+{
+	return perf_event_read(__sched_clock_cycles);
+}
+
+static __init void sched_clock_wipe_cache(void)
+{
+	int i;
+
+	for (i = 0; i < sizeof(sched_clock_cache); i++)
+		ACCESS_ONCE(sched_clock_cache[i]) = 0;
+}
+
+static __init u64 cache_cold_sched_clock(void)
+{
+	u64 cycles;
+
+	local_irq_disable();
+	sched_clock_wipe_cache();
+	cycles = sched_clock_cycles();
+	(void)sched_clock();
+	cycles = sched_clock_cycles() - cycles;
+	local_irq_enable();
+
+	return cycles;
+}
+
+static __init u64 cache_cold_local_clock(void)
+{
+	u64 cycles;
+
+	local_irq_disable();
+	sched_clock_wipe_cache();
+	cycles = sched_clock_cycles();
+	(void)local_clock();
+	cycles = sched_clock_cycles() - cycles;
+	local_irq_enable();
+
+	return cycles;
+}
+
+static __init void do_bench(void)
+{
+	u64 cycles;
+	u64 tmp;
+	int i;
+
+	printk("sched_clock_stable: %d\n", sched_clock_stable());
+
+	cycles = 0;
+	for (i = 0; i < 1000; i++)
+		cycles += cache_cold_sched_clock();
+
+	printk("(cold) sched_clock: %lu\n", cycles);
+
+	cycles = 0;
+	for (i = 0; i < 1000; i++)
+		cycles += cache_cold_local_clock();
+
+	printk("(cold) local_clock: %lu\n", cycles);
+
+	local_irq_disable();
+	ACCESS_ONCE(tmp) = sched_clock();
+
+	cycles = sched_clock_cycles();
+
+	for (i = 0; i < 1000; i++)
+		ACCESS_ONCE(tmp) = sched_clock();
+
+	cycles = sched_clock_cycles() - cycles;
+	local_irq_enable();
+
+	printk("(warm) sched_clock: %lu\n", cycles);
+
+	local_irq_disable();
+	ACCESS_ONCE(tmp) = local_clock();
+
+	cycles = sched_clock_cycles();
+
+	for (i = 0; i < 1000; i++)
+		ACCESS_ONCE(tmp) = local_clock();
+
+	cycles = sched_clock_cycles() - cycles;
+	local_irq_enable();
+
+	printk("(warm) local_clock: %lu\n", cycles);
+}
+
+static __init int sched_clock_bench(void)
+{
+	struct perf_event_attr perf_attr = {
+		.type = PERF_TYPE_HARDWARE,
+		.config = PERF_COUNT_HW_CPU_CYCLES,
+		.size = sizeof(struct perf_event_attr),
+		.pinned = 1,
+	};
+
+	__sched_clock_cycles = perf_event_create_kernel_counter(&perf_attr, -1, current, NULL, NULL);
+
+	set_sched_clock_stable();
+	do_bench();
+
+	clear_sched_clock_stable();
+	do_bench();
+
+	set_sched_clock_stable();
+
+	perf_event_release_kernel(__sched_clock_cycles);
+
+	return 0;
+}
+
+late_initcall(sched_clock_bench);

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC][PATCH 0/7] sched: Optimize sched_clock bits
  2013-12-03 15:10   ` Peter Zijlstra
@ 2013-12-10 14:47     ` Peter Zijlstra
  0 siblings, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2013-12-10 14:47 UTC (permalink / raw)
  To: Eliezer Tamir
  Cc: John Stultz, Thomas Gleixner, Steven Rostedt, Ingo Molnar,
	Mathieu Desnoyers, Andy Lutomirski, linux-kernel, Tony Luck, hpa

On Tue, Dec 03, 2013 at 04:10:53PM +0100, Peter Zijlstra wrote:
> Also, static_key and related APIs suck donkey ballz -- I did indeed
> promise myself I'd not step into that bike shed contest but having had
> to use them again *painfull*.

> POST:
> 
> [    5.061000] sched_clock_stable: 1
> [   18.916000] (cold) sched_clock: 1335206
> [   32.576000] (cold) local_clock: 1365236
> [   32.580000] (warm) sched_clock: 11771
> [   32.584000] (warm) local_clock: 14725

It looks like this was using the jiffies (tsc_disabled) instead of rdtsc path.

/me curses static_key and co more.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2013-12-10 14:48 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-29 17:36 [RFC][PATCH 0/7] sched: Optimize sched_clock bits Peter Zijlstra
2013-11-29 17:36 ` [RFC][PATCH 1/7] math64: mul_u64_u32_shr() Peter Zijlstra
2013-11-29 17:36 ` [RFC][PATCH 2/7] x86: Use mul_u64_u32_shr() for native_sched_clock() Peter Zijlstra
2013-11-29 17:37 ` [RFC][PATCH 3/7] x86: Avoid a runtime condition in native_sched_clock() Peter Zijlstra
2013-11-29 17:37 ` [RFC][PATCH 4/7] x86: Move some code around Peter Zijlstra
2013-11-29 17:37 ` [RFC][PATCH 5/7] x86: Use latch data structure for cyc2ns Peter Zijlstra
2013-11-29 23:22   ` Andy Lutomirski
2013-11-30  9:18     ` Peter Zijlstra
2013-11-29 17:37 ` [RFC][PATCH 6/7] sched: Remove local_irq_disable() from the clocks Peter Zijlstra
2013-11-29 17:37 ` [RFC][PATCH 7/7] sched: Use a static_key for sched_clock_stable Peter Zijlstra
2013-12-01 18:08 ` [RFC][PATCH 0/7] sched: Optimize sched_clock bits Eliezer Tamir
2013-12-03 15:10   ` Peter Zijlstra
2013-12-10 14:47     ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox