[patch] Performance Counters for Linux, v2

linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [patch] Performance Counters for Linux, v2
@ 2008-12-08  1:22 Ingo Molnar
  2008-12-08  1:49 ` Arjan van de Ven
                   ` (3 more replies)
  0 siblings, 4 replies; 33+ messages in thread
From: Ingo Molnar @ 2008-12-08  1:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, linux-arch, Andrew Morton, Stephane Eranian,
	Eric Dumazet, Robert Richter, Arjan van de Veen, Peter Anvin,
	Peter Zijlstra, Steven Rostedt, David Miller, Paul Mackerras


[ Performance counters are special hardware registers available on most 
  modern CPUs. These register count the number of certain types of hw 
  events: such as instructions executed, cachemisses suffered, or 
  branches mis-predicted, without slowing down the kernel or 
  applications. These registers can also trigger interrupts when a 
  threshold number of events have passed - and can thus be used to 
  profile the code that runs on that CPU. ]

This is version 2 of our Performance Counters subsystem implementation.

The biggest user-visible change in this release is a new user-space 
text-mode profiling utility that is based on this code: KernelTop.

KernelTop can be downloaded from:

  http://redhat.com/~mingo/perfcounters/kerneltop.c

It's a standalone .c file that needs no extra libraries - it only needs a 
CONFIG_PERF_COUNTERS=y kernel to run on.

This utility is intended for kernel developers - it's basically a dynamic 
kernel profiler that gets hardware counter events dispatched to it 
continuously, which it feeds into a histogram and outputs it 
periodically.

Here is a screenshot of it:

------------------------------------------------------------------------------
  KernelTop: 250880 irqs/sec  [NMI, 10000 cycles], (all, cpu: 0)
------------------------------------------------------------------------------

             events         RIP          kernel function
             ______   ________________   _______________

              17319 - ffffffff8106f8fa : audit_syscall_exit
              16300 - ffffffff81042ce2 : sys_rt_sigprocmask
              11031 - ffffffff8106fdc8 : audit_syscall_entry
              10880 - ffffffff8100bd8d : rff_trace
               9780 - ffffffff810a232f : kfree	[ehci_hcd]
               9707 - ffffffff81298cb7 : _spin_lock_irq	[ehci_hcd]
               7903 - ffffffff8106db17 : unroll_tree_refs
               7266 - ffffffff81138d10 : copy_user_generic_string
               5751 - ffffffff8100be45 : sysret_check
               4803 - ffffffff8100bea8 : sysret_signal
               4696 - ffffffff8100bdb0 : system_call
               4425 - ffffffff8100bdc0 : system_call_after_swapgs
               2855 - ffffffff810ae183 : path_put	[ext3]
               2773 - ffffffff8100bedb : auditsys
               1589 - ffffffff810b6864 : dput	[sunrpc]
               1253 - ffffffff8100be40 : ret_from_sys_call
                690 - ffffffff8105034c : current_kernel_time	[ext3]
                673 - ffffffff81042bd4 : sys_sigprocmask
                611 - ffffffff8100bf25 : sysret_audit

It will correctly profile core kernel, module space and vsyscall areas as 
well. It allows the use of the most common hw counters: cycles, 
instructions, branches, cachemisses, cache-references and branch-misses.

KernelTop does not have to be started/stopped - it will continously 
profile the system and updates the histogram as the workload changes. The 
histogram is not cumulative: old workload effects will time out 
gradually. For example if the system goes idle, then the profiler output 
will go down to near zero within 10-20 seconds. So there's no need to 
stop or restart profiling - it all updates automatically, as the workflow 
changes its characteristics.

KernelTop can also profile raw event IDs. For example, on a Core2 CPU, to 
profile the "Number of instruction length decoder stalls" (raw event 
0x0087) during a hackbench run, i did this:

  $ ./kerneltop -e -$(printf "%d\n" 0x00000087) -c 10000 -n 1

------------------------------------------------------------------------------
 KernelTop:     331 irqs/sec  [NMI, 10000 raw:0087],  (all, 2 CPUs)
------------------------------------------------------------------------------

             events         RIP          kernel function
             ______   ________________   _______________

               1016 - ffffffff802a611e : kmem_cache_alloc_node
                898 - ffffffff804ca381 : sock_wfree
                 64 - ffffffff80567306 : schedule
                 50 - ffffffff804cdb39 : skb_release_head_state
                 45 - ffffffff8053ed54 : unix_write_space
                 33 - ffffffff802a6a4d : __kmalloc_node
                 18 - ffffffff802a642c : cache_alloc_refill
                 13 - ffffffff804cdd50 : __alloc_skb
                  7 - ffffffff8053ec0a : unix_shutdown

[ The printf is done to pass in a negative event number as a parameter. ]

We also made a good number of internal changes as well to the subsystem:

There's a new "counter group record" facility that is a straightforward 
extension of the existing "irq record" notification type. This record 
type can be set on a 'master' counter, and if the master counter triggers 
an IRQ or an NMI, all the 'secondary' counters are read out atomically 
and are put into the counter-group record. The result can then be read() 
out by userspace via a single system call. (Based on extensive feedback 
from Paul Mackerras and David Miller, thanks guys!)

The other big change is the support of virtual task counters via counter 
scheduling: a task can specify more counters than there are on the CPU, 
the kernel will then schedule the counters periodically to spread out hw 
resources. So for example if a task starts 6 counters on a CPU that has 
only two hardware counters, it still gets this output:

 counter[0 cycles              ]:   5204680573 , delta:   1733680843 events
 counter[1 instructions        ]:   1364468045 , delta:    454818351 events
 counter[2 cache-refs          ]:        12732 , delta:         4399 events
 counter[3 cache-misses        ]:         1009 , delta:          336 events
 counter[4 branch-instructions ]:    125993304 , delta:     42006998 events
 counter[5 branch-misses       ]:         1946 , delta:          649 events

See this sample code at:

  http://redhat.com/~mingo/perfcounters/hello-loop.c

There's also now the ability to do NMI profiling: this works both for per 
CPU and per task counters. NMI counters are transparent and are enabled 
via the PERF_COUNTER_NMI bit in the "hardware event type" parameter of 
the sys_perf_counter_open() system call.

There's also more generic x86 support: all 4 generic PMCs of Nehalem / 
Core i7 are supported - i've run 4 instances of KernelTop and they used 
up four separate PMCs.

There's also perf counters debug output that can be triggered via sysrq, 
for diagnostic purposes.

	Ingo, Thomas

------------------->

The latest performance counters experimental git tree can be found at:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git perfcounters/core

--------------->
Ingo Molnar (2):
      performance counters: documentation
      performance counters: x86 support

Thomas Gleixner (1):
      performance counters: core code


 Documentation/perf-counters.txt                |  104 +++
 arch/x86/Kconfig                               |    1 +
 arch/x86/ia32/ia32entry.S                      |    3 +-
 arch/x86/include/asm/hardirq_32.h              |    1 +
 arch/x86/include/asm/hw_irq.h                  |    2 +
 arch/x86/include/asm/intel_arch_perfmon.h      |   34 +-
 arch/x86/include/asm/irq_vectors.h             |    5 +
 arch/x86/include/asm/mach-default/entry_arch.h |    5 +
 arch/x86/include/asm/pda.h                     |    1 +
 arch/x86/include/asm/thread_info.h             |    4 +-
 arch/x86/include/asm/unistd_32.h               |    1 +
 arch/x86/include/asm/unistd_64.h               |    3 +-
 arch/x86/kernel/apic.c                         |    2 +
 arch/x86/kernel/cpu/Makefile                   |   12 +-
 arch/x86/kernel/cpu/common.c                   |    2 +
 arch/x86/kernel/cpu/perf_counter.c             |  571 ++++++++++++++
 arch/x86/kernel/entry_64.S                     |    6 +
 arch/x86/kernel/irq.c                          |    5 +
 arch/x86/kernel/irqinit_32.c                   |    3 +
 arch/x86/kernel/irqinit_64.c                   |    5 +
 arch/x86/kernel/signal_32.c                    |    8 +-
 arch/x86/kernel/signal_64.c                    |    5 +
 arch/x86/kernel/syscall_table_32.S             |    1 +
 drivers/char/sysrq.c                           |    2 +
 include/linux/perf_counter.h                   |  171 +++++
 include/linux/sched.h                          |    9 +
 include/linux/syscalls.h                       |    6 +
 init/Kconfig                                   |   29 +
 kernel/Makefile                                |    1 +
 kernel/fork.c                                  |    1 +
 kernel/perf_counter.c                          |  945 ++++++++++++++++++++++++
 kernel/sched.c                                 |   24 +
 kernel/sys_ni.c                                |    3 +
 33 files changed, 1954 insertions(+), 21 deletions(-)
 create mode 100644 Documentation/perf-counters.txt
 create mode 100644 arch/x86/kernel/cpu/perf_counter.c
 create mode 100644 include/linux/perf_counter.h
 create mode 100644 kernel/perf_counter.c

diff --git a/Documentation/perf-counters.txt b/Documentation/perf-counters.txt
new file mode 100644
index 0000000..19033a0
--- /dev/null
+++ b/Documentation/perf-counters.txt
@@ -0,0 +1,104 @@
+
+Performance Counters for Linux
+------------------------------
+
+Performance counters are special hardware registers available on most modern
+CPUs. These registers count the number of certain types of hw events: such
+as instructions executed, cachemisses suffered, or branches mis-predicted -
+without slowing down the kernel or applications. These registers can also
+trigger interrupts when a threshold number of events have passed - and can
+thus be used to profile the code that runs on that CPU.
+
+The Linux Performance Counter subsystem provides an abstraction of these
+hardware capabilities. It provides per task and per CPU counters, and
+it provides event capabilities on top of those.
+
+Performance counters are accessed via special file descriptors.
+There's one file descriptor per virtual counter used.
+
+The special file descriptor is opened via the perf_counter_open()
+system call:
+
+ int
+ perf_counter_open(u32 hw_event_type,
+                   u32 hw_event_period,
+                   u32 record_type,
+                   pid_t pid,
+                   int cpu);
+
+The syscall returns the new fd. The fd can be used via the normal
+VFS system calls: read() can be used to read the counter, fcntl()
+can be used to set the blocking mode, etc.
+
+Multiple counters can be kept open at a time, and the counters
+can be poll()ed.
+
+When creating a new counter fd, 'hw_event_type' is one of:
+
+ enum hw_event_types {
+	PERF_COUNT_CYCLES,
+	PERF_COUNT_INSTRUCTIONS,
+	PERF_COUNT_CACHE_REFERENCES,
+	PERF_COUNT_CACHE_MISSES,
+	PERF_COUNT_BRANCH_INSTRUCTIONS,
+	PERF_COUNT_BRANCH_MISSES,
+ };
+
+These are standardized types of events that work uniformly on all CPUs
+that implements Performance Counters support under Linux. If a CPU is
+not able to count branch-misses, then the system call will return
+-EINVAL.
+
+[ Note: more hw_event_types are supported as well, but they are CPU
+  specific and are enumerated via /sys on a per CPU basis. Raw hw event
+  types can be passed in as negative numbers. For example, to count
+  "External bus cycles while bus lock signal asserted" events on Intel
+  Core CPUs, pass in a -0x4064 event type value. ]
+
+The parameter 'hw_event_period' is the number of events before waking up
+a read() that is blocked on a counter fd. Zero value means a non-blocking
+counter.
+
+'record_type' is the type of data that a read() will provide for the
+counter, and it can be one of:
+
+  enum perf_record_type {
+	PERF_RECORD_SIMPLE,
+	PERF_RECORD_IRQ,
+  };
+
+a "simple" counter is one that counts hardware events and allows
+them to be read out into a u64 count value. (read() returns 8 on
+a successful read of a simple counter.)
+
+An "irq" counter is one that will also provide an IRQ context information:
+the IP of the interrupted context. In this case read() will return
+the 8-byte counter value, plus the Instruction Pointer address of the
+interrupted context.
+
+The 'pid' parameter allows the counter to be specific to a task:
+
+ pid == 0: if the pid parameter is zero, the counter is attached to the
+ current task.
+
+ pid > 0: the counter is attached to a specific task (if the current task
+ has sufficient privilege to do so)
+
+ pid < 0: all tasks are counted (per cpu counters)
+
+The 'cpu' parameter allows a counter to be made specific to a full
+CPU:
+
+ cpu >= 0: the counter is restricted to a specific CPU
+ cpu == -1: the counter counts on all CPUs
+
+Note: the combination of 'pid == -1' and 'cpu == -1' is not valid.
+
+A 'pid > 0' and 'cpu == -1' counter is a per task counter that counts
+events of that task and 'follows' that task to whatever CPU the task
+gets schedule to. Per task counters can be created by any user, for
+their own tasks.
+
+A 'pid == -1' and 'cpu == x' counter is a per CPU counter that counts
+all events on CPU-x. Per CPU counters need CAP_SYS_ADMIN privilege.
+
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ac22bb7..5a2d74a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -651,6 +651,7 @@ config X86_UP_IOAPIC
 config X86_LOCAL_APIC
 	def_bool y
 	depends on X86_64 || (X86_32 && (X86_UP_APIC || (SMP && !X86_VOYAGER) || X86_GENERICARCH))
+	select HAVE_PERF_COUNTERS
 
 config X86_IO_APIC
 	def_bool y
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 256b00b..3c14ed0 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -823,7 +823,8 @@ ia32_sys_call_table:
 	.quad compat_sys_signalfd4
 	.quad sys_eventfd2
 	.quad sys_epoll_create1
-	.quad sys_dup3			/* 330 */
+	.quad sys_dup3				/* 330 */
 	.quad sys_pipe2
 	.quad sys_inotify_init1
+	.quad sys_perf_counter_open
 ia32_syscall_end:
diff --git a/arch/x86/include/asm/hardirq_32.h b/arch/x86/include/asm/hardirq_32.h
index 5ca135e..b3e475d 100644
--- a/arch/x86/include/asm/hardirq_32.h
+++ b/arch/x86/include/asm/hardirq_32.h
@@ -9,6 +9,7 @@ typedef struct {
 	unsigned long idle_timestamp;
 	unsigned int __nmi_count;	/* arch dependent */
 	unsigned int apic_timer_irqs;	/* arch dependent */
+	unsigned int apic_perf_irqs;	/* arch dependent */
 	unsigned int irq0_irqs;
 	unsigned int irq_resched_count;
 	unsigned int irq_call_count;
diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index b97aecb..c22900e 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -30,6 +30,8 @@
 /* Interrupt handlers registered during init_IRQ */
 extern void apic_timer_interrupt(void);
 extern void error_interrupt(void);
+extern void perf_counter_interrupt(void);
+
 extern void spurious_interrupt(void);
 extern void thermal_interrupt(void);
 extern void reschedule_interrupt(void);
diff --git a/arch/x86/include/asm/intel_arch_perfmon.h b/arch/x86/include/asm/intel_arch_perfmon.h
index fa0fd06..71598a9 100644
--- a/arch/x86/include/asm/intel_arch_perfmon.h
+++ b/arch/x86/include/asm/intel_arch_perfmon.h
@@ -1,22 +1,24 @@
 #ifndef _ASM_X86_INTEL_ARCH_PERFMON_H
 #define _ASM_X86_INTEL_ARCH_PERFMON_H
 
-#define MSR_ARCH_PERFMON_PERFCTR0		0xc1
-#define MSR_ARCH_PERFMON_PERFCTR1		0xc2
+#define MSR_ARCH_PERFMON_PERFCTR0			      0xc1
+#define MSR_ARCH_PERFMON_PERFCTR1			      0xc2
 
-#define MSR_ARCH_PERFMON_EVENTSEL0		0x186
-#define MSR_ARCH_PERFMON_EVENTSEL1		0x187
+#define MSR_ARCH_PERFMON_EVENTSEL0			     0x186
+#define MSR_ARCH_PERFMON_EVENTSEL1			     0x187
 
-#define ARCH_PERFMON_EVENTSEL0_ENABLE	(1 << 22)
-#define ARCH_PERFMON_EVENTSEL_INT	(1 << 20)
-#define ARCH_PERFMON_EVENTSEL_OS	(1 << 17)
-#define ARCH_PERFMON_EVENTSEL_USR	(1 << 16)
+#define ARCH_PERFMON_EVENTSEL0_ENABLE			  (1 << 22)
+#define ARCH_PERFMON_EVENTSEL_INT			  (1 << 20)
+#define ARCH_PERFMON_EVENTSEL_OS			  (1 << 17)
+#define ARCH_PERFMON_EVENTSEL_USR			  (1 << 16)
 
-#define ARCH_PERFMON_UNHALTED_CORE_CYCLES_SEL	(0x3c)
-#define ARCH_PERFMON_UNHALTED_CORE_CYCLES_UMASK	(0x00 << 8)
-#define ARCH_PERFMON_UNHALTED_CORE_CYCLES_INDEX (0)
+#define ARCH_PERFMON_UNHALTED_CORE_CYCLES_SEL		      0x3c
+#define ARCH_PERFMON_UNHALTED_CORE_CYCLES_UMASK		(0x00 << 8)
+#define ARCH_PERFMON_UNHALTED_CORE_CYCLES_INDEX 		 0
 #define ARCH_PERFMON_UNHALTED_CORE_CYCLES_PRESENT \
-	(1 << (ARCH_PERFMON_UNHALTED_CORE_CYCLES_INDEX))
+		(1 << (ARCH_PERFMON_UNHALTED_CORE_CYCLES_INDEX))
+
+#define ARCH_PERFMON_BRANCH_MISSES_RETIRED			 6
 
 union cpuid10_eax {
 	struct {
@@ -28,4 +30,12 @@ union cpuid10_eax {
 	unsigned int full;
 };
 
+#ifdef CONFIG_PERF_COUNTERS
+extern void init_hw_perf_counters(void);
+extern void perf_counters_lapic_init(int nmi);
+#else
+static inline void init_hw_perf_counters(void)		{ }
+static inline void perf_counters_lapic_init(int nmi)	{ }
+#endif
+
 #endif /* _ASM_X86_INTEL_ARCH_PERFMON_H */
diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index 0005adb..b8d277f 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -87,6 +87,11 @@
 #define LOCAL_TIMER_VECTOR	0xef
 
 /*
+ * Performance monitoring interrupt vector:
+ */
+#define LOCAL_PERF_VECTOR	0xee
+
+/*
  * First APIC vector available to drivers: (vectors 0x30-0xee) we
  * start at 0x31(0x41) to spread out vectors evenly between priority
  * levels. (0x80 is the syscall vector)
diff --git a/arch/x86/include/asm/mach-default/entry_arch.h b/arch/x86/include/asm/mach-default/entry_arch.h
index 6b1add8..ad31e5d 100644
--- a/arch/x86/include/asm/mach-default/entry_arch.h
+++ b/arch/x86/include/asm/mach-default/entry_arch.h
@@ -25,10 +25,15 @@ BUILD_INTERRUPT(irq_move_cleanup_interrupt,IRQ_MOVE_CLEANUP_VECTOR)
  * a much simpler SMP time architecture:
  */
 #ifdef CONFIG_X86_LOCAL_APIC
+
 BUILD_INTERRUPT(apic_timer_interrupt,LOCAL_TIMER_VECTOR)
 BUILD_INTERRUPT(error_interrupt,ERROR_APIC_VECTOR)
 BUILD_INTERRUPT(spurious_interrupt,SPURIOUS_APIC_VECTOR)
 
+#ifdef CONFIG_PERF_COUNTERS
+BUILD_INTERRUPT(perf_counter_interrupt, LOCAL_PERF_VECTOR)
+#endif
+
 #ifdef CONFIG_X86_MCE_P4THERMAL
 BUILD_INTERRUPT(thermal_interrupt,THERMAL_APIC_VECTOR)
 #endif
diff --git a/arch/x86/include/asm/pda.h b/arch/x86/include/asm/pda.h
index 2fbfff8..90a8d9d 100644
--- a/arch/x86/include/asm/pda.h
+++ b/arch/x86/include/asm/pda.h
@@ -30,6 +30,7 @@ struct x8664_pda {
 	short isidle;
 	struct mm_struct *active_mm;
 	unsigned apic_timer_irqs;
+	unsigned apic_perf_irqs;
 	unsigned irq0_irqs;
 	unsigned irq_resched_count;
 	unsigned irq_call_count;
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index e44d379..810bf26 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -80,6 +80,7 @@ struct thread_info {
 #define TIF_SYSCALL_AUDIT	7	/* syscall auditing active */
 #define TIF_SECCOMP		8	/* secure computing */
 #define TIF_MCE_NOTIFY		10	/* notify userspace of an MCE */
+#define TIF_PERF_COUNTERS	11	/* notify perf counter work */
 #define TIF_NOTSC		16	/* TSC is not accessible in userland */
 #define TIF_IA32		17	/* 32bit process */
 #define TIF_FORK		18	/* ret_from_fork */
@@ -103,6 +104,7 @@ struct thread_info {
 #define _TIF_SYSCALL_AUDIT	(1 << TIF_SYSCALL_AUDIT)
 #define _TIF_SECCOMP		(1 << TIF_SECCOMP)
 #define _TIF_MCE_NOTIFY		(1 << TIF_MCE_NOTIFY)
+#define _TIF_PERF_COUNTERS	(1 << TIF_PERF_COUNTERS)
 #define _TIF_NOTSC		(1 << TIF_NOTSC)
 #define _TIF_IA32		(1 << TIF_IA32)
 #define _TIF_FORK		(1 << TIF_FORK)
@@ -135,7 +137,7 @@ struct thread_info {
 
 /* Only used for 64 bit */
 #define _TIF_DO_NOTIFY_MASK						\
-	(_TIF_SIGPENDING|_TIF_MCE_NOTIFY|_TIF_NOTIFY_RESUME)
+	(_TIF_SIGPENDING|_TIF_MCE_NOTIFY|_TIF_PERF_COUNTERS|_TIF_NOTIFY_RESUME)
 
 /* flags to check in __switch_to() */
 #define _TIF_WORK_CTXSW							\
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index f2bba78..7e47658 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -338,6 +338,7 @@
 #define __NR_dup3		330
 #define __NR_pipe2		331
 #define __NR_inotify_init1	332
+#define __NR_perf_counter_open	333
 
 #ifdef __KERNEL__
 
diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index d2e415e..53025fe 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -653,7 +653,8 @@ __SYSCALL(__NR_dup3, sys_dup3)
 __SYSCALL(__NR_pipe2, sys_pipe2)
 #define __NR_inotify_init1			294
 __SYSCALL(__NR_inotify_init1, sys_inotify_init1)
-
+#define __NR_perf_counter_open		295
+__SYSCALL(__NR_perf_counter_open, sys_perf_counter_open)
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/kernel/apic.c b/arch/x86/kernel/apic.c
index 16f9487..8ab8c18 100644
--- a/arch/x86/kernel/apic.c
+++ b/arch/x86/kernel/apic.c
@@ -31,6 +31,7 @@
 #include <linux/dmi.h>
 #include <linux/dmar.h>
 
+#include <asm/intel_arch_perfmon.h>
 #include <asm/atomic.h>
 #include <asm/smp.h>
 #include <asm/mtrr.h>
@@ -1147,6 +1148,7 @@ void __cpuinit setup_local_APIC(void)
 		apic_write(APIC_ESR, 0);
 	}
 #endif
+	perf_counters_lapic_init(0);
 
 	preempt_disable();
 
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 82ec607..89e5336 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -1,5 +1,5 @@
 #
-# Makefile for x86-compatible CPU details and quirks
+# Makefile for x86-compatible CPU details, features and quirks
 #
 
 obj-y			:= intel_cacheinfo.o addon_cpuid_features.o
@@ -16,11 +16,13 @@ obj-$(CONFIG_CPU_SUP_CENTAUR_64)	+= centaur_64.o
 obj-$(CONFIG_CPU_SUP_TRANSMETA_32)	+= transmeta.o
 obj-$(CONFIG_CPU_SUP_UMC_32)		+= umc.o
 
-obj-$(CONFIG_X86_MCE)	+= mcheck/
-obj-$(CONFIG_MTRR)	+= mtrr/
-obj-$(CONFIG_CPU_FREQ)	+= cpufreq/
+obj-$(CONFIG_PERF_COUNTERS)		+= perf_counter.o
 
-obj-$(CONFIG_X86_LOCAL_APIC) += perfctr-watchdog.o
+obj-$(CONFIG_X86_MCE)			+= mcheck/
+obj-$(CONFIG_MTRR)			+= mtrr/
+obj-$(CONFIG_CPU_FREQ)			+= cpufreq/
+
+obj-$(CONFIG_X86_LOCAL_APIC)		+= perfctr-watchdog.o
 
 quiet_cmd_mkcapflags = MKCAP   $@
       cmd_mkcapflags = $(PERL) $(srctree)/$(src)/mkcapflags.pl $< $@
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index b9c9ea0..4461011 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -17,6 +17,7 @@
 #include <asm/mmu_context.h>
 #include <asm/mtrr.h>
 #include <asm/mce.h>
+#include <asm/intel_arch_perfmon.h>
 #include <asm/pat.h>
 #include <asm/asm.h>
 #include <asm/numa.h>
@@ -750,6 +751,7 @@ void __init identify_boot_cpu(void)
 #else
 	vgetcpu_set_mode();
 #endif
+	init_hw_perf_counters();
 }
 
 void __cpuinit identify_secondary_cpu(struct cpuinfo_x86 *c)
diff --git a/arch/x86/kernel/cpu/perf_counter.c b/arch/x86/kernel/cpu/perf_counter.c
new file mode 100644
index 0000000..82440cb
--- /dev/null
+++ b/arch/x86/kernel/cpu/perf_counter.c
@@ -0,0 +1,571 @@
+/*
+ * Performance counter x86 architecture code
+ *
+ *  Copyright(C) 2008 Thomas Gleixner <tglx@linutronix.de>
+ *  Copyright(C) 2008 Red Hat, Inc., Ingo Molnar
+ *
+ *  For licencing details see kernel-base/COPYING
+ */
+
+#include <linux/perf_counter.h>
+#include <linux/capability.h>
+#include <linux/notifier.h>
+#include <linux/hardirq.h>
+#include <linux/kprobes.h>
+#include <linux/kdebug.h>
+#include <linux/sched.h>
+
+#include <asm/intel_arch_perfmon.h>
+#include <asm/apic.h>
+
+static bool perf_counters_initialized __read_mostly;
+
+/*
+ * Number of (generic) HW counters:
+ */
+static int nr_hw_counters __read_mostly;
+static u32 perf_counter_mask __read_mostly;
+
+/* No support for fixed function counters yet */
+
+#define MAX_HW_COUNTERS		8
+
+struct cpu_hw_counters {
+	struct perf_counter	*counters[MAX_HW_COUNTERS];
+	unsigned long		used[BITS_TO_LONGS(MAX_HW_COUNTERS)];
+	int			enable_all;
+};
+
+/*
+ * Intel PerfMon v3. Used on Core2 and later.
+ */
+static DEFINE_PER_CPU(struct cpu_hw_counters, cpu_hw_counters);
+
+const int intel_perfmon_event_map[] =
+{
+  [PERF_COUNT_CYCLES]			= 0x003c,
+  [PERF_COUNT_INSTRUCTIONS]		= 0x00c0,
+  [PERF_COUNT_CACHE_REFERENCES]		= 0x4f2e,
+  [PERF_COUNT_CACHE_MISSES]		= 0x412e,
+  [PERF_COUNT_BRANCH_INSTRUCTIONS]	= 0x00c4,
+  [PERF_COUNT_BRANCH_MISSES]		= 0x00c5,
+};
+
+const int max_intel_perfmon_events = ARRAY_SIZE(intel_perfmon_event_map);
+
+/*
+ * Setup the hardware configuration for a given hw_event_type
+ */
+int hw_perf_counter_init(struct perf_counter *counter, s32 hw_event_type)
+{
+	struct hw_perf_counter *hwc = &counter->hw;
+
+	if (unlikely(!perf_counters_initialized))
+		return -EINVAL;
+
+	/*
+	 * Count user events, and generate PMC IRQs:
+	 * (keep 'enabled' bit clear for now)
+	 */
+	hwc->config = ARCH_PERFMON_EVENTSEL_USR | ARCH_PERFMON_EVENTSEL_INT;
+
+	/*
+	 * If privileged enough, count OS events too, and allow
+	 * NMI events as well:
+	 */
+	hwc->nmi = 0;
+	if (capable(CAP_SYS_ADMIN)) {
+		hwc->config |= ARCH_PERFMON_EVENTSEL_OS;
+		if (hw_event_type & PERF_COUNT_NMI)
+			hwc->nmi = 1;
+	}
+
+	hwc->config_base = MSR_ARCH_PERFMON_EVENTSEL0;
+	hwc->counter_base = MSR_ARCH_PERFMON_PERFCTR0;
+
+	hwc->irq_period = counter->__irq_period;
+	/*
+	 * Intel PMCs cannot be accessed sanely above 32 bit width,
+	 * so we install an artificial 1<<31 period regardless of
+	 * the generic counter period:
+	 */
+	if (!hwc->irq_period)
+		hwc->irq_period = 0x7FFFFFFF;
+
+	hwc->next_count = -((s32) hwc->irq_period);
+
+	/*
+	 * Negative event types mean raw encoded event+umask values:
+	 */
+	if (hw_event_type < 0) {
+		counter->hw_event_type = -hw_event_type;
+		counter->hw_event_type &= ~PERF_COUNT_NMI;
+	} else {
+		hw_event_type &= ~PERF_COUNT_NMI;
+		if (hw_event_type >= max_intel_perfmon_events)
+			return -EINVAL;
+		/*
+		 * The generic map:
+		 */
+		counter->hw_event_type = intel_perfmon_event_map[hw_event_type];
+	}
+	hwc->config |= counter->hw_event_type;
+	counter->wakeup_pending = 0;
+
+	return 0;
+}
+
+static void __hw_perf_enable_all(void)
+{
+	wrmsr(MSR_CORE_PERF_GLOBAL_CTRL, perf_counter_mask, 0);
+}
+
+void hw_perf_enable_all(void)
+{
+	struct cpu_hw_counters *cpuc = &__get_cpu_var(cpu_hw_counters);
+
+	cpuc->enable_all = 1;
+	__hw_perf_enable_all();
+}
+
+void hw_perf_disable_all(void)
+{
+	struct cpu_hw_counters *cpuc = &__get_cpu_var(cpu_hw_counters);
+
+	cpuc->enable_all = 0;
+	wrmsr(MSR_CORE_PERF_GLOBAL_CTRL, 0, 0);
+}
+
+static DEFINE_PER_CPU(u64, prev_next_count[MAX_HW_COUNTERS]);
+
+static void __hw_perf_counter_enable(struct hw_perf_counter *hwc, int idx)
+{
+	per_cpu(prev_next_count[idx], smp_processor_id()) = hwc->next_count;
+
+	wrmsr(hwc->counter_base + idx, hwc->next_count, 0);
+	wrmsr(hwc->config_base + idx, hwc->config, 0);
+}
+
+void hw_perf_counter_enable(struct perf_counter *counter)
+{
+	struct cpu_hw_counters *cpuc = &__get_cpu_var(cpu_hw_counters);
+	struct hw_perf_counter *hwc = &counter->hw;
+	int idx = hwc->idx;
+
+	/* Try to get the previous counter again */
+	if (test_and_set_bit(idx, cpuc->used)) {
+		idx = find_first_zero_bit(cpuc->used, nr_hw_counters);
+		set_bit(idx, cpuc->used);
+		hwc->idx = idx;
+	}
+
+	perf_counters_lapic_init(hwc->nmi);
+
+	wrmsr(hwc->config_base + idx,
+	      hwc->config & ~ARCH_PERFMON_EVENTSEL0_ENABLE, 0);
+
+	cpuc->counters[idx] = counter;
+	counter->hw.config |= ARCH_PERFMON_EVENTSEL0_ENABLE;
+	__hw_perf_counter_enable(hwc, idx);
+}
+
+#ifdef CONFIG_X86_64
+static inline void atomic64_counter_set(struct perf_counter *counter, u64 val)
+{
+	atomic64_set(&counter->count, val);
+}
+
+static inline u64 atomic64_counter_read(struct perf_counter *counter)
+{
+	return atomic64_read(&counter->count);
+}
+#else
+/*
+ * Todo: add proper atomic64_t support to 32-bit x86:
+ */
+static inline void atomic64_counter_set(struct perf_counter *counter, u64 val64)
+{
+	u32 *val32 = (void *)&val64;
+
+	atomic_set(counter->count32 + 0, *(val32 + 0));
+	atomic_set(counter->count32 + 1, *(val32 + 1));
+}
+
+static inline u64 atomic64_counter_read(struct perf_counter *counter)
+{
+	return atomic_read(counter->count32 + 0) |
+		(u64) atomic_read(counter->count32 + 1) << 32;
+}
+#endif
+
+static void __hw_perf_save_counter(struct perf_counter *counter,
+				   struct hw_perf_counter *hwc, int idx)
+{
+	s64 raw = -1;
+	s64 delta;
+	int err;
+
+	/*
+	 * Get the raw hw counter value:
+	 */
+	err = rdmsrl_safe(hwc->counter_base + idx, &raw);
+	WARN_ON_ONCE(err);
+
+	/*
+	 * Rebase it to zero (it started counting at -irq_period),
+	 * to see the delta since ->prev_count:
+	 */
+	delta = (s64)hwc->irq_period + (s64)(s32)raw;
+
+	atomic64_counter_set(counter, hwc->prev_count + delta);
+
+	/*
+	 * Adjust the ->prev_count offset - if we went beyond
+	 * irq_period of units, then we got an IRQ and the counter
+	 * was set back to -irq_period:
+	 */
+	while (delta >= (s64)hwc->irq_period) {
+		hwc->prev_count += hwc->irq_period;
+		delta -= (s64)hwc->irq_period;
+	}
+
+	/*
+	 * Calculate the next raw counter value we'll write into
+	 * the counter at the next sched-in time:
+	 */
+	delta -= (s64)hwc->irq_period;
+
+	hwc->next_count = (s32)delta;
+}
+
+void perf_counter_print_debug(void)
+{
+	u64 ctrl, status, overflow, pmc_ctrl, pmc_count, next_count;
+	int cpu, err, idx;
+
+	local_irq_disable();
+
+	cpu = smp_processor_id();
+
+	err = rdmsrl_safe(MSR_CORE_PERF_GLOBAL_CTRL, &ctrl);
+	WARN_ON_ONCE(err);
+
+	err = rdmsrl_safe(MSR_CORE_PERF_GLOBAL_STATUS, &status);
+	WARN_ON_ONCE(err);
+
+	err = rdmsrl_safe(MSR_CORE_PERF_GLOBAL_OVF_CTRL, &overflow);
+	WARN_ON_ONCE(err);
+
+	printk(KERN_INFO "\n");
+	printk(KERN_INFO "CPU#%d: ctrl:       %016llx\n", cpu, ctrl);
+	printk(KERN_INFO "CPU#%d: status:     %016llx\n", cpu, status);
+	printk(KERN_INFO "CPU#%d: overflow:   %016llx\n", cpu, overflow);
+
+	for (idx = 0; idx < nr_hw_counters; idx++) {
+		err = rdmsrl_safe(MSR_ARCH_PERFMON_EVENTSEL0 + idx, &pmc_ctrl);
+		WARN_ON_ONCE(err);
+
+		err = rdmsrl_safe(MSR_ARCH_PERFMON_PERFCTR0 + idx, &pmc_count);
+		WARN_ON_ONCE(err);
+
+		next_count = per_cpu(prev_next_count[idx], cpu);
+
+		printk(KERN_INFO "CPU#%d: PMC%d ctrl:  %016llx\n",
+			cpu, idx, pmc_ctrl);
+		printk(KERN_INFO "CPU#%d: PMC%d count: %016llx\n",
+			cpu, idx, pmc_count);
+		printk(KERN_INFO "CPU#%d: PMC%d next:  %016llx\n",
+			cpu, idx, next_count);
+	}
+	local_irq_enable();
+}
+
+void hw_perf_counter_disable(struct perf_counter *counter)
+{
+	struct cpu_hw_counters *cpuc = &__get_cpu_var(cpu_hw_counters);
+	struct hw_perf_counter *hwc = &counter->hw;
+	unsigned int idx = hwc->idx;
+
+	counter->hw.config &= ~ARCH_PERFMON_EVENTSEL0_ENABLE;
+	wrmsr(hwc->config_base + idx, hwc->config, 0);
+
+	clear_bit(idx, cpuc->used);
+	cpuc->counters[idx] = NULL;
+	__hw_perf_save_counter(counter, hwc, idx);
+}
+
+void hw_perf_counter_read(struct perf_counter *counter)
+{
+	struct hw_perf_counter *hwc = &counter->hw;
+	unsigned long addr = hwc->counter_base + hwc->idx;
+	s64 offs, val = -1LL;
+	s32 val32;
+	int err;
+
+	/* Careful: NMI might modify the counter offset */
+	do {
+		offs = hwc->prev_count;
+		err = rdmsrl_safe(addr, &val);
+		WARN_ON_ONCE(err);
+	} while (offs != hwc->prev_count);
+
+	val32 = (s32) val;
+	val =  (s64)hwc->irq_period + (s64)val32;
+	atomic64_counter_set(counter, hwc->prev_count + val);
+}
+
+static void perf_store_irq_data(struct perf_counter *counter, u64 data)
+{
+	struct perf_data *irqdata = counter->irqdata;
+
+	if (irqdata->len > PERF_DATA_BUFLEN - sizeof(u64)) {
+		irqdata->overrun++;
+	} else {
+		u64 *p = (u64 *) &irqdata->data[irqdata->len];
+
+		*p = data;
+		irqdata->len += sizeof(u64);
+	}
+}
+
+static void perf_save_and_restart(struct perf_counter *counter)
+{
+	struct hw_perf_counter *hwc = &counter->hw;
+	int idx = hwc->idx;
+
+	wrmsr(hwc->config_base + idx,
+	      hwc->config & ~ARCH_PERFMON_EVENTSEL0_ENABLE, 0);
+
+	if (hwc->config & ARCH_PERFMON_EVENTSEL0_ENABLE) {
+		__hw_perf_save_counter(counter, hwc, idx);
+		__hw_perf_counter_enable(hwc, idx);
+	}
+}
+
+static void
+perf_handle_group(struct perf_counter *leader, u64 *status, u64 *overflown)
+{
+	struct perf_counter_context *ctx = leader->ctx;
+	struct perf_counter *counter;
+	int bit;
+
+	list_for_each_entry(counter, &ctx->counters, list) {
+		if (counter->record_type != PERF_RECORD_SIMPLE ||
+		    counter == leader)
+			continue;
+
+		if (counter->active) {
+			/*
+			 * When counter was not in the overflow mask, we have to
+			 * read it from hardware. We read it as well, when it
+			 * has not been read yet and clear the bit in the
+			 * status mask.
+			 */
+			bit = counter->hw.idx;
+			if (!test_bit(bit, (unsigned long *) overflown) ||
+			    test_bit(bit, (unsigned long *) status)) {
+				clear_bit(bit, (unsigned long *) status);
+				perf_save_and_restart(counter);
+			}
+		}
+		perf_store_irq_data(leader, counter->hw_event_type);
+		perf_store_irq_data(leader, atomic64_counter_read(counter));
+	}
+}
+
+/*
+ * This handler is triggered by the local APIC, so the APIC IRQ handling
+ * rules apply:
+ */
+static void __smp_perf_counter_interrupt(struct pt_regs *regs, int nmi)
+{
+	int bit, cpu = smp_processor_id();
+	struct cpu_hw_counters *cpuc;
+	u64 ack, status;
+
+	rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, status);
+	if (!status) {
+		ack_APIC_irq();
+		return;
+	}
+
+	/* Disable counters globally */
+	wrmsr(MSR_CORE_PERF_GLOBAL_CTRL, 0, 0);
+	ack_APIC_irq();
+
+	cpuc = &per_cpu(cpu_hw_counters, cpu);
+
+again:
+	ack = status;
+	for_each_bit(bit, (unsigned long *) &status, nr_hw_counters) {
+		struct perf_counter *counter = cpuc->counters[bit];
+
+		clear_bit(bit, (unsigned long *) &status);
+		if (!counter)
+			continue;
+
+		perf_save_and_restart(counter);
+
+		switch (counter->record_type) {
+		case PERF_RECORD_SIMPLE:
+			continue;
+		case PERF_RECORD_IRQ:
+			perf_store_irq_data(counter, instruction_pointer(regs));
+			break;
+		case PERF_RECORD_GROUP:
+			perf_store_irq_data(counter, counter->hw_event_type);
+			perf_store_irq_data(counter,
+					    atomic64_counter_read(counter));
+			perf_handle_group(counter, &status, &ack);
+			break;
+		}
+		/*
+		 * From NMI context we cannot call into the scheduler to
+		 * do a task wakeup - but we mark these counters as
+		 * wakeup_pending and initate a wakeup callback:
+		 */
+		if (nmi) {
+			counter->wakeup_pending = 1;
+			set_tsk_thread_flag(current, TIF_PERF_COUNTERS);
+		} else {
+			wake_up(&counter->waitq);
+		}
+	}
+
+	wrmsr(MSR_CORE_PERF_GLOBAL_OVF_CTRL, ack, 0);
+
+	/*
+	 * Repeat if there is more work to be done:
+	 */
+	rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, status);
+	if (status)
+		goto again;
+
+	/*
+	 * Do not reenable when global enable is off:
+	 */
+	if (cpuc->enable_all)
+		__hw_perf_enable_all();
+}
+
+void smp_perf_counter_interrupt(struct pt_regs *regs)
+{
+	irq_enter();
+#ifdef CONFIG_X86_64
+	add_pda(apic_perf_irqs, 1);
+#else
+	per_cpu(irq_stat, smp_processor_id()).apic_perf_irqs++;
+#endif
+	apic_write(APIC_LVTPC, LOCAL_PERF_VECTOR);
+	__smp_perf_counter_interrupt(regs, 0);
+
+	irq_exit();
+}
+
+/*
+ * This handler is triggered by NMI contexts:
+ */
+void perf_counter_notify(struct pt_regs *regs)
+{
+	struct cpu_hw_counters *cpuc;
+	unsigned long flags;
+	int bit, cpu;
+
+	local_irq_save(flags);
+	cpu = smp_processor_id();
+	cpuc = &per_cpu(cpu_hw_counters, cpu);
+
+	for_each_bit(bit, cpuc->used, nr_hw_counters) {
+		struct perf_counter *counter = cpuc->counters[bit];
+
+		if (!counter)
+			continue;
+
+		if (counter->wakeup_pending) {
+			counter->wakeup_pending = 0;
+			wake_up(&counter->waitq);
+		}
+	}
+
+	local_irq_restore(flags);
+}
+
+void __cpuinit perf_counters_lapic_init(int nmi)
+{
+	u32 apic_val;
+
+	if (!perf_counters_initialized)
+		return;
+	/*
+	 * Enable the performance counter vector in the APIC LVT:
+	 */
+	apic_val = apic_read(APIC_LVTERR);
+
+	apic_write(APIC_LVTERR, apic_val | APIC_LVT_MASKED);
+	if (nmi)
+		apic_write(APIC_LVTPC, APIC_DM_NMI);
+	else
+		apic_write(APIC_LVTPC, LOCAL_PERF_VECTOR);
+	apic_write(APIC_LVTERR, apic_val);
+}
+
+static int __kprobes
+perf_counter_nmi_handler(struct notifier_block *self,
+			 unsigned long cmd, void *__args)
+{
+	struct die_args *args = __args;
+	struct pt_regs *regs;
+
+	if (likely(cmd != DIE_NMI_IPI))
+		return NOTIFY_DONE;
+
+	regs = args->regs;
+
+	apic_write(APIC_LVTPC, APIC_DM_NMI);
+	__smp_perf_counter_interrupt(regs, 1);
+
+	return NOTIFY_STOP;
+}
+
+static __read_mostly struct notifier_block perf_counter_nmi_notifier = {
+	.notifier_call		= perf_counter_nmi_handler
+};
+
+void __init init_hw_perf_counters(void)
+{
+	union cpuid10_eax eax;
+	unsigned int unused;
+	unsigned int ebx;
+
+	if (!cpu_has(&boot_cpu_data, X86_FEATURE_ARCH_PERFMON))
+		return;
+
+	/*
+	 * Check whether the Architectural PerfMon supports
+	 * Branch Misses Retired Event or not.
+	 */
+	cpuid(10, &(eax.full), &ebx, &unused, &unused);
+	if (eax.split.mask_length <= ARCH_PERFMON_BRANCH_MISSES_RETIRED)
+		return;
+
+	printk(KERN_INFO "Intel Performance Monitoring support detected.\n");
+
+	printk(KERN_INFO "... version:      %d\n", eax.split.version_id);
+	printk(KERN_INFO "... num_counters: %d\n", eax.split.num_counters);
+	nr_hw_counters = eax.split.num_counters;
+	if (nr_hw_counters > MAX_HW_COUNTERS) {
+		nr_hw_counters = MAX_HW_COUNTERS;
+		WARN(1, KERN_ERR "hw perf counters %d > max(%d), clipping!",
+			nr_hw_counters, MAX_HW_COUNTERS);
+	}
+	perf_counter_mask = (1 << nr_hw_counters) - 1;
+	perf_max_counters = nr_hw_counters;
+
+	printk(KERN_INFO "... bit_width:    %d\n", eax.split.bit_width);
+	printk(KERN_INFO "... mask_length:  %d\n", eax.split.mask_length);
+
+	perf_counters_lapic_init(0);
+	register_die_notifier(&perf_counter_nmi_notifier);
+
+	perf_counters_initialized = true;
+}
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index b86f332..ad70f59 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -869,6 +869,12 @@ END(error_interrupt)
 ENTRY(spurious_interrupt)
 	apicinterrupt SPURIOUS_APIC_VECTOR,smp_spurious_interrupt
 END(spurious_interrupt)
+
+#ifdef CONFIG_PERF_COUNTERS
+ENTRY(perf_counter_interrupt)
+	apicinterrupt LOCAL_PERF_VECTOR,smp_perf_counter_interrupt
+END(perf_counter_interrupt)
+#endif
 				
 /*
  * Exception entry points.
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index d1d4dc5..d92bc71 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -56,6 +56,10 @@ static int show_other_interrupts(struct seq_file *p)
 	for_each_online_cpu(j)
 		seq_printf(p, "%10u ", irq_stats(j)->apic_timer_irqs);
 	seq_printf(p, "  Local timer interrupts\n");
+	seq_printf(p, "CNT: ");
+	for_each_online_cpu(j)
+		seq_printf(p, "%10u ", irq_stats(j)->apic_perf_irqs);
+	seq_printf(p, "  Performance counter interrupts\n");
 #endif
 #ifdef CONFIG_SMP
 	seq_printf(p, "RES: ");
@@ -160,6 +164,7 @@ u64 arch_irq_stat_cpu(unsigned int cpu)
 
 #ifdef CONFIG_X86_LOCAL_APIC
 	sum += irq_stats(cpu)->apic_timer_irqs;
+	sum += irq_stats(cpu)->apic_perf_irqs;
 #endif
 #ifdef CONFIG_SMP
 	sum += irq_stats(cpu)->irq_resched_count;
diff --git a/arch/x86/kernel/irqinit_32.c b/arch/x86/kernel/irqinit_32.c
index 845aa98..de2bb7c 100644
--- a/arch/x86/kernel/irqinit_32.c
+++ b/arch/x86/kernel/irqinit_32.c
@@ -160,6 +160,9 @@ void __init native_init_IRQ(void)
 	/* IPI vectors for APIC spurious and error interrupts */
 	alloc_intr_gate(SPURIOUS_APIC_VECTOR, spurious_interrupt);
 	alloc_intr_gate(ERROR_APIC_VECTOR, error_interrupt);
+# ifdef CONFIG_PERF_COUNTERS
+	alloc_intr_gate(LOCAL_PERF_VECTOR, perf_counter_interrupt);
+# endif
 #endif
 
 #if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86_MCE_P4THERMAL)
diff --git a/arch/x86/kernel/irqinit_64.c b/arch/x86/kernel/irqinit_64.c
index ff02353..eb04dd9 100644
--- a/arch/x86/kernel/irqinit_64.c
+++ b/arch/x86/kernel/irqinit_64.c
@@ -204,6 +204,11 @@ static void __init apic_intr_init(void)
 	/* IPI vectors for APIC spurious and error interrupts */
 	alloc_intr_gate(SPURIOUS_APIC_VECTOR, spurious_interrupt);
 	alloc_intr_gate(ERROR_APIC_VECTOR, error_interrupt);
+
+	/* Performance monitoring interrupt: */
+#ifdef CONFIG_PERF_COUNTERS
+	alloc_intr_gate(LOCAL_PERF_VECTOR, perf_counter_interrupt);
+#endif
 }
 
 void __init native_init_IRQ(void)
diff --git a/arch/x86/kernel/signal_32.c b/arch/x86/kernel/signal_32.c
index d6dd057..6d39c27 100644
--- a/arch/x86/kernel/signal_32.c
+++ b/arch/x86/kernel/signal_32.c
@@ -6,7 +6,9 @@
  */
 #include <linux/list.h>
 
+#include <linux/perf_counter.h>
 #include <linux/personality.h>
+#include <linux/tracehook.h>
 #include <linux/binfmts.h>
 #include <linux/suspend.h>
 #include <linux/kernel.h>
@@ -17,7 +19,6 @@
 #include <linux/errno.h>
 #include <linux/sched.h>
 #include <linux/wait.h>
-#include <linux/tracehook.h>
 #include <linux/elf.h>
 #include <linux/smp.h>
 #include <linux/mm.h>
@@ -694,6 +695,11 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
 		tracehook_notify_resume(regs);
 	}
 
+	if (thread_info_flags & _TIF_PERF_COUNTERS) {
+		clear_thread_flag(TIF_PERF_COUNTERS);
+		perf_counter_notify(regs);
+	}
+
 #ifdef CONFIG_X86_32
 	clear_thread_flag(TIF_IRET);
 #endif /* CONFIG_X86_32 */
diff --git a/arch/x86/kernel/signal_64.c b/arch/x86/kernel/signal_64.c
index a5c9627..066a13f 100644
--- a/arch/x86/kernel/signal_64.c
+++ b/arch/x86/kernel/signal_64.c
@@ -7,6 +7,7 @@
  *  2000-2002   x86-64 support by Andi Kleen
  */
 
+#include <linux/perf_counter.h>
 #include <linux/sched.h>
 #include <linux/mm.h>
 #include <linux/smp.h>
@@ -493,6 +494,10 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
 		clear_thread_flag(TIF_NOTIFY_RESUME);
 		tracehook_notify_resume(regs);
 	}
+	if (thread_info_flags & _TIF_PERF_COUNTERS) {
+		clear_thread_flag(TIF_PERF_COUNTERS);
+		perf_counter_notify(regs);
+	}
 
 #ifdef CONFIG_X86_32
 	clear_thread_flag(TIF_IRET);
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index d44395f..496726d 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -332,3 +332,4 @@ ENTRY(sys_call_table)
 	.long sys_dup3			/* 330 */
 	.long sys_pipe2
 	.long sys_inotify_init1
+	.long sys_perf_counter_open
diff --git a/drivers/char/sysrq.c b/drivers/char/sysrq.c
index ce0d9da..52146c2 100644
--- a/drivers/char/sysrq.c
+++ b/drivers/char/sysrq.c
@@ -25,6 +25,7 @@
 #include <linux/kbd_kern.h>
 #include <linux/proc_fs.h>
 #include <linux/quotaops.h>
+#include <linux/perf_counter.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/suspend.h>
@@ -244,6 +245,7 @@ static void sysrq_handle_showregs(int key, struct tty_struct *tty)
 	struct pt_regs *regs = get_irq_regs();
 	if (regs)
 		show_regs(regs);
+	perf_counter_print_debug();
 }
 static struct sysrq_key_op sysrq_showregs_op = {
 	.handler	= sysrq_handle_showregs,
diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h
new file mode 100644
index 0000000..22c4469
--- /dev/null
+++ b/include/linux/perf_counter.h
@@ -0,0 +1,171 @@
+/*
+ *  Performance counters:
+ *
+ *   Copyright(C) 2008, Thomas Gleixner <tglx@linutronix.de>
+ *   Copyright(C) 2008, Red Hat, Inc., Ingo Molnar
+ *
+ *  Data type definitions, declarations, prototypes.
+ *
+ *  Started by: Thomas Gleixner and Ingo Molnar
+ *
+ *  For licencing details see kernel-base/COPYING
+ */
+#ifndef _LINUX_PERF_COUNTER_H
+#define _LINUX_PERF_COUNTER_H
+
+#include <asm/atomic.h>
+
+#include <linux/list.h>
+#include <linux/mutex.h>
+#include <linux/rculist.h>
+#include <linux/rcupdate.h>
+#include <linux/spinlock.h>
+
+struct task_struct;
+
+/*
+ * Generalized hardware event types, used by the hw_event_type parameter
+ * of the sys_perf_counter_open() syscall:
+ */
+enum hw_event_types {
+	PERF_COUNT_CYCLES,
+	PERF_COUNT_INSTRUCTIONS,
+	PERF_COUNT_CACHE_REFERENCES,
+	PERF_COUNT_CACHE_MISSES,
+	PERF_COUNT_BRANCH_INSTRUCTIONS,
+	PERF_COUNT_BRANCH_MISSES,
+	/*
+	 * If this bit is set in the type, then trigger NMI sampling:
+	 */
+	PERF_COUNT_NMI			= (1 << 30),
+};
+
+/*
+ * IRQ-notification data record type:
+ */
+enum perf_record_type {
+	PERF_RECORD_SIMPLE,
+	PERF_RECORD_IRQ,
+	PERF_RECORD_GROUP,
+};
+
+/**
+ * struct hw_perf_counter - performance counter hardware details
+ */
+struct hw_perf_counter {
+	u64			config;
+	unsigned long		config_base;
+	unsigned long		counter_base;
+	int			nmi;
+	unsigned int		idx;
+	u64			prev_count;
+	s32			next_count;
+	u64			irq_period;
+};
+
+/*
+ * Hardcoded buffer length limit for now, for IRQ-fed events:
+ */
+#define PERF_DATA_BUFLEN	2048
+
+/**
+ * struct perf_data - performance counter IRQ data sampling ...
+ */
+struct perf_data {
+	int			len;
+	int			rd_idx;
+	int			overrun;
+	u8			data[PERF_DATA_BUFLEN];
+};
+
+/**
+ * struct perf_counter - performance counter kernel representation:
+ */
+struct perf_counter {
+	struct list_head		list;
+	int				active;
+#if BITS_PER_LONG == 64
+	atomic64_t			count;
+#else
+	atomic_t			count32[2];
+#endif
+	u64				__irq_period;
+
+	struct hw_perf_counter		hw;
+
+	struct perf_counter_context	*ctx;
+	struct task_struct		*task;
+
+	/*
+	 * Protect attach/detach:
+	 */
+	struct mutex			mutex;
+
+	int				oncpu;
+	int				cpu;
+
+	s32				hw_event_type;
+	enum perf_record_type		record_type;
+
+	/* read() / irq related data */
+	wait_queue_head_t		waitq;
+	/* optional: for NMIs */
+	int				wakeup_pending;
+	struct perf_data		*irqdata;
+	struct perf_data		*usrdata;
+	struct perf_data		data[2];
+};
+
+/**
+ * struct perf_counter_context - counter context structure
+ *
+ * Used as a container for task counters and CPU counters as well:
+ */
+struct perf_counter_context {
+#ifdef CONFIG_PERF_COUNTERS
+	/*
+	 * Protect the list of counters:
+	 */
+	spinlock_t		lock;
+	struct list_head	counters;
+	int			nr_counters;
+	int			nr_active;
+	struct task_struct	*task;
+#endif
+};
+
+/**
+ * struct perf_counter_cpu_context - per cpu counter context structure
+ */
+struct perf_cpu_context {
+	struct perf_counter_context	ctx;
+	struct perf_counter_context	*task_ctx;
+	int				active_oncpu;
+	int				max_pertask;
+};
+
+/*
+ * Set by architecture code:
+ */
+extern int perf_max_counters;
+
+#ifdef CONFIG_PERF_COUNTERS
+extern void perf_counter_task_sched_in(struct task_struct *task, int cpu);
+extern void perf_counter_task_sched_out(struct task_struct *task, int cpu);
+extern void perf_counter_task_tick(struct task_struct *task, int cpu);
+extern void perf_counter_init_task(struct task_struct *task);
+extern void perf_counter_notify(struct pt_regs *regs);
+extern void perf_counter_print_debug(void);
+#else
+static inline void
+perf_counter_task_sched_in(struct task_struct *task, int cpu)		{ }
+static inline void
+perf_counter_task_sched_out(struct task_struct *task, int cpu)		{ }
+static inline void
+perf_counter_task_tick(struct task_struct *task, int cpu)		{ }
+static inline void perf_counter_init_task(struct task_struct *task)	{ }
+static inline void perf_counter_notify(struct pt_regs *regs)		{ }
+static inline void perf_counter_print_debug(void)			{ }
+#endif
+
+#endif /* _LINUX_PERF_COUNTER_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 55e30d1..4c53027 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -71,6 +71,7 @@ struct sched_param {
 #include <linux/fs_struct.h>
 #include <linux/compiler.h>
 #include <linux/completion.h>
+#include <linux/perf_counter.h>
 #include <linux/pid.h>
 #include <linux/percpu.h>
 #include <linux/topology.h>
@@ -1326,6 +1327,7 @@ struct task_struct {
 	struct list_head pi_state_list;
 	struct futex_pi_state *pi_state_cache;
 #endif
+	struct perf_counter_context perf_counter_ctx;
 #ifdef CONFIG_NUMA
 	struct mempolicy *mempolicy;
 	short il_next;
@@ -2285,6 +2287,13 @@ static inline void inc_syscw(struct task_struct *tsk)
 #define TASK_SIZE_OF(tsk)	TASK_SIZE
 #endif
 
+/*
+ * Call the function if the target task is executing on a CPU right now:
+ */
+extern void task_oncpu_function_call(struct task_struct *p,
+				     void (*func) (void *info), void *info);
+
+
 #ifdef CONFIG_MM_OWNER
 extern void mm_update_next_owner(struct mm_struct *mm);
 extern void mm_init_owner(struct mm_struct *mm, struct task_struct *p);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 04fb47b..6cce728 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -624,4 +624,10 @@ asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
 
 int kernel_execve(const char *filename, char *const argv[], char *const envp[]);
 
+asmlinkage int
+sys_perf_counter_open(u32 hw_event_type,
+		      u32 hw_event_period,
+		      u32 record_type,
+		      pid_t pid,
+		      int cpu);
 #endif
diff --git a/init/Kconfig b/init/Kconfig
index f763762..78bede2 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -732,6 +732,35 @@ config AIO
           by some high performance threaded applications. Disabling
           this option saves about 7k.
 
+config HAVE_PERF_COUNTERS
+	bool
+
+menu "Performance Counters"
+
+config PERF_COUNTERS
+	bool "Kernel Performance Counters"
+	depends on HAVE_PERF_COUNTERS
+	default y
+	help
+	  Enable kernel support for performance counter hardware.
+
+	  Performance counters are special hardware registers available
+	  on most modern CPUs. These registers count the number of certain
+	  types of hw events: such as instructions executed, cachemisses
+	  suffered, or branches mis-predicted - without slowing down the
+	  kernel or applications. These registers can also trigger interrupts
+	  when a threshold number of events have passed - and can thus be
+	  used to profile the code that runs on that CPU.
+
+	  The Linux Performance Counter subsystem provides an abstraction of
+	  these hardware capabilities, available via a system call. It
+	  provides per task and per CPU counters, and it provides event
+	  capabilities on top of those.
+
+	  Say Y if unsure.
+
+endmenu
+
 config VM_EVENT_COUNTERS
 	default y
 	bool "Enable VM event counters for /proc/vmstat" if EMBEDDED
diff --git a/kernel/Makefile b/kernel/Makefile
index 19fad00..1f184a1 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -89,6 +89,7 @@ obj-$(CONFIG_HAVE_GENERIC_DMA_COHERENT) += dma-coherent.o
 obj-$(CONFIG_FUNCTION_TRACER) += trace/
 obj-$(CONFIG_TRACING) += trace/
 obj-$(CONFIG_SMP) += sched_cpupri.o
+obj-$(CONFIG_PERF_COUNTERS) += perf_counter.o
 
 ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
 # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
diff --git a/kernel/fork.c b/kernel/fork.c
index 2a372a0..441fadf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -975,6 +975,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		goto fork_out;
 
 	rt_mutex_init_task(p);
+	perf_counter_init_task(p);
 
 #ifdef CONFIG_PROVE_LOCKING
 	DEBUG_LOCKS_WARN_ON(!p->hardirqs_enabled);
diff --git a/kernel/perf_counter.c b/kernel/perf_counter.c
new file mode 100644
index 0000000..f84b400
--- /dev/null
+++ b/kernel/perf_counter.c
@@ -0,0 +1,945 @@
+/*
+ * Performance counter core code
+ *
+ *  Copyright(C) 2008 Thomas Gleixner <tglx@linutronix.de>
+ *  Copyright(C) 2008 Red Hat, Inc., Ingo Molnar
+ *
+ *  For licencing details see kernel-base/COPYING
+ */
+
+#include <linux/fs.h>
+#include <linux/cpu.h>
+#include <linux/smp.h>
+#include <linux/poll.h>
+#include <linux/sysfs.h>
+#include <linux/ptrace.h>
+#include <linux/percpu.h>
+#include <linux/uaccess.h>
+#include <linux/syscalls.h>
+#include <linux/anon_inodes.h>
+#include <linux/perf_counter.h>
+
+/*
+ * Each CPU has a list of per CPU counters:
+ */
+DEFINE_PER_CPU(struct perf_cpu_context, perf_cpu_context);
+
+int perf_max_counters __read_mostly;
+static int perf_reserved_percpu __read_mostly;
+static int perf_overcommit __read_mostly = 1;
+
+/*
+ * Mutex for (sysadmin-configurable) counter reservations:
+ */
+static DEFINE_MUTEX(perf_resource_mutex);
+
+/*
+ * Architecture provided APIs - weak aliases:
+ */
+
+int __weak hw_perf_counter_init(struct perf_counter *counter, u32 hw_event_type)
+{
+	return -EINVAL;
+}
+
+void __weak hw_perf_counter_enable(struct perf_counter *counter)	 { }
+void __weak hw_perf_counter_disable(struct perf_counter *counter)	 { }
+void __weak hw_perf_counter_read(struct perf_counter *counter)		 { }
+void __weak hw_perf_disable_all(void) { }
+void __weak hw_perf_enable_all(void) { }
+void __weak hw_perf_counter_setup(void) { }
+
+#if BITS_PER_LONG == 64
+
+/*
+ * Read the cached counter in counter safe against cross CPU / NMI
+ * modifications. 64 bit version - no complications.
+ */
+static inline u64 perf_read_counter_safe(struct perf_counter *counter)
+{
+	return (u64) atomic64_read(&counter->count);
+}
+
+#else
+
+/*
+ * Read the cached counter in counter safe against cross CPU / NMI
+ * modifications. 32 bit version.
+ */
+static u64 perf_read_counter_safe(struct perf_counter *counter)
+{
+	u32 cntl, cnth;
+
+	local_irq_disable();
+	do {
+		cnth = atomic_read(&counter->count32[1]);
+		cntl = atomic_read(&counter->count32[0]);
+	} while (cnth != atomic_read(&counter->count32[1]));
+
+	local_irq_enable();
+
+	return cntl | ((u64) cnth) << 32;
+}
+
+#endif
+
+/*
+ * Cross CPU call to remove a performance counter
+ *
+ * We disable the counter on the hardware level first. After that we
+ * remove it from the context list.
+ */
+static void __perf_remove_from_context(void *info)
+{
+	struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context);
+	struct perf_counter *counter = info;
+	struct perf_counter_context *ctx = counter->ctx;
+
+	/*
+	 * If this is a task context, we need to check whether it is
+	 * the current task context of this cpu. If not it has been
+	 * scheduled out before the smp call arrived.
+	 */
+	if (ctx->task && cpuctx->task_ctx != ctx)
+		return;
+
+	spin_lock(&ctx->lock);
+
+	if (counter->active) {
+		hw_perf_counter_disable(counter);
+		counter->active = 0;
+		ctx->nr_active--;
+		cpuctx->active_oncpu--;
+		counter->task = NULL;
+	}
+	ctx->nr_counters--;
+
+	/*
+	 * Protect the list operation against NMI by disabling the
+	 * counters on a global level. NOP for non NMI based counters.
+	 */
+	hw_perf_disable_all();
+	list_del_init(&counter->list);
+	hw_perf_enable_all();
+
+	if (!ctx->task) {
+		/*
+		 * Allow more per task counters with respect to the
+		 * reservation:
+		 */
+		cpuctx->max_pertask =
+			min(perf_max_counters - ctx->nr_counters,
+			    perf_max_counters - perf_reserved_percpu);
+	}
+
+	spin_unlock(&ctx->lock);
+}
+
+
+/*
+ * Remove the counter from a task's (or a CPU's) list of counters.
+ *
+ * Must be called with counter->mutex held.
+ *
+ * CPU counters are removed with a smp call. For task counters we only
+ * call when the task is on a CPU.
+ */
+static void perf_remove_from_context(struct perf_counter *counter)
+{
+	struct perf_counter_context *ctx = counter->ctx;
+	struct task_struct *task = ctx->task;
+
+	if (!task) {
+		/*
+		 * Per cpu counters are removed via an smp call and
+		 * the removal is always sucessful.
+		 */
+		smp_call_function_single(counter->cpu,
+					 __perf_remove_from_context,
+					 counter, 1);
+		return;
+	}
+
+retry:
+	task_oncpu_function_call(task, __perf_remove_from_context,
+				 counter);
+
+	spin_lock_irq(&ctx->lock);
+	/*
+	 * If the context is active we need to retry the smp call.
+	 */
+	if (ctx->nr_active && !list_empty(&counter->list)) {
+		spin_unlock_irq(&ctx->lock);
+		goto retry;
+	}
+
+	/*
+	 * The lock prevents that this context is scheduled in so we
+	 * can remove the counter safely, if it the call above did not
+	 * succeed.
+	 */
+	if (!list_empty(&counter->list)) {
+		ctx->nr_counters--;
+		list_del_init(&counter->list);
+		counter->task = NULL;
+	}
+	spin_unlock_irq(&ctx->lock);
+}
+
+/*
+ * Cross CPU call to install and enable a preformance counter
+ */
+static void __perf_install_in_context(void *info)
+{
+	struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context);
+	struct perf_counter *counter = info;
+	struct perf_counter_context *ctx = counter->ctx;
+	int cpu = smp_processor_id();
+
+	/*
+	 * If this is a task context, we need to check whether it is
+	 * the current task context of this cpu. If not it has been
+	 * scheduled out before the smp call arrived.
+	 */
+	if (ctx->task && cpuctx->task_ctx != ctx)
+		return;
+
+	spin_lock(&ctx->lock);
+
+	/*
+	 * Protect the list operation against NMI by disabling the
+	 * counters on a global level. NOP for non NMI based counters.
+	 */
+	hw_perf_disable_all();
+	list_add_tail(&counter->list, &ctx->counters);
+	hw_perf_enable_all();
+
+	ctx->nr_counters++;
+
+	if (cpuctx->active_oncpu < perf_max_counters) {
+		hw_perf_counter_enable(counter);
+		counter->active = 1;
+		counter->oncpu = cpu;
+		ctx->nr_active++;
+		cpuctx->active_oncpu++;
+	}
+
+	if (!ctx->task && cpuctx->max_pertask)
+		cpuctx->max_pertask--;
+
+	spin_unlock(&ctx->lock);
+}
+
+/*
+ * Attach a performance counter to a context
+ *
+ * First we add the counter to the list with the hardware enable bit
+ * in counter->hw_config cleared.
+ *
+ * If the counter is attached to a task which is on a CPU we use a smp
+ * call to enable it in the task context. The task might have been
+ * scheduled away, but we check this in the smp call again.
+ */
+static void
+perf_install_in_context(struct perf_counter_context *ctx,
+			struct perf_counter *counter,
+			int cpu)
+{
+	struct task_struct *task = ctx->task;
+
+	counter->ctx = ctx;
+	if (!task) {
+		/*
+		 * Per cpu counters are installed via an smp call and
+		 * the install is always sucessful.
+		 */
+		smp_call_function_single(cpu, __perf_install_in_context,
+					 counter, 1);
+		return;
+	}
+
+	counter->task = task;
+retry:
+	task_oncpu_function_call(task, __perf_install_in_context,
+				 counter);
+
+	spin_lock_irq(&ctx->lock);
+	/*
+	 * If the context is active and the counter has not been added
+	 * we need to retry the smp call.
+	 */
+	if (ctx->nr_active && list_empty(&counter->list)) {
+		spin_unlock_irq(&ctx->lock);
+		goto retry;
+	}
+
+	/*
+	 * The lock prevents that this context is scheduled in so we
+	 * can add the counter safely, if it the call above did not
+	 * succeed.
+	 */
+	if (list_empty(&counter->list)) {
+		list_add_tail(&counter->list, &ctx->counters);
+		ctx->nr_counters++;
+	}
+	spin_unlock_irq(&ctx->lock);
+}
+
+/*
+ * Called from scheduler to remove the counters of the current task,
+ * with interrupts disabled.
+ *
+ * We stop each counter and update the counter value in counter->count.
+ *
+ * This does not protect us against NMI, but hw_perf_counter_disable()
+ * sets the disabled bit in the control field of counter _before_
+ * accessing the counter control register. If a NMI hits, then it will
+ * not restart the counter.
+ */
+void perf_counter_task_sched_out(struct task_struct *task, int cpu)
+{
+	struct perf_cpu_context *cpuctx = &per_cpu(perf_cpu_context, cpu);
+	struct perf_counter_context *ctx = &task->perf_counter_ctx;
+	struct perf_counter *counter;
+
+	if (likely(!cpuctx->task_ctx))
+		return;
+
+	spin_lock(&ctx->lock);
+	list_for_each_entry(counter, &ctx->counters, list) {
+		if (!ctx->nr_active)
+			break;
+		if (counter->active) {
+			hw_perf_counter_disable(counter);
+			counter->active = 0;
+			counter->oncpu = -1;
+			ctx->nr_active--;
+			cpuctx->active_oncpu--;
+		}
+	}
+	spin_unlock(&ctx->lock);
+	cpuctx->task_ctx = NULL;
+}
+
+/*
+ * Called from scheduler to add the counters of the current task
+ * with interrupts disabled.
+ *
+ * We restore the counter value and then enable it.
+ *
+ * This does not protect us against NMI, but hw_perf_counter_enable()
+ * sets the enabled bit in the control field of counter _before_
+ * accessing the counter control register. If a NMI hits, then it will
+ * keep the counter running.
+ */
+void perf_counter_task_sched_in(struct task_struct *task, int cpu)
+{
+	struct perf_cpu_context *cpuctx = &per_cpu(perf_cpu_context, cpu);
+	struct perf_counter_context *ctx = &task->perf_counter_ctx;
+	struct perf_counter *counter;
+
+	if (likely(!ctx->nr_counters))
+		return;
+
+	spin_lock(&ctx->lock);
+	list_for_each_entry(counter, &ctx->counters, list) {
+		if (ctx->nr_active == cpuctx->max_pertask)
+			break;
+		if (counter->cpu != -1 && counter->cpu != cpu)
+			continue;
+
+		hw_perf_counter_enable(counter);
+		counter->active = 1;
+		counter->oncpu = cpu;
+		ctx->nr_active++;
+		cpuctx->active_oncpu++;
+	}
+	spin_unlock(&ctx->lock);
+	cpuctx->task_ctx = ctx;
+}
+
+void perf_counter_task_tick(struct task_struct *curr, int cpu)
+{
+	struct perf_counter_context *ctx = &curr->perf_counter_ctx;
+	struct perf_counter *counter;
+
+	if (likely(!ctx->nr_counters))
+		return;
+
+	perf_counter_task_sched_out(curr, cpu);
+
+	spin_lock(&ctx->lock);
+
+	/*
+	 * Rotate the first entry last:
+	 */
+	hw_perf_disable_all();
+	list_for_each_entry(counter, &ctx->counters, list) {
+		list_del(&counter->list);
+		list_add_tail(&counter->list, &ctx->counters);
+		break;
+	}
+	hw_perf_enable_all();
+
+	spin_unlock(&ctx->lock);
+
+	perf_counter_task_sched_in(curr, cpu);
+}
+
+/*
+ * Initialize the perf_counter context in task_struct
+ */
+void perf_counter_init_task(struct task_struct *task)
+{
+	struct perf_counter_context *ctx = &task->perf_counter_ctx;
+
+	spin_lock_init(&ctx->lock);
+	INIT_LIST_HEAD(&ctx->counters);
+	ctx->nr_counters = 0;
+	ctx->task = task;
+}
+
+/*
+ * Cross CPU call to read the hardware counter
+ */
+static void __hw_perf_counter_read(void *info)
+{
+	hw_perf_counter_read(info);
+}
+
+static u64 perf_read_counter(struct perf_counter *counter)
+{
+	/*
+	 * If counter is enabled and currently active on a CPU, update the
+	 * value in the counter structure:
+	 */
+	if (counter->active) {
+		smp_call_function_single(counter->oncpu,
+					 __hw_perf_counter_read, counter, 1);
+	}
+
+	return perf_read_counter_safe(counter);
+}
+
+/*
+ * Cross CPU call to switch performance data pointers
+ */
+static void __perf_switch_irq_data(void *info)
+{
+	struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context);
+	struct perf_counter *counter = info;
+	struct perf_counter_context *ctx = counter->ctx;
+	struct perf_data *oldirqdata = counter->irqdata;
+
+	/*
+	 * If this is a task context, we need to check whether it is
+	 * the current task context of this cpu. If not it has been
+	 * scheduled out before the smp call arrived.
+	 */
+	if (ctx->task) {
+		if (cpuctx->task_ctx != ctx)
+			return;
+		spin_lock(&ctx->lock);
+	}
+
+	/* Change the pointer NMI safe */
+	atomic_long_set((atomic_long_t *)&counter->irqdata,
+			(unsigned long) counter->usrdata);
+	counter->usrdata = oldirqdata;
+
+	if (ctx->task)
+		spin_unlock(&ctx->lock);
+}
+
+static struct perf_data *perf_switch_irq_data(struct perf_counter *counter)
+{
+	struct perf_counter_context *ctx = counter->ctx;
+	struct perf_data *oldirqdata = counter->irqdata;
+	struct task_struct *task = ctx->task;
+
+	if (!task) {
+		smp_call_function_single(counter->cpu,
+					 __perf_switch_irq_data,
+					 counter, 1);
+		return counter->usrdata;
+	}
+
+retry:
+	spin_lock_irq(&ctx->lock);
+	if (!counter->active) {
+		counter->irqdata = counter->usrdata;
+		counter->usrdata = oldirqdata;
+		spin_unlock_irq(&ctx->lock);
+		return oldirqdata;
+	}
+	spin_unlock_irq(&ctx->lock);
+	task_oncpu_function_call(task, __perf_switch_irq_data, counter);
+	/* Might have failed, because task was scheduled out */
+	if (counter->irqdata == oldirqdata)
+		goto retry;
+
+	return counter->usrdata;
+}
+
+static void put_context(struct perf_counter_context *ctx)
+{
+	if (ctx->task) {
+		put_task_struct(ctx->task);
+		ctx->task = NULL;
+	}
+}
+
+static struct perf_counter_context *find_get_context(pid_t pid, int cpu)
+{
+	struct perf_cpu_context *cpuctx;
+	struct perf_counter_context *ctx;
+	struct task_struct *task;
+
+	/*
+	 * If cpu is not a wildcard then this is a percpu counter:
+	 */
+	if (cpu != -1) {
+		/* Must be root to operate on a CPU counter: */
+		if (!capable(CAP_SYS_ADMIN))
+			return ERR_PTR(-EACCES);
+
+		if (cpu < 0 || cpu > num_possible_cpus())
+			return ERR_PTR(-EINVAL);
+
+		/*
+		 * We could be clever and allow to attach a counter to an
+		 * offline CPU and activate it when the CPU comes up, but
+		 * that's for later.
+		 */
+		if (!cpu_isset(cpu, cpu_online_map))
+			return ERR_PTR(-ENODEV);
+
+		cpuctx = &per_cpu(perf_cpu_context, cpu);
+		ctx = &cpuctx->ctx;
+
+		WARN_ON_ONCE(ctx->task);
+		return ctx;
+	}
+
+	rcu_read_lock();
+	if (!pid)
+		task = current;
+	else
+		task = find_task_by_vpid(pid);
+	if (task)
+		get_task_struct(task);
+	rcu_read_unlock();
+
+	if (!task)
+		return ERR_PTR(-ESRCH);
+
+	ctx = &task->perf_counter_ctx;
+	ctx->task = task;
+
+	/* Reuse ptrace permission checks for now. */
+	if (!ptrace_may_access(task, PTRACE_MODE_READ)) {
+		put_context(ctx);
+		return ERR_PTR(-EACCES);
+	}
+
+	return ctx;
+}
+
+/*
+ * Called when the last reference to the file is gone.
+ */
+static int perf_release(struct inode *inode, struct file *file)
+{
+	struct perf_counter *counter = file->private_data;
+	struct perf_counter_context *ctx = counter->ctx;
+
+	file->private_data = NULL;
+
+	mutex_lock(&counter->mutex);
+
+	perf_remove_from_context(counter);
+	put_context(ctx);
+
+	mutex_unlock(&counter->mutex);
+
+	kfree(counter);
+
+	return 0;
+}
+
+/*
+ * Read the performance counter - simple non blocking version for now
+ */
+static ssize_t
+perf_read_hw(struct perf_counter *counter, char __user *buf, size_t count)
+{
+	u64 cntval;
+
+	if (count != sizeof(cntval))
+		return -EINVAL;
+
+	mutex_lock(&counter->mutex);
+	cntval = perf_read_counter(counter);
+	mutex_unlock(&counter->mutex);
+
+	return put_user(cntval, (u64 __user *) buf) ? -EFAULT : sizeof(cntval);
+}
+
+static ssize_t
+perf_copy_usrdata(struct perf_data *usrdata, char __user *buf, size_t count)
+{
+	if (!usrdata->len)
+		return 0;
+
+	count = min(count, (size_t)usrdata->len);
+	if (copy_to_user(buf, usrdata->data + usrdata->rd_idx, count))
+		return -EFAULT;
+
+	/* Adjust the counters */
+	usrdata->len -= count;
+	if (!usrdata->len)
+		usrdata->rd_idx = 0;
+	else
+		usrdata->rd_idx += count;
+
+	return count;
+}
+
+static ssize_t
+perf_read_irq_data(struct perf_counter	*counter,
+		   char __user		*buf,
+		   size_t		count,
+		   int			nonblocking)
+{
+	struct perf_data *irqdata, *usrdata;
+	DECLARE_WAITQUEUE(wait, current);
+	ssize_t res;
+
+	irqdata = counter->irqdata;
+	usrdata = counter->usrdata;
+
+	if (usrdata->len + irqdata->len >= count)
+		goto read_pending;
+
+	if (nonblocking)
+		return -EAGAIN;
+
+	spin_lock_irq(&counter->waitq.lock);
+	__add_wait_queue(&counter->waitq, &wait);
+	for (;;) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		if (usrdata->len + irqdata->len >= count)
+			break;
+
+		if (signal_pending(current))
+			break;
+
+		spin_unlock_irq(&counter->waitq.lock);
+		schedule();
+		spin_lock_irq(&counter->waitq.lock);
+	}
+	__remove_wait_queue(&counter->waitq, &wait);
+	__set_current_state(TASK_RUNNING);
+	spin_unlock_irq(&counter->waitq.lock);
+
+	if (usrdata->len + irqdata->len < count)
+		return -ERESTARTSYS;
+read_pending:
+	mutex_lock(&counter->mutex);
+
+	/* Drain pending data first: */
+	res = perf_copy_usrdata(usrdata, buf, count);
+	if (res < 0 || res == count)
+		goto out;
+
+	/* Switch irq buffer: */
+	usrdata = perf_switch_irq_data(counter);
+	if (perf_copy_usrdata(usrdata, buf + res, count - res) < 0) {
+		if (!res)
+			res = -EFAULT;
+	} else {
+		res = count;
+	}
+out:
+	mutex_unlock(&counter->mutex);
+
+	return res;
+}
+
+static ssize_t
+perf_read(struct file *file, char __user *buf, size_t count, loff_t *ppos)
+{
+	struct perf_counter *counter = file->private_data;
+
+	switch (counter->record_type) {
+	case PERF_RECORD_SIMPLE:
+		return perf_read_hw(counter, buf, count);
+
+	case PERF_RECORD_IRQ:
+	case PERF_RECORD_GROUP:
+		return perf_read_irq_data(counter, buf, count,
+					  file->f_flags & O_NONBLOCK);
+	}
+	return -EINVAL;
+}
+
+static unsigned int perf_poll(struct file *file, poll_table *wait)
+{
+	struct perf_counter *counter = file->private_data;
+	unsigned int events = 0;
+	unsigned long flags;
+
+	poll_wait(file, &counter->waitq, wait);
+
+	spin_lock_irqsave(&counter->waitq.lock, flags);
+	if (counter->usrdata->len || counter->irqdata->len)
+		events |= POLLIN;
+	spin_unlock_irqrestore(&counter->waitq.lock, flags);
+
+	return events;
+}
+
+static const struct file_operations perf_fops = {
+	.release		= perf_release,
+	.read			= perf_read,
+	.poll			= perf_poll,
+};
+
+/*
+ * Allocate and initialize a counter structure
+ */
+static struct perf_counter *
+perf_counter_alloc(u32 hw_event_period, int cpu, u32 record_type)
+{
+	struct perf_counter *counter = kzalloc(sizeof(*counter), GFP_KERNEL);
+
+	if (!counter)
+		return NULL;
+
+	mutex_init(&counter->mutex);
+	INIT_LIST_HEAD(&counter->list);
+	init_waitqueue_head(&counter->waitq);
+
+	counter->irqdata	= &counter->data[0];
+	counter->usrdata	= &counter->data[1];
+	counter->cpu		= cpu;
+	counter->record_type	= record_type;
+	counter->__irq_period	= hw_event_period;
+	counter->wakeup_pending = 0;
+
+	return counter;
+}
+
+/**
+ * sys_perf_task_open - open a performance counter associate it to a task
+ * @hw_event_type:	event type for monitoring/sampling...
+ * @pid:		target pid
+ */
+asmlinkage int
+sys_perf_counter_open(u32 hw_event_type,
+		      u32 hw_event_period,
+		      u32 record_type,
+		      pid_t pid,
+		      int cpu)
+{
+	struct perf_counter_context *ctx;
+	struct perf_counter *counter;
+	int ret;
+
+	ctx = find_get_context(pid, cpu);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	ret = -ENOMEM;
+	counter = perf_counter_alloc(hw_event_period, cpu, record_type);
+	if (!counter)
+		goto err_put_context;
+
+	ret = hw_perf_counter_init(counter, hw_event_type);
+	if (ret)
+		goto err_free_put_context;
+
+	perf_install_in_context(ctx, counter, cpu);
+
+	ret = anon_inode_getfd("[perf_counter]", &perf_fops, counter, 0);
+	if (ret < 0)
+		goto err_remove_free_put_context;
+
+	return ret;
+
+err_remove_free_put_context:
+	mutex_lock(&counter->mutex);
+	perf_remove_from_context(counter);
+	mutex_unlock(&counter->mutex);
+
+err_free_put_context:
+	kfree(counter);
+
+err_put_context:
+	put_context(ctx);
+
+	return ret;
+}
+
+static void __cpuinit perf_init_cpu(int cpu)
+{
+	struct perf_cpu_context *ctx;
+
+	ctx = &per_cpu(perf_cpu_context, cpu);
+	spin_lock_init(&ctx->ctx.lock);
+	INIT_LIST_HEAD(&ctx->ctx.counters);
+
+	mutex_lock(&perf_resource_mutex);
+	ctx->max_pertask = perf_max_counters - perf_reserved_percpu;
+	mutex_unlock(&perf_resource_mutex);
+	hw_perf_counter_setup();
+}
+
+#ifdef CONFIG_HOTPLUG_CPU
+static void __perf_exit_cpu(void *info)
+{
+	struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context);
+	struct perf_counter_context *ctx = &cpuctx->ctx;
+	struct perf_counter *counter, *tmp;
+
+	list_for_each_entry_safe(counter, tmp, &ctx->counters, list)
+		__perf_remove_from_context(counter);
+
+}
+static void perf_exit_cpu(int cpu)
+{
+	smp_call_function_single(cpu, __perf_exit_cpu, NULL, 1);
+}
+#else
+static inline void perf_exit_cpu(int cpu) { }
+#endif
+
+static int __cpuinit
+perf_cpu_notify(struct notifier_block *self, unsigned long action, void *hcpu)
+{
+	unsigned int cpu = (long)hcpu;
+
+	switch (action) {
+
+	case CPU_UP_PREPARE:
+	case CPU_UP_PREPARE_FROZEN:
+		perf_init_cpu(cpu);
+		break;
+
+	case CPU_DOWN_PREPARE:
+	case CPU_DOWN_PREPARE_FROZEN:
+		perf_exit_cpu(cpu);
+		break;
+
+	default:
+		break;
+	}
+
+	return NOTIFY_OK;
+}
+
+static struct notifier_block __cpuinitdata perf_cpu_nb = {
+	.notifier_call		= perf_cpu_notify,
+};
+
+static int __init perf_counter_init(void)
+{
+	perf_cpu_notify(&perf_cpu_nb, (unsigned long)CPU_UP_PREPARE,
+			(void *)(long)smp_processor_id());
+	register_cpu_notifier(&perf_cpu_nb);
+
+	return 0;
+}
+early_initcall(perf_counter_init);
+
+static ssize_t perf_show_reserve_percpu(struct sysdev_class *class, char *buf)
+{
+	return sprintf(buf, "%d\n", perf_reserved_percpu);
+}
+
+static ssize_t
+perf_set_reserve_percpu(struct sysdev_class *class,
+			const char *buf,
+			size_t count)
+{
+	struct perf_cpu_context *cpuctx;
+	unsigned long val;
+	int err, cpu, mpt;
+
+	err = strict_strtoul(buf, 10, &val);
+	if (err)
+		return err;
+	if (val > perf_max_counters)
+		return -EINVAL;
+
+	mutex_lock(&perf_resource_mutex);
+	perf_reserved_percpu = val;
+	for_each_online_cpu(cpu) {
+		cpuctx = &per_cpu(perf_cpu_context, cpu);
+		spin_lock_irq(&cpuctx->ctx.lock);
+		mpt = min(perf_max_counters - cpuctx->ctx.nr_counters,
+			  perf_max_counters - perf_reserved_percpu);
+		cpuctx->max_pertask = mpt;
+		spin_unlock_irq(&cpuctx->ctx.lock);
+	}
+	mutex_unlock(&perf_resource_mutex);
+
+	return count;
+}
+
+static ssize_t perf_show_overcommit(struct sysdev_class *class, char *buf)
+{
+	return sprintf(buf, "%d\n", perf_overcommit);
+}
+
+static ssize_t
+perf_set_overcommit(struct sysdev_class *class, const char *buf, size_t count)
+{
+	unsigned long val;
+	int err;
+
+	err = strict_strtoul(buf, 10, &val);
+	if (err)
+		return err;
+	if (val > 1)
+		return -EINVAL;
+
+	mutex_lock(&perf_resource_mutex);
+	perf_overcommit = val;
+	mutex_unlock(&perf_resource_mutex);
+
+	return count;
+}
+
+static SYSDEV_CLASS_ATTR(
+				reserve_percpu,
+				0644,
+				perf_show_reserve_percpu,
+				perf_set_reserve_percpu
+			);
+
+static SYSDEV_CLASS_ATTR(
+				overcommit,
+				0644,
+				perf_show_overcommit,
+				perf_set_overcommit
+			);
+
+static struct attribute *perfclass_attrs[] = {
+	&attr_reserve_percpu.attr,
+	&attr_overcommit.attr,
+	NULL
+};
+
+static struct attribute_group perfclass_attr_group = {
+	.attrs			= perfclass_attrs,
+	.name			= "perf_counters",
+};
+
+static int __init perf_counter_sysfs_init(void)
+{
+	return sysfs_create_group(&cpu_sysdev_class.kset.kobj,
+				  &perfclass_attr_group);
+}
+device_initcall(perf_counter_sysfs_init);
+
diff --git a/kernel/sched.c b/kernel/sched.c
index b7480fb..254d56d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2212,6 +2212,27 @@ static int sched_balance_self(int cpu, int flag)
 
 #endif /* CONFIG_SMP */
 
+/**
+ * task_oncpu_function_call - call a function on the cpu on which a task runs
+ * @p:		the task to evaluate
+ * @func:	the function to be called
+ * @info:	the function call argument
+ *
+ * Calls the function @func when the task is currently running. This might
+ * be on the current CPU, which just calls the function directly
+ */
+void task_oncpu_function_call(struct task_struct *p,
+			      void (*func) (void *info), void *info)
+{
+	int cpu;
+
+	preempt_disable();
+	cpu = task_cpu(p);
+	if (task_curr(p))
+		smp_call_function_single(cpu, func, info, 1);
+	preempt_enable();
+}
+
 /***
  * try_to_wake_up - wake up a thread
  * @p: the to-be-woken-up thread
@@ -2534,6 +2555,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
 		    struct task_struct *next)
 {
 	fire_sched_out_preempt_notifiers(prev, next);
+	perf_counter_task_sched_out(prev, cpu_of(rq));
 	prepare_lock_switch(rq, next);
 	prepare_arch_switch(next);
 }
@@ -2574,6 +2596,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	 */
 	prev_state = prev->state;
 	finish_arch_switch(prev);
+	perf_counter_task_sched_in(current, cpu_of(rq));
 	finish_lock_switch(rq, prev);
 #ifdef CONFIG_SMP
 	if (current->sched_class->post_schedule)
@@ -4296,6 +4319,7 @@ void scheduler_tick(void)
 	rq->idle_at_tick = idle_cpu(cpu);
 	trigger_load_balance(rq, cpu);
 #endif
+	perf_counter_task_tick(curr, cpu);
 }
 
 #if defined(CONFIG_PREEMPT) && (defined(CONFIG_DEBUG_PREEMPT) || \
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index e14a232..4be8bbc 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -174,3 +174,6 @@ cond_syscall(compat_sys_timerfd_settime);
 cond_syscall(compat_sys_timerfd_gettime);
 cond_syscall(sys_eventfd);
 cond_syscall(sys_eventfd2);
+
+/* performance counters: */
+cond_syscall(sys_perf_counter_open);

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [patch] Performance Counters for Linux, v2
  2008-12-08  1:22 [patch] Performance Counters for Linux, v2 Ingo Molnar
@ 2008-12-08  1:49 ` Arjan van de Ven
  2008-12-08 11:49   ` Ingo Molnar
  2008-12-08  3:24 ` Paul Mackerras
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 33+ messages in thread
From: Arjan van de Ven @ 2008-12-08  1:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Thomas Gleixner, linux-arch, Andrew Morton,
	Stephane Eranian, Eric Dumazet, Robert Richter, Peter Anvin,
	Peter Zijlstra, Steven Rostedt, David Miller, Paul Mackerras

On Mon, 8 Dec 2008 02:22:12 +0100
Ingo Molnar <mingo@elte.hu> wrote:

> 
> [ Performance counters are special hardware registers available on
> most modern CPUs. These register count the number of certain types of
> hw events: such as instructions executed, cachemisses suffered, or 
>   branches mis-predicted, without slowing down the kernel or 
>   applications. These registers can also trigger interrupts when a 
>   threshold number of events have passed - and can thus be used to 
>   profile the code that runs on that CPU. ]
> 
> This is version 2 of our Performance Counters subsystem
> implementation.
> 
> The biggest user-visible change in this release is a new user-space 
> text-mode profiling utility that is based on this code: KernelTop.
> 
> KernelTop can be downloaded from:
> 
>   http://redhat.com/~mingo/perfcounters/kerneltop.c
> 
> It's a standalone .c file that needs no extra libraries - it only
> needs a CONFIG_PERF_COUNTERS=y kernel to run on.
> 
> This utility is intended for kernel developers - it's basically a
> dynamic kernel profiler that gets hardware counter events dispatched
> to it continuously, which it feeds into a histogram and outputs it 
> periodically.
> 

I played with this a little, and while it works neat, I wanted a
feature added where it shows a detailed profile for the top function.
I've hacked this in quickly (the usability isn't all that great yet)
and put the source code up at
http://www.tglx.de/~arjan/kerneltop-0.02.tar.gz

with this it looks like this:

$ sudo ./kerneltop --vmlinux=/home/arjan/linux-2.6.git/vmlinux

------------------------------------------------------------------------------
 KernelTop:     274 irqs/sec  [NMI, 1000000 cycles],  (all, 2 CPUs)
------------------------------------------------------------------------------

             events         RIP          kernel function
             ______   ________________   _______________

                230 - 00000000c04189e9 : read_hpet
                 82 - 00000000c0409439 : mwait_idle_with_hints
                 77 - 00000000c051a7b7 : acpi_os_read_port
                 52 - 00000000c053cb3a : acpi_idle_enter_bm
                 38 - 00000000c0418d93 : hpet_next_event
                 19 - 00000000c051a802 : acpi_os_write_port
                 14 - 00000000c04f8704 : __copy_to_user_ll
                 13 - 00000000c0460c20 : get_page_from_freelist
                  7 - 00000000c041c96c : kunmap_atomic
                  5 - 00000000c06a30d2 : _spin_lock	[joydev]
                  4 - 00000000c04f79b7 : vsnprintf	[snd_seq]
                  4 - 00000000c06a3048 : _spin_lock_irqsave	[pcspkr]
                  3 - 00000000c0403b3c : irq_entries_start
                  3 - 00000000c0423fee : run_rebalance_domains
                  3 - 00000000c0425e2c : scheduler_tick
                  3 - 00000000c0430938 : get_next_timer_interrupt
                  3 - 00000000c043cdfa : __update_sched_clock\x19
                  3 - 00000000c0448b14 : update_iter
                  2 - 00000000c04304bd : run_timer_softirq

Showing details for read_hpet
       0	c04189e9 <read_hpet>:
       2	c04189e9:	a1 b0 e0 89 c0       	mov    0xc089e0b0,%eax
       0	
       0	/*
       0	 * Clock source related code
       0	 */
       0	static cycle_t read_hpet(void)
       0	{
       1	c04189ee:	55                   	push   %ebp
       0	c04189ef:	89 e5                	mov    %esp,%ebp
       1	c04189f1:	05 f0 00 00 00       	add    $0xf0,%eax
       0	c04189f6:	8b 00                	mov    (%eax),%eax
       0		return (cycle_t)hpet_readl(HPET_COUNTER);
       0	}
     300	c04189f8:	31 d2                	xor    %edx,%edx
       0	c04189fa:	5d                   	pop    %ebp
       0	c04189fb:	c3                   	ret    
       0	

As is usual with profile outputs, the cost for the function always gets added to the instruction after
the really guilty one. I'd move the count one back, but this is hard if the previous instruction was a
(conditional) jump...

-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] Performance Counters for Linux, v2
  2008-12-08  1:22 [patch] Performance Counters for Linux, v2 Ingo Molnar
  2008-12-08  1:49 ` Arjan van de Ven
@ 2008-12-08  3:24 ` Paul Mackerras
  2008-12-08 11:33   ` Ingo Molnar
  2008-12-08  8:32 ` Corey J Ashford
  2008-12-09  6:37 ` stephane eranian
  3 siblings, 1 reply; 33+ messages in thread
From: Paul Mackerras @ 2008-12-08  3:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Thomas Gleixner, linux-arch, Andrew Morton,
	Stephane Eranian, Eric Dumazet, Robert Richter, Arjan van de Veen,
	Peter Anvin, Peter Zijlstra, Steven Rostedt, David Miller

Ingo Molnar writes:

> There's a new "counter group record" facility that is a straightforward 
> extension of the existing "irq record" notification type. This record 
> type can be set on a 'master' counter, and if the master counter triggers 
> an IRQ or an NMI, all the 'secondary' counters are read out atomically 
> and are put into the counter-group record. The result can then be read() 
> out by userspace via a single system call. (Based on extensive feedback 
> from Paul Mackerras and David Miller, thanks guys!)
> 
> The other big change is the support of virtual task counters via counter 
> scheduling: a task can specify more counters than there are on the CPU, 
> the kernel will then schedule the counters periodically to spread out hw 
> resources.

Still not good enough, I'm sorry.

* I have no guarantee that the secondary counters were all counting
  at the same time(s) as the master counter, so the numbers are
  virtually useless.

* I might legitimately want to be notified based on any of the
  "secondary" counters reaching particular values.  The "master"
  vs. "secondary" distinction is an artificial one that is going to
  make certain reasonable use-cases impossible.

These things are both symptoms of the fact that you still have the
abstraction at the wrong level.  The basic abstraction really needs to
be a counter-set, not an individual counter.

I think your patch can be extended to do counter-sets without
complicating the interface too much.  We could have:

struct event_spec {
	u32 hw_event_type;
	u32 hw_event_period;
	u64 hw_raw_ctrl;
};

int perf_counterset_open(u32 n_counters,
    			 struct event_spec *counters,
			 u32 record_type,
			 pid_t pid,
			 int cpu);

and then you could have perf_counter_open as a simple wrapper around
perf_counterset_open.

With an approach like this we can also provide an "exclusive" mode for
the PMU (e.g. with a flag bit in record_type or n_counters), which
means that the counter-set occupies the whole PMU.  That will give a
way for userspace to specify all the details of how the PMU is to be
programmed, which in turn means that the kernel doesn't need to know
all the arcane details of every event on every processor; it just
needs to know the common events.

I notice the implementation also still assumes it can add any counter
at any time subject only to a limit on the number of counters in use.
That will have to be fixed before it is usable on powerpc (and
apparently on some x86 processors too).

Paul.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] Performance Counters for Linux, v2
  2008-12-08  1:22 [patch] Performance Counters for Linux, v2 Ingo Molnar
  2008-12-08  1:49 ` Arjan van de Ven
  2008-12-08  3:24 ` Paul Mackerras
@ 2008-12-08  8:32 ` Corey J Ashford
  2008-12-09  6:37 ` stephane eranian
  3 siblings, 0 replies; 33+ messages in thread
From: Corey J Ashford @ 2008-12-08  8:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Arjan van de Veen, Peter Zijlstra, Eric Dumazet,
	David Miller, Stephane Eranian, Peter Anvin, linux-arch,
	linux-kernel, linux-kernel-owner, Paul Mackerras, Robert Richter,
	Steven Rostedt, Thomas Gleixner

linux-kernel-owner@vger.kernel.org wrote on 12/07/2008 05:22:12 PM:
[snip]
> 
> The other big change is the support of virtual task counters via counter 

> scheduling: a task can specify more counters than there are on the CPU, 
> the kernel will then schedule the counters periodically to spread out hw 

> resources. So for example if a task starts 6 counters on a CPU that has 
> only two hardware counters, it still gets this output:
> 
>  counter[0 cycles              ]:   5204680573 , delta:   1733680843 
events
>  counter[1 instructions        ]:   1364468045 , delta:    454818351 
events
>  counter[2 cache-refs          ]:        12732 , delta:         4399 
events
>  counter[3 cache-misses        ]:         1009 , delta:          336 
events
>  counter[4 branch-instructions ]:    125993304 , delta:     42006998 
events
>  counter[5 branch-misses       ]:         1946 , delta:          649 
events
>

Hello Ingo,

I posted some questions about this capability in your proposal on LKML, 
but I wasn't able to get the reply threaded in properly.  Could you take a 
look at this post, please?

http://lkml.org/lkml/2008/12/5/299

Thanks for your consideration,

- Corey

Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
Beaverton, OR 
503-578-3507 
cjashfor@us.ibm.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] Performance Counters for Linux, v2
  2008-12-08  3:24 ` Paul Mackerras
@ 2008-12-08 11:33   ` Ingo Molnar
  2008-12-08 12:02     ` David Miller
                       ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Ingo Molnar @ 2008-12-08 11:33 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: linux-kernel, Thomas Gleixner, linux-arch, Andrew Morton,
	Stephane Eranian, Eric Dumazet, Robert Richter, Arjan van de Veen,
	Peter Anvin, Peter Zijlstra, Steven Rostedt, David Miller

* Paul Mackerras <paulus@samba.org> wrote:

> Ingo Molnar writes:
> 
> > There's a new "counter group record" facility that is a straightforward 
> > extension of the existing "irq record" notification type. This record 
> > type can be set on a 'master' counter, and if the master counter triggers 
> > an IRQ or an NMI, all the 'secondary' counters are read out atomically 
> > and are put into the counter-group record. The result can then be read() 
> > out by userspace via a single system call. (Based on extensive feedback 
> > from Paul Mackerras and David Miller, thanks guys!)
> > 
> > The other big change is the support of virtual task counters via counter 
> > scheduling: a task can specify more counters than there are on the CPU, 
> > the kernel will then schedule the counters periodically to spread out hw 
> > resources.
> 
> Still not good enough, I'm sorry.
> 
> * I have no guarantee that the secondary counters were all counting
>   at the same time(s) as the master counter, so the numbers are
>   virtually useless.

If you want a _guarantee_ that multiple counters can count at once you 
can still do it: for example by using the separate, orthogonal 
reservation mechanism we had in -v1 already.

Also, you dont _have to_ overcommit counters.

Your whole statistical argument that group readout is a must-have for 
precision is fundamentally flawed as well: counters _themselves_, as used 
by most applications, by their nature, are a statistical sample to begin 
with. There's way too many hardware events to track each of them 
unintrusively - so this type of instrumentation is _all_ sampling based, 
and fundamentally so. (with a few narrow exceptions such as single-event 
interrupts for certain rare event types)

This means that the only correct technical/mathematical argument is to 
talk about "levels of noise" and how they compare and correlate - and 
i've seen no actual measurements or estimations pro or contra. Group 
readout of counters can reduce noise for sure, but it is wrong for you to 
try to turn this into some sort of all-or-nothing property. Other sources 
of noise tend to be of much higher of magnitude.

You need really stable workloads to see such low noise levels that group 
readout of counters starts to matter - and the thing is that often such 
'stable' workloads are rather boringly artificial, because in real life 
there's no such thing as a stable workload.

Finally, the basic API to user-space is not the way to impose rigid "I 
own the whole PMU" notion that you are pushing. That notion can be 
achieved in different, system administration means - and a perf-counter 
reservation facility was included in the v1 patchset.

Note that you are doing something that is a kernel design no-no: you are 
trying to design a "guarantee" for hardware constraints by complicating 
it into the userpace ABI - and that is a fundamentally losing 
proposition.

It's a tail-wags-the-dog design situation that we are routinely resisting 
in the upstream kernel: you are putting hardware constraints ahead of 
usability, you are putting hardware constraints ahead of sane interface 
design - and such an approach is wrong and shortsighted on every level.

It's also shortsighted because it's a red herring: there's nothing that 
forbids the counter scheduler from listening to the hw constraints, for 
CPUs where there's a lot of counter constraints.

> * I might legitimately want to be notified based on any of the
>   "secondary" counters reaching particular values.  The "master" vs. 
>   "secondary" distinction is an artificial one that is going to make 
>   certain reasonable use-cases impossible.

the secondary counters can cause records too - independently of the 
master counter. This is because the objects (and fds) are separate so 
there's no restriction at all on the secondary counters. This is a lot 
less natural to do if you have a "vector of counters" abstraction.

> These things are both symptoms of the fact that you still have the 
> abstraction at the wrong level.  The basic abstraction really needs to 
> be a counter-set, not an individual counter.

Being per object is a very fundamental property of Linux, and you have to 
understand and respect that down to your bone if you want to design new 
syscall ABIs for Linux.

The "perfmon v3 light" system calls, all five of them, are a classic 
loundry list of what _not_ to do in new Linux APIs: they are too 
specific, too complex and way too limited on every level.

Per object and per fd abstractions are a _very strong_ conceptual 
property of Linux. Look at what they bring in the performance counters 
case:

 - All the VFS syscalls work naturally: sys_read(), sys_close(),
   sys_dup(), you name it.

 - It makes all counters poll()able. Any subset of them, and at any time,
   independently of any context descriptor. Look at kerneltop.c: it has a
   USE_POLLING switch to switch to a poll() loop, and it just works the 
   way you'd expect it to work.

 - We can share fds between monitor threads and you can do a thread pool
   that works down new events - without forcing any counter scheduling in
   the monitored task.

 - It makes the same task monitorable by multiple monitors, trivially
   so. There's no forced context notion that privatizes the PMU - with 
   some 'virtual context' extra dimension slapped on top of it.

 - Doing a proper per object abstraction simplifies event and error
   handling significantly: instead of having to work down a vector of 
   counters and demultiplexing events and matching them up to individual 
   counters, the demultiplexing is done by the _kernel_.

 - It makes counter scheduling very dynamic. Instead of exposing
   user-space to a static "counter allocation" (with all the insane ABI
   and kernel internal complications this brings), perf-counters
   subsystem does not expose user-space to such scheduling details
   _at all_.

 - Difference in complexity. The "v3 light" version of perfmon (which 
   does not even schedule any PMU contexts), contains these core kernel 
   files:

         19 files changed, 4424 insertions(+)

   Our code has this core kernel impact:

         10 files changed, 1191 insertions(+)

   And in some areas it's already more capable than "perfmon v3".
   The difference is very obvious.

All in one, using the 1:1 fd:counter design is a powerful, modern Linux 
abstraction to its core. It's much easier to think about for application 
developers as well, so we'll see a much sharper adoption rate. 

Also, i noticed that your claims about our code tend to be rather 
abstract and are often dwelling on issues that IMO have no big practical 
relevance - so may i suggest the following approach instead to break the 
(mutual!) cycle of miscommunication: if you think an issue is important, 
could you please point out the problem in practical terms what you think 
would not be possible with our scheme? We tend to prioritize items by 
practical value.

Things like: "kerneltop would not be as accurate with: ..., to the level 
of adding 5% of extra noise.". Would that work for you?

> I think your patch can be extended to do counter-sets without 
> complicating the interface too much.  We could have:
> 
> struct event_spec {
> 	u32 hw_event_type;
> 	u32 hw_event_period;
> 	u64 hw_raw_ctrl;
> };

This needless vectoring and the exposing of contexts would kill many good 
properties of the new subsystem, without any tangible benefits - see 
above.

This is really scheduling school 101: a hardware context allocation is 
the _last_ thing we want to expose to user-space in this particular case. 
This is a fundamental property of hardware resource scheduling. We _dont_ 
want to tie the hands of the kernel by putting resource scheduling into 
user-space!

Your arguments remind me a bit of the "user-space threads have to be 
scheduled in user-space!" N:M threading design discussions we had years 
ago. IBM folks were pushing NGPT very strongly back then and claimed that 
it's the right design for high-performance threading, etc. etc.

In reality, doing user-space scheduling for cheap-to-context-switch 
hardware resources was a fundamentally wrong proposition back then too, 
and it is still the wrong concept today as well.

> int perf_counterset_open(u32 n_counters,
>     			 struct event_spec *counters,
> 			 u32 record_type,
> 			 pid_t pid,
> 			 int cpu);
> 
> and then you could have perf_counter_open as a simple wrapper around 
> perf_counterset_open.
> 
> With an approach like this we can also provide an "exclusive" mode for 
> the PMU [...]

You can already allocate "exclusive" counters in a guaranteed way via our 
code, here and today.

> [...] (e.g. with a flag bit in record_type or n_counters), which means 
> that the counter-set occupies the whole PMU.  That will give a way for 
> userspace to specify all the details of how the PMU is to be 
> programmed, which in turn means that the kernel doesn't need to know 
> all the arcane details of every event on every processor; it just needs 
> to know the common events.
> 
> I notice the implementation also still assumes it can add any counter 
> at any time subject only to a limit on the number of counters in use. 
> That will have to be fixed before it is usable on powerpc (and 
> apparently on some x86 processors too).

There's constrained PMCs on x86 too, as you mention. Instead of repeating 
the answer that i gave before (that this is easy and natural), how about 
this approach: if we added real, working support for constrained PMCs on 
x86, that will then address this point of yours rather forcefully, 
correct?

	Ingo

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] Performance Counters for Linux, v2
  2008-12-08  1:49 ` Arjan van de Ven
@ 2008-12-08 11:49   ` Ingo Molnar
  2009-01-07  7:43     ` Zhang, Yanmin
  0 siblings, 1 reply; 33+ messages in thread
From: Ingo Molnar @ 2008-12-08 11:49 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: linux-kernel, Thomas Gleixner, linux-arch, Andrew Morton,
	Stephane Eranian, Eric Dumazet, Robert Richter, Peter Anvin,
	Peter Zijlstra, Steven Rostedt, David Miller, Paul Mackerras


* Arjan van de Ven <arjan@infradead.org> wrote:

> On Mon, 8 Dec 2008 02:22:12 +0100
> Ingo Molnar <mingo@elte.hu> wrote:
> 
> > 
> > [ Performance counters are special hardware registers available on
> > most modern CPUs. These register count the number of certain types of
> > hw events: such as instructions executed, cachemisses suffered, or 
> >   branches mis-predicted, without slowing down the kernel or 
> >   applications. These registers can also trigger interrupts when a 
> >   threshold number of events have passed - and can thus be used to 
> >   profile the code that runs on that CPU. ]
> > 
> > This is version 2 of our Performance Counters subsystem
> > implementation.
> > 
> > The biggest user-visible change in this release is a new user-space 
> > text-mode profiling utility that is based on this code: KernelTop.
> > 
> > KernelTop can be downloaded from:
> > 
> >   http://redhat.com/~mingo/perfcounters/kerneltop.c
> > 
> > It's a standalone .c file that needs no extra libraries - it only
> > needs a CONFIG_PERF_COUNTERS=y kernel to run on.
> > 
> > This utility is intended for kernel developers - it's basically a
> > dynamic kernel profiler that gets hardware counter events dispatched
> > to it continuously, which it feeds into a histogram and outputs it 
> > periodically.
> > 
> 
> I played with this a little, and while it works neat, I wanted a 
> feature added where it shows a detailed profile for the top function.

ah, very nice idea!

> I've hacked this in quickly (the usability isn't all that great yet) 
> and put the source code up at
>
> http://www.tglx.de/~arjan/kerneltop-0.02.tar.gz

ok, picked it up :-)

> with this it looks like this:
> 
> $ sudo ./kerneltop --vmlinux=/home/arjan/linux-2.6.git/vmlinux
> 
> ------------------------------------------------------------------------------
>  KernelTop:     274 irqs/sec  [NMI, 1000000 cycles],  (all, 2 CPUs)
> ------------------------------------------------------------------------------
> 
>              events         RIP          kernel function
>              ______   ________________   _______________
> 
>                 230 - 00000000c04189e9 : read_hpet
>                  82 - 00000000c0409439 : mwait_idle_with_hints
>                  77 - 00000000c051a7b7 : acpi_os_read_port
>                  52 - 00000000c053cb3a : acpi_idle_enter_bm
>                  38 - 00000000c0418d93 : hpet_next_event
>                  19 - 00000000c051a802 : acpi_os_write_port
>                  14 - 00000000c04f8704 : __copy_to_user_ll
>                  13 - 00000000c0460c20 : get_page_from_freelist
>                   7 - 00000000c041c96c : kunmap_atomic
>                   5 - 00000000c06a30d2 : _spin_lock	[joydev]
>                   4 - 00000000c04f79b7 : vsnprintf	[snd_seq]
>                   4 - 00000000c06a3048 : _spin_lock_irqsave	[pcspkr]
>                   3 - 00000000c0403b3c : irq_entries_start
>                   3 - 00000000c0423fee : run_rebalance_domains
>                   3 - 00000000c0425e2c : scheduler_tick
>                   3 - 00000000c0430938 : get_next_timer_interrupt
>                   3 - 00000000c043cdfa : __update_sched_clock\x19
>                   3 - 00000000c0448b14 : update_iter
>                   2 - 00000000c04304bd : run_timer_softirq
> 
> Showing details for read_hpet
>        0	c04189e9 <read_hpet>:
>        2	c04189e9:	a1 b0 e0 89 c0       	mov    0xc089e0b0,%eax
>        0	
>        0	/*
>        0	 * Clock source related code
>        0	 */
>        0	static cycle_t read_hpet(void)
>        0	{
>        1	c04189ee:	55                   	push   %ebp
>        0	c04189ef:	89 e5                	mov    %esp,%ebp
>        1	c04189f1:	05 f0 00 00 00       	add    $0xf0,%eax
>        0	c04189f6:	8b 00                	mov    (%eax),%eax
>        0		return (cycle_t)hpet_readl(HPET_COUNTER);
>        0	}
>      300	c04189f8:	31 d2                	xor    %edx,%edx
>        0	c04189fa:	5d                   	pop    %ebp
>        0	c04189fb:	c3                   	ret    
>        0	

very nice and useful output! This for example shows that it's the readl() 
on the HPET_COUNTER IO address that is causing the overhead. That is to 
be expected - HPET is mapped uncached and the access goes out to the 
chipset.

> As is usual with profile outputs, the cost for the function always gets 
> added to the instruction after the really guilty one. I'd move the 
> count one back, but this is hard if the previous instruction was a 
> (conditional) jump...

yeah. Sometimes the delay can be multiple instructions - so it's best to 
leave the profiling picture as pristine as possible, and let the kernel 
developer chose the right counter type that displays particular problem 
areas in the most expressive way.

For example when i'm doing SMP scalability work, i generally look at 
cachemiss counts, for cacheline bouncing. The following kerneltop output 
shows last-level data-cache misses in the kernel during a tbench 64 run 
on a 16-way box, using latest mainline -git:

------------------------------------------------------------------------------
 KernelTop:    3744 irqs/sec  [NMI, 1000 cache-misses],  (all, 16 CPUs)
------------------------------------------------------------------------------

             events         RIP          kernel function
             ______   ________________   _______________

               7757 - ffffffff804d723e : dst_release
               7649 - ffffffff804e3611 : eth_type_trans
               6402 - ffffffff8050e470 : tcp_established_options
               5975 - ffffffff804fa054 : ip_rcv_finish
               5530 - ffffffff80365fb0 : copy_user_generic_string!
               3979 - ffffffff804ccf0c : skb_push
               3474 - ffffffff804fe6cb : ip_queue_xmit
               1950 - ffffffff804cdcdd : skb_release_head_state
               1595 - ffffffff804cce4f : skb_copy_and_csum_dev
               1365 - ffffffff80501079 : __inet_lookup_established
                908 - ffffffff804fa5fc : ip_local_deliver_finish
                743 - ffffffff8036cbcc : unmap_single
                452 - ffffffff80569402 : _read_lock
                411 - ffffffff80283010 : get_page_from_freelist
                410 - ffffffff80505b16 : tcp_sendmsg
                406 - ffffffff8028631a : put_page
                386 - ffffffff80509067 : tcp_ack
                204 - ffffffff804d2d55 : netif_rx
                194 - ffffffff8050b94b : tcp_data_queue

Cachemiss event samples tend to line up quite close to the instruction 
that causes them.

Looking at pure cycles (same workload) gives a different view:

------------------------------------------------------------------------------
 KernelTop:   27357 irqs/sec  [NMI, 1000000 cycles],  (all, 16 CPUs)
------------------------------------------------------------------------------

             events         RIP          kernel function
             ______   ________________   _______________

              16602 - ffffffff80365fb0 : copy_user_generic_string!
               7947 - ffffffff80505b16 : tcp_sendmsg
               7450 - ffffffff80509067 : tcp_ack
               7384 - ffffffff80332881 : avc_has_perm_noaudit
               6888 - ffffffff80504e7c : tcp_recvmsg
               6564 - ffffffff8056745e : schedule
               6170 - ffffffff8050ecd5 : tcp_transmit_skb
               4949 - ffffffff8020a75b : __switch_to
               4417 - ffffffff8050cc4f : tcp_rcv_established
               4283 - ffffffff804d723e : dst_release
               3842 - ffffffff804fed58 : ip_finish_output
               3760 - ffffffff804fe6cb : ip_queue_xmit
               3580 - ffffffff80501079 : __inet_lookup_established
               3540 - ffffffff80514ce5 : tcp_v4_rcv
               3475 - ffffffff8026c31f : audit_syscall_exit
               3411 - ffffffffff600130 : vread_hpet
               3267 - ffffffff802a73de : kfree
               3058 - ffffffff804d39ed : dev_queue_xmit
               3047 - ffffffff804eecf8 : nf_iterate

Cycles overhead tends to be harder to match up with instructions.

	Ingo

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] Performance Counters for Linux, v2
  2008-12-08 11:33   ` Ingo Molnar
@ 2008-12-08 12:02     ` David Miller
  2008-12-08 14:41     ` Andi Kleen
  2008-12-08 22:03     ` Paul Mackerras
  2 siblings, 0 replies; 33+ messages in thread
From: David Miller @ 2008-12-08 12:02 UTC (permalink / raw)
  To: mingo
  Cc: paulus, linux-kernel, tglx, linux-arch, akpm, eranian, dada1,
	robert.richter, arjan, hpa, a.p.zijlstra, rostedt

From: Ingo Molnar <mingo@elte.hu>
Date: Mon, 8 Dec 2008 12:33:18 +0100

> Your whole statistical argument that group readout is a must-have for 
> precision is fundamentally flawed as well: counters _themselves_, as used 
> by most applications, by their nature, are a statistical sample to begin 
> with. There's way too many hardware events to track each of them 
> unintrusively - so this type of instrumentation is _all_ sampling based, 
> and fundamentally so. (with a few narrow exceptions such as single-event 
> interrupts for certain rare event types)

There are a lot of people who are going to fundamentally
disagree with this, myself included.

A lot of things are being stated about what people do with this stuff,
but I think there are people working longer in this area who quite
possibly know a lot better.  But they were blindsided by this new work
instead of being consulted, which was pretty unnice.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] Performance Counters for Linux, v2
  2008-12-08 11:33   ` Ingo Molnar
  2008-12-08 12:02     ` David Miller
@ 2008-12-08 14:41     ` Andi Kleen
  2008-12-08 22:03     ` Paul Mackerras
  2 siblings, 0 replies; 33+ messages in thread
From: Andi Kleen @ 2008-12-08 14:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Paul Mackerras, linux-kernel, Thomas Gleixner, linux-arch,
	Andrew Morton, Stephane Eranian, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, Peter Zijlstra, Steven Rostedt,
	David Miller

Ingo Molnar <mingo@elte.hu> writes:

> This means that the only correct technical/mathematical argument is to 
> talk about "levels of noise" and how they compare and correlate - and 
> i've seen no actual measurements or estimations pro or contra. Group 
> readout of counters can reduce noise for sure, but it is wrong for you to 
> try to turn this into some sort of all-or-nothing property. Other sources 
> of noise tend to be of much higher of magnitude.

Ingo, could you please describe how PEBS and IBS fit into your model?

Thanks.

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] Performance Counters for Linux, v2
  2008-12-08 11:33   ` Ingo Molnar
  2008-12-08 12:02     ` David Miller
  2008-12-08 14:41     ` Andi Kleen
@ 2008-12-08 22:03     ` Paul Mackerras
  2008-12-09 13:00       ` Ingo Molnar
  2 siblings, 1 reply; 33+ messages in thread
From: Paul Mackerras @ 2008-12-08 22:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Thomas Gleixner, linux-arch, Andrew Morton,
	Stephane Eranian, Eric Dumazet, Robert Richter, Arjan van de Veen,
	Peter Anvin, Peter Zijlstra, Steven Rostedt, David Miller

Ingo Molnar writes:

> If you want a _guarantee_ that multiple counters can count at once you 
> can still do it: for example by using the separate, orthogonal 
> reservation mechanism we had in -v1 already.

Is that this?

" - There's a /sys based reservation facility that allows the allocation
   of a certain number of hw counters for guaranteed sysadmin access."

Sounds like I can't do that as an ordinary user, even on my own
processes...

I don't want the whole PMU all the time, I just want it while my
monitored process is running, and only on the CPU where it is
running.

> Also, you dont _have to_ overcommit counters.
> 
> Your whole statistical argument that group readout is a must-have for 
> precision is fundamentally flawed as well: counters _themselves_, as used 
> by most applications, by their nature, are a statistical sample to begin 
> with. There's way too many hardware events to track each of them 
> unintrusively - so this type of instrumentation is _all_ sampling based, 
> and fundamentally so. (with a few narrow exceptions such as single-event 
> interrupts for certain rare event types)

No - at least on the machines I'm familiar with, I can count every
single cache miss and hit at every level of the memory hierarchy,
every single TLB miss, every load and store instruction, etc. etc.

I want to be able to work out things like cache hit rates, just as one
example.  To do that I need two numbers that are directly comparable
because they relate to the same set of instructions.  If I have a
count of L1 Dcache hits for one set of instructions and a count of L1
Dcache misses over some different stretch of instructions, the ratio
of them doesn't mean anything.

Your argument about "it's all statistical" is bogus because even if
the things we are measuring are statistical, that's still no excuse
for being sloppy about how we make our estimates.  And not being able
to have synchronized counters is just sloppy.  The users want it, the
hardware provides it, so that makes it a must-have as far as I am
concerned.

> This means that the only correct technical/mathematical argument is to 
> talk about "levels of noise" and how they compare and correlate - and 
> i've seen no actual measurements or estimations pro or contra. Group 
> readout of counters can reduce noise for sure, but it is wrong for you to 
> try to turn this into some sort of all-or-nothing property. Other sources 
> of noise tend to be of much higher of magnitude.

What can you back that assertion up with?

> You need really stable workloads to see such low noise levels that group 
> readout of counters starts to matter - and the thing is that often such 
> 'stable' workloads are rather boringly artificial, because in real life 
> there's no such thing as a stable workload.

More unsupported assertions, that sound wrong to me...

> Finally, the basic API to user-space is not the way to impose rigid "I 
> own the whole PMU" notion that you are pushing. That notion can be 
> achieved in different, system administration means - and a perf-counter 
> reservation facility was included in the v1 patchset.

Only for root, which isn't good enough.

What I was proposing was NOT a rigid notion - you don't have to own
the whole PMU if you are happy to use the events that the kernel knows
about.  If you do want the whole PMU, you can have it while the
process you're monitoring is running, and the kernel will
context-switch it between you and other users, who can also have the
whole PMU when their processes are running.

> Note that you are doing something that is a kernel design no-no: you are 
> trying to design a "guarantee" for hardware constraints by complicating 
> it into the userpace ABI - and that is a fundamentally losing 
> proposition.

Perhaps you have misunderstood my proposal.  A counter-set doesn't
have to be the whole PMU, and you can have multiple counter-sets
active at the same time as long as they fit.  You can even have
multiple "whole PMU" counter-sets and the kernel will multiplex them
onto the real PMU.

> It's a tail-wags-the-dog design situation that we are routinely resisting 
> in the upstream kernel: you are putting hardware constraints ahead of 
> usability, you are putting hardware constraints ahead of sane interface 
> design - and such an approach is wrong and shortsighted on every level.

Well, I'll ignore the patronizing tone (but please try to avoid it in
future).

The PRIMARY reason for wanting counter-sets is because THAT IS WHAT
THE USERS WANT.  A "usable" and "sane" interface design that doesn't
do what users want is useless.

Anyway, my proposal is just as "usable" as yours, since users still
have perf_counter_open, exactly as in your proposal.  Users with
simpler requirements can do things exactly the same way as with your
proposal.

> It's also shortsighted because it's a red herring: there's nothing that 
> forbids the counter scheduler from listening to the hw constraints, for 
> CPUs where there's a lot of counter constraints.

Handling the counter constraints is indeed a matter of implementation,
and as I noted previously, your current proposed implementation
doesn't handle them.

> Being per object is a very fundamental property of Linux, and you have to 
> understand and respect that down to your bone if you want to design new 
> syscall ABIs for Linux.

It's the choice of a single counter as being your "object" that I
object to. :)

>  - It makes counter scheduling very dynamic. Instead of exposing
>    user-space to a static "counter allocation" (with all the insane ABI
>    and kernel internal complications this brings), perf-counters
>    subsystem does not expose user-space to such scheduling details
>    _at all_.

Which is not necessarily a good thing.  Fundamentally, if you are
trying to measure something, and you get a number, you need to know
what exactly got measured.

For example, suppose I am trying to count TLB misses during the
execution of a program.  If my TLB miss counter keeps getting bumped
off because the kernel is scheduling my counter along with a dozen
other counters, then I *at least* want to know about it, and
preferably control it.  Otherwise I'll be getting results that vary by
an order of magnitude with no way to tell why.

> All in one, using the 1:1 fd:counter design is a powerful, modern Linux 
> abstraction to its core. It's much easier to think about for application 
> developers as well, so we'll see a much sharper adoption rate. 

For simple things, yes it is simpler.  But it can't do the more
complex things in any sort of clean or sane way.

> Also, i noticed that your claims about our code tend to be rather 
> abstract

... because the design of your code is wrong at an abstract level ...

> and are often dwelling on issues that IMO have no big practical 
> relevance - so may i suggest the following approach instead to break the 
> (mutual!) cycle of miscommunication: if you think an issue is important, 
> could you please point out the problem in practical terms what you think 
> would not be possible with our scheme? We tend to prioritize items by 
> practical value.
> 
> Things like: "kerneltop would not be as accurate with: ..., to the level 
> of adding 5% of extra noise.". Would that work for you?

OK, here's an example.  I have an application whose execution has
several different phases, and I want to measure the L1 Icache hit rate
and the L1 Dcache hit rate as a function of time and make a graph.  So
I need counters for L1 Icache accesses, L1 Icache misses, L1 Dcache
accesses, and L1 Dcache misses.  I want to sample at 1ms intervals.
The CPU I'm running on has two counters.

With your current proposal, I don't see any way to make sure that the
counter scheduler counts L1 Dcache accesses and L1 Dcache misses at
the same time, then schedules L1 Icache accesses and L1 Icache
misses.  I could end up with L1 Dcache accesses and L1 Icache
accesses, then L1 Dcache misses and L1 Icache misses - and get a
nonsensical situation like the misses being greater than the accesses.

> This needless vectoring and the exposing of contexts would kill many good 
> properties of the new subsystem, without any tangible benefits - see 
> above.

No.  Where did you get contexts from?  I didn't write anything about
contexts.  Please read what I wrote.

> This is really scheduling school 101: a hardware context allocation is 
> the _last_ thing we want to expose to user-space in this particular case. 

Please drop the patronizing tone, again.

What user-space applications want to be able to do is this:

* Ensure that a set of counters are all counting at the same time.

* Know when counters get scheduled on and off the process so that the
  results can be interpreted properly.  Either that or be able to
  control the scheduling.

* Sophisticated applications want to be able to do things with the PMU
  that the kernel doesn't necessarily understand.

> This is a fundamental property of hardware resource scheduling. We _dont_ 
> want to tie the hands of the kernel by putting resource scheduling into 
> user-space!

You'd rather provide useless numbers to userspace? :)

> Your arguments remind me a bit of the "user-space threads have to be 
> scheduled in user-space!" N:M threading design discussions we had years 
> ago. IBM folks were pushing NGPT very strongly back then and claimed that 
> it's the right design for high-performance threading, etc. etc.

Your arguments remind me of a filesystem that a colleague of mine once
designed that only had files, but no directories (you could have "/"
characters in the filenames, though).  This whole discussion is a bit
like you arguing that directories are an unnecessary complication that
only messes up the interface and adds extra system calls.

> You can already allocate "exclusive" counters in a guaranteed way via our 
> code, here and today.

But then I don't get context-switching between processes.

> There's constrained PMCs on x86 too, as you mention. Instead of repeating 
> the answer that i gave before (that this is easy and natural), how about 
> this approach: if we added real, working support for constrained PMCs on 
> x86, that will then address this point of yours rather forcefully, 
> correct?

It still means we end up having to add something approaching 29,000
lines of code and 320kB to the kernel, just for the IBM 64-bit PowerPC
processors.  (I don't guarantee that code is optimal, but that is some
indication of the complexity required.)

I am perfectly happy to add code for the kernel to know about the most
commonly-used, simple events on those processors.  But I surely don't
want to have to teach the kernel about every last event and every last
capability of those machines' PMUs.

For example, there is a facility on POWER6 where certain instructions
can be selected (based on (instruction_word & mask) == value) and
marked, and then there are events that allow you to measure how long
marked instructions take in various stages of execution.  How would I
make such a feature available for applications to use, within your
framework?

Paul.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] Performance Counters for Linux, v2
  2008-12-08  1:22 [patch] Performance Counters for Linux, v2 Ingo Molnar
                   ` (2 preceding siblings ...)
  2008-12-08  8:32 ` Corey J Ashford
@ 2008-12-09  6:37 ` stephane eranian
  2008-12-09 11:02   ` Ingo Molnar
  2008-12-09 13:46   ` Ingo Molnar
  3 siblings, 2 replies; 33+ messages in thread
From: stephane eranian @ 2008-12-09  6:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Thomas Gleixner, linux-arch, Andrew Morton,
	Eric Dumazet, Robert Richter, Arjan van de Veen, Peter Anvin,
	Peter Zijlstra, Steven Rostedt, David Miller, Paul Mackerras

Hi,

On Mon, Dec 8, 2008 at 2:22 AM, Ingo Molnar <mingo@elte.hu> wrote:
>
>
> There's a new "counter group record" facility that is a straightforward
> extension of the existing "irq record" notification type. This record
> type can be set on a 'master' counter, and if the master counter triggers
> an IRQ or an NMI, all the 'secondary' counters are read out atomically
> and are put into the counter-group record. The result can then be read()
> out by userspace via a single system call. (Based on extensive feedback
> from Paul Mackerras and David Miller, thanks guys!)
>

That is unfortunately not generic enough.You need a bit more
flexibility than master/secondaries, I am afraid.  What tools want
is to be able to express:
   - when event X overflows, record values of events  J, K
   - when event Y overflows, record values of events  Z, J

I am not making this up. I know tools that do just that, i.e., that is
collecting
two distinct profiles in a single run. This is how, for instance, you
can collect
a flat profile and the call graph in one run, very much like gprof.

When a you get a notification and you read out the sample, you'd like
to know if which order values are returned. Given you do not expose counters,
I would assume the only possibility would be return them in file
descriptor order.
But that assumes at the time you create the file descriptor for an
event you have
all the other file descriptors you need...

> There's also more generic x86 support: all 4 generic PMCs of Nehalem /
> Core i7 are supported - i've run 4 instances of KernelTop and they used
> up four separate PMCs.
>
Core/Atom have 5 counters, Nehalem has 7.
Why are you not using all of them already?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] Performance Counters for Linux, v2
  2008-12-09  6:37 ` stephane eranian
@ 2008-12-09 11:02   ` Ingo Molnar
  2008-12-09 11:11     ` David Miller
  2008-12-09 13:46   ` Ingo Molnar
  1 sibling, 1 reply; 33+ messages in thread
From: Ingo Molnar @ 2008-12-09 11:02 UTC (permalink / raw)
  To: eranian
  Cc: linux-kernel, Thomas Gleixner, linux-arch, Andrew Morton,
	Eric Dumazet, Robert Richter, Arjan van de Veen, Peter Anvin,
	Peter Zijlstra, Steven Rostedt, David Miller, Paul Mackerras


* stephane eranian <eranian@googlemail.com> wrote:

> > There's also more generic x86 support: all 4 generic PMCs of Nehalem 
> > / Core i7 are supported - i've run 4 instances of KernelTop and they 
> > used up four separate PMCs.
>
> Core/Atom have 5 counters, Nehalem has 7. Why are you not using all of 
> them already?

no, Nehalem has 4 generic purpose PMCs and 3 fixed-purpose PMCs (7 
total), Core/Atom has 2 generic PMCs and 3 fixed-purpose PMCs (5 total). 
Saying that it has 7 is misleading. (and even the generic PMCs have 
constraints)

	Ingo

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] Performance Counters for Linux, v2
  2008-12-09 11:02   ` Ingo Molnar
@ 2008-12-09 11:11     ` David Miller
  2008-12-09 11:22       ` Ingo Molnar
  0 siblings, 1 reply; 33+ messages in thread
From: David Miller @ 2008-12-09 11:11 UTC (permalink / raw)
  To: mingo
  Cc: eranian, linux-kernel, tglx, linux-arch, akpm, dada1,
	robert.richter, arjan, hpa, a.p.zijlstra, rostedt, paulus

From: Ingo Molnar <mingo@elte.hu>
Date: Tue, 9 Dec 2008 12:02:46 +0100

> 
> * stephane eranian <eranian@googlemail.com> wrote:
> 
> > > There's also more generic x86 support: all 4 generic PMCs of Nehalem 
> > > / Core i7 are supported - i've run 4 instances of KernelTop and they 
> > > used up four separate PMCs.
> >
> > Core/Atom have 5 counters, Nehalem has 7. Why are you not using all of 
> > them already?
> 
> no, Nehalem has 4 generic purpose PMCs and 3 fixed-purpose PMCs (7 
> total),
 ...
> Saying that it has 7 is misleading.

Even you just did.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] Performance Counters for Linux, v2
  2008-12-09 11:11     ` David Miller
@ 2008-12-09 11:22       ` Ingo Molnar
  2008-12-09 11:29         ` David Miller
  0 siblings, 1 reply; 33+ messages in thread
From: Ingo Molnar @ 2008-12-09 11:22 UTC (permalink / raw)
  To: David Miller
  Cc: eranian, linux-kernel, tglx, linux-arch, akpm, dada1,
	robert.richter, arjan, hpa, a.p.zijlstra, rostedt, paulus


* David Miller <davem@davemloft.net> wrote:

> From: Ingo Molnar <mingo@elte.hu>
> Date: Tue, 9 Dec 2008 12:02:46 +0100
> 
> > 
> > * stephane eranian <eranian@googlemail.com> wrote:
> > 
> > > > There's also more generic x86 support: all 4 generic PMCs of Nehalem 
> > > > / Core i7 are supported - i've run 4 instances of KernelTop and they 
> > > > used up four separate PMCs.
> > >
> > > Core/Atom have 5 counters, Nehalem has 7. Why are you not using all of 
> > > them already?
> > 
> > no, Nehalem has 4 generic purpose PMCs and 3 fixed-purpose PMCs (7 
> > total),
>  ...
> > Saying that it has 7 is misleading.
> 
> Even you just did.

which portion of my point stressing the general purpose attribute was 
unclear to you?

Saying it has 7 is misleading in the same way as if i told you now: 
"look, i have four eyes!". (they are: left eye looking right, left eye 
looking left, right eye looking left and right eye looking right)

Nehalem has 4 general purpose PMCs, not 7. Yes, it has 7 counters but 
they are not all general purpose. The P4 has 18.

	Ingo

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] Performance Counters for Linux, v2
  2008-12-09 11:22       ` Ingo Molnar
@ 2008-12-09 11:29         ` David Miller
  2008-12-09 12:14           ` Paolo Ciarrocchi
  0 siblings, 1 reply; 33+ messages in thread
From: David Miller @ 2008-12-09 11:29 UTC (permalink / raw)
  To: mingo
  Cc: eranian, linux-kernel, tglx, linux-arch, akpm, dada1,
	robert.richter, arjan, hpa, a.p.zijlstra, rostedt, paulus

From: Ingo Molnar <mingo@elte.hu>
Date: Tue, 9 Dec 2008 12:22:25 +0100

> 
> * David Miller <davem@davemloft.net> wrote:
> 
> > From: Ingo Molnar <mingo@elte.hu>
> > Date: Tue, 9 Dec 2008 12:02:46 +0100
> > 
> > > 
> > > * stephane eranian <eranian@googlemail.com> wrote:
> > > 
> > > > > There's also more generic x86 support: all 4 generic PMCs of Nehalem 
> > > > > / Core i7 are supported - i've run 4 instances of KernelTop and they 
> > > > > used up four separate PMCs.
> > > >
> > > > Core/Atom have 5 counters, Nehalem has 7. Why are you not using all of 
> > > > them already?
> > > 
> > > no, Nehalem has 4 generic purpose PMCs and 3 fixed-purpose PMCs (7 
> > > total),
> >  ...
> > > Saying that it has 7 is misleading.
> > 
> > Even you just did.
> 
> which portion of my point stressing the general purpose attribute was 
> unclear to you?

I'm just teasing you because you picked a trite point from stephane's
email instead of the meat later on, which I would have found more
interesting to hear you comment on.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] Performance Counters for Linux, v2
  2008-12-09 11:29         ` David Miller
@ 2008-12-09 12:14           ` Paolo Ciarrocchi
  0 siblings, 0 replies; 33+ messages in thread
From: Paolo Ciarrocchi @ 2008-12-09 12:14 UTC (permalink / raw)
  To: David Miller, Ingo Molnar
  Cc: eranian, linux-kernel, tglx, linux-arch, akpm, dada1,
	robert.richter, arjan, hpa, a.p.zijlstra, rostedt, paulus

On Tue, Dec 9, 2008 at 12:29 PM, David Miller <davem@davemloft.net> wrote:
> From: Ingo Molnar <mingo@elte.hu>
> Date: Tue, 9 Dec 2008 12:22:25 +0100
>
>>
>> * David Miller <davem@davemloft.net> wrote:
>>
>> > From: Ingo Molnar <mingo@elte.hu>
>> > Date: Tue, 9 Dec 2008 12:02:46 +0100
>> >
>> > >
>> > > * stephane eranian <eranian@googlemail.com> wrote:
>> > >
>> > > > > There's also more generic x86 support: all 4 generic PMCs of Nehalem
>> > > > > / Core i7 are supported - i've run 4 instances of KernelTop and they
>> > > > > used up four separate PMCs.
>> > > >
>> > > > Core/Atom have 5 counters, Nehalem has 7. Why are you not using all of
>> > > > them already?
>> > >
>> > > no, Nehalem has 4 generic purpose PMCs and 3 fixed-purpose PMCs (7
>> > > total),
>> >  ...
>> > > Saying that it has 7 is misleading.
>> >
>> > Even you just did.
>>
>> which portion of my point stressing the general purpose attribute was
>> unclear to you?
>
> I'm just teasing you because you picked a trite point from stephane's
> email instead of the meat later on, which I would have found more
> interesting to hear you comment on.

I'm interested in Ingo's comments on that argument as well but I don't
feel the need to act like were are all in a kindergarten.

You are two astonishing developers and I'm sure you can demonstrate
that you can have a pure technical discussion on this topic as you did
several times in the past.

Regards,
-- 
Paolo
http://paolo.ciarrocchi.googlepages.com/
http://mypage.vodafone.it/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] Performance Counters for Linux, v2
  2008-12-08 22:03     ` Paul Mackerras
@ 2008-12-09 13:00       ` Ingo Molnar
  2008-12-09 23:00         ` Paul Mackerras
  0 siblings, 1 reply; 33+ messages in thread
From: Ingo Molnar @ 2008-12-09 13:00 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: linux-kernel, Thomas Gleixner, linux-arch, Andrew Morton,
	Stephane Eranian, Eric Dumazet, Robert Richter, Arjan van de Veen,
	Peter Anvin, Peter Zijlstra, Steven Rostedt, David Miller

* Paul Mackerras <paulus@samba.org> wrote:

> > Things like: "kerneltop would not be as accurate with: ..., to the 
> > level of adding 5% of extra noise.". Would that work for you?
> 
> OK, here's an example.  I have an application whose execution has 
> several different phases, and I want to measure the L1 Icache hit rate 
> and the L1 Dcache hit rate as a function of time and make a graph.  So 
> I need counters for L1 Icache accesses, L1 Icache misses, L1 Dcache 
> accesses, and L1 Dcache misses.  I want to sample at 1ms intervals. The 
> CPU I'm running on has two counters.
> 
> With your current proposal, I don't see any way to make sure that the 
> counter scheduler counts L1 Dcache accesses and L1 Dcache misses at the 
> same time, then schedules L1 Icache accesses and L1 Icache misses.  I 
> could end up with L1 Dcache accesses and L1 Icache accesses, then L1 
> Dcache misses and L1 Icache misses - and get a nonsensical situation 
> like the misses being greater than the accesses.

yes, agreed, this is a valid special case of simple counter readout - 
we'll add support to couple counters like that.

Note that this issue does not impact use of multiple counters in 
profilers. (i.e. anything that is not a pure readout of the counter, 
along linear time, as your example above suggests).

Once we start sampling the context, grouping of counters becomes 
irrelevant (and a hindrance) and static frequency sampling becomes an 
inferior method of sampling.

( The highest quality statistical approach is the kind of multi-counter
  sampling model you can see implemented in KernelTop for example, where
  the counters are independently sampled. Can go on in great detail about
  this if you are interested - this is the far more interesting usecase
  in practice. )

	Ingo

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] Performance Counters for Linux, v2
  2008-12-09  6:37 ` stephane eranian
  2008-12-09 11:02   ` Ingo Molnar
@ 2008-12-09 13:46   ` Ingo Molnar
  2008-12-09 16:39     ` Chris Friesen
                       ` (3 more replies)
  1 sibling, 4 replies; 33+ messages in thread
From: Ingo Molnar @ 2008-12-09 13:46 UTC (permalink / raw)
  To: eranian
  Cc: linux-kernel, Thomas Gleixner, linux-arch, Andrew Morton,
	Eric Dumazet, Robert Richter, Arjan van de Veen, Peter Anvin,
	Peter Zijlstra, Steven Rostedt, David Miller, Paul Mackerras,
	Paolo Ciarrocchi

* stephane eranian <eranian@googlemail.com> wrote:

> > There's a new "counter group record" facility that is a 
> > straightforward extension of the existing "irq record" notification 
> > type. This record type can be set on a 'master' counter, and if the 
> > master counter triggers an IRQ or an NMI, all the 'secondary' 
> > counters are read out atomically and are put into the counter-group 
> > record. The result can then be read() out by userspace via a single 
> > system call. (Based on extensive feedback from Paul Mackerras and 
> > David Miller, thanks guys!)
> 
> That is unfortunately not generic enough. You need a bit more 
> flexibility than master/secondaries, I am afraid.  What tools want is 
> to be able to express:
>
>    - when event X overflows, record values of events  J, K
>    - when event Y overflows, record values of events  Z, J

hm, the new group code in perfcounters-v2 can already do this. Have you 
tried to use it and it didnt work? If so then that's a bug. Nothing in 
the design prevents that kind of group readout.

[ We could (and probably will) enhance the grouping relationship some
  more, but group readouts are a fundamentally inferior mode of 
  profiling. (see below for the explanation) ]

> I am not making this up. I know tools that do just that, i.e., that is 
> collecting two distinct profiles in a single run. This is how, for 
> instance, you can collect a flat profile and the call graph in one run, 
> very much like gprof.

yeah, but it's still the fundamentally wrong thing to do.

Being able to extract high-quality performance information from the 
system is the cornerstone of our design, and chosing the right sampling 
model permeates the whole issue of single-counter versus group-readout.

I dont think finer design aspects of kernel support for performance 
counters can be argued without being on the same page about this, so 
please let me outline our view on these things, in (boringly) verbose 
detail - spiked with examples and code as well.

Firstly, sampling "at 1msec intervals" or any fixed period is a _very_ 
wrong mindset - and cross-sampling counters is a similarly wrong mindset.

When there are two (or more) hw metrics to profile, the ideally best 
(i.e. the statistically most stable and most relevant) sampling for the 
two statistical variables (say of l2_misses versus l2_accesses) is to 
sample them independently, via their own metric. Not via a static 1khz 
rate - or via picking one of the variables to generate samples.

[ Sidenote: as long as the hw supports such sort of independent sampling 
  - lets assume so for the sake of argument - not all CPUs are capable of 
  that - most modern CPUs do though. ]

Static frequency [time] sampling has a number of disadvantages that 
drastically reduce its precision and reduce its utility, and 'group' 
sampling where one counter controls the events has similar problems:

- It under-samples rare events such as cachemisses.

  An example: say we have a workload that executes 1 billion instructions 
  a second, of which 5000 generate a cachemiss. Only one in 200,000
  instructions generates a cachemiss. The chance for a static sampling 
  IRQ to hit exactly an instruction that causes the cachemiss is 1:200 
  (0.5%) in every second. That is very low probability, and the profile 
  would not be very helpful - even though it samples at a seemingly 
  adequate frequency of 1000 events per second!

  With per event counters and per event sampling that KernelTop uses, we 
  get an event next to the instruction that causes a cachemiss with a
  100% certainty, all the time. The profile and its per instruction
  aspects suddenly become a whole lot more accurate and whole lot more 
  interesting.

- Static frequency and group sampling also runs the risk of systematic 
  error/skew of sampling if any workload component has any correlation
  with the "1msec" global sampling period.

  For example: say we profile a workload that runs a timer every 20
  msecs. In such a case the profile could be skewed assymetrically
  against [or in favor of] that timer activity that it does every 10
  milliseconds.

  Good sampling wants the samples to be generated in proportion to the
  variable itself, not proportional to absolute time.

- Static sampling also over-samples when the workload activity goes
  down (when it goes more idle).

  For example: we profile a fluctuating workload that is sometimes only
  0.2% busy, i.e. running only for 2 milliseconds every second. Still we
  keep interrupting it at 1 khz - that can be a very brutal systematic
  skew if the sampling overhead is 2 microseconds, totalling to 2 msecs
  overhead every second - so 50% of what runs on the CPU will be sampling
  code - impacting/skewing the sampled code.

  Good sampling wants to 'follow' the ebb and flow of the actual hw
  events that the CPU has.

The best way to sample two metrics such as "cache accesses" and "cache 
misses" (or say "cache misses" versus "TLB misses") is to sample the two 
variables _independently_, and to build independent histograms out of 
them.

The combination (or 'grouping') of the measured variables is thus done at 
the output stage _after_ data acquisition, to provide a weighted 
histogram (or a split-view double histogram).

For example, in a "l2 misses" versus "l2 accesses" case, the highest 
quality of sampling is to use two independent sampling IRQs with such 
sampling parameters:

  - one notification every     200 L2 cache misses
  - one notification every  10,000 L2 cache accesses

[ this is a ballpark figure - the sample rate is a function of the
  averages of the workload and the characteristics of the CPU. ]

And at the output stage display a combination of:

  l2_accesses[pc]
  l2_misses[pc]
  l2_misses[pc] / l2_accesseses[pc]

Note that if we had a third variable as well - say icache_misses[], we 
could combine the three metrics:

  l2_misses[pc] / l2_accesses[pc] / icache_misses[pc]

  ( such a view expresses the miss/access ratio in a branch-weighted
    fashion: it weighs down instructions that also show signs of icache 
    pressure and goes for the functions with a high dcache rate but low 
    icache pressure - i.e. commonly executed functions with a high data 
    miss rate. )

Sampling at a static frequency is acceptable as well in some cases, and 
will lead to an output that is usable for some things. It's just not the 
best sampling model, and it's not usable at all for certain important 
things such as highly derived views, good instruction level profiles or 
rare hw events.

I've uploaded a new version of kerneltop.c that has such a multi-counter 
sampling model that follows this statistical model:

    http://redhat.com/~mingo/perfcounters/kerneltop.c

Example of usage:

I've started a tbench 64 localhost workload on a 16way x86 box. I want to 
check the miss/refs ratio. I first did a sample one of the metrics, 
cache-references:

$ ./kerneltop -e 2 -c 100000 -C 2

------------------------------------------------------------------------------
 KernelTop:    1311 irqs/sec  [NMI, 10000 cache-refs],  (all, cpu: 2)
------------------------------------------------------------------------------

             events         RIP          kernel function
             ______   ________________   _______________

            5717.00 - ffffffff803666c0 : copy_user_generic_string!
             355.00 - ffffffff80507646 : tcp_sendmsg
             315.00 - ffffffff8050abcb : tcp_ack
             222.00 - ffffffff804fbb20 : ip_rcv_finish
             215.00 - ffffffff8020a75b : __switch_to
             194.00 - ffffffff804d0b76 : skb_copy_datagram_iovec
             187.00 - ffffffff80502b5d : __inet_lookup_established
             183.00 - ffffffff8051083d : tcp_transmit_skb
             160.00 - ffffffff804e4fc9 : eth_type_trans
             156.00 - ffffffff8026ae31 : audit_syscall_exit

Then i checked the characteristics of the other metric [cache-misses]: 

$ ./kerneltop -e 3 -c 200 -C 2

------------------------------------------------------------------------------
 KernelTop:    1362 irqs/sec  [NMI, 200 cache-misses],  (all, cpu: 2)
------------------------------------------------------------------------------

             events         RIP          kernel function
             ______   ________________   _______________

            1419.00 - ffffffff803666c0 : copy_user_generic_string!
            1075.00 - ffffffff804e4fc9 : eth_type_trans
            1059.00 - ffffffff804d8baa : dst_release
             949.00 - ffffffff80510004 : tcp_established_options
             841.00 - ffffffff804fbb20 : ip_rcv_finish
             569.00 - ffffffff804ce808 : skb_push
             454.00 - ffffffff80502b5d : __inet_lookup_established
             453.00 - ffffffff805001a3 : ip_queue_xmit
             298.00 - ffffffff804cf5d8 : skb_release_head_state
             247.00 - ffffffff804ce74b : skb_copy_and_csum_dev

then, to get the "combination" view of the two counters, i appended the 
two command lines:

 $ ./kerneltop -e 3 -c 200 -e 2 -c 10000 -C 2

------------------------------------------------------------------------------
 KernelTop:    2669 irqs/sec  [NMI, cache-misses/cache-refs],  (all, cpu: 2)
------------------------------------------------------------------------------

             weight         RIP          kernel function
             ______   ________________   _______________

              35.20 - ffffffff804ce74b : skb_copy_and_csum_dev
              33.00 - ffffffff804cb740 : sock_alloc_send_skb
              31.26 - ffffffff804ce808 : skb_push
              22.43 - ffffffff80510004 : tcp_established_options
              19.00 - ffffffff8027d250 : find_get_page
              15.76 - ffffffff804e4fc9 : eth_type_trans
              15.20 - ffffffff804d8baa : dst_release
              14.86 - ffffffff804cf5d8 : skb_release_head_state
              14.00 - ffffffff802217d5 : read_hpet
              12.00 - ffffffff804ffb7f : __ip_local_out
              11.97 - ffffffff804fc0c8 : ip_local_deliver_finish
               8.54 - ffffffff805001a3 : ip_queue_xmit

[ It's interesting to see that a seemingly common function, 
  copy_user_generic_string(), got eliminated from the top spots - because 
  there are other functions whose relative cachemiss rate is far more 
  serious. ]

The above "derived" profile output is relatively stable under kerneltop 
with the use of ~2600 sample irqs/sec and the 2 seconds default refresh. 
I'd encourage you to try to achieve the same quality of output with 
static 2600 hz sampling - it wont work with the kind of event rates i've 
worked with above, no matter whether you read out a single counter or a 
group of counters, atomically or not. (because we just dont get 
notification PCs at the relevant hw events - we get PCs with a time 
sample)

And that is just one 'rare' event type (cachemisses) - if we had two such 
sources (say l2 cachemisses and TLB misses) then such type of combined 
view would only be possible if we got independent events from both 
hardware events.

And note that once you accept that the highest quality approach is to 
sample the hw events independently, all the "group readout" approaches 
become a second-tier mechanism. KernelTop uses that model and works just 
fine without any group readout and it is making razor sharp profiles, 
down to the instruction level.

[ Note that there's special-cases where group-sampling can limp along
  with acceptable results: if one of the two counters has so many events
  that sampling by time or sampling by the rare event type gives relevant
  context info. But the moment both event sources are rare, the group
  model breaks down completely and produces meaningless results. It's
  just a fundamentally wrong kind of abstraction to mix together
  unrelated statistical variables. And that's one of the fundamental
  design problems i see with perfmon-v3. ]

	Ingo

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] Performance Counters for Linux, v2
  2008-12-09 13:46   ` Ingo Molnar
@ 2008-12-09 16:39     ` Chris Friesen
  2008-12-09 19:02       ` Ingo Molnar
  2008-12-09 16:46     ` Will Newton
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 33+ messages in thread
From: Chris Friesen @ 2008-12-09 16:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: eranian, linux-kernel, Thomas Gleixner, linux-arch, Andrew Morton,
	Eric Dumazet, Robert Richter, Arjan van de Veen, Peter Anvin,
	Peter Zijlstra, Steven Rostedt, David Miller, Paul Mackerras,
	Paolo Ciarrocchi

Ingo Molnar wrote:

> When there are two (or more) hw metrics to profile, the ideally best 
> (i.e. the statistically most stable and most relevant) sampling for the 
> two statistical variables (say of l2_misses versus l2_accesses) is to 
> sample them independently, via their own metric. Not via a static 1khz 
> rate - or via picking one of the variables to generate samples.

Regardless of sampling method, don't you still want some way to 
enable/disable the various counters as close to simultaneously as possible?

Chris

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] Performance Counters for Linux, v2
  2008-12-09 13:46   ` Ingo Molnar
  2008-12-09 16:39     ` Chris Friesen
@ 2008-12-09 16:46     ` Will Newton
  2008-12-09 17:35       ` Chris Friesen
  2008-12-09 21:16     ` stephane eranian
  2008-12-09 22:19     ` Paul Mackerras
  3 siblings, 1 reply; 33+ messages in thread
From: Will Newton @ 2008-12-09 16:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: eranian, linux-kernel, Thomas Gleixner, linux-arch, Andrew Morton,
	Eric Dumazet, Robert Richter, Arjan van de Veen, Peter Anvin,
	Peter Zijlstra, Steven Rostedt, David Miller, Paul Mackerras,
	Paolo Ciarrocchi

On Tue, Dec 9, 2008 at 1:46 PM, Ingo Molnar <mingo@elte.hu> wrote:

> Firstly, sampling "at 1msec intervals" or any fixed period is a _very_
> wrong mindset - and cross-sampling counters is a similarly wrong mindset.

If your hardware does not interrupt on overflow I don't think you have
any choice in the matter. I know such hardware is less than ideal but
it exists so it should be supported.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] Performance Counters for Linux, v2
  2008-12-09 16:46     ` Will Newton
@ 2008-12-09 17:35       ` Chris Friesen
  0 siblings, 0 replies; 33+ messages in thread
From: Chris Friesen @ 2008-12-09 17:35 UTC (permalink / raw)
  To: Will Newton
  Cc: Ingo Molnar, eranian, linux-kernel, Thomas Gleixner, linux-arch,
	Andrew Morton, Eric Dumazet, Robert Richter, Arjan van de Veen,
	Peter Anvin, Peter Zijlstra, Steven Rostedt, David Miller,
	Paul Mackerras, Paolo Ciarrocchi

Will Newton wrote:
> On Tue, Dec 9, 2008 at 1:46 PM, Ingo Molnar <mingo@elte.hu> wrote:
> 
>> Firstly, sampling "at 1msec intervals" or any fixed period is a _very_
>> wrong mindset - and cross-sampling counters is a similarly wrong mindset.
> 
> If your hardware does not interrupt on overflow I don't think you have
> any choice in the matter. I know such hardware is less than ideal but
> it exists so it should be supported.

I think you could still set up the counters as Ingo describes and then 
sample the counters (as opposed to the program) at a suitable interval 
(chosen such that the counters won't overflow more than once between 
samples).

Chris

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] Performance Counters for Linux, v2
  2008-12-09 16:39     ` Chris Friesen
@ 2008-12-09 19:02       ` Ingo Molnar
  2008-12-09 19:51         ` Chris Friesen
  0 siblings, 1 reply; 33+ messages in thread
From: Ingo Molnar @ 2008-12-09 19:02 UTC (permalink / raw)
  To: Chris Friesen
  Cc: eranian, linux-kernel, Thomas Gleixner, linux-arch, Andrew Morton,
	Eric Dumazet, Robert Richter, Arjan van de Veen, Peter Anvin,
	Peter Zijlstra, Steven Rostedt, David Miller, Paul Mackerras,
	Paolo Ciarrocchi


* Chris Friesen <cfriesen@nortel.com> wrote:

> Ingo Molnar wrote:
>
>> When there are two (or more) hw metrics to profile, the ideally best 
>> (i.e. the statistically most stable and most relevant) sampling for 
>> the two statistical variables (say of l2_misses versus l2_accesses) is 
>> to sample them independently, via their own metric. Not via a static 
>> 1khz rate - or via picking one of the variables to generate samples.
>
> Regardless of sampling method, don't you still want some way to 
> enable/disable the various counters as close to simultaneously as 
> possible?

If it's about counter control for the monitored task, then we sure could 
do something about that. (apps/libraries could thus select a subset of 
functions to profile/measure, runtime, etc.)

If it's about counter control for the profiler/debugger, i'm not sure how 
useful that is - do you have a good usecase for it?

	Ingo

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] Performance Counters for Linux, v2
  2008-12-09 19:02       ` Ingo Molnar
@ 2008-12-09 19:51         ` Chris Friesen
  0 siblings, 0 replies; 33+ messages in thread
From: Chris Friesen @ 2008-12-09 19:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: eranian, linux-kernel, Thomas Gleixner, linux-arch, Andrew Morton,
	Eric Dumazet, Robert Richter, Arjan van de Veen, Peter Anvin,
	Peter Zijlstra, Steven Rostedt, David Miller, Paul Mackerras,
	Paolo Ciarrocchi

Ingo Molnar wrote:
> * Chris Friesen <cfriesen@nortel.com> wrote:

>> Regardless of sampling method, don't you still want some way to 
>> enable/disable the various counters as close to simultaneously as 
>> possible?
> 
> If it's about counter control for the monitored task, then we sure could 
> do something about that. (apps/libraries could thus select a subset of 
> functions to profile/measure, runtime, etc.)
> 
> If it's about counter control for the profiler/debugger, i'm not sure how 
> useful that is - do you have a good usecase for it?

I'm sure that others could give more usecases, but I was thinking about 
cases like "I want to test _these_ multiple metrics simultaneously over 
_this_ specific section of code".  In a case like this, it seems 
desirable to start/stop the various performance counters as close 
together as possible, especially if the section of code being tested is 
short.

Chris

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] Performance Counters for Linux, v2
  2008-12-09 13:46   ` Ingo Molnar
  2008-12-09 16:39     ` Chris Friesen
  2008-12-09 16:46     ` Will Newton
@ 2008-12-09 21:16     ` stephane eranian
  2008-12-09 21:16       ` stephane eranian
  2008-12-09 22:19     ` Paul Mackerras
  3 siblings, 1 reply; 33+ messages in thread
From: stephane eranian @ 2008-12-09 21:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-arch, Peter Zijlstra, David Miller, linux-kernel,
	Steven Rostedt, Eric Dumazet, Paul Mackerras, Paolo Ciarrocchi,
	Peter Anvin, Thomas Gleixner, Andrew Morton, perfmon2-devel,
	Arjan van de Veen

Hi,

On Tue, Dec 9, 2008 at 2:46 PM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * stephane eranian <eranian@googlemail.com> wrote:
>
>> > There's a new "counter group record" facility that is a
>> > straightforward extension of the existing "irq record" notification
>> > type. This record type can be set on a 'master' counter, and if the
>> > master counter triggers an IRQ or an NMI, all the 'secondary'
>> > counters are read out atomically and are put into the counter-group
>> > record. The result can then be read() out by userspace via a single
>> > system call. (Based on extensive feedback from Paul Mackerras and
>> > David Miller, thanks guys!)
>>
>> That is unfortunately not generic enough. You need a bit more
>> flexibility than master/secondaries, I am afraid.  What tools want is
>> to be able to express:
>>
>>    - when event X overflows, record values of events  J, K
>>    - when event Y overflows, record values of events  Z, J
>
> hm, the new group code in perfcounters-v2 can already do this. Have you
> tried to use it and it didnt work? If so then that's a bug. Nothing in
> the design prevents that kind of group readout.
>
> [ We could (and probably will) enhance the grouping relationship some
>  more, but group readouts are a fundamentally inferior mode of
>  profiling. (see below for the explanation) ]
>
>> I am not making this up. I know tools that do just that, i.e., that is
>> collecting two distinct profiles in a single run. This is how, for
>> instance, you can collect a flat profile and the call graph in one run,
>> very much like gprof.
>
> yeah, but it's still the fundamentally wrong thing to do.
>
That's not for you to say. This is decision for the tool writers.

There is absolutely nothing wrong with this. In fact, people do this
kind of measurements all the time. Your horizon seems a bit too
limited, maybe.

Certain PMU features do not count events, they capture information about
where they occur, so they are more like buffers. Sometimes, they are hosted
in registers. For instance, Itanium has long been able to capture where
cache misses occur. The data is stored in a couple of PMU registers and only
one cache miss at a time. There is a PMU event that counts how many misses
are captured. So you program that event into a counter and when it overflows
you want to read out the pair of data registers containing the last captured
cache miss. Thus, when event X overflows, you capture values in registers Z, Y.
There is nothing wrong with this. You do the same thing when you want to
sample on a branch trace buffer, like X86 LBR. Again nothing wrong with this.
In fact you can collect both at the same time and in independent manner.

> Being able to extract high-quality performance information from the
> system is the cornerstone of our design, and chosing the right sampling
> model permeates the whole issue of single-counter versus group-readout.
>
> I dont think finer design aspects of kernel support for performance
> counters can be argued without being on the same page about this, so
> please let me outline our view on these things, in (boringly) verbose
> detail - spiked with examples and code as well.
>
> Firstly, sampling "at 1msec intervals" or any fixed period is a _very_
> wrong mindset - and cross-sampling counters is a similarly wrong mindset.
>
> When there are two (or more) hw metrics to profile, the ideally best
> (i.e. the statistically most stable and most relevant) sampling for the
> two statistical variables (say of l2_misses versus l2_accesses) is to
> sample them independently, via their own metric. Not via a static 1khz
> rate - or via picking one of the variables to generate samples.
>

Did I talk about static sampling period?

> [ Sidenote: as long as the hw supports such sort of independent sampling
>  - lets assume so for the sake of argument - not all CPUs are capable of
>  that - most modern CPUs do though. ]
>
> Static frequency [time] sampling has a number of disadvantages that
> drastically reduce its precision and reduce its utility, and 'group'
> sampling where one counter controls the events has similar problems:
>
> - It under-samples rare events such as cachemisses.
>
>  An example: say we have a workload that executes 1 billion instructions
>  a second, of which 5000 generate a cachemiss. Only one in 200,000
>  instructions generates a cachemiss. The chance for a static sampling
>  IRQ to hit exactly an instruction that causes the cachemiss is 1:200
>  (0.5%) in every second. That is very low probability, and the profile
>  would not be very helpful - even though it samples at a seemingly
>  adequate frequency of 1000 events per second!
>
Who talked about periods expressed as events per second?

I did not talk about that. If you had looked at the perfmon API, you would
have noticed that it does not know anything about sampling periods. It
only sees register values. Tools are free to pick whatever value they like.
And the value, by definition. is defined as the number of occurrences of
the event, not the number of occurrences per seconds. You can say:
every 2000 cache miss, take a sample, just program that counter
with -2000.

>  With per event counters and per event sampling that KernelTop uses, we
>  get an event next to the instruction that causes a cachemiss with a

You have no guarantee on how close the RIP is compared to where the cache
miss occurred. It can be several of instructions away (NMI or not by the way).
There is nothing software can do about it, neither my inferior design nor
your superior design.

>
> And note that once you accept that the highest quality approach is to
> sample the hw events independently, all the "group readout" approaches
> become a second-tier mechanism. KernelTop uses that model and works just
> fine without any group readout and it is making razor sharp profiles,
> down to the instruction level.
>

And you think you cannot do independent sampling with perfmon3?

As for 'razor sharp', that is your interpretation. As far as I know a
RIP is always
pointing to an instruction anyway. What you seem to be ignoring here
is the fact that
the RIP is as good as the hardware can give you. And it just happens that
on ALL processor architectures it is off compared to where the event actually
occurred. It can be several cycles away actually: skid. Your superior design
does not improve that precision whatsoever. It has to be handled at the
hardware level. Why do you think AMD added IBS, why Intel added PEBS on
X86 and why Intel added IP-EAR on Itanium2? Even PEBS is not solving that
issue completely. As far I know the quality of your profiles are as
good as Oprofile,
VTUNE, or perfmon.

> [ Note that there's special-cases where group-sampling can limp along
>  with acceptable results: if one of the two counters has so many events
>  that sampling by time or sampling by the rare event type gives relevant
>  context info. But the moment both event sources are rare, the group

>  model breaks down completely and produces meaningless results. It's
>  just a fundamentally wrong kind of abstraction to mix together
>  unrelated statistical variables. And that's one of the fundamental
>  design problems i see with perfmon-v3. ]
>

Again an unfounded statement, perfmon3 does not mandate what is recorded
on overflow. It does not mandate how many events you can sample on at the same
time. It does not know about sampling periods, it only knows about data register
values and reset values on overflow. For each counters, you can freely specify
what you want recorded using a simple bitmask.

Are we on the same page, then?

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] Performance Counters for Linux, v2
  2008-12-09 21:16     ` stephane eranian
@ 2008-12-09 21:16       ` stephane eranian
  0 siblings, 0 replies; 33+ messages in thread
From: stephane eranian @ 2008-12-09 21:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Thomas Gleixner, linux-arch, Andrew Morton,
	Eric Dumazet, Robert Richter, Arjan van de Veen, Peter Anvin,
	Peter Zijlstra, Steven Rostedt, David Miller, Paul Mackerras,
	Paolo Ciarrocchi, perfmon2-devel

Hi,

On Tue, Dec 9, 2008 at 2:46 PM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * stephane eranian <eranian@googlemail.com> wrote:
>
>> > There's a new "counter group record" facility that is a
>> > straightforward extension of the existing "irq record" notification
>> > type. This record type can be set on a 'master' counter, and if the
>> > master counter triggers an IRQ or an NMI, all the 'secondary'
>> > counters are read out atomically and are put into the counter-group
>> > record. The result can then be read() out by userspace via a single
>> > system call. (Based on extensive feedback from Paul Mackerras and
>> > David Miller, thanks guys!)
>>
>> That is unfortunately not generic enough. You need a bit more
>> flexibility than master/secondaries, I am afraid.  What tools want is
>> to be able to express:
>>
>>    - when event X overflows, record values of events  J, K
>>    - when event Y overflows, record values of events  Z, J
>
> hm, the new group code in perfcounters-v2 can already do this. Have you
> tried to use it and it didnt work? If so then that's a bug. Nothing in
> the design prevents that kind of group readout.
>
> [ We could (and probably will) enhance the grouping relationship some
>  more, but group readouts are a fundamentally inferior mode of
>  profiling. (see below for the explanation) ]
>
>> I am not making this up. I know tools that do just that, i.e., that is
>> collecting two distinct profiles in a single run. This is how, for
>> instance, you can collect a flat profile and the call graph in one run,
>> very much like gprof.
>
> yeah, but it's still the fundamentally wrong thing to do.
>
That's not for you to say. This is decision for the tool writers.

There is absolutely nothing wrong with this. In fact, people do this
kind of measurements all the time. Your horizon seems a bit too
limited, maybe.

Certain PMU features do not count events, they capture information about
where they occur, so they are more like buffers. Sometimes, they are hosted
in registers. For instance, Itanium has long been able to capture where
cache misses occur. The data is stored in a couple of PMU registers and only
one cache miss at a time. There is a PMU event that counts how many misses
are captured. So you program that event into a counter and when it overflows
you want to read out the pair of data registers containing the last captured
cache miss. Thus, when event X overflows, you capture values in registers Z, Y.
There is nothing wrong with this. You do the same thing when you want to
sample on a branch trace buffer, like X86 LBR. Again nothing wrong with this.
In fact you can collect both at the same time and in independent manner.

> Being able to extract high-quality performance information from the
> system is the cornerstone of our design, and chosing the right sampling
> model permeates the whole issue of single-counter versus group-readout.
>
> I dont think finer design aspects of kernel support for performance
> counters can be argued without being on the same page about this, so
> please let me outline our view on these things, in (boringly) verbose
> detail - spiked with examples and code as well.
>
> Firstly, sampling "at 1msec intervals" or any fixed period is a _very_
> wrong mindset - and cross-sampling counters is a similarly wrong mindset.
>
> When there are two (or more) hw metrics to profile, the ideally best
> (i.e. the statistically most stable and most relevant) sampling for the
> two statistical variables (say of l2_misses versus l2_accesses) is to
> sample them independently, via their own metric. Not via a static 1khz
> rate - or via picking one of the variables to generate samples.
>

Did I talk about static sampling period?

> [ Sidenote: as long as the hw supports such sort of independent sampling
>  - lets assume so for the sake of argument - not all CPUs are capable of
>  that - most modern CPUs do though. ]
>
> Static frequency [time] sampling has a number of disadvantages that
> drastically reduce its precision and reduce its utility, and 'group'
> sampling where one counter controls the events has similar problems:
>
> - It under-samples rare events such as cachemisses.
>
>  An example: say we have a workload that executes 1 billion instructions
>  a second, of which 5000 generate a cachemiss. Only one in 200,000
>  instructions generates a cachemiss. The chance for a static sampling
>  IRQ to hit exactly an instruction that causes the cachemiss is 1:200
>  (0.5%) in every second. That is very low probability, and the profile
>  would not be very helpful - even though it samples at a seemingly
>  adequate frequency of 1000 events per second!
>
Who talked about periods expressed as events per second?

I did not talk about that. If you had looked at the perfmon API, you would
have noticed that it does not know anything about sampling periods. It
only sees register values. Tools are free to pick whatever value they like.
And the value, by definition. is defined as the number of occurrences of
the event, not the number of occurrences per seconds. You can say:
every 2000 cache miss, take a sample, just program that counter
with -2000.

>  With per event counters and per event sampling that KernelTop uses, we
>  get an event next to the instruction that causes a cachemiss with a

You have no guarantee on how close the RIP is compared to where the cache
miss occurred. It can be several of instructions away (NMI or not by the way).
There is nothing software can do about it, neither my inferior design nor
your superior design.

>
> And note that once you accept that the highest quality approach is to
> sample the hw events independently, all the "group readout" approaches
> become a second-tier mechanism. KernelTop uses that model and works just
> fine without any group readout and it is making razor sharp profiles,
> down to the instruction level.
>

And you think you cannot do independent sampling with perfmon3?

As for 'razor sharp', that is your interpretation. As far as I know a
RIP is always
pointing to an instruction anyway. What you seem to be ignoring here
is the fact that
the RIP is as good as the hardware can give you. And it just happens that
on ALL processor architectures it is off compared to where the event actually
occurred. It can be several cycles away actually: skid. Your superior design
does not improve that precision whatsoever. It has to be handled at the
hardware level. Why do you think AMD added IBS, why Intel added PEBS on
X86 and why Intel added IP-EAR on Itanium2? Even PEBS is not solving that
issue completely. As far I know the quality of your profiles are as
good as Oprofile,
VTUNE, or perfmon.

> [ Note that there's special-cases where group-sampling can limp along
>  with acceptable results: if one of the two counters has so many events
>  that sampling by time or sampling by the rare event type gives relevant
>  context info. But the moment both event sources are rare, the group

>  model breaks down completely and produces meaningless results. It's
>  just a fundamentally wrong kind of abstraction to mix together
>  unrelated statistical variables. And that's one of the fundamental
>  design problems i see with perfmon-v3. ]
>

Again an unfounded statement, perfmon3 does not mandate what is recorded
on overflow. It does not mandate how many events you can sample on at the same
time. It does not know about sampling periods, it only knows about data register
values and reset values on overflow. For each counters, you can freely specify
what you want recorded using a simple bitmask.

Are we on the same page, then?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] Performance Counters for Linux, v2
  2008-12-09 13:46   ` Ingo Molnar
                       ` (2 preceding siblings ...)
  2008-12-09 21:16     ` stephane eranian
@ 2008-12-09 22:19     ` Paul Mackerras
  2008-12-09 22:40       ` Andi Kleen
  3 siblings, 1 reply; 33+ messages in thread
From: Paul Mackerras @ 2008-12-09 22:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: eranian, linux-kernel, Thomas Gleixner, linux-arch, Andrew Morton,
	Eric Dumazet, Robert Richter, Arjan van de Veen, Peter Anvin,
	Peter Zijlstra, Steven Rostedt, David Miller, Paolo Ciarrocchi

Ingo Molnar writes:

> yeah, but it's still the fundamentally wrong thing to do.
> 
> Being able to extract high-quality performance information from the 
> system is the cornerstone of our design, and chosing the right sampling 
> model permeates the whole issue of single-counter versus group-readout.

Thanks for taking the time to write all this down, and I will respond
in detail once I have thought about it some more.

The thing that stands out to me immediately, though, is that you are
concentrating entirely on _sampling_ as opposed to _counting_.
Perhaps this is the main reason we have been disagreeing.

Now of course sampling is interesting, but counting is also
interesting, whether over the whole execution of a program or over
short intervals during the execution of a program (based either on
time, or on the execution of certain functions).

It seems to me that a well-designed performance monitor infrastructure
should support both counting and sampling.  And for counting, getting
high-quality data requires synchronized counters (ones that all start
counting and stop counting at the same times).

Looking back at the discussion so far, I can see that your arguments
make more sense if you are only wanting to do sampling.  And I have
been arguing for what I believe we need to do counting properly (I
have focused more on counting because we already have infrastructure
for sampling, namely oprofile).

So, can we agree to discuss both sampling and counting, and design an
infrastructure that's good for both?

Paul.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] Performance Counters for Linux, v2
  2008-12-09 22:19     ` Paul Mackerras
@ 2008-12-09 22:40       ` Andi Kleen
  2008-12-10  4:44         ` Paul Mackerras
  0 siblings, 1 reply; 33+ messages in thread
From: Andi Kleen @ 2008-12-09 22:40 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Ingo Molnar, eranian, linux-kernel, Thomas Gleixner, linux-arch,
	Andrew Morton, Eric Dumazet, Robert Richter, Arjan van de Veen,
	Peter Anvin, Peter Zijlstra, Steven Rostedt, David Miller,
	Paolo Ciarrocchi

Paul Mackerras <paulus@samba.org> writes:

> So, can we agree to discuss both sampling and counting, and design an
> infrastructure that's good for both?

When you say counting you should also include "event ring buffers with
metadata", like PEBS on Intel x86. 

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] Performance Counters for Linux, v2
  2008-12-09 13:00       ` Ingo Molnar
@ 2008-12-09 23:00         ` Paul Mackerras
  0 siblings, 0 replies; 33+ messages in thread
From: Paul Mackerras @ 2008-12-09 23:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Thomas Gleixner, linux-arch, Andrew Morton,
	Stephane Eranian, Eric Dumazet, Robert Richter, Arjan van de Veen,
	Peter Anvin, Peter Zijlstra, Steven Rostedt, David Miller

Ingo Molnar writes:

> * Paul Mackerras <paulus@samba.org> wrote:
> 
> > > Things like: "kerneltop would not be as accurate with: ..., to the 
> > > level of adding 5% of extra noise.". Would that work for you?
> > 
> > OK, here's an example.  I have an application whose execution has 
> > several different phases, and I want to measure the L1 Icache hit rate 
> > and the L1 Dcache hit rate as a function of time and make a graph.  So 
> > I need counters for L1 Icache accesses, L1 Icache misses, L1 Dcache 
> > accesses, and L1 Dcache misses.  I want to sample at 1ms intervals. The 
> > CPU I'm running on has two counters.
> > 
> > With your current proposal, I don't see any way to make sure that the 
> > counter scheduler counts L1 Dcache accesses and L1 Dcache misses at the 
> > same time, then schedules L1 Icache accesses and L1 Icache misses.  I 
> > could end up with L1 Dcache accesses and L1 Icache accesses, then L1 
> > Dcache misses and L1 Icache misses - and get a nonsensical situation 
> > like the misses being greater than the accesses.
> 
> yes, agreed, this is a valid special case of simple counter readout - 
> we'll add support to couple counters like that.

This is an example of a sampling problem, but one where the thing
being sampled is a derived statistic from two counter values.

I don't agree that this is really a "special case".  There are lots of
derived statistics that are interesting for performance analysis,
starting with CPI (cycles per instruction), proportions of various
instructions in the code, cache hit/miss rates for various different
caches, etc., etc.

> Note that this issue does not impact use of multiple counters in 
> profilers. (i.e. anything that is not a pure readout of the counter, 
> along linear time, as your example above suggests).

Well, that's the sampling vs. counting distinction that I made in my
other email.  We need to do both well.

As far as I can see, my "counter set" proposal does everything yours
does (since a counter set can be just a single counter), and also
cleanly accommodates what's needed for counting and for sampling
derived statistics.  No?

Paul.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] Performance Counters for Linux, v2
  2008-12-09 22:40       ` Andi Kleen
@ 2008-12-10  4:44         ` Paul Mackerras
  2008-12-10  5:03           ` stephane eranian
  2008-12-10 10:26           ` Andi Kleen
  0 siblings, 2 replies; 33+ messages in thread
From: Paul Mackerras @ 2008-12-10  4:44 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, eranian, linux-kernel, Thomas Gleixner, linux-arch,
	Andrew Morton, Eric Dumazet, Robert Richter, Arjan van de Veen,
	Peter Anvin, Peter Zijlstra, Steven Rostedt, David Miller,
	Paolo Ciarrocchi

Andi Kleen writes:

> When you say counting you should also include "event ring buffers with
> metadata", like PEBS on Intel x86. 

I'm not familiar with PEBS.  Maybe it's something different again,
neither sampling nor counting, but a third thing?

Paul.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] Performance Counters for Linux, v2
  2008-12-10  4:44         ` Paul Mackerras
@ 2008-12-10  5:03           ` stephane eranian
  2008-12-10  5:03             ` stephane eranian
  2008-12-10 10:26           ` Andi Kleen
  1 sibling, 1 reply; 33+ messages in thread
From: stephane eranian @ 2008-12-10  5:03 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Andi Kleen, Ingo Molnar, linux-kernel, Thomas Gleixner,
	linux-arch, Andrew Morton, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, Peter Zijlstra, Steven Rostedt,
	David Miller, Paolo Ciarrocchi

Paul,

On Wed, Dec 10, 2008 at 5:44 AM, Paul Mackerras <paulus@samba.org> wrote:
> Andi Kleen writes:
>
>> When you say counting you should also include "event ring buffers with
>> metadata", like PEBS on Intel x86.
>
> I'm not familiar with PEBS.  Maybe it's something different again,
> neither sampling nor counting, but a third thing?
>
PEBS is an Intel only feature used for sampling. However, this time
the hardware (and microcode) does the sampling for you. You point
the CPU to a structure in memory, called DS, which then points to
a region of memory you can designate, i.e., the sampling buffer.
Buffer can be any size you want.

Then you program counter0 when an event and a sampling period.
When the counter overflows, there is no interrupt, the microcode
records the RIP and full machine state, and reloads the counter with
the period specified in DS. The OS gets an interrupt ONLY when the
buffer fills up. Overhead is thus minimized, but you have no control
over the format of the samples. The precision (P) comes from the fact
that the RIP is guaranteed to point the an instruction that is just after
an instruction which generated the event you're sampling on. The catch
is that no all events support PEBS, and only one counter works with PEBS
on Core 2. Nehalem is better, more events support PEBS, all 4 generic
counters do support PEBS. Furthermore,PEBS can now capture where
cache misses occur, very much like what Itanium can do.

Needless to say all of this is supported by perfmon.

Hope this helps.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] Performance Counters for Linux, v2
  2008-12-10  5:03           ` stephane eranian
@ 2008-12-10  5:03             ` stephane eranian
  0 siblings, 0 replies; 33+ messages in thread
From: stephane eranian @ 2008-12-10  5:03 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Andi Kleen, Ingo Molnar, linux-kernel, Thomas Gleixner,
	linux-arch, Andrew Morton, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, Peter Zijlstra, Steven Rostedt,
	David Miller, Paolo Ciarrocchi

Paul,

On Wed, Dec 10, 2008 at 5:44 AM, Paul Mackerras <paulus@samba.org> wrote:
> Andi Kleen writes:
>
>> When you say counting you should also include "event ring buffers with
>> metadata", like PEBS on Intel x86.
>
> I'm not familiar with PEBS.  Maybe it's something different again,
> neither sampling nor counting, but a third thing?
>
PEBS is an Intel only feature used for sampling. However, this time
the hardware (and microcode) does the sampling for you. You point
the CPU to a structure in memory, called DS, which then points to
a region of memory you can designate, i.e., the sampling buffer.
Buffer can be any size you want.

Then you program counter0 when an event and a sampling period.
When the counter overflows, there is no interrupt, the microcode
records the RIP and full machine state, and reloads the counter with
the period specified in DS. The OS gets an interrupt ONLY when the
buffer fills up. Overhead is thus minimized, but you have no control
over the format of the samples. The precision (P) comes from the fact
that the RIP is guaranteed to point the an instruction that is just after
an instruction which generated the event you're sampling on. The catch
is that no all events support PEBS, and only one counter works with PEBS
on Core 2. Nehalem is better, more events support PEBS, all 4 generic
counters do support PEBS. Furthermore,PEBS can now capture where
cache misses occur, very much like what Itanium can do.

Needless to say all of this is supported by perfmon.

Hope this helps.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] Performance Counters for Linux, v2
  2008-12-10  4:44         ` Paul Mackerras
  2008-12-10  5:03           ` stephane eranian
@ 2008-12-10 10:26           ` Andi Kleen
  1 sibling, 0 replies; 33+ messages in thread
From: Andi Kleen @ 2008-12-10 10:26 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Andi Kleen, Ingo Molnar, eranian, linux-kernel, Thomas Gleixner,
	linux-arch, Andrew Morton, Eric Dumazet, Robert Richter,
	Arjan van de Veen, Peter Anvin, Peter Zijlstra, Steven Rostedt,
	David Miller, Paolo Ciarrocchi

On Wed, Dec 10, 2008 at 03:44:31PM +1100, Paul Mackerras wrote:
> Andi Kleen writes:
> 
> > When you say counting you should also include "event ring buffers with
> > metadata", like PEBS on Intel x86. 
> 
> I'm not familiar with PEBS.  Maybe it's something different again,
> neither sampling nor counting, but a third thing?

Yes it's a third thing. A CPU controlled ring event buffer. See Stephane's 
description.

There's also some crosses, e.g. AMD's IBS (which is essentially a counter +
some additional registers to give more details about the interrupted
instruction)

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] Performance Counters for Linux, v2
  2008-12-08 11:49   ` Ingo Molnar
@ 2009-01-07  7:43     ` Zhang, Yanmin
  2009-01-09  1:07       ` Zhang, Yanmin
  0 siblings, 1 reply; 33+ messages in thread
From: Zhang, Yanmin @ 2009-01-07  7:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, linux-kernel, Thomas Gleixner, linux-arch,
	Andrew Morton, Stephane Eranian, Eric Dumazet, Robert Richter,
	Peter Anvin, Peter Zijlstra, Steven Rostedt, David Miller,
	Paul Mackerras

On Mon, 2008-12-08 at 12:49 +0100, Ingo Molnar wrote:
> * Arjan van de Ven <arjan@infradead.org> wrote:
> 
> > On Mon, 8 Dec 2008 02:22:12 +0100
> > Ingo Molnar <mingo@elte.hu> wrote:
> > 
> > > 
> > > [ Performance counters are special hardware registers available on
> > > most modern CPUs. These register count the number of certain types of
> > > hw events: such as instructions executed, cachemisses suffered, or 
> > >   branches mis-predicted, without slowing down the kernel or 
> > >   applications. These registers can also trigger interrupts when a 
> > >   threshold number of events have passed - and can thus be used to 
> > >   profile the code that runs on that CPU. ]
> > > 
> > > This is version 2 of our Performance Counters subsystem
> > > implementation.
> > > 
> > > The biggest user-visible change in this release is a new user-space 
> > > text-mode profiling utility that is based on this code: KernelTop.
> > > 
> > > KernelTop can be downloaded from:
> > > 
> > >   http://redhat.com/~mingo/perfcounters/kerneltop.c
> > > 
> > > It's a standalone .c file that needs no extra libraries - it only
> > > needs a CONFIG_PERF_COUNTERS=y kernel to run on.
> > > 
> > > This utility is intended for kernel developers - it's basically a
> > > dynamic kernel profiler that gets hardware counter events dispatched
> > > to it continuously, which it feeds into a histogram and outputs it 
> > > periodically.
> > > 
> > 
> > I played with this a little, and while it works neat, I wanted a 
> > feature added where it shows a detailed profile for the top function.
> 
> ah, very nice idea!
> 
> > I've hacked this in quickly (the usability isn't all that great yet) 
> > and put the source code up at
> >
> > http://www.tglx.de/~arjan/kerneltop-0.02.tar.gz
> 
> ok, picked it up :-)
Ingo,

I try to use patch V5 and the latest kerneltop to collect some cachemiss data.

It seems kerneltop just shows the first instruction ip address of functions. Does
the latest kerneltop include the enhancement from Arjan? As you know, with oprofile,
we can get detailed instrument ip address which causes the cache miss although the ip
address should go back one instruction mostly.

> 
> > with this it looks like this:
> > 
> > $ sudo ./kerneltop --vmlinux=/home/arjan/linux-2.6.git/vmlinux
> > 
> > ------------------------------------------------------------------------------
> >  KernelTop:     274 irqs/sec  [NMI, 1000000 cycles],  (all, 2 CPUs)
> > ------------------------------------------------------------------------------
> > 
> >              events         RIP          kernel function
> >              ______   ________________   _______________
> > 
> >                 230 - 00000000c04189e9 : read_hpet
> >                  82 - 00000000c0409439 : mwait_idle_with_hints
> >                  77 - 00000000c051a7b7 : acpi_os_read_port
> >                  52 - 00000000c053cb3a : acpi_idle_enter_bm
> >                  38 - 00000000c0418d93 : hpet_next_event
> >                  19 - 00000000c051a802 : acpi_os_write_port
> >                  14 - 00000000c04f8704 : __copy_to_user_ll
> >                  13 - 00000000c0460c20 : get_page_from_freelist
> >                   7 - 00000000c041c96c : kunmap_atomic
> >                   5 - 00000000c06a30d2 : _spin_lock	[joydev]
> >                   4 - 00000000c04f79b7 : vsnprintf	[snd_seq]
> >                   4 - 00000000c06a3048 : _spin_lock_irqsave	[pcspkr]
> >                   3 - 00000000c0403b3c : irq_entries_start
> >                   3 - 00000000c0423fee : run_rebalance_domains
> >                   3 - 00000000c0425e2c : scheduler_tick
> >                   3 - 00000000c0430938 : get_next_timer_interrupt
> >                   3 - 00000000c043cdfa : __update_sched_clock\x19
> >                   3 - 00000000c0448b14 : update_iter
> >                   2 - 00000000c04304bd : run_timer_softirq
> > 
> > Showing details for read_hpet
> >        0	c04189e9 <read_hpet>:
> >        2	c04189e9:	a1 b0 e0 89 c0       	mov    0xc089e0b0,%eax
> >        0	
> >        0	/*
> >        0	 * Clock source related code
> >        0	 */
> >        0	static cycle_t read_hpet(void)
> >        0	{
> >        1	c04189ee:	55                   	push   %ebp
> >        0	c04189ef:	89 e5                	mov    %esp,%ebp
> >        1	c04189f1:	05 f0 00 00 00       	add    $0xf0,%eax
> >        0	c04189f6:	8b 00                	mov    (%eax),%eax
> >        0		return (cycle_t)hpet_readl(HPET_COUNTER);
> >        0	}
> >      300	c04189f8:	31 d2                	xor    %edx,%edx
> >        0	c04189fa:	5d                   	pop    %ebp
> >        0	c04189fb:	c3                   	ret    
> >        0	
> 
Just like above detailed information.

Thanks,
Yanmin

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] Performance Counters for Linux, v2
  2009-01-07  7:43     ` Zhang, Yanmin
@ 2009-01-09  1:07       ` Zhang, Yanmin
  0 siblings, 0 replies; 33+ messages in thread
From: Zhang, Yanmin @ 2009-01-09  1:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, linux-kernel, Thomas Gleixner, linux-arch,
	Andrew Morton, Stephane Eranian, Eric Dumazet, Robert Richter,
	Peter Anvin, Peter Zijlstra, Steven Rostedt, David Miller,
	Paul Mackerras

On Wed, 2009-01-07 at 15:43 +0800, Zhang, Yanmin wrote:
> On Mon, 2008-12-08 at 12:49 +0100, Ingo Molnar wrote:
> > * Arjan van de Ven <arjan@infradead.org> wrote:
> > 
> > > On Mon, 8 Dec 2008 02:22:12 +0100
> > > Ingo Molnar <mingo@elte.hu> wrote:
> > > 
> > > > 
> > > > [ Performance counters are special hardware registers available on
> > > > most modern CPUs. These register count the number of certain types of
> > > > hw events: such as instructions executed, cachemisses suffered, or 
> > > >   branches mis-predicted, without slowing down the kernel or 
> > > >   applications. These registers can also trigger interrupts when a 
> > > >   threshold number of events have passed - and can thus be used to 
> > > >   profile the code that runs on that CPU. ]
> > > > 
> > > > This is version 2 of our Performance Counters subsystem
> > > > implementation.
> > > > 
> > > > The biggest user-visible change in this release is a new user-space 
> > > > text-mode profiling utility that is based on this code: KernelTop.
> > > > 
> > > > KernelTop can be downloaded from:
> > > > 
> > > >   http://redhat.com/~mingo/perfcounters/kerneltop.c
> > > > 
> > > > It's a standalone .c file that needs no extra libraries - it only
> > > > needs a CONFIG_PERF_COUNTERS=y kernel to run on.
> > > > 
> > > > This utility is intended for kernel developers - it's basically a
> > > > dynamic kernel profiler that gets hardware counter events dispatched
> > > > to it continuously, which it feeds into a histogram and outputs it 
> > > > periodically.
> > > > 
> > > 
> > > I played with this a little, and while it works neat, I wanted a 
> > > feature added where it shows a detailed profile for the top function.
> > 
> > ah, very nice idea!
> > 
> > > I've hacked this in quickly (the usability isn't all that great yet) 
> > > and put the source code up at
> > >
> > > http://www.tglx.de/~arjan/kerneltop-0.02.tar.gz
> > 
> > ok, picked it up :-)
> Ingo,
> 
> I try to use patch V5 and the latest kerneltop to collect some cachemiss data.
> 
> It seems kerneltop just shows the first instruction ip address of functions. Does
> the latest kerneltop include the enhancement from Arjan? As you know, with oprofile,
> we can get detailed instrument ip address which causes the cache miss although the ip
> address should go back one instruction mostly.
As a matter of fact, the original kerneltop has parameter -s to support it. But
kerneltop has a bug to show details of the symbol. sym_filter_entry should be initiated
after qsort.

below is an example.
#./kerneltop --vmlinux=/root/linux-2.6.28_slqb1230flush/vmlinux -d 20 -e 3 -f 1000 -s flush_free_list

------------------------------------------------------------------------------
 KernelTop:   20297 irqs/sec  [NMI, 10000 cache-misses],  (all, 8 CPUs)
------------------------------------------------------------------------------

             events         RIP          kernel function
  ______     ______   ________________   _______________

           12816.00 - ffffffff803d5760 : copy_user_generic_string!
           11751.00 - ffffffff80647a2c : unix_stream_recvmsg
           10215.00 - ffffffff805eda5f : sock_alloc_send_skb
            9738.00 - ffffffff80284821 : flush_free_list
            6749.00 - ffffffff802854a1 : __kmalloc_track_caller
            3663.00 - ffffffff805f09fa : skb_dequeue
            3591.00 - ffffffff80284be2 : kmem_cache_alloc       [qla2xxx]
            3501.00 - ffffffff805f15f5 : __alloc_skb
            1296.00 - ffffffff803d8eb4 : list_del       [qla2xxx]
            1110.00 - ffffffff805f0ed2 : kfree_skb
Showing details for flush_free_list
       0        ffffffff8028488a:       78 00 00 
       0        ffffffff8028488d:       49 8d 04 00             lea    (%r8,%rax,1),%rax
       0        ffffffff80284891:       4c 8b 31                mov    (%rcx),%r14
    1143        ffffffff80284894:       48 c1 e8 0c             shr    $0xc,%rax
       0        ffffffff80284898:       48 6b c0 38             imul   $0x38,%rax,%rax
       0        ffffffff8028489c:       48 8d 1c 10             lea    (%rax,%rdx,1),%rbx
       0        ffffffff802848a0:       48 8b 03                mov    (%rbx),%rax
    3195        ffffffff802848a3:       25 00 40 00 00          and    $0x4000,%eax


The disassembly of lots of functions are big, so the new kernertop truncates them by filtering some
instructions whose count are smaller than count_filter. It just shows the instructions whose count
are more than count_filter and 3 instructions ahead of the reported instructions. For example, before
printing
3195 ffffffff802848a3: 25 00 40 00 00 and $0x4000,%eax
the new kerneltop prints other 3 instructions:
0        ffffffff80284898:       48 6b c0 38             imul   $0x38,%rax,%rax
0        ffffffff8028489c:       48 8d 1c 10             lea    (%rax,%rdx,1),%rbx
0        ffffffff802848a0:       48 8b 03                mov    (%rbx),%rax

So users can go back quickly to find the instruction who really causes the event (not the reported
instruction by the performance counter).

Below is the patch against kernetop of Dec/23/2008 version.

yanmin

---

--- kerneltop.c.orig	2009-01-08 16:39:16.000000000 +0800
+++ kerneltop.c	2009-01-08 16:39:16.000000000 +0800
@@ -3,7 +3,7 @@
 
    Build with:
 
-     cc -O6 -Wall `pkg-config --cflags glib-2.0` -c -o kerneltop.o kerneltop.c
+     cc -O6 -Wall `pkg-config --cflags --libs glib-2.0` -o kerneltop kerneltop.c
 
    Sample output:
 
@@ -291,8 +291,6 @@ static void process_options(int argc, ch
 		else
 			event_count[counter] = 100000;
 	}
-	if (nr_counters == 1)
-		count_filter = 0;
 }
 
 static uint64_t			min_ip;
@@ -307,7 +305,7 @@ struct sym_entry {
 
 #define MAX_SYMS		100000
 
-static unsigned int sym_table_count;
+static int sym_table_count;
 
 struct sym_entry		*sym_filter_entry;
 
@@ -350,7 +348,7 @@ static struct sym_entry		tmp[MAX_SYMS];
 
 static void print_sym_table(void)
 {
-	unsigned int i, printed;
+	int i, printed;
 	int counter;
 
 	memcpy(tmp, sym_table, sizeof(sym_table[0])*sym_table_count);
@@ -494,7 +492,6 @@ static int read_symbol(FILE *in, struct 
 			printf("symbol filter start: %016lx\n", filter_start);
 			printf("                end: %016lx\n", filter_end);
 			filter_end = filter_start = 0;
-			sym_filter_entry = NULL;
 			sym_filter = NULL;
 			sleep(1);
 		}
@@ -502,7 +499,6 @@ static int read_symbol(FILE *in, struct 
 	if (filter_match == 0 && sym_filter && !strcmp(s->sym, sym_filter)) {
 		filter_match = 1;
 		filter_start = s->addr;
-		sym_filter_entry = s;
 	}
 
 	return 0;
@@ -538,6 +534,16 @@ static void parse_symbols(void)
 	last->sym = "<end>";
 
 	qsort(sym_table, sym_table_count, sizeof(sym_table[0]), compare_addr);
+
+	if (filter_end) {
+		int count;
+		for (count=0; count < sym_table_count; count ++) {
+			if (!strcmp(sym_table[count].sym, sym_filter)) {
+				sym_filter_entry = &sym_table[count];
+				break;
+			}
+		}
+	}
 }
 
 
@@ -617,11 +623,27 @@ static void lookup_sym_in_vmlinux(struct
 	}
 }
 
+void show_lines(GList *item_queue, int item_queue_count)
+{
+	int i;
+	struct source_line *line;
+
+	for (i = 0; i < item_queue_count; i++) {
+		line = item_queue->data;
+		printf("%8li\t%s\n", line->count, line->line);
+		item_queue = g_list_next(item_queue);
+	}
+}
+
+#define TRACE_COUNT     3
+
 static void show_details(struct sym_entry *sym)
 {
 	struct source_line *line;
 	GList *item;
 	int displayed = 0;
+	GList *item_queue;
+	int item_queue_count = 0;
 
 	if (!sym->source)
 		lookup_sym_in_vmlinux(sym);
@@ -633,16 +655,28 @@ static void show_details(struct sym_entr
 	item = sym->source;
 	while (item) {
 		line = item->data;
-		item = g_list_next(item);
 		if (displayed && strstr(line->line, ">:"))
 			break;
 
-		printf("%8li\t%s\n", line->count, line->line);
+		if (!item_queue_count)
+			item_queue = item;
+		item_queue_count ++;
+
+		if (line->count >= count_filter) {
+			show_lines(item_queue, item_queue_count);
+			item_queue_count = 0;
+			item_queue = NULL;
+		} else if (item_queue_count > TRACE_COUNT) {
+			item_queue = g_list_next(item_queue);
+			item_queue_count --;
+		}
+
+		line->count = 0;
 		displayed++;
 		if (displayed > 300)
 			break;
+		item = g_list_next(item);
 	}
-	exit(0);
 }
 
 /*

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2009-01-09  1:07 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-12-08  1:22 [patch] Performance Counters for Linux, v2 Ingo Molnar
2008-12-08  1:49 ` Arjan van de Ven
2008-12-08 11:49   ` Ingo Molnar
2009-01-07  7:43     ` Zhang, Yanmin
2009-01-09  1:07       ` Zhang, Yanmin
2008-12-08  3:24 ` Paul Mackerras
2008-12-08 11:33   ` Ingo Molnar
2008-12-08 12:02     ` David Miller
2008-12-08 14:41     ` Andi Kleen
2008-12-08 22:03     ` Paul Mackerras
2008-12-09 13:00       ` Ingo Molnar
2008-12-09 23:00         ` Paul Mackerras
2008-12-08  8:32 ` Corey J Ashford
2008-12-09  6:37 ` stephane eranian
2008-12-09 11:02   ` Ingo Molnar
2008-12-09 11:11     ` David Miller
2008-12-09 11:22       ` Ingo Molnar
2008-12-09 11:29         ` David Miller
2008-12-09 12:14           ` Paolo Ciarrocchi
2008-12-09 13:46   ` Ingo Molnar
2008-12-09 16:39     ` Chris Friesen
2008-12-09 19:02       ` Ingo Molnar
2008-12-09 19:51         ` Chris Friesen
2008-12-09 16:46     ` Will Newton
2008-12-09 17:35       ` Chris Friesen
2008-12-09 21:16     ` stephane eranian
2008-12-09 21:16       ` stephane eranian
2008-12-09 22:19     ` Paul Mackerras
2008-12-09 22:40       ` Andi Kleen
2008-12-10  4:44         ` Paul Mackerras
2008-12-10  5:03           ` stephane eranian
2008-12-10  5:03             ` stephane eranian
2008-12-10 10:26           ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).