public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/9] perf_counter patches
@ 2009-03-28 19:43 Peter Zijlstra
  2009-03-28 19:44 ` [PATCH 1/9] perf_counter: unify and fix delayed counter wakeup Peter Zijlstra
                   ` (8 more replies)
  0 siblings, 9 replies; 24+ messages in thread
From: Peter Zijlstra @ 2009-03-28 19:43 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel
  Cc: Paul Mackerras, Mike Galbraith, Arjan van de Ven, Wu Fengguang,
	Peter Zijlstra

This set fixes 2 problems I found while going over the perf_counter code.
 - the delayed wakeups were incomplete (and broken on x86)
 - update_userpage() wasn't used properly

It also adds tracking of the mmap() data so that we can profile userspace.

I also folded in Paulus' pending patches (I had to fix up a few conflict,
please double check).

Finally, the writable control page patch... still not convinced we need that.
-- 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/9] perf_counter: unify and fix delayed counter wakeup
  2009-03-28 19:43 [PATCH 0/9] perf_counter patches Peter Zijlstra
@ 2009-03-28 19:44 ` Peter Zijlstra
  2009-03-29  0:14   ` Paul Mackerras
  2009-03-28 19:44 ` [PATCH 2/9] perf_counter: fix update_userpage() Peter Zijlstra
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2009-03-28 19:44 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel
  Cc: Paul Mackerras, Mike Galbraith, Arjan van de Ven, Wu Fengguang,
	Peter Zijlstra, Eric Dumazet

[-- Attachment #1: perf_counter-fix-delayed-wakeup.patch --]
[-- Type: text/plain, Size: 17744 bytes --]

While going over the wakeup code I noticed delayed wakeups only work
for hardware counters but basically all software counters rely on
them.

This patch unifies and generalizes the delayed wakeup to fix this
issue.

Since we're dealing with NMI context bits here, use a cmpxchg() based
single link list implementation to track counters that have pending
wakeups.

[ This should really be generic code for delayed wakeups, but since we
  cannot use cmpxchg()/xchg() in generic code, I've let it live in the
  perf_counter code. -- Eric Dumazet could use it to aggregate the
  network wakeups. ]

Furthermore, the x86 method of using TIF flags was flawed in that its
quite possible to end up setting the bit on the idle task, loosing the
wakeup.

The powerpc method uses per-cpu storage and would appear to work -- it
would check the per-cpu pending state every time it enabled IRQs.
However, since I fully removed the x86 code, I pulled it as well.

If we find the current jiffy tick based approach too slow, we can
re-instate the powerpc method and implement something similar on x86.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Eric Dumazet <dada1@cosmosbay.com>
---
 arch/powerpc/include/asm/hw_irq.h   |   39 -----------
 arch/powerpc/include/asm/paca.h     |    1 
 arch/powerpc/kernel/asm-offsets.c   |    1 
 arch/powerpc/kernel/entry_64.S      |    9 --
 arch/powerpc/kernel/irq.c           |    5 -
 arch/powerpc/kernel/perf_counter.c  |   33 ---------
 arch/x86/include/asm/perf_counter.h |    3 
 arch/x86/include/asm/thread_info.h  |    4 -
 arch/x86/kernel/cpu/perf_counter.c  |   29 --------
 arch/x86/kernel/signal.c            |    6 -
 include/linux/perf_counter.h        |   15 ++--
 kernel/perf_counter.c               |  126 +++++++++++++++++++++++++++++++++---
 kernel/timer.c                      |    3 
 13 files changed, 133 insertions(+), 141 deletions(-)

Index: linux-2.6/arch/powerpc/kernel/perf_counter.c
===================================================================
--- linux-2.6.orig/arch/powerpc/kernel/perf_counter.c
+++ linux-2.6/arch/powerpc/kernel/perf_counter.c
@@ -650,24 +650,6 @@ hw_perf_counter_init(struct perf_counter
 }
 
 /*
- * Handle wakeups.
- */
-void perf_counter_do_pending(void)
-{
-	int i;
-	struct cpu_hw_counters *cpuhw = &__get_cpu_var(cpu_hw_counters);
-	struct perf_counter *counter;
-
-	for (i = 0; i < cpuhw->n_counters; ++i) {
-		counter = cpuhw->counter[i];
-		if (counter && counter->wakeup_pending) {
-			counter->wakeup_pending = 0;
-			wake_up(&counter->waitq);
-		}
-	}
-}
-
-/*
  * A counter has overflowed; update its count and record
  * things if requested.  Note that interrupts are hard-disabled
  * here so there is no possibility of being interrupted.
@@ -720,7 +702,7 @@ static void perf_counter_interrupt(struc
 	struct cpu_hw_counters *cpuhw = &__get_cpu_var(cpu_hw_counters);
 	struct perf_counter *counter;
 	long val;
-	int need_wakeup = 0, found = 0;
+	int found = 0;
 
 	for (i = 0; i < cpuhw->n_counters; ++i) {
 		counter = cpuhw->counter[i];
@@ -754,19 +736,6 @@ static void perf_counter_interrupt(struc
 	 * we get back out of this interrupt.
 	 */
 	mtspr(SPRN_MMCR0, cpuhw->mmcr[0]);
-
-	/*
-	 * If we need a wakeup, check whether interrupts were soft-enabled
-	 * when we took the interrupt.  If they were, we can wake stuff up
-	 * immediately; otherwise we'll have do the wakeup when interrupts
-	 * get soft-enabled.
-	 */
-	if (get_perf_counter_pending() && regs->softe) {
-		irq_enter();
-		clear_perf_counter_pending();
-		perf_counter_do_pending();
-		irq_exit();
-	}
 }
 
 void hw_perf_counter_setup(int cpu)
Index: linux-2.6/arch/x86/kernel/cpu/perf_counter.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_counter.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_counter.c
@@ -227,7 +227,6 @@ static int __hw_perf_counter_init(struct
 		 */
 		hwc->config |= pmc_ops->event_map(perf_event_id(hw_event));
 	}
-	counter->wakeup_pending = 0;
 
 	return 0;
 }
@@ -773,34 +772,6 @@ void smp_perf_counter_interrupt(struct p
 	irq_exit();
 }
 
-/*
- * This handler is triggered by NMI contexts:
- */
-void perf_counter_notify(struct pt_regs *regs)
-{
-	struct cpu_hw_counters *cpuc;
-	unsigned long flags;
-	int bit, cpu;
-
-	local_irq_save(flags);
-	cpu = smp_processor_id();
-	cpuc = &per_cpu(cpu_hw_counters, cpu);
-
-	for_each_bit(bit, cpuc->used, X86_PMC_IDX_MAX) {
-		struct perf_counter *counter = cpuc->counters[bit];
-
-		if (!counter)
-			continue;
-
-		if (counter->wakeup_pending) {
-			counter->wakeup_pending = 0;
-			wake_up(&counter->waitq);
-		}
-	}
-
-	local_irq_restore(flags);
-}
-
 void perf_counters_lapic_init(int nmi)
 {
 	u32 apic_val;
Index: linux-2.6/include/linux/perf_counter.h
===================================================================
--- linux-2.6.orig/include/linux/perf_counter.h
+++ linux-2.6/include/linux/perf_counter.h
@@ -275,6 +275,10 @@ struct perf_mmap_data {
 	void 				*data_pages[0];
 };
 
+struct perf_wakeup_entry {
+	struct perf_wakeup_entry *next;
+};
+
 /**
  * struct perf_counter - performance counter kernel representation:
  */
@@ -350,7 +354,7 @@ struct perf_counter {
 	/* poll related */
 	wait_queue_head_t		waitq;
 	/* optional: for NMIs */
-	int				wakeup_pending;
+	struct perf_wakeup_entry	wakeup;
 
 	void (*destroy)(struct perf_counter *);
 	struct rcu_head			rcu_head;
@@ -427,7 +431,7 @@ extern void perf_counter_task_sched_out(
 extern void perf_counter_task_tick(struct task_struct *task, int cpu);
 extern void perf_counter_init_task(struct task_struct *child);
 extern void perf_counter_exit_task(struct task_struct *child);
-extern void perf_counter_notify(struct pt_regs *regs);
+extern void perf_counter_do_pending(void);
 extern void perf_counter_print_debug(void);
 extern void perf_counter_unthrottle(void);
 extern u64 hw_perf_save_disable(void);
@@ -461,7 +465,7 @@ static inline void
 perf_counter_task_tick(struct task_struct *task, int cpu)		{ }
 static inline void perf_counter_init_task(struct task_struct *child)	{ }
 static inline void perf_counter_exit_task(struct task_struct *child)	{ }
-static inline void perf_counter_notify(struct pt_regs *regs)		{ }
+static inline void perf_counter_do_pending(void)			{ }
 static inline void perf_counter_print_debug(void)			{ }
 static inline void perf_counter_unthrottle(void)			{ }
 static inline void hw_perf_restore(u64 ctrl)				{ }
@@ -469,8 +473,9 @@ static inline u64 hw_perf_save_disable(v
 static inline int perf_counter_task_disable(void)	{ return -EINVAL; }
 static inline int perf_counter_task_enable(void)	{ return -EINVAL; }
 
-static inline void perf_swcounter_event(u32 event, u64 nr,
-					int nmi, struct pt_regs *regs)	{ }
+static inline void
+perf_swcounter_event(u32 event, u64 nr, int nmi, struct pt_regs *regs)	{ }
+
 #endif
 
 #endif /* __KERNEL__ */
Index: linux-2.6/kernel/perf_counter.c
===================================================================
--- linux-2.6.orig/kernel/perf_counter.c
+++ linux-2.6/kernel/perf_counter.c
@@ -1197,8 +1197,12 @@ static void free_counter_rcu(struct rcu_
 	kfree(counter);
 }
 
+static void perf_pending_sync(struct perf_counter *counter);
+
 static void free_counter(struct perf_counter *counter)
 {
+	perf_pending_sync(counter);
+
 	if (counter->destroy)
 		counter->destroy(counter);
 
@@ -1529,6 +1533,116 @@ static const struct file_operations perf
 };
 
 /*
+ * Perf counter wakeup
+ *
+ * If there's data, ensure we set the poll() state and publish everything
+ * to user-space before waking everybody up.
+ */
+
+void perf_counter_wakeup(struct perf_counter *counter)
+{
+	struct perf_mmap_data *data;
+
+	rcu_read_lock();
+	data = rcu_dereference(counter->data);
+	if (data) {
+		(void)atomic_xchg(&data->wakeup, POLL_IN);
+		__perf_counter_update_userpage(counter, data);
+	}
+	rcu_read_unlock();
+
+	wake_up_all(&counter->waitq);
+}
+
+/*
+ * Pending wakeups
+ *
+ * Handle the case where we need to wakeup up from NMI (or rq->lock) context.
+ *
+ * The NMI bit means we cannot possibly take locks. Therefore, maintain a
+ * single linked list and use cmpxchg() to add entries lockless.
+ */
+
+#define PENDING_TAIL ((struct perf_wakeup_entry *)-1UL)
+
+static DEFINE_PER_CPU(struct perf_wakeup_entry *, perf_wakeup_head) = {
+	PENDING_TAIL,
+};
+
+static void perf_pending_queue(struct perf_counter *counter)
+{
+	struct perf_wakeup_entry **head;
+	struct perf_wakeup_entry *prev, *next;
+
+	if (cmpxchg(&counter->wakeup.next, NULL, PENDING_TAIL) != NULL)
+		return;
+
+	head = &get_cpu_var(perf_wakeup_head);
+
+	do {
+		prev = counter->wakeup.next = *head;
+		next = &counter->wakeup;
+	} while (cmpxchg(head, prev, next) != prev);
+
+	put_cpu_var(perf_wakeup_head);
+}
+
+static int __perf_pending_run(void)
+{
+	struct perf_wakeup_entry *list;
+	int nr = 0;
+
+	list = xchg(&__get_cpu_var(perf_wakeup_head), PENDING_TAIL);
+	while (list != PENDING_TAIL) {
+		struct perf_counter *counter = container_of(list,
+				struct perf_counter, wakeup);
+
+		list = list->next;
+
+		counter->wakeup.next = NULL;
+		/*
+		 * Ensure we observe the unqueue before we issue the wakeup,
+		 * so that we won't be waiting forever.
+		 * -- see perf_not_pending().
+		 */
+		smp_wmb();
+
+		perf_counter_wakeup(counter);
+		nr++;
+	}
+
+	return nr;
+}
+
+static inline int perf_not_pending(struct perf_counter *counter)
+{
+	/*
+	 * If we flush on whatever cpu we run, there is a chance we don't
+	 * need to wait.
+	 */
+	get_cpu();
+	__perf_pending_run();
+	put_cpu();
+
+	/*
+	 * Ensure we see the proper queue state before going to sleep
+	 * so that we do not miss the wakeup. -- see perf_pending_handle()
+	 */
+	smp_rmb();
+	return counter->wakeup.next == NULL;
+}
+
+static void perf_pending_sync(struct perf_counter *counter)
+{
+	wait_event(counter->waitq, perf_not_pending(counter));
+}
+
+void perf_counter_do_pending(void)
+{
+	__perf_pending_run();
+}
+
+/*
  * Output
  */
 
@@ -1611,13 +1725,10 @@ static void perf_output_copy(struct perf
 static void perf_output_end(struct perf_output_handle *handle, int nmi)
 {
 	if (handle->wakeup) {
-		(void)atomic_xchg(&handle->data->wakeup, POLL_IN);
-		__perf_counter_update_userpage(handle->counter, handle->data);
-		if (nmi) {
-			handle->counter->wakeup_pending = 1;
-			set_perf_counter_pending();
-		} else
-			wake_up(&handle->counter->waitq);
+		if (nmi)
+			perf_pending_queue(handle->counter);
+		else
+			perf_counter_wakeup(handle->counter);
 	}
 	rcu_read_unlock();
 }
@@ -2211,7 +2322,6 @@ perf_counter_alloc(struct perf_counter_h
 
 	counter->cpu			= cpu;
 	counter->hw_event		= *hw_event;
-	counter->wakeup_pending		= 0;
 	counter->group_leader		= group_leader;
 	counter->hw_ops			= NULL;
 	counter->ctx			= ctx;
Index: linux-2.6/arch/x86/kernel/signal.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/signal.c
+++ linux-2.6/arch/x86/kernel/signal.c
@@ -6,7 +6,6 @@
  *  2000-06-20  Pentium III FXSR, SSE support by Gareth Hughes
  *  2000-2002   x86-64 support by Andi Kleen
  */
-#include <linux/perf_counter.h>
 #include <linux/sched.h>
 #include <linux/mm.h>
 #include <linux/smp.h>
@@ -872,11 +871,6 @@ do_notify_resume(struct pt_regs *regs, v
 		tracehook_notify_resume(regs);
 	}
 
-	if (thread_info_flags & _TIF_PERF_COUNTERS) {
-		clear_thread_flag(TIF_PERF_COUNTERS);
-		perf_counter_notify(regs);
-	}
-
 #ifdef CONFIG_X86_32
 	clear_thread_flag(TIF_IRET);
 #endif /* CONFIG_X86_32 */
Index: linux-2.6/arch/powerpc/kernel/irq.c
===================================================================
--- linux-2.6.orig/arch/powerpc/kernel/irq.c
+++ linux-2.6/arch/powerpc/kernel/irq.c
@@ -135,11 +135,6 @@ notrace void raw_local_irq_restore(unsig
 			iseries_handle_interrupts();
 	}
 
-	if (get_perf_counter_pending()) {
-		clear_perf_counter_pending();
-		perf_counter_do_pending();
-	}
-
 	/*
 	 * if (get_paca()->hard_enabled) return;
 	 * But again we need to take care that gcc gets hard_enabled directly
Index: linux-2.6/arch/x86/include/asm/perf_counter.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/perf_counter.h
+++ linux-2.6/arch/x86/include/asm/perf_counter.h
@@ -84,9 +84,6 @@ union cpuid10_edx {
 #define MSR_ARCH_PERFMON_FIXED_CTR2			0x30b
 #define X86_PMC_IDX_FIXED_BUS_CYCLES			(X86_PMC_IDX_FIXED + 2)
 
-#define set_perf_counter_pending()	\
-		set_tsk_thread_flag(current, TIF_PERF_COUNTERS);
-
 #ifdef CONFIG_PERF_COUNTERS
 extern void init_hw_perf_counters(void);
 extern void perf_counters_lapic_init(int nmi);
Index: linux-2.6/arch/powerpc/include/asm/hw_irq.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/hw_irq.h
+++ linux-2.6/arch/powerpc/include/asm/hw_irq.h
@@ -131,44 +131,5 @@ static inline int irqs_disabled_flags(un
  */
 struct hw_interrupt_type;
 
-#ifdef CONFIG_PERF_COUNTERS
-static inline unsigned long get_perf_counter_pending(void)
-{
-	unsigned long x;
-
-	asm volatile("lbz %0,%1(13)"
-		: "=r" (x)
-		: "i" (offsetof(struct paca_struct, perf_counter_pending)));
-	return x;
-}
-
-static inline void set_perf_counter_pending(void)
-{
-	asm volatile("stb %0,%1(13)" : :
-		"r" (1),
-		"i" (offsetof(struct paca_struct, perf_counter_pending)));
-}
-
-static inline void clear_perf_counter_pending(void)
-{
-	asm volatile("stb %0,%1(13)" : :
-		"r" (0),
-		"i" (offsetof(struct paca_struct, perf_counter_pending)));
-}
-
-extern void perf_counter_do_pending(void);
-
-#else
-
-static inline unsigned long get_perf_counter_pending(void)
-{
-	return 0;
-}
-
-static inline void set_perf_counter_pending(void) {}
-static inline void clear_perf_counter_pending(void) {}
-static inline void perf_counter_do_pending(void) {}
-#endif /* CONFIG_PERF_COUNTERS */
-
 #endif	/* __KERNEL__ */
 #endif	/* _ASM_POWERPC_HW_IRQ_H */
Index: linux-2.6/arch/powerpc/include/asm/paca.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/paca.h
+++ linux-2.6/arch/powerpc/include/asm/paca.h
@@ -99,7 +99,6 @@ struct paca_struct {
 	u8 soft_enabled;		/* irq soft-enable flag */
 	u8 hard_enabled;		/* set if irqs are enabled in MSR */
 	u8 io_sync;			/* writel() needs spin_unlock sync */
-	u8 perf_counter_pending;	/* PM interrupt while soft-disabled */
 
 	/* Stuff for accurate time accounting */
 	u64 user_time;			/* accumulated usermode TB ticks */
Index: linux-2.6/arch/powerpc/kernel/asm-offsets.c
===================================================================
--- linux-2.6.orig/arch/powerpc/kernel/asm-offsets.c
+++ linux-2.6/arch/powerpc/kernel/asm-offsets.c
@@ -131,7 +131,6 @@ int main(void)
 	DEFINE(PACAKMSR, offsetof(struct paca_struct, kernel_msr));
 	DEFINE(PACASOFTIRQEN, offsetof(struct paca_struct, soft_enabled));
 	DEFINE(PACAHARDIRQEN, offsetof(struct paca_struct, hard_enabled));
-	DEFINE(PACAPERFPEND, offsetof(struct paca_struct, perf_counter_pending));
 	DEFINE(PACASLBCACHE, offsetof(struct paca_struct, slb_cache));
 	DEFINE(PACASLBCACHEPTR, offsetof(struct paca_struct, slb_cache_ptr));
 	DEFINE(PACACONTEXTID, offsetof(struct paca_struct, context.id));
Index: linux-2.6/arch/powerpc/kernel/entry_64.S
===================================================================
--- linux-2.6.orig/arch/powerpc/kernel/entry_64.S
+++ linux-2.6/arch/powerpc/kernel/entry_64.S
@@ -526,15 +526,6 @@ ALT_FW_FTR_SECTION_END_IFCLR(FW_FEATURE_
 2:
 	TRACE_AND_RESTORE_IRQ(r5);
 
-#ifdef CONFIG_PERF_COUNTERS
-	/* check paca->perf_counter_pending if we're enabling ints */
-	lbz	r3,PACAPERFPEND(r13)
-	and.	r3,r3,r5
-	beq	27f
-	bl	.perf_counter_do_pending
-27:
-#endif /* CONFIG_PERF_COUNTERS */
-
 	/* extract EE bit and use it to restore paca->hard_enabled */
 	ld	r3,_MSR(r1)
 	rldicl	r4,r3,49,63		/* r0 = (r3 >> 15) & 1 */
Index: linux-2.6/arch/x86/include/asm/thread_info.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/thread_info.h
+++ linux-2.6/arch/x86/include/asm/thread_info.h
@@ -83,7 +83,6 @@ struct thread_info {
 #define TIF_SYSCALL_AUDIT	7	/* syscall auditing active */
 #define TIF_SECCOMP		8	/* secure computing */
 #define TIF_MCE_NOTIFY		10	/* notify userspace of an MCE */
-#define TIF_PERF_COUNTERS	11	/* notify perf counter work */
 #define TIF_NOTSC		16	/* TSC is not accessible in userland */
 #define TIF_IA32		17	/* 32bit process */
 #define TIF_FORK		18	/* ret_from_fork */
@@ -107,7 +106,6 @@ struct thread_info {
 #define _TIF_SYSCALL_AUDIT	(1 << TIF_SYSCALL_AUDIT)
 #define _TIF_SECCOMP		(1 << TIF_SECCOMP)
 #define _TIF_MCE_NOTIFY		(1 << TIF_MCE_NOTIFY)
-#define _TIF_PERF_COUNTERS	(1 << TIF_PERF_COUNTERS)
 #define _TIF_NOTSC		(1 << TIF_NOTSC)
 #define _TIF_IA32		(1 << TIF_IA32)
 #define _TIF_FORK		(1 << TIF_FORK)
@@ -141,7 +139,7 @@ struct thread_info {
 
 /* Only used for 64 bit */
 #define _TIF_DO_NOTIFY_MASK						\
-	(_TIF_SIGPENDING|_TIF_MCE_NOTIFY|_TIF_PERF_COUNTERS|_TIF_NOTIFY_RESUME)
+	(_TIF_SIGPENDING|_TIF_MCE_NOTIFY|_TIF_NOTIFY_RESUME)
 
 /* flags to check in __switch_to() */
 #define _TIF_WORK_CTXSW							\
Index: linux-2.6/kernel/timer.c
===================================================================
--- linux-2.6.orig/kernel/timer.c
+++ linux-2.6/kernel/timer.c
@@ -37,6 +37,7 @@
 #include <linux/delay.h>
 #include <linux/tick.h>
 #include <linux/kallsyms.h>
+#include <linux/perf_counter.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -1167,6 +1168,8 @@ static void run_timer_softirq(struct sof
 {
 	struct tvec_base *base = __get_cpu_var(tvec_bases);
 
+	perf_counter_do_pending();
+
 	hrtimer_run_pending();
 
 	if (time_after_eq(jiffies, base->timer_jiffies))

-- 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 2/9] perf_counter: fix update_userpage()
  2009-03-28 19:43 [PATCH 0/9] perf_counter patches Peter Zijlstra
  2009-03-28 19:44 ` [PATCH 1/9] perf_counter: unify and fix delayed counter wakeup Peter Zijlstra
@ 2009-03-28 19:44 ` Peter Zijlstra
  2009-03-29  0:24   ` Paul Mackerras
  2009-03-28 19:44 ` [PATCH 3/9] perf_counter: kerneltop: simplify data_head read Peter Zijlstra
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2009-03-28 19:44 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel
  Cc: Paul Mackerras, Mike Galbraith, Arjan van de Ven, Wu Fengguang,
	Peter Zijlstra

[-- Attachment #1: perf_counter-fix-update_userpage.patch --]
[-- Type: text/plain, Size: 4306 bytes --]

It just occured to me it is possible to have multiple contending
updates of the userpage (mmap information vs overflow vs counter).
This would break the seqlock logic.

It appear the arch code uses this from NMI context, so we cannot
possibly serialize its use, therefore separate the data_head update
from it and let it return to its original use.

The arch code needs to make sure there are no contending callers by
disabling the counter before using it -- powerpc appears to do this
nicely.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/perf_counter.h |   35 +++++++++++++++++++++++++++++++++++
 kernel/perf_counter.c        |   38 +++++++++++++++++++++++---------------
 2 files changed, 58 insertions(+), 15 deletions(-)

Index: linux-2.6/include/linux/perf_counter.h
===================================================================
--- linux-2.6.orig/include/linux/perf_counter.h
+++ linux-2.6/include/linux/perf_counter.h
@@ -160,10 +160,45 @@ struct perf_counter_hw_event {
 struct perf_counter_mmap_page {
 	__u32	version;		/* version number of this structure */
 	__u32	compat_version;		/* lowest version this is compat with */
+
+	/*
+	 * Bits needed to read the hw counters in user-space.
+	 *
+	 * The index and offset should be read atomically using the seqlock:
+	 *
+	 *   __u32 seq, index;
+	 *   __s64 offset;
+	 *
+	 * again:
+	 *   rmb();
+	 *   seq = pc->lock;
+	 *
+	 *   if (unlikely(seq & 1)) {
+	 *     cpu_relax();
+	 *     goto again;
+	 *   }
+	 *
+	 *   index = pc->index;
+	 *   offset = pc->offset;
+	 *
+	 *   rmb();
+	 *   if (pc->lock != seq)
+	 *     goto again;
+	 *
+	 * After this, index contains architecture specific counter index + 1,
+	 * so that 0 means unavailable, offset contains the value to be added
+	 * to the result of the raw timer read to obtain this counter's value.
+	 */
 	__u32	lock;			/* seqlock for synchronization */
 	__u32	index;			/* hardware counter identifier */
 	__s64	offset;			/* add to hardware counter value */
 
+	/*
+	 * Control data for the mmap() data buffer.
+	 *
+	 * User-space reading this value should issue an rmb(), on SMP capable
+	 * platforms, after reading this value. -- see perf_counter_wakeup().
+	 */
 	__u32   data_head;		/* head in the data section */
 };
 
Index: linux-2.6/kernel/perf_counter.c
===================================================================
--- linux-2.6.orig/kernel/perf_counter.c
+++ linux-2.6/kernel/perf_counter.c
@@ -1316,10 +1316,22 @@ static long perf_ioctl(struct file *file
 	return err;
 }
 
-static void __perf_counter_update_userpage(struct perf_counter *counter,
-					   struct perf_mmap_data *data)
+/*
+ * Callers need to ensure there can be no nesting of this function, otherwise
+ * the seqlock logic goes bad. We can not serialize this because the arch
+ * code calls this from NMI context.
+ */
+void perf_counter_update_userpage(struct perf_counter *counter)
 {
-	struct perf_counter_mmap_page *userpg = data->user_page;
+	struct perf_mmap_data *data;
+	struct perf_counter_mmap_page *userpg;
+
+	rcu_read_lock();
+	data = rcu_dereference(counter->data);
+	if (!data)
+		goto unlock;
+
+	userpg = data->user_page;
 
 	/*
 	 * Disable preemption so as to not let the corresponding user-space
@@ -1333,20 +1345,10 @@ static void __perf_counter_update_userpa
 	if (counter->state == PERF_COUNTER_STATE_ACTIVE)
 		userpg->offset -= atomic64_read(&counter->hw.prev_count);
 
-	userpg->data_head = atomic_read(&data->head);
 	smp_wmb();
 	++userpg->lock;
 	preempt_enable();
-}
-
-void perf_counter_update_userpage(struct perf_counter *counter)
-{
-	struct perf_mmap_data *data;
-
-	rcu_read_lock();
-	data = rcu_dereference(counter->data);
-	if (data)
-		__perf_counter_update_userpage(counter, data);
+unlock:
 	rcu_read_unlock();
 }
 
@@ -1547,7 +1549,13 @@ void perf_counter_wakeup(struct perf_cou
 	data = rcu_dereference(counter->data);
 	if (data) {
 		(void)atomic_xchg(&data->wakeup, POLL_IN);
-		__perf_counter_update_userpage(counter, data);
+		/*
+		 * Ensure all data writes are issued before updating the
+		 * user-space data head information. The matching rmb()
+		 * will be in userspace after reading this value.
+		 */
+		smp_wmb();
+		data->user_page->data_head = atomic_read(&data->head);
 	}
 	rcu_read_unlock();
 

-- 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 3/9] perf_counter: kerneltop: simplify data_head read
  2009-03-28 19:43 [PATCH 0/9] perf_counter patches Peter Zijlstra
  2009-03-28 19:44 ` [PATCH 1/9] perf_counter: unify and fix delayed counter wakeup Peter Zijlstra
  2009-03-28 19:44 ` [PATCH 2/9] perf_counter: fix update_userpage() Peter Zijlstra
@ 2009-03-28 19:44 ` Peter Zijlstra
  2009-03-28 19:44 ` [PATCH 4/9] perf_counter: executable mmap() information Peter Zijlstra
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2009-03-28 19:44 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel
  Cc: Paul Mackerras, Mike Galbraith, Arjan van de Ven, Wu Fengguang,
	Peter Zijlstra

[-- Attachment #1: kerneltop-new-mmap.patch --]
[-- Type: text/plain, Size: 852 bytes --]

Now that the kernel side changed, match up again.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 Documentation/perf_counter/kerneltop.c |   14 +-------------
 1 file changed, 1 insertion(+), 13 deletions(-)

Index: linux-2.6/Documentation/perf_counter/kerneltop.c
===================================================================
--- linux-2.6.orig/Documentation/perf_counter/kerneltop.c
+++ linux-2.6/Documentation/perf_counter/kerneltop.c
@@ -1125,22 +1125,10 @@ struct mmap_data {
 static unsigned int mmap_read_head(struct mmap_data *md)
 {
 	struct perf_counter_mmap_page *pc = md->base;
-	unsigned int seq, head;
-
-repeat:
-	rmb();
-	seq = pc->lock;
-
-	if (unlikely(seq & 1)) {
-		cpu_relax();
-		goto repeat;
-	}
+	int head;
 
 	head = pc->data_head;
-
 	rmb();
-	if (pc->lock != seq)
-		goto repeat;
 
 	return head;
 }

-- 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 4/9] perf_counter: executable mmap() information
  2009-03-28 19:43 [PATCH 0/9] perf_counter patches Peter Zijlstra
                   ` (2 preceding siblings ...)
  2009-03-28 19:44 ` [PATCH 3/9] perf_counter: kerneltop: simplify data_head read Peter Zijlstra
@ 2009-03-28 19:44 ` Peter Zijlstra
  2009-03-28 19:44 ` [PATCH 5/9] perf_counter: kerneltop: parse the mmap data stream Peter Zijlstra
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2009-03-28 19:44 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel
  Cc: Paul Mackerras, Mike Galbraith, Arjan van de Ven, Wu Fengguang,
	Peter Zijlstra, Andrew Morton

[-- Attachment #1: perf_counter-mmap-tracking.patch --]
[-- Type: text/plain, Size: 7409 bytes --]

Currently the profiling information returns userspace IPs but no way
to correlate them to userspace code. Userspace could look into
/proc/$pid/maps but that might not be current or even present anymore
at the time of analyzing the IPs.

Therefore provide means to track the mmap information and provide it
in the output stream.

XXX: only covers mmap()/munmap(), mremap() and mprotect() are missing.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/perf_counter.h |   24 ++++++-
 kernel/perf_counter.c        |  145 +++++++++++++++++++++++++++++++++++++++++++
 mm/mmap.c                    |   10 ++
 3 files changed, 177 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/perf_counter.h
===================================================================
--- linux-2.6.orig/include/linux/perf_counter.h
+++ linux-2.6/include/linux/perf_counter.h
@@ -137,9 +137,11 @@ struct perf_counter_hw_event {
 				exclude_kernel :  1, /* ditto kernel          */
 				exclude_hv     :  1, /* ditto hypervisor      */
 				exclude_idle   :  1, /* don't count when idle */
-				include_tid    :  1, /* include the tid */
+				include_tid    :  1, /* include the tid       */
+				mmap           :  1, /* include mmap data     */
+				munmap         :  1, /* include munmap data   */
 
-				__reserved_1   : 54;
+				__reserved_1   : 52;
 
 	__u32			extra_config_len;
 	__u32			__reserved_4;
@@ -211,6 +213,9 @@ enum perf_event_type {
 	PERF_EVENT_IP		= 0,
 	PERF_EVENT_GROUP	= 1,
 
+	PERF_EVENT_MMAP		= 2,
+	PERF_EVENT_MUNMAP	= 3,
+
 	__PERF_EVENT_TID	= 0x100,
 };
 
@@ -491,6 +496,12 @@ static inline int is_software_counter(st
 
 extern void perf_swcounter_event(u32, u64, int, struct pt_regs *);
 
+extern void perf_counter_mmap(unsigned long addr, unsigned long len,
+			      unsigned long pgoff, struct file *file);
+
+extern void perf_counter_munmap(unsigned long addr, unsigned long len,
+				unsigned long pgoff, struct file *file);
+
 #else
 static inline void
 perf_counter_task_sched_in(struct task_struct *task, int cpu)		{ }
@@ -511,6 +522,15 @@ static inline int perf_counter_task_enab
 static inline void
 perf_swcounter_event(u32 event, u64 nr, int nmi, struct pt_regs *regs)	{ }
 
+
+static inline void
+perf_counter_mmap(unsigned long addr, unsigned long len,
+		  unsigned long pgoff, struct file *file)		{ }
+
+static inline void
+perf_counter_munmap(unsigned long addr, unsigned long len,
+		    unsigned long pgoff, struct file *file) 		{ }
+
 #endif
 
 #endif /* __KERNEL__ */
Index: linux-2.6/kernel/perf_counter.c
===================================================================
--- linux-2.6.orig/kernel/perf_counter.c
+++ linux-2.6/kernel/perf_counter.c
@@ -25,6 +25,7 @@
 #include <linux/anon_inodes.h>
 #include <linux/kernel_stat.h>
 #include <linux/perf_counter.h>
+#include <linux/dcache.h>
 
 #include <asm/irq_regs.h>
 
@@ -1842,6 +1843,150 @@ void perf_counter_output(struct perf_cou
 }
 
 /*
+ * mmap tracking
+ */
+
+struct perf_mmap_event {
+	struct file	*file;
+	char		*file_name;
+	int		file_size;
+
+	struct {
+		struct perf_event_header	header;
+
+		u32				pid;
+		u32				tid;
+		u64				start;
+		u64				len;
+		u64				pgoff;
+	} event;
+};
+
+static void perf_counter_mmap_output(struct perf_counter *counter,
+				     struct perf_mmap_event *mmap_event)
+{
+	struct perf_output_handle handle;
+	int size = mmap_event->event.header.size;
+	int ret = perf_output_begin(&handle, counter, size);
+
+	if (ret)
+		return;
+
+	perf_output_put(&handle, mmap_event->event);
+	perf_output_copy(&handle, mmap_event->file_name,
+				   mmap_event->file_size);
+	perf_output_end(&handle, 0);
+}
+
+static int perf_counter_mmap_match(struct perf_counter *counter,
+				   struct perf_mmap_event *mmap_event)
+{
+	if (counter->hw_event.mmap &&
+	    mmap_event->event.header.type == PERF_EVENT_MMAP)
+		return 1;
+
+	if (counter->hw_event.munmap &&
+	    mmap_event->event.header.type == PERF_EVENT_MUNMAP)
+		return 1;
+
+	return 0;
+}
+
+static void perf_counter_mmap_ctx(struct perf_counter_context *ctx,
+				  struct perf_mmap_event *mmap_event)
+{
+	struct perf_counter *counter;
+
+	if (system_state != SYSTEM_RUNNING || list_empty(&ctx->event_list))
+		return;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(counter, &ctx->event_list, event_entry) {
+		if (perf_counter_mmap_match(counter, mmap_event))
+			perf_counter_mmap_output(counter, mmap_event);
+	}
+	rcu_read_unlock();
+}
+
+static void perf_counter_mmap_event(struct perf_mmap_event *mmap_event)
+{
+	struct perf_cpu_context *cpuctx;
+	struct file *file = mmap_event->file;
+	unsigned int size;
+	char tmp[16];
+	char *buf = NULL;
+	char *name;
+
+	if (file) {
+		buf = kzalloc(PATH_MAX, GFP_KERNEL);
+		if (!buf) {
+			name = strncpy(tmp, "//enomem", sizeof(tmp));
+			goto got_name;
+		}
+		name = dentry_path(file->f_dentry, buf, PATH_MAX);
+		if (IS_ERR(name)) {
+			name = strncpy(tmp, "//toolong", sizeof(tmp));
+			goto got_name;
+		}
+	} else {
+		name = strncpy(tmp, "//anon", sizeof(tmp));
+		goto got_name;
+	}
+
+got_name:
+	size = ALIGN(strlen(name), sizeof(u64));
+
+	mmap_event->file_name = name;
+	mmap_event->file_size = size;
+
+	mmap_event->event.header.size = sizeof(mmap_event->event) + size;
+
+	cpuctx = &get_cpu_var(perf_cpu_context);
+	perf_counter_mmap_ctx(&cpuctx->ctx, mmap_event);
+	put_cpu_var(perf_cpu_context);
+
+	perf_counter_mmap_ctx(&current->perf_counter_ctx, mmap_event);
+
+	kfree(buf);
+}
+
+void perf_counter_mmap(unsigned long addr, unsigned long len,
+		       unsigned long pgoff, struct file *file)
+{
+	struct perf_mmap_event mmap_event = {
+		.file   = file,
+		.event  = {
+			.header = { .type = PERF_EVENT_MMAP, },
+			.pid	= current->group_leader->pid,
+			.tid	= current->pid,
+			.start  = addr,
+			.len    = len,
+			.pgoff  = pgoff,
+		},
+	};
+
+	perf_counter_mmap_event(&mmap_event);
+}
+
+void perf_counter_munmap(unsigned long addr, unsigned long len,
+			 unsigned long pgoff, struct file *file)
+{
+	struct perf_mmap_event mmap_event = {
+		.file   = file,
+		.event  = {
+			.header = { .type = PERF_EVENT_MUNMAP, },
+			.pid	= current->group_leader->pid,
+			.tid	= current->pid,
+			.start  = addr,
+			.len    = len,
+			.pgoff  = pgoff,
+		},
+	};
+
+	perf_counter_mmap_event(&mmap_event);
+}
+
+/*
  * Generic software counter infrastructure
  */
 
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c
+++ linux-2.6/mm/mmap.c
@@ -27,6 +27,7 @@
 #include <linux/mempolicy.h>
 #include <linux/rmap.h>
 #include <linux/mmu_notifier.h>
+#include <linux/perf_counter.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -1219,6 +1220,9 @@ munmap_back:
 	if (correct_wcount)
 		atomic_inc(&inode->i_writecount);
 out:
+	if (vm_flags & VM_EXEC)
+		perf_counter_mmap(addr, len, pgoff, file);
+
 	mm->total_vm += len >> PAGE_SHIFT;
 	vm_stat_account(mm, vm_flags, file, len >> PAGE_SHIFT);
 	if (vm_flags & VM_LOCKED) {
@@ -1752,6 +1756,12 @@ static void remove_vma_list(struct mm_st
 	do {
 		long nrpages = vma_pages(vma);
 
+		if (vma->vm_flags & VM_EXEC) {
+			perf_counter_munmap(vma->vm_start,
+					nrpages << PAGE_SHIFT,
+					vma->vm_pgoff, vma->vm_file);
+		}
+
 		mm->total_vm -= nrpages;
 		vm_stat_account(mm, vma->vm_flags, vma->vm_file, -nrpages);
 		vma = remove_vma(vma);

-- 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 5/9] perf_counter: kerneltop: parse the mmap data stream
  2009-03-28 19:43 [PATCH 0/9] perf_counter patches Peter Zijlstra
                   ` (3 preceding siblings ...)
  2009-03-28 19:44 ` [PATCH 4/9] perf_counter: executable mmap() information Peter Zijlstra
@ 2009-03-28 19:44 ` Peter Zijlstra
  2009-03-28 19:44 ` [PATCH 6/9] perf_counter: powerpc: only reserve PMU hardware when we need it Peter Zijlstra
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2009-03-28 19:44 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel
  Cc: Paul Mackerras, Mike Galbraith, Arjan van de Ven, Wu Fengguang,
	Peter Zijlstra

[-- Attachment #1: kerneltop-mmap-events.patch --]
[-- Type: text/plain, Size: 4049 bytes --]

frob the kerneltop code to print the mmap data in the stream

Better use would be collecting the IPs per PID and mapping them onto
the provided userspace code.. TODO

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 Documentation/perf_counter/kerneltop.c |   50 +++++++++++++++++++++++++++++----
 1 file changed, 44 insertions(+), 6 deletions(-)

Index: linux-2.6/Documentation/perf_counter/kerneltop.c
===================================================================
--- linux-2.6.orig/Documentation/perf_counter/kerneltop.c
+++ linux-2.6/Documentation/perf_counter/kerneltop.c
@@ -184,6 +184,8 @@ static int			nmi				=  1;
 static int			group				=  0;
 static unsigned int		page_size;
 static unsigned int		mmap_pages			=  16;
+static int			use_mmap			= 0;
+static int			use_munmap			= 0;
 
 static char			*vmlinux;
 
@@ -333,6 +335,8 @@ static void display_help(void)
 	" -z        --zero             # zero counts after display\n"
 	" -D        --dump_symtab      # dump symbol table to stderr on startup\n"
 	" -m pages  --mmap_pages=<pages> # number of mmap data pages\n"
+	" -M        --mmap_info        # print mmap info stream\n"
+	" -U        --munmap_info      # print munmap info stream\n"
 	);
 
 	exit(0);
@@ -1052,9 +1056,11 @@ static void process_options(int argc, ch
 			{"stat",	no_argument,		NULL, 'S'},
 			{"zero",	no_argument,		NULL, 'z'},
 			{"mmap_pages",	required_argument,	NULL, 'm'},
+			{"mmap_info",	no_argument,		NULL, 'M'},
+			{"munmap_info",	no_argument,		NULL, 'U'},
 			{NULL,		0,			NULL,  0 }
 		};
-		int c = getopt_long(argc, argv, "+:ac:C:d:De:f:g:hn:m:p:s:Sx:z",
+		int c = getopt_long(argc, argv, "+:ac:C:d:De:f:g:hn:m:p:s:Sx:zMU",
 				    long_options, &option_index);
 		if (c == -1)
 			break;
@@ -1092,6 +1098,8 @@ static void process_options(int argc, ch
 		case 'x': vmlinux			= strdup(optarg); break;
 		case 'z': zero				=              1; break;
 		case 'm': mmap_pages			=   atoi(optarg); break;
+		case 'M': use_mmap			=              1; break;
+		case 'U': use_munmap			=              1; break;
 		default: error = 1; break;
 		}
 	}
@@ -1172,12 +1180,29 @@ static void mmap_read(struct mmap_data *
 	last_read = this_read;
 
 	for (; old != head;) {
-		struct event_struct {
+		struct ip_event {
 			struct perf_event_header header;
 			__u64 ip;
 			__u32 pid, tid;
-		} *event = (struct event_struct *)&data[old & md->mask];
-		struct event_struct event_copy;
+		};
+		struct mmap_event {
+			struct perf_event_header header;
+			__u32 pid, tid;
+			__u64 start;
+			__u64 len;
+			__u64 pgoff;
+			char filename[PATH_MAX];
+		};
+
+		typedef union event_union {
+			struct perf_event_header header;
+			struct ip_event ip;
+			struct mmap_event mmap;
+		} event_t;
+
+		event_t *event = (event_t *)&data[old & md->mask];
+
+		event_t event_copy;
 
 		unsigned int size = event->header.size;
 
@@ -1187,7 +1212,7 @@ static void mmap_read(struct mmap_data *
 		 */
 		if ((old & md->mask) + size != ((old + size) & md->mask)) {
 			unsigned int offset = old;
-			unsigned int len = sizeof(*event), cpy;
+			unsigned int len = min(sizeof(*event), size), cpy;
 			void *dst = &event_copy;
 
 			do {
@@ -1206,7 +1231,18 @@ static void mmap_read(struct mmap_data *
 		switch (event->header.type) {
 		case PERF_EVENT_IP:
 		case PERF_EVENT_IP | __PERF_EVENT_TID:
-			process_event(event->ip, md->counter);
+			process_event(event->ip.ip, md->counter);
+			break;
+
+		case PERF_EVENT_MMAP:
+		case PERF_EVENT_MUNMAP:
+			printf("%s: %Lu %Lu %Lu %s\n",
+					event->header.type == PERF_EVENT_MMAP
+					  ? "mmap" : "munmap",
+					event->mmap.start,
+					event->mmap.len,
+					event->mmap.pgoff,
+					event->mmap.filename);
 			break;
 		}
 	}
@@ -1255,6 +1291,8 @@ int main(int argc, char *argv[])
 			hw_event.record_type	= PERF_RECORD_IRQ;
 			hw_event.nmi		= nmi;
 			hw_event.include_tid	= 1;
+			hw_event.mmap		= use_mmap;
+			hw_event.munmap		= use_munmap;
 
 			fd[i][counter] = sys_perf_counter_open(&hw_event, tid, cpu, group_fd, 0);
 			if (fd[i][counter] < 0) {

-- 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 6/9] perf_counter: powerpc: only reserve PMU hardware when we need it
  2009-03-28 19:43 [PATCH 0/9] perf_counter patches Peter Zijlstra
                   ` (4 preceding siblings ...)
  2009-03-28 19:44 ` [PATCH 5/9] perf_counter: kerneltop: parse the mmap data stream Peter Zijlstra
@ 2009-03-28 19:44 ` Peter Zijlstra
  2009-03-28 19:44 ` [PATCH 7/9] perf_counter: make it possible for hw_perf_counter_init to return error codes Peter Zijlstra
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2009-03-28 19:44 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel
  Cc: Paul Mackerras, Mike Galbraith, Arjan van de Ven, Wu Fengguang,
	Peter Zijlstra

[-- Attachment #1: paulus-perf_counter-powerpc-only_reserve_PMU_hardware_when_we_need_it.patch --]
[-- Type: text/plain, Size: 3836 bytes --]

From: Paul Mackerras <paulus@samba.org>

Impact: cooperate with oprofile

At present, on PowerPC, if you have perf_counters compiled in, oprofile
doesn't work.  There is code to allow the PMU to be shared between
competing subsystems, such as perf_counters and oprofile, but currently
the perf_counter subsystem reserves the PMU for itself at boot time,
and never releases it.

This makes perf_counter play nicely with oprofile.  Now we keep a count
of how many perf_counter instances are counting hardware events, and
reserve the PMU when that count becomes non-zero, and release the PMU
when that count becomes zero.  This means that it is possible to have
perf_counters compiled in and still use oprofile, as long as there are
no hardware perf_counters active.  This also means that if oprofile is
active, sys_perf_counter_open will fail if the hw_event specifies a
hardware event.

To avoid races with other tasks creating and destroying perf_counters,
we use a mutex.  We use atomic_inc_not_zero and atomic_add_unless to
avoid having to take the mutex unless there is a possibility of the
count going between 0 and 1.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---

 arch/powerpc/kernel/perf_counter.c |   47 +++++++++++++++++++++++++++++++++----
 1 file changed, 42 insertions(+), 5 deletions(-)

Index: linux-2.6/arch/powerpc/kernel/perf_counter.c
===================================================================
--- linux-2.6.orig/arch/powerpc/kernel/perf_counter.c
+++ linux-2.6/arch/powerpc/kernel/perf_counter.c
@@ -41,6 +41,8 @@ struct power_pmu *ppmu;
  */
 static unsigned int freeze_counters_kernel = MMCR0_FCS;
 
+static void perf_counter_interrupt(struct pt_regs *regs);
+
 void perf_counter_print_debug(void)
 {
 }
@@ -594,6 +596,24 @@ struct hw_perf_counter_ops power_perf_op
 	.read = power_perf_read
 };
 
+/* Number of perf_counters counting hardware events */
+static atomic_t num_counters;
+/* Used to avoid races in calling reserve/release_pmc_hardware */
+static DEFINE_MUTEX(pmc_reserve_mutex);
+
+/*
+ * Release the PMU if this is the last perf_counter.
+ */
+static void hw_perf_counter_destroy(struct perf_counter *counter)
+{
+	if (!atomic_add_unless(&num_counters, -1, 1)) {
+		mutex_lock(&pmc_reserve_mutex);
+		if (atomic_dec_return(&num_counters) == 0)
+			release_pmc_hardware();
+		mutex_unlock(&pmc_reserve_mutex);
+	}
+}
+
 const struct hw_perf_counter_ops *
 hw_perf_counter_init(struct perf_counter *counter)
 {
@@ -601,6 +621,7 @@ hw_perf_counter_init(struct perf_counter
 	struct perf_counter *ctrs[MAX_HWCOUNTERS];
 	unsigned int events[MAX_HWCOUNTERS];
 	int n;
+	int err;
 
 	if (!ppmu)
 		return NULL;
@@ -646,6 +667,27 @@ hw_perf_counter_init(struct perf_counter
 
 	counter->hw.config = events[n];
 	atomic64_set(&counter->hw.period_left, counter->hw_event.irq_period);
+
+	/*
+	 * See if we need to reserve the PMU.
+	 * If no counters are currently in use, then we have to take a
+	 * mutex to ensure that we don't race with another task doing
+	 * reserve_pmc_hardware or release_pmc_hardware.
+	 */
+	err = 0;
+	if (!atomic_inc_not_zero(&num_counters)) {
+		mutex_lock(&pmc_reserve_mutex);
+		if (atomic_read(&num_counters) == 0 &&
+		    reserve_pmc_hardware(perf_counter_interrupt))
+			err = -EBUSY;
+		else
+			atomic_inc(&num_counters);
+		mutex_unlock(&pmc_reserve_mutex);
+	}
+	counter->destroy = hw_perf_counter_destroy;
+
+	if (err)
+		return NULL;
 	return &power_perf_ops;
 }
 
@@ -756,11 +798,6 @@ static int init_perf_counters(void)
 {
 	unsigned long pvr;
 
-	if (reserve_pmc_hardware(perf_counter_interrupt)) {
-		printk(KERN_ERR "Couldn't init performance monitor subsystem\n");
-		return -EBUSY;
-	}
-
 	/* XXX should get this from cputable */
 	pvr = mfspr(SPRN_PVR);
 	switch (PVR_VER(pvr)) {

-- 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 7/9] perf_counter: make it possible for hw_perf_counter_init to return error codes
  2009-03-28 19:43 [PATCH 0/9] perf_counter patches Peter Zijlstra
                   ` (5 preceding siblings ...)
  2009-03-28 19:44 ` [PATCH 6/9] perf_counter: powerpc: only reserve PMU hardware when we need it Peter Zijlstra
@ 2009-03-28 19:44 ` Peter Zijlstra
  2009-03-30  4:13   ` Paul Mackerras
  2009-03-28 19:44 ` [PATCH 8/9] perf_counter tools: optionally scale counter values in perfstat mode Peter Zijlstra
  2009-03-28 19:44 ` [PATCH 9/9] RFC perf_counter: event overlow handling Peter Zijlstra
  8 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2009-03-28 19:44 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel
  Cc: Paul Mackerras, Mike Galbraith, Arjan van de Ven, Wu Fengguang,
	Peter Zijlstra

[-- Attachment #1: paulus-perf_counter-make_it_possible_for_hw_perf_counter_init_to_return_error_codes.patch --]
[-- Type: text/plain, Size: 5191 bytes --]

From: Paul Mackerras <paulus@samba.org>

Impact: better error reporting

At present, if hw_perf_counter_init encounters an error, all it can do
is return NULL, which causes sys_perf_counter_open to return an EINVAL
error to userspace.  This isn't very informative for userspace; it means
that userspace can't tell the difference between "sorry, oprofile is
already using the PMU" and "we don't support this CPU" and "this CPU
doesn't support the requested generic hardware event".

This commit uses the PTR_ERR/ERR_PTR/IS_ERR set of macros to let
hw_perf_counter_init return an error code on error rather than just NULL
if it wishes.  If it does so, that error code will be returned from
sys_perf_counter_open to userspace.  If it returns NULL, an EINVAL
error will be returned to userspace, as before.

This also adapts the powerpc hw_perf_counter_init to make use of this
to return ENXIO, EINVAL, EBUSY, or EOPNOTSUPP as appropriate.  It would
be good to add extra error numbers in future to allow userspace to
distinguish the various errors that are currently reported as EINVAL,
i.e. irq_period < 0, too many events in a group, conflict between
exclude_* settings in a group, and PMU resource conflict in a group.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---

 arch/powerpc/kernel/perf_counter.c |   14 +++++++-------
 kernel/perf_counter.c              |   33 +++++++++++++++++++++------------
 2 files changed, 28 insertions(+), 19 deletions(-)

Index: linux-2.6/arch/powerpc/kernel/perf_counter.c
===================================================================
--- linux-2.6.orig/arch/powerpc/kernel/perf_counter.c
+++ linux-2.6/arch/powerpc/kernel/perf_counter.c
@@ -624,13 +624,13 @@ hw_perf_counter_init(struct perf_counter
 	int err;
 
 	if (!ppmu)
-		return NULL;
+		return ERR_PTR(-ENXIO);
 	if ((s64)counter->hw_event.irq_period < 0)
-		return NULL;
+		return ERR_PTR(-EINVAL);
 	if (!perf_event_raw(&counter->hw_event)) {
 		ev = perf_event_id(&counter->hw_event);
 		if (ev >= ppmu->n_generic || ppmu->generic_events[ev] == 0)
-			return NULL;
+			return ERR_PTR(-EOPNOTSUPP);
 		ev = ppmu->generic_events[ev];
 	} else {
 		ev = perf_event_config(&counter->hw_event);
@@ -656,14 +656,14 @@ hw_perf_counter_init(struct perf_counter
 		n = collect_events(counter->group_leader, ppmu->n_counter - 1,
 				   ctrs, events);
 		if (n < 0)
-			return NULL;
+			return ERR_PTR(-EINVAL);
 	}
 	events[n] = ev;
 	ctrs[n] = counter;
 	if (check_excludes(ctrs, n, 1))
-		return NULL;
+		return ERR_PTR(-EINVAL);
 	if (power_check_constraints(events, n + 1))
-		return NULL;
+		return ERR_PTR(-EINVAL);
 
 	counter->hw.config = events[n];
 	atomic64_set(&counter->hw.period_left, counter->hw_event.irq_period);
@@ -687,7 +687,7 @@ hw_perf_counter_init(struct perf_counter
 	counter->destroy = hw_perf_counter_destroy;
 
 	if (err)
-		return NULL;
+		return ERR_PTR(err);
 	return &power_perf_ops;
 }
 
Index: linux-2.6/kernel/perf_counter.c
===================================================================
--- linux-2.6.orig/kernel/perf_counter.c
+++ linux-2.6/kernel/perf_counter.c
@@ -2451,10 +2451,11 @@ perf_counter_alloc(struct perf_counter_h
 {
 	const struct hw_perf_counter_ops *hw_ops;
 	struct perf_counter *counter;
+	long err;
 
 	counter = kzalloc(sizeof(*counter), gfpflags);
 	if (!counter)
-		return NULL;
+		return ERR_PTR(-ENOMEM);
 
 	/*
 	 * Single counters are their own group leaders, with an
@@ -2504,9 +2505,15 @@ perf_counter_alloc(struct perf_counter_h
 		break;
 	}
 
-	if (!hw_ops) {
+	err = 0;
+	if (!hw_ops)
+		err = -EINVAL;
+	else if (IS_ERR(hw_ops))
+		err = PTR_ERR(hw_ops);
+
+	if (err) {
 		kfree(counter);
-		return NULL;
+		return ERR_PTR(err);
 	}
 done:
 	counter->hw_ops = hw_ops;
@@ -2581,10 +2588,10 @@ SYSCALL_DEFINE5(perf_counter_open,
 			goto err_put_context;
 	}
 
-	ret = -EINVAL;
 	counter = perf_counter_alloc(&hw_event, cpu, ctx, group_leader,
 				     GFP_KERNEL);
-	if (!counter)
+	ret = PTR_ERR(counter);
+	if (IS_ERR(counter))
 		goto err_put_context;
 
 	ret = anon_inode_getfd("[perf_counter]", &perf_fops, counter, 0);
@@ -2656,8 +2663,8 @@ inherit_counter(struct perf_counter *par
 	child_counter = perf_counter_alloc(&parent_counter->hw_event,
 					   parent_counter->cpu, child_ctx,
 					   group_leader, GFP_KERNEL);
-	if (!child_counter)
-		return NULL;
+	if (IS_ERR(child_counter))
+		return child_counter;
 
 	/*
 	 * Link it up in the child's context:
@@ -2708,15 +2715,17 @@ static int inherit_group(struct perf_cou
 {
 	struct perf_counter *leader;
 	struct perf_counter *sub;
+	struct perf_counter *child_ctr;
 
 	leader = inherit_counter(parent_counter, parent, parent_ctx,
 				 child, NULL, child_ctx);
-	if (!leader)
-		return -ENOMEM;
+	if (IS_ERR(leader))
+		return PTR_ERR(leader);
 	list_for_each_entry(sub, &parent_counter->sibling_list, list_entry) {
-		if (!inherit_counter(sub, parent, parent_ctx,
-				     child, leader, child_ctx))
-			return -ENOMEM;
+		child_ctr = inherit_counter(sub, parent, parent_ctx,
+					    child, leader, child_ctx);
+		if (IS_ERR(child_ctr))
+			return PTR_ERR(child_ctr);
 	}
 	return 0;
 }

-- 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 8/9] perf_counter tools: optionally scale counter values in perfstat mode
  2009-03-28 19:43 [PATCH 0/9] perf_counter patches Peter Zijlstra
                   ` (6 preceding siblings ...)
  2009-03-28 19:44 ` [PATCH 7/9] perf_counter: make it possible for hw_perf_counter_init to return error codes Peter Zijlstra
@ 2009-03-28 19:44 ` Peter Zijlstra
  2009-03-28 19:44 ` [PATCH 9/9] RFC perf_counter: event overlow handling Peter Zijlstra
  8 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2009-03-28 19:44 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel
  Cc: Paul Mackerras, Mike Galbraith, Arjan van de Ven, Wu Fengguang,
	Peter Zijlstra

[-- Attachment #1: paulus-perf_counter_tools-optionally_scale_counter_values_in_perfstat_mode.patch --]
[-- Type: text/plain, Size: 5885 bytes --]

From: Paul Mackerras <paulus@samba.org>

Impact: new functionality

This adds add an option to the perfstat mode of kerneltop to scale the
reported counter values according to the fraction of time that each
counter gets to count.  This is invoked with the -l option (I used 'l'
because s, c, a and e were all taken already.)  This uses the new
PERF_RECORD_TOTAL_TIME_{ENABLED,RUNNING} read format options.

With this, we get output like this:

$ ./perfstat -l -e 0:0,0:1,0:2,0:3,0:4,0:5 ./spin

 Performance counter stats for './spin':

     4016072055  CPU cycles           (events)  (scaled from 66.53%)
     2005887318  instructions         (events)  (scaled from 66.53%)
        1762849  cache references     (events)  (scaled from 66.69%)
         165229  cache misses         (events)  (scaled from 66.85%)
     1001298009  branches             (events)  (scaled from 66.78%)
          41566  branch misses        (events)  (scaled from 66.61%)

 Wall-clock time elapsed:  2438.227446 msecs

This also lets us detect when a counter is zero because the counter
never got to go on the CPU at all.  In that case we print <not counted>
rather than 0.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 Documentation/perf_counter/kerneltop.c |   56 ++++++++++++++++++++++++++-------
 1 file changed, 45 insertions(+), 11 deletions(-)

Index: linux-2.6/Documentation/perf_counter/kerneltop.c
===================================================================
--- linux-2.6.orig/Documentation/perf_counter/kerneltop.c
+++ linux-2.6/Documentation/perf_counter/kerneltop.c
@@ -197,6 +197,8 @@ static int			delay_secs			=  2;
 static int			zero;
 static int			dump_symtab;
 
+static int			scale;
+
 struct source_line {
 	uint64_t		EIP;
 	unsigned long		count;
@@ -305,6 +307,7 @@ static void display_perfstat_help(void)
 	display_events_help();
 
 	printf(
+	" -l                           # scale counter values\n"
 	" -a                           # system-wide collection\n");
 	exit(0);
 }
@@ -328,6 +331,7 @@ static void display_help(void)
 	" -c CNT    --count=CNT        # event period to sample\n\n"
 	" -C CPU    --cpu=CPU          # CPU (-1 for all)                 [default: -1]\n"
 	" -p PID    --pid=PID          # PID of sampled task (-1 for all) [default: -1]\n\n"
+	" -l                           # show scale factor for RR events\n"
 	" -d delay  --delay=<seconds>  # sampling/display delay           [default:  2]\n"
 	" -f CNT    --filter=CNT       # min-event-count filter          [default: 100]\n\n"
 	" -s symbol --symbol=<symbol>  # function to be showed annotated one-shot\n"
@@ -436,6 +440,9 @@ static void create_perfstat_counter(int 
 	hw_event.config		= event_id[counter];
 	hw_event.record_type	= PERF_RECORD_SIMPLE;
 	hw_event.nmi		= 0;
+	if (scale)
+		hw_event.read_format	= PERF_FORMAT_TOTAL_TIME_ENABLED |
+					  PERF_FORMAT_TOTAL_TIME_RUNNING;
 
 	if (system_wide) {
 		int cpu;
@@ -507,28 +514,53 @@ int do_perfstat(int argc, char *argv[])
 	fprintf(stderr, "\n");
 
 	for (counter = 0; counter < nr_counters; counter++) {
-		int cpu;
-		__u64 count, single_count;
+		int cpu, nv;
+		__u64 count[3], single_count[3];
+		int scaled;
 
-		count = 0;
+		count[0] = count[1] = count[2] = 0;
+		nv = scale ? 3 : 1;
 		for (cpu = 0; cpu < nr_cpus; cpu ++) {
 			res = read(fd[cpu][counter],
-					(char *) &single_count, sizeof(single_count));
-			assert(res == sizeof(single_count));
-			count += single_count;
+				   single_count, nv * sizeof(__u64));
+			assert(res == nv * sizeof(__u64));
+
+			count[0] += single_count[0];
+			if (scale) {
+				count[1] += single_count[1];
+				count[2] += single_count[2];
+			}
+		}
+
+		scaled = 0;
+		if (scale) {
+			if (count[2] == 0) {
+				fprintf(stderr, " %14s  %-20s\n",
+					"<not counted>", event_name(counter));
+				continue;
+			}
+			if (count[2] < count[1]) {
+				scaled = 1;
+				count[0] = (unsigned long long)
+					((double)count[0] * count[1] / count[2] + 0.5);
+			}
 		}
 
 		if (event_id[counter] == EID(PERF_TYPE_SOFTWARE, PERF_COUNT_CPU_CLOCK) ||
 		    event_id[counter] == EID(PERF_TYPE_SOFTWARE, PERF_COUNT_TASK_CLOCK)) {
 
-			double msecs = (double)count / 1000000;
+			double msecs = (double)count[0] / 1000000;
 
-			fprintf(stderr, " %14.6f  %-20s (msecs)\n",
+			fprintf(stderr, " %14.6f  %-20s (msecs)",
 				msecs, event_name(counter));
 		} else {
-			fprintf(stderr, " %14Ld  %-20s (events)\n",
-				count, event_name(counter));
+			fprintf(stderr, " %14Ld  %-20s (events)",
+				count[0], event_name(counter));
 		}
+		if (scaled)
+			fprintf(stderr, "  (scaled from %.2f%%)",
+				(double) count[2] / count[1] * 100);
+		fprintf(stderr, "\n");
 	}
 	fprintf(stderr, "\n");
 	fprintf(stderr, " Wall-clock time elapsed: %12.6f msecs\n",
@@ -1049,6 +1081,7 @@ static void process_options(int argc, ch
 			{"filter",	required_argument,	NULL, 'f'},
 			{"group",	required_argument,	NULL, 'g'},
 			{"help",	no_argument,		NULL, 'h'},
+			{"scale",	no_argument,		NULL, 'l'},
 			{"nmi",		required_argument,	NULL, 'n'},
 			{"pid",		required_argument,	NULL, 'p'},
 			{"vmlinux",	required_argument,	NULL, 'x'},
@@ -1060,7 +1093,7 @@ static void process_options(int argc, ch
 			{"munmap_info",	no_argument,		NULL, 'U'},
 			{NULL,		0,			NULL,  0 }
 		};
-		int c = getopt_long(argc, argv, "+:ac:C:d:De:f:g:hn:m:p:s:Sx:zMU",
+		int c = getopt_long(argc, argv, "+:ac:C:d:De:f:g:hln:m:p:s:Sx:zMU",
 				    long_options, &option_index);
 		if (c == -1)
 			break;
@@ -1084,6 +1117,7 @@ static void process_options(int argc, ch
 		case 'f': count_filter			=   atoi(optarg); break;
 		case 'g': group				=   atoi(optarg); break;
 		case 'h':      				  display_help(); break;
+		case 'l': scale				=	       1; break;
 		case 'n': nmi				=   atoi(optarg); break;
 		case 'p':
 			/* CPU and PID are mutually exclusive */

-- 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 9/9] RFC perf_counter: event overlow handling
  2009-03-28 19:43 [PATCH 0/9] perf_counter patches Peter Zijlstra
                   ` (7 preceding siblings ...)
  2009-03-28 19:44 ` [PATCH 8/9] perf_counter tools: optionally scale counter values in perfstat mode Peter Zijlstra
@ 2009-03-28 19:44 ` Peter Zijlstra
  8 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2009-03-28 19:44 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel
  Cc: Paul Mackerras, Mike Galbraith, Arjan van de Ven, Wu Fengguang,
	Peter Zijlstra

[-- Attachment #1: perf_counter-mmap-write.patch --]
[-- Type: text/plain, Size: 8511 bytes --]

Alternative method of mmap() data output handling that provides better
overflow management.

Unlike the previous method, that didn't have any user->kernel
feedback and relied on userspace keeping up, this method relies on
userspace writing its last read position into the control page.

It will ensure new output doesn't overwrite not-yet read events, new
events for which there is no space left are lost and the overflow
counter is incremented, providing exact event loss numbers.

Untested -- not sure its really worth the overhead, the most important
thing to know is _if_ you're loosing data, either method allows for
that.

Still unsure about this, Paulus seems to agree.

Maybe-Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/perf_counter.h |   14 ++++-
 kernel/perf_counter.c        |  114 +++++++++++++++++++++++++++++++++----------
 2 files changed, 101 insertions(+), 27 deletions(-)

Index: linux-2.6/include/linux/perf_counter.h
===================================================================
--- linux-2.6.orig/include/linux/perf_counter.h
+++ linux-2.6/include/linux/perf_counter.h
@@ -198,10 +198,18 @@ struct perf_counter_mmap_page {
 	/*
 	 * Control data for the mmap() data buffer.
 	 *
-	 * User-space reading this value should issue an rmb(), on SMP capable
-	 * platforms, after reading this value. -- see perf_counter_wakeup().
+	 * User-space reading the @data_head value should issue an rmb(), on
+	 * SMP capable platforms, after reading this value -- see
+	 * perf_counter_wakeup().
+	 *
+	 * When the mapping is PROT_WRITE the @data_tail value should be
+	 * written by userspace to reflect the last read data. In this case
+	 * the kernel will not over-write unread data, and @overflow will
+	 * contain the number of events that were lost due to this.
 	 */
 	__u32   data_head;		/* head in the data section */
+	__u32	data_tail;		/* user-space written tail */
+	__u32	overflow;		/* number of lost events */
 };
 
 struct perf_event_header {
@@ -309,8 +317,10 @@ struct file;
 struct perf_mmap_data {
 	struct rcu_head			rcu_head;
 	int				nr_pages;
+	int				writable;
 	atomic_t			wakeup;
 	atomic_t			head;
+	atomic_t			overflow;
 	struct perf_counter_mmap_page   *user_page;
 	void 				*data_pages[0];
 };
Index: linux-2.6/kernel/perf_counter.c
===================================================================
--- linux-2.6.orig/kernel/perf_counter.c
+++ linux-2.6/kernel/perf_counter.c
@@ -1382,6 +1382,26 @@ unlock:
 	return ret;
 }
 
+static int perf_mmap_mkwrite(struct vm_area_struct *vma, struct page *page)
+{
+	struct perf_counter *counter = vma->vm_file->private_data;
+	struct perf_mmap_data *data;
+	int ret = -EINVAL;
+
+	rcu_read_lock();
+	data = rcu_dereference(counter->data);
+
+	/*
+	 * Only allow writes to the control page.
+	 */
+	if (data && (page == virt_to_page(data->user_page)))
+		ret = 0;
+
+	rcu_read_unlock();
+
+	return ret;
+}
+
 static int perf_mmap_data_alloc(struct perf_counter *counter, int nr_pages)
 {
 	struct perf_mmap_data *data;
@@ -1467,9 +1487,10 @@ static void perf_mmap_close(struct vm_ar
 }
 
 static struct vm_operations_struct perf_mmap_vmops = {
-	.open = perf_mmap_open,
-	.close = perf_mmap_close,
-	.fault = perf_mmap_fault,
+	.open		= perf_mmap_open,
+	.close		= perf_mmap_close,
+	.fault		= perf_mmap_fault,
+	.page_mkwrite	= perf_mmap_mkwrite,
 };
 
 static int perf_mmap(struct file *file, struct vm_area_struct *vma)
@@ -1480,7 +1501,7 @@ static int perf_mmap(struct file *file, 
 	unsigned long locked, lock_limit;
 	int ret = 0;
 
-	if (!(vma->vm_flags & VM_SHARED) || (vma->vm_flags & VM_WRITE))
+	if (!(vma->vm_flags & VM_SHARED))
 		return -EINVAL;
 
 	vma_size = vma->vm_end - vma->vm_start;
@@ -1510,16 +1531,19 @@ static int perf_mmap(struct file *file, 
 
 	mutex_lock(&counter->mmap_mutex);
 	if (atomic_inc_not_zero(&counter->mmap_count))
-		goto out;
+		goto unlock;
 
 	WARN_ON(counter->data);
 	ret = perf_mmap_data_alloc(counter, nr_pages);
-	if (!ret)
-		atomic_set(&counter->mmap_count, 1);
-out:
+	if (ret)
+		goto unlock;
+
+	atomic_set(&counter->mmap_count, 1);
+	if (vma->vm_flags & VM_WRITE)
+		counter->data->writable = 1;
+unlock:
 	mutex_unlock(&counter->mmap_mutex);
 
-	vma->vm_flags &= ~VM_MAYWRITE;
 	vma->vm_flags |= VM_RESERVED;
 	vma->vm_ops = &perf_mmap_vmops;
 
@@ -1550,6 +1574,7 @@ void perf_counter_wakeup(struct perf_cou
 	data = rcu_dereference(counter->data);
 	if (data) {
 		(void)atomic_xchg(&data->wakeup, POLL_IN);
+		data->user_page->overflow = atomic_read(&data->overflow);
 		/*
 		 * Ensure all data writes are issued before updating the
 		 * user-space data head information. The matching rmb()
@@ -1661,10 +1686,46 @@ struct perf_output_handle {
 	unsigned int		offset;
 	unsigned int		head;
 	int			wakeup;
+	int			nmi;
 };
 
+static bool perf_output_overflow(struct perf_mmap_data *data,
+				 unsigned int offset, unsigned int head)
+{
+	unsigned int tail;
+	unsigned int mask;
+
+	if (!data->writable)
+		return false;
+
+	mask = (data->nr_pages << PAGE_SHIFT) - 1;
+	/*
+	 * Userspace could choose to issue a wmb() after updating the tail
+	 * pointer. This ensures we get to see the tail move sooner.
+	 */
+	smp_rmb();
+	tail = ACCESS_ONCE(data->user_page->data_tail);
+
+	offset = (offset - tail) & mask;
+	head   = (head   - tail) & mask;
+
+	if ((int)(head - offset) < 0)
+		return true;
+
+	return false;
+}
+
+static inline void __perf_output_wakeup(struct perf_output_handle *handle)
+{
+	if (handle->nmi)
+		perf_pending_queue(handle->counter);
+	else
+		perf_counter_wakeup(handle->counter);
+}
+
 static int perf_output_begin(struct perf_output_handle *handle,
-			     struct perf_counter *counter, unsigned int size)
+			     struct perf_counter *counter, unsigned int size,
+			     int nmi)
 {
 	struct perf_mmap_data *data;
 	unsigned int offset, head;
@@ -1674,15 +1735,19 @@ static int perf_output_begin(struct perf
 	if (!data)
 		goto out;
 
+	handle->counter	= counter;
+	handle->nmi	= nmi;
+
 	if (!data->nr_pages)
-		goto out;
+		goto fail;
 
 	do {
 		offset = head = atomic_read(&data->head);
 		head += size;
+		if (unlikely(perf_output_overflow(data, offset, head)))
+			goto fail;
 	} while (atomic_cmpxchg(&data->head, offset, head) != offset);
 
-	handle->counter	= counter;
 	handle->data	= data;
 	handle->offset	= offset;
 	handle->head	= head;
@@ -1690,6 +1755,9 @@ static int perf_output_begin(struct perf
 
 	return 0;
 
+fail:
+	atomic_inc(&data->overflow);
+	__perf_output_wakeup(handle);
 out:
 	rcu_read_unlock();
 
@@ -1731,14 +1799,10 @@ static void perf_output_copy(struct perf
 #define perf_output_put(handle, x) \
 	perf_output_copy((handle), &(x), sizeof(x))
 
-static void perf_output_end(struct perf_output_handle *handle, int nmi)
+static void perf_output_end(struct perf_output_handle *handle)
 {
-	if (handle->wakeup) {
-		if (nmi)
-			perf_pending_queue(handle->counter);
-		else
-			perf_counter_wakeup(handle->counter);
-	}
+	if (handle->wakeup)
+		__perf_output_wakeup(handle);
 	rcu_read_unlock();
 }
 
@@ -1748,12 +1812,12 @@ static int perf_output_write(struct perf
 	struct perf_output_handle handle;
 	int ret;
 
-	ret = perf_output_begin(&handle, counter, size);
+	ret = perf_output_begin(&handle, counter, size, nmi);
 	if (ret)
 		goto out;
 
 	perf_output_copy(&handle, buf, size);
-	perf_output_end(&handle, nmi);
+	perf_output_end(&handle);
 
 out:
 	return ret;
@@ -1802,7 +1866,7 @@ static void perf_output_group(struct per
 
 	size = sizeof(header) + counter->nr_siblings * sizeof(entry);
 
-	ret = perf_output_begin(&handle, counter, size);
+	ret = perf_output_begin(&handle, counter, size, nmi);
 	if (ret)
 		return;
 
@@ -1822,7 +1886,7 @@ static void perf_output_group(struct per
 		perf_output_put(&handle, entry);
 	}
 
-	perf_output_end(&handle, nmi);
+	perf_output_end(&handle);
 }
 
 void perf_counter_output(struct perf_counter *counter,
@@ -1867,7 +1931,7 @@ static void perf_counter_mmap_output(str
 {
 	struct perf_output_handle handle;
 	int size = mmap_event->event.header.size;
-	int ret = perf_output_begin(&handle, counter, size);
+	int ret = perf_output_begin(&handle, counter, size, 0);
 
 	if (ret)
 		return;
@@ -1875,7 +1939,7 @@ static void perf_counter_mmap_output(str
 	perf_output_put(&handle, mmap_event->event);
 	perf_output_copy(&handle, mmap_event->file_name,
 				   mmap_event->file_size);
-	perf_output_end(&handle, 0);
+	perf_output_end(&handle);
 }
 
 static int perf_counter_mmap_match(struct perf_counter *counter,

-- 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/9] perf_counter: unify and fix delayed counter wakeup
  2009-03-28 19:44 ` [PATCH 1/9] perf_counter: unify and fix delayed counter wakeup Peter Zijlstra
@ 2009-03-29  0:14   ` Paul Mackerras
  2009-03-29  9:16     ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: Paul Mackerras @ 2009-03-29  0:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Mike Galbraith, Arjan van de Ven,
	Wu Fengguang, Eric Dumazet

Peter Zijlstra writes:

> While going over the wakeup code I noticed delayed wakeups only work
> for hardware counters but basically all software counters rely on
> them.

Hmmm, I don't like the extra latency this introduces, particularly
since on powerpc we already have a good way to avoid the latency.

I did a grep for perf_swcounter_event calls that have nmi=1, and there
are a couple, to my surprise.  Why does the context switch one have
nmi=1?  It certainly isn't called from an actual NMI handler.  Is it
because of locking issues?

The other one is the tracepoint call in perf_tpcounter_event.  I
assume you have put nmi=1 there because you don't know what context
we're in.  That means we'll always delay the wakeup even when we might
be in an ordinary interrupt-on process context.  Couldn't we do
better?

Paul.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/9] perf_counter: fix update_userpage()
  2009-03-28 19:44 ` [PATCH 2/9] perf_counter: fix update_userpage() Peter Zijlstra
@ 2009-03-29  0:24   ` Paul Mackerras
  2009-04-02  8:50     ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: Paul Mackerras @ 2009-03-29  0:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Mike Galbraith, Arjan van de Ven,
	Wu Fengguang

Peter Zijlstra writes:

> It just occured to me it is possible to have multiple contending
> updates of the userpage (mmap information vs overflow vs counter).
> This would break the seqlock logic.
> 
> It appear the arch code uses this from NMI context, so we cannot
> possibly serialize its use, therefore separate the data_head update
> from it and let it return to its original use.

That sounds reasonable, and thanks for putting in a big comment.

Acked-by: Paul Mackerras <paulus@samba.org>

> --- linux-2.6.orig/include/linux/perf_counter.h
> +++ linux-2.6/include/linux/perf_counter.h
> @@ -160,10 +160,45 @@ struct perf_counter_hw_event {
>  struct perf_counter_mmap_page {
>  	__u32	version;		/* version number of this structure */
>  	__u32	compat_version;		/* lowest version this is compat with */
> +
> +	/*
> +	 * Bits needed to read the hw counters in user-space.
> +	 *
> +	 * The index and offset should be read atomically using the seqlock:
> +	 *
> +	 *   __u32 seq, index;
> +	 *   __s64 offset;
> +	 *
> +	 * again:
> +	 *   rmb();
> +	 *   seq = pc->lock;
> +	 *
> +	 *   if (unlikely(seq & 1)) {
> +	 *     cpu_relax();
> +	 *     goto again;
> +	 *   }
> +	 *
> +	 *   index = pc->index;
> +	 *   offset = pc->offset;
> +	 *
> +	 *   rmb();
> +	 *   if (pc->lock != seq)
> +	 *     goto again;
> +	 *
> +	 * After this, index contains architecture specific counter index + 1,
> +	 * so that 0 means unavailable, offset contains the value to be added
> +	 * to the result of the raw timer read to obtain this counter's value.
> +	 */
>  	__u32	lock;			/* seqlock for synchronization */
>  	__u32	index;			/* hardware counter identifier */
>  	__s64	offset;			/* add to hardware counter value */

I think we can simplify this (in a follow-on patch).

It has occurred to me that we don't need to do all this on the
userspace side, because we are necessarily reading and writing these
fields on the same CPU.  If the reader and writer were on different
CPUs, that would make no sense since they would be accessing different
hardware counter registers.

That means that we don't need any CPU memory barriers on either side.
All the kernel needs to do is to increment `lock' when it updates
things, and the user side can be:

	do {
		seq = pc->lock;
		index = pc->index;
		offset = pc->offset;
		barrier();
	} while (pc->lock != seq);

and all that's needed is a compiler barrier to stop the compiler from
optimizing too much.

Paul.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/9] perf_counter: unify and fix delayed counter wakeup
  2009-03-29  0:14   ` Paul Mackerras
@ 2009-03-29  9:16     ` Peter Zijlstra
  2009-03-29  9:25       ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2009-03-29  9:16 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Ingo Molnar, linux-kernel, Mike Galbraith, Arjan van de Ven,
	Wu Fengguang, Eric Dumazet

On Sun, 2009-03-29 at 11:14 +1100, Paul Mackerras wrote:
> Peter Zijlstra writes:
> 
> > While going over the wakeup code I noticed delayed wakeups only work
> > for hardware counters but basically all software counters rely on
> > them.
> 
> Hmmm, I don't like the extra latency this introduces, particularly
> since on powerpc we already have a good way to avoid the latency.

Right, so I can re-instate the powerpc bits and have it call
perf_counter_do_pending() whenever it finds the per-cpu pending bit set.

I'd have to look into the fancy new per-cpu stuff for the x86 bits, but
I'm reasonably sure something like that should be doable.

> I did a grep for perf_swcounter_event calls that have nmi=1, and there
> are a couple, to my surprise.  Why does the context switch one have
> nmi=1?  It certainly isn't called from an actual NMI handler.  Is it
> because of locking issues?

Yeah, can't do a wakeup while holding the rq->lock.

> The other one is the tracepoint call in perf_tpcounter_event.  I
> assume you have put nmi=1 there because you don't know what context
> we're in.  That means we'll always delay the wakeup even when we might
> be in an ordinary interrupt-on process context.  Couldn't we do
> better?

Maybe, not only real in_nmi() tracepoints have that problem, we also
have lock_acquire() like tracepoints that could call into the event code
in the middle of a lock acquisition (which might be rq->lock).

So always using nmi=1 for those seemed like the safe way out.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/9] perf_counter: unify and fix delayed counter wakeup
  2009-03-29  9:16     ` Peter Zijlstra
@ 2009-03-29  9:25       ` Peter Zijlstra
  2009-03-29 10:02         ` Paul Mackerras
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2009-03-29  9:25 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Ingo Molnar, linux-kernel, Mike Galbraith, Arjan van de Ven,
	Wu Fengguang, Eric Dumazet

On Sun, 2009-03-29 at 11:16 +0200, Peter Zijlstra wrote:
> On Sun, 2009-03-29 at 11:14 +1100, Paul Mackerras wrote:
> > Peter Zijlstra writes:
> > 
> > > While going over the wakeup code I noticed delayed wakeups only work
> > > for hardware counters but basically all software counters rely on
> > > them.
> > 
> > Hmmm, I don't like the extra latency this introduces, particularly
> > since on powerpc we already have a good way to avoid the latency.
> 
> Right, so I can re-instate the powerpc bits and have it call
> perf_counter_do_pending() whenever it finds the per-cpu pending bit set.
> 
> I'd have to look into the fancy new per-cpu stuff for the x86 bits, but
> I'm reasonably sure something like that should be doable.

In a perfect world, I'd introduce a self-ipi on UP and use that. Also,
in that same perfect world, all arches would support cmpxchg()/xchg() so
we could put the whole thing in generic code.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/9] perf_counter: unify and fix delayed counter wakeup
  2009-03-29  9:25       ` Peter Zijlstra
@ 2009-03-29 10:02         ` Paul Mackerras
  0 siblings, 0 replies; 24+ messages in thread
From: Paul Mackerras @ 2009-03-29 10:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Mike Galbraith, Arjan van de Ven,
	Wu Fengguang, Eric Dumazet

Peter Zijlstra writes:

> In a perfect world, I'd introduce a self-ipi on UP and use that. Also,
> in that same perfect world, all arches would support cmpxchg()/xchg() so
> we could put the whole thing in generic code.

We already require atomic64_t, which not all architectures have - in
particular, ppc32 doesn't have it, though it does have cmpxchg/xchg
for 32-bit quantities.  The list of architectures that have atomic64_t
seems to be a proper subset of the list of architectures that have
cmpxchg, from a quick grep.  So I would think you can use cmpxchg if
necessary.

Paul.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 7/9] perf_counter: make it possible for hw_perf_counter_init to return error codes
  2009-03-28 19:44 ` [PATCH 7/9] perf_counter: make it possible for hw_perf_counter_init to return error codes Peter Zijlstra
@ 2009-03-30  4:13   ` Paul Mackerras
  0 siblings, 0 replies; 24+ messages in thread
From: Paul Mackerras @ 2009-03-30  4:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Mike Galbraith, Arjan van de Ven,
	Wu Fengguang

$SUBJECT patch has a bug; it doesn't recognize when
hw_perf_counter_init returns an error on a raw hardware event.  I'll
post a new version.

Paul.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/9] perf_counter: fix update_userpage()
  2009-03-29  0:24   ` Paul Mackerras
@ 2009-04-02  8:50     ` Peter Zijlstra
  2009-04-02  9:00       ` Peter Zijlstra
  2009-04-02  9:15       ` Paul Mackerras
  0 siblings, 2 replies; 24+ messages in thread
From: Peter Zijlstra @ 2009-04-02  8:50 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Ingo Molnar, linux-kernel, Mike Galbraith, Arjan van de Ven,
	Wu Fengguang

On Sun, 2009-03-29 at 11:24 +1100, Paul Mackerras wrote:

> > --- linux-2.6.orig/include/linux/perf_counter.h
> > +++ linux-2.6/include/linux/perf_counter.h
> > @@ -160,10 +160,45 @@ struct perf_counter_hw_event {
> >  struct perf_counter_mmap_page {
> >  	__u32	version;		/* version number of this structure */
> >  	__u32	compat_version;		/* lowest version this is compat with */
> > +
> > +	/*
> > +	 * Bits needed to read the hw counters in user-space.
> > +	 *
> > +	 * The index and offset should be read atomically using the seqlock:
> > +	 *
> > +	 *   __u32 seq, index;
> > +	 *   __s64 offset;
> > +	 *
> > +	 * again:
> > +	 *   rmb();
> > +	 *   seq = pc->lock;
> > +	 *
> > +	 *   if (unlikely(seq & 1)) {
> > +	 *     cpu_relax();
> > +	 *     goto again;
> > +	 *   }
> > +	 *
> > +	 *   index = pc->index;
> > +	 *   offset = pc->offset;
> > +	 *
> > +	 *   rmb();
> > +	 *   if (pc->lock != seq)
> > +	 *     goto again;
> > +	 *
> > +	 * After this, index contains architecture specific counter index + 1,
> > +	 * so that 0 means unavailable, offset contains the value to be added
> > +	 * to the result of the raw timer read to obtain this counter's value.
> > +	 */
> >  	__u32	lock;			/* seqlock for synchronization */
> >  	__u32	index;			/* hardware counter identifier */
> >  	__s64	offset;			/* add to hardware counter value */
> 
> I think we can simplify this (in a follow-on patch).
> 
> It has occurred to me that we don't need to do all this on the
> userspace side, because we are necessarily reading and writing these
> fields on the same CPU.  If the reader and writer were on different
> CPUs, that would make no sense since they would be accessing different
> hardware counter registers.
> 
> That means that we don't need any CPU memory barriers on either side.
> All the kernel needs to do is to increment `lock' when it updates
> things, and the user side can be:
> 
> 	do {
> 		seq = pc->lock;
> 		index = pc->index;
> 		offset = pc->offset;
> 		barrier();
> 	} while (pc->lock != seq);
> 
> and all that's needed is a compiler barrier to stop the compiler from
> optimizing too much.

Can this work at all?

I mean, user-space could get preempted/rescheduled after we read the
mmap() data using that seqlock and before we actually did the read-pmc
bit.

In that case, the counter can have changed underneath us and we're
reading rubbish.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/9] perf_counter: fix update_userpage()
  2009-04-02  8:50     ` Peter Zijlstra
@ 2009-04-02  9:00       ` Peter Zijlstra
  2009-04-02  9:21         ` Paul Mackerras
  2009-04-02  9:15       ` Paul Mackerras
  1 sibling, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2009-04-02  9:00 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Ingo Molnar, linux-kernel, Mike Galbraith, Arjan van de Ven,
	Wu Fengguang

On Thu, 2009-04-02 at 10:50 +0200, Peter Zijlstra wrote:
> On Sun, 2009-03-29 at 11:24 +1100, Paul Mackerras wrote:
> 
> > > --- linux-2.6.orig/include/linux/perf_counter.h
> > > +++ linux-2.6/include/linux/perf_counter.h
> > > @@ -160,10 +160,45 @@ struct perf_counter_hw_event {
> > >  struct perf_counter_mmap_page {
> > >  	__u32	version;		/* version number of this structure */
> > >  	__u32	compat_version;		/* lowest version this is compat with */
> > > +
> > > +	/*
> > > +	 * Bits needed to read the hw counters in user-space.
> > > +	 *
> > > +	 * The index and offset should be read atomically using the seqlock:
> > > +	 *
> > > +	 *   __u32 seq, index;
> > > +	 *   __s64 offset;
> > > +	 *
> > > +	 * again:
> > > +	 *   rmb();
> > > +	 *   seq = pc->lock;
> > > +	 *
> > > +	 *   if (unlikely(seq & 1)) {
> > > +	 *     cpu_relax();
> > > +	 *     goto again;
> > > +	 *   }
> > > +	 *
> > > +	 *   index = pc->index;
> > > +	 *   offset = pc->offset;
> > > +	 *
> > > +	 *   rmb();
> > > +	 *   if (pc->lock != seq)
> > > +	 *     goto again;
> > > +	 *
> > > +	 * After this, index contains architecture specific counter index + 1,
> > > +	 * so that 0 means unavailable, offset contains the value to be added
> > > +	 * to the result of the raw timer read to obtain this counter's value.
> > > +	 */
> > >  	__u32	lock;			/* seqlock for synchronization */
> > >  	__u32	index;			/* hardware counter identifier */
> > >  	__s64	offset;			/* add to hardware counter value */
> > 
> > I think we can simplify this (in a follow-on patch).
> > 
> > It has occurred to me that we don't need to do all this on the
> > userspace side, because we are necessarily reading and writing these
> > fields on the same CPU.  If the reader and writer were on different
> > CPUs, that would make no sense since they would be accessing different
> > hardware counter registers.
> > 
> > That means that we don't need any CPU memory barriers on either side.
> > All the kernel needs to do is to increment `lock' when it updates
> > things, and the user side can be:
> > 
> > 	do {
> > 		seq = pc->lock;
> > 		index = pc->index;
> > 		offset = pc->offset;
> > 		barrier();
> > 	} while (pc->lock != seq);
> > 
> > and all that's needed is a compiler barrier to stop the compiler from
> > optimizing too much.
> 
> Can this work at all?
> 
> I mean, user-space could get preempted/rescheduled after we read the
> mmap() data using that seqlock and before we actually did the read-pmc
> bit.
> 
> In that case, the counter can have changed underneath us and we're
> reading rubbish.

The below might work:

  u32 seq;
  s64 count;

again:
  seq = pc->lock;

  if (unlikely(seq & 1)) {
    cpu_relax();
    goto again;
  }

  count = pmc_read(pc->index);
  counter += pc->offset;

  barrier();
  if (pc->lock != seq)
    goto again;


   


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/9] perf_counter: fix update_userpage()
  2009-04-02  8:50     ` Peter Zijlstra
  2009-04-02  9:00       ` Peter Zijlstra
@ 2009-04-02  9:15       ` Paul Mackerras
  2009-04-02  9:36         ` Peter Zijlstra
  1 sibling, 1 reply; 24+ messages in thread
From: Paul Mackerras @ 2009-04-02  9:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Mike Galbraith, Arjan van de Ven,
	Wu Fengguang

Peter Zijlstra writes:

> > That means that we don't need any CPU memory barriers on either side.
> > All the kernel needs to do is to increment `lock' when it updates
> > things, and the user side can be:
> > 
> > 	do {
> > 		seq = pc->lock;
> > 		index = pc->index;
> > 		offset = pc->offset;
> > 		barrier();
> > 	} while (pc->lock != seq);
> > 
> > and all that's needed is a compiler barrier to stop the compiler from
> > optimizing too much.
> 
> Can this work at all?
> 
> I mean, user-space could get preempted/rescheduled after we read the
> mmap() data using that seqlock and before we actually did the read-pmc
> bit.
> 
> In that case, the counter can have changed underneath us and we're
> reading rubbish.

Good point.  This should work, though:

	do {
		seq = pc->lock;
		barrier();
		value = read_pmc(pc->index) + pc->offset;
		barrier();
	} while (pc->lock != seq);
	return value;

No?

Paul.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/9] perf_counter: fix update_userpage()
  2009-04-02  9:00       ` Peter Zijlstra
@ 2009-04-02  9:21         ` Paul Mackerras
  2009-04-02  9:28           ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: Paul Mackerras @ 2009-04-02  9:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Mike Galbraith, Arjan van de Ven,
	Wu Fengguang

Peter Zijlstra writes:

> The below might work:
> 
>   u32 seq;
>   s64 count;
> 
> again:
>   seq = pc->lock;
> 
>   if (unlikely(seq & 1)) {

I don't believe we can ever see this condition, since pc->lock is
updated in the kernel either at interrupt level on the cpu this task
is running on, or in the kernel in the context of this task.  So this
userspace code can never run in the middle of the kernel updating
things.

Paul.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/9] perf_counter: fix update_userpage()
  2009-04-02  9:21         ` Paul Mackerras
@ 2009-04-02  9:28           ` Peter Zijlstra
  0 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2009-04-02  9:28 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Ingo Molnar, linux-kernel, Mike Galbraith, Arjan van de Ven,
	Wu Fengguang

On Thu, 2009-04-02 at 20:21 +1100, Paul Mackerras wrote:
> Peter Zijlstra writes:
> 
> > The below might work:
> > 
> >   u32 seq;
> >   s64 count;
> > 
> > again:
> >   seq = pc->lock;
> > 
> >   if (unlikely(seq & 1)) {
> 
> I don't believe we can ever see this condition, since pc->lock is
> updated in the kernel either at interrupt level on the cpu this task
> is running on, or in the kernel in the context of this task.  So this
> userspace code can never run in the middle of the kernel updating
> things.

Colour me paranoid ;-)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/9] perf_counter: fix update_userpage()
  2009-04-02  9:15       ` Paul Mackerras
@ 2009-04-02  9:36         ` Peter Zijlstra
  2009-04-02  9:58           ` Paul Mackerras
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2009-04-02  9:36 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Ingo Molnar, linux-kernel, Mike Galbraith, Arjan van de Ven,
	Wu Fengguang

On Thu, 2009-04-02 at 20:15 +1100, Paul Mackerras wrote:
> Peter Zijlstra writes:
> 
> > > That means that we don't need any CPU memory barriers on either side.
> > > All the kernel needs to do is to increment `lock' when it updates
> > > things, and the user side can be:
> > > 
> > > 	do {
> > > 		seq = pc->lock;
> > > 		index = pc->index;
> > > 		offset = pc->offset;
> > > 		barrier();
> > > 	} while (pc->lock != seq);
> > > 
> > > and all that's needed is a compiler barrier to stop the compiler from
> > > optimizing too much.
> > 
> > Can this work at all?
> > 
> > I mean, user-space could get preempted/rescheduled after we read the
> > mmap() data using that seqlock and before we actually did the read-pmc
> > bit.
> > 
> > In that case, the counter can have changed underneath us and we're
> > reading rubbish.
> 
> Good point.  This should work, though:
> 
> 	do {
> 		seq = pc->lock;
> 		barrier();
> 		value = read_pmc(pc->index) + pc->offset;
> 		barrier();
> 	} while (pc->lock != seq);
> 	return value;

I don't think you need the first barrier(), all you need to avoid is it
reusing the first pc->lock read, so one should suffice.

Also, you need to handle the !pc->index case.

But yeah.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/9] perf_counter: fix update_userpage()
  2009-04-02  9:36         ` Peter Zijlstra
@ 2009-04-02  9:58           ` Paul Mackerras
  2009-04-02 10:36             ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: Paul Mackerras @ 2009-04-02  9:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Mike Galbraith, Arjan van de Ven,
	Wu Fengguang

Peter Zijlstra writes:

> > Good point.  This should work, though:
> > 
> > 	do {
> > 		seq = pc->lock;
> > 		barrier();
> > 		value = read_pmc(pc->index) + pc->offset;
> > 		barrier();
> > 	} while (pc->lock != seq);
> > 	return value;
> 
> I don't think you need the first barrier(), all you need to avoid is it
> reusing the first pc->lock read, so one should suffice.

I need it to make sure that the compiler doesn't put the load of
pc->index or pc->offset before the first load of pc->lock.  The second
barrier is needed to make sure the compiler puts the second load of
pc->lock after the loads of pc->index and pc->offset.  So I think I do
need to barrier()s (but only compiler barriers, not cpu memory
barriers).

> Also, you need to handle the !pc->index case.

Hmmm, yeah.  I claim that read_pmc(0) always returns 0. :)

Paul.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/9] perf_counter: fix update_userpage()
  2009-04-02  9:58           ` Paul Mackerras
@ 2009-04-02 10:36             ` Peter Zijlstra
  0 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2009-04-02 10:36 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Ingo Molnar, linux-kernel, Mike Galbraith, Arjan van de Ven,
	Wu Fengguang

On Thu, 2009-04-02 at 20:58 +1100, Paul Mackerras wrote:
> Peter Zijlstra writes:
> 
> > > Good point.  This should work, though:
> > > 
> > > 	do {
> > > 		seq = pc->lock;
> > > 		barrier();
> > > 		value = read_pmc(pc->index) + pc->offset;
> > > 		barrier();
> > > 	} while (pc->lock != seq);
> > > 	return value;
> > 
> > I don't think you need the first barrier(), all you need to avoid is it
> > reusing the first pc->lock read, so one should suffice.
> 
> I need it to make sure that the compiler doesn't put the load of
> pc->index or pc->offset before the first load of pc->lock.  The second
> barrier is needed to make sure the compiler puts the second load of
> pc->lock after the loads of pc->index and pc->offset.  So I think I do
> need to barrier()s (but only compiler barriers, not cpu memory
> barriers).

Ah, you're right indeed.

> > Also, you need to handle the !pc->index case.
> 
> Hmmm, yeah.  I claim that read_pmc(0) always returns 0. :)

Hehe :-)

Ok, updated that patch.


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2009-04-02 10:35 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-03-28 19:43 [PATCH 0/9] perf_counter patches Peter Zijlstra
2009-03-28 19:44 ` [PATCH 1/9] perf_counter: unify and fix delayed counter wakeup Peter Zijlstra
2009-03-29  0:14   ` Paul Mackerras
2009-03-29  9:16     ` Peter Zijlstra
2009-03-29  9:25       ` Peter Zijlstra
2009-03-29 10:02         ` Paul Mackerras
2009-03-28 19:44 ` [PATCH 2/9] perf_counter: fix update_userpage() Peter Zijlstra
2009-03-29  0:24   ` Paul Mackerras
2009-04-02  8:50     ` Peter Zijlstra
2009-04-02  9:00       ` Peter Zijlstra
2009-04-02  9:21         ` Paul Mackerras
2009-04-02  9:28           ` Peter Zijlstra
2009-04-02  9:15       ` Paul Mackerras
2009-04-02  9:36         ` Peter Zijlstra
2009-04-02  9:58           ` Paul Mackerras
2009-04-02 10:36             ` Peter Zijlstra
2009-03-28 19:44 ` [PATCH 3/9] perf_counter: kerneltop: simplify data_head read Peter Zijlstra
2009-03-28 19:44 ` [PATCH 4/9] perf_counter: executable mmap() information Peter Zijlstra
2009-03-28 19:44 ` [PATCH 5/9] perf_counter: kerneltop: parse the mmap data stream Peter Zijlstra
2009-03-28 19:44 ` [PATCH 6/9] perf_counter: powerpc: only reserve PMU hardware when we need it Peter Zijlstra
2009-03-28 19:44 ` [PATCH 7/9] perf_counter: make it possible for hw_perf_counter_init to return error codes Peter Zijlstra
2009-03-30  4:13   ` Paul Mackerras
2009-03-28 19:44 ` [PATCH 8/9] perf_counter tools: optionally scale counter values in perfstat mode Peter Zijlstra
2009-03-28 19:44 ` [PATCH 9/9] RFC perf_counter: event overlow handling Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox