[PATCH] perf: Update event buffer tail when overwriting old events

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] perf: Update event buffer tail when overwriting old events
@ 2013-06-06  5:58 Yan, Zheng
  2013-06-18  9:13 ` Peter Zijlstra
  2013-07-08 12:15 ` Peter Zijlstra
  0 siblings, 2 replies; 12+ messages in thread
From: Yan, Zheng @ 2013-06-06  5:58 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, mingo, eranian, ak, Yan, Zheng

From: "Yan, Zheng" <zheng.z.yan@intel.com>

If perf event buffer is in overwrite mode, the kernel only updates
the data head when it overwrites old samples. The program that owns
the buffer need periodically check the buffer and update a variable
that tracks the date tail. If the program fails to do this in time,
the data tail can be overwritted by new samples. The program has to
rewind the buffer because it does not know where is the first vaild
sample.

This patch makes the kernel update the date tail when it overwrites
old events. So the program that owns the event buffer can always
read the latest samples. This is convenient for programs that use
perf to do branch tracing. One use case is GDB branch tracing:
(http://sourceware.org/ml/gdb-patches/2012-06/msg00172.html)
It uses perf interface to read BTS, but only cares the branches
before the ptrace event.

I added code to perf_output_{begin/end} to count how many cycles
are spent by sample output, then ran "perf record" to profile kernel
compilation 10 times on IvyBridge-EP. (perf record -a make -j 60)
The first number is scaled to 1000, the rest numbers are scaled by
the same factor.

        before   overwrite mode      after   overwrite mode
AVG      1000        999             1046        1044
STDEV    19.4        19.5            17.1        17.9

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 kernel/events/internal.h    |  2 ++
 kernel/events/ring_buffer.c | 74 ++++++++++++++++++++++++---------------------
 2 files changed, 42 insertions(+), 34 deletions(-)

diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index eb675c4..c6d7539 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -20,6 +20,8 @@ struct ring_buffer {
 
 	atomic_t			poll;		/* POLL_ for wakeups */
 
+	local_t				tail;		/* read position     */
+	local_t				next_tail;	/* next read position */
 	local_t				head;		/* write position    */
 	local_t				nest;		/* nested writers    */
 	local_t				events;		/* event limit       */
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index cd55144..2d5b15e 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -15,28 +15,9 @@
 
 #include "internal.h"
 
-static bool perf_output_space(struct ring_buffer *rb, unsigned long tail,
-			      unsigned long offset, unsigned long head)
+static bool perf_output_space(unsigned long tail, unsigned long offset,
+			      unsigned long head, unsigned long mask)
 {
-	unsigned long sz = perf_data_size(rb);
-	unsigned long mask = sz - 1;
-
-	/*
-	 * check if user-writable
-	 * overwrite : over-write its own tail
-	 * !overwrite: buffer possibly drops events.
-	 */
-	if (rb->overwrite)
-		return true;
-
-	/*
-	 * verify that payload is not bigger than buffer
-	 * otherwise masking logic may fail to detect
-	 * the "not enough space" condition
-	 */
-	if ((head - offset) > sz)
-		return false;
-
 	offset = (offset - tail) & mask;
 	head   = (head   - tail) & mask;
 
@@ -113,7 +94,7 @@ int perf_output_begin(struct perf_output_handle *handle,
 		      struct perf_event *event, unsigned int size)
 {
 	struct ring_buffer *rb;
-	unsigned long tail, offset, head;
+	unsigned long tail, offset, head, max_size;
 	int have_lost;
 	struct perf_sample_data sample_data;
 	struct {
@@ -136,7 +117,8 @@ int perf_output_begin(struct perf_output_handle *handle,
 	handle->rb	= rb;
 	handle->event	= event;
 
-	if (!rb->nr_pages)
+	max_size = perf_data_size(rb);
+	if (size > max_size)
 		goto out;
 
 	have_lost = local_read(&rb->lost);
@@ -149,19 +131,43 @@ int perf_output_begin(struct perf_output_handle *handle,
 
 	perf_output_get_handle(handle);
 
-	do {
+	if (rb->overwrite) {
+		do {
+			tail = local_read(&rb->tail);
+			offset = local_read(&rb->head);
+			head = offset + size;
+			if (unlikely(!perf_output_space(tail, offset, head,
+						        max_size - 1))) {
+				tail = local_read(&rb->next_tail);
+				local_set(&rb->tail, tail);
+				rb->user_page->data_tail = tail;
+			}
+		} while (local_cmpxchg(&rb->head, offset, head) != offset);
+
 		/*
-		 * Userspace could choose to issue a mb() before updating the
-		 * tail pointer. So that all reads will be completed before the
-		 * write is issued.
+		 * Save the start of next event when half of the buffer
+		 * has been filled. Later when the event buffer overflows,
+		 * update the tail pointer to point to it.
 		 */
-		tail = ACCESS_ONCE(rb->user_page->data_tail);
-		smp_rmb();
-		offset = head = local_read(&rb->head);
-		head += size;
-		if (unlikely(!perf_output_space(rb, tail, offset, head)))
-			goto fail;
-	} while (local_cmpxchg(&rb->head, offset, head) != offset);
+		if (tail == local_read(&rb->next_tail) &&
+		    head - tail >= (max_size >> 1))
+			local_cmpxchg(&rb->next_tail, tail, head);
+	} else {
+		do {
+			/*
+			 * Userspace could choose to issue a mb() before
+			 * updating the tail pointer. So that all reads will
+			 * be completed before the write is issued.
+			 */
+			tail = ACCESS_ONCE(rb->user_page->data_tail);
+			smp_rmb();
+			offset = local_read(&rb->head);
+			head = offset + size;
+			if (unlikely(!perf_output_space(tail, offset, head,
+							max_size - 1)))
+				goto fail;
+		} while (local_cmpxchg(&rb->head, offset, head) != offset);
+	}
 
 	if (head - local_read(&rb->wakeup) > rb->watermark)
 		local_add(rb->watermark, &rb->wakeup);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] perf: Update event buffer tail when overwriting old events
  2013-06-06  5:58 [PATCH] perf: Update event buffer tail when overwriting old events Yan, Zheng
@ 2013-06-18  9:13 ` Peter Zijlstra
  2013-07-08 12:15 ` Peter Zijlstra
  1 sibling, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2013-06-18  9:13 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: linux-kernel, mingo, eranian, ak

On Thu, Jun 06, 2013 at 01:58:06PM +0800, Yan, Zheng wrote:
> From: "Yan, Zheng" <zheng.z.yan@intel.com>
> 
> If perf event buffer is in overwrite mode, the kernel only updates
> the data head when it overwrites old samples. The program that owns
> the buffer need periodically check the buffer and update a variable
> that tracks the date tail. If the program fails to do this in time,
> the data tail can be overwritted by new samples. The program has to
> rewind the buffer because it does not know where is the first vaild
> sample.
> 
> This patch makes the kernel update the date tail when it overwrites
> old events. So the program that owns the event buffer can always
> read the latest samples. This is convenient for programs that use
> perf to do branch tracing. One use case is GDB branch tracing:
> (http://sourceware.org/ml/gdb-patches/2012-06/msg00172.html)
> It uses perf interface to read BTS, but only cares the branches
> before the ptrace event.
> 
> I added code to perf_output_{begin/end} to count how many cycles
> are spent by sample output, then ran "perf record" to profile kernel
> compilation 10 times on IvyBridge-EP. (perf record -a make -j 60)
> The first number is scaled to 1000, the rest numbers are scaled by
> the same factor.
> 
>         before   overwrite mode      after   overwrite mode
> AVG      1000        999             1046        1044
> STDEV    19.4        19.5            17.1        17.9
> 

Right, so it all gets about 5% more expensive.. I don't suppose there's
anything smart we can do to avoid this for the !overwrite mode?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] perf: Update event buffer tail when overwriting old events
  2013-06-06  5:58 [PATCH] perf: Update event buffer tail when overwriting old events Yan, Zheng
  2013-06-18  9:13 ` Peter Zijlstra
@ 2013-07-08 12:15 ` Peter Zijlstra
  2013-07-09  6:18   ` Namhyung Kim
                     ` (2 more replies)
  1 sibling, 3 replies; 12+ messages in thread
From: Peter Zijlstra @ 2013-07-08 12:15 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: linux-kernel, mingo, eranian, ak

On Thu, Jun 06, 2013 at 01:58:06PM +0800, Yan, Zheng wrote:
> From: "Yan, Zheng" <zheng.z.yan@intel.com>
> 
> If perf event buffer is in overwrite mode, the kernel only updates
> the data head when it overwrites old samples. The program that owns
> the buffer need periodically check the buffer and update a variable
> that tracks the date tail. If the program fails to do this in time,
> the data tail can be overwritted by new samples. The program has to
> rewind the buffer because it does not know where is the first vaild
> sample.
> 
> This patch makes the kernel update the date tail when it overwrites
> old events. So the program that owns the event buffer can always
> read the latest samples. This is convenient for programs that use
> perf to do branch tracing. One use case is GDB branch tracing:
> (http://sourceware.org/ml/gdb-patches/2012-06/msg00172.html)
> It uses perf interface to read BTS, but only cares the branches
> before the ptrace event.
> 
> I added code to perf_output_{begin/end} to count how many cycles
> are spent by sample output, then ran "perf record" to profile kernel
> compilation 10 times on IvyBridge-EP. (perf record -a make -j 60)
> The first number is scaled to 1000, the rest numbers are scaled by
> the same factor.
> 
>         before   overwrite mode      after   overwrite mode
> AVG      1000        999             1046        1044
> STDEV    19.4        19.5            17.1        17.9

OK, so I was sure I replied to this email; but apparently I didn't :/

So its still adding about 5% overhead to the regular case; this is sad.

What does something like the below do?

---
 include/linux/perf_event.h  |  2 +
 kernel/events/core.c        | 60 ++++++++++++++++++++++++------
 kernel/events/internal.h    |  2 +
 kernel/events/ring_buffer.c | 90 +++++++++++++++++++++++++++------------------
 4 files changed, 106 insertions(+), 48 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 8873f82..bcce98a 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -753,6 +753,8 @@ static inline bool has_branch_stack(struct perf_event *event)
 
 extern int perf_output_begin(struct perf_output_handle *handle,
 			     struct perf_event *event, unsigned int size);
+extern int perf_output_begin_overwrite(struct perf_output_handle *handle,
+			     struct perf_event *event, unsigned int size);
 extern void perf_output_end(struct perf_output_handle *handle);
 extern unsigned int perf_output_copy(struct perf_output_handle *handle,
 			     const void *buf, unsigned int len);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 1833bc5..4d674e9 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3878,6 +3878,8 @@ static const struct vm_operations_struct perf_mmap_vmops = {
 	.page_mkwrite	= perf_mmap_fault,
 };
 
+static void perf_event_set_overflow(struct perf_event *event, struct ring_buffer *rb);
+
 static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct perf_event *event = file->private_data;
@@ -3985,6 +3987,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 	vma->vm_mm->pinned_vm += extra;
 
 	ring_buffer_attach(event, rb);
+	perf_event_set_overflow(event, rb);
 	rcu_assign_pointer(event->rb, rb);
 
 	perf_event_update_userpage(event);
@@ -4595,9 +4598,12 @@ void perf_prepare_sample(struct perf_event_header *header,
 	}
 }
 
-static void perf_event_output(struct perf_event *event,
-				struct perf_sample_data *data,
-				struct pt_regs *regs)
+static __always_inline void 
+__perf_event_output(struct perf_event *event,
+		    struct perf_sample_data *data,
+		    struct pt_regs *regs,
+		    int (*output_begin)(struct perf_output_handle *, 
+			                struct perf_event *, unsigned int))
 {
 	struct perf_output_handle handle;
 	struct perf_event_header header;
@@ -4607,7 +4613,7 @@ static void perf_event_output(struct perf_event *event,
 
 	perf_prepare_sample(&header, data, event, regs);
 
-	if (perf_output_begin(&handle, event, header.size))
+	if (output_begin(&handle, event, header.size))
 		goto exit;
 
 	perf_output_sample(&handle, &header, data, event);
@@ -4618,6 +4624,33 @@ static void perf_event_output(struct perf_event *event,
 	rcu_read_unlock();
 }
 
+static void perf_event_output(struct perf_event *event,
+				struct perf_sample_data *data,
+				struct pt_regs *regs)
+{
+	__perf_event_output(event, data, regs, perf_output_begin);
+}
+
+static void perf_event_output_overwrite(struct perf_event *event,
+				struct perf_sample_data *data,
+				struct pt_regs *regs)
+{
+	__perf_event_output(event, data, regs, perf_output_begin_overwrite);
+}
+
+static void 
+perf_event_set_overflow(struct perf_event *event, struct ring_buffer *rb)
+{
+	if (event->overflow_handler != perf_event_output ||
+	    event->overflow_handler != perf_event_output_overwrite)
+		return;
+
+	if (rb->overwrite)
+		event->overflow_handler = perf_event_output_overwrite;
+	else
+		event->overflow_handler = perf_event_output;
+}
+
 /*
  * read event_id
  */
@@ -5183,10 +5216,7 @@ static int __perf_event_overflow(struct perf_event *event,
 		irq_work_queue(&event->pending);
 	}
 
-	if (event->overflow_handler)
-		event->overflow_handler(event, data, regs);
-	else
-		perf_event_output(event, data, regs);
+	event->overflow_handler(event, data, regs);
 
 	if (event->fasync && event->pending_kill) {
 		event->pending_wakeup = 1;
@@ -6501,8 +6531,13 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 		context = parent_event->overflow_handler_context;
 	}
 
-	event->overflow_handler	= overflow_handler;
-	event->overflow_handler_context = context;
+	if (overflow_handler) {
+		event->overflow_handler	= overflow_handler;
+		event->overflow_handler_context = context;
+	} else {
+		event->overflow_handler = perf_event_output;
+		event->overflow_handler_context = NULL;
+	}
 
 	perf_event__state_init(event);
 
@@ -6736,9 +6771,10 @@ perf_event_set_output(struct perf_event *event, struct perf_event *output_event)
 	if (old_rb)
 		ring_buffer_detach(event, old_rb);
 
-	if (rb)
+	if (rb) {
 		ring_buffer_attach(event, rb);
-
+		perf_event_set_overflow(event, rb);
+	}
 	rcu_assign_pointer(event->rb, rb);
 
 	if (old_rb) {
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index ca65997..c4e4610 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -20,6 +20,8 @@ struct ring_buffer {
 
 	atomic_t			poll;		/* POLL_ for wakeups */
 
+	local_t				tail;		/* read position     */
+	local_t				next_tail;	/* next read position */
 	local_t				head;		/* write position    */
 	local_t				nest;		/* nested writers    */
 	local_t				events;		/* event limit       */
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index cd55144..5887044 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -15,28 +15,9 @@
 
 #include "internal.h"
 
-static bool perf_output_space(struct ring_buffer *rb, unsigned long tail,
-			      unsigned long offset, unsigned long head)
+static bool perf_output_space(unsigned long tail, unsigned long offset,
+			      unsigned long head, unsigned long mask)
 {
-	unsigned long sz = perf_data_size(rb);
-	unsigned long mask = sz - 1;
-
-	/*
-	 * check if user-writable
-	 * overwrite : over-write its own tail
-	 * !overwrite: buffer possibly drops events.
-	 */
-	if (rb->overwrite)
-		return true;
-
-	/*
-	 * verify that payload is not bigger than buffer
-	 * otherwise masking logic may fail to detect
-	 * the "not enough space" condition
-	 */
-	if ((head - offset) > sz)
-		return false;
-
 	offset = (offset - tail) & mask;
 	head   = (head   - tail) & mask;
 
@@ -109,11 +90,11 @@ static void perf_output_put_handle(struct perf_output_handle *handle)
 	preempt_enable();
 }
 
-int perf_output_begin(struct perf_output_handle *handle,
-		      struct perf_event *event, unsigned int size)
+static __always_inline int __perf_output_begin(struct perf_output_handle *handle,
+		      struct perf_event *event, unsigned int size, bool overwrite)
 {
 	struct ring_buffer *rb;
-	unsigned long tail, offset, head;
+	unsigned long tail, offset, head, max_size;
 	int have_lost;
 	struct perf_sample_data sample_data;
 	struct {
@@ -136,7 +117,8 @@ int perf_output_begin(struct perf_output_handle *handle,
 	handle->rb	= rb;
 	handle->event	= event;
 
-	if (!rb->nr_pages)
+	max_size = perf_data_size(rb);
+	if (size > max_size)
 		goto out;
 
 	have_lost = local_read(&rb->lost);
@@ -149,19 +131,43 @@ int perf_output_begin(struct perf_output_handle *handle,
 
 	perf_output_get_handle(handle);
 
-	do {
+	if (overwrite) {
+		do {
+			tail = local_read(&rb->tail);
+			offset = local_read(&rb->head);
+			head = offset + size;
+			if (unlikely(!perf_output_space(tail, offset, head,
+						        max_size - 1))) {
+				tail = local_read(&rb->next_tail);
+				local_set(&rb->tail, tail);
+				rb->user_page->data_tail = tail;
+			}
+		} while (local_cmpxchg(&rb->head, offset, head) != offset);
+
 		/*
-		 * Userspace could choose to issue a mb() before updating the
-		 * tail pointer. So that all reads will be completed before the
-		 * write is issued.
+		 * Save the start of next event when half of the buffer
+		 * has been filled. Later when the event buffer overflows,
+		 * update the tail pointer to point to it.
 		 */
-		tail = ACCESS_ONCE(rb->user_page->data_tail);
-		smp_rmb();
-		offset = head = local_read(&rb->head);
-		head += size;
-		if (unlikely(!perf_output_space(rb, tail, offset, head)))
-			goto fail;
-	} while (local_cmpxchg(&rb->head, offset, head) != offset);
+		if (tail == local_read(&rb->next_tail) &&
+		    head - tail >= (max_size >> 1))
+			local_cmpxchg(&rb->next_tail, tail, head);
+	} else {
+		do {
+			/*
+			 * Userspace could choose to issue a mb() before
+			 * updating the tail pointer. So that all reads will
+			 * be completed before the write is issued.
+			 */
+			tail = ACCESS_ONCE(rb->user_page->data_tail);
+			smp_rmb();
+			offset = local_read(&rb->head);
+			head = offset + size;
+			if (unlikely(!perf_output_space(tail, offset, head,
+							max_size - 1)))
+				goto fail;
+		} while (local_cmpxchg(&rb->head, offset, head) != offset);
+	}
 
 	if (head - local_read(&rb->wakeup) > rb->watermark)
 		local_add(rb->watermark, &rb->wakeup);
@@ -194,6 +200,18 @@ int perf_output_begin(struct perf_output_handle *handle,
 	return -ENOSPC;
 }
 
+int perf_output_begin(struct perf_output_handle *handle,
+		      struct perf_event *event, unsigned int size)
+{
+	return __perf_output_begin(handle, event, size, false);
+}
+
+int perf_output_begin_overwrite(struct perf_output_handle *handle,
+		      struct perf_event *event, unsigned int size)
+{
+	return __perf_output_begin(handle, event, size, true);
+}
+
 unsigned int perf_output_copy(struct perf_output_handle *handle,
 		      const void *buf, unsigned int len)
 {


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] perf: Update event buffer tail when overwriting old events
  2013-07-08 12:15 ` Peter Zijlstra
@ 2013-07-09  6:18   ` Namhyung Kim
  2013-07-09  7:40     ` Peter Zijlstra
  2013-07-09  7:05   ` Yan, Zheng
  2013-07-10 11:37   ` Yan, Zheng
  2 siblings, 1 reply; 12+ messages in thread
From: Namhyung Kim @ 2013-07-09  6:18 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Yan, Zheng, linux-kernel, mingo, eranian, ak

Hi Peter,

On Mon, 8 Jul 2013 14:15:57 +0200, Peter Zijlstra wrote:
[SNIP]
> +static void 
> +perf_event_set_overflow(struct perf_event *event, struct ring_buffer *rb)
> +{
> +	if (event->overflow_handler != perf_event_output ||

I guess you wanted "&&" instead of "||" here.

Thanks,
Namhyung


> +	    event->overflow_handler != perf_event_output_overwrite)
> +		return;
> +
> +	if (rb->overwrite)
> +		event->overflow_handler = perf_event_output_overwrite;
> +	else
> +		event->overflow_handler = perf_event_output;
> +}

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] perf: Update event buffer tail when overwriting old events
  2013-07-09  6:18   ` Namhyung Kim
@ 2013-07-09  7:40     ` Peter Zijlstra
  0 siblings, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2013-07-09  7:40 UTC (permalink / raw)
  To: Namhyung Kim; +Cc: Yan, Zheng, linux-kernel, mingo, eranian, ak

On Tue, Jul 09, 2013 at 03:18:20PM +0900, Namhyung Kim wrote:
> Hi Peter,
> 
> On Mon, 8 Jul 2013 14:15:57 +0200, Peter Zijlstra wrote:
> [SNIP]
> > +static void 
> > +perf_event_set_overflow(struct perf_event *event, struct ring_buffer *rb)
> > +{
> > +	if (event->overflow_handler != perf_event_output ||
> 
> I guess you wanted "&&" instead of "||" here.

Uhm.. yeah. /me dons the brown paper bag.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] perf: Update event buffer tail when overwriting old events
  2013-07-08 12:15 ` Peter Zijlstra
  2013-07-09  6:18   ` Namhyung Kim
@ 2013-07-09  7:05   ` Yan, Zheng
  2013-07-09  8:05     ` Peter Zijlstra
  2013-07-10 11:37   ` Yan, Zheng
  2 siblings, 1 reply; 12+ messages in thread
From: Yan, Zheng @ 2013-07-09  7:05 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mingo, eranian, ak

On 07/08/2013 08:15 PM, Peter Zijlstra wrote:
> On Thu, Jun 06, 2013 at 01:58:06PM +0800, Yan, Zheng wrote:
>> From: "Yan, Zheng" <zheng.z.yan@intel.com>
>>
>> If perf event buffer is in overwrite mode, the kernel only updates
>> the data head when it overwrites old samples. The program that owns
>> the buffer need periodically check the buffer and update a variable
>> that tracks the date tail. If the program fails to do this in time,
>> the data tail can be overwritted by new samples. The program has to
>> rewind the buffer because it does not know where is the first vaild
>> sample.
>>
>> This patch makes the kernel update the date tail when it overwrites
>> old events. So the program that owns the event buffer can always
>> read the latest samples. This is convenient for programs that use
>> perf to do branch tracing. One use case is GDB branch tracing:
>> (http://sourceware.org/ml/gdb-patches/2012-06/msg00172.html)
>> It uses perf interface to read BTS, but only cares the branches
>> before the ptrace event.
>>
>> I added code to perf_output_{begin/end} to count how many cycles
>> are spent by sample output, then ran "perf record" to profile kernel
>> compilation 10 times on IvyBridge-EP. (perf record -a make -j 60)
>> The first number is scaled to 1000, the rest numbers are scaled by
>> the same factor.
>>
>>         before   overwrite mode      after   overwrite mode
>> AVG      1000        999             1046        1044
>> STDEV    19.4        19.5            17.1        17.9
> 
> OK, so I was sure I replied to this email; but apparently I didn't :/
> 
> So its still adding about 5% overhead to the regular case; this is sad.
> 
> What does something like the below do?

Thank you for your help. I ran the same test, the results for regular case
are much better. But it still has about 1% overhead, probably because we
enlarge the ring_buffer structure, make it less cache friendly.

      origin    with the patch
AVG    1000      1013
STDEV  13.4      15.0

Regards
Yan, Zheng

> 
> ---
>  include/linux/perf_event.h  |  2 +
>  kernel/events/core.c        | 60 ++++++++++++++++++++++++------
>  kernel/events/internal.h    |  2 +
>  kernel/events/ring_buffer.c | 90 +++++++++++++++++++++++++++------------------
>  4 files changed, 106 insertions(+), 48 deletions(-)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 8873f82..bcce98a 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -753,6 +753,8 @@ static inline bool has_branch_stack(struct perf_event *event)
>  
>  extern int perf_output_begin(struct perf_output_handle *handle,
>  			     struct perf_event *event, unsigned int size);
> +extern int perf_output_begin_overwrite(struct perf_output_handle *handle,
> +			     struct perf_event *event, unsigned int size);
>  extern void perf_output_end(struct perf_output_handle *handle);
>  extern unsigned int perf_output_copy(struct perf_output_handle *handle,
>  			     const void *buf, unsigned int len);
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 1833bc5..4d674e9 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3878,6 +3878,8 @@ static const struct vm_operations_struct perf_mmap_vmops = {
>  	.page_mkwrite	= perf_mmap_fault,
>  };
>  
> +static void perf_event_set_overflow(struct perf_event *event, struct ring_buffer *rb);
> +
>  static int perf_mmap(struct file *file, struct vm_area_struct *vma)
>  {
>  	struct perf_event *event = file->private_data;
> @@ -3985,6 +3987,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
>  	vma->vm_mm->pinned_vm += extra;
>  
>  	ring_buffer_attach(event, rb);
> +	perf_event_set_overflow(event, rb);
>  	rcu_assign_pointer(event->rb, rb);
>  
>  	perf_event_update_userpage(event);
> @@ -4595,9 +4598,12 @@ void perf_prepare_sample(struct perf_event_header *header,
>  	}
>  }
>  
> -static void perf_event_output(struct perf_event *event,
> -				struct perf_sample_data *data,
> -				struct pt_regs *regs)
> +static __always_inline void 
> +__perf_event_output(struct perf_event *event,
> +		    struct perf_sample_data *data,
> +		    struct pt_regs *regs,
> +		    int (*output_begin)(struct perf_output_handle *, 
> +			                struct perf_event *, unsigned int))
>  {
>  	struct perf_output_handle handle;
>  	struct perf_event_header header;
> @@ -4607,7 +4613,7 @@ static void perf_event_output(struct perf_event *event,
>  
>  	perf_prepare_sample(&header, data, event, regs);
>  
> -	if (perf_output_begin(&handle, event, header.size))
> +	if (output_begin(&handle, event, header.size))
>  		goto exit;
>  
>  	perf_output_sample(&handle, &header, data, event);
> @@ -4618,6 +4624,33 @@ static void perf_event_output(struct perf_event *event,
>  	rcu_read_unlock();
>  }
>  
> +static void perf_event_output(struct perf_event *event,
> +				struct perf_sample_data *data,
> +				struct pt_regs *regs)
> +{
> +	__perf_event_output(event, data, regs, perf_output_begin);
> +}
> +
> +static void perf_event_output_overwrite(struct perf_event *event,
> +				struct perf_sample_data *data,
> +				struct pt_regs *regs)
> +{
> +	__perf_event_output(event, data, regs, perf_output_begin_overwrite);
> +}
> +
> +static void 
> +perf_event_set_overflow(struct perf_event *event, struct ring_buffer *rb)
> +{
> +	if (event->overflow_handler != perf_event_output ||
> +	    event->overflow_handler != perf_event_output_overwrite)
> +		return;
> +
> +	if (rb->overwrite)
> +		event->overflow_handler = perf_event_output_overwrite;
> +	else
> +		event->overflow_handler = perf_event_output;
> +}
> +
>  /*
>   * read event_id
>   */
> @@ -5183,10 +5216,7 @@ static int __perf_event_overflow(struct perf_event *event,
>  		irq_work_queue(&event->pending);
>  	}
>  
> -	if (event->overflow_handler)
> -		event->overflow_handler(event, data, regs);
> -	else
> -		perf_event_output(event, data, regs);
> +	event->overflow_handler(event, data, regs);
>  
>  	if (event->fasync && event->pending_kill) {
>  		event->pending_wakeup = 1;
> @@ -6501,8 +6531,13 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>  		context = parent_event->overflow_handler_context;
>  	}
>  
> -	event->overflow_handler	= overflow_handler;
> -	event->overflow_handler_context = context;
> +	if (overflow_handler) {
> +		event->overflow_handler	= overflow_handler;
> +		event->overflow_handler_context = context;
> +	} else {
> +		event->overflow_handler = perf_event_output;
> +		event->overflow_handler_context = NULL;
> +	}
>  
>  	perf_event__state_init(event);
>  
> @@ -6736,9 +6771,10 @@ perf_event_set_output(struct perf_event *event, struct perf_event *output_event)
>  	if (old_rb)
>  		ring_buffer_detach(event, old_rb);
>  
> -	if (rb)
> +	if (rb) {
>  		ring_buffer_attach(event, rb);
> -
> +		perf_event_set_overflow(event, rb);
> +	}
>  	rcu_assign_pointer(event->rb, rb);
>  
>  	if (old_rb) {
> diff --git a/kernel/events/internal.h b/kernel/events/internal.h
> index ca65997..c4e4610 100644
> --- a/kernel/events/internal.h
> +++ b/kernel/events/internal.h
> @@ -20,6 +20,8 @@ struct ring_buffer {
>  
>  	atomic_t			poll;		/* POLL_ for wakeups */
>  
> +	local_t				tail;		/* read position     */
> +	local_t				next_tail;	/* next read position */
>  	local_t				head;		/* write position    */
>  	local_t				nest;		/* nested writers    */
>  	local_t				events;		/* event limit       */
> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index cd55144..5887044 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -15,28 +15,9 @@
>  
>  #include "internal.h"
>  
> -static bool perf_output_space(struct ring_buffer *rb, unsigned long tail,
> -			      unsigned long offset, unsigned long head)
> +static bool perf_output_space(unsigned long tail, unsigned long offset,
> +			      unsigned long head, unsigned long mask)
>  {
> -	unsigned long sz = perf_data_size(rb);
> -	unsigned long mask = sz - 1;
> -
> -	/*
> -	 * check if user-writable
> -	 * overwrite : over-write its own tail
> -	 * !overwrite: buffer possibly drops events.
> -	 */
> -	if (rb->overwrite)
> -		return true;
> -
> -	/*
> -	 * verify that payload is not bigger than buffer
> -	 * otherwise masking logic may fail to detect
> -	 * the "not enough space" condition
> -	 */
> -	if ((head - offset) > sz)
> -		return false;
> -
>  	offset = (offset - tail) & mask;
>  	head   = (head   - tail) & mask;
>  
> @@ -109,11 +90,11 @@ static void perf_output_put_handle(struct perf_output_handle *handle)
>  	preempt_enable();
>  }
>  
> -int perf_output_begin(struct perf_output_handle *handle,
> -		      struct perf_event *event, unsigned int size)
> +static __always_inline int __perf_output_begin(struct perf_output_handle *handle,
> +		      struct perf_event *event, unsigned int size, bool overwrite)
>  {
>  	struct ring_buffer *rb;
> -	unsigned long tail, offset, head;
> +	unsigned long tail, offset, head, max_size;
>  	int have_lost;
>  	struct perf_sample_data sample_data;
>  	struct {
> @@ -136,7 +117,8 @@ int perf_output_begin(struct perf_output_handle *handle,
>  	handle->rb	= rb;
>  	handle->event	= event;
>  
> -	if (!rb->nr_pages)
> +	max_size = perf_data_size(rb);
> +	if (size > max_size)
>  		goto out;
>  
>  	have_lost = local_read(&rb->lost);
> @@ -149,19 +131,43 @@ int perf_output_begin(struct perf_output_handle *handle,
>  
>  	perf_output_get_handle(handle);
>  
> -	do {
> +	if (overwrite) {
> +		do {
> +			tail = local_read(&rb->tail);
> +			offset = local_read(&rb->head);
> +			head = offset + size;
> +			if (unlikely(!perf_output_space(tail, offset, head,
> +						        max_size - 1))) {
> +				tail = local_read(&rb->next_tail);
> +				local_set(&rb->tail, tail);
> +				rb->user_page->data_tail = tail;
> +			}
> +		} while (local_cmpxchg(&rb->head, offset, head) != offset);
> +
>  		/*
> -		 * Userspace could choose to issue a mb() before updating the
> -		 * tail pointer. So that all reads will be completed before the
> -		 * write is issued.
> +		 * Save the start of next event when half of the buffer
> +		 * has been filled. Later when the event buffer overflows,
> +		 * update the tail pointer to point to it.
>  		 */
> -		tail = ACCESS_ONCE(rb->user_page->data_tail);
> -		smp_rmb();
> -		offset = head = local_read(&rb->head);
> -		head += size;
> -		if (unlikely(!perf_output_space(rb, tail, offset, head)))
> -			goto fail;
> -	} while (local_cmpxchg(&rb->head, offset, head) != offset);
> +		if (tail == local_read(&rb->next_tail) &&
> +		    head - tail >= (max_size >> 1))
> +			local_cmpxchg(&rb->next_tail, tail, head);
> +	} else {
> +		do {
> +			/*
> +			 * Userspace could choose to issue a mb() before
> +			 * updating the tail pointer. So that all reads will
> +			 * be completed before the write is issued.
> +			 */
> +			tail = ACCESS_ONCE(rb->user_page->data_tail);
> +			smp_rmb();
> +			offset = local_read(&rb->head);
> +			head = offset + size;
> +			if (unlikely(!perf_output_space(tail, offset, head,
> +							max_size - 1)))
> +				goto fail;
> +		} while (local_cmpxchg(&rb->head, offset, head) != offset);
> +	}
>  
>  	if (head - local_read(&rb->wakeup) > rb->watermark)
>  		local_add(rb->watermark, &rb->wakeup);
> @@ -194,6 +200,18 @@ int perf_output_begin(struct perf_output_handle *handle,
>  	return -ENOSPC;
>  }
>  
> +int perf_output_begin(struct perf_output_handle *handle,
> +		      struct perf_event *event, unsigned int size)
> +{
> +	return __perf_output_begin(handle, event, size, false);
> +}
> +
> +int perf_output_begin_overwrite(struct perf_output_handle *handle,
> +		      struct perf_event *event, unsigned int size)
> +{
> +	return __perf_output_begin(handle, event, size, true);
> +}
> +
>  unsigned int perf_output_copy(struct perf_output_handle *handle,
>  		      const void *buf, unsigned int len)
>  {
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] perf: Update event buffer tail when overwriting old events
  2013-07-09  7:05   ` Yan, Zheng
@ 2013-07-09  8:05     ` Peter Zijlstra
  2013-07-09 13:52       ` Yan, Zheng
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2013-07-09  8:05 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: linux-kernel, mingo, eranian, ak

On Tue, Jul 09, 2013 at 03:05:41PM +0800, Yan, Zheng wrote:

> Thank you for your help. I ran the same test, the results for regular case
> are much better. But it still has about 1% overhead, probably because we
> enlarge the ring_buffer structure, make it less cache friendly.
> 
>       origin    with the patch
> AVG    1000      1013
> STDEV  13.4      15.0

And this is the !overwrite case, right? I don't suppose you cured the logic
Namhyung Kim pointed out? That should affect the overwrite case I suppose since
it won't switch to perf_event_output_overwrite().

tip/master:

struct ring_buffer {
        atomic_t                   refcount;             /*     0     4 */

        /* XXX 4 bytes hole, try to pack */

        struct callback_head       callback_head;        /*     8    16 */
        int                        nr_pages;             /*    24     4 */
        int                        overwrite;            /*    28     4 */
        atomic_t                   poll;                 /*    32     4 */

        /* XXX 4 bytes hole, try to pack */

        local_t                    head;                 /*    40     8 */
        local_t                    nest;                 /*    48     8 */
        local_t                    events;               /*    56     8 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        local_t                    wakeup;               /*    64     8 */
        local_t                    lost;                 /*    72     8 */
        long int                   watermark;            /*    80     8 */
        spinlock_t                 event_lock;           /*    88    56 */
        /* --- cacheline 2 boundary (128 bytes) was 16 bytes ago --- */
        struct list_head           event_list;           /*   144    16 */
        atomic_t                   mmap_count;           /*   160     4 */

        /* XXX 4 bytes hole, try to pack */

        long unsigned int          mmap_locked;          /*   168     8 */
        struct user_struct *       mmap_user;            /*   176     8 */
        struct perf_event_mmap_page * user_page;         /*   184     8 */
        /* --- cacheline 3 boundary (192 bytes) --- */
        void *                     data_pages[0];        /*   192     0 */

        /* size: 192, cachelines: 3, members: 18 */
        /* sum members: 180, holes: 3, sum holes: 12 */
};

tip/master + patch:

struct ring_buffer {
        atomic_t                   refcount;             /*     0     4 */

        /* XXX 4 bytes hole, try to pack */

        struct callback_head       callback_head;        /*     8    16 */
        int                        nr_pages;             /*    24     4 */
        int                        overwrite;            /*    28     4 */
        atomic_t                   poll;                 /*    32     4 */

        /* XXX 4 bytes hole, try to pack */

        local_t                    tail;                 /*    40     8 */
        local_t                    next_tail;            /*    48     8 */
        local_t                    head;                 /*    56     8 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        local_t                    nest;                 /*    64     8 */
        local_t                    events;               /*    72     8 */
        local_t                    wakeup;               /*    80     8 */
        local_t                    lost;                 /*    88     8 */
        long int                   watermark;            /*    96     8 */
        spinlock_t                 event_lock;           /*   104    56 */
        /* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */
        struct list_head           event_list;           /*   160    16 */
        atomic_t                   mmap_count;           /*   176     4 */

        /* XXX 4 bytes hole, try to pack */

        long unsigned int          mmap_locked;          /*   184     8 */
        /* --- cacheline 3 boundary (192 bytes) --- */
        struct user_struct *       mmap_user;            /*   192     8 */
        struct perf_event_mmap_page * user_page;         /*   200     8 */
        void *                     data_pages[0];        /*   208     0 */

        /* size: 208, cachelines: 4, members: 20 */
        /* sum members: 196, holes: 3, sum holes: 12 */
        /* last cacheline: 16 bytes */
};

tip/master + patch^2:

struct ring_buffer {
        atomic_t                   refcount;             /*     0     4 */
        atomic_t                   mmap_count;           /*     4     4 */
        union {
                int                overwrite;            /*           4 */
                struct callback_head callback_head;      /*          16 */
        };                                               /*     8    16 */
        int                        nr_pages;             /*    24     4 */
        atomic_t                   poll;                 /*    28     4 */
        local_t                    tail;                 /*    32     8 */
        local_t                    next_tail;            /*    40     8 */
        local_t                    head;                 /*    48     8 */
        local_t                    nest;                 /*    56     8 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        local_t                    events;               /*    64     8 */
        local_t                    wakeup;               /*    72     8 */
        local_t                    lost;                 /*    80     8 */
        long int                   watermark;            /*    88     8 */
        spinlock_t                 event_lock;           /*    96    56 */
        /* --- cacheline 2 boundary (128 bytes) was 24 bytes ago --- */
        struct list_head           event_list;           /*   152    16 */
        long unsigned int          mmap_locked;          /*   168     8 */
        struct user_struct *       mmap_user;            /*   176     8 */
        struct perf_event_mmap_page * user_page;         /*   184     8 */
        /* --- cacheline 3 boundary (192 bytes) --- */
        void *                     data_pages[0];        /*   192     0 */

        /* size: 192, cachelines: 3, members: 19 */
};


---
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4641,7 +4641,7 @@ static void perf_event_output_overwrite(
 static void
 perf_event_set_overflow(struct perf_event *event, struct ring_buffer *rb)
 {
-	if (event->overflow_handler != perf_event_output ||
+	if (event->overflow_handler != perf_event_output &&
 	    event->overflow_handler != perf_event_output_overwrite)
 		return;
 
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -10,13 +10,16 @@
 
 struct ring_buffer {
 	atomic_t			refcount;
-	struct rcu_head			rcu_head;
+	atomic_t			mmap_count;
+	union {
+		int			overwrite;	/* can overwrite itself */
+		struct rcu_head		rcu_head;
+	};
 #ifdef CONFIG_PERF_USE_VMALLOC
 	struct work_struct		work;
 	int				page_order;	/* allocation order  */
 #endif
 	int				nr_pages;	/* nr of data pages  */
-	int				overwrite;	/* can overwrite itself */
 
 	atomic_t			poll;		/* POLL_ for wakeups */
 
@@ -33,7 +36,6 @@ struct ring_buffer {
 	spinlock_t			event_lock;
 	struct list_head		event_list;
 
-	atomic_t			mmap_count;
 	unsigned long			mmap_locked;
 	struct user_struct		*mmap_user;
 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] perf: Update event buffer tail when overwriting old events
  2013-07-09  8:05     ` Peter Zijlstra
@ 2013-07-09 13:52       ` Yan, Zheng
  2013-07-09 14:31         ` Peter Zijlstra
  0 siblings, 1 reply; 12+ messages in thread
From: Yan, Zheng @ 2013-07-09 13:52 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mingo, eranian, ak

On 07/09/2013 04:05 PM, Peter Zijlstra wrote:
> On Tue, Jul 09, 2013 at 03:05:41PM +0800, Yan, Zheng wrote:
> 
>> Thank you for your help. I ran the same test, the results for regular case
>> are much better. But it still has about 1% overhead, probably because we
>> enlarge the ring_buffer structure, make it less cache friendly.
>>
>>       origin    with the patch
>> AVG    1000      1013
>> STDEV  13.4      15.0
> 
> And this is the !overwrite case, right? I don't suppose you cured the logic
> Namhyung Kim pointed out? That should affect the overwrite case I suppose since
> it won't switch to perf_event_output_overwrite().

yes, it's the overwrite case.

> 
> tip/master:
> 
> struct ring_buffer {
>         atomic_t                   refcount;             /*     0     4 */
> 
>         /* XXX 4 bytes hole, try to pack */
> 
>         struct callback_head       callback_head;        /*     8    16 */
>         int                        nr_pages;             /*    24     4 */
>         int                        overwrite;            /*    28     4 */
>         atomic_t                   poll;                 /*    32     4 */
> 
>         /* XXX 4 bytes hole, try to pack */
> 
>         local_t                    head;                 /*    40     8 */
>         local_t                    nest;                 /*    48     8 */
>         local_t                    events;               /*    56     8 */
>         /* --- cacheline 1 boundary (64 bytes) --- */
>         local_t                    wakeup;               /*    64     8 */
>         local_t                    lost;                 /*    72     8 */
>         long int                   watermark;            /*    80     8 */
>         spinlock_t                 event_lock;           /*    88    56 */
>         /* --- cacheline 2 boundary (128 bytes) was 16 bytes ago --- */
>         struct list_head           event_list;           /*   144    16 */
>         atomic_t                   mmap_count;           /*   160     4 */
> 
>         /* XXX 4 bytes hole, try to pack */
> 
>         long unsigned int          mmap_locked;          /*   168     8 */
>         struct user_struct *       mmap_user;            /*   176     8 */
>         struct perf_event_mmap_page * user_page;         /*   184     8 */
>         /* --- cacheline 3 boundary (192 bytes) --- */
>         void *                     data_pages[0];        /*   192     0 */
> 
>         /* size: 192, cachelines: 3, members: 18 */
>         /* sum members: 180, holes: 3, sum holes: 12 */
> };
> 
> tip/master + patch:
> 
> struct ring_buffer {
>         atomic_t                   refcount;             /*     0     4 */
> 
>         /* XXX 4 bytes hole, try to pack */
> 
>         struct callback_head       callback_head;        /*     8    16 */
>         int                        nr_pages;             /*    24     4 */
>         int                        overwrite;            /*    28     4 */
>         atomic_t                   poll;                 /*    32     4 */
> 
>         /* XXX 4 bytes hole, try to pack */
> 
>         local_t                    tail;                 /*    40     8 */
>         local_t                    next_tail;            /*    48     8 */
>         local_t                    head;                 /*    56     8 */
>         /* --- cacheline 1 boundary (64 bytes) --- */
>         local_t                    nest;                 /*    64     8 */
>         local_t                    events;               /*    72     8 */
>         local_t                    wakeup;               /*    80     8 */
>         local_t                    lost;                 /*    88     8 */
>         long int                   watermark;            /*    96     8 */
>         spinlock_t                 event_lock;           /*   104    56 */
>         /* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */
>         struct list_head           event_list;           /*   160    16 */
>         atomic_t                   mmap_count;           /*   176     4 */
> 
>         /* XXX 4 bytes hole, try to pack */
> 
>         long unsigned int          mmap_locked;          /*   184     8 */
>         /* --- cacheline 3 boundary (192 bytes) --- */
>         struct user_struct *       mmap_user;            /*   192     8 */
>         struct perf_event_mmap_page * user_page;         /*   200     8 */
>         void *                     data_pages[0];        /*   208     0 */
> 
>         /* size: 208, cachelines: 4, members: 20 */
>         /* sum members: 196, holes: 3, sum holes: 12 */
>         /* last cacheline: 16 bytes */
> };
> 
> tip/master + patch^2:
> 
> struct ring_buffer {
>         atomic_t                   refcount;             /*     0     4 */
>         atomic_t                   mmap_count;           /*     4     4 */
>         union {
>                 int                overwrite;            /*           4 */
>                 struct callback_head callback_head;      /*          16 */
>         };                                               /*     8    16 */
>         int                        nr_pages;             /*    24     4 */
>         atomic_t                   poll;                 /*    28     4 */
>         local_t                    tail;                 /*    32     8 */
>         local_t                    next_tail;            /*    40     8 */
>         local_t                    head;                 /*    48     8 */
>         local_t                    nest;                 /*    56     8 */
>         /* --- cacheline 1 boundary (64 bytes) --- */
>         local_t                    events;               /*    64     8 */
>         local_t                    wakeup;               /*    72     8 */
>         local_t                    lost;                 /*    80     8 */
>         long int                   watermark;            /*    88     8 */
>         spinlock_t                 event_lock;           /*    96    56 */
>         /* --- cacheline 2 boundary (128 bytes) was 24 bytes ago --- */
>         struct list_head           event_list;           /*   152    16 */
>         long unsigned int          mmap_locked;          /*   168     8 */
>         struct user_struct *       mmap_user;            /*   176     8 */
>         struct perf_event_mmap_page * user_page;         /*   184     8 */
>         /* --- cacheline 3 boundary (192 bytes) --- */
>         void *                     data_pages[0];        /*   192     0 */
> 
>         /* size: 192, cachelines: 3, members: 19 */
> };
> 
> 
> ---
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -4641,7 +4641,7 @@ static void perf_event_output_overwrite(
>  static void
>  perf_event_set_overflow(struct perf_event *event, struct ring_buffer *rb)
>  {
> -	if (event->overflow_handler != perf_event_output ||
> +	if (event->overflow_handler != perf_event_output &&
>  	    event->overflow_handler != perf_event_output_overwrite)
>  		return;
>  
> --- a/kernel/events/internal.h
> +++ b/kernel/events/internal.h
> @@ -10,13 +10,16 @@
>  
>  struct ring_buffer {
>  	atomic_t			refcount;
> -	struct rcu_head			rcu_head;
> +	atomic_t			mmap_count;
> +	union {
> +		int			overwrite;	/* can overwrite itself */
> +		struct rcu_head		rcu_head;
> +	};
>  #ifdef CONFIG_PERF_USE_VMALLOC
>  	struct work_struct		work;
>  	int				page_order;	/* allocation order  */
>  #endif
>  	int				nr_pages;	/* nr of data pages  */
> -	int				overwrite;	/* can overwrite itself */
>  
>  	atomic_t			poll;		/* POLL_ for wakeups */
>  
> @@ -33,7 +36,6 @@ struct ring_buffer {
>  	spinlock_t			event_lock;
>  	struct list_head		event_list;
>  
> -	atomic_t			mmap_count;
>  	unsigned long			mmap_locked;
>  	struct user_struct		*mmap_user;
>  
> 

      origin    patch    patch^2
AVG    1000      1013    1028
STDEV  13.4      15.0    14.6

patch^2 doesn't help. I will try moving the new fields down and re-test tomorrow

Regards
Yan, Zheng




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] perf: Update event buffer tail when overwriting old events
  2013-07-09 13:52       ` Yan, Zheng
@ 2013-07-09 14:31         ` Peter Zijlstra
  0 siblings, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2013-07-09 14:31 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: linux-kernel, mingo, eranian, ak

On Tue, Jul 09, 2013 at 09:52:17PM +0800, Yan, Zheng wrote:
> On 07/09/2013 04:05 PM, Peter Zijlstra wrote:
> > On Tue, Jul 09, 2013 at 03:05:41PM +0800, Yan, Zheng wrote:
> > 
> >> Thank you for your help. I ran the same test, the results for regular case
> >> are much better. But it still has about 1% overhead, probably because we
> >> enlarge the ring_buffer structure, make it less cache friendly.
> >>
> >>       origin    with the patch
> >> AVG    1000      1013
> >> STDEV  13.4      15.0
> > 
> > And this is the !overwrite case, right? I don't suppose you cured the logic
> > Namhyung Kim pointed out? That should affect the overwrite case I suppose since
> > it won't switch to perf_event_output_overwrite().
> 
> yes, it's the overwrite case.

So the most common case is the !overwrite one; we should ensure no significant
regression there. The overwrite case isn't used that much because as you've
found its really hard to use without a valid tail pointer. So I'm not too
bothered about making the overwrite case a _little_ more expensive if that
makes it far more usable.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] perf: Update event buffer tail when overwriting old events
  2013-07-08 12:15 ` Peter Zijlstra
  2013-07-09  6:18   ` Namhyung Kim
  2013-07-09  7:05   ` Yan, Zheng
@ 2013-07-10 11:37   ` Yan, Zheng
  2013-07-10 11:44     ` Peter Zijlstra
  2 siblings, 1 reply; 12+ messages in thread
From: Yan, Zheng @ 2013-07-10 11:37 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mingo, eranian, ak

On 07/08/2013 08:15 PM, Peter Zijlstra wrote:
> On Thu, Jun 06, 2013 at 01:58:06PM +0800, Yan, Zheng wrote:
>> From: "Yan, Zheng" <zheng.z.yan@intel.com>
>>
>> If perf event buffer is in overwrite mode, the kernel only updates
>> the data head when it overwrites old samples. The program that owns
>> the buffer need periodically check the buffer and update a variable
>> that tracks the date tail. If the program fails to do this in time,
>> the data tail can be overwritted by new samples. The program has to
>> rewind the buffer because it does not know where is the first vaild
>> sample.
>>
>> This patch makes the kernel update the date tail when it overwrites
>> old events. So the program that owns the event buffer can always
>> read the latest samples. This is convenient for programs that use
>> perf to do branch tracing. One use case is GDB branch tracing:
>> (http://sourceware.org/ml/gdb-patches/2012-06/msg00172.html)
>> It uses perf interface to read BTS, but only cares the branches
>> before the ptrace event.
>>
>> I added code to perf_output_{begin/end} to count how many cycles
>> are spent by sample output, then ran "perf record" to profile kernel
>> compilation 10 times on IvyBridge-EP. (perf record -a make -j 60)
>> The first number is scaled to 1000, the rest numbers are scaled by
>> the same factor.
>>
>>         before   overwrite mode      after   overwrite mode
>> AVG      1000        999             1046        1044
>> STDEV    19.4        19.5            17.1        17.9
> 
> OK, so I was sure I replied to this email; but apparently I didn't :/
> 
> So its still adding about 5% overhead to the regular case; this is sad.
> 
> What does something like the below do?
> 

I re-test the patch on a different 32 core sandybridge-ep machine. the result is quite good.

       origin   origin overwrite       modified    modified overwrite
AVG    1000      1044                   960        1006
STDEV  39.0      26.0                   28.1       14.4


Regards
Yan, Zheng

 

> ---
>  include/linux/perf_event.h  |  2 +
>  kernel/events/core.c        | 60 ++++++++++++++++++++++++------
>  kernel/events/internal.h    |  2 +
>  kernel/events/ring_buffer.c | 90 +++++++++++++++++++++++++++------------------
>  4 files changed, 106 insertions(+), 48 deletions(-)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 8873f82..bcce98a 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -753,6 +753,8 @@ static inline bool has_branch_stack(struct perf_event *event)
>  
>  extern int perf_output_begin(struct perf_output_handle *handle,
>  			     struct perf_event *event, unsigned int size);
> +extern int perf_output_begin_overwrite(struct perf_output_handle *handle,
> +			     struct perf_event *event, unsigned int size);
>  extern void perf_output_end(struct perf_output_handle *handle);
>  extern unsigned int perf_output_copy(struct perf_output_handle *handle,
>  			     const void *buf, unsigned int len);
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 1833bc5..4d674e9 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3878,6 +3878,8 @@ static const struct vm_operations_struct perf_mmap_vmops = {
>  	.page_mkwrite	= perf_mmap_fault,
>  };
>  
> +static void perf_event_set_overflow(struct perf_event *event, struct ring_buffer *rb);
> +
>  static int perf_mmap(struct file *file, struct vm_area_struct *vma)
>  {
>  	struct perf_event *event = file->private_data;
> @@ -3985,6 +3987,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
>  	vma->vm_mm->pinned_vm += extra;
>  
>  	ring_buffer_attach(event, rb);
> +	perf_event_set_overflow(event, rb);
>  	rcu_assign_pointer(event->rb, rb);
>  
>  	perf_event_update_userpage(event);
> @@ -4595,9 +4598,12 @@ void perf_prepare_sample(struct perf_event_header *header,
>  	}
>  }
>  
> -static void perf_event_output(struct perf_event *event,
> -				struct perf_sample_data *data,
> -				struct pt_regs *regs)
> +static __always_inline void 
> +__perf_event_output(struct perf_event *event,
> +		    struct perf_sample_data *data,
> +		    struct pt_regs *regs,
> +		    int (*output_begin)(struct perf_output_handle *, 
> +			                struct perf_event *, unsigned int))
>  {
>  	struct perf_output_handle handle;
>  	struct perf_event_header header;
> @@ -4607,7 +4613,7 @@ static void perf_event_output(struct perf_event *event,
>  
>  	perf_prepare_sample(&header, data, event, regs);
>  
> -	if (perf_output_begin(&handle, event, header.size))
> +	if (output_begin(&handle, event, header.size))
>  		goto exit;
>  
>  	perf_output_sample(&handle, &header, data, event);
> @@ -4618,6 +4624,33 @@ static void perf_event_output(struct perf_event *event,
>  	rcu_read_unlock();
>  }
>  
> +static void perf_event_output(struct perf_event *event,
> +				struct perf_sample_data *data,
> +				struct pt_regs *regs)
> +{
> +	__perf_event_output(event, data, regs, perf_output_begin);
> +}
> +
> +static void perf_event_output_overwrite(struct perf_event *event,
> +				struct perf_sample_data *data,
> +				struct pt_regs *regs)
> +{
> +	__perf_event_output(event, data, regs, perf_output_begin_overwrite);
> +}
> +
> +static void 
> +perf_event_set_overflow(struct perf_event *event, struct ring_buffer *rb)
> +{
> +	if (event->overflow_handler != perf_event_output ||
> +	    event->overflow_handler != perf_event_output_overwrite)
> +		return;
> +
> +	if (rb->overwrite)
> +		event->overflow_handler = perf_event_output_overwrite;
> +	else
> +		event->overflow_handler = perf_event_output;
> +}
> +
>  /*
>   * read event_id
>   */
> @@ -5183,10 +5216,7 @@ static int __perf_event_overflow(struct perf_event *event,
>  		irq_work_queue(&event->pending);
>  	}
>  
> -	if (event->overflow_handler)
> -		event->overflow_handler(event, data, regs);
> -	else
> -		perf_event_output(event, data, regs);
> +	event->overflow_handler(event, data, regs);
>  
>  	if (event->fasync && event->pending_kill) {
>  		event->pending_wakeup = 1;
> @@ -6501,8 +6531,13 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>  		context = parent_event->overflow_handler_context;
>  	}
>  
> -	event->overflow_handler	= overflow_handler;
> -	event->overflow_handler_context = context;
> +	if (overflow_handler) {
> +		event->overflow_handler	= overflow_handler;
> +		event->overflow_handler_context = context;
> +	} else {
> +		event->overflow_handler = perf_event_output;
> +		event->overflow_handler_context = NULL;
> +	}
>  
>  	perf_event__state_init(event);
>  
> @@ -6736,9 +6771,10 @@ perf_event_set_output(struct perf_event *event, struct perf_event *output_event)
>  	if (old_rb)
>  		ring_buffer_detach(event, old_rb);
>  
> -	if (rb)
> +	if (rb) {
>  		ring_buffer_attach(event, rb);
> -
> +		perf_event_set_overflow(event, rb);
> +	}
>  	rcu_assign_pointer(event->rb, rb);
>  
>  	if (old_rb) {
> diff --git a/kernel/events/internal.h b/kernel/events/internal.h
> index ca65997..c4e4610 100644
> --- a/kernel/events/internal.h
> +++ b/kernel/events/internal.h
> @@ -20,6 +20,8 @@ struct ring_buffer {
>  
>  	atomic_t			poll;		/* POLL_ for wakeups */
>  
> +	local_t				tail;		/* read position     */
> +	local_t				next_tail;	/* next read position */
>  	local_t				head;		/* write position    */
>  	local_t				nest;		/* nested writers    */
>  	local_t				events;		/* event limit       */
> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index cd55144..5887044 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -15,28 +15,9 @@
>  
>  #include "internal.h"
>  
> -static bool perf_output_space(struct ring_buffer *rb, unsigned long tail,
> -			      unsigned long offset, unsigned long head)
> +static bool perf_output_space(unsigned long tail, unsigned long offset,
> +			      unsigned long head, unsigned long mask)
>  {
> -	unsigned long sz = perf_data_size(rb);
> -	unsigned long mask = sz - 1;
> -
> -	/*
> -	 * check if user-writable
> -	 * overwrite : over-write its own tail
> -	 * !overwrite: buffer possibly drops events.
> -	 */
> -	if (rb->overwrite)
> -		return true;
> -
> -	/*
> -	 * verify that payload is not bigger than buffer
> -	 * otherwise masking logic may fail to detect
> -	 * the "not enough space" condition
> -	 */
> -	if ((head - offset) > sz)
> -		return false;
> -
>  	offset = (offset - tail) & mask;
>  	head   = (head   - tail) & mask;
>  
> @@ -109,11 +90,11 @@ static void perf_output_put_handle(struct perf_output_handle *handle)
>  	preempt_enable();
>  }
>  
> -int perf_output_begin(struct perf_output_handle *handle,
> -		      struct perf_event *event, unsigned int size)
> +static __always_inline int __perf_output_begin(struct perf_output_handle *handle,
> +		      struct perf_event *event, unsigned int size, bool overwrite)
>  {
>  	struct ring_buffer *rb;
> -	unsigned long tail, offset, head;
> +	unsigned long tail, offset, head, max_size;
>  	int have_lost;
>  	struct perf_sample_data sample_data;
>  	struct {
> @@ -136,7 +117,8 @@ int perf_output_begin(struct perf_output_handle *handle,
>  	handle->rb	= rb;
>  	handle->event	= event;
>  
> -	if (!rb->nr_pages)
> +	max_size = perf_data_size(rb);
> +	if (size > max_size)
>  		goto out;
>  
>  	have_lost = local_read(&rb->lost);
> @@ -149,19 +131,43 @@ int perf_output_begin(struct perf_output_handle *handle,
>  
>  	perf_output_get_handle(handle);
>  
> -	do {
> +	if (overwrite) {
> +		do {
> +			tail = local_read(&rb->tail);
> +			offset = local_read(&rb->head);
> +			head = offset + size;
> +			if (unlikely(!perf_output_space(tail, offset, head,
> +						        max_size - 1))) {
> +				tail = local_read(&rb->next_tail);
> +				local_set(&rb->tail, tail);
> +				rb->user_page->data_tail = tail;
> +			}
> +		} while (local_cmpxchg(&rb->head, offset, head) != offset);
> +
>  		/*
> -		 * Userspace could choose to issue a mb() before updating the
> -		 * tail pointer. So that all reads will be completed before the
> -		 * write is issued.
> +		 * Save the start of next event when half of the buffer
> +		 * has been filled. Later when the event buffer overflows,
> +		 * update the tail pointer to point to it.
>  		 */
> -		tail = ACCESS_ONCE(rb->user_page->data_tail);
> -		smp_rmb();
> -		offset = head = local_read(&rb->head);
> -		head += size;
> -		if (unlikely(!perf_output_space(rb, tail, offset, head)))
> -			goto fail;
> -	} while (local_cmpxchg(&rb->head, offset, head) != offset);
> +		if (tail == local_read(&rb->next_tail) &&
> +		    head - tail >= (max_size >> 1))
> +			local_cmpxchg(&rb->next_tail, tail, head);
> +	} else {
> +		do {
> +			/*
> +			 * Userspace could choose to issue a mb() before
> +			 * updating the tail pointer. So that all reads will
> +			 * be completed before the write is issued.
> +			 */
> +			tail = ACCESS_ONCE(rb->user_page->data_tail);
> +			smp_rmb();
> +			offset = local_read(&rb->head);
> +			head = offset + size;
> +			if (unlikely(!perf_output_space(tail, offset, head,
> +							max_size - 1)))
> +				goto fail;
> +		} while (local_cmpxchg(&rb->head, offset, head) != offset);
> +	}
>  
>  	if (head - local_read(&rb->wakeup) > rb->watermark)
>  		local_add(rb->watermark, &rb->wakeup);
> @@ -194,6 +200,18 @@ int perf_output_begin(struct perf_output_handle *handle,
>  	return -ENOSPC;
>  }
>  
> +int perf_output_begin(struct perf_output_handle *handle,
> +		      struct perf_event *event, unsigned int size)
> +{
> +	return __perf_output_begin(handle, event, size, false);
> +}
> +
> +int perf_output_begin_overwrite(struct perf_output_handle *handle,
> +		      struct perf_event *event, unsigned int size)
> +{
> +	return __perf_output_begin(handle, event, size, true);
> +}
> +
>  unsigned int perf_output_copy(struct perf_output_handle *handle,
>  		      const void *buf, unsigned int len)
>  {
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] perf: Update event buffer tail when overwriting old events
  2013-07-10 11:37   ` Yan, Zheng
@ 2013-07-10 11:44     ` Peter Zijlstra
  2013-07-11  0:46       ` Yan, Zheng
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2013-07-10 11:44 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: linux-kernel, mingo, eranian, ak

On Wed, Jul 10, 2013 at 07:37:43PM +0800, Yan, Zheng wrote:
> On 07/08/2013 08:15 PM, Peter Zijlstra wrote:
> > On Thu, Jun 06, 2013 at 01:58:06PM +0800, Yan, Zheng wrote:
> >> From: "Yan, Zheng" <zheng.z.yan@intel.com>

> >>         before   overwrite mode      after   overwrite mode
> >> AVG      1000        999             1046        1044
> >> STDEV    19.4        19.5            17.1        17.9
> > 
> > OK, so I was sure I replied to this email; but apparently I didn't :/
> > 
> > So its still adding about 5% overhead to the regular case; this is sad.
> > 
> > What does something like the below do?
> > 
> 
> I re-test the patch on a different 32 core sandybridge-ep machine. the result is quite good.
> 
>        origin   origin overwrite       modified    modified overwrite
> AVG    1000      1044                   960        1006
> STDEV  39.0      26.0                   28.1       14.4

Nice! -- you did fix the snafu for the overwrite more before testing right?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] perf: Update event buffer tail when overwriting old events
  2013-07-10 11:44     ` Peter Zijlstra
@ 2013-07-11  0:46       ` Yan, Zheng
  0 siblings, 0 replies; 12+ messages in thread
From: Yan, Zheng @ 2013-07-11  0:46 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mingo, eranian, ak

On 07/10/2013 07:44 PM, Peter Zijlstra wrote:
> On Wed, Jul 10, 2013 at 07:37:43PM +0800, Yan, Zheng wrote:
>> On 07/08/2013 08:15 PM, Peter Zijlstra wrote:
>>> On Thu, Jun 06, 2013 at 01:58:06PM +0800, Yan, Zheng wrote:
>>>> From: "Yan, Zheng" <zheng.z.yan@intel.com>
> 
>>>>         before   overwrite mode      after   overwrite mode
>>>> AVG      1000        999             1046        1044
>>>> STDEV    19.4        19.5            17.1        17.9
>>>
>>> OK, so I was sure I replied to this email; but apparently I didn't :/
>>>
>>> So its still adding about 5% overhead to the regular case; this is sad.
>>>
>>> What does something like the below do?
>>>
>>
>> I re-test the patch on a different 32 core sandybridge-ep machine. the result is quite good.
>>
>>        origin   origin overwrite       modified    modified overwrite
>> AVG    1000      1044                   960        1006
>> STDEV  39.0      26.0                   28.1       14.4
> 
> Nice! -- you did fix the snafu for the overwrite more before testing right?
> 

yes, of course.

Regards
Yan, Zheng

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2013-07-11  0:46 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-06-06  5:58 [PATCH] perf: Update event buffer tail when overwriting old events Yan, Zheng
2013-06-18  9:13 ` Peter Zijlstra
2013-07-08 12:15 ` Peter Zijlstra
2013-07-09  6:18   ` Namhyung Kim
2013-07-09  7:40     ` Peter Zijlstra
2013-07-09  7:05   ` Yan, Zheng
2013-07-09  8:05     ` Peter Zijlstra
2013-07-09 13:52       ` Yan, Zheng
2013-07-09 14:31         ` Peter Zijlstra
2013-07-10 11:37   ` Yan, Zheng
2013-07-10 11:44     ` Peter Zijlstra
2013-07-11  0:46       ` Yan, Zheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox