* Re: [PATCH] perf: Update event buffer tail when overwriting old events
2013-07-08 12:15 ` Peter Zijlstra
@ 2013-07-09 6:18 ` Namhyung Kim
2013-07-09 7:40 ` Peter Zijlstra
2013-07-09 7:05 ` Yan, Zheng
2013-07-10 11:37 ` Yan, Zheng
2 siblings, 1 reply; 12+ messages in thread
From: Namhyung Kim @ 2013-07-09 6:18 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Yan, Zheng, linux-kernel, mingo, eranian, ak
Hi Peter,
On Mon, 8 Jul 2013 14:15:57 +0200, Peter Zijlstra wrote:
[SNIP]
> +static void
> +perf_event_set_overflow(struct perf_event *event, struct ring_buffer *rb)
> +{
> + if (event->overflow_handler != perf_event_output ||
I guess you wanted "&&" instead of "||" here.
Thanks,
Namhyung
> + event->overflow_handler != perf_event_output_overwrite)
> + return;
> +
> + if (rb->overwrite)
> + event->overflow_handler = perf_event_output_overwrite;
> + else
> + event->overflow_handler = perf_event_output;
> +}
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH] perf: Update event buffer tail when overwriting old events
2013-07-09 6:18 ` Namhyung Kim
@ 2013-07-09 7:40 ` Peter Zijlstra
0 siblings, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2013-07-09 7:40 UTC (permalink / raw)
To: Namhyung Kim; +Cc: Yan, Zheng, linux-kernel, mingo, eranian, ak
On Tue, Jul 09, 2013 at 03:18:20PM +0900, Namhyung Kim wrote:
> Hi Peter,
>
> On Mon, 8 Jul 2013 14:15:57 +0200, Peter Zijlstra wrote:
> [SNIP]
> > +static void
> > +perf_event_set_overflow(struct perf_event *event, struct ring_buffer *rb)
> > +{
> > + if (event->overflow_handler != perf_event_output ||
>
> I guess you wanted "&&" instead of "||" here.
Uhm.. yeah. /me dons the brown paper bag.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] perf: Update event buffer tail when overwriting old events
2013-07-08 12:15 ` Peter Zijlstra
2013-07-09 6:18 ` Namhyung Kim
@ 2013-07-09 7:05 ` Yan, Zheng
2013-07-09 8:05 ` Peter Zijlstra
2013-07-10 11:37 ` Yan, Zheng
2 siblings, 1 reply; 12+ messages in thread
From: Yan, Zheng @ 2013-07-09 7:05 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: linux-kernel, mingo, eranian, ak
On 07/08/2013 08:15 PM, Peter Zijlstra wrote:
> On Thu, Jun 06, 2013 at 01:58:06PM +0800, Yan, Zheng wrote:
>> From: "Yan, Zheng" <zheng.z.yan@intel.com>
>>
>> If perf event buffer is in overwrite mode, the kernel only updates
>> the data head when it overwrites old samples. The program that owns
>> the buffer need periodically check the buffer and update a variable
>> that tracks the date tail. If the program fails to do this in time,
>> the data tail can be overwritted by new samples. The program has to
>> rewind the buffer because it does not know where is the first vaild
>> sample.
>>
>> This patch makes the kernel update the date tail when it overwrites
>> old events. So the program that owns the event buffer can always
>> read the latest samples. This is convenient for programs that use
>> perf to do branch tracing. One use case is GDB branch tracing:
>> (http://sourceware.org/ml/gdb-patches/2012-06/msg00172.html)
>> It uses perf interface to read BTS, but only cares the branches
>> before the ptrace event.
>>
>> I added code to perf_output_{begin/end} to count how many cycles
>> are spent by sample output, then ran "perf record" to profile kernel
>> compilation 10 times on IvyBridge-EP. (perf record -a make -j 60)
>> The first number is scaled to 1000, the rest numbers are scaled by
>> the same factor.
>>
>> before overwrite mode after overwrite mode
>> AVG 1000 999 1046 1044
>> STDEV 19.4 19.5 17.1 17.9
>
> OK, so I was sure I replied to this email; but apparently I didn't :/
>
> So its still adding about 5% overhead to the regular case; this is sad.
>
> What does something like the below do?
Thank you for your help. I ran the same test, the results for regular case
are much better. But it still has about 1% overhead, probably because we
enlarge the ring_buffer structure, make it less cache friendly.
origin with the patch
AVG 1000 1013
STDEV 13.4 15.0
Regards
Yan, Zheng
>
> ---
> include/linux/perf_event.h | 2 +
> kernel/events/core.c | 60 ++++++++++++++++++++++++------
> kernel/events/internal.h | 2 +
> kernel/events/ring_buffer.c | 90 +++++++++++++++++++++++++++------------------
> 4 files changed, 106 insertions(+), 48 deletions(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 8873f82..bcce98a 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -753,6 +753,8 @@ static inline bool has_branch_stack(struct perf_event *event)
>
> extern int perf_output_begin(struct perf_output_handle *handle,
> struct perf_event *event, unsigned int size);
> +extern int perf_output_begin_overwrite(struct perf_output_handle *handle,
> + struct perf_event *event, unsigned int size);
> extern void perf_output_end(struct perf_output_handle *handle);
> extern unsigned int perf_output_copy(struct perf_output_handle *handle,
> const void *buf, unsigned int len);
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 1833bc5..4d674e9 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3878,6 +3878,8 @@ static const struct vm_operations_struct perf_mmap_vmops = {
> .page_mkwrite = perf_mmap_fault,
> };
>
> +static void perf_event_set_overflow(struct perf_event *event, struct ring_buffer *rb);
> +
> static int perf_mmap(struct file *file, struct vm_area_struct *vma)
> {
> struct perf_event *event = file->private_data;
> @@ -3985,6 +3987,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
> vma->vm_mm->pinned_vm += extra;
>
> ring_buffer_attach(event, rb);
> + perf_event_set_overflow(event, rb);
> rcu_assign_pointer(event->rb, rb);
>
> perf_event_update_userpage(event);
> @@ -4595,9 +4598,12 @@ void perf_prepare_sample(struct perf_event_header *header,
> }
> }
>
> -static void perf_event_output(struct perf_event *event,
> - struct perf_sample_data *data,
> - struct pt_regs *regs)
> +static __always_inline void
> +__perf_event_output(struct perf_event *event,
> + struct perf_sample_data *data,
> + struct pt_regs *regs,
> + int (*output_begin)(struct perf_output_handle *,
> + struct perf_event *, unsigned int))
> {
> struct perf_output_handle handle;
> struct perf_event_header header;
> @@ -4607,7 +4613,7 @@ static void perf_event_output(struct perf_event *event,
>
> perf_prepare_sample(&header, data, event, regs);
>
> - if (perf_output_begin(&handle, event, header.size))
> + if (output_begin(&handle, event, header.size))
> goto exit;
>
> perf_output_sample(&handle, &header, data, event);
> @@ -4618,6 +4624,33 @@ static void perf_event_output(struct perf_event *event,
> rcu_read_unlock();
> }
>
> +static void perf_event_output(struct perf_event *event,
> + struct perf_sample_data *data,
> + struct pt_regs *regs)
> +{
> + __perf_event_output(event, data, regs, perf_output_begin);
> +}
> +
> +static void perf_event_output_overwrite(struct perf_event *event,
> + struct perf_sample_data *data,
> + struct pt_regs *regs)
> +{
> + __perf_event_output(event, data, regs, perf_output_begin_overwrite);
> +}
> +
> +static void
> +perf_event_set_overflow(struct perf_event *event, struct ring_buffer *rb)
> +{
> + if (event->overflow_handler != perf_event_output ||
> + event->overflow_handler != perf_event_output_overwrite)
> + return;
> +
> + if (rb->overwrite)
> + event->overflow_handler = perf_event_output_overwrite;
> + else
> + event->overflow_handler = perf_event_output;
> +}
> +
> /*
> * read event_id
> */
> @@ -5183,10 +5216,7 @@ static int __perf_event_overflow(struct perf_event *event,
> irq_work_queue(&event->pending);
> }
>
> - if (event->overflow_handler)
> - event->overflow_handler(event, data, regs);
> - else
> - perf_event_output(event, data, regs);
> + event->overflow_handler(event, data, regs);
>
> if (event->fasync && event->pending_kill) {
> event->pending_wakeup = 1;
> @@ -6501,8 +6531,13 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
> context = parent_event->overflow_handler_context;
> }
>
> - event->overflow_handler = overflow_handler;
> - event->overflow_handler_context = context;
> + if (overflow_handler) {
> + event->overflow_handler = overflow_handler;
> + event->overflow_handler_context = context;
> + } else {
> + event->overflow_handler = perf_event_output;
> + event->overflow_handler_context = NULL;
> + }
>
> perf_event__state_init(event);
>
> @@ -6736,9 +6771,10 @@ perf_event_set_output(struct perf_event *event, struct perf_event *output_event)
> if (old_rb)
> ring_buffer_detach(event, old_rb);
>
> - if (rb)
> + if (rb) {
> ring_buffer_attach(event, rb);
> -
> + perf_event_set_overflow(event, rb);
> + }
> rcu_assign_pointer(event->rb, rb);
>
> if (old_rb) {
> diff --git a/kernel/events/internal.h b/kernel/events/internal.h
> index ca65997..c4e4610 100644
> --- a/kernel/events/internal.h
> +++ b/kernel/events/internal.h
> @@ -20,6 +20,8 @@ struct ring_buffer {
>
> atomic_t poll; /* POLL_ for wakeups */
>
> + local_t tail; /* read position */
> + local_t next_tail; /* next read position */
> local_t head; /* write position */
> local_t nest; /* nested writers */
> local_t events; /* event limit */
> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index cd55144..5887044 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -15,28 +15,9 @@
>
> #include "internal.h"
>
> -static bool perf_output_space(struct ring_buffer *rb, unsigned long tail,
> - unsigned long offset, unsigned long head)
> +static bool perf_output_space(unsigned long tail, unsigned long offset,
> + unsigned long head, unsigned long mask)
> {
> - unsigned long sz = perf_data_size(rb);
> - unsigned long mask = sz - 1;
> -
> - /*
> - * check if user-writable
> - * overwrite : over-write its own tail
> - * !overwrite: buffer possibly drops events.
> - */
> - if (rb->overwrite)
> - return true;
> -
> - /*
> - * verify that payload is not bigger than buffer
> - * otherwise masking logic may fail to detect
> - * the "not enough space" condition
> - */
> - if ((head - offset) > sz)
> - return false;
> -
> offset = (offset - tail) & mask;
> head = (head - tail) & mask;
>
> @@ -109,11 +90,11 @@ static void perf_output_put_handle(struct perf_output_handle *handle)
> preempt_enable();
> }
>
> -int perf_output_begin(struct perf_output_handle *handle,
> - struct perf_event *event, unsigned int size)
> +static __always_inline int __perf_output_begin(struct perf_output_handle *handle,
> + struct perf_event *event, unsigned int size, bool overwrite)
> {
> struct ring_buffer *rb;
> - unsigned long tail, offset, head;
> + unsigned long tail, offset, head, max_size;
> int have_lost;
> struct perf_sample_data sample_data;
> struct {
> @@ -136,7 +117,8 @@ int perf_output_begin(struct perf_output_handle *handle,
> handle->rb = rb;
> handle->event = event;
>
> - if (!rb->nr_pages)
> + max_size = perf_data_size(rb);
> + if (size > max_size)
> goto out;
>
> have_lost = local_read(&rb->lost);
> @@ -149,19 +131,43 @@ int perf_output_begin(struct perf_output_handle *handle,
>
> perf_output_get_handle(handle);
>
> - do {
> + if (overwrite) {
> + do {
> + tail = local_read(&rb->tail);
> + offset = local_read(&rb->head);
> + head = offset + size;
> + if (unlikely(!perf_output_space(tail, offset, head,
> + max_size - 1))) {
> + tail = local_read(&rb->next_tail);
> + local_set(&rb->tail, tail);
> + rb->user_page->data_tail = tail;
> + }
> + } while (local_cmpxchg(&rb->head, offset, head) != offset);
> +
> /*
> - * Userspace could choose to issue a mb() before updating the
> - * tail pointer. So that all reads will be completed before the
> - * write is issued.
> + * Save the start of next event when half of the buffer
> + * has been filled. Later when the event buffer overflows,
> + * update the tail pointer to point to it.
> */
> - tail = ACCESS_ONCE(rb->user_page->data_tail);
> - smp_rmb();
> - offset = head = local_read(&rb->head);
> - head += size;
> - if (unlikely(!perf_output_space(rb, tail, offset, head)))
> - goto fail;
> - } while (local_cmpxchg(&rb->head, offset, head) != offset);
> + if (tail == local_read(&rb->next_tail) &&
> + head - tail >= (max_size >> 1))
> + local_cmpxchg(&rb->next_tail, tail, head);
> + } else {
> + do {
> + /*
> + * Userspace could choose to issue a mb() before
> + * updating the tail pointer. So that all reads will
> + * be completed before the write is issued.
> + */
> + tail = ACCESS_ONCE(rb->user_page->data_tail);
> + smp_rmb();
> + offset = local_read(&rb->head);
> + head = offset + size;
> + if (unlikely(!perf_output_space(tail, offset, head,
> + max_size - 1)))
> + goto fail;
> + } while (local_cmpxchg(&rb->head, offset, head) != offset);
> + }
>
> if (head - local_read(&rb->wakeup) > rb->watermark)
> local_add(rb->watermark, &rb->wakeup);
> @@ -194,6 +200,18 @@ int perf_output_begin(struct perf_output_handle *handle,
> return -ENOSPC;
> }
>
> +int perf_output_begin(struct perf_output_handle *handle,
> + struct perf_event *event, unsigned int size)
> +{
> + return __perf_output_begin(handle, event, size, false);
> +}
> +
> +int perf_output_begin_overwrite(struct perf_output_handle *handle,
> + struct perf_event *event, unsigned int size)
> +{
> + return __perf_output_begin(handle, event, size, true);
> +}
> +
> unsigned int perf_output_copy(struct perf_output_handle *handle,
> const void *buf, unsigned int len)
> {
>
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH] perf: Update event buffer tail when overwriting old events
2013-07-09 7:05 ` Yan, Zheng
@ 2013-07-09 8:05 ` Peter Zijlstra
2013-07-09 13:52 ` Yan, Zheng
0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2013-07-09 8:05 UTC (permalink / raw)
To: Yan, Zheng; +Cc: linux-kernel, mingo, eranian, ak
On Tue, Jul 09, 2013 at 03:05:41PM +0800, Yan, Zheng wrote:
> Thank you for your help. I ran the same test, the results for regular case
> are much better. But it still has about 1% overhead, probably because we
> enlarge the ring_buffer structure, make it less cache friendly.
>
> origin with the patch
> AVG 1000 1013
> STDEV 13.4 15.0
And this is the !overwrite case, right? I don't suppose you cured the logic
Namhyung Kim pointed out? That should affect the overwrite case I suppose since
it won't switch to perf_event_output_overwrite().
tip/master:
struct ring_buffer {
atomic_t refcount; /* 0 4 */
/* XXX 4 bytes hole, try to pack */
struct callback_head callback_head; /* 8 16 */
int nr_pages; /* 24 4 */
int overwrite; /* 28 4 */
atomic_t poll; /* 32 4 */
/* XXX 4 bytes hole, try to pack */
local_t head; /* 40 8 */
local_t nest; /* 48 8 */
local_t events; /* 56 8 */
/* --- cacheline 1 boundary (64 bytes) --- */
local_t wakeup; /* 64 8 */
local_t lost; /* 72 8 */
long int watermark; /* 80 8 */
spinlock_t event_lock; /* 88 56 */
/* --- cacheline 2 boundary (128 bytes) was 16 bytes ago --- */
struct list_head event_list; /* 144 16 */
atomic_t mmap_count; /* 160 4 */
/* XXX 4 bytes hole, try to pack */
long unsigned int mmap_locked; /* 168 8 */
struct user_struct * mmap_user; /* 176 8 */
struct perf_event_mmap_page * user_page; /* 184 8 */
/* --- cacheline 3 boundary (192 bytes) --- */
void * data_pages[0]; /* 192 0 */
/* size: 192, cachelines: 3, members: 18 */
/* sum members: 180, holes: 3, sum holes: 12 */
};
tip/master + patch:
struct ring_buffer {
atomic_t refcount; /* 0 4 */
/* XXX 4 bytes hole, try to pack */
struct callback_head callback_head; /* 8 16 */
int nr_pages; /* 24 4 */
int overwrite; /* 28 4 */
atomic_t poll; /* 32 4 */
/* XXX 4 bytes hole, try to pack */
local_t tail; /* 40 8 */
local_t next_tail; /* 48 8 */
local_t head; /* 56 8 */
/* --- cacheline 1 boundary (64 bytes) --- */
local_t nest; /* 64 8 */
local_t events; /* 72 8 */
local_t wakeup; /* 80 8 */
local_t lost; /* 88 8 */
long int watermark; /* 96 8 */
spinlock_t event_lock; /* 104 56 */
/* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */
struct list_head event_list; /* 160 16 */
atomic_t mmap_count; /* 176 4 */
/* XXX 4 bytes hole, try to pack */
long unsigned int mmap_locked; /* 184 8 */
/* --- cacheline 3 boundary (192 bytes) --- */
struct user_struct * mmap_user; /* 192 8 */
struct perf_event_mmap_page * user_page; /* 200 8 */
void * data_pages[0]; /* 208 0 */
/* size: 208, cachelines: 4, members: 20 */
/* sum members: 196, holes: 3, sum holes: 12 */
/* last cacheline: 16 bytes */
};
tip/master + patch^2:
struct ring_buffer {
atomic_t refcount; /* 0 4 */
atomic_t mmap_count; /* 4 4 */
union {
int overwrite; /* 4 */
struct callback_head callback_head; /* 16 */
}; /* 8 16 */
int nr_pages; /* 24 4 */
atomic_t poll; /* 28 4 */
local_t tail; /* 32 8 */
local_t next_tail; /* 40 8 */
local_t head; /* 48 8 */
local_t nest; /* 56 8 */
/* --- cacheline 1 boundary (64 bytes) --- */
local_t events; /* 64 8 */
local_t wakeup; /* 72 8 */
local_t lost; /* 80 8 */
long int watermark; /* 88 8 */
spinlock_t event_lock; /* 96 56 */
/* --- cacheline 2 boundary (128 bytes) was 24 bytes ago --- */
struct list_head event_list; /* 152 16 */
long unsigned int mmap_locked; /* 168 8 */
struct user_struct * mmap_user; /* 176 8 */
struct perf_event_mmap_page * user_page; /* 184 8 */
/* --- cacheline 3 boundary (192 bytes) --- */
void * data_pages[0]; /* 192 0 */
/* size: 192, cachelines: 3, members: 19 */
};
---
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4641,7 +4641,7 @@ static void perf_event_output_overwrite(
static void
perf_event_set_overflow(struct perf_event *event, struct ring_buffer *rb)
{
- if (event->overflow_handler != perf_event_output ||
+ if (event->overflow_handler != perf_event_output &&
event->overflow_handler != perf_event_output_overwrite)
return;
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -10,13 +10,16 @@
struct ring_buffer {
atomic_t refcount;
- struct rcu_head rcu_head;
+ atomic_t mmap_count;
+ union {
+ int overwrite; /* can overwrite itself */
+ struct rcu_head rcu_head;
+ };
#ifdef CONFIG_PERF_USE_VMALLOC
struct work_struct work;
int page_order; /* allocation order */
#endif
int nr_pages; /* nr of data pages */
- int overwrite; /* can overwrite itself */
atomic_t poll; /* POLL_ for wakeups */
@@ -33,7 +36,6 @@ struct ring_buffer {
spinlock_t event_lock;
struct list_head event_list;
- atomic_t mmap_count;
unsigned long mmap_locked;
struct user_struct *mmap_user;
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH] perf: Update event buffer tail when overwriting old events
2013-07-09 8:05 ` Peter Zijlstra
@ 2013-07-09 13:52 ` Yan, Zheng
2013-07-09 14:31 ` Peter Zijlstra
0 siblings, 1 reply; 12+ messages in thread
From: Yan, Zheng @ 2013-07-09 13:52 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: linux-kernel, mingo, eranian, ak
On 07/09/2013 04:05 PM, Peter Zijlstra wrote:
> On Tue, Jul 09, 2013 at 03:05:41PM +0800, Yan, Zheng wrote:
>
>> Thank you for your help. I ran the same test, the results for regular case
>> are much better. But it still has about 1% overhead, probably because we
>> enlarge the ring_buffer structure, make it less cache friendly.
>>
>> origin with the patch
>> AVG 1000 1013
>> STDEV 13.4 15.0
>
> And this is the !overwrite case, right? I don't suppose you cured the logic
> Namhyung Kim pointed out? That should affect the overwrite case I suppose since
> it won't switch to perf_event_output_overwrite().
yes, it's the overwrite case.
>
> tip/master:
>
> struct ring_buffer {
> atomic_t refcount; /* 0 4 */
>
> /* XXX 4 bytes hole, try to pack */
>
> struct callback_head callback_head; /* 8 16 */
> int nr_pages; /* 24 4 */
> int overwrite; /* 28 4 */
> atomic_t poll; /* 32 4 */
>
> /* XXX 4 bytes hole, try to pack */
>
> local_t head; /* 40 8 */
> local_t nest; /* 48 8 */
> local_t events; /* 56 8 */
> /* --- cacheline 1 boundary (64 bytes) --- */
> local_t wakeup; /* 64 8 */
> local_t lost; /* 72 8 */
> long int watermark; /* 80 8 */
> spinlock_t event_lock; /* 88 56 */
> /* --- cacheline 2 boundary (128 bytes) was 16 bytes ago --- */
> struct list_head event_list; /* 144 16 */
> atomic_t mmap_count; /* 160 4 */
>
> /* XXX 4 bytes hole, try to pack */
>
> long unsigned int mmap_locked; /* 168 8 */
> struct user_struct * mmap_user; /* 176 8 */
> struct perf_event_mmap_page * user_page; /* 184 8 */
> /* --- cacheline 3 boundary (192 bytes) --- */
> void * data_pages[0]; /* 192 0 */
>
> /* size: 192, cachelines: 3, members: 18 */
> /* sum members: 180, holes: 3, sum holes: 12 */
> };
>
> tip/master + patch:
>
> struct ring_buffer {
> atomic_t refcount; /* 0 4 */
>
> /* XXX 4 bytes hole, try to pack */
>
> struct callback_head callback_head; /* 8 16 */
> int nr_pages; /* 24 4 */
> int overwrite; /* 28 4 */
> atomic_t poll; /* 32 4 */
>
> /* XXX 4 bytes hole, try to pack */
>
> local_t tail; /* 40 8 */
> local_t next_tail; /* 48 8 */
> local_t head; /* 56 8 */
> /* --- cacheline 1 boundary (64 bytes) --- */
> local_t nest; /* 64 8 */
> local_t events; /* 72 8 */
> local_t wakeup; /* 80 8 */
> local_t lost; /* 88 8 */
> long int watermark; /* 96 8 */
> spinlock_t event_lock; /* 104 56 */
> /* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */
> struct list_head event_list; /* 160 16 */
> atomic_t mmap_count; /* 176 4 */
>
> /* XXX 4 bytes hole, try to pack */
>
> long unsigned int mmap_locked; /* 184 8 */
> /* --- cacheline 3 boundary (192 bytes) --- */
> struct user_struct * mmap_user; /* 192 8 */
> struct perf_event_mmap_page * user_page; /* 200 8 */
> void * data_pages[0]; /* 208 0 */
>
> /* size: 208, cachelines: 4, members: 20 */
> /* sum members: 196, holes: 3, sum holes: 12 */
> /* last cacheline: 16 bytes */
> };
>
> tip/master + patch^2:
>
> struct ring_buffer {
> atomic_t refcount; /* 0 4 */
> atomic_t mmap_count; /* 4 4 */
> union {
> int overwrite; /* 4 */
> struct callback_head callback_head; /* 16 */
> }; /* 8 16 */
> int nr_pages; /* 24 4 */
> atomic_t poll; /* 28 4 */
> local_t tail; /* 32 8 */
> local_t next_tail; /* 40 8 */
> local_t head; /* 48 8 */
> local_t nest; /* 56 8 */
> /* --- cacheline 1 boundary (64 bytes) --- */
> local_t events; /* 64 8 */
> local_t wakeup; /* 72 8 */
> local_t lost; /* 80 8 */
> long int watermark; /* 88 8 */
> spinlock_t event_lock; /* 96 56 */
> /* --- cacheline 2 boundary (128 bytes) was 24 bytes ago --- */
> struct list_head event_list; /* 152 16 */
> long unsigned int mmap_locked; /* 168 8 */
> struct user_struct * mmap_user; /* 176 8 */
> struct perf_event_mmap_page * user_page; /* 184 8 */
> /* --- cacheline 3 boundary (192 bytes) --- */
> void * data_pages[0]; /* 192 0 */
>
> /* size: 192, cachelines: 3, members: 19 */
> };
>
>
> ---
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -4641,7 +4641,7 @@ static void perf_event_output_overwrite(
> static void
> perf_event_set_overflow(struct perf_event *event, struct ring_buffer *rb)
> {
> - if (event->overflow_handler != perf_event_output ||
> + if (event->overflow_handler != perf_event_output &&
> event->overflow_handler != perf_event_output_overwrite)
> return;
>
> --- a/kernel/events/internal.h
> +++ b/kernel/events/internal.h
> @@ -10,13 +10,16 @@
>
> struct ring_buffer {
> atomic_t refcount;
> - struct rcu_head rcu_head;
> + atomic_t mmap_count;
> + union {
> + int overwrite; /* can overwrite itself */
> + struct rcu_head rcu_head;
> + };
> #ifdef CONFIG_PERF_USE_VMALLOC
> struct work_struct work;
> int page_order; /* allocation order */
> #endif
> int nr_pages; /* nr of data pages */
> - int overwrite; /* can overwrite itself */
>
> atomic_t poll; /* POLL_ for wakeups */
>
> @@ -33,7 +36,6 @@ struct ring_buffer {
> spinlock_t event_lock;
> struct list_head event_list;
>
> - atomic_t mmap_count;
> unsigned long mmap_locked;
> struct user_struct *mmap_user;
>
>
origin patch patch^2
AVG 1000 1013 1028
STDEV 13.4 15.0 14.6
patch^2 doesn't help. I will try moving the new fields down and re-test tomorrow
Regards
Yan, Zheng
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH] perf: Update event buffer tail when overwriting old events
2013-07-09 13:52 ` Yan, Zheng
@ 2013-07-09 14:31 ` Peter Zijlstra
0 siblings, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2013-07-09 14:31 UTC (permalink / raw)
To: Yan, Zheng; +Cc: linux-kernel, mingo, eranian, ak
On Tue, Jul 09, 2013 at 09:52:17PM +0800, Yan, Zheng wrote:
> On 07/09/2013 04:05 PM, Peter Zijlstra wrote:
> > On Tue, Jul 09, 2013 at 03:05:41PM +0800, Yan, Zheng wrote:
> >
> >> Thank you for your help. I ran the same test, the results for regular case
> >> are much better. But it still has about 1% overhead, probably because we
> >> enlarge the ring_buffer structure, make it less cache friendly.
> >>
> >> origin with the patch
> >> AVG 1000 1013
> >> STDEV 13.4 15.0
> >
> > And this is the !overwrite case, right? I don't suppose you cured the logic
> > Namhyung Kim pointed out? That should affect the overwrite case I suppose since
> > it won't switch to perf_event_output_overwrite().
>
> yes, it's the overwrite case.
So the most common case is the !overwrite one; we should ensure no significant
regression there. The overwrite case isn't used that much because as you've
found its really hard to use without a valid tail pointer. So I'm not too
bothered about making the overwrite case a _little_ more expensive if that
makes it far more usable.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] perf: Update event buffer tail when overwriting old events
2013-07-08 12:15 ` Peter Zijlstra
2013-07-09 6:18 ` Namhyung Kim
2013-07-09 7:05 ` Yan, Zheng
@ 2013-07-10 11:37 ` Yan, Zheng
2013-07-10 11:44 ` Peter Zijlstra
2 siblings, 1 reply; 12+ messages in thread
From: Yan, Zheng @ 2013-07-10 11:37 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: linux-kernel, mingo, eranian, ak
On 07/08/2013 08:15 PM, Peter Zijlstra wrote:
> On Thu, Jun 06, 2013 at 01:58:06PM +0800, Yan, Zheng wrote:
>> From: "Yan, Zheng" <zheng.z.yan@intel.com>
>>
>> If perf event buffer is in overwrite mode, the kernel only updates
>> the data head when it overwrites old samples. The program that owns
>> the buffer need periodically check the buffer and update a variable
>> that tracks the date tail. If the program fails to do this in time,
>> the data tail can be overwritted by new samples. The program has to
>> rewind the buffer because it does not know where is the first vaild
>> sample.
>>
>> This patch makes the kernel update the date tail when it overwrites
>> old events. So the program that owns the event buffer can always
>> read the latest samples. This is convenient for programs that use
>> perf to do branch tracing. One use case is GDB branch tracing:
>> (http://sourceware.org/ml/gdb-patches/2012-06/msg00172.html)
>> It uses perf interface to read BTS, but only cares the branches
>> before the ptrace event.
>>
>> I added code to perf_output_{begin/end} to count how many cycles
>> are spent by sample output, then ran "perf record" to profile kernel
>> compilation 10 times on IvyBridge-EP. (perf record -a make -j 60)
>> The first number is scaled to 1000, the rest numbers are scaled by
>> the same factor.
>>
>> before overwrite mode after overwrite mode
>> AVG 1000 999 1046 1044
>> STDEV 19.4 19.5 17.1 17.9
>
> OK, so I was sure I replied to this email; but apparently I didn't :/
>
> So its still adding about 5% overhead to the regular case; this is sad.
>
> What does something like the below do?
>
I re-test the patch on a different 32 core sandybridge-ep machine. the result is quite good.
origin origin overwrite modified modified overwrite
AVG 1000 1044 960 1006
STDEV 39.0 26.0 28.1 14.4
Regards
Yan, Zheng
> ---
> include/linux/perf_event.h | 2 +
> kernel/events/core.c | 60 ++++++++++++++++++++++++------
> kernel/events/internal.h | 2 +
> kernel/events/ring_buffer.c | 90 +++++++++++++++++++++++++++------------------
> 4 files changed, 106 insertions(+), 48 deletions(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 8873f82..bcce98a 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -753,6 +753,8 @@ static inline bool has_branch_stack(struct perf_event *event)
>
> extern int perf_output_begin(struct perf_output_handle *handle,
> struct perf_event *event, unsigned int size);
> +extern int perf_output_begin_overwrite(struct perf_output_handle *handle,
> + struct perf_event *event, unsigned int size);
> extern void perf_output_end(struct perf_output_handle *handle);
> extern unsigned int perf_output_copy(struct perf_output_handle *handle,
> const void *buf, unsigned int len);
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 1833bc5..4d674e9 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3878,6 +3878,8 @@ static const struct vm_operations_struct perf_mmap_vmops = {
> .page_mkwrite = perf_mmap_fault,
> };
>
> +static void perf_event_set_overflow(struct perf_event *event, struct ring_buffer *rb);
> +
> static int perf_mmap(struct file *file, struct vm_area_struct *vma)
> {
> struct perf_event *event = file->private_data;
> @@ -3985,6 +3987,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
> vma->vm_mm->pinned_vm += extra;
>
> ring_buffer_attach(event, rb);
> + perf_event_set_overflow(event, rb);
> rcu_assign_pointer(event->rb, rb);
>
> perf_event_update_userpage(event);
> @@ -4595,9 +4598,12 @@ void perf_prepare_sample(struct perf_event_header *header,
> }
> }
>
> -static void perf_event_output(struct perf_event *event,
> - struct perf_sample_data *data,
> - struct pt_regs *regs)
> +static __always_inline void
> +__perf_event_output(struct perf_event *event,
> + struct perf_sample_data *data,
> + struct pt_regs *regs,
> + int (*output_begin)(struct perf_output_handle *,
> + struct perf_event *, unsigned int))
> {
> struct perf_output_handle handle;
> struct perf_event_header header;
> @@ -4607,7 +4613,7 @@ static void perf_event_output(struct perf_event *event,
>
> perf_prepare_sample(&header, data, event, regs);
>
> - if (perf_output_begin(&handle, event, header.size))
> + if (output_begin(&handle, event, header.size))
> goto exit;
>
> perf_output_sample(&handle, &header, data, event);
> @@ -4618,6 +4624,33 @@ static void perf_event_output(struct perf_event *event,
> rcu_read_unlock();
> }
>
> +static void perf_event_output(struct perf_event *event,
> + struct perf_sample_data *data,
> + struct pt_regs *regs)
> +{
> + __perf_event_output(event, data, regs, perf_output_begin);
> +}
> +
> +static void perf_event_output_overwrite(struct perf_event *event,
> + struct perf_sample_data *data,
> + struct pt_regs *regs)
> +{
> + __perf_event_output(event, data, regs, perf_output_begin_overwrite);
> +}
> +
> +static void
> +perf_event_set_overflow(struct perf_event *event, struct ring_buffer *rb)
> +{
> + if (event->overflow_handler != perf_event_output ||
> + event->overflow_handler != perf_event_output_overwrite)
> + return;
> +
> + if (rb->overwrite)
> + event->overflow_handler = perf_event_output_overwrite;
> + else
> + event->overflow_handler = perf_event_output;
> +}
> +
> /*
> * read event_id
> */
> @@ -5183,10 +5216,7 @@ static int __perf_event_overflow(struct perf_event *event,
> irq_work_queue(&event->pending);
> }
>
> - if (event->overflow_handler)
> - event->overflow_handler(event, data, regs);
> - else
> - perf_event_output(event, data, regs);
> + event->overflow_handler(event, data, regs);
>
> if (event->fasync && event->pending_kill) {
> event->pending_wakeup = 1;
> @@ -6501,8 +6531,13 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
> context = parent_event->overflow_handler_context;
> }
>
> - event->overflow_handler = overflow_handler;
> - event->overflow_handler_context = context;
> + if (overflow_handler) {
> + event->overflow_handler = overflow_handler;
> + event->overflow_handler_context = context;
> + } else {
> + event->overflow_handler = perf_event_output;
> + event->overflow_handler_context = NULL;
> + }
>
> perf_event__state_init(event);
>
> @@ -6736,9 +6771,10 @@ perf_event_set_output(struct perf_event *event, struct perf_event *output_event)
> if (old_rb)
> ring_buffer_detach(event, old_rb);
>
> - if (rb)
> + if (rb) {
> ring_buffer_attach(event, rb);
> -
> + perf_event_set_overflow(event, rb);
> + }
> rcu_assign_pointer(event->rb, rb);
>
> if (old_rb) {
> diff --git a/kernel/events/internal.h b/kernel/events/internal.h
> index ca65997..c4e4610 100644
> --- a/kernel/events/internal.h
> +++ b/kernel/events/internal.h
> @@ -20,6 +20,8 @@ struct ring_buffer {
>
> atomic_t poll; /* POLL_ for wakeups */
>
> + local_t tail; /* read position */
> + local_t next_tail; /* next read position */
> local_t head; /* write position */
> local_t nest; /* nested writers */
> local_t events; /* event limit */
> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index cd55144..5887044 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -15,28 +15,9 @@
>
> #include "internal.h"
>
> -static bool perf_output_space(struct ring_buffer *rb, unsigned long tail,
> - unsigned long offset, unsigned long head)
> +static bool perf_output_space(unsigned long tail, unsigned long offset,
> + unsigned long head, unsigned long mask)
> {
> - unsigned long sz = perf_data_size(rb);
> - unsigned long mask = sz - 1;
> -
> - /*
> - * check if user-writable
> - * overwrite : over-write its own tail
> - * !overwrite: buffer possibly drops events.
> - */
> - if (rb->overwrite)
> - return true;
> -
> - /*
> - * verify that payload is not bigger than buffer
> - * otherwise masking logic may fail to detect
> - * the "not enough space" condition
> - */
> - if ((head - offset) > sz)
> - return false;
> -
> offset = (offset - tail) & mask;
> head = (head - tail) & mask;
>
> @@ -109,11 +90,11 @@ static void perf_output_put_handle(struct perf_output_handle *handle)
> preempt_enable();
> }
>
> -int perf_output_begin(struct perf_output_handle *handle,
> - struct perf_event *event, unsigned int size)
> +static __always_inline int __perf_output_begin(struct perf_output_handle *handle,
> + struct perf_event *event, unsigned int size, bool overwrite)
> {
> struct ring_buffer *rb;
> - unsigned long tail, offset, head;
> + unsigned long tail, offset, head, max_size;
> int have_lost;
> struct perf_sample_data sample_data;
> struct {
> @@ -136,7 +117,8 @@ int perf_output_begin(struct perf_output_handle *handle,
> handle->rb = rb;
> handle->event = event;
>
> - if (!rb->nr_pages)
> + max_size = perf_data_size(rb);
> + if (size > max_size)
> goto out;
>
> have_lost = local_read(&rb->lost);
> @@ -149,19 +131,43 @@ int perf_output_begin(struct perf_output_handle *handle,
>
> perf_output_get_handle(handle);
>
> - do {
> + if (overwrite) {
> + do {
> + tail = local_read(&rb->tail);
> + offset = local_read(&rb->head);
> + head = offset + size;
> + if (unlikely(!perf_output_space(tail, offset, head,
> + max_size - 1))) {
> + tail = local_read(&rb->next_tail);
> + local_set(&rb->tail, tail);
> + rb->user_page->data_tail = tail;
> + }
> + } while (local_cmpxchg(&rb->head, offset, head) != offset);
> +
> /*
> - * Userspace could choose to issue a mb() before updating the
> - * tail pointer. So that all reads will be completed before the
> - * write is issued.
> + * Save the start of next event when half of the buffer
> + * has been filled. Later when the event buffer overflows,
> + * update the tail pointer to point to it.
> */
> - tail = ACCESS_ONCE(rb->user_page->data_tail);
> - smp_rmb();
> - offset = head = local_read(&rb->head);
> - head += size;
> - if (unlikely(!perf_output_space(rb, tail, offset, head)))
> - goto fail;
> - } while (local_cmpxchg(&rb->head, offset, head) != offset);
> + if (tail == local_read(&rb->next_tail) &&
> + head - tail >= (max_size >> 1))
> + local_cmpxchg(&rb->next_tail, tail, head);
> + } else {
> + do {
> + /*
> + * Userspace could choose to issue a mb() before
> + * updating the tail pointer. So that all reads will
> + * be completed before the write is issued.
> + */
> + tail = ACCESS_ONCE(rb->user_page->data_tail);
> + smp_rmb();
> + offset = local_read(&rb->head);
> + head = offset + size;
> + if (unlikely(!perf_output_space(tail, offset, head,
> + max_size - 1)))
> + goto fail;
> + } while (local_cmpxchg(&rb->head, offset, head) != offset);
> + }
>
> if (head - local_read(&rb->wakeup) > rb->watermark)
> local_add(rb->watermark, &rb->wakeup);
> @@ -194,6 +200,18 @@ int perf_output_begin(struct perf_output_handle *handle,
> return -ENOSPC;
> }
>
> +int perf_output_begin(struct perf_output_handle *handle,
> + struct perf_event *event, unsigned int size)
> +{
> + return __perf_output_begin(handle, event, size, false);
> +}
> +
> +int perf_output_begin_overwrite(struct perf_output_handle *handle,
> + struct perf_event *event, unsigned int size)
> +{
> + return __perf_output_begin(handle, event, size, true);
> +}
> +
> unsigned int perf_output_copy(struct perf_output_handle *handle,
> const void *buf, unsigned int len)
> {
>
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH] perf: Update event buffer tail when overwriting old events
2013-07-10 11:37 ` Yan, Zheng
@ 2013-07-10 11:44 ` Peter Zijlstra
2013-07-11 0:46 ` Yan, Zheng
0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2013-07-10 11:44 UTC (permalink / raw)
To: Yan, Zheng; +Cc: linux-kernel, mingo, eranian, ak
On Wed, Jul 10, 2013 at 07:37:43PM +0800, Yan, Zheng wrote:
> On 07/08/2013 08:15 PM, Peter Zijlstra wrote:
> > On Thu, Jun 06, 2013 at 01:58:06PM +0800, Yan, Zheng wrote:
> >> From: "Yan, Zheng" <zheng.z.yan@intel.com>
> >> before overwrite mode after overwrite mode
> >> AVG 1000 999 1046 1044
> >> STDEV 19.4 19.5 17.1 17.9
> >
> > OK, so I was sure I replied to this email; but apparently I didn't :/
> >
> > So its still adding about 5% overhead to the regular case; this is sad.
> >
> > What does something like the below do?
> >
>
> I re-test the patch on a different 32 core sandybridge-ep machine. the result is quite good.
>
> origin origin overwrite modified modified overwrite
> AVG 1000 1044 960 1006
> STDEV 39.0 26.0 28.1 14.4
Nice! -- you did fix the snafu for the overwrite more before testing right?
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] perf: Update event buffer tail when overwriting old events
2013-07-10 11:44 ` Peter Zijlstra
@ 2013-07-11 0:46 ` Yan, Zheng
0 siblings, 0 replies; 12+ messages in thread
From: Yan, Zheng @ 2013-07-11 0:46 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: linux-kernel, mingo, eranian, ak
On 07/10/2013 07:44 PM, Peter Zijlstra wrote:
> On Wed, Jul 10, 2013 at 07:37:43PM +0800, Yan, Zheng wrote:
>> On 07/08/2013 08:15 PM, Peter Zijlstra wrote:
>>> On Thu, Jun 06, 2013 at 01:58:06PM +0800, Yan, Zheng wrote:
>>>> From: "Yan, Zheng" <zheng.z.yan@intel.com>
>
>>>> before overwrite mode after overwrite mode
>>>> AVG 1000 999 1046 1044
>>>> STDEV 19.4 19.5 17.1 17.9
>>>
>>> OK, so I was sure I replied to this email; but apparently I didn't :/
>>>
>>> So its still adding about 5% overhead to the regular case; this is sad.
>>>
>>> What does something like the below do?
>>>
>>
>> I re-test the patch on a different 32 core sandybridge-ep machine. the result is quite good.
>>
>> origin origin overwrite modified modified overwrite
>> AVG 1000 1044 960 1006
>> STDEV 39.0 26.0 28.1 14.4
>
> Nice! -- you did fix the snafu for the overwrite more before testing right?
>
yes, of course.
Regards
Yan, Zheng
^ permalink raw reply [flat|nested] 12+ messages in thread