Re: [PATCH V5 1/1] bpf: control events stored in PERF_EVENT_ARRAY maps trace data output when perf sampling

All of lore.kernel.org
 help / color / mirror / Atom feed

From: xiakaixu <xiakaixu@huawei.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: "Wangnan (F)" <wangnan0@huawei.com>,
	Alexei Starovoitov <ast@plumgrid.com>,
	pi3orama <pi3orama@163.com>, <davem@davemloft.net>,
	<acme@kernel.org>, <mingo@redhat.com>,
	<masami.hiramatsu.pt@hitachi.com>, <jolsa@kernel.org>,
	<daniel@iogearbox.net>, <linux-kernel@vger.kernel.org>,
	<hekuang@huawei.com>, <netdev@vger.kernel.org>
Subject: Re: [PATCH V5 1/1] bpf: control events stored in PERF_EVENT_ARRAY maps trace data output when perf sampling
Date: Tue, 27 Oct 2015 14:43:45 +0800	[thread overview]
Message-ID: <562F1D21.1050607@huawei.com> (raw)
In-Reply-To: <20151023151205.GW11639@twins.programming.kicks-ass.net>

于 2015/10/23 23:12, Peter Zijlstra 写道:
> On Fri, Oct 23, 2015 at 02:52:11PM +0200, Peter Zijlstra wrote:
>> On Thu, Oct 22, 2015 at 06:28:22PM +0800, Wangnan (F) wrote:
>>> information to analysis when glitch happen. Another way we are trying to
>>> implement
>>> now is to dynamically turn events on and off, or at least enable/disable
>>> sampling dynamically because the overhead of copying those samples
>>> is a big part of perf's total overhead. After that we can trace as many
>>> event as possible, but only fetch data from them when we detect a glitch.
>>
>> So why don't you 'fix' the flight recorder mode and just leave the data
>> in memory and not bother copying it out until a glitch happens?
>>
>> Something like this:
>>
>> lkml.kernel.org/r/20130708121557.GA17211@twins.programming.kicks-ass.net
>>
>> it appears we never quite finished that.
> 
> Updated to current sources, compile tested only.
> 
> It obviously needs testing and performance numbers..  and some
> userspace.
> 
> ---
> Subject: perf: Update event buffer tail when overwriting old events
> From: Peter Zijlstra <peterz@infradead.org>
> 
>> From: "Yan, Zheng" <zheng.z.yan@intel.com>
>>
>> If perf event buffer is in overwrite mode, the kernel only updates
>> the data head when it overwrites old samples. The program that owns
>> the buffer need periodically check the buffer and update a variable
>> that tracks the date tail. If the program fails to do this in time,
>> the data tail can be overwritted by new samples. The program has to
>> rewind the buffer because it does not know where is the first vaild
>> sample.
>>
>> This patch makes the kernel update the date tail when it overwrites
>> old events. So the program that owns the event buffer can always
>> read the latest samples. This is convenient for programs that use
>> perf to do branch tracing. One use case is GDB branch tracing:
>> (http://sourceware.org/ml/gdb-patches/2012-06/msg00172.html)
>> It uses perf interface to read BTS, but only cares the branches
>> before the ptrace event.
> 
> Original-patch-by: "Yan, Zheng" <zheng.z.yan@intel.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  arch/x86/kernel/cpu/perf_event_intel_ds.c |    2 
>  include/linux/perf_event.h                |    6 --
>  kernel/events/core.c                      |   56 +++++++++++++++++----
>  kernel/events/internal.h                  |    2 
>  kernel/events/ring_buffer.c               |   77 +++++++++++++++++++++---------
>  5 files changed, 107 insertions(+), 36 deletions(-)
> 
> --- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
> @@ -1140,7 +1140,7 @@ static void __intel_pmu_pebs_event(struc
>  
>  	while (count > 1) {
>  		setup_pebs_sample_data(event, iregs, at, &data, &regs);
> -		perf_event_output(event, &data, &regs);
> +		event->overflow_handler(event, &data, &regs);
>  		at += x86_pmu.pebs_record_size;
>  		at = get_next_pebs_record_by_bit(at, top, bit);
>  		count--;
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -828,10 +828,6 @@ extern int perf_event_overflow(struct pe
>  				 struct perf_sample_data *data,
>  				 struct pt_regs *regs);
>  
> -extern void perf_event_output(struct perf_event *event,
> -				struct perf_sample_data *data,
> -				struct pt_regs *regs);
> -
>  extern void
>  perf_event_header__init_id(struct perf_event_header *header,
>  			   struct perf_sample_data *data,
> @@ -1032,6 +1028,8 @@ static inline bool has_aux(struct perf_e
>  
>  extern int perf_output_begin(struct perf_output_handle *handle,
>  			     struct perf_event *event, unsigned int size);
> +extern int perf_output_begin_overwrite(struct perf_output_handle *handle,
> +			     struct perf_event *event, unsigned int size);
>  extern void perf_output_end(struct perf_output_handle *handle);
>  extern unsigned int perf_output_copy(struct perf_output_handle *handle,
>  			     const void *buf, unsigned int len);
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -4515,6 +4515,8 @@ static int perf_mmap_fault(struct vm_are
>  	return ret;
>  }
>  
> +static void perf_event_set_overflow(struct perf_event *event, struct ring_buffer *rb);
> +
>  static void ring_buffer_attach(struct perf_event *event,
>  			       struct ring_buffer *rb)
>  {
> @@ -4546,6 +4548,8 @@ static void ring_buffer_attach(struct pe
>  		spin_lock_irqsave(&rb->event_lock, flags);
>  		list_add_rcu(&event->rb_entry, &rb->event_list);
>  		spin_unlock_irqrestore(&rb->event_lock, flags);
> +
> +		perf_event_set_overflow(event, rb);
>  	}
>  
>  	rcu_assign_pointer(event->rb, rb);
> @@ -5579,9 +5583,12 @@ void perf_prepare_sample(struct perf_eve
>  	}
>  }
>  
> -void perf_event_output(struct perf_event *event,
> -			struct perf_sample_data *data,
> -			struct pt_regs *regs)
> +static __always_inline void
> +__perf_event_output(struct perf_event *event,
> +		    struct perf_sample_data *data,
> +		    struct pt_regs *regs,
> +		    int (*output_begin)(struct perf_output_handle *,
> +			                struct perf_event *, unsigned int))
>  {
>  	struct perf_output_handle handle;
>  	struct perf_event_header header;
> @@ -5591,7 +5598,7 @@ void perf_event_output(struct perf_event
>  
>  	perf_prepare_sample(&header, data, event, regs);
>  
> -	if (perf_output_begin(&handle, event, header.size))
> +	if (output_begin(&handle, event, header.size))
>  		goto exit;
>  
>  	perf_output_sample(&handle, &header, data, event);
> @@ -5602,6 +5609,33 @@ void perf_event_output(struct perf_event
>  	rcu_read_unlock();
>  }
>  
> +static void perf_event_output(struct perf_event *event,
> +				struct perf_sample_data *data,
> +				struct pt_regs *regs)
> +{
> +	__perf_event_output(event, data, regs, perf_output_begin);
> +}
> +
> +static void perf_event_output_overwrite(struct perf_event *event,
> +				struct perf_sample_data *data,
> +				struct pt_regs *regs)
> +{
> +	__perf_event_output(event, data, regs, perf_output_begin_overwrite);
> +}
> +
> +static void
> +perf_event_set_overflow(struct perf_event *event, struct ring_buffer *rb)
> +{
> +	if (event->overflow_handler != perf_event_output &&
> +	    event->overflow_handler != perf_event_output_overwrite)
> +		return;
> +
> +	if (rb->overwrite)
> +		event->overflow_handler = perf_event_output_overwrite;
> +	else
> +		event->overflow_handler = perf_event_output;
> +}
> +
>  /*
>   * read event_id
>   */
> @@ -6426,10 +6460,7 @@ static int __perf_event_overflow(struct
>  		irq_work_queue(&event->pending);
>  	}
>  
> -	if (event->overflow_handler)
> -		event->overflow_handler(event, data, regs);
> -	else
> -		perf_event_output(event, data, regs);
> +	event->overflow_handler(event, data, regs);
>  
>  	if (*perf_event_fasync(event) && event->pending_kill) {
>  		event->pending_wakeup = 1;
> @@ -7904,8 +7935,13 @@ perf_event_alloc(struct perf_event_attr
>  		context = parent_event->overflow_handler_context;
>  	}
>  
> -	event->overflow_handler	= overflow_handler;
> -	event->overflow_handler_context = context;
> +	if (overflow_handler) {
> +		event->overflow_handler	= overflow_handler;
> +		event->overflow_handler_context = context;
> +	} else {
> +		event->overflow_handler = perf_event_output;
> +		event->overflow_handler_context = NULL;
> +	}
>  
>  	perf_event__state_init(event);
>  
> --- a/kernel/events/internal.h
> +++ b/kernel/events/internal.h
> @@ -21,6 +21,8 @@ struct ring_buffer {
>  
>  	atomic_t			poll;		/* POLL_ for wakeups */
>  
> +	local_t				tail;		/* read position     */
> +	local_t				next_tail;	/* next read position */
>  	local_t				head;		/* write position    */
>  	local_t				nest;		/* nested writers    */
>  	local_t				events;		/* event limit       */
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -102,11 +102,11 @@ static void perf_output_put_handle(struc
>  	preempt_enable();
>  }
>  
> -int perf_output_begin(struct perf_output_handle *handle,
> -		      struct perf_event *event, unsigned int size)
> +static __always_inline int __perf_output_begin(struct perf_output_handle *handle,
> +		      struct perf_event *event, unsigned int size, bool overwrite)
>  {
>  	struct ring_buffer *rb;
> -	unsigned long tail, offset, head;
> +	unsigned long tail, offset, head, max_size;
>  	int have_lost, page_shift;
>  	struct {
>  		struct perf_event_header header;
> @@ -125,7 +125,8 @@ int perf_output_begin(struct perf_output
>  	if (unlikely(!rb))
>  		goto out;
>  
> -	if (unlikely(!rb->nr_pages))
> +	max_size = perf_data_size(rb);
> +	if (unlikely(size > max_size))
>  		goto out;
>  
>  	handle->rb    = rb;
> @@ -140,27 +141,49 @@ int perf_output_begin(struct perf_output
>  
>  	perf_output_get_handle(handle);
>  
> -	do {
> -		tail = READ_ONCE_CTRL(rb->user_page->data_tail);
> -		offset = head = local_read(&rb->head);
> -		if (!rb->overwrite &&
> -		    unlikely(CIRC_SPACE(head, tail, perf_data_size(rb)) < size))
> -			goto fail;
> +	if (overwrite) {
> +		do {
> +			tail = local_read(&rb->tail);
> +			offset = local_read(&rb->head);
> +			head = offset + size;
> +			if (unlikely(CIRC_SPACE(head, tail, max_size) < size)) {
Should be 'if (unlikely(CIRC_SPACE(offset, tail, max_size) < size))'?

> +				tail = local_read(&rb->next_tail);
> +				local_set(&rb->tail, tail);
> +				rb->user_page->data_tail = tail;
> +			}
> +		} while (local_cmpxchg(&rb->head, offset, head) != offset);
>  
>  		/*
> -		 * The above forms a control dependency barrier separating the
> -		 * @tail load above from the data stores below. Since the @tail
> -		 * load is required to compute the branch to fail below.
> -		 *
> -		 * A, matches D; the full memory barrier userspace SHOULD issue
> -		 * after reading the data and before storing the new tail
> -		 * position.
> -		 *
> -		 * See perf_output_put_handle().
> +		 * Save the start of next event when half of the buffer
> +		 * has been filled. Later when the event buffer overflows,
> +		 * update the tail pointer to point to it.
>  		 */
> +		if (tail == local_read(&rb->next_tail) &&
> +		    CIRC_CNT(head, tail, max_size) >= (max_size / 2))
> +			local_cmpxchg(&rb->next_tail, tail, head);
> +	} else {
> +		do {
> +			tail = READ_ONCE_CTRL(rb->user_page->data_tail);
> +			offset = head = local_read(&rb->head);
> +			if (!rb->overwrite &&
> +			    unlikely(CIRC_SPACE(head, tail, perf_data_size(rb)) < size))
> +				goto fail;
> +
> +			/*
> +			 * The above forms a control dependency barrier separating the
> +			 * @tail load above from the data stores below. Since the @tail
> +			 * load is required to compute the branch to fail below.
> +			 *
> +			 * A, matches D; the full memory barrier userspace SHOULD issue
> +			 * after reading the data and before storing the new tail
> +			 * position.
> +			 *
> +			 * See perf_output_put_handle().
> +			 */
>  
> -		head += size;
> -	} while (local_cmpxchg(&rb->head, offset, head) != offset);
> +			head += size;
> +		} while (local_cmpxchg(&rb->head, offset, head) != offset);
> +	}
>  
>  	/*
>  	 * We rely on the implied barrier() by local_cmpxchg() to ensure
> @@ -203,6 +226,18 @@ int perf_output_begin(struct perf_output
>  	return -ENOSPC;
>  }
>  
> +int perf_output_begin(struct perf_output_handle *handle,
> +		      struct perf_event *event, unsigned int size)
> +{
> +	return __perf_output_begin(handle, event, size, false);
> +}
> +
> +int perf_output_begin_overwrite(struct perf_output_handle *handle,
> +		      struct perf_event *event, unsigned int size)
> +{
> +	return __perf_output_begin(handle, event, size, true);
> +}
> +
>  unsigned int perf_output_copy(struct perf_output_handle *handle,
>  		      const void *buf, unsigned int len)
>  {
> 
> .
> 


-- 
Regards
Kaixu Xia

next prev parent reply	other threads:[~2015-10-27  6:44 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-20  7:22 [PATCH V5 0/1] bpf: control events stored in PERF_EVENT_ARRAY maps trace data output when perf sampling Kaixu Xia
2015-10-20  7:22 ` [PATCH V5 1/1] " Kaixu Xia
2015-10-20 22:53   ` Alexei Starovoitov
2015-10-21  9:12     ` Peter Zijlstra
2015-10-21 10:31       ` xiakaixu
2015-10-21 11:33         ` Peter Zijlstra
2015-10-21 11:49           ` Wangnan (F)
2015-10-21 12:17             ` Peter Zijlstra
2015-10-21 13:42               ` Wangnan (F)
2015-10-21 13:49                 ` Peter Zijlstra
2015-10-21 14:01                   ` pi3orama
2015-10-21 14:09                     ` Peter Zijlstra
2015-10-21 15:06                       ` pi3orama
2015-10-21 16:57                         ` Peter Zijlstra
2015-10-21 21:19                           ` Alexei Starovoitov
2015-10-22  9:06                             ` Peter Zijlstra
2015-10-22 10:28                               ` Wangnan (F)
2015-10-23 12:52                                 ` Peter Zijlstra
2015-10-23 15:12                                   ` Peter Zijlstra
2015-10-27  6:43                                     ` xiakaixu [this message]
2015-10-22  2:46                           ` Wangnan (F)
2015-10-22  7:39                             ` Ingo Molnar
2015-10-22  7:51                               ` Wangnan (F)
2015-10-22  9:24                                 ` Peter Zijlstra
2015-10-22  1:56                 ` Wangnan (F)
2015-10-22  3:09                   ` Alexei Starovoitov
2015-10-22  3:12                     ` Wangnan (F)
2015-10-22  3:26                       ` Alexei Starovoitov
2015-10-22  9:49                       ` Peter Zijlstra
2015-10-21 11:34       ` Wangnan (F)
2015-10-21 11:56         ` Peter Zijlstra
2015-10-21 12:03           ` Wangnan (F)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=562F1D21.1050607@huawei.com \
    --to=xiakaixu@huawei.com \
    --cc=acme@kernel.org \
    --cc=ast@plumgrid.com \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=hekuang@huawei.com \
    --cc=jolsa@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=masami.hiramatsu.pt@hitachi.com \
    --cc=mingo@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=pi3orama@163.com \
    --cc=wangnan0@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.