Re: [PATCH] perf_events: fix broken intr throttling (v3)

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Anton Blanchard <anton@samba.org>
To: Stephane Eranian <eranian@google.com>
Cc: linux-kernel@vger.kernel.org, peterz@infradead.org,
	mingo@elte.hu, gleb@redhat.com, wcohen@redhat.com,
	vince@deater.net, asharma@fb.com, andi@firstfloor.org,
	paulus@samba.org, emunson@mgebm.net, imunsie@au1.ibm.com,
	benh@kernel.crashing.org, sukadev@linux.vnet.ibm.com
Subject: Re: [PATCH] perf_events: fix broken intr throttling (v3)
Date: Wed, 8 Feb 2012 22:32:50 +1100	[thread overview]
Message-ID: <20120208223250.282861cb@kryten> (raw)
In-Reply-To: <20120126160319.GA5655@quad>


Hi,

On Thu, 26 Jan 2012 17:03:19 +0100
Stephane Eranian <eranian@google.com> wrote:

> This patch fixes the throttling mechanism. It was broken
> in 3.2.0. Events were not being unthrottled. The unthrottling
> mechanism required that events be checked at each timer tick.

This patch breaks perf on POWER. I haven't had a chance to work out why
yet, but will investigate tomorrow.

Anton

> This patch solves this problem and also separates:
>   - unthrottling
>   - multiplexing
>   - frequency-mode period adjustments
> 
> Not all of them need to be executed at each timer tick.
> 
> This third version of the patch is based on my original patch +
> PeterZ proposal (https://lkml.org/lkml/2012/1/7/87).
> 
> At each timer tick, for each context:
>   - if the current CPU has throttled events, we unthrottle events
> 
>   - if context has frequency-based events, we adjust sampling periods
> 
>   - if we have reached the jiffies interval, we multiplex (rotate)
> 
> We decoupled rotation (multiplexing) from frequency-mode sampling
> period adjustments.  They should not necessarily happen at the same
> rate. Multiplexing is subject to jiffies_interval (currently at 1
> but could be higher once the tunable is exposed via sysfs).
> 
> We have grouped frequency-mode adjustment and unthrottling into the
> same routine to minimize code duplication. When throttled while in
> frequency mode, we scan the events only once.
> 
> We have fixed the threshold enforcement code in
> __perf_event_overflow(). There was a bug whereby it would allow more
> than the authorized rate because an increment of hwc->interrupts was
> not executed at the right place.
> 
> The patch was tested with low sampling limit (2000) and fixed periods,
> frequency mode, overcommitted PMU.
> 
> On a 2.1GHz AMD CPU:
> $ cat /proc/sys/kernel/perf_event_max_sample_rate
> 2000
> 
> We set a rate of 3000 samples/sec (2.1GHz/3000 = 700000):
> 
> $ perf record -e cycles,cycles -c 700000  noploop 10
> $ perf report -D | tail -21
> Aggregated stats:
>            TOTAL events:      80086
>             MMAP events:         88
>             COMM events:          2
>             EXIT events:          4
>         THROTTLE events:      19996
>       UNTHROTTLE events:      19996
>           SAMPLE events:      40000
> cycles stats:
>            TOTAL events:      40006
>             MMAP events:          5
>             COMM events:          1
>             EXIT events:          4
>         THROTTLE events:       9998
>       UNTHROTTLE events:       9998
>           SAMPLE events:      20000
> cycles stats:
>            TOTAL events:      39996
>         THROTTLE events:       9998
>       UNTHROTTLE events:       9998
>           SAMPLE events:      20000
> 
> For 10s, the cap is 2x2000x10 = 40000 samples.
> We get exactly that: 20000 samples/event.
> 
> Signed-off-by: Stephane Eranian <eranian@google.com>
> ---
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 0b91db2..412b790 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -589,6 +589,7 @@ struct hw_perf_event {
>  	u64				sample_period;
>  	u64				last_period;
>  	local64_t			period_left;
> +	u64                             interrupts_seq;
>  	u64				interrupts;
>  
>  	u64				freq_time_stamp;
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index de859fb..7c3b9de 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2300,6 +2300,9 @@ do {					\
>  	return div64_u64(dividend, divisor);
>  }
>  
> +static DEFINE_PER_CPU(int, perf_throttled_count);
> +static DEFINE_PER_CPU(u64, perf_throttled_seq);
> +
>  static void perf_adjust_period(struct perf_event *event, u64 nsec,
> u64 count) {
>  	struct hw_perf_event *hwc = &event->hw;
> @@ -2325,16 +2328,29 @@ static void perf_adjust_period(struct
> perf_event *event, u64 nsec, u64 count) }
>  }
>  
> -static void perf_ctx_adjust_freq(struct perf_event_context *ctx, u64
> period) +/*
> + * combine freq adjustment with unthrottling to avoid two passes
> over the
> + * events. At the same time, make sure, having freq events does not
> change
> + * the rate of unthrottling as that would introduce bias.
> + */
> +static void perf_adjust_freq_unthr_context(struct perf_event_context
> *ctx,
> +					   int needs_unthr)
>  {
>  	struct perf_event *event;
>  	struct hw_perf_event *hwc;
> -	u64 interrupts, now;
> +	u64 now, period = TICK_NSEC;
>  	s64 delta;
>  
> -	if (!ctx->nr_freq)
> +	/*
> +	 * only need to iterate over all events iff:
> +	 * - context have events in frequency mode (needs freq
> adjust)
> +	 * - there are events to unthrottle on this cpu
> +	 */
> +	if (!(ctx->nr_freq || needs_unthr))
>  		return;
>  
> +	raw_spin_lock(&ctx->lock);
> +
>  	list_for_each_entry_rcu(event, &ctx->event_list,
> event_entry) { if (event->state != PERF_EVENT_STATE_ACTIVE)
>  			continue;
> @@ -2344,13 +2360,8 @@ static void perf_ctx_adjust_freq(struct
> perf_event_context *ctx, u64 period) 
>  		hwc = &event->hw;
>  
> -		interrupts = hwc->interrupts;
> -		hwc->interrupts = 0;
> -
> -		/*
> -		 * unthrottle events on the tick
> -		 */
> -		if (interrupts == MAX_INTERRUPTS) {
> +		if (needs_unthr && hwc->interrupts ==
> MAX_INTERRUPTS) {
> +			hwc->interrupts = 0;
>  			perf_log_throttle(event, 1);
>  			event->pmu->start(event, 0);
>  		}
> @@ -2358,14 +2369,26 @@ static void perf_ctx_adjust_freq(struct
> perf_event_context *ctx, u64 period) if (!event->attr.freq
> || !event->attr.sample_freq) continue;
>  
> -		event->pmu->read(event);
> +		/*
> +		 * stop the event and update event->count
> +		 */
> +		event->pmu->stop(event, PERF_EF_UPDATE);
> +
>  		now = local64_read(&event->count);
>  		delta = now - hwc->freq_count_stamp;
>  		hwc->freq_count_stamp = now;
>  
> +		/*
> +		 * restart the event
> +		 * reload only if value has changed
> +		 */
>  		if (delta > 0)
>  			perf_adjust_period(event, period, delta);
> +
> +		event->pmu->start(event, delta > 0 ?
> PERF_EF_RELOAD : 0); }
> +
> +	raw_spin_unlock(&ctx->lock);
>  }
>  
>  /*
> @@ -2388,16 +2411,13 @@ static void rotate_ctx(struct
> perf_event_context *ctx) */
>  static void perf_rotate_context(struct perf_cpu_context *cpuctx)
>  {
> -	u64 interval = (u64)cpuctx->jiffies_interval * TICK_NSEC;
>  	struct perf_event_context *ctx = NULL;
> -	int rotate = 0, remove = 1, freq = 0;
> +	int rotate = 0, remove = 1;
>  
>  	if (cpuctx->ctx.nr_events) {
>  		remove = 0;
>  		if (cpuctx->ctx.nr_events != cpuctx->ctx.nr_active)
>  			rotate = 1;
> -		if (cpuctx->ctx.nr_freq)
> -			freq = 1;
>  	}
>  
>  	ctx = cpuctx->task_ctx;
> @@ -2405,37 +2425,26 @@ static void perf_rotate_context(struct
> perf_cpu_context *cpuctx) remove = 0;
>  		if (ctx->nr_events != ctx->nr_active)
>  			rotate = 1;
> -		if (ctx->nr_freq)
> -			freq = 1;
>  	}
>  
> -	if (!rotate && !freq)
> +	if (!rotate)
>  		goto done;
>  
>  	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>  	perf_pmu_disable(cpuctx->ctx.pmu);
>  
> -	if (freq) {
> -		perf_ctx_adjust_freq(&cpuctx->ctx, interval);
> -		if (ctx)
> -			perf_ctx_adjust_freq(ctx, interval);
> -	}
> -
> -	if (rotate) {
> -		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
> -		if (ctx)
> -			ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE);
> +	cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
> +	if (ctx)
> +		ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE);
>  
> -		rotate_ctx(&cpuctx->ctx);
> -		if (ctx)
> -			rotate_ctx(ctx);
> +	rotate_ctx(&cpuctx->ctx);
> +	if (ctx)
> +		rotate_ctx(ctx);
>  
> -		perf_event_sched_in(cpuctx, ctx, current);
> -	}
> +	perf_event_sched_in(cpuctx, ctx, current);
>  
>  	perf_pmu_enable(cpuctx->ctx.pmu);
>  	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> -
>  done:
>  	if (remove)
>  		list_del_init(&cpuctx->rotation_list);
> @@ -2445,10 +2454,22 @@ void perf_event_task_tick(void)
>  {
>  	struct list_head *head = &__get_cpu_var(rotation_list);
>  	struct perf_cpu_context *cpuctx, *tmp;
> +	struct perf_event_context *ctx;
> +	int throttled;
>  
>  	WARN_ON(!irqs_disabled());
>  
> +	__this_cpu_inc(perf_throttled_seq);
> +	throttled = __this_cpu_xchg(perf_throttled_count, 0);
> +
>  	list_for_each_entry_safe(cpuctx, tmp, head, rotation_list) {
> +		ctx = &cpuctx->ctx;
> +		perf_adjust_freq_unthr_context(ctx, throttled);
> +
> +		ctx = cpuctx->task_ctx;
> +		if (ctx)
> +			perf_adjust_freq_unthr_context(ctx,
> throttled); +
>  		if (cpuctx->jiffies_interval == 1 ||
>  				!(jiffies %
> cpuctx->jiffies_interval)) perf_rotate_context(cpuctx);
> @@ -4514,6 +4535,7 @@ static int __perf_event_overflow(struct
> perf_event *event, {
>  	int events = atomic_read(&event->event_limit);
>  	struct hw_perf_event *hwc = &event->hw;
> +	u64 seq;
>  	int ret = 0;
>  
>  	/*
> @@ -4523,14 +4545,20 @@ static int __perf_event_overflow(struct
> perf_event *event, if (unlikely(!is_sampling_event(event)))
>  		return 0;
>  
> -	if (unlikely(hwc->interrupts >= max_samples_per_tick)) {
> -		if (throttle) {
> +	seq = __this_cpu_read(perf_throttled_seq);
> +	if (seq != hwc->interrupts_seq) {
> +		hwc->interrupts_seq = seq;
> +		hwc->interrupts = 1;
> +	} else {
> +		hwc->interrupts++;
> +		if (unlikely(throttle
> +			     && hwc->interrupts >=
> max_samples_per_tick)) {
> +			__this_cpu_inc(perf_throttled_count);
>  			hwc->interrupts = MAX_INTERRUPTS;
>  			perf_log_throttle(event, 0);
>  			ret = 1;
>  		}
> -	} else
> -		hwc->interrupts++;
> +	}
>  
>  	if (event->attr.freq) {
>  		u64 now = perf_clock();
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

next prev parent reply	other threads:[~2012-02-08 11:32 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-01-26 16:03 [PATCH] perf_events: fix broken intr throttling (v3) Stephane Eranian
2012-01-26 16:16 ` Peter Zijlstra
2012-01-26 18:22   ` Arun Sharma
2012-01-28 12:04 ` [tip:perf/urgent] perf: Fix broken interrupt rate throttling tip-bot for Stephane Eranian
2012-02-08 11:32 ` Anton Blanchard [this message]
2012-02-08 11:43   ` [PATCH] perf_events: fix broken intr throttling (v3) Stephane Eranian
2012-02-10  3:19     ` Sukadev Bhattiprolu
2012-02-10  6:15       ` Benjamin Herrenschmidt
2012-02-15  2:37         ` Sukadev Bhattiprolu
2012-02-19 11:26           ` Stephane Eranian
2012-02-20 19:01             ` Sukadev Bhattiprolu
2012-02-09  5:23 ` Anshuman Khandual
2012-02-09 22:45   ` Stephane Eranian

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120208223250.282861cb@kryten \
    --to=anton@samba.org \
    --cc=andi@firstfloor.org \
    --cc=asharma@fb.com \
    --cc=benh@kernel.crashing.org \
    --cc=emunson@mgebm.net \
    --cc=eranian@google.com \
    --cc=gleb@redhat.com \
    --cc=imunsie@au1.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=paulus@samba.org \
    --cc=peterz@infradead.org \
    --cc=sukadev@linux.vnet.ibm.com \
    --cc=vince@deater.net \
    --cc=wcohen@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.