From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752086Ab2BIFXs (ORCPT ); Thu, 9 Feb 2012 00:23:48 -0500 Received: from e28smtp08.in.ibm.com ([122.248.162.8]:54010 "EHLO e28smtp08.in.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751099Ab2BIFXq (ORCPT ); Thu, 9 Feb 2012 00:23:46 -0500 Message-ID: <4F335858.4090501@linux.vnet.ibm.com> Date: Thu, 09 Feb 2012 10:53:36 +0530 From: Anshuman Khandual User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.17) Gecko/20110424 Thunderbird/3.1.10 MIME-Version: 1.0 To: Stephane Eranian CC: linux-kernel@vger.kernel.org, peterz@infradead.org, mingo@elte.hu, gleb@redhat.com, wcohen@redhat.com, vince@deater.net, asharma@fb.com, andi@firstfloor.org Subject: Re: [PATCH] perf_events: fix broken intr throttling (v3) References: <20120126160319.GA5655@quad> In-Reply-To: <20120126160319.GA5655@quad> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit x-cbid: 12020905-2000-0000-0000-00000659B981 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thursday 26 January 2012 09:33 PM, Stephane Eranian wrote: > > This patch fixes the throttling mechanism. It was broken > in 3.2.0. Events were not being unthrottled. The unthrottling > mechanism required that events be checked at each timer tick. > > This patch solves this problem and also separates: > - unthrottling > - multiplexing > - frequency-mode period adjustments > > Not all of them need to be executed at each timer tick. > > This third version of the patch is based on my original patch + > PeterZ proposal (https://lkml.org/lkml/2012/1/7/87). > > At each timer tick, for each context: > - if the current CPU has throttled events, we unthrottle events > > - if context has frequency-based events, we adjust sampling periods > > - if we have reached the jiffies interval, we multiplex (rotate) > > We decoupled rotation (multiplexing) from frequency-mode sampling > period adjustments. They should not necessarily happen at the same > rate. Multiplexing is subject to jiffies_interval (currently at 1 > but could be higher once the tunable is exposed via sysfs). I am wondering how much higher value would jiffies_interval can be assigned to ? Once its exposed to the user through sysfs it can have any value. Are we planning to impose some sort of MAX limit on this ? > > We have grouped frequency-mode adjustment and unthrottling into the > same routine to minimize code duplication. When throttled while in > frequency mode, we scan the events only once. > > We have fixed the threshold enforcement code in __perf_event_overflow(). > There was a bug whereby it would allow more than the authorized rate > because an increment of hwc->interrupts was not executed at the right > place. > > The patch was tested with low sampling limit (2000) and fixed periods, > frequency mode, overcommitted PMU. > > On a 2.1GHz AMD CPU: > $ cat /proc/sys/kernel/perf_event_max_sample_rate > 2000 > > We set a rate of 3000 samples/sec (2.1GHz/3000 = 700000): > > $ perf record -e cycles,cycles -c 700000 noploop 10 > $ perf report -D | tail -21 > Aggregated stats: > TOTAL events: 80086 > MMAP events: 88 > COMM events: 2 > EXIT events: 4 > THROTTLE events: 19996 > UNTHROTTLE events: 19996 > SAMPLE events: 40000 > cycles stats: > TOTAL events: 40006 > MMAP events: 5 > COMM events: 1 > EXIT events: 4 > THROTTLE events: 9998 > UNTHROTTLE events: 9998 > SAMPLE events: 20000 > cycles stats: > TOTAL events: 39996 > THROTTLE events: 9998 > UNTHROTTLE events: 9998 > SAMPLE events: 20000 > > For 10s, the cap is 2x2000x10 = 40000 samples. > We get exactly that: 20000 samples/event. > > Signed-off-by: Stephane Eranian > --- > > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h > index 0b91db2..412b790 100644 > --- a/include/linux/perf_event.h > +++ b/include/linux/perf_event.h > @@ -589,6 +589,7 @@ struct hw_perf_event { > u64 sample_period; > u64 last_period; > local64_t period_left; > + u64 interrupts_seq; > u64 interrupts; > > u64 freq_time_stamp; > diff --git a/kernel/events/core.c b/kernel/events/core.c > index de859fb..7c3b9de 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -2300,6 +2300,9 @@ do { \ > return div64_u64(dividend, divisor); > } > > +static DEFINE_PER_CPU(int, perf_throttled_count); > +static DEFINE_PER_CPU(u64, perf_throttled_seq); > + > static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count) > { > struct hw_perf_event *hwc = &event->hw; > @@ -2325,16 +2328,29 @@ static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count) > } > } > > -static void perf_ctx_adjust_freq(struct perf_event_context *ctx, u64 period) > +/* > + * combine freq adjustment with unthrottling to avoid two passes over the > + * events. At the same time, make sure, having freq events does not change > + * the rate of unthrottling as that would introduce bias. > + */ > +static void perf_adjust_freq_unthr_context(struct perf_event_context *ctx, > + int needs_unthr) > { > struct perf_event *event; > struct hw_perf_event *hwc; > - u64 interrupts, now; > + u64 now, period = TICK_NSEC; > s64 delta; > > - if (!ctx->nr_freq) > + /* > + * only need to iterate over all events iff: > + * - context have events in frequency mode (needs freq adjust) > + * - there are events to unthrottle on this cpu > + */ > + if (!(ctx->nr_freq || needs_unthr)) > return; > > + raw_spin_lock(&ctx->lock); > + > list_for_each_entry_rcu(event, &ctx->event_list, event_entry) { > if (event->state != PERF_EVENT_STATE_ACTIVE) > continue; > @@ -2344,13 +2360,8 @@ static void perf_ctx_adjust_freq(struct perf_event_context *ctx, u64 period) > > hwc = &event->hw; > > - interrupts = hwc->interrupts; > - hwc->interrupts = 0; > - > - /* > - * unthrottle events on the tick > - */ > - if (interrupts == MAX_INTERRUPTS) { > + if (needs_unthr && hwc->interrupts == MAX_INTERRUPTS) { > + hwc->interrupts = 0; > perf_log_throttle(event, 1); > event->pmu->start(event, 0); > } > @@ -2358,14 +2369,26 @@ static void perf_ctx_adjust_freq(struct perf_event_context *ctx, u64 period) > if (!event->attr.freq || !event->attr.sample_freq) > continue; > > - event->pmu->read(event); > + /* > + * stop the event and update event->count > + */ > + event->pmu->stop(event, PERF_EF_UPDATE); > + > now = local64_read(&event->count); > delta = now - hwc->freq_count_stamp; > hwc->freq_count_stamp = now; > > + /* > + * restart the event > + * reload only if value has changed > + */ > if (delta > 0) > perf_adjust_period(event, period, delta); > + > + event->pmu->start(event, delta > 0 ? PERF_EF_RELOAD : 0); > } > + > + raw_spin_unlock(&ctx->lock); > } > > /* > @@ -2388,16 +2411,13 @@ static void rotate_ctx(struct perf_event_context *ctx) > */ > static void perf_rotate_context(struct perf_cpu_context *cpuctx) > { > - u64 interval = (u64)cpuctx->jiffies_interval * TICK_NSEC; > struct perf_event_context *ctx = NULL; > - int rotate = 0, remove = 1, freq = 0; > + int rotate = 0, remove = 1; > > if (cpuctx->ctx.nr_events) { > remove = 0; > if (cpuctx->ctx.nr_events != cpuctx->ctx.nr_active) > rotate = 1; > - if (cpuctx->ctx.nr_freq) > - freq = 1; > } > > ctx = cpuctx->task_ctx; > @@ -2405,37 +2425,26 @@ static void perf_rotate_context(struct perf_cpu_context *cpuctx) > remove = 0; > if (ctx->nr_events != ctx->nr_active) > rotate = 1; > - if (ctx->nr_freq) > - freq = 1; > } > > - if (!rotate && !freq) > + if (!rotate) > goto done; > > perf_ctx_lock(cpuctx, cpuctx->task_ctx); > perf_pmu_disable(cpuctx->ctx.pmu); > > - if (freq) { > - perf_ctx_adjust_freq(&cpuctx->ctx, interval); > - if (ctx) > - perf_ctx_adjust_freq(ctx, interval); > - } > - > - if (rotate) { > - cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE); > - if (ctx) > - ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE); > + cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE); > + if (ctx) > + ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE); > > - rotate_ctx(&cpuctx->ctx); > - if (ctx) > - rotate_ctx(ctx); > + rotate_ctx(&cpuctx->ctx); > + if (ctx) > + rotate_ctx(ctx); > > - perf_event_sched_in(cpuctx, ctx, current); > - } > + perf_event_sched_in(cpuctx, ctx, current); > > perf_pmu_enable(cpuctx->ctx.pmu); > perf_ctx_unlock(cpuctx, cpuctx->task_ctx); > - > done: > if (remove) > list_del_init(&cpuctx->rotation_list); > @@ -2445,10 +2454,22 @@ void perf_event_task_tick(void) > { > struct list_head *head = &__get_cpu_var(rotation_list); > struct perf_cpu_context *cpuctx, *tmp; > + struct perf_event_context *ctx; > + int throttled; > > WARN_ON(!irqs_disabled()); > > + __this_cpu_inc(perf_throttled_seq); > + throttled = __this_cpu_xchg(perf_throttled_count, 0); > + > list_for_each_entry_safe(cpuctx, tmp, head, rotation_list) { > + ctx = &cpuctx->ctx; > + perf_adjust_freq_unthr_context(ctx, throttled); > + > + ctx = cpuctx->task_ctx; > + if (ctx) > + perf_adjust_freq_unthr_context(ctx, throttled); > + > if (cpuctx->jiffies_interval == 1 || > !(jiffies % cpuctx->jiffies_interval)) > perf_rotate_context(cpuctx); > @@ -4514,6 +4535,7 @@ static int __perf_event_overflow(struct perf_event *event, > { > int events = atomic_read(&event->event_limit); > struct hw_perf_event *hwc = &event->hw; > + u64 seq; > int ret = 0; > > /* > @@ -4523,14 +4545,20 @@ static int __perf_event_overflow(struct perf_event *event, > if (unlikely(!is_sampling_event(event))) > return 0; > > - if (unlikely(hwc->interrupts >= max_samples_per_tick)) { > - if (throttle) { > + seq = __this_cpu_read(perf_throttled_seq); > + if (seq != hwc->interrupts_seq) { > + hwc->interrupts_seq = seq; > + hwc->interrupts = 1; > + } else { > + hwc->interrupts++; > + if (unlikely(throttle > + && hwc->interrupts >= max_samples_per_tick)) { > + __this_cpu_inc(perf_throttled_count); > hwc->interrupts = MAX_INTERRUPTS; > perf_log_throttle(event, 0); > ret = 1; > } > - } else > - hwc->interrupts++; > + } > > if (event->attr.freq) { > u64 now = perf_clock(); > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ >