Re: [PATCH v2]: perf/core: addressing 4x slowdown during per-process, profiling of STREAM benchmark on Intel Xeon Phi

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Alexey Budankov <alexey.budankov@linux.intel.com>
To: David Carrillo-Cisneros <davidcc@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	Alexander Shishkin <alexander.shishkin@linux.intel.com>,
	Andi Kleen <ak@linux.intel.com>, Kan Liang <kan.liang@intel.com>,
	Dmitri Prokhorov <Dmitry.Prohorov@intel.com>,
	Valery Cherepennikov <valery.cherepennikov@intel.com>,
	Stephane Eranian <eranian@google.com>,
	Mark Rutland <mark.rutland@arm.com>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Arun Kalyanasundaram <arunkaly@google.com>
Subject: Re: [PATCH v2]: perf/core: addressing 4x slowdown during per-process, profiling of STREAM benchmark on Intel Xeon Phi
Date: Wed, 14 Jun 2017 14:27:10 +0300	[thread overview]
Message-ID: <3b886d12-e71a-c7cf-4410-d0247f73b4a3@linux.intel.com> (raw)
In-Reply-To: <CALcN6mgbT2j_sb4L-SXt9KNvZUk6pE9hRoHtPn_quVUEaixxXQ@mail.gmail.com>

On 01.06.2017 0:33, David Carrillo-Cisneros wrote:
> On Sat, May 27, 2017 at 4:19 AM, Alexey Budankov
> <alexey.budankov@linux.intel.com> wrote:
>> Motivation:
>>
>> The issue manifests like 4x slowdown when profiling single thread STREAM
>> benchmark on Intel Xeon Phi running RHEL7.2 (Intel MPSS distribution).
>> Perf profiling is done in per-process mode and involves about 30 core
>> events. In case the benchmark is OpenMP based and runs under profiling in
>> 272 threads the overhead becomes even more dramatic: 512.144s against
>> 36.192s (with this patch).
> 
> How long does it take without any perf monitoring? Could you provide
> more details about the benchmark? how many CPUs are being monitored?

STREAM runs about 10 seconds without Perf monitoring on Intel KNL in 272 
threads (272 CPUs are monitored). More on the benchmark is here: 
https://www.cs.virginia.edu/stream/

> 
> SNIP
>> different from the one executing the handler. Additionally for every
>> filtered out group group_sched_out() updates tstamps values to the current
>> interrupt time. This updating work is now done only once by
>> update_context_time() called by ctx_sched_out() before cpu groups
>> iteration.
> 
> I don't see this. e.g.:

It is done here:

@@ -1383,6 +1384,9 @@ static void update_context_time(struct 
perf_event_context *ctx)

      ctx->time += now - ctx->timestamp;
      ctx->timestamp = now;
+
+    ctx->tstamp_data.running += ctx->time - ctx->tstamp_data.stopped;
+    ctx->tstamp_data.stopped = ctx->time;
  }

Unscheduled events tstamps are updated all at once in 
update_context_time() called from ctx_sched_out().

> in your patch task_ctx_sched_out calls ctx_sched_out with mux == 0,
> that path does the exact same thing as before your patch. >
> I understand why you want to move the event's times to a different
> structure and keep a pointer in the event, but I don't see that you
> are avoiding the update of the times of unscheduled events.
> 
>>
>>   static u64 perf_event_time(struct perf_event *event)
>> @@ -1424,16 +1428,16 @@ static void update_event_times(struct perf_event
>> *event)
>>          else if (ctx->is_active)
>>                  run_end = ctx->time;
>>          else
>> -               run_end = event->tstamp_stopped;
>> +               run_end = event->tstamp->stopped;
>>
>> -       event->total_time_enabled = run_end - event->tstamp_enabled;
>> +       event->total_time_enabled = run_end - event->tstamp->enabled;
>>
>>          if (event->state == PERF_EVENT_STATE_INACTIVE)
>> -               run_end = event->tstamp_stopped;
>> +               run_end = event->tstamp->stopped;
>>          else
>>                  run_end = perf_event_time(event);
>>
>> -       event->total_time_running = run_end - event->tstamp_running;
>> +       event->total_time_running = run_end - event->tstamp->running;
> 
> FWICT, this is run for ALL events in context with matching CPU.
> 
> 
> SNIP
>>   }
>> @@ -3051,9 +3277,9 @@ void __perf_event_task_sched_out(struct task_struct
>> *task,
>>    * Called with IRQs disabled
>>    */
>>   static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
>> -                             enum event_type_t event_type)
>> +                             enum event_type_t event_type, int mux)
>>   {
>> -       ctx_sched_out(&cpuctx->ctx, cpuctx, event_type);
>> +       ctx_sched_out(&cpuctx->ctx, cpuctx, event_type, mux);
>>   }
>>
>>   static void
>> @@ -3061,29 +3287,8 @@ ctx_pinned_sched_in(struct perf_event_context *ctx,
>>                      struct perf_cpu_context *cpuctx)
>>   {
>>          struct perf_event *event;
>> -
>> -       list_for_each_entry(event, &ctx->pinned_groups, group_entry) {
>> -               if (event->state <= PERF_EVENT_STATE_OFF)
>> -                       continue;
>> -               if (!event_filter_match(event))
>> -                       continue;
> 
> Could we remove or simplify the tests in event_filter_match since the
> rb-tree filters by cpu?

Why do you want to remove or simplify event_filter_match()?

> 
>> -
>> -               /* may need to reset tstamp_enabled */
>> -               if (is_cgroup_event(event))
>> -                       perf_cgroup_mark_enabled(event, ctx);
>> -
>> -               if (group_can_go_on(event, cpuctx, 1))
>> -                       group_sched_in(event, cpuctx, ctx);
>> -
>> -               /*
>> -                * If this pinned group hasn't been scheduled,
>> -                * put it in error state.
>> -                */
>> -               if (event->state == PERF_EVENT_STATE_INACTIVE) {
>> -                       update_group_times(event);
>> -                       event->state = PERF_EVENT_STATE_ERROR;
>> -               }
>> -       }
>> +       list_for_each_entry(event, &ctx->pinned_groups, group_entry)
>> +               ctx_sched_in_pinned_group(ctx, cpuctx, event);
>>   }
>>
>>   static void
>> @@ -3092,37 +3297,19 @@ ctx_flexible_sched_in(struct perf_event_context
>> *ctx,
>>   {
>>          struct perf_event *event;
>>          int can_add_hw = 1;
>> -
>> -       list_for_each_entry(event, &ctx->flexible_groups, group_entry) {
>> -               /* Ignore events in OFF or ERROR state */
>> -               if (event->state <= PERF_EVENT_STATE_OFF)
>> -                       continue;
>> -               /*
>> -                * Listen to the 'cpu' scheduling filter constraint
>> -                * of events:
>> -                */
>> -               if (!event_filter_match(event))
>> -                       continue;
> same as before.
>

next prev parent reply	other threads:[~2017-06-14 11:27 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-05-26 22:13 [PATCH]: perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi Alexey Budankov
2017-05-27 11:19 ` [PATCH v2]: perf/core: addressing 4x slowdown during per-process, " Alexey Budankov
2017-05-29  7:45   ` Peter Zijlstra
2017-05-29  9:24     ` Alexey Budankov
2017-05-29 10:33       ` Peter Zijlstra
2017-05-29 10:46         ` Alexey Budankov
2017-05-29  7:46   ` Peter Zijlstra
2017-05-29  9:15     ` Alexey Budankov
2017-05-29 10:43       ` Peter Zijlstra
2017-05-29 10:56         ` Alexey Budankov
2017-05-29 11:23           ` Peter Zijlstra
2017-05-29 11:45             ` Alexey Budankov
2017-06-15 17:42               ` Alexey Budankov
2017-06-21 15:39                 ` Alexey Budankov
2017-06-30 10:22                   ` Alexey Budankov
2017-05-31 21:33   ` David Carrillo-Cisneros
2017-06-14 11:27     ` Alexey Budankov [this message]
2017-05-29 12:03 ` [PATCH]: perf/core: addressing 4x slowdown during per-process " Alexander Shishkin
2017-05-29 13:43   ` Alexey Budankov
2017-05-29 15:22     ` Peter Zijlstra
2017-05-29 15:29       ` Peter Zijlstra
2017-05-29 16:41         ` Alexey Budankov
2017-05-30  8:29     ` Alexander Shishkin
2017-06-14 10:07       ` Alexey Budankov
2017-06-15 17:44         ` Alexey Budankov
  -- strict thread matches above, loose matches on Subject: below --
2017-05-31  0:04 [PATCH v2]: perf/core: addressing 4x slowdown during per-process, " Arun Kalyanasundaram
2017-06-14 12:26 ` Alexey Budankov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3b886d12-e71a-c7cf-4410-d0247f73b4a3@linux.intel.com \
    --to=alexey.budankov@linux.intel.com \
    --cc=Dmitry.Prohorov@intel.com \
    --cc=acme@kernel.org \
    --cc=ak@linux.intel.com \
    --cc=alexander.shishkin@linux.intel.com \
    --cc=arunkaly@google.com \
    --cc=davidcc@google.com \
    --cc=eranian@google.com \
    --cc=kan.liang@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=valery.cherepennikov@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox