Re: [RFC 04/10] drm/i915: Expose a PMU interface for perf queries

From: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com>
To: "peterz@infradead.org" <peterz@infradead.org>
Cc: "Intel-gfx@lists.freedesktop.org" <Intel-gfx@lists.freedesktop.org>
Subject: Re: [RFC 04/10] drm/i915: Expose a PMU interface for perf queries
Date: Wed, 23 Aug 2017 17:51:38 +0000	[thread overview]
Message-ID: <1503481771.5383.31.camel@intel.com> (raw)
In-Reply-To: <20170822181755.GL32112@worktop.programming.kicks-ass.net>

On Tue, 2017-08-22 at 20:17 +0200, Peter Zijlstra wrote:
> On Sat, Aug 12, 2017 at 02:15:13AM +0000, Rogozhkin, Dmitry V wrote:
> > $ perf stat -e instructions,i915/rcs0-busy/ workload.sh
> > <... wrokload.sh output...>
> > 
> > Performance counter stats for 'workload.sh':
> >      1,204,616,268      instructions
> >                  0      i915/rcs0-busy/
> > 
> >        1.869169153 seconds time elapsed
> > 
> > As you can see instructions event works pretty well, i915/rcs0-busy/
> > doesn't.
> > 
> > I afraid that our current understanding of how PMU should work is not
> > fully correct.
> 
> Can we start off by explaining to me how this i915 stuff works. Because
> all I have is ~750 lines of patch without comments. Which sort of leaves
> me confused.

i915 PMU tries to expose a number of SW metrics to characterize i915
performance: i915 interrupt counts, time (counted in microseconds) GPU
spent in specific states like RC6, time  (counted in microseconds) GPU
engines were busy executing some tasks or were idle, etc. As for now
these metrics are i915 global and can't be attached to some particular
task, however, we may consider such metrics in the future. I think that
global PMU is the nice first step. I had a talk with Andi Kleen who
pointed that this is similar to how uncore PMU behaves and suggested I
should use 'perf stat -e <event> -a -C0' and probably adjust the code to
register i915 PMU with the perf_invalid_context. Which I did in the
patch which probably should be squashed to the patch from Tvrtko (I did
not want to intersect with his job). This change can be found here:
https://patchwork.freedesktop.org/patch/171953/. This patch makes 'perf
stat -e i915/rcs0-busy/' to error out and supports 'perf stat -e
i915/rcs0-busy/ -a -C0'. I still think I miss something since 'perf stat
-e i915/rcs0-busy/ -a' will give metrics multiplied by number of active
CPUs, but that's look closer to what is needed.

Anyhow, returning to the metrics i915 exposes. Some metrics are just
exposure of some counters supported already inside i915 PMU which do not
require any special sampling: at any given moment you can request the
counter value (these are interrupts counts, i915 power consumption).
Other metrics are similar to the ever-existing which I just described,
but they require activation for i915 to start to count them - this is
done on the event initialization (these are engine busy stats). Finally,
there is a third group which require sampling counting: they are needed
to be initialized and i915 pmu starts an internal timer to count these
values (these are some engines characteristics referenced in the code as
QUEUED, SEMA, WAIT).

I hope this clarifies. As well I hope I correctly covered high level
description of i915 PMU. I hope Chris will correct me if I was wrong
somewhere. Mind that another author of i915 PMU Tvrtko is for one more
week or so on vacation.

> 
> The above command tries to add an event 'i915/rcs0-busy/' to a task. How
> are i915 resource associated to any one particular task?

Currently in no way, they are global.

> 
> Is there a unique i915 resource for each task? If not, I don't see how
> per-task event can ever work as expected.

This depends on what you mean under "expected"? I see 2 variants:
1. Either this command line should return correct metric values
2. This command line should error out
Right now i915 PMU produces global metric, thus, I think 2nd variant is
true. Which I think is achievable if PMU is registered with
perf_invalid_context.

> 
> > I think so, because the way PMU entry points init(),
> > add(), del(), start(), stop(), read() are implemented do not correlate
> > with how many times they are called. I have counted them and here is the
> > result:
> > init()=19, add()=44310, del()=43900, start()=44534, stop()=0, read()=0
> > 
> > Which means that we are regularly attempt to start/stop timer and/or
> > busy stats calculations. Another thing which pay attention is that
> > read() was not called at all. How perf supposes to get counter value?
> 
> Both stop() and del() are supposed to update event->count. Only if we do
> sys_read() while the event is active (something perf-stat never does
> IIRC) will it issue pmu::read() to get an up-to-date number.
> 
> > Yet another thing, where we are supposed to initialize our internal
> > staff: numbers above are from single run and even init is called
> > multiple times? Where we are supposed to de-init our staff: each time on
> > del() - this hardly makes sense?
> 
> init happens in pmu::event_init(), that can set an optional
> event->destroy() function for de-init.
> 
> init() is called once for each event created, the above creates an
> inherited per-task event (I think, I lost track of what perf tool does)
> and 19 seems to suggest you did some 18 fork()/clone() calls after that,
> resulting in your 1 parent event with 18 children.

This behavior changed significantly once I have registered i915 PMU with
the perf_invalid_context notifying perf subsystem that out PMU produces
global metrics. Right now what I have is:
1. If I call "perf stat -e <event> -a -C0", then event is initialized
once
2. If I call "perf stat -e <event> -a", then event is initialized as
many times as there are logical CPUs. On my system there were 8.

And there is another problem associated with the "perf stat -e <event>
-a": since 8 events are initialized, I get metric values multiplied by
8. How I am supposed to fix that? I suspect that this is somehow related
to cpumask attribute exposed by some other PMUs, for example uncore.
They are doing something like:

static cpumask_t uncore_cpu_mask;

static DEVICE_ATTR(cpumask, S_IRUGO, uncore_get_attr_cpumask, NULL);

static struct attribute *uncore_pmu_attrs[] = {
	&dev_attr_cpumask.attr,
	NULL,
};

static struct attribute_group uncore_pmu_attr_group = {
	.attrs = uncore_pmu_attrs,
};

Should i915 follow this way as well?

> 
> > I should note that if perf will be issued with -I 10 option, then read()
> > is being called: init_c()=265, add_c()=132726, del_c()=131482,
> > start_c()=133412, stop()=0, read()=71. However, i915 counter is still 0.
> > I have tried to print counter values from within read() and these values
> > are non 0. Actually read() returns sequence of <non_zero>, 0, 0, 0, ...,
> > <no_zero> because with our add(), del() code we regularly start/stop our
> > counter and execution in read() follows different branches.
> > 
> > Thus, I think that right now we do not implement PMU correctly and do
> > not meet perf expectations from the PMU. Unfortunately, right now I have
> > no idea what are these expectations.
> 
> Please as to clarify how i915 works, I have no idea where to go.

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx