Multi-PMU groups with the perf tool

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Multi-PMU groups with the perf tool
@ 2015-09-08 14:30 Pawel Moll
  2016-03-01 17:22 ` Pawel Moll
  0 siblings, 1 reply; 2+ messages in thread
From: Pawel Moll @ 2015-09-08 14:30 UTC (permalink / raw)
  To: linux-perf-users

Greetings,

I've been recently asked about using "uncore-like" PMUs with "perf
record" command. The issue is that those PMUs are generally
non-sampling, so one can not simply do

# perf record -a -e ccn/cycles/,ccn/xp_valid_flit,xp=1,port=0,vc=1,dir=1/ sleep 1

as the event_init function will return -EINVAL. In most cases, stat
provides a valid equivalent:

# perf stat -a -e ccn/cycles/,ccn/xp_valid_flit,xp=1,port=0,vc=1,dir=1/ sleep 1 

providing diff between counter values "before" and "after", but what if
you're looking at something more complex than "sleep 1" and you want to
observe memory system behaviour over time? Of course no one expects the
samples to be correlated with a particular PC, but sampling and
visualising the data with, say, 1ms period can be useful for certain use
cases.

My initial answer was: create a group with a 1ms cpu-clock (so hrtimer)
as a leader and, with PERF_SAMPLE_READ, attach "uncore" children to it.
That way, you'll get uncore counters read every ms.

I know that this works from the kernel point of view, because my custom
tool does exactly this, but I wanted to provide an example using the
standard perf tool (the user, obviously, already had it). So I spent
some time trying to create such a group with the perf tool and, after
going as far as reading the code :-) must admit defeat. What I would
like to do is something along the lines of (with the parenthesis group
syntax being just a product of my imagination ;-)

# perf record -F 1000 -a \
	-e cpu-clock(ccn/cycles/,ccn/xp_valid_flit,xp=1,port=0,vc=1,dir=1/) \
	sleep 1 

Have I missed something (very likely, given the number of
"not-really-obvious" features of the tool ;-), or is there no way of
creating such a custom group with the perf tool today? It's not a
critique, merely a question.

Cheers!

Pawel

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Multi-PMU groups with the perf tool
  2015-09-08 14:30 Multi-PMU groups with the perf tool Pawel Moll
@ 2016-03-01 17:22 ` Pawel Moll
  0 siblings, 0 replies; 2+ messages in thread
From: Pawel Moll @ 2016-03-01 17:22 UTC (permalink / raw)
  To: linux-perf-users

I'm sure there's a medical name for answering own questions,
particularly after 6 months, but hey, it may help someone else :-)

On Tue, 2015-09-08 at 15:30 +0100, Pawel Moll wrote:
> My initial answer was: create a group with a 1ms cpu-clock (so
> hrtimer)
> as a leader and, with PERF_SAMPLE_READ, attach "uncore" children to
> it.
> That way, you'll get uncore counters read every ms.
> 
> I know that this works from the kernel point of view, because my
> custom
> tool does exactly this, but I wanted to provide an example using the
> standard perf tool (the user, obviously, already had it). So I spent
> some time trying to create such a group with the perf tool and, after
> going as far as reading the code :-) must admit defeat. What I would
> like to do is something along the lines of (with the parenthesis
> group
> syntax being just a product of my imagination ;-)
> 
> # perf record -F 1000 -a \
> 	-e cpu
> -clock(ccn/cycles/,ccn/xp_valid_flit,xp=1,port=0,vc=1,dir=1/) \
> 	sleep 1 
> 
> Have I missed something (very likely, given the number of
> "not-really-obvious" features of the tool ;-), or is there no way of
> creating such a custom group with the perf tool today? It's not a
> critique, merely a question.

By pure accident I finally realised it is - in principle - possible
with the existing tool. The incantation would look like this (skipped
the "ccn/xp_valid_flit.../" event for the sake of line length):


# perf record -a -e '{cpu-clock,ccn/cycles/}:S' sleep 1


only it still doesn't work, failing -EINVAL. After some digging I've
realised the reason - the tool tries to create such a group on each of
the CPUs, because I've asked for it with -a, right? Only that "uncore"
events like "ccn/cycles/" are to be pinned to a single CPU and to
communicate this they export "cpumask" sysfs attribute. Let's have a
look:


# cat /sys/bus/event_source/devices/ccn/cpumask
3


And now, if we try to run the command above with some verbosity:

# perf record -vv -a -e '{cpu-clock,ccn/cycles/}:S' sleep 1
------------------------------------------------------------
perf_event_attr:
  type                             1
  size                             112
  { sample_period, sample_freq }   4000
  sample_type                      IP|TID|TIME|READ|ID|CPU|PERIOD
  read_format                      ID|GROUP
  disabled                         1
  mmap                             1
  comm                             1
  freq                             1
  task                             1
  sample_id_all                    1
  exclude_guest                    1
  mmap2                            1
  comm_exec                        1
------------------------------------------------------------
sys_perf_event_open: pid -1  cpu 0  group_fd -1  flags 0x8
sys_perf_event_open: pid -1  cpu 1  group_fd -1  flags 0x8
sys_perf_event_open: pid -1  cpu 2  group_fd -1  flags 0x8
sys_perf_event_open: pid -1  cpu 3  group_fd -1  flags 0x8
sys_perf_event_open: pid -1  cpu 4  group_fd -1  flags 0x8
sys_perf_event_open: pid -1  cpu 5  group_fd -1  flags 0x8
sys_perf_event_open: pid -1  cpu 6  group_fd -1  flags 0x8
sys_perf_event_open: pid -1  cpu 7  group_fd -1  flags 0x8
------------------------------------------------------------
perf_event_attr:
  type                             7
  size                             112
  config                           0xff00
  sample_type                      IP|TID|TIME|READ|ID|CPU|PERIOD
  read_format                      ID|GROUP
  freq                             1
  sample_id_all                    1
  exclude_guest                    1
------------------------------------------------------------
sys_perf_event_open: pid -1  cpu 3  group_fd 4  flags 0x8
sys_perf_event_open failed, error -22


The tool seemingly did the right thing and requested the "ccn" event on
CPU3 for group 4, being the "cpu-clock" event on CPU0, only it's not
allowed. The perf core has this check:


			/*
			 * Make sure we're both events for the same CPU;
			 * grouping events for different CPUs is broken; since
			 * you can never concurrently schedule them anyhow.
			 */
			if (group_leader->cpu != event->cpu)
				goto err_context;


If I now, instead of doing "-a" request two CPUs in the right order, I
get the proof:


# perf record -vv -C3,0 -e '{cpu-clock,ccn/cycles/}:S' sleep 1
------------------------------------------------------------
perf_event_attr:
  type                             1
  size                             112
  { sample_period, sample_freq }   4000
  sample_type                      IP|TID|TIME|READ|ID|CPU|PERIOD
  read_format                      ID|GROUP
  disabled                         1
  mmap                             1
  comm                             1
  freq                             1
  task                             1
  sample_id_all                    1
  exclude_guest                    1
  mmap2                            1
  comm_exec                        1
------------------------------------------------------------
sys_perf_event_open: pid -1  cpu 3  group_fd -1  flags 0x8
sys_perf_event_open: pid -1  cpu 0  group_fd -1  flags 0x8
------------------------------------------------------------
perf_event_attr:
  type                             7
  size                             112
  config                           0xff00
  sample_type                      IP|TID|TIME|READ|ID|CPU|PERIOD
  read_format                      ID|GROUP
  freq                             1
  sample_id_all                    1
  exclude_guest                    1
------------------------------------------------------------
sys_perf_event_open: pid -1  cpu 3  group_fd 4  flags 0x8
sys_perf_event_open: pid -1  cpu 0  group_fd 5  flags 0x8
sys_perf_event_open failed, error -22


Everything works fine for the first group, created on CPU3 and then
fails for the second one. Interestingly it seems that in this case the
tool ignores "cpumask" and tries to create the "ccn" event on CPU0. The
thing is that the "cpu" argument gets overridden by the driver to match
the "cpumask" value and we're back in square one.

So finally, if I only run record on the single, correct CPU, it works:


# perf record -C3 -e '{cpu-clock,ccn/cycles/}:S' sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.330 MB perf.data (1874 samples) ]


Only it's not exactly what I wanted (I want system wide CPU data - with
other events in the list - with the uncore events included in it at
regular intervals)... The obvious "fix" for it would be making sure the
tool does not create "cpumask"ed events in group belonging to a "wrong"
CPU, but whether it's the correct solution, I'm not sure yet.

Comments welcome, silence won't be taken as an offence ;-)

Pawel

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2016-03-01 17:22 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-08 14:30 Multi-PMU groups with the perf tool Pawel Moll
2016-03-01 17:22 ` Pawel Moll

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).