* [RFC] Extending ARM perf-events for multiple PMUs
@ 2011-04-08 17:15 Will Deacon
2011-04-08 18:10 ` Linus Walleij
` (2 more replies)
0 siblings, 3 replies; 17+ messages in thread
From: Will Deacon @ 2011-04-08 17:15 UTC (permalink / raw)
To: linux-arm-kernel
Hello,
Currently the perf code on ARM only caters for the core CPU PMU. In actual
fact, this only represents a subset of the performance monitoring hardware
available in real SoCs and is arguably the simplest to interact with. This
long-winded email is an attempt to classify the possible event sources that we
might see so that we can have clean support for them in the future. I think
that the perf tools might also need tweaking slightly so they can handle PMUs
which can't service per-cpu or per-task events (instead, you essentially have
a single system-wide event).
We can split PMUs up into two basic categories (an `action' here is usually an
interrupt but could be defined as any state recording or signalling operation).
(1) CPU-aware PMUs
This type of PMU is typically per-CPU and accessed via co-processor
instructions. Actions may be delivered as PPIs. Events scheduled onto
a CPU-aware PMU can be grouped, possibly with events scheduled for other
per-CPU PMUs on the same CPU. An action delivered by one of these PMUs
can *always* be attributed to a specific CPU but not necessarily a
specific task. Accessing a CPU-aware PMU is a synchronous operation.
(2) System PMUs
System PMUs are typically outside of the CPU domain. Bus monitors, GPU
counters and external L2 cache controller monitors are all system PMUs.
Actions delivered by these PMUs cannot be attributed to a particular CPU
and certainly cannot be associated with a particular piece of code. They
are memory-mapped and cannot be grouped with other PMUs of any type.
Accesses to a system PMU may be asynchronous.
System PMUs can be further split up into `counting' and `filtering'
PMUs:
(i) Counting PMUs
Counting PMUs increment a counter whenever a particular event occurs
and can deliver an action periodically (for example, on overflow or
after a certain amount of time has passed). The event types are
hardwired as particular, discrete events such as `cycles' or
`misses'.
(ii) Filtering PMUs
Filtering PMUs respond to a query. For example, `generate an action
whenever you see a bus access which fits the following criteria'. The
action may simply be to increment a counter, in which case this PMU
can act as a highly configurable counting PMU, where the event types
are dynamic.
Now, we currently support the core CPU PMU, which is obviously a CPU-aware PMU
that generates interrupts as actions. Another example of a CPU-aware PMU is
the VFP PMU in Qualcomm's Scorpion. The next step (moving outwards from the
core) is to add support for L2 cache controllers. I expect most of these to be
Counting System PMUs, although I can envisage them being CPU-aware if built
into the core with enough extra hardware.
Implementing support for CPU-aware PMUs can be done alongside the current CPU
PMU code and much of the code can be shared with the core PMU providing that
the event namespaces are distinct.
Implementing support for Counting System PMUs can reuse a lot of the
functionality in perf_event.c (for example, struct arm_pmu) but the low-level
accessors should be separate and a new struct pmu should be used. This means
that we will want multiple instances of struct arm_pmu and a method to translate
from a struct pmu to a struct arm_pmu. We'll also need to clean up some of the
armpmu_* functions to ensure the correction indirection is used when invoking
per-pmu functions.
Finally, the Filtering System PMUs will probably need their own struct pmu
instances for each device and can make use of the dynamic sysfs interface via
perf_pmu_register. I don't see any scope for common code in this space yet.
I appreciate this is especially hand-wavy stuff, but I'd like to check we've
got all of our bases covered before introducing system PMUs to ARM. The first
victim is the PL310 L2CC on the Cortex-A9.
Feedback welcome,
Will
^ permalink raw reply [flat|nested] 17+ messages in thread
* [RFC] Extending ARM perf-events for multiple PMUs
2011-04-08 17:15 [RFC] Extending ARM perf-events for multiple PMUs Will Deacon
@ 2011-04-08 18:10 ` Linus Walleij
2011-04-11 11:12 ` Will Deacon
2011-04-09 11:40 ` Peter Zijlstra
2011-04-11 17:29 ` Ashwin Chaugule
2 siblings, 1 reply; 17+ messages in thread
From: Linus Walleij @ 2011-04-08 18:10 UTC (permalink / raw)
To: linux-arm-kernel
Hi Will, thanks for this quite informative letter!
On Fri, Apr 8, 2011 at 7:15 PM, Will Deacon <will.deacon@arm.com> wrote:
> Implementing support for Counting System PMUs can reuse a lot of the
> functionality in perf_event.c (for example, struct arm_pmu) but the low-level
> accessors should be separate and a new struct pmu should be used. This means
> that we will want multiple instances of struct arm_pmu and a method to translate
> from a struct pmu to a struct arm_pmu. We'll also need to clean up some of the
> armpmu_* functions to ensure the correction indirection is used when invoking
> per-pmu functions.
What I start wondering at this point in the description is that there is some
implicit assumption that counting system PMU:s is an arch/arm/* thing,
that they should even be named arm_* and I guess as such they are
some PrimeCell kind of thing.
I can surely accept the per-CPU close-to-CPU counters to live
in arch/arm/* but..
I am thinking that a SoC vendor like Renesas may be implementing a
System PMU monitoring a bus shared between an SH and and ARM
core.
Unless you're ARM Ltd you can also think about vendors doing System
PMU IP blocks and synthesizing these in both ARM and other-arch
systems.
So maybe this needs an multiarch-spanning solution? I start
thinking about decoupling these babies from the arch and abstracting
them into something like drivers/perf
Am I totally misguided in this?
Yours,
Linus Walleij
^ permalink raw reply [flat|nested] 17+ messages in thread
* [RFC] Extending ARM perf-events for multiple PMUs
2011-04-08 17:15 [RFC] Extending ARM perf-events for multiple PMUs Will Deacon
2011-04-08 18:10 ` Linus Walleij
@ 2011-04-09 11:40 ` Peter Zijlstra
2011-04-11 11:29 ` Will Deacon
` (2 more replies)
2011-04-11 17:29 ` Ashwin Chaugule
2 siblings, 3 replies; 17+ messages in thread
From: Peter Zijlstra @ 2011-04-09 11:40 UTC (permalink / raw)
To: linux-arm-kernel
On Fri, 2011-04-08 at 18:15 +0100, Will Deacon wrote:
> Hello,
>
> Currently the perf code on ARM only caters for the core CPU PMU. In actual
> fact, this only represents a subset of the performance monitoring hardware
> available in real SoCs and is arguably the simplest to interact with. This
> long-winded email is an attempt to classify the possible event sources that we
> might see so that we can have clean support for them in the future. I think
> that the perf tools might also need tweaking slightly so they can handle PMUs
> which can't service per-cpu or per-task events (instead, you essentially have
> a single system-wide event).
>
> We can split PMUs up into two basic categories (an `action' here is usually an
> interrupt but could be defined as any state recording or signalling operation).
>
> (1) CPU-aware PMUs
>
> This type of PMU is typically per-CPU and accessed via co-processor
> instructions. Actions may be delivered as PPIs. Events scheduled onto
> a CPU-aware PMU can be grouped, possibly with events scheduled for other
> per-CPU PMUs on the same CPU. An action delivered by one of these PMUs
> can *always* be attributed to a specific CPU but not necessarily a
> specific task. Accessing a CPU-aware PMU is a synchronous operation.
>
> (2) System PMUs
>
> System PMUs are typically outside of the CPU domain. Bus monitors, GPU
> counters and external L2 cache controller monitors are all system PMUs.
> Actions delivered by these PMUs cannot be attributed to a particular CPU
> and certainly cannot be associated with a particular piece of code. They
> are memory-mapped and cannot be grouped with other PMUs of any type.
> Accesses to a system PMU may be asynchronous.
>
> System PMUs can be further split up into `counting' and `filtering'
> PMUs:
>
> (i) Counting PMUs
>
> Counting PMUs increment a counter whenever a particular event occurs
> and can deliver an action periodically (for example, on overflow or
> after a certain amount of time has passed). The event types are
> hardwired as particular, discrete events such as `cycles' or
> `misses'.
>
> (ii) Filtering PMUs
>
> Filtering PMUs respond to a query. For example, `generate an action
> whenever you see a bus access which fits the following criteria'. The
> action may simply be to increment a counter, in which case this PMU
> can act as a highly configurable counting PMU, where the event types
> are dynamic.
I don't see this distinction, both will have to count, and telling it
what to count is a function of perf_event_attr::config* and how the
hardware implements that is of no interest.
> Now, we currently support the core CPU PMU, which is obviously a CPU-aware PMU
> that generates interrupts as actions. Another example of a CPU-aware PMU is
> the VFP PMU in Qualcomm's Scorpion. The next step (moving outwards from the
> core) is to add support for L2 cache controllers. I expect most of these to be
> Counting System PMUs, although I can envisage them being CPU-aware if built
> into the core with enough extra hardware.
>
> Implementing support for CPU-aware PMUs can be done alongside the current CPU
> PMU code and much of the code can be shared with the core PMU providing that
> the event namespaces are distinct.
>
> Implementing support for Counting System PMUs can reuse a lot of the
> functionality in perf_event.c (for example, struct arm_pmu) but the low-level
> accessors should be separate and a new struct pmu should be used. This means
> that we will want multiple instances of struct arm_pmu and a method to translate
> from a struct pmu to a struct arm_pmu. We'll also need to clean up some of the
> armpmu_* functions to ensure the correction indirection is used when invoking
> per-pmu functions.
>
> Finally, the Filtering System PMUs will probably need their own struct pmu
> instances for each device and can make use of the dynamic sysfs interface via
> perf_pmu_register. I don't see any scope for common code in this space yet.
>
> I appreciate this is especially hand-wavy stuff, but I'd like to check we've
> got all of our bases covered before introducing system PMUs to ARM. The first
> victim is the PL310 L2CC on the Cortex-A9.
Right, so x86 has this too, and we have a fairly complete implementation
of the Nehalem/Westmere uncore PMU, which is a NODE/memory controller
PMU. Afaik we're mostly waiting on Intel to clarify some hardware
details.
So the perf core supports multiple hardware PMUs, but currently only one
of which can do per-task sampling, if you've got multiple CPU local PMUs
we need to do a little extra.
See perf_pmu_register(), what say a memory controller PMU would do is
something like:
perf_pmu_register(&node_pmu, "node", -1);
that will create a /sys/bus/event_source/devices/node/ directory in
which will host the PMU details for userspace. This is currently limited
to a single 'type' file which includes the number to provide
perf_event_attr::type, but could (and should) be extended to provide
some important events as well, which will provide the the bits to put in
perf_event_attr::config.
I just haven't figured out a way to dynamically add files/directories in
the whole struct device sysfs muck (that also pleases the driver/sysfs
folks). Nor have we agreed on a sane layout for such events there.
What we do for the events is map the provided CPU number to a memory
controller (cpu_to_node() does that for our case), and then use the
first online cpu in that node mask to drive the event.
If you've got system wide things like GPUs, where every cpu maps to the
same device, simply use the first online cpu and create a pmu instance
per device.
Now, I've also wanted to make symlinks in the regular sysfs topology to
these bus/event_source nodes, but again, that's something I've not
managed to find out how to do yet.
That is, for the currently existing "cpu" node, I'd like to have:
/sys/devices/system/cpu/cpuN/event_source -> /sys/bus/event_source/devices/cpu
And similar for the node thing:
/sys/devices/system/node/nodeN/event_source -> /sys/bus/event_source/devices/node
And for a GPU we could have:
/sys/devices/pci0000:00/0000:00:02.0/drm/card0/event_source -> /sys/bus/event_source/devices/IGC0
^ permalink raw reply [flat|nested] 17+ messages in thread
* [RFC] Extending ARM perf-events for multiple PMUs
2011-04-08 18:10 ` Linus Walleij
@ 2011-04-11 11:12 ` Will Deacon
0 siblings, 0 replies; 17+ messages in thread
From: Will Deacon @ 2011-04-11 11:12 UTC (permalink / raw)
To: linux-arm-kernel
On Fri, 2011-04-08 at 19:10 +0100, Linus Walleij wrote:
> Hi Will, thanks for this quite informative letter!
Hi Linus,
> On Fri, Apr 8, 2011 at 7:15 PM, Will Deacon <will.deacon@arm.com> wrote:
>
> > Implementing support for Counting System PMUs can reuse a lot of the
> > functionality in perf_event.c (for example, struct arm_pmu) but the low-level
> > accessors should be separate and a new struct pmu should be used. This means
> > that we will want multiple instances of struct arm_pmu and a method to translate
> > from a struct pmu to a struct arm_pmu. We'll also need to clean up some of the
> > armpmu_* functions to ensure the correction indirection is used when invoking
> > per-pmu functions.
>
> What I start wondering at this point in the description is that there is some
> implicit assumption that counting system PMU:s is an arch/arm/* thing,
> that they should even be named arm_* and I guess as such they are
> some PrimeCell kind of thing.
PMUs are typically built into other devices (the one which they're
profiling) so they tend to be tightly-coupled to some other code.
So yes, for things like a PMU in a graphics chip, there should be a
separate struct pmu which lives near the graphics driver.
However, for things like CPUs, busses and cache-controllers they should
probably live under arch/arm/. I'd like to identify the common code for
these PMUs, like we have done for the CPU, rather than see half a dozen
struct pmus for L2 cache-controllers appear from nowhere.
> I am thinking that a SoC vendor like Renesas may be implementing a
> System PMU monitoring a bus shared between an SH and and ARM
> core.
> Unless you're ARM Ltd you can also think about vendors doing System
> PMU IP blocks and synthesizing these in both ARM and other-arch
> systems.
>
> So maybe this needs an multiarch-spanning solution? I start
> thinking about decoupling these babies from the arch and abstracting
> them into something like drivers/perf
>
Well, the PMU registration mechanism in perf is already common to all
architectures. I'm just trying to make sure that we identify common PMU
code under arch/arm/ and avoid duplication under different platforms.
PMUs that need to operate on shared busses and the like will likely have
their own struct pmu which can indeed live somewhere more generic.
Cheers,
Will
^ permalink raw reply [flat|nested] 17+ messages in thread
* [RFC] Extending ARM perf-events for multiple PMUs
2011-04-09 11:40 ` Peter Zijlstra
@ 2011-04-11 11:29 ` Will Deacon
2011-04-11 12:47 ` Peter Zijlstra
2011-04-11 17:44 ` Ashwin Chaugule
2011-04-11 18:00 ` Ashwin Chaugule
2011-04-12 7:39 ` Ming Lei
2 siblings, 2 replies; 17+ messages in thread
From: Will Deacon @ 2011-04-11 11:29 UTC (permalink / raw)
To: linux-arm-kernel
Hi Peter,
On Sat, 2011-04-09 at 12:40 +0100, Peter Zijlstra wrote:
> > System PMUs can be further split up into `counting' and `filtering'
> > PMUs:
> >
> > (i) Counting PMUs
> >
> > Counting PMUs increment a counter whenever a particular event occurs
> > and can deliver an action periodically (for example, on overflow or
> > after a certain amount of time has passed). The event types are
> > hardwired as particular, discrete events such as `cycles' or
> > `misses'.
> >
> > (ii) Filtering PMUs
> >
> > Filtering PMUs respond to a query. For example, `generate an action
> > whenever you see a bus access which fits the following criteria'. The
> > action may simply be to increment a counter, in which case this PMU
> > can act as a highly configurable counting PMU, where the event types
> > are dynamic.
>
> I don't see this distinction, both will have to count, and telling it
> what to count is a function of perf_event_attr::config* and how the
> hardware implements that is of no interest.
Sure, fundamentally we're just writing bits rather than interpreting
them. The reason I mention the difference is that filtering PMUs will
always need their own struct pmu because of the lack of an event
namespace. The other problem is only an issue for some userspace tools
(like Oprofile) which require lists of events and their hex codes.
> > I appreciate this is especially hand-wavy stuff, but I'd like to check we've
> > got all of our bases covered before introducing system PMUs to ARM. The first
> > victim is the PL310 L2CC on the Cortex-A9.
>
> Right, so x86 has this too, and we have a fairly complete implementation
> of the Nehalem/Westmere uncore PMU, which is a NODE/memory controller
> PMU. Afaik we're mostly waiting on Intel to clarify some hardware
> details.
>
> So the perf core supports multiple hardware PMUs, but currently only one
> of which can do per-task sampling, if you've got multiple CPU local PMUs
> we need to do a little extra.
>
> See perf_pmu_register(), what say a memory controller PMU would do is
> something like:
>
> perf_pmu_register(&node_pmu, "node", -1);
>
> that will create a /sys/bus/event_source/devices/node/ directory in
> which will host the PMU details for userspace. This is currently limited
> to a single 'type' file which includes the number to provide
> perf_event_attr::type, but could (and should) be extended to provide
> some important events as well, which will provide the the bits to put in
> perf_event_attr::config.
Yup, the registration stuff is a good fit for these. I think we may want
an extra level of indirection under arch/arm/ to avoid lots of code
duplication for the struct pmu functions though (like we have for the
CPU PMU).
> I just haven't figured out a way to dynamically add files/directories in
> the whole struct device sysfs muck (that also pleases the driver/sysfs
> folks). Nor have we agreed on a sane layout for such events there.
>
> What we do for the events is map the provided CPU number to a memory
> controller (cpu_to_node() does that for our case), and then use the
> first online cpu in that node mask to drive the event.
>
> If you've got system wide things like GPUs, where every cpu maps to the
> same device, simply use the first online cpu and create a pmu instance
> per device.
Would this result in userspace attributing all of the data to a
particular CPU? We could consider allowing events where the cpu is -1
and the task pid is -1 as well. Non system-wide PMUs could reject these
and demand multiple events instead.
> Now, I've also wanted to make symlinks in the regular sysfs topology to
> these bus/event_source nodes, but again, that's something I've not
> managed to find out how to do yet.
>
> That is, for the currently existing "cpu" node, I'd like to have:
>
> /sys/devices/system/cpu/cpuN/event_source -> /sys/bus/event_source/devices/cpu
>
> And similar for the node thing:
>
> /sys/devices/system/node/nodeN/event_source -> /sys/bus/event_source/devices/node
>
> And for a GPU we could have:
>
> /sys/devices/pci0000:00/0000:00:02.0/drm/card0/event_source -> /sys/bus/event_source/devices/IGC0
That looks like a good way to show the topology of the event sources to
me.
Thanks for your feedback,
Will
^ permalink raw reply [flat|nested] 17+ messages in thread
* [RFC] Extending ARM perf-events for multiple PMUs
2011-04-11 11:29 ` Will Deacon
@ 2011-04-11 12:47 ` Peter Zijlstra
2011-04-11 17:44 ` Ashwin Chaugule
1 sibling, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2011-04-11 12:47 UTC (permalink / raw)
To: linux-arm-kernel
On Mon, 2011-04-11 at 12:29 +0100, Will Deacon wrote:
> > If you've got system wide things like GPUs, where every cpu maps to the
> > same device, simply use the first online cpu and create a pmu instance
> > per device.
>
> Would this result in userspace attributing all of the data to a
> particular CPU? We could consider allowing events where the cpu is -1
> and the task pid is -1 as well. Non system-wide PMUs could reject these
> and demand multiple events instead.
Not at such, but you need a cpu to receive interrupts on and program the
hardware from etc. Currently most core code assumes things are either
restrained to a single cpu or serialized by virtue of a task never
running on more than 1 cpu at a time.
I'm not quite sure how hard these assumptions are, and we might be able
to get away with making it a little less strict, but that's something
you'd have to play with.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [RFC] Extending ARM perf-events for multiple PMUs
2011-04-08 17:15 [RFC] Extending ARM perf-events for multiple PMUs Will Deacon
2011-04-08 18:10 ` Linus Walleij
2011-04-09 11:40 ` Peter Zijlstra
@ 2011-04-11 17:29 ` Ashwin Chaugule
2011-04-11 18:00 ` Will Deacon
2 siblings, 1 reply; 17+ messages in thread
From: Ashwin Chaugule @ 2011-04-11 17:29 UTC (permalink / raw)
To: linux-arm-kernel
Hi Will,
Thanks for the starting the discussion here.
On 4/8/2011 1:15 PM, Will Deacon wrote:
>
> (1) CPU-aware PMUs
>
> This type of PMU is typically per-CPU and accessed via co-processor
> instructions. Actions may be delivered as PPIs. Events scheduled onto
> a CPU-aware PMU can be grouped, possibly with events scheduled for other
> per-CPU PMUs on the same CPU. An action delivered by one of these PMUs
> can *always* be attributed to a specific CPU but not necessarily a
> specific task. Accessing a CPU-aware PMU is a synchronous operation.
>
I didn't understand when would an action not be attributed to a task in
this category ? If we know which CPU "enabled" the event, this should be
possible ?
> (2) System PMUs
>
> System PMUs are typically outside of the CPU domain. Bus monitors, GPU
> counters and external L2 cache controller monitors are all system PMUs.
> Actions delivered by these PMUs cannot be attributed to a particular CPU
> and certainly cannot be associated with a particular piece of code. They
> are memory-mapped and cannot be grouped with other PMUs of any type.
> Accesses to a system PMU may be asynchronous.
>
> System PMUs can be further split up into `counting' and `filtering'
> PMUs:
>
> (i) Counting PMUs
>
> Counting PMUs increment a counter whenever a particular event occurs
> and can deliver an action periodically (for example, on overflow or
> after a certain amount of time has passed). The event types are
> hardwired as particular, discrete events such as `cycles' or
> `misses'.
>
> (ii) Filtering PMUs
>
> Filtering PMUs respond to a query. For example, `generate an action
> whenever you see a bus access which fits the following criteria'. The
> action may simply be to increment a counter, in which case this PMU
> can act as a highly configurable counting PMU, where the event types
> are dynamic.
>
> Now, we currently support the core CPU PMU, which is obviously a CPU-aware PMU
> that generates interrupts as actions. Another example of a CPU-aware PMU is
> the VFP PMU in Qualcomm's Scorpion. The next step (moving outwards from the
> core) is to add support for L2 cache controllers. I expect most of these to be
> Counting System PMUs, although I can envisage them being CPU-aware if built
> into the core with enough extra hardware.
For the Qcom L2CC, the PMU can be configured to filter events based on
specific masters. This fact would make it a CPU-aware PMU, although its
NOT per-core and triggers SPI's.
In such a case, I found it to be quite ugly trying to reuse the per-cpu
data structures esp in the interrupt handler, since the interrupt can
trigger on a CPU where the event wasn't enabled. A cleaner approach was to
use a separate struct pmu. However, I agree that this approach would lead
to several pmu's popping up in arch/arm.
So, I think we could add another category for such highly configurable
PMUs, which are not per-core, but have enough extra h/w to make them
cpu-aware. These need to be treated differently by arm perf, because they
can't really use the per-cpu data structures of the cpu-aware pmu's and as
such can't easily re-use of many of the functions.
In fact, most of Qcomm PMU's (bus, fabric etc.) will fall under this new
category. At first glance, these would appear to fall under the System PMU
(counting) category, but they don't because of the extra h/w logic that
allows origin filtering of events.
Also, having all this origin filtering logic helps us track per-process
events on these PMU's, for which we need extra functions to decide how to
allocate and configure counters based on which context (task, cpu) the
event is enabled in.
Cheers,
Ashwin
--
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [RFC] Extending ARM perf-events for multiple PMUs
2011-04-11 11:29 ` Will Deacon
2011-04-11 12:47 ` Peter Zijlstra
@ 2011-04-11 17:44 ` Ashwin Chaugule
2011-04-12 17:45 ` Will Deacon
1 sibling, 1 reply; 17+ messages in thread
From: Ashwin Chaugule @ 2011-04-11 17:44 UTC (permalink / raw)
To: linux-arm-kernel
Hi Will,
On 4/11/2011 7:29 AM, Will Deacon wrote:
>> I don't see this distinction, both will have to count, and telling it
>> what to count is a function of perf_event_attr::config* and how the
>> hardware implements that is of no interest.
>
> Sure, fundamentally we're just writing bits rather than interpreting
> them. The reason I mention the difference is that filtering PMUs will
> always need their own struct pmu because of the lack of an event
> namespace. The other problem is only an issue for some userspace tools
> (like Oprofile) which require lists of events and their hex codes.
>
If you mean namespace = perf_event_attr::config, its 64 bits + another 64
bits of config_base + event_base on ARM ? Not too sure, but it would seem
like that should be enough to setup such event chaining.
>
> Would this result in userspace attributing all of the data to a
> particular CPU? We could consider allowing events where the cpu is -1
> and the task pid is -1 as well. Non system-wide PMUs could reject these
> and demand multiple events instead.
Agreed. perf stat -a on PMU's that are not CPU-aware, would report
incorrect output. Task counting on such PMU's would be pointless.
Cheers,
Ashwin
--
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [RFC] Extending ARM perf-events for multiple PMUs
2011-04-11 17:29 ` Ashwin Chaugule
@ 2011-04-11 18:00 ` Will Deacon
2011-04-11 20:46 ` Ashwin Chaugule
0 siblings, 1 reply; 17+ messages in thread
From: Will Deacon @ 2011-04-11 18:00 UTC (permalink / raw)
To: linux-arm-kernel
On Mon, 2011-04-11 at 18:29 +0100, Ashwin Chaugule wrote:
> Hi Will,
Hi Ashwin,
> Thanks for the starting the discussion here.
>
> On 4/8/2011 1:15 PM, Will Deacon wrote:
> >
> > (1) CPU-aware PMUs
> >
> > This type of PMU is typically per-CPU and accessed via co-processor
> > instructions. Actions may be delivered as PPIs. Events scheduled onto
> > a CPU-aware PMU can be grouped, possibly with events scheduled for other
> > per-CPU PMUs on the same CPU. An action delivered by one of these PMUs
> > can *always* be attributed to a specific CPU but not necessarily a
> > specific task. Accessing a CPU-aware PMU is a synchronous operation.
> >
>
> I didn't understand when would an action not be attributed to a task in
> this category ? If we know which CPU "enabled" the event, this should be
> possible ?
I don't think that's enough from a profiling perspective because the
state of the device will be altered by other tasks. For example, the
number of misses in the L2 cache for a given task is going to be
affected by the other tasks running in the system, even if we only
profile during the period in which the task is running. I think it's
better to permit only per-CPU events in the this case, attributing the
samples to tasks and letting the user work out what's going on.
> > Now, we currently support the core CPU PMU, which is obviously a CPU-aware PMU
> > that generates interrupts as actions. Another example of a CPU-aware PMU is
> > the VFP PMU in Qualcomm's Scorpion. The next step (moving outwards from the
> > core) is to add support for L2 cache controllers. I expect most of these to be
> > Counting System PMUs, although I can envisage them being CPU-aware if built
> > into the core with enough extra hardware.
>
> For the Qcom L2CC, the PMU can be configured to filter events based on
> specific masters. This fact would make it a CPU-aware PMU, although its
> NOT per-core and triggers SPI's.
I have a similar issue with this; filtering based on the master *isn't*
the same as having per-master samples, simply because the combined
effect of the masters will influence all of the results. That doesn't
mean that the filtering feature isn't useful, just that it should be
described in the event encoding rather than by pretending to support
per-CPU events.
> In such a case, I found it to be quite ugly trying to reuse the per-cpu
> data structures esp in the interrupt handler, since the interrupt can
> trigger on a CPU where the event wasn't enabled. A cleaner approach was to
> use a separate struct pmu. However, I agree that this approach would lead
> to several pmu's popping up in arch/arm.
>
I expect to see new struct pmus, I'd just like to try and identify
common patterns before the code mounts up. I imagine that we'll have a
struct pmu for L2 cache controllers, for example, from which people can
hang their own specific accessors. Whether or not we can hang other
system PMUs off such an implementation is unclear to me at the moment.
> So, I think we could add another category for such highly configurable
> PMUs, which are not per-core, but have enough extra h/w to make them
> cpu-aware. These need to be treated differently by arm perf, because they
> can't really use the per-cpu data structures of the cpu-aware pmu's and as
> such can't easily re-use of many of the functions.
>
> In fact, most of Qcomm PMU's (bus, fabric etc.) will fall under this new
> category. At first glance, these would appear to fall under the System PMU
> (counting) category, but they don't because of the extra h/w logic that
> allows origin filtering of events.
I think they do; they just have some event encodings that monitor events
specific to particular masters (but may not necessarily be attributable
to them).
> Also, having all this origin filtering logic helps us track per-process
> events on these PMU's, for which we need extra functions to decide how to
> allocate and configure counters based on which context (task, cpu) the
> event is enabled in.
I don't think we should go down the road of splitting up the counters on
a given PMU so that they can be shared between different tasks on
different CPUs. There will probably be a single control register, so
keeping everything in sync will be impossible.
Will
^ permalink raw reply [flat|nested] 17+ messages in thread
* [RFC] Extending ARM perf-events for multiple PMUs
2011-04-09 11:40 ` Peter Zijlstra
2011-04-11 11:29 ` Will Deacon
@ 2011-04-11 18:00 ` Ashwin Chaugule
2011-04-12 7:39 ` Ming Lei
2 siblings, 0 replies; 17+ messages in thread
From: Ashwin Chaugule @ 2011-04-11 18:00 UTC (permalink / raw)
To: linux-arm-kernel
Hi Peter,
On 4/9/2011 7:40 AM, Peter Zijlstra wrote:
>
> So the perf core supports multiple hardware PMUs, but currently only one
> of which can do per-task sampling, if you've got multiple CPU local PMUs
> we need to do a little extra.
Would this restriction be removed if task_struct->perf_event_ctxp[] had
more entries for each PMU_TYPE ?
On a related note, I haven't had time yet to look deeper for a proper fix
for the stale perf_event_ctx pointer issue, other than the one liner in
perf_task_event_sched_in() we discussed about, but that'll need fixing
before more PMU's start popping up.
> What we do for the events is map the provided CPU number to a memory
> controller (cpu_to_node() does that for our case), and then use the
> first online cpu in that node mask to drive the event.
I suppose you're referring to the uncore patches here.
So only one CPU has access to the uncore PMU at a time ?
Cheers,
Ashwin
--
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [RFC] Extending ARM perf-events for multiple PMUs
2011-04-11 18:00 ` Will Deacon
@ 2011-04-11 20:46 ` Ashwin Chaugule
2011-04-12 18:08 ` Will Deacon
0 siblings, 1 reply; 17+ messages in thread
From: Ashwin Chaugule @ 2011-04-11 20:46 UTC (permalink / raw)
To: linux-arm-kernel
Hi Will,
On 4/11/2011 2:00 PM, Will Deacon wrote:
>
> I don't think that's enough from a profiling perspective because the
> state of the device will be altered by other tasks. For example, the
> number of misses in the L2 cache for a given task is going to be
> affected by the other tasks running in the system, even if we only
> profile during the period in which the task is running.
I'm probably missing something. If another task affects the cache
contents, this will manifest as an increase in cache misses/hits for the
task that is being profiled during this interval. This will also happen
when interrupts trigger and wipe out cache lines anyway. IOW, a counter
thats counting events from CPU0, will not increment, if the event it is
counting gets affected by CPU1.
>>
>> For the Qcom L2CC, the PMU can be configured to filter events based on
>> specific masters. This fact would make it a CPU-aware PMU, although its
>> NOT per-core and triggers SPI's.
>
> I have a similar issue with this; filtering based on the master *isn't*
> the same as having per-master samples, simply because the combined
> effect of the masters will influence all of the results. That doesn't
> mean that the filtering feature isn't useful, just that it should be
> described in the event encoding rather than by pretending to support
> per-CPU events.
I'll talk with the h/w guys who designed this, but from the spec it seems
like each event either has an Origin ID, or is Origin independent. If the
event has an OID, then the counter should *not* be counting the effect of
the other masters.
> I expect to see new struct pmus, I'd just like to try and identify
> common patterns before the code mounts up. I imagine that we'll have a
> struct pmu for L2 cache controllers, for example, from which people can
> hang their own specific accessors. Whether or not we can hang other
> system PMUs off such an implementation is unclear to me at the moment.
Agreed. Thanks for initiating the discussion on LKAML.
>
>> So, I think we could add another category for such highly configurable
>> PMUs, which are not per-core, but have enough extra h/w to make them
>> cpu-aware. These need to be treated differently by arm perf, because they
>> can't really use the per-cpu data structures of the cpu-aware pmu's and as
>> such can't easily re-use of many of the functions.
>>
>> In fact, most of Qcomm PMU's (bus, fabric etc.) will fall under this new
>> category. At first glance, these would appear to fall under the System PMU
>> (counting) category, but they don't because of the extra h/w logic that
>> allows origin filtering of events.
>
> I think they do; they just have some event encodings that monitor events
> specific to particular masters (but may not necessarily be attributable
> to them).
In our case, I think they are attributable, but I'll reconfirm by talking
to the h/w designers. Verifying these counter outputs is another challenge
I'm pursuing.
>
>> Also, having all this origin filtering logic helps us track per-process
>> events on these PMU's, for which we need extra functions to decide how to
>> allocate and configure counters based on which context (task, cpu) the
>> event is enabled in.
>
> I don't think we should go down the road of splitting up the counters on
> a given PMU so that they can be shared between different tasks on
> different CPUs. There will probably be a single control register, so
> keeping everything in sync will be impossible.
So, for the L2CC on the 8660 (AFAIK, even the bus/fabric monitors), each
counter has its own origin filter. So the various counters can count from
different masters at different profiling intervals.
Cheers,
Ashwin
--
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [RFC] Extending ARM perf-events for multiple PMUs
2011-04-09 11:40 ` Peter Zijlstra
2011-04-11 11:29 ` Will Deacon
2011-04-11 18:00 ` Ashwin Chaugule
@ 2011-04-12 7:39 ` Ming Lei
2011-04-12 10:30 ` Peter Zijlstra
2 siblings, 1 reply; 17+ messages in thread
From: Ming Lei @ 2011-04-12 7:39 UTC (permalink / raw)
To: linux-arm-kernel
On Sat, 09 Apr 2011 13:40:35 +0200
Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, 2011-04-08 at 18:15 +0100, Will Deacon wrote:
> > Hello,
> >
> > Currently the perf code on ARM only caters for the core CPU PMU. In
> > actual fact, this only represents a subset of the performance
> > monitoring hardware available in real SoCs and is arguably the
> > simplest to interact with. This long-winded email is an attempt to
> > classify the possible event sources that we might see so that we
> > can have clean support for them in the future. I think that the
> > perf tools might also need tweaking slightly so they can handle
> > PMUs which can't service per-cpu or per-task events (instead, you
> > essentially have a single system-wide event).
> >
> > We can split PMUs up into two basic categories (an `action' here is
> > usually an interrupt but could be defined as any state recording or
> > signalling operation).
> >
> > (1) CPU-aware PMUs
> >
> > This type of PMU is typically per-CPU and accessed via
> > co-processor instructions. Actions may be delivered as PPIs. Events
> > scheduled onto a CPU-aware PMU can be grouped, possibly with events
> > scheduled for other per-CPU PMUs on the same CPU. An action
> > delivered by one of these PMUs can *always* be attributed to a
> > specific CPU but not necessarily a specific task. Accessing a
> > CPU-aware PMU is a synchronous operation.
> >
> > (2) System PMUs
> >
> > System PMUs are typically outside of the CPU domain. Bus
> > monitors, GPU counters and external L2 cache controller monitors
> > are all system PMUs. Actions delivered by these PMUs cannot be
> > attributed to a particular CPU and certainly cannot be associated
> > with a particular piece of code. They are memory-mapped and cannot
> > be grouped with other PMUs of any type. Accesses to a system PMU
> > may be asynchronous.
> >
> > System PMUs can be further split up into `counting' and
> > `filtering' PMUs:
> >
> > (i) Counting PMUs
> >
> > Counting PMUs increment a counter whenever a particular
> > event occurs and can deliver an action periodically (for example,
> > on overflow or after a certain amount of time has passed). The
> > event types are hardwired as particular, discrete events such as
> > `cycles' or `misses'.
> >
> > (ii) Filtering PMUs
> >
> > Filtering PMUs respond to a query. For example, `generate
> > an action whenever you see a bus access which fits the following
> > criteria'. The action may simply be to increment a counter, in
> > which case this PMU can act as a highly configurable counting PMU,
> > where the event types are dynamic.
>
> I don't see this distinction, both will have to count, and telling it
> what to count is a function of perf_event_attr::config* and how the
> hardware implements that is of no interest.
>
> > Now, we currently support the core CPU PMU, which is obviously a
> > CPU-aware PMU that generates interrupts as actions. Another example
> > of a CPU-aware PMU is the VFP PMU in Qualcomm's Scorpion. The next
> > step (moving outwards from the core) is to add support for L2 cache
> > controllers. I expect most of these to be Counting System PMUs,
> > although I can envisage them being CPU-aware if built into the core
> > with enough extra hardware.
> >
> > Implementing support for CPU-aware PMUs can be done alongside the
> > current CPU PMU code and much of the code can be shared with the
> > core PMU providing that the event namespaces are distinct.
> >
> > Implementing support for Counting System PMUs can reuse a lot of the
> > functionality in perf_event.c (for example, struct arm_pmu) but the
> > low-level accessors should be separate and a new struct pmu should
> > be used. This means that we will want multiple instances of struct
> > arm_pmu and a method to translate from a struct pmu to a struct
> > arm_pmu. We'll also need to clean up some of the armpmu_* functions
> > to ensure the correction indirection is used when invoking per-pmu
> > functions.
> >
> > Finally, the Filtering System PMUs will probably need their own
> > struct pmu instances for each device and can make use of the
> > dynamic sysfs interface via perf_pmu_register. I don't see any
> > scope for common code in this space yet.
> >
> > I appreciate this is especially hand-wavy stuff, but I'd like to
> > check we've got all of our bases covered before introducing system
> > PMUs to ARM. The first victim is the PL310 L2CC on the Cortex-A9.
>
> Right, so x86 has this too, and we have a fairly complete
> implementation of the Nehalem/Westmere uncore PMU, which is a
> NODE/memory controller PMU. Afaik we're mostly waiting on Intel to
> clarify some hardware details.
>
> So the perf core supports multiple hardware PMUs, but currently only
> one of which can do per-task sampling, if you've got multiple CPU
> local PMUs we need to do a little extra.
>
> See perf_pmu_register(), what say a memory controller PMU would do is
> something like:
>
> perf_pmu_register(&node_pmu, "node", -1);
>
> that will create a /sys/bus/event_source/devices/node/ directory in
> which will host the PMU details for userspace. This is currently
> limited to a single 'type' file which includes the number to provide
> perf_event_attr::type, but could (and should) be extended to provide
> some important events as well, which will provide the the bits to put
> in perf_event_attr::config.
>
> I just haven't figured out a way to dynamically add files/directories
Seems not very difficult, we have pmu_bus already, so introduce the
.match to find driver according device name, then implement a
driver for the pmu device to add this needed attributes(files).
> in the whole struct device sysfs muck (that also pleases the
> driver/sysfs folks). Nor have we agreed on a sane layout for such
> events there.
You mean we can find this event names here and pass them to perf -e ?
> What we do for the events is map the provided CPU number to a memory
> controller (cpu_to_node() does that for our case), and then use the
> first online cpu in that node mask to drive the event.
>
> If you've got system wide things like GPUs, where every cpu maps to
> the same device, simply use the first online cpu and create a pmu
> instance per device.
>
> Now, I've also wanted to make symlinks in the regular sysfs topology
> to these bus/event_source nodes, but again, that's something I've not
> managed to find out how to do yet.
>
> That is, for the currently existing "cpu" node, I'd like to have:
>
> /sys/devices/system/cpu/cpuN/event_source
> -> /sys/bus/event_source/devices/cpu
>
> And similar for the node thing:
>
> /sys/devices/system/node/nodeN/event_source
> -> /sys/bus/event_source/devices/node
>
> And for a GPU we could have:
>
> /sys/devices/pci0000:00/0000:00:02.0/drm/card0/event_source
> -> /sys/bus/event_source/devices/IGC0
>
>
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 17+ messages in thread
* [RFC] Extending ARM perf-events for multiple PMUs
2011-04-12 7:39 ` Ming Lei
@ 2011-04-12 10:30 ` Peter Zijlstra
2011-04-12 11:12 ` Ming Lei
0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2011-04-12 10:30 UTC (permalink / raw)
To: linux-arm-kernel
On Tue, 2011-04-12 at 15:39 +0800, Ming Lei wrote:
> > I just haven't figured out a way to dynamically add files/directories
>
> Seems not very difficult, we have pmu_bus already, so introduce the
> .match to find driver according device name, then implement a
> driver for the pmu device to add this needed attributes(files).
It probably isn't very hard, but I'm not sysfs/driver skilled and
haven't been able to put a lot of time in.
> > in the whole struct device sysfs muck (that also pleases the
> > driver/sysfs folks). Nor have we agreed on a sane layout for such
> > events there.
>
> You mean we can find this event names here and pass them to perf -e ?
That's the purpose yes. The intermediate problem is how to represent
these events in the sysfs hierarchy such that not every pmu
implementation does it differently.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [RFC] Extending ARM perf-events for multiple PMUs
2011-04-12 10:30 ` Peter Zijlstra
@ 2011-04-12 11:12 ` Ming Lei
0 siblings, 0 replies; 17+ messages in thread
From: Ming Lei @ 2011-04-12 11:12 UTC (permalink / raw)
To: linux-arm-kernel
Hi Peter,
2011/4/12 Peter Zijlstra <peterz@infradead.org>:
> On Tue, 2011-04-12 at 15:39 +0800, Ming Lei wrote:
>> > I just haven't figured out a way to dynamically add files/directories
>>
>> Seems not very difficult, we have pmu_bus already, so introduce the
>> .match to find driver according device name, then implement a
>> driver for the pmu device to add this needed attributes(files).
>
> It probably isn't very hard, but I'm not sysfs/driver skilled and
> haven't been able to put a lot of time in.
>
>> > in the whole struct device sysfs muck (that also pleases the
>> > driver/sysfs folks). Nor have we agreed on a sane layout for such
>> > events there.
>>
>> You mean we can find this event names here and pass them to perf -e ?
>
> That's the purpose yes. The intermediate problem is how to represent
> these events in the sysfs hierarchy such that not every pmu
> implementation does it differently.
How about the below idea?
- for each pmu device, one attribute group(directory) named as 'events'
is created to accommodate all events this pmu can handle, such as:
/sys/devices/cpu/events/ for pmu of 'cpu'
- perf_pmu_register will populate all events that this pmu can handle under
the directory of 'events' using information from the defined pmu instance
- 'perf' utility can get all events for each pmu by walking the
directory of
'events' for all pmu devices, which can be got from
'/sys/bus/event_source/devices'.
thanks,
--
Ming Lei
^ permalink raw reply [flat|nested] 17+ messages in thread
* [RFC] Extending ARM perf-events for multiple PMUs
2011-04-11 17:44 ` Ashwin Chaugule
@ 2011-04-12 17:45 ` Will Deacon
0 siblings, 0 replies; 17+ messages in thread
From: Will Deacon @ 2011-04-12 17:45 UTC (permalink / raw)
To: linux-arm-kernel
Hi Ashwin,
On Mon, 2011-04-11 at 18:44 +0100, Ashwin Chaugule wrote:
> > Sure, fundamentally we're just writing bits rather than interpreting
> > them. The reason I mention the difference is that filtering PMUs will
> > always need their own struct pmu because of the lack of an event
> > namespace. The other problem is only an issue for some userspace tools
> > (like Oprofile) which require lists of events and their hex codes.
> >
>
> If you mean namespace = perf_event_attr::config, its 64 bits + another 64
> bits of config_base + event_base on ARM ? Not too sure, but it would seem
> like that should be enough to setup such event chaining.
If you have a filtering PMU on a bus with large physical addresses, by
the time you've specified an address range you've already used up a
decent proportion of those bits so I don't think we should restrict
ourselves if we don't have to.
Will
^ permalink raw reply [flat|nested] 17+ messages in thread
* [RFC] Extending ARM perf-events for multiple PMUs
2011-04-11 20:46 ` Ashwin Chaugule
@ 2011-04-12 18:08 ` Will Deacon
2011-04-13 5:09 ` Ashwin Chaugule
0 siblings, 1 reply; 17+ messages in thread
From: Will Deacon @ 2011-04-12 18:08 UTC (permalink / raw)
To: linux-arm-kernel
On Mon, 2011-04-11 at 21:46 +0100, Ashwin Chaugule wrote:
> Hi Will,
Hello,
> On 4/11/2011 2:00 PM, Will Deacon wrote:
> >
> > I don't think that's enough from a profiling perspective because the
> > state of the device will be altered by other tasks. For example, the
> > number of misses in the L2 cache for a given task is going to be
> > affected by the other tasks running in the system, even if we only
> > profile during the period in which the task is running.
>
> I'm probably missing something. If another task affects the cache
> contents, this will manifest as an increase in cache misses/hits for the
> task that is being profiled during this interval. This will also happen
> when interrupts trigger and wipe out cache lines anyway. IOW, a counter
> thats counting events from CPU0, will not increment, if the event it is
> counting gets affected by CPU1.
How can you enforce this? If a task on CPU1 has a large working set and
clobbers all of L2, then a task on CPU0 will have no choice but to miss
at L2 if it misses at L1. I think this scenario is similar for all PMUs
that have multiple masters.
> >>
> >> For the Qcom L2CC, the PMU can be configured to filter events based on
> >> specific masters. This fact would make it a CPU-aware PMU, although its
> >> NOT per-core and triggers SPI's.
> >
> > I have a similar issue with this; filtering based on the master *isn't*
> > the same as having per-master samples, simply because the combined
> > effect of the masters will influence all of the results. That doesn't
> > mean that the filtering feature isn't useful, just that it should be
> > described in the event encoding rather than by pretending to support
> > per-CPU events.
>
> I'll talk with the h/w guys who designed this, but from the spec it seems
> like each event either has an Origin ID, or is Origin independent. If the
> event has an OID, then the counter should *not* be counting the effect of
> the other masters.
Ok, some feedback from the hardware guys would be useful so we know what
we're dealing with. However, we still have some other problems for these
system PMUs if you want to allow the events to specify CPU affinity:
- What do you do if there are more masters than CPUs?
- How do you handle mixing events that can be filtered by origin with
those that can't?
So another argument for avoiding CPU affinity is simply that it
complicates the code. I think this complication is unnecessary if we can
get perf working with CPU=1, pid=-1 (I fear there may be locking issues
but I don't know yet). You can specify masters in the event encoding
instead which has the benefit of forcing userspace to think more
carefully about what they are doing (rather than erroneously attributing
samples to CPUs) and also providing more flexibility (for example, if
you have an event that counts interactions between two CPUs - which one
do you attribute it to?).
> >
> >> Also, having all this origin filtering logic helps us track per-process
> >> events on these PMU's, for which we need extra functions to decide how to
> >> allocate and configure counters based on which context (task, cpu) the
> >> event is enabled in.
> >
> > I don't think we should go down the road of splitting up the counters on
> > a given PMU so that they can be shared between different tasks on
> > different CPUs. There will probably be a single control register, so
> > keeping everything in sync will be impossible.
>
> So, for the L2CC on the 8660 (AFAIK, even the bus/fabric monitors), each
> counter has its own origin filter. So the various counters can count from
> different masters at different profiling intervals.
Ok, that tidies this problem up nicely in this case but for other PMUs
we might not be as fortunate.
Cheers,
Will
^ permalink raw reply [flat|nested] 17+ messages in thread
* [RFC] Extending ARM perf-events for multiple PMUs
2011-04-12 18:08 ` Will Deacon
@ 2011-04-13 5:09 ` Ashwin Chaugule
0 siblings, 0 replies; 17+ messages in thread
From: Ashwin Chaugule @ 2011-04-13 5:09 UTC (permalink / raw)
To: linux-arm-kernel
Hi Will,
On 4/12/2011 2:08 PM, Will Deacon wrote:
> Ok, some feedback from the hardware guys would be useful so we know what
> we're dealing with. However, we still have some other problems for these
> system PMUs if you want to allow the events to specify CPU affinity:
>
I don't intend to allow *all* events to specify CPU affinity. Just those
that have OID's (and theres lots of these).
Also, I confirmed with the h/w guys about the filtering logic. A correctly
configured counter won't be affected by the other masters. This works
quite well for filtering by CPU's, ergo filtering by task/cpu works with
whats already there in perf for PMUs where each counter can be filtered by
origin. Just needs some extra handling than whats in arm-perf code today.
> - What do you do if there are more masters than CPUs?
Sure. For non-cpu masters, like I agreed earlier, we still need to extend
perf to allow cpu = -1 and task = -1.
> - How do you handle mixing events that can be filtered by origin with
> those that can't?
I haven't reached the point of handling events that can't be filtered by
origin. They're very few and super esoteric ;)
>
> So another argument for avoiding CPU affinity is simply that it
> complicates the code. I think this complication is unnecessary if we can
> get perf working with CPU=1, pid=-1 (I fear there may be locking issues
> but I don't know yet). You can specify masters in the event encoding
> instead which has the benefit of forcing userspace to think more
> carefully about what they are doing (rather than erroneously attributing
> samples to CPUs) and also providing more flexibility (for example, if
> you have an event that counts interactions between two CPUs - which one
> do you attribute it to?).
Guess you mean cpu = -1 ? I've been dealing with CPU side events for these
PMU's since these seem to be in demand the most. The only real
complication I found was to use the per-cpu data structures from arm-perf
code for the cpu context counting. Makes it ugly in the (SPI) interrupt
handler. However, as an alternative simpler solution, I skipped the
arm_pmu fops, registered a new PMU TYPE, and handled it separately.
Looking ahead, encoding masters in the event code makes sense. We'll need
to make perf-core code aware of this too. Currently it only seems to look
at cpu and task affinity and stores the perf_event in the appropriate
context lists.
Primarily, perf needs to be changed to allow specifying context for each
event. Currently the "-a/-p/-t" option applies to all events specified.
When cpu = -1, task = -1 for an event, we could store it in the cpu
context list of the cpu that never goes down (CPU0?). Then let the counter
spin, perf should report back the output against the appropriate raw event
code. The user should be able to understand the output.
Time for some more experiments and more thinking. Arming the PMU counters
here is not so much a problem as reporting the results back.
There's also the case, where some PMU's start multiple counters for each
event. Thinking of leveraging event groups to report back the results as
non-cumulative output.
>>
>> So, for the L2CC on the 8660 (AFAIK, even the bus/fabric monitors), each
>> counter has its own origin filter. So the various counters can count from
>> different masters at different profiling intervals.
>
> Ok, that tidies this problem up nicely in this case but for other PMUs
> we might not be as fortunate.
>
>
Hence the suggestion to have another category in your initial email. ;)
Cheers,
Ashwin
--
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2011-04-13 5:09 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-04-08 17:15 [RFC] Extending ARM perf-events for multiple PMUs Will Deacon
2011-04-08 18:10 ` Linus Walleij
2011-04-11 11:12 ` Will Deacon
2011-04-09 11:40 ` Peter Zijlstra
2011-04-11 11:29 ` Will Deacon
2011-04-11 12:47 ` Peter Zijlstra
2011-04-11 17:44 ` Ashwin Chaugule
2011-04-12 17:45 ` Will Deacon
2011-04-11 18:00 ` Ashwin Chaugule
2011-04-12 7:39 ` Ming Lei
2011-04-12 10:30 ` Peter Zijlstra
2011-04-12 11:12 ` Ming Lei
2011-04-11 17:29 ` Ashwin Chaugule
2011-04-11 18:00 ` Will Deacon
2011-04-11 20:46 ` Ashwin Chaugule
2011-04-12 18:08 ` Will Deacon
2011-04-13 5:09 ` Ashwin Chaugule
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).