[RFC] perf_events: support for uncore a.k.a. nest units

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC] perf_events: support for uncore a.k.a. nest units
@ 2010-01-19 19:41 Corey Ashford
  2010-01-20  0:44 ` Andi Kleen
                   ` (2 more replies)
  0 siblings, 3 replies; 55+ messages in thread
From: Corey Ashford @ 2010-01-19 19:41 UTC (permalink / raw)
  To: LKML, Ingo Molnar, Andi Kleen, Paul Mackerras, Stephane Eranian,
	Peter Zijlstra, Frederic Weisbecker, Xiao Guangrong
  Cc: Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love

-----
Intro
-----
One subject that hasn't been addressed since the introduction of perf_events in 
the Linux kernel is that of support for "uncore" or "nest" unit events.  Uncore 
is the term used by the Intel engineers for their off-core units but are still 
on the same die as the cores, and "nest" means exactly the same thing for IBM 
Power processor engineers.  I will use the term uncore for brevity and because 
it's in common parlance, but the issues and design possibilities below are 
relevant to both.  I will also broaden the term by stating that uncore will also 
refer to PMUs that are completely off of the processor chip altogether.

Contents
--------
1. Why support PMUs in uncore units?  Is there anything interesting to look at?
2. How do uncore events differ from core events?
3. Why does a CPU need to be assigned to manage a particular uncore unit's events?
4. How do you encode uncore events?
5. How do you address a particular uncore PMU?
6. Event rotation issues with uncore PMUs
7. Other issues?
8. Feedback?

----
1. Why support PMUs in uncore units?  Is there anything interesting to look at?
----

Today, many x86 chips contain uncore units, and we think that it's likely that 
the trend will continue, as more devices - I/O, memory interfaces, shared 
caches, accelerators, etc. - are integrated onto multi-core chips.  As these 
devices become more sophisticated and more workload is diverted off-core, 
engineers and performance analysts are going to want to look at what's happening 
in these units so that they can find bottlenecks.

In addition, we think that even off-chip I/O and interconnect devices are likely 
to gain PMUs because engineers will want to find bottlenecks in their massively 
parallel systems.

----
2. How do uncore events differ from core events?
----

The main difference is that uncore events are mostly likely not going to be tied 
to a particular Linux task, or even a CPU context.  Uncore units are resources 
that are in some sense system-wide, though, they may not really be accessible 
system-wide in some architectures.  In the case of accelerators and I/O devices, 
it's likely they will run asynchronously from the cores, and thus keeping track 
of events on a per-task basis doesn't make a lot of sense.  The other existing 
mode in perf_events is a per-CPU context, and it turns out that this mode does 
match up with uncore units well, though the choice of which CPU to use to manage 
that uncore unit is going to need to be arch-dependent and may involve other 
issues as well, such as minimizing access latency between the uncore unit and 
the CPU which is managing it.

----
3. Why does a CPU need to be assigned to manage a particular uncore unit's events?
----

* The control registers of the uncore unit's PMU need to be read and written, 
and that may be possible only from a subset of processors in the system.
* A processor is needed to rotate the event list on the uncore unit on every 
tick for the purposes of event scheduling.
* Because of access latency issues, we may want the CPU to be close in locality 
to the PMU.

It seems like a good idea to let the kernel decide which CPU to use to monitor a 
particular uncore event, based on the location of the uncore unit, and possibly 
current system load balance.  The user will not want to have to figure out this 
detailed information.

----
4. How do you encode uncore events?
----
Uncore events will need to be encoded in the config field of the perf_event_attr 
struct using the existing PERF_TYPE_RAW encoding.  64 bits are available in the 
config field, and that may be sufficient to support events on most systems. 
However, due to  the proliferation and added complexity of PMUs we envision, we 
might want to add another 64-bit config (perhaps call it config_extra or 
config2) field to encode any extra attributes that might be needed.  The exact 
encoding used, just as for the current encoding for core events, will be on a 
per-arch and possibly per-system basis.

----
5. How do you address a particular uncore PMU?
----

This one is going to be very system- and arch-dependent, but it seems fairly 
clear that we need some sort of addressing scheme that can be 
system/arch-defined by the kernel.

 From a hierarchical perspective, here's an example of possible uncore PMU 
locations in a large system:

1) Per-core - units that are shared between all hardware threads in a core
2) Per-node - units that are shared between all cores in a node
3) Per-chip - units that are shared between all nodes in a chip
4) Per-blade - units that are shared between all chips on a blade
5) Per-rack - units that are shared between all blades in a rack

Addressing option 1)

Reuse the cpu argument: cpu would be interpreted differently if an uncore unit 
is specified (via the perf_event_attr struct's config field).

For the hypothetical system described above, we'd want to have an address that 
contains enough address bits for each of the above.  For example:

bits   field
------ -----
3..0   PMU number 0-15  /* specifies which of several identical PMUs being 
addressed */
7..4   core id 0-15
8..8   node id 0-1
11..9  chip id 0-7
16..12 blade id 0-31
23..17 rack id 0-128

These fields would be exposed via /usr/include/linux/perf_events_uncore_addr.h 
(for example).  How you actually assign these numbers to actual hardware is, 
again, system-design dependent, and may be influenced by the use of a 
hypervisor, or other software which allocates resources available to the system 
dynamically.

How does the user discover the mapping between the hardware made available to 
the system and the addresses shown above?  Again, this is system-dependent, and 
probably outside the scope of this proposal.  In other words, I don't know how 
to do this in a general way, though I could probably put something together for 
a particular system.

Addressing Option 2)

Have the kernel create nodes for each uncore PMU in /sys/devices/system or other 
pseudo file system, such as the existing  /proc/device-tree on Power systems. 
/sys/devices/system or /proc/device-tree could be explored by the
user tool, and the user could then specify the path of the requested PMU via a 
string which the kernel could interpret.  To be overly simplistic, something 
like "/sys/devices/system/pmus/blade4/cpu0/vectorcopro1".  If we settled on a 
common tree root to use, we could specify only the relative path name, 
"blade4/cpu0/vectorcopro1".

One way to provide this extra "PMU path" argument to the sys_perf_event_open() 
would be to add a bit to the flags argument says we're adding a PMU path string 
onto the end of the argument list.

This path-string-based addressing option seems to more flexible in the long run, 
and does not have as serious of an issue in mapping PMUs to user space; the 
kernel essentially exposes to user space all of the available PMUs for the 
current partition.   This might create more work for the kernel side, but should 
make the system more transparent for user-space tools.  Another system- or at 
least arch-dependent tool would have to be written for user space to help users 
navigate the device tree to find the PMU they want to use.  I don't think it 
would make sense to build that capability into perf, because the software would 
be arch- or system-dependent.

It could be argued that we should use a common user space tree to represent PMUs 
for all architectures and systems, so that the arch-independent perf code would 
be able to display available uncore PMUs.  That may be a goal that's very hard 
to achieve because of the wide variation in architectures.  Any thoughts on that?

----
6. Event rotation issues with uncore PMUs
----

Currently, the perf_events code rotates the set of events assigned to a CPU or 
task on every system tick, so that event scheduling collisions on a PMU are 
mitigated.  This turns out to cause problems for uncore units for two reasons - 
inefficiency and CPU load.

a) Rotation of a set of events across more than one PMU causes inefficient rotation.

Consider the following event list; the letter designates the PMU and the number 
is the event number on that PMU.
A1 A2 A3 B1 B2 B3 B4 C1 C2 C3 C4 C5

after one rotation, you can see that the event list will be:

C5 A1 A2 A3 B1 B2 B3 B4 C1 C2 C3 C4

and then

C4 C5 A1 A2 A3 B1 B2 B3 B4 C1 C2 C3

Notice how the relative positions for the A and B PMU events haven't changed 
even after two (or even five) rotations, so they will schedule the events in the 
same order for some time.  This will skew the multiplexing so that some events 
will be scheduled much less often than they should or could be.

What we'd like to have happen is that events for each PMU be rotated in their 
own lists.  For example, before rotation:

A1 A2 A3
B1 B2 B3 B4
C1 C2 C3 C4 C5

After rotation:

A3 A1 A2
B2 B3 B4 B1
C2 C3 C4 C5 C1

We've got some ideas about how to make this happen, using either separate lists, 
or placing them on separate CPUs.

b) Access to some PMU uncore units may be quite slow due to the interconnect 
that is used.  This can place a burden on the CPU if it is done every system tick.

This can be addressed by keeping a counter, on a per-PMU context basis that 
reduces the rate of event rotations.  Setting the rotation period to three, for 
example, would cause event rotations in that context to happen on every third 
tick, instead of every tick.  We think that the kernel could measure the amount 
of time it is taking to do a rotate, and then dynamically decrease the rotation 
rate if it's taking too long; "rotation rate throttling" in other words.

----
7. Other issues?
----

This section left blank for now.

----
8. Feedback?
----

I'd appreciate any feedback you might have on this topic.  You can contact me 
directly at the email address below, or better yet, reply to LKML.

-- 
Regards,

- Corey

Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
Beaverton, OR
503-578-3507
cjashfor@us.ibm.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-01-19 19:41 [RFC] perf_events: support for uncore a.k.a. nest units Corey Ashford
@ 2010-01-20  0:44 ` Andi Kleen
  2010-01-20  1:49   ` Corey Ashford
  2010-01-20 13:34 ` Peter Zijlstra
       [not found] ` <d3f22a1003290213x7d7904an59d50eb6a8616133@mail.gmail.com>
  2 siblings, 1 reply; 55+ messages in thread
From: Andi Kleen @ 2010-01-20  0:44 UTC (permalink / raw)
  To: Corey Ashford
  Cc: LKML, Ingo Molnar, Andi Kleen, Paul Mackerras, Stephane Eranian,
	Peter Zijlstra, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra,
	Philip Mucci, Maynard Johnson, Carl Love

On Tue, Jan 19, 2010 at 11:41:01AM -0800, Corey Ashford wrote:
> One subject that hasn't been addressed since the introduction of 
> perf_events in the Linux kernel is that of support for "uncore" or "nest" 
> unit events.  Uncore is the term used by the Intel engineers for their 
> off-core units but are still on the same die as the cores, and "nest" means 
> exactly the same thing for IBM Power processor engineers.  I will use the 
> term uncore for brevity and because it's in common parlance, but the issues 
> and design possibilities below are relevant to both.  I will also broaden 
> the term by stating that uncore will also refer to PMUs that are completely 
> off of the processor chip altogether.

Yes, e.g. chipsets commonly have their own PMUs too.

> The main difference is that uncore events are mostly likely not going to be 
> tied to a particular Linux task, or even a CPU context.  Uncore units are 
> resources that are in some sense system-wide, though, they may not really 
> be accessible system-wide in some architectures.  In the case of 
> accelerators and I/O devices, it's likely they will run asynchronously from 
> the cores, and thus keeping track of events on a per-task basis doesn't 
> make a lot of sense.  The other existing mode in perf_events is a per-CPU 
> context, and it turns out that this mode does match up with uncore units 
> well, though the choice of which CPU to use to manage that uncore unit is 
> going to need to be arch-dependent and may involve other issues as well, 
> such as minimizing access latency between the uncore unit and the CPU which 
> is managing it.

What the user needs to know is which CPUs are affected by that uncore
event. For example the integrated memory controller counters that count local
accesses should be somehow associated with the local CPUs.

> 4. How do you encode uncore events?
> ----
> Uncore events will need to be encoded in the config field of the 
> perf_event_attr struct using the existing PERF_TYPE_RAW encoding.  64 bits 
> are available in the config field, and that may be sufficient to support 
> events on most systems. However, due to  the proliferation and added 
> complexity of PMUs we envision, we might want to add another 64-bit config 
> (perhaps call it config_extra or config2) field to encode any extra 
> attributes that might be needed.  The exact encoding used, just as for the 
> current encoding for core events, will be on a per-arch and possibly 
> per-system basis.

I don't think a raw hex number will scale anywhere. You'll need a human
readable event list / sub event masks with help texts.

Often uncore events have specific restrictions, and that needs
to be enforced somewhere too.

Doing that all in a clean way that is also usable
by programs likely needs a lot more thinking.


> bits   field
> ------ -----
> 3..0   PMU number 0-15  /* specifies which of several identical PMUs being 
> addressed */
> 7..4   core id 0-15
> 8..8   node id 0-1
> 11..9  chip id 0-7
> 16..12 blade id 0-31
> 23..17 rack id 0-128

Such a compressed addressing scheme doesn't seem very future proof.
e.g. core 4 bits for the core is already obsolete (see the "80 core chip" that
was recently announced)


> probably put something together for a particular system.
>
> Addressing Option 2)
>
> Have the kernel create nodes for each uncore PMU in /sys/devices/system or 
> other pseudo file system, such as the existing  /proc/device-tree on Power 
> systems. /sys/devices/system or /proc/device-tree could be explored by the
> user tool, and the user could then specify the path of the requested PMU 
> via a string which the kernel could interpret.  To be overly simplistic, 
> something like "/sys/devices/system/pmus/blade4/cpu0/vectorcopro1".  If we 
> settled on a common tree root to use, we could specify only the relative 
> path name, "blade4/cpu0/vectorcopro1".

That's a more workable scheme, but you still need to find a clean
way to describe topology (see above). The existing examples in sysfs
are unfortuately all clumpsy imho.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-01-20  0:44 ` Andi Kleen
@ 2010-01-20  1:49   ` Corey Ashford
  2010-01-20  9:35     ` Andi Kleen
  0 siblings, 1 reply; 55+ messages in thread
From: Corey Ashford @ 2010-01-20  1:49 UTC (permalink / raw)
  To: Andi Kleen
  Cc: LKML, Ingo Molnar, Paul Mackerras, Stephane Eranian,
	Peter Zijlstra, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra,
	Philip Mucci, Maynard Johnson, Carl Love

On 1/19/2010 4:44 PM, Andi Kleen wrote:
> On Tue, Jan 19, 2010 at 11:41:01AM -0800, Corey Ashford wrote:
>> 4. How do you encode uncore events?
>> ----
>> Uncore events will need to be encoded in the config field of the
>> perf_event_attr struct using the existing PERF_TYPE_RAW encoding.  64 bits
>> are available in the config field, and that may be sufficient to support
>> events on most systems. However, due to  the proliferation and added
>> complexity of PMUs we envision, we might want to add another 64-bit config
>> (perhaps call it config_extra or config2) field to encode any extra
>> attributes that might be needed.  The exact encoding used, just as for the
>> current encoding for core events, will be on a per-arch and possibly
>> per-system basis.
>
> I don't think a raw hex number will scale anywhere. You'll need a human
> readable event list / sub event masks with help texts.
>
> Often uncore events have specific restrictions, and that needs
> to be enforced somewhere too.
>
> Doing that all in a clean way that is also usable
> by programs likely needs a lot more thinking.

I left out one critical detail here: I had in mind that we'd be using a library 
like libpfm for handling the issue of event names + attributes to raw code 
translation.  In fact, we are using libpfm today for this purpose in the 
PAPI/perf_events substrate implementation.

>
>
>> bits   field
>> ------ -----
>> 3..0   PMU number 0-15  /* specifies which of several identical PMUs being
>> addressed */
>> 7..4   core id 0-15
>> 8..8   node id 0-1
>> 11..9  chip id 0-7
>> 16..12 blade id 0-31
>> 23..17 rack id 0-128
>
> Such a compressed addressing scheme doesn't seem very future proof.
> e.g. core 4 bits for the core is already obsolete (see the "80 core chip" that
> was recently announced)

Agreed.  If the designer is very generous with the size of each field, it could 
hold up for quite awhile, but still there's a problem with relating these 
addresses to actual hardware.

>
>
>> probably put something together for a particular system.
>>
>> Addressing Option 2)
>>
>> Have the kernel create nodes for each uncore PMU in /sys/devices/system or
>> other pseudo file system, such as the existing  /proc/device-tree on Power
>> systems. /sys/devices/system or /proc/device-tree could be explored by the
>> user tool, and the user could then specify the path of the requested PMU
>> via a string which the kernel could interpret.  To be overly simplistic,
>> something like "/sys/devices/system/pmus/blade4/cpu0/vectorcopro1".  If we
>> settled on a common tree root to use, we could specify only the relative
>> path name, "blade4/cpu0/vectorcopro1".
>
> That's a more workable scheme, but you still need to find a clean
> way to describe topology (see above). The existing examples in sysfs
> are unfortuately all clumpsy imho.
>

Yes, I agree.  Also it's easy to construct a system design that doesn't have a 
hierarchical topology.  A simple example would be a cluster of 32 nodes, each of 
which is connected to its 31 neighbors.  Perhaps for the purposes of just 
enumerating PMUs, a tree might be sufficient, but it's not clear to me that it 
is mathematically sufficient for all topologies, not to mention if it's 
intuitive enough to use.  For example, highly-interconnected components might 
require that PMU leaf nodes be duplicated in multiple branches, i.e. PMU paths 
might not be unique in some topologies.

I'm certainly open to better alternatives!


Thanks for your thoughts,

- Corey


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-01-20  1:49   ` Corey Ashford
@ 2010-01-20  9:35     ` Andi Kleen
  2010-01-20 19:28       ` Corey Ashford
  0 siblings, 1 reply; 55+ messages in thread
From: Andi Kleen @ 2010-01-20  9:35 UTC (permalink / raw)
  To: Corey Ashford
  Cc: Andi Kleen, LKML, Ingo Molnar, Paul Mackerras, Stephane Eranian,
	Peter Zijlstra, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra,
	Philip Mucci, Maynard Johnson, Carl Love

> Yes, I agree.  Also it's easy to construct a system design that doesn't 
> have a hierarchical topology.  A simple example would be a cluster of 32 
> nodes, each of which is connected to its 31 neighbors.  Perhaps for the 

I doubt it's needed or useful to describe all details of an interconnect.

If detailed distance information is needed a simple table like
the SLIT table exported by ACPI would seem easier to handle.

But at least some degree of locality (e.g. "local memory controller")
would make sense.


> purposes of just enumerating PMUs, a tree might be sufficient, but it's not 
> clear to me that it is mathematically sufficient for all topologies, not to 
> mention if it's intuitive enough to use.  For example, 
> highly-interconnected components might require that PMU leaf nodes be 
> duplicated in multiple branches, i.e. PMU paths might not be unique in some 
> topologies.

We already have cyclical graphs in sysfs using symlinks. I'm not 
sure they are all that easy to parse/handle, but at least they
can be described.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-01-20  9:35     ` Andi Kleen
@ 2010-01-20 19:28       ` Corey Ashford
  0 siblings, 0 replies; 55+ messages in thread
From: Corey Ashford @ 2010-01-20 19:28 UTC (permalink / raw)
  To: Andi Kleen
  Cc: LKML, Ingo Molnar, Paul Mackerras, Stephane Eranian,
	Peter Zijlstra, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra,
	Philip Mucci, Maynard Johnson, Carl Love



On 1/20/2010 1:35 AM, Andi Kleen wrote:
>> Yes, I agree.  Also it's easy to construct a system design that doesn't
>> have a hierarchical topology.  A simple example would be a cluster of 32
>> nodes, each of which is connected to its 31 neighbors.  Perhaps for the
>
> I doubt it's needed or useful to describe all details of an interconnect.
>

At this point, I don't see a need for that level of detail either.

> If detailed distance information is needed a simple table like
> the SLIT table exported by ACPI would seem easier to handle.
>

Thanks for the pointer.  I didn't know about the ACPI SLIT and SRAT tables until 
your post.  Having had a quick look at them, I don't think they'd be that 
helpful to us, at least at this point.

> But at least some degree of locality (e.g. "local memory controller")
> would make sense.

I think locality could be determined by looking at the device tree.  For 
example, a memory controller for a particular processor chip would be a 
subdirectory of that chip.

>
>> purposes of just enumerating PMUs, a tree might be sufficient, but it's not
>> clear to me that it is mathematically sufficient for all topologies, not to
>> mention if it's intuitive enough to use.  For example,
>> highly-interconnected components might require that PMU leaf nodes be
>> duplicated in multiple branches, i.e. PMU paths might not be unique in some
>> topologies.
>
> We already have cyclical graphs in sysfs using symlinks. I'm not
> sure they are all that easy to parse/handle, but at least they
> can be described.

Good point.

-- 
Regards,

- Corey

Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
Beaverton, OR
503-578-3507
cjashfor@us.ibm.com


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-01-19 19:41 [RFC] perf_events: support for uncore a.k.a. nest units Corey Ashford
  2010-01-20  0:44 ` Andi Kleen
@ 2010-01-20 13:34 ` Peter Zijlstra
  2010-01-20 21:33   ` Peter Zijlstra
       [not found] ` <d3f22a1003290213x7d7904an59d50eb6a8616133@mail.gmail.com>
  2 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2010-01-20 13:34 UTC (permalink / raw)
  To: Corey Ashford
  Cc: LKML, Ingo Molnar, Andi Kleen, Paul Mackerras, Stephane Eranian,
	Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci,
	Maynard Johnson, Carl Love

On Tue, 2010-01-19 at 11:41 -0800, Corey Ashford wrote:

> ----
> 3. Why does a CPU need to be assigned to manage a particular uncore unit's events?
> ----
> 
> * The control registers of the uncore unit's PMU need to be read and written, 
> and that may be possible only from a subset of processors in the system.
> * A processor is needed to rotate the event list on the uncore unit on every 
> tick for the purposes of event scheduling.
> * Because of access latency issues, we may want the CPU to be close in locality 
> to the PMU.
> 
> It seems like a good idea to let the kernel decide which CPU to use to monitor a 
> particular uncore event, based on the location of the uncore unit, and possibly 
> current system load balance.  The user will not want to have to figure out this 
> detailed information.

Well, to some extend the user will have to participate. For example
which uncore pmu will be selected depends on the cpu you're attaching
the event to according to the cpu to node map.

Furthermore the intel uncore thing has curious interrupt routing
capabilities which could be tied into this mapping.

> ----
> 4. How do you encode uncore events?
> ----
> Uncore events will need to be encoded in the config field of the perf_event_attr 
> struct using the existing PERF_TYPE_RAW encoding.  64 bits are available in the 
> config field, and that may be sufficient to support events on most systems. 
> However, due to  the proliferation and added complexity of PMUs we envision, we 
> might want to add another 64-bit config (perhaps call it config_extra or 
> config2) field to encode any extra attributes that might be needed.  The exact 
> encoding used, just as for the current encoding for core events, will be on a 
> per-arch and possibly per-system basis.

Lets cross that bridge when we get there.

> ----
> 5. How do you address a particular uncore PMU?
> ----
> 
> This one is going to be very system- and arch-dependent, but it seems fairly 
> clear that we need some sort of addressing scheme that can be 
> system/arch-defined by the kernel.
> 
>  From a hierarchical perspective, here's an example of possible uncore PMU 
> locations in a large system:
> 
> 1) Per-core - units that are shared between all hardware threads in a core
> 2) Per-node - units that are shared between all cores in a node
> 3) Per-chip - units that are shared between all nodes in a chip
> 4) Per-blade - units that are shared between all chips on a blade
> 5) Per-rack - units that are shared between all blades in a rack

So how about PERF_TYPE_{CORE,NODE,SOCKET} like things?

> ----
> 6. Event rotation issues with uncore PMUs
> ----
> 
> Currently, the perf_events code rotates the set of events assigned to a CPU or 
> task on every system tick, so that event scheduling collisions on a PMU are 
> mitigated.  This turns out to cause problems for uncore units for two reasons - 
> inefficiency and CPU load.

Well, if you give these things a cpumask and put them all onto the
context of first cpu of that mask things seem to collect nicely.

> b) Access to some PMU uncore units may be quite slow due to the interconnect 
> that is used.  This can place a burden on the CPU if it is done every system tick.
> 
> This can be addressed by keeping a counter, on a per-PMU context basis that 
> reduces the rate of event rotations.  Setting the rotation period to three, for 
> example, would cause event rotations in that context to happen on every third 
> tick, instead of every tick.  We think that the kernel could measure the amount 
> of time it is taking to do a rotate, and then dynamically decrease the rotation 
> rate if it's taking too long; "rotation rate throttling" in other words.

The better solution is to generalize the whole rr on tick scheme (which
has already been discussed).



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-01-20 13:34 ` Peter Zijlstra
@ 2010-01-20 21:33   ` Peter Zijlstra
  2010-01-20 23:23     ` Corey Ashford
  2010-01-21  8:47     ` stephane eranian
  0 siblings, 2 replies; 55+ messages in thread
From: Peter Zijlstra @ 2010-01-20 21:33 UTC (permalink / raw)
  To: Corey Ashford
  Cc: LKML, Ingo Molnar, Andi Kleen, Paul Mackerras, Stephane Eranian,
	Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci,
	Maynard Johnson, Carl Love

On Wed, 2010-01-20 at 14:34 +0100, Peter Zijlstra wrote:

> So how about PERF_TYPE_{CORE,NODE,SOCKET} like things?

OK, so I read most of the intel uncore stuff, and it seems to suggest
you need a regular pmu event to receive uncore events (chained setup),
this seems rather retarded since it wastes a perfectly good pmu event
and makes configuring all this more intricate...

A well, nothing to be done about that I guess.. 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-01-20 21:33   ` Peter Zijlstra
@ 2010-01-20 23:23     ` Corey Ashford
  2010-01-21  7:21       ` Ingo Molnar
  2010-01-21  8:36       ` Peter Zijlstra
  2010-01-21  8:47     ` stephane eranian
  1 sibling, 2 replies; 55+ messages in thread
From: Corey Ashford @ 2010-01-20 23:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Ingo Molnar, Andi Kleen, Paul Mackerras, Stephane Eranian,
	Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci,
	Maynard Johnson, Carl Love

On 1/20/2010 1:33 PM, Peter Zijlstra wrote:
> On Wed, 2010-01-20 at 14:34 +0100, Peter Zijlstra wrote:
>
>> So how about PERF_TYPE_{CORE,NODE,SOCKET} like things?
>
> OK, so I read most of the intel uncore stuff, and it seems to suggest
> you need a regular pmu event to receive uncore events (chained setup),
> this seems rather retarded since it wastes a perfectly good pmu event
> and makes configuring all this more intricate...
>
> A well, nothing to be done about that I guess..

Yes, we have a similar situation where in addition to events that are counted on 
core PMU counters, we also have counters that are off-core; in some cases the 
counters are in off-core units which take their actual events from other 
off-core units, in addition to their own events.  So you can see that this can 
be almost arbitrarily complex.

As for the PERF_TYPE_(CORE,NODE,SOCKET) idea, that could still work, even 
though, for example, a socket event may be counted on a core PMU.  Using more 
encodings for the type field, as you've suggested, would allow us to reuse the 
64-bit config space multiple times.  Were you thinking that with the type field 
we'd still re-use the "cpu" argument for the actual pmu address within the 
PERF_TYPE_* space?  If so, that's an interesting idea, but I think it still 
leaves open the problem of how to actually relate those address to the real 
hardware, especially in the case of using a hypervisor which has provided you a 
small subset of the physical hardware in the system.

I really think we need some sort of data structure which is passed from the 
kernel to user space to represent the topology of the system, and give useful 
information to be able to identify each PMU node.  Whether this is done with a 
sysfs-style tree, a table in a file, XML, etc... it doesn't really matter much, 
but it needs to be something that can be parsed relatively easily and *contains 
just enough information* for the user to be able to correctly choose PMUs, and 
for the kernel to be able to relate that back to actual PMU hardware.

In our case, we are looking at /proc/device-tree, and it actually does appear to 
contain enough information for us.  However, since /proc/device-tree is not 
available anywhere but Power arch (/proc/device-tree originates from a data 
structure passed into the OS from the Open Firmware) we'd like to have a more 
general approach that can be used on x86 and other arches.

- Corey

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-01-20 23:23     ` Corey Ashford
@ 2010-01-21  7:21       ` Ingo Molnar
  2010-01-21 19:13         ` Corey Ashford
  2010-01-21  8:36       ` Peter Zijlstra
  1 sibling, 1 reply; 55+ messages in thread
From: Ingo Molnar @ 2010-01-21  7:21 UTC (permalink / raw)
  To: Corey Ashford
  Cc: Peter Zijlstra, LKML, Andi Kleen, Paul Mackerras,
	Stephane Eranian, Frederic Weisbecker, Xiao Guangrong,
	Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love,
	Steven Rostedt, Arnaldo Carvalho de Melo, Masami Hiramatsu

* Corey Ashford <cjashfor@linux.vnet.ibm.com> wrote:

> I really think we need some sort of data structure which is passed from the 
> kernel to user space to represent the topology of the system, and give 
> useful information to be able to identify each PMU node.  Whether this is 
> done with a sysfs-style tree, a table in a file, XML, etc... it doesn't 
> really matter much, but it needs to be something that can be parsed 
> relatively easily and *contains just enough information* for the user to be 
> able to correctly choose PMUs, and for the kernel to be able to relate that 
> back to actual PMU hardware.

The right way would be to extend the current event description under 
/debug/tracing/events with hardware descriptors and (maybe) to formalise this 
into a separate /proc/events/ or into a separate filesystem.

The advantage of this is that in the grand scheme of things we _really_ dont 
want to limit performance events to 'hardware' hierarchies, or to 
devices/sysfs, some existing /proc scheme, or any other arbitrary (and 
fundamentally limiting) object enumeration.

We want a unified, logical enumeration of all events and objects that we care 
about from a performance monitoring and analysis point of view, shaped for the 
purpose of and parsed by perf user-space. And since the current event 
descriptors are already rather rich as they enumerate all sorts of things: 

 - tracepoints
 - hw-breakpoints
 - dynamic probes

etc., and are well used by tooling we should expand those with real hardware 
structure.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-01-21  7:21       ` Ingo Molnar
@ 2010-01-21 19:13         ` Corey Ashford
  2010-01-21 19:28           ` Corey Ashford
  0 siblings, 1 reply; 55+ messages in thread
From: Corey Ashford @ 2010-01-21 19:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, LKML, Andi Kleen, Paul Mackerras,
	Stephane Eranian, Frederic Weisbecker, Xiao Guangrong,
	Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love,
	Steven Rostedt, Arnaldo Carvalho de Melo, Masami Hiramatsu

On 1/20/2010 11:21 PM, Ingo Molnar wrote:
>
> * Corey Ashford<cjashfor@linux.vnet.ibm.com>  wrote:
>
>> I really think we need some sort of data structure which is passed from the
>> kernel to user space to represent the topology of the system, and give
>> useful information to be able to identify each PMU node.  Whether this is
>> done with a sysfs-style tree, a table in a file, XML, etc... it doesn't
>> really matter much, but it needs to be something that can be parsed
>> relatively easily and *contains just enough information* for the user to be
>> able to correctly choose PMUs, and for the kernel to be able to relate that
>> back to actual PMU hardware.
>
> The right way would be to extend the current event description under
> /debug/tracing/events with hardware descriptors and (maybe) to formalise this
> into a separate /proc/events/ or into a separate filesystem.
>
> The advantage of this is that in the grand scheme of things we _really_ dont
> want to limit performance events to 'hardware' hierarchies, or to
> devices/sysfs, some existing /proc scheme, or any other arbitrary (and
> fundamentally limiting) object enumeration.
>
> We want a unified, logical enumeration of all events and objects that we care
> about from a performance monitoring and analysis point of view, shaped for the
> purpose of and parsed by perf user-space. And since the current event
> descriptors are already rather rich as they enumerate all sorts of things:
>
>   - tracepoints
>   - hw-breakpoints
>   - dynamic probes
>
> etc., and are well used by tooling we should expand those with real hardware
> structure.

This is an intriguing idea; I like the idea of generalizing all of this info 
into one structure.

So you think that this structure should contain event info as well?  If these 
structures are created by the kernel, I think that would necessitate placing 
large event tables into the kernel, which is something I think we'd prefer to 
avoid because of the amount of memory it would take.  Keep in mind that we need 
not only event names, but event descriptions, encodings, attributes (e.g. unit 
masks), attribute descriptions, etc.  I suppose the kernel could read a file 
from the file system, and then add this info to the tree, but that just seems 
bad.  Are there existing places in the kernel where it reads a user space file 
to create a user space pseudo filesystem?

I think keeping event naming in user space, and PMU naming in kernel space might 
be a better idea: the kernel exposes the available PMUs to user space via some 
structure, and a user space library tries to recognize the exposed PMUs and 
provide event lists and other needed info.  The perf tool would use this library 
to be able to list available events to users.

-- 
Regards,

- Corey

Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
Beaverton, OR
503-578-3507
cjashfor@us.ibm.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-01-21 19:13         ` Corey Ashford
@ 2010-01-21 19:28           ` Corey Ashford
  2010-01-27 10:28             ` Ingo Molnar
  0 siblings, 1 reply; 55+ messages in thread
From: Corey Ashford @ 2010-01-21 19:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, LKML, Andi Kleen, Paul Mackerras,
	Stephane Eranian, Frederic Weisbecker, Xiao Guangrong,
	Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love,
	Steven Rostedt, Arnaldo Carvalho de Melo, Masami Hiramatsu

On 1/21/2010 11:13 AM, Corey Ashford wrote:
>
>
> On 1/20/2010 11:21 PM, Ingo Molnar wrote:
>>
>> * Corey Ashford<cjashfor@linux.vnet.ibm.com> wrote:
>>
>>> I really think we need some sort of data structure which is passed
>>> from the
>>> kernel to user space to represent the topology of the system, and give
>>> useful information to be able to identify each PMU node. Whether this is
>>> done with a sysfs-style tree, a table in a file, XML, etc... it doesn't
>>> really matter much, but it needs to be something that can be parsed
>>> relatively easily and *contains just enough information* for the user
>>> to be
>>> able to correctly choose PMUs, and for the kernel to be able to
>>> relate that
>>> back to actual PMU hardware.
>>
>> The right way would be to extend the current event description under
>> /debug/tracing/events with hardware descriptors and (maybe) to
>> formalise this
>> into a separate /proc/events/ or into a separate filesystem.
>>
>> The advantage of this is that in the grand scheme of things we
>> _really_ dont
>> want to limit performance events to 'hardware' hierarchies, or to
>> devices/sysfs, some existing /proc scheme, or any other arbitrary (and
>> fundamentally limiting) object enumeration.
>>
>> We want a unified, logical enumeration of all events and objects that
>> we care
>> about from a performance monitoring and analysis point of view, shaped
>> for the
>> purpose of and parsed by perf user-space. And since the current event
>> descriptors are already rather rich as they enumerate all sorts of
>> things:
>>
>> - tracepoints
>> - hw-breakpoints
>> - dynamic probes
>>
>> etc., and are well used by tooling we should expand those with real
>> hardware
>> structure.
>
> This is an intriguing idea; I like the idea of generalizing all of this
> info into one structure.
>
> So you think that this structure should contain event info as well? If
> these structures are created by the kernel, I think that would
> necessitate placing large event tables into the kernel, which is
> something I think we'd prefer to avoid because of the amount of memory
> it would take. Keep in mind that we need not only event names, but event
> descriptions, encodings, attributes (e.g. unit masks), attribute
> descriptions, etc. I suppose the kernel could read a file from the file
> system, and then add this info to the tree, but that just seems bad. Are
> there existing places in the kernel where it reads a user space file to
> create a user space pseudo filesystem?
>
> I think keeping event naming in user space, and PMU naming in kernel
> space might be a better idea: the kernel exposes the available PMUs to
> user space via some structure, and a user space library tries to
> recognize the exposed PMUs and provide event lists and other needed
> info. The perf tool would use this library to be able to list available
> events to users.
>

Perhaps another way of handing this would be to have the kernel dynamically load 
a specific "PMU kernel module" once it has detected that it has a particular PMU 
in the hardware.  The module would consist only of a data structure, and a 
simple API to access the event data.  This way, only only the PMUs that actually 
exist in the hardware would need to be loaded into memory, and perhaps then only 
temporarily (just long enough to create the pseudo fs nodes).

Still, though, since it's a pseudo fs, all of that event data would be taking up 
kernel memory.

Another model, perhaps, would be to actually write this data out to a real file 
system upon every boot up, so that it wouldn't need to be held in memory.  That 
seems rather ugly and time consuming, though.

-- 
Regards,

- Corey

Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
Beaverton, OR
503-578-3507
cjashfor@us.ibm.com


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-01-21 19:28           ` Corey Ashford
@ 2010-01-27 10:28             ` Ingo Molnar
  2010-01-27 19:50               ` Corey Ashford
  0 siblings, 1 reply; 55+ messages in thread
From: Ingo Molnar @ 2010-01-27 10:28 UTC (permalink / raw)
  To: Corey Ashford
  Cc: Peter Zijlstra, LKML, Andi Kleen, Paul Mackerras,
	Stephane Eranian, Frederic Weisbecker, Xiao Guangrong,
	Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love,
	Steven Rostedt, Arnaldo Carvalho de Melo, Masami Hiramatsu


* Corey Ashford <cjashfor@linux.vnet.ibm.com> wrote:

> On 1/21/2010 11:13 AM, Corey Ashford wrote:
> >
> >
> >On 1/20/2010 11:21 PM, Ingo Molnar wrote:
> >>
> >>* Corey Ashford<cjashfor@linux.vnet.ibm.com> wrote:
> >>
> >>>I really think we need some sort of data structure which is passed
> >>>from the
> >>>kernel to user space to represent the topology of the system, and give
> >>>useful information to be able to identify each PMU node. Whether this is
> >>>done with a sysfs-style tree, a table in a file, XML, etc... it doesn't
> >>>really matter much, but it needs to be something that can be parsed
> >>>relatively easily and *contains just enough information* for the user
> >>>to be
> >>>able to correctly choose PMUs, and for the kernel to be able to
> >>>relate that
> >>>back to actual PMU hardware.
> >>
> >>The right way would be to extend the current event description under
> >>/debug/tracing/events with hardware descriptors and (maybe) to
> >>formalise this
> >>into a separate /proc/events/ or into a separate filesystem.
> >>
> >>The advantage of this is that in the grand scheme of things we
> >>_really_ dont
> >>want to limit performance events to 'hardware' hierarchies, or to
> >>devices/sysfs, some existing /proc scheme, or any other arbitrary (and
> >>fundamentally limiting) object enumeration.
> >>
> >>We want a unified, logical enumeration of all events and objects that
> >>we care
> >>about from a performance monitoring and analysis point of view, shaped
> >>for the
> >>purpose of and parsed by perf user-space. And since the current event
> >>descriptors are already rather rich as they enumerate all sorts of
> >>things:
> >>
> >>- tracepoints
> >>- hw-breakpoints
> >>- dynamic probes
> >>
> >>etc., and are well used by tooling we should expand those with real
> >>hardware
> >>structure.
> >
> >This is an intriguing idea; I like the idea of generalizing all of this
> >info into one structure.
> >
> >So you think that this structure should contain event info as well? If
> >these structures are created by the kernel, I think that would
> >necessitate placing large event tables into the kernel, which is
> >something I think we'd prefer to avoid because of the amount of memory
> >it would take. Keep in mind that we need not only event names, but event
> >descriptions, encodings, attributes (e.g. unit masks), attribute
> >descriptions, etc. I suppose the kernel could read a file from the file
> >system, and then add this info to the tree, but that just seems bad. Are
> >there existing places in the kernel where it reads a user space file to
> >create a user space pseudo filesystem?
> >
> >I think keeping event naming in user space, and PMU naming in kernel
> >space might be a better idea: the kernel exposes the available PMUs to
> >user space via some structure, and a user space library tries to
> >recognize the exposed PMUs and provide event lists and other needed
> >info. The perf tool would use this library to be able to list available
> >events to users.
> >
> 
> Perhaps another way of handing this would be to have the kernel dynamically 
> load a specific "PMU kernel module" once it has detected that it has a 
> particular PMU in the hardware.  The module would consist only of a data 
> structure, and a simple API to access the event data.  This way, only only 
> the PMUs that actually exist in the hardware would need to be loaded into 
> memory, and perhaps then only temporarily (just long enough to create the 
> pseudo fs nodes).
> 
> Still, though, since it's a pseudo fs, all of that event data would be 
> taking up kernel memory.
> 
> Another model, perhaps, would be to actually write this data out to a real 
> file system upon every boot up, so that it wouldn't need to be held in 
> memory.  That seems rather ugly and time consuming, though.

I dont think memory consumption is a problem at all. The structure of the 
monitored hardware/software state is information we _want_ the kernel to 
provide, mainly because there's no unified repository for user-space to get 
this info from.

If someone doesnt want it on some ultra-embedded box then sure a .config 
switch can be provided to allow it to be turned off.

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-01-27 10:28             ` Ingo Molnar
@ 2010-01-27 19:50               ` Corey Ashford
  2010-01-28 10:57                 ` Peter Zijlstra
  0 siblings, 1 reply; 55+ messages in thread
From: Corey Ashford @ 2010-01-27 19:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, LKML, Andi Kleen, Paul Mackerras,
	Stephane Eranian, Frederic Weisbecker, Xiao Guangrong,
	Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love,
	Steven Rostedt, Arnaldo Carvalho de Melo, Masami Hiramatsu

On 1/27/2010 2:28 AM, Ingo Molnar wrote:
>
> * Corey Ashford<cjashfor@linux.vnet.ibm.com>  wrote:
>
>> On 1/21/2010 11:13 AM, Corey Ashford wrote:
>>>
>>>
>>> On 1/20/2010 11:21 PM, Ingo Molnar wrote:
>>>>
>>>> * Corey Ashford<cjashfor@linux.vnet.ibm.com>  wrote:
>>>>
>>>>> I really think we need some sort of data structure which is passed
>>>> >from the
>>>>> kernel to user space to represent the topology of the system, and give
>>>>> useful information to be able to identify each PMU node. Whether this is
>>>>> done with a sysfs-style tree, a table in a file, XML, etc... it doesn't
>>>>> really matter much, but it needs to be something that can be parsed
>>>>> relatively easily and *contains just enough information* for the user
>>>>> to be
>>>>> able to correctly choose PMUs, and for the kernel to be able to
>>>>> relate that
>>>>> back to actual PMU hardware.
>>>>
>>>> The right way would be to extend the current event description under
>>>> /debug/tracing/events with hardware descriptors and (maybe) to
>>>> formalise this
>>>> into a separate /proc/events/ or into a separate filesystem.
>>>>
>>>> The advantage of this is that in the grand scheme of things we
>>>> _really_ dont
>>>> want to limit performance events to 'hardware' hierarchies, or to
>>>> devices/sysfs, some existing /proc scheme, or any other arbitrary (and
>>>> fundamentally limiting) object enumeration.
>>>>
>>>> We want a unified, logical enumeration of all events and objects that
>>>> we care
>>>> about from a performance monitoring and analysis point of view, shaped
>>>> for the
>>>> purpose of and parsed by perf user-space. And since the current event
>>>> descriptors are already rather rich as they enumerate all sorts of
>>>> things:
>>>>
>>>> - tracepoints
>>>> - hw-breakpoints
>>>> - dynamic probes
>>>>
>>>> etc., and are well used by tooling we should expand those with real
>>>> hardware
>>>> structure.
>>>
>>> This is an intriguing idea; I like the idea of generalizing all of this
>>> info into one structure.
>>>
>>> So you think that this structure should contain event info as well? If
>>> these structures are created by the kernel, I think that would
>>> necessitate placing large event tables into the kernel, which is
>>> something I think we'd prefer to avoid because of the amount of memory
>>> it would take. Keep in mind that we need not only event names, but event
>>> descriptions, encodings, attributes (e.g. unit masks), attribute
>>> descriptions, etc. I suppose the kernel could read a file from the file
>>> system, and then add this info to the tree, but that just seems bad. Are
>>> there existing places in the kernel where it reads a user space file to
>>> create a user space pseudo filesystem?
>>>
>>> I think keeping event naming in user space, and PMU naming in kernel
>>> space might be a better idea: the kernel exposes the available PMUs to
>>> user space via some structure, and a user space library tries to
>>> recognize the exposed PMUs and provide event lists and other needed
>>> info. The perf tool would use this library to be able to list available
>>> events to users.
>>>
>>
>> Perhaps another way of handing this would be to have the kernel dynamically
>> load a specific "PMU kernel module" once it has detected that it has a
>> particular PMU in the hardware.  The module would consist only of a data
>> structure, and a simple API to access the event data.  This way, only only
>> the PMUs that actually exist in the hardware would need to be loaded into
>> memory, and perhaps then only temporarily (just long enough to create the
>> pseudo fs nodes).
>>
>> Still, though, since it's a pseudo fs, all of that event data would be
>> taking up kernel memory.
>>
>> Another model, perhaps, would be to actually write this data out to a real
>> file system upon every boot up, so that it wouldn't need to be held in
>> memory.  That seems rather ugly and time consuming, though.
>
> I dont think memory consumption is a problem at all. The structure of the
> monitored hardware/software state is information we _want_ the kernel to
> provide, mainly because there's no unified repository for user-space to get
> this info from.
>
> If someone doesnt want it on some ultra-embedded box then sure a .config
> switch can be provided to allow it to be turned off.
>
> 	Ingo

Ok, just so that we quantify things a bit, let's say I have 20 different types 
of PMUs totalling 2000 different events, each of which has a name and text 
description, averaging 300 characters.  Along with that, there's let's say 4 
64-bit words of metadata per event describing encoding, which attributes apply 
to the event, and any other needed info. I don't know how much memory each 
pseudo fs node takes up.  Let me guess and say 128 bytes for each event node 
(the amount taken for the PMU nodes would be negligible compared with the event 
nodes).

So thats 2000 * (300 + 32 + 128) bytes ~= 920KB of memory.

Let's assume that the correct event module can be loaded dynamically, so that we 
don't need to have all of the possible event sets for a particular arch kernel 
build.

Any opinions on whether allocating this amount of kernel memory would be 
acceptable?  It seems like a lot of kernel memory to me, but I come from an 
embedded systems background.  Granted, most systems are going to use a fraction 
of that amount of memory (<100KB) due to having far fewer PMUs and therefore 
fewer distinct event types.

There's at least one more dimension to this.  Let's say I have 16 uncore PMUs 
all of the same type, each of which has, for example 8 events.  As a very crude 
pseudo fs, let's say we have a structure like this:


/sys/devices/pmus/
     uncore_pmu0/
         event0/ (path name to here is the name of the pmu and event)
             description (file)
             applicable_attributes (file)
         event1/
             description
             applicable_attributes
         event2/
             ...
         event7/
             ...
     uncore_pmu1/
         event0/
             description
             applicable_attributes
         ...
     ...
     uncore_pmu15/
         ...

Now, you can see that there's a lot of replication here, because the event 
descriptions and attributes are the same for each uncore pmu.  We can use 
symlinks to link them to the same descriptions and attribute data, but these 
symlinks take up space too, which can add up if there are a lot of identical 
PMUs.  So for complex and large systems, we might consume several meg of memory 
for the pseudo fs.

Note that I'm taking some liberty with the applicable_attributes file.  I know 
some attribute info has to be in there, but I don't have any sort of concrete 
idea as to how to encode it at this point.  The point of this email is to get an 
idea as to how much memory the pseudo fs would consume.

-- 
Regards,

- Corey

Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
Beaverton, OR
503-578-3507
cjashfor@us.ibm.com


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-01-27 19:50               ` Corey Ashford
@ 2010-01-28 10:57                 ` Peter Zijlstra
  2010-01-28 18:00                   ` Corey Ashford
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2010-01-28 10:57 UTC (permalink / raw)
  To: Corey Ashford
  Cc: Ingo Molnar, LKML, Andi Kleen, Paul Mackerras, Stephane Eranian,
	Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci,
	Maynard Johnson, Carl Love, Steven Rostedt,
	Arnaldo Carvalho de Melo, Masami Hiramatsu

On Wed, 2010-01-27 at 11:50 -0800, Corey Ashford wrote:
> On 1/27/2010 2:28 AM, Ingo Molnar wrote:
> >
> > * Corey Ashford<cjashfor@linux.vnet.ibm.com>  wrote:
> >
> >> On 1/21/2010 11:13 AM, Corey Ashford wrote:
> >>>
> >>>
> >>> On 1/20/2010 11:21 PM, Ingo Molnar wrote:
> >>>>
> >>>> * Corey Ashford<cjashfor@linux.vnet.ibm.com>  wrote:
> >>>>
> >>>>> I really think we need some sort of data structure which is passed
> >>>> >from the
> >>>>> kernel to user space to represent the topology of the system, and give
> >>>>> useful information to be able to identify each PMU node. Whether this is
> >>>>> done with a sysfs-style tree, a table in a file, XML, etc... it doesn't
> >>>>> really matter much, but it needs to be something that can be parsed
> >>>>> relatively easily and *contains just enough information* for the user
> >>>>> to be
> >>>>> able to correctly choose PMUs, and for the kernel to be able to
> >>>>> relate that
> >>>>> back to actual PMU hardware.
> >>>>
> >>>> The right way would be to extend the current event description under
> >>>> /debug/tracing/events with hardware descriptors and (maybe) to
> >>>> formalise this
> >>>> into a separate /proc/events/ or into a separate filesystem.
> >>>>
> >>>> The advantage of this is that in the grand scheme of things we
> >>>> _really_ dont
> >>>> want to limit performance events to 'hardware' hierarchies, or to
> >>>> devices/sysfs, some existing /proc scheme, or any other arbitrary (and
> >>>> fundamentally limiting) object enumeration.
> >>>>
> >>>> We want a unified, logical enumeration of all events and objects that
> >>>> we care
> >>>> about from a performance monitoring and analysis point of view, shaped
> >>>> for the
> >>>> purpose of and parsed by perf user-space. And since the current event
> >>>> descriptors are already rather rich as they enumerate all sorts of
> >>>> things:
> >>>>
> >>>> - tracepoints
> >>>> - hw-breakpoints
> >>>> - dynamic probes
> >>>>
> >>>> etc., and are well used by tooling we should expand those with real
> >>>> hardware
> >>>> structure.
> >>>
> >>> This is an intriguing idea; I like the idea of generalizing all of this
> >>> info into one structure.
> >>>
> >>> So you think that this structure should contain event info as well? If
> >>> these structures are created by the kernel, I think that would
> >>> necessitate placing large event tables into the kernel, which is
> >>> something I think we'd prefer to avoid because of the amount of memory
> >>> it would take. Keep in mind that we need not only event names, but event
> >>> descriptions, encodings, attributes (e.g. unit masks), attribute
> >>> descriptions, etc. I suppose the kernel could read a file from the file
> >>> system, and then add this info to the tree, but that just seems bad. Are
> >>> there existing places in the kernel where it reads a user space file to
> >>> create a user space pseudo filesystem?
> >>>
> >>> I think keeping event naming in user space, and PMU naming in kernel
> >>> space might be a better idea: the kernel exposes the available PMUs to
> >>> user space via some structure, and a user space library tries to
> >>> recognize the exposed PMUs and provide event lists and other needed
> >>> info. The perf tool would use this library to be able to list available
> >>> events to users.
> >>>
> >>
> >> Perhaps another way of handing this would be to have the kernel dynamically
> >> load a specific "PMU kernel module" once it has detected that it has a
> >> particular PMU in the hardware.  The module would consist only of a data
> >> structure, and a simple API to access the event data.  This way, only only
> >> the PMUs that actually exist in the hardware would need to be loaded into
> >> memory, and perhaps then only temporarily (just long enough to create the
> >> pseudo fs nodes).
> >>
> >> Still, though, since it's a pseudo fs, all of that event data would be
> >> taking up kernel memory.
> >>
> >> Another model, perhaps, would be to actually write this data out to a real
> >> file system upon every boot up, so that it wouldn't need to be held in
> >> memory.  That seems rather ugly and time consuming, though.
> >
> > I dont think memory consumption is a problem at all. The structure of the
> > monitored hardware/software state is information we _want_ the kernel to
> > provide, mainly because there's no unified repository for user-space to get
> > this info from.
> >
> > If someone doesnt want it on some ultra-embedded box then sure a .config
> > switch can be provided to allow it to be turned off.
> >
> > 	Ingo
> 
> Ok, just so that we quantify things a bit, let's say I have 20 different types 
> of PMUs totalling 2000 different events, each of which has a name and text 
> description, averaging 300 characters.  Along with that, there's let's say 4 
> 64-bit words of metadata per event describing encoding, which attributes apply 
> to the event, and any other needed info. I don't know how much memory each 
> pseudo fs node takes up.  Let me guess and say 128 bytes for each event node 
> (the amount taken for the PMU nodes would be negligible compared with the event 
> nodes).
> 
> So thats 2000 * (300 + 32 + 128) bytes ~= 920KB of memory.
> 
> Let's assume that the correct event module can be loaded dynamically, so that we 
> don't need to have all of the possible event sets for a particular arch kernel 
> build.
> 
> Any opinions on whether allocating this amount of kernel memory would be 
> acceptable?  It seems like a lot of kernel memory to me, but I come from an 
> embedded systems background.  Granted, most systems are going to use a fraction 
> of that amount of memory (<100KB) due to having far fewer PMUs and therefore 
> fewer distinct event types.
> 
> There's at least one more dimension to this.  Let's say I have 16 uncore PMUs 
> all of the same type, each of which has, for example 8 events.  As a very crude 
> pseudo fs, let's say we have a structure like this:
> 
> 
> /sys/devices/pmus/
>      uncore_pmu0/
>          event0/ (path name to here is the name of the pmu and event)
>              description (file)
>              applicable_attributes (file)
>          event1/
>              description
>              applicable_attributes
>          event2/
>              ...
>          event7/
>              ...
>      uncore_pmu1/
>          event0/
>              description
>              applicable_attributes
>          ...
>      ...
>      uncore_pmu15/
>          ...

I really don't like this. The the cpu->uncore map is fixed by the
topology of the machine, which is already available in /sys some place.

Lets simply use the cpu->node mapping and use PERF_TYPE_NODE{,_RAW} or
something like that. We can start with 2 generic events for that type,
local/remote memory accesses and take it from there.




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-01-28 10:57                 ` Peter Zijlstra
@ 2010-01-28 18:00                   ` Corey Ashford
  2010-01-28 19:06                     ` Peter Zijlstra
  0 siblings, 1 reply; 55+ messages in thread
From: Corey Ashford @ 2010-01-28 18:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, LKML, Andi Kleen, Paul Mackerras, Stephane Eranian,
	Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci,
	Maynard Johnson, Carl Love, Steven Rostedt,
	Arnaldo Carvalho de Melo, Masami Hiramatsu

On 01/28/2010 02:57 AM, Peter Zijlstra wrote:
> On Wed, 2010-01-27 at 11:50 -0800, Corey Ashford wrote:
>> On 1/27/2010 2:28 AM, Ingo Molnar wrote:
>>>
>>> * Corey Ashford<cjashfor@linux.vnet.ibm.com>   wrote:
>>>
>>>> On 1/21/2010 11:13 AM, Corey Ashford wrote:
>>>>>
>>>>>
>>>>> On 1/20/2010 11:21 PM, Ingo Molnar wrote:
>>>>>>
>>>>>> * Corey Ashford<cjashfor@linux.vnet.ibm.com>   wrote:
>>>>>>
>>>>>>> I really think we need some sort of data structure which is passed
>>>>>> >from the
>>>>>>> kernel to user space to represent the topology of the system, and give
>>>>>>> useful information to be able to identify each PMU node. Whether this is
>>>>>>> done with a sysfs-style tree, a table in a file, XML, etc... it doesn't
>>>>>>> really matter much, but it needs to be something that can be parsed
>>>>>>> relatively easily and *contains just enough information* for the user
>>>>>>> to be
>>>>>>> able to correctly choose PMUs, and for the kernel to be able to
>>>>>>> relate that
>>>>>>> back to actual PMU hardware.
>>>>>>
>>>>>> The right way would be to extend the current event description under
>>>>>> /debug/tracing/events with hardware descriptors and (maybe) to
>>>>>> formalise this
>>>>>> into a separate /proc/events/ or into a separate filesystem.
>>>>>>
>>>>>> The advantage of this is that in the grand scheme of things we
>>>>>> _really_ dont
>>>>>> want to limit performance events to 'hardware' hierarchies, or to
>>>>>> devices/sysfs, some existing /proc scheme, or any other arbitrary (and
>>>>>> fundamentally limiting) object enumeration.
>>>>>>
>>>>>> We want a unified, logical enumeration of all events and objects that
>>>>>> we care
>>>>>> about from a performance monitoring and analysis point of view, shaped
>>>>>> for the
>>>>>> purpose of and parsed by perf user-space. And since the current event
>>>>>> descriptors are already rather rich as they enumerate all sorts of
>>>>>> things:
>>>>>>
>>>>>> - tracepoints
>>>>>> - hw-breakpoints
>>>>>> - dynamic probes
>>>>>>
>>>>>> etc., and are well used by tooling we should expand those with real
>>>>>> hardware
>>>>>> structure.
>>>>>
>>>>> This is an intriguing idea; I like the idea of generalizing all of this
>>>>> info into one structure.
>>>>>
>>>>> So you think that this structure should contain event info as well? If
>>>>> these structures are created by the kernel, I think that would
>>>>> necessitate placing large event tables into the kernel, which is
>>>>> something I think we'd prefer to avoid because of the amount of memory
>>>>> it would take. Keep in mind that we need not only event names, but event
>>>>> descriptions, encodings, attributes (e.g. unit masks), attribute
>>>>> descriptions, etc. I suppose the kernel could read a file from the file
>>>>> system, and then add this info to the tree, but that just seems bad. Are
>>>>> there existing places in the kernel where it reads a user space file to
>>>>> create a user space pseudo filesystem?
>>>>>
>>>>> I think keeping event naming in user space, and PMU naming in kernel
>>>>> space might be a better idea: the kernel exposes the available PMUs to
>>>>> user space via some structure, and a user space library tries to
>>>>> recognize the exposed PMUs and provide event lists and other needed
>>>>> info. The perf tool would use this library to be able to list available
>>>>> events to users.
>>>>>
>>>>
>>>> Perhaps another way of handing this would be to have the kernel dynamically
>>>> load a specific "PMU kernel module" once it has detected that it has a
>>>> particular PMU in the hardware.  The module would consist only of a data
>>>> structure, and a simple API to access the event data.  This way, only only
>>>> the PMUs that actually exist in the hardware would need to be loaded into
>>>> memory, and perhaps then only temporarily (just long enough to create the
>>>> pseudo fs nodes).
>>>>
>>>> Still, though, since it's a pseudo fs, all of that event data would be
>>>> taking up kernel memory.
>>>>
>>>> Another model, perhaps, would be to actually write this data out to a real
>>>> file system upon every boot up, so that it wouldn't need to be held in
>>>> memory.  That seems rather ugly and time consuming, though.
>>>
>>> I dont think memory consumption is a problem at all. The structure of the
>>> monitored hardware/software state is information we _want_ the kernel to
>>> provide, mainly because there's no unified repository for user-space to get
>>> this info from.
>>>
>>> If someone doesnt want it on some ultra-embedded box then sure a .config
>>> switch can be provided to allow it to be turned off.
>>>
>>> 	Ingo
>>
>> Ok, just so that we quantify things a bit, let's say I have 20 different types
>> of PMUs totalling 2000 different events, each of which has a name and text
>> description, averaging 300 characters.  Along with that, there's let's say 4
>> 64-bit words of metadata per event describing encoding, which attributes apply
>> to the event, and any other needed info. I don't know how much memory each
>> pseudo fs node takes up.  Let me guess and say 128 bytes for each event node
>> (the amount taken for the PMU nodes would be negligible compared with the event
>> nodes).
>>
>> So thats 2000 * (300 + 32 + 128) bytes ~= 920KB of memory.
>>
>> Let's assume that the correct event module can be loaded dynamically, so that we
>> don't need to have all of the possible event sets for a particular arch kernel
>> build.
>>
>> Any opinions on whether allocating this amount of kernel memory would be
>> acceptable?  It seems like a lot of kernel memory to me, but I come from an
>> embedded systems background.  Granted, most systems are going to use a fraction
>> of that amount of memory (<100KB) due to having far fewer PMUs and therefore
>> fewer distinct event types.
>>
>> There's at least one more dimension to this.  Let's say I have 16 uncore PMUs
>> all of the same type, each of which has, for example 8 events.  As a very crude
>> pseudo fs, let's say we have a structure like this:
>>
>>
>> /sys/devices/pmus/
>>       uncore_pmu0/
>>           event0/ (path name to here is the name of the pmu and event)
>>               description (file)
>>               applicable_attributes (file)
>>           event1/
>>               description
>>               applicable_attributes
>>           event2/
>>               ...
>>           event7/
>>               ...
>>       uncore_pmu1/
>>           event0/
>>               description
>>               applicable_attributes
>>           ...
>>       ...
>>       uncore_pmu15/
>>           ...
>
> I really don't like this. The the cpu->uncore map is fixed by the
> topology of the machine, which is already available in /sys some place.
>
> Lets simply use the cpu->node mapping and use PERF_TYPE_NODE{,_RAW} or
> something like that. We can start with 2 generic events for that type,
> local/remote memory accesses and take it from there.
>

I don't quite get what you're saying here.  Perhaps you are thinking 
that all uncore units are associated with a particular cpu node, or a 
set of cpu nodes?  And that there's only one uncore unit per cpu (or set 
of cpus) that needs to be addressed, i.e. no ambiguity?

That is not going to be the case for all systems.  We can have uncore 
units that are associated with the entire system, for example PMUs in an 
I/O device.   And we can have multiple uncore units of a particular 
type, for example multiple vector coprocessors, each with its own PMU, 
and are associated with a single cpu or a set of cpus.

perf_events needs an addressing scheme that covers these cases.

- Corey

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-01-28 18:00                   ` Corey Ashford
@ 2010-01-28 19:06                     ` Peter Zijlstra
  2010-01-28 19:44                       ` Corey Ashford
  2010-01-28 22:08                       ` Corey Ashford
  0 siblings, 2 replies; 55+ messages in thread
From: Peter Zijlstra @ 2010-01-28 19:06 UTC (permalink / raw)
  To: Corey Ashford
  Cc: Ingo Molnar, LKML, Andi Kleen, Paul Mackerras, Stephane Eranian,
	Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci,
	Maynard Johnson, Carl Love, Steven Rostedt,
	Arnaldo Carvalho de Melo, Masami Hiramatsu

On Thu, 2010-01-28 at 10:00 -0800, Corey Ashford wrote:
> 
> I don't quite get what you're saying here.  Perhaps you are thinking 
> that all uncore units are associated with a particular cpu node, or a 
> set of cpu nodes?  And that there's only one uncore unit per cpu (or set 
> of cpus) that needs to be addressed, i.e. no ambiguity?

Well, I was initially thinking of the intel uncore thing which is memory
controller, so node, level.

But all system topology bound pmus can be done that way.

> That is not going to be the case for all systems.  We can have uncore 
> units that are associated with the entire system,

Right, but that's simple too.

> for example PMUs in an I/O device.  

> And we can have multiple uncore units of a particular 
> type, for example multiple vector coprocessors, each with its own PMU, 
> and are associated with a single cpu or a set of cpus.
> 
> perf_events needs an addressing scheme that covers these cases. 

You could possible add a u64 pmu_id field to perf_event_attr and use
that together with things like:

  PERF_TYPE_PCI, attr.pmu_id = domain:bus:device:function encoding
  PERF_TYPE_SPU, attr.pmu_id = spu-id

But before we go there the perf core needs to be extended to deal with
multiple hardware pmus, something which isn't too hard but we need to be
careful not to bloat the normal code paths for these somewhat esoteric
use cases.




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-01-28 19:06                     ` Peter Zijlstra
@ 2010-01-28 19:44                       ` Corey Ashford
  2010-01-28 22:08                       ` Corey Ashford
  1 sibling, 0 replies; 55+ messages in thread
From: Corey Ashford @ 2010-01-28 19:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, LKML, Andi Kleen, Paul Mackerras, Stephane Eranian,
	Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci,
	Maynard Johnson, Carl Love, Steven Rostedt,
	Arnaldo Carvalho de Melo, Masami Hiramatsu

On 1/28/2010 11:06 AM, Peter Zijlstra wrote:
> On Thu, 2010-01-28 at 10:00 -0800, Corey Ashford wrote:
>>
>> I don't quite get what you're saying here.  Perhaps you are thinking
>> that all uncore units are associated with a particular cpu node, or a
>> set of cpu nodes?  And that there's only one uncore unit per cpu (or set
>> of cpus) that needs to be addressed, i.e. no ambiguity?
>
> Well, I was initially thinking of the intel uncore thing which is memory
> controller, so node, level.
>
> But all system topology bound pmus can be done that way.
>
>> That is not going to be the case for all systems.  We can have uncore
>> units that are associated with the entire system,
>
> Right, but that's simple too.
>
>> for example PMUs in an I/O device.
>
>> And we can have multiple uncore units of a particular
>> type, for example multiple vector coprocessors, each with its own PMU,
>> and are associated with a single cpu or a set of cpus.
>>
>> perf_events needs an addressing scheme that covers these cases.
>
> You could possible add a u64 pmu_id field to perf_event_attr and use
> that together with things like:
>
>    PERF_TYPE_PCI, attr.pmu_id = domain:bus:device:function encoding
>    PERF_TYPE_SPU, attr.pmu_id = spu-id
>

Thank you for that clarification.

One of Ingo's comments was that he wants perf to be able to expose all of the 
available PMUs via the perf tool.  That perf should be able to parse some data 
structure (somewhere) that would contain all of the info the user would need to 
choose a particular PMU.  Do you have some ideas about how that could be 
accomplished using the above encoding scheme?  I can see how it would be fairly 
easy to come up with a PERF_TYPE_* encoding per-topology, and then interpret all 
of those bits correctly within the kernel (which is saavy to that topology), but 
I don't see how there would be a straight-forward way to expose that structure 
to perf.  How would perf know which of those encodings apply to the current 
system, how many PMUs there are of each type, etc.

That's why I'm leaning toward a /sys/devices-style pseudo fs at the moment.  If 
there's a simpler, better way, I'm open to it.

> But before we go there the perf core needs to be extended to deal with
> multiple hardware pmus, something which isn't too hard but we need to be
> careful not to bloat the normal code paths for these somewhat esoteric
> use cases.

Is this something you are looking into?

- Corey

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-01-28 19:06                     ` Peter Zijlstra
  2010-01-28 19:44                       ` Corey Ashford
@ 2010-01-28 22:08                       ` Corey Ashford
  2010-01-29  9:52                         ` Peter Zijlstra
  1 sibling, 1 reply; 55+ messages in thread
From: Corey Ashford @ 2010-01-28 22:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, LKML, Andi Kleen, Paul Mackerras, Stephane Eranian,
	Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci,
	Maynard Johnson, Carl Love, Steven Rostedt,
	Arnaldo Carvalho de Melo, Masami Hiramatsu

On 1/28/2010 11:06 AM, Peter Zijlstra wrote:
> On Thu, 2010-01-28 at 10:00 -0800, Corey Ashford wrote:
>>
>> I don't quite get what you're saying here.  Perhaps you are thinking
>> that all uncore units are associated with a particular cpu node, or a
>> set of cpu nodes?  And that there's only one uncore unit per cpu (or set
>> of cpus) that needs to be addressed, i.e. no ambiguity?
>
> Well, I was initially thinking of the intel uncore thing which is memory
> controller, so node, level.
>
> But all system topology bound pmus can be done that way.
>
>> That is not going to be the case for all systems.  We can have uncore
>> units that are associated with the entire system,
>
> Right, but that's simple too.
>
>> for example PMUs in an I/O device.
>
>> And we can have multiple uncore units of a particular
>> type, for example multiple vector coprocessors, each with its own PMU,
>> and are associated with a single cpu or a set of cpus.
>>
>> perf_events needs an addressing scheme that covers these cases.
>
> You could possible add a u64 pmu_id field to perf_event_attr and use
> that together with things like:
>
>    PERF_TYPE_PCI, attr.pmu_id = domain:bus:device:function encoding
>    PERF_TYPE_SPU, attr.pmu_id = spu-id
>

Thank you for the clarification.

One of Ingo's comments in this thread was that he wants perf to be able to 
display to the user the available PMUs along with their respect events.  That 
perf would parse some machine-independent data structure (somewhere) to get this 
info.   This same info would provide the user a method of specifying which PMU 
he wants to address.  He'd also like all of the event info data to reside in the 
same place.  I hope I am paraphrasing him correctly.

I can see that with the scheme you have proposed above, it would be 
straight-forward to encode PMU ids for a particular new PERF_TYPE_* system 
topology, but I don't see a clear way of providng perf with enough information 
to tell it which particular topology is being used, how many units of each PMU 
type exist, and so on.  Do you have any ideas as to how to accomplish this goal 
with the method you are suggesting?

This is one of the reasons why I am leaning toward a /sys/devices-style data 
structure; the kernel could easily build it based on the pmus that it discovers 
(through whatever means), and the user can fairly easily choose a pmu from this 
structure to open, and it's unambiguous to the kernel as to which pmu the user 
really wants.

I am not convinced that this is the right place to put the event info for each PMU.

 > But before we go there the perf core needs to be extended to deal with
 > multiple hardware pmus, something which isn't too hard but we need to be
 > careful not to bloat the normal code paths for these somewhat esoteric
 > use cases.
 >

Is this something you've looked into?  If so, what sort of issues have you 
discovered?

Thanks,

- Corey

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-01-28 22:08                       ` Corey Ashford
@ 2010-01-29  9:52                         ` Peter Zijlstra
  2010-01-29 23:05                           ` Corey Ashford
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2010-01-29  9:52 UTC (permalink / raw)
  To: Corey Ashford
  Cc: Ingo Molnar, LKML, Andi Kleen, Paul Mackerras, Stephane Eranian,
	Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci,
	Maynard Johnson, Carl Love, Steven Rostedt,
	Arnaldo Carvalho de Melo, Masami Hiramatsu

On Thu, 2010-01-28 at 14:08 -0800, Corey Ashford wrote:

> This is one of the reasons why I am leaning toward a /sys/devices-style data 
> structure; the kernel could easily build it based on the pmus that it discovers 
> (through whatever means), and the user can fairly easily choose a pmu from this 
> structure to open, and it's unambiguous to the kernel as to which pmu the user 
> really wants.

Well, the dumb way is simply probing all of them and see who responds.
Another might be adding a pmu attribute (showing the pmu-id) to the
existing sysfs topology layouts (system topology, pci, spu, are all
already available in sysfs iirc).

> I am not convinced that this is the right place to put the event info for each PMU.

Right, I'm not at all sure the kernel wants to know about any events
beyond those needed for pmu scheduling constraints and possible generic
event maps.

Clearly it needs to know about all software events, but I don't think we
need nor want exhaustive hardware event lists in the kernel.

>  > But before we go there the perf core needs to be extended to deal with
>  > multiple hardware pmus, something which isn't too hard but we need to be
>  > careful not to bloat the normal code paths for these somewhat esoteric
>  > use cases.
>  >
> 
> Is this something you've looked into?  If so, what sort of issues have you 
> discovered?

I've poked at it a little yes, while simply abstracting the current hw
interface and making it a list of pmu's isn't hard at all, it does add
overhead to a few key locations.

Another aspect is event scheduling, you'd want to separate the event
lists for the various pmus so that the RR thing works as expected, this
again adds overhead because you now need to abstract out the event lists
as well.

The main fast path affected by both these things is the task switch
event scheduling where you have to iterate all active events and their
pmus.

So while the abstraction itself isn't too hard, doing it so as to
minimize the bloat on the key paths does make it interesting.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-01-29  9:52                         ` Peter Zijlstra
@ 2010-01-29 23:05                           ` Corey Ashford
  2010-01-30  8:42                             ` Peter Zijlstra
  0 siblings, 1 reply; 55+ messages in thread
From: Corey Ashford @ 2010-01-29 23:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, LKML, Andi Kleen, Paul Mackerras, Stephane Eranian,
	Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci,
	Maynard Johnson, Carl Love, Steven Rostedt,
	Arnaldo Carvalho de Melo, Masami Hiramatsu



On 1/29/2010 1:52 AM, Peter Zijlstra wrote:
> On Thu, 2010-01-28 at 14:08 -0800, Corey Ashford wrote:
>
>> This is one of the reasons why I am leaning toward a /sys/devices-style data
>> structure; the kernel could easily build it based on the pmus that it discovers
>> (through whatever means), and the user can fairly easily choose a pmu from this
>> structure to open, and it's unambiguous to the kernel as to which pmu the user
>> really wants.
>
> Well, the dumb way is simply probing all of them and see who responds.

That can work, but it's still fuzzy to me how a user would relate a PMU address 
that he's encoded to some actual device in the system he's using.  How would he 
know that he's addressing the correct device (besides that the PMU type 
matches), given that we're likely to have hypervisors as middle-men.

> Another might be adding a pmu attribute (showing the pmu-id) to the
> existing sysfs topology layouts (system topology, pci, spu, are all
> already available in sysfs iirc).

So you'd read the id from the sysfs topology tree, and then pass that id to the 
interface?  That's an interesting approach that eliminates the need to pass a 
string pmu path to the kernel.

I like this idea, but I need to read more deeply about the topology entries to 
understand how they work.

>> I am not convinced that this is the right place to put the event info for each PMU.
>
> Right, I'm not at all sure the kernel wants to know about any events
> beyond those needed for pmu scheduling constraints and possible generic
> event maps.
>
> Clearly it needs to know about all software events, but I don't think we
> need nor want exhaustive hardware event lists in the kernel.
>
>>   >  But before we go there the perf core needs to be extended to deal with
>>   >  multiple hardware pmus, something which isn't too hard but we need to be
>>   >  careful not to bloat the normal code paths for these somewhat esoteric
>>   >  use cases.
>>   >
>>
>> Is this something you've looked into?  If so, what sort of issues have you
>> discovered?
>
> I've poked at it a little yes, while simply abstracting the current hw
> interface and making it a list of pmu's isn't hard at all, it does add
> overhead to a few key locations.
>
> Another aspect is event scheduling, you'd want to separate the event
> lists for the various pmus so that the RR thing works as expected, this
> again adds overhead because you now need to abstract out the event lists
> as well.
>
> The main fast path affected by both these things is the task switch
> event scheduling where you have to iterate all active events and their
> pmus.
>
> So while the abstraction itself isn't too hard, doing it so as to
> minimize the bloat on the key paths does make it interesting.
>

Interesting.

Thanks for your comments.

- Corey


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-01-29 23:05                           ` Corey Ashford
@ 2010-01-30  8:42                             ` Peter Zijlstra
  2010-02-01 19:39                               ` Corey Ashford
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2010-01-30  8:42 UTC (permalink / raw)
  To: Corey Ashford
  Cc: Ingo Molnar, LKML, Andi Kleen, Paul Mackerras, Stephane Eranian,
	Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci,
	Maynard Johnson, Carl Love, Steven Rostedt,
	Arnaldo Carvalho de Melo, Masami Hiramatsu

On Fri, 2010-01-29 at 15:05 -0800, Corey Ashford wrote:
> So you'd read the id from the sysfs topology tree, and then pass that id to the 
> interface?  That's an interesting approach that eliminates the need to pass a 
> string pmu path to the kernel. 

No, the attr.pmu_id would reflect the location in the tree (pci
location, or spu number), the pmu id reported would identify the kind of
pmu driver used for that particular device.

I realized this confusion after sending but didn't clarify, we should
come up with a good alternative name for either (or both) uses.




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-01-30  8:42                             ` Peter Zijlstra
@ 2010-02-01 19:39                               ` Corey Ashford
  2010-02-01 19:54                                 ` Peter Zijlstra
  0 siblings, 1 reply; 55+ messages in thread
From: Corey Ashford @ 2010-02-01 19:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, LKML, Andi Kleen, Paul Mackerras, Stephane Eranian,
	Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci,
	Maynard Johnson, Carl Love, Steven Rostedt,
	Arnaldo Carvalho de Melo, Masami Hiramatsu



On 1/30/2010 12:42 AM, Peter Zijlstra wrote:
> On Fri, 2010-01-29 at 15:05 -0800, Corey Ashford wrote:
>> So you'd read the id from the sysfs topology tree, and then pass that id to the
>> interface?  That's an interesting approach that eliminates the need to pass a
>> string pmu path to the kernel.
>
> No, the attr.pmu_id would reflect the location in the tree (pci
> location, or spu number), the pmu id reported would identify the kind of
> pmu driver used for that particular device.
>
> I realized this confusion after sending but didn't clarify, we should
> come up with a good alternative name for either (or both) uses.
>

Ok, just so I'm clear here, is attr.pmu_id a (char *) or some sort of encoded 
bit field?

- Corey


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-02-01 19:39                               ` Corey Ashford
@ 2010-02-01 19:54                                 ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2010-02-01 19:54 UTC (permalink / raw)
  To: Corey Ashford
  Cc: Ingo Molnar, LKML, Andi Kleen, Paul Mackerras, Stephane Eranian,
	Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci,
	Maynard Johnson, Carl Love, Steven Rostedt,
	Arnaldo Carvalho de Melo, Masami Hiramatsu

On Mon, 2010-02-01 at 11:39 -0800, Corey Ashford wrote:
> 
> On 1/30/2010 12:42 AM, Peter Zijlstra wrote:
> > On Fri, 2010-01-29 at 15:05 -0800, Corey Ashford wrote:
> >> So you'd read the id from the sysfs topology tree, and then pass that id to the
> >> interface?  That's an interesting approach that eliminates the need to pass a
> >> string pmu path to the kernel.
> >
> > No, the attr.pmu_id would reflect the location in the tree (pci
> > location, or spu number), the pmu id reported would identify the kind of
> > pmu driver used for that particular device.
> >
> > I realized this confusion after sending but didn't clarify, we should
> > come up with a good alternative name for either (or both) uses.
> >
> 
> Ok, just so I'm clear here, is attr.pmu_id a (char *) or some sort of encoded 
> bit field?

Right, currently on x86 we have x86_pmu.name which basically tells us
what kind of pmu it is, but we really don't export that since its
trivial to see that from /proc/cpuinfo, but the ARM people expressed
interest in this because decoding the cpuid equivalent on arm is like
nasty business.

But once we go add other funny pmu's, it becomes interesting to know
what kind of pmu it is and tie possible event sets to them.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-01-20 23:23     ` Corey Ashford
  2010-01-21  7:21       ` Ingo Molnar
@ 2010-01-21  8:36       ` Peter Zijlstra
  1 sibling, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2010-01-21  8:36 UTC (permalink / raw)
  To: Corey Ashford
  Cc: LKML, Ingo Molnar, Andi Kleen, Paul Mackerras, Stephane Eranian,
	Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci,
	Maynard Johnson, Carl Love

On Wed, 2010-01-20 at 15:23 -0800, Corey Ashford wrote:
> If so, that's an interesting idea, but I think it still 
> leaves open the problem of how to actually relate those address to the real 
> hardware, especially in the case of using a hypervisor which has provided you a 
> small subset of the physical hardware in the system. 

Well, I'm tempted to say that's a problem for the virt guys :-)

One way one could solve that is by having the topology information
include the virt<->phys map, so that you can find the physical node from
the virtual cpu number.

Going to be interesting though, but then, virt seems to be about
creating problems where there were none before.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-01-20 21:33   ` Peter Zijlstra
  2010-01-20 23:23     ` Corey Ashford
@ 2010-01-21  8:47     ` stephane eranian
  2010-01-21  8:59       ` Peter Zijlstra
  1 sibling, 1 reply; 55+ messages in thread
From: stephane eranian @ 2010-01-21  8:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Corey Ashford, LKML, Ingo Molnar, Andi Kleen, Paul Mackerras,
	Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci,
	Maynard Johnson, Carl Love

On Wed, Jan 20, 2010 at 10:33 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, 2010-01-20 at 14:34 +0100, Peter Zijlstra wrote:
>
>> So how about PERF_TYPE_{CORE,NODE,SOCKET} like things?
>
> OK, so I read most of the intel uncore stuff, and it seems to suggest
> you need a regular pmu event to receive uncore events (chained setup),
> this seems rather retarded since it wastes a perfectly good pmu event
> and makes configuring all this more intricate...
>
I don't think that is correct. You can be using the uncore PMU on Nehalem
without any core PMU event. The only thing to realize is that uncore PMU
shares the same interrupt vector as core PMU. You need to configure which
core the uncore is going to interrupt on. This is done via a bitmask, so you
can interrupt more than one core at a time. Several strategies are possible.


> A well, nothing to be done about that I guess..
>
>
>
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-01-21  8:47     ` stephane eranian
@ 2010-01-21  8:59       ` Peter Zijlstra
  2010-01-21  9:16         ` stephane eranian
  2010-01-21  9:43         ` stephane eranian
  0 siblings, 2 replies; 55+ messages in thread
From: Peter Zijlstra @ 2010-01-21  8:59 UTC (permalink / raw)
  To: eranian
  Cc: Corey Ashford, LKML, Ingo Molnar, Andi Kleen, Paul Mackerras,
	Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci,
	Maynard Johnson, Carl Love

On Thu, 2010-01-21 at 09:47 +0100, stephane eranian wrote:
> I don't think that is correct. You can be using the uncore PMU on Nehalem
> without any core PMU event. The only thing to realize is that uncore PMU
> shares the same interrupt vector as core PMU. You need to configure which
> core the uncore is going to interrupt on. This is done via a bitmask, so you
> can interrupt more than one core at a time. Several strategies are possible.

Ah, sharing the IRQ line is no problem. But from reading I got the
impression you need to configure an Offcore counter. See 30.6.2.1:

• EN_PMI_COREn (bit n, n = 0, 3 if four cores are present): When set, processor
core n is programmed to receive an interrupt signal from any interrupt enabled
uncore counter. PMI delivery due to an uncore counter overflow is enabled by
setting IA32_DEBUG_CTL.Offcore_PMI_EN to 1.

Which seems to indicate a link with the off-core response thing.

However I would be very glad to be wrong :-)

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-01-21  8:59       ` Peter Zijlstra
@ 2010-01-21  9:16         ` stephane eranian
  2010-01-21  9:43         ` stephane eranian
  1 sibling, 0 replies; 55+ messages in thread
From: stephane eranian @ 2010-01-21  9:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Corey Ashford, LKML, Ingo Molnar, Andi Kleen, Paul Mackerras,
	Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci,
	Maynard Johnson, Carl Love

The offcore_response register is a CORE PMU register.
I know the name is confusing.

OFFCORE_RESPONSE_0 is not a counter but rather an
extension of the filtering capabilities of regular counters.
To use offcore_response_0, you will need to program
a generic counter (1 config, 1 data) + offcore_response_0.
This is a very useful feature to understand memory traffic.

There is an associated difficulty though. The
offcore_response_0 MSR is shared between HT threads.
Thus the kernel needs to arbitrate. That should be taken
care of by the event scheduling code which I am working
on.

In the context of perf_event, you need to find some encoding
of the event to pass as RAW. The raw event code + umask
is 0x01B7. Then, you need to encode the value for the
offcore_response_0 MSR, which is 16 bits on NHM/WSM.
You could either stash this into the config field or use an extended
field. The kernel would detect 0x01B7 and use the value in the
extra config field.

As described in vol3b, Intel Westmere adds a second
offcore_response counter. It behaves the same way with
the same restrictions.

The debugctl bit controls uncore activation not offcore.

On Thu, Jan 21, 2010 at 9:59 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, 2010-01-21 at 09:47 +0100, stephane eranian wrote:
>> I don't think that is correct. You can be using the uncore PMU on Nehalem
>> without any core PMU event. The only thing to realize is that uncore PMU
>> shares the same interrupt vector as core PMU. You need to configure which
>> core the uncore is going to interrupt on. This is done via a bitmask, so you
>> can interrupt more than one core at a time. Several strategies are possible.
>
> Ah, sharing the IRQ line is no problem. But from reading I got the
> impression you need to configure an Offcore counter. See 30.6.2.1:
>
> • EN_PMI_COREn (bit n, n = 0, 3 if four cores are present): When set, processor
> core n is programmed to receive an interrupt signal from any interrupt enabled
> uncore counter. PMI delivery due to an uncore counter overflow is enabled by
> setting IA32_DEBUG_CTL.Offcore_PMI_EN to 1.
>
> Which seems to indicate a link with the off-core response thing.
>
> However I would be very glad to be wrong :-)
>
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-01-21  8:59       ` Peter Zijlstra
  2010-01-21  9:16         ` stephane eranian
@ 2010-01-21  9:43         ` stephane eranian
  1 sibling, 0 replies; 55+ messages in thread
From: stephane eranian @ 2010-01-21  9:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Corey Ashford, LKML, Ingo Molnar, Andi Kleen, Paul Mackerras,
	Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci,
	Maynard Johnson, Carl Love, eranian

On Thu, Jan 21, 2010 at 9:59 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, 2010-01-21 at 09:47 +0100, stephane eranian wrote:
>> I don't think that is correct. You can be using the uncore PMU on Nehalem
>> without any core PMU event. The only thing to realize is that uncore PMU
>> shares the same interrupt vector as core PMU. You need to configure which
>> core the uncore is going to interrupt on. This is done via a bitmask, so you
>> can interrupt more than one core at a time. Several strategies are possible.
>
> Ah, sharing the IRQ line is no problem. But from reading I got the

Given the PMU sharing model of perf_events, it seems you may have
multiple consumers of uncore PMU at the same time. That means you
will need to direct the interrupt onto all the CPU for which you currently
have a user. You may have multiple users per CPU, thus you need some
reference count to track all of that. The alternative is to systematically
broadcast the uncore PMU interrupt. Each core then checks whether or
not it has uncore users.

Note that all of this is independent of the type of event, i.e., per-thread
or system-wide.

^ permalink raw reply	[flat|nested] 55+ messages in thread

[parent not found: <d3f22a1003290213x7d7904an59d50eb6a8616133@mail.gmail.com>]

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
       [not found] ` <d3f22a1003290213x7d7904an59d50eb6a8616133@mail.gmail.com>
@ 2010-03-30  7:42   ` Lin Ming
  2010-03-30 16:49     ` Corey Ashford
  0 siblings, 1 reply; 55+ messages in thread
From: Lin Ming @ 2010-03-30  7:42 UTC (permalink / raw)
  To: Corey Ashford
  Cc: Peter Zijlstra, Ingo Molnar, LKML, Andi Kleen, Paul Mackerras,
	Stephane Eranian, Frederic Weisbecker, Xiao Guangrong,
	Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love,
	Steven Rostedt, Arnaldo Carvalho de Melo, Masami Hiramatsu

Hi, Corey

How is this going now? Are you still working on this?
I'd like to help to add support for uncore, test, write code or anything
else.

Thanks,
Lin Ming

> 
> -----
> Intro
> -----
> One subject that hasn't been addressed since the introduction of
> perf_events in the Linux kernel is that of support for "uncore" or
> "nest" unit events.  Uncore is the term used by the Intel engineers
> for their off-core units but are still on the same die as the cores,
> and "nest" means exactly the same thing for IBM Power processor
> engineers.  I will use the term uncore for brevity and because it's in
> common parlance, but the issues and design possibilities below are
> relevant to both.  I will also broaden the term by stating that uncore
> will also refer to PMUs that are completely off of the processor chip
> altogether.
> 
> Contents
> --------
> 1. Why support PMUs in uncore units?  Is there anything interesting to look at?
> 2. How do uncore events differ from core events?
> 3. Why does a CPU need to be assigned to manage a particular uncore
> unit's events?
> 4. How do you encode uncore events?
> 5. How do you address a particular uncore PMU?
> 6. Event rotation issues with uncore PMUs
> 7. Other issues?
> 8. Feedback?
> 
> ----
> 1. Why support PMUs in uncore units?  Is there anything interesting to look at?
> ----
> 
> Today, many x86 chips contain uncore units, and we think that it's
> likely that the trend will continue, as more devices - I/O, memory
> interfaces, shared caches, accelerators, etc. - are integrated onto
> multi-core chips.  As these devices become more sophisticated and more
> workload is diverted off-core, engineers and performance analysts are
> going to want to look at what's happening in these units so that they
> can find bottlenecks.
> 
> In addition, we think that even off-chip I/O and interconnect devices
> are likely to gain PMUs because engineers will want to find
> bottlenecks in their massively parallel systems.
> 
> ----
> 2. How do uncore events differ from core events?
> ----
> 
> The main difference is that uncore events are mostly likely not going
> to be tied to a particular Linux task, or even a CPU context.  Uncore
> units are resources that are in some sense system-wide, though, they
> may not really be accessible system-wide in some architectures.  In
> the case of accelerators and I/O devices, it's likely they will run
> asynchronously from the cores, and thus keeping track of events on a
> per-task basis doesn't make a lot of sense.  The other existing mode
> in perf_events is a per-CPU context, and it turns out that this mode
> does match up with uncore units well, though the choice of which CPU
> to use to manage that uncore unit is going to need to be
> arch-dependent and may involve other issues as well, such as
> minimizing access latency between the uncore unit and the CPU which is
> managing it.
> 
> ----
> 3. Why does a CPU need to be assigned to manage a particular uncore
> unit's events?
> ----
> 
> * The control registers of the uncore unit's PMU need to be read and
> written, and that may be possible only from a subset of processors in
> the system.
> * A processor is needed to rotate the event list on the uncore unit on
> every tick for the purposes of event scheduling.
> * Because of access latency issues, we may want the CPU to be close in
> locality to the PMU.
> 
> It seems like a good idea to let the kernel decide which CPU to use to
> monitor a particular uncore event, based on the location of the uncore
> unit, and possibly current system load balance.  The user will not
> want to have to figure out this detailed information.
> 
> ----
> 4. How do you encode uncore events?
> ----
> Uncore events will need to be encoded in the config field of the
> perf_event_attr struct using the existing PERF_TYPE_RAW encoding.  64
> bits are available in the config field, and that may be sufficient to
> support events on most systems. However, due to  the proliferation and
> added complexity of PMUs we envision, we might want to add another
> 64-bit config (perhaps call it config_extra or config2) field to
> encode any extra attributes that might be needed.  The exact encoding
> used, just as for the current encoding for core events, will be on a
> per-arch and possibly per-system basis.
> 
> ----
> 5. How do you address a particular uncore PMU?
> ----
> 
> This one is going to be very system- and arch-dependent, but it seems
> fairly clear that we need some sort of addressing scheme that can be
> system/arch-defined by the kernel.
> 
> From a hierarchical perspective, here's an example of possible uncore
> PMU locations in a large system:
> 
> 1) Per-core - units that are shared between all hardware threads in a core
> 2) Per-node - units that are shared between all cores in a node
> 3) Per-chip - units that are shared between all nodes in a chip
> 4) Per-blade - units that are shared between all chips on a blade
> 5) Per-rack - units that are shared between all blades in a rack
> 
> Addressing option 1)
> 
> Reuse the cpu argument: cpu would be interpreted differently if an
> uncore unit is specified (via the perf_event_attr struct's config
> field).
> 
> For the hypothetical system described above, we'd want to have an
> address that contains enough address bits for each of the above.  For
> example:
> 
> bits   field
> ------ -----
> 3..0   PMU number 0-15  /* specifies which of several identical PMUs
> being addressed */
> 7..4   core id 0-15
> 8..8   node id 0-1
> 11..9  chip id 0-7
> 16..12 blade id 0-31
> 23..17 rack id 0-128
> 
> These fields would be exposed via
> /usr/include/linux/perf_events_uncore_addr.h (for example).  How you
> actually assign these numbers to actual hardware is, again,
> system-design dependent, and may be influenced by the use of a
> hypervisor, or other software which allocates resources available to
> the system dynamically.
> 
> How does the user discover the mapping between the hardware made
> available to the system and the addresses shown above?  Again, this is
> system-dependent, and probably outside the scope of this proposal.  In
> other words, I don't know how to do this in a general way, though I
> could probably put something together for a particular system.
> 
> Addressing Option 2)
> 
> Have the kernel create nodes for each uncore PMU in
> /sys/devices/system or other pseudo file system, such as the existing
> /proc/device-tree on Power systems. /sys/devices/system or
> /proc/device-tree could be explored by the
> user tool, and the user could then specify the path of the requested
> PMU via a string which the kernel could interpret.  To be overly
> simplistic, something like
> "/sys/devices/system/pmus/blade4/cpu0/vectorcopro1".  If we settled on
> a common tree root to use, we could specify only the relative path
> name, "blade4/cpu0/vectorcopro1".
> 
> One way to provide this extra "PMU path" argument to the
> sys_perf_event_open() would be to add a bit to the flags argument says
> we're adding a PMU path string onto the end of the argument list.
> 
> This path-string-based addressing option seems to more flexible in the
> long run, and does not have as serious of an issue in mapping PMUs to
> user space; the kernel essentially exposes to user space all of the
> available PMUs for the current partition.   This might create more
> work for the kernel side, but should make the system more transparent
> for user-space tools.  Another system- or at least arch-dependent tool
> would have to be written for user space to help users navigate the
> device tree to find the PMU they want to use.  I don't think it would
> make sense to build that capability into perf, because the software
> would be arch- or system-dependent.
> 
> It could be argued that we should use a common user space tree to
> represent PMUs for all architectures and systems, so that the
> arch-independent perf code would be able to display available uncore
> PMUs.  That may be a goal that's very hard to achieve because of the
> wide variation in architectures.  Any thoughts on that?
> 
> ----
> 6. Event rotation issues with uncore PMUs
> ----
> 
> Currently, the perf_events code rotates the set of events assigned to
> a CPU or task on every system tick, so that event scheduling
> collisions on a PMU are mitigated.  This turns out to cause problems
> for uncore units for two reasons - inefficiency and CPU load.
> 
> a) Rotation of a set of events across more than one PMU causes
> inefficient rotation.
> 
> Consider the following event list; the letter designates the PMU and
> the number is the event number on that PMU.
> A1 A2 A3 B1 B2 B3 B4 C1 C2 C3 C4 C5
> 
> after one rotation, you can see that the event list will be:
> 
> C5 A1 A2 A3 B1 B2 B3 B4 C1 C2 C3 C4
> 
> and then
> 
> C4 C5 A1 A2 A3 B1 B2 B3 B4 C1 C2 C3
> 
> Notice how the relative positions for the A and B PMU events haven't
> changed even after two (or even five) rotations, so they will schedule
> the events in the same order for some time.  This will skew the
> multiplexing so that some events will be scheduled much less often
> than they should or could be.
> 
> What we'd like to have happen is that events for each PMU be rotated
> in their own lists.  For example, before rotation:
> 
> A1 A2 A3
> B1 B2 B3 B4
> C1 C2 C3 C4 C5
> 
> After rotation:
> 
> A3 A1 A2
> B2 B3 B4 B1
> C2 C3 C4 C5 C1
> 
> We've got some ideas about how to make this happen, using either
> separate lists, or placing them on separate CPUs.
> 
> b) Access to some PMU uncore units may be quite slow due to the
> interconnect that is used.  This can place a burden on the CPU if it
> is done every system tick.
> 
> This can be addressed by keeping a counter, on a per-PMU context basis
> that reduces the rate of event rotations.  Setting the rotation period
> to three, for example, would cause event rotations in that context to
> happen on every third tick, instead of every tick.  We think that the
> kernel could measure the amount of time it is taking to do a rotate,
> and then dynamically decrease the rotation rate if it's taking too
> long; "rotation rate throttling" in other words.
> 
> ----
> 7. Other issues?
> ----
> 
> This section left blank for now.
> 
> ----
> 8. Feedback?
> ----
> 
> I'd appreciate any feedback you might have on this topic.  You can
> contact me directly at the email address below, or better yet, reply
> to LKML.
> 
> --
> Regards,
> 
> - Corey
> 
> Corey Ashford
> Software Engineer
> IBM Linux Technology Center, Linux Toolchain
> Beaverton, OR
> 503-578-3507
> cjashfor@us.ibm.com
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-03-30  7:42   ` Lin Ming
@ 2010-03-30 16:49     ` Corey Ashford
  2010-03-30 17:15       ` Peter Zijlstra
  2010-03-30 21:28       ` stephane eranian
  0 siblings, 2 replies; 55+ messages in thread
From: Corey Ashford @ 2010-03-30 16:49 UTC (permalink / raw)
  To: Lin Ming
  Cc: Peter Zijlstra, Ingo Molnar, LKML, Andi Kleen, Paul Mackerras,
	Stephane Eranian, Frederic Weisbecker, Xiao Guangrong,
	Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love,
	Steven Rostedt, Arnaldo Carvalho de Melo, Masami Hiramatsu

On 03/30/2010 12:42 AM, Lin Ming wrote:
> Hi, Corey
>
> How is this going now? Are you still working on this?
> I'd like to help to add support for uncore, test, write code or anything
> else.
>
> Thanks,
> Lin Ming

I haven't been actively working on adding infrastructure for nest PMUs 
yet.  At the moment we are working on supporting nest events for IBM's 
Wire-Speed processor, using the current infrastructure, because of the 
time limitations.  Using the existing infrastructure is definitely not 
ideal, but for this processor, it's workable.

There are still a lot of issues to solve for adding this infrastructure:

1) Does perf_events need a new context type (in addition to per-task and 
per-cpu)?  This is partly because we don't want to be mixing the 
rotation of CPU-events with nest events.  Each PMU really ought to have 
its own event list.

2) How do we deal with accessing PMU's which require slow access methods 
(e.g. internal serial bus)?  The accesses may need to be placed on 
worker threads so that they don't affect the performance of context 
switches and system ticks.

3) How exactly do we represent the PMU's in the pseudofs (/sys or 
/proc)?  And how exactly does the user specify the PMU to perf_events?
Peter Zijlstra and Stephane Eranian both recommended opening the PMU 
with open() and then passing the resulting fd in through the 
perf_event_attr struct.

4) How do we choose a CPU to do the housekeeping work for a particular 
nest PMU.  Peter thought that user space should still specify the it via 
open_perf_event() cpu parameter, but there's also an argument to be made 
for the kernel choosing the best CPU to handle the job, or at least make 
it optional for the user to choose the CPU.

I'm sure there are other issues as well.  If you'd like to start working 
on some (or all!) of these, you are more than welcome to.  I think we 
need to toss around some more ideas before committing much to code at 
this point.

- Corey

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-03-30 16:49     ` Corey Ashford
@ 2010-03-30 17:15       ` Peter Zijlstra
  2010-03-30 22:12         ` Corey Ashford
  2010-04-15 21:16         ` Gary.Mohr
  2010-03-30 21:28       ` stephane eranian
  1 sibling, 2 replies; 55+ messages in thread
From: Peter Zijlstra @ 2010-03-30 17:15 UTC (permalink / raw)
  To: Corey Ashford
  Cc: Lin Ming, Ingo Molnar, LKML, Andi Kleen, Paul Mackerras,
	Stephane Eranian, Frederic Weisbecker, Xiao Guangrong,
	Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love,
	Steven Rostedt, Arnaldo Carvalho de Melo, Masami Hiramatsu

On Tue, 2010-03-30 at 09:49 -0700, Corey Ashford wrote:
> On 03/30/2010 12:42 AM, Lin Ming wrote:
> > Hi, Corey
> >
> > How is this going now? Are you still working on this?
> > I'd like to help to add support for uncore, test, write code or anything
> > else.
> >
> > Thanks,
> > Lin Ming
> 
> I haven't been actively working on adding infrastructure for nest PMUs 
> yet.  At the moment we are working on supporting nest events for IBM's 
> Wire-Speed processor, using the current infrastructure, because of the 
> time limitations.  Using the existing infrastructure is definitely not 
> ideal, but for this processor, it's workable.
> 
> There are still a lot of issues to solve for adding this infrastructure:
> 
> 1) Does perf_events need a new context type (in addition to per-task and 
> per-cpu)?  This is partly because we don't want to be mixing the 
> rotation of CPU-events with nest events.  Each PMU really ought to have 
> its own event list.
> 
> 2) How do we deal with accessing PMU's which require slow access methods 
> (e.g. internal serial bus)?  The accesses may need to be placed on 
> worker threads so that they don't affect the performance of context 
> switches and system ticks.
> 
> 3) How exactly do we represent the PMU's in the pseudofs (/sys or 
> /proc)?  And how exactly does the user specify the PMU to perf_events?
> Peter Zijlstra and Stephane Eranian both recommended opening the PMU 
> with open() and then passing the resulting fd in through the 
> perf_event_attr struct.
> 
> 4) How do we choose a CPU to do the housekeeping work for a particular 
> nest PMU.  Peter thought that user space should still specify the it via 
> open_perf_event() cpu parameter, but there's also an argument to be made 
> for the kernel choosing the best CPU to handle the job, or at least make 
> it optional for the user to choose the CPU.
> 
> I'm sure there are other issues as well.  If you'd like to start working 
> on some (or all!) of these, you are more than welcome to.  I think we 
> need to toss around some more ideas before committing much to code at 
> this point.

Right, I've got some definite ideas on how to go here, just need some
time to implement them.

The first thing that needs to be done is get rid of all the __weak
functions (with exception of perf_callchain*, since that really is arch
specific).

For hw_perf_event_init() we need to create a pmu registration facility
and lookup a pmu_id, either passed as an actual id found in sysfs or an
open file handle from sysfs (the cpu pmu would be pmu_id 0 for backwards
compat).

hw_perf_disable/enable() would become struct pmu functions and
perf_disable/enable need to become per-pmu, most functions operate on a
specific event, for those we know the pmu and hence can call the per-pmu
version. (XXX find those sites where this is not true).

Then we can move to context, yes I think we want new context for new
PMUs, otherwise we get very funny RR interleaving problems. My idea was
to move find_get_context() into struct pmu as well, this allows you to
have per-pmu contexts. Initially I'd not allow per-pmu-per-task contexts
because then things like perf_event_task_sched_out() would get rather
complex.

For RR we can move away from perf_event_task_tick and let the pmu
install a (hr)timer for this on their own.

I've been planning to implement this for more than a week now, its just
that other stuff keeps getting in the way.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-03-30 17:15       ` Peter Zijlstra
@ 2010-03-30 22:12         ` Corey Ashford
  2010-03-31 14:01           ` Peter Zijlstra
  2010-04-15 21:16         ` Gary.Mohr
  1 sibling, 1 reply; 55+ messages in thread
From: Corey Ashford @ 2010-03-30 22:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lin Ming, Ingo Molnar, LKML, Andi Kleen, Paul Mackerras,
	Stephane Eranian, Frederic Weisbecker, Xiao Guangrong,
	Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love,
	Steven Rostedt, Arnaldo Carvalho de Melo, Masami Hiramatsu

On 3/30/2010 10:15 AM, Peter Zijlstra wrote:
-- my comments snipped --
>
> Right, I've got some definite ideas on how to go here, just need some
> time to implement them.
>
> The first thing that needs to be done is get rid of all the __weak
> functions (with exception of perf_callchain*, since that really is arch
> specific).
>
> For hw_perf_event_init() we need to create a pmu registration facility
> and lookup a pmu_id, either passed as an actual id found in sysfs or an
> open file handle from sysfs (the cpu pmu would be pmu_id 0 for backwards
> compat).
>
> hw_perf_disable/enable() would become struct pmu functions and
> perf_disable/enable need to become per-pmu, most functions operate on a
> specific event, for those we know the pmu and hence can call the per-pmu
> version. (XXX find those sites where this is not true).

This sounds like a good idea.  Right now for the Wire-Speed processor, we have a 
loop that goes through all of the nest PMU's and calls their respective per-pmu 
functions.

>
> Then we can move to context, yes I think we want new context for new
> PMUs, otherwise we get very funny RR interleaving problems. My idea was
> to move find_get_context() into struct pmu as well, this allows you to
> have per-pmu contexts.

Yes, I think it makes a lot of sense, so that there's not some sort of fixed 
association of pmu contexts to cpu contexts, for example.

> Initially I'd not allow per-pmu-per-task contexts
> because then things like perf_event_task_sched_out() would get rather
> complex.

Definitely.  I don't think it makes sense to have per-task context on 
nest/uncore PMUs.  At least we haven't found any justification for it.

>
> For RR we can move away from perf_event_task_tick and let the pmu
> install a (hr)timer for this on their own.

This is necessary I think, because of the access time for some of the PMU's.  I 
wonder though if it should, perhaps optionally, be off-loaded to a high-priority 
task to do the switching so that access latency to the PMU can be controlled.

As I mentioned when we met, some of the Wire-Speed processor nest PMU control 
registers are accessed via SCOM, which is an internal, 200 MHz serial bus.  We 
are being quoted ~525 SCOM bus ticks to do a PMU control register access, which 
comes out to about 2.5 microseconds.  If you figure 5 accesses to rotate the 
events on a PMU, that's a minimum of 12.5 microseconds.

>
> I've been planning to implement this for more than a week now, its just
> that other stuff keeps getting in the way.
>

Well, it's not as if this is a trivial task either :)

- Corey

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-03-30 22:12         ` Corey Ashford
@ 2010-03-31 14:01           ` Peter Zijlstra
  2010-03-31 14:13             ` stephane eranian
                               ` (2 more replies)
  0 siblings, 3 replies; 55+ messages in thread
From: Peter Zijlstra @ 2010-03-31 14:01 UTC (permalink / raw)
  To: Corey Ashford
  Cc: Lin Ming, Ingo Molnar, LKML, Andi Kleen, Paul Mackerras,
	Stephane Eranian, Frederic Weisbecker, Xiao Guangrong,
	Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love,
	Steven Rostedt, Arnaldo Carvalho de Melo, Masami Hiramatsu

On Tue, 2010-03-30 at 15:12 -0700, Corey Ashford wrote:
> 
> > Initially I'd not allow per-pmu-per-task contexts
> > because then things like perf_event_task_sched_out() would get rather
> > complex.
> 
> Definitely.  I don't think it makes sense to have per-task context on 
> nest/uncore PMUs.  At least we haven't found any justification for it.

For uncore no, but there is also the hw-breakpoint stuff that is being
presented as a pmu, for those it would make sense to have a separate
per-task context.

But doing multiple per-task contexts is something for a next step
indeed.

> > For RR we can move away from perf_event_task_tick and let the pmu
> > install a (hr)timer for this on their own.
> 
> This is necessary I think, because of the access time for some of the PMU's.  I 
> wonder though if it should, perhaps optionally, be off-loaded to a high-priority 
> task to do the switching so that access latency to the PMU can be controlled.
> 
> As I mentioned when we met, some of the Wire-Speed processor nest PMU control 
> registers are accessed via SCOM, which is an internal, 200 MHz serial bus.  We 
> are being quoted ~525 SCOM bus ticks to do a PMU control register access, which 
> comes out to about 2.5 microseconds.  If you figure 5 accesses to rotate the 
> events on a PMU, that's a minimum of 12.5 microseconds. 

Yeah, you mentioned that.. for those things we need some changes anyway,
since currently we install per-cpu counters using IPIs and expect the
pmu::enable() method to be synchronous (it has a return value). It would
be totally unacceptable to do 2.5ms pokes with IRQs disabled.

The RR thing would be the easiest to solve, just let the timer wake up a
thread instead of doing the work itself, that's fully isolated to how
the pmu chooses to implement that. The above mentioned issue however
would be much more challenging to fix nicely.




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-03-31 14:01           ` Peter Zijlstra
@ 2010-03-31 14:13             ` stephane eranian
  2010-03-31 15:49             ` Maynard Johnson
  2010-03-31 17:50             ` Corey Ashford
  2 siblings, 0 replies; 55+ messages in thread
From: stephane eranian @ 2010-03-31 14:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Corey Ashford, Lin Ming, Ingo Molnar, LKML, Andi Kleen,
	Paul Mackerras, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra,
	Philip Mucci, Maynard Johnson, Carl Love, Steven Rostedt,
	Arnaldo Carvalho de Melo, Masami Hiramatsu

On Wed, Mar 31, 2010 at 4:01 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, 2010-03-30 at 15:12 -0700, Corey Ashford wrote:
>>
>> > Initially I'd not allow per-pmu-per-task contexts
>> > because then things like perf_event_task_sched_out() would get rather
>> > complex.
>>
>> Definitely.  I don't think it makes sense to have per-task context on
>> nest/uncore PMUs.  At least we haven't found any justification for it.
>
> For uncore no, but there is also the hw-breakpoint stuff that is being
> presented as a pmu, for those it would make sense to have a separate
> per-task context.
>
> But doing multiple per-task contexts is something for a next step
> indeed.
>
>> > For RR we can move away from perf_event_task_tick and let the pmu
>> > install a (hr)timer for this on their own.
>>
>> This is necessary I think, because of the access time for some of the PMU's.  I
>> wonder though if it should, perhaps optionally, be off-loaded to a high-priority
>> task to do the switching so that access latency to the PMU can be controlled.
>>
>> As I mentioned when we met, some of the Wire-Speed processor nest PMU control
>> registers are accessed via SCOM, which is an internal, 200 MHz serial bus.  We
>> are being quoted ~525 SCOM bus ticks to do a PMU control register access, which
>> comes out to about 2.5 microseconds.  If you figure 5 accesses to rotate the
>> events on a PMU, that's a minimum of 12.5 microseconds.
>
> Yeah, you mentioned that.. for those things we need some changes anyway,
> since currently we install per-cpu counters using IPIs and expect the
> pmu::enable() method to be synchronous (it has a return value). It would
> be totally unacceptable to do 2.5ms pokes with IRQs disabled.
>
> The RR thing would be the easiest to solve, just let the timer wake up a
> thread instead of doing the work itself, that's fully isolated to how
> the pmu chooses to implement that. The above mentioned issue however
> would be much more challenging to fix nicely.
>
>
Also some of perf_enable()/perf_disable() would have to be per PMU and
not global like they are today.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-03-31 14:01           ` Peter Zijlstra
  2010-03-31 14:13             ` stephane eranian
@ 2010-03-31 15:49             ` Maynard Johnson
  2010-03-31 17:50             ` Corey Ashford
  2 siblings, 0 replies; 55+ messages in thread
From: Maynard Johnson @ 2010-03-31 15:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arnaldo Carvalho de Melo, Andi Kleen, cel, Corey Ashford,
	Stephane Eranian, Frederic Weisbecker, LKML, Masami Hiramatsu,
	Ingo Molnar, Lin Ming, Philip Mucci, Paul Mackerras,
	Steven Rostedt, Dan Terpstra, Xiao Guangrong

Peter Zijlstra <peterz@infradead.org> wrote on 03/31/2010 09:01:56 AM:

> [image removed]
>
> Re: [RFC] perf_events: support for uncore a.k.a. nest units
>
> Peter Zijlstra
>
> to:
>
> Corey Ashford
>
> 03/31/2010 09:02 AM
>
> Cc:
>
> Lin Ming, Ingo Molnar, LKML, Andi Kleen, Paul Mackerras, Stephane
> Eranian, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip
> Mucci, Maynard Johnson, cel, Steven Rostedt, Arnaldo Carvalho de
> Melo, Masami Hiramatsu
>
> On Tue, 2010-03-30 at 15:12 -0700, Corey Ashford wrote:
> >
> > > Initially I'd not allow per-pmu-per-task contexts
> > > because then things like perf_event_task_sched_out() would get rather
> > > complex.
> >
> > Definitely.  I don't think it makes sense to have per-task context on
> > nest/uncore PMUs.  At least we haven't found any justification for it.
>
> For uncore no, but there is also the hw-breakpoint stuff that is being
> presented as a pmu, for those it would make sense to have a separate
> per-task context.
hw-breakpoint presented as a pmu?  hmmmm.  IMHO, this is an example where
shoehorning something into the perf_events subsystem that logically doesn't
belong there just makes for more complexity in the code.
>
> But doing multiple per-task contexts is something for a next step
> indeed.
>
[snip]


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-03-31 14:01           ` Peter Zijlstra
  2010-03-31 14:13             ` stephane eranian
  2010-03-31 15:49             ` Maynard Johnson
@ 2010-03-31 17:50             ` Corey Ashford
  2 siblings, 0 replies; 55+ messages in thread
From: Corey Ashford @ 2010-03-31 17:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lin Ming, Ingo Molnar, LKML, Andi Kleen, Paul Mackerras,
	Stephane Eranian, Frederic Weisbecker, Xiao Guangrong,
	Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love,
	Steven Rostedt, Arnaldo Carvalho de Melo, Masami Hiramatsu

On 03/31/2010 07:01 AM, Peter Zijlstra wrote:
> On Tue, 2010-03-30 at 15:12 -0700, Corey Ashford wrote:
>>
>>> Initially I'd not allow per-pmu-per-task contexts
>>> because then things like perf_event_task_sched_out() would get rather
>>> complex.
>>
>> Definitely.  I don't think it makes sense to have per-task context on
>> nest/uncore PMUs.  At least we haven't found any justification for it.
>
> For uncore no, but there is also the hw-breakpoint stuff that is being
> presented as a pmu, for those it would make sense to have a separate
> per-task context.
>
> But doing multiple per-task contexts is something for a next step
> indeed.
>
>>> For RR we can move away from perf_event_task_tick and let the pmu
>>> install a (hr)timer for this on their own.
>>
>> This is necessary I think, because of the access time for some of the PMU's.  I
>> wonder though if it should, perhaps optionally, be off-loaded to a high-priority
>> task to do the switching so that access latency to the PMU can be controlled.
>>
>> As I mentioned when we met, some of the Wire-Speed processor nest PMU control
>> registers are accessed via SCOM, which is an internal, 200 MHz serial bus.  We
>> are being quoted ~525 SCOM bus ticks to do a PMU control register access, which
>> comes out to about 2.5 microseconds.  If you figure 5 accesses to rotate the
>> events on a PMU, that's a minimum of 12.5 microseconds.
>
> Yeah, you mentioned that.. for those things we need some changes anyway,
> since currently we install per-cpu counters using IPIs and expect the
> pmu::enable() method to be synchronous (it has a return value).

That's a good point.  We hadn't considered this issue.

> It would
> be totally unacceptable to do 2.5ms pokes with IRQs disabled.

Just to be clear, it's 2.5us, not 2.5ms, but I think it's still bad... 
in our case, it's about 6600 processor clocks per access.

> The RR thing would be the easiest to solve, just let the timer wake up a
> thread instead of doing the work itself, that's fully isolated to how
> the pmu chooses to implement that. The above mentioned issue however
> would be much more challenging to fix nicely.

It seems like it might need to be done in two phases.  IPI request is 
sent, and then a thread is woken up on the other CPU, it does some work, 
and then sets a status variable and somehow notifies the caller that the 
operation has completed.  I don't know the kernel's communication 
mechanisms well enough to know which one is most appropriate - maybe rwsem?

- Corey

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-03-30 17:15       ` Peter Zijlstra
  2010-03-30 22:12         ` Corey Ashford
@ 2010-04-15 21:16         ` Gary.Mohr
  2010-04-16 13:24           ` Peter Zijlstra
  1 sibling, 1 reply; 55+ messages in thread
From: Gary.Mohr @ 2010-04-15 21:16 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Corey Ashford, Stephane Eranian, LKML

> On Tue, 2010-03-30 at 09:49 -0700, Corey Ashford wrote:
>
> Right, I've got some definite ideas on how to go here, just need some
> time to implement them.
>
> The first thing that needs to be done is get rid of all the __weak
> functions (with exception of perf_callchain*, since that really is arch
> specific).
>
> For hw_perf_event_init() we need to create a pmu registration facility
> and lookup a pmu_id, either passed as an actual id found in sysfs or an
> open file handle from sysfs (the cpu pmu would be pmu_id 0 for backwards
> compat).
>
> hw_perf_disable/enable() would become struct pmu functions and
> perf_disable/enable need to become per-pmu, most functions operate on a
> specific event, for those we know the pmu and hence can call the per-pmu
> version. (XXX find those sites where this is not true).
>
> Then we can move to context, yes I think we want new context for new
> PMUs, otherwise we get very funny RR interleaving problems. My idea was
> to move find_get_context() into struct pmu as well, this allows you to
> have per-pmu contexts. Initially I'd not allow per-pmu-per-task contexts
> because then things like perf_event_task_sched_out() would get rather
> complex.
>
> For RR we can move away from perf_event_task_tick and let the pmu
> install a (hr)timer for this on their own.
>
> I've been planning to implement this for more than a week now, its just
> that other stuff keeps getting in the way.
>

Hi Peter,

My name is Gary Mohr and I work for Bull Information Systems.  I have been
following your discussions with Corey (and others) about how to implement
support for nest PMU's in the linux kernel.

My company feels that support for Intel Nehalem uncore events is very
important
to our customers.  Has the "other stuff" mentioned above quited down to
allow
you to get started on building support for these features ??  If
development
is actually in progress, would you be willing to make a guess as to which
version of the kernel may offer the new capabilities ??

As I said we are interested so if there is any way we can assist you,
please
let us know.  We would be happy to take experimental patch sets and
validate,
test, and debug any problems we encounter if that would help your
development.

Thanks for your time.
Gary

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-04-15 21:16         ` Gary.Mohr
@ 2010-04-16 13:24           ` Peter Zijlstra
  2010-04-19  9:08             ` Lin Ming
  2010-04-20 11:55             ` Lin Ming
  0 siblings, 2 replies; 55+ messages in thread
From: Peter Zijlstra @ 2010-04-16 13:24 UTC (permalink / raw)
  To: Gary.Mohr; +Cc: Corey Ashford, Stephane Eranian, LKML

On Thu, 2010-04-15 at 14:16 -0700, Gary.Mohr@Bull.com wrote:
> > On Tue, 2010-03-30 at 09:49 -0700, Corey Ashford wrote:
> >
> > Right, I've got some definite ideas on how to go here, just need some
> > time to implement them.
> >
> > The first thing that needs to be done is get rid of all the __weak
> > functions (with exception of perf_callchain*, since that really is arch
> > specific).
> >
> > For hw_perf_event_init() we need to create a pmu registration facility
> > and lookup a pmu_id, either passed as an actual id found in sysfs or an
> > open file handle from sysfs (the cpu pmu would be pmu_id 0 for backwards
> > compat).
> >
> > hw_perf_disable/enable() would become struct pmu functions and
> > perf_disable/enable need to become per-pmu, most functions operate on a
> > specific event, for those we know the pmu and hence can call the per-pmu
> > version. (XXX find those sites where this is not true).
> >
> > Then we can move to context, yes I think we want new context for new
> > PMUs, otherwise we get very funny RR interleaving problems. My idea was
> > to move find_get_context() into struct pmu as well, this allows you to
> > have per-pmu contexts. Initially I'd not allow per-pmu-per-task contexts
> > because then things like perf_event_task_sched_out() would get rather
> > complex.
> >
> > For RR we can move away from perf_event_task_tick and let the pmu
> > install a (hr)timer for this on their own.
> >
> > I've been planning to implement this for more than a week now, its just
> > that other stuff keeps getting in the way.
> >
> 
> Hi Peter,
> 
> My name is Gary Mohr and I work for Bull Information Systems.  I have been
> following your discussions with Corey (and others) about how to implement
> support for nest PMU's in the linux kernel.
> 
> My company feels that support for Intel Nehalem uncore events is very
> important to our customers.  Has the "other stuff" mentioned above quited down to
> allow you to get started on building support for these features ?? 

Sadly no.

>  If development
> is actually in progress, would you be willing to make a guess as to which
> version of the kernel may offer the new capabilities ??
> 
> As I said we are interested so if there is any way we can assist you,
> please let us know.  We would be happy to take experimental patch sets and
> validate, test, and debug any problems we encounter if that would help your
> development.

Supply patches to make the above happen ;-)

One thing not on that list, which should happen first I guess, is to
remove hw_perf_group_sched_in(). The idea is to add some sort of
transactional API to the struct pmu, so that we can delay the
schedulability check until commit time (and roll back when it fails).

Something as simple as:

  struct pmu {
    void start_txn(struct pmu *);
    void commit_txn(struct pmu *);

    ,,,
  };

and then change group_sched_in() to use this instead of
hw_perf_group_sched_in(), whose implementations mostly replicate
group_sched_in() in various buggy ways anyway.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-04-16 13:24           ` Peter Zijlstra
@ 2010-04-19  9:08             ` Lin Ming
  2010-04-19  9:27               ` Peter Zijlstra
  2010-04-20 11:55             ` Lin Ming
  1 sibling, 1 reply; 55+ messages in thread
From: Lin Ming @ 2010-04-19  9:08 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Gary.Mohr@Bull.com, Corey Ashford, Stephane Eranian, LKML

On Fri, 2010-04-16 at 21:24 +0800, Peter Zijlstra wrote:
> On Thu, 2010-04-15 at 14:16 -0700, Gary.Mohr@Bull.com wrote:
> > > On Tue, 2010-03-30 at 09:49 -0700, Corey Ashford wrote:
> > >
> > > Right, I've got some definite ideas on how to go here, just need some
> > > time to implement them.
> > >
> > > The first thing that needs to be done is get rid of all the __weak
> > > functions (with exception of perf_callchain*, since that really is arch
> > > specific).
> > >
> > > For hw_perf_event_init() we need to create a pmu registration facility
> > > and lookup a pmu_id, either passed as an actual id found in sysfs or an
> > > open file handle from sysfs (the cpu pmu would be pmu_id 0 for backwards
> > > compat).
> > >
> > > hw_perf_disable/enable() would become struct pmu functions and
> > > perf_disable/enable need to become per-pmu, most functions operate on a
> > > specific event, for those we know the pmu and hence can call the per-pmu
> > > version. (XXX find those sites where this is not true).
> > >
> > > Then we can move to context, yes I think we want new context for new
> > > PMUs, otherwise we get very funny RR interleaving problems. My idea was
> > > to move find_get_context() into struct pmu as well, this allows you to
> > > have per-pmu contexts. Initially I'd not allow per-pmu-per-task contexts
> > > because then things like perf_event_task_sched_out() would get rather
> > > complex.
> > >
> > > For RR we can move away from perf_event_task_tick and let the pmu
> > > install a (hr)timer for this on their own.
> > >
> > > I've been planning to implement this for more than a week now, its just
> > > that other stuff keeps getting in the way.
> > >
> > 
> > Hi Peter,
> > 
> > My name is Gary Mohr and I work for Bull Information Systems.  I have been
> > following your discussions with Corey (and others) about how to implement
> > support for nest PMU's in the linux kernel.
> > 
> > My company feels that support for Intel Nehalem uncore events is very
> > important to our customers.  Has the "other stuff" mentioned above quited down to
> > allow you to get started on building support for these features ?? 
> 
> Sadly no.
> 
> >  If development
> > is actually in progress, would you be willing to make a guess as to which
> > version of the kernel may offer the new capabilities ??
> > 
> > As I said we are interested so if there is any way we can assist you,
> > please let us know.  We would be happy to take experimental patch sets and
> > validate, test, and debug any problems we encounter if that would help your
> > development.
> 
> Supply patches to make the above happen ;-)

Hi, 

I have been also looking at this for some time.
I'll send a draft patch later this week to support multiple hw pmus.

Lin Ming


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-04-19  9:08             ` Lin Ming
@ 2010-04-19  9:27               ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2010-04-19  9:27 UTC (permalink / raw)
  To: Lin Ming; +Cc: Gary.Mohr@Bull.com, Corey Ashford, Stephane Eranian, LKML

On Mon, 2010-04-19 at 17:08 +0800, Lin Ming wrote:

> I have been also looking at this for some time.
> I'll send a draft patch later this week to support multiple hw pmus.

Well, that's going to be a mighty large patch.

I'd rather see smaller patches, say one patch per removal of a weak
hw_perf interface, etc...


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-04-16 13:24           ` Peter Zijlstra
  2010-04-19  9:08             ` Lin Ming
@ 2010-04-20 11:55             ` Lin Ming
  2010-04-20 12:03               ` Peter Zijlstra
  1 sibling, 1 reply; 55+ messages in thread
From: Lin Ming @ 2010-04-20 11:55 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Gary.Mohr@Bull.com, Corey Ashford, Stephane Eranian, LKML

On Fri, 2010-04-16 at 21:24 +0800, Peter Zijlstra wrote:
> On Thu, 2010-04-15 at 14:16 -0700, Gary.Mohr@Bull.com wrote:
> > > On Tue, 2010-03-30 at 09:49 -0700, Corey Ashford wrote:
> > >
> > > Right, I've got some definite ideas on how to go here, just need some
> > > time to implement them.
> > >
> > > The first thing that needs to be done is get rid of all the __weak
> > > functions (with exception of perf_callchain*, since that really is arch
> > > specific).
> > >
> > > For hw_perf_event_init() we need to create a pmu registration facility
> > > and lookup a pmu_id, either passed as an actual id found in sysfs or an
> > > open file handle from sysfs (the cpu pmu would be pmu_id 0 for backwards
> > > compat).
> > >
> > > hw_perf_disable/enable() would become struct pmu functions and
> > > perf_disable/enable need to become per-pmu, most functions operate on a
> > > specific event, for those we know the pmu and hence can call the per-pmu
> > > version. (XXX find those sites where this is not true).
> > >
> > > Then we can move to context, yes I think we want new context for new
> > > PMUs, otherwise we get very funny RR interleaving problems. My idea was
> > > to move find_get_context() into struct pmu as well, this allows you to
> > > have per-pmu contexts. Initially I'd not allow per-pmu-per-task contexts
> > > because then things like perf_event_task_sched_out() would get rather
> > > complex.
> > >
> > > For RR we can move away from perf_event_task_tick and let the pmu
> > > install a (hr)timer for this on their own.
> > >
> > > I've been planning to implement this for more than a week now, its just
> > > that other stuff keeps getting in the way.
> > >
> > 
> > Hi Peter,
> > 
> > My name is Gary Mohr and I work for Bull Information Systems.  I have been
> > following your discussions with Corey (and others) about how to implement
> > support for nest PMU's in the linux kernel.
> > 
> > My company feels that support for Intel Nehalem uncore events is very
> > important to our customers.  Has the "other stuff" mentioned above quited down to
> > allow you to get started on building support for these features ?? 
> 
> Sadly no.
> 
> >  If development
> > is actually in progress, would you be willing to make a guess as to which
> > version of the kernel may offer the new capabilities ??
> > 
> > As I said we are interested so if there is any way we can assist you,
> > please let us know.  We would be happy to take experimental patch sets and
> > validate, test, and debug any problems we encounter if that would help your
> > development.
> 
> Supply patches to make the above happen ;-)
> 
> One thing not on that list, which should happen first I guess, is to
> remove hw_perf_group_sched_in(). The idea is to add some sort of
> transactional API to the struct pmu, so that we can delay the
> schedulability check until commit time (and roll back when it fails).
> 
> Something as simple as:
> 
>   struct pmu {
>     void start_txn(struct pmu *);
>     void commit_txn(struct pmu *);
> 
>     ,,,
>   };

Could you please explain a bit more?

Does it mean that "start_txn" perform the schedule events stuff
and "commit_txn" perform the assign events stuff?

Does "commit time" mean the actual activation in hw_perf_enable?

Thanks,
Lin Ming

> 
> and then change group_sched_in() to use this instead of
> hw_perf_group_sched_in(), whose implementations mostly replicate
> group_sched_in() in various buggy ways anyway.
> 



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-04-20 11:55             ` Lin Ming
@ 2010-04-20 12:03               ` Peter Zijlstra
  2010-04-21  8:08                 ` Lin Ming
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2010-04-20 12:03 UTC (permalink / raw)
  To: Lin Ming; +Cc: Gary.Mohr@Bull.com, Corey Ashford, Stephane Eranian, LKML

On Tue, 2010-04-20 at 19:55 +0800, Lin Ming wrote:

> > One thing not on that list, which should happen first I guess, is to
> > remove hw_perf_group_sched_in(). The idea is to add some sort of
> > transactional API to the struct pmu, so that we can delay the
> > schedulability check until commit time (and roll back when it fails).
> > 
> > Something as simple as:
> > 
> >   struct pmu {
> >     void start_txn(struct pmu *);
> >     void commit_txn(struct pmu *);
> > 
> >     ,,,
> >   };
> 
> Could you please explain a bit more?
> 
> Does it mean that "start_txn" perform the schedule events stuff
> and "commit_txn" perform the assign events stuff?
> 
> Does "commit time" mean the actual activation in hw_perf_enable?

No, the idea behind hw_perf_group_sched_in() is to not perform
schedulability tests on each event in the group, but to add the group as
a whole and then perform one test.

Of course, when that test fails, you'll have to roll-back the whole
group again.

So start_txn (or a better name) would simply toggle a flag in the pmu
implementation that will make pmu::enable() not perform the
schedulablilty test.

Then commit_txn() will perform the schedulability test (so note the
method has to have a !void return value, my mistake in the earlier
email).

This will allow us to use the regular
kernel/perf_event.c::group_sched_in() and all the rollback code.
Currently each hw_perf_group_sched_in() implementation duplicates all
the rolllback code (with various bugs).

We must get rid of all weak hw_perf_*() functions before we can properly
consider multiple struct pmu implementations.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-04-20 12:03               ` Peter Zijlstra
@ 2010-04-21  8:08                 ` Lin Ming
  2010-04-21  8:32                   ` stephane eranian
  0 siblings, 1 reply; 55+ messages in thread
From: Lin Ming @ 2010-04-21  8:08 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Gary.Mohr@Bull.com, Corey Ashford, Stephane Eranian, LKML

On Tue, 2010-04-20 at 20:03 +0800, Peter Zijlstra wrote:
> On Tue, 2010-04-20 at 19:55 +0800, Lin Ming wrote:
> 
> > > One thing not on that list, which should happen first I guess, is to
> > > remove hw_perf_group_sched_in(). The idea is to add some sort of
> > > transactional API to the struct pmu, so that we can delay the
> > > schedulability check until commit time (and roll back when it fails).
> > > 
> > > Something as simple as:
> > > 
> > >   struct pmu {
> > >     void start_txn(struct pmu *);
> > >     void commit_txn(struct pmu *);
> > > 
> > >     ,,,
> > >   };
> > 
> > Could you please explain a bit more?
> > 
> > Does it mean that "start_txn" perform the schedule events stuff
> > and "commit_txn" perform the assign events stuff?
> > 
> > Does "commit time" mean the actual activation in hw_perf_enable?
> 
> No, the idea behind hw_perf_group_sched_in() is to not perform
> schedulability tests on each event in the group, but to add the group as
> a whole and then perform one test.
> 
> Of course, when that test fails, you'll have to roll-back the whole
> group again.
> 
> So start_txn (or a better name) would simply toggle a flag in the pmu
> implementation that will make pmu::enable() not perform the
> schedulablilty test.
> 
> Then commit_txn() will perform the schedulability test (so note the
> method has to have a !void return value, my mistake in the earlier
> email).
> 
> This will allow us to use the regular
> kernel/perf_event.c::group_sched_in() and all the rollback code.
> Currently each hw_perf_group_sched_in() implementation duplicates all
> the rolllback code (with various bugs).
> 
> 
> 
> We must get rid of all weak hw_perf_*() functions before we can properly
> consider multiple struct pmu implementations.
> 

Thanks for the clear explanation.

Does below patch show what you mean?

I only touch the x86 arch code now.

---
 arch/x86/kernel/cpu/perf_event.c |  161 +++++++++++--------------------------
 include/linux/perf_event.h       |   10 ++-
 kernel/perf_event.c              |   28 +++----
 3 files changed, 67 insertions(+), 132 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 626154a..62aa9a1 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -944,6 +944,9 @@ static int x86_pmu_enable(struct perf_event *event)
 	if (n < 0)
 		return n;
 
+	if (!(event->pmu->flag & PERF_EVENT_TRAN_STARTED))
+		goto out;
+
 	ret = x86_pmu.schedule_events(cpuc, n, assign);
 	if (ret)
 		return ret;
@@ -953,6 +956,7 @@ static int x86_pmu_enable(struct perf_event *event)
 	 */
 	memcpy(cpuc->assign, assign, n*sizeof(int));
 
+out:
 	cpuc->n_events = n;
 	cpuc->n_added += n - n0;
 
@@ -1210,119 +1214,6 @@ x86_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event)
 	return &unconstrained;
 }
 
-static int x86_event_sched_in(struct perf_event *event,
-			  struct perf_cpu_context *cpuctx)
-{
-	int ret = 0;
-
-	event->state = PERF_EVENT_STATE_ACTIVE;
-	event->oncpu = smp_processor_id();
-	event->tstamp_running += event->ctx->time - event->tstamp_stopped;
-
-	if (!is_x86_event(event))
-		ret = event->pmu->enable(event);
-
-	if (!ret && !is_software_event(event))
-		cpuctx->active_oncpu++;
-
-	if (!ret && event->attr.exclusive)
-		cpuctx->exclusive = 1;
-
-	return ret;
-}
-
-static void x86_event_sched_out(struct perf_event *event,
-			    struct perf_cpu_context *cpuctx)
-{
-	event->state = PERF_EVENT_STATE_INACTIVE;
-	event->oncpu = -1;
-
-	if (!is_x86_event(event))
-		event->pmu->disable(event);
-
-	event->tstamp_running -= event->ctx->time - event->tstamp_stopped;
-
-	if (!is_software_event(event))
-		cpuctx->active_oncpu--;
-
-	if (event->attr.exclusive || !cpuctx->active_oncpu)
-		cpuctx->exclusive = 0;
-}
-
-/*
- * Called to enable a whole group of events.
- * Returns 1 if the group was enabled, or -EAGAIN if it could not be.
- * Assumes the caller has disabled interrupts and has
- * frozen the PMU with hw_perf_save_disable.
- *
- * called with PMU disabled. If successful and return value 1,
- * then guaranteed to call perf_enable() and hw_perf_enable()
- */
-int hw_perf_group_sched_in(struct perf_event *leader,
-	       struct perf_cpu_context *cpuctx,
-	       struct perf_event_context *ctx)
-{
-	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
-	struct perf_event *sub;
-	int assign[X86_PMC_IDX_MAX];
-	int n0, n1, ret;
-
-	if (!x86_pmu_initialized())
-		return 0;
-
-	/* n0 = total number of events */
-	n0 = collect_events(cpuc, leader, true);
-	if (n0 < 0)
-		return n0;
-
-	ret = x86_pmu.schedule_events(cpuc, n0, assign);
-	if (ret)
-		return ret;
-
-	ret = x86_event_sched_in(leader, cpuctx);
-	if (ret)
-		return ret;
-
-	n1 = 1;
-	list_for_each_entry(sub, &leader->sibling_list, group_entry) {
-		if (sub->state > PERF_EVENT_STATE_OFF) {
-			ret = x86_event_sched_in(sub, cpuctx);
-			if (ret)
-				goto undo;
-			++n1;
-		}
-	}
-	/*
-	 * copy new assignment, now we know it is possible
-	 * will be used by hw_perf_enable()
-	 */
-	memcpy(cpuc->assign, assign, n0*sizeof(int));
-
-	cpuc->n_events  = n0;
-	cpuc->n_added  += n1;
-	ctx->nr_active += n1;
-
-	/*
-	 * 1 means successful and events are active
-	 * This is not quite true because we defer
-	 * actual activation until hw_perf_enable() but
-	 * this way we* ensure caller won't try to enable
-	 * individual events
-	 */
-	return 1;
-undo:
-	x86_event_sched_out(leader, cpuctx);
-	n0  = 1;
-	list_for_each_entry(sub, &leader->sibling_list, group_entry) {
-		if (sub->state == PERF_EVENT_STATE_ACTIVE) {
-			x86_event_sched_out(sub, cpuctx);
-			if (++n0 == n1)
-				break;
-		}
-	}
-	return ret;
-}
-
 #include "perf_event_amd.c"
 #include "perf_event_p6.c"
 #include "perf_event_p4.c"
@@ -1454,6 +1345,47 @@ static inline void x86_pmu_read(struct perf_event *event)
 	x86_perf_event_update(event);
 }
 
+/*
+ * Set the flag to make pmu::enable() not perform the
+ * schedulablilty test.
+ */
+static void x86_pmu_start_txn(struct pmu *pmu)
+{
+	pmu->flag |= PERF_EVENT_TRAN_STARTED;
+}
+
+static void x86_pmu_stop_txn(struct pmu *pmu)
+{
+	pmu->flag &= ~PERF_EVENT_TRAN_STARTED;
+}
+
+/*
+ * Return 0 if commit transaction success
+ */
+static int x86_pmu_commit_txn(struct pmu *pmu)
+{
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+	int assign[X86_PMC_IDX_MAX];
+	int n, ret;
+
+	n = cpuc->n_events;
+
+	if (!x86_pmu_initialized())
+		return -EAGAIN;
+
+	ret = x86_pmu.schedule_events(cpuc, n, assign);
+	if (ret)
+		return ret;
+
+	/*
+	 * copy new assignment, now we know it is possible
+	 * will be used by hw_perf_enable()
+	 */
+	memcpy(cpuc->assign, assign, n*sizeof(int));
+
+	return 0;
+}
+
 static const struct pmu pmu = {
 	.enable		= x86_pmu_enable,
 	.disable	= x86_pmu_disable,
@@ -1461,6 +1393,9 @@ static const struct pmu pmu = {
 	.stop		= x86_pmu_stop,
 	.read		= x86_pmu_read,
 	.unthrottle	= x86_pmu_unthrottle,
+	.start_txn	= x86_pmu_start_txn,
+	.stop_txn	= x86_pmu_stop_txn,
+	.commit_txn	= x86_pmu_commit_txn,
 };
 
 /*
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index bf896d0..93aa8d8 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -524,6 +524,8 @@ struct hw_perf_event {
 
 struct perf_event;
 
+#define PERF_EVENT_TRAN_STARTED 1
+
 /**
  * struct pmu - generic performance monitoring unit
  */
@@ -534,6 +536,11 @@ struct pmu {
 	void (*stop)			(struct perf_event *event);
 	void (*read)			(struct perf_event *event);
 	void (*unthrottle)		(struct perf_event *event);
+	void (*start_txn)		(struct pmu *pmu);
+	void (*stop_txn)		(struct pmu *pmu);
+	int (*commit_txn)		(struct pmu *pmu);
+
+	u8 flag;
 };
 
 /**
@@ -799,9 +806,6 @@ extern void perf_disable(void);
 extern void perf_enable(void);
 extern int perf_event_task_disable(void);
 extern int perf_event_task_enable(void);
-extern int hw_perf_group_sched_in(struct perf_event *group_leader,
-	       struct perf_cpu_context *cpuctx,
-	       struct perf_event_context *ctx);
 extern void perf_event_update_userpage(struct perf_event *event);
 extern int perf_event_release_kernel(struct perf_event *event);
 extern struct perf_event *
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 07b7a43..4537676 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -83,14 +83,6 @@ extern __weak const struct pmu *hw_perf_event_init(struct perf_event *event)
 void __weak hw_perf_disable(void)		{ barrier(); }
 void __weak hw_perf_enable(void)		{ barrier(); }
 
-int __weak
-hw_perf_group_sched_in(struct perf_event *group_leader,
-	       struct perf_cpu_context *cpuctx,
-	       struct perf_event_context *ctx)
-{
-	return 0;
-}
-
 void __weak perf_event_print_debug(void)	{ }
 
 static DEFINE_PER_CPU(int, perf_disable_count);
@@ -642,14 +634,13 @@ group_sched_in(struct perf_event *group_event,
 	       struct perf_event_context *ctx)
 {
 	struct perf_event *event, *partial_group;
+	struct pmu *pmu = (struct pmu *)group_event->pmu;
 	int ret;
 
 	if (group_event->state == PERF_EVENT_STATE_OFF)
 		return 0;
 
-	ret = hw_perf_group_sched_in(group_event, cpuctx, ctx);
-	if (ret)
-		return ret < 0 ? ret : 0;
+	pmu->start_txn(pmu);
 
 	if (event_sched_in(group_event, cpuctx, ctx))
 		return -EAGAIN;
@@ -664,16 +655,21 @@ group_sched_in(struct perf_event *group_event,
 		}
 	}
 
-	return 0;
+	ret = pmu->commit_txn(pmu);
+	if (!ret) {
+		pmu->stop_txn(pmu);
+		return 0;
+	}
 
 group_error:
+	pmu->stop_txn(pmu);
+
 	/*
-	 * Groups can be scheduled in as one unit only, so undo any
-	 * partial group before returning:
+	 * Commit transaction fails, rollback
+	 * Groups can be scheduled in as one unit only, so undo
+	 * whole group before returning:
 	 */
 	list_for_each_entry(event, &group_event->sibling_list, group_entry) {
-		if (event == partial_group)
-			break;
 		event_sched_out(event, cpuctx, ctx);
 	}
 	event_sched_out(group_event, cpuctx, ctx);



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-04-21  8:08                 ` Lin Ming
@ 2010-04-21  8:32                   ` stephane eranian
  2010-04-21  8:39                     ` Lin Ming
  0 siblings, 1 reply; 55+ messages in thread
From: stephane eranian @ 2010-04-21  8:32 UTC (permalink / raw)
  To: Lin Ming; +Cc: Peter Zijlstra, Gary.Mohr@Bull.com, Corey Ashford, LKML

Seems to me that struct pmu is a shared resource across all CPUs.
I don't understand why scheduling on one CPU would have to impact
all the other CPUs, unless I am missing something here.


On Wed, Apr 21, 2010 at 10:08 AM, Lin Ming <ming.m.lin@intel.com> wrote:
> On Tue, 2010-04-20 at 20:03 +0800, Peter Zijlstra wrote:
>> On Tue, 2010-04-20 at 19:55 +0800, Lin Ming wrote:
>>
>> > > One thing not on that list, which should happen first I guess, is to
>> > > remove hw_perf_group_sched_in(). The idea is to add some sort of
>> > > transactional API to the struct pmu, so that we can delay the
>> > > schedulability check until commit time (and roll back when it fails).
>> > >
>> > > Something as simple as:
>> > >
>> > >   struct pmu {
>> > >     void start_txn(struct pmu *);
>> > >     void commit_txn(struct pmu *);
>> > >
>> > >     ,,,
>> > >   };
>> >
>> > Could you please explain a bit more?
>> >
>> > Does it mean that "start_txn" perform the schedule events stuff
>> > and "commit_txn" perform the assign events stuff?
>> >
>> > Does "commit time" mean the actual activation in hw_perf_enable?
>>
>> No, the idea behind hw_perf_group_sched_in() is to not perform
>> schedulability tests on each event in the group, but to add the group as
>> a whole and then perform one test.
>>
>> Of course, when that test fails, you'll have to roll-back the whole
>> group again.
>>
>> So start_txn (or a better name) would simply toggle a flag in the pmu
>> implementation that will make pmu::enable() not perform the
>> schedulablilty test.
>>
>> Then commit_txn() will perform the schedulability test (so note the
>> method has to have a !void return value, my mistake in the earlier
>> email).
>>
>> This will allow us to use the regular
>> kernel/perf_event.c::group_sched_in() and all the rollback code.
>> Currently each hw_perf_group_sched_in() implementation duplicates all
>> the rolllback code (with various bugs).
>>
>>
>>
>> We must get rid of all weak hw_perf_*() functions before we can properly
>> consider multiple struct pmu implementations.
>>
>
> Thanks for the clear explanation.
>
> Does below patch show what you mean?
>
> I only touch the x86 arch code now.
>
> ---
>  arch/x86/kernel/cpu/perf_event.c |  161 +++++++++++--------------------------
>  include/linux/perf_event.h       |   10 ++-
>  kernel/perf_event.c              |   28 +++----
>  3 files changed, 67 insertions(+), 132 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> index 626154a..62aa9a1 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -944,6 +944,9 @@ static int x86_pmu_enable(struct perf_event *event)
>        if (n < 0)
>                return n;
>
> +       if (!(event->pmu->flag & PERF_EVENT_TRAN_STARTED))
> +               goto out;
> +
>        ret = x86_pmu.schedule_events(cpuc, n, assign);
>        if (ret)
>                return ret;
> @@ -953,6 +956,7 @@ static int x86_pmu_enable(struct perf_event *event)
>         */
>        memcpy(cpuc->assign, assign, n*sizeof(int));
>
> +out:
>        cpuc->n_events = n;
>        cpuc->n_added += n - n0;
>
> @@ -1210,119 +1214,6 @@ x86_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event)
>        return &unconstrained;
>  }
>
> -static int x86_event_sched_in(struct perf_event *event,
> -                         struct perf_cpu_context *cpuctx)
> -{
> -       int ret = 0;
> -
> -       event->state = PERF_EVENT_STATE_ACTIVE;
> -       event->oncpu = smp_processor_id();
> -       event->tstamp_running += event->ctx->time - event->tstamp_stopped;
> -
> -       if (!is_x86_event(event))
> -               ret = event->pmu->enable(event);
> -
> -       if (!ret && !is_software_event(event))
> -               cpuctx->active_oncpu++;
> -
> -       if (!ret && event->attr.exclusive)
> -               cpuctx->exclusive = 1;
> -
> -       return ret;
> -}
> -
> -static void x86_event_sched_out(struct perf_event *event,
> -                           struct perf_cpu_context *cpuctx)
> -{
> -       event->state = PERF_EVENT_STATE_INACTIVE;
> -       event->oncpu = -1;
> -
> -       if (!is_x86_event(event))
> -               event->pmu->disable(event);
> -
> -       event->tstamp_running -= event->ctx->time - event->tstamp_stopped;
> -
> -       if (!is_software_event(event))
> -               cpuctx->active_oncpu--;
> -
> -       if (event->attr.exclusive || !cpuctx->active_oncpu)
> -               cpuctx->exclusive = 0;
> -}
> -
> -/*
> - * Called to enable a whole group of events.
> - * Returns 1 if the group was enabled, or -EAGAIN if it could not be.
> - * Assumes the caller has disabled interrupts and has
> - * frozen the PMU with hw_perf_save_disable.
> - *
> - * called with PMU disabled. If successful and return value 1,
> - * then guaranteed to call perf_enable() and hw_perf_enable()
> - */
> -int hw_perf_group_sched_in(struct perf_event *leader,
> -              struct perf_cpu_context *cpuctx,
> -              struct perf_event_context *ctx)
> -{
> -       struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> -       struct perf_event *sub;
> -       int assign[X86_PMC_IDX_MAX];
> -       int n0, n1, ret;
> -
> -       if (!x86_pmu_initialized())
> -               return 0;
> -
> -       /* n0 = total number of events */
> -       n0 = collect_events(cpuc, leader, true);
> -       if (n0 < 0)
> -               return n0;
> -
> -       ret = x86_pmu.schedule_events(cpuc, n0, assign);
> -       if (ret)
> -               return ret;
> -
> -       ret = x86_event_sched_in(leader, cpuctx);
> -       if (ret)
> -               return ret;
> -
> -       n1 = 1;
> -       list_for_each_entry(sub, &leader->sibling_list, group_entry) {
> -               if (sub->state > PERF_EVENT_STATE_OFF) {
> -                       ret = x86_event_sched_in(sub, cpuctx);
> -                       if (ret)
> -                               goto undo;
> -                       ++n1;
> -               }
> -       }
> -       /*
> -        * copy new assignment, now we know it is possible
> -        * will be used by hw_perf_enable()
> -        */
> -       memcpy(cpuc->assign, assign, n0*sizeof(int));
> -
> -       cpuc->n_events  = n0;
> -       cpuc->n_added  += n1;
> -       ctx->nr_active += n1;
> -
> -       /*
> -        * 1 means successful and events are active
> -        * This is not quite true because we defer
> -        * actual activation until hw_perf_enable() but
> -        * this way we* ensure caller won't try to enable
> -        * individual events
> -        */
> -       return 1;
> -undo:
> -       x86_event_sched_out(leader, cpuctx);
> -       n0  = 1;
> -       list_for_each_entry(sub, &leader->sibling_list, group_entry) {
> -               if (sub->state == PERF_EVENT_STATE_ACTIVE) {
> -                       x86_event_sched_out(sub, cpuctx);
> -                       if (++n0 == n1)
> -                               break;
> -               }
> -       }
> -       return ret;
> -}
> -
>  #include "perf_event_amd.c"
>  #include "perf_event_p6.c"
>  #include "perf_event_p4.c"
> @@ -1454,6 +1345,47 @@ static inline void x86_pmu_read(struct perf_event *event)
>        x86_perf_event_update(event);
>  }
>
> +/*
> + * Set the flag to make pmu::enable() not perform the
> + * schedulablilty test.
> + */
> +static void x86_pmu_start_txn(struct pmu *pmu)
> +{
> +       pmu->flag |= PERF_EVENT_TRAN_STARTED;
> +}
> +
> +static void x86_pmu_stop_txn(struct pmu *pmu)
> +{
> +       pmu->flag &= ~PERF_EVENT_TRAN_STARTED;
> +}
> +
> +/*
> + * Return 0 if commit transaction success
> + */
> +static int x86_pmu_commit_txn(struct pmu *pmu)
> +{
> +       struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> +       int assign[X86_PMC_IDX_MAX];
> +       int n, ret;
> +
> +       n = cpuc->n_events;
> +
> +       if (!x86_pmu_initialized())
> +               return -EAGAIN;
> +
> +       ret = x86_pmu.schedule_events(cpuc, n, assign);
> +       if (ret)
> +               return ret;
> +
> +       /*
> +        * copy new assignment, now we know it is possible
> +        * will be used by hw_perf_enable()
> +        */
> +       memcpy(cpuc->assign, assign, n*sizeof(int));
> +
> +       return 0;
> +}
> +
>  static const struct pmu pmu = {
>        .enable         = x86_pmu_enable,
>        .disable        = x86_pmu_disable,
> @@ -1461,6 +1393,9 @@ static const struct pmu pmu = {
>        .stop           = x86_pmu_stop,
>        .read           = x86_pmu_read,
>        .unthrottle     = x86_pmu_unthrottle,
> +       .start_txn      = x86_pmu_start_txn,
> +       .stop_txn       = x86_pmu_stop_txn,
> +       .commit_txn     = x86_pmu_commit_txn,
>  };
>
>  /*
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index bf896d0..93aa8d8 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -524,6 +524,8 @@ struct hw_perf_event {
>
>  struct perf_event;
>
> +#define PERF_EVENT_TRAN_STARTED 1
> +
>  /**
>  * struct pmu - generic performance monitoring unit
>  */
> @@ -534,6 +536,11 @@ struct pmu {
>        void (*stop)                    (struct perf_event *event);
>        void (*read)                    (struct perf_event *event);
>        void (*unthrottle)              (struct perf_event *event);
> +       void (*start_txn)               (struct pmu *pmu);
> +       void (*stop_txn)                (struct pmu *pmu);
> +       int (*commit_txn)               (struct pmu *pmu);
> +
> +       u8 flag;
>  };
>
>  /**
> @@ -799,9 +806,6 @@ extern void perf_disable(void);
>  extern void perf_enable(void);
>  extern int perf_event_task_disable(void);
>  extern int perf_event_task_enable(void);
> -extern int hw_perf_group_sched_in(struct perf_event *group_leader,
> -              struct perf_cpu_context *cpuctx,
> -              struct perf_event_context *ctx);
>  extern void perf_event_update_userpage(struct perf_event *event);
>  extern int perf_event_release_kernel(struct perf_event *event);
>  extern struct perf_event *
> diff --git a/kernel/perf_event.c b/kernel/perf_event.c
> index 07b7a43..4537676 100644
> --- a/kernel/perf_event.c
> +++ b/kernel/perf_event.c
> @@ -83,14 +83,6 @@ extern __weak const struct pmu *hw_perf_event_init(struct perf_event *event)
>  void __weak hw_perf_disable(void)              { barrier(); }
>  void __weak hw_perf_enable(void)               { barrier(); }
>
> -int __weak
> -hw_perf_group_sched_in(struct perf_event *group_leader,
> -              struct perf_cpu_context *cpuctx,
> -              struct perf_event_context *ctx)
> -{
> -       return 0;
> -}
> -
>  void __weak perf_event_print_debug(void)       { }
>
>  static DEFINE_PER_CPU(int, perf_disable_count);
> @@ -642,14 +634,13 @@ group_sched_in(struct perf_event *group_event,
>               struct perf_event_context *ctx)
>  {
>        struct perf_event *event, *partial_group;
> +       struct pmu *pmu = (struct pmu *)group_event->pmu;
>        int ret;
>
>        if (group_event->state == PERF_EVENT_STATE_OFF)
>                return 0;
>
> -       ret = hw_perf_group_sched_in(group_event, cpuctx, ctx);
> -       if (ret)
> -               return ret < 0 ? ret : 0;
> +       pmu->start_txn(pmu);
>
>        if (event_sched_in(group_event, cpuctx, ctx))
>                return -EAGAIN;
> @@ -664,16 +655,21 @@ group_sched_in(struct perf_event *group_event,
>                }
>        }
>
> -       return 0;
> +       ret = pmu->commit_txn(pmu);
> +       if (!ret) {
> +               pmu->stop_txn(pmu);
> +               return 0;
> +       }
>
>  group_error:
> +       pmu->stop_txn(pmu);
> +
>        /*
> -        * Groups can be scheduled in as one unit only, so undo any
> -        * partial group before returning:
> +        * Commit transaction fails, rollback
> +        * Groups can be scheduled in as one unit only, so undo
> +        * whole group before returning:
>         */
>        list_for_each_entry(event, &group_event->sibling_list, group_entry) {
> -               if (event == partial_group)
> -                       break;
>                event_sched_out(event, cpuctx, ctx);
>        }
>        event_sched_out(group_event, cpuctx, ctx);
>
>
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-04-21  8:32                   ` stephane eranian
@ 2010-04-21  8:39                     ` Lin Ming
  2010-04-21  8:44                       ` stephane eranian
  0 siblings, 1 reply; 55+ messages in thread
From: Lin Ming @ 2010-04-21  8:39 UTC (permalink / raw)
  To: eranian@gmail.com; +Cc: Peter Zijlstra, Gary.Mohr@Bull.com, Corey Ashford, LKML

On Wed, 2010-04-21 at 16:32 +0800, stephane eranian wrote:
> Seems to me that struct pmu is a shared resource across all CPUs.
> I don't understand why scheduling on one CPU would have to impact
> all the other CPUs, unless I am missing something here.

Do you mean the pmu->flag?

You are right, pmu->flag should be per cpu data.

Will update the patch.

Thanks,
Lin Ming

> 
> 
> On Wed, Apr 21, 2010 at 10:08 AM, Lin Ming <ming.m.lin@intel.com> wrote:
> > On Tue, 2010-04-20 at 20:03 +0800, Peter Zijlstra wrote:
> >> On Tue, 2010-04-20 at 19:55 +0800, Lin Ming wrote:
> >>
> >> > > One thing not on that list, which should happen first I guess, is to
> >> > > remove hw_perf_group_sched_in(). The idea is to add some sort of
> >> > > transactional API to the struct pmu, so that we can delay the
> >> > > schedulability check until commit time (and roll back when it fails).
> >> > >
> >> > > Something as simple as:
> >> > >
> >> > >   struct pmu {
> >> > >     void start_txn(struct pmu *);
> >> > >     void commit_txn(struct pmu *);
> >> > >
> >> > >     ,,,
> >> > >   };
> >> >
> >> > Could you please explain a bit more?
> >> >
> >> > Does it mean that "start_txn" perform the schedule events stuff
> >> > and "commit_txn" perform the assign events stuff?
> >> >
> >> > Does "commit time" mean the actual activation in hw_perf_enable?
> >>
> >> No, the idea behind hw_perf_group_sched_in() is to not perform
> >> schedulability tests on each event in the group, but to add the group as
> >> a whole and then perform one test.
> >>
> >> Of course, when that test fails, you'll have to roll-back the whole
> >> group again.
> >>
> >> So start_txn (or a better name) would simply toggle a flag in the pmu
> >> implementation that will make pmu::enable() not perform the
> >> schedulablilty test.
> >>
> >> Then commit_txn() will perform the schedulability test (so note the
> >> method has to have a !void return value, my mistake in the earlier
> >> email).
> >>
> >> This will allow us to use the regular
> >> kernel/perf_event.c::group_sched_in() and all the rollback code.
> >> Currently each hw_perf_group_sched_in() implementation duplicates all
> >> the rolllback code (with various bugs).
> >>
> >>
> >>
> >> We must get rid of all weak hw_perf_*() functions before we can properly
> >> consider multiple struct pmu implementations.
> >>
> >
> > Thanks for the clear explanation.
> >
> > Does below patch show what you mean?
> >
> > I only touch the x86 arch code now.
> >
> > ---
> >  arch/x86/kernel/cpu/perf_event.c |  161 +++++++++++--------------------------
> >  include/linux/perf_event.h       |   10 ++-
> >  kernel/perf_event.c              |   28 +++----
> >  3 files changed, 67 insertions(+), 132 deletions(-)
> >
> > diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> > index 626154a..62aa9a1 100644
> > --- a/arch/x86/kernel/cpu/perf_event.c
> > +++ b/arch/x86/kernel/cpu/perf_event.c
> > @@ -944,6 +944,9 @@ static int x86_pmu_enable(struct perf_event *event)
> >        if (n < 0)
> >                return n;
> >
> > +       if (!(event->pmu->flag & PERF_EVENT_TRAN_STARTED))
> > +               goto out;
> > +
> >        ret = x86_pmu.schedule_events(cpuc, n, assign);
> >        if (ret)
> >                return ret;
> > @@ -953,6 +956,7 @@ static int x86_pmu_enable(struct perf_event *event)
> >         */
> >        memcpy(cpuc->assign, assign, n*sizeof(int));
> >
> > +out:
> >        cpuc->n_events = n;
> >        cpuc->n_added += n - n0;
> >
> > @@ -1210,119 +1214,6 @@ x86_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event)
> >        return &unconstrained;
> >  }
> >
> > -static int x86_event_sched_in(struct perf_event *event,
> > -                         struct perf_cpu_context *cpuctx)
> > -{
> > -       int ret = 0;
> > -
> > -       event->state = PERF_EVENT_STATE_ACTIVE;
> > -       event->oncpu = smp_processor_id();
> > -       event->tstamp_running += event->ctx->time - event->tstamp_stopped;
> > -
> > -       if (!is_x86_event(event))
> > -               ret = event->pmu->enable(event);
> > -
> > -       if (!ret && !is_software_event(event))
> > -               cpuctx->active_oncpu++;
> > -
> > -       if (!ret && event->attr.exclusive)
> > -               cpuctx->exclusive = 1;
> > -
> > -       return ret;
> > -}
> > -
> > -static void x86_event_sched_out(struct perf_event *event,
> > -                           struct perf_cpu_context *cpuctx)
> > -{
> > -       event->state = PERF_EVENT_STATE_INACTIVE;
> > -       event->oncpu = -1;
> > -
> > -       if (!is_x86_event(event))
> > -               event->pmu->disable(event);
> > -
> > -       event->tstamp_running -= event->ctx->time - event->tstamp_stopped;
> > -
> > -       if (!is_software_event(event))
> > -               cpuctx->active_oncpu--;
> > -
> > -       if (event->attr.exclusive || !cpuctx->active_oncpu)
> > -               cpuctx->exclusive = 0;
> > -}
> > -
> > -/*
> > - * Called to enable a whole group of events.
> > - * Returns 1 if the group was enabled, or -EAGAIN if it could not be.
> > - * Assumes the caller has disabled interrupts and has
> > - * frozen the PMU with hw_perf_save_disable.
> > - *
> > - * called with PMU disabled. If successful and return value 1,
> > - * then guaranteed to call perf_enable() and hw_perf_enable()
> > - */
> > -int hw_perf_group_sched_in(struct perf_event *leader,
> > -              struct perf_cpu_context *cpuctx,
> > -              struct perf_event_context *ctx)
> > -{
> > -       struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> > -       struct perf_event *sub;
> > -       int assign[X86_PMC_IDX_MAX];
> > -       int n0, n1, ret;
> > -
> > -       if (!x86_pmu_initialized())
> > -               return 0;
> > -
> > -       /* n0 = total number of events */
> > -       n0 = collect_events(cpuc, leader, true);
> > -       if (n0 < 0)
> > -               return n0;
> > -
> > -       ret = x86_pmu.schedule_events(cpuc, n0, assign);
> > -       if (ret)
> > -               return ret;
> > -
> > -       ret = x86_event_sched_in(leader, cpuctx);
> > -       if (ret)
> > -               return ret;
> > -
> > -       n1 = 1;
> > -       list_for_each_entry(sub, &leader->sibling_list, group_entry) {
> > -               if (sub->state > PERF_EVENT_STATE_OFF) {
> > -                       ret = x86_event_sched_in(sub, cpuctx);
> > -                       if (ret)
> > -                               goto undo;
> > -                       ++n1;
> > -               }
> > -       }
> > -       /*
> > -        * copy new assignment, now we know it is possible
> > -        * will be used by hw_perf_enable()
> > -        */
> > -       memcpy(cpuc->assign, assign, n0*sizeof(int));
> > -
> > -       cpuc->n_events  = n0;
> > -       cpuc->n_added  += n1;
> > -       ctx->nr_active += n1;
> > -
> > -       /*
> > -        * 1 means successful and events are active
> > -        * This is not quite true because we defer
> > -        * actual activation until hw_perf_enable() but
> > -        * this way we* ensure caller won't try to enable
> > -        * individual events
> > -        */
> > -       return 1;
> > -undo:
> > -       x86_event_sched_out(leader, cpuctx);
> > -       n0  = 1;
> > -       list_for_each_entry(sub, &leader->sibling_list, group_entry) {
> > -               if (sub->state == PERF_EVENT_STATE_ACTIVE) {
> > -                       x86_event_sched_out(sub, cpuctx);
> > -                       if (++n0 == n1)
> > -                               break;
> > -               }
> > -       }
> > -       return ret;
> > -}
> > -
> >  #include "perf_event_amd.c"
> >  #include "perf_event_p6.c"
> >  #include "perf_event_p4.c"
> > @@ -1454,6 +1345,47 @@ static inline void x86_pmu_read(struct perf_event *event)
> >        x86_perf_event_update(event);
> >  }
> >
> > +/*
> > + * Set the flag to make pmu::enable() not perform the
> > + * schedulablilty test.
> > + */
> > +static void x86_pmu_start_txn(struct pmu *pmu)
> > +{
> > +       pmu->flag |= PERF_EVENT_TRAN_STARTED;
> > +}
> > +
> > +static void x86_pmu_stop_txn(struct pmu *pmu)
> > +{
> > +       pmu->flag &= ~PERF_EVENT_TRAN_STARTED;
> > +}
> > +
> > +/*
> > + * Return 0 if commit transaction success
> > + */
> > +static int x86_pmu_commit_txn(struct pmu *pmu)
> > +{
> > +       struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> > +       int assign[X86_PMC_IDX_MAX];
> > +       int n, ret;
> > +
> > +       n = cpuc->n_events;
> > +
> > +       if (!x86_pmu_initialized())
> > +               return -EAGAIN;
> > +
> > +       ret = x86_pmu.schedule_events(cpuc, n, assign);
> > +       if (ret)
> > +               return ret;
> > +
> > +       /*
> > +        * copy new assignment, now we know it is possible
> > +        * will be used by hw_perf_enable()
> > +        */
> > +       memcpy(cpuc->assign, assign, n*sizeof(int));
> > +
> > +       return 0;
> > +}
> > +
> >  static const struct pmu pmu = {
> >        .enable         = x86_pmu_enable,
> >        .disable        = x86_pmu_disable,
> > @@ -1461,6 +1393,9 @@ static const struct pmu pmu = {
> >        .stop           = x86_pmu_stop,
> >        .read           = x86_pmu_read,
> >        .unthrottle     = x86_pmu_unthrottle,
> > +       .start_txn      = x86_pmu_start_txn,
> > +       .stop_txn       = x86_pmu_stop_txn,
> > +       .commit_txn     = x86_pmu_commit_txn,
> >  };
> >
> >  /*
> > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> > index bf896d0..93aa8d8 100644
> > --- a/include/linux/perf_event.h
> > +++ b/include/linux/perf_event.h
> > @@ -524,6 +524,8 @@ struct hw_perf_event {
> >
> >  struct perf_event;
> >
> > +#define PERF_EVENT_TRAN_STARTED 1
> > +
> >  /**
> >  * struct pmu - generic performance monitoring unit
> >  */
> > @@ -534,6 +536,11 @@ struct pmu {
> >        void (*stop)                    (struct perf_event *event);
> >        void (*read)                    (struct perf_event *event);
> >        void (*unthrottle)              (struct perf_event *event);
> > +       void (*start_txn)               (struct pmu *pmu);
> > +       void (*stop_txn)                (struct pmu *pmu);
> > +       int (*commit_txn)               (struct pmu *pmu);
> > +
> > +       u8 flag;
> >  };
> >
> >  /**
> > @@ -799,9 +806,6 @@ extern void perf_disable(void);
> >  extern void perf_enable(void);
> >  extern int perf_event_task_disable(void);
> >  extern int perf_event_task_enable(void);
> > -extern int hw_perf_group_sched_in(struct perf_event *group_leader,
> > -              struct perf_cpu_context *cpuctx,
> > -              struct perf_event_context *ctx);
> >  extern void perf_event_update_userpage(struct perf_event *event);
> >  extern int perf_event_release_kernel(struct perf_event *event);
> >  extern struct perf_event *
> > diff --git a/kernel/perf_event.c b/kernel/perf_event.c
> > index 07b7a43..4537676 100644
> > --- a/kernel/perf_event.c
> > +++ b/kernel/perf_event.c
> > @@ -83,14 +83,6 @@ extern __weak const struct pmu *hw_perf_event_init(struct perf_event *event)
> >  void __weak hw_perf_disable(void)              { barrier(); }
> >  void __weak hw_perf_enable(void)               { barrier(); }
> >
> > -int __weak
> > -hw_perf_group_sched_in(struct perf_event *group_leader,
> > -              struct perf_cpu_context *cpuctx,
> > -              struct perf_event_context *ctx)
> > -{
> > -       return 0;
> > -}
> > -
> >  void __weak perf_event_print_debug(void)       { }
> >
> >  static DEFINE_PER_CPU(int, perf_disable_count);
> > @@ -642,14 +634,13 @@ group_sched_in(struct perf_event *group_event,
> >               struct perf_event_context *ctx)
> >  {
> >        struct perf_event *event, *partial_group;
> > +       struct pmu *pmu = (struct pmu *)group_event->pmu;
> >        int ret;
> >
> >        if (group_event->state == PERF_EVENT_STATE_OFF)
> >                return 0;
> >
> > -       ret = hw_perf_group_sched_in(group_event, cpuctx, ctx);
> > -       if (ret)
> > -               return ret < 0 ? ret : 0;
> > +       pmu->start_txn(pmu);
> >
> >        if (event_sched_in(group_event, cpuctx, ctx))
> >                return -EAGAIN;
> > @@ -664,16 +655,21 @@ group_sched_in(struct perf_event *group_event,
> >                }
> >        }
> >
> > -       return 0;
> > +       ret = pmu->commit_txn(pmu);
> > +       if (!ret) {
> > +               pmu->stop_txn(pmu);
> > +               return 0;
> > +       }
> >
> >  group_error:
> > +       pmu->stop_txn(pmu);
> > +
> >        /*
> > -        * Groups can be scheduled in as one unit only, so undo any
> > -        * partial group before returning:
> > +        * Commit transaction fails, rollback
> > +        * Groups can be scheduled in as one unit only, so undo
> > +        * whole group before returning:
> >         */
> >        list_for_each_entry(event, &group_event->sibling_list, group_entry) {
> > -               if (event == partial_group)
> > -                       break;
> >                event_sched_out(event, cpuctx, ctx);
> >        }
> >        event_sched_out(group_event, cpuctx, ctx);
> >
> >
> >


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-04-21  8:39                     ` Lin Ming
@ 2010-04-21  8:44                       ` stephane eranian
  2010-04-21  9:42                         ` Lin Ming
  0 siblings, 1 reply; 55+ messages in thread
From: stephane eranian @ 2010-04-21  8:44 UTC (permalink / raw)
  To: Lin Ming; +Cc: Peter Zijlstra, Gary.Mohr@Bull.com, Corey Ashford, LKML

On Wed, Apr 21, 2010 at 10:39 AM, Lin Ming <ming.m.lin@intel.com> wrote:
> On Wed, 2010-04-21 at 16:32 +0800, stephane eranian wrote:
>> Seems to me that struct pmu is a shared resource across all CPUs.
>> I don't understand why scheduling on one CPU would have to impact
>> all the other CPUs, unless I am missing something here.
>
> Do you mean the pmu->flag?

Yes.

> You are right, pmu->flag should be per cpu data.
>
> Will update the patch.
>
> Thanks,
> Lin Ming
>
>>
>>
>> On Wed, Apr 21, 2010 at 10:08 AM, Lin Ming <ming.m.lin@intel.com> wrote:
>> > On Tue, 2010-04-20 at 20:03 +0800, Peter Zijlstra wrote:
>> >> On Tue, 2010-04-20 at 19:55 +0800, Lin Ming wrote:
>> >>
>> >> > > One thing not on that list, which should happen first I guess, is to
>> >> > > remove hw_perf_group_sched_in(). The idea is to add some sort of
>> >> > > transactional API to the struct pmu, so that we can delay the
>> >> > > schedulability check until commit time (and roll back when it fails).
>> >> > >
>> >> > > Something as simple as:
>> >> > >
>> >> > >   struct pmu {
>> >> > >     void start_txn(struct pmu *);
>> >> > >     void commit_txn(struct pmu *);
>> >> > >
>> >> > >     ,,,
>> >> > >   };
>> >> >
>> >> > Could you please explain a bit more?
>> >> >
>> >> > Does it mean that "start_txn" perform the schedule events stuff
>> >> > and "commit_txn" perform the assign events stuff?
>> >> >
>> >> > Does "commit time" mean the actual activation in hw_perf_enable?
>> >>
>> >> No, the idea behind hw_perf_group_sched_in() is to not perform
>> >> schedulability tests on each event in the group, but to add the group as
>> >> a whole and then perform one test.
>> >>
>> >> Of course, when that test fails, you'll have to roll-back the whole
>> >> group again.
>> >>
>> >> So start_txn (or a better name) would simply toggle a flag in the pmu
>> >> implementation that will make pmu::enable() not perform the
>> >> schedulablilty test.
>> >>
>> >> Then commit_txn() will perform the schedulability test (so note the
>> >> method has to have a !void return value, my mistake in the earlier
>> >> email).
>> >>
>> >> This will allow us to use the regular
>> >> kernel/perf_event.c::group_sched_in() and all the rollback code.
>> >> Currently each hw_perf_group_sched_in() implementation duplicates all
>> >> the rolllback code (with various bugs).
>> >>
>> >>
>> >>
>> >> We must get rid of all weak hw_perf_*() functions before we can properly
>> >> consider multiple struct pmu implementations.
>> >>
>> >
>> > Thanks for the clear explanation.
>> >
>> > Does below patch show what you mean?
>> >
>> > I only touch the x86 arch code now.
>> >
>> > ---
>> >  arch/x86/kernel/cpu/perf_event.c |  161 +++++++++++--------------------------
>> >  include/linux/perf_event.h       |   10 ++-
>> >  kernel/perf_event.c              |   28 +++----
>> >  3 files changed, 67 insertions(+), 132 deletions(-)
>> >
>> > diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
>> > index 626154a..62aa9a1 100644
>> > --- a/arch/x86/kernel/cpu/perf_event.c
>> > +++ b/arch/x86/kernel/cpu/perf_event.c
>> > @@ -944,6 +944,9 @@ static int x86_pmu_enable(struct perf_event *event)
>> >        if (n < 0)
>> >                return n;
>> >
>> > +       if (!(event->pmu->flag & PERF_EVENT_TRAN_STARTED))
>> > +               goto out;
>> > +
>> >        ret = x86_pmu.schedule_events(cpuc, n, assign);
>> >        if (ret)
>> >                return ret;
>> > @@ -953,6 +956,7 @@ static int x86_pmu_enable(struct perf_event *event)
>> >         */
>> >        memcpy(cpuc->assign, assign, n*sizeof(int));
>> >
>> > +out:
>> >        cpuc->n_events = n;
>> >        cpuc->n_added += n - n0;
>> >
>> > @@ -1210,119 +1214,6 @@ x86_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event)
>> >        return &unconstrained;
>> >  }
>> >
>> > -static int x86_event_sched_in(struct perf_event *event,
>> > -                         struct perf_cpu_context *cpuctx)
>> > -{
>> > -       int ret = 0;
>> > -
>> > -       event->state = PERF_EVENT_STATE_ACTIVE;
>> > -       event->oncpu = smp_processor_id();
>> > -       event->tstamp_running += event->ctx->time - event->tstamp_stopped;
>> > -
>> > -       if (!is_x86_event(event))
>> > -               ret = event->pmu->enable(event);
>> > -
>> > -       if (!ret && !is_software_event(event))
>> > -               cpuctx->active_oncpu++;
>> > -
>> > -       if (!ret && event->attr.exclusive)
>> > -               cpuctx->exclusive = 1;
>> > -
>> > -       return ret;
>> > -}
>> > -
>> > -static void x86_event_sched_out(struct perf_event *event,
>> > -                           struct perf_cpu_context *cpuctx)
>> > -{
>> > -       event->state = PERF_EVENT_STATE_INACTIVE;
>> > -       event->oncpu = -1;
>> > -
>> > -       if (!is_x86_event(event))
>> > -               event->pmu->disable(event);
>> > -
>> > -       event->tstamp_running -= event->ctx->time - event->tstamp_stopped;
>> > -
>> > -       if (!is_software_event(event))
>> > -               cpuctx->active_oncpu--;
>> > -
>> > -       if (event->attr.exclusive || !cpuctx->active_oncpu)
>> > -               cpuctx->exclusive = 0;
>> > -}
>> > -
>> > -/*
>> > - * Called to enable a whole group of events.
>> > - * Returns 1 if the group was enabled, or -EAGAIN if it could not be.
>> > - * Assumes the caller has disabled interrupts and has
>> > - * frozen the PMU with hw_perf_save_disable.
>> > - *
>> > - * called with PMU disabled. If successful and return value 1,
>> > - * then guaranteed to call perf_enable() and hw_perf_enable()
>> > - */
>> > -int hw_perf_group_sched_in(struct perf_event *leader,
>> > -              struct perf_cpu_context *cpuctx,
>> > -              struct perf_event_context *ctx)
>> > -{
>> > -       struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
>> > -       struct perf_event *sub;
>> > -       int assign[X86_PMC_IDX_MAX];
>> > -       int n0, n1, ret;
>> > -
>> > -       if (!x86_pmu_initialized())
>> > -               return 0;
>> > -
>> > -       /* n0 = total number of events */
>> > -       n0 = collect_events(cpuc, leader, true);
>> > -       if (n0 < 0)
>> > -               return n0;
>> > -
>> > -       ret = x86_pmu.schedule_events(cpuc, n0, assign);
>> > -       if (ret)
>> > -               return ret;
>> > -
>> > -       ret = x86_event_sched_in(leader, cpuctx);
>> > -       if (ret)
>> > -               return ret;
>> > -
>> > -       n1 = 1;
>> > -       list_for_each_entry(sub, &leader->sibling_list, group_entry) {
>> > -               if (sub->state > PERF_EVENT_STATE_OFF) {
>> > -                       ret = x86_event_sched_in(sub, cpuctx);
>> > -                       if (ret)
>> > -                               goto undo;
>> > -                       ++n1;
>> > -               }
>> > -       }
>> > -       /*
>> > -        * copy new assignment, now we know it is possible
>> > -        * will be used by hw_perf_enable()
>> > -        */
>> > -       memcpy(cpuc->assign, assign, n0*sizeof(int));
>> > -
>> > -       cpuc->n_events  = n0;
>> > -       cpuc->n_added  += n1;
>> > -       ctx->nr_active += n1;
>> > -
>> > -       /*
>> > -        * 1 means successful and events are active
>> > -        * This is not quite true because we defer
>> > -        * actual activation until hw_perf_enable() but
>> > -        * this way we* ensure caller won't try to enable
>> > -        * individual events
>> > -        */
>> > -       return 1;
>> > -undo:
>> > -       x86_event_sched_out(leader, cpuctx);
>> > -       n0  = 1;
>> > -       list_for_each_entry(sub, &leader->sibling_list, group_entry) {
>> > -               if (sub->state == PERF_EVENT_STATE_ACTIVE) {
>> > -                       x86_event_sched_out(sub, cpuctx);
>> > -                       if (++n0 == n1)
>> > -                               break;
>> > -               }
>> > -       }
>> > -       return ret;
>> > -}
>> > -
>> >  #include "perf_event_amd.c"
>> >  #include "perf_event_p6.c"
>> >  #include "perf_event_p4.c"
>> > @@ -1454,6 +1345,47 @@ static inline void x86_pmu_read(struct perf_event *event)
>> >        x86_perf_event_update(event);
>> >  }
>> >
>> > +/*
>> > + * Set the flag to make pmu::enable() not perform the
>> > + * schedulablilty test.
>> > + */
>> > +static void x86_pmu_start_txn(struct pmu *pmu)
>> > +{
>> > +       pmu->flag |= PERF_EVENT_TRAN_STARTED;
>> > +}
>> > +
>> > +static void x86_pmu_stop_txn(struct pmu *pmu)
>> > +{
>> > +       pmu->flag &= ~PERF_EVENT_TRAN_STARTED;
>> > +}
>> > +
>> > +/*
>> > + * Return 0 if commit transaction success
>> > + */
>> > +static int x86_pmu_commit_txn(struct pmu *pmu)
>> > +{
>> > +       struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
>> > +       int assign[X86_PMC_IDX_MAX];
>> > +       int n, ret;
>> > +
>> > +       n = cpuc->n_events;
>> > +
>> > +       if (!x86_pmu_initialized())
>> > +               return -EAGAIN;
>> > +
>> > +       ret = x86_pmu.schedule_events(cpuc, n, assign);
>> > +       if (ret)
>> > +               return ret;
>> > +
>> > +       /*
>> > +        * copy new assignment, now we know it is possible
>> > +        * will be used by hw_perf_enable()
>> > +        */
>> > +       memcpy(cpuc->assign, assign, n*sizeof(int));
>> > +
>> > +       return 0;
>> > +}
>> > +
>> >  static const struct pmu pmu = {
>> >        .enable         = x86_pmu_enable,
>> >        .disable        = x86_pmu_disable,
>> > @@ -1461,6 +1393,9 @@ static const struct pmu pmu = {
>> >        .stop           = x86_pmu_stop,
>> >        .read           = x86_pmu_read,
>> >        .unthrottle     = x86_pmu_unthrottle,
>> > +       .start_txn      = x86_pmu_start_txn,
>> > +       .stop_txn       = x86_pmu_stop_txn,
>> > +       .commit_txn     = x86_pmu_commit_txn,
>> >  };
>> >
>> >  /*
>> > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> > index bf896d0..93aa8d8 100644
>> > --- a/include/linux/perf_event.h
>> > +++ b/include/linux/perf_event.h
>> > @@ -524,6 +524,8 @@ struct hw_perf_event {
>> >
>> >  struct perf_event;
>> >
>> > +#define PERF_EVENT_TRAN_STARTED 1
>> > +
>> >  /**
>> >  * struct pmu - generic performance monitoring unit
>> >  */
>> > @@ -534,6 +536,11 @@ struct pmu {
>> >        void (*stop)                    (struct perf_event *event);
>> >        void (*read)                    (struct perf_event *event);
>> >        void (*unthrottle)              (struct perf_event *event);
>> > +       void (*start_txn)               (struct pmu *pmu);
>> > +       void (*stop_txn)                (struct pmu *pmu);
>> > +       int (*commit_txn)               (struct pmu *pmu);
>> > +
>> > +       u8 flag;
>> >  };
>> >
>> >  /**
>> > @@ -799,9 +806,6 @@ extern void perf_disable(void);
>> >  extern void perf_enable(void);
>> >  extern int perf_event_task_disable(void);
>> >  extern int perf_event_task_enable(void);
>> > -extern int hw_perf_group_sched_in(struct perf_event *group_leader,
>> > -              struct perf_cpu_context *cpuctx,
>> > -              struct perf_event_context *ctx);
>> >  extern void perf_event_update_userpage(struct perf_event *event);
>> >  extern int perf_event_release_kernel(struct perf_event *event);
>> >  extern struct perf_event *
>> > diff --git a/kernel/perf_event.c b/kernel/perf_event.c
>> > index 07b7a43..4537676 100644
>> > --- a/kernel/perf_event.c
>> > +++ b/kernel/perf_event.c
>> > @@ -83,14 +83,6 @@ extern __weak const struct pmu *hw_perf_event_init(struct perf_event *event)
>> >  void __weak hw_perf_disable(void)              { barrier(); }
>> >  void __weak hw_perf_enable(void)               { barrier(); }
>> >
>> > -int __weak
>> > -hw_perf_group_sched_in(struct perf_event *group_leader,
>> > -              struct perf_cpu_context *cpuctx,
>> > -              struct perf_event_context *ctx)
>> > -{
>> > -       return 0;
>> > -}
>> > -
>> >  void __weak perf_event_print_debug(void)       { }
>> >
>> >  static DEFINE_PER_CPU(int, perf_disable_count);
>> > @@ -642,14 +634,13 @@ group_sched_in(struct perf_event *group_event,
>> >               struct perf_event_context *ctx)
>> >  {
>> >        struct perf_event *event, *partial_group;
>> > +       struct pmu *pmu = (struct pmu *)group_event->pmu;
>> >        int ret;
>> >
>> >        if (group_event->state == PERF_EVENT_STATE_OFF)
>> >                return 0;
>> >
>> > -       ret = hw_perf_group_sched_in(group_event, cpuctx, ctx);
>> > -       if (ret)
>> > -               return ret < 0 ? ret : 0;
>> > +       pmu->start_txn(pmu);
>> >
>> >        if (event_sched_in(group_event, cpuctx, ctx))
>> >                return -EAGAIN;
>> > @@ -664,16 +655,21 @@ group_sched_in(struct perf_event *group_event,
>> >                }
>> >        }
>> >
>> > -       return 0;
>> > +       ret = pmu->commit_txn(pmu);
>> > +       if (!ret) {
>> > +               pmu->stop_txn(pmu);
>> > +               return 0;
>> > +       }
>> >
>> >  group_error:
>> > +       pmu->stop_txn(pmu);
>> > +
>> >        /*
>> > -        * Groups can be scheduled in as one unit only, so undo any
>> > -        * partial group before returning:
>> > +        * Commit transaction fails, rollback
>> > +        * Groups can be scheduled in as one unit only, so undo
>> > +        * whole group before returning:
>> >         */
>> >        list_for_each_entry(event, &group_event->sibling_list, group_entry) {
>> > -               if (event == partial_group)
>> > -                       break;
>> >                event_sched_out(event, cpuctx, ctx);
>> >        }
>> >        event_sched_out(group_event, cpuctx, ctx);
>> >
>> >
>> >
>
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-04-21  8:44                       ` stephane eranian
@ 2010-04-21  9:42                         ` Lin Ming
  2010-04-21  9:57                           ` Peter Zijlstra
  0 siblings, 1 reply; 55+ messages in thread
From: Lin Ming @ 2010-04-21  9:42 UTC (permalink / raw)
  To: eranian@gmail.com; +Cc: Peter Zijlstra, Gary.Mohr@Bull.com, Corey Ashford, LKML

On Wed, 2010-04-21 at 16:44 +0800, stephane eranian wrote:
> On Wed, Apr 21, 2010 at 10:39 AM, Lin Ming <ming.m.lin@intel.com> wrote:
> > On Wed, 2010-04-21 at 16:32 +0800, stephane eranian wrote:
> >> Seems to me that struct pmu is a shared resource across all CPUs.
> >> I don't understand why scheduling on one CPU would have to impact
> >> all the other CPUs, unless I am missing something here.
> >
> > Do you mean the pmu->flag?
> 
> Yes.

Thanks for the review.

Changes log:

v2.
pmu->flag should be per cpu (Stephane Eranian)

change definition of "const struct pmu" to "struct pmu" since it's not
read only now.

v1.
remove hw_perf_group_sched_in() based on Peter's idea.

---
 arch/x86/kernel/cpu/perf_event.c |  183 ++++++++++++++------------------------
 include/linux/perf_event.h       |   16 +++-
 kernel/perf_event.c              |   55 ++++++------
 3 files changed, 106 insertions(+), 148 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 626154a..1113fd5 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -598,7 +598,7 @@ static void x86_pmu_enable_all(int added)
 	}
 }
 
-static const struct pmu pmu;
+static struct pmu pmu;
 
 static inline int is_x86_event(struct perf_event *event)
 {
@@ -935,6 +935,7 @@ static int x86_pmu_enable(struct perf_event *event)
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
 	struct hw_perf_event *hwc;
 	int assign[X86_PMC_IDX_MAX];
+	u8 *flag;
 	int n, n0, ret;
 
 	hwc = &event->hw;
@@ -944,6 +945,10 @@ static int x86_pmu_enable(struct perf_event *event)
 	if (n < 0)
 		return n;
 
+	flag = perf_pmu_flag((struct pmu *)event->pmu);
+	if (!(*flag & PERF_EVENT_TRAN_STARTED))
+		goto out;
+
 	ret = x86_pmu.schedule_events(cpuc, n, assign);
 	if (ret)
 		return ret;
@@ -953,6 +958,7 @@ static int x86_pmu_enable(struct perf_event *event)
 	 */
 	memcpy(cpuc->assign, assign, n*sizeof(int));
 
+out:
 	cpuc->n_events = n;
 	cpuc->n_added += n - n0;
 
@@ -1210,119 +1216,6 @@ x86_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event)
 	return &unconstrained;
 }
 
-static int x86_event_sched_in(struct perf_event *event,
-			  struct perf_cpu_context *cpuctx)
-{
-	int ret = 0;
-
-	event->state = PERF_EVENT_STATE_ACTIVE;
-	event->oncpu = smp_processor_id();
-	event->tstamp_running += event->ctx->time - event->tstamp_stopped;
-
-	if (!is_x86_event(event))
-		ret = event->pmu->enable(event);
-
-	if (!ret && !is_software_event(event))
-		cpuctx->active_oncpu++;
-
-	if (!ret && event->attr.exclusive)
-		cpuctx->exclusive = 1;
-
-	return ret;
-}
-
-static void x86_event_sched_out(struct perf_event *event,
-			    struct perf_cpu_context *cpuctx)
-{
-	event->state = PERF_EVENT_STATE_INACTIVE;
-	event->oncpu = -1;
-
-	if (!is_x86_event(event))
-		event->pmu->disable(event);
-
-	event->tstamp_running -= event->ctx->time - event->tstamp_stopped;
-
-	if (!is_software_event(event))
-		cpuctx->active_oncpu--;
-
-	if (event->attr.exclusive || !cpuctx->active_oncpu)
-		cpuctx->exclusive = 0;
-}
-
-/*
- * Called to enable a whole group of events.
- * Returns 1 if the group was enabled, or -EAGAIN if it could not be.
- * Assumes the caller has disabled interrupts and has
- * frozen the PMU with hw_perf_save_disable.
- *
- * called with PMU disabled. If successful and return value 1,
- * then guaranteed to call perf_enable() and hw_perf_enable()
- */
-int hw_perf_group_sched_in(struct perf_event *leader,
-	       struct perf_cpu_context *cpuctx,
-	       struct perf_event_context *ctx)
-{
-	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
-	struct perf_event *sub;
-	int assign[X86_PMC_IDX_MAX];
-	int n0, n1, ret;
-
-	if (!x86_pmu_initialized())
-		return 0;
-
-	/* n0 = total number of events */
-	n0 = collect_events(cpuc, leader, true);
-	if (n0 < 0)
-		return n0;
-
-	ret = x86_pmu.schedule_events(cpuc, n0, assign);
-	if (ret)
-		return ret;
-
-	ret = x86_event_sched_in(leader, cpuctx);
-	if (ret)
-		return ret;
-
-	n1 = 1;
-	list_for_each_entry(sub, &leader->sibling_list, group_entry) {
-		if (sub->state > PERF_EVENT_STATE_OFF) {
-			ret = x86_event_sched_in(sub, cpuctx);
-			if (ret)
-				goto undo;
-			++n1;
-		}
-	}
-	/*
-	 * copy new assignment, now we know it is possible
-	 * will be used by hw_perf_enable()
-	 */
-	memcpy(cpuc->assign, assign, n0*sizeof(int));
-
-	cpuc->n_events  = n0;
-	cpuc->n_added  += n1;
-	ctx->nr_active += n1;
-
-	/*
-	 * 1 means successful and events are active
-	 * This is not quite true because we defer
-	 * actual activation until hw_perf_enable() but
-	 * this way we* ensure caller won't try to enable
-	 * individual events
-	 */
-	return 1;
-undo:
-	x86_event_sched_out(leader, cpuctx);
-	n0  = 1;
-	list_for_each_entry(sub, &leader->sibling_list, group_entry) {
-		if (sub->state == PERF_EVENT_STATE_ACTIVE) {
-			x86_event_sched_out(sub, cpuctx);
-			if (++n0 == n1)
-				break;
-		}
-	}
-	return ret;
-}
-
 #include "perf_event_amd.c"
 #include "perf_event_p6.c"
 #include "perf_event_p4.c"
@@ -1397,6 +1290,14 @@ void __init init_hw_perf_events(void)
 		return;
 	}
 
+	/*
+	 * TBD: will move this to pmu register function
+	 * when multiple hw pmu support is added
+	 */
+	pmu.flag = alloc_percpu(u8);
+	if (!pmu.flag)
+		return;
+
 	pmu_check_apic();
 
 	pr_cont("%s PMU driver.\n", x86_pmu.name);
@@ -1454,13 +1355,61 @@ static inline void x86_pmu_read(struct perf_event *event)
 	x86_perf_event_update(event);
 }
 
-static const struct pmu pmu = {
+/*
+ * Set the flag to make pmu::enable() not perform the
+ * schedulablilty test.
+ */
+static void x86_pmu_start_txn(struct pmu *pmu)
+{
+	u8 *flag = perf_pmu_flag(pmu);
+
+	*flag |= PERF_EVENT_TRAN_STARTED;
+}
+
+static void x86_pmu_stop_txn(struct pmu *pmu)
+{
+	u8 *flag = perf_pmu_flag(pmu);
+
+	*flag &= ~PERF_EVENT_TRAN_STARTED;
+}
+
+/*
+ * Return 0 if commit transaction success
+ */
+static int x86_pmu_commit_txn(struct pmu *pmu)
+{
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+	int assign[X86_PMC_IDX_MAX];
+	int n, ret;
+
+	n = cpuc->n_events;
+
+	if (!x86_pmu_initialized())
+		return -EAGAIN;
+
+	ret = x86_pmu.schedule_events(cpuc, n, assign);
+	if (ret)
+		return ret;
+
+	/*
+	 * copy new assignment, now we know it is possible
+	 * will be used by hw_perf_enable()
+	 */
+	memcpy(cpuc->assign, assign, n*sizeof(int));
+
+	return 0;
+}
+
+static struct pmu pmu = {
 	.enable		= x86_pmu_enable,
 	.disable	= x86_pmu_disable,
 	.start		= x86_pmu_start,
 	.stop		= x86_pmu_stop,
 	.read		= x86_pmu_read,
 	.unthrottle	= x86_pmu_unthrottle,
+	.start_txn	= x86_pmu_start_txn,
+	.stop_txn	= x86_pmu_stop_txn,
+	.commit_txn	= x86_pmu_commit_txn,
 };
 
 /*
@@ -1537,9 +1486,9 @@ out:
 	return ret;
 }
 
-const struct pmu *hw_perf_event_init(struct perf_event *event)
+struct pmu *hw_perf_event_init(struct perf_event *event)
 {
-	const struct pmu *tmp;
+	struct pmu *tmp;
 	int err;
 
 	err = __hw_perf_event_init(event);
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index bf896d0..9fa3f46 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -524,6 +524,8 @@ struct hw_perf_event {
 
 struct perf_event;
 
+#define PERF_EVENT_TRAN_STARTED 1
+
 /**
  * struct pmu - generic performance monitoring unit
  */
@@ -534,6 +536,12 @@ struct pmu {
 	void (*stop)			(struct perf_event *event);
 	void (*read)			(struct perf_event *event);
 	void (*unthrottle)		(struct perf_event *event);
+	void (*start_txn)		(struct pmu *pmu);
+	void (*stop_txn)		(struct pmu *pmu);
+	int (*commit_txn)		(struct pmu *pmu);
+
+	/* percpu flag */
+	u8 *flag;
 };
 
 /**
@@ -610,7 +618,7 @@ struct perf_event {
 	int				group_flags;
 	struct perf_event		*group_leader;
 	struct perf_event		*output;
-	const struct pmu		*pmu;
+	struct pmu			*pmu;
 
 	enum perf_event_active_state	state;
 	atomic64_t			count;
@@ -782,7 +790,7 @@ struct perf_output_handle {
  */
 extern int perf_max_events;
 
-extern const struct pmu *hw_perf_event_init(struct perf_event *event);
+extern struct pmu *hw_perf_event_init(struct perf_event *event);
 
 extern void perf_event_task_sched_in(struct task_struct *task);
 extern void perf_event_task_sched_out(struct task_struct *task, struct task_struct *next);
@@ -799,9 +807,6 @@ extern void perf_disable(void);
 extern void perf_enable(void);
 extern int perf_event_task_disable(void);
 extern int perf_event_task_enable(void);
-extern int hw_perf_group_sched_in(struct perf_event *group_leader,
-	       struct perf_cpu_context *cpuctx,
-	       struct perf_event_context *ctx);
 extern void perf_event_update_userpage(struct perf_event *event);
 extern int perf_event_release_kernel(struct perf_event *event);
 extern struct perf_event *
@@ -811,6 +816,7 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr,
 				perf_overflow_handler_t callback);
 extern u64 perf_event_read_value(struct perf_event *event,
 				 u64 *enabled, u64 *running);
+extern u8 *perf_pmu_flag(struct pmu *pmu);
 
 struct perf_sample_data {
 	u64				type;
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 07b7a43..4820090 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -75,7 +75,7 @@ static DEFINE_SPINLOCK(perf_resource_lock);
 /*
  * Architecture provided APIs - weak aliases:
  */
-extern __weak const struct pmu *hw_perf_event_init(struct perf_event *event)
+extern __weak struct pmu *hw_perf_event_init(struct perf_event *event)
 {
 	return NULL;
 }
@@ -83,15 +83,14 @@ extern __weak const struct pmu *hw_perf_event_init(struct perf_event *event)
 void __weak hw_perf_disable(void)		{ barrier(); }
 void __weak hw_perf_enable(void)		{ barrier(); }
 
-int __weak
-hw_perf_group_sched_in(struct perf_event *group_leader,
-	       struct perf_cpu_context *cpuctx,
-	       struct perf_event_context *ctx)
+void __weak perf_event_print_debug(void)	{ }
+
+u8 *perf_pmu_flag(struct pmu *pmu)
 {
-	return 0;
-}
+	int cpu = smp_processor_id();
 
-void __weak perf_event_print_debug(void)	{ }
+	return per_cpu_ptr(pmu->flag, cpu);
+}
 
 static DEFINE_PER_CPU(int, perf_disable_count);
 
@@ -642,14 +641,13 @@ group_sched_in(struct perf_event *group_event,
 	       struct perf_event_context *ctx)
 {
 	struct perf_event *event, *partial_group;
+	struct pmu *pmu = group_event->pmu;
 	int ret;
 
 	if (group_event->state == PERF_EVENT_STATE_OFF)
 		return 0;
 
-	ret = hw_perf_group_sched_in(group_event, cpuctx, ctx);
-	if (ret)
-		return ret < 0 ? ret : 0;
+	pmu->start_txn(pmu);
 
 	if (event_sched_in(group_event, cpuctx, ctx))
 		return -EAGAIN;
@@ -664,16 +662,21 @@ group_sched_in(struct perf_event *group_event,
 		}
 	}
 
-	return 0;
+	ret = pmu->commit_txn(pmu);
+	if (!ret) {
+		pmu->stop_txn(pmu);
+		return 0;
+	}
 
 group_error:
+	pmu->stop_txn(pmu);
+
 	/*
-	 * Groups can be scheduled in as one unit only, so undo any
-	 * partial group before returning:
+	 * Commit transaction fails, rollback
+	 * Groups can be scheduled in as one unit only, so undo
+	 * whole group before returning:
 	 */
 	list_for_each_entry(event, &group_event->sibling_list, group_entry) {
-		if (event == partial_group)
-			break;
 		event_sched_out(event, cpuctx, ctx);
 	}
 	event_sched_out(group_event, cpuctx, ctx);
@@ -4139,7 +4142,7 @@ static void perf_swevent_disable(struct perf_event *event)
 	hlist_del_rcu(&event->hlist_entry);
 }
 
-static const struct pmu perf_ops_generic = {
+static struct pmu perf_ops_generic = {
 	.enable		= perf_swevent_enable,
 	.disable	= perf_swevent_disable,
 	.read		= perf_swevent_read,
@@ -4250,7 +4253,7 @@ static void cpu_clock_perf_event_read(struct perf_event *event)
 	cpu_clock_perf_event_update(event);
 }
 
-static const struct pmu perf_ops_cpu_clock = {
+static struct pmu perf_ops_cpu_clock = {
 	.enable		= cpu_clock_perf_event_enable,
 	.disable	= cpu_clock_perf_event_disable,
 	.read		= cpu_clock_perf_event_read,
@@ -4307,7 +4310,7 @@ static void task_clock_perf_event_read(struct perf_event *event)
 	task_clock_perf_event_update(event, time);
 }
 
-static const struct pmu perf_ops_task_clock = {
+static struct pmu perf_ops_task_clock = {
 	.enable		= task_clock_perf_event_enable,
 	.disable	= task_clock_perf_event_disable,
 	.read		= task_clock_perf_event_read,
@@ -4448,7 +4451,7 @@ static void tp_perf_event_destroy(struct perf_event *event)
 	swevent_hlist_put(event);
 }
 
-static const struct pmu *tp_perf_event_init(struct perf_event *event)
+static struct pmu *tp_perf_event_init(struct perf_event *event)
 {
 	int err;
 
@@ -4505,7 +4508,7 @@ static int perf_tp_event_match(struct perf_event *event,
 	return 1;
 }
 
-static const struct pmu *tp_perf_event_init(struct perf_event *event)
+static struct pmu *tp_perf_event_init(struct perf_event *event)
 {
 	return NULL;
 }
@@ -4527,7 +4530,7 @@ static void bp_perf_event_destroy(struct perf_event *event)
 	release_bp_slot(event);
 }
 
-static const struct pmu *bp_perf_event_init(struct perf_event *bp)
+static struct pmu *bp_perf_event_init(struct perf_event *bp)
 {
 	int err;
 
@@ -4551,7 +4554,7 @@ void perf_bp_event(struct perf_event *bp, void *data)
 		perf_swevent_add(bp, 1, 1, &sample, regs);
 }
 #else
-static const struct pmu *bp_perf_event_init(struct perf_event *bp)
+static struct pmu *bp_perf_event_init(struct perf_event *bp)
 {
 	return NULL;
 }
@@ -4573,9 +4576,9 @@ static void sw_perf_event_destroy(struct perf_event *event)
 	swevent_hlist_put(event);
 }
 
-static const struct pmu *sw_perf_event_init(struct perf_event *event)
+static struct pmu *sw_perf_event_init(struct perf_event *event)
 {
-	const struct pmu *pmu = NULL;
+	struct pmu *pmu = NULL;
 	u64 event_id = event->attr.config;
 
 	/*
@@ -4637,7 +4640,7 @@ perf_event_alloc(struct perf_event_attr *attr,
 		   perf_overflow_handler_t overflow_handler,
 		   gfp_t gfpflags)
 {
-	const struct pmu *pmu;
+	struct pmu *pmu;
 	struct perf_event *event;
 	struct hw_perf_event *hwc;
 	long err;


> 
> > You are right, pmu->flag should be per cpu data.
> >
> > Will update the patch.
> >
> > Thanks,
> > Lin Ming
> >
> >>



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-04-21  9:42                         ` Lin Ming
@ 2010-04-21  9:57                           ` Peter Zijlstra
  2010-04-21 22:12                             ` Lin Ming
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2010-04-21  9:57 UTC (permalink / raw)
  To: Lin Ming; +Cc: eranian@gmail.com, Gary.Mohr@Bull.com, Corey Ashford, LKML

On Wed, 2010-04-21 at 17:42 +0800, Lin Ming wrote:
> +       /*
> +        * TBD: will move this to pmu register function
> +        * when multiple hw pmu support is added
> +        */
> +       pmu.flag = alloc_percpu(u8);
> +       if (!pmu.flag)
> +               return; 

Why not simply use a field in struct cpu_hw_events?

That's where we track all per-cpu pmu state.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-04-21  9:57                           ` Peter Zijlstra
@ 2010-04-21 22:12                             ` Lin Ming
  2010-04-21 14:22                               ` Peter Zijlstra
  0 siblings, 1 reply; 55+ messages in thread
From: Lin Ming @ 2010-04-21 22:12 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: eranian@gmail.com, Gary.Mohr@Bull.com, Corey Ashford, LKML

On Wed, 2010-04-21 at 17:57 +0800, Peter Zijlstra wrote:
> On Wed, 2010-04-21 at 17:42 +0800, Lin Ming wrote:
> > +       /*
> > +        * TBD: will move this to pmu register function
> > +        * when multiple hw pmu support is added
> > +        */
> > +       pmu.flag = alloc_percpu(u8);
> > +       if (!pmu.flag)
> > +               return; 
> 
> Why not simply use a field in struct cpu_hw_events?
> 
> That's where we track all per-cpu pmu state.

That's good idea!

Thanks for the review.

Changes log:

v3.
track per-cpu pmu transaction state in cpu_hw_events (Peter Zijlstra)
move pmu definition back to const ("struct pmu" -> "const struct pmu")

v2.
pmu->flag should be per cpu (Stephane Eranian)

change definition of "const struct pmu" to "struct pmu" since it's not
read only now.

v1.
remove hw_perf_group_sched_in() based on Peter's idea.

---
 arch/x86/kernel/cpu/perf_event.c |  167 ++++++++++++-------------------------
 include/linux/perf_event.h       |    8 +-
 kernel/perf_event.c              |   23 +++---
 3 files changed, 69 insertions(+), 129 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 626154a..4a1dc62 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -110,6 +110,8 @@ struct cpu_hw_events {
 	u64			tags[X86_PMC_IDX_MAX];
 	struct perf_event	*event_list[X86_PMC_IDX_MAX]; /* in enabled order */
 
+	u8			flag;
+
 	/*
 	 * Intel DebugStore bits
 	 */
@@ -944,6 +946,9 @@ static int x86_pmu_enable(struct perf_event *event)
 	if (n < 0)
 		return n;
 
+	if (cpuc->flag & PERF_EVENT_TRAN_STARTED)
+		goto out;
+
 	ret = x86_pmu.schedule_events(cpuc, n, assign);
 	if (ret)
 		return ret;
@@ -953,6 +958,7 @@ static int x86_pmu_enable(struct perf_event *event)
 	 */
 	memcpy(cpuc->assign, assign, n*sizeof(int));
 
+out:
 	cpuc->n_events = n;
 	cpuc->n_added += n - n0;
 
@@ -1210,119 +1216,6 @@ x86_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event)
 	return &unconstrained;
 }
 
-static int x86_event_sched_in(struct perf_event *event,
-			  struct perf_cpu_context *cpuctx)
-{
-	int ret = 0;
-
-	event->state = PERF_EVENT_STATE_ACTIVE;
-	event->oncpu = smp_processor_id();
-	event->tstamp_running += event->ctx->time - event->tstamp_stopped;
-
-	if (!is_x86_event(event))
-		ret = event->pmu->enable(event);
-
-	if (!ret && !is_software_event(event))
-		cpuctx->active_oncpu++;
-
-	if (!ret && event->attr.exclusive)
-		cpuctx->exclusive = 1;
-
-	return ret;
-}
-
-static void x86_event_sched_out(struct perf_event *event,
-			    struct perf_cpu_context *cpuctx)
-{
-	event->state = PERF_EVENT_STATE_INACTIVE;
-	event->oncpu = -1;
-
-	if (!is_x86_event(event))
-		event->pmu->disable(event);
-
-	event->tstamp_running -= event->ctx->time - event->tstamp_stopped;
-
-	if (!is_software_event(event))
-		cpuctx->active_oncpu--;
-
-	if (event->attr.exclusive || !cpuctx->active_oncpu)
-		cpuctx->exclusive = 0;
-}
-
-/*
- * Called to enable a whole group of events.
- * Returns 1 if the group was enabled, or -EAGAIN if it could not be.
- * Assumes the caller has disabled interrupts and has
- * frozen the PMU with hw_perf_save_disable.
- *
- * called with PMU disabled. If successful and return value 1,
- * then guaranteed to call perf_enable() and hw_perf_enable()
- */
-int hw_perf_group_sched_in(struct perf_event *leader,
-	       struct perf_cpu_context *cpuctx,
-	       struct perf_event_context *ctx)
-{
-	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
-	struct perf_event *sub;
-	int assign[X86_PMC_IDX_MAX];
-	int n0, n1, ret;
-
-	if (!x86_pmu_initialized())
-		return 0;
-
-	/* n0 = total number of events */
-	n0 = collect_events(cpuc, leader, true);
-	if (n0 < 0)
-		return n0;
-
-	ret = x86_pmu.schedule_events(cpuc, n0, assign);
-	if (ret)
-		return ret;
-
-	ret = x86_event_sched_in(leader, cpuctx);
-	if (ret)
-		return ret;
-
-	n1 = 1;
-	list_for_each_entry(sub, &leader->sibling_list, group_entry) {
-		if (sub->state > PERF_EVENT_STATE_OFF) {
-			ret = x86_event_sched_in(sub, cpuctx);
-			if (ret)
-				goto undo;
-			++n1;
-		}
-	}
-	/*
-	 * copy new assignment, now we know it is possible
-	 * will be used by hw_perf_enable()
-	 */
-	memcpy(cpuc->assign, assign, n0*sizeof(int));
-
-	cpuc->n_events  = n0;
-	cpuc->n_added  += n1;
-	ctx->nr_active += n1;
-
-	/*
-	 * 1 means successful and events are active
-	 * This is not quite true because we defer
-	 * actual activation until hw_perf_enable() but
-	 * this way we* ensure caller won't try to enable
-	 * individual events
-	 */
-	return 1;
-undo:
-	x86_event_sched_out(leader, cpuctx);
-	n0  = 1;
-	list_for_each_entry(sub, &leader->sibling_list, group_entry) {
-		if (sub->state == PERF_EVENT_STATE_ACTIVE) {
-			x86_event_sched_out(sub, cpuctx);
-			if (++n0 == n1)
-				break;
-		}
-	}
-	return ret;
-}
-
 #include "perf_event_amd.c"
 #include "perf_event_p6.c"
 #include "perf_event_p4.c"
@@ -1454,6 +1347,51 @@ static inline void x86_pmu_read(struct perf_event *event)
 	x86_perf_event_update(event);
 }
 
+/*
+ * Set the flag to make pmu::enable() not perform the
+ * schedulablilty test.
+ */
+static void x86_pmu_start_txn(const struct pmu *pmu)
+{
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+
+	cpuc->flag |= PERF_EVENT_TRAN_STARTED;
+}
+
+static void x86_pmu_stop_txn(const struct pmu *pmu)
+{
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+
+	cpuc->flag &= ~PERF_EVENT_TRAN_STARTED;
+}
+
+/*
+ * Return 0 if commit transaction success
+ */
+static int x86_pmu_commit_txn(const struct pmu *pmu)
+{
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+	int assign[X86_PMC_IDX_MAX];
+	int n, ret;
+
+	n = cpuc->n_events;
+
+	if (!x86_pmu_initialized())
+		return -EAGAIN;
+
+	ret = x86_pmu.schedule_events(cpuc, n, assign);
+	if (ret)
+		return ret;
+
+	/*
+	 * copy new assignment, now we know it is possible
+	 * will be used by hw_perf_enable()
+	 */
+	memcpy(cpuc->assign, assign, n*sizeof(int));
+
+	return 0;
+}
+
 static const struct pmu pmu = {
 	.enable		= x86_pmu_enable,
 	.disable	= x86_pmu_disable,
@@ -1461,6 +1399,9 @@ static const struct pmu pmu = {
 	.stop		= x86_pmu_stop,
 	.read		= x86_pmu_read,
 	.unthrottle	= x86_pmu_unthrottle,
+	.start_txn	= x86_pmu_start_txn,
+	.stop_txn	= x86_pmu_stop_txn,
+	.commit_txn	= x86_pmu_commit_txn,
 };
 
 /*
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index bf896d0..862b965 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -524,6 +524,8 @@ struct hw_perf_event {
 
 struct perf_event;
 
+#define PERF_EVENT_TRAN_STARTED 1
+
 /**
  * struct pmu - generic performance monitoring unit
  */
@@ -534,6 +536,9 @@ struct pmu {
 	void (*stop)			(struct perf_event *event);
 	void (*read)			(struct perf_event *event);
 	void (*unthrottle)		(struct perf_event *event);
+	void (*start_txn)		(const struct pmu *pmu);
+	void (*stop_txn)		(const struct pmu *pmu);
+	int (*commit_txn)		(const struct pmu *pmu);
 };
 
 /**
@@ -799,9 +804,6 @@ extern void perf_disable(void);
 extern void perf_enable(void);
 extern int perf_event_task_disable(void);
 extern int perf_event_task_enable(void);
-extern int hw_perf_group_sched_in(struct perf_event *group_leader,
-	       struct perf_cpu_context *cpuctx,
-	       struct perf_event_context *ctx);
 extern void perf_event_update_userpage(struct perf_event *event);
 extern int perf_event_release_kernel(struct perf_event *event);
 extern struct perf_event *
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 07b7a43..1503174 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -83,14 +83,6 @@ extern __weak const struct pmu *hw_perf_event_init(struct perf_event *event)
 void __weak hw_perf_disable(void)		{ barrier(); }
 void __weak hw_perf_enable(void)		{ barrier(); }
 
-int __weak
-hw_perf_group_sched_in(struct perf_event *group_leader,
-	       struct perf_cpu_context *cpuctx,
-	       struct perf_event_context *ctx)
-{
-	return 0;
-}
-
 void __weak perf_event_print_debug(void)	{ }
 
 static DEFINE_PER_CPU(int, perf_disable_count);
@@ -641,15 +633,14 @@ group_sched_in(struct perf_event *group_event,
 	       struct perf_cpu_context *cpuctx,
 	       struct perf_event_context *ctx)
 {
-	struct perf_event *event, *partial_group;
+	struct perf_event *event, *partial_group = NULL;
+	const struct pmu *pmu = group_event->pmu;
 	int ret;
 
 	if (group_event->state == PERF_EVENT_STATE_OFF)
 		return 0;
 
-	ret = hw_perf_group_sched_in(group_event, cpuctx, ctx);
-	if (ret)
-		return ret < 0 ? ret : 0;
+	pmu->start_txn(pmu);
 
 	if (event_sched_in(group_event, cpuctx, ctx))
 		return -EAGAIN;
@@ -664,9 +655,15 @@ group_sched_in(struct perf_event *group_event,
 		}
 	}
 
-	return 0;
+	ret = pmu->commit_txn(pmu);
+	if (!ret) {
+		pmu->stop_txn(pmu);
+		return 0;
+	}
 
 group_error:
+	pmu->stop_txn(pmu);
+
 	/*
 	 * Groups can be scheduled in as one unit only, so undo any
 	 * partial group before returning:



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-04-21 22:12                             ` Lin Ming
@ 2010-04-21 14:22                               ` Peter Zijlstra
  2010-04-21 22:38                                 ` Lin Ming
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2010-04-21 14:22 UTC (permalink / raw)
  To: Lin Ming; +Cc: eranian@gmail.com, Gary.Mohr@Bull.com, Corey Ashford, LKML

On Wed, 2010-04-21 at 22:12 +0000, Lin Ming wrote:
> +       ret = pmu->commit_txn(pmu);
> +       if (!ret) {
> +               pmu->stop_txn(pmu);
> +               return 0;
> +       }
>  
>  group_error:
> +       pmu->stop_txn(pmu); 

If you let commit_txn() also clear the state you can save some logic and
a method.

But yes, this looks good. If you don't remove the weak interface just
yet, you can do a patch per architecture that uses this (at least
powerpc and sparc do), and remove the weak thing at the end once all
users are gone.




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-04-21 14:22                               ` Peter Zijlstra
@ 2010-04-21 22:38                                 ` Lin Ming
  2010-04-21 14:53                                   ` Peter Zijlstra
  0 siblings, 1 reply; 55+ messages in thread
From: Lin Ming @ 2010-04-21 22:38 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: eranian@gmail.com, Gary.Mohr@Bull.com, Corey Ashford, LKML

On Wed, 2010-04-21 at 22:22 +0800, Peter Zijlstra wrote:
> On Wed, 2010-04-21 at 22:12 +0000, Lin Ming wrote:
> > +       ret = pmu->commit_txn(pmu);
> > +       if (!ret) {
> > +               pmu->stop_txn(pmu);
> > +               return 0;
> > +       }
> >  
> >  group_error:
> > +       pmu->stop_txn(pmu); 
> 
> If you let commit_txn() also clear the state you can save some logic and
> a method.

I add this ->stop_txn(pmu) because the rollback code also need to clear
the state.

> 
> But yes, this looks good. If you don't remove the weak interface just
> yet, you can do a patch per architecture that uses this (at least
> powerpc and sparc do), and remove the weak thing at the end once all
> users are gone.
> 

OK, I'll do that for powerpc and sparc.

And need to check if there are transaction methods in group_sched_in for
other arch, as below.

group_sched_in(...)
{
	if (pmu->start_txn)
		pmu->start_txn(pmu);

	if (pmu->commit_txn)
		pmu->commit_txn(pmu)
}

Thanks,
Lin Ming


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-04-21 22:38                                 ` Lin Ming
@ 2010-04-21 14:53                                   ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2010-04-21 14:53 UTC (permalink / raw)
  To: Lin Ming; +Cc: eranian@gmail.com, Gary.Mohr@Bull.com, Corey Ashford, LKML

On Wed, 2010-04-21 at 22:38 +0000, Lin Ming wrote:
> 
> I add this ->stop_txn(pmu) because the rollback code also need to clear
> the state. 

Ah, yes, because strictly speaking ->enable() could still fail for
another reason. OK.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-03-30 16:49     ` Corey Ashford
  2010-03-30 17:15       ` Peter Zijlstra
@ 2010-03-30 21:28       ` stephane eranian
  2010-03-30 23:11         ` Corey Ashford
  1 sibling, 1 reply; 55+ messages in thread
From: stephane eranian @ 2010-03-30 21:28 UTC (permalink / raw)
  To: Corey Ashford
  Cc: Lin Ming, Peter Zijlstra, Ingo Molnar, LKML, Andi Kleen,
	Paul Mackerras, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra,
	Philip Mucci, Maynard Johnson, Carl Love, Steven Rostedt,
	Arnaldo Carvalho de Melo, Masami Hiramatsu

On Tue, Mar 30, 2010 at 6:49 PM, Corey Ashford
<cjashfor@linux.vnet.ibm.com> wrote:
> On 03/30/2010 12:42 AM, Lin Ming wrote:
>>
>> Hi, Corey
>>
>> How is this going now? Are you still working on this?
>> I'd like to help to add support for uncore, test, write code or anything
>> else.
>>
>> Thanks,
>> Lin Ming
>
> I haven't been actively working on adding infrastructure for nest PMUs yet.
>  At the moment we are working on supporting nest events for IBM's Wire-Speed
> processor, using the current infrastructure, because of the time
> limitations.  Using the existing infrastructure is definitely not ideal, but
> for this processor, it's workable.
>
> There are still a lot of issues to solve for adding this infrastructure:
>
> 1) Does perf_events need a new context type (in addition to per-task and
> per-cpu)?  This is partly because we don't want to be mixing the rotation of
> CPU-events with nest events.  Each PMU really ought to have its own event
> list.
>
I concur with the fact that you don't want to mix events form
different PMUs in the
same rotation list. There are some side effects of not doing this with
AMD64 Northbridge
events when you have multiple concurrent sessions. But this is a special case
where the "uncore" (nest) PMU is actually controlled via the core PMU. I think
that is okay for now, but you certainly don't want that for Nehalem uncore, for
instance.

> 2) How do we deal with accessing PMU's which require slow access methods
> (e.g. internal serial bus)?  The accesses may need to be placed on worker
> threads so that they don't affect the performance of context switches and
> system ticks.
>
> 3) How exactly do we represent the PMU's in the pseudofs (/sys or /proc)?
>  And how exactly does the user specify the PMU to perf_events?
> Peter Zijlstra and Stephane Eranian both recommended opening the PMU with
> open() and then passing the resulting fd in through the perf_event_attr
> struct.
>
> 4) How do we choose a CPU to do the housekeeping work for a particular nest
> PMU.  Peter thought that user space should still specify the it via
> open_perf_event() cpu parameter, but there's also an argument to be made for
> the kernel choosing the best CPU to handle the job, or at least make it
> optional for the user to choose the CPU.
>
One of the housekeeping task is to handle uncore PMU interrupts, for instance.
That is not a trivial task given that events are managed independently and
that you could be monitoring per-thread or system-wide. It may be that
some uncore PMU can only interrupt one core. Intel Nehalem can interrupt
many at once.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-03-30 21:28       ` stephane eranian
@ 2010-03-30 23:11         ` Corey Ashford
  2010-03-31 13:43           ` stephane eranian
  0 siblings, 1 reply; 55+ messages in thread
From: Corey Ashford @ 2010-03-30 23:11 UTC (permalink / raw)
  To: eranian
  Cc: stephane eranian, Lin Ming, Peter Zijlstra, Ingo Molnar, LKML,
	Andi Kleen, Paul Mackerras, Frederic Weisbecker, Xiao Guangrong,
	Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love,
	Steven Rostedt, Arnaldo Carvalho de Melo, Masami Hiramatsu

On 3/30/2010 2:28 PM, stephane eranian wrote:
> On Tue, Mar 30, 2010 at 6:49 PM, Corey Ashford
> <cjashfor@linux.vnet.ibm.com>  wrote:
>> 4) How do we choose a CPU to do the housekeeping work for a particular nest
>> PMU.  Peter thought that user space should still specify the it via
>> open_perf_event() cpu parameter, but there's also an argument to be made for
>> the kernel choosing the best CPU to handle the job, or at least make it
>> optional for the user to choose the CPU.
>>
> One of the housekeeping task is to handle uncore PMU interrupts, for instance.
> That is not a trivial task given that events are managed independently and
> that you could be monitoring per-thread or system-wide. It may be that
> some uncore PMU can only interrupt one core. Intel Nehalem can interrupt
> many at once.

That's a good point, and I think it's unreasonable to expect that the user knows 
exactly how the interrupts are connected from the uncore/nest PMU to which CPU(s).

Perhaps one way around this would be to return an error if the chosen CPU wasn't 
fully capable of performing the housekeeping functions for the requested PMU. 
But this certainly isn't ideal, because relying on this mechanism would require 
that the user (or user tool) figure out which CPU is fully capable by 
trial-and-error.

- Corey


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] perf_events: support for uncore a.k.a. nest units
  2010-03-30 23:11         ` Corey Ashford
@ 2010-03-31 13:43           ` stephane eranian
  0 siblings, 0 replies; 55+ messages in thread
From: stephane eranian @ 2010-03-31 13:43 UTC (permalink / raw)
  To: Corey Ashford
  Cc: Lin Ming, Peter Zijlstra, Ingo Molnar, LKML, Andi Kleen,
	Paul Mackerras, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra,
	Philip Mucci, Maynard Johnson, Carl Love, Steven Rostedt,
	Arnaldo Carvalho de Melo, Masami Hiramatsu

On Wed, Mar 31, 2010 at 1:11 AM, Corey Ashford
<cjashfor@linux.vnet.ibm.com> wrote:
> On 3/30/2010 2:28 PM, stephane eranian wrote:
>>
>> On Tue, Mar 30, 2010 at 6:49 PM, Corey Ashford
>> <cjashfor@linux.vnet.ibm.com>  wrote:
>>>
>>> 4) How do we choose a CPU to do the housekeeping work for a particular
>>> nest
>>> PMU.  Peter thought that user space should still specify the it via
>>> open_perf_event() cpu parameter, but there's also an argument to be made
>>> for
>>> the kernel choosing the best CPU to handle the job, or at least make it
>>> optional for the user to choose the CPU.
>>>
>> One of the housekeeping task is to handle uncore PMU interrupts, for
>> instance.
>> That is not a trivial task given that events are managed independently and
>> that you could be monitoring per-thread or system-wide. It may be that
>> some uncore PMU can only interrupt one core. Intel Nehalem can interrupt
>> many at once.
>
> That's a good point, and I think it's unreasonable to expect that the user
> knows exactly how the interrupts are connected from the uncore/nest PMU to
> which CPU(s).
>
> Perhaps one way around this would be to return an error if the chosen CPU
> wasn't fully capable of performing the housekeeping functions for the
> requested PMU. But this certainly isn't ideal, because relying on this
> mechanism would require that the user (or user tool) figure out which CPU is
> fully capable by trial-and-error.
>

I think users should not have to worry about all of this. It is also
fine to restrict
any uncore monitoring to system-wide mode, i.e., not per-thread. But I have
also seen people requesting just that hoping they could draw correlations
between core and uncore events in one run. I think this is less critical though.

In the specific case of CPU-uncore (e.g., Nehalem uncore), you could simply
identify the uncore PMU with the CPU. For anything else, you would definitively
need the file descriptor. The more uniform approach would have been to use
the file descriptor all along, except for per-thread. But that's fine, I think.

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2010-04-21 14:54 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-19 19:41 [RFC] perf_events: support for uncore a.k.a. nest units Corey Ashford
2010-01-20  0:44 ` Andi Kleen
2010-01-20  1:49   ` Corey Ashford
2010-01-20  9:35     ` Andi Kleen
2010-01-20 19:28       ` Corey Ashford
2010-01-20 13:34 ` Peter Zijlstra
2010-01-20 21:33   ` Peter Zijlstra
2010-01-20 23:23     ` Corey Ashford
2010-01-21  7:21       ` Ingo Molnar
2010-01-21 19:13         ` Corey Ashford
2010-01-21 19:28           ` Corey Ashford
2010-01-27 10:28             ` Ingo Molnar
2010-01-27 19:50               ` Corey Ashford
2010-01-28 10:57                 ` Peter Zijlstra
2010-01-28 18:00                   ` Corey Ashford
2010-01-28 19:06                     ` Peter Zijlstra
2010-01-28 19:44                       ` Corey Ashford
2010-01-28 22:08                       ` Corey Ashford
2010-01-29  9:52                         ` Peter Zijlstra
2010-01-29 23:05                           ` Corey Ashford
2010-01-30  8:42                             ` Peter Zijlstra
2010-02-01 19:39                               ` Corey Ashford
2010-02-01 19:54                                 ` Peter Zijlstra
2010-01-21  8:36       ` Peter Zijlstra
2010-01-21  8:47     ` stephane eranian
2010-01-21  8:59       ` Peter Zijlstra
2010-01-21  9:16         ` stephane eranian
2010-01-21  9:43         ` stephane eranian
     [not found] ` <d3f22a1003290213x7d7904an59d50eb6a8616133@mail.gmail.com>
2010-03-30  7:42   ` Lin Ming
2010-03-30 16:49     ` Corey Ashford
2010-03-30 17:15       ` Peter Zijlstra
2010-03-30 22:12         ` Corey Ashford
2010-03-31 14:01           ` Peter Zijlstra
2010-03-31 14:13             ` stephane eranian
2010-03-31 15:49             ` Maynard Johnson
2010-03-31 17:50             ` Corey Ashford
2010-04-15 21:16         ` Gary.Mohr
2010-04-16 13:24           ` Peter Zijlstra
2010-04-19  9:08             ` Lin Ming
2010-04-19  9:27               ` Peter Zijlstra
2010-04-20 11:55             ` Lin Ming
2010-04-20 12:03               ` Peter Zijlstra
2010-04-21  8:08                 ` Lin Ming
2010-04-21  8:32                   ` stephane eranian
2010-04-21  8:39                     ` Lin Ming
2010-04-21  8:44                       ` stephane eranian
2010-04-21  9:42                         ` Lin Ming
2010-04-21  9:57                           ` Peter Zijlstra
2010-04-21 22:12                             ` Lin Ming
2010-04-21 14:22                               ` Peter Zijlstra
2010-04-21 22:38                                 ` Lin Ming
2010-04-21 14:53                                   ` Peter Zijlstra
2010-03-30 21:28       ` stephane eranian
2010-03-30 23:11         ` Corey Ashford
2010-03-31 13:43           ` stephane eranian

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox