* [RFC] perf_events: support for uncore a.k.a. nest units
@ 2010-01-19 19:41 Corey Ashford
2010-01-20 0:44 ` Andi Kleen
` (2 more replies)
0 siblings, 3 replies; 55+ messages in thread
From: Corey Ashford @ 2010-01-19 19:41 UTC (permalink / raw)
To: LKML, Ingo Molnar, Andi Kleen, Paul Mackerras, Stephane Eranian,
Peter Zijlstra, Frederic Weisbecker, Xiao Guangrong
Cc: Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love
-----
Intro
-----
One subject that hasn't been addressed since the introduction of perf_events in
the Linux kernel is that of support for "uncore" or "nest" unit events. Uncore
is the term used by the Intel engineers for their off-core units but are still
on the same die as the cores, and "nest" means exactly the same thing for IBM
Power processor engineers. I will use the term uncore for brevity and because
it's in common parlance, but the issues and design possibilities below are
relevant to both. I will also broaden the term by stating that uncore will also
refer to PMUs that are completely off of the processor chip altogether.
Contents
--------
1. Why support PMUs in uncore units? Is there anything interesting to look at?
2. How do uncore events differ from core events?
3. Why does a CPU need to be assigned to manage a particular uncore unit's events?
4. How do you encode uncore events?
5. How do you address a particular uncore PMU?
6. Event rotation issues with uncore PMUs
7. Other issues?
8. Feedback?
----
1. Why support PMUs in uncore units? Is there anything interesting to look at?
----
Today, many x86 chips contain uncore units, and we think that it's likely that
the trend will continue, as more devices - I/O, memory interfaces, shared
caches, accelerators, etc. - are integrated onto multi-core chips. As these
devices become more sophisticated and more workload is diverted off-core,
engineers and performance analysts are going to want to look at what's happening
in these units so that they can find bottlenecks.
In addition, we think that even off-chip I/O and interconnect devices are likely
to gain PMUs because engineers will want to find bottlenecks in their massively
parallel systems.
----
2. How do uncore events differ from core events?
----
The main difference is that uncore events are mostly likely not going to be tied
to a particular Linux task, or even a CPU context. Uncore units are resources
that are in some sense system-wide, though, they may not really be accessible
system-wide in some architectures. In the case of accelerators and I/O devices,
it's likely they will run asynchronously from the cores, and thus keeping track
of events on a per-task basis doesn't make a lot of sense. The other existing
mode in perf_events is a per-CPU context, and it turns out that this mode does
match up with uncore units well, though the choice of which CPU to use to manage
that uncore unit is going to need to be arch-dependent and may involve other
issues as well, such as minimizing access latency between the uncore unit and
the CPU which is managing it.
----
3. Why does a CPU need to be assigned to manage a particular uncore unit's events?
----
* The control registers of the uncore unit's PMU need to be read and written,
and that may be possible only from a subset of processors in the system.
* A processor is needed to rotate the event list on the uncore unit on every
tick for the purposes of event scheduling.
* Because of access latency issues, we may want the CPU to be close in locality
to the PMU.
It seems like a good idea to let the kernel decide which CPU to use to monitor a
particular uncore event, based on the location of the uncore unit, and possibly
current system load balance. The user will not want to have to figure out this
detailed information.
----
4. How do you encode uncore events?
----
Uncore events will need to be encoded in the config field of the perf_event_attr
struct using the existing PERF_TYPE_RAW encoding. 64 bits are available in the
config field, and that may be sufficient to support events on most systems.
However, due to the proliferation and added complexity of PMUs we envision, we
might want to add another 64-bit config (perhaps call it config_extra or
config2) field to encode any extra attributes that might be needed. The exact
encoding used, just as for the current encoding for core events, will be on a
per-arch and possibly per-system basis.
----
5. How do you address a particular uncore PMU?
----
This one is going to be very system- and arch-dependent, but it seems fairly
clear that we need some sort of addressing scheme that can be
system/arch-defined by the kernel.
From a hierarchical perspective, here's an example of possible uncore PMU
locations in a large system:
1) Per-core - units that are shared between all hardware threads in a core
2) Per-node - units that are shared between all cores in a node
3) Per-chip - units that are shared between all nodes in a chip
4) Per-blade - units that are shared between all chips on a blade
5) Per-rack - units that are shared between all blades in a rack
Addressing option 1)
Reuse the cpu argument: cpu would be interpreted differently if an uncore unit
is specified (via the perf_event_attr struct's config field).
For the hypothetical system described above, we'd want to have an address that
contains enough address bits for each of the above. For example:
bits field
------ -----
3..0 PMU number 0-15 /* specifies which of several identical PMUs being
addressed */
7..4 core id 0-15
8..8 node id 0-1
11..9 chip id 0-7
16..12 blade id 0-31
23..17 rack id 0-128
These fields would be exposed via /usr/include/linux/perf_events_uncore_addr.h
(for example). How you actually assign these numbers to actual hardware is,
again, system-design dependent, and may be influenced by the use of a
hypervisor, or other software which allocates resources available to the system
dynamically.
How does the user discover the mapping between the hardware made available to
the system and the addresses shown above? Again, this is system-dependent, and
probably outside the scope of this proposal. In other words, I don't know how
to do this in a general way, though I could probably put something together for
a particular system.
Addressing Option 2)
Have the kernel create nodes for each uncore PMU in /sys/devices/system or other
pseudo file system, such as the existing /proc/device-tree on Power systems.
/sys/devices/system or /proc/device-tree could be explored by the
user tool, and the user could then specify the path of the requested PMU via a
string which the kernel could interpret. To be overly simplistic, something
like "/sys/devices/system/pmus/blade4/cpu0/vectorcopro1". If we settled on a
common tree root to use, we could specify only the relative path name,
"blade4/cpu0/vectorcopro1".
One way to provide this extra "PMU path" argument to the sys_perf_event_open()
would be to add a bit to the flags argument says we're adding a PMU path string
onto the end of the argument list.
This path-string-based addressing option seems to more flexible in the long run,
and does not have as serious of an issue in mapping PMUs to user space; the
kernel essentially exposes to user space all of the available PMUs for the
current partition. This might create more work for the kernel side, but should
make the system more transparent for user-space tools. Another system- or at
least arch-dependent tool would have to be written for user space to help users
navigate the device tree to find the PMU they want to use. I don't think it
would make sense to build that capability into perf, because the software would
be arch- or system-dependent.
It could be argued that we should use a common user space tree to represent PMUs
for all architectures and systems, so that the arch-independent perf code would
be able to display available uncore PMUs. That may be a goal that's very hard
to achieve because of the wide variation in architectures. Any thoughts on that?
----
6. Event rotation issues with uncore PMUs
----
Currently, the perf_events code rotates the set of events assigned to a CPU or
task on every system tick, so that event scheduling collisions on a PMU are
mitigated. This turns out to cause problems for uncore units for two reasons -
inefficiency and CPU load.
a) Rotation of a set of events across more than one PMU causes inefficient rotation.
Consider the following event list; the letter designates the PMU and the number
is the event number on that PMU.
A1 A2 A3 B1 B2 B3 B4 C1 C2 C3 C4 C5
after one rotation, you can see that the event list will be:
C5 A1 A2 A3 B1 B2 B3 B4 C1 C2 C3 C4
and then
C4 C5 A1 A2 A3 B1 B2 B3 B4 C1 C2 C3
Notice how the relative positions for the A and B PMU events haven't changed
even after two (or even five) rotations, so they will schedule the events in the
same order for some time. This will skew the multiplexing so that some events
will be scheduled much less often than they should or could be.
What we'd like to have happen is that events for each PMU be rotated in their
own lists. For example, before rotation:
A1 A2 A3
B1 B2 B3 B4
C1 C2 C3 C4 C5
After rotation:
A3 A1 A2
B2 B3 B4 B1
C2 C3 C4 C5 C1
We've got some ideas about how to make this happen, using either separate lists,
or placing them on separate CPUs.
b) Access to some PMU uncore units may be quite slow due to the interconnect
that is used. This can place a burden on the CPU if it is done every system tick.
This can be addressed by keeping a counter, on a per-PMU context basis that
reduces the rate of event rotations. Setting the rotation period to three, for
example, would cause event rotations in that context to happen on every third
tick, instead of every tick. We think that the kernel could measure the amount
of time it is taking to do a rotate, and then dynamically decrease the rotation
rate if it's taking too long; "rotation rate throttling" in other words.
----
7. Other issues?
----
This section left blank for now.
----
8. Feedback?
----
I'd appreciate any feedback you might have on this topic. You can contact me
directly at the email address below, or better yet, reply to LKML.
--
Regards,
- Corey
Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
Beaverton, OR
503-578-3507
cjashfor@us.ibm.com
^ permalink raw reply [flat|nested] 55+ messages in thread* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-01-19 19:41 [RFC] perf_events: support for uncore a.k.a. nest units Corey Ashford @ 2010-01-20 0:44 ` Andi Kleen 2010-01-20 1:49 ` Corey Ashford 2010-01-20 13:34 ` Peter Zijlstra [not found] ` <d3f22a1003290213x7d7904an59d50eb6a8616133@mail.gmail.com> 2 siblings, 1 reply; 55+ messages in thread From: Andi Kleen @ 2010-01-20 0:44 UTC (permalink / raw) To: Corey Ashford Cc: LKML, Ingo Molnar, Andi Kleen, Paul Mackerras, Stephane Eranian, Peter Zijlstra, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love On Tue, Jan 19, 2010 at 11:41:01AM -0800, Corey Ashford wrote: > One subject that hasn't been addressed since the introduction of > perf_events in the Linux kernel is that of support for "uncore" or "nest" > unit events. Uncore is the term used by the Intel engineers for their > off-core units but are still on the same die as the cores, and "nest" means > exactly the same thing for IBM Power processor engineers. I will use the > term uncore for brevity and because it's in common parlance, but the issues > and design possibilities below are relevant to both. I will also broaden > the term by stating that uncore will also refer to PMUs that are completely > off of the processor chip altogether. Yes, e.g. chipsets commonly have their own PMUs too. > The main difference is that uncore events are mostly likely not going to be > tied to a particular Linux task, or even a CPU context. Uncore units are > resources that are in some sense system-wide, though, they may not really > be accessible system-wide in some architectures. In the case of > accelerators and I/O devices, it's likely they will run asynchronously from > the cores, and thus keeping track of events on a per-task basis doesn't > make a lot of sense. The other existing mode in perf_events is a per-CPU > context, and it turns out that this mode does match up with uncore units > well, though the choice of which CPU to use to manage that uncore unit is > going to need to be arch-dependent and may involve other issues as well, > such as minimizing access latency between the uncore unit and the CPU which > is managing it. What the user needs to know is which CPUs are affected by that uncore event. For example the integrated memory controller counters that count local accesses should be somehow associated with the local CPUs. > 4. How do you encode uncore events? > ---- > Uncore events will need to be encoded in the config field of the > perf_event_attr struct using the existing PERF_TYPE_RAW encoding. 64 bits > are available in the config field, and that may be sufficient to support > events on most systems. However, due to the proliferation and added > complexity of PMUs we envision, we might want to add another 64-bit config > (perhaps call it config_extra or config2) field to encode any extra > attributes that might be needed. The exact encoding used, just as for the > current encoding for core events, will be on a per-arch and possibly > per-system basis. I don't think a raw hex number will scale anywhere. You'll need a human readable event list / sub event masks with help texts. Often uncore events have specific restrictions, and that needs to be enforced somewhere too. Doing that all in a clean way that is also usable by programs likely needs a lot more thinking. > bits field > ------ ----- > 3..0 PMU number 0-15 /* specifies which of several identical PMUs being > addressed */ > 7..4 core id 0-15 > 8..8 node id 0-1 > 11..9 chip id 0-7 > 16..12 blade id 0-31 > 23..17 rack id 0-128 Such a compressed addressing scheme doesn't seem very future proof. e.g. core 4 bits for the core is already obsolete (see the "80 core chip" that was recently announced) > probably put something together for a particular system. > > Addressing Option 2) > > Have the kernel create nodes for each uncore PMU in /sys/devices/system or > other pseudo file system, such as the existing /proc/device-tree on Power > systems. /sys/devices/system or /proc/device-tree could be explored by the > user tool, and the user could then specify the path of the requested PMU > via a string which the kernel could interpret. To be overly simplistic, > something like "/sys/devices/system/pmus/blade4/cpu0/vectorcopro1". If we > settled on a common tree root to use, we could specify only the relative > path name, "blade4/cpu0/vectorcopro1". That's a more workable scheme, but you still need to find a clean way to describe topology (see above). The existing examples in sysfs are unfortuately all clumpsy imho. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-01-20 0:44 ` Andi Kleen @ 2010-01-20 1:49 ` Corey Ashford 2010-01-20 9:35 ` Andi Kleen 0 siblings, 1 reply; 55+ messages in thread From: Corey Ashford @ 2010-01-20 1:49 UTC (permalink / raw) To: Andi Kleen Cc: LKML, Ingo Molnar, Paul Mackerras, Stephane Eranian, Peter Zijlstra, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love On 1/19/2010 4:44 PM, Andi Kleen wrote: > On Tue, Jan 19, 2010 at 11:41:01AM -0800, Corey Ashford wrote: >> 4. How do you encode uncore events? >> ---- >> Uncore events will need to be encoded in the config field of the >> perf_event_attr struct using the existing PERF_TYPE_RAW encoding. 64 bits >> are available in the config field, and that may be sufficient to support >> events on most systems. However, due to the proliferation and added >> complexity of PMUs we envision, we might want to add another 64-bit config >> (perhaps call it config_extra or config2) field to encode any extra >> attributes that might be needed. The exact encoding used, just as for the >> current encoding for core events, will be on a per-arch and possibly >> per-system basis. > > I don't think a raw hex number will scale anywhere. You'll need a human > readable event list / sub event masks with help texts. > > Often uncore events have specific restrictions, and that needs > to be enforced somewhere too. > > Doing that all in a clean way that is also usable > by programs likely needs a lot more thinking. I left out one critical detail here: I had in mind that we'd be using a library like libpfm for handling the issue of event names + attributes to raw code translation. In fact, we are using libpfm today for this purpose in the PAPI/perf_events substrate implementation. > > >> bits field >> ------ ----- >> 3..0 PMU number 0-15 /* specifies which of several identical PMUs being >> addressed */ >> 7..4 core id 0-15 >> 8..8 node id 0-1 >> 11..9 chip id 0-7 >> 16..12 blade id 0-31 >> 23..17 rack id 0-128 > > Such a compressed addressing scheme doesn't seem very future proof. > e.g. core 4 bits for the core is already obsolete (see the "80 core chip" that > was recently announced) Agreed. If the designer is very generous with the size of each field, it could hold up for quite awhile, but still there's a problem with relating these addresses to actual hardware. > > >> probably put something together for a particular system. >> >> Addressing Option 2) >> >> Have the kernel create nodes for each uncore PMU in /sys/devices/system or >> other pseudo file system, such as the existing /proc/device-tree on Power >> systems. /sys/devices/system or /proc/device-tree could be explored by the >> user tool, and the user could then specify the path of the requested PMU >> via a string which the kernel could interpret. To be overly simplistic, >> something like "/sys/devices/system/pmus/blade4/cpu0/vectorcopro1". If we >> settled on a common tree root to use, we could specify only the relative >> path name, "blade4/cpu0/vectorcopro1". > > That's a more workable scheme, but you still need to find a clean > way to describe topology (see above). The existing examples in sysfs > are unfortuately all clumpsy imho. > Yes, I agree. Also it's easy to construct a system design that doesn't have a hierarchical topology. A simple example would be a cluster of 32 nodes, each of which is connected to its 31 neighbors. Perhaps for the purposes of just enumerating PMUs, a tree might be sufficient, but it's not clear to me that it is mathematically sufficient for all topologies, not to mention if it's intuitive enough to use. For example, highly-interconnected components might require that PMU leaf nodes be duplicated in multiple branches, i.e. PMU paths might not be unique in some topologies. I'm certainly open to better alternatives! Thanks for your thoughts, - Corey ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-01-20 1:49 ` Corey Ashford @ 2010-01-20 9:35 ` Andi Kleen 2010-01-20 19:28 ` Corey Ashford 0 siblings, 1 reply; 55+ messages in thread From: Andi Kleen @ 2010-01-20 9:35 UTC (permalink / raw) To: Corey Ashford Cc: Andi Kleen, LKML, Ingo Molnar, Paul Mackerras, Stephane Eranian, Peter Zijlstra, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love > Yes, I agree. Also it's easy to construct a system design that doesn't > have a hierarchical topology. A simple example would be a cluster of 32 > nodes, each of which is connected to its 31 neighbors. Perhaps for the I doubt it's needed or useful to describe all details of an interconnect. If detailed distance information is needed a simple table like the SLIT table exported by ACPI would seem easier to handle. But at least some degree of locality (e.g. "local memory controller") would make sense. > purposes of just enumerating PMUs, a tree might be sufficient, but it's not > clear to me that it is mathematically sufficient for all topologies, not to > mention if it's intuitive enough to use. For example, > highly-interconnected components might require that PMU leaf nodes be > duplicated in multiple branches, i.e. PMU paths might not be unique in some > topologies. We already have cyclical graphs in sysfs using symlinks. I'm not sure they are all that easy to parse/handle, but at least they can be described. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-01-20 9:35 ` Andi Kleen @ 2010-01-20 19:28 ` Corey Ashford 0 siblings, 0 replies; 55+ messages in thread From: Corey Ashford @ 2010-01-20 19:28 UTC (permalink / raw) To: Andi Kleen Cc: LKML, Ingo Molnar, Paul Mackerras, Stephane Eranian, Peter Zijlstra, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love On 1/20/2010 1:35 AM, Andi Kleen wrote: >> Yes, I agree. Also it's easy to construct a system design that doesn't >> have a hierarchical topology. A simple example would be a cluster of 32 >> nodes, each of which is connected to its 31 neighbors. Perhaps for the > > I doubt it's needed or useful to describe all details of an interconnect. > At this point, I don't see a need for that level of detail either. > If detailed distance information is needed a simple table like > the SLIT table exported by ACPI would seem easier to handle. > Thanks for the pointer. I didn't know about the ACPI SLIT and SRAT tables until your post. Having had a quick look at them, I don't think they'd be that helpful to us, at least at this point. > But at least some degree of locality (e.g. "local memory controller") > would make sense. I think locality could be determined by looking at the device tree. For example, a memory controller for a particular processor chip would be a subdirectory of that chip. > >> purposes of just enumerating PMUs, a tree might be sufficient, but it's not >> clear to me that it is mathematically sufficient for all topologies, not to >> mention if it's intuitive enough to use. For example, >> highly-interconnected components might require that PMU leaf nodes be >> duplicated in multiple branches, i.e. PMU paths might not be unique in some >> topologies. > > We already have cyclical graphs in sysfs using symlinks. I'm not > sure they are all that easy to parse/handle, but at least they > can be described. Good point. -- Regards, - Corey Corey Ashford Software Engineer IBM Linux Technology Center, Linux Toolchain Beaverton, OR 503-578-3507 cjashfor@us.ibm.com ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-01-19 19:41 [RFC] perf_events: support for uncore a.k.a. nest units Corey Ashford 2010-01-20 0:44 ` Andi Kleen @ 2010-01-20 13:34 ` Peter Zijlstra 2010-01-20 21:33 ` Peter Zijlstra [not found] ` <d3f22a1003290213x7d7904an59d50eb6a8616133@mail.gmail.com> 2 siblings, 1 reply; 55+ messages in thread From: Peter Zijlstra @ 2010-01-20 13:34 UTC (permalink / raw) To: Corey Ashford Cc: LKML, Ingo Molnar, Andi Kleen, Paul Mackerras, Stephane Eranian, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love On Tue, 2010-01-19 at 11:41 -0800, Corey Ashford wrote: > ---- > 3. Why does a CPU need to be assigned to manage a particular uncore unit's events? > ---- > > * The control registers of the uncore unit's PMU need to be read and written, > and that may be possible only from a subset of processors in the system. > * A processor is needed to rotate the event list on the uncore unit on every > tick for the purposes of event scheduling. > * Because of access latency issues, we may want the CPU to be close in locality > to the PMU. > > It seems like a good idea to let the kernel decide which CPU to use to monitor a > particular uncore event, based on the location of the uncore unit, and possibly > current system load balance. The user will not want to have to figure out this > detailed information. Well, to some extend the user will have to participate. For example which uncore pmu will be selected depends on the cpu you're attaching the event to according to the cpu to node map. Furthermore the intel uncore thing has curious interrupt routing capabilities which could be tied into this mapping. > ---- > 4. How do you encode uncore events? > ---- > Uncore events will need to be encoded in the config field of the perf_event_attr > struct using the existing PERF_TYPE_RAW encoding. 64 bits are available in the > config field, and that may be sufficient to support events on most systems. > However, due to the proliferation and added complexity of PMUs we envision, we > might want to add another 64-bit config (perhaps call it config_extra or > config2) field to encode any extra attributes that might be needed. The exact > encoding used, just as for the current encoding for core events, will be on a > per-arch and possibly per-system basis. Lets cross that bridge when we get there. > ---- > 5. How do you address a particular uncore PMU? > ---- > > This one is going to be very system- and arch-dependent, but it seems fairly > clear that we need some sort of addressing scheme that can be > system/arch-defined by the kernel. > > From a hierarchical perspective, here's an example of possible uncore PMU > locations in a large system: > > 1) Per-core - units that are shared between all hardware threads in a core > 2) Per-node - units that are shared between all cores in a node > 3) Per-chip - units that are shared between all nodes in a chip > 4) Per-blade - units that are shared between all chips on a blade > 5) Per-rack - units that are shared between all blades in a rack So how about PERF_TYPE_{CORE,NODE,SOCKET} like things? > ---- > 6. Event rotation issues with uncore PMUs > ---- > > Currently, the perf_events code rotates the set of events assigned to a CPU or > task on every system tick, so that event scheduling collisions on a PMU are > mitigated. This turns out to cause problems for uncore units for two reasons - > inefficiency and CPU load. Well, if you give these things a cpumask and put them all onto the context of first cpu of that mask things seem to collect nicely. > b) Access to some PMU uncore units may be quite slow due to the interconnect > that is used. This can place a burden on the CPU if it is done every system tick. > > This can be addressed by keeping a counter, on a per-PMU context basis that > reduces the rate of event rotations. Setting the rotation period to three, for > example, would cause event rotations in that context to happen on every third > tick, instead of every tick. We think that the kernel could measure the amount > of time it is taking to do a rotate, and then dynamically decrease the rotation > rate if it's taking too long; "rotation rate throttling" in other words. The better solution is to generalize the whole rr on tick scheme (which has already been discussed). ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-01-20 13:34 ` Peter Zijlstra @ 2010-01-20 21:33 ` Peter Zijlstra 2010-01-20 23:23 ` Corey Ashford 2010-01-21 8:47 ` stephane eranian 0 siblings, 2 replies; 55+ messages in thread From: Peter Zijlstra @ 2010-01-20 21:33 UTC (permalink / raw) To: Corey Ashford Cc: LKML, Ingo Molnar, Andi Kleen, Paul Mackerras, Stephane Eranian, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love On Wed, 2010-01-20 at 14:34 +0100, Peter Zijlstra wrote: > So how about PERF_TYPE_{CORE,NODE,SOCKET} like things? OK, so I read most of the intel uncore stuff, and it seems to suggest you need a regular pmu event to receive uncore events (chained setup), this seems rather retarded since it wastes a perfectly good pmu event and makes configuring all this more intricate... A well, nothing to be done about that I guess.. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-01-20 21:33 ` Peter Zijlstra @ 2010-01-20 23:23 ` Corey Ashford 2010-01-21 7:21 ` Ingo Molnar 2010-01-21 8:36 ` Peter Zijlstra 2010-01-21 8:47 ` stephane eranian 1 sibling, 2 replies; 55+ messages in thread From: Corey Ashford @ 2010-01-20 23:23 UTC (permalink / raw) To: Peter Zijlstra Cc: LKML, Ingo Molnar, Andi Kleen, Paul Mackerras, Stephane Eranian, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love On 1/20/2010 1:33 PM, Peter Zijlstra wrote: > On Wed, 2010-01-20 at 14:34 +0100, Peter Zijlstra wrote: > >> So how about PERF_TYPE_{CORE,NODE,SOCKET} like things? > > OK, so I read most of the intel uncore stuff, and it seems to suggest > you need a regular pmu event to receive uncore events (chained setup), > this seems rather retarded since it wastes a perfectly good pmu event > and makes configuring all this more intricate... > > A well, nothing to be done about that I guess.. Yes, we have a similar situation where in addition to events that are counted on core PMU counters, we also have counters that are off-core; in some cases the counters are in off-core units which take their actual events from other off-core units, in addition to their own events. So you can see that this can be almost arbitrarily complex. As for the PERF_TYPE_(CORE,NODE,SOCKET) idea, that could still work, even though, for example, a socket event may be counted on a core PMU. Using more encodings for the type field, as you've suggested, would allow us to reuse the 64-bit config space multiple times. Were you thinking that with the type field we'd still re-use the "cpu" argument for the actual pmu address within the PERF_TYPE_* space? If so, that's an interesting idea, but I think it still leaves open the problem of how to actually relate those address to the real hardware, especially in the case of using a hypervisor which has provided you a small subset of the physical hardware in the system. I really think we need some sort of data structure which is passed from the kernel to user space to represent the topology of the system, and give useful information to be able to identify each PMU node. Whether this is done with a sysfs-style tree, a table in a file, XML, etc... it doesn't really matter much, but it needs to be something that can be parsed relatively easily and *contains just enough information* for the user to be able to correctly choose PMUs, and for the kernel to be able to relate that back to actual PMU hardware. In our case, we are looking at /proc/device-tree, and it actually does appear to contain enough information for us. However, since /proc/device-tree is not available anywhere but Power arch (/proc/device-tree originates from a data structure passed into the OS from the Open Firmware) we'd like to have a more general approach that can be used on x86 and other arches. - Corey ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-01-20 23:23 ` Corey Ashford @ 2010-01-21 7:21 ` Ingo Molnar 2010-01-21 19:13 ` Corey Ashford 2010-01-21 8:36 ` Peter Zijlstra 1 sibling, 1 reply; 55+ messages in thread From: Ingo Molnar @ 2010-01-21 7:21 UTC (permalink / raw) To: Corey Ashford Cc: Peter Zijlstra, LKML, Andi Kleen, Paul Mackerras, Stephane Eranian, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love, Steven Rostedt, Arnaldo Carvalho de Melo, Masami Hiramatsu * Corey Ashford <cjashfor@linux.vnet.ibm.com> wrote: > I really think we need some sort of data structure which is passed from the > kernel to user space to represent the topology of the system, and give > useful information to be able to identify each PMU node. Whether this is > done with a sysfs-style tree, a table in a file, XML, etc... it doesn't > really matter much, but it needs to be something that can be parsed > relatively easily and *contains just enough information* for the user to be > able to correctly choose PMUs, and for the kernel to be able to relate that > back to actual PMU hardware. The right way would be to extend the current event description under /debug/tracing/events with hardware descriptors and (maybe) to formalise this into a separate /proc/events/ or into a separate filesystem. The advantage of this is that in the grand scheme of things we _really_ dont want to limit performance events to 'hardware' hierarchies, or to devices/sysfs, some existing /proc scheme, or any other arbitrary (and fundamentally limiting) object enumeration. We want a unified, logical enumeration of all events and objects that we care about from a performance monitoring and analysis point of view, shaped for the purpose of and parsed by perf user-space. And since the current event descriptors are already rather rich as they enumerate all sorts of things: - tracepoints - hw-breakpoints - dynamic probes etc., and are well used by tooling we should expand those with real hardware structure. Thanks, Ingo ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-01-21 7:21 ` Ingo Molnar @ 2010-01-21 19:13 ` Corey Ashford 2010-01-21 19:28 ` Corey Ashford 0 siblings, 1 reply; 55+ messages in thread From: Corey Ashford @ 2010-01-21 19:13 UTC (permalink / raw) To: Ingo Molnar Cc: Peter Zijlstra, LKML, Andi Kleen, Paul Mackerras, Stephane Eranian, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love, Steven Rostedt, Arnaldo Carvalho de Melo, Masami Hiramatsu On 1/20/2010 11:21 PM, Ingo Molnar wrote: > > * Corey Ashford<cjashfor@linux.vnet.ibm.com> wrote: > >> I really think we need some sort of data structure which is passed from the >> kernel to user space to represent the topology of the system, and give >> useful information to be able to identify each PMU node. Whether this is >> done with a sysfs-style tree, a table in a file, XML, etc... it doesn't >> really matter much, but it needs to be something that can be parsed >> relatively easily and *contains just enough information* for the user to be >> able to correctly choose PMUs, and for the kernel to be able to relate that >> back to actual PMU hardware. > > The right way would be to extend the current event description under > /debug/tracing/events with hardware descriptors and (maybe) to formalise this > into a separate /proc/events/ or into a separate filesystem. > > The advantage of this is that in the grand scheme of things we _really_ dont > want to limit performance events to 'hardware' hierarchies, or to > devices/sysfs, some existing /proc scheme, or any other arbitrary (and > fundamentally limiting) object enumeration. > > We want a unified, logical enumeration of all events and objects that we care > about from a performance monitoring and analysis point of view, shaped for the > purpose of and parsed by perf user-space. And since the current event > descriptors are already rather rich as they enumerate all sorts of things: > > - tracepoints > - hw-breakpoints > - dynamic probes > > etc., and are well used by tooling we should expand those with real hardware > structure. This is an intriguing idea; I like the idea of generalizing all of this info into one structure. So you think that this structure should contain event info as well? If these structures are created by the kernel, I think that would necessitate placing large event tables into the kernel, which is something I think we'd prefer to avoid because of the amount of memory it would take. Keep in mind that we need not only event names, but event descriptions, encodings, attributes (e.g. unit masks), attribute descriptions, etc. I suppose the kernel could read a file from the file system, and then add this info to the tree, but that just seems bad. Are there existing places in the kernel where it reads a user space file to create a user space pseudo filesystem? I think keeping event naming in user space, and PMU naming in kernel space might be a better idea: the kernel exposes the available PMUs to user space via some structure, and a user space library tries to recognize the exposed PMUs and provide event lists and other needed info. The perf tool would use this library to be able to list available events to users. -- Regards, - Corey Corey Ashford Software Engineer IBM Linux Technology Center, Linux Toolchain Beaverton, OR 503-578-3507 cjashfor@us.ibm.com ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-01-21 19:13 ` Corey Ashford @ 2010-01-21 19:28 ` Corey Ashford 2010-01-27 10:28 ` Ingo Molnar 0 siblings, 1 reply; 55+ messages in thread From: Corey Ashford @ 2010-01-21 19:28 UTC (permalink / raw) To: Ingo Molnar Cc: Peter Zijlstra, LKML, Andi Kleen, Paul Mackerras, Stephane Eranian, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love, Steven Rostedt, Arnaldo Carvalho de Melo, Masami Hiramatsu On 1/21/2010 11:13 AM, Corey Ashford wrote: > > > On 1/20/2010 11:21 PM, Ingo Molnar wrote: >> >> * Corey Ashford<cjashfor@linux.vnet.ibm.com> wrote: >> >>> I really think we need some sort of data structure which is passed >>> from the >>> kernel to user space to represent the topology of the system, and give >>> useful information to be able to identify each PMU node. Whether this is >>> done with a sysfs-style tree, a table in a file, XML, etc... it doesn't >>> really matter much, but it needs to be something that can be parsed >>> relatively easily and *contains just enough information* for the user >>> to be >>> able to correctly choose PMUs, and for the kernel to be able to >>> relate that >>> back to actual PMU hardware. >> >> The right way would be to extend the current event description under >> /debug/tracing/events with hardware descriptors and (maybe) to >> formalise this >> into a separate /proc/events/ or into a separate filesystem. >> >> The advantage of this is that in the grand scheme of things we >> _really_ dont >> want to limit performance events to 'hardware' hierarchies, or to >> devices/sysfs, some existing /proc scheme, or any other arbitrary (and >> fundamentally limiting) object enumeration. >> >> We want a unified, logical enumeration of all events and objects that >> we care >> about from a performance monitoring and analysis point of view, shaped >> for the >> purpose of and parsed by perf user-space. And since the current event >> descriptors are already rather rich as they enumerate all sorts of >> things: >> >> - tracepoints >> - hw-breakpoints >> - dynamic probes >> >> etc., and are well used by tooling we should expand those with real >> hardware >> structure. > > This is an intriguing idea; I like the idea of generalizing all of this > info into one structure. > > So you think that this structure should contain event info as well? If > these structures are created by the kernel, I think that would > necessitate placing large event tables into the kernel, which is > something I think we'd prefer to avoid because of the amount of memory > it would take. Keep in mind that we need not only event names, but event > descriptions, encodings, attributes (e.g. unit masks), attribute > descriptions, etc. I suppose the kernel could read a file from the file > system, and then add this info to the tree, but that just seems bad. Are > there existing places in the kernel where it reads a user space file to > create a user space pseudo filesystem? > > I think keeping event naming in user space, and PMU naming in kernel > space might be a better idea: the kernel exposes the available PMUs to > user space via some structure, and a user space library tries to > recognize the exposed PMUs and provide event lists and other needed > info. The perf tool would use this library to be able to list available > events to users. > Perhaps another way of handing this would be to have the kernel dynamically load a specific "PMU kernel module" once it has detected that it has a particular PMU in the hardware. The module would consist only of a data structure, and a simple API to access the event data. This way, only only the PMUs that actually exist in the hardware would need to be loaded into memory, and perhaps then only temporarily (just long enough to create the pseudo fs nodes). Still, though, since it's a pseudo fs, all of that event data would be taking up kernel memory. Another model, perhaps, would be to actually write this data out to a real file system upon every boot up, so that it wouldn't need to be held in memory. That seems rather ugly and time consuming, though. -- Regards, - Corey Corey Ashford Software Engineer IBM Linux Technology Center, Linux Toolchain Beaverton, OR 503-578-3507 cjashfor@us.ibm.com ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-01-21 19:28 ` Corey Ashford @ 2010-01-27 10:28 ` Ingo Molnar 2010-01-27 19:50 ` Corey Ashford 0 siblings, 1 reply; 55+ messages in thread From: Ingo Molnar @ 2010-01-27 10:28 UTC (permalink / raw) To: Corey Ashford Cc: Peter Zijlstra, LKML, Andi Kleen, Paul Mackerras, Stephane Eranian, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love, Steven Rostedt, Arnaldo Carvalho de Melo, Masami Hiramatsu * Corey Ashford <cjashfor@linux.vnet.ibm.com> wrote: > On 1/21/2010 11:13 AM, Corey Ashford wrote: > > > > > >On 1/20/2010 11:21 PM, Ingo Molnar wrote: > >> > >>* Corey Ashford<cjashfor@linux.vnet.ibm.com> wrote: > >> > >>>I really think we need some sort of data structure which is passed > >>>from the > >>>kernel to user space to represent the topology of the system, and give > >>>useful information to be able to identify each PMU node. Whether this is > >>>done with a sysfs-style tree, a table in a file, XML, etc... it doesn't > >>>really matter much, but it needs to be something that can be parsed > >>>relatively easily and *contains just enough information* for the user > >>>to be > >>>able to correctly choose PMUs, and for the kernel to be able to > >>>relate that > >>>back to actual PMU hardware. > >> > >>The right way would be to extend the current event description under > >>/debug/tracing/events with hardware descriptors and (maybe) to > >>formalise this > >>into a separate /proc/events/ or into a separate filesystem. > >> > >>The advantage of this is that in the grand scheme of things we > >>_really_ dont > >>want to limit performance events to 'hardware' hierarchies, or to > >>devices/sysfs, some existing /proc scheme, or any other arbitrary (and > >>fundamentally limiting) object enumeration. > >> > >>We want a unified, logical enumeration of all events and objects that > >>we care > >>about from a performance monitoring and analysis point of view, shaped > >>for the > >>purpose of and parsed by perf user-space. And since the current event > >>descriptors are already rather rich as they enumerate all sorts of > >>things: > >> > >>- tracepoints > >>- hw-breakpoints > >>- dynamic probes > >> > >>etc., and are well used by tooling we should expand those with real > >>hardware > >>structure. > > > >This is an intriguing idea; I like the idea of generalizing all of this > >info into one structure. > > > >So you think that this structure should contain event info as well? If > >these structures are created by the kernel, I think that would > >necessitate placing large event tables into the kernel, which is > >something I think we'd prefer to avoid because of the amount of memory > >it would take. Keep in mind that we need not only event names, but event > >descriptions, encodings, attributes (e.g. unit masks), attribute > >descriptions, etc. I suppose the kernel could read a file from the file > >system, and then add this info to the tree, but that just seems bad. Are > >there existing places in the kernel where it reads a user space file to > >create a user space pseudo filesystem? > > > >I think keeping event naming in user space, and PMU naming in kernel > >space might be a better idea: the kernel exposes the available PMUs to > >user space via some structure, and a user space library tries to > >recognize the exposed PMUs and provide event lists and other needed > >info. The perf tool would use this library to be able to list available > >events to users. > > > > Perhaps another way of handing this would be to have the kernel dynamically > load a specific "PMU kernel module" once it has detected that it has a > particular PMU in the hardware. The module would consist only of a data > structure, and a simple API to access the event data. This way, only only > the PMUs that actually exist in the hardware would need to be loaded into > memory, and perhaps then only temporarily (just long enough to create the > pseudo fs nodes). > > Still, though, since it's a pseudo fs, all of that event data would be > taking up kernel memory. > > Another model, perhaps, would be to actually write this data out to a real > file system upon every boot up, so that it wouldn't need to be held in > memory. That seems rather ugly and time consuming, though. I dont think memory consumption is a problem at all. The structure of the monitored hardware/software state is information we _want_ the kernel to provide, mainly because there's no unified repository for user-space to get this info from. If someone doesnt want it on some ultra-embedded box then sure a .config switch can be provided to allow it to be turned off. Ingo ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-01-27 10:28 ` Ingo Molnar @ 2010-01-27 19:50 ` Corey Ashford 2010-01-28 10:57 ` Peter Zijlstra 0 siblings, 1 reply; 55+ messages in thread From: Corey Ashford @ 2010-01-27 19:50 UTC (permalink / raw) To: Ingo Molnar Cc: Peter Zijlstra, LKML, Andi Kleen, Paul Mackerras, Stephane Eranian, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love, Steven Rostedt, Arnaldo Carvalho de Melo, Masami Hiramatsu On 1/27/2010 2:28 AM, Ingo Molnar wrote: > > * Corey Ashford<cjashfor@linux.vnet.ibm.com> wrote: > >> On 1/21/2010 11:13 AM, Corey Ashford wrote: >>> >>> >>> On 1/20/2010 11:21 PM, Ingo Molnar wrote: >>>> >>>> * Corey Ashford<cjashfor@linux.vnet.ibm.com> wrote: >>>> >>>>> I really think we need some sort of data structure which is passed >>>> >from the >>>>> kernel to user space to represent the topology of the system, and give >>>>> useful information to be able to identify each PMU node. Whether this is >>>>> done with a sysfs-style tree, a table in a file, XML, etc... it doesn't >>>>> really matter much, but it needs to be something that can be parsed >>>>> relatively easily and *contains just enough information* for the user >>>>> to be >>>>> able to correctly choose PMUs, and for the kernel to be able to >>>>> relate that >>>>> back to actual PMU hardware. >>>> >>>> The right way would be to extend the current event description under >>>> /debug/tracing/events with hardware descriptors and (maybe) to >>>> formalise this >>>> into a separate /proc/events/ or into a separate filesystem. >>>> >>>> The advantage of this is that in the grand scheme of things we >>>> _really_ dont >>>> want to limit performance events to 'hardware' hierarchies, or to >>>> devices/sysfs, some existing /proc scheme, or any other arbitrary (and >>>> fundamentally limiting) object enumeration. >>>> >>>> We want a unified, logical enumeration of all events and objects that >>>> we care >>>> about from a performance monitoring and analysis point of view, shaped >>>> for the >>>> purpose of and parsed by perf user-space. And since the current event >>>> descriptors are already rather rich as they enumerate all sorts of >>>> things: >>>> >>>> - tracepoints >>>> - hw-breakpoints >>>> - dynamic probes >>>> >>>> etc., and are well used by tooling we should expand those with real >>>> hardware >>>> structure. >>> >>> This is an intriguing idea; I like the idea of generalizing all of this >>> info into one structure. >>> >>> So you think that this structure should contain event info as well? If >>> these structures are created by the kernel, I think that would >>> necessitate placing large event tables into the kernel, which is >>> something I think we'd prefer to avoid because of the amount of memory >>> it would take. Keep in mind that we need not only event names, but event >>> descriptions, encodings, attributes (e.g. unit masks), attribute >>> descriptions, etc. I suppose the kernel could read a file from the file >>> system, and then add this info to the tree, but that just seems bad. Are >>> there existing places in the kernel where it reads a user space file to >>> create a user space pseudo filesystem? >>> >>> I think keeping event naming in user space, and PMU naming in kernel >>> space might be a better idea: the kernel exposes the available PMUs to >>> user space via some structure, and a user space library tries to >>> recognize the exposed PMUs and provide event lists and other needed >>> info. The perf tool would use this library to be able to list available >>> events to users. >>> >> >> Perhaps another way of handing this would be to have the kernel dynamically >> load a specific "PMU kernel module" once it has detected that it has a >> particular PMU in the hardware. The module would consist only of a data >> structure, and a simple API to access the event data. This way, only only >> the PMUs that actually exist in the hardware would need to be loaded into >> memory, and perhaps then only temporarily (just long enough to create the >> pseudo fs nodes). >> >> Still, though, since it's a pseudo fs, all of that event data would be >> taking up kernel memory. >> >> Another model, perhaps, would be to actually write this data out to a real >> file system upon every boot up, so that it wouldn't need to be held in >> memory. That seems rather ugly and time consuming, though. > > I dont think memory consumption is a problem at all. The structure of the > monitored hardware/software state is information we _want_ the kernel to > provide, mainly because there's no unified repository for user-space to get > this info from. > > If someone doesnt want it on some ultra-embedded box then sure a .config > switch can be provided to allow it to be turned off. > > Ingo Ok, just so that we quantify things a bit, let's say I have 20 different types of PMUs totalling 2000 different events, each of which has a name and text description, averaging 300 characters. Along with that, there's let's say 4 64-bit words of metadata per event describing encoding, which attributes apply to the event, and any other needed info. I don't know how much memory each pseudo fs node takes up. Let me guess and say 128 bytes for each event node (the amount taken for the PMU nodes would be negligible compared with the event nodes). So thats 2000 * (300 + 32 + 128) bytes ~= 920KB of memory. Let's assume that the correct event module can be loaded dynamically, so that we don't need to have all of the possible event sets for a particular arch kernel build. Any opinions on whether allocating this amount of kernel memory would be acceptable? It seems like a lot of kernel memory to me, but I come from an embedded systems background. Granted, most systems are going to use a fraction of that amount of memory (<100KB) due to having far fewer PMUs and therefore fewer distinct event types. There's at least one more dimension to this. Let's say I have 16 uncore PMUs all of the same type, each of which has, for example 8 events. As a very crude pseudo fs, let's say we have a structure like this: /sys/devices/pmus/ uncore_pmu0/ event0/ (path name to here is the name of the pmu and event) description (file) applicable_attributes (file) event1/ description applicable_attributes event2/ ... event7/ ... uncore_pmu1/ event0/ description applicable_attributes ... ... uncore_pmu15/ ... Now, you can see that there's a lot of replication here, because the event descriptions and attributes are the same for each uncore pmu. We can use symlinks to link them to the same descriptions and attribute data, but these symlinks take up space too, which can add up if there are a lot of identical PMUs. So for complex and large systems, we might consume several meg of memory for the pseudo fs. Note that I'm taking some liberty with the applicable_attributes file. I know some attribute info has to be in there, but I don't have any sort of concrete idea as to how to encode it at this point. The point of this email is to get an idea as to how much memory the pseudo fs would consume. -- Regards, - Corey Corey Ashford Software Engineer IBM Linux Technology Center, Linux Toolchain Beaverton, OR 503-578-3507 cjashfor@us.ibm.com ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-01-27 19:50 ` Corey Ashford @ 2010-01-28 10:57 ` Peter Zijlstra 2010-01-28 18:00 ` Corey Ashford 0 siblings, 1 reply; 55+ messages in thread From: Peter Zijlstra @ 2010-01-28 10:57 UTC (permalink / raw) To: Corey Ashford Cc: Ingo Molnar, LKML, Andi Kleen, Paul Mackerras, Stephane Eranian, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love, Steven Rostedt, Arnaldo Carvalho de Melo, Masami Hiramatsu On Wed, 2010-01-27 at 11:50 -0800, Corey Ashford wrote: > On 1/27/2010 2:28 AM, Ingo Molnar wrote: > > > > * Corey Ashford<cjashfor@linux.vnet.ibm.com> wrote: > > > >> On 1/21/2010 11:13 AM, Corey Ashford wrote: > >>> > >>> > >>> On 1/20/2010 11:21 PM, Ingo Molnar wrote: > >>>> > >>>> * Corey Ashford<cjashfor@linux.vnet.ibm.com> wrote: > >>>> > >>>>> I really think we need some sort of data structure which is passed > >>>> >from the > >>>>> kernel to user space to represent the topology of the system, and give > >>>>> useful information to be able to identify each PMU node. Whether this is > >>>>> done with a sysfs-style tree, a table in a file, XML, etc... it doesn't > >>>>> really matter much, but it needs to be something that can be parsed > >>>>> relatively easily and *contains just enough information* for the user > >>>>> to be > >>>>> able to correctly choose PMUs, and for the kernel to be able to > >>>>> relate that > >>>>> back to actual PMU hardware. > >>>> > >>>> The right way would be to extend the current event description under > >>>> /debug/tracing/events with hardware descriptors and (maybe) to > >>>> formalise this > >>>> into a separate /proc/events/ or into a separate filesystem. > >>>> > >>>> The advantage of this is that in the grand scheme of things we > >>>> _really_ dont > >>>> want to limit performance events to 'hardware' hierarchies, or to > >>>> devices/sysfs, some existing /proc scheme, or any other arbitrary (and > >>>> fundamentally limiting) object enumeration. > >>>> > >>>> We want a unified, logical enumeration of all events and objects that > >>>> we care > >>>> about from a performance monitoring and analysis point of view, shaped > >>>> for the > >>>> purpose of and parsed by perf user-space. And since the current event > >>>> descriptors are already rather rich as they enumerate all sorts of > >>>> things: > >>>> > >>>> - tracepoints > >>>> - hw-breakpoints > >>>> - dynamic probes > >>>> > >>>> etc., and are well used by tooling we should expand those with real > >>>> hardware > >>>> structure. > >>> > >>> This is an intriguing idea; I like the idea of generalizing all of this > >>> info into one structure. > >>> > >>> So you think that this structure should contain event info as well? If > >>> these structures are created by the kernel, I think that would > >>> necessitate placing large event tables into the kernel, which is > >>> something I think we'd prefer to avoid because of the amount of memory > >>> it would take. Keep in mind that we need not only event names, but event > >>> descriptions, encodings, attributes (e.g. unit masks), attribute > >>> descriptions, etc. I suppose the kernel could read a file from the file > >>> system, and then add this info to the tree, but that just seems bad. Are > >>> there existing places in the kernel where it reads a user space file to > >>> create a user space pseudo filesystem? > >>> > >>> I think keeping event naming in user space, and PMU naming in kernel > >>> space might be a better idea: the kernel exposes the available PMUs to > >>> user space via some structure, and a user space library tries to > >>> recognize the exposed PMUs and provide event lists and other needed > >>> info. The perf tool would use this library to be able to list available > >>> events to users. > >>> > >> > >> Perhaps another way of handing this would be to have the kernel dynamically > >> load a specific "PMU kernel module" once it has detected that it has a > >> particular PMU in the hardware. The module would consist only of a data > >> structure, and a simple API to access the event data. This way, only only > >> the PMUs that actually exist in the hardware would need to be loaded into > >> memory, and perhaps then only temporarily (just long enough to create the > >> pseudo fs nodes). > >> > >> Still, though, since it's a pseudo fs, all of that event data would be > >> taking up kernel memory. > >> > >> Another model, perhaps, would be to actually write this data out to a real > >> file system upon every boot up, so that it wouldn't need to be held in > >> memory. That seems rather ugly and time consuming, though. > > > > I dont think memory consumption is a problem at all. The structure of the > > monitored hardware/software state is information we _want_ the kernel to > > provide, mainly because there's no unified repository for user-space to get > > this info from. > > > > If someone doesnt want it on some ultra-embedded box then sure a .config > > switch can be provided to allow it to be turned off. > > > > Ingo > > Ok, just so that we quantify things a bit, let's say I have 20 different types > of PMUs totalling 2000 different events, each of which has a name and text > description, averaging 300 characters. Along with that, there's let's say 4 > 64-bit words of metadata per event describing encoding, which attributes apply > to the event, and any other needed info. I don't know how much memory each > pseudo fs node takes up. Let me guess and say 128 bytes for each event node > (the amount taken for the PMU nodes would be negligible compared with the event > nodes). > > So thats 2000 * (300 + 32 + 128) bytes ~= 920KB of memory. > > Let's assume that the correct event module can be loaded dynamically, so that we > don't need to have all of the possible event sets for a particular arch kernel > build. > > Any opinions on whether allocating this amount of kernel memory would be > acceptable? It seems like a lot of kernel memory to me, but I come from an > embedded systems background. Granted, most systems are going to use a fraction > of that amount of memory (<100KB) due to having far fewer PMUs and therefore > fewer distinct event types. > > There's at least one more dimension to this. Let's say I have 16 uncore PMUs > all of the same type, each of which has, for example 8 events. As a very crude > pseudo fs, let's say we have a structure like this: > > > /sys/devices/pmus/ > uncore_pmu0/ > event0/ (path name to here is the name of the pmu and event) > description (file) > applicable_attributes (file) > event1/ > description > applicable_attributes > event2/ > ... > event7/ > ... > uncore_pmu1/ > event0/ > description > applicable_attributes > ... > ... > uncore_pmu15/ > ... I really don't like this. The the cpu->uncore map is fixed by the topology of the machine, which is already available in /sys some place. Lets simply use the cpu->node mapping and use PERF_TYPE_NODE{,_RAW} or something like that. We can start with 2 generic events for that type, local/remote memory accesses and take it from there. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-01-28 10:57 ` Peter Zijlstra @ 2010-01-28 18:00 ` Corey Ashford 2010-01-28 19:06 ` Peter Zijlstra 0 siblings, 1 reply; 55+ messages in thread From: Corey Ashford @ 2010-01-28 18:00 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, LKML, Andi Kleen, Paul Mackerras, Stephane Eranian, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love, Steven Rostedt, Arnaldo Carvalho de Melo, Masami Hiramatsu On 01/28/2010 02:57 AM, Peter Zijlstra wrote: > On Wed, 2010-01-27 at 11:50 -0800, Corey Ashford wrote: >> On 1/27/2010 2:28 AM, Ingo Molnar wrote: >>> >>> * Corey Ashford<cjashfor@linux.vnet.ibm.com> wrote: >>> >>>> On 1/21/2010 11:13 AM, Corey Ashford wrote: >>>>> >>>>> >>>>> On 1/20/2010 11:21 PM, Ingo Molnar wrote: >>>>>> >>>>>> * Corey Ashford<cjashfor@linux.vnet.ibm.com> wrote: >>>>>> >>>>>>> I really think we need some sort of data structure which is passed >>>>>> >from the >>>>>>> kernel to user space to represent the topology of the system, and give >>>>>>> useful information to be able to identify each PMU node. Whether this is >>>>>>> done with a sysfs-style tree, a table in a file, XML, etc... it doesn't >>>>>>> really matter much, but it needs to be something that can be parsed >>>>>>> relatively easily and *contains just enough information* for the user >>>>>>> to be >>>>>>> able to correctly choose PMUs, and for the kernel to be able to >>>>>>> relate that >>>>>>> back to actual PMU hardware. >>>>>> >>>>>> The right way would be to extend the current event description under >>>>>> /debug/tracing/events with hardware descriptors and (maybe) to >>>>>> formalise this >>>>>> into a separate /proc/events/ or into a separate filesystem. >>>>>> >>>>>> The advantage of this is that in the grand scheme of things we >>>>>> _really_ dont >>>>>> want to limit performance events to 'hardware' hierarchies, or to >>>>>> devices/sysfs, some existing /proc scheme, or any other arbitrary (and >>>>>> fundamentally limiting) object enumeration. >>>>>> >>>>>> We want a unified, logical enumeration of all events and objects that >>>>>> we care >>>>>> about from a performance monitoring and analysis point of view, shaped >>>>>> for the >>>>>> purpose of and parsed by perf user-space. And since the current event >>>>>> descriptors are already rather rich as they enumerate all sorts of >>>>>> things: >>>>>> >>>>>> - tracepoints >>>>>> - hw-breakpoints >>>>>> - dynamic probes >>>>>> >>>>>> etc., and are well used by tooling we should expand those with real >>>>>> hardware >>>>>> structure. >>>>> >>>>> This is an intriguing idea; I like the idea of generalizing all of this >>>>> info into one structure. >>>>> >>>>> So you think that this structure should contain event info as well? If >>>>> these structures are created by the kernel, I think that would >>>>> necessitate placing large event tables into the kernel, which is >>>>> something I think we'd prefer to avoid because of the amount of memory >>>>> it would take. Keep in mind that we need not only event names, but event >>>>> descriptions, encodings, attributes (e.g. unit masks), attribute >>>>> descriptions, etc. I suppose the kernel could read a file from the file >>>>> system, and then add this info to the tree, but that just seems bad. Are >>>>> there existing places in the kernel where it reads a user space file to >>>>> create a user space pseudo filesystem? >>>>> >>>>> I think keeping event naming in user space, and PMU naming in kernel >>>>> space might be a better idea: the kernel exposes the available PMUs to >>>>> user space via some structure, and a user space library tries to >>>>> recognize the exposed PMUs and provide event lists and other needed >>>>> info. The perf tool would use this library to be able to list available >>>>> events to users. >>>>> >>>> >>>> Perhaps another way of handing this would be to have the kernel dynamically >>>> load a specific "PMU kernel module" once it has detected that it has a >>>> particular PMU in the hardware. The module would consist only of a data >>>> structure, and a simple API to access the event data. This way, only only >>>> the PMUs that actually exist in the hardware would need to be loaded into >>>> memory, and perhaps then only temporarily (just long enough to create the >>>> pseudo fs nodes). >>>> >>>> Still, though, since it's a pseudo fs, all of that event data would be >>>> taking up kernel memory. >>>> >>>> Another model, perhaps, would be to actually write this data out to a real >>>> file system upon every boot up, so that it wouldn't need to be held in >>>> memory. That seems rather ugly and time consuming, though. >>> >>> I dont think memory consumption is a problem at all. The structure of the >>> monitored hardware/software state is information we _want_ the kernel to >>> provide, mainly because there's no unified repository for user-space to get >>> this info from. >>> >>> If someone doesnt want it on some ultra-embedded box then sure a .config >>> switch can be provided to allow it to be turned off. >>> >>> Ingo >> >> Ok, just so that we quantify things a bit, let's say I have 20 different types >> of PMUs totalling 2000 different events, each of which has a name and text >> description, averaging 300 characters. Along with that, there's let's say 4 >> 64-bit words of metadata per event describing encoding, which attributes apply >> to the event, and any other needed info. I don't know how much memory each >> pseudo fs node takes up. Let me guess and say 128 bytes for each event node >> (the amount taken for the PMU nodes would be negligible compared with the event >> nodes). >> >> So thats 2000 * (300 + 32 + 128) bytes ~= 920KB of memory. >> >> Let's assume that the correct event module can be loaded dynamically, so that we >> don't need to have all of the possible event sets for a particular arch kernel >> build. >> >> Any opinions on whether allocating this amount of kernel memory would be >> acceptable? It seems like a lot of kernel memory to me, but I come from an >> embedded systems background. Granted, most systems are going to use a fraction >> of that amount of memory (<100KB) due to having far fewer PMUs and therefore >> fewer distinct event types. >> >> There's at least one more dimension to this. Let's say I have 16 uncore PMUs >> all of the same type, each of which has, for example 8 events. As a very crude >> pseudo fs, let's say we have a structure like this: >> >> >> /sys/devices/pmus/ >> uncore_pmu0/ >> event0/ (path name to here is the name of the pmu and event) >> description (file) >> applicable_attributes (file) >> event1/ >> description >> applicable_attributes >> event2/ >> ... >> event7/ >> ... >> uncore_pmu1/ >> event0/ >> description >> applicable_attributes >> ... >> ... >> uncore_pmu15/ >> ... > > I really don't like this. The the cpu->uncore map is fixed by the > topology of the machine, which is already available in /sys some place. > > Lets simply use the cpu->node mapping and use PERF_TYPE_NODE{,_RAW} or > something like that. We can start with 2 generic events for that type, > local/remote memory accesses and take it from there. > I don't quite get what you're saying here. Perhaps you are thinking that all uncore units are associated with a particular cpu node, or a set of cpu nodes? And that there's only one uncore unit per cpu (or set of cpus) that needs to be addressed, i.e. no ambiguity? That is not going to be the case for all systems. We can have uncore units that are associated with the entire system, for example PMUs in an I/O device. And we can have multiple uncore units of a particular type, for example multiple vector coprocessors, each with its own PMU, and are associated with a single cpu or a set of cpus. perf_events needs an addressing scheme that covers these cases. - Corey ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-01-28 18:00 ` Corey Ashford @ 2010-01-28 19:06 ` Peter Zijlstra 2010-01-28 19:44 ` Corey Ashford 2010-01-28 22:08 ` Corey Ashford 0 siblings, 2 replies; 55+ messages in thread From: Peter Zijlstra @ 2010-01-28 19:06 UTC (permalink / raw) To: Corey Ashford Cc: Ingo Molnar, LKML, Andi Kleen, Paul Mackerras, Stephane Eranian, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love, Steven Rostedt, Arnaldo Carvalho de Melo, Masami Hiramatsu On Thu, 2010-01-28 at 10:00 -0800, Corey Ashford wrote: > > I don't quite get what you're saying here. Perhaps you are thinking > that all uncore units are associated with a particular cpu node, or a > set of cpu nodes? And that there's only one uncore unit per cpu (or set > of cpus) that needs to be addressed, i.e. no ambiguity? Well, I was initially thinking of the intel uncore thing which is memory controller, so node, level. But all system topology bound pmus can be done that way. > That is not going to be the case for all systems. We can have uncore > units that are associated with the entire system, Right, but that's simple too. > for example PMUs in an I/O device. > And we can have multiple uncore units of a particular > type, for example multiple vector coprocessors, each with its own PMU, > and are associated with a single cpu or a set of cpus. > > perf_events needs an addressing scheme that covers these cases. You could possible add a u64 pmu_id field to perf_event_attr and use that together with things like: PERF_TYPE_PCI, attr.pmu_id = domain:bus:device:function encoding PERF_TYPE_SPU, attr.pmu_id = spu-id But before we go there the perf core needs to be extended to deal with multiple hardware pmus, something which isn't too hard but we need to be careful not to bloat the normal code paths for these somewhat esoteric use cases. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-01-28 19:06 ` Peter Zijlstra @ 2010-01-28 19:44 ` Corey Ashford 2010-01-28 22:08 ` Corey Ashford 1 sibling, 0 replies; 55+ messages in thread From: Corey Ashford @ 2010-01-28 19:44 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, LKML, Andi Kleen, Paul Mackerras, Stephane Eranian, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love, Steven Rostedt, Arnaldo Carvalho de Melo, Masami Hiramatsu On 1/28/2010 11:06 AM, Peter Zijlstra wrote: > On Thu, 2010-01-28 at 10:00 -0800, Corey Ashford wrote: >> >> I don't quite get what you're saying here. Perhaps you are thinking >> that all uncore units are associated with a particular cpu node, or a >> set of cpu nodes? And that there's only one uncore unit per cpu (or set >> of cpus) that needs to be addressed, i.e. no ambiguity? > > Well, I was initially thinking of the intel uncore thing which is memory > controller, so node, level. > > But all system topology bound pmus can be done that way. > >> That is not going to be the case for all systems. We can have uncore >> units that are associated with the entire system, > > Right, but that's simple too. > >> for example PMUs in an I/O device. > >> And we can have multiple uncore units of a particular >> type, for example multiple vector coprocessors, each with its own PMU, >> and are associated with a single cpu or a set of cpus. >> >> perf_events needs an addressing scheme that covers these cases. > > You could possible add a u64 pmu_id field to perf_event_attr and use > that together with things like: > > PERF_TYPE_PCI, attr.pmu_id = domain:bus:device:function encoding > PERF_TYPE_SPU, attr.pmu_id = spu-id > Thank you for that clarification. One of Ingo's comments was that he wants perf to be able to expose all of the available PMUs via the perf tool. That perf should be able to parse some data structure (somewhere) that would contain all of the info the user would need to choose a particular PMU. Do you have some ideas about how that could be accomplished using the above encoding scheme? I can see how it would be fairly easy to come up with a PERF_TYPE_* encoding per-topology, and then interpret all of those bits correctly within the kernel (which is saavy to that topology), but I don't see how there would be a straight-forward way to expose that structure to perf. How would perf know which of those encodings apply to the current system, how many PMUs there are of each type, etc. That's why I'm leaning toward a /sys/devices-style pseudo fs at the moment. If there's a simpler, better way, I'm open to it. > But before we go there the perf core needs to be extended to deal with > multiple hardware pmus, something which isn't too hard but we need to be > careful not to bloat the normal code paths for these somewhat esoteric > use cases. Is this something you are looking into? - Corey ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-01-28 19:06 ` Peter Zijlstra 2010-01-28 19:44 ` Corey Ashford @ 2010-01-28 22:08 ` Corey Ashford 2010-01-29 9:52 ` Peter Zijlstra 1 sibling, 1 reply; 55+ messages in thread From: Corey Ashford @ 2010-01-28 22:08 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, LKML, Andi Kleen, Paul Mackerras, Stephane Eranian, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love, Steven Rostedt, Arnaldo Carvalho de Melo, Masami Hiramatsu On 1/28/2010 11:06 AM, Peter Zijlstra wrote: > On Thu, 2010-01-28 at 10:00 -0800, Corey Ashford wrote: >> >> I don't quite get what you're saying here. Perhaps you are thinking >> that all uncore units are associated with a particular cpu node, or a >> set of cpu nodes? And that there's only one uncore unit per cpu (or set >> of cpus) that needs to be addressed, i.e. no ambiguity? > > Well, I was initially thinking of the intel uncore thing which is memory > controller, so node, level. > > But all system topology bound pmus can be done that way. > >> That is not going to be the case for all systems. We can have uncore >> units that are associated with the entire system, > > Right, but that's simple too. > >> for example PMUs in an I/O device. > >> And we can have multiple uncore units of a particular >> type, for example multiple vector coprocessors, each with its own PMU, >> and are associated with a single cpu or a set of cpus. >> >> perf_events needs an addressing scheme that covers these cases. > > You could possible add a u64 pmu_id field to perf_event_attr and use > that together with things like: > > PERF_TYPE_PCI, attr.pmu_id = domain:bus:device:function encoding > PERF_TYPE_SPU, attr.pmu_id = spu-id > Thank you for the clarification. One of Ingo's comments in this thread was that he wants perf to be able to display to the user the available PMUs along with their respect events. That perf would parse some machine-independent data structure (somewhere) to get this info. This same info would provide the user a method of specifying which PMU he wants to address. He'd also like all of the event info data to reside in the same place. I hope I am paraphrasing him correctly. I can see that with the scheme you have proposed above, it would be straight-forward to encode PMU ids for a particular new PERF_TYPE_* system topology, but I don't see a clear way of providng perf with enough information to tell it which particular topology is being used, how many units of each PMU type exist, and so on. Do you have any ideas as to how to accomplish this goal with the method you are suggesting? This is one of the reasons why I am leaning toward a /sys/devices-style data structure; the kernel could easily build it based on the pmus that it discovers (through whatever means), and the user can fairly easily choose a pmu from this structure to open, and it's unambiguous to the kernel as to which pmu the user really wants. I am not convinced that this is the right place to put the event info for each PMU. > But before we go there the perf core needs to be extended to deal with > multiple hardware pmus, something which isn't too hard but we need to be > careful not to bloat the normal code paths for these somewhat esoteric > use cases. > Is this something you've looked into? If so, what sort of issues have you discovered? Thanks, - Corey ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-01-28 22:08 ` Corey Ashford @ 2010-01-29 9:52 ` Peter Zijlstra 2010-01-29 23:05 ` Corey Ashford 0 siblings, 1 reply; 55+ messages in thread From: Peter Zijlstra @ 2010-01-29 9:52 UTC (permalink / raw) To: Corey Ashford Cc: Ingo Molnar, LKML, Andi Kleen, Paul Mackerras, Stephane Eranian, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love, Steven Rostedt, Arnaldo Carvalho de Melo, Masami Hiramatsu On Thu, 2010-01-28 at 14:08 -0800, Corey Ashford wrote: > This is one of the reasons why I am leaning toward a /sys/devices-style data > structure; the kernel could easily build it based on the pmus that it discovers > (through whatever means), and the user can fairly easily choose a pmu from this > structure to open, and it's unambiguous to the kernel as to which pmu the user > really wants. Well, the dumb way is simply probing all of them and see who responds. Another might be adding a pmu attribute (showing the pmu-id) to the existing sysfs topology layouts (system topology, pci, spu, are all already available in sysfs iirc). > I am not convinced that this is the right place to put the event info for each PMU. Right, I'm not at all sure the kernel wants to know about any events beyond those needed for pmu scheduling constraints and possible generic event maps. Clearly it needs to know about all software events, but I don't think we need nor want exhaustive hardware event lists in the kernel. > > But before we go there the perf core needs to be extended to deal with > > multiple hardware pmus, something which isn't too hard but we need to be > > careful not to bloat the normal code paths for these somewhat esoteric > > use cases. > > > > Is this something you've looked into? If so, what sort of issues have you > discovered? I've poked at it a little yes, while simply abstracting the current hw interface and making it a list of pmu's isn't hard at all, it does add overhead to a few key locations. Another aspect is event scheduling, you'd want to separate the event lists for the various pmus so that the RR thing works as expected, this again adds overhead because you now need to abstract out the event lists as well. The main fast path affected by both these things is the task switch event scheduling where you have to iterate all active events and their pmus. So while the abstraction itself isn't too hard, doing it so as to minimize the bloat on the key paths does make it interesting. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-01-29 9:52 ` Peter Zijlstra @ 2010-01-29 23:05 ` Corey Ashford 2010-01-30 8:42 ` Peter Zijlstra 0 siblings, 1 reply; 55+ messages in thread From: Corey Ashford @ 2010-01-29 23:05 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, LKML, Andi Kleen, Paul Mackerras, Stephane Eranian, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love, Steven Rostedt, Arnaldo Carvalho de Melo, Masami Hiramatsu On 1/29/2010 1:52 AM, Peter Zijlstra wrote: > On Thu, 2010-01-28 at 14:08 -0800, Corey Ashford wrote: > >> This is one of the reasons why I am leaning toward a /sys/devices-style data >> structure; the kernel could easily build it based on the pmus that it discovers >> (through whatever means), and the user can fairly easily choose a pmu from this >> structure to open, and it's unambiguous to the kernel as to which pmu the user >> really wants. > > Well, the dumb way is simply probing all of them and see who responds. That can work, but it's still fuzzy to me how a user would relate a PMU address that he's encoded to some actual device in the system he's using. How would he know that he's addressing the correct device (besides that the PMU type matches), given that we're likely to have hypervisors as middle-men. > Another might be adding a pmu attribute (showing the pmu-id) to the > existing sysfs topology layouts (system topology, pci, spu, are all > already available in sysfs iirc). So you'd read the id from the sysfs topology tree, and then pass that id to the interface? That's an interesting approach that eliminates the need to pass a string pmu path to the kernel. I like this idea, but I need to read more deeply about the topology entries to understand how they work. >> I am not convinced that this is the right place to put the event info for each PMU. > > Right, I'm not at all sure the kernel wants to know about any events > beyond those needed for pmu scheduling constraints and possible generic > event maps. > > Clearly it needs to know about all software events, but I don't think we > need nor want exhaustive hardware event lists in the kernel. > >> > But before we go there the perf core needs to be extended to deal with >> > multiple hardware pmus, something which isn't too hard but we need to be >> > careful not to bloat the normal code paths for these somewhat esoteric >> > use cases. >> > >> >> Is this something you've looked into? If so, what sort of issues have you >> discovered? > > I've poked at it a little yes, while simply abstracting the current hw > interface and making it a list of pmu's isn't hard at all, it does add > overhead to a few key locations. > > Another aspect is event scheduling, you'd want to separate the event > lists for the various pmus so that the RR thing works as expected, this > again adds overhead because you now need to abstract out the event lists > as well. > > The main fast path affected by both these things is the task switch > event scheduling where you have to iterate all active events and their > pmus. > > So while the abstraction itself isn't too hard, doing it so as to > minimize the bloat on the key paths does make it interesting. > Interesting. Thanks for your comments. - Corey ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-01-29 23:05 ` Corey Ashford @ 2010-01-30 8:42 ` Peter Zijlstra 2010-02-01 19:39 ` Corey Ashford 0 siblings, 1 reply; 55+ messages in thread From: Peter Zijlstra @ 2010-01-30 8:42 UTC (permalink / raw) To: Corey Ashford Cc: Ingo Molnar, LKML, Andi Kleen, Paul Mackerras, Stephane Eranian, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love, Steven Rostedt, Arnaldo Carvalho de Melo, Masami Hiramatsu On Fri, 2010-01-29 at 15:05 -0800, Corey Ashford wrote: > So you'd read the id from the sysfs topology tree, and then pass that id to the > interface? That's an interesting approach that eliminates the need to pass a > string pmu path to the kernel. No, the attr.pmu_id would reflect the location in the tree (pci location, or spu number), the pmu id reported would identify the kind of pmu driver used for that particular device. I realized this confusion after sending but didn't clarify, we should come up with a good alternative name for either (or both) uses. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-01-30 8:42 ` Peter Zijlstra @ 2010-02-01 19:39 ` Corey Ashford 2010-02-01 19:54 ` Peter Zijlstra 0 siblings, 1 reply; 55+ messages in thread From: Corey Ashford @ 2010-02-01 19:39 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, LKML, Andi Kleen, Paul Mackerras, Stephane Eranian, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love, Steven Rostedt, Arnaldo Carvalho de Melo, Masami Hiramatsu On 1/30/2010 12:42 AM, Peter Zijlstra wrote: > On Fri, 2010-01-29 at 15:05 -0800, Corey Ashford wrote: >> So you'd read the id from the sysfs topology tree, and then pass that id to the >> interface? That's an interesting approach that eliminates the need to pass a >> string pmu path to the kernel. > > No, the attr.pmu_id would reflect the location in the tree (pci > location, or spu number), the pmu id reported would identify the kind of > pmu driver used for that particular device. > > I realized this confusion after sending but didn't clarify, we should > come up with a good alternative name for either (or both) uses. > Ok, just so I'm clear here, is attr.pmu_id a (char *) or some sort of encoded bit field? - Corey ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-02-01 19:39 ` Corey Ashford @ 2010-02-01 19:54 ` Peter Zijlstra 0 siblings, 0 replies; 55+ messages in thread From: Peter Zijlstra @ 2010-02-01 19:54 UTC (permalink / raw) To: Corey Ashford Cc: Ingo Molnar, LKML, Andi Kleen, Paul Mackerras, Stephane Eranian, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love, Steven Rostedt, Arnaldo Carvalho de Melo, Masami Hiramatsu On Mon, 2010-02-01 at 11:39 -0800, Corey Ashford wrote: > > On 1/30/2010 12:42 AM, Peter Zijlstra wrote: > > On Fri, 2010-01-29 at 15:05 -0800, Corey Ashford wrote: > >> So you'd read the id from the sysfs topology tree, and then pass that id to the > >> interface? That's an interesting approach that eliminates the need to pass a > >> string pmu path to the kernel. > > > > No, the attr.pmu_id would reflect the location in the tree (pci > > location, or spu number), the pmu id reported would identify the kind of > > pmu driver used for that particular device. > > > > I realized this confusion after sending but didn't clarify, we should > > come up with a good alternative name for either (or both) uses. > > > > Ok, just so I'm clear here, is attr.pmu_id a (char *) or some sort of encoded > bit field? Right, currently on x86 we have x86_pmu.name which basically tells us what kind of pmu it is, but we really don't export that since its trivial to see that from /proc/cpuinfo, but the ARM people expressed interest in this because decoding the cpuid equivalent on arm is like nasty business. But once we go add other funny pmu's, it becomes interesting to know what kind of pmu it is and tie possible event sets to them. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-01-20 23:23 ` Corey Ashford 2010-01-21 7:21 ` Ingo Molnar @ 2010-01-21 8:36 ` Peter Zijlstra 1 sibling, 0 replies; 55+ messages in thread From: Peter Zijlstra @ 2010-01-21 8:36 UTC (permalink / raw) To: Corey Ashford Cc: LKML, Ingo Molnar, Andi Kleen, Paul Mackerras, Stephane Eranian, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love On Wed, 2010-01-20 at 15:23 -0800, Corey Ashford wrote: > If so, that's an interesting idea, but I think it still > leaves open the problem of how to actually relate those address to the real > hardware, especially in the case of using a hypervisor which has provided you a > small subset of the physical hardware in the system. Well, I'm tempted to say that's a problem for the virt guys :-) One way one could solve that is by having the topology information include the virt<->phys map, so that you can find the physical node from the virtual cpu number. Going to be interesting though, but then, virt seems to be about creating problems where there were none before. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-01-20 21:33 ` Peter Zijlstra 2010-01-20 23:23 ` Corey Ashford @ 2010-01-21 8:47 ` stephane eranian 2010-01-21 8:59 ` Peter Zijlstra 1 sibling, 1 reply; 55+ messages in thread From: stephane eranian @ 2010-01-21 8:47 UTC (permalink / raw) To: Peter Zijlstra Cc: Corey Ashford, LKML, Ingo Molnar, Andi Kleen, Paul Mackerras, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love On Wed, Jan 20, 2010 at 10:33 PM, Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, 2010-01-20 at 14:34 +0100, Peter Zijlstra wrote: > >> So how about PERF_TYPE_{CORE,NODE,SOCKET} like things? > > OK, so I read most of the intel uncore stuff, and it seems to suggest > you need a regular pmu event to receive uncore events (chained setup), > this seems rather retarded since it wastes a perfectly good pmu event > and makes configuring all this more intricate... > I don't think that is correct. You can be using the uncore PMU on Nehalem without any core PMU event. The only thing to realize is that uncore PMU shares the same interrupt vector as core PMU. You need to configure which core the uncore is going to interrupt on. This is done via a bitmask, so you can interrupt more than one core at a time. Several strategies are possible. > A well, nothing to be done about that I guess.. > > > > ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-01-21 8:47 ` stephane eranian @ 2010-01-21 8:59 ` Peter Zijlstra 2010-01-21 9:16 ` stephane eranian 2010-01-21 9:43 ` stephane eranian 0 siblings, 2 replies; 55+ messages in thread From: Peter Zijlstra @ 2010-01-21 8:59 UTC (permalink / raw) To: eranian Cc: Corey Ashford, LKML, Ingo Molnar, Andi Kleen, Paul Mackerras, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love On Thu, 2010-01-21 at 09:47 +0100, stephane eranian wrote: > I don't think that is correct. You can be using the uncore PMU on Nehalem > without any core PMU event. The only thing to realize is that uncore PMU > shares the same interrupt vector as core PMU. You need to configure which > core the uncore is going to interrupt on. This is done via a bitmask, so you > can interrupt more than one core at a time. Several strategies are possible. Ah, sharing the IRQ line is no problem. But from reading I got the impression you need to configure an Offcore counter. See 30.6.2.1: • EN_PMI_COREn (bit n, n = 0, 3 if four cores are present): When set, processor core n is programmed to receive an interrupt signal from any interrupt enabled uncore counter. PMI delivery due to an uncore counter overflow is enabled by setting IA32_DEBUG_CTL.Offcore_PMI_EN to 1. Which seems to indicate a link with the off-core response thing. However I would be very glad to be wrong :-) ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-01-21 8:59 ` Peter Zijlstra @ 2010-01-21 9:16 ` stephane eranian 2010-01-21 9:43 ` stephane eranian 1 sibling, 0 replies; 55+ messages in thread From: stephane eranian @ 2010-01-21 9:16 UTC (permalink / raw) To: Peter Zijlstra Cc: Corey Ashford, LKML, Ingo Molnar, Andi Kleen, Paul Mackerras, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love The offcore_response register is a CORE PMU register. I know the name is confusing. OFFCORE_RESPONSE_0 is not a counter but rather an extension of the filtering capabilities of regular counters. To use offcore_response_0, you will need to program a generic counter (1 config, 1 data) + offcore_response_0. This is a very useful feature to understand memory traffic. There is an associated difficulty though. The offcore_response_0 MSR is shared between HT threads. Thus the kernel needs to arbitrate. That should be taken care of by the event scheduling code which I am working on. In the context of perf_event, you need to find some encoding of the event to pass as RAW. The raw event code + umask is 0x01B7. Then, you need to encode the value for the offcore_response_0 MSR, which is 16 bits on NHM/WSM. You could either stash this into the config field or use an extended field. The kernel would detect 0x01B7 and use the value in the extra config field. As described in vol3b, Intel Westmere adds a second offcore_response counter. It behaves the same way with the same restrictions. The debugctl bit controls uncore activation not offcore. On Thu, Jan 21, 2010 at 9:59 AM, Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, 2010-01-21 at 09:47 +0100, stephane eranian wrote: >> I don't think that is correct. You can be using the uncore PMU on Nehalem >> without any core PMU event. The only thing to realize is that uncore PMU >> shares the same interrupt vector as core PMU. You need to configure which >> core the uncore is going to interrupt on. This is done via a bitmask, so you >> can interrupt more than one core at a time. Several strategies are possible. > > Ah, sharing the IRQ line is no problem. But from reading I got the > impression you need to configure an Offcore counter. See 30.6.2.1: > > • EN_PMI_COREn (bit n, n = 0, 3 if four cores are present): When set, processor > core n is programmed to receive an interrupt signal from any interrupt enabled > uncore counter. PMI delivery due to an uncore counter overflow is enabled by > setting IA32_DEBUG_CTL.Offcore_PMI_EN to 1. > > Which seems to indicate a link with the off-core response thing. > > However I would be very glad to be wrong :-) > > ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-01-21 8:59 ` Peter Zijlstra 2010-01-21 9:16 ` stephane eranian @ 2010-01-21 9:43 ` stephane eranian 1 sibling, 0 replies; 55+ messages in thread From: stephane eranian @ 2010-01-21 9:43 UTC (permalink / raw) To: Peter Zijlstra Cc: Corey Ashford, LKML, Ingo Molnar, Andi Kleen, Paul Mackerras, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love, eranian On Thu, Jan 21, 2010 at 9:59 AM, Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, 2010-01-21 at 09:47 +0100, stephane eranian wrote: >> I don't think that is correct. You can be using the uncore PMU on Nehalem >> without any core PMU event. The only thing to realize is that uncore PMU >> shares the same interrupt vector as core PMU. You need to configure which >> core the uncore is going to interrupt on. This is done via a bitmask, so you >> can interrupt more than one core at a time. Several strategies are possible. > > Ah, sharing the IRQ line is no problem. But from reading I got the Given the PMU sharing model of perf_events, it seems you may have multiple consumers of uncore PMU at the same time. That means you will need to direct the interrupt onto all the CPU for which you currently have a user. You may have multiple users per CPU, thus you need some reference count to track all of that. The alternative is to systematically broadcast the uncore PMU interrupt. Each core then checks whether or not it has uncore users. Note that all of this is independent of the type of event, i.e., per-thread or system-wide. ^ permalink raw reply [flat|nested] 55+ messages in thread
[parent not found: <d3f22a1003290213x7d7904an59d50eb6a8616133@mail.gmail.com>]
* Re: [RFC] perf_events: support for uncore a.k.a. nest units [not found] ` <d3f22a1003290213x7d7904an59d50eb6a8616133@mail.gmail.com> @ 2010-03-30 7:42 ` Lin Ming 2010-03-30 16:49 ` Corey Ashford 0 siblings, 1 reply; 55+ messages in thread From: Lin Ming @ 2010-03-30 7:42 UTC (permalink / raw) To: Corey Ashford Cc: Peter Zijlstra, Ingo Molnar, LKML, Andi Kleen, Paul Mackerras, Stephane Eranian, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love, Steven Rostedt, Arnaldo Carvalho de Melo, Masami Hiramatsu Hi, Corey How is this going now? Are you still working on this? I'd like to help to add support for uncore, test, write code or anything else. Thanks, Lin Ming > > ----- > Intro > ----- > One subject that hasn't been addressed since the introduction of > perf_events in the Linux kernel is that of support for "uncore" or > "nest" unit events. Uncore is the term used by the Intel engineers > for their off-core units but are still on the same die as the cores, > and "nest" means exactly the same thing for IBM Power processor > engineers. I will use the term uncore for brevity and because it's in > common parlance, but the issues and design possibilities below are > relevant to both. I will also broaden the term by stating that uncore > will also refer to PMUs that are completely off of the processor chip > altogether. > > Contents > -------- > 1. Why support PMUs in uncore units? Is there anything interesting to look at? > 2. How do uncore events differ from core events? > 3. Why does a CPU need to be assigned to manage a particular uncore > unit's events? > 4. How do you encode uncore events? > 5. How do you address a particular uncore PMU? > 6. Event rotation issues with uncore PMUs > 7. Other issues? > 8. Feedback? > > ---- > 1. Why support PMUs in uncore units? Is there anything interesting to look at? > ---- > > Today, many x86 chips contain uncore units, and we think that it's > likely that the trend will continue, as more devices - I/O, memory > interfaces, shared caches, accelerators, etc. - are integrated onto > multi-core chips. As these devices become more sophisticated and more > workload is diverted off-core, engineers and performance analysts are > going to want to look at what's happening in these units so that they > can find bottlenecks. > > In addition, we think that even off-chip I/O and interconnect devices > are likely to gain PMUs because engineers will want to find > bottlenecks in their massively parallel systems. > > ---- > 2. How do uncore events differ from core events? > ---- > > The main difference is that uncore events are mostly likely not going > to be tied to a particular Linux task, or even a CPU context. Uncore > units are resources that are in some sense system-wide, though, they > may not really be accessible system-wide in some architectures. In > the case of accelerators and I/O devices, it's likely they will run > asynchronously from the cores, and thus keeping track of events on a > per-task basis doesn't make a lot of sense. The other existing mode > in perf_events is a per-CPU context, and it turns out that this mode > does match up with uncore units well, though the choice of which CPU > to use to manage that uncore unit is going to need to be > arch-dependent and may involve other issues as well, such as > minimizing access latency between the uncore unit and the CPU which is > managing it. > > ---- > 3. Why does a CPU need to be assigned to manage a particular uncore > unit's events? > ---- > > * The control registers of the uncore unit's PMU need to be read and > written, and that may be possible only from a subset of processors in > the system. > * A processor is needed to rotate the event list on the uncore unit on > every tick for the purposes of event scheduling. > * Because of access latency issues, we may want the CPU to be close in > locality to the PMU. > > It seems like a good idea to let the kernel decide which CPU to use to > monitor a particular uncore event, based on the location of the uncore > unit, and possibly current system load balance. The user will not > want to have to figure out this detailed information. > > ---- > 4. How do you encode uncore events? > ---- > Uncore events will need to be encoded in the config field of the > perf_event_attr struct using the existing PERF_TYPE_RAW encoding. 64 > bits are available in the config field, and that may be sufficient to > support events on most systems. However, due to the proliferation and > added complexity of PMUs we envision, we might want to add another > 64-bit config (perhaps call it config_extra or config2) field to > encode any extra attributes that might be needed. The exact encoding > used, just as for the current encoding for core events, will be on a > per-arch and possibly per-system basis. > > ---- > 5. How do you address a particular uncore PMU? > ---- > > This one is going to be very system- and arch-dependent, but it seems > fairly clear that we need some sort of addressing scheme that can be > system/arch-defined by the kernel. > > From a hierarchical perspective, here's an example of possible uncore > PMU locations in a large system: > > 1) Per-core - units that are shared between all hardware threads in a core > 2) Per-node - units that are shared between all cores in a node > 3) Per-chip - units that are shared between all nodes in a chip > 4) Per-blade - units that are shared between all chips on a blade > 5) Per-rack - units that are shared between all blades in a rack > > Addressing option 1) > > Reuse the cpu argument: cpu would be interpreted differently if an > uncore unit is specified (via the perf_event_attr struct's config > field). > > For the hypothetical system described above, we'd want to have an > address that contains enough address bits for each of the above. For > example: > > bits field > ------ ----- > 3..0 PMU number 0-15 /* specifies which of several identical PMUs > being addressed */ > 7..4 core id 0-15 > 8..8 node id 0-1 > 11..9 chip id 0-7 > 16..12 blade id 0-31 > 23..17 rack id 0-128 > > These fields would be exposed via > /usr/include/linux/perf_events_uncore_addr.h (for example). How you > actually assign these numbers to actual hardware is, again, > system-design dependent, and may be influenced by the use of a > hypervisor, or other software which allocates resources available to > the system dynamically. > > How does the user discover the mapping between the hardware made > available to the system and the addresses shown above? Again, this is > system-dependent, and probably outside the scope of this proposal. In > other words, I don't know how to do this in a general way, though I > could probably put something together for a particular system. > > Addressing Option 2) > > Have the kernel create nodes for each uncore PMU in > /sys/devices/system or other pseudo file system, such as the existing > /proc/device-tree on Power systems. /sys/devices/system or > /proc/device-tree could be explored by the > user tool, and the user could then specify the path of the requested > PMU via a string which the kernel could interpret. To be overly > simplistic, something like > "/sys/devices/system/pmus/blade4/cpu0/vectorcopro1". If we settled on > a common tree root to use, we could specify only the relative path > name, "blade4/cpu0/vectorcopro1". > > One way to provide this extra "PMU path" argument to the > sys_perf_event_open() would be to add a bit to the flags argument says > we're adding a PMU path string onto the end of the argument list. > > This path-string-based addressing option seems to more flexible in the > long run, and does not have as serious of an issue in mapping PMUs to > user space; the kernel essentially exposes to user space all of the > available PMUs for the current partition. This might create more > work for the kernel side, but should make the system more transparent > for user-space tools. Another system- or at least arch-dependent tool > would have to be written for user space to help users navigate the > device tree to find the PMU they want to use. I don't think it would > make sense to build that capability into perf, because the software > would be arch- or system-dependent. > > It could be argued that we should use a common user space tree to > represent PMUs for all architectures and systems, so that the > arch-independent perf code would be able to display available uncore > PMUs. That may be a goal that's very hard to achieve because of the > wide variation in architectures. Any thoughts on that? > > ---- > 6. Event rotation issues with uncore PMUs > ---- > > Currently, the perf_events code rotates the set of events assigned to > a CPU or task on every system tick, so that event scheduling > collisions on a PMU are mitigated. This turns out to cause problems > for uncore units for two reasons - inefficiency and CPU load. > > a) Rotation of a set of events across more than one PMU causes > inefficient rotation. > > Consider the following event list; the letter designates the PMU and > the number is the event number on that PMU. > A1 A2 A3 B1 B2 B3 B4 C1 C2 C3 C4 C5 > > after one rotation, you can see that the event list will be: > > C5 A1 A2 A3 B1 B2 B3 B4 C1 C2 C3 C4 > > and then > > C4 C5 A1 A2 A3 B1 B2 B3 B4 C1 C2 C3 > > Notice how the relative positions for the A and B PMU events haven't > changed even after two (or even five) rotations, so they will schedule > the events in the same order for some time. This will skew the > multiplexing so that some events will be scheduled much less often > than they should or could be. > > What we'd like to have happen is that events for each PMU be rotated > in their own lists. For example, before rotation: > > A1 A2 A3 > B1 B2 B3 B4 > C1 C2 C3 C4 C5 > > After rotation: > > A3 A1 A2 > B2 B3 B4 B1 > C2 C3 C4 C5 C1 > > We've got some ideas about how to make this happen, using either > separate lists, or placing them on separate CPUs. > > b) Access to some PMU uncore units may be quite slow due to the > interconnect that is used. This can place a burden on the CPU if it > is done every system tick. > > This can be addressed by keeping a counter, on a per-PMU context basis > that reduces the rate of event rotations. Setting the rotation period > to three, for example, would cause event rotations in that context to > happen on every third tick, instead of every tick. We think that the > kernel could measure the amount of time it is taking to do a rotate, > and then dynamically decrease the rotation rate if it's taking too > long; "rotation rate throttling" in other words. > > ---- > 7. Other issues? > ---- > > This section left blank for now. > > ---- > 8. Feedback? > ---- > > I'd appreciate any feedback you might have on this topic. You can > contact me directly at the email address below, or better yet, reply > to LKML. > > -- > Regards, > > - Corey > > Corey Ashford > Software Engineer > IBM Linux Technology Center, Linux Toolchain > Beaverton, OR > 503-578-3507 > cjashfor@us.ibm.com > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-03-30 7:42 ` Lin Ming @ 2010-03-30 16:49 ` Corey Ashford 2010-03-30 17:15 ` Peter Zijlstra 2010-03-30 21:28 ` stephane eranian 0 siblings, 2 replies; 55+ messages in thread From: Corey Ashford @ 2010-03-30 16:49 UTC (permalink / raw) To: Lin Ming Cc: Peter Zijlstra, Ingo Molnar, LKML, Andi Kleen, Paul Mackerras, Stephane Eranian, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love, Steven Rostedt, Arnaldo Carvalho de Melo, Masami Hiramatsu On 03/30/2010 12:42 AM, Lin Ming wrote: > Hi, Corey > > How is this going now? Are you still working on this? > I'd like to help to add support for uncore, test, write code or anything > else. > > Thanks, > Lin Ming I haven't been actively working on adding infrastructure for nest PMUs yet. At the moment we are working on supporting nest events for IBM's Wire-Speed processor, using the current infrastructure, because of the time limitations. Using the existing infrastructure is definitely not ideal, but for this processor, it's workable. There are still a lot of issues to solve for adding this infrastructure: 1) Does perf_events need a new context type (in addition to per-task and per-cpu)? This is partly because we don't want to be mixing the rotation of CPU-events with nest events. Each PMU really ought to have its own event list. 2) How do we deal with accessing PMU's which require slow access methods (e.g. internal serial bus)? The accesses may need to be placed on worker threads so that they don't affect the performance of context switches and system ticks. 3) How exactly do we represent the PMU's in the pseudofs (/sys or /proc)? And how exactly does the user specify the PMU to perf_events? Peter Zijlstra and Stephane Eranian both recommended opening the PMU with open() and then passing the resulting fd in through the perf_event_attr struct. 4) How do we choose a CPU to do the housekeeping work for a particular nest PMU. Peter thought that user space should still specify the it via open_perf_event() cpu parameter, but there's also an argument to be made for the kernel choosing the best CPU to handle the job, or at least make it optional for the user to choose the CPU. I'm sure there are other issues as well. If you'd like to start working on some (or all!) of these, you are more than welcome to. I think we need to toss around some more ideas before committing much to code at this point. - Corey ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-03-30 16:49 ` Corey Ashford @ 2010-03-30 17:15 ` Peter Zijlstra 2010-03-30 22:12 ` Corey Ashford 2010-04-15 21:16 ` Gary.Mohr 2010-03-30 21:28 ` stephane eranian 1 sibling, 2 replies; 55+ messages in thread From: Peter Zijlstra @ 2010-03-30 17:15 UTC (permalink / raw) To: Corey Ashford Cc: Lin Ming, Ingo Molnar, LKML, Andi Kleen, Paul Mackerras, Stephane Eranian, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love, Steven Rostedt, Arnaldo Carvalho de Melo, Masami Hiramatsu On Tue, 2010-03-30 at 09:49 -0700, Corey Ashford wrote: > On 03/30/2010 12:42 AM, Lin Ming wrote: > > Hi, Corey > > > > How is this going now? Are you still working on this? > > I'd like to help to add support for uncore, test, write code or anything > > else. > > > > Thanks, > > Lin Ming > > I haven't been actively working on adding infrastructure for nest PMUs > yet. At the moment we are working on supporting nest events for IBM's > Wire-Speed processor, using the current infrastructure, because of the > time limitations. Using the existing infrastructure is definitely not > ideal, but for this processor, it's workable. > > There are still a lot of issues to solve for adding this infrastructure: > > 1) Does perf_events need a new context type (in addition to per-task and > per-cpu)? This is partly because we don't want to be mixing the > rotation of CPU-events with nest events. Each PMU really ought to have > its own event list. > > 2) How do we deal with accessing PMU's which require slow access methods > (e.g. internal serial bus)? The accesses may need to be placed on > worker threads so that they don't affect the performance of context > switches and system ticks. > > 3) How exactly do we represent the PMU's in the pseudofs (/sys or > /proc)? And how exactly does the user specify the PMU to perf_events? > Peter Zijlstra and Stephane Eranian both recommended opening the PMU > with open() and then passing the resulting fd in through the > perf_event_attr struct. > > 4) How do we choose a CPU to do the housekeeping work for a particular > nest PMU. Peter thought that user space should still specify the it via > open_perf_event() cpu parameter, but there's also an argument to be made > for the kernel choosing the best CPU to handle the job, or at least make > it optional for the user to choose the CPU. > > I'm sure there are other issues as well. If you'd like to start working > on some (or all!) of these, you are more than welcome to. I think we > need to toss around some more ideas before committing much to code at > this point. Right, I've got some definite ideas on how to go here, just need some time to implement them. The first thing that needs to be done is get rid of all the __weak functions (with exception of perf_callchain*, since that really is arch specific). For hw_perf_event_init() we need to create a pmu registration facility and lookup a pmu_id, either passed as an actual id found in sysfs or an open file handle from sysfs (the cpu pmu would be pmu_id 0 for backwards compat). hw_perf_disable/enable() would become struct pmu functions and perf_disable/enable need to become per-pmu, most functions operate on a specific event, for those we know the pmu and hence can call the per-pmu version. (XXX find those sites where this is not true). Then we can move to context, yes I think we want new context for new PMUs, otherwise we get very funny RR interleaving problems. My idea was to move find_get_context() into struct pmu as well, this allows you to have per-pmu contexts. Initially I'd not allow per-pmu-per-task contexts because then things like perf_event_task_sched_out() would get rather complex. For RR we can move away from perf_event_task_tick and let the pmu install a (hr)timer for this on their own. I've been planning to implement this for more than a week now, its just that other stuff keeps getting in the way. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-03-30 17:15 ` Peter Zijlstra @ 2010-03-30 22:12 ` Corey Ashford 2010-03-31 14:01 ` Peter Zijlstra 2010-04-15 21:16 ` Gary.Mohr 1 sibling, 1 reply; 55+ messages in thread From: Corey Ashford @ 2010-03-30 22:12 UTC (permalink / raw) To: Peter Zijlstra Cc: Lin Ming, Ingo Molnar, LKML, Andi Kleen, Paul Mackerras, Stephane Eranian, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love, Steven Rostedt, Arnaldo Carvalho de Melo, Masami Hiramatsu On 3/30/2010 10:15 AM, Peter Zijlstra wrote: -- my comments snipped -- > > Right, I've got some definite ideas on how to go here, just need some > time to implement them. > > The first thing that needs to be done is get rid of all the __weak > functions (with exception of perf_callchain*, since that really is arch > specific). > > For hw_perf_event_init() we need to create a pmu registration facility > and lookup a pmu_id, either passed as an actual id found in sysfs or an > open file handle from sysfs (the cpu pmu would be pmu_id 0 for backwards > compat). > > hw_perf_disable/enable() would become struct pmu functions and > perf_disable/enable need to become per-pmu, most functions operate on a > specific event, for those we know the pmu and hence can call the per-pmu > version. (XXX find those sites where this is not true). This sounds like a good idea. Right now for the Wire-Speed processor, we have a loop that goes through all of the nest PMU's and calls their respective per-pmu functions. > > Then we can move to context, yes I think we want new context for new > PMUs, otherwise we get very funny RR interleaving problems. My idea was > to move find_get_context() into struct pmu as well, this allows you to > have per-pmu contexts. Yes, I think it makes a lot of sense, so that there's not some sort of fixed association of pmu contexts to cpu contexts, for example. > Initially I'd not allow per-pmu-per-task contexts > because then things like perf_event_task_sched_out() would get rather > complex. Definitely. I don't think it makes sense to have per-task context on nest/uncore PMUs. At least we haven't found any justification for it. > > For RR we can move away from perf_event_task_tick and let the pmu > install a (hr)timer for this on their own. This is necessary I think, because of the access time for some of the PMU's. I wonder though if it should, perhaps optionally, be off-loaded to a high-priority task to do the switching so that access latency to the PMU can be controlled. As I mentioned when we met, some of the Wire-Speed processor nest PMU control registers are accessed via SCOM, which is an internal, 200 MHz serial bus. We are being quoted ~525 SCOM bus ticks to do a PMU control register access, which comes out to about 2.5 microseconds. If you figure 5 accesses to rotate the events on a PMU, that's a minimum of 12.5 microseconds. > > I've been planning to implement this for more than a week now, its just > that other stuff keeps getting in the way. > Well, it's not as if this is a trivial task either :) - Corey ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-03-30 22:12 ` Corey Ashford @ 2010-03-31 14:01 ` Peter Zijlstra 2010-03-31 14:13 ` stephane eranian ` (2 more replies) 0 siblings, 3 replies; 55+ messages in thread From: Peter Zijlstra @ 2010-03-31 14:01 UTC (permalink / raw) To: Corey Ashford Cc: Lin Ming, Ingo Molnar, LKML, Andi Kleen, Paul Mackerras, Stephane Eranian, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love, Steven Rostedt, Arnaldo Carvalho de Melo, Masami Hiramatsu On Tue, 2010-03-30 at 15:12 -0700, Corey Ashford wrote: > > > Initially I'd not allow per-pmu-per-task contexts > > because then things like perf_event_task_sched_out() would get rather > > complex. > > Definitely. I don't think it makes sense to have per-task context on > nest/uncore PMUs. At least we haven't found any justification for it. For uncore no, but there is also the hw-breakpoint stuff that is being presented as a pmu, for those it would make sense to have a separate per-task context. But doing multiple per-task contexts is something for a next step indeed. > > For RR we can move away from perf_event_task_tick and let the pmu > > install a (hr)timer for this on their own. > > This is necessary I think, because of the access time for some of the PMU's. I > wonder though if it should, perhaps optionally, be off-loaded to a high-priority > task to do the switching so that access latency to the PMU can be controlled. > > As I mentioned when we met, some of the Wire-Speed processor nest PMU control > registers are accessed via SCOM, which is an internal, 200 MHz serial bus. We > are being quoted ~525 SCOM bus ticks to do a PMU control register access, which > comes out to about 2.5 microseconds. If you figure 5 accesses to rotate the > events on a PMU, that's a minimum of 12.5 microseconds. Yeah, you mentioned that.. for those things we need some changes anyway, since currently we install per-cpu counters using IPIs and expect the pmu::enable() method to be synchronous (it has a return value). It would be totally unacceptable to do 2.5ms pokes with IRQs disabled. The RR thing would be the easiest to solve, just let the timer wake up a thread instead of doing the work itself, that's fully isolated to how the pmu chooses to implement that. The above mentioned issue however would be much more challenging to fix nicely. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-03-31 14:01 ` Peter Zijlstra @ 2010-03-31 14:13 ` stephane eranian 2010-03-31 15:49 ` Maynard Johnson 2010-03-31 17:50 ` Corey Ashford 2 siblings, 0 replies; 55+ messages in thread From: stephane eranian @ 2010-03-31 14:13 UTC (permalink / raw) To: Peter Zijlstra Cc: Corey Ashford, Lin Ming, Ingo Molnar, LKML, Andi Kleen, Paul Mackerras, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love, Steven Rostedt, Arnaldo Carvalho de Melo, Masami Hiramatsu On Wed, Mar 31, 2010 at 4:01 PM, Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, 2010-03-30 at 15:12 -0700, Corey Ashford wrote: >> >> > Initially I'd not allow per-pmu-per-task contexts >> > because then things like perf_event_task_sched_out() would get rather >> > complex. >> >> Definitely. I don't think it makes sense to have per-task context on >> nest/uncore PMUs. At least we haven't found any justification for it. > > For uncore no, but there is also the hw-breakpoint stuff that is being > presented as a pmu, for those it would make sense to have a separate > per-task context. > > But doing multiple per-task contexts is something for a next step > indeed. > >> > For RR we can move away from perf_event_task_tick and let the pmu >> > install a (hr)timer for this on their own. >> >> This is necessary I think, because of the access time for some of the PMU's. I >> wonder though if it should, perhaps optionally, be off-loaded to a high-priority >> task to do the switching so that access latency to the PMU can be controlled. >> >> As I mentioned when we met, some of the Wire-Speed processor nest PMU control >> registers are accessed via SCOM, which is an internal, 200 MHz serial bus. We >> are being quoted ~525 SCOM bus ticks to do a PMU control register access, which >> comes out to about 2.5 microseconds. If you figure 5 accesses to rotate the >> events on a PMU, that's a minimum of 12.5 microseconds. > > Yeah, you mentioned that.. for those things we need some changes anyway, > since currently we install per-cpu counters using IPIs and expect the > pmu::enable() method to be synchronous (it has a return value). It would > be totally unacceptable to do 2.5ms pokes with IRQs disabled. > > The RR thing would be the easiest to solve, just let the timer wake up a > thread instead of doing the work itself, that's fully isolated to how > the pmu chooses to implement that. The above mentioned issue however > would be much more challenging to fix nicely. > > Also some of perf_enable()/perf_disable() would have to be per PMU and not global like they are today. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-03-31 14:01 ` Peter Zijlstra 2010-03-31 14:13 ` stephane eranian @ 2010-03-31 15:49 ` Maynard Johnson 2010-03-31 17:50 ` Corey Ashford 2 siblings, 0 replies; 55+ messages in thread From: Maynard Johnson @ 2010-03-31 15:49 UTC (permalink / raw) To: Peter Zijlstra Cc: Arnaldo Carvalho de Melo, Andi Kleen, cel, Corey Ashford, Stephane Eranian, Frederic Weisbecker, LKML, Masami Hiramatsu, Ingo Molnar, Lin Ming, Philip Mucci, Paul Mackerras, Steven Rostedt, Dan Terpstra, Xiao Guangrong Peter Zijlstra <peterz@infradead.org> wrote on 03/31/2010 09:01:56 AM: > [image removed] > > Re: [RFC] perf_events: support for uncore a.k.a. nest units > > Peter Zijlstra > > to: > > Corey Ashford > > 03/31/2010 09:02 AM > > Cc: > > Lin Ming, Ingo Molnar, LKML, Andi Kleen, Paul Mackerras, Stephane > Eranian, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip > Mucci, Maynard Johnson, cel, Steven Rostedt, Arnaldo Carvalho de > Melo, Masami Hiramatsu > > On Tue, 2010-03-30 at 15:12 -0700, Corey Ashford wrote: > > > > > Initially I'd not allow per-pmu-per-task contexts > > > because then things like perf_event_task_sched_out() would get rather > > > complex. > > > > Definitely. I don't think it makes sense to have per-task context on > > nest/uncore PMUs. At least we haven't found any justification for it. > > For uncore no, but there is also the hw-breakpoint stuff that is being > presented as a pmu, for those it would make sense to have a separate > per-task context. hw-breakpoint presented as a pmu? hmmmm. IMHO, this is an example where shoehorning something into the perf_events subsystem that logically doesn't belong there just makes for more complexity in the code. > > But doing multiple per-task contexts is something for a next step > indeed. > [snip] ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-03-31 14:01 ` Peter Zijlstra 2010-03-31 14:13 ` stephane eranian 2010-03-31 15:49 ` Maynard Johnson @ 2010-03-31 17:50 ` Corey Ashford 2 siblings, 0 replies; 55+ messages in thread From: Corey Ashford @ 2010-03-31 17:50 UTC (permalink / raw) To: Peter Zijlstra Cc: Lin Ming, Ingo Molnar, LKML, Andi Kleen, Paul Mackerras, Stephane Eranian, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love, Steven Rostedt, Arnaldo Carvalho de Melo, Masami Hiramatsu On 03/31/2010 07:01 AM, Peter Zijlstra wrote: > On Tue, 2010-03-30 at 15:12 -0700, Corey Ashford wrote: >> >>> Initially I'd not allow per-pmu-per-task contexts >>> because then things like perf_event_task_sched_out() would get rather >>> complex. >> >> Definitely. I don't think it makes sense to have per-task context on >> nest/uncore PMUs. At least we haven't found any justification for it. > > For uncore no, but there is also the hw-breakpoint stuff that is being > presented as a pmu, for those it would make sense to have a separate > per-task context. > > But doing multiple per-task contexts is something for a next step > indeed. > >>> For RR we can move away from perf_event_task_tick and let the pmu >>> install a (hr)timer for this on their own. >> >> This is necessary I think, because of the access time for some of the PMU's. I >> wonder though if it should, perhaps optionally, be off-loaded to a high-priority >> task to do the switching so that access latency to the PMU can be controlled. >> >> As I mentioned when we met, some of the Wire-Speed processor nest PMU control >> registers are accessed via SCOM, which is an internal, 200 MHz serial bus. We >> are being quoted ~525 SCOM bus ticks to do a PMU control register access, which >> comes out to about 2.5 microseconds. If you figure 5 accesses to rotate the >> events on a PMU, that's a minimum of 12.5 microseconds. > > Yeah, you mentioned that.. for those things we need some changes anyway, > since currently we install per-cpu counters using IPIs and expect the > pmu::enable() method to be synchronous (it has a return value). That's a good point. We hadn't considered this issue. > It would > be totally unacceptable to do 2.5ms pokes with IRQs disabled. Just to be clear, it's 2.5us, not 2.5ms, but I think it's still bad... in our case, it's about 6600 processor clocks per access. > The RR thing would be the easiest to solve, just let the timer wake up a > thread instead of doing the work itself, that's fully isolated to how > the pmu chooses to implement that. The above mentioned issue however > would be much more challenging to fix nicely. It seems like it might need to be done in two phases. IPI request is sent, and then a thread is woken up on the other CPU, it does some work, and then sets a status variable and somehow notifies the caller that the operation has completed. I don't know the kernel's communication mechanisms well enough to know which one is most appropriate - maybe rwsem? - Corey ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-03-30 17:15 ` Peter Zijlstra 2010-03-30 22:12 ` Corey Ashford @ 2010-04-15 21:16 ` Gary.Mohr 2010-04-16 13:24 ` Peter Zijlstra 1 sibling, 1 reply; 55+ messages in thread From: Gary.Mohr @ 2010-04-15 21:16 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Corey Ashford, Stephane Eranian, LKML > On Tue, 2010-03-30 at 09:49 -0700, Corey Ashford wrote: > > Right, I've got some definite ideas on how to go here, just need some > time to implement them. > > The first thing that needs to be done is get rid of all the __weak > functions (with exception of perf_callchain*, since that really is arch > specific). > > For hw_perf_event_init() we need to create a pmu registration facility > and lookup a pmu_id, either passed as an actual id found in sysfs or an > open file handle from sysfs (the cpu pmu would be pmu_id 0 for backwards > compat). > > hw_perf_disable/enable() would become struct pmu functions and > perf_disable/enable need to become per-pmu, most functions operate on a > specific event, for those we know the pmu and hence can call the per-pmu > version. (XXX find those sites where this is not true). > > Then we can move to context, yes I think we want new context for new > PMUs, otherwise we get very funny RR interleaving problems. My idea was > to move find_get_context() into struct pmu as well, this allows you to > have per-pmu contexts. Initially I'd not allow per-pmu-per-task contexts > because then things like perf_event_task_sched_out() would get rather > complex. > > For RR we can move away from perf_event_task_tick and let the pmu > install a (hr)timer for this on their own. > > I've been planning to implement this for more than a week now, its just > that other stuff keeps getting in the way. > Hi Peter, My name is Gary Mohr and I work for Bull Information Systems. I have been following your discussions with Corey (and others) about how to implement support for nest PMU's in the linux kernel. My company feels that support for Intel Nehalem uncore events is very important to our customers. Has the "other stuff" mentioned above quited down to allow you to get started on building support for these features ?? If development is actually in progress, would you be willing to make a guess as to which version of the kernel may offer the new capabilities ?? As I said we are interested so if there is any way we can assist you, please let us know. We would be happy to take experimental patch sets and validate, test, and debug any problems we encounter if that would help your development. Thanks for your time. Gary ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-04-15 21:16 ` Gary.Mohr @ 2010-04-16 13:24 ` Peter Zijlstra 2010-04-19 9:08 ` Lin Ming 2010-04-20 11:55 ` Lin Ming 0 siblings, 2 replies; 55+ messages in thread From: Peter Zijlstra @ 2010-04-16 13:24 UTC (permalink / raw) To: Gary.Mohr; +Cc: Corey Ashford, Stephane Eranian, LKML On Thu, 2010-04-15 at 14:16 -0700, Gary.Mohr@Bull.com wrote: > > On Tue, 2010-03-30 at 09:49 -0700, Corey Ashford wrote: > > > > Right, I've got some definite ideas on how to go here, just need some > > time to implement them. > > > > The first thing that needs to be done is get rid of all the __weak > > functions (with exception of perf_callchain*, since that really is arch > > specific). > > > > For hw_perf_event_init() we need to create a pmu registration facility > > and lookup a pmu_id, either passed as an actual id found in sysfs or an > > open file handle from sysfs (the cpu pmu would be pmu_id 0 for backwards > > compat). > > > > hw_perf_disable/enable() would become struct pmu functions and > > perf_disable/enable need to become per-pmu, most functions operate on a > > specific event, for those we know the pmu and hence can call the per-pmu > > version. (XXX find those sites where this is not true). > > > > Then we can move to context, yes I think we want new context for new > > PMUs, otherwise we get very funny RR interleaving problems. My idea was > > to move find_get_context() into struct pmu as well, this allows you to > > have per-pmu contexts. Initially I'd not allow per-pmu-per-task contexts > > because then things like perf_event_task_sched_out() would get rather > > complex. > > > > For RR we can move away from perf_event_task_tick and let the pmu > > install a (hr)timer for this on their own. > > > > I've been planning to implement this for more than a week now, its just > > that other stuff keeps getting in the way. > > > > Hi Peter, > > My name is Gary Mohr and I work for Bull Information Systems. I have been > following your discussions with Corey (and others) about how to implement > support for nest PMU's in the linux kernel. > > My company feels that support for Intel Nehalem uncore events is very > important to our customers. Has the "other stuff" mentioned above quited down to > allow you to get started on building support for these features ?? Sadly no. > If development > is actually in progress, would you be willing to make a guess as to which > version of the kernel may offer the new capabilities ?? > > As I said we are interested so if there is any way we can assist you, > please let us know. We would be happy to take experimental patch sets and > validate, test, and debug any problems we encounter if that would help your > development. Supply patches to make the above happen ;-) One thing not on that list, which should happen first I guess, is to remove hw_perf_group_sched_in(). The idea is to add some sort of transactional API to the struct pmu, so that we can delay the schedulability check until commit time (and roll back when it fails). Something as simple as: struct pmu { void start_txn(struct pmu *); void commit_txn(struct pmu *); ,,, }; and then change group_sched_in() to use this instead of hw_perf_group_sched_in(), whose implementations mostly replicate group_sched_in() in various buggy ways anyway. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-04-16 13:24 ` Peter Zijlstra @ 2010-04-19 9:08 ` Lin Ming 2010-04-19 9:27 ` Peter Zijlstra 2010-04-20 11:55 ` Lin Ming 1 sibling, 1 reply; 55+ messages in thread From: Lin Ming @ 2010-04-19 9:08 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Gary.Mohr@Bull.com, Corey Ashford, Stephane Eranian, LKML On Fri, 2010-04-16 at 21:24 +0800, Peter Zijlstra wrote: > On Thu, 2010-04-15 at 14:16 -0700, Gary.Mohr@Bull.com wrote: > > > On Tue, 2010-03-30 at 09:49 -0700, Corey Ashford wrote: > > > > > > Right, I've got some definite ideas on how to go here, just need some > > > time to implement them. > > > > > > The first thing that needs to be done is get rid of all the __weak > > > functions (with exception of perf_callchain*, since that really is arch > > > specific). > > > > > > For hw_perf_event_init() we need to create a pmu registration facility > > > and lookup a pmu_id, either passed as an actual id found in sysfs or an > > > open file handle from sysfs (the cpu pmu would be pmu_id 0 for backwards > > > compat). > > > > > > hw_perf_disable/enable() would become struct pmu functions and > > > perf_disable/enable need to become per-pmu, most functions operate on a > > > specific event, for those we know the pmu and hence can call the per-pmu > > > version. (XXX find those sites where this is not true). > > > > > > Then we can move to context, yes I think we want new context for new > > > PMUs, otherwise we get very funny RR interleaving problems. My idea was > > > to move find_get_context() into struct pmu as well, this allows you to > > > have per-pmu contexts. Initially I'd not allow per-pmu-per-task contexts > > > because then things like perf_event_task_sched_out() would get rather > > > complex. > > > > > > For RR we can move away from perf_event_task_tick and let the pmu > > > install a (hr)timer for this on their own. > > > > > > I've been planning to implement this for more than a week now, its just > > > that other stuff keeps getting in the way. > > > > > > > Hi Peter, > > > > My name is Gary Mohr and I work for Bull Information Systems. I have been > > following your discussions with Corey (and others) about how to implement > > support for nest PMU's in the linux kernel. > > > > My company feels that support for Intel Nehalem uncore events is very > > important to our customers. Has the "other stuff" mentioned above quited down to > > allow you to get started on building support for these features ?? > > Sadly no. > > > If development > > is actually in progress, would you be willing to make a guess as to which > > version of the kernel may offer the new capabilities ?? > > > > As I said we are interested so if there is any way we can assist you, > > please let us know. We would be happy to take experimental patch sets and > > validate, test, and debug any problems we encounter if that would help your > > development. > > Supply patches to make the above happen ;-) Hi, I have been also looking at this for some time. I'll send a draft patch later this week to support multiple hw pmus. Lin Ming ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-04-19 9:08 ` Lin Ming @ 2010-04-19 9:27 ` Peter Zijlstra 0 siblings, 0 replies; 55+ messages in thread From: Peter Zijlstra @ 2010-04-19 9:27 UTC (permalink / raw) To: Lin Ming; +Cc: Gary.Mohr@Bull.com, Corey Ashford, Stephane Eranian, LKML On Mon, 2010-04-19 at 17:08 +0800, Lin Ming wrote: > I have been also looking at this for some time. > I'll send a draft patch later this week to support multiple hw pmus. Well, that's going to be a mighty large patch. I'd rather see smaller patches, say one patch per removal of a weak hw_perf interface, etc... ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-04-16 13:24 ` Peter Zijlstra 2010-04-19 9:08 ` Lin Ming @ 2010-04-20 11:55 ` Lin Ming 2010-04-20 12:03 ` Peter Zijlstra 1 sibling, 1 reply; 55+ messages in thread From: Lin Ming @ 2010-04-20 11:55 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Gary.Mohr@Bull.com, Corey Ashford, Stephane Eranian, LKML On Fri, 2010-04-16 at 21:24 +0800, Peter Zijlstra wrote: > On Thu, 2010-04-15 at 14:16 -0700, Gary.Mohr@Bull.com wrote: > > > On Tue, 2010-03-30 at 09:49 -0700, Corey Ashford wrote: > > > > > > Right, I've got some definite ideas on how to go here, just need some > > > time to implement them. > > > > > > The first thing that needs to be done is get rid of all the __weak > > > functions (with exception of perf_callchain*, since that really is arch > > > specific). > > > > > > For hw_perf_event_init() we need to create a pmu registration facility > > > and lookup a pmu_id, either passed as an actual id found in sysfs or an > > > open file handle from sysfs (the cpu pmu would be pmu_id 0 for backwards > > > compat). > > > > > > hw_perf_disable/enable() would become struct pmu functions and > > > perf_disable/enable need to become per-pmu, most functions operate on a > > > specific event, for those we know the pmu and hence can call the per-pmu > > > version. (XXX find those sites where this is not true). > > > > > > Then we can move to context, yes I think we want new context for new > > > PMUs, otherwise we get very funny RR interleaving problems. My idea was > > > to move find_get_context() into struct pmu as well, this allows you to > > > have per-pmu contexts. Initially I'd not allow per-pmu-per-task contexts > > > because then things like perf_event_task_sched_out() would get rather > > > complex. > > > > > > For RR we can move away from perf_event_task_tick and let the pmu > > > install a (hr)timer for this on their own. > > > > > > I've been planning to implement this for more than a week now, its just > > > that other stuff keeps getting in the way. > > > > > > > Hi Peter, > > > > My name is Gary Mohr and I work for Bull Information Systems. I have been > > following your discussions with Corey (and others) about how to implement > > support for nest PMU's in the linux kernel. > > > > My company feels that support for Intel Nehalem uncore events is very > > important to our customers. Has the "other stuff" mentioned above quited down to > > allow you to get started on building support for these features ?? > > Sadly no. > > > If development > > is actually in progress, would you be willing to make a guess as to which > > version of the kernel may offer the new capabilities ?? > > > > As I said we are interested so if there is any way we can assist you, > > please let us know. We would be happy to take experimental patch sets and > > validate, test, and debug any problems we encounter if that would help your > > development. > > Supply patches to make the above happen ;-) > > One thing not on that list, which should happen first I guess, is to > remove hw_perf_group_sched_in(). The idea is to add some sort of > transactional API to the struct pmu, so that we can delay the > schedulability check until commit time (and roll back when it fails). > > Something as simple as: > > struct pmu { > void start_txn(struct pmu *); > void commit_txn(struct pmu *); > > ,,, > }; Could you please explain a bit more? Does it mean that "start_txn" perform the schedule events stuff and "commit_txn" perform the assign events stuff? Does "commit time" mean the actual activation in hw_perf_enable? Thanks, Lin Ming > > and then change group_sched_in() to use this instead of > hw_perf_group_sched_in(), whose implementations mostly replicate > group_sched_in() in various buggy ways anyway. > ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-04-20 11:55 ` Lin Ming @ 2010-04-20 12:03 ` Peter Zijlstra 2010-04-21 8:08 ` Lin Ming 0 siblings, 1 reply; 55+ messages in thread From: Peter Zijlstra @ 2010-04-20 12:03 UTC (permalink / raw) To: Lin Ming; +Cc: Gary.Mohr@Bull.com, Corey Ashford, Stephane Eranian, LKML On Tue, 2010-04-20 at 19:55 +0800, Lin Ming wrote: > > One thing not on that list, which should happen first I guess, is to > > remove hw_perf_group_sched_in(). The idea is to add some sort of > > transactional API to the struct pmu, so that we can delay the > > schedulability check until commit time (and roll back when it fails). > > > > Something as simple as: > > > > struct pmu { > > void start_txn(struct pmu *); > > void commit_txn(struct pmu *); > > > > ,,, > > }; > > Could you please explain a bit more? > > Does it mean that "start_txn" perform the schedule events stuff > and "commit_txn" perform the assign events stuff? > > Does "commit time" mean the actual activation in hw_perf_enable? No, the idea behind hw_perf_group_sched_in() is to not perform schedulability tests on each event in the group, but to add the group as a whole and then perform one test. Of course, when that test fails, you'll have to roll-back the whole group again. So start_txn (or a better name) would simply toggle a flag in the pmu implementation that will make pmu::enable() not perform the schedulablilty test. Then commit_txn() will perform the schedulability test (so note the method has to have a !void return value, my mistake in the earlier email). This will allow us to use the regular kernel/perf_event.c::group_sched_in() and all the rollback code. Currently each hw_perf_group_sched_in() implementation duplicates all the rolllback code (with various bugs). We must get rid of all weak hw_perf_*() functions before we can properly consider multiple struct pmu implementations. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-04-20 12:03 ` Peter Zijlstra @ 2010-04-21 8:08 ` Lin Ming 2010-04-21 8:32 ` stephane eranian 0 siblings, 1 reply; 55+ messages in thread From: Lin Ming @ 2010-04-21 8:08 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Gary.Mohr@Bull.com, Corey Ashford, Stephane Eranian, LKML On Tue, 2010-04-20 at 20:03 +0800, Peter Zijlstra wrote: > On Tue, 2010-04-20 at 19:55 +0800, Lin Ming wrote: > > > > One thing not on that list, which should happen first I guess, is to > > > remove hw_perf_group_sched_in(). The idea is to add some sort of > > > transactional API to the struct pmu, so that we can delay the > > > schedulability check until commit time (and roll back when it fails). > > > > > > Something as simple as: > > > > > > struct pmu { > > > void start_txn(struct pmu *); > > > void commit_txn(struct pmu *); > > > > > > ,,, > > > }; > > > > Could you please explain a bit more? > > > > Does it mean that "start_txn" perform the schedule events stuff > > and "commit_txn" perform the assign events stuff? > > > > Does "commit time" mean the actual activation in hw_perf_enable? > > No, the idea behind hw_perf_group_sched_in() is to not perform > schedulability tests on each event in the group, but to add the group as > a whole and then perform one test. > > Of course, when that test fails, you'll have to roll-back the whole > group again. > > So start_txn (or a better name) would simply toggle a flag in the pmu > implementation that will make pmu::enable() not perform the > schedulablilty test. > > Then commit_txn() will perform the schedulability test (so note the > method has to have a !void return value, my mistake in the earlier > email). > > This will allow us to use the regular > kernel/perf_event.c::group_sched_in() and all the rollback code. > Currently each hw_perf_group_sched_in() implementation duplicates all > the rolllback code (with various bugs). > > > > We must get rid of all weak hw_perf_*() functions before we can properly > consider multiple struct pmu implementations. > Thanks for the clear explanation. Does below patch show what you mean? I only touch the x86 arch code now. --- arch/x86/kernel/cpu/perf_event.c | 161 +++++++++++-------------------------- include/linux/perf_event.h | 10 ++- kernel/perf_event.c | 28 +++---- 3 files changed, 67 insertions(+), 132 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c index 626154a..62aa9a1 100644 --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -944,6 +944,9 @@ static int x86_pmu_enable(struct perf_event *event) if (n < 0) return n; + if (!(event->pmu->flag & PERF_EVENT_TRAN_STARTED)) + goto out; + ret = x86_pmu.schedule_events(cpuc, n, assign); if (ret) return ret; @@ -953,6 +956,7 @@ static int x86_pmu_enable(struct perf_event *event) */ memcpy(cpuc->assign, assign, n*sizeof(int)); +out: cpuc->n_events = n; cpuc->n_added += n - n0; @@ -1210,119 +1214,6 @@ x86_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event) return &unconstrained; } -static int x86_event_sched_in(struct perf_event *event, - struct perf_cpu_context *cpuctx) -{ - int ret = 0; - - event->state = PERF_EVENT_STATE_ACTIVE; - event->oncpu = smp_processor_id(); - event->tstamp_running += event->ctx->time - event->tstamp_stopped; - - if (!is_x86_event(event)) - ret = event->pmu->enable(event); - - if (!ret && !is_software_event(event)) - cpuctx->active_oncpu++; - - if (!ret && event->attr.exclusive) - cpuctx->exclusive = 1; - - return ret; -} - -static void x86_event_sched_out(struct perf_event *event, - struct perf_cpu_context *cpuctx) -{ - event->state = PERF_EVENT_STATE_INACTIVE; - event->oncpu = -1; - - if (!is_x86_event(event)) - event->pmu->disable(event); - - event->tstamp_running -= event->ctx->time - event->tstamp_stopped; - - if (!is_software_event(event)) - cpuctx->active_oncpu--; - - if (event->attr.exclusive || !cpuctx->active_oncpu) - cpuctx->exclusive = 0; -} - -/* - * Called to enable a whole group of events. - * Returns 1 if the group was enabled, or -EAGAIN if it could not be. - * Assumes the caller has disabled interrupts and has - * frozen the PMU with hw_perf_save_disable. - * - * called with PMU disabled. If successful and return value 1, - * then guaranteed to call perf_enable() and hw_perf_enable() - */ -int hw_perf_group_sched_in(struct perf_event *leader, - struct perf_cpu_context *cpuctx, - struct perf_event_context *ctx) -{ - struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events); - struct perf_event *sub; - int assign[X86_PMC_IDX_MAX]; - int n0, n1, ret; - - if (!x86_pmu_initialized()) - return 0; - - /* n0 = total number of events */ - n0 = collect_events(cpuc, leader, true); - if (n0 < 0) - return n0; - - ret = x86_pmu.schedule_events(cpuc, n0, assign); - if (ret) - return ret; - - ret = x86_event_sched_in(leader, cpuctx); - if (ret) - return ret; - - n1 = 1; - list_for_each_entry(sub, &leader->sibling_list, group_entry) { - if (sub->state > PERF_EVENT_STATE_OFF) { - ret = x86_event_sched_in(sub, cpuctx); - if (ret) - goto undo; - ++n1; - } - } - /* - * copy new assignment, now we know it is possible - * will be used by hw_perf_enable() - */ - memcpy(cpuc->assign, assign, n0*sizeof(int)); - - cpuc->n_events = n0; - cpuc->n_added += n1; - ctx->nr_active += n1; - - /* - * 1 means successful and events are active - * This is not quite true because we defer - * actual activation until hw_perf_enable() but - * this way we* ensure caller won't try to enable - * individual events - */ - return 1; -undo: - x86_event_sched_out(leader, cpuctx); - n0 = 1; - list_for_each_entry(sub, &leader->sibling_list, group_entry) { - if (sub->state == PERF_EVENT_STATE_ACTIVE) { - x86_event_sched_out(sub, cpuctx); - if (++n0 == n1) - break; - } - } - return ret; -} - #include "perf_event_amd.c" #include "perf_event_p6.c" #include "perf_event_p4.c" @@ -1454,6 +1345,47 @@ static inline void x86_pmu_read(struct perf_event *event) x86_perf_event_update(event); } +/* + * Set the flag to make pmu::enable() not perform the + * schedulablilty test. + */ +static void x86_pmu_start_txn(struct pmu *pmu) +{ + pmu->flag |= PERF_EVENT_TRAN_STARTED; +} + +static void x86_pmu_stop_txn(struct pmu *pmu) +{ + pmu->flag &= ~PERF_EVENT_TRAN_STARTED; +} + +/* + * Return 0 if commit transaction success + */ +static int x86_pmu_commit_txn(struct pmu *pmu) +{ + struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events); + int assign[X86_PMC_IDX_MAX]; + int n, ret; + + n = cpuc->n_events; + + if (!x86_pmu_initialized()) + return -EAGAIN; + + ret = x86_pmu.schedule_events(cpuc, n, assign); + if (ret) + return ret; + + /* + * copy new assignment, now we know it is possible + * will be used by hw_perf_enable() + */ + memcpy(cpuc->assign, assign, n*sizeof(int)); + + return 0; +} + static const struct pmu pmu = { .enable = x86_pmu_enable, .disable = x86_pmu_disable, @@ -1461,6 +1393,9 @@ static const struct pmu pmu = { .stop = x86_pmu_stop, .read = x86_pmu_read, .unthrottle = x86_pmu_unthrottle, + .start_txn = x86_pmu_start_txn, + .stop_txn = x86_pmu_stop_txn, + .commit_txn = x86_pmu_commit_txn, }; /* diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index bf896d0..93aa8d8 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -524,6 +524,8 @@ struct hw_perf_event { struct perf_event; +#define PERF_EVENT_TRAN_STARTED 1 + /** * struct pmu - generic performance monitoring unit */ @@ -534,6 +536,11 @@ struct pmu { void (*stop) (struct perf_event *event); void (*read) (struct perf_event *event); void (*unthrottle) (struct perf_event *event); + void (*start_txn) (struct pmu *pmu); + void (*stop_txn) (struct pmu *pmu); + int (*commit_txn) (struct pmu *pmu); + + u8 flag; }; /** @@ -799,9 +806,6 @@ extern void perf_disable(void); extern void perf_enable(void); extern int perf_event_task_disable(void); extern int perf_event_task_enable(void); -extern int hw_perf_group_sched_in(struct perf_event *group_leader, - struct perf_cpu_context *cpuctx, - struct perf_event_context *ctx); extern void perf_event_update_userpage(struct perf_event *event); extern int perf_event_release_kernel(struct perf_event *event); extern struct perf_event * diff --git a/kernel/perf_event.c b/kernel/perf_event.c index 07b7a43..4537676 100644 --- a/kernel/perf_event.c +++ b/kernel/perf_event.c @@ -83,14 +83,6 @@ extern __weak const struct pmu *hw_perf_event_init(struct perf_event *event) void __weak hw_perf_disable(void) { barrier(); } void __weak hw_perf_enable(void) { barrier(); } -int __weak -hw_perf_group_sched_in(struct perf_event *group_leader, - struct perf_cpu_context *cpuctx, - struct perf_event_context *ctx) -{ - return 0; -} - void __weak perf_event_print_debug(void) { } static DEFINE_PER_CPU(int, perf_disable_count); @@ -642,14 +634,13 @@ group_sched_in(struct perf_event *group_event, struct perf_event_context *ctx) { struct perf_event *event, *partial_group; + struct pmu *pmu = (struct pmu *)group_event->pmu; int ret; if (group_event->state == PERF_EVENT_STATE_OFF) return 0; - ret = hw_perf_group_sched_in(group_event, cpuctx, ctx); - if (ret) - return ret < 0 ? ret : 0; + pmu->start_txn(pmu); if (event_sched_in(group_event, cpuctx, ctx)) return -EAGAIN; @@ -664,16 +655,21 @@ group_sched_in(struct perf_event *group_event, } } - return 0; + ret = pmu->commit_txn(pmu); + if (!ret) { + pmu->stop_txn(pmu); + return 0; + } group_error: + pmu->stop_txn(pmu); + /* - * Groups can be scheduled in as one unit only, so undo any - * partial group before returning: + * Commit transaction fails, rollback + * Groups can be scheduled in as one unit only, so undo + * whole group before returning: */ list_for_each_entry(event, &group_event->sibling_list, group_entry) { - if (event == partial_group) - break; event_sched_out(event, cpuctx, ctx); } event_sched_out(group_event, cpuctx, ctx); ^ permalink raw reply related [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-04-21 8:08 ` Lin Ming @ 2010-04-21 8:32 ` stephane eranian 2010-04-21 8:39 ` Lin Ming 0 siblings, 1 reply; 55+ messages in thread From: stephane eranian @ 2010-04-21 8:32 UTC (permalink / raw) To: Lin Ming; +Cc: Peter Zijlstra, Gary.Mohr@Bull.com, Corey Ashford, LKML Seems to me that struct pmu is a shared resource across all CPUs. I don't understand why scheduling on one CPU would have to impact all the other CPUs, unless I am missing something here. On Wed, Apr 21, 2010 at 10:08 AM, Lin Ming <ming.m.lin@intel.com> wrote: > On Tue, 2010-04-20 at 20:03 +0800, Peter Zijlstra wrote: >> On Tue, 2010-04-20 at 19:55 +0800, Lin Ming wrote: >> >> > > One thing not on that list, which should happen first I guess, is to >> > > remove hw_perf_group_sched_in(). The idea is to add some sort of >> > > transactional API to the struct pmu, so that we can delay the >> > > schedulability check until commit time (and roll back when it fails). >> > > >> > > Something as simple as: >> > > >> > > struct pmu { >> > > void start_txn(struct pmu *); >> > > void commit_txn(struct pmu *); >> > > >> > > ,,, >> > > }; >> > >> > Could you please explain a bit more? >> > >> > Does it mean that "start_txn" perform the schedule events stuff >> > and "commit_txn" perform the assign events stuff? >> > >> > Does "commit time" mean the actual activation in hw_perf_enable? >> >> No, the idea behind hw_perf_group_sched_in() is to not perform >> schedulability tests on each event in the group, but to add the group as >> a whole and then perform one test. >> >> Of course, when that test fails, you'll have to roll-back the whole >> group again. >> >> So start_txn (or a better name) would simply toggle a flag in the pmu >> implementation that will make pmu::enable() not perform the >> schedulablilty test. >> >> Then commit_txn() will perform the schedulability test (so note the >> method has to have a !void return value, my mistake in the earlier >> email). >> >> This will allow us to use the regular >> kernel/perf_event.c::group_sched_in() and all the rollback code. >> Currently each hw_perf_group_sched_in() implementation duplicates all >> the rolllback code (with various bugs). >> >> >> >> We must get rid of all weak hw_perf_*() functions before we can properly >> consider multiple struct pmu implementations. >> > > Thanks for the clear explanation. > > Does below patch show what you mean? > > I only touch the x86 arch code now. > > --- > arch/x86/kernel/cpu/perf_event.c | 161 +++++++++++-------------------------- > include/linux/perf_event.h | 10 ++- > kernel/perf_event.c | 28 +++---- > 3 files changed, 67 insertions(+), 132 deletions(-) > > diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c > index 626154a..62aa9a1 100644 > --- a/arch/x86/kernel/cpu/perf_event.c > +++ b/arch/x86/kernel/cpu/perf_event.c > @@ -944,6 +944,9 @@ static int x86_pmu_enable(struct perf_event *event) > if (n < 0) > return n; > > + if (!(event->pmu->flag & PERF_EVENT_TRAN_STARTED)) > + goto out; > + > ret = x86_pmu.schedule_events(cpuc, n, assign); > if (ret) > return ret; > @@ -953,6 +956,7 @@ static int x86_pmu_enable(struct perf_event *event) > */ > memcpy(cpuc->assign, assign, n*sizeof(int)); > > +out: > cpuc->n_events = n; > cpuc->n_added += n - n0; > > @@ -1210,119 +1214,6 @@ x86_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event) > return &unconstrained; > } > > -static int x86_event_sched_in(struct perf_event *event, > - struct perf_cpu_context *cpuctx) > -{ > - int ret = 0; > - > - event->state = PERF_EVENT_STATE_ACTIVE; > - event->oncpu = smp_processor_id(); > - event->tstamp_running += event->ctx->time - event->tstamp_stopped; > - > - if (!is_x86_event(event)) > - ret = event->pmu->enable(event); > - > - if (!ret && !is_software_event(event)) > - cpuctx->active_oncpu++; > - > - if (!ret && event->attr.exclusive) > - cpuctx->exclusive = 1; > - > - return ret; > -} > - > -static void x86_event_sched_out(struct perf_event *event, > - struct perf_cpu_context *cpuctx) > -{ > - event->state = PERF_EVENT_STATE_INACTIVE; > - event->oncpu = -1; > - > - if (!is_x86_event(event)) > - event->pmu->disable(event); > - > - event->tstamp_running -= event->ctx->time - event->tstamp_stopped; > - > - if (!is_software_event(event)) > - cpuctx->active_oncpu--; > - > - if (event->attr.exclusive || !cpuctx->active_oncpu) > - cpuctx->exclusive = 0; > -} > - > -/* > - * Called to enable a whole group of events. > - * Returns 1 if the group was enabled, or -EAGAIN if it could not be. > - * Assumes the caller has disabled interrupts and has > - * frozen the PMU with hw_perf_save_disable. > - * > - * called with PMU disabled. If successful and return value 1, > - * then guaranteed to call perf_enable() and hw_perf_enable() > - */ > -int hw_perf_group_sched_in(struct perf_event *leader, > - struct perf_cpu_context *cpuctx, > - struct perf_event_context *ctx) > -{ > - struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events); > - struct perf_event *sub; > - int assign[X86_PMC_IDX_MAX]; > - int n0, n1, ret; > - > - if (!x86_pmu_initialized()) > - return 0; > - > - /* n0 = total number of events */ > - n0 = collect_events(cpuc, leader, true); > - if (n0 < 0) > - return n0; > - > - ret = x86_pmu.schedule_events(cpuc, n0, assign); > - if (ret) > - return ret; > - > - ret = x86_event_sched_in(leader, cpuctx); > - if (ret) > - return ret; > - > - n1 = 1; > - list_for_each_entry(sub, &leader->sibling_list, group_entry) { > - if (sub->state > PERF_EVENT_STATE_OFF) { > - ret = x86_event_sched_in(sub, cpuctx); > - if (ret) > - goto undo; > - ++n1; > - } > - } > - /* > - * copy new assignment, now we know it is possible > - * will be used by hw_perf_enable() > - */ > - memcpy(cpuc->assign, assign, n0*sizeof(int)); > - > - cpuc->n_events = n0; > - cpuc->n_added += n1; > - ctx->nr_active += n1; > - > - /* > - * 1 means successful and events are active > - * This is not quite true because we defer > - * actual activation until hw_perf_enable() but > - * this way we* ensure caller won't try to enable > - * individual events > - */ > - return 1; > -undo: > - x86_event_sched_out(leader, cpuctx); > - n0 = 1; > - list_for_each_entry(sub, &leader->sibling_list, group_entry) { > - if (sub->state == PERF_EVENT_STATE_ACTIVE) { > - x86_event_sched_out(sub, cpuctx); > - if (++n0 == n1) > - break; > - } > - } > - return ret; > -} > - > #include "perf_event_amd.c" > #include "perf_event_p6.c" > #include "perf_event_p4.c" > @@ -1454,6 +1345,47 @@ static inline void x86_pmu_read(struct perf_event *event) > x86_perf_event_update(event); > } > > +/* > + * Set the flag to make pmu::enable() not perform the > + * schedulablilty test. > + */ > +static void x86_pmu_start_txn(struct pmu *pmu) > +{ > + pmu->flag |= PERF_EVENT_TRAN_STARTED; > +} > + > +static void x86_pmu_stop_txn(struct pmu *pmu) > +{ > + pmu->flag &= ~PERF_EVENT_TRAN_STARTED; > +} > + > +/* > + * Return 0 if commit transaction success > + */ > +static int x86_pmu_commit_txn(struct pmu *pmu) > +{ > + struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events); > + int assign[X86_PMC_IDX_MAX]; > + int n, ret; > + > + n = cpuc->n_events; > + > + if (!x86_pmu_initialized()) > + return -EAGAIN; > + > + ret = x86_pmu.schedule_events(cpuc, n, assign); > + if (ret) > + return ret; > + > + /* > + * copy new assignment, now we know it is possible > + * will be used by hw_perf_enable() > + */ > + memcpy(cpuc->assign, assign, n*sizeof(int)); > + > + return 0; > +} > + > static const struct pmu pmu = { > .enable = x86_pmu_enable, > .disable = x86_pmu_disable, > @@ -1461,6 +1393,9 @@ static const struct pmu pmu = { > .stop = x86_pmu_stop, > .read = x86_pmu_read, > .unthrottle = x86_pmu_unthrottle, > + .start_txn = x86_pmu_start_txn, > + .stop_txn = x86_pmu_stop_txn, > + .commit_txn = x86_pmu_commit_txn, > }; > > /* > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h > index bf896d0..93aa8d8 100644 > --- a/include/linux/perf_event.h > +++ b/include/linux/perf_event.h > @@ -524,6 +524,8 @@ struct hw_perf_event { > > struct perf_event; > > +#define PERF_EVENT_TRAN_STARTED 1 > + > /** > * struct pmu - generic performance monitoring unit > */ > @@ -534,6 +536,11 @@ struct pmu { > void (*stop) (struct perf_event *event); > void (*read) (struct perf_event *event); > void (*unthrottle) (struct perf_event *event); > + void (*start_txn) (struct pmu *pmu); > + void (*stop_txn) (struct pmu *pmu); > + int (*commit_txn) (struct pmu *pmu); > + > + u8 flag; > }; > > /** > @@ -799,9 +806,6 @@ extern void perf_disable(void); > extern void perf_enable(void); > extern int perf_event_task_disable(void); > extern int perf_event_task_enable(void); > -extern int hw_perf_group_sched_in(struct perf_event *group_leader, > - struct perf_cpu_context *cpuctx, > - struct perf_event_context *ctx); > extern void perf_event_update_userpage(struct perf_event *event); > extern int perf_event_release_kernel(struct perf_event *event); > extern struct perf_event * > diff --git a/kernel/perf_event.c b/kernel/perf_event.c > index 07b7a43..4537676 100644 > --- a/kernel/perf_event.c > +++ b/kernel/perf_event.c > @@ -83,14 +83,6 @@ extern __weak const struct pmu *hw_perf_event_init(struct perf_event *event) > void __weak hw_perf_disable(void) { barrier(); } > void __weak hw_perf_enable(void) { barrier(); } > > -int __weak > -hw_perf_group_sched_in(struct perf_event *group_leader, > - struct perf_cpu_context *cpuctx, > - struct perf_event_context *ctx) > -{ > - return 0; > -} > - > void __weak perf_event_print_debug(void) { } > > static DEFINE_PER_CPU(int, perf_disable_count); > @@ -642,14 +634,13 @@ group_sched_in(struct perf_event *group_event, > struct perf_event_context *ctx) > { > struct perf_event *event, *partial_group; > + struct pmu *pmu = (struct pmu *)group_event->pmu; > int ret; > > if (group_event->state == PERF_EVENT_STATE_OFF) > return 0; > > - ret = hw_perf_group_sched_in(group_event, cpuctx, ctx); > - if (ret) > - return ret < 0 ? ret : 0; > + pmu->start_txn(pmu); > > if (event_sched_in(group_event, cpuctx, ctx)) > return -EAGAIN; > @@ -664,16 +655,21 @@ group_sched_in(struct perf_event *group_event, > } > } > > - return 0; > + ret = pmu->commit_txn(pmu); > + if (!ret) { > + pmu->stop_txn(pmu); > + return 0; > + } > > group_error: > + pmu->stop_txn(pmu); > + > /* > - * Groups can be scheduled in as one unit only, so undo any > - * partial group before returning: > + * Commit transaction fails, rollback > + * Groups can be scheduled in as one unit only, so undo > + * whole group before returning: > */ > list_for_each_entry(event, &group_event->sibling_list, group_entry) { > - if (event == partial_group) > - break; > event_sched_out(event, cpuctx, ctx); > } > event_sched_out(group_event, cpuctx, ctx); > > > ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-04-21 8:32 ` stephane eranian @ 2010-04-21 8:39 ` Lin Ming 2010-04-21 8:44 ` stephane eranian 0 siblings, 1 reply; 55+ messages in thread From: Lin Ming @ 2010-04-21 8:39 UTC (permalink / raw) To: eranian@gmail.com; +Cc: Peter Zijlstra, Gary.Mohr@Bull.com, Corey Ashford, LKML On Wed, 2010-04-21 at 16:32 +0800, stephane eranian wrote: > Seems to me that struct pmu is a shared resource across all CPUs. > I don't understand why scheduling on one CPU would have to impact > all the other CPUs, unless I am missing something here. Do you mean the pmu->flag? You are right, pmu->flag should be per cpu data. Will update the patch. Thanks, Lin Ming > > > On Wed, Apr 21, 2010 at 10:08 AM, Lin Ming <ming.m.lin@intel.com> wrote: > > On Tue, 2010-04-20 at 20:03 +0800, Peter Zijlstra wrote: > >> On Tue, 2010-04-20 at 19:55 +0800, Lin Ming wrote: > >> > >> > > One thing not on that list, which should happen first I guess, is to > >> > > remove hw_perf_group_sched_in(). The idea is to add some sort of > >> > > transactional API to the struct pmu, so that we can delay the > >> > > schedulability check until commit time (and roll back when it fails). > >> > > > >> > > Something as simple as: > >> > > > >> > > struct pmu { > >> > > void start_txn(struct pmu *); > >> > > void commit_txn(struct pmu *); > >> > > > >> > > ,,, > >> > > }; > >> > > >> > Could you please explain a bit more? > >> > > >> > Does it mean that "start_txn" perform the schedule events stuff > >> > and "commit_txn" perform the assign events stuff? > >> > > >> > Does "commit time" mean the actual activation in hw_perf_enable? > >> > >> No, the idea behind hw_perf_group_sched_in() is to not perform > >> schedulability tests on each event in the group, but to add the group as > >> a whole and then perform one test. > >> > >> Of course, when that test fails, you'll have to roll-back the whole > >> group again. > >> > >> So start_txn (or a better name) would simply toggle a flag in the pmu > >> implementation that will make pmu::enable() not perform the > >> schedulablilty test. > >> > >> Then commit_txn() will perform the schedulability test (so note the > >> method has to have a !void return value, my mistake in the earlier > >> email). > >> > >> This will allow us to use the regular > >> kernel/perf_event.c::group_sched_in() and all the rollback code. > >> Currently each hw_perf_group_sched_in() implementation duplicates all > >> the rolllback code (with various bugs). > >> > >> > >> > >> We must get rid of all weak hw_perf_*() functions before we can properly > >> consider multiple struct pmu implementations. > >> > > > > Thanks for the clear explanation. > > > > Does below patch show what you mean? > > > > I only touch the x86 arch code now. > > > > --- > > arch/x86/kernel/cpu/perf_event.c | 161 +++++++++++-------------------------- > > include/linux/perf_event.h | 10 ++- > > kernel/perf_event.c | 28 +++---- > > 3 files changed, 67 insertions(+), 132 deletions(-) > > > > diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c > > index 626154a..62aa9a1 100644 > > --- a/arch/x86/kernel/cpu/perf_event.c > > +++ b/arch/x86/kernel/cpu/perf_event.c > > @@ -944,6 +944,9 @@ static int x86_pmu_enable(struct perf_event *event) > > if (n < 0) > > return n; > > > > + if (!(event->pmu->flag & PERF_EVENT_TRAN_STARTED)) > > + goto out; > > + > > ret = x86_pmu.schedule_events(cpuc, n, assign); > > if (ret) > > return ret; > > @@ -953,6 +956,7 @@ static int x86_pmu_enable(struct perf_event *event) > > */ > > memcpy(cpuc->assign, assign, n*sizeof(int)); > > > > +out: > > cpuc->n_events = n; > > cpuc->n_added += n - n0; > > > > @@ -1210,119 +1214,6 @@ x86_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event) > > return &unconstrained; > > } > > > > -static int x86_event_sched_in(struct perf_event *event, > > - struct perf_cpu_context *cpuctx) > > -{ > > - int ret = 0; > > - > > - event->state = PERF_EVENT_STATE_ACTIVE; > > - event->oncpu = smp_processor_id(); > > - event->tstamp_running += event->ctx->time - event->tstamp_stopped; > > - > > - if (!is_x86_event(event)) > > - ret = event->pmu->enable(event); > > - > > - if (!ret && !is_software_event(event)) > > - cpuctx->active_oncpu++; > > - > > - if (!ret && event->attr.exclusive) > > - cpuctx->exclusive = 1; > > - > > - return ret; > > -} > > - > > -static void x86_event_sched_out(struct perf_event *event, > > - struct perf_cpu_context *cpuctx) > > -{ > > - event->state = PERF_EVENT_STATE_INACTIVE; > > - event->oncpu = -1; > > - > > - if (!is_x86_event(event)) > > - event->pmu->disable(event); > > - > > - event->tstamp_running -= event->ctx->time - event->tstamp_stopped; > > - > > - if (!is_software_event(event)) > > - cpuctx->active_oncpu--; > > - > > - if (event->attr.exclusive || !cpuctx->active_oncpu) > > - cpuctx->exclusive = 0; > > -} > > - > > -/* > > - * Called to enable a whole group of events. > > - * Returns 1 if the group was enabled, or -EAGAIN if it could not be. > > - * Assumes the caller has disabled interrupts and has > > - * frozen the PMU with hw_perf_save_disable. > > - * > > - * called with PMU disabled. If successful and return value 1, > > - * then guaranteed to call perf_enable() and hw_perf_enable() > > - */ > > -int hw_perf_group_sched_in(struct perf_event *leader, > > - struct perf_cpu_context *cpuctx, > > - struct perf_event_context *ctx) > > -{ > > - struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events); > > - struct perf_event *sub; > > - int assign[X86_PMC_IDX_MAX]; > > - int n0, n1, ret; > > - > > - if (!x86_pmu_initialized()) > > - return 0; > > - > > - /* n0 = total number of events */ > > - n0 = collect_events(cpuc, leader, true); > > - if (n0 < 0) > > - return n0; > > - > > - ret = x86_pmu.schedule_events(cpuc, n0, assign); > > - if (ret) > > - return ret; > > - > > - ret = x86_event_sched_in(leader, cpuctx); > > - if (ret) > > - return ret; > > - > > - n1 = 1; > > - list_for_each_entry(sub, &leader->sibling_list, group_entry) { > > - if (sub->state > PERF_EVENT_STATE_OFF) { > > - ret = x86_event_sched_in(sub, cpuctx); > > - if (ret) > > - goto undo; > > - ++n1; > > - } > > - } > > - /* > > - * copy new assignment, now we know it is possible > > - * will be used by hw_perf_enable() > > - */ > > - memcpy(cpuc->assign, assign, n0*sizeof(int)); > > - > > - cpuc->n_events = n0; > > - cpuc->n_added += n1; > > - ctx->nr_active += n1; > > - > > - /* > > - * 1 means successful and events are active > > - * This is not quite true because we defer > > - * actual activation until hw_perf_enable() but > > - * this way we* ensure caller won't try to enable > > - * individual events > > - */ > > - return 1; > > -undo: > > - x86_event_sched_out(leader, cpuctx); > > - n0 = 1; > > - list_for_each_entry(sub, &leader->sibling_list, group_entry) { > > - if (sub->state == PERF_EVENT_STATE_ACTIVE) { > > - x86_event_sched_out(sub, cpuctx); > > - if (++n0 == n1) > > - break; > > - } > > - } > > - return ret; > > -} > > - > > #include "perf_event_amd.c" > > #include "perf_event_p6.c" > > #include "perf_event_p4.c" > > @@ -1454,6 +1345,47 @@ static inline void x86_pmu_read(struct perf_event *event) > > x86_perf_event_update(event); > > } > > > > +/* > > + * Set the flag to make pmu::enable() not perform the > > + * schedulablilty test. > > + */ > > +static void x86_pmu_start_txn(struct pmu *pmu) > > +{ > > + pmu->flag |= PERF_EVENT_TRAN_STARTED; > > +} > > + > > +static void x86_pmu_stop_txn(struct pmu *pmu) > > +{ > > + pmu->flag &= ~PERF_EVENT_TRAN_STARTED; > > +} > > + > > +/* > > + * Return 0 if commit transaction success > > + */ > > +static int x86_pmu_commit_txn(struct pmu *pmu) > > +{ > > + struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events); > > + int assign[X86_PMC_IDX_MAX]; > > + int n, ret; > > + > > + n = cpuc->n_events; > > + > > + if (!x86_pmu_initialized()) > > + return -EAGAIN; > > + > > + ret = x86_pmu.schedule_events(cpuc, n, assign); > > + if (ret) > > + return ret; > > + > > + /* > > + * copy new assignment, now we know it is possible > > + * will be used by hw_perf_enable() > > + */ > > + memcpy(cpuc->assign, assign, n*sizeof(int)); > > + > > + return 0; > > +} > > + > > static const struct pmu pmu = { > > .enable = x86_pmu_enable, > > .disable = x86_pmu_disable, > > @@ -1461,6 +1393,9 @@ static const struct pmu pmu = { > > .stop = x86_pmu_stop, > > .read = x86_pmu_read, > > .unthrottle = x86_pmu_unthrottle, > > + .start_txn = x86_pmu_start_txn, > > + .stop_txn = x86_pmu_stop_txn, > > + .commit_txn = x86_pmu_commit_txn, > > }; > > > > /* > > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h > > index bf896d0..93aa8d8 100644 > > --- a/include/linux/perf_event.h > > +++ b/include/linux/perf_event.h > > @@ -524,6 +524,8 @@ struct hw_perf_event { > > > > struct perf_event; > > > > +#define PERF_EVENT_TRAN_STARTED 1 > > + > > /** > > * struct pmu - generic performance monitoring unit > > */ > > @@ -534,6 +536,11 @@ struct pmu { > > void (*stop) (struct perf_event *event); > > void (*read) (struct perf_event *event); > > void (*unthrottle) (struct perf_event *event); > > + void (*start_txn) (struct pmu *pmu); > > + void (*stop_txn) (struct pmu *pmu); > > + int (*commit_txn) (struct pmu *pmu); > > + > > + u8 flag; > > }; > > > > /** > > @@ -799,9 +806,6 @@ extern void perf_disable(void); > > extern void perf_enable(void); > > extern int perf_event_task_disable(void); > > extern int perf_event_task_enable(void); > > -extern int hw_perf_group_sched_in(struct perf_event *group_leader, > > - struct perf_cpu_context *cpuctx, > > - struct perf_event_context *ctx); > > extern void perf_event_update_userpage(struct perf_event *event); > > extern int perf_event_release_kernel(struct perf_event *event); > > extern struct perf_event * > > diff --git a/kernel/perf_event.c b/kernel/perf_event.c > > index 07b7a43..4537676 100644 > > --- a/kernel/perf_event.c > > +++ b/kernel/perf_event.c > > @@ -83,14 +83,6 @@ extern __weak const struct pmu *hw_perf_event_init(struct perf_event *event) > > void __weak hw_perf_disable(void) { barrier(); } > > void __weak hw_perf_enable(void) { barrier(); } > > > > -int __weak > > -hw_perf_group_sched_in(struct perf_event *group_leader, > > - struct perf_cpu_context *cpuctx, > > - struct perf_event_context *ctx) > > -{ > > - return 0; > > -} > > - > > void __weak perf_event_print_debug(void) { } > > > > static DEFINE_PER_CPU(int, perf_disable_count); > > @@ -642,14 +634,13 @@ group_sched_in(struct perf_event *group_event, > > struct perf_event_context *ctx) > > { > > struct perf_event *event, *partial_group; > > + struct pmu *pmu = (struct pmu *)group_event->pmu; > > int ret; > > > > if (group_event->state == PERF_EVENT_STATE_OFF) > > return 0; > > > > - ret = hw_perf_group_sched_in(group_event, cpuctx, ctx); > > - if (ret) > > - return ret < 0 ? ret : 0; > > + pmu->start_txn(pmu); > > > > if (event_sched_in(group_event, cpuctx, ctx)) > > return -EAGAIN; > > @@ -664,16 +655,21 @@ group_sched_in(struct perf_event *group_event, > > } > > } > > > > - return 0; > > + ret = pmu->commit_txn(pmu); > > + if (!ret) { > > + pmu->stop_txn(pmu); > > + return 0; > > + } > > > > group_error: > > + pmu->stop_txn(pmu); > > + > > /* > > - * Groups can be scheduled in as one unit only, so undo any > > - * partial group before returning: > > + * Commit transaction fails, rollback > > + * Groups can be scheduled in as one unit only, so undo > > + * whole group before returning: > > */ > > list_for_each_entry(event, &group_event->sibling_list, group_entry) { > > - if (event == partial_group) > > - break; > > event_sched_out(event, cpuctx, ctx); > > } > > event_sched_out(group_event, cpuctx, ctx); > > > > > > ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-04-21 8:39 ` Lin Ming @ 2010-04-21 8:44 ` stephane eranian 2010-04-21 9:42 ` Lin Ming 0 siblings, 1 reply; 55+ messages in thread From: stephane eranian @ 2010-04-21 8:44 UTC (permalink / raw) To: Lin Ming; +Cc: Peter Zijlstra, Gary.Mohr@Bull.com, Corey Ashford, LKML On Wed, Apr 21, 2010 at 10:39 AM, Lin Ming <ming.m.lin@intel.com> wrote: > On Wed, 2010-04-21 at 16:32 +0800, stephane eranian wrote: >> Seems to me that struct pmu is a shared resource across all CPUs. >> I don't understand why scheduling on one CPU would have to impact >> all the other CPUs, unless I am missing something here. > > Do you mean the pmu->flag? Yes. > You are right, pmu->flag should be per cpu data. > > Will update the patch. > > Thanks, > Lin Ming > >> >> >> On Wed, Apr 21, 2010 at 10:08 AM, Lin Ming <ming.m.lin@intel.com> wrote: >> > On Tue, 2010-04-20 at 20:03 +0800, Peter Zijlstra wrote: >> >> On Tue, 2010-04-20 at 19:55 +0800, Lin Ming wrote: >> >> >> >> > > One thing not on that list, which should happen first I guess, is to >> >> > > remove hw_perf_group_sched_in(). The idea is to add some sort of >> >> > > transactional API to the struct pmu, so that we can delay the >> >> > > schedulability check until commit time (and roll back when it fails). >> >> > > >> >> > > Something as simple as: >> >> > > >> >> > > struct pmu { >> >> > > void start_txn(struct pmu *); >> >> > > void commit_txn(struct pmu *); >> >> > > >> >> > > ,,, >> >> > > }; >> >> > >> >> > Could you please explain a bit more? >> >> > >> >> > Does it mean that "start_txn" perform the schedule events stuff >> >> > and "commit_txn" perform the assign events stuff? >> >> > >> >> > Does "commit time" mean the actual activation in hw_perf_enable? >> >> >> >> No, the idea behind hw_perf_group_sched_in() is to not perform >> >> schedulability tests on each event in the group, but to add the group as >> >> a whole and then perform one test. >> >> >> >> Of course, when that test fails, you'll have to roll-back the whole >> >> group again. >> >> >> >> So start_txn (or a better name) would simply toggle a flag in the pmu >> >> implementation that will make pmu::enable() not perform the >> >> schedulablilty test. >> >> >> >> Then commit_txn() will perform the schedulability test (so note the >> >> method has to have a !void return value, my mistake in the earlier >> >> email). >> >> >> >> This will allow us to use the regular >> >> kernel/perf_event.c::group_sched_in() and all the rollback code. >> >> Currently each hw_perf_group_sched_in() implementation duplicates all >> >> the rolllback code (with various bugs). >> >> >> >> >> >> >> >> We must get rid of all weak hw_perf_*() functions before we can properly >> >> consider multiple struct pmu implementations. >> >> >> > >> > Thanks for the clear explanation. >> > >> > Does below patch show what you mean? >> > >> > I only touch the x86 arch code now. >> > >> > --- >> > arch/x86/kernel/cpu/perf_event.c | 161 +++++++++++-------------------------- >> > include/linux/perf_event.h | 10 ++- >> > kernel/perf_event.c | 28 +++---- >> > 3 files changed, 67 insertions(+), 132 deletions(-) >> > >> > diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c >> > index 626154a..62aa9a1 100644 >> > --- a/arch/x86/kernel/cpu/perf_event.c >> > +++ b/arch/x86/kernel/cpu/perf_event.c >> > @@ -944,6 +944,9 @@ static int x86_pmu_enable(struct perf_event *event) >> > if (n < 0) >> > return n; >> > >> > + if (!(event->pmu->flag & PERF_EVENT_TRAN_STARTED)) >> > + goto out; >> > + >> > ret = x86_pmu.schedule_events(cpuc, n, assign); >> > if (ret) >> > return ret; >> > @@ -953,6 +956,7 @@ static int x86_pmu_enable(struct perf_event *event) >> > */ >> > memcpy(cpuc->assign, assign, n*sizeof(int)); >> > >> > +out: >> > cpuc->n_events = n; >> > cpuc->n_added += n - n0; >> > >> > @@ -1210,119 +1214,6 @@ x86_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event) >> > return &unconstrained; >> > } >> > >> > -static int x86_event_sched_in(struct perf_event *event, >> > - struct perf_cpu_context *cpuctx) >> > -{ >> > - int ret = 0; >> > - >> > - event->state = PERF_EVENT_STATE_ACTIVE; >> > - event->oncpu = smp_processor_id(); >> > - event->tstamp_running += event->ctx->time - event->tstamp_stopped; >> > - >> > - if (!is_x86_event(event)) >> > - ret = event->pmu->enable(event); >> > - >> > - if (!ret && !is_software_event(event)) >> > - cpuctx->active_oncpu++; >> > - >> > - if (!ret && event->attr.exclusive) >> > - cpuctx->exclusive = 1; >> > - >> > - return ret; >> > -} >> > - >> > -static void x86_event_sched_out(struct perf_event *event, >> > - struct perf_cpu_context *cpuctx) >> > -{ >> > - event->state = PERF_EVENT_STATE_INACTIVE; >> > - event->oncpu = -1; >> > - >> > - if (!is_x86_event(event)) >> > - event->pmu->disable(event); >> > - >> > - event->tstamp_running -= event->ctx->time - event->tstamp_stopped; >> > - >> > - if (!is_software_event(event)) >> > - cpuctx->active_oncpu--; >> > - >> > - if (event->attr.exclusive || !cpuctx->active_oncpu) >> > - cpuctx->exclusive = 0; >> > -} >> > - >> > -/* >> > - * Called to enable a whole group of events. >> > - * Returns 1 if the group was enabled, or -EAGAIN if it could not be. >> > - * Assumes the caller has disabled interrupts and has >> > - * frozen the PMU with hw_perf_save_disable. >> > - * >> > - * called with PMU disabled. If successful and return value 1, >> > - * then guaranteed to call perf_enable() and hw_perf_enable() >> > - */ >> > -int hw_perf_group_sched_in(struct perf_event *leader, >> > - struct perf_cpu_context *cpuctx, >> > - struct perf_event_context *ctx) >> > -{ >> > - struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events); >> > - struct perf_event *sub; >> > - int assign[X86_PMC_IDX_MAX]; >> > - int n0, n1, ret; >> > - >> > - if (!x86_pmu_initialized()) >> > - return 0; >> > - >> > - /* n0 = total number of events */ >> > - n0 = collect_events(cpuc, leader, true); >> > - if (n0 < 0) >> > - return n0; >> > - >> > - ret = x86_pmu.schedule_events(cpuc, n0, assign); >> > - if (ret) >> > - return ret; >> > - >> > - ret = x86_event_sched_in(leader, cpuctx); >> > - if (ret) >> > - return ret; >> > - >> > - n1 = 1; >> > - list_for_each_entry(sub, &leader->sibling_list, group_entry) { >> > - if (sub->state > PERF_EVENT_STATE_OFF) { >> > - ret = x86_event_sched_in(sub, cpuctx); >> > - if (ret) >> > - goto undo; >> > - ++n1; >> > - } >> > - } >> > - /* >> > - * copy new assignment, now we know it is possible >> > - * will be used by hw_perf_enable() >> > - */ >> > - memcpy(cpuc->assign, assign, n0*sizeof(int)); >> > - >> > - cpuc->n_events = n0; >> > - cpuc->n_added += n1; >> > - ctx->nr_active += n1; >> > - >> > - /* >> > - * 1 means successful and events are active >> > - * This is not quite true because we defer >> > - * actual activation until hw_perf_enable() but >> > - * this way we* ensure caller won't try to enable >> > - * individual events >> > - */ >> > - return 1; >> > -undo: >> > - x86_event_sched_out(leader, cpuctx); >> > - n0 = 1; >> > - list_for_each_entry(sub, &leader->sibling_list, group_entry) { >> > - if (sub->state == PERF_EVENT_STATE_ACTIVE) { >> > - x86_event_sched_out(sub, cpuctx); >> > - if (++n0 == n1) >> > - break; >> > - } >> > - } >> > - return ret; >> > -} >> > - >> > #include "perf_event_amd.c" >> > #include "perf_event_p6.c" >> > #include "perf_event_p4.c" >> > @@ -1454,6 +1345,47 @@ static inline void x86_pmu_read(struct perf_event *event) >> > x86_perf_event_update(event); >> > } >> > >> > +/* >> > + * Set the flag to make pmu::enable() not perform the >> > + * schedulablilty test. >> > + */ >> > +static void x86_pmu_start_txn(struct pmu *pmu) >> > +{ >> > + pmu->flag |= PERF_EVENT_TRAN_STARTED; >> > +} >> > + >> > +static void x86_pmu_stop_txn(struct pmu *pmu) >> > +{ >> > + pmu->flag &= ~PERF_EVENT_TRAN_STARTED; >> > +} >> > + >> > +/* >> > + * Return 0 if commit transaction success >> > + */ >> > +static int x86_pmu_commit_txn(struct pmu *pmu) >> > +{ >> > + struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events); >> > + int assign[X86_PMC_IDX_MAX]; >> > + int n, ret; >> > + >> > + n = cpuc->n_events; >> > + >> > + if (!x86_pmu_initialized()) >> > + return -EAGAIN; >> > + >> > + ret = x86_pmu.schedule_events(cpuc, n, assign); >> > + if (ret) >> > + return ret; >> > + >> > + /* >> > + * copy new assignment, now we know it is possible >> > + * will be used by hw_perf_enable() >> > + */ >> > + memcpy(cpuc->assign, assign, n*sizeof(int)); >> > + >> > + return 0; >> > +} >> > + >> > static const struct pmu pmu = { >> > .enable = x86_pmu_enable, >> > .disable = x86_pmu_disable, >> > @@ -1461,6 +1393,9 @@ static const struct pmu pmu = { >> > .stop = x86_pmu_stop, >> > .read = x86_pmu_read, >> > .unthrottle = x86_pmu_unthrottle, >> > + .start_txn = x86_pmu_start_txn, >> > + .stop_txn = x86_pmu_stop_txn, >> > + .commit_txn = x86_pmu_commit_txn, >> > }; >> > >> > /* >> > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h >> > index bf896d0..93aa8d8 100644 >> > --- a/include/linux/perf_event.h >> > +++ b/include/linux/perf_event.h >> > @@ -524,6 +524,8 @@ struct hw_perf_event { >> > >> > struct perf_event; >> > >> > +#define PERF_EVENT_TRAN_STARTED 1 >> > + >> > /** >> > * struct pmu - generic performance monitoring unit >> > */ >> > @@ -534,6 +536,11 @@ struct pmu { >> > void (*stop) (struct perf_event *event); >> > void (*read) (struct perf_event *event); >> > void (*unthrottle) (struct perf_event *event); >> > + void (*start_txn) (struct pmu *pmu); >> > + void (*stop_txn) (struct pmu *pmu); >> > + int (*commit_txn) (struct pmu *pmu); >> > + >> > + u8 flag; >> > }; >> > >> > /** >> > @@ -799,9 +806,6 @@ extern void perf_disable(void); >> > extern void perf_enable(void); >> > extern int perf_event_task_disable(void); >> > extern int perf_event_task_enable(void); >> > -extern int hw_perf_group_sched_in(struct perf_event *group_leader, >> > - struct perf_cpu_context *cpuctx, >> > - struct perf_event_context *ctx); >> > extern void perf_event_update_userpage(struct perf_event *event); >> > extern int perf_event_release_kernel(struct perf_event *event); >> > extern struct perf_event * >> > diff --git a/kernel/perf_event.c b/kernel/perf_event.c >> > index 07b7a43..4537676 100644 >> > --- a/kernel/perf_event.c >> > +++ b/kernel/perf_event.c >> > @@ -83,14 +83,6 @@ extern __weak const struct pmu *hw_perf_event_init(struct perf_event *event) >> > void __weak hw_perf_disable(void) { barrier(); } >> > void __weak hw_perf_enable(void) { barrier(); } >> > >> > -int __weak >> > -hw_perf_group_sched_in(struct perf_event *group_leader, >> > - struct perf_cpu_context *cpuctx, >> > - struct perf_event_context *ctx) >> > -{ >> > - return 0; >> > -} >> > - >> > void __weak perf_event_print_debug(void) { } >> > >> > static DEFINE_PER_CPU(int, perf_disable_count); >> > @@ -642,14 +634,13 @@ group_sched_in(struct perf_event *group_event, >> > struct perf_event_context *ctx) >> > { >> > struct perf_event *event, *partial_group; >> > + struct pmu *pmu = (struct pmu *)group_event->pmu; >> > int ret; >> > >> > if (group_event->state == PERF_EVENT_STATE_OFF) >> > return 0; >> > >> > - ret = hw_perf_group_sched_in(group_event, cpuctx, ctx); >> > - if (ret) >> > - return ret < 0 ? ret : 0; >> > + pmu->start_txn(pmu); >> > >> > if (event_sched_in(group_event, cpuctx, ctx)) >> > return -EAGAIN; >> > @@ -664,16 +655,21 @@ group_sched_in(struct perf_event *group_event, >> > } >> > } >> > >> > - return 0; >> > + ret = pmu->commit_txn(pmu); >> > + if (!ret) { >> > + pmu->stop_txn(pmu); >> > + return 0; >> > + } >> > >> > group_error: >> > + pmu->stop_txn(pmu); >> > + >> > /* >> > - * Groups can be scheduled in as one unit only, so undo any >> > - * partial group before returning: >> > + * Commit transaction fails, rollback >> > + * Groups can be scheduled in as one unit only, so undo >> > + * whole group before returning: >> > */ >> > list_for_each_entry(event, &group_event->sibling_list, group_entry) { >> > - if (event == partial_group) >> > - break; >> > event_sched_out(event, cpuctx, ctx); >> > } >> > event_sched_out(group_event, cpuctx, ctx); >> > >> > >> > > > ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-04-21 8:44 ` stephane eranian @ 2010-04-21 9:42 ` Lin Ming 2010-04-21 9:57 ` Peter Zijlstra 0 siblings, 1 reply; 55+ messages in thread From: Lin Ming @ 2010-04-21 9:42 UTC (permalink / raw) To: eranian@gmail.com; +Cc: Peter Zijlstra, Gary.Mohr@Bull.com, Corey Ashford, LKML On Wed, 2010-04-21 at 16:44 +0800, stephane eranian wrote: > On Wed, Apr 21, 2010 at 10:39 AM, Lin Ming <ming.m.lin@intel.com> wrote: > > On Wed, 2010-04-21 at 16:32 +0800, stephane eranian wrote: > >> Seems to me that struct pmu is a shared resource across all CPUs. > >> I don't understand why scheduling on one CPU would have to impact > >> all the other CPUs, unless I am missing something here. > > > > Do you mean the pmu->flag? > > Yes. Thanks for the review. Changes log: v2. pmu->flag should be per cpu (Stephane Eranian) change definition of "const struct pmu" to "struct pmu" since it's not read only now. v1. remove hw_perf_group_sched_in() based on Peter's idea. --- arch/x86/kernel/cpu/perf_event.c | 183 ++++++++++++++------------------------ include/linux/perf_event.h | 16 +++- kernel/perf_event.c | 55 ++++++------ 3 files changed, 106 insertions(+), 148 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c index 626154a..1113fd5 100644 --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -598,7 +598,7 @@ static void x86_pmu_enable_all(int added) } } -static const struct pmu pmu; +static struct pmu pmu; static inline int is_x86_event(struct perf_event *event) { @@ -935,6 +935,7 @@ static int x86_pmu_enable(struct perf_event *event) struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events); struct hw_perf_event *hwc; int assign[X86_PMC_IDX_MAX]; + u8 *flag; int n, n0, ret; hwc = &event->hw; @@ -944,6 +945,10 @@ static int x86_pmu_enable(struct perf_event *event) if (n < 0) return n; + flag = perf_pmu_flag((struct pmu *)event->pmu); + if (!(*flag & PERF_EVENT_TRAN_STARTED)) + goto out; + ret = x86_pmu.schedule_events(cpuc, n, assign); if (ret) return ret; @@ -953,6 +958,7 @@ static int x86_pmu_enable(struct perf_event *event) */ memcpy(cpuc->assign, assign, n*sizeof(int)); +out: cpuc->n_events = n; cpuc->n_added += n - n0; @@ -1210,119 +1216,6 @@ x86_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event) return &unconstrained; } -static int x86_event_sched_in(struct perf_event *event, - struct perf_cpu_context *cpuctx) -{ - int ret = 0; - - event->state = PERF_EVENT_STATE_ACTIVE; - event->oncpu = smp_processor_id(); - event->tstamp_running += event->ctx->time - event->tstamp_stopped; - - if (!is_x86_event(event)) - ret = event->pmu->enable(event); - - if (!ret && !is_software_event(event)) - cpuctx->active_oncpu++; - - if (!ret && event->attr.exclusive) - cpuctx->exclusive = 1; - - return ret; -} - -static void x86_event_sched_out(struct perf_event *event, - struct perf_cpu_context *cpuctx) -{ - event->state = PERF_EVENT_STATE_INACTIVE; - event->oncpu = -1; - - if (!is_x86_event(event)) - event->pmu->disable(event); - - event->tstamp_running -= event->ctx->time - event->tstamp_stopped; - - if (!is_software_event(event)) - cpuctx->active_oncpu--; - - if (event->attr.exclusive || !cpuctx->active_oncpu) - cpuctx->exclusive = 0; -} - -/* - * Called to enable a whole group of events. - * Returns 1 if the group was enabled, or -EAGAIN if it could not be. - * Assumes the caller has disabled interrupts and has - * frozen the PMU with hw_perf_save_disable. - * - * called with PMU disabled. If successful and return value 1, - * then guaranteed to call perf_enable() and hw_perf_enable() - */ -int hw_perf_group_sched_in(struct perf_event *leader, - struct perf_cpu_context *cpuctx, - struct perf_event_context *ctx) -{ - struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events); - struct perf_event *sub; - int assign[X86_PMC_IDX_MAX]; - int n0, n1, ret; - - if (!x86_pmu_initialized()) - return 0; - - /* n0 = total number of events */ - n0 = collect_events(cpuc, leader, true); - if (n0 < 0) - return n0; - - ret = x86_pmu.schedule_events(cpuc, n0, assign); - if (ret) - return ret; - - ret = x86_event_sched_in(leader, cpuctx); - if (ret) - return ret; - - n1 = 1; - list_for_each_entry(sub, &leader->sibling_list, group_entry) { - if (sub->state > PERF_EVENT_STATE_OFF) { - ret = x86_event_sched_in(sub, cpuctx); - if (ret) - goto undo; - ++n1; - } - } - /* - * copy new assignment, now we know it is possible - * will be used by hw_perf_enable() - */ - memcpy(cpuc->assign, assign, n0*sizeof(int)); - - cpuc->n_events = n0; - cpuc->n_added += n1; - ctx->nr_active += n1; - - /* - * 1 means successful and events are active - * This is not quite true because we defer - * actual activation until hw_perf_enable() but - * this way we* ensure caller won't try to enable - * individual events - */ - return 1; -undo: - x86_event_sched_out(leader, cpuctx); - n0 = 1; - list_for_each_entry(sub, &leader->sibling_list, group_entry) { - if (sub->state == PERF_EVENT_STATE_ACTIVE) { - x86_event_sched_out(sub, cpuctx); - if (++n0 == n1) - break; - } - } - return ret; -} - #include "perf_event_amd.c" #include "perf_event_p6.c" #include "perf_event_p4.c" @@ -1397,6 +1290,14 @@ void __init init_hw_perf_events(void) return; } + /* + * TBD: will move this to pmu register function + * when multiple hw pmu support is added + */ + pmu.flag = alloc_percpu(u8); + if (!pmu.flag) + return; + pmu_check_apic(); pr_cont("%s PMU driver.\n", x86_pmu.name); @@ -1454,13 +1355,61 @@ static inline void x86_pmu_read(struct perf_event *event) x86_perf_event_update(event); } -static const struct pmu pmu = { +/* + * Set the flag to make pmu::enable() not perform the + * schedulablilty test. + */ +static void x86_pmu_start_txn(struct pmu *pmu) +{ + u8 *flag = perf_pmu_flag(pmu); + + *flag |= PERF_EVENT_TRAN_STARTED; +} + +static void x86_pmu_stop_txn(struct pmu *pmu) +{ + u8 *flag = perf_pmu_flag(pmu); + + *flag &= ~PERF_EVENT_TRAN_STARTED; +} + +/* + * Return 0 if commit transaction success + */ +static int x86_pmu_commit_txn(struct pmu *pmu) +{ + struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events); + int assign[X86_PMC_IDX_MAX]; + int n, ret; + + n = cpuc->n_events; + + if (!x86_pmu_initialized()) + return -EAGAIN; + + ret = x86_pmu.schedule_events(cpuc, n, assign); + if (ret) + return ret; + + /* + * copy new assignment, now we know it is possible + * will be used by hw_perf_enable() + */ + memcpy(cpuc->assign, assign, n*sizeof(int)); + + return 0; +} + +static struct pmu pmu = { .enable = x86_pmu_enable, .disable = x86_pmu_disable, .start = x86_pmu_start, .stop = x86_pmu_stop, .read = x86_pmu_read, .unthrottle = x86_pmu_unthrottle, + .start_txn = x86_pmu_start_txn, + .stop_txn = x86_pmu_stop_txn, + .commit_txn = x86_pmu_commit_txn, }; /* @@ -1537,9 +1486,9 @@ out: return ret; } -const struct pmu *hw_perf_event_init(struct perf_event *event) +struct pmu *hw_perf_event_init(struct perf_event *event) { - const struct pmu *tmp; + struct pmu *tmp; int err; err = __hw_perf_event_init(event); diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index bf896d0..9fa3f46 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -524,6 +524,8 @@ struct hw_perf_event { struct perf_event; +#define PERF_EVENT_TRAN_STARTED 1 + /** * struct pmu - generic performance monitoring unit */ @@ -534,6 +536,12 @@ struct pmu { void (*stop) (struct perf_event *event); void (*read) (struct perf_event *event); void (*unthrottle) (struct perf_event *event); + void (*start_txn) (struct pmu *pmu); + void (*stop_txn) (struct pmu *pmu); + int (*commit_txn) (struct pmu *pmu); + + /* percpu flag */ + u8 *flag; }; /** @@ -610,7 +618,7 @@ struct perf_event { int group_flags; struct perf_event *group_leader; struct perf_event *output; - const struct pmu *pmu; + struct pmu *pmu; enum perf_event_active_state state; atomic64_t count; @@ -782,7 +790,7 @@ struct perf_output_handle { */ extern int perf_max_events; -extern const struct pmu *hw_perf_event_init(struct perf_event *event); +extern struct pmu *hw_perf_event_init(struct perf_event *event); extern void perf_event_task_sched_in(struct task_struct *task); extern void perf_event_task_sched_out(struct task_struct *task, struct task_struct *next); @@ -799,9 +807,6 @@ extern void perf_disable(void); extern void perf_enable(void); extern int perf_event_task_disable(void); extern int perf_event_task_enable(void); -extern int hw_perf_group_sched_in(struct perf_event *group_leader, - struct perf_cpu_context *cpuctx, - struct perf_event_context *ctx); extern void perf_event_update_userpage(struct perf_event *event); extern int perf_event_release_kernel(struct perf_event *event); extern struct perf_event * @@ -811,6 +816,7 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, perf_overflow_handler_t callback); extern u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running); +extern u8 *perf_pmu_flag(struct pmu *pmu); struct perf_sample_data { u64 type; diff --git a/kernel/perf_event.c b/kernel/perf_event.c index 07b7a43..4820090 100644 --- a/kernel/perf_event.c +++ b/kernel/perf_event.c @@ -75,7 +75,7 @@ static DEFINE_SPINLOCK(perf_resource_lock); /* * Architecture provided APIs - weak aliases: */ -extern __weak const struct pmu *hw_perf_event_init(struct perf_event *event) +extern __weak struct pmu *hw_perf_event_init(struct perf_event *event) { return NULL; } @@ -83,15 +83,14 @@ extern __weak const struct pmu *hw_perf_event_init(struct perf_event *event) void __weak hw_perf_disable(void) { barrier(); } void __weak hw_perf_enable(void) { barrier(); } -int __weak -hw_perf_group_sched_in(struct perf_event *group_leader, - struct perf_cpu_context *cpuctx, - struct perf_event_context *ctx) +void __weak perf_event_print_debug(void) { } + +u8 *perf_pmu_flag(struct pmu *pmu) { - return 0; -} + int cpu = smp_processor_id(); -void __weak perf_event_print_debug(void) { } + return per_cpu_ptr(pmu->flag, cpu); +} static DEFINE_PER_CPU(int, perf_disable_count); @@ -642,14 +641,13 @@ group_sched_in(struct perf_event *group_event, struct perf_event_context *ctx) { struct perf_event *event, *partial_group; + struct pmu *pmu = group_event->pmu; int ret; if (group_event->state == PERF_EVENT_STATE_OFF) return 0; - ret = hw_perf_group_sched_in(group_event, cpuctx, ctx); - if (ret) - return ret < 0 ? ret : 0; + pmu->start_txn(pmu); if (event_sched_in(group_event, cpuctx, ctx)) return -EAGAIN; @@ -664,16 +662,21 @@ group_sched_in(struct perf_event *group_event, } } - return 0; + ret = pmu->commit_txn(pmu); + if (!ret) { + pmu->stop_txn(pmu); + return 0; + } group_error: + pmu->stop_txn(pmu); + /* - * Groups can be scheduled in as one unit only, so undo any - * partial group before returning: + * Commit transaction fails, rollback + * Groups can be scheduled in as one unit only, so undo + * whole group before returning: */ list_for_each_entry(event, &group_event->sibling_list, group_entry) { - if (event == partial_group) - break; event_sched_out(event, cpuctx, ctx); } event_sched_out(group_event, cpuctx, ctx); @@ -4139,7 +4142,7 @@ static void perf_swevent_disable(struct perf_event *event) hlist_del_rcu(&event->hlist_entry); } -static const struct pmu perf_ops_generic = { +static struct pmu perf_ops_generic = { .enable = perf_swevent_enable, .disable = perf_swevent_disable, .read = perf_swevent_read, @@ -4250,7 +4253,7 @@ static void cpu_clock_perf_event_read(struct perf_event *event) cpu_clock_perf_event_update(event); } -static const struct pmu perf_ops_cpu_clock = { +static struct pmu perf_ops_cpu_clock = { .enable = cpu_clock_perf_event_enable, .disable = cpu_clock_perf_event_disable, .read = cpu_clock_perf_event_read, @@ -4307,7 +4310,7 @@ static void task_clock_perf_event_read(struct perf_event *event) task_clock_perf_event_update(event, time); } -static const struct pmu perf_ops_task_clock = { +static struct pmu perf_ops_task_clock = { .enable = task_clock_perf_event_enable, .disable = task_clock_perf_event_disable, .read = task_clock_perf_event_read, @@ -4448,7 +4451,7 @@ static void tp_perf_event_destroy(struct perf_event *event) swevent_hlist_put(event); } -static const struct pmu *tp_perf_event_init(struct perf_event *event) +static struct pmu *tp_perf_event_init(struct perf_event *event) { int err; @@ -4505,7 +4508,7 @@ static int perf_tp_event_match(struct perf_event *event, return 1; } -static const struct pmu *tp_perf_event_init(struct perf_event *event) +static struct pmu *tp_perf_event_init(struct perf_event *event) { return NULL; } @@ -4527,7 +4530,7 @@ static void bp_perf_event_destroy(struct perf_event *event) release_bp_slot(event); } -static const struct pmu *bp_perf_event_init(struct perf_event *bp) +static struct pmu *bp_perf_event_init(struct perf_event *bp) { int err; @@ -4551,7 +4554,7 @@ void perf_bp_event(struct perf_event *bp, void *data) perf_swevent_add(bp, 1, 1, &sample, regs); } #else -static const struct pmu *bp_perf_event_init(struct perf_event *bp) +static struct pmu *bp_perf_event_init(struct perf_event *bp) { return NULL; } @@ -4573,9 +4576,9 @@ static void sw_perf_event_destroy(struct perf_event *event) swevent_hlist_put(event); } -static const struct pmu *sw_perf_event_init(struct perf_event *event) +static struct pmu *sw_perf_event_init(struct perf_event *event) { - const struct pmu *pmu = NULL; + struct pmu *pmu = NULL; u64 event_id = event->attr.config; /* @@ -4637,7 +4640,7 @@ perf_event_alloc(struct perf_event_attr *attr, perf_overflow_handler_t overflow_handler, gfp_t gfpflags) { - const struct pmu *pmu; + struct pmu *pmu; struct perf_event *event; struct hw_perf_event *hwc; long err; > > > You are right, pmu->flag should be per cpu data. > > > > Will update the patch. > > > > Thanks, > > Lin Ming > > > >> ^ permalink raw reply related [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-04-21 9:42 ` Lin Ming @ 2010-04-21 9:57 ` Peter Zijlstra 2010-04-21 22:12 ` Lin Ming 0 siblings, 1 reply; 55+ messages in thread From: Peter Zijlstra @ 2010-04-21 9:57 UTC (permalink / raw) To: Lin Ming; +Cc: eranian@gmail.com, Gary.Mohr@Bull.com, Corey Ashford, LKML On Wed, 2010-04-21 at 17:42 +0800, Lin Ming wrote: > + /* > + * TBD: will move this to pmu register function > + * when multiple hw pmu support is added > + */ > + pmu.flag = alloc_percpu(u8); > + if (!pmu.flag) > + return; Why not simply use a field in struct cpu_hw_events? That's where we track all per-cpu pmu state. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-04-21 9:57 ` Peter Zijlstra @ 2010-04-21 22:12 ` Lin Ming 2010-04-21 14:22 ` Peter Zijlstra 0 siblings, 1 reply; 55+ messages in thread From: Lin Ming @ 2010-04-21 22:12 UTC (permalink / raw) To: Peter Zijlstra; +Cc: eranian@gmail.com, Gary.Mohr@Bull.com, Corey Ashford, LKML On Wed, 2010-04-21 at 17:57 +0800, Peter Zijlstra wrote: > On Wed, 2010-04-21 at 17:42 +0800, Lin Ming wrote: > > + /* > > + * TBD: will move this to pmu register function > > + * when multiple hw pmu support is added > > + */ > > + pmu.flag = alloc_percpu(u8); > > + if (!pmu.flag) > > + return; > > Why not simply use a field in struct cpu_hw_events? > > That's where we track all per-cpu pmu state. That's good idea! Thanks for the review. Changes log: v3. track per-cpu pmu transaction state in cpu_hw_events (Peter Zijlstra) move pmu definition back to const ("struct pmu" -> "const struct pmu") v2. pmu->flag should be per cpu (Stephane Eranian) change definition of "const struct pmu" to "struct pmu" since it's not read only now. v1. remove hw_perf_group_sched_in() based on Peter's idea. --- arch/x86/kernel/cpu/perf_event.c | 167 ++++++++++++------------------------- include/linux/perf_event.h | 8 +- kernel/perf_event.c | 23 +++--- 3 files changed, 69 insertions(+), 129 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c index 626154a..4a1dc62 100644 --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -110,6 +110,8 @@ struct cpu_hw_events { u64 tags[X86_PMC_IDX_MAX]; struct perf_event *event_list[X86_PMC_IDX_MAX]; /* in enabled order */ + u8 flag; + /* * Intel DebugStore bits */ @@ -944,6 +946,9 @@ static int x86_pmu_enable(struct perf_event *event) if (n < 0) return n; + if (cpuc->flag & PERF_EVENT_TRAN_STARTED) + goto out; + ret = x86_pmu.schedule_events(cpuc, n, assign); if (ret) return ret; @@ -953,6 +958,7 @@ static int x86_pmu_enable(struct perf_event *event) */ memcpy(cpuc->assign, assign, n*sizeof(int)); +out: cpuc->n_events = n; cpuc->n_added += n - n0; @@ -1210,119 +1216,6 @@ x86_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event) return &unconstrained; } -static int x86_event_sched_in(struct perf_event *event, - struct perf_cpu_context *cpuctx) -{ - int ret = 0; - - event->state = PERF_EVENT_STATE_ACTIVE; - event->oncpu = smp_processor_id(); - event->tstamp_running += event->ctx->time - event->tstamp_stopped; - - if (!is_x86_event(event)) - ret = event->pmu->enable(event); - - if (!ret && !is_software_event(event)) - cpuctx->active_oncpu++; - - if (!ret && event->attr.exclusive) - cpuctx->exclusive = 1; - - return ret; -} - -static void x86_event_sched_out(struct perf_event *event, - struct perf_cpu_context *cpuctx) -{ - event->state = PERF_EVENT_STATE_INACTIVE; - event->oncpu = -1; - - if (!is_x86_event(event)) - event->pmu->disable(event); - - event->tstamp_running -= event->ctx->time - event->tstamp_stopped; - - if (!is_software_event(event)) - cpuctx->active_oncpu--; - - if (event->attr.exclusive || !cpuctx->active_oncpu) - cpuctx->exclusive = 0; -} - -/* - * Called to enable a whole group of events. - * Returns 1 if the group was enabled, or -EAGAIN if it could not be. - * Assumes the caller has disabled interrupts and has - * frozen the PMU with hw_perf_save_disable. - * - * called with PMU disabled. If successful and return value 1, - * then guaranteed to call perf_enable() and hw_perf_enable() - */ -int hw_perf_group_sched_in(struct perf_event *leader, - struct perf_cpu_context *cpuctx, - struct perf_event_context *ctx) -{ - struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events); - struct perf_event *sub; - int assign[X86_PMC_IDX_MAX]; - int n0, n1, ret; - - if (!x86_pmu_initialized()) - return 0; - - /* n0 = total number of events */ - n0 = collect_events(cpuc, leader, true); - if (n0 < 0) - return n0; - - ret = x86_pmu.schedule_events(cpuc, n0, assign); - if (ret) - return ret; - - ret = x86_event_sched_in(leader, cpuctx); - if (ret) - return ret; - - n1 = 1; - list_for_each_entry(sub, &leader->sibling_list, group_entry) { - if (sub->state > PERF_EVENT_STATE_OFF) { - ret = x86_event_sched_in(sub, cpuctx); - if (ret) - goto undo; - ++n1; - } - } - /* - * copy new assignment, now we know it is possible - * will be used by hw_perf_enable() - */ - memcpy(cpuc->assign, assign, n0*sizeof(int)); - - cpuc->n_events = n0; - cpuc->n_added += n1; - ctx->nr_active += n1; - - /* - * 1 means successful and events are active - * This is not quite true because we defer - * actual activation until hw_perf_enable() but - * this way we* ensure caller won't try to enable - * individual events - */ - return 1; -undo: - x86_event_sched_out(leader, cpuctx); - n0 = 1; - list_for_each_entry(sub, &leader->sibling_list, group_entry) { - if (sub->state == PERF_EVENT_STATE_ACTIVE) { - x86_event_sched_out(sub, cpuctx); - if (++n0 == n1) - break; - } - } - return ret; -} - #include "perf_event_amd.c" #include "perf_event_p6.c" #include "perf_event_p4.c" @@ -1454,6 +1347,51 @@ static inline void x86_pmu_read(struct perf_event *event) x86_perf_event_update(event); } +/* + * Set the flag to make pmu::enable() not perform the + * schedulablilty test. + */ +static void x86_pmu_start_txn(const struct pmu *pmu) +{ + struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events); + + cpuc->flag |= PERF_EVENT_TRAN_STARTED; +} + +static void x86_pmu_stop_txn(const struct pmu *pmu) +{ + struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events); + + cpuc->flag &= ~PERF_EVENT_TRAN_STARTED; +} + +/* + * Return 0 if commit transaction success + */ +static int x86_pmu_commit_txn(const struct pmu *pmu) +{ + struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events); + int assign[X86_PMC_IDX_MAX]; + int n, ret; + + n = cpuc->n_events; + + if (!x86_pmu_initialized()) + return -EAGAIN; + + ret = x86_pmu.schedule_events(cpuc, n, assign); + if (ret) + return ret; + + /* + * copy new assignment, now we know it is possible + * will be used by hw_perf_enable() + */ + memcpy(cpuc->assign, assign, n*sizeof(int)); + + return 0; +} + static const struct pmu pmu = { .enable = x86_pmu_enable, .disable = x86_pmu_disable, @@ -1461,6 +1399,9 @@ static const struct pmu pmu = { .stop = x86_pmu_stop, .read = x86_pmu_read, .unthrottle = x86_pmu_unthrottle, + .start_txn = x86_pmu_start_txn, + .stop_txn = x86_pmu_stop_txn, + .commit_txn = x86_pmu_commit_txn, }; /* diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index bf896d0..862b965 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -524,6 +524,8 @@ struct hw_perf_event { struct perf_event; +#define PERF_EVENT_TRAN_STARTED 1 + /** * struct pmu - generic performance monitoring unit */ @@ -534,6 +536,9 @@ struct pmu { void (*stop) (struct perf_event *event); void (*read) (struct perf_event *event); void (*unthrottle) (struct perf_event *event); + void (*start_txn) (const struct pmu *pmu); + void (*stop_txn) (const struct pmu *pmu); + int (*commit_txn) (const struct pmu *pmu); }; /** @@ -799,9 +804,6 @@ extern void perf_disable(void); extern void perf_enable(void); extern int perf_event_task_disable(void); extern int perf_event_task_enable(void); -extern int hw_perf_group_sched_in(struct perf_event *group_leader, - struct perf_cpu_context *cpuctx, - struct perf_event_context *ctx); extern void perf_event_update_userpage(struct perf_event *event); extern int perf_event_release_kernel(struct perf_event *event); extern struct perf_event * diff --git a/kernel/perf_event.c b/kernel/perf_event.c index 07b7a43..1503174 100644 --- a/kernel/perf_event.c +++ b/kernel/perf_event.c @@ -83,14 +83,6 @@ extern __weak const struct pmu *hw_perf_event_init(struct perf_event *event) void __weak hw_perf_disable(void) { barrier(); } void __weak hw_perf_enable(void) { barrier(); } -int __weak -hw_perf_group_sched_in(struct perf_event *group_leader, - struct perf_cpu_context *cpuctx, - struct perf_event_context *ctx) -{ - return 0; -} - void __weak perf_event_print_debug(void) { } static DEFINE_PER_CPU(int, perf_disable_count); @@ -641,15 +633,14 @@ group_sched_in(struct perf_event *group_event, struct perf_cpu_context *cpuctx, struct perf_event_context *ctx) { - struct perf_event *event, *partial_group; + struct perf_event *event, *partial_group = NULL; + const struct pmu *pmu = group_event->pmu; int ret; if (group_event->state == PERF_EVENT_STATE_OFF) return 0; - ret = hw_perf_group_sched_in(group_event, cpuctx, ctx); - if (ret) - return ret < 0 ? ret : 0; + pmu->start_txn(pmu); if (event_sched_in(group_event, cpuctx, ctx)) return -EAGAIN; @@ -664,9 +655,15 @@ group_sched_in(struct perf_event *group_event, } } - return 0; + ret = pmu->commit_txn(pmu); + if (!ret) { + pmu->stop_txn(pmu); + return 0; + } group_error: + pmu->stop_txn(pmu); + /* * Groups can be scheduled in as one unit only, so undo any * partial group before returning: ^ permalink raw reply related [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-04-21 22:12 ` Lin Ming @ 2010-04-21 14:22 ` Peter Zijlstra 2010-04-21 22:38 ` Lin Ming 0 siblings, 1 reply; 55+ messages in thread From: Peter Zijlstra @ 2010-04-21 14:22 UTC (permalink / raw) To: Lin Ming; +Cc: eranian@gmail.com, Gary.Mohr@Bull.com, Corey Ashford, LKML On Wed, 2010-04-21 at 22:12 +0000, Lin Ming wrote: > + ret = pmu->commit_txn(pmu); > + if (!ret) { > + pmu->stop_txn(pmu); > + return 0; > + } > > group_error: > + pmu->stop_txn(pmu); If you let commit_txn() also clear the state you can save some logic and a method. But yes, this looks good. If you don't remove the weak interface just yet, you can do a patch per architecture that uses this (at least powerpc and sparc do), and remove the weak thing at the end once all users are gone. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-04-21 14:22 ` Peter Zijlstra @ 2010-04-21 22:38 ` Lin Ming 2010-04-21 14:53 ` Peter Zijlstra 0 siblings, 1 reply; 55+ messages in thread From: Lin Ming @ 2010-04-21 22:38 UTC (permalink / raw) To: Peter Zijlstra; +Cc: eranian@gmail.com, Gary.Mohr@Bull.com, Corey Ashford, LKML On Wed, 2010-04-21 at 22:22 +0800, Peter Zijlstra wrote: > On Wed, 2010-04-21 at 22:12 +0000, Lin Ming wrote: > > + ret = pmu->commit_txn(pmu); > > + if (!ret) { > > + pmu->stop_txn(pmu); > > + return 0; > > + } > > > > group_error: > > + pmu->stop_txn(pmu); > > If you let commit_txn() also clear the state you can save some logic and > a method. I add this ->stop_txn(pmu) because the rollback code also need to clear the state. > > But yes, this looks good. If you don't remove the weak interface just > yet, you can do a patch per architecture that uses this (at least > powerpc and sparc do), and remove the weak thing at the end once all > users are gone. > OK, I'll do that for powerpc and sparc. And need to check if there are transaction methods in group_sched_in for other arch, as below. group_sched_in(...) { if (pmu->start_txn) pmu->start_txn(pmu); if (pmu->commit_txn) pmu->commit_txn(pmu) } Thanks, Lin Ming ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-04-21 22:38 ` Lin Ming @ 2010-04-21 14:53 ` Peter Zijlstra 0 siblings, 0 replies; 55+ messages in thread From: Peter Zijlstra @ 2010-04-21 14:53 UTC (permalink / raw) To: Lin Ming; +Cc: eranian@gmail.com, Gary.Mohr@Bull.com, Corey Ashford, LKML On Wed, 2010-04-21 at 22:38 +0000, Lin Ming wrote: > > I add this ->stop_txn(pmu) because the rollback code also need to clear > the state. Ah, yes, because strictly speaking ->enable() could still fail for another reason. OK. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-03-30 16:49 ` Corey Ashford 2010-03-30 17:15 ` Peter Zijlstra @ 2010-03-30 21:28 ` stephane eranian 2010-03-30 23:11 ` Corey Ashford 1 sibling, 1 reply; 55+ messages in thread From: stephane eranian @ 2010-03-30 21:28 UTC (permalink / raw) To: Corey Ashford Cc: Lin Ming, Peter Zijlstra, Ingo Molnar, LKML, Andi Kleen, Paul Mackerras, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love, Steven Rostedt, Arnaldo Carvalho de Melo, Masami Hiramatsu On Tue, Mar 30, 2010 at 6:49 PM, Corey Ashford <cjashfor@linux.vnet.ibm.com> wrote: > On 03/30/2010 12:42 AM, Lin Ming wrote: >> >> Hi, Corey >> >> How is this going now? Are you still working on this? >> I'd like to help to add support for uncore, test, write code or anything >> else. >> >> Thanks, >> Lin Ming > > I haven't been actively working on adding infrastructure for nest PMUs yet. > At the moment we are working on supporting nest events for IBM's Wire-Speed > processor, using the current infrastructure, because of the time > limitations. Using the existing infrastructure is definitely not ideal, but > for this processor, it's workable. > > There are still a lot of issues to solve for adding this infrastructure: > > 1) Does perf_events need a new context type (in addition to per-task and > per-cpu)? This is partly because we don't want to be mixing the rotation of > CPU-events with nest events. Each PMU really ought to have its own event > list. > I concur with the fact that you don't want to mix events form different PMUs in the same rotation list. There are some side effects of not doing this with AMD64 Northbridge events when you have multiple concurrent sessions. But this is a special case where the "uncore" (nest) PMU is actually controlled via the core PMU. I think that is okay for now, but you certainly don't want that for Nehalem uncore, for instance. > 2) How do we deal with accessing PMU's which require slow access methods > (e.g. internal serial bus)? The accesses may need to be placed on worker > threads so that they don't affect the performance of context switches and > system ticks. > > 3) How exactly do we represent the PMU's in the pseudofs (/sys or /proc)? > And how exactly does the user specify the PMU to perf_events? > Peter Zijlstra and Stephane Eranian both recommended opening the PMU with > open() and then passing the resulting fd in through the perf_event_attr > struct. > > 4) How do we choose a CPU to do the housekeeping work for a particular nest > PMU. Peter thought that user space should still specify the it via > open_perf_event() cpu parameter, but there's also an argument to be made for > the kernel choosing the best CPU to handle the job, or at least make it > optional for the user to choose the CPU. > One of the housekeeping task is to handle uncore PMU interrupts, for instance. That is not a trivial task given that events are managed independently and that you could be monitoring per-thread or system-wide. It may be that some uncore PMU can only interrupt one core. Intel Nehalem can interrupt many at once. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-03-30 21:28 ` stephane eranian @ 2010-03-30 23:11 ` Corey Ashford 2010-03-31 13:43 ` stephane eranian 0 siblings, 1 reply; 55+ messages in thread From: Corey Ashford @ 2010-03-30 23:11 UTC (permalink / raw) To: eranian Cc: stephane eranian, Lin Ming, Peter Zijlstra, Ingo Molnar, LKML, Andi Kleen, Paul Mackerras, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love, Steven Rostedt, Arnaldo Carvalho de Melo, Masami Hiramatsu On 3/30/2010 2:28 PM, stephane eranian wrote: > On Tue, Mar 30, 2010 at 6:49 PM, Corey Ashford > <cjashfor@linux.vnet.ibm.com> wrote: >> 4) How do we choose a CPU to do the housekeeping work for a particular nest >> PMU. Peter thought that user space should still specify the it via >> open_perf_event() cpu parameter, but there's also an argument to be made for >> the kernel choosing the best CPU to handle the job, or at least make it >> optional for the user to choose the CPU. >> > One of the housekeeping task is to handle uncore PMU interrupts, for instance. > That is not a trivial task given that events are managed independently and > that you could be monitoring per-thread or system-wide. It may be that > some uncore PMU can only interrupt one core. Intel Nehalem can interrupt > many at once. That's a good point, and I think it's unreasonable to expect that the user knows exactly how the interrupts are connected from the uncore/nest PMU to which CPU(s). Perhaps one way around this would be to return an error if the chosen CPU wasn't fully capable of performing the housekeeping functions for the requested PMU. But this certainly isn't ideal, because relying on this mechanism would require that the user (or user tool) figure out which CPU is fully capable by trial-and-error. - Corey ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] perf_events: support for uncore a.k.a. nest units 2010-03-30 23:11 ` Corey Ashford @ 2010-03-31 13:43 ` stephane eranian 0 siblings, 0 replies; 55+ messages in thread From: stephane eranian @ 2010-03-31 13:43 UTC (permalink / raw) To: Corey Ashford Cc: Lin Ming, Peter Zijlstra, Ingo Molnar, LKML, Andi Kleen, Paul Mackerras, Frederic Weisbecker, Xiao Guangrong, Dan Terpstra, Philip Mucci, Maynard Johnson, Carl Love, Steven Rostedt, Arnaldo Carvalho de Melo, Masami Hiramatsu On Wed, Mar 31, 2010 at 1:11 AM, Corey Ashford <cjashfor@linux.vnet.ibm.com> wrote: > On 3/30/2010 2:28 PM, stephane eranian wrote: >> >> On Tue, Mar 30, 2010 at 6:49 PM, Corey Ashford >> <cjashfor@linux.vnet.ibm.com> wrote: >>> >>> 4) How do we choose a CPU to do the housekeeping work for a particular >>> nest >>> PMU. Peter thought that user space should still specify the it via >>> open_perf_event() cpu parameter, but there's also an argument to be made >>> for >>> the kernel choosing the best CPU to handle the job, or at least make it >>> optional for the user to choose the CPU. >>> >> One of the housekeeping task is to handle uncore PMU interrupts, for >> instance. >> That is not a trivial task given that events are managed independently and >> that you could be monitoring per-thread or system-wide. It may be that >> some uncore PMU can only interrupt one core. Intel Nehalem can interrupt >> many at once. > > That's a good point, and I think it's unreasonable to expect that the user > knows exactly how the interrupts are connected from the uncore/nest PMU to > which CPU(s). > > Perhaps one way around this would be to return an error if the chosen CPU > wasn't fully capable of performing the housekeeping functions for the > requested PMU. But this certainly isn't ideal, because relying on this > mechanism would require that the user (or user tool) figure out which CPU is > fully capable by trial-and-error. > I think users should not have to worry about all of this. It is also fine to restrict any uncore monitoring to system-wide mode, i.e., not per-thread. But I have also seen people requesting just that hoping they could draw correlations between core and uncore events in one run. I think this is less critical though. In the specific case of CPU-uncore (e.g., Nehalem uncore), you could simply identify the uncore PMU with the CPU. For anything else, you would definitively need the file descriptor. The more uniform approach would have been to use the file descriptor all along, except for per-thread. But that's fine, I think. ^ permalink raw reply [flat|nested] 55+ messages in thread
end of thread, other threads:[~2010-04-21 14:54 UTC | newest]
Thread overview: 55+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-19 19:41 [RFC] perf_events: support for uncore a.k.a. nest units Corey Ashford
2010-01-20 0:44 ` Andi Kleen
2010-01-20 1:49 ` Corey Ashford
2010-01-20 9:35 ` Andi Kleen
2010-01-20 19:28 ` Corey Ashford
2010-01-20 13:34 ` Peter Zijlstra
2010-01-20 21:33 ` Peter Zijlstra
2010-01-20 23:23 ` Corey Ashford
2010-01-21 7:21 ` Ingo Molnar
2010-01-21 19:13 ` Corey Ashford
2010-01-21 19:28 ` Corey Ashford
2010-01-27 10:28 ` Ingo Molnar
2010-01-27 19:50 ` Corey Ashford
2010-01-28 10:57 ` Peter Zijlstra
2010-01-28 18:00 ` Corey Ashford
2010-01-28 19:06 ` Peter Zijlstra
2010-01-28 19:44 ` Corey Ashford
2010-01-28 22:08 ` Corey Ashford
2010-01-29 9:52 ` Peter Zijlstra
2010-01-29 23:05 ` Corey Ashford
2010-01-30 8:42 ` Peter Zijlstra
2010-02-01 19:39 ` Corey Ashford
2010-02-01 19:54 ` Peter Zijlstra
2010-01-21 8:36 ` Peter Zijlstra
2010-01-21 8:47 ` stephane eranian
2010-01-21 8:59 ` Peter Zijlstra
2010-01-21 9:16 ` stephane eranian
2010-01-21 9:43 ` stephane eranian
[not found] ` <d3f22a1003290213x7d7904an59d50eb6a8616133@mail.gmail.com>
2010-03-30 7:42 ` Lin Ming
2010-03-30 16:49 ` Corey Ashford
2010-03-30 17:15 ` Peter Zijlstra
2010-03-30 22:12 ` Corey Ashford
2010-03-31 14:01 ` Peter Zijlstra
2010-03-31 14:13 ` stephane eranian
2010-03-31 15:49 ` Maynard Johnson
2010-03-31 17:50 ` Corey Ashford
2010-04-15 21:16 ` Gary.Mohr
2010-04-16 13:24 ` Peter Zijlstra
2010-04-19 9:08 ` Lin Ming
2010-04-19 9:27 ` Peter Zijlstra
2010-04-20 11:55 ` Lin Ming
2010-04-20 12:03 ` Peter Zijlstra
2010-04-21 8:08 ` Lin Ming
2010-04-21 8:32 ` stephane eranian
2010-04-21 8:39 ` Lin Ming
2010-04-21 8:44 ` stephane eranian
2010-04-21 9:42 ` Lin Ming
2010-04-21 9:57 ` Peter Zijlstra
2010-04-21 22:12 ` Lin Ming
2010-04-21 14:22 ` Peter Zijlstra
2010-04-21 22:38 ` Lin Ming
2010-04-21 14:53 ` Peter Zijlstra
2010-03-30 21:28 ` stephane eranian
2010-03-30 23:11 ` Corey Ashford
2010-03-31 13:43 ` stephane eranian
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox