* Re: [RFC][PATCH 3/9] perf: export registerred pmus via sysfs
2010-05-10 11:27 ` Peter Zijlstra
@ 2010-05-10 11:36 ` Peter Zijlstra
2010-05-10 11:48 ` Ingo Molnar
2010-05-10 11:39 ` Russell King
` (4 subsequent siblings)
5 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2010-05-10 11:36 UTC (permalink / raw)
To: Lin Ming
Cc: Ingo Molnar, Frederic Weisbecker, eranian@gmail.com,
Gary.Mohr@Bull.com, Corey Ashford, arjan@linux.intel.com,
Zhang, Yanmin, Paul Mackerras, David S. Miller, Russell King,
Paul Mundt, lkml
On Mon, 2010-05-10 at 13:27 +0200, Peter Zijlstra wrote:
> On Mon, 2010-05-10 at 18:26 +0800, Lin Ming wrote:
>
> > > No, I'm assuming there is only 1 PMU per CPU. Corey is the expert on
> > > crazy hardware though, but I think the sanest way is to extend the CPU
> > > topology if there's more structure to it.
> >
> > But our goal is to support multiple pmus, don't we need to assume there
> > are more than 1 PMU per CPU?
>
> No, because as I said, then its ambiguous what pmu you want. If you have
> that, you need to extend your topology information.
>
> Anyway, I talked with Ingo on this and he'd like to see this somewhat
> extended.
>
> Instead of a pmu_id field, which we pass into a new
> perf_event_attr::pmu_id field, how about creating an event_source sysfs
> class. Then each class can have an event_source_id and a hierarchy of
> 'generic' events.
>
> We'd start using the PERF_TYPE_ space for this and express the
> PERF_COUNT_ space in the event attributes found inside that class.
>
> That way we can include all the existing event enumerations into this as
> well.
>
> This way we can create:
>
> /sys/devices/system/cpu/cpuN/cpu_hardware_events
> cpu_hardware_events/event_source_id
> cpu_hardware_events/cpu_cycles
> cpu_hardware_events/instructions
> /...
>
> /sys/devices/system/cpu/cpuN/cpu_raw_events
> cpu_raw_events/event_source_id
>
>
> These would match the current PERF_TYPE_* values for compatibility
>
> For new PMUs we can start a dynamic range of PERF_TYPE_ (say at 64k but
> that's not ABI and can be changed at any time, we've got u32 to play
> with).
>
> For uncore this would result in:
>
> /sys/devices/system/node/nodeN/node_raw_events
> node_raw_events/event_source_id
>
> and maybe:
>
> /sys/devices/system/node/nodeN/node_events
> node_events/event_source_id
> node_events/local_misses
> /local_hits
> /remote_misses
> /remote_hits
> /...
>
>
> The software events and tracepoints and kprobes stuff we could hang off
> of /sys/kernel/ or something
The GPU folks would hang is off of the drm class or maybe next to it in
the PCI space.
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [RFC][PATCH 3/9] perf: export registerred pmus via sysfs
2010-05-10 11:36 ` Peter Zijlstra
@ 2010-05-10 11:48 ` Ingo Molnar
0 siblings, 0 replies; 51+ messages in thread
From: Ingo Molnar @ 2010-05-10 11:48 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Lin Ming, Frederic Weisbecker, eranian@gmail.com,
Gary.Mohr@Bull.com, Corey Ashford, arjan@linux.intel.com,
Zhang, Yanmin, Paul Mackerras, David S. Miller, Russell King,
Paul Mundt, lkml, Arnaldo Carvalho de Melo, Arjan van de Ven
* Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, 2010-05-10 at 13:27 +0200, Peter Zijlstra wrote:
> > On Mon, 2010-05-10 at 18:26 +0800, Lin Ming wrote:
> >
> > > > No, I'm assuming there is only 1 PMU per CPU. Corey is the expert on
> > > > crazy hardware though, but I think the sanest way is to extend the CPU
> > > > topology if there's more structure to it.
> > >
> > > But our goal is to support multiple pmus, don't we need to assume there
> > > are more than 1 PMU per CPU?
> >
> > No, because as I said, then its ambiguous what pmu you want. If you have
> > that, you need to extend your topology information.
> >
> > Anyway, I talked with Ingo on this and he'd like to see this somewhat
> > extended.
> >
> > Instead of a pmu_id field, which we pass into a new
> > perf_event_attr::pmu_id field, how about creating an event_source sysfs
> > class. Then each class can have an event_source_id and a hierarchy of
> > 'generic' events.
> >
> > We'd start using the PERF_TYPE_ space for this and express the
> > PERF_COUNT_ space in the event attributes found inside that class.
> >
> > That way we can include all the existing event enumerations into this as
> > well.
> >
> > This way we can create:
> >
> > /sys/devices/system/cpu/cpuN/cpu_hardware_events
> > cpu_hardware_events/event_source_id
> > cpu_hardware_events/cpu_cycles
> > cpu_hardware_events/instructions
> > /...
> >
> > /sys/devices/system/cpu/cpuN/cpu_raw_events
> > cpu_raw_events/event_source_id
> >
> >
> > These would match the current PERF_TYPE_* values for compatibility
> >
> > For new PMUs we can start a dynamic range of PERF_TYPE_ (say at 64k but
> > that's not ABI and can be changed at any time, we've got u32 to play
> > with).
> >
> > For uncore this would result in:
> >
> > /sys/devices/system/node/nodeN/node_raw_events
> > node_raw_events/event_source_id
> >
> > and maybe:
> >
> > /sys/devices/system/node/nodeN/node_events
> > node_events/event_source_id
> > node_events/local_misses
> > /local_hits
> > /remote_misses
> > /remote_hits
> > /...
> >
> >
> > The software events and tracepoints and kprobes stuff we could hang off
> > of /sys/kernel/ or something
>
> The GPU folks would hang is off of the drm class or maybe next to it in
> the PCI space.
It could conceivably be in multiple places as well - a given event makes sense
to enumerate in multiple places.
( For example an 'interrupt' might show up in a given GPU - but it can also
show up amongst the IRQ tracepoints - or something like that. )
But by far the most common case would be for an event source to be attached to
one particular place in the sysfs topology.
Note how naturally this scheme extends to all things hardware topology - which
is already enumerated in sysfs. It also extends to all things software events
in a pretty natural way via /sys/kernel/mm/.
Plus we want to move out the /sys/kernel/debug/ hacks for kprobes and
tracepoints into this space as well. (possibly do it with hw-breakpoints as
well by attaching them to the CPU directory - for completeness)
That way /sys/class/event_source/ would provide an enumeration of all events
to 'perf list' and would automatically be usable by all the perf tooling.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC][PATCH 3/9] perf: export registerred pmus via sysfs
2010-05-10 11:27 ` Peter Zijlstra
2010-05-10 11:36 ` Peter Zijlstra
@ 2010-05-10 11:39 ` Russell King
2010-05-10 11:42 ` Peter Zijlstra
` (3 subsequent siblings)
5 siblings, 0 replies; 51+ messages in thread
From: Russell King @ 2010-05-10 11:39 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Lin Ming, Ingo Molnar, Frederic Weisbecker, eranian@gmail.com,
Gary.Mohr@Bull.com, Corey Ashford, arjan@linux.intel.com,
Zhang, Yanmin, Paul Mackerras, David S. Miller, Paul Mundt, lkml
This really needs to loop the ARM PMU guys on this topic. ARM PMU is
not something I know very much about, and I really don't understand
why I'm on the CC list and not the ARM PMU folk.
I may be mis-stating a problem, but I believe that we have issues with
exporting event IDs in that they're SoC implementation specific - but
as I say, the ARM PMU guys know the details not me.
I strongly suggest no further discussion without at least looping Will
Deacon in on this.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [RFC][PATCH 3/9] perf: export registerred pmus via sysfs
2010-05-10 11:27 ` Peter Zijlstra
2010-05-10 11:36 ` Peter Zijlstra
2010-05-10 11:39 ` Russell King
@ 2010-05-10 11:42 ` Peter Zijlstra
2010-05-10 20:25 ` Will Deacon
2010-05-10 11:43 ` Ingo Molnar
` (2 subsequent siblings)
5 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2010-05-10 11:42 UTC (permalink / raw)
To: Lin Ming
Cc: Ingo Molnar, Frederic Weisbecker, eranian@gmail.com,
Gary.Mohr@Bull.com, Corey Ashford, arjan@linux.intel.com,
Zhang, Yanmin, Paul Mackerras, David S. Miller, Russell King,
Paul Mundt, lkml, Will Deacon
(Added Will Deacon)
On Mon, 2010-05-10 at 13:27 +0200, Peter Zijlstra wrote:
> On Mon, 2010-05-10 at 18:26 +0800, Lin Ming wrote:
>
> > > No, I'm assuming there is only 1 PMU per CPU. Corey is the expert on
> > > crazy hardware though, but I think the sanest way is to extend the CPU
> > > topology if there's more structure to it.
> >
> > But our goal is to support multiple pmus, don't we need to assume there
> > are more than 1 PMU per CPU?
>
> No, because as I said, then its ambiguous what pmu you want. If you have
> that, you need to extend your topology information.
>
> Anyway, I talked with Ingo on this and he'd like to see this somewhat
> extended.
>
> Instead of a pmu_id field, which we pass into a new
> perf_event_attr::pmu_id field, how about creating an event_source sysfs
> class. Then each class can have an event_source_id and a hierarchy of
> 'generic' events.
>
> We'd start using the PERF_TYPE_ space for this and express the
> PERF_COUNT_ space in the event attributes found inside that class.
>
> That way we can include all the existing event enumerations into this as
> well.
>
> This way we can create:
>
> /sys/devices/system/cpu/cpuN/cpu_hardware_events
> cpu_hardware_events/event_source_id
> cpu_hardware_events/cpu_cycles
> cpu_hardware_events/instructions
> /...
>
> /sys/devices/system/cpu/cpuN/cpu_raw_events
> cpu_raw_events/event_source_id
>
>
> These would match the current PERF_TYPE_* values for compatibility
>
> For new PMUs we can start a dynamic range of PERF_TYPE_ (say at 64k but
> that's not ABI and can be changed at any time, we've got u32 to play
> with).
>
> For uncore this would result in:
>
> /sys/devices/system/node/nodeN/node_raw_events
> node_raw_events/event_source_id
>
> and maybe:
>
> /sys/devices/system/node/nodeN/node_events
> node_events/event_source_id
> node_events/local_misses
> /local_hits
> /remote_misses
> /remote_hits
> /...
>
>
> The software events and tracepoints and kprobes stuff we could hang off
> of /sys/kernel/ or something
>
> So your registration would indeed look like something:
>
> perf_event_register_pmu(struct pmu *pmu, int type),
>
> where type would normally be -1 (dynamic) but would be PERF_TYPE_ for
> those already laid down in ABI.
>
> This approach will also give us a good overview
> in /sys/class/event_source/, which will be a flat listing of all
> existing event sources.
So Russell reminded me that the ARM people have the problem that
their /proc/cpuinfo isn't specific enough to map to a unique event map.
Whilst extending ARM /proc/cpuinfo seems like a sensible option it will
not cover anything but cpu events.
So in that trend (and to avoid exhaustive in kernel event lists for no
good reason), it might make sense to also add some event_source
attributes that identify the thing, maybe a event_source_name or
event_source_driver field that would allow unique maps to exhaustive
event lists.
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [RFC][PATCH 3/9] perf: export registerred pmus via sysfs
2010-05-10 11:42 ` Peter Zijlstra
@ 2010-05-10 20:25 ` Will Deacon
2010-05-11 6:34 ` Peter Zijlstra
0 siblings, 1 reply; 51+ messages in thread
From: Will Deacon @ 2010-05-10 20:25 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Lin Ming, Ingo Molnar, Frederic Weisbecker, eranian,
Gary.Mohr@Bull.com, Corey Ashford, arjan, Zhang, Yanmin,
Paul Mackerras, David S. Miller, Russell King, Paul Mundt, lkml
Hi Peter,
Thanks for CC'ing me on this.
On Mon, 2010-05-10 at 12:42 +0100, Peter Zijlstra wrote:
> (Added Will Deacon)
>
<snip>
> > We'd start using the PERF_TYPE_ space for this and express the
> > PERF_COUNT_ space in the event attributes found inside that class.
> >
> > That way we can include all the existing event enumerations into this as
> > well.
> >
> > This way we can create:
> >
> > /sys/devices/system/cpu/cpuN/cpu_hardware_events
> > cpu_hardware_events/event_source_id
> > cpu_hardware_events/cpu_cycles
> > cpu_hardware_events/instructions
> > /...
> >
> > /sys/devices/system/cpu/cpuN/cpu_raw_events
> > cpu_raw_events/event_source_id
> >
> >
> > These would match the current PERF_TYPE_* values for compatibility
> >
> > For new PMUs we can start a dynamic range of PERF_TYPE_ (say at 64k but
> > that's not ABI and can be changed at any time, we've got u32 to play
> > with).
> >
> > For uncore this would result in:
> >
> > /sys/devices/system/node/nodeN/node_raw_events
> > node_raw_events/event_source_id
> >
> > and maybe:
> >
> > /sys/devices/system/node/nodeN/node_events
> > node_events/event_source_id
> > node_events/local_misses
> > /local_hits
> > /remote_misses
> > /remote_hits
> > /...
> >
I like this approach, but it only makes sense for ARM if we describe the
common subset of events [that is, those associated with a PERF_COUNT_]
because, although the ARMv7 architecture does define some common event
numberings between cores, implementors are at liberty to extend the
event space as they wish. These extensions are much better off being
handled in userspace.
Another interesting ARM-ism is the vast potential for uncore event
sources in SoC devices. The `node' terminology seems a bit confusing
here, as there may be counters situated at various points of a bus
hierarchy which monitor various types of transactions for example. I
suppose these could live under /sys/devices/system/bus/... ? I would
expect these kind of counters to be controlled via raw events because
having a list of discrete events doesn't really make sense [for example,
if I want to count all bursts of a given size, I can encode the burst
size into the event number].
> So Russell reminded me that the ARM people have the problem that
> their /proc/cpuinfo isn't specific enough to map to a unique event map.
> Whilst extending ARM /proc/cpuinfo seems like a sensible option it will
> not cover anything but cpu events.
>
> So in that trend (and to avoid exhaustive in kernel event lists for no
> good reason), it might make sense to also add some event_source
> attributes that identify the thing, maybe a event_source_name or
> event_source_driver field that would allow unique maps to exhaustive
> event lists.
The ARM perf events backend already has an ID field in the pmu struct
which can be used to lookup a string describing what the event source
is. Exporting this via sysfs will make it much easier for userspace [and
is basically how OProfile deals with PMU identification].
Cheers,
Will
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC][PATCH 3/9] perf: export registerred pmus via sysfs
2010-05-10 20:25 ` Will Deacon
@ 2010-05-11 6:34 ` Peter Zijlstra
0 siblings, 0 replies; 51+ messages in thread
From: Peter Zijlstra @ 2010-05-11 6:34 UTC (permalink / raw)
To: Will Deacon
Cc: Lin Ming, Ingo Molnar, Frederic Weisbecker, eranian,
Gary.Mohr@Bull.com, Corey Ashford, arjan, Zhang, Yanmin,
Paul Mackerras, David S. Miller, Russell King, Paul Mundt, lkml
On Mon, 2010-05-10 at 21:25 +0100, Will Deacon wrote:
> Another interesting ARM-ism is the vast potential for uncore event
> sources in SoC devices. The `node' terminology seems a bit confusing
> here, as there may be counters situated at various points of a bus
> hierarchy which monitor various types of transactions for example. I
> suppose these could live under /sys/devices/system/bus/... ?
Right there are more such systems, and yes hooking them into the
appropriate machine topology like PCI busses is exactly the intent.
I already mentioned the GPU PMUs living in the appropriate PCI device.
But yes, there are far more exotic configurations out there.
> I would
> expect these kind of counters to be controlled via raw events because
> having a list of discrete events doesn't really make sense [for example,
> if I want to count all bursts of a given size, I can encode the burst
> size into the event number].
Raw is fine, its only once these things start to converge and show
common traits that adding a 'generic' event class becomes useful, and
this is something that can always be done later.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC][PATCH 3/9] perf: export registerred pmus via sysfs
2010-05-10 11:27 ` Peter Zijlstra
` (2 preceding siblings ...)
2010-05-10 11:42 ` Peter Zijlstra
@ 2010-05-10 11:43 ` Ingo Molnar
2010-05-10 11:49 ` Peter Zijlstra
2010-05-11 14:15 ` Borislav Petkov
2010-05-10 23:54 ` Corey Ashford
2010-05-11 2:43 ` Lin Ming
5 siblings, 2 replies; 51+ messages in thread
From: Ingo Molnar @ 2010-05-10 11:43 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Lin Ming, Frederic Weisbecker, eranian@gmail.com,
Gary.Mohr@Bull.com, Corey Ashford, arjan@linux.intel.com,
Zhang, Yanmin, Paul Mackerras, David S. Miller, Russell King,
Paul Mundt, lkml, Arnaldo Carvalho de Melo
* Peter Zijlstra <peterz@infradead.org> wrote:
> [...]
>
> This way we can create:
>
> /sys/devices/system/cpu/cpuN/cpu_hardware_events
> cpu_hardware_events/event_source_id
> cpu_hardware_events/cpu_cycles
> cpu_hardware_events/instructions
> /...
>
> /sys/devices/system/cpu/cpuN/cpu_raw_events
> cpu_raw_events/event_source_id
>
>
> These would match the current PERF_TYPE_* values for compatibility
>
> For new PMUs we can start a dynamic range of PERF_TYPE_ (say at 64k but
> that's not ABI and can be changed at any time, we've got u32 to play
> with).
>
> For uncore this would result in:
>
> /sys/devices/system/node/nodeN/node_raw_events
> node_raw_events/event_source_id
>
> and maybe:
>
> /sys/devices/system/node/nodeN/node_events
> node_events/event_source_id
> node_events/local_misses
> /local_hits
> /remote_misses
> /remote_hits
> /...
>
> The software events and tracepoints and kprobes stuff we could hang off
> of /sys/kernel/ or something
Yeah, we really want a mechanism like this in place instead of continuing with
the somewhat ad-hoc extensions to the event enumeration space.
One detail: i think we want one more level. Instead of:
/sys/devices/system/node/nodeN/node_events
node_events/event_source_id
node_events/local_misses
/local_hits
/remote_misses
/remote_hits
/...
We want the individual events to be a directory, containing the event_id:
/sys/devices/system/node/nodeN/node_events
node_events/event_source_id
node_events/local_misses/event_id
/local_hits/event_id
/remote_misses/event_id
/remote_hits/event_id
/...
The reason is that we want to keep our options open to add more attributes to
individual events. (In fact extended attributes already exist for certain
event classes - such as the 'format' info for tracepoints.)
Thanks,
Ingo
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [RFC][PATCH 3/9] perf: export registerred pmus via sysfs
2010-05-10 11:43 ` Ingo Molnar
@ 2010-05-10 11:49 ` Peter Zijlstra
2010-05-10 11:53 ` Ingo Molnar
2010-05-11 14:15 ` Borislav Petkov
1 sibling, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2010-05-10 11:49 UTC (permalink / raw)
To: Ingo Molnar
Cc: Lin Ming, Frederic Weisbecker, eranian@gmail.com,
Gary.Mohr@Bull.com, Corey Ashford, arjan@linux.intel.com,
Zhang, Yanmin, Paul Mackerras, David S. Miller, Russell King,
Paul Mundt, lkml, Arnaldo Carvalho de Melo, Will Deacon
On Mon, 2010-05-10 at 13:43 +0200, Ingo Molnar wrote:
>
> Yeah, we really want a mechanism like this in place instead of continuing with
> the somewhat ad-hoc extensions to the event enumeration space.
>
> One detail: i think we want one more level. Instead of:
>
> /sys/devices/system/node/nodeN/node_events
> node_events/event_source_id
> node_events/local_misses
> /local_hits
> /remote_misses
> /remote_hits
> /...
>
> We want the individual events to be a directory, containing the event_id:
>
> /sys/devices/system/node/nodeN/node_events
> node_events/event_source_id
> node_events/local_misses/event_id
> /local_hits/event_id
> /remote_misses/event_id
> /remote_hits/event_id
> /...
>
> The reason is that we want to keep our options open to add more attributes to
> individual events. (In fact extended attributes already exist for certain
> event classes - such as the 'format' info for tracepoints.)
Sure, sounds like a sensible suggestion.
One thing I'd also like to clarify is that !raw events should not be
exhaustive hardware event lists, those are best left for userspace, but
instead are generally useful events that can be expected to be
implemented by any hardware of that particular class.
So a GPU might have things like 'vsync' and 'cmd_pipeline_stall' or
whatever is a generic GPU feature, but not very implementation specific
things that the next generation of hardware won't ever have.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC][PATCH 3/9] perf: export registerred pmus via sysfs
2010-05-10 11:49 ` Peter Zijlstra
@ 2010-05-10 11:53 ` Ingo Molnar
2010-05-10 23:13 ` Corey Ashford
0 siblings, 1 reply; 51+ messages in thread
From: Ingo Molnar @ 2010-05-10 11:53 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Lin Ming, Frederic Weisbecker, eranian@gmail.com,
Gary.Mohr@Bull.com, Corey Ashford, arjan@linux.intel.com,
Zhang, Yanmin, Paul Mackerras, David S. Miller, Russell King,
Paul Mundt, lkml, Arnaldo Carvalho de Melo, Will Deacon
* Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, 2010-05-10 at 13:43 +0200, Ingo Molnar wrote:
> >
> > Yeah, we really want a mechanism like this in place instead of continuing with
> > the somewhat ad-hoc extensions to the event enumeration space.
> >
> > One detail: i think we want one more level. Instead of:
> >
> > /sys/devices/system/node/nodeN/node_events
> > node_events/event_source_id
> > node_events/local_misses
> > /local_hits
> > /remote_misses
> > /remote_hits
> > /...
> >
> > We want the individual events to be a directory, containing the event_id:
> >
> > /sys/devices/system/node/nodeN/node_events
> > node_events/event_source_id
> > node_events/local_misses/event_id
> > /local_hits/event_id
> > /remote_misses/event_id
> > /remote_hits/event_id
> > /...
> >
> > The reason is that we want to keep our options open to add more attributes to
> > individual events. (In fact extended attributes already exist for certain
> > event classes - such as the 'format' info for tracepoints.)
>
> Sure, sounds like a sensible suggestion.
>
> One thing I'd also like to clarify is that !raw events should not be
> exhaustive hardware event lists, those are best left for userspace, but
> instead are generally useful events that can be expected to be implemented
> by any hardware of that particular class.
>
> So a GPU might have things like 'vsync' and 'cmd_pipeline_stall' or whatever
> is a generic GPU feature, but not very implementation specific things that
> the next generation of hardware won't ever have.
Definitely so.
Ingo
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC][PATCH 3/9] perf: export registerred pmus via sysfs
2010-05-10 11:53 ` Ingo Molnar
@ 2010-05-10 23:13 ` Corey Ashford
2010-05-11 6:46 ` Peter Zijlstra
2010-05-11 10:09 ` stephane eranian
0 siblings, 2 replies; 51+ messages in thread
From: Corey Ashford @ 2010-05-10 23:13 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra
Cc: Lin Ming, Frederic Weisbecker, eranian@gmail.com,
Gary.Mohr@Bull.com, arjan@linux.intel.com, Zhang, Yanmin,
Paul Mackerras, David S. Miller, Russell King, Paul Mundt, lkml,
Arnaldo Carvalho de Melo, Will Deacon, Maynard Johnson, Carl Love,
Paul Mackerras
On 5/10/2010 4:53 AM, Ingo Molnar wrote:
>
> * Peter Zijlstra <peterz@infradead.org> wrote:
>
>> On Mon, 2010-05-10 at 13:43 +0200, Ingo Molnar wrote:
>>>
>>> Yeah, we really want a mechanism like this in place instead of continuing with
>>> the somewhat ad-hoc extensions to the event enumeration space.
>>>
>>> One detail: i think we want one more level. Instead of:
>>>
>>> /sys/devices/system/node/nodeN/node_events
>>> node_events/event_source_id
>>> node_events/local_misses
>>> /local_hits
>>> /remote_misses
>>> /remote_hits
>>> /...
>>>
>>> We want the individual events to be a directory, containing the event_id:
>>>
>>> /sys/devices/system/node/nodeN/node_events
>>> node_events/event_source_id
>>> node_events/local_misses/event_id
>>> /local_hits/event_id
>>> /remote_misses/event_id
>>> /remote_hits/event_id
>>> /...
>>>
>>> The reason is that we want to keep our options open to add more attributes to
>>> individual events. (In fact extended attributes already exist for certain
>>> event classes - such as the 'format' info for tracepoints.)
Having extra fields for each event would allow us to describe hardware-specific event attributes. For example:
/sys/devices/system/node/nodeN/node_events
node_events/event_source_id
node_events/local_misses/event_id
/local_hits/event_id
/crypto_datamover <- specific node PMU
/marked_crb_rcv_des
/event_id
/attrib
/lpid <- attribute name
/lpid/type <- type of attribute (boolean, integer, etc.)
/lpid/min <- min value of int attribute
/lpid/max <- max value of int attribute
/lpid/bit_offset <- amount to shift attribute value before OR'ing into the raw event code
/marking_mode <- attribute name
/marking_mode/type
/...
Of course, these nodes would need to be replicated for each event that needs them or other attributes.
>>
>> Sure, sounds like a sensible suggestion.
>>
>> One thing I'd also like to clarify is that !raw events should not be
>> exhaustive hardware event lists, those are best left for userspace, but
>> instead are generally useful events that can be expected to be implemented
>> by any hardware of that particular class.
Why exactly is this? I got the impression this was something you and Ingo wanted earlier. As big of an impact as it will be, it would be nice to unify the two event spaces (generic and raw) into one space that can be explored by a user space tool (or even crudely by /bin/ls).
>>
>> So a GPU might have things like 'vsync' and 'cmd_pipeline_stall' or whatever
>> is a generic GPU feature, but not very implementation specific things that
>> the next generation of hardware won't ever have.
>
> Definitely so.
>
> Ingo
Hi Ingo,
In the past, you said that you didn't want to have user space anything enumerate raw hardware events that are supported by the kernel. So does the above represent a re-thinking of that position?
We'd like to have the capability of hardware-specific symbolic event names in the perf tool by some mechanism, unified or otherwise. Right now, for the IBM Wire-Speed processor, we are currently not able to use the perf tool because of its lack of symbolic raw event name support.
In the mean time, we are using a pair of demo programs from Stephane Eranian's libpfm4 source tree called "task" and "syst". These tools use the symbolic event names provided by libpfm4, and use the kernel support from perf_events.
Regards,
- Corey
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [RFC][PATCH 3/9] perf: export registerred pmus via sysfs
2010-05-10 23:13 ` Corey Ashford
@ 2010-05-11 6:46 ` Peter Zijlstra
2010-05-11 7:21 ` Ingo Molnar
2010-05-11 10:09 ` stephane eranian
1 sibling, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2010-05-11 6:46 UTC (permalink / raw)
To: Corey Ashford
Cc: Ingo Molnar, Lin Ming, Frederic Weisbecker, eranian@gmail.com,
Gary.Mohr@Bull.com, arjan@linux.intel.com, Zhang, Yanmin,
Paul Mackerras, David S. Miller, Russell King, Paul Mundt, lkml,
Arnaldo Carvalho de Melo, Will Deacon, Maynard Johnson, Carl Love
On Mon, 2010-05-10 at 16:13 -0700, Corey Ashford wrote:
> Having extra fields for each event would allow us to describe hardware-specific event attributes. For example:
> /sys/devices/system/node/nodeN/node_events
> node_events/event_source_id
> node_events/local_misses/event_id
> /local_hits/event_id
> /crypto_datamover <- specific node PMU
> /marked_crb_rcv_des
> /event_id
> /attrib
> /lpid <- attribute name
> /lpid/type <- type of attribute (boolean, integer, etc.)
> /lpid/min <- min value of int attribute
> /lpid/max <- max value of int attribute
> /lpid/bit_offset <- amount to shift attribute value before OR'ing into the raw event code
> /marking_mode <- attribute name
> /marking_mode/type
> /...
>
> Of course, these nodes would need to be replicated for each event that needs them or other attributes.
I would suggest having a 1:n pmu:event_source ratio, not the other way
around. So in your example above the crypto_datamover PMU thingy would
get its own event_source(s), which get placed at whatever place in the
machine topology they live. If that happens to be at the node level,
place them there.
> >> Sure, sounds like a sensible suggestion.
> >>
> >> One thing I'd also like to clarify is that !raw events should not be
> >> exhaustive hardware event lists, those are best left for userspace, but
> >> instead are generally useful events that can be expected to be implemented
> >> by any hardware of that particular class.
>
> Why exactly is this? I got the impression this was something you and
> Ingo wanted earlier. As big of an impact as it will be, it would be
> nice to unify the two event spaces (generic and raw) into one space
> that can be explored by a user space tool (or even crudely
> by /bin/ls).
No we want generic event spaces that have events that are generally
applicable to a whole class of PMUs, so say to all CPU PMUs, or all GPU
PMUs or all PCI-bridge PMUs.
What we do not want are exhaustive event lists for specific PMU
implementations, those are best left for userspace.
I'll let Ingo answer your other question.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC][PATCH 3/9] perf: export registerred pmus via sysfs
2010-05-11 6:46 ` Peter Zijlstra
@ 2010-05-11 7:21 ` Ingo Molnar
2010-05-11 8:20 ` Lin Ming
0 siblings, 1 reply; 51+ messages in thread
From: Ingo Molnar @ 2010-05-11 7:21 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Corey Ashford, Lin Ming, Frederic Weisbecker, eranian@gmail.com,
Gary.Mohr@Bull.com, arjan@linux.intel.com, Zhang, Yanmin,
Paul Mackerras, David S. Miller, Russell King, Paul Mundt, lkml,
Arnaldo Carvalho de Melo, Will Deacon, Maynard Johnson, Carl Love
* Peter Zijlstra <peterz@infradead.org> wrote:
> What we do not want are exhaustive event lists for specific PMU
> implementations, those are best left for userspace.
I'd refine this the following way:
- We mandate proper in-kernel enumeration of all things event sources, for
example /sys/devices/system/node/nodeN/node_events. Obviously an event source
needs to be addressable for it to be useful to userspace.
- We want generalized events expressed in those event containers that
are used commonly. Whatever people find useful we can enumerate and what
is enumerated is an ABI.
- The 'rest' can go into /sys/devices/system/node/nodeN/node_events/raw_event/.
These will never be guaranteed in an ABI way really (although will work in
some cases) - those using raw event codes are really up to themselves and
if it ever gets in the way of proper, more expressive
enumeration/generalization it will have to yield.
These are the ground rules as i see them.
Ingo
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC][PATCH 3/9] perf: export registerred pmus via sysfs
2010-05-11 7:21 ` Ingo Molnar
@ 2010-05-11 8:20 ` Lin Ming
2010-05-11 8:50 ` Peter Zijlstra
0 siblings, 1 reply; 51+ messages in thread
From: Lin Ming @ 2010-05-11 8:20 UTC (permalink / raw)
To: Ingo Molnar
Cc: Peter Zijlstra, Corey Ashford, Frederic Weisbecker,
eranian@gmail.com, Gary.Mohr@Bull.com, arjan@linux.intel.com,
Zhang, Yanmin, Paul Mackerras, David S. Miller, Russell King,
Paul Mundt, lkml, Arnaldo Carvalho de Melo, Will Deacon,
Maynard Johnson, Carl Love
On Tue, 2010-05-11 at 15:21 +0800, Ingo Molnar wrote:
> * Peter Zijlstra <peterz@infradead.org> wrote:
>
> > What we do not want are exhaustive event lists for specific PMU
> > implementations, those are best left for userspace.
>
> I'd refine this the following way:
>
> - We mandate proper in-kernel enumeration of all things event sources, for
> example /sys/devices/system/node/nodeN/node_events. Obviously an event source
> needs to be addressable for it to be useful to userspace.
>
> - We want generalized events expressed in those event containers that
> are used commonly. Whatever people find useful we can enumerate and what
> is enumerated is an ABI.
>
> - The 'rest' can go into /sys/devices/system/node/nodeN/node_events/raw_event/.
> These will never be guaranteed in an ABI way really (although will work in
> some cases) - those using raw event codes are really up to themselves and
> if it ever gets in the way of proper, more expressive
> enumeration/generalization it will have to yield.
>
> These are the ground rules as i see them.
>
> Ingo
How will this sysfs interface be used for userspace tool?
/sys/devices/system/node/nodeN/node_events
node_events/event_source_id
node_events/local_misses/event_id
/local_hits/event_id
/remote_misses/event_id
/remote_hits/event_id
For example, to monitor node event local_misses on node 0, does it work
as below?
1. perf top -e local_misses -n 0 (-n 0 means node 0)
2. read /sys/devices/system/node/node0/node_events/event_source_id to
get the pmu_id
3. read /sys/devices/system/node/node0/node_events/local_misses/event_id
to get the event_id
4. event_attr::pmu_id=pmu_id, event::config=event_id
5. other setting...
6. call syscall perf_event_open(....)
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [RFC][PATCH 3/9] perf: export registerred pmus via sysfs
2010-05-11 8:20 ` Lin Ming
@ 2010-05-11 8:50 ` Peter Zijlstra
2010-05-11 9:03 ` Lin Ming
0 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2010-05-11 8:50 UTC (permalink / raw)
To: Lin Ming
Cc: Ingo Molnar, Corey Ashford, Frederic Weisbecker,
eranian@gmail.com, Gary.Mohr@Bull.com, arjan@linux.intel.com,
Zhang, Yanmin, Paul Mackerras, David S. Miller, Russell King,
Paul Mundt, lkml, Arnaldo Carvalho de Melo, Will Deacon,
Maynard Johnson, Carl Love
On Tue, 2010-05-11 at 16:20 +0800, Lin Ming wrote:
> How will this sysfs interface be used for userspace tool?
>
> /sys/devices/system/node/nodeN/node_events
> node_events/event_source_id
>
> node_events/local_misses/event_id
> /local_hits/event_id
> /remote_misses/event_id
> /remote_hits/event_id
>
> For example, to monitor node event local_misses on node 0, does it work
> as below?
>
> 1. perf top -e local_misses -n 0 (-n 0 means node 0)
>
> 2. read /sys/devices/system/node/node0/node_events/event_source_id to
> get the pmu_id
>
> 3. read /sys/devices/system/node/node0/node_events/local_misses/event_id
> to get the event_id
>
> 4. event_attr::pmu_id=pmu_id, event::config=event_id
>
> 5. other setting...
>
> 6. call syscall perf_event_open(....)
No, you'll use event_source_id as perf_event_attr::type, use event_id as
perf_event_attr::config and then use a cpu-wide counter on a cpu
contained in node0.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC][PATCH 3/9] perf: export registerred pmus via sysfs
2010-05-11 8:50 ` Peter Zijlstra
@ 2010-05-11 9:03 ` Lin Ming
2010-05-11 9:05 ` Lin Ming
2010-05-11 9:12 ` Peter Zijlstra
0 siblings, 2 replies; 51+ messages in thread
From: Lin Ming @ 2010-05-11 9:03 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Corey Ashford, Frederic Weisbecker,
eranian@gmail.com, Gary.Mohr@Bull.com, arjan@linux.intel.com,
Zhang, Yanmin, Paul Mackerras, David S. Miller, Russell King,
Paul Mundt, lkml, Arnaldo Carvalho de Melo, Will Deacon,
Maynard Johnson, Carl Love
On Tue, 2010-05-11 at 16:50 +0800, Peter Zijlstra wrote:
> On Tue, 2010-05-11 at 16:20 +0800, Lin Ming wrote:
>
> > How will this sysfs interface be used for userspace tool?
> >
> > /sys/devices/system/node/nodeN/node_events
> > node_events/event_source_id
> >
> > node_events/local_misses/event_id
> > /local_hits/event_id
> > /remote_misses/event_id
> > /remote_hits/event_id
> >
> > For example, to monitor node event local_misses on node 0, does it work
> > as below?
> >
> > 1. perf top -e local_misses -n 0 (-n 0 means node 0)
> >
> > 2. read /sys/devices/system/node/node0/node_events/event_source_id to
> > get the pmu_id
> >
> > 3. read /sys/devices/system/node/node0/node_events/local_misses/event_id
> > to get the event_id
> >
> > 4. event_attr::pmu_id=pmu_id, event::config=event_id
> >
> > 5. other setting...
> >
> > 6. call syscall perf_event_open(....)
>
> No, you'll use event_source_id as perf_event_attr::type, use event_id as
> perf_event_attr::config and then use a cpu-wide counter on a cpu
> contained in node0.
Is event_source_id a link to event_source class?
For example, 2 event sources on Nehalem
/sys/class/event_sources/core_pmu
/sys/class/event_sources/uncore_pmu
Then,
/sys/devices/system/node/nodeN/node_events/event_source_id is a link
to /sys/class/event_sources/uncore_pmu.
/sys/devices/system/cpu/cpuN/cpu_hardware_events/event_source_id is a
link to /sys/class/event_sources/core_pmu.
And, the event_source_id
#cat /sys/class/event_sources/core_pmu
0
#cat /sys/class/event_sources/uncore_pmu
1
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC][PATCH 3/9] perf: export registerred pmus via sysfs
2010-05-11 9:03 ` Lin Ming
@ 2010-05-11 9:05 ` Lin Ming
2010-05-11 9:12 ` Peter Zijlstra
1 sibling, 0 replies; 51+ messages in thread
From: Lin Ming @ 2010-05-11 9:05 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Corey Ashford, Frederic Weisbecker,
eranian@gmail.com, Gary.Mohr@Bull.com, arjan@linux.intel.com,
Zhang, Yanmin, Paul Mackerras, David S. Miller, Russell King,
Paul Mundt, lkml, Arnaldo Carvalho de Melo, Will Deacon,
Maynard Johnson, Carl Love
On Tue, 2010-05-11 at 17:03 +0800, Lin Ming wrote:
> On Tue, 2010-05-11 at 16:50 +0800, Peter Zijlstra wrote:
> > On Tue, 2010-05-11 at 16:20 +0800, Lin Ming wrote:
> >
> > > How will this sysfs interface be used for userspace tool?
> > >
> > > /sys/devices/system/node/nodeN/node_events
> > > node_events/event_source_id
> > >
> > > node_events/local_misses/event_id
> > > /local_hits/event_id
> > > /remote_misses/event_id
> > > /remote_hits/event_id
> > >
> > > For example, to monitor node event local_misses on node 0, does it work
> > > as below?
> > >
> > > 1. perf top -e local_misses -n 0 (-n 0 means node 0)
> > >
> > > 2. read /sys/devices/system/node/node0/node_events/event_source_id to
> > > get the pmu_id
> > >
> > > 3. read /sys/devices/system/node/node0/node_events/local_misses/event_id
> > > to get the event_id
> > >
> > > 4. event_attr::pmu_id=pmu_id, event::config=event_id
> > >
> > > 5. other setting...
> > >
> > > 6. call syscall perf_event_open(....)
> >
> > No, you'll use event_source_id as perf_event_attr::type, use event_id as
> > perf_event_attr::config and then use a cpu-wide counter on a cpu
> > contained in node0.
>
> Is event_source_id a link to event_source class?
>
> For example, 2 event sources on Nehalem
> /sys/class/event_sources/core_pmu
> /sys/class/event_sources/uncore_pmu
>
> Then,
> /sys/devices/system/node/nodeN/node_events/event_source_id is a link
> to /sys/class/event_sources/uncore_pmu.
>
> /sys/devices/system/cpu/cpuN/cpu_hardware_events/event_source_id is a
> link to /sys/class/event_sources/core_pmu.
>
> And, the event_source_id
> #cat /sys/class/event_sources/core_pmu
> 0
>
> #cat /sys/class/event_sources/uncore_pmu
> 1
Oh, no.
/sys/class/event_sources/uncore_pmu is not 1, but other dynamic value.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC][PATCH 3/9] perf: export registerred pmus via sysfs
2010-05-11 9:03 ` Lin Ming
2010-05-11 9:05 ` Lin Ming
@ 2010-05-11 9:12 ` Peter Zijlstra
2010-05-11 9:18 ` Ingo Molnar
2010-05-11 9:40 ` Lin Ming
1 sibling, 2 replies; 51+ messages in thread
From: Peter Zijlstra @ 2010-05-11 9:12 UTC (permalink / raw)
To: Lin Ming
Cc: Ingo Molnar, Corey Ashford, Frederic Weisbecker,
eranian@gmail.com, Gary.Mohr@Bull.com, arjan@linux.intel.com,
Zhang, Yanmin, Paul Mackerras, David S. Miller, Russell King,
Paul Mundt, lkml, Arnaldo Carvalho de Melo, Will Deacon,
Maynard Johnson, Carl Love
On Tue, 2010-05-11 at 17:03 +0800, Lin Ming wrote:
> On Tue, 2010-05-11 at 16:50 +0800, Peter Zijlstra wrote:
> > On Tue, 2010-05-11 at 16:20 +0800, Lin Ming wrote:
> >
> > > How will this sysfs interface be used for userspace tool?
> > >
> > > /sys/devices/system/node/nodeN/node_events
> > > node_events/event_source_id
> > >
> > > node_events/local_misses/event_id
> > > /local_hits/event_id
> > > /remote_misses/event_id
> > > /remote_hits/event_id
> > >
> > > For example, to monitor node event local_misses on node 0, does it work
> > > as below?
> > >
> > > 1. perf top -e local_misses -n 0 (-n 0 means node 0)
> > >
> > > 2. read /sys/devices/system/node/node0/node_events/event_source_id to
> > > get the pmu_id
> > >
> > > 3. read /sys/devices/system/node/node0/node_events/local_misses/event_id
> > > to get the event_id
> > >
> > > 4. event_attr::pmu_id=pmu_id, event::config=event_id
> > >
> > > 5. other setting...
> > >
> > > 6. call syscall perf_event_open(....)
> >
> > No, you'll use event_source_id as perf_event_attr::type, use event_id as
> > perf_event_attr::config and then use a cpu-wide counter on a cpu
> > contained in node0.
>
> Is event_source_id a link to event_source class?
No its an attribute of said class.
> For example, 2 event sources on Nehalem
> /sys/class/event_sources/core_pmu
> /sys/class/event_sources/uncore_pmu
>
> Then,
> /sys/devices/system/node/nodeN/node_events/event_source_id is a link
> to /sys/class/event_sources/uncore_pmu.
> /sys/devices/system/cpu/cpuN/cpu_hardware_events/event_source_id is a
> link to /sys/class/event_sources/core_pmu.
The other way around, look in /sys/class/*/, they're all symlinks.
> And, the event_source_id
> #cat /sys/class/event_sources/core_pmu
> 0
> #cat /sys/class/event_sources/uncore_pmu
> 1
You can't cat a directory. You'd cat something
like: /sys/class/event_sources/core_pmu/event_source_id
And they would contain PERF_TYPE_* like things.
So for the current CPU PMUs we'd already create 3 event classes,
cpu_hw_events, cpu_hw_cache_events, cpu_raw_events, with resp.
event_source_id of 0, 3 and 4.
The new PMUs will use a dynamic range that starts at PERF_TYPE_MAX.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC][PATCH 3/9] perf: export registerred pmus via sysfs
2010-05-11 9:12 ` Peter Zijlstra
@ 2010-05-11 9:18 ` Ingo Molnar
2010-05-11 9:24 ` Peter Zijlstra
2010-05-13 8:28 ` Lin Ming
2010-05-11 9:40 ` Lin Ming
1 sibling, 2 replies; 51+ messages in thread
From: Ingo Molnar @ 2010-05-11 9:18 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Lin Ming, Corey Ashford, Frederic Weisbecker, eranian@gmail.com,
Gary.Mohr@Bull.com, arjan@linux.intel.com, Zhang, Yanmin,
Paul Mackerras, David S. Miller, Russell King, Paul Mundt, lkml,
Arnaldo Carvalho de Melo, Will Deacon, Maynard Johnson, Carl Love
* Peter Zijlstra <peterz@infradead.org> wrote:
> The new PMUs will use a dynamic range that starts at PERF_TYPE_MAX.
I dont think we should use a dynamic range of event sources - it's a
completely useless indirection that has no meaning to humans.
As far as machine interfaces go a much cleaner approach would be to allow an
open fd to a sysfs file to be passed to sys_perf_event_open() - this would
identify the event source. This needs a small extension of the ABI but we
could thus get rid of the 'type' enumeration altogether and express _all_
event sources via fds to sysfs files.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC][PATCH 3/9] perf: export registerred pmus via sysfs
2010-05-11 9:18 ` Ingo Molnar
@ 2010-05-11 9:24 ` Peter Zijlstra
2010-05-11 9:31 ` Ingo Molnar
2010-05-13 8:28 ` Lin Ming
1 sibling, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2010-05-11 9:24 UTC (permalink / raw)
To: Ingo Molnar
Cc: Lin Ming, Corey Ashford, Frederic Weisbecker, eranian@gmail.com,
Gary.Mohr@Bull.com, arjan@linux.intel.com, Zhang, Yanmin,
Paul Mackerras, David S. Miller, Russell King, Paul Mundt, lkml,
Arnaldo Carvalho de Melo, Will Deacon, Maynard Johnson, Carl Love
On Tue, 2010-05-11 at 11:18 +0200, Ingo Molnar wrote:
> * Peter Zijlstra <peterz@infradead.org> wrote:
>
> > The new PMUs will use a dynamic range that starts at PERF_TYPE_MAX.
>
> I dont think we should use a dynamic range of event sources - it's a
> completely useless indirection that has no meaning to humans.
>
> As far as machine interfaces go a much cleaner approach would be to allow an
> open fd to a sysfs file to be passed to sys_perf_event_open() - this would
> identify the event source. This needs a small extension of the ABI but we
> could thus get rid of the 'type' enumeration altogether and express _all_
> event sources via fds to sysfs files.
Whatever, that's almost identical. What we can do is reserve
perf_event_attr::type with bit 31 set for fd's and use the fd->file
lookup instead of the type->pmu lookup.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC][PATCH 3/9] perf: export registerred pmus via sysfs
2010-05-11 9:24 ` Peter Zijlstra
@ 2010-05-11 9:31 ` Ingo Molnar
2010-05-11 10:28 ` Lin Ming
0 siblings, 1 reply; 51+ messages in thread
From: Ingo Molnar @ 2010-05-11 9:31 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Lin Ming, Corey Ashford, Frederic Weisbecker, eranian@gmail.com,
Gary.Mohr@Bull.com, arjan@linux.intel.com, Zhang, Yanmin,
Paul Mackerras, David S. Miller, Russell King, Paul Mundt, lkml,
Arnaldo Carvalho de Melo, Will Deacon, Maynard Johnson, Carl Love
* Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, 2010-05-11 at 11:18 +0200, Ingo Molnar wrote:
> > * Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > > The new PMUs will use a dynamic range that starts at PERF_TYPE_MAX.
> >
> > I dont think we should use a dynamic range of event sources - it's a
> > completely useless indirection that has no meaning to humans.
> >
> > As far as machine interfaces go a much cleaner approach would be to allow an
> > open fd to a sysfs file to be passed to sys_perf_event_open() - this would
> > identify the event source. This needs a small extension of the ABI but we
> > could thus get rid of the 'type' enumeration altogether and express _all_
> > event sources via fds to sysfs files.
>
> Whatever, that's almost identical. [...]
It's not identical: as we dont expose our mapping structure externally and
dont have to have some dynamic type ID allocation layer/mechanism. Also, using
fds is an elegant, Linuxish way of expressing some object's identity and
passing it along.
( It also removes the possibility to intentionally or accidentally have type
IDs that are not reachable via the sysfs and vica verse. )
> [...] What we can do is reserve perf_event_attr::type with bit 31 set for
> fd's and use the fd->file lookup instead of the type->pmu lookup.
Yeah, that sounds good.
Ingo
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC][PATCH 3/9] perf: export registerred pmus via sysfs
2010-05-11 9:31 ` Ingo Molnar
@ 2010-05-11 10:28 ` Lin Ming
0 siblings, 0 replies; 51+ messages in thread
From: Lin Ming @ 2010-05-11 10:28 UTC (permalink / raw)
To: Ingo Molnar
Cc: Peter Zijlstra, Corey Ashford, Frederic Weisbecker,
eranian@gmail.com, Gary.Mohr@Bull.com, arjan@linux.intel.com,
Zhang, Yanmin, Paul Mackerras, David S. Miller, Russell King,
Paul Mundt, lkml, Arnaldo Carvalho de Melo, Will Deacon,
Maynard Johnson, Carl Love
On Tue, 2010-05-11 at 17:31 +0800, Ingo Molnar wrote:
> * Peter Zijlstra <peterz@infradead.org> wrote:
>
> > On Tue, 2010-05-11 at 11:18 +0200, Ingo Molnar wrote:
> > > * Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > > The new PMUs will use a dynamic range that starts at PERF_TYPE_MAX.
> > >
> > > I dont think we should use a dynamic range of event sources - it's a
> > > completely useless indirection that has no meaning to humans.
> > >
> > > As far as machine interfaces go a much cleaner approach would be to allow an
> > > open fd to a sysfs file to be passed to sys_perf_event_open() - this would
> > > identify the event source. This needs a small extension of the ABI but we
> > > could thus get rid of the 'type' enumeration altogether and express _all_
> > > event sources via fds to sysfs files.
> >
> > Whatever, that's almost identical. [...]
>
> It's not identical: as we dont expose our mapping structure externally and
> dont have to have some dynamic type ID allocation layer/mechanism. Also, using
> fds is an elegant, Linuxish way of expressing some object's identity and
> passing it along.
Don't understand this well enough.
Does it need to attach pmu info to the sys file?
For example,
Does "perf top -e cycles" work as below?
1. get the pmu file of cycles events,
ie, /sys/devices/system/cpu/cpu_hw_events/event_source_id
2. fd = open("/sys/devices/system/cpu/cpu_hw_events/event_source_id")
3. perf_event_attr::type = fd | (1<<31)
4. In kernel, perf_event_lookup_pmu() finds pmu by fd -> sys file -> pmu
>
> ( It also removes the possibility to intentionally or accidentally have type
> IDs that are not reachable via the sysfs and vica verse. )
>
> > [...] What we can do is reserve perf_event_attr::type with bit 31 set for
> > fd's and use the fd->file lookup instead of the type->pmu lookup.
>
> Yeah, that sounds good.
>
> Ingo
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC][PATCH 3/9] perf: export registerred pmus via sysfs
2010-05-11 9:18 ` Ingo Molnar
2010-05-11 9:24 ` Peter Zijlstra
@ 2010-05-13 8:28 ` Lin Ming
2010-05-13 8:38 ` Ingo Molnar
1 sibling, 1 reply; 51+ messages in thread
From: Lin Ming @ 2010-05-13 8:28 UTC (permalink / raw)
To: Ingo Molnar
Cc: greg, Peter Zijlstra, Corey Ashford, Frederic Weisbecker,
eranian@gmail.com, Gary.Mohr@Bull.com, arjan@linux.intel.com,
Zhang, Yanmin, Paul Mackerras, David S. Miller, Russell King,
Paul Mundt, lkml, Arnaldo Carvalho de Melo, Will Deacon,
Maynard Johnson, Carl Love
On Tue, 2010-05-11 at 17:18 +0800, Ingo Molnar wrote:
> * Peter Zijlstra <peterz@infradead.org> wrote:
>
> > The new PMUs will use a dynamic range that starts at PERF_TYPE_MAX.
>
> I dont think we should use a dynamic range of event sources - it's a
> completely useless indirection that has no meaning to humans.
>
> As far as machine interfaces go a much cleaner approach would be to allow an
> open fd to a sysfs file to be passed to sys_perf_event_open() - this would
> identify the event source. This needs a small extension of the ABI but we
> could thus get rid of the 'type' enumeration altogether and express _all_
> event sources via fds to sysfs files.
I still don't understand this sys_fd -> pmu lookup, would you please
explain it more detail?
struct pmu {
kobject kobj;
...
};
What I can imagine is,
1. In userspace, sys_fd =
open("/sys/devices/system/cpu/event_source", ..), then sys_fd is passed
to sys_perf_event_open()
2. In kernel, sys_file = <find the sys file structure with sys_fd>
3. kobject = <retrieve the kobject from sys_file>
4. pmu = container_of(kobject, struct pmu, kobj)
If my understanding is correct, then step 3 above seems strange. It's
not the typical usage of sys file.
Thanks,
Lin Ming
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [RFC][PATCH 3/9] perf: export registerred pmus via sysfs
2010-05-13 8:28 ` Lin Ming
@ 2010-05-13 8:38 ` Ingo Molnar
2010-05-13 9:22 ` Lin Ming
0 siblings, 1 reply; 51+ messages in thread
From: Ingo Molnar @ 2010-05-13 8:38 UTC (permalink / raw)
To: Lin Ming
Cc: greg, Peter Zijlstra, Corey Ashford, Frederic Weisbecker,
eranian@gmail.com, Gary.Mohr@Bull.com, arjan@linux.intel.com,
Zhang, Yanmin, Paul Mackerras, David S. Miller, Russell King,
Paul Mundt, lkml, Arnaldo Carvalho de Melo, Will Deacon,
Maynard Johnson, Carl Love
* Lin Ming <ming.m.lin@intel.com> wrote:
> On Tue, 2010-05-11 at 17:18 +0800, Ingo Molnar wrote:
> > * Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > > The new PMUs will use a dynamic range that starts at PERF_TYPE_MAX.
> >
> > I dont think we should use a dynamic range of event sources - it's a
> > completely useless indirection that has no meaning to humans.
> >
> > As far as machine interfaces go a much cleaner approach would be to allow an
> > open fd to a sysfs file to be passed to sys_perf_event_open() - this would
> > identify the event source. This needs a small extension of the ABI but we
> > could thus get rid of the 'type' enumeration altogether and express _all_
> > event sources via fds to sysfs files.
>
> I still don't understand this sys_fd -> pmu lookup, would you please
> explain it more detail?
>
> struct pmu {
> kobject kobj;
> ...
> };
>
> What I can imagine is,
>
> 1. In userspace, sys_fd =
> open("/sys/devices/system/cpu/event_source", ..), then sys_fd is passed
> to sys_perf_event_open()
Yes, open() an event_source - or rather an event itself. For raw events there
has to be a separate event entry that can be opened.
I.e. we'd have a layout like:
/sys/devices/system/cpu/events/cycles/id
/sys/devices/system/cpu/events/instructions/id
/sys/devices/system/cpu/events/raw/id
By making each event category a directory we gain the flexibility of
integrating tracepoints as well, for example:
/sys/kernel/sched/events/wakeup/id
/sys/kernel/sched/events/wakeup/format
Where 'format' describes the event record layout:
# cat /debug/tracing/events/sched/sched_wakeup/format
name: sched_wakeup
ID: 59
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:int common_lock_depth; offset:8; size:4; signed:1;
field:char comm[TASK_COMM_LEN]; offset:12; size:16; signed:1;
field:pid_t pid; offset:28; size:4; signed:1;
field:int prio; offset:32; size:4; signed:1;
field:int success; offset:36; size:4; signed:1;
field:int target_cpu; offset:40; size:4; signed:1;
print fmt: "comm=%s pid=%d prio=%d success=%d target_cpu=%03d", REC->comm, REC->pid, REC->prio, REC->success, REC->target_cpu
> 2. In kernel, sys_file = <find the sys file structure with sys_fd>
>
> 3. kobject = <retrieve the kobject from sys_file>
>
> 4. pmu = container_of(kobject, struct pmu, kobj)
>
> If my understanding is correct, then step 3 above seems strange. It's
> not the typical usage of sys file.
I dont think it's stange - we demux from the generic sysfs object to the more
specific perf events related object. This is similar how driver specific sysfs
functionality does the demux as well.
Ingo
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [RFC][PATCH 3/9] perf: export registerred pmus via sysfs
2010-05-13 8:38 ` Ingo Molnar
@ 2010-05-13 9:22 ` Lin Ming
0 siblings, 0 replies; 51+ messages in thread
From: Lin Ming @ 2010-05-13 9:22 UTC (permalink / raw)
To: Ingo Molnar
Cc: greg@kroah.com, Peter Zijlstra, Corey Ashford,
Frederic Weisbecker, eranian@gmail.com, Gary.Mohr@Bull.com,
arjan@linux.intel.com, Zhang, Yanmin, Paul Mackerras,
David S. Miller, Russell King, Paul Mundt, lkml,
Arnaldo Carvalho de Melo, Will Deacon, Maynard Johnson, Carl Love
On Thu, 2010-05-13 at 16:38 +0800, Ingo Molnar wrote:
> * Lin Ming <ming.m.lin@intel.com> wrote:
>
> > On Tue, 2010-05-11 at 17:18 +0800, Ingo Molnar wrote:
> > > * Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > > The new PMUs will use a dynamic range that starts at PERF_TYPE_MAX.
> > >
> > > I dont think we should use a dynamic range of event sources - it's a
> > > completely useless indirection that has no meaning to humans.
> > >
> > > As far as machine interfaces go a much cleaner approach would be to allow an
> > > open fd to a sysfs file to be passed to sys_perf_event_open() - this would
> > > identify the event source. This needs a small extension of the ABI but we
> > > could thus get rid of the 'type' enumeration altogether and express _all_
> > > event sources via fds to sysfs files.
> >
> > I still don't understand this sys_fd -> pmu lookup, would you please
> > explain it more detail?
> >
> > struct pmu {
> > kobject kobj;
> > ...
> > };
> >
> > What I can imagine is,
> >
> > 1. In userspace, sys_fd =
> > open("/sys/devices/system/cpu/event_source", ..), then sys_fd is passed
> > to sys_perf_event_open()
>
> Yes, open() an event_source - or rather an event itself. For raw events there
> has to be a separate event entry that can be opened.
>
> I.e. we'd have a layout like:
>
> /sys/devices/system/cpu/events/cycles/id
> /sys/devices/system/cpu/events/instructions/id
> /sys/devices/system/cpu/events/raw/id
>
> By making each event category a directory we gain the flexibility of
> integrating tracepoints as well, for example:
>
> /sys/kernel/sched/events/wakeup/id
> /sys/kernel/sched/events/wakeup/format
>
> Where 'format' describes the event record layout:
>
> # cat /debug/tracing/events/sched/sched_wakeup/format
> name: sched_wakeup
> ID: 59
> format:
> field:unsigned short common_type; offset:0; size:2; signed:0;
> field:unsigned char common_flags; offset:2; size:1; signed:0;
> field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
> field:int common_pid; offset:4; size:4; signed:1;
> field:int common_lock_depth; offset:8; size:4; signed:1;
>
> field:char comm[TASK_COMM_LEN]; offset:12; size:16; signed:1;
> field:pid_t pid; offset:28; size:4; signed:1;
> field:int prio; offset:32; size:4; signed:1;
> field:int success; offset:36; size:4; signed:1;
> field:int target_cpu; offset:40; size:4; signed:1;
>
> print fmt: "comm=%s pid=%d prio=%d success=%d target_cpu=%03d", REC->comm, REC->pid, REC->prio, REC->success, REC->target_cpu
>
> > 2. In kernel, sys_file = <find the sys file structure with sys_fd>
> >
> > 3. kobject = <retrieve the kobject from sys_file>
> >
> > 4. pmu = container_of(kobject, struct pmu, kobj)
> >
> > If my understanding is correct, then step 3 above seems strange. It's
> > not the typical usage of sys file.
>
> I dont think it's stange - we demux from the generic sysfs object to the more
> specific perf events related object. This is similar how driver specific sysfs
> functionality does the demux as well.
OK, thanks for the explanation.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC][PATCH 3/9] perf: export registerred pmus via sysfs
2010-05-11 9:12 ` Peter Zijlstra
2010-05-11 9:18 ` Ingo Molnar
@ 2010-05-11 9:40 ` Lin Ming
2010-05-11 9:48 ` Peter Zijlstra
1 sibling, 1 reply; 51+ messages in thread
From: Lin Ming @ 2010-05-11 9:40 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Corey Ashford, Frederic Weisbecker,
eranian@gmail.com, Gary.Mohr@Bull.com, arjan@linux.intel.com,
Zhang, Yanmin, Paul Mackerras, David S. Miller, Russell King,
Paul Mundt, lkml, Arnaldo Carvalho de Melo, Will Deacon,
Maynard Johnson, Carl Love
On Tue, 2010-05-11 at 17:12 +0800, Peter Zijlstra wrote:
> On Tue, 2010-05-11 at 17:03 +0800, Lin Ming wrote:
> > On Tue, 2010-05-11 at 16:50 +0800, Peter Zijlstra wrote:
> > > On Tue, 2010-05-11 at 16:20 +0800, Lin Ming wrote:
> > >
> > > > How will this sysfs interface be used for userspace tool?
> > > >
> > > > /sys/devices/system/node/nodeN/node_events
> > > > node_events/event_source_id
> > > >
> > > > node_events/local_misses/event_id
> > > > /local_hits/event_id
> > > > /remote_misses/event_id
> > > > /remote_hits/event_id
> > > >
> > > > For example, to monitor node event local_misses on node 0, does it work
> > > > as below?
> > > >
> > > > 1. perf top -e local_misses -n 0 (-n 0 means node 0)
> > > >
> > > > 2. read /sys/devices/system/node/node0/node_events/event_source_id to
> > > > get the pmu_id
> > > >
> > > > 3. read /sys/devices/system/node/node0/node_events/local_misses/event_id
> > > > to get the event_id
> > > >
> > > > 4. event_attr::pmu_id=pmu_id, event::config=event_id
> > > >
> > > > 5. other setting...
> > > >
> > > > 6. call syscall perf_event_open(....)
> > >
> > > No, you'll use event_source_id as perf_event_attr::type, use event_id as
> > > perf_event_attr::config and then use a cpu-wide counter on a cpu
> > > contained in node0.
> >
> > Is event_source_id a link to event_source class?
>
> No its an attribute of said class.
>
> > For example, 2 event sources on Nehalem
> > /sys/class/event_sources/core_pmu
> > /sys/class/event_sources/uncore_pmu
> >
> > Then,
> > /sys/devices/system/node/nodeN/node_events/event_source_id is a link
> > to /sys/class/event_sources/uncore_pmu.
>
> > /sys/devices/system/cpu/cpuN/cpu_hardware_events/event_source_id is a
> > link to /sys/class/event_sources/core_pmu.
>
> The other way around, look in /sys/class/*/, they're all symlinks.
/sys/devices/system/cpu/cpu0/cpu_hw_events/*
/sys/devices/system/cpu/cpu0/cpu_hw_cache_events/*
/sys/devices/system/cpu/cpu0/cpu_raw_events/*
....
....
/sys/devices/system/cpu/cpuN/cpu_hw_events/*
/sys/devices/system/cpu/cpuN/cpu_hw_cache_events/*
/sys/devices/system/cpu/cpuN/cpu_raw_events/*
Is /sys/class/event_sources/* looks like,
/sys/class/event_sources/cpu_hw_events0
-> /sys/devices/system/cpu/cpu0/cpu_hw_events
...
/sys/class/event_sources/cpu_hw_eventsN
-> /sys/devices/system/cpu/cpuN/cpu_hw_events
/sys/class/event_sources/cpu_hw_cache_events0
-> /sys/devices/system/cpu/cpu0/cpu_hw_events
...
/sys/class/event_sources/cpu_hw_cache_eventsN
-> /sys/devices/system/cpu/cpuN/cpu_hw_events
?
>
> > And, the event_source_id
> > #cat /sys/class/event_sources/core_pmu
> > 0
>
> > #cat /sys/class/event_sources/uncore_pmu
> > 1
>
> You can't cat a directory. You'd cat something
> like: /sys/class/event_sources/core_pmu/event_source_id
>
> And they would contain PERF_TYPE_* like things.
>
> So for the current CPU PMUs we'd already create 3 event classes,
> cpu_hw_events, cpu_hw_cache_events, cpu_raw_events, with resp.
> event_source_id of 0, 3 and 4.
>
> The new PMUs will use a dynamic range that starts at PERF_TYPE_MAX.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC][PATCH 3/9] perf: export registerred pmus via sysfs
2010-05-11 9:40 ` Lin Ming
@ 2010-05-11 9:48 ` Peter Zijlstra
2010-05-11 9:53 ` Lin Ming
` (2 more replies)
0 siblings, 3 replies; 51+ messages in thread
From: Peter Zijlstra @ 2010-05-11 9:48 UTC (permalink / raw)
To: Lin Ming
Cc: Ingo Molnar, Corey Ashford, Frederic Weisbecker,
eranian@gmail.com, Gary.Mohr@Bull.com, arjan@linux.intel.com,
Zhang, Yanmin, Paul Mackerras, David S. Miller, Russell King,
Paul Mundt, lkml, Arnaldo Carvalho de Melo, Will Deacon,
Maynard Johnson, Carl Love, greg@kroah.com, Kay Sievers
On Tue, 2010-05-11 at 17:40 +0800, Lin Ming wrote:
> /sys/devices/system/cpu/cpu0/cpu_hw_events/*
> /sys/devices/system/cpu/cpu0/cpu_hw_cache_events/*
> /sys/devices/system/cpu/cpu0/cpu_raw_events/*
> ....
> ....
> /sys/devices/system/cpu/cpuN/cpu_hw_events/*
> /sys/devices/system/cpu/cpuN/cpu_hw_cache_events/*
> /sys/devices/system/cpu/cpuN/cpu_raw_events/*
>
> Is /sys/class/event_sources/* looks like,
>
> /sys/class/event_sources/cpu_hw_events0
> -> /sys/devices/system/cpu/cpu0/cpu_hw_events
> ...
> /sys/class/event_sources/cpu_hw_eventsN
> -> /sys/devices/system/cpu/cpuN/cpu_hw_events
>
> /sys/class/event_sources/cpu_hw_cache_events0
> -> /sys/devices/system/cpu/cpu0/cpu_hw_events
> ...
> /sys/class/event_sources/cpu_hw_cache_eventsN
> -> /sys/devices/system/cpu/cpuN/cpu_hw_events
Hmm, good question.
No all the cpus would have the same event sources. I'm not sure if we
can make sysfs understand that though (added GregKH and Kay to CC).
Possibly we'd have to place them at the cpu level, like:
/sys/devices/system/cpu/cpu_*_events/
and have links like:
/sys/devices/system/cpu/cpuN/cpu_*_events ->
/sys/devices/system/cpu/cpu_*_events/
as well as
/sys/class/event_sources/cpu_*_events ->
/sys/devices/system/cpu/cpu_*_events/
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [RFC][PATCH 3/9] perf: export registerred pmus via sysfs
2010-05-11 9:48 ` Peter Zijlstra
@ 2010-05-11 9:53 ` Lin Ming
2010-05-11 15:17 ` Greg KH
2010-05-12 5:51 ` Paul Mundt
2 siblings, 0 replies; 51+ messages in thread
From: Lin Ming @ 2010-05-11 9:53 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Corey Ashford, Frederic Weisbecker,
eranian@gmail.com, Gary.Mohr@Bull.com, arjan@linux.intel.com,
Zhang, Yanmin, Paul Mackerras, David S. Miller, Russell King,
Paul Mundt, lkml, Arnaldo Carvalho de Melo, Will Deacon,
Maynard Johnson, Carl Love, greg@kroah.com, Kay Sievers
On Tue, 2010-05-11 at 17:48 +0800, Peter Zijlstra wrote:
> On Tue, 2010-05-11 at 17:40 +0800, Lin Ming wrote:
> > /sys/devices/system/cpu/cpu0/cpu_hw_events/*
> > /sys/devices/system/cpu/cpu0/cpu_hw_cache_events/*
> > /sys/devices/system/cpu/cpu0/cpu_raw_events/*
> > ....
> > ....
> > /sys/devices/system/cpu/cpuN/cpu_hw_events/*
> > /sys/devices/system/cpu/cpuN/cpu_hw_cache_events/*
> > /sys/devices/system/cpu/cpuN/cpu_raw_events/*
> >
> > Is /sys/class/event_sources/* looks like,
> >
> > /sys/class/event_sources/cpu_hw_events0
> > -> /sys/devices/system/cpu/cpu0/cpu_hw_events
> > ...
> > /sys/class/event_sources/cpu_hw_eventsN
> > -> /sys/devices/system/cpu/cpuN/cpu_hw_events
> >
> > /sys/class/event_sources/cpu_hw_cache_events0
> > -> /sys/devices/system/cpu/cpu0/cpu_hw_events
> > ...
> > /sys/class/event_sources/cpu_hw_cache_eventsN
> > -> /sys/devices/system/cpu/cpuN/cpu_hw_events
>
> Hmm, good question.
>
> No all the cpus would have the same event sources. I'm not sure if we
> can make sysfs understand that though (added GregKH and Kay to CC).
>
> Possibly we'd have to place them at the cpu level, like:
>
> /sys/devices/system/cpu/cpu_*_events/
>
> and have links like:
>
> /sys/devices/system/cpu/cpuN/cpu_*_events ->
> /sys/devices/system/cpu/cpu_*_events/
>
> as well as
>
> /sys/class/event_sources/cpu_*_events ->
> /sys/devices/system/cpu/cpu_*_events/
Ah, I like this idea.
Thanks.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC][PATCH 3/9] perf: export registerred pmus via sysfs
2010-05-11 9:48 ` Peter Zijlstra
2010-05-11 9:53 ` Lin Ming
@ 2010-05-11 15:17 ` Greg KH
2010-05-12 5:51 ` Paul Mundt
2 siblings, 0 replies; 51+ messages in thread
From: Greg KH @ 2010-05-11 15:17 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Lin Ming, Ingo Molnar, Corey Ashford, Frederic Weisbecker,
eranian@gmail.com, Gary.Mohr@Bull.com, arjan@linux.intel.com,
Zhang, Yanmin, Paul Mackerras, David S. Miller, Russell King,
Paul Mundt, lkml, Arnaldo Carvalho de Melo, Will Deacon,
Maynard Johnson, Carl Love, Kay Sievers
On Tue, May 11, 2010 at 11:48:42AM +0200, Peter Zijlstra wrote:
> On Tue, 2010-05-11 at 17:40 +0800, Lin Ming wrote:
> > /sys/devices/system/cpu/cpu0/cpu_hw_events/*
> > /sys/devices/system/cpu/cpu0/cpu_hw_cache_events/*
> > /sys/devices/system/cpu/cpu0/cpu_raw_events/*
> > ....
> > ....
> > /sys/devices/system/cpu/cpuN/cpu_hw_events/*
> > /sys/devices/system/cpu/cpuN/cpu_hw_cache_events/*
> > /sys/devices/system/cpu/cpuN/cpu_raw_events/*
> >
> > Is /sys/class/event_sources/* looks like,
> >
> > /sys/class/event_sources/cpu_hw_events0
> > -> /sys/devices/system/cpu/cpu0/cpu_hw_events
> > ...
> > /sys/class/event_sources/cpu_hw_eventsN
> > -> /sys/devices/system/cpu/cpuN/cpu_hw_events
> >
> > /sys/class/event_sources/cpu_hw_cache_events0
> > -> /sys/devices/system/cpu/cpu0/cpu_hw_events
> > ...
> > /sys/class/event_sources/cpu_hw_cache_eventsN
> > -> /sys/devices/system/cpu/cpuN/cpu_hw_events
>
> Hmm, good question.
>
> No all the cpus would have the same event sources. I'm not sure if we
> can make sysfs understand that though (added GregKH and Kay to CC).
>
> Possibly we'd have to place them at the cpu level, like:
>
> /sys/devices/system/cpu/cpu_*_events/
The problem with this is /sys/devices/system/ are the horrid sysdev
structures, which don't play nice (or at all) with the rest of the
driver/device model.
If we fix them up to finally work properly like real devices then we
could do this:
> and have links like:
>
> /sys/devices/system/cpu/cpuN/cpu_*_events ->
> /sys/devices/system/cpu/cpu_*_events/
>
> as well as
>
> /sys/class/event_sources/cpu_*_events ->
> /sys/devices/system/cpu/cpu_*_events/
Like you want to have done.
{sigh}
I guess I'll finally have to start working on this. I had some
conversations at the last Collab Summit with someone on how to properly
fix this all up, I'll go dig for that note and start in on it.
thanks,
greg k-h
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [RFC][PATCH 3/9] perf: export registerred pmus via sysfs
2010-05-11 9:48 ` Peter Zijlstra
2010-05-11 9:53 ` Lin Ming
2010-05-11 15:17 ` Greg KH
@ 2010-05-12 5:51 ` Paul Mundt
2010-05-12 8:37 ` Peter Zijlstra
2 siblings, 1 reply; 51+ messages in thread
From: Paul Mundt @ 2010-05-12 5:51 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Lin Ming, Ingo Molnar, Corey Ashford, Frederic Weisbecker,
eranian@gmail.com, Gary.Mohr@Bull.com, arjan@linux.intel.com,
Zhang, Yanmin, Paul Mackerras, David S. Miller, Russell King,
lkml, Arnaldo Carvalho de Melo, Will Deacon, Maynard Johnson,
Carl Love, greg@kroah.com, Kay Sievers
On Tue, May 11, 2010 at 11:48:42AM +0200, Peter Zijlstra wrote:
> On Tue, 2010-05-11 at 17:40 +0800, Lin Ming wrote:
> > Is /sys/class/event_sources/* looks like,
> >
> > /sys/class/event_sources/cpu_hw_events0
> > -> /sys/devices/system/cpu/cpu0/cpu_hw_events
> > ...
> > /sys/class/event_sources/cpu_hw_eventsN
> > -> /sys/devices/system/cpu/cpuN/cpu_hw_events
> >
> > /sys/class/event_sources/cpu_hw_cache_events0
> > -> /sys/devices/system/cpu/cpu0/cpu_hw_events
> > ...
> > /sys/class/event_sources/cpu_hw_cache_eventsN
> > -> /sys/devices/system/cpu/cpuN/cpu_hw_events
>
> Hmm, good question.
>
> No all the cpus would have the same event sources. I'm not sure if we
> can make sysfs understand that though (added GregKH and Kay to CC).
>
This is something I've been thinking about, too. On SH we have a
large set of perf counter events that are entirely dependent on the
configuration of the CPU they're on, with no requirement that these
configurations are identical on all CPUs in an SMP configuration.
As an example, it's possible to halve the L1 dcache and use that part of
it as a small and fast memory which has completely different events
associated with it from the regular L1 dcache events. These events would
be invalid on a CPU that was running with all cache ways enabled but
might also be valid on other CPUs that bolt these events to an extra SRAM
outside of the cache topology completely.
In any event, the events are at least consistent across all CPUs, it's
only which ones are valid on a given CPU at a given time that can change.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC][PATCH 3/9] perf: export registerred pmus via sysfs
2010-05-12 5:51 ` Paul Mundt
@ 2010-05-12 8:37 ` Peter Zijlstra
2010-05-14 7:04 ` Paul Mundt
0 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2010-05-12 8:37 UTC (permalink / raw)
To: Paul Mundt
Cc: Lin Ming, Ingo Molnar, Corey Ashford, Frederic Weisbecker,
eranian@gmail.com, Gary.Mohr@Bull.com, arjan@linux.intel.com,
Zhang, Yanmin, Paul Mackerras, David S. Miller, Russell King,
lkml, Arnaldo Carvalho de Melo, Will Deacon, Maynard Johnson,
Carl Love, greg@kroah.com, Kay Sievers
On Wed, 2010-05-12 at 14:51 +0900, Paul Mundt wrote:
> On Tue, May 11, 2010 at 11:48:42AM +0200, Peter Zijlstra wrote:
> > No all the cpus would have the same event sources. I'm not sure if we
> > can make sysfs understand that though (added GregKH and Kay to CC).
> >
> This is something I've been thinking about, too. On SH we have a
> large set of perf counter events that are entirely dependent on the
> configuration of the CPU they're on, with no requirement that these
> configurations are identical on all CPUs in an SMP configuration.
>
> As an example, it's possible to halve the L1 dcache and use that part of
> it as a small and fast memory which has completely different events
> associated with it from the regular L1 dcache events. These events would
> be invalid on a CPU that was running with all cache ways enabled but
> might also be valid on other CPUs that bolt these events to an extra SRAM
> outside of the cache topology completely.
>
> In any event, the events are at least consistent across all CPUs, it's
> only which ones are valid on a given CPU at a given time that can change.
So you're running with asymmetric SMP systems? I really hadn't
considered that. Will this change at runtime or is it a system boot time
thing?
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC][PATCH 3/9] perf: export registerred pmus via sysfs
2010-05-12 8:37 ` Peter Zijlstra
@ 2010-05-14 7:04 ` Paul Mundt
0 siblings, 0 replies; 51+ messages in thread
From: Paul Mundt @ 2010-05-14 7:04 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Lin Ming, Ingo Molnar, Corey Ashford, Frederic Weisbecker,
eranian@gmail.com, Gary.Mohr@Bull.com, arjan@linux.intel.com,
Zhang, Yanmin, Paul Mackerras, David S. Miller, Russell King,
lkml, Arnaldo Carvalho de Melo, Will Deacon, Maynard Johnson,
Carl Love, greg@kroah.com, Kay Sievers
On Wed, May 12, 2010 at 10:37:23AM +0200, Peter Zijlstra wrote:
> On Wed, 2010-05-12 at 14:51 +0900, Paul Mundt wrote:
> > On Tue, May 11, 2010 at 11:48:42AM +0200, Peter Zijlstra wrote:
>
> > > No all the cpus would have the same event sources. I'm not sure if we
> > > can make sysfs understand that though (added GregKH and Kay to CC).
> > >
> > This is something I've been thinking about, too. On SH we have a
> > large set of perf counter events that are entirely dependent on the
> > configuration of the CPU they're on, with no requirement that these
> > configurations are identical on all CPUs in an SMP configuration.
> >
> > As an example, it's possible to halve the L1 dcache and use that part of
> > it as a small and fast memory which has completely different events
> > associated with it from the regular L1 dcache events. These events would
> > be invalid on a CPU that was running with all cache ways enabled but
> > might also be valid on other CPUs that bolt these events to an extra SRAM
> > outside of the cache topology completely.
> >
> > In any event, the events are at least consistent across all CPUs, it's
> > only which ones are valid on a given CPU at a given time that can change.
>
> So you're running with asymmetric SMP systems? I really hadn't
> considered that. Will this change at runtime or is it a system boot time
> thing?
At the moment it's a boot time thing, but we're moving towards runtime
switching via CPU hotplug (which at the moment we primarily use for
runtime power management). This has specifically been a recurring
requirement from some of our automotive customers, so it's gradually
becoming more prevalent.
We also have the multi-core case where multiple architectures are
combined but we still have memory-mapped access to the slave CPUs
performance counters (SH-Mobile G series has this behaviour where we have
both an ARM and an SH core and it doesn't really matter which one is
running the primary linux kernel, the slave on the other hand might be
running linux or it may be running a fixed application that depends on
control and input from the primary linux-running MPU). Presently we just
tie in through the hardware debug interfaces for monitoring and
controlling the secondary counters, but being able to make this sort of
thing workload granular via perf would obviously be a huge benefit.
Supporting these sorts of configurations is going to take a bit of doing
though, especially on the topology side.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC][PATCH 3/9] perf: export registerred pmus via sysfs
2010-05-10 23:13 ` Corey Ashford
2010-05-11 6:46 ` Peter Zijlstra
@ 2010-05-11 10:09 ` stephane eranian
1 sibling, 0 replies; 51+ messages in thread
From: stephane eranian @ 2010-05-11 10:09 UTC (permalink / raw)
To: Corey Ashford
Cc: Ingo Molnar, Peter Zijlstra, Lin Ming, Frederic Weisbecker,
Gary.Mohr@Bull.com, arjan@linux.intel.com, Zhang, Yanmin,
Paul Mackerras, David S. Miller, Russell King, Paul Mundt, lkml,
Arnaldo Carvalho de Melo, Will Deacon, Maynard Johnson, Carl Love
Corey,
On Tue, May 11, 2010 at 1:13 AM, Corey Ashford
<cjashfor@linux.vnet.ibm.com> wrote:
> We'd like to have the capability of hardware-specific symbolic event names in the perf tool by some mechanism, unified or otherwise. Right now, for the IBM Wire-Speed processor, we are currently not able to use the perf tool because of its lack of symbolic raw event name support.
>
> In the mean time, we are using a pair of demo programs from Stephane Eranian's libpfm4 source tree called "task" and "syst". These tools use the symbolic event names provided by libpfm4, and use the kernel support from perf_events.
>
I will soon post a patch to make use of libpfm4 in perf, thereby
giving access to all events
unit masks, filters, using symbolic names.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC][PATCH 3/9] perf: export registerred pmus via sysfs
2010-05-10 11:43 ` Ingo Molnar
2010-05-10 11:49 ` Peter Zijlstra
@ 2010-05-11 14:15 ` Borislav Petkov
2010-05-11 14:25 ` Peter Zijlstra
1 sibling, 1 reply; 51+ messages in thread
From: Borislav Petkov @ 2010-05-11 14:15 UTC (permalink / raw)
To: Ingo Molnar
Cc: Peter Zijlstra, Lin Ming, Frederic Weisbecker, eranian@gmail.com,
Gary.Mohr@Bull.com, Corey Ashford, arjan@linux.intel.com,
Zhang, Yanmin, Paul Mackerras, David S. Miller, Russell King,
Paul Mundt, lkml, Arnaldo Carvalho de Melo
From: Ingo Molnar <mingo@elte.hu>
Date: Mon, May 10, 2010 at 01:43:11PM +0200
Hi all,
> Yeah, we really want a mechanism like this in place instead of continuing with
> the somewhat ad-hoc extensions to the event enumeration space.
>
> One detail: i think we want one more level. Instead of:
>
> /sys/devices/system/node/nodeN/node_events
> node_events/event_source_id
> node_events/local_misses
> /local_hits
> /remote_misses
> /remote_hits
> /...
>
> We want the individual events to be a directory, containing the event_id:
>
> /sys/devices/system/node/nodeN/node_events
> node_events/event_source_id
> node_events/local_misses/event_id
> /local_hits/event_id
> /remote_misses/event_id
> /remote_hits/event_id
> /...
>
> The reason is that we want to keep our options open to add more attributes to
> individual events. (In fact extended attributes already exist for certain
> event classes - such as the 'format' info for tracepoints.)
ok, what you guys have so far sounds ok, here's some more stuff
we should be considering when using the tracepoints (and their
representation in /sysfs or whatever) for error reporting.
All the error reporting is done using MCEs so the
MCE should be a raw per cpu event somewhere under
/sys/devices/system/cpu/cpuN/events/raw_cpu_events/ or whatever works for
you.
Another point I have is that MCEs don't need pmus so we should consider
having the ability to decouple events from pmus.
What you basically want to have is a tracepoint which is "persistent,"
as Ingo suggested earlier, and it buffers MCEs occurring at any time
into a ring buffer until a userspace daemon or similar sucks that data
out for processing (critical stuff is handled differently, of course).
And this should work on any x86 hw supporting MCA without hw perf
monitoring features.
Also, we might think in terms of using some of the MCE fields in /sysfs
for hardware error injection like EDAC does inject DRAM ECC errors but
this should be straight-forward using one attribute like
/sys/devices/system/cpu/cpuN/events/raw_cpu_events/mce/inject_ecc
or similar.
This is mostly what I can come up with now...
--
Regards/Gruss,
Boris.
--
Advanced Micro Devices, Inc.
Operating Systems Research Center
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC][PATCH 3/9] perf: export registerred pmus via sysfs
2010-05-11 14:15 ` Borislav Petkov
@ 2010-05-11 14:25 ` Peter Zijlstra
2010-05-11 15:37 ` Borislav Petkov
0 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2010-05-11 14:25 UTC (permalink / raw)
To: Borislav Petkov
Cc: Ingo Molnar, Lin Ming, Frederic Weisbecker, eranian@gmail.com,
Gary.Mohr@Bull.com, Corey Ashford, arjan@linux.intel.com,
Zhang, Yanmin, Paul Mackerras, David S. Miller, Russell King,
Paul Mundt, lkml, Arnaldo Carvalho de Melo
On Tue, 2010-05-11 at 16:15 +0200, Borislav Petkov wrote:
> Another point I have is that MCEs don't need pmus so we should consider
> having the ability to decouple events from pmus.
Strictly speaking tracepoints are software events, which run off of a
software 'pmu'. So no, we can't decouple, they need a 'pmu' context.
> What you basically want to have is a tracepoint which is "persistent,"
> as Ingo suggested earlier, and it buffers MCEs occurring at any time
> into a ring buffer until a userspace daemon or similar sucks that data
> out for processing (critical stuff is handled differently, of course).
> And this should work on any x86 hw supporting MCA without hw perf
> monitoring features.
Try building x86 without perf hw support :/
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC][PATCH 3/9] perf: export registerred pmus via sysfs
2010-05-11 14:25 ` Peter Zijlstra
@ 2010-05-11 15:37 ` Borislav Petkov
2010-05-11 15:46 ` Peter Zijlstra
0 siblings, 1 reply; 51+ messages in thread
From: Borislav Petkov @ 2010-05-11 15:37 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Lin Ming, Frederic Weisbecker, eranian@gmail.com,
Gary.Mohr@Bull.com, Corey Ashford, arjan@linux.intel.com,
Zhang, Yanmin, Paul Mackerras, David S. Miller, Russell King,
Paul Mundt, lkml, Arnaldo Carvalho de Melo
From: Peter Zijlstra <peterz@infradead.org>
Date: Tue, May 11, 2010 at 10:25:15AM -0400
> On Tue, 2010-05-11 at 16:15 +0200, Borislav Petkov wrote:
> > Another point I have is that MCEs don't need pmus so we should consider
> > having the ability to decouple events from pmus.
>
> Strictly speaking tracepoints are software events, which run off of a
> software 'pmu'. So no, we can't decouple, they need a 'pmu' context.
We could make this configurable depending on the severity of the error.
I'm guessing for further event handling through the perf infrastructure
we cannot run without a sw pmu context but on critical conditions
we need to run as fast and as sparingly as possible so I'm thinking
maybe adding some specially tailored callbacks to the MCE tracepoint
trace_mce_record, as Steven suggested.
> > What you basically want to have is a tracepoint which is "persistent,"
> > as Ingo suggested earlier, and it buffers MCEs occurring at any time
> > into a ring buffer until a userspace daemon or similar sucks that data
> > out for processing (critical stuff is handled differently, of course).
> > And this should work on any x86 hw supporting MCA without hw perf
> > monitoring features.
>
> Try building x86 without perf hw support :/
I dunno, maybe decoupling wouldn't be necessary per se but I was simply
pointing out that MCEs shouldn't necessarily depend on the presence on
hardware performance counters.
--
Regards/Gruss,
Boris.
--
Advanced Micro Devices, Inc.
Operating Systems Research Center
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC][PATCH 3/9] perf: export registerred pmus via sysfs
2010-05-11 15:37 ` Borislav Petkov
@ 2010-05-11 15:46 ` Peter Zijlstra
0 siblings, 0 replies; 51+ messages in thread
From: Peter Zijlstra @ 2010-05-11 15:46 UTC (permalink / raw)
To: Borislav Petkov
Cc: Ingo Molnar, Lin Ming, Frederic Weisbecker, eranian@gmail.com,
Gary.Mohr@Bull.com, Corey Ashford, arjan@linux.intel.com,
Zhang, Yanmin, Paul Mackerras, David S. Miller, Russell King,
Paul Mundt, lkml, Arnaldo Carvalho de Melo
On Tue, 2010-05-11 at 17:37 +0200, Borislav Petkov wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Tue, May 11, 2010 at 10:25:15AM -0400
>
> > On Tue, 2010-05-11 at 16:15 +0200, Borislav Petkov wrote:
> > > Another point I have is that MCEs don't need pmus so we should consider
> > > having the ability to decouple events from pmus.
> >
> > Strictly speaking tracepoints are software events, which run off of a
> > software 'pmu'. So no, we can't decouple, they need a 'pmu' context.
>
> We could make this configurable depending on the severity of the error.
> I'm guessing for further event handling through the perf infrastructure
> we cannot run without a sw pmu context but on critical conditions
> we need to run as fast and as sparingly as possible so I'm thinking
> maybe adding some specially tailored callbacks to the MCE tracepoint
> trace_mce_record, as Steven suggested.
Well, all the tracepoint stuff should already be NMI-safe (all of perf
events needs to be because the PMI is an NMI) and I think perf as a
whole would like to run as fast as possible, so I don't yet see the need
for special purpose hooks (which I'll try to resist as much as
possible).
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC][PATCH 3/9] perf: export registerred pmus via sysfs
2010-05-10 11:27 ` Peter Zijlstra
` (3 preceding siblings ...)
2010-05-10 11:43 ` Ingo Molnar
@ 2010-05-10 23:54 ` Corey Ashford
2010-05-11 6:50 ` Peter Zijlstra
2010-05-11 2:43 ` Lin Ming
5 siblings, 1 reply; 51+ messages in thread
From: Corey Ashford @ 2010-05-10 23:54 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Lin Ming, Ingo Molnar, Frederic Weisbecker, eranian@gmail.com,
Gary.Mohr@Bull.com, arjan@linux.intel.com, Zhang, Yanmin,
Paul Mackerras, David S. Miller, Russell King, Paul Mundt, lkml
On 5/10/2010 4:27 AM, Peter Zijlstra wrote:
> On Mon, 2010-05-10 at 18:26 +0800, Lin Ming wrote:
>
>>> No, I'm assuming there is only 1 PMU per CPU. Corey is the expert on
>>> crazy hardware though,
:-)
>>> but I think the sanest way is to extend the CPU
>>> topology if there's more structure to it.
>>
>> But our goal is to support multiple pmus, don't we need to assume there
>> are more than 1 PMU per CPU?
>
> No, because as I said, then its ambiguous what pmu you want. If you have
> that, you need to extend your topology information.
>
> Anyway, I talked with Ingo on this and he'd like to see this somewhat
> extended.
>
> Instead of a pmu_id field, which we pass into a new
> perf_event_attr::pmu_id field, how about creating an event_source sysfs
> class. Then each class can have an event_source_id and a hierarchy of
> 'generic' events.
>
> We'd start using the PERF_TYPE_ space for this and express the
> PERF_COUNT_ space in the event attributes found inside that class.
>
> That way we can include all the existing event enumerations into this as
> well.
>
> This way we can create:
>
> /sys/devices/system/cpu/cpuN/cpu_hardware_events
> cpu_hardware_events/event_source_id
> cpu_hardware_events/cpu_cycles
> cpu_hardware_events/instructions
> /...
>
> /sys/devices/system/cpu/cpuN/cpu_raw_events
> cpu_raw_events/event_source_id
>
>
> These would match the current PERF_TYPE_* values for compatibility
>
> For new PMUs we can start a dynamic range of PERF_TYPE_ (say at 64k but
> that's not ABI and can be changed at any time, we've got u32 to play
> with).
>
> For uncore this would result in:
>
> /sys/devices/system/node/nodeN/node_raw_events
> node_raw_events/event_source_id
>
> and maybe:
>
> /sys/devices/system/node/nodeN/node_events
> node_events/event_source_id
> node_events/local_misses
> /local_hits
> /remote_misses
> /remote_hits
> /...
Just to give a concrete example, the IBM Wire-Speed Processor has four AT-"nodes" per chip, each containing four PowerPC cores.
Those four nodes together share a number of nest PMU accelerators, I/O devices, buses etc. which each have their own PMUs. Further adding to the structure is that some of the nodes are replicated. For example, we have two memory controllers, each with a pair of PMUs.
/sys/devices/system/node/node0/mem_ctlr0/
event_source_id
events/
partial_cacheline_read_retried/
partial_cacheline_write_retried/
...
mem_ctlr1/
event_source_id
events/
partial_cacheline_read_retried/
...
So it's a bit ugly to replicate the event information across identical pmus, but that can be done via links, without too much memory cost, I assume.
Does this seem workable?
--
Regards,
- Corey
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [RFC][PATCH 3/9] perf: export registerred pmus via sysfs
2010-05-10 23:54 ` Corey Ashford
@ 2010-05-11 6:50 ` Peter Zijlstra
0 siblings, 0 replies; 51+ messages in thread
From: Peter Zijlstra @ 2010-05-11 6:50 UTC (permalink / raw)
To: Corey Ashford
Cc: Lin Ming, Ingo Molnar, Frederic Weisbecker, eranian@gmail.com,
Gary.Mohr@Bull.com, arjan@linux.intel.com, Zhang, Yanmin,
Paul Mackerras, David S. Miller, Russell King, Paul Mundt, lkml
On Mon, 2010-05-10 at 16:54 -0700, Corey Ashford wrote:
>
> Just to give a concrete example, the IBM Wire-Speed Processor has four
> AT-"nodes" per chip, each containing four PowerPC cores.
>
> Those four nodes together share a number of nest PMU accelerators, I/O
> devices, buses etc. which each have their own PMUs. Further adding to
> the structure is that some of the nodes are replicated. For example,
> we have two memory controllers, each with a pair of PMUs.
>
> /sys/devices/system/node/node0/mem_ctlr0/
> event_source_id
> events/
> partial_cacheline_read_retried/
> partial_cacheline_write_retried/
> ...
> mem_ctlr1/
> event_source_id
> events/
> partial_cacheline_read_retried/
> ...
>
> So it's a bit ugly to replicate the event information across identical
> pmus, but that can be done via links, without too much memory cost, I
> assume.
>
> Does this seem workable?
If you really have two memory controllers per node, I guess so. Sounds
strange to me though, typically a memory controller is the node
boundary.
But like I said in the other email, use a 1:n pmu:event_source ratio and
simply stick then in the machine/device topology wherever they belong,
if that ends up with multiple PMUs at one particular level, so be it.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC][PATCH 3/9] perf: export registerred pmus via sysfs
2010-05-10 11:27 ` Peter Zijlstra
` (4 preceding siblings ...)
2010-05-10 23:54 ` Corey Ashford
@ 2010-05-11 2:43 ` Lin Ming
2010-05-11 6:35 ` Peter Zijlstra
5 siblings, 1 reply; 51+ messages in thread
From: Lin Ming @ 2010-05-11 2:43 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Frederic Weisbecker, eranian@gmail.com,
Gary.Mohr@Bull.com, Corey Ashford, arjan@linux.intel.com,
Zhang, Yanmin, Paul Mackerras, David S. Miller, Russell King,
Paul Mundt, lkml
On Mon, 2010-05-10 at 19:27 +0800, Peter Zijlstra wrote:
> On Mon, 2010-05-10 at 18:26 +0800, Lin Ming wrote:
>
> > > No, I'm assuming there is only 1 PMU per CPU. Corey is the expert on
> > > crazy hardware though, but I think the sanest way is to extend the CPU
> > > topology if there's more structure to it.
> >
> > But our goal is to support multiple pmus, don't we need to assume there
> > are more than 1 PMU per CPU?
>
> No, because as I said, then its ambiguous what pmu you want. If you have
> that, you need to extend your topology information.
>
> Anyway, I talked with Ingo on this and he'd like to see this somewhat
> extended.
>
> Instead of a pmu_id field, which we pass into a new
> perf_event_attr::pmu_id field, how about creating an event_source sysfs
> class. Then each class can have an event_source_id and a hierarchy of
> 'generic' events.
>
> We'd start using the PERF_TYPE_ space for this and express the
> PERF_COUNT_ space in the event attributes found inside that class.
>
> That way we can include all the existing event enumerations into this as
> well.
>
> This way we can create:
>
> /sys/devices/system/cpu/cpuN/cpu_hardware_events
> cpu_hardware_events/event_source_id
> cpu_hardware_events/cpu_cycles
> cpu_hardware_events/instructions
> /...
>
> /sys/devices/system/cpu/cpuN/cpu_raw_events
> cpu_raw_events/event_source_id
>
>
> These would match the current PERF_TYPE_* values for compatibility
>
> For new PMUs we can start a dynamic range of PERF_TYPE_ (say at 64k but
> that's not ABI and can be changed at any time, we've got u32 to play
> with).
>
> For uncore this would result in:
>
> /sys/devices/system/node/nodeN/node_raw_events
> node_raw_events/event_source_id
>
> and maybe:
>
> /sys/devices/system/node/nodeN/node_events
> node_events/event_source_id
> node_events/local_misses
> /local_hits
> /remote_misses
> /remote_hits
> /...
>
>
> The software events and tracepoints and kprobes stuff we could hang off
> of /sys/kernel/ or something
>
> So your registration would indeed look like something:
>
> perf_event_register_pmu(struct pmu *pmu, int type),
>
> where type would normally be -1 (dynamic) but would be PERF_TYPE_ for
> those already laid down in ABI.
>
> This approach will also give us a good overview
> in /sys/class/event_source/, which will be a flat listing of all
> existing event sources.
>
> Does this make sense?
Thanks for the idea.
Give me some time to get a clear understanding of the ideas from you and
others.
And then I'll work out a patch as soon as possible.
Lin Ming
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [RFC][PATCH 3/9] perf: export registerred pmus via sysfs
2010-05-11 2:43 ` Lin Ming
@ 2010-05-11 6:35 ` Peter Zijlstra
0 siblings, 0 replies; 51+ messages in thread
From: Peter Zijlstra @ 2010-05-11 6:35 UTC (permalink / raw)
To: Lin Ming
Cc: Ingo Molnar, Frederic Weisbecker, eranian@gmail.com,
Gary.Mohr@Bull.com, Corey Ashford, arjan@linux.intel.com,
Zhang, Yanmin, Paul Mackerras, David S. Miller, Russell King,
Paul Mundt, lkml
On Tue, 2010-05-11 at 10:43 +0800, Lin Ming wrote:
>
> Give me some time to get a clear understanding of the ideas from you and
> others.
Of course, ho hurry!
> And then I'll work out a patch as soon as possible.
Thanks for working on this!
^ permalink raw reply [flat|nested] 51+ messages in thread