comments on Performance Counters for Linux (PCL)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* comments on Performance Counters for Linux (PCL)
@ 2009-05-28 14:58 stephane eranian
  2009-05-28 16:25 ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: stephane eranian @ 2009-05-28 14:58 UTC (permalink / raw)
  To: LKML
  Cc: Andrew Morton, Thomas Gleixner, Ingo Molnar, Robert Richter,
	Peter Zijlstra, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

Hi,

The following sections are some preliminary comments concerning the
Performance Counter for Linux (PCL) API and implementation proposal
currently in development.

S.Eranian
eranian@gmail.com

I/ General API comments

  1/ Data structures

     * struct perf_counter_hw_event

     - I think this structure will be used to enable non-counting features,
       e.g. IBS. Name is awkward. It is used to enable non-hardware events
       (sw events). Why not call it: struct perf_event

     - uint64_t config

       Why use a single field to encode event type and its encoding? By design,
       the syscall is not in the critical path. Why not spell things out
       clearly: int type, uint64_t code.

     - uint64_t irq_period

       IRQ is an x86 related name. Why not use smpl_period instead?

     - uint32_t record_type

       This field is a bitmask. I believe 32-bit is too small to accommodate
       future record formats.

     - uint32_t read_format

       Ditto.

     - uint64_t nmi

       This is an X86-only feature. Why make this visible in a generic API?

       What is the semantic of this?

       I cannot have one counter use NMI and another not use NMI or are you
       planning on switching the interrupt vector when you change event groups?

       Why do I need to be a priviledged user to enable NMI? Especially given
       that:
               - non-privileged users can monitor at privilege level 0 (kernel).
               - there is interrupt throttling

     - uint64_t exclude_*

       It seems those fields were added to support the generic HW events. But
       I think they are confusing and their semantic is not quite clear.

       Furthermore, aren't they irrelevant for the SW events?

       What is the meaning of exclude_user? Which priv levels are actually
       excluded?

       Take Itanium, it has 4 priv levels and the PMU counters can monitor at
       any priv levels or combination thereof?

       When programming raw HW events, the priv level filtering is typically
       already included. Which setting has priority, raw encoding or the
       exclude_*?

       Looking at the existing X86 implementation, it seems exclude_* can
       override whatever is set in the raw event code.

       For any events, but in particular, SW events, why not encode this in
       the config field, like it is for a raw HW event?

     - mmap, munmap, comm

       It is not clear to me why those fields are defined here rather than as
       PERF_RECORD_*. They are stored in the event buffer only. They are only
       useful when sampling.

       It is not clear why you have mmap and munmap as separate options.
       What's the point of munmap-only notification?

     * enum perf_event_types vs. enum perf_event_type

       Both names are too close to each other, yet they define unrelated data
       structures. This is very confusing.

     * struct perf_counter_mmap_page

       The definition of data_head precludes sampling buffers bigger that 4GB.

       Does that makes sense on TB machines?

       Given there is only one counter per-page, there is an awful lot of
       precious RLIMIT_MEMLOCK space wasted for this.

       Typically, if you are self-sampling, you are not going to read the
       current value of the sampling period. That re-mapping trick is only
       useful when counting.

       Why not make these two separate mappings (using the mmap offset as
       the indicator)?

       With this approach, you would get one page back per sampling period
       and that page could then be used for the actual samples.

 2/ System calls

     * ioctl()

       You have defined 3 ioctls() so far to operate on an existing event.
       I was under the impression that ioctl() should not be used except for
       drivers.

     * prctl()

       The API is event-based. Each event gets a file descriptor. Events are
       therefore managed individually. Thus, to enable/disable, you need to
       enable/disable each one separately.

       The use of prctl() breaks this design choice. It is not clear what you
       are actually enabling. It looks like you are enabling all the counters
       attached to the thread. This is incorrect. With your implementation,
       the PMU can be shared between competing users. In particular, multiple
       tools may be monitoring the same thread. Now, imagine, a tool is
       monitoring a self-monitoring thread which happens to start/stop its
       measurement using prctl(). Then, that would also start/stop the
       measurement of the external tool. I have verified that this is what is
       actually happening.

       I believe this call is bogus and it should be eliminated. The interface
       is exposing events individually therefore they should be controlled
       individually.

 3/ Counter width

       It is not clear whether or not the API exposes counters as 64-bit wide
       on PMUs which do not implement 64-bit wide counters.

       Both irq_period and read() return 64-bit integers. However, it appears
       that the implementation is not using all the bits. In fact, on X86, it
       appears the irq_period is truncated silently. I believe this is not
       correct. If the period is not valid, an error should be returned.
       Otherwise, the tool will be getting samples at a rate different than
       what it requested.

       I would assume that on the read() side, counts are accumulated as
       64-bit integers. But if it is the case, then it seems there is an
       asymmetry between period and counts.

       Given that your API is high level, I don't think tools should have to
       worry about the actual width of a counter. This is especially true
       because they don't know which counters the event is going to go into
       and if I recall correctly, on some PMU models, different counters can
       have different width (Power, I think).

       It is rather convenient for tools to always manipulate counters as
       64-bit integers. You should provide a consistent view between counts
       and periods.

 4/ Grouping

       By design, an event can only be part of one group at a time. Events in
       a group are guaranteed to be active on the PMU at the same time. That
       means a group cannot have more events than there are available counters
       on the PMU. Tools may want to know the number of counters available in
       order to group their events accordingly, such that reliable ratios
       could be computed. It seems the only way to know this is by trial and
       error. This is not practical.

 5/ Multiplexing and scaling

       The PMU can be shared by multiple programs each controlling a variable
       number of events. Multiplexing occurs by default unless pinned is
       requested. The exclusive option only guarantees the group does not
       share the PMU with other groups while it is active, at least this is
       my understanding.

       By default, you may be multiplexed and if that happens you cannot know
       unless you request the timing information as part of the read_format.
       Without it, and if multiplexing has occurred, bogus counts may be
       returned with no indication whatsoever.

       To avoid returning misleading information, it seems like the API should
       refuse to open a non-pinned event which does not have
       PERF_FORMAT_TOTAL_TIME_ENABLED|PERF_FORMAT_TOTAL_TIME_RUNNING in the
       read_format. This would avoid a lot of confusion down the road.

 7/ Multiplexing and system-wide

       Multiplexing is time-based and it is hooked into the timer tick. At
       every tick, the kernel tries to schedule another group of events.

       In tickless kernels if a CPU is idle, no timer tick is generated,
       therefore no multiplexing occurs. This is incorrect. It's not because
       the CPU is idle, that there aren't any interesting PMU events to measure.
       Parts of the CPU may still be active, e.g., caches and buses. And thus,
       it is expected that multiplexing still happens.

       You need to hook up the timer source for multiplexing to something else
       which is not affected by tickless.

 8/ Controlling group multiplexing

       Although, multiplexing is somehow exposed to user via the timing
       information.  I believe there is not enough control. I know of advanced
       monitoring tools which needs to measure over a dozen events in one
       monitoring session. Given that the underlying PMU does not have enough
       counters OR that certain events cannot be measured together, it is
       necessary to split the events into groups and multiplex them. Events
       are not grouped at random AND groups are not ordered at random either.
       The sequence of groups is carefully chosen such that related events are
       in neighboring groups such that they measure similar parts of the
       execution.  This way you can mitigate the fluctuations introduced by
       multiplexing and compare ratios. In other words, some tools may want to
       control the order in which groups are scheduled on the PMU.

       The exclusive flag ensures correct grouping. But there is nothing to
       control ordering of groups.  That is a problem for some tools. Groups
       from different 'session' may be interleaved and break the continuity of
        measurement.

       The group ordering has to be controllable from the tools OR must be
       fully specified by the API. But it should not be a property of the
       implementation. The API could for instance specify that groups are
       scheduled in increasing order of the group leaders' file descriptor.
       There needs to be some way of preventing interleaving of groups from
       different 'sessions'.

 9/ Event buffer

       There is a kernel level event buffer which can be re-mapped read-only at
       the user level via mmap(). The buffer must be a multiple of page size
       and must be at least 2-page long. The First page is used for the
       counter re-mapping and buffer header, the second for the actual event
       buffer.

       The buffer is managed as a cyclic buffer. That means there is a
       continuous race between the tool and the kernel. The tool must parse
       the buffer faster than the kernel can fill it out. It is important to
       realize that the race continues even when monitoring is stopped, as non
       PMU-based infos keep being stored, such as mmap, munmap. This is
       expected because it is not possible to lose mapping information
       otherwise invalid correlation of samples may happen.

       However, there is currently no reliable way of figuring out whether or
       not the buffer has wrapped around since the last scan by the tool. Just
       checking the current position or estimating the space left is not good
       enough. There ought to be an overflow counter of some sort indicating
       the number of times the head wrapped around.

  10/ Group event buffer entry

       This is activated by setting the PERF_RECORD_GROUP in the record_type
       field.  With this bit set, the values of the other members of the
       group are stored sequentially in the buffer. To help figure out which
       value corresponds to which event, the current implementation also
       stores the raw encoding of the event.

       The event encoding does not help figure out which event the value refers
       to. There can be multiple events with the same code. This does fit the
       API model where events are identified by file descriptors.

       The file descriptor must be provided and not the raw encoding.

  11/ reserve_percpu

       There are more than counters on many PMU models. Counters are not
       symmetrical even on X86.

       What does this API actually guarantees in terms on what events a tool
       will be able to measure with the reserved counters?

II/ X86 comments

  Mostly implementation related comments in this section.

  1/ Fixed counter and event on Intel

       You cannot simply fall back to generic counters if you cannot find
       a fixed counter. There are model-specific bugs, for instance
       UNHALTED_REFERENCE_CYCLES (0x013c), does not measure the same thing on
       Nehalem when it is used in fixed counter 2 or a generic counter. The
       same is true on Core.

       You cannot simply look at the event field code to determine whether
       this is an event supported by a fixed counters. You must look at the
       other fields such as edge, invert, cnt-mask. If those are present then
       you have to fall back to using a generic counter as fixed counters only
       support priv level filtering. As indicated above, though, the
       programming UNHALTED_REFERENCE_CYCLES on a generic counter does not
       count the same thing, therefore you need to fail is filters other than
       priv levels are present on this event.

  2/ Event knowledge missing

       There are constraints and bugs on some events in Intel Core and Nehalem.
       In your model, those need to be taken care of by the kernel. Should the
       kernel make the wrong decision, there would be no work-around for user
       tools. Take the example I outlined just above with Intel fixed counters.

       Constraints do exist on AMD64 processors as well.

  3/ Interrupt throttling

       There is apparently no way for a system admin to set the threshold. It
       is hardcoded.

       Throttling occurs without the tool(s) knowing. I think this is a problem.

   4/ NMI

       Why restrict NMI to privileged users when you have throttling to protect
       against interrupt flooding?

       Are you trying to restrict non privileged users from getting sampling
       inside kernel critical sections?

III/ Requests

  1/ Sampling period change

       As it stands today, it seems there is no way to change a period but to
       close() the event file descriptor and start over. When you close the
       group leader, it is not clear to me what happens to the remaining events.

       I know of tools which want to adjust the sampling period based on the
       number of samples they get per second.

       By design, your perf_counter_open() should not really be in the
       critical path, e.g., when you are processing samples from the event
       buffer. Thus, I think it would be good to have a dedicated call to
       allow changing the period.

  2/ Sampling period randomization

       It is our experience (on Itanium, for instance), that for certain
       sampling measurements, it is beneficial to randomize the sampling
       period a bit. This is in particular the case when sampling on an
       event that happens very frequently and which is not related to
       timing, e.g., branch_instructions_retired. Randomization helps mitigate
       the bias. You do not need anything sophisticated. But when you are using
       a kernel-level sampling buffer, you need to have to kernel randomize.
       Randomization needs to be supported per event.

  3/ Group multiplexing ordering

       As mentioned above, the ordering of group multiplexing for one process
       needs to be either specified by the API or controllable by users.

IV/ Open questions

  1/ Support for model-specific uncore PMU monitoring capabilities

       Recent processors have multiple PMUs. Typically one per core and but
       also one at the socket level, e.g., Intel Nehalem. It is expected that
       this API will provide access to these PMU as well.

       It seems like with the current API, raw events for those PMUs would need
       a new architecture-specific type as the event encoding by itself may
       not be enough to disambiguate between a core and uncore PMU event.

       How are those events going to be supported?

  2/ Features impacting all counters

       On some PMU models, e.g., Itanium, they are certain features which have
       an influence on all counters that are active. For instance, there is a
       way to restrict monitoring to a range of continuous code or data
       addresses using both some PMU registers and the debug registers.

       Given that the API exposes events (counters) as independent of each
       other, I wonder how range restriction could be implemented.

       Similarly, on Itanium, there are global behaviors. For instance, on
       counter overflow the entire PMU freezes all at once. That seems to be
       contradictory with the design of the API which creates the illusion of
       independence.

       What solutions do you propose?

  3/ AMD IBS

       How is AMD IBS going to be implemented?

       IBS has two separate sets of registers. One to capture fetch related
       data and another one to capture instruction execution data. For each,
       there is one config register but multiple data registers. In each mode,
       there is a specific sampling period and IBS can interrupt.

       It looks like you could define two pseudo events or event types and then
       define a new record_format and read_format.  That formats would only be
       valid for an IBS event.

       Is that how you intend to support IBS?

  4/ Intel PEBS

       Since Netburst-based processors, Intel PMUs support a hardware sampling
       buffer mechanism called PEBS.

       PEBS really became useful with Nehalem.

       Not all events support PEBS. Up until Nehalem, only one counter supported
       PEBS (PMC0). The format of the hardware buffer has changed between Core
       and Nehalem. It is not yet architected, thus it can still evolve with
       future PMU models.

       On Nehalem, there is a new PEBS-based feature called Load Latency
       Filtering which captures where data cache misses occur
       (similar to Itanium D-EAR). Activating this feature requires setting a
       latency threshold hosted in a separate PMU MSR.

       On Nehalem, given that all 4 generic counters support PEBS, the
       sampling buffer may contain samples generated by any of the 4 counters.
       The buffer includes a bitmask of registers to determine the source
       of the samples. Multiple bits may be set in the bitmask.

       How PEBS will be supported for this new API?

  5/ Intel Last Branch Record (LBR)

       Intel processors since Netburst have a cyclic buffer hosted in
       registers which can record taken branches. Each taken branch is stored
       into a pair of LBR registers (source, destination). Up until Nehalem,
       there was not filtering capabilities for LBR. LBR is not an architected
       PMU feature.

       There is no counter associated with LBR. Nehalem has a LBR_SELECT MSR.
       However there are some constraints on it given it is shared by threads.

       LBR is only useful when sampling and therefore must be combined with a
       counter. LBR must also be configured to freeze on PMU interrupt.

       How is LBR going to be supported?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: comments on Performance Counters for Linux (PCL)
  2009-05-28 14:58 comments on Performance Counters for Linux (PCL) stephane eranian
@ 2009-05-28 16:25 ` Peter Zijlstra
  2009-05-28 16:47   ` Peter Zijlstra
                     ` (4 more replies)
  0 siblings, 5 replies; 22+ messages in thread
From: Peter Zijlstra @ 2009-05-28 16:25 UTC (permalink / raw)
  To: eranian
  Cc: LKML, Andrew Morton, Thomas Gleixner, Ingo Molnar, Robert Richter,
	Paul Mackerras, Andi Kleen, Maynard Johnson, Carl Love,
	Corey J Ashford, Philip Mucci, Dan Terpstra, perfmon2-devel

On Thu, 2009-05-28 at 16:58 +0200, stephane eranian wrote:
> Hi,
> 
> The following sections are some preliminary comments concerning the
> Performance Counter for Linux (PCL) API and implementation proposal
> currently in development.

Thanks for your input, some answers below.

~hth

> I/ General API comments
> 
>   1/ Data structures
> 
>      * struct perf_counter_hw_event
> 
>      - I think this structure will be used to enable non-counting features,
>        e.g. IBS. Name is awkward. It is used to enable non-hardware events
>        (sw events). Why not call it: struct perf_event

Sure a name change might be a good idea.

>      - uint64_t config
> 
>        Why use a single field to encode event type and its encoding? By design,
>        the syscall is not in the critical path. Why not spell things out
>        clearly: int type, uint64_t code.

Because its convenient to be able to identify a counter in a single u64.

>      - uint64_t irq_period
> 
>        IRQ is an x86 related name. Why not use smpl_period instead?

don't really care, but IRQ seems used throughout linux, we could name
the thing interrupt or sample period.

>      - uint32_t record_type
> 
>        This field is a bitmask. I believe 32-bit is too small to accommodate
>        future record formats.

It currently controls 8 aspects of the overflow entry, do you really
forsee the need for more than 32?

>      - uint32_t read_format
> 
>        Ditto.

I guess it doesn't hurt extending them..

>      - uint64_t nmi

Its no a u64, but a single bit, also, its dead and about to be removed.

>      - uint64_t exclude_*
> 
>        It seems those fields were added to support the generic HW events. But
>        I think they are confusing and their semantic is not quite clear.
> 
>        Furthermore, aren't they irrelevant for the SW events?

They are.

>        What is the meaning of exclude_user? Which priv levels are actually
>        excluded?

userspace

>        Take Itanium, it has 4 priv levels and the PMU counters can monitor at
>        any priv levels or combination thereof?

x86 has more priv rings too, but linux only uses 2, kernel (ring 0) and
user (ring 2 iirc). Does ia64 expose more than 2 priv levels in linux?

>        When programming raw HW events, the priv level filtering is typically
>        already included. Which setting has priority, raw encoding or the
>        exclude_*?
> 
>        Looking at the existing X86 implementation, it seems exclude_* can
>        override whatever is set in the raw event code.
> 
>        For any events, but in particular, SW events, why not encode this in
>        the config field, like it is for a raw HW event?

Because config is about what we count, this is about where we count.
Doesn't seem strange to separate these two.

>      - mmap, munmap, comm
> 
>        It is not clear to me why those fields are defined here rather than as
>        PERF_RECORD_*. They are stored in the event buffer only. They are only
>        useful when sampling.
> 
>        It is not clear why you have mmap and munmap as separate options.
>        What's the point of munmap-only notification?

You only need one, either mmap or munmap, its up to the application
which one it prefers, both work too.

They are not part of PERF_RECORD_ because MMAP/MUNMAP/COMM events are
unrelated to counter overflow.

>      * enum perf_event_types vs. enum perf_event_type
> 
>        Both names are too close to each other, yet they define unrelated data
>        structures. This is very confusing.

OK.

>      * struct perf_counter_mmap_page
> 
>        The definition of data_head precludes sampling buffers bigger that 4GB.
> 
>        Does that makes sense on TB machines?

Dunno, 4G seems an awefull lot to me -- but if someone really wants more
I guess we can grow the field.

>        Given there is only one counter per-page, there is an awful lot of
>        precious RLIMIT_MEMLOCK space wasted for this.
> 
>        Typically, if you are self-sampling, you are not going to read the
>        current value of the sampling period. That re-mapping trick is only
>        useful when counting.
> 
>        Why not make these two separate mappings (using the mmap offset as
>        the indicator)?
> 
>        With this approach, you would get one page back per sampling period
>        and that page could then be used for the actual samples.

Not quite, you still need a place for the data_head.

>  2/ System calls
> 
>      * ioctl()
> 
>        You have defined 3 ioctls() so far to operate on an existing event.
>        I was under the impression that ioctl() should not be used except for
>        drivers.

4 actually.

>      * prctl()
> 
>        The API is event-based. Each event gets a file descriptor. Events are
>        therefore managed individually. Thus, to enable/disable, you need to
>        enable/disable each one separately.
> 
>        The use of prctl() breaks this design choice. It is not clear what you
>        are actually enabling. It looks like you are enabling all the counters
>        attached to the thread. This is incorrect. With your implementation,
>        the PMU can be shared between competing users. In particular, multiple
>        tools may be monitoring the same thread. Now, imagine, a tool is
>        monitoring a self-monitoring thread which happens to start/stop its
>        measurement using prctl(). Then, that would also start/stop the
>        measurement of the external tool. I have verified that this is what is
>        actually happening.

Recently changed that, it enables/disables all counters created by the
task calling prctl().

>        I believe this call is bogus and it should be eliminated. The interface
>        is exposing events individually therefore they should be controlled
>        individually.

Bogus maybe not, superfluous, yeah, its a simpler way than iterating all
the fds you just created, saves a few syscalls.

If that justifies its existance,. I dunno.

>  3/ Counter width
> 
>        It is not clear whether or not the API exposes counters as 64-bit wide
>        on PMUs which do not implement 64-bit wide counters.
> 
>        Both irq_period and read() return 64-bit integers. However, it appears
>        that the implementation is not using all the bits. In fact, on X86, it
>        appears the irq_period is truncated silently. I believe this is not
>        correct. If the period is not valid, an error should be returned.
>        Otherwise, the tool will be getting samples at a rate different than
>        what it requested.

Sure, fail creation when the specified period is larger than the
supported counter width -- patch welcome.

>        I would assume that on the read() side, counts are accumulated as
>        64-bit integers. But if it is the case, then it seems there is an
>        asymmetry between period and counts.
> 
>        Given that your API is high level, I don't think tools should have to
>        worry about the actual width of a counter. This is especially true
>        because they don't know which counters the event is going to go into
>        and if I recall correctly, on some PMU models, different counters can
>        have different width (Power, I think).
> 
>        It is rather convenient for tools to always manipulate counters as
>        64-bit integers. You should provide a consistent view between counts
>        and periods.

So you're suggesting to artificually strech periods by say composing a
single overflow from smaller ones, ignoring the intermediate overflow
events?

That sounds doable, again, patch welcome.

>  4/ Grouping
> 
>        By design, an event can only be part of one group at a time. Events in
>        a group are guaranteed to be active on the PMU at the same time. That
>        means a group cannot have more events than there are available counters
>        on the PMU. Tools may want to know the number of counters available in
>        order to group their events accordingly, such that reliable ratios
>        could be computed. It seems the only way to know this is by trial and
>        error. This is not practical.

Got a proposal to ammend this?

>  5/ Multiplexing and scaling
> 
>        The PMU can be shared by multiple programs each controlling a variable
>        number of events. Multiplexing occurs by default unless pinned is
>        requested. The exclusive option only guarantees the group does not
>        share the PMU with other groups while it is active, at least this is
>        my understanding.

We have pinned and exclusive. pinned means always on the PMU, exclusive
means when on the PMU no-one else can be.

>        By default, you may be multiplexed and if that happens you cannot know
>        unless you request the timing information as part of the read_format.
>        Without it, and if multiplexing has occurred, bogus counts may be
>        returned with no indication whatsoever.

I don't see the problem, you knew they could get multiplexes, yet you
didn't ask for the information needed to extrapolate the information,
sounds like you get what you aksed for.

>        To avoid returning misleading information, it seems like the API should
>        refuse to open a non-pinned event which does not have
>        PERF_FORMAT_TOTAL_TIME_ENABLED|PERF_FORMAT_TOTAL_TIME_RUNNING in the
>        read_format. This would avoid a lot of confusion down the road.

I prefer to give people rope and tell them how to tie the knot.

>  7/ Multiplexing and system-wide
> 
>        Multiplexing is time-based and it is hooked into the timer tick. At
>        every tick, the kernel tries to schedule another group of events.
> 
>        In tickless kernels if a CPU is idle, no timer tick is generated,
>        therefore no multiplexing occurs. This is incorrect. It's not because
>        the CPU is idle, that there aren't any interesting PMU events to measure.
>        Parts of the CPU may still be active, e.g., caches and buses. And thus,
>        it is expected that multiplexing still happens.
> 
>        You need to hook up the timer source for multiplexing to something else
>        which is not affected by tickless.

Or inhibit nohz when there are active counters, but good point.

>  8/ Controlling group multiplexing
> 
>        Although, multiplexing is somehow exposed to user via the timing
>        information.  I believe there is not enough control. I know of advanced
>        monitoring tools which needs to measure over a dozen events in one
>        monitoring session. Given that the underlying PMU does not have enough
>        counters OR that certain events cannot be measured together, it is
>        necessary to split the events into groups and multiplex them. Events
>        are not grouped at random AND groups are not ordered at random either.
>        The sequence of groups is carefully chosen such that related events are
>        in neighboring groups such that they measure similar parts of the
>        execution.  This way you can mitigate the fluctuations introduced by
>        multiplexing and compare ratios. In other words, some tools may want to
>        control the order in which groups are scheduled on the PMU.

Current code RR groups in the order they are created, is more control
needed?

>        The exclusive flag ensures correct grouping. But there is nothing to
>        control ordering of groups.  That is a problem for some tools. Groups
>        from different 'session' may be interleaved and break the continuity of
>         measurement.

I'm not sure I understand the 'session' bit.

>        The group ordering has to be controllable from the tools OR must be
>        fully specified by the API. But it should not be a property of the
>        implementation. The API could for instance specify that groups are
>        scheduled in increasing order of the group leaders' file descriptor.
>        There needs to be some way of preventing interleaving of groups from
>        different 'sessions'.

If we specify we maintain order of creation, would that suffice?

>  9/ Event buffer
> 
>        There is a kernel level event buffer which can be re-mapped read-only at
>        the user level via mmap(). The buffer must be a multiple of page size

2^n actually

>        and must be at least 2-page long. The First page is used for the
>        counter re-mapping and buffer header, the second for the actual event
>        buffer.

I think a single data page is valid too (2^0=1).

>        The buffer is managed as a cyclic buffer. That means there is a
>        continuous race between the tool and the kernel. The tool must parse
>        the buffer faster than the kernel can fill it out. It is important to
>        realize that the race continues even when monitoring is stopped, as non
>        PMU-based infos keep being stored, such as mmap, munmap. This is
>        expected because it is not possible to lose mapping information
>        otherwise invalid correlation of samples may happen.
> 
>        However, there is currently no reliable way of figuring out whether or
>        not the buffer has wrapped around since the last scan by the tool. Just
>        checking the current position or estimating the space left is not good
>        enough. There ought to be an overflow counter of some sort indicating
>        the number of times the head wrapped around.

data_head wraps on u32, that is, it increments past the mmap'ed data
window, and only the lower n+page_shift bits signify the position in the
buffer (hence the 2^n pages requirement).

This means that if userspace remembers the last read position, and all
the data was overwritten since last time, we'd be able to tell.

Suppose we have mapped 4 pages (of page size 4k), that means our buffer
position would be the lower 14 bits of data_head.

Now say the last observed position was:

  0x00003458 (& 0x00003FFF == 0x3458)

and the next observed position is:

  0x00013458 (& 0x00003FFF == 0x3458)

We'd still be able to tell we overflowed 8 times.

Does this suffice?

[ assuming you map less than 4G of course, otherwise there's no spare
  bits left ]

>   10/ Group event buffer entry
> 
>        This is activated by setting the PERF_RECORD_GROUP in the record_type
>        field.  With this bit set, the values of the other members of the
>        group are stored sequentially in the buffer. To help figure out which
>        value corresponds to which event, the current implementation also
>        stores the raw encoding of the event.
> 
>        The event encoding does not help figure out which event the value refers
>        to. There can be multiple events with the same code. This does fit the
>        API model where events are identified by file descriptors.
> 
>        The file descriptor must be provided and not the raw encoding.

OK, sounds sensible.

>   11/ reserve_percpu
> 
>        There are more than counters on many PMU models. Counters are not
>        symmetrical even on X86.
> 
>        What does this API actually guarantees in terms on what events a tool
>        will be able to measure with the reserved counters?
> 
> II/ X86 comments
> 
>   Mostly implementation related comments in this section.
> 
>   1/ Fixed counter and event on Intel
> 
>        You cannot simply fall back to generic counters if you cannot find
>        a fixed counter. There are model-specific bugs, for instance
>        UNHALTED_REFERENCE_CYCLES (0x013c), does not measure the same thing on
>        Nehalem when it is used in fixed counter 2 or a generic counter. The
>        same is true on Core.
> 
>        You cannot simply look at the event field code to determine whether
>        this is an event supported by a fixed counters. You must look at the
>        other fields such as edge, invert, cnt-mask. If those are present then
>        you have to fall back to using a generic counter as fixed counters only
>        support priv level filtering. As indicated above, though, the
>        programming UNHALTED_REFERENCE_CYCLES on a generic counter does not
>        count the same thing, therefore you need to fail is filters other than
>        priv levels are present on this event.
> 
>   2/ Event knowledge missing
> 
>        There are constraints and bugs on some events in Intel Core and Nehalem.
>        In your model, those need to be taken care of by the kernel. Should the
>        kernel make the wrong decision, there would be no work-around for user
>        tools. Take the example I outlined just above with Intel fixed counters.
> 
>        Constraints do exist on AMD64 processors as well..

Good thing updating the kernel is so easy ;-)

>   3/ Interrupt throttling
> 
>        There is apparently no way for a system admin to set the threshold. It
>        is hardcoded.
> 
>        Throttling occurs without the tool(s) knowing. I think this is a problem.

Fixed, it has a sysctl now, is in generic code and emits timestamped
throttle/unthrottle events to the data stream, Power also implemented
the needed bits.

>    4/ NMI
> 
>        Why restrict NMI to privileged users when you have throttling to protect
>        against interrupt flooding?
> 
>        Are you trying to restrict non privileged users from getting sampling
>        inside kernel critical sections?

On its way to being fixed. Everything is NMI now only the code needs
cleaning up.

> III/ Requests
> 
>   1/ Sampling period change
> 
>        As it stands today, it seems there is no way to change a period but to
>        close() the event file descriptor and start over.. When you close the
>        group leader, it is not clear to me what happens to the remaining events.

The other events will be 'promoted' to individual counters and continue
on until their fd is closed too.

>        I know of tools which want to adjust the sampling period based on the
>        number of samples they get per second.

I recently implemented dynamic period stuff, it adjusts the period every
tick so as to strive for a given target frequency.

>        By design, your perf_counter_open() should not really be in the
>        critical path, e.g., when you are processing samples from the event
>        buffer. Thus, I think it would be good to have a dedicated call to
>        allow changing the period.

Yet another ioctl() ?

>   2/ Sampling period randomization
> 
>        It is our experience (on Itanium, for instance), that for certain
>        sampling measurements, it is beneficial to randomize the sampling
>        period a bit. This is in particular the case when sampling on an
>        event that happens very frequently and which is not related to
>        timing, e.g., branch_instructions_retired. Randomization helps mitigate
>        the bias. You do not need anything sophisticated.. But when you are using
>        a kernel-level sampling buffer, you need to have to kernel randomize.
>        Randomization needs to be supported per event.

Corey raised this a while back, I asked what kind of parameters were
needed and if a specific (p)RNG was specified.

Is something with an (avg,std) good enough? Do you have an
implementation that I can borrow, or even better a patch? :-)

>   3/ Group multiplexing ordering
> 
>        As mentioned above, the ordering of group multiplexing for one process
>        needs to be either specified by the API or controllable by users.

Creation order.

> IV/ Open questions
> 
>   1/ Support for model-specific uncore PMU monitoring capabilities
> 
>        Recent processors have multiple PMUs. Typically one per core and but
>        also one at the socket level, e.g., Intel Nehalem. It is expected that
>        this API will provide access to these PMU as well.
> 
>        It seems like with the current API, raw events for those PMUs would need
>        a new architecture-specific type as the event encoding by itself may
>        not be enough to disambiguate between a core and uncore PMU event.
> 
>        How are those events going to be supported?

/me goes poke at the docs... and finds MSR_OFFCORE_RSP_0. Not sure I
quite get what they're saying though, but yes

>   2/ Features impacting all counters
> 
>        On some PMU models, e.g., Itanium, they are certain features which have
>        an influence on all counters that are active. For instance, there is a
>        way to restrict monitoring to a range of continuous code or data
>        addresses using both some PMU registers and the debug registers.
> 
>        Given that the API exposes events (counters) as independent of each
>        other, I wonder how range restriction could be implemented.

Setting them per counter and when scheduling the counters check for
compatibility and stop adding counters to the pmu if the next counter is
incompatible.

>        Similarly, on Itanium, there are global behaviors. For instance, on
>        counter overflow the entire PMU freezes all at once. That seems to be
>        contradictory with the design of the API which creates the illusion of
>        independence.

Simply take the interrupt, deal with the overflow, and continue, its not
like the hardware can do any better, can it?

>   3/ AMD IBS
> 
>        How is AMD IBS going to be implemented?
> 
>        IBS has two separate sets of registers. One to capture fetch related
>        data and another one to capture instruction execution data. For each,
>        there is one config register but multiple data registers. In each mode,
>        there is a specific sampling period and IBS can interrupt.
> 
>        It looks like you could define two pseudo events or event types and then
>        define a new record_format and read_format.  That formats would only be
>        valid for an IBS event.
> 
>        Is that how you intend to support IBS?

I can't seem to locate the IBS stuff in the documentation currently, and
I must admit I've not yet looked into it, others might have.

>   4/ Intel PEBS
> 
>        Since Netburst-based processors, Intel PMUs support a hardware sampling
>        buffer mechanism called PEBS.
> 
>        PEBS really became useful with Nehalem.
> 
>        Not all events support PEBS. Up until Nehalem, only one counter supported
>        PEBS (PMC0). The format of the hardware buffer has changed between Core
>        and Nehalem. It is not yet architected, thus it can still evolve with
>        future PMU models.
> 
>        On Nehalem, there is a new PEBS-based feature called Load Latency
>        Filtering which captures where data cache misses occur
>        (similar to Itanium D-EAR). Activating this feature requires setting a
>        latency threshold hosted in a separate PMU MSR.
> 
>        On Nehalem, given that all 4 generic counters support PEBS, the
>        sampling buffer may contain samples generated by any of the 4 counters.
>        The buffer includes a bitmask of registers to determine the source
>        of the samples. Multiple bits may be set in the bitmask.
> 
> 
>        How PEBS will be supported for this new API?

Something like that yes, we'd have to read the PEBS data and demux to
the per counter data streams.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: comments on Performance Counters for Linux (PCL)
  2009-05-28 16:25 ` Peter Zijlstra
@ 2009-05-28 16:47   ` Peter Zijlstra
  2009-05-28 21:06   ` [perfmon2] " Corey Ashford
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2009-05-28 16:47 UTC (permalink / raw)
  To: eranian
  Cc: LKML, Andrew Morton, Thomas Gleixner, Ingo Molnar, Robert Richter,
	Paul Mackerras, Andi Kleen, Maynard Johnson, Carl Love,
	Corey J Ashford, Philip Mucci, Dan Terpstra, perfmon2-devel

On Thu, 2009-05-28 at 18:25 +0200, Peter Zijlstra wrote:
> >      - uint64_t exclude_*
> > 
> >        It seems those fields were added to support the generic HW events. But
> >        I think they are confusing and their semantic is not quite clear.
> > 
> >        Furthermore, aren't they irrelevant for the SW events?
> 
> They are.

It seems I can't read. They are _not_ irrelevant, sw events do honour
them. We can for instance distinguish between the kernel causing a
pagefualt and userspace doing so.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [perfmon2] comments on Performance Counters for Linux (PCL)
  2009-05-28 16:25 ` Peter Zijlstra
  2009-05-28 16:47   ` Peter Zijlstra
@ 2009-05-28 21:06   ` Corey Ashford
  2009-05-28 21:35     ` Ingo Molnar
  2009-05-29  6:51   ` Andi Kleen
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 22+ messages in thread
From: Corey Ashford @ 2009-05-28 21:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: eranian, Thomas Gleixner, Philip Mucci, LKML, Andi Kleen,
	Paul Mackerras, Maynard Johnson, Andrew Morton, Ingo Molnar,
	perfmon2-devel

Just a few comments below on some excerpts from this very good discussion.

Peter Zijlstra wrote:
> On Thu, 2009-05-28 at 16:58 +0200, stephane eranian wrote:
>>      - uint64_t irq_period
>>
>>        IRQ is an x86 related name. Why not use smpl_period instead?
> 
> don't really care, but IRQ seems used throughout linux, we could name
> the thing interrupt or sample period.

I agree with Stephane, the name irq_period struck me as somewhat strange for 
what it does.  sample_period would be much better.

> 
>>      - uint32_t record_type
>>
>>        This field is a bitmask. I believe 32-bit is too small to accommodate
>>        future record formats.
> 
> It currently controls 8 aspects of the overflow entry, do you really
> forsee the need for more than 32?

record_type is probably not the best name for this either.  Maybe 
"record_layout" or "sample_layout" or "sample_format" (to go along with read_format)


>>        I would assume that on the read() side, counts are accumulated as
>>        64-bit integers. But if it is the case, then it seems there is an
>>        asymmetry between period and counts.
>>
>>        Given that your API is high level, I don't think tools should have to
>>        worry about the actual width of a counter. This is especially true
>>        because they don't know which counters the event is going to go into
>>        and if I recall correctly, on some PMU models, different counters can
>>        have different width (Power, I think).
>>
>>        It is rather convenient for tools to always manipulate counters as
>>        64-bit integers. You should provide a consistent view between counts
>>        and periods.
> 
> So you're suggesting to artificually strech periods by say composing a
> single overflow from smaller ones, ignoring the intermediate overflow
> events?
> 
> That sounds doable, again, patch welcome.

I definitely agree with Stephane's point on this one.  I had assumed that long 
irq_periods (longer than the width of the counter) would be synthesized as you 
suggest.  If this is not the case, PCL should be changed so that it does,  -or- 
at a minimum, the user should get an error back stating that the period is too 
long for the hardware counter.

>>  4/ Grouping
>>
>>        By design, an event can only be part of one group at a time. Events in
>>        a group are guaranteed to be active on the PMU at the same time. That
>>        means a group cannot have more events than there are available counters
>>        on the PMU. Tools may want to know the number of counters available in
>>        order to group their events accordingly, such that reliable ratios
>>        could be computed. It seems the only way to know this is by trial and
>>        error. This is not practical.
> 
> Got a proposal to ammend this?

I think counters in a group are guaranteed to be active at the same time iff the 
pinned bit is set for that group, right?

I don't get the problem with reliable ratios here.  If each counter has its own 
time values, time enabled vs. time on counter, reliable ratios should always be 
available.

> 
>>  5/ Multiplexing and scaling
>>
>>        The PMU can be shared by multiple programs each controlling a variable
>>        number of events. Multiplexing occurs by default unless pinned is
>>        requested. The exclusive option only guarantees the group does not
>>        share the PMU with other groups while it is active, at least this is
>>        my understanding.
> 
> We have pinned and exclusive. pinned means always on the PMU, exclusive
> means when on the PMU no-one else can be.

The use of the exclusive bit has been unclear to me.  Let's say I have 4 
hardware counters, and two groups of two events each.  As long as there's no 
interference from one group to the other, is there a reason I'd want the 
"exclusive" bit on?

Is it used only in the case where the kernel would otherwise not be able to 
schedule both groups onto counters at the same time and you want to ensure that 
your group doesn't get preempted by another group waiting to get onto the PMU?

>> III/ Requests
>>   2/ Sampling period randomization
>>
>>        It is our experience (on Itanium, for instance), that for certain
>>        sampling measurements, it is beneficial to randomize the sampling
>>        period a bit. This is in particular the case when sampling on an
>>        event that happens very frequently and which is not related to
>>        timing, e.g., branch_instructions_retired. Randomization helps mitigate
>>        the bias. You do not need anything sophisticated.. But when you are using
>>        a kernel-level sampling buffer, you need to have to kernel randomize.
>>        Randomization needs to be supported per event.
> 
> Corey raised this a while back, I asked what kind of parameters were
> needed and if a specific (p)RNG was specified.
> 
> Is something with an (avg,std) good enough? Do you have an
> implementation that I can borrow, or even better a patch? :-)

For how it's done in perfmon2, take a look at Section 3.4.2 (page 74) of 
http://www.hpl.hp.com/techreports/2004/HPL-2004-200R1.pdf

- Corey

Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
cjashfor@us.ibm.com


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [perfmon2] comments on Performance Counters for Linux (PCL)
  2009-05-28 21:06   ` [perfmon2] " Corey Ashford
@ 2009-05-28 21:35     ` Ingo Molnar
  2009-05-28 23:24       ` Paul Mackerras
  0 siblings, 1 reply; 22+ messages in thread
From: Ingo Molnar @ 2009-05-28 21:35 UTC (permalink / raw)
  To: Corey Ashford
  Cc: Peter Zijlstra, eranian, Thomas Gleixner, Philip Mucci, LKML,
	Andi Kleen, Paul Mackerras, Maynard Johnson, Andrew Morton,
	perfmon2-devel


* Corey Ashford <cjashfor@linux.vnet.ibm.com> wrote:

> Just a few comments below on some excerpts from this very good discussion.
>
> Peter Zijlstra wrote:
>> On Thu, 2009-05-28 at 16:58 +0200, stephane eranian wrote:
>>>      - uint64_t irq_period
>>>
>>>        IRQ is an x86 related name. Why not use smpl_period instead?

irq is not an x86 related name at all. There's thousands of uses of 
it even in arch/powerpc:

  earth4:~/tip> git grep -i irq arch/powerpc/ | wc -l
  6441

>>
>> don't really care, but IRQ seems used throughout linux, we could 
>> name the thing interrupt or sample period.
>
> I agree with Stephane, the name irq_period struck me as somewhat 
> strange for what it does.  sample_period would be much better.

sample_period would be fine - but smpl_period definitely not ;-)

>>>      - uint32_t record_type
>>>
>>>        This field is a bitmask. I believe 32-bit is too small to 
>>>        accommodate future record formats.
>>
>> It currently controls 8 aspects of the overflow entry, do you 
>> really forsee the need for more than 32?
>
> record_type is probably not the best name for this either.  Maybe 
> "record_layout" or "sample_layout" or "sample_format" (to go along 
> with read_format)

'record' is pretty established for this - so record_layout would be 
fine. Peter?

>>>        I would assume that on the read() side, counts are accumulated as
>>>        64-bit integers. But if it is the case, then it seems there is an
>>>        asymmetry between period and counts.
>>>
>>>        Given that your API is high level, I don't think tools should have to
>>>        worry about the actual width of a counter. This is especially true
>>>        because they don't know which counters the event is going to go into
>>>        and if I recall correctly, on some PMU models, different counters can
>>>        have different width (Power, I think).
>>>
>>>        It is rather convenient for tools to always manipulate counters as
>>>        64-bit integers. You should provide a consistent view between counts
>>>        and periods.
>>
>> So you're suggesting to artificually strech periods by say 
>> composing a single overflow from smaller ones, ignoring the 
>> intermediate overflow events?
>>
>> That sounds doable, again, patch welcome.
>
> I definitely agree with Stephane's point on this one.  I had 
> assumed that long irq_periods (longer than the width of the 
> counter) would be synthesized as you suggest.  If this is not the 
> case, PCL should be changed so that it does, -or- at a minimum, 
> the user should get an error back stating that the period is too 
> long for the hardware counter.

this looks somewhat academic - at least on x86, even the fastest 
events (say cycles) with a 32 bit overflow means one event per 
second on 4GB. That's not a significant event count in practice. 
What's the minimum width we are talking about on Power?

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [perfmon2] comments on Performance Counters for Linux (PCL)
  2009-05-28 21:35     ` Ingo Molnar
@ 2009-05-28 23:24       ` Paul Mackerras
  2009-05-29  7:19         ` Ingo Molnar
  2009-05-29 10:50         ` stephane eranian
  0 siblings, 2 replies; 22+ messages in thread
From: Paul Mackerras @ 2009-05-28 23:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Corey Ashford, Peter Zijlstra, eranian, Thomas Gleixner,
	Philip Mucci, LKML, Andi Kleen, Maynard Johnson, Andrew Morton,
	perfmon2-devel

Ingo Molnar writes:

> * Corey Ashford <cjashfor@linux.vnet.ibm.com> wrote:
> >> So you're suggesting to artificually strech periods by say 
> >> composing a single overflow from smaller ones, ignoring the 
> >> intermediate overflow events?
> >>
> >> That sounds doable, again, patch welcome.
> >
> > I definitely agree with Stephane's point on this one.  I had 
> > assumed that long irq_periods (longer than the width of the 
> > counter) would be synthesized as you suggest.  If this is not the 
> > case, PCL should be changed so that it does, -or- at a minimum, 
> > the user should get an error back stating that the period is too 
> > long for the hardware counter.
> 
> this looks somewhat academic - at least on x86, even the fastest 
> events (say cycles) with a 32 bit overflow means one event per 
> second on 4GB. That's not a significant event count in practice. 
> What's the minimum width we are talking about on Power?

32 bits, but since the top bit is effectively a level-sensitive
interrupt request, the maximum period in hardware is 2^31 counts.

However, I already support 64-bit interrupt periods (well, 63-bit
actually) on powerpc by only calling perf_counter_overflow() when
counter->hw.period_left becomes <= 0, and arranging to set the
hardware counter to 0 if counter->hw.period_left is >= 0x80000000.
It's a tiny amount of code to handle it, really.

Paul.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [perfmon2] comments on Performance Counters for Linux (PCL)
  2009-05-28 23:24       ` Paul Mackerras
@ 2009-05-29  7:19         ` Ingo Molnar
  2009-05-29  8:21           ` Geert Uytterhoeven
  2009-05-29 10:50         ` stephane eranian
  1 sibling, 1 reply; 22+ messages in thread
From: Ingo Molnar @ 2009-05-29  7:19 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Corey Ashford, Peter Zijlstra, eranian, Thomas Gleixner,
	Philip Mucci, LKML, Andi Kleen, Maynard Johnson, Andrew Morton,
	perfmon2-devel


* Paul Mackerras <paulus@samba.org> wrote:

> Ingo Molnar writes:
> 
> > * Corey Ashford <cjashfor@linux.vnet.ibm.com> wrote:
> > >> So you're suggesting to artificually strech periods by say 
> > >> composing a single overflow from smaller ones, ignoring the 
> > >> intermediate overflow events?
> > >>
> > >> That sounds doable, again, patch welcome.
> > >
> > > I definitely agree with Stephane's point on this one.  I had 
> > > assumed that long irq_periods (longer than the width of the 
> > > counter) would be synthesized as you suggest.  If this is not the 
> > > case, PCL should be changed so that it does, -or- at a minimum, 
> > > the user should get an error back stating that the period is too 
> > > long for the hardware counter.
> > 
> > this looks somewhat academic - at least on x86, even the fastest 
> > events (say cycles) with a 32 bit overflow means one event per 
> > second on 4GB. That's not a significant event count in practice. 
> > What's the minimum width we are talking about on Power?
> 
> 32 bits, but since the top bit is effectively a level-sensitive 
> interrupt request, the maximum period in hardware is 2^31 counts.
> 
> However, I already support 64-bit interrupt periods (well, 63-bit 
> actually) on powerpc by only calling perf_counter_overflow() when 
> counter->hw.period_left becomes <= 0, and arranging to set the 
> hardware counter to 0 if counter->hw.period_left is >= 0x80000000. 
> It's a tiny amount of code to handle it, really.

No argument about that - just wanted to know whether there's any 
real practical effect beyond the nitpicking factor ;-)

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [perfmon2] comments on Performance Counters for Linux (PCL)
  2009-05-29  7:19         ` Ingo Molnar
@ 2009-05-29  8:21           ` Geert Uytterhoeven
  2009-05-29 14:42             ` Carl Love
  0 siblings, 1 reply; 22+ messages in thread
From: Geert Uytterhoeven @ 2009-05-29  8:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Paul Mackerras, Peter Zijlstra, Philip Mucci, LKML, Andrew Morton,
	Andi Kleen, Maynard Johnson, eranian, Thomas Gleixner,
	perfmon2-devel

On Fri, 29 May 2009, Ingo Molnar wrote:
> * Paul Mackerras <paulus@samba.org> wrote:
> > Ingo Molnar writes:
> > > * Corey Ashford <cjashfor@linux.vnet.ibm.com> wrote:
> > > >> So you're suggesting to artificually strech periods by say 
> > > >> composing a single overflow from smaller ones, ignoring the 
> > > >> intermediate overflow events?
> > > >>
> > > >> That sounds doable, again, patch welcome.
> > > >
> > > > I definitely agree with Stephane's point on this one.  I had 
> > > > assumed that long irq_periods (longer than the width of the 
> > > > counter) would be synthesized as you suggest.  If this is not the 
> > > > case, PCL should be changed so that it does, -or- at a minimum, 
> > > > the user should get an error back stating that the period is too 
> > > > long for the hardware counter.
> > > 
> > > this looks somewhat academic - at least on x86, even the fastest 
> > > events (say cycles) with a 32 bit overflow means one event per 
> > > second on 4GB. That's not a significant event count in practice. 
> > > What's the minimum width we are talking about on Power?
> > 
> > 32 bits, but since the top bit is effectively a level-sensitive 
> > interrupt request, the maximum period in hardware is 2^31 counts.
> > 
> > However, I already support 64-bit interrupt periods (well, 63-bit 
> > actually) on powerpc by only calling perf_counter_overflow() when 
> > counter->hw.period_left becomes <= 0, and arranging to set the 
> > hardware counter to 0 if counter->hw.period_left is >= 0x80000000. 
> > It's a tiny amount of code to handle it, really.
> 
> No argument about that - just wanted to know whether there's any 
> real practical effect beyond the nitpicking factor ;-)

I never really dived into this stuff, but ISTR there are some 16-bit counters
on CELL? Is that correct?

With kind regards,

Geert Uytterhoeven
Software Architect
Techsoft Centre

Technology and Software Centre Europe
The Corporate Village · Da Vincilaan 7-D1 · B-1935 Zaventem · Belgium

Phone:    +32 (0)2 700 8453
Fax:      +32 (0)2 700 8622
E-mail:   Geert.Uytterhoeven@sonycom.com
Internet: http://www.sony-europe.com/

A division of Sony Europe (Belgium) N.V.
VAT BE 0413.825.160 · RPR Brussels
Fortis · BIC GEBABEBB · IBAN BE41293037680010

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [perfmon2] comments on Performance Counters for Linux (PCL)
  2009-05-29  8:21           ` Geert Uytterhoeven
@ 2009-05-29 14:42             ` Carl Love
  2009-05-30 19:16               ` Ingo Molnar
  0 siblings, 1 reply; 22+ messages in thread
From: Carl Love @ 2009-05-29 14:42 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Philip Mucci, LKML,
	Andi Kleen, Paul Mackerras, Maynard Johnson, Andrew Morton,
	eranian, perfmon2-devel


On Fri, 2009-05-29 at 10:21 +0200, Geert Uytterhoeven wrote:
> On Fri, 29 May 2009, Ingo Molnar wrote:
> > * Paul Mackerras <paulus@samba.org> wrote:
> > > Ingo Molnar writes:
> > > > * Corey Ashford <cjashfor@linux.vnet.ibm.com> wrote:
> > > > >> So you're suggesting to artificually strech periods by say 
> > > > >> composing a single overflow from smaller ones, ignoring the 
> > > > >> intermediate overflow events?
> > > > >>
> > > > >> That sounds doable, again, patch welcome.
> > > > >
> > > > > I definitely agree with Stephane's point on this one.  I had 
> > > > > assumed that long irq_periods (longer than the width of the 
> > > > > counter) would be synthesized as you suggest.  If this is not the 
> > > > > case, PCL should be changed so that it does, -or- at a minimum, 
> > > > > the user should get an error back stating that the period is too 
> > > > > long for the hardware counter.
> > > > 
> > > > this looks somewhat academic - at least on x86, even the fastest 
> > > > events (say cycles) with a 32 bit overflow means one event per 
> > > > second on 4GB. That's not a significant event count in practice. 
> > > > What's the minimum width we are talking about on Power?
> > > 
> > > 32 bits, but since the top bit is effectively a level-sensitive 
> > > interrupt request, the maximum period in hardware is 2^31 counts.
> > > 
> > > However, I already support 64-bit interrupt periods (well, 63-bit 
> > > actually) on powerpc by only calling perf_counter_overflow() when 
> > > counter->hw.period_left becomes <= 0, and arranging to set the 
> > > hardware counter to 0 if counter->hw.period_left is >= 0x80000000. 
> > > It's a tiny amount of code to handle it, really.
> > 
> > No argument about that - just wanted to know whether there's any 
> > real practical effect beyond the nitpicking factor ;-)
> 
> I never really dived into this stuff, but ISTR there are some 16-bit counters
> on CELL? Is that correct?

FYI, the counters on CELL are configurable.  You can have up to eight 16
bit count counters or you can combine counters i and i=1 (where
i=0,2,4,6) into 32 bit counters.  This allows you to have some 16 and
some 32 bit counters at the same time.  

> 
> With kind regards,
> 
> Geert Uytterhoeven
> Software Architect
> Techsoft Centre
> 
> Technology and Software Centre Europe
> The Corporate Village · Da Vincilaan 7-D1 · B-1935 Zaventem · Belgium
> 
> Phone:    +32 (0)2 700 8453
> Fax:      +32 (0)2 700 8622
> E-mail:   Geert.Uytterhoeven@sonycom.com
> Internet: http://www.sony-europe.com/
> 
> A division of Sony Europe (Belgium) N.V.
> VAT BE 0413.825.160 · RPR Brussels
> Fortis · BIC GEBABEBB · IBAN BE41293037680010
> 
> ------------------------------------------------------------------------------
> Register Now for Creativity and Technology (CaT), June 3rd, NYC. CaT 
> is a gathering of tech-side developers & brand creativity professionals. Meet
> the minds behind Google Creative Lab, Visual Complexity, Processing, & 
> iPhoneDevCamp as they present alongside digital heavyweights like Barbarian 
> Group, R/GA, & Big Spaceship. http://p.sf.net/sfu/creativitycat-com 
> _______________________________________________
> perfmon2-devel mailing list
> perfmon2-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/perfmon2-devel


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [perfmon2] comments on Performance Counters for Linux (PCL)
  2009-05-29 14:42             ` Carl Love
@ 2009-05-30 19:16               ` Ingo Molnar
  0 siblings, 0 replies; 22+ messages in thread
From: Ingo Molnar @ 2009-05-30 19:16 UTC (permalink / raw)
  To: Carl Love
  Cc: Geert Uytterhoeven, Thomas Gleixner, Peter Zijlstra, Philip Mucci,
	LKML, Andi Kleen, Paul Mackerras, Maynard Johnson, Andrew Morton,
	eranian, perfmon2-devel


* Carl Love <cel@us.ibm.com> wrote:

> 
> On Fri, 2009-05-29 at 10:21 +0200, Geert Uytterhoeven wrote:
> > On Fri, 29 May 2009, Ingo Molnar wrote:
> > > * Paul Mackerras <paulus@samba.org> wrote:
> > > > Ingo Molnar writes:
> > > > > * Corey Ashford <cjashfor@linux.vnet.ibm.com> wrote:
> > > > > >> So you're suggesting to artificually strech periods by say 
> > > > > >> composing a single overflow from smaller ones, ignoring the 
> > > > > >> intermediate overflow events?
> > > > > >>
> > > > > >> That sounds doable, again, patch welcome.
> > > > > >
> > > > > > I definitely agree with Stephane's point on this one.  I had 
> > > > > > assumed that long irq_periods (longer than the width of the 
> > > > > > counter) would be synthesized as you suggest.  If this is not the 
> > > > > > case, PCL should be changed so that it does, -or- at a minimum, 
> > > > > > the user should get an error back stating that the period is too 
> > > > > > long for the hardware counter.
> > > > > 
> > > > > this looks somewhat academic - at least on x86, even the fastest 
> > > > > events (say cycles) with a 32 bit overflow means one event per 
> > > > > second on 4GB. That's not a significant event count in practice. 
> > > > > What's the minimum width we are talking about on Power?
> > > > 
> > > > 32 bits, but since the top bit is effectively a level-sensitive 
> > > > interrupt request, the maximum period in hardware is 2^31 counts.
> > > > 
> > > > However, I already support 64-bit interrupt periods (well, 63-bit 
> > > > actually) on powerpc by only calling perf_counter_overflow() when 
> > > > counter->hw.period_left becomes <= 0, and arranging to set the 
> > > > hardware counter to 0 if counter->hw.period_left is >= 0x80000000. 
> > > > It's a tiny amount of code to handle it, really.
> > > 
> > > No argument about that - just wanted to know whether there's any 
> > > real practical effect beyond the nitpicking factor ;-)
> > 
> > I never really dived into this stuff, but ISTR there are some 16-bit counters
> > on CELL? Is that correct?
> 
> FYI, the counters on CELL are configurable.  You can have up to 
> eight 16 bit count counters or you can combine counters i and i=1 
> (where i=0,2,4,6) into 32 bit counters.  This allows you to have 
> some 16 and some 32 bit counters at the same time.

If 16-bit counters are exposed then this can be solved like the 
PowerPC perfcounters code does it: by not propagating 'early' IRQs 
back to the generic layer but continuing it until the real threshold 
has been reached.

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [perfmon2] comments on Performance Counters for Linux (PCL)
  2009-05-28 23:24       ` Paul Mackerras
  2009-05-29  7:19         ` Ingo Molnar
@ 2009-05-29 10:50         ` stephane eranian
  2009-05-29 10:53           ` stephane eranian
  1 sibling, 1 reply; 22+ messages in thread
From: stephane eranian @ 2009-05-29 10:50 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Ingo Molnar, Corey Ashford, Peter Zijlstra, Thomas Gleixner,
	Philip Mucci, LKML, Andi Kleen, Maynard Johnson, Andrew Morton,
	perfmon2-devel

Hi,

On Fri, May 29, 2009 at 1:24 AM, Paul Mackerras <paulus@samba.org> wrote:
> Ingo Molnar writes:
>
>> * Corey Ashford <cjashfor@linux.vnet.ibm.com> wrote:
>> >> So you're suggesting to artificually strech periods by say
>> >> composing a single overflow from smaller ones, ignoring the
>> >> intermediate overflow events?
>> >>
>> >> That sounds doable, again, patch welcome.
>> >
>> > I definitely agree with Stephane's point on this one.  I had
>> > assumed that long irq_periods (longer than the width of the
>> > counter) would be synthesized as you suggest.  If this is not the
>> > case, PCL should be changed so that it does, -or- at a minimum,
>> > the user should get an error back stating that the period is too
>> > long for the hardware counter.
>>
>> this looks somewhat academic - at least on x86, even the fastest
>> events (say cycles) with a 32 bit overflow means one event per
>> second on 4GB. That's not a significant event count in practice.
>> What's the minimum width we are talking about on Power?
>
> 32 bits, but since the top bit is effectively a level-sensitive
> interrupt request, the maximum period in hardware is 2^31 counts.
>
This is exactly the same on Intel X86: 31 bits.
AMD is different.

Unfortunately, the Intel document is very obscure about this restriction.

Let's take Core. 40-bit wide counters. If you read with RDPMC
you get 40-bit worth of data, same with RDMSR.

If you write with WRMSR, you can only modify the bottom 32
bits. But bit 31 is the sign bit. Bit 32-39 are sign extension
of bit 31.

To trigger an overflow, bit 31 must be set, i.e., interrupt is triggered
when 40 bits transition back to 0.

Therefore your actual degree of freedom for the sampling period
is 31 bits.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [perfmon2] comments on Performance Counters for Linux (PCL)
  2009-05-29 10:50         ` stephane eranian
@ 2009-05-29 10:53           ` stephane eranian
  0 siblings, 0 replies; 22+ messages in thread
From: stephane eranian @ 2009-05-29 10:53 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Ingo Molnar, Corey Ashford, Peter Zijlstra, Thomas Gleixner,
	Philip Mucci, LKML, Andi Kleen, Maynard Johnson, Andrew Morton,
	perfmon2-devel

On Fri, May 29, 2009 at 12:50 PM, stephane eranian
<eranian@googlemail.com> wrote:
> Hi,
>
> On Fri, May 29, 2009 at 1:24 AM, Paul Mackerras <paulus@samba.org> wrote:
>> Ingo Molnar writes:
>>
>>> * Corey Ashford <cjashfor@linux.vnet.ibm.com> wrote:
>>> >> So you're suggesting to artificually strech periods by say
>>> >> composing a single overflow from smaller ones, ignoring the
>>> >> intermediate overflow events?
>>> >>
>>> >> That sounds doable, again, patch welcome.
>>> >
>>> > I definitely agree with Stephane's point on this one.  I had
>>> > assumed that long irq_periods (longer than the width of the
>>> > counter) would be synthesized as you suggest.  If this is not the
>>> > case, PCL should be changed so that it does, -or- at a minimum,
>>> > the user should get an error back stating that the period is too
>>> > long for the hardware counter.
>>>
>>> this looks somewhat academic - at least on x86, even the fastest
>>> events (say cycles) with a 32 bit overflow means one event per
>>> second on 4GB. That's not a significant event count in practice.
>>> What's the minimum width we are talking about on Power?
>>
>> 32 bits, but since the top bit is effectively a level-sensitive
>> interrupt request, the maximum period in hardware is 2^31 counts.
>>
> This is exactly the same on Intel X86: 31 bits.
> AMD is different.
>
> Unfortunately, the Intel document is very obscure about this restriction.
>
> Let's take Core. 40-bit wide counters. If you read with RDPMC
> you get 40-bit worth of data, same with RDMSR.
>
> If you write with WRMSR, you can only modify the bottom 32
> bits. But bit 31 is the sign bit. Bit 32-39 are sign extension
> of bit 31.
>
Let me rephrase this a bit. You can actually modify all bits
with WRMSR. but the sign extension determines the value
of bits 32-39 based on bit 31.

Forgot to mention that since Penryn, if you try to set bits 40-63
with WRMSR, you get a fault.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: comments on Performance Counters for Linux (PCL)
  2009-05-28 16:25 ` Peter Zijlstra
  2009-05-28 16:47   ` Peter Zijlstra
  2009-05-28 21:06   ` [perfmon2] " Corey Ashford
@ 2009-05-29  6:51   ` Andi Kleen
  2009-05-29  8:19     ` Peter Zijlstra
  2009-05-29 10:43   ` stephane eranian
  2009-05-29 20:32   ` Peter Zijlstra
  4 siblings, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2009-05-29  6:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: eranian, LKML, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Robert Richter, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

> >        Take Itanium, it has 4 priv levels and the PMU counters can monitor at
> >        any priv levels or combination thereof?
> 
> x86 has more priv rings too, but linux only uses 2, kernel (ring 0) and
> user (ring 2 iirc). Does ia64 expose more than 2 priv levels in linux?

In a hardware virtualized environment x86 Linux uses at least four.

And yes ia64 also exposes more I believe, e.g. it has the fancy
"system calls that run in user context"

-Andi


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: comments on Performance Counters for Linux (PCL)
  2009-05-29  6:51   ` Andi Kleen
@ 2009-05-29  8:19     ` Peter Zijlstra
  2009-05-29  8:53       ` Andi Kleen
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2009-05-29  8:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: eranian, LKML, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Robert Richter, Paul Mackerras, Maynard Johnson, Carl Love,
	Corey J Ashford, Philip Mucci, Dan Terpstra, perfmon2-devel

On Fri, 2009-05-29 at 08:51 +0200, Andi Kleen wrote:
> > >        Take Itanium, it has 4 priv levels and the PMU counters can monitor at
> > >        any priv levels or combination thereof?
> > 
> > x86 has more priv rings too, but linux only uses 2, kernel (ring 0) and
> > user (ring 2 iirc). Does ia64 expose more than 2 priv levels in linux?
> 
> In a hardware virtualized environment x86 Linux uses at least four.

You're talking about VT stuff, right? Does that virtualize the PMU too?
otherwise the point it moot since it will have to arbitrate the PMU and
the guest will likely never see the HV.

Or are you talking about Xen like horrors where they run 'guest' dom0
and domu in separate rings? That would still need to arbitrate the PMU
between guests.

> And yes ia64 also exposes more I believe, e.g. it has the fancy
> "system calls that run in user context"

Happen to have more details? If they really run in user context their
priv level would be the same right, otherwise they run in something
weird, not quite user not quite kernel. Now that might be the case, but
I'm utterly ignorant on itanic.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: comments on Performance Counters for Linux (PCL)
  2009-05-29  8:19     ` Peter Zijlstra
@ 2009-05-29  8:53       ` Andi Kleen
  2009-05-29 16:13         ` [perfmon2] " Luck, Tony
  0 siblings, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2009-05-29  8:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andi Kleen, eranian, LKML, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Robert Richter, Paul Mackerras, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

> You're talking about VT stuff, right? 

Yes.

> Does that virtualize the PMU too?

A hypervisor can in software, but another common case is a global profiler on 
the host to be able to profile all guests too. e.g. xenoprofile is able to do 
that.

If you want a full system picture with virtualization that's pretty much 
required.

For that you need a concept of more priviledge levels (host user, host sys,
guest user, guest sys)

> Happen to have more details? If they really run in user context their
> priv level would be the same right, otherwise they run in something
> weird, not quite user not quite kernel. Now that might be the case, but
> I'm utterly ignorant on itanic.

See Documentation/ia64/fsys.txt

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [perfmon2] comments on Performance Counters for Linux (PCL)
  2009-05-29  8:53       ` Andi Kleen
@ 2009-05-29 16:13         ` Luck, Tony
  0 siblings, 0 replies; 22+ messages in thread
From: Luck, Tony @ 2009-05-29 16:13 UTC (permalink / raw)
  To: Andi Kleen, Peter Zijlstra
  Cc: Andrew Morton, Philip Mucci, LKML, Maynard Johnson,
	Thomas Gleixner, Paul Mackerras, Ingo Molnar, eranian@gmail.com,
	perfmon2-devel

>> Happen to have more details? If they really run in user context their
>> priv level would be the same right, otherwise they run in something
>> weird, not quite user not quite kernel. Now that might be the case, but
>> I'm utterly ignorant on itanic.
>
> See Documentation/ia64/fsys.txt

Summary is that the fast system calls run in ring 0 ... but since they
didn't save all the user state (in order to make them fast!) there are
restrictions on what things they can do.  In particular they must not
trap.

-Tony

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: comments on Performance Counters for Linux (PCL)
  2009-05-28 16:25 ` Peter Zijlstra
                     ` (2 preceding siblings ...)
  2009-05-29  6:51   ` Andi Kleen
@ 2009-05-29 10:43   ` stephane eranian
  2009-05-29 15:02     ` Peter Zijlstra
  2009-05-29 20:32   ` Peter Zijlstra
  4 siblings, 1 reply; 22+ messages in thread
From: stephane eranian @ 2009-05-29 10:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Thomas Gleixner, Ingo Molnar, Robert Richter,
	Paul Mackerras, Andi Kleen, Maynard Johnson, Carl Love,
	Corey J Ashford, Philip Mucci, Dan Terpstra, perfmon2-devel

Hi,

On Thu, May 28, 2009 at 6:25 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>> I/ General API comments
>>
>>   1/ Data structures
>>
>>      * struct perf_counter_hw_event
>>
>>      - I think this structure will be used to enable non-counting features,
>>        e.g. IBS. Name is awkward. It is used to enable non-hardware events
>>        (sw events). Why not call it: struct perf_event
>
> Sure a name change might be a good idea.
>
In that case, the open syscall should also be changed to something
more generic: perf_open()


>>      - uint64_t irq_period
>>
>>        IRQ is an x86 related name. Why not use smpl_period instead?
>
> don't really care, but IRQ seems used throughout linux, we could name
> the thing interrupt or sample period.
>
IRQ is well understood by kernel people and I agree it is not X86 specific.
But we are talking about user level developers, some not really software
engineers either, e.g., physicists.


>>      - uint32_t record_type
>>
>>        This field is a bitmask. I believe 32-bit is too small to accommodate
>>        future record formats.
>
> It currently controls 8 aspects of the overflow entry, do you really
> forsee the need for more than 32?
>
Again, given the perf_open() syscall is not on a critical path, it does not
hurt to pass a bigger struct and have provision for future extensions.

>>      - uint32_t read_format
>>
>>        Ditto.
>
> I guess it doesn't hurt extending them..

Exactly.

>>      - uint64_t exclude_*
>
>>        What is the meaning of exclude_user? Which priv levels are actually
>>        excluded?
>
> userspace
>
This is a fuzzy notion.

>>        Take Itanium, it has 4 priv levels and the PMU counters can monitor at
>>        any priv levels or combination thereof?
>
> x86 has more priv rings too, but linux only uses 2, kernel (ring 0) and
> user (ring 2 iirc). Does ia64 expose more than 2 priv levels in linux?
>
X86 has priv level 0,1,2,3. The issue, though, it that the X86 PMU only
distinguishes 2 coarse levels: OS, USR. Where OS=0, USR=1,2,3.

IA64 has also 4 levels, but the difference is that the PMU can filter on
all 4 levels independently. The question is then, what does exclude_user
actually encompass there?

And then, there is VT on X86 and IA64...
AMD64 PMU as of family 10h has host and guest filters in the
PERFEVTSEL registers.


>>        When programming raw HW events, the priv level filtering is typically
>>        already included. Which setting has priority, raw encoding or the
>>        exclude_*?
>>
>>        Looking at the existing X86 implementation, it seems exclude_* can
>>        override whatever is set in the raw event code.
>>
>>        For any events, but in particular, SW events, why not encode this in
>>        the config field, like it is for a raw HW event?
>
> Because config is about what we count, this is about where we count.
> Doesn't seem strange to separate these two.
>
For a monitor tool, this means it may need to do the work twice.

Imagine, I encode events using strings: INST_RETIRED:u=1:k=1.
This means measure INST_RETIRED, at user level and kernel level.
You typically pass this to a helper library and you get back the raw
event code, which includes the priv level mask. If that library is generic
and does not know about PCL, then the tool needs to extract either
from the raw code or the string, the priv level information to set the
exclude_* fields accordingly. The alternative is to make the library
PCL-aware and have it set the perf_event structure directly.


>>      * struct perf_counter_mmap_page
>>        Given there is only one counter per-page, there is an awful lot of
>>        precious RLIMIT_MEMLOCK space wasted for this.
>>
>>        Typically, if you are self-sampling, you are not going to read the
>>        current value of the sampling period. That re-mapping trick is only
>>        useful when counting.
>>
>>        Why not make these two separate mappings (using the mmap offset as
>>        the indicator)?
>>
>>        With this approach, you would get one page back per sampling period
>>        and that page could then be used for the actual samples.
>
> Not quite, you still need a place for the data_head.
>
You could put it at the beginning of the actual buffer. But then, I
suspect it will
break the logic you have in data_head (explained below).


>>  2/ System calls
>>
>>      * ioctl()
>>
>>        You have defined 3 ioctls() so far to operate on an existing event.
>>        I was under the impression that ioctl() should not be used except for
>>        drivers.
>
> 4 actually.
>
Why not use 4 new syscalls instead of using ioctl().

>>      * prctl()
>>
>>        The API is event-based. Each event gets a file descriptor. Events are
>>        therefore managed individually. Thus, to enable/disable, you need to
>>        enable/disable each one separately.
>>
>>        The use of prctl() breaks this design choice. It is not clear what you
>>        are actually enabling. It looks like you are enabling all the counters
>>        attached to the thread. This is incorrect. With your implementation,
>>        the PMU can be shared between competing users. In particular, multiple
>>        tools may be monitoring the same thread. Now, imagine, a tool is
>>        monitoring a self-monitoring thread which happens to start/stop its
>>        measurement using prctl(). Then, that would also start/stop the
>>        measurement of the external tool. I have verified that this is what is
>>        actually happening.
>
> Recently changed that, it enables/disables all counters created by the
> task calling prctl().
>
And attached that what? Itself or anything?

>>        I believe this call is bogus and it should be eliminated. The interface
>>        is exposing events individually therefore they should be controlled
>>        individually.
>
> Bogus maybe not, superfluous, yeah, its a simpler way than iterating all
> the fds you just created, saves a few syscalls.
>
Well, my point was that it does not fit well with your file descriptor oriented
API.


>
>>  3/ Counter width
>>
>>        It is not clear whether or not the API exposes counters as 64-bit wide
>>        on PMUs which do not implement 64-bit wide counters.
>>
>>        Both irq_period and read() return 64-bit integers. However, it appears
>>        that the implementation is not using all the bits. In fact, on X86, it
>>        appears the irq_period is truncated silently. I believe this is not
>>        correct. If the period is not valid, an error should be returned.
>>        Otherwise, the tool will be getting samples at a rate different than
>>        what it requested.
>
> Sure, fail creation when the specified period is larger than the
> supported counter width -- patch welcome.
>
Yes, but then that means tools need to know on which counter
the event is going to be programmed. Not all counters may have the
same width.

>>        I would assume that on the read() side, counts are accumulated as
>>        64-bit integers. But if it is the case, then it seems there is an
>>        asymmetry between period and counts.
>>
>>        Given that your API is high level, I don't think tools should have to
>>        worry about the actual width of a counter. This is especially true
>>        because they don't know which counters the event is going to go into
>>        and if I recall correctly, on some PMU models, different counters can
>>        have different width (Power, I think).
>>
>>        It is rather convenient for tools to always manipulate counters as
>>        64-bit integers. You should provide a consistent view between counts
>>        and periods.
>
> So you're suggesting to artificually strech periods by say composing a
> single overflow from smaller ones, ignoring the intermediate overflow
> events?
>
Yes, you emulate actual 64-bit wide counters. In the case of perfmon,
there is no notion of sampling period. All counters are exposed as 64-bit
wide. You can write any value you want into a counter. If you want a period p,
then you program the counter to -p. The period p may be larger than the width
of the actual counter. That means you will get intermediate overflows. A final
overflow will make the 64-bit value wrap around and that's when you
record a sample.

>>  4/ Grouping
>>
>>        By design, an event can only be part of one group at a time. Events in
>>        a group are guaranteed to be active on the PMU at the same time. That
>>        means a group cannot have more events than there are available counters
>>        on the PMU. Tools may want to know the number of counters available in
>>        order to group their events accordingly, such that reliable ratios
>>        could be computed. It seems the only way to know this is by trial and
>>        error. This is not practical.
>
> Got a proposal to ammend this?
>
Either add a syscall for that, or better, expose this via sysfs.

>>  5/ Multiplexing and scaling
>>
>>        The PMU can be shared by multiple programs each controlling a variable
>>        number of events. Multiplexing occurs by default unless pinned is
>>        requested. The exclusive option only guarantees the group does not
>>        share the PMU with other groups while it is active, at least this is
>>        my understanding.
>
> We have pinned and exclusive. pinned means always on the PMU, exclusive
> means when on the PMU no-one else can be.
>
exclusive: no sharing even if the group does not use all the counters
AND they are
                other events waiting for the resource. Right?

>>        By default, you may be multiplexed and if that happens you cannot know
>>        unless you request the timing information as part of the read_format.
>>        Without it, and if multiplexing has occurred, bogus counts may be
>>        returned with no indication whatsoever.
>
> I don't see the problem, you knew they could get multiplexes, yet you
> didn't ask for the information needed to extrapolate the information,
> sounds like you get what you aksed for.
>
The API specification must then clearly say: events and groups of events
are multiplexed by default. Scaling is not done automatically.

>>        To avoid returning misleading information, it seems like the API should
>>        refuse to open a non-pinned event which does not have
>>        PERF_FORMAT_TOTAL_TIME_ENABLED|PERF_FORMAT_TOTAL_TIME_RUNNING in the
>>        read_format. This would avoid a lot of confusion down the road.
>
> I prefer to give people rope and tell them how to tie the knot.
>
This is a case of silent error. I suspect many people will fall into that trap.
Need to make sure documentation warns about that.

>>  7/ Multiplexing and system-wide
>>
>>        Multiplexing is time-based and it is hooked into the timer tick. At
>>        every tick, the kernel tries to schedule another group of events.
>>
>>        In tickless kernels if a CPU is idle, no timer tick is generated,
>>        therefore no multiplexing occurs. This is incorrect. It's not because
>>        the CPU is idle, that there aren't any interesting PMU events to measure.
>>        Parts of the CPU may still be active, e.g., caches and buses. And thus,
>>        it is expected that multiplexing still happens.
>>
>>        You need to hook up the timer source for multiplexing to something else
>>        which is not affected by tickless.
>
> Or inhibit nohz when there are active counters, but good point.
>
Don't want do use nohz because you would be modifying the system
you're trying to monitor.

>>  8/ Controlling group multiplexing
>>
>>        Although, multiplexing is somehow exposed to user via the timing
>>        information.  I believe there is not enough control. I know of advanced
>>        monitoring tools which needs to measure over a dozen events in one
>>        monitoring session. Given that the underlying PMU does not have enough
>>        counters OR that certain events cannot be measured together, it is
>>        necessary to split the events into groups and multiplex them. Events
>>        are not grouped at random AND groups are not ordered at random either.
>>        The sequence of groups is carefully chosen such that related events are
>>        in neighboring groups such that they measure similar parts of the
>>        execution.  This way you can mitigate the fluctuations introduced by
>>        multiplexing and compare ratios. In other words, some tools may want to
>>        control the order in which groups are scheduled on the PMU.
>
> Current code RR groups in the order they are created, is more control
> needed?
>
I understand the creation order in the case of a single tool.

My point was more in the case of multiple groups from multiple tools competing.
Imagine, Tool A and B want to monitor thread T. A has 3 groups, B 2 groups.
imagine all groups are exclusive. Does this mean that all groups of A will be
multiplexed and THEN all groups of B, or can they be interleaved, e.g. 1 group
from A, followed by 1 group from B?

This behavior has to be clearly spelled out by the API.


>>  9/ Event buffer
>>
>>        There is a kernel level event buffer which can be re-mapped read-only at
>>        the user level via mmap(). The buffer must be a multiple of page size
>
> 2^n actually
>
Yes. But this is rounded-up to pages because of remapping. So better make use
of the full space.


>>        and must be at least 2-page long. The First page is used for the
>>        counter re-mapping and buffer header, the second for the actual event
>>        buffer.
>
> I think a single data page is valid too (2^0=1).
>
I have not tried that yet.

> Suppose we have mapped 4 pages (of page size 4k), that means our buffer
> position would be the lower 14 bits of data_head.
>
> Now say the last observed position was:
>
>  0x00003458 (& 0x00003FFF == 0x3458)
>
> and the next observed position is:
>
>  0x00013458 (& 0x00003FFF == 0x3458)
>
> We'd still be able to tell we overflowed 8 times.
>
Isn't it 4 times?

> Does this suffice?
>
Should work, assuming you have some bits left for the overflow.
That means you cannot actually go to 4GB of space unless you
know you cannot lose the race with the kernel.

>>   11/ reserve_percpu
>>
>>        There are more than counters on many PMU models. Counters are not
>>        symmetrical even on X86.
>>
>>        What does this API actually guarantees in terms on what events a tool
>>        will be able to measure with the reserved counters?
>>
>> II/ X86 comments
>>
>>   Mostly implementation related comments in this section.
>>
>>   1/ Fixed counter and event on Intel
>>
>>        You cannot simply fall back to generic counters if you cannot find
>>        a fixed counter. There are model-specific bugs, for instance
>>        UNHALTED_REFERENCE_CYCLES (0x013c), does not measure the same thing on
>>        Nehalem when it is used in fixed counter 2 or a generic counter. The
>>        same is true on Core.
>>
>>        You cannot simply look at the event field code to determine whether
>>        this is an event supported by a fixed counters. You must look at the
>>        other fields such as edge, invert, cnt-mask. If those are present then
>>        you have to fall back to using a generic counter as fixed counters only
>>        support priv level filtering. As indicated above, though, the
>>        programming UNHALTED_REFERENCE_CYCLES on a generic counter does not
>>        count the same thing, therefore you need to fail is filters other than
>>        priv levels are present on this event.
>>
>>   2/ Event knowledge missing
>>
>>        There are constraints and bugs on some events in Intel Core and Nehalem.
>>        In your model, those need to be taken care of by the kernel. Should the
>>        kernel make the wrong decision, there would be no work-around for user
>>        tools. Take the example I outlined just above with Intel fixed counters.
>>
>>        Constraints do exist on AMD64 processors as well..
>
> Good thing updating the kernel is so easy ;-)
>

Not once this is in production though.


>>   3/ Interrupt throttling
>>
>>        There is apparently no way for a system admin to set the threshold. It
>>        is hardcoded.
>>
>>        Throttling occurs without the tool(s) knowing. I think this is a problem.
>
> Fixed, it has a sysctl now, is in generic code and emits timestamped
> throttle/unthrottle events to the data stream, Power also implemented
> the needed bits.
>
Good.

>> III/ Requests
>>
>>   1/ Sampling period change
>>
>>        As it stands today, it seems there is no way to change a period but to
>>        close() the event file descriptor and start over.. When you close the
>>        group leader, it is not clear to me what happens to the remaining events.
>
> The other events will be 'promoted' to individual counters and continue
> on until their fd is closed too.
>
So you'd better start from scratch because you will lose group sampling.

>>        I know of tools which want to adjust the sampling period based on the
>>        number of samples they get per second.
>
> I recently implemented dynamic period stuff, it adjusts the period every
> tick so as to strive for a given target frequency.
>
I am wondering is the tool shouldn't be in charge of that rather than
the kernel.
At least it would give it more control about what is happening and when.

>>        By design, your perf_counter_open() should not really be in the
>>        critical path, e.g., when you are processing samples from the event
>>        buffer. Thus, I think it would be good to have a dedicated call to
>>        allow changing the period.
>
> Yet another ioctl() ?
>
I would say yet another syscall.

>>   2/ Sampling period randomization
>>
>>        It is our experience (on Itanium, for instance), that for certain
>>        sampling measurements, it is beneficial to randomize the sampling
>>        period a bit. This is in particular the case when sampling on an
>>        event that happens very frequently and which is not related to
>>        timing, e.g., branch_instructions_retired. Randomization helps mitigate
>>        the bias. You do not need anything sophisticated.. But when you are using
>>        a kernel-level sampling buffer, you need to have to kernel randomize.
>>        Randomization needs to be supported per event.
>
> Corey raised this a while back, I asked what kind of parameters were
> needed and if a specific (p)RNG was specified.
>
> Is something with an (avg,std) good enough? Do you have an
> implementation that I can borrow, or even better a patch? :-)

I think all you need is a bitmask to control the range of variation of the
period. As I said, the randomizer does not need to be sophisticated.
In perfmon we originally used the Carta random number generator.
But nowadays, we use the existing random32() kernel function.

>
>> IV/ Open questions
>>
>>   1/ Support for model-specific uncore PMU monitoring capabilities
>>
>>        Recent processors have multiple PMUs. Typically one per core and but
>>        also one at the socket level, e.g., Intel Nehalem. It is expected that
>>        this API will provide access to these PMU as well.
>>
>>        It seems like with the current API, raw events for those PMUs would need
>>        a new architecture-specific type as the event encoding by itself may
>>        not be enough to disambiguate between a core and uncore PMU event.
>>
>>        How are those events going to be supported?
>
> /me goes poke at the docs... and finds MSR_OFFCORE_RSP_0. Not sure I
> quite get what they're saying though, but yes
>
This one is not uncore, it's core. Name is confusing. The uncore is all the UNC_
stuff. See Vol3b section 18.17.2.

>>   2/ Features impacting all counters
>>
>>        On some PMU models, e.g., Itanium, they are certain features which have
>>        an influence on all counters that are active. For instance, there is a
>>        way to restrict monitoring to a range of continuous code or data
>>        addresses using both some PMU registers and the debug registers.
>>
>>        Given that the API exposes events (counters) as independent of each
>>        other, I wonder how range restriction could be implemented.
>
> Setting them per counter and when scheduling the counters check for
> compatibility and stop adding counters to the pmu if the next counter is
> incompatible.
>
How would you pass the code range addresses per-counter?
Suppose I want to monitor CYCLES between 0x100000-0x200000.
Range is specified using debug registers.


>>        Similarly, on Itanium, there are global behaviors. For instance, on
>>        counter overflow the entire PMU freezes all at once. That seems to be
>>        contradictory with the design of the API which creates the illusion of
>>        independence.
>
> Simply take the interrupt, deal with the overflow, and continue, its not
> like the hardware can do any better, can it?
>
Hardware cannot do more. That means other unrelated counters which happen
to have been scheduled at the same time will be blindspotted.

I suspect that for Itanium, the better way is to refuse to co-schedule events
from different groups, i.e., always run in exclusive mode.

>>   3/ AMD IBS
>>
>>        How is AMD IBS going to be implemented?
>>
>>        IBS has two separate sets of registers. One to capture fetch related
>>        data and another one to capture instruction execution data. For each,
>>        there is one config register but multiple data registers. In each mode,
>>        there is a specific sampling period and IBS can interrupt.
>>
>>        It looks like you could define two pseudo events or event types and then
>>        define a new record_format and read_format.  That formats would only be
>>        valid for an IBS event.
>>
>>        Is that how you intend to support IBS?
>
> I can't seem to locate the IBS stuff in the documentation currently, and
> I must admit I've not yet looked into it, others might have.
>
AMD BIOS and Kernel Developer's Guide (BKDG) for Family 10h, section 3.13.
You have the register descriptions.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: comments on Performance Counters for Linux (PCL)
  2009-05-29 10:43   ` stephane eranian
@ 2009-05-29 15:02     ` Peter Zijlstra
  2009-05-29 15:59       ` stephane eranian
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2009-05-29 15:02 UTC (permalink / raw)
  To: eranian
  Cc: LKML, Andrew Morton, Thomas Gleixner, Ingo Molnar, Robert Richter,
	Paul Mackerras, Andi Kleen, Maynard Johnson, Carl Love,
	Corey J Ashford, Philip Mucci, Dan Terpstra, perfmon2-devel

On Fri, 2009-05-29 at 12:43 +0200, stephane eranian wrote:
> > I can't seem to locate the IBS stuff in the documentation currently, and
> > I must admit I've not yet looked into it, others might have.
> >
> AMD BIOS and Kernel Developer's Guide (BKDG) for Family 10h, section 3.13.
> You have the register descriptions.

Gah, I only downloaded BKDG for Family 11h under the assumption it would
include everything 10h had and then some,.. how ignorant of me :(


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: comments on Performance Counters for Linux (PCL)
  2009-05-29 15:02     ` Peter Zijlstra
@ 2009-05-29 15:59       ` stephane eranian
  0 siblings, 0 replies; 22+ messages in thread
From: stephane eranian @ 2009-05-29 15:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Thomas Gleixner, Ingo Molnar, Robert Richter,
	Paul Mackerras, Andi Kleen, Maynard Johnson, Carl Love,
	Corey J Ashford, Philip Mucci, Dan Terpstra, perfmon2-devel

On Fri, May 29, 2009 at 5:02 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Fri, 2009-05-29 at 12:43 +0200, stephane eranian wrote:
>> > I can't seem to locate the IBS stuff in the documentation currently, and
>> > I must admit I've not yet looked into it, others might have.
>> >
>> AMD BIOS and Kernel Developer's Guide (BKDG) for Family 10h, section 3.13.
>> You have the register descriptions.
>
> Gah, I only downloaded BKDG for Family 11h under the assumption it would
> include everything 10h had and then some,.. how ignorant of me :(
>
11h is the mobile series, it does not have IBS if I recall.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: comments on Performance Counters for Linux (PCL)
  2009-05-28 16:25 ` Peter Zijlstra
                     ` (3 preceding siblings ...)
  2009-05-29 10:43   ` stephane eranian
@ 2009-05-29 20:32   ` Peter Zijlstra
  2009-05-29 20:49     ` stephane eranian
  4 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2009-05-29 20:32 UTC (permalink / raw)
  To: eranian
  Cc: LKML, Andrew Morton, Thomas Gleixner, Ingo Molnar, Robert Richter,
	Paul Mackerras, Andi Kleen, Maynard Johnson, Carl Love,
	Corey J Ashford, Philip Mucci, Dan Terpstra, perfmon2-devel

On Thu, 2009-05-28 at 18:25 +0200, Peter Zijlstra wrote:
> >   10/ Group event buffer entry
> > 
> >        This is activated by setting the PERF_RECORD_GROUP in the record_type
> >        field.  With this bit set, the values of the other members of the
> >        group are stored sequentially in the buffer. To help figure out which
> >        value corresponds to which event, the current implementation also
> >        stores the raw encoding of the event.
> > 
> >        The event encoding does not help figure out which event the value refers
> >        to. There can be multiple events with the same code. This does fit the
> >        API model where events are identified by file descriptors.
> > 
> >        The file descriptor must be provided and not the raw encoding.
> 
> OK, sounds sensible.

This can't actually be done, fds can change, and there is no struct
file* to fd map.

If the config isn't good enough, the best we could do is something
unique per instance.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: comments on Performance Counters for Linux (PCL)
  2009-05-29 20:32   ` Peter Zijlstra
@ 2009-05-29 20:49     ` stephane eranian
  0 siblings, 0 replies; 22+ messages in thread
From: stephane eranian @ 2009-05-29 20:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Thomas Gleixner, Ingo Molnar, Robert Richter,
	Paul Mackerras, Andi Kleen, Maynard Johnson, Carl Love,
	Corey J Ashford, Philip Mucci, Dan Terpstra, perfmon2-devel

On Fri, May 29, 2009 at 10:32 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, 2009-05-28 at 18:25 +0200, Peter Zijlstra wrote:
>> >   10/ Group event buffer entry
>> >
>> >        This is activated by setting the PERF_RECORD_GROUP in the record_type
>> >        field.  With this bit set, the values of the other members of the
>> >        group are stored sequentially in the buffer. To help figure out which
>> >        value corresponds to which event, the current implementation also
>> >        stores the raw encoding of the event.
>> >
>> >        The event encoding does not help figure out which event the value refers
>> >        to. There can be multiple events with the same code. This does fit the
>> >        API model where events are identified by file descriptors.
>> >
>> >        The file descriptor must be provided and not the raw encoding.
>>
>> OK, sounds sensible.
>
> This can't actually be done, fds can change, and there is no struct
> file* to fd map.

The API must define an order in which values are stored. Otherwise
how would the application figure out what they correspond to in the
group.

Explain to me what you mean by fds can change. Are you thinking
about dup()+close() for instance?

> If the config isn't good enough, the best we could do is something
> unique per instance.
>
Config isn't good enough for sure. For instance, I could be sampling
the same generic PERF_CPU_COUNT_CYCLES twice in the same
group, once at the user level only (exclude_kernel=1), and once at
the kernel level only (exclude_user=1), but the two config values
would be identical.

If there is no backmapping, then you could get by perhaps by having
the API guarantee that values are stored in the order events were
originally chained in the group. If that does not work, then you need
to emit a unique cookie per fd and let the application maintain a mapping
table.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* comments on Performance Counters for Linux (PCL)
@ 2009-05-28 14:53 eranian
  0 siblings, 0 replies; 22+ messages in thread
From: eranian @ 2009-05-28 14:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, tglx, mingo, robert.richter, a.p.zijlstra, paulus, andi,
	mpjohn, carll, cjashfor, mucci, terpstra, perfmon2-devel

Hi,

The following sections are some preliminary comments concerning the
Performance Counter for Linux (PCL) API and implementation proposal
currently in development.

S.Eranian
eranian@gmail.com

I/ General API comments

   1/ Data structures

      * struct perf_counter_hw_event

      - I think this structure will be used to enable non-counting features,
	e.g. IBS. Name is awkward. It is used to enable non-hardware events
	(sw events). Why not call it: struct perf_event

      - uint64_t config

	Why use a single field to encode event type and its encoding? By design,
	the syscall is not in the critical path. Why not spell things out
	clearly: int type, uint64_t code.

      - uint64_t irq_period

        IRQ is an x86 related name. Why not use smpl_period instead?

      - uint32_t record_type

        This field is a bitmask. I believe 32-bit is too small to accommodate
	future record formats.

      - uint32_t read_format

        Ditto.

      - uint64_t nmi

        This is an X86-only feature. Why make this visible in a generic API?

	What is the semantic of this?

	I cannot have one counter use NMI and another not use NMI or are you
	planning on switching the interrupt vector when you change event groups?

	Why do I need to be a priviledged user to enable NMI? Especially given
	that:
		- non-privileged users can monitor at privilege level 0 (kernel).
		- there is interrupt throttling

      - uint64_t exclude_*

        It seems those fields were added to support the generic HW events. But
	I think they are confusing and their semantic is not quite clear.

	Furthermore, aren't they irrelevant for the SW events?

	What is the meaning of exclude_user? Which priv levels are actually
	excluded?

        Take Itanium, it has 4 priv levels and the PMU counters can monitor at
	any priv levels or combination thereof?

	When programming raw HW events, the priv level filtering is typically
	already included. Which setting has priority, raw encoding or the
	exclude_*?

	Looking at the existing X86 implementation, it seems exclude_* can
	override whatever is set in the raw event code.

	For any events, but in particular, SW events, why not encode this in
	the config field, like it is for a raw HW event?

      - mmap, munmap, comm

        It is not clear to me why those fields are defined here rather than as
	PERF_RECORD_*. They are stored in the event buffer only. They are only
	useful when sampling.

        It is not clear why you have mmap and munmap as separate options.
	What's the point of munmap-only notification?

      * enum perf_event_types vs. enum perf_event_type

	Both names are too close to each other, yet they define unrelated data
	structures. This is very confusing.

      * struct perf_counter_mmap_page

	The definition of data_head precludes sampling buffers bigger that 4GB.

	Does that makes sense on TB machines?

	Given there is only one counter per-page, there is an awful lot of
	precious RLIMIT_MEMLOCK space wasted for this.

	Typically, if you are self-sampling, you are not going to read the
	current value of the sampling period. That re-mapping trick is only
	useful when counting.

	Why not make these two separate mappings (using the mmap offset as
	the indicator)?

	With this approach, you would get one page back per sampling period
	and that page could then be used for the actual samples.

  2/ System calls

      * ioctl()

	You have defined 3 ioctls() so far to operate on an existing event.
	I was under the impression that ioctl() should not be used except for
	drivers.

      * prctl()

	The API is event-based. Each event gets a file descriptor. Events are
	therefore managed individually. Thus, to enable/disable, you need to
	enable/disable each one separately.

	The use of prctl() breaks this design choice. It is not clear what you
	are actually enabling. It looks like you are enabling all the counters
	attached to the thread. This is incorrect. With your implementation,
	the PMU can be shared between competing users. In particular, multiple
	tools may be monitoring the same thread. Now, imagine, a tool is
	monitoring a self-monitoring thread which happens to start/stop its
	measurement using prctl(). Then, that would also start/stop the
	measurement of the external tool. I have verified that this is what is
	actually happening.

	I believe this call is bogus and it should be eliminated. The interface
	is exposing events individually therefore they should be controlled
	individually.

  3/ Counter width

	It is not clear whether or not the API exposes counters as 64-bit wide
	on PMUs which do not implement 64-bit wide counters.

	Both irq_period and read() return 64-bit integers. However, it appears
	that the implementation is not using all the bits. In fact, on X86, it
	appears the irq_period is truncated silently. I believe this is not
	correct. If the period is not valid, an error should be returned.
	Otherwise, the tool will be getting samples at a rate different than
	what it requested. 

	I would assume that on the read() side, counts are accumulated as
	64-bit integers. But if it is the case, then it seems there is an
	asymmetry between period and counts.

	Given that your API is high level, I don't think tools should have to
	worry about the actual width of a counter. This is especially true
	because they don't know which counters the event is going to go into
	and if I recall correctly, on some PMU models, different counters can
	have different width (Power, I think).

	It is rather convenient for tools to always manipulate counters as
	64-bit integers. You should provide a consistent view between counts
	and periods.

  4/ Grouping

	By design, an event can only be part of one group at a time. Events in
	a group are guaranteed to be active on the PMU at the same time. That
	means a group cannot have more events than there are available counters
	on the PMU. Tools may want to know the number of counters available in
	order to group their events accordingly, such that reliable ratios
	could be computed. It seems the only way to know this is by trial and
	error. This is not practical.

  5/ Multiplexing and scaling

	The PMU can be shared by multiple programs each controlling a variable
	number of events. Multiplexing occurs by default unless pinned is
	requested. The exclusive option only guarantees the group does not
	share the PMU with other groups while it is active, at least this is
	my understanding.

	By default, you may be multiplexed and if that happens you cannot know
	unless you request the timing information as part of the read_format.
	Without it, and if multiplexing has occurred, bogus counts may be
	returned with no indication whatsoever.

	To avoid returning misleading information, it seems like the API should
	refuse to open a non-pinned event which does not have
	PERF_FORMAT_TOTAL_TIME_ENABLED|PERF_FORMAT_TOTAL_TIME_RUNNING in the
	read_format. This would avoid a lot of confusion down the road. 

  7/ Multiplexing and system-wide

	Multiplexing is time-based and it is hooked into the timer tick. At
	every tick, the kernel tries to schedule another group of events.

	In tickless kernels if a CPU is idle, no timer tick is generated,
	therefore no multiplexing occurs. This is incorrect. It's not because
	the CPU is idle, that there aren't any interesting PMU events to measure.
	Parts of the CPU may still be active, e.g., caches and buses. And thus,
	it is expected that multiplexing still happens.

	You need to hook up the timer source for multiplexing to something else
	which is not affected by tickless.

  8/ Controlling group multiplexing

	Although, multiplexing is somehow exposed to user via the timing
	information.  I believe there is not enough control. I know of advanced
	monitoring tools which needs to measure over a dozen events in one
	monitoring session. Given that the underlying PMU does not have enough
	counters OR that certain events cannot be measured together, it is
	necessary to split the events into groups and multiplex them. Events
	are not grouped at random AND groups are not ordered at random either.
	The sequence of groups is carefully chosen such that related events are
	in neighboring groups such that they measure similar parts of the
	execution.  This way you can mitigate the fluctuations introduced by
	multiplexing and compare ratios. In other words, some tools may want to
	control the order in which groups are scheduled on the PMU.

	The exclusive flag ensures correct grouping. But there is nothing to
	control ordering of groups.  That is a problem for some tools. Groups
	from different 'session' may be interleaved and break the continuity of
	 measurement.

	The group ordering has to be controllable from the tools OR must be
	fully specified by the API. But it should not be a property of the
	implementation. The API could for instance specify that groups are
	scheduled in increasing order of the group leaders' file descriptor.
	There needs to be some way of preventing interleaving of groups from
	different 'sessions'.

  9/ Event buffer

	There is a kernel level event buffer which can be re-mapped read-only at
	the user level via mmap(). The buffer must be a multiple of page size
	and must be at least 2-page long. The First page is used for the
	counter re-mapping and buffer header, the second for the actual event
	buffer.

	The buffer is managed as a cyclic buffer. That means there is a
	continuous race between the tool and the kernel. The tool must parse
	the buffer faster than the kernel can fill it out. It is important to
	realize that the race continues even when monitoring is stopped, as non
	PMU-based infos keep being stored, such as mmap, munmap. This is
	expected because it is not possible to lose mapping information
	otherwise invalid correlation of samples may happen.

	However, there is currently no reliable way of figuring out whether or
	not the buffer has wrapped around since the last scan by the tool. Just
	checking the current position or estimating the space left is not good
	enough. There ought to be an overflow counter of some sort indicating
	the number of times the head wrapped around.

   10/ Group event buffer entry

	This is activated by setting the PERF_RECORD_GROUP in the record_type
	field.  With this bit set, the values of the other members of the
	group are stored sequentially in the buffer. To help figure out which
	value corresponds to which event, the current implementation also
	stores the raw encoding of the event.

	The event encoding does not help figure out which event the value refers
	to. There can be multiple events with the same code. This does fit the
	API model where events are identified by file descriptors.

	The file descriptor must be provided and not the raw encoding.

   11/ reserve_percpu

	There are more than counters on many PMU models. Counters are not
	symmetrical even on X86.

	What does this API actually guarantees in terms on what events a tool
	will be able to measure with the reserved counters?

II/ X86 comments

   Mostly implementation related comments in this section.

   1/ Fixed counter and event on Intel

	You cannot simply fall back to generic counters if you cannot find
	a fixed counter. There are model-specific bugs, for instance
	UNHALTED_REFERENCE_CYCLES (0x013c), does not measure the same thing on
	Nehalem when it is used in fixed counter 2 or a generic counter. The
	same is true on Core.

	You cannot simply look at the event field code to determine whether
	this is an event supported by a fixed counters. You must look at the
	other fields such as edge, invert, cnt-mask. If those are present then
	you have to fall back to using a generic counter as fixed counters only
	support priv level filtering. As indicated above, though, the
	programming UNHALTED_REFERENCE_CYCLES on a generic counter does not
	count the same thing, therefore you need to fail is filters other than
	priv levels are present on this event.

   2/ Event knowledge missing

	There are constraints and bugs on some events in Intel Core and Nehalem.
	In your model, those need to be taken care of by the kernel. Should the
	kernel make the wrong decision, there would be no work-around for user
	tools. Take the example I outlined just above with Intel fixed counters.

	Constraints do exist on AMD64 processors as well.

   3/ Interrupt throttling

	There is apparently no way for a system admin to set the threshold. It
	is hardcoded.

	Throttling occurs without the tool(s) knowing. I think this is a problem.

    4/ NMI

	Why restrict NMI to privileged users when you have throttling to protect
	against interrupt flooding?

	Are you trying to restrict non privileged users from getting sampling
	inside kernel critical sections?

III/ Requests

   1/ Sampling period change

	As it stands today, it seems there is no way to change a period but to
	close() the event file descriptor and start over. When you close the
	group leader, it is not clear to me what happens to the remaining events.

	I know of tools which want to adjust the sampling period based on the
	number of samples they get per second.

	By design, your perf_counter_open() should not really be in the
	critical path, e.g., when you are processing samples from the event
	buffer. Thus, I think it would be good to have a dedicated call to
	allow changing the period.

   2/ Sampling period randomization

	It is our experience (on Itanium, for instance), that for certain
	sampling measurements, it is beneficial to randomize the sampling
	period a bit. This is in particular the case when sampling on an
	event that happens very frequently and which is not related to
	timing, e.g., branch_instructions_retired. Randomization helps mitigate
	the bias. You do not need anything sophisticated. But when you are using
	a kernel-level sampling buffer, you need to have to kernel randomize.
	Randomization needs to be supported per event.

   3/ Group multiplexing ordering

	As mentioned above, the ordering of group multiplexing for one process
	needs to be either specified by the API or controllable by users.

IV/ Open questions

   1/ Support for model-specific uncore PMU monitoring capabilities

	Recent processors have multiple PMUs. Typically one per core and but
	also one at the socket level, e.g., Intel Nehalem. It is expected that
	this API will provide access to these PMU as well.

	It seems like with the current API, raw events for those PMUs would need
	a new architecture-specific type as the event encoding by itself may
	not be enough to disambiguate between a core and uncore PMU event.

	How are those events going to be supported?

   2/ Features impacting all counters

	On some PMU models, e.g., Itanium, they are certain features which have
	an influence on all counters that are active. For instance, there is a
	way to restrict monitoring to a range of continuous code or data
	addresses using both some PMU registers and the debug registers.

	Given that the API exposes events (counters) as independent of each
	other, I wonder how range restriction could be implemented.

	Similarly, on Itanium, there are global behaviors. For instance, on
	counter overflow the entire PMU freezes all at once. That seems to be
	contradictory with the design of the API which creates the illusion of
	independence.

	What solutions do you propose?

   3/ AMD IBS

	How is AMD IBS going to be implemented?

	IBS has two separate sets of registers. One to capture fetch related
	data and another one to capture instruction execution data. For each,
	there is one config register but multiple data registers. In each mode,
	there is a specific sampling period and IBS can interrupt.

	It looks like you could define two pseudo events or event types and then
	define a new record_format and read_format.  That formats would only be
	valid for an IBS event.

	Is that how you intend to support IBS?

   4/ Intel PEBS    

	Since Netburst-based processors, Intel PMUs support a hardware sampling
	buffer mechanism called PEBS.

	PEBS really became useful with Nehalem.

	Not all events support PEBS. Up until Nehalem, only one counter supported
	PEBS (PMC0). The format of the hardware buffer has changed between Core
	and Nehalem. It is not yet architected, thus it can still evolve with
	future PMU models.

	On Nehalem, there is a new PEBS-based feature called Load Latency
	Filtering which captures where data cache misses occur
	(similar to Itanium D-EAR). Activating this feature requires setting a
	latency threshold hosted in a separate PMU MSR.

	On Nehalem, given that all 4 generic counters support PEBS, the
	sampling buffer may contain samples generated by any of the 4 counters.
	The buffer includes a bitmask of registers to determine the source
	of the samples. Multiple bits may be set in the bitmask.


	How PEBS will be supported for this new API?

   5/ Intel Last Branch Record (LBR)

	Intel processors since Netburst have a cyclic buffer hosted in
	registers which can record taken branches. Each taken branch is stored
	into a pair of LBR registers (source, destination). Up until Nehalem,
	there was not filtering capabilities for LBR. LBR is not an architected
	PMU feature.  

	There is no counter associated with LBR. Nehalem has a LBR_SELECT MSR.
	However there are some constraints on it given it is shared by threads.

	LBR is only useful when sampling and therefore must be combined with a
	counter. LBR must also be configured to freeze on PMU interrupt.

	How is LBR going to be supported?


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2009-05-30 19:17 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-05-28 14:58 comments on Performance Counters for Linux (PCL) stephane eranian
2009-05-28 16:25 ` Peter Zijlstra
2009-05-28 16:47   ` Peter Zijlstra
2009-05-28 21:06   ` [perfmon2] " Corey Ashford
2009-05-28 21:35     ` Ingo Molnar
2009-05-28 23:24       ` Paul Mackerras
2009-05-29  7:19         ` Ingo Molnar
2009-05-29  8:21           ` Geert Uytterhoeven
2009-05-29 14:42             ` Carl Love
2009-05-30 19:16               ` Ingo Molnar
2009-05-29 10:50         ` stephane eranian
2009-05-29 10:53           ` stephane eranian
2009-05-29  6:51   ` Andi Kleen
2009-05-29  8:19     ` Peter Zijlstra
2009-05-29  8:53       ` Andi Kleen
2009-05-29 16:13         ` [perfmon2] " Luck, Tony
2009-05-29 10:43   ` stephane eranian
2009-05-29 15:02     ` Peter Zijlstra
2009-05-29 15:59       ` stephane eranian
2009-05-29 20:32   ` Peter Zijlstra
2009-05-29 20:49     ` stephane eranian
  -- strict thread matches above, loose matches on Subject: below --
2009-05-28 14:53 eranian

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox