public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* v2 of comments on Performance Counters for Linux (PCL)
@ 2009-06-16 17:42 stephane eranian
  2009-06-22 11:48 ` Ingo Molnar
                   ` (19 more replies)
  0 siblings, 20 replies; 69+ messages in thread
From: stephane eranian @ 2009-06-16 17:42 UTC (permalink / raw)
  To: LKML
  Cc: Andrew Morton, Thomas Gleixner, Ingo Molnar, Robert Richter,
	Peter Zijlstra, Paul Mackerras, Andi Kleen, Maynard Johnson,
	Carl Love, Corey J Ashford, Philip Mucci, Dan Terpstra,
	perfmon2-devel

Hi,

Here is an updated version of my comments on PCL. Compared to the
previous version,
I have removed all the issues that were fixed or clarified. I have
kept all the issues and
open questions which I think are not solved yet and I added a few more.


I/ General API comments

 1/ System calls

     * ioctl()

       You have defined 5 ioctls() so far to operate on an existing event.
       I was under the impression that ioctl() should not be used except for
       drivers.

       How do you justify your usage of ioctl() in this context?

 2/ Grouping

       By design, an event can only be part of one group at a time. Events in
       a group are guaranteed to be active on the PMU at the same time. That
       means a group cannot have more events than there are available counters
       on the PMU. Tools may want to know the number of counters available in
       order to group their events accordingly, such that reliable ratios
       could be computed. It seems the only way to know this is by trial and
       error. This is not practical.

 3/ Multiplexing and system-wide

       Multiplexing is time-based and it is hooked into the timer tick. At
       every tick, the kernel tries to schedule another set of event groups.

       In tickless kernels if a CPU is idle, no timer tick is generated,
       therefore no multiplexing occurs. This is incorrect. It's not because
       the CPU is idle, that there aren't any interesting PMU events to measure.
       Parts of the CPU may still be active, e.g., caches and buses. And thus,
       it is expected that multiplexing still happens.

       You need to hook up the timer source for multiplexing to something else
       which is not affected by tickless. You cannot simply disable tickless
       during a measurement because you would not be measuring the system as
       it actually behaves.

  4/ Controlling group multiplexing

       Although multiplexing is exposed to users via the timing information,
       events may not necessarily be grouped at random by tools. Groups may
       not be ordered at random either.

       I know of tools which craft the sequence of groups carefully such that
       related events are in neighboring groups such that they measure similar
       parts of the execution. This way, you can mitigate the fluctuations
       introduced by multiplexing. In other words, some tools may want to
       control the order in which groups are scheduled on the PMU.

       You mentioned that groups are multiplexed in creation order. But which
       creation order? As far as I know, multiple distinct tools may be
       attaching to the same thread at the same time and their groups may be
       interleaved in the list. Therefore, I believe 'creation order' refers
       to the global group creation order which is only visible to the kernel.
       Each tool may see a different order. Let's take an example.

       Tool A creates group G1, G2, G3 and attaches them to thread T0. At the
       same time tool B creates group G4, G5. The actual global order may
       be: G1, G4, G2, G5, G3. This is what the kernel is going to multiplex.
       Each group will be multiplexed in the right order from the point of view
       of each tool. But there will be gaps. It would be nice to have a way
       to ensure that the sequence is either: G1, G2, G3, G4, G5 or G4, G5,
       G1, G2, G3. In other words, avoid the interleaving.

  5/ Mmaped count

       It is possible to read counts directly from user space for
self-monitoring
       threads. This leverages a HW capability present on some processors. On
       X86, this is possible via RDPMC.

       The full 64-bit count is constructed by combining the hardware value
       extracted with an assembly instruction and a base value made available
       thru the mmap. There is an atomic generation count available to deal
       with the race condition.

       I believe there is a problem with this approach given that the PMU
       is shared and that events can be multiplexed. That means that even
       though you are self-monitoring, events get replaced on the PMU. The
       assembly instruction is unaware of that, it reads a register
not an event.

       On x86, assume event A is hosted in counter 0, thus you need RDPMC(0)
       to extract the count. But then, the event is replaced by another one
       which reuses counter 0. At the user level, you will still use RDPMC(0)
       but it will read the HW value from a different event and combine it
       with a base count from another one.

       To avoid this, you need to pin the event so it stays in the PMU at
       all times. Now, here is something unclear to me. Pinning does not
       mean stay in the SAME register, it means the event stays on the PMU
       but it can possibly change register. To prevent that, I believe you need
       to also set exclusive so that no other group can be scheduled, and thus
       possibly use the same counter.

       Looks like this is the only way you can make this actually work.
       Not setting pinned+exclusive, is another pitfall in which many people
       will fall into.

  6/ Group scheduling

       Looking at the existing code, it seems to me there is a risk of
       starvation for groups, i.e., groups never scheduled on the PMU.

       My understanding of the scheduling algorithm is:

               - first try to  schedule pinned groups. If a pinned group
                 fails, put it in error mode. read() will fail until the
                 group gets another chance at being scheduled.

               - then try to schedule the remaining groups. If a group fails
                 just skip it.

       If the group list does not change, then certain groups may always fail.
       However, the ordering of the list changes because at every tick, it is
       rotated. The head becomes the tail. Therefore, each group eventually gets
       the first position and therefore gets the full PMU to assign its events.

       This works as long as there is a guarantee the list will ALWAYS
rotate. If
       a thread does not run long enough for a tick, it may never rotate.

  7/ Group validity checking

       At the user level, an application is only concerned with events
and grouping
       of those events. The assignment logic is performed by the kernel.

       For a group to be scheduled, all its events must be compatible
with each other,
       otherwise the group will never be scheduled. It is not clear to
me when that
       sanity check will be performed if I create the group such that
it is stopped.

       If the group goes all the way to scheduling, it will never be
scheduled. Counts
       will be zero and the users will have no idea why. If the group
is put in error
       state, read will not be possible. But again, how will the user know why?


  8/ Generalized cache events

      In recent days, you have added support for what you call
'generalized cache events'.

      The log defines:
               new event type: PERF_TYPE_HW_CACHE

               This is a 3-dimensional space:
               { L1-D, L1-I, L2, ITLB, DTLB, BPU } x
               { load, store, prefetch } x
               { accesses, misses }

      Those generic events are then mapped by the kernel onto actual
PMU events if possible.

      I don't see any justification for adding this and especially in
the kernel.

      What's the motivation and goal of this?

      If you define generic events, you need to provide a clear
definition of what they are
      actually measuring. This is especially true for caches because
there are many cache
      events and many different behaviors.

      If the goal is to make comparisons easier. I believe this is
doomed to fail. Because
      different caches behave differently, events capture different
subtle things, e.g, HW
      prefetch vs. sw prefetch. If to actually understand what the
generic event is counting
      I need to know the mapping, then this whole feature is useless.

  9/ Group reading

      It is possible to start/stop an event group simply via ioctl()
on the group
      leader. However, it is not possible to read all the counts with a single
      with a single read() system call. That seems odd. Furhermore, I
believe you
      want reads to be as atomic as possible.

  10/ Event buffer minimal useful size

      As it stands, the buffer header occupies the first page, even though the
      buffer header struct is 32-byte long. That's a lot of precious
RLIMIT_MEMLOCK
      memory wasted.

      The actual buffer (data) starts at the next page (from builtin-top.c):

       static void mmap_read_counter(struct mmap_data *md)
       {
               unsigned int head = mmap_read_head(md);
               unsigned int old = md->prev;
               unsigned char *data = md->base + page_size;


       Given that the buffer "full" notification are sent on page
crossing boundaries,
       if the actual buffer payload size is 1 page, you are guaranteed
to have your
       samples overwritten.

       This leads me to believe that the minimal buffer size to get
useful data is 3 pages.
       This is per event group per thread. That puts a lot of pressure
on RLIMIT_MEMLOCK
       which is ususally set fairly low by distros.

   11/ Missing definitions for generic hardware events

       As soon as you define generic events, you need to provide a
clear and precise definition
       at to what they measure. This is crucial to make them useful. I
have not seen such a
       definition yet.

II/ X86 comments

  1/ Fixed counters on Intel

       You cannot simply fall back to generic counters if you cannot find
       a fixed counter. There are model-specific bugs, for instance
       UNHALTED_REFERENCE_CYCLES (0x013c), does not measure the same thing on
       Nehalem when it is used in fixed counter 2 or a generic counter. The
       same is true on Core.

       You cannot simply look at the event field code to determine whether
       this is an event supported by a fixed counters. You must look at the
       other fields such as edge, invert, cnt-mask. If those are present then
       you have to fall back to using a generic counter as fixed counters only
       support priv level filtering. As indicated above, though, programming
       UNHALTED_REFERENCE_CYCLES on a generic counter does not count the same
       thing, therefore you need to fail if filters other than priv levels are
       present on this event.

  2/ Event knowledge missing

       There are constraints on events in Intel processors. Different
constraints
       do exist on AMD64 processors, especially with uncore-releated events.

       In your model, those need to be taken care of by the kernel. Should the
       kernel make the wrong decision, there would be no work-around for user
       tools. Take the example I outlined just above with Intel fixed counters.

       The current code-base does not have any constrained event
support, therefore
       bogus counts may be returned depending on the event measured.

III/ Requests

  1/ Sampling period randomization

       It is our experience (on Itanium, for instance), that for certain
       sampling measurements, it is beneficial to randomize the sampling
       period a bit. This is in particular the case when sampling on an
       event that happens very frequently and which is not related to
       timing, e.g., branch_instructions_retired. Randomization helps mitigate
       the bias. You do not need something sophisticated. But when you are using
       a kernel-level sampling buffer, you need to have the kernel randomize.
       Randomization needs to be supported per event.

IV/ Open questions

  1/ Support for model-specific uncore PMU monitoring capabilities

       Recent processors have multiple PMUs. Typically one per core and but
       also one at the socket level, e.g., Intel Nehalem. It is expected that
       this API will provide access to these PMU as well.

       It seems like with the current API, raw events for those PMU would need
       a new architecture-specific type as the event encoding by itself may
       not be enough to disambiguate between a core and uncore PMU event.

       How are those events going to be supported?

  2/ Features impacting all counters

       On some PMU models, e.g., Itanium, they are certain features which have
       an influence on all counters that are active. For instance, there is a
       way to restrict monitoring to a range of continuous code or data
       addresses using both some PMU registers and the debug registers.

       Given that the API exposes events (counters) as independent of each
       other, I wonder how range restriction could be implemented.

       Similarly, on Itanium, there are global behaviors. For instance, on
       counter overflow the entire PMU freezes all at once. That seems to be
       contradictory with the design of the API which creates the illusion of
       independence.

       What solutions do you propose?

  3/ AMD IBS

       How is AMD IBS going to be implemented?

       IBS has two separate sets of registers. One to capture fetch related
       data and another one to capture instruction execution data. For each,
       there is one config register but multiple data registers. In each mode,
       there is a specific sampling period and IBS can interrupt.

       It looks like you could define two pseudo events or event types and then
       define a new record_format and read_format.  That formats would only be
       valid for an IBS event.

       Is that how you intend to support IBS?

  4/ Intel PEBS

       Since Netburst-based processors, Intel PMUs support a hardware sampling
       buffer mechanism called PEBS.

       PEBS really became useful with Nehalem.

       Not all events support PEBS. Up until Nehalem, only one counter supported
       PEBS (PMC0). The format of the hardware buffer has changed between Core
       and Nehalem. It is not yet architected, thus it can still evolve with
       future PMU models.

       On Nehalem, there is a new PEBS-based feature called Load Latency
       Filtering which captures where data cache misses occur
       (similar to Itanium D-EAR). Activating this feature requires setting a
       latency threshold hosted in a separate PMU MSR.

       On Nehalem, given that all 4 generic counters support PEBS, the
       sampling buffer may contain samples generated by any of the 4 counters.
       The buffer includes a bitmask of registers to determine the source
       of the samples. Multiple bits may be set in the bitmask.


       How PEBS will be supported for this new API?

  5/ Intel Last Branch Record (LBR)

       Intel processors since Netburst have a cyclic buffer hosted in
       registers which can record taken branches. Each taken branch is stored
       into a pair of LBR registers (source, destination). Up until Nehalem,
       there was not filtering capabilities for LBR. LBR is not an architected
       PMU feature.

       There is no counter associated with LBR. Nehalem has a LBR_SELECT MSR.
       However there are some constraints on it given it is shared by threads.

       LBR is only useful when sampling and therefore must be combined with a
       counter. LBR must also be configured to freeze on PMU interrupt.

       How is LBR going to be supported?

^ permalink raw reply	[flat|nested] 69+ messages in thread
* Re: [perfmon2] IV.3 - AMD IBS
@ 2009-06-25 11:28 stephane eranian
  0 siblings, 0 replies; 69+ messages in thread
From: stephane eranian @ 2009-06-25 11:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Drongowski, Paul, Peter Zijlstra, Rob Fowler, Philip Mucci, LKML,
	Andi Kleen, Paul Mackerras, Maynard Johnson, Andrew Morton,
	Thomas Gleixner, perfmon2-devel

Hi,

On Tue, Jun 23, 2009 at 4:55 PM, Ingo Molnar<mingo@elte.hu> wrote:
>
> The 20 bits delay is in cycles, right? So this in itself still lends
> itself to be transparently provided as a PERF_COUNT_HW_CPU_CYCLES
> counter.
>

I do not believe you can use  IBS as a better substitute for either CYCLES or
INSTRUCTIONS sampling. IBS simply does not operate in the same way.

But instead of me arguing with you guys for a long time, I have asked someone
at AMD who knows more than me about IBS. Paul posted his answer only on
the perfmon2 mailing list, I have forwarded it below.

You will also note that he is providing another example as to why support for
software sampling period randomization is useful.

I would like to thank Paul for spending time providing a lot of useful details
about IBS.

I am hoping this can clarify things.

On Wed, Jun 24, 2009 at 8:20 PM, Drongowski,
Paul<paul.drongowski@amd.com> wrote:
>
> Hi --
>
> I'm sorry to be joining this discussion so late. A few of my
> colleagues pointed me toward the current thread on IBS and I've tried
> to catch up by reading the archives. A short self-introduction: I'm a
> member of the AMD CodeAnalyst team, Ravi Bhargava and I wrote Appendix G
> (concerning IBS) of the AMD Software Optimization Guide for AMD
> Family 10h Processors and at one point in my life, I worked on DCPI
> (using ProfileMe).
>
> First off, Stephane and Rob have done a good job representing IBS and
> also ProfileMe. Thanks, guys!
>
> Rather than grossly disturb the current discussion, I'd like to offer
> a few points of clarification and maybe a little useful history.
>
> Peter's observation that IBS is a "mismatch with the traditional one
> value per counter thing" is quite apt. IBS has similarities to
> ProfileMe. Stephane's citation of the Itanium Data-EAR and
> Instruction-EAR are also very relevant as examples of profile data
> that do not fit with the "one value per counter thing."
>
> IBS Fetch.
>
>    IBS fetch sampling does not exactly sample x86 instructions. The
>    current fetch counter counts fetch operations where a fetch
> operation
>    may be a 32-byte fetch block (on AMD Family 10h) or it may be a
>    fetch operation initiated by a redirection such as a branch.
>    A fetch block is 32 bytes of instruction information which is
>    sent to the instruction decoder. The fetch address that is reported
>    may either be the start of a valid x86 instruction or the start of
>    a fetch block. In the second case, the address may be in the middle
> of
>    an x86 instruction.
>
>    IBS fetch sampling produces a number of event flags (e.g.,
> instruction
>    cache miss), but it also produces the latency (in cycles) of the
>    fetch operation. The latencies can be accumulated in either
>    descriptive statistics, or better, in a histogram since descriptive
>    statistics don't really show where an access is hitting in the
>    memory hierarchy. BTW, even though an IBS fetch sample may be
> reported,
>    the decoder may not use the instruction bytes due to a late arriving
>    redirection.
>
> IBS Op.
>
>    IBS op sampling does not sample x86 instructions. It samples the
>    ops which are issued from x86 instructions. Some x86 instructions
>    issue more than one op. Microcoded instructions are particularly
>    thorny as a single REP MOV may issue many ops, thereby affecting
>    the number of samples that fall on them (i.e., disproportionate to
> the
>    execution frequency of the surrounding basic block.) The number of
>    ops issued is data dependent and is unpredictable. Appendix C
>    of the Software Optimization Guide lists the number of ops issued
>    from x86 instructions (one, two or many).
>
>    Beginning with AMD Family 10h RevC, there are two op selection
>    (counting) modes for IBS: cycles-counting and dispatched op
> counting.
>
>    Cycles-counting is _not_ equivalent to CPU_CLK_UNHALTED -- it is
>    not a precise version of the performance monitoring counter (PMC)
>    event (event select 0x076). In cycles-mode, when the current count
>    reaches the max count, the next available dispatch group of ops is
>    selected and a secondary mechanism selects an op within the dispatch
>    group. The dispatch group may contain one, two or three ops. If you
>    smell a rat, you're right. The secondary scheme negatively affects
>    the desired pseudo-random selection scheme. Also, if a dispatch
>    group is not available, the sample is skipped and the counting
>    process is reset.
>
>    Further, cycles-mode selection is affected by pipeline stalls. This
>    affects the distribution of IBS op samples taken in cycles-mode.
>    With cycles-mode, one instruction may have more data cache miss
> events,
>    but the underlying sampling basis is so skewed that the comparison
> is
>    not meaningful. IBS op samples are generated only for ops that
> retire;
>    tagged ops on a "wrong path" are flushed without producing a sample.
>    Overall, I cannot personally say that IBS cycles-mode produces a
> precise
>    equivalent to CPU_CLK_UNHALTED. I cannot endorse or recommend
>    its use in this way.
>
>    Given these issues, dispatched op counting was added in RevC. This
> mode
>    is the _preferred_ mode. Ops are counted as they are dispatched and
> the
>    op that triggers the max count threshold is selected and tagged.
>    Dispatched op mode produces a distribution of op samples that
> reflects
>    the execution frequency of instructions/basic blocks. DirectPath
>    Double and VectorPath (microcoded) x86 instructions which issue more
> than
>    one op will still be oversampled, however. The distribution is
> important
>    because it allows meaningful comparison of event counts between
>    instructions.
>
>    Even though the distribution of samples in dispatched op mode
> reflects
>    execution frequency, it is not a substitute for RETIRED_INSTRUCTIONS
>    (event select 0x0c0). The number of IBS op samples in some
> workloads,
>    especially those with certain kinds of stack access and microcoded
>    instructions, diverges greatly from RETIRED_INSTRUCTIONS.
>
>    IBS is what it is.
>
> IBS derived events
>
>    Since ProfileMe and Data EAR didn't exactly take the world by storm,
>    (oh, yeah, I worked with HP Caliper on Itanium for a while, too ;-),
>    profiling infrastructures like OProfile and CodeAnalyst are largely
>    based on the PMC sampling model.
>
>    In order to get IBS into practice as quickly as possible, we defined
>    IBS derived events. This allowed us to implement basic support for
>    IBS in both OProfile and CodeAnalyst without major changes in
>    infrastructure. I should note that translation from raw IBS bits to
>    derived events is and was always intended to be performed by user
>    space tools. I personally believe that translation should not be
>    performed in the kernel -- kernel support should be simple and
>    lightweight.
>
>    An IBS op sample is a small "packet" of profile data:
>
>        A bunch of event flags (data cache miss, etc.)
>        Tag-to-retire time (cycles)
>        Completion-to-retire (cycles)
>        DC miss latency (cycles)
>        DC miss addresses (64-bit virtual and physical addresses)
>
>    These entities can be used to compute latency distributions,
>    memory access maps, etc. IBS enables new kinds of analysis such
>    as data-centric profiling that identifies hot data regions (that
>    could be used to tune data layout in NUMA environment).
>
>    Quite frankly, at this juncture, I find the derived event model to
> be
>    too limiting. DCPI had a much different way of organizing ProfileMe
>    data that allowed flexible formulation of queries during
> post-processing --
>    something that cannot be done with the derived event approach.
>
>    Further, the organization and use of DC miss addresses is open for
>    investigation. I would _love_ to encourage someone (anyone? anyone?)
>    to take up this investigation. There may also be unforeseen uses --
>    perhaps driving compile-time optimizations. The existing derived
> events
>    do not adequately support new applications of IBS data. Thus, I
> would
>    encourage kernel-level support that passes IBS data along without
>    modification.
>
> Filtering.
>
>    After our initial experience with IBS, we see the need for
> filtering.
>    One approach is to collect and report only those IBS register values
>    that are needed to support a certain kind of analysis. For example,
>    if the DC miss addresses are not needed, why collect them? Suravee
>    and Robert Richter (both terrific colleagues) have been
> investigating
>    this, so I will defer to their analysis and comments.
>
> Software randomization.
>
>    We've found that software randomization of the sampling period
> and/or
>    current count is needed to avoid certain situations where the
> pipeline
>    and the sampling process get into a periodic hard-loop that affects
>    the distribution of IBS op samples. BTW, forcing those low order
> four
>    bits to zero occasionally has a negative effect on op distribution.
>
> IBS future extensions
>
>    Of course, I can't discuss specific new features. However, here are
>    some possible variations:
>
>       * The current count and max count values may become longer.
>       * New event flags may be added.
>       * Existing event flags may be left out (i.e., not implemented
>         in a family or model)
>       * New ancillary data (like DC miss latency or DC miss address)
>         may be added.
>
>    It may be necessary to collect new 64-bit values that do not contain
>    event flags, for example.
>
> Thanks for enduring this long-winded message. I hope that I've
> communicated some information and requirements, and I'll be more than
> happy to answer questions about IBS (or get the answers).
>
> -- pj
>
> Dr. Paul Drongowski
> AMD CodeAnalyst team
> Boston Design Center
>
> -------------------------
> The information presented in this reply is for informational purposes
> only and may contain technical inaccuracies, omissions and
> typographical errors. Links to third party sites are for convenience
> only, and no endorsement is implied.
>
>
>
>
> ------------------------------------------------------------------------------
> _______________________________________________
> perfmon2-devel mailing list
> perfmon2-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/perfmon2-devel
>

^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2009-08-03 14:22 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-06-16 17:42 v2 of comments on Performance Counters for Linux (PCL) stephane eranian
2009-06-22 11:48 ` Ingo Molnar
2009-06-22 11:49 ` I.1 - System calls - ioctl Ingo Molnar
2009-06-22 12:58   ` Christoph Hellwig
2009-06-22 13:56     ` Ingo Molnar
2009-06-22 17:41       ` Arnd Bergmann
2009-07-13 10:53     ` Peter Zijlstra
2009-07-13 17:30       ` [perfmon2] " Arnd Bergmann
2009-07-13 17:34         ` Peter Zijlstra
2009-07-13 17:53           ` Arnd Bergmann
2009-07-14 13:51       ` Christoph Hellwig
2009-07-30 13:58       ` stephane eranian
2009-07-30 14:13         ` Peter Zijlstra
2009-07-30 16:17           ` stephane eranian
2009-07-30 16:40             ` Arnd Bergmann
2009-07-30 16:53               ` stephane eranian
2009-07-30 17:20                 ` Arnd Bergmann
2009-08-03 14:22                   ` Peter Zijlstra
2009-06-22 11:50 ` I.2 - Grouping Ingo Molnar
2009-06-22 19:45   ` stephane eranian
2009-06-22 22:04     ` Corey Ashford
2009-06-23 17:51       ` stephane eranian
2009-06-22 21:38   ` Corey Ashford
2009-06-23  5:16   ` Paul Mackerras
2009-06-23  7:36     ` stephane eranian
2009-06-23  8:26       ` Paul Mackerras
2009-06-23  8:30         ` stephane eranian
2009-06-23 16:24           ` Corey Ashford
2009-06-22 11:51 ` I.3 - Multiplexing and system-wide Ingo Molnar
2009-06-22 11:51 ` I.4 - Controlling group multiplexing Ingo Molnar
2009-06-22 11:52 ` I.5 - Mmaped count Ingo Molnar
2009-06-22 12:25   ` stephane eranian
2009-06-22 12:35     ` Peter Zijlstra
2009-06-22 12:54       ` stephane eranian
2009-06-22 14:39         ` Peter Zijlstra
2009-06-23  0:41         ` Paul Mackerras
2009-06-23  0:39       ` Paul Mackerras
2009-06-23  6:13         ` Peter Zijlstra
2009-06-23  7:40         ` stephane eranian
2009-06-23  0:33     ` Paul Mackerras
2009-06-22 11:53 ` I.6 - Group scheduling Ingo Molnar
2009-06-22 11:54 ` I.7 - Group validity checking Ingo Molnar
2009-06-22 11:54 ` I.8 - Generalized cache events Ingo Molnar
2009-06-22 11:55 ` I.9 - Group reading Ingo Molnar
2009-06-22 11:55 ` I.10 - Event buffer minimal useful size Ingo Molnar
2009-06-22 11:56 ` I.11 - Missing definitions for generic events Ingo Molnar
2009-06-22 14:54   ` stephane eranian
2009-06-22 11:57 ` II.1 - Fixed counters on Intel Ingo Molnar
2009-06-22 14:27   ` stephane eranian
2009-06-22 11:57 ` II.2 - Event knowledge missing Ingo Molnar
2009-06-23 13:18   ` stephane eranian
2009-06-22 11:58 ` III.1 - Sampling period randomization Ingo Molnar
2009-06-22 11:58 ` IV.1 - Support for model-specific uncore PMU Ingo Molnar
2009-06-22 11:59 ` IV.2 - Features impacting all counters Ingo Molnar
2009-06-22 12:00 ` IV.3 - AMD IBS Ingo Molnar
2009-06-22 14:08   ` [perfmon2] " Rob Fowler
2009-06-22 17:58     ` Maynard Johnson
2009-06-23  6:19     ` Peter Zijlstra
2009-06-23  8:19       ` stephane eranian
2009-06-23 14:05         ` Ingo Molnar
2009-06-23 14:25           ` stephane eranian
2009-06-23 14:55             ` Ingo Molnar
2009-06-23 14:40       ` Rob Fowler
2009-06-22 19:17   ` stephane eranian
2009-06-22 12:00 ` IV.4 - Intel PEBS Ingo Molnar
2009-06-22 12:16   ` Andi Kleen
2009-06-22 12:01 ` IV.5 - Intel Last Branch Record (LBR) Ingo Molnar
2009-06-22 20:02   ` stephane eranian
  -- strict thread matches above, loose matches on Subject: below --
2009-06-25 11:28 [perfmon2] IV.3 - AMD IBS stephane eranian

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox