* HW perf. events arch implementation
@ 2010-02-24 1:35 Michael Cree
2010-03-05 22:20 ` Peter Zijlstra
0 siblings, 1 reply; 3+ messages in thread
From: Michael Cree @ 2010-02-24 1:35 UTC (permalink / raw)
To: linux-kernel; +Cc: linux-alpha, Ingo Molnar, Peter Zijlstra
I am trying to implement arch specific code on the Alpha for hardware
performance events (yeah, I'm probably a little bit loopy and unsound
of mind pursuing this on an end-of-line platform, but it's a way in to
learn a little bit of kernel programming and it scratches an itch).
I have taken a look at the code in the x86, sparc and ppc
implementations and tried to drum up an Alpha implementation for the
EV67/7/79 cpus, but it ain't working and is producing obviously
erroneous counts. Part of the problem is that I don't understand
under what conditions, and with what assumptions, the performance
event subsystem is calling into the architecture specific code. Is
there any documentation available that describes the architecture
specific interface?
The Alpha CPUs of interest have two 20-bit performance monitoring
counters that can count cycles, instructions, Bcache misses and Mbox
replays (but not all combinations of those). For round numbers
consider a 1GHz CPU, with a theoretical maximal sustained throughput
of four instructions per cycle, then a single performance counter
could potentially generate 4000 interrupts per second to signal
counter overflow when counting instructions.
The x86, sparc and PPC implementations seem to me to assume that calls
to read back the counters occur more frequently than performance
counter overflow interrupts, and that the highest bit of the counter
can safely be used to detect overflow. (Am I correct?) That is
likely not to be true of the Alpha because of the small width of the
counter. Is there someone who would be happy to give me, a kernel
newbie who probably doesn't even make the grade of neophyte, a bit of
direction on this?
Also, the Alpha CPUs have an interesting mode whereby one programmes
up one counter with a specified (or random) value that specifies a
future instruction to profile. The CPU runs for that number of
instructions/cycles, then a short monitoring window (of a few cycles)
is opened about the profiled instruction and when completed an
interrupt is generated. One can then read back a whole lot of
information about the pipeline at the time of the profiled
instruction. This can be used for statistical sampling. Does the
performance events subsystem support monitoring with such a mode?
Cheers
Michael.
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: HW perf. events arch implementation
2010-02-24 1:35 HW perf. events arch implementation Michael Cree
@ 2010-03-05 22:20 ` Peter Zijlstra
2010-03-23 20:16 ` Michael Cree
0 siblings, 1 reply; 3+ messages in thread
From: Peter Zijlstra @ 2010-03-05 22:20 UTC (permalink / raw)
To: Michael Cree; +Cc: linux-kernel, linux-alpha, Ingo Molnar
On Wed, 2010-02-24 at 14:35 +1300, Michael Cree wrote:
> I am trying to implement arch specific code on the Alpha for hardware
> performance events (yeah, I'm probably a little bit loopy and unsound
> of mind pursuing this on an end-of-line platform, but it's a way in to
> learn a little bit of kernel programming and it scratches an itch).
>
> I have taken a look at the code in the x86, sparc and ppc
> implementations and tried to drum up an Alpha implementation for the
> EV67/7/79 cpus, but it ain't working and is producing obviously
> erroneous counts. Part of the problem is that I don't understand
> under what conditions, and with what assumptions, the performance
> event subsystem is calling into the architecture specific code. Is
> there any documentation available that describes the architecture
> specific interface?
>
> The Alpha CPUs of interest have two 20-bit performance monitoring
> counters that can count cycles, instructions, Bcache misses and Mbox
> replays (but not all combinations of those). For round numbers
> consider a 1GHz CPU, with a theoretical maximal sustained throughput
> of four instructions per cycle, then a single performance counter
> could potentially generate 4000 interrupts per second to signal
> counter overflow when counting instructions.
>
> The x86, sparc and PPC implementations seem to me to assume that calls
> to read back the counters occur more frequently than performance
> counter overflow interrupts, and that the highest bit of the counter
> can safely be used to detect overflow. (Am I correct?) That is
> likely not to be true of the Alpha because of the small width of the
> counter. Is there someone who would be happy to give me, a kernel
> newbie who probably doesn't even make the grade of neophyte, a bit of
> direction on this?
Right, so the architecture interface is 2 fold, a struct pmu, and a
bunch of weak hw_perf_*() functions.
I'm trying to move away from the hw_perf*() functions, but for now
they're there and are useful for a number of things.
We have:
hw_perf_event_init();
hw_perf_disable();
hw_perf_enable();
hw_perf_group_sched_in();
hw_perf_event_init() is called when we are creating a counter of type
PERF_TYPE_RAW, PERF_TYPE_HARDWARE or PERF_TYPE_HW_CACHE, it will return
a struct pmu for that event.
hw_perf_disable()/hw_perf_enable() are like
local_irq_disable()/local_irq_enable() but for the Performance Monitor
Interrupt (PMI), which might be a NMI, so we need to disable it in some
arch specific way -- these basically freeze/unfreeze the PMU.
hw_perf_group_sched_in() is a bit of a nightmare and a source of bugs
and I really should get around to killing it off, but this is used
optimize multiple pmu->enable() calls.
Then we have struct pmu, it has the following members:
enable()
disable()
start()
stop()
read()
unthrottle()
->enable() will try to program the event onto the hardware and return 0
on success, if however it cannot, due to there not being a suitable
counter available, it shall return an error.
->disable() will remove the event from the hardware and release all
resources that were acquired by ->enable().
->start() will undo ->stop().
->stop() will stop the counter but not release any resources that might
have been acquired by ->enable().
->read() will read the hardware counter and fold the delta into
event->count.
->unthrottle(), when present, will undo whatever is done to stop the PMI
from triggering after perf_event_overflow() returns !0. That is, we have
sysctl_perf_event_sample_rate and we try to ensure the PMI doesn't
exceed that, if it does perf_event_overflow() will return !0 and the
arch code is supposed to inhibit it from firing again until
->unthrottle() is called. This avoid users from accidentally
live-locking the system by requesting a PMI on every completed
instruction ;-)
[ ->start/->stop are a way to reprogram the hardware without releasing
constraint reservations, this is useful when you change the
As to your counter width, if you have a special overflow bit in a
separate register then you can possibly use that, but otherwise you need
it to keep your count straight. The PMI will happen _after_ the
overflow, at which point you need to fold back the counter delta into
your event->count, if it just overflowed that's bound to be a very small
delta -- I guess you can always add the max value on PMI, but it might
racy, esp. in the presence of ->read() calls.
Also, if you have multiple registers sharing a PMI you need to be able
to tell which register overflowed and caused the PMI.
> Also, the Alpha CPUs have an interesting mode whereby one programmes
> up one counter with a specified (or random) value that specifies a
> future instruction to profile. The CPU runs for that number of
> instructions/cycles, then a short monitoring window (of a few cycles)
> is opened about the profiled instruction and when completed an
> interrupt is generated. One can then read back a whole lot of
> information about the pipeline at the time of the profiled
> instruction. This can be used for statistical sampling. Does the
> performance events subsystem support monitoring with such a mode?
That sounds like AMD IBS, which I've been told is based on the Alpha
PMU. We currently do no have AMD IBS supported.
AMD has two IBS counters, one does instructions and one does fetches, I
think Robert was going to support these by modeling them as fixed
purpose counters and provide the extra information through
PERF_SAMPLE_RAW until we can come up with a saner model.
A potential saner model is adding non sampling counters into its group
which are used to represent these other aspects of the unit.
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: HW perf. events arch implementation
2010-03-05 22:20 ` Peter Zijlstra
@ 2010-03-23 20:16 ` Michael Cree
0 siblings, 0 replies; 3+ messages in thread
From: Michael Cree @ 2010-03-23 20:16 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: linux-kernel, linux-alpha, Ingo Molnar
On 6/03/2010, at 11:20 AM, Peter Zijlstra wrote:
> On Wed, 2010-02-24 at 14:35 +1300, Michael Cree wrote:
>> I am trying to implement arch specific code on the Alpha for hardware
>> performance events
> Right, so the architecture interface is 2 fold, a struct pmu, and a
> bunch of weak hw_perf_*() functions.
Thanks for the description of the hw perf arch interface.
I have now identified why my code on the Alpha didn't work. I had
borrowed the code for reading the hw counters and calculating deltas
from the sparc/x86 code, but I now realise that code is based on the
assumption that the hw counters are read frequently enough that the
counter cannot count more than half of its maximum period between
updates. That assumption is violated on the Alpha (which has 20 bit
counters) and I was occasionally getting negative deltas.
Your comment following is relevant:
> As to your counter width, if you have a special overflow bit in a
> separate register then you can possibly use that, but otherwise you
> need
> it to keep your count straight.
On the Alpha which counter caused the overflow is passed to the
interrupt routine so I have used that to keep the count straight.
It is also appears that the interrupt routine, when it calls
perf_event_overflow() can incur quite a bit of execution so I disable
the hw counters for the duration of the interrupt routine.
I therefore will complete the coding for measuring cycles,
instructions, cache misses and mbox replays (the last one I presume
will have to be coded as a RAW event) on the EV67 CPUs and later. At
this stage I will probably keep it simple and implement counting one
hw event at a time only. Might be able to submit something for review
within the week; but it could be three or four weeks in coming since I
go on holiday for 12 days about Easter.
I have no plans to implement hw perf events on Alpha CPUs older than
the EV67 unless someone jumps up and down and says they really need it!
>> Also, the Alpha CPUs have an interesting mode whereby one programmes
>> up one counter with a specified (or random) value that specifies a
>> future instruction to profile. The CPU runs for that number of
>> instructions/cycles, then a short monitoring window (of a few cycles)
>> is opened about the profiled instruction and when completed an
>> interrupt is generated. One can then read back a whole lot of
>> information about the pipeline at the time of the profiled
>> instruction. This can be used for statistical sampling. Does the
>> performance events subsystem support monitoring with such a mode?
>
> That sounds like AMD IBS, which I've been told is based on the Alpha
> PMU. We currently do no have AMD IBS supported.
I will therefore leave implementing that mode for the future.
Cheers
Michael.
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2010-03-23 20:16 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-02-24 1:35 HW perf. events arch implementation Michael Cree
2010-03-05 22:20 ` Peter Zijlstra
2010-03-23 20:16 ` Michael Cree
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).