* [PATCH 1/4] irq: Support to compute context on top of a given preempt_count offset
2010-05-21 14:05 [PATCH 0/4] perf: Tasks and irq exclusion Frederic Weisbecker
@ 2010-05-21 14:05 ` Frederic Weisbecker
2010-05-21 14:05 ` [PATCH 2/4] perf: Add exclude_task perf event attribute Frederic Weisbecker
` (3 subsequent siblings)
4 siblings, 0 replies; 15+ messages in thread
From: Frederic Weisbecker @ 2010-05-21 14:05 UTC (permalink / raw)
To: Ingo Molnar
Cc: LKML, Frederic Weisbecker, Ingo Molnar, Peter Zijlstra,
Arnaldo Carvalho de Melo, Paul Mackerras
This brings the support to compute irq/preempt contexts on top
of a a given preempt count offset instead of the current one.
This is going to be useful for perf that needs to know the
preempt_count() of the context that an event has interrupted.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
---
include/linux/hardirq.h | 17 +++++++++++++----
1 files changed, 13 insertions(+), 4 deletions(-)
diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index d5b3876..0195547 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -58,6 +58,7 @@
#define SOFTIRQ_MASK (__IRQ_MASK(SOFTIRQ_BITS) << SOFTIRQ_SHIFT)
#define HARDIRQ_MASK (__IRQ_MASK(HARDIRQ_BITS) << HARDIRQ_SHIFT)
#define NMI_MASK (__IRQ_MASK(NMI_BITS) << NMI_SHIFT)
+#define IRQ_MASK (SOFTIRQ_MASK | HARDIRQ_MASK | NMI_MASK)
#define PREEMPT_OFFSET (1UL << PREEMPT_SHIFT)
#define SOFTIRQ_OFFSET (1UL << SOFTIRQ_SHIFT)
@@ -74,15 +75,23 @@
#error PREEMPT_ACTIVE is too low!
#endif
-#define hardirq_count() (preempt_count() & HARDIRQ_MASK)
-#define softirq_count() (preempt_count() & SOFTIRQ_MASK)
-#define irq_count() (preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK \
- | NMI_MASK))
+#define hardirq_count_offset(offset) ((offset) & HARDIRQ_MASK)
+#define softirq_count_offset(offset) ((offset) & SOFTIRQ_MASK)
+#define irq_count_offset(offset) ((offset) & IRQ_MASK)
+
+#define hardirq_count() hardirq_count_offset(preempt_count())
+#define softirq_count() softirq_count_offset(preempt_count())
+#define irq_count() irq_count_offset(preempt_count())
/*
* Are we doing bottom half or hardware interrupt processing?
* Are we in a softirq context? Interrupt context?
*/
+#define in_irq_offset(offset) (hardirq_count_offset(offset))
+#define in_softirq_offset(offset) (softirq_count_offset(offset))
+#define in_interrupt_offset(offset) (irq_count_offset(offset))
+
+
#define in_irq() (hardirq_count())
#define in_softirq() (softirq_count())
#define in_interrupt() (irq_count())
--
1.6.2.3
^ permalink raw reply related [flat|nested] 15+ messages in thread* [PATCH 2/4] perf: Add exclude_task perf event attribute
2010-05-21 14:05 [PATCH 0/4] perf: Tasks and irq exclusion Frederic Weisbecker
2010-05-21 14:05 ` [PATCH 1/4] irq: Support to compute context on top of a given preempt_count offset Frederic Weisbecker
@ 2010-05-21 14:05 ` Frederic Weisbecker
2010-05-25 1:43 ` Paul Mackerras
2010-05-21 14:05 ` [PATCH 3/4] perf: Support for irq exclusion Frederic Weisbecker
` (2 subsequent siblings)
4 siblings, 1 reply; 15+ messages in thread
From: Frederic Weisbecker @ 2010-05-21 14:05 UTC (permalink / raw)
To: Ingo Molnar
Cc: LKML, Frederic Weisbecker, Ingo Molnar, Peter Zijlstra,
Arnaldo Carvalho de Melo, Paul Mackerras
Excluding is useful when you want to trace only hard and softirqs.
For this we use a new generic perf_exclude_event() (the previous
one beeing turned into perf_exclude_swevent) to which you can pass
the preemption offset to which your events trigger.
Computing preempt_count() - offset gives us the preempt_count() of
the context that the event has interrupted, on top of which we
can filter the non-irq contexts.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
---
include/linux/perf_event.h | 3 ++-
kernel/perf_event.c | 41 ++++++++++++++++++++++++++++++++++-------
2 files changed, 36 insertions(+), 8 deletions(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index fe50347..d939fc7 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -214,8 +214,9 @@ struct perf_event_attr {
* See also PERF_RECORD_MISC_EXACT_IP
*/
precise_ip : 2, /* skid constraint */
+ exclude_task : 1, /* don't count task context */
- __reserved_1 : 47;
+ __reserved_1 : 46;
union {
__u32 wakeup_events; /* wakeup every n events */
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 45b7aec..ab96411 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -3907,10 +3907,33 @@ static int __perf_event_overflow(struct perf_event *event, int nmi,
return ret;
}
+static bool
+perf_exclude_event(struct perf_event *event, int offset)
+{
+ if (!in_interrupt_offset(preempt_count() - offset)) {
+ if (event->attr.exclude_task)
+ return true;
+ }
+
+ return false;
+}
+
int perf_event_overflow(struct perf_event *event, int nmi,
struct perf_sample_data *data,
struct pt_regs *regs)
{
+ int offset = HARDIRQ_OFFSET;
+
+ if (nmi)
+ offset += NMI_OFFSET;
+ /*
+ * Hardware events trigger on NMI or simple hardirq, which offset we
+ * need to substract to get the preempt_count() of the interrupted
+ * context.
+ */
+ if (perf_exclude_event(event, offset))
+ return 0;
+
return __perf_event_overflow(event, nmi, 1, data, regs);
}
@@ -4008,8 +4031,8 @@ static void perf_swevent_add(struct perf_event *event, u64 nr,
static int perf_tp_event_match(struct perf_event *event,
struct perf_sample_data *data);
-static int perf_exclude_event(struct perf_event *event,
- struct pt_regs *regs)
+static int perf_exclude_swevent(struct perf_event *event,
+ struct pt_regs *regs, int offset)
{
if (regs) {
if (event->attr.exclude_user && user_mode(regs))
@@ -4019,7 +4042,7 @@ static int perf_exclude_event(struct perf_event *event,
return 1;
}
- return 0;
+ return perf_exclude_event(event, offset);
}
static int perf_swevent_match(struct perf_event *event,
@@ -4034,7 +4057,7 @@ static int perf_swevent_match(struct perf_event *event,
if (event->attr.config != event_id)
return 0;
- if (perf_exclude_event(event, regs))
+ if (perf_exclude_swevent(event, regs, 0))
return 0;
if (event->attr.type == PERF_TYPE_TRACEPOINT &&
@@ -4230,9 +4253,13 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer)
data.period = event->hw.last_period;
regs = get_irq_regs();
- if (regs && !perf_exclude_event(event, regs)) {
+ /*
+ * we are in hardirq, so we need to substract our preempt offset
+ * to retrieve the interrupted one
+ */
+ if (regs && !perf_exclude_swevent(event, regs, HARDIRQ_OFFSET)) {
if (!(event->attr.exclude_idle && current->pid == 0))
- if (perf_event_overflow(event, 0, &data, regs))
+ if (__perf_event_overflow(event, 0, 1, &data, regs))
ret = HRTIMER_NORESTART;
}
@@ -4624,7 +4651,7 @@ void perf_bp_event(struct perf_event *bp, void *data)
perf_sample_data_init(&sample, bp->attr.bp_addr);
- if (!perf_exclude_event(bp, regs))
+ if (!perf_exclude_swevent(bp, regs, 0))
perf_swevent_add(bp, 1, 1, &sample, regs);
}
#else
--
1.6.2.3
^ permalink raw reply related [flat|nested] 15+ messages in thread* Re: [PATCH 2/4] perf: Add exclude_task perf event attribute
2010-05-21 14:05 ` [PATCH 2/4] perf: Add exclude_task perf event attribute Frederic Weisbecker
@ 2010-05-25 1:43 ` Paul Mackerras
2010-05-25 6:58 ` Peter Zijlstra
0 siblings, 1 reply; 15+ messages in thread
From: Paul Mackerras @ 2010-05-25 1:43 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: Ingo Molnar, LKML, Peter Zijlstra, Arnaldo Carvalho de Melo
On Fri, May 21, 2010 at 04:05:13PM +0200, Frederic Weisbecker wrote:
> Excluding is useful when you want to trace only hard and softirqs.
>
> For this we use a new generic perf_exclude_event() (the previous
> one beeing turned into perf_exclude_swevent) to which you can pass
> the preemption offset to which your events trigger.
>
> Computing preempt_count() - offset gives us the preempt_count() of
> the context that the event has interrupted, on top of which we
> can filter the non-irq contexts.
How does this work for hardware events when we are sampling and
getting an interrupt every N events? It seems like the hardware is
still counting all events and interrupting every N events, but we are
only recording a sample if the interrupt occurred in the context we
want. In other words the context of the Nth event is considered to be
the context for the N-1 events preceding that, which seems a pretty
poor approximation.
Also, for hardware events, if we are counting rather than sampling,
the exclude_task bit will have no effect. So perhaps in that case the
perf_event_open should fail rather than appear to succeed but give
wrong data.
Paul.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/4] perf: Add exclude_task perf event attribute
2010-05-25 1:43 ` Paul Mackerras
@ 2010-05-25 6:58 ` Peter Zijlstra
2010-05-25 10:06 ` Frederic Weisbecker
2010-06-07 1:38 ` Frederic Weisbecker
0 siblings, 2 replies; 15+ messages in thread
From: Peter Zijlstra @ 2010-05-25 6:58 UTC (permalink / raw)
To: Paul Mackerras
Cc: Frederic Weisbecker, Ingo Molnar, LKML, Arnaldo Carvalho de Melo
On Tue, 2010-05-25 at 11:43 +1000, Paul Mackerras wrote:
> On Fri, May 21, 2010 at 04:05:13PM +0200, Frederic Weisbecker wrote:
>
> > Excluding is useful when you want to trace only hard and softirqs.
> >
> > For this we use a new generic perf_exclude_event() (the previous
> > one beeing turned into perf_exclude_swevent) to which you can pass
> > the preemption offset to which your events trigger.
> >
> > Computing preempt_count() - offset gives us the preempt_count() of
> > the context that the event has interrupted, on top of which we
> > can filter the non-irq contexts.
>
> How does this work for hardware events when we are sampling and
> getting an interrupt every N events? It seems like the hardware is
> still counting all events and interrupting every N events, but we are
> only recording a sample if the interrupt occurred in the context we
> want. In other words the context of the Nth event is considered to be
> the context for the N-1 events preceding that, which seems a pretty
> poor approximation.
>
> Also, for hardware events, if we are counting rather than sampling,
> the exclude_task bit will have no effect. So perhaps in that case the
> perf_event_open should fail rather than appear to succeed but give
> wrong data.
Right, so for hardware event we'd need to go with those irq_{enter,exit}
hooks and either fully disable the call, or do as Ingo suggested, read
the count delta and add that to period_left, so that we'll delay the
sample (and subtract from ->count, which is I think the trickiest bit as
it'll generate a non-monotonic ->count).
So I prefer the disable/enable from irq_enter/exit, however I also
suspect that that is by far the most expensive option.
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH 2/4] perf: Add exclude_task perf event attribute
2010-05-25 6:58 ` Peter Zijlstra
@ 2010-05-25 10:06 ` Frederic Weisbecker
2010-06-07 1:38 ` Frederic Weisbecker
1 sibling, 0 replies; 15+ messages in thread
From: Frederic Weisbecker @ 2010-05-25 10:06 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Paul Mackerras, Ingo Molnar, LKML, Arnaldo Carvalho de Melo
On Tue, May 25, 2010 at 08:58:08AM +0200, Peter Zijlstra wrote:
> On Tue, 2010-05-25 at 11:43 +1000, Paul Mackerras wrote:
> > On Fri, May 21, 2010 at 04:05:13PM +0200, Frederic Weisbecker wrote:
> >
> > > Excluding is useful when you want to trace only hard and softirqs.
> > >
> > > For this we use a new generic perf_exclude_event() (the previous
> > > one beeing turned into perf_exclude_swevent) to which you can pass
> > > the preemption offset to which your events trigger.
> > >
> > > Computing preempt_count() - offset gives us the preempt_count() of
> > > the context that the event has interrupted, on top of which we
> > > can filter the non-irq contexts.
> >
> > How does this work for hardware events when we are sampling and
> > getting an interrupt every N events? It seems like the hardware is
> > still counting all events and interrupting every N events, but we are
> > only recording a sample if the interrupt occurred in the context we
> > want. In other words the context of the Nth event is considered to be
> > the context for the N-1 events preceding that, which seems a pretty
> > poor approximation.
> >
> > Also, for hardware events, if we are counting rather than sampling,
> > the exclude_task bit will have no effect. So perhaps in that case the
> > perf_event_open should fail rather than appear to succeed but give
> > wrong data.
>
> Right, so for hardware event we'd need to go with those irq_{enter,exit}
> hooks and either fully disable the call, or do as Ingo suggested, read
> the count delta and add that to period_left, so that we'll delay the
> sample (and subtract from ->count, which is I think the trickiest bit as
> it'll generate a non-monotonic ->count).
>
> So I prefer the disable/enable from irq_enter/exit, however I also
> suspect that that is by far the most expensive option.
Ingo proposed me another trick while discussing other details: having
a per context count instead of a single whole one.
So instead of having event->count, we can have event->task_count/softirq_count
and hardirq_count. Each time we enter irq_enter() (non-nested), we read the count
register and we compute the difference on irq_exit() and add the result on
event->hardirq_count. (similar kind of tricks for task and softirq counts).
So when we want to get the total, we just need to compute the sum, wrt the
exclude_* options we have.
Now that still requires to keep the samples proxy. And the samples will stay
a bit async as the interrupt period won't be paused when we enter a filtered
context, something that would only be solved with a round of ->stop(). But
as you said, I really suspect this is not viable.
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH 2/4] perf: Add exclude_task perf event attribute
2010-05-25 6:58 ` Peter Zijlstra
2010-05-25 10:06 ` Frederic Weisbecker
@ 2010-06-07 1:38 ` Frederic Weisbecker
2010-06-08 18:59 ` Ingo Molnar
1 sibling, 1 reply; 15+ messages in thread
From: Frederic Weisbecker @ 2010-06-07 1:38 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar
Cc: Paul Mackerras, LKML, Arnaldo Carvalho de Melo, Stephane Eranian
On Tue, May 25, 2010 at 08:58:08AM +0200, Peter Zijlstra wrote:
> On Tue, 2010-05-25 at 11:43 +1000, Paul Mackerras wrote:
> > On Fri, May 21, 2010 at 04:05:13PM +0200, Frederic Weisbecker wrote:
> >
> > > Excluding is useful when you want to trace only hard and softirqs.
> > >
> > > For this we use a new generic perf_exclude_event() (the previous
> > > one beeing turned into perf_exclude_swevent) to which you can pass
> > > the preemption offset to which your events trigger.
> > >
> > > Computing preempt_count() - offset gives us the preempt_count() of
> > > the context that the event has interrupted, on top of which we
> > > can filter the non-irq contexts.
> >
> > How does this work for hardware events when we are sampling and
> > getting an interrupt every N events? It seems like the hardware is
> > still counting all events and interrupting every N events, but we are
> > only recording a sample if the interrupt occurred in the context we
> > want. In other words the context of the Nth event is considered to be
> > the context for the N-1 events preceding that, which seems a pretty
> > poor approximation.
> >
> > Also, for hardware events, if we are counting rather than sampling,
> > the exclude_task bit will have no effect. So perhaps in that case the
> > perf_event_open should fail rather than appear to succeed but give
> > wrong data.
>
> Right, so for hardware event we'd need to go with those irq_{enter,exit}
> hooks and either fully disable the call, or do as Ingo suggested, read
> the count delta and add that to period_left, so that we'll delay the
> sample (and subtract from ->count, which is I think the trickiest bit as
> it'll generate a non-monotonic ->count).
>
> So I prefer the disable/enable from irq_enter/exit, however I also
> suspect that that is by far the most expensive option.
Playing with that, it's easy to contain the counting on the filtered
contexts: I can just flush (event->read()) when we enter/exit a context
but filter the update of event->count depending on exclude_* things.
There are several problems with that though:
- overflow interrupts continue, we can block them, but still...
- periods become randomly async as the interrupts happen. We
could save the period_left on context enter to solve this
It would be certainly easier and clearer to use stop/start things on context
enter/exit.
And the only thing that seem to happen in these paths is a write
to the event config register.
Is it what is going to be too slow?
If you compare that to all the reads on the counter,
the interrupts that still need to be serviced and filtered with the
other solution, may be the stop/start solution is eventually better
in contrast.
How much time approximately does it take to write in this config register?
Thanks.
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH 2/4] perf: Add exclude_task perf event attribute
2010-06-07 1:38 ` Frederic Weisbecker
@ 2010-06-08 18:59 ` Ingo Molnar
2010-06-08 19:02 ` Frederic Weisbecker
0 siblings, 1 reply; 15+ messages in thread
From: Ingo Molnar @ 2010-06-08 18:59 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: Peter Zijlstra, Paul Mackerras, LKML, Arnaldo Carvalho de Melo,
Stephane Eranian
* Frederic Weisbecker <fweisbec@gmail.com> wrote:
> On Tue, May 25, 2010 at 08:58:08AM +0200, Peter Zijlstra wrote:
> > On Tue, 2010-05-25 at 11:43 +1000, Paul Mackerras wrote:
> > > On Fri, May 21, 2010 at 04:05:13PM +0200, Frederic Weisbecker wrote:
> > >
> > > > Excluding is useful when you want to trace only hard and softirqs.
> > > >
> > > > For this we use a new generic perf_exclude_event() (the previous
> > > > one beeing turned into perf_exclude_swevent) to which you can pass
> > > > the preemption offset to which your events trigger.
> > > >
> > > > Computing preempt_count() - offset gives us the preempt_count() of
> > > > the context that the event has interrupted, on top of which we
> > > > can filter the non-irq contexts.
> > >
> > > How does this work for hardware events when we are sampling and
> > > getting an interrupt every N events? It seems like the hardware is
> > > still counting all events and interrupting every N events, but we are
> > > only recording a sample if the interrupt occurred in the context we
> > > want. In other words the context of the Nth event is considered to be
> > > the context for the N-1 events preceding that, which seems a pretty
> > > poor approximation.
> > >
> > > Also, for hardware events, if we are counting rather than sampling,
> > > the exclude_task bit will have no effect. So perhaps in that case the
> > > perf_event_open should fail rather than appear to succeed but give
> > > wrong data.
> >
> > Right, so for hardware event we'd need to go with those irq_{enter,exit}
> > hooks and either fully disable the call, or do as Ingo suggested, read
> > the count delta and add that to period_left, so that we'll delay the
> > sample (and subtract from ->count, which is I think the trickiest bit as
> > it'll generate a non-monotonic ->count).
> >
> > So I prefer the disable/enable from irq_enter/exit, however I also
> > suspect that that is by far the most expensive option.
>
>
> Playing with that, it's easy to contain the counting on the filtered
> contexts: I can just flush (event->read()) when we enter/exit a context
> but filter the update of event->count depending on exclude_* things.
>
> There are several problems with that though:
>
> - overflow interrupts continue, we can block them, but still...
> - periods become randomly async as the interrupts happen. We
> could save the period_left on context enter to solve this
>
>
> It would be certainly easier and clearer to use stop/start things on context
> enter/exit.
>
> And the only thing that seem to happen in these paths is a write
> to the event config register.
> Is it what is going to be too slow?
> If you compare that to all the reads on the counter,
> the interrupts that still need to be serviced and filtered with the
> other solution, may be the stop/start solution is eventually better
> in contrast.
>
> How much time approximately does it take to write in this config register?
it should be fast enough. I think we should first go for a good, high-quality
implementation that has a correct model for collecting information - and then,
if in practice there's any significant slowdown, we could perhaps add a
speedup that cuts corners.
If we first cut corners we'll never be able to fully trust the info, and we'll
never know how it would all have played out via the disable/enable method.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH 2/4] perf: Add exclude_task perf event attribute
2010-06-08 18:59 ` Ingo Molnar
@ 2010-06-08 19:02 ` Frederic Weisbecker
0 siblings, 0 replies; 15+ messages in thread
From: Frederic Weisbecker @ 2010-06-08 19:02 UTC (permalink / raw)
To: Ingo Molnar
Cc: Peter Zijlstra, Paul Mackerras, LKML, Arnaldo Carvalho de Melo,
Stephane Eranian
On Tue, Jun 08, 2010 at 08:59:17PM +0200, Ingo Molnar wrote:
>
> * Frederic Weisbecker <fweisbec@gmail.com> wrote:
>
> > On Tue, May 25, 2010 at 08:58:08AM +0200, Peter Zijlstra wrote:
> > > On Tue, 2010-05-25 at 11:43 +1000, Paul Mackerras wrote:
> > > > On Fri, May 21, 2010 at 04:05:13PM +0200, Frederic Weisbecker wrote:
> > > >
> > > > > Excluding is useful when you want to trace only hard and softirqs.
> > > > >
> > > > > For this we use a new generic perf_exclude_event() (the previous
> > > > > one beeing turned into perf_exclude_swevent) to which you can pass
> > > > > the preemption offset to which your events trigger.
> > > > >
> > > > > Computing preempt_count() - offset gives us the preempt_count() of
> > > > > the context that the event has interrupted, on top of which we
> > > > > can filter the non-irq contexts.
> > > >
> > > > How does this work for hardware events when we are sampling and
> > > > getting an interrupt every N events? It seems like the hardware is
> > > > still counting all events and interrupting every N events, but we are
> > > > only recording a sample if the interrupt occurred in the context we
> > > > want. In other words the context of the Nth event is considered to be
> > > > the context for the N-1 events preceding that, which seems a pretty
> > > > poor approximation.
> > > >
> > > > Also, for hardware events, if we are counting rather than sampling,
> > > > the exclude_task bit will have no effect. So perhaps in that case the
> > > > perf_event_open should fail rather than appear to succeed but give
> > > > wrong data.
> > >
> > > Right, so for hardware event we'd need to go with those irq_{enter,exit}
> > > hooks and either fully disable the call, or do as Ingo suggested, read
> > > the count delta and add that to period_left, so that we'll delay the
> > > sample (and subtract from ->count, which is I think the trickiest bit as
> > > it'll generate a non-monotonic ->count).
> > >
> > > So I prefer the disable/enable from irq_enter/exit, however I also
> > > suspect that that is by far the most expensive option.
> >
> >
> > Playing with that, it's easy to contain the counting on the filtered
> > contexts: I can just flush (event->read()) when we enter/exit a context
> > but filter the update of event->count depending on exclude_* things.
> >
> > There are several problems with that though:
> >
> > - overflow interrupts continue, we can block them, but still...
> > - periods become randomly async as the interrupts happen. We
> > could save the period_left on context enter to solve this
> >
> >
> > It would be certainly easier and clearer to use stop/start things on context
> > enter/exit.
> >
> > And the only thing that seem to happen in these paths is a write
> > to the event config register.
> > Is it what is going to be too slow?
> > If you compare that to all the reads on the counter,
> > the interrupts that still need to be serviced and filtered with the
> > other solution, may be the stop/start solution is eventually better
> > in contrast.
> >
> > How much time approximately does it take to write in this config register?
>
> it should be fast enough. I think we should first go for a good, high-quality
> implementation that has a correct model for collecting information - and then,
> if in practice there's any significant slowdown, we could perhaps add a
> speedup that cuts corners.
>
> If we first cut corners we'll never be able to fully trust the info, and we'll
> never know how it would all have played out via the disable/enable method.
>
> Thanks,
>
> Ingo
All agreed, I'm taking that direction then.
Thanks.
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 3/4] perf: Support for irq exclusion
2010-05-21 14:05 [PATCH 0/4] perf: Tasks and irq exclusion Frederic Weisbecker
2010-05-21 14:05 ` [PATCH 1/4] irq: Support to compute context on top of a given preempt_count offset Frederic Weisbecker
2010-05-21 14:05 ` [PATCH 2/4] perf: Add exclude_task perf event attribute Frederic Weisbecker
@ 2010-05-21 14:05 ` Frederic Weisbecker
2010-05-21 14:06 ` Frederic Weisbecker
2010-05-21 14:05 ` [PATCH 4/4] perf: Support for task/softirq/hardirq exclusion on tools Frederic Weisbecker
2010-05-21 15:12 ` [PATCH 0/4] perf: Precise task / softirq / hardirq filtered stats/profiles Ingo Molnar
4 siblings, 1 reply; 15+ messages in thread
From: Frederic Weisbecker @ 2010-05-21 14:05 UTC (permalink / raw)
To: Ingo Molnar
Cc: LKML, Frederic Weisbecker, Ingo Molnar, Peter Zijlstra,
Arnaldo Carvalho de Melo, Paul Mackerras
Provide exclude_softirq and exclude_hardirq support in perf
event attributes. This brings the final pieces to subscribe
to any desired context granularity of profiling or tracing.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
---
include/linux/perf_event.h | 8 +++++---
kernel/perf_event.c | 42 +++++++++++++++++++++++++++++++-----------
2 files changed, 36 insertions(+), 14 deletions(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index d939fc7..ca55ec5 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -213,10 +213,12 @@ struct perf_event_attr {
*
* See also PERF_RECORD_MISC_EXACT_IP
*/
- precise_ip : 2, /* skid constraint */
- exclude_task : 1, /* don't count task context */
+ precise_ip : 2, /* skid constraint */
+ exclude_task : 1, /* don't count task context */
+ exclude_softirq : 1, /* don't count softirq */
+ exclude_hardirq : 1, /* don't count hardirq */
- __reserved_1 : 46;
+ __reserved_1 : 45;
union {
__u32 wakeup_events; /* wakeup every n events */
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index ab96411..03c17b2 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -3837,16 +3837,9 @@ static void perf_log_throttle(struct perf_event *event, int enable)
perf_output_end(&handle);
}
-/*
- * Generic event overflow handling, sampling.
- */
-
-static int __perf_event_overflow(struct perf_event *event, int nmi,
- int throttle, struct perf_sample_data *data,
- struct pt_regs *regs)
+static int perf_event_unthrottle(struct perf_event *event,
+ struct hw_perf_event *hwc, int throttle)
{
- int events = atomic_read(&event->event_limit);
- struct hw_perf_event *hwc = &event->hw;
int ret = 0;
throttle = (throttle && event->pmu->unthrottle != NULL);
@@ -3872,6 +3865,23 @@ static int __perf_event_overflow(struct perf_event *event, int nmi,
}
}
+ return ret;
+}
+
+/*
+ * Generic event overflow handling, sampling.
+ */
+
+static int __perf_event_overflow(struct perf_event *event, int nmi,
+ int throttle, struct perf_sample_data *data,
+ struct pt_regs *regs)
+{
+ int events = atomic_read(&event->event_limit);
+ struct hw_perf_event *hwc = &event->hw;
+ int ret;
+
+ ret = perf_event_unthrottle(event, &event->hw, throttle);
+
if (event->attr.freq) {
u64 now = perf_clock();
s64 delta = now - hwc->freq_time_stamp;
@@ -3910,9 +3920,19 @@ static int __perf_event_overflow(struct perf_event *event, int nmi,
static bool
perf_exclude_event(struct perf_event *event, int offset)
{
- if (!in_interrupt_offset(preempt_count() - offset)) {
+ int preempt_offset = preempt_count() - offset;
+
+ if (!in_interrupt_offset(preempt_offset)) {
if (event->attr.exclude_task)
return true;
+ } else {
+ if (in_irq_offset(preempt_offset)) {
+ if (event->attr.exclude_hardirq)
+ return true;
+ } else if (in_softirq_offset(preempt_offset)) {
+ if (event->attr.exclude_softirq)
+ return true;
+ }
}
return false;
@@ -3932,7 +3952,7 @@ int perf_event_overflow(struct perf_event *event, int nmi,
* context.
*/
if (perf_exclude_event(event, offset))
- return 0;
+ return perf_event_unthrottle(event, &event->hw, 1);
return __perf_event_overflow(event, nmi, 1, data, regs);
}
--
1.6.2.3
^ permalink raw reply related [flat|nested] 15+ messages in thread* Re: [PATCH 3/4] perf: Support for irq exclusion
2010-05-21 14:05 ` [PATCH 3/4] perf: Support for irq exclusion Frederic Weisbecker
@ 2010-05-21 14:06 ` Frederic Weisbecker
0 siblings, 0 replies; 15+ messages in thread
From: Frederic Weisbecker @ 2010-05-21 14:06 UTC (permalink / raw)
To: Ingo Molnar
Cc: LKML, Peter Zijlstra, Arnaldo Carvalho de Melo, Paul Mackerras
On Fri, May 21, 2010 at 04:05:14PM +0200, Frederic Weisbecker wrote:
> Provide exclude_softirq and exclude_hardirq support in perf
> event attributes. This brings the final pieces to subscribe
> to any desired context granularity of profiling or tracing.
>
> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
> Cc: Paul Mackerras <paulus@samba.org>
> ---
> include/linux/perf_event.h | 8 +++++---
> kernel/perf_event.c | 42 +++++++++++++++++++++++++++++++-----------
> 2 files changed, 36 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index d939fc7..ca55ec5 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -213,10 +213,12 @@ struct perf_event_attr {
> *
> * See also PERF_RECORD_MISC_EXACT_IP
> */
> - precise_ip : 2, /* skid constraint */
> - exclude_task : 1, /* don't count task context */
> + precise_ip : 2, /* skid constraint */
> + exclude_task : 1, /* don't count task context */
> + exclude_softirq : 1, /* don't count softirq */
> + exclude_hardirq : 1, /* don't count hardirq */
>
> - __reserved_1 : 46;
> + __reserved_1 : 45;
Should be 44.
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 4/4] perf: Support for task/softirq/hardirq exclusion on tools
2010-05-21 14:05 [PATCH 0/4] perf: Tasks and irq exclusion Frederic Weisbecker
` (2 preceding siblings ...)
2010-05-21 14:05 ` [PATCH 3/4] perf: Support for irq exclusion Frederic Weisbecker
@ 2010-05-21 14:05 ` Frederic Weisbecker
2010-05-21 15:12 ` [PATCH 0/4] perf: Precise task / softirq / hardirq filtered stats/profiles Ingo Molnar
4 siblings, 0 replies; 15+ messages in thread
From: Frederic Weisbecker @ 2010-05-21 14:05 UTC (permalink / raw)
To: Ingo Molnar
Cc: LKML, Frederic Weisbecker, Ingo Molnar, Peter Zijlstra,
Arnaldo Carvalho de Melo, Paul Mackerras
Bring the following new flags on perf events:
- t = Profile task context
- s = Profile softirq context
- i = Profile hardirq context
Example:
perf record -a -g -e cycles:i ls -R /usr > /dev/null
3.11% ls [kernel.kallsyms] [k] __lock_acquire
|
--- __lock_acquire
|
|--95.83%-- lock_acquire
| _raw_spin_lock
| |
| |--30.43%-- perf_ctx_adjust_freq
| | perf_event_task_tick
| | scheduler_tick
| | update_process_times
| | tick_sched_timer
| | __run_hrtimer
| | hrtimer_interrupt
| | smp_apic_timer_interrupt
| | apic_timer_interrupt
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
---
tools/perf/util/parse-events.c | 37 ++++++++++++++++++++++++++-----------
1 files changed, 26 insertions(+), 11 deletions(-)
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 9bf0f40..7a18e71 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -688,24 +688,36 @@ static enum event_result
parse_event_modifier(const char **strp, struct perf_event_attr *attr)
{
const char *str = *strp;
- int exclude = 0;
- int eu = 0, ek = 0, eh = 0, precise = 0;
+ int exclude_ring = 0, exclude_context = 0;
+ int eu = 0, ek = 0, eh = 0, et = 0, es = 0, ei = 0, precise = 0;
if (*str++ != ':')
return 0;
while (*str) {
if (*str == 'u') {
- if (!exclude)
- exclude = eu = ek = eh = 1;
+ if (!exclude_ring)
+ exclude_ring = eu = ek = eh = 1;
eu = 0;
} else if (*str == 'k') {
- if (!exclude)
- exclude = eu = ek = eh = 1;
+ if (!exclude_ring)
+ exclude_ring = eu = ek = eh = 1;
ek = 0;
} else if (*str == 'h') {
- if (!exclude)
- exclude = eu = ek = eh = 1;
+ if (!exclude_ring)
+ exclude_ring = eu = ek = eh = 1;
eh = 0;
+ } else if (*str == 't') {
+ if (!exclude_context)
+ exclude_context = et = es = ei = 1;
+ et = 0;
+ } else if (*str == 's') {
+ if (!exclude_context)
+ exclude_context = et = es = ei = 1;
+ es = 0;
+ } else if (*str == 'i') {
+ if (!exclude_context)
+ exclude_context = et = es = ei = 1;
+ ei = 0;
} else if (*str == 'p') {
precise++;
} else
@@ -715,9 +727,12 @@ parse_event_modifier(const char **strp, struct perf_event_attr *attr)
}
if (str >= *strp + 2) {
*strp = str;
- attr->exclude_user = eu;
- attr->exclude_kernel = ek;
- attr->exclude_hv = eh;
+ attr->exclude_user = eu;
+ attr->exclude_kernel = ek;
+ attr->exclude_hv = eh;
+ attr->exclude_task = et;
+ attr->exclude_softirq = es;
+ attr->exclude_hardirq = ei;
attr->precise_ip = precise;
return 1;
}
--
1.6.2.3
^ permalink raw reply related [flat|nested] 15+ messages in thread* Re: [PATCH 0/4] perf: Precise task / softirq / hardirq filtered stats/profiles
2010-05-21 14:05 [PATCH 0/4] perf: Tasks and irq exclusion Frederic Weisbecker
` (3 preceding siblings ...)
2010-05-21 14:05 ` [PATCH 4/4] perf: Support for task/softirq/hardirq exclusion on tools Frederic Weisbecker
@ 2010-05-21 15:12 ` Ingo Molnar
2010-05-21 16:15 ` Peter Zijlstra
4 siblings, 1 reply; 15+ messages in thread
From: Ingo Molnar @ 2010-05-21 15:12 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: LKML, Peter Zijlstra, Arnaldo Carvalho de Melo, Paul Mackerras
* Frederic Weisbecker <fweisbec@gmail.com> wrote:
> Hi,
>
> The new task and irq exclusion handling can let you
> confine tracing and profiling to about everything you
> want.
I fixed the subject line ;-)
'exclusion' is the ABI detail. The feature your patches
implement are to allow 'softirq limited' or 'task-context
limited' or 'hardirq profiling' - which is way cool.
One thing i'd like to see in this feature is for it to
work on pure event counting - i.e. 'perf stat' as well.
This would allow some _very_ precise stats, without IRQ
noise. For example, today we have this kind of noise
in instruction counting:
$ for ((i=0;i<10;i++)); do perf stat -e instructions /bin/true 2>&1 | grep instructions; done
217161 instructions # 0,000 IPC
218591 instructions # 0,000 IPC
223268 instructions # 0,000 IPC
217112 instructions # 0,000 IPC
219392 instructions # 0,000 IPC
216801 instructions # 0,000 IPC
217501 instructions # 0,000 IPC
218565 instructions # 0,000 IPC
218682 instructions # 0,000 IPC
218523 instructions # 0,000 IPC
it it's all that bad at ~2% jitter, but many improvements
we are working on in the kernel are much smaller than 1%.
If we extended your feature to perf stat, we might be able
to get a lot more precise measurements in terms of kernel
optimizations (and kernel bloat).
I'm really curious how accurate your scheme could become
that way. From the above 'few thousands instructions'
noise we might be able to get down to a 'hundreds of
instructions' noise? If so then it would allow us to
measure micro-optimizations in a radically more precise
way.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH 0/4] perf: Precise task / softirq / hardirq filtered stats/profiles
2010-05-21 15:12 ` [PATCH 0/4] perf: Precise task / softirq / hardirq filtered stats/profiles Ingo Molnar
@ 2010-05-21 16:15 ` Peter Zijlstra
2010-05-21 18:36 ` Ingo Molnar
0 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2010-05-21 16:15 UTC (permalink / raw)
To: Ingo Molnar
Cc: Frederic Weisbecker, LKML, Arnaldo Carvalho de Melo,
Paul Mackerras
On Fri, 2010-05-21 at 17:12 +0200, Ingo Molnar wrote:
> 'exclusion' is the ABI detail. The feature your patches
> implement are to allow 'softirq limited' or 'task-context
> limited' or 'hardirq profiling' - which is way cool.
>
> One thing i'd like to see in this feature is for it to
> work on pure event counting - i.e. 'perf stat' as well.
Its not really exclusion, all it does is discard samples when in the
wrong context (which happens to work reasonably well for all the
swevents, except for the timer ones).
If you really want to do exclusion you have to disable/enable on *IRQ
entry/exit, but I guess that gets to be prohibitive on costs.
Implementing it shouldn't be too hard, just add some hooks to
irq_enter() irq_exit() and __do_softirq(). Each such hook should loop
over all active events and call ->stop/->start.
The only real problem would be poking at the hrtimer events from an
hrtimer interrupt :-)
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/4] perf: Precise task / softirq / hardirq filtered stats/profiles
2010-05-21 16:15 ` Peter Zijlstra
@ 2010-05-21 18:36 ` Ingo Molnar
0 siblings, 0 replies; 15+ messages in thread
From: Ingo Molnar @ 2010-05-21 18:36 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Frederic Weisbecker, LKML, Arnaldo Carvalho de Melo,
Paul Mackerras
* Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, 2010-05-21 at 17:12 +0200, Ingo Molnar wrote:
> > 'exclusion' is the ABI detail. The feature your patches
> > implement are to allow 'softirq limited' or 'task-context
> > limited' or 'hardirq profiling' - which is way cool.
> >
> > One thing i'd like to see in this feature is for it to
> > work on pure event counting - i.e. 'perf stat' as well.
>
> Its not really exclusion, all it does is discard samples
> when in the wrong context (which happens to work
> reasonably well for all the swevents, except for the
> timer ones).
>
> If you really want to do exclusion you have to
> disable/enable on *IRQ entry/exit, but I guess that gets
> to be prohibitive on costs.
Yeah, i know - this is what i tried to allude to in my
other part of my reply:
> > If we extended your feature to perf stat, we might be
> > able to get a lot more precise measurements in terms
> > of kernel optimizations (and kernel bloat).
Right, so there's two ways to do it, one is the
disable/enable what you mention, the other would be to
save the count and then read again and subtract the delta.
( the RDPMC based delta method can be made to work for
sampling as well, even if the NMI hits in the middle of
the softirq or hardirq. )
Two reads might be cheaper than a disable+enable.
Especially if it's done using RDPMC.
We should do it like that, not by discarding samples, and
overhead should be OK as long as we dont do the
disable/enable (or delta read) if the feature is off.
If a simple enable/disable or read/read costs too much
then we need to prod hw makers about it. But it should be
OK i think.
Ingo
^ permalink raw reply [flat|nested] 15+ messages in thread