public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] perf: Fix inconsistency between IP and callchain sampling
@ 2010-01-18  5:47 Anton Blanchard
  2010-01-18 10:43 ` Frederic Weisbecker
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Anton Blanchard @ 2010-01-18  5:47 UTC (permalink / raw)
  To: Peter Zijlstra, Paul Mackerras, Ingo Molnar
  Cc: Benjamin Herrenschmidt, Paul Mundt, Frederic Weisbecker,
	linux-kernel


When running perf across all cpus with backtracing (-a -g), sometimes we
get samples without associated backtraces:

    23.44%         init  [kernel]                     [k] restore
    11.46%         init                       eeba0c  [k] 0x00000000eeba0c
     6.77%      swapper  [kernel]                     [k] .perf_ctx_adjust_freq
     5.73%         init  [kernel]                     [k] .__trace_hcall_entry
     4.69%         perf  libc-2.9.so                  [.] 0x0000000006bb8c
                       |          
                       |--11.11%-- 0xfffa941bbbc

It turns out the backtrace code has a check for the idle task and the IP
sampling does not. This creates problems when profiling an interrupt
heavy workload (in my case 10Gbit ethernet) since we get no backtraces
for interrupts received while idle (ie most of the workload).

Right now x86 and sh check that current is not NULL, which should never
happen so remove that too.

Signed-off-by: Anton Blanchard <anton@samba.org>
---

The exclusion of idle tasks should be in the common perf events code,
perhaps keying off the exclude_idle field. It should also ensure that
we weren't in an interrupt at the time.

I also notice this:

        if (is_user && current->state != TASK_RUNNING)

But I'm not exactly sure what that will catch. When would we get a userspace
sample from something that isnt running?

Index: linux.trees.git/arch/powerpc/kernel/perf_callchain.c
===================================================================
--- linux.trees.git.orig/arch/powerpc/kernel/perf_callchain.c	2010-01-18 16:10:10.000000000 +1100
+++ linux.trees.git/arch/powerpc/kernel/perf_callchain.c	2010-01-18 16:10:17.000000000 +1100
@@ -495,9 +495,6 @@ struct perf_callchain_entry *perf_callch
 
 	entry->nr = 0;
 
-	if (current->pid == 0)		/* idle task? */
-		return entry;
-
 	if (!user_mode(regs)) {
 		perf_callchain_kernel(regs, entry);
 		if (current->mm)
Index: linux.trees.git/arch/x86/kernel/cpu/perf_event.c
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/cpu/perf_event.c	2010-01-18 16:10:36.000000000 +1100
+++ linux.trees.git/arch/x86/kernel/cpu/perf_event.c	2010-01-18 16:17:33.000000000 +1100
@@ -2425,9 +2425,6 @@ perf_do_callchain(struct pt_regs *regs, 
 
 	is_user = user_mode(regs);
 
-	if (!current || current->pid == 0)
-		return;
-
 	if (is_user && current->state != TASK_RUNNING)
 		return;
 
Index: linux.trees.git/arch/sh/kernel/perf_callchain.c
===================================================================
--- linux.trees.git.orig/arch/sh/kernel/perf_callchain.c	2010-01-18 16:18:24.000000000 +1100
+++ linux.trees.git/arch/sh/kernel/perf_callchain.c	2010-01-18 16:18:37.000000000 +1100
@@ -68,9 +68,6 @@ perf_do_callchain(struct pt_regs *regs, 
 
 	is_user = user_mode(regs);
 
-	if (!current || current->pid == 0)
-		return;
-
 	if (is_user && current->state != TASK_RUNNING)
 		return;
 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] perf: Fix inconsistency between IP and callchain sampling
  2010-01-18  5:47 [PATCH] perf: Fix inconsistency between IP and callchain sampling Anton Blanchard
@ 2010-01-18 10:43 ` Frederic Weisbecker
  2010-01-22  2:04   ` Anton Blanchard
  2010-01-22  7:25   ` Ingo Molnar
  2010-01-21 13:20 ` Frederic Weisbecker
  2010-01-29  9:24 ` [tip:perf/core] " tip-bot for Anton Blanchard
  2 siblings, 2 replies; 6+ messages in thread
From: Frederic Weisbecker @ 2010-01-18 10:43 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Peter Zijlstra, Paul Mackerras, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mundt, linux-kernel

On Mon, Jan 18, 2010 at 04:47:07PM +1100, Anton Blanchard wrote:
> 
> When running perf across all cpus with backtracing (-a -g), sometimes we
> get samples without associated backtraces:
> 
>     23.44%         init  [kernel]                     [k] restore
>     11.46%         init                       eeba0c  [k] 0x00000000eeba0c
>      6.77%      swapper  [kernel]                     [k] .perf_ctx_adjust_freq
>      5.73%         init  [kernel]                     [k] .__trace_hcall_entry
>      4.69%         perf  libc-2.9.so                  [.] 0x0000000006bb8c
>                        |          
>                        |--11.11%-- 0xfffa941bbbc
> 
> It turns out the backtrace code has a check for the idle task and the IP
> sampling does not. This creates problems when profiling an interrupt
> heavy workload (in my case 10Gbit ethernet) since we get no backtraces
> for interrupts received while idle (ie most of the workload).


Agreed, the arch backtrace code is not well suited to decide this.


> 
> Right now x86 and sh check that current is not NULL, which should never
> happen so remove that too.


Yeah. Unless we can have backtraces in pretty rare places
where current is unavailable. But I guess not.


> 
> Signed-off-by: Anton Blanchard <anton@samba.org>


Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>


> The exclusion of idle tasks should be in the common perf events code,
> perhaps keying off the exclude_idle field. It should also ensure that
> we weren't in an interrupt at the time.


We have exclude_idle but it has only effects on cpu clock events:

if (regs) {
	if (!(event->attr.exclude_idle && current->pid == 0))
		if (perf_event_overflow(event, 0, &data, regs))
			ret = HRTIMER_NORESTART;
}


I think the exclude_idle check should move into perf_event_overflow(),
to enforce its semantics and apply it to every software events.
I'm preparing a patch for that.

(.. Even better would have been to schedule out exclude_idle events when we
enter idle. But currently this is a single event attribute, not a
group attribute, which would make such individual scheduling game a
bit insane. My guess is that it should have been a group attribute,
to keep the group counting consistent, so that its scope could have
been broader, to the point of deactivating hardware events on idle, etc...
But now the ABI is fixed.. )


Concerning interrupts that happen in idle, I think we should filter
these if exclude_idle = 1. That looks more something a user may
want: if we don't want to profile idle, neither do we want to encumber with
interrupts that occur inside. On the opposite, if someone wants a finegrained
profile, let's get idle and its interrupts.

What do you guys think about that?


> 
> I also notice this:
> 
>         if (is_user && current->state != TASK_RUNNING)
> 
> But I'm not exactly sure what that will catch. When would we get a userspace
> sample from something that isnt running?


Not sure either...

Thanks.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] perf: Fix inconsistency between IP and callchain sampling
  2010-01-18  5:47 [PATCH] perf: Fix inconsistency between IP and callchain sampling Anton Blanchard
  2010-01-18 10:43 ` Frederic Weisbecker
@ 2010-01-21 13:20 ` Frederic Weisbecker
  2010-01-29  9:24 ` [tip:perf/core] " tip-bot for Anton Blanchard
  2 siblings, 0 replies; 6+ messages in thread
From: Frederic Weisbecker @ 2010-01-21 13:20 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Peter Zijlstra, Paul Mackerras, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mundt, linux-kernel

On Mon, Jan 18, 2010 at 04:47:07PM +1100, Anton Blanchard wrote:
> 
> When running perf across all cpus with backtracing (-a -g), sometimes we
> get samples without associated backtraces:
> 
>     23.44%         init  [kernel]                     [k] restore
>     11.46%         init                       eeba0c  [k] 0x00000000eeba0c
>      6.77%      swapper  [kernel]                     [k] .perf_ctx_adjust_freq
>      5.73%         init  [kernel]                     [k] .__trace_hcall_entry
>      4.69%         perf  libc-2.9.so                  [.] 0x0000000006bb8c
>                        |          
>                        |--11.11%-- 0xfffa941bbbc
> 
> It turns out the backtrace code has a check for the idle task and the IP
> sampling does not. This creates problems when profiling an interrupt
> heavy workload (in my case 10Gbit ethernet) since we get no backtraces
> for interrupts received while idle (ie most of the workload).
> 
> Right now x86 and sh check that current is not NULL, which should never
> happen so remove that too.
> 
> Signed-off-by: Anton Blanchard <anton@samba.org>



I'm queuing it. Thanks.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] perf: Fix inconsistency between IP and callchain sampling
  2010-01-18 10:43 ` Frederic Weisbecker
@ 2010-01-22  2:04   ` Anton Blanchard
  2010-01-22  7:25   ` Ingo Molnar
  1 sibling, 0 replies; 6+ messages in thread
From: Anton Blanchard @ 2010-01-22  2:04 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, Paul Mackerras, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mundt, linux-kernel

 
Hi Frederic,

> Concerning interrupts that happen in idle, I think we should filter
> these if exclude_idle = 1. That looks more something a user may
> want: if we don't want to profile idle, neither do we want to encumber with
> interrupts that occur inside. On the opposite, if someone wants a finegrained
> profile, let's get idle and its interrupts.
> 
> What do you guys think about that?

Yeah that sounds reasonable to me. Thanks for looking into it!

Anton

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] perf: Fix inconsistency between IP and callchain sampling
  2010-01-18 10:43 ` Frederic Weisbecker
  2010-01-22  2:04   ` Anton Blanchard
@ 2010-01-22  7:25   ` Ingo Molnar
  1 sibling, 0 replies; 6+ messages in thread
From: Ingo Molnar @ 2010-01-22  7:25 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Anton Blanchard, Peter Zijlstra, Paul Mackerras,
	Benjamin Herrenschmidt, Paul Mundt, linux-kernel,
	Arnaldo Carvalho de Melo


* Frederic Weisbecker <fweisbec@gmail.com> wrote:

> Concerning interrupts that happen in idle, I think we should filter these if 
> exclude_idle = 1. That looks more something a user may want: if we don't 
> want to profile idle, neither do we want to encumber with interrupts that 
> occur inside. On the opposite, if someone wants a finegrained profile, let's 
> get idle and its interrupts.
> 
> What do you guys think about that?

Another, related thing i'd _love_ to see implemented is per IRQ level 
filtering of both samples and statistics.

This would allow two nice things:

 - the reporting of IRQ (and softirq/tasklet) contexts as separate entites by
   perf report

 - a 'no IRQ related noise' mode for perf task perf stat, like the
   user/kernel/hypervisor bits already do, just extended to irqs as well. This 
   would make 'perf stat --repeat 10 /bin/true' _much_ less noisy, and could 
   be used even more to assess the performance impact of kernel patches. 
   Currently IRQs that hit task execution get added to the task's overhead, 
   which makes the numbers both skewed and noisier.

(The do_IRQ() callbacks are needed because most PMUs cannot stop counters when 
we enter/exit IRQ/softirq state, so we have to turn counter off/on on irq 
entries.)

	Ingo

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [tip:perf/core] perf: Fix inconsistency between IP and callchain sampling
  2010-01-18  5:47 [PATCH] perf: Fix inconsistency between IP and callchain sampling Anton Blanchard
  2010-01-18 10:43 ` Frederic Weisbecker
  2010-01-21 13:20 ` Frederic Weisbecker
@ 2010-01-29  9:24 ` tip-bot for Anton Blanchard
  2 siblings, 0 replies; 6+ messages in thread
From: tip-bot for Anton Blanchard @ 2010-01-29  9:24 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, paulus, anton, hpa, mingo, a.p.zijlstra, lethal,
	fweisbec, benh, tglx, mingo

Commit-ID:  339ce1a4dc2ca26444c4f65c31b71a5056f3bb0b
Gitweb:     http://git.kernel.org/tip/339ce1a4dc2ca26444c4f65c31b71a5056f3bb0b
Author:     Anton Blanchard <anton@samba.org>
AuthorDate: Mon, 18 Jan 2010 16:47:07 +1100
Committer:  Frederic Weisbecker <fweisbec@gmail.com>
CommitDate: Thu, 28 Jan 2010 14:31:20 +0100

perf: Fix inconsistency between IP and callchain sampling

When running perf across all cpus with backtracing (-a -g), sometimes we
get samples without associated backtraces:

    23.44%         init  [kernel]                     [k] restore
    11.46%         init                       eeba0c  [k] 0x00000000eeba0c
     6.77%      swapper  [kernel]                     [k] .perf_ctx_adjust_freq
     5.73%         init  [kernel]                     [k] .__trace_hcall_entry
     4.69%         perf  libc-2.9.so                  [.] 0x0000000006bb8c
                       |
                       |--11.11%-- 0xfffa941bbbc

It turns out the backtrace code has a check for the idle task and the IP
sampling does not. This creates problems when profiling an interrupt
heavy workload (in my case 10Gbit ethernet) since we get no backtraces
for interrupts received while idle (ie most of the workload).

Right now x86 and sh check that current is not NULL, which should never
happen so remove that too.

Idle task's exclusion must be performed from the core code, on top
of perf_event_attr:exclude_idle.

Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mundt <lethal@linux-sh.org>
LKML-Reference: <20100118054707.GT12666@kryten>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 arch/powerpc/kernel/perf_callchain.c |    3 ---
 arch/sh/kernel/perf_callchain.c      |    3 ---
 arch/x86/kernel/cpu/perf_event.c     |    3 ---
 3 files changed, 0 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/kernel/perf_callchain.c b/arch/powerpc/kernel/perf_callchain.c
index a3c11ca..95ad9da 100644
--- a/arch/powerpc/kernel/perf_callchain.c
+++ b/arch/powerpc/kernel/perf_callchain.c
@@ -495,9 +495,6 @@ struct perf_callchain_entry *perf_callchain(struct pt_regs *regs)
 
 	entry->nr = 0;
 
-	if (current->pid == 0)		/* idle task? */
-		return entry;
-
 	if (!user_mode(regs)) {
 		perf_callchain_kernel(regs, entry);
 		if (current->mm)
diff --git a/arch/sh/kernel/perf_callchain.c b/arch/sh/kernel/perf_callchain.c
index 24ea837..a9dd3ab 100644
--- a/arch/sh/kernel/perf_callchain.c
+++ b/arch/sh/kernel/perf_callchain.c
@@ -68,9 +68,6 @@ perf_do_callchain(struct pt_regs *regs, struct perf_callchain_entry *entry)
 
 	is_user = user_mode(regs);
 
-	if (!current || current->pid == 0)
-		return;
-
 	if (is_user && current->state != TASK_RUNNING)
 		return;
 
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index b1bb8c5..ed1998b 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -2425,9 +2425,6 @@ perf_do_callchain(struct pt_regs *regs, struct perf_callchain_entry *entry)
 
 	is_user = user_mode(regs);
 
-	if (!current || current->pid == 0)
-		return;
-
 	if (is_user && current->state != TASK_RUNNING)
 		return;
 

^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2010-01-29  9:26 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-18  5:47 [PATCH] perf: Fix inconsistency between IP and callchain sampling Anton Blanchard
2010-01-18 10:43 ` Frederic Weisbecker
2010-01-22  2:04   ` Anton Blanchard
2010-01-22  7:25   ` Ingo Molnar
2010-01-21 13:20 ` Frederic Weisbecker
2010-01-29  9:24 ` [tip:perf/core] " tip-bot for Anton Blanchard

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox