linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] powerpc/perf_events: Implement perf_arch_fetch_caller_regs for powerpc
@ 2010-03-15  5:46 Paul Mackerras
  2010-03-15 17:36 ` Michael Neuling
  2010-03-15 21:04 ` Frederic Weisbecker
  0 siblings, 2 replies; 5+ messages in thread
From: Paul Mackerras @ 2010-03-15  5:46 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Frederic Weisbecker, benh
  Cc: linuxppc-dev, linux-kernel, anton

This implements a powerpc version of perf_arch_fetch_caller_regs.
It's implemented in assembly because that way we can be sure there
isn't a stack frame for perf_arch_fetch_caller_regs.  If it was in
C, gcc might or might not create a stack frame for it, which would
affect the number of levels we have to skip.  It's not ifdef'd
because it is only 14 instructions long.

With this, we see results from perf record -e lock:lock_acquire like
this:

# Samples: 24878
#
# Overhead         Command      Shared Object  Symbol
# ........  ..............  .................  ......
#
    14.99%            perf  [kernel.kallsyms]  [k] ._raw_spin_lock
                      |
                      --- ._raw_spin_lock
                         |          
                         |--25.00%-- .alloc_fd
                         |          (nil)
                         |          |          
                         |          |--50.00%-- .anon_inode_getfd
                         |          |          .sys_perf_event_open
                         |          |          syscall_exit
                         |          |          syscall
                         |          |          create_counter
                         |          |          __cmd_record
                         |          |          run_builtin
                         |          |          main
                         |          |          0xfd2e704
                         |          |          0xfd2e8c0
                         |          |          (nil)

... etc.

Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 arch/powerpc/include/asm/asm-compat.h |    2 ++
 arch/powerpc/kernel/misc.S            |   20 ++++++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/arch/powerpc/include/asm/asm-compat.h b/arch/powerpc/include/asm/asm-compat.h
index c1b475a..a9b91ed 100644
--- a/arch/powerpc/include/asm/asm-compat.h
+++ b/arch/powerpc/include/asm/asm-compat.h
@@ -28,6 +28,7 @@
 #define PPC_LLARX(t, a, b, eh)	PPC_LDARX(t, a, b, eh)
 #define PPC_STLCX	stringify_in_c(stdcx.)
 #define PPC_CNTLZL	stringify_in_c(cntlzd)
+#define PPC_LR_STKOFF	16
 
 /* Move to CR, single-entry optimized version. Only available
  * on POWER4 and later.
@@ -51,6 +52,7 @@
 #define PPC_STLCX	stringify_in_c(stwcx.)
 #define PPC_CNTLZL	stringify_in_c(cntlzw)
 #define PPC_MTOCRF	stringify_in_c(mtcrf)
+#define PPC_LR_STKOFF	4
 
 #endif
 
diff --git a/arch/powerpc/kernel/misc.S b/arch/powerpc/kernel/misc.S
index 2d29752..4459500 100644
--- a/arch/powerpc/kernel/misc.S
+++ b/arch/powerpc/kernel/misc.S
@@ -127,3 +127,23 @@ _GLOBAL(__setup_cpu_power7)
 _GLOBAL(__restore_cpu_power7)
 	/* place holder */
 	blr
+
+/*
+ * Get a minimal set of registers for our caller's nth caller.
+ * r3 = regs pointer, r5 = n.
+ */
+_GLOBAL(perf_arch_fetch_caller_regs)
+	mr	r6,r1
+	cmpwi	r5,0
+	mflr	r4
+	ble	2f
+	mtctr	r5
+1:	PPC_LL	r6,0(r6)
+	bdnz	1b
+	PPC_LL	r4,PPC_LR_STKOFF(r6)
+2:	PPC_LL	r7,0(r6)
+	PPC_LL	r7,PPC_LR_STKOFF(r7)
+	PPC_STL	r6,GPR1-STACK_FRAME_OVERHEAD(r3)
+	PPC_STL	r4,_NIP-STACK_FRAME_OVERHEAD(r3)
+	PPC_STL	r7,_LINK-STACK_FRAME_OVERHEAD(r3)
+	blr

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] powerpc/perf_events: Implement perf_arch_fetch_caller_regs for powerpc
  2010-03-15  5:46 [PATCH] powerpc/perf_events: Implement perf_arch_fetch_caller_regs for powerpc Paul Mackerras
@ 2010-03-15 17:36 ` Michael Neuling
  2010-03-15 21:04 ` Frederic Weisbecker
  1 sibling, 0 replies; 5+ messages in thread
From: Michael Neuling @ 2010-03-15 17:36 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Peter Zijlstra, Frederic Weisbecker, linux-kernel, linuxppc-dev,
	anton, Ingo Molnar

> This implements a powerpc version of perf_arch_fetch_caller_regs.
> It's implemented in assembly because that way we can be sure there
> isn't a stack frame for perf_arch_fetch_caller_regs.  If it was in
> C, gcc might or might not create a stack frame for it, which would
> affect the number of levels we have to skip.  

Should we put this comment in the code as well?

Mikey

> It's not ifdef'd
> because it is only 14 instructions long.
> 
> With this, we see results from perf record -e lock:lock_acquire like
> this:
> 
> # Samples: 24878
> #
> # Overhead         Command      Shared Object  Symbol
> # ........  ..............  .................  ......
> #
>     14.99%            perf  [kernel.kallsyms]  [k] ._raw_spin_lock
>                       |
>                       --- ._raw_spin_lock
>                          |          
>                          |--25.00%-- .alloc_fd
>                          |          (nil)
>                          |          |          
>                          |          |--50.00%-- .anon_inode_getfd
>                          |          |          .sys_perf_event_open
>                          |          |          syscall_exit
>                          |          |          syscall
>                          |          |          create_counter
>                          |          |          __cmd_record
>                          |          |          run_builtin
>                          |          |          main
>                          |          |          0xfd2e704
>                          |          |          0xfd2e8c0
>                          |          |          (nil)
> 
> ... etc.
> 
> Signed-off-by: Paul Mackerras <paulus@samba.org>
> ---
>  arch/powerpc/include/asm/asm-compat.h |    2 ++
>  arch/powerpc/kernel/misc.S            |   20 ++++++++++++++++++++
>  2 files changed, 22 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/asm-compat.h b/arch/powerpc/include/asm
/asm-compat.h
> index c1b475a..a9b91ed 100644
> --- a/arch/powerpc/include/asm/asm-compat.h
> +++ b/arch/powerpc/include/asm/asm-compat.h
> @@ -28,6 +28,7 @@
>  #define PPC_LLARX(t, a, b, eh)	PPC_LDARX(t, a, b, eh)
>  #define PPC_STLCX	stringify_in_c(stdcx.)
>  #define PPC_CNTLZL	stringify_in_c(cntlzd)
> +#define PPC_LR_STKOFF	16
>  
>  /* Move to CR, single-entry optimized version. Only available
>   * on POWER4 and later.
> @@ -51,6 +52,7 @@
>  #define PPC_STLCX	stringify_in_c(stwcx.)
>  #define PPC_CNTLZL	stringify_in_c(cntlzw)
>  #define PPC_MTOCRF	stringify_in_c(mtcrf)
> +#define PPC_LR_STKOFF	4
>  
>  #endif
>  
> diff --git a/arch/powerpc/kernel/misc.S b/arch/powerpc/kernel/misc.S
> index 2d29752..4459500 100644
> --- a/arch/powerpc/kernel/misc.S
> +++ b/arch/powerpc/kernel/misc.S
> @@ -127,3 +127,23 @@ _GLOBAL(__setup_cpu_power7)
>  _GLOBAL(__restore_cpu_power7)
>  	/* place holder */
>  	blr
> +
> +/*
> + * Get a minimal set of registers for our caller's nth caller.
> + * r3 = regs pointer, r5 = n.
> + */
> +_GLOBAL(perf_arch_fetch_caller_regs)
> +	mr	r6,r1
> +	cmpwi	r5,0
> +	mflr	r4
> +	ble	2f
> +	mtctr	r5
> +1:	PPC_LL	r6,0(r6)
> +	bdnz	1b
> +	PPC_LL	r4,PPC_LR_STKOFF(r6)
> +2:	PPC_LL	r7,0(r6)
> +	PPC_LL	r7,PPC_LR_STKOFF(r7)
> +	PPC_STL	r6,GPR1-STACK_FRAME_OVERHEAD(r3)
> +	PPC_STL	r4,_NIP-STACK_FRAME_OVERHEAD(r3)
> +	PPC_STL	r7,_LINK-STACK_FRAME_OVERHEAD(r3)
> +	blr
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] powerpc/perf_events: Implement perf_arch_fetch_caller_regs for powerpc
  2010-03-15  5:46 [PATCH] powerpc/perf_events: Implement perf_arch_fetch_caller_regs for powerpc Paul Mackerras
  2010-03-15 17:36 ` Michael Neuling
@ 2010-03-15 21:04 ` Frederic Weisbecker
  2010-03-16  3:22   ` Paul Mackerras
  1 sibling, 1 reply; 5+ messages in thread
From: Frederic Weisbecker @ 2010-03-15 21:04 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Peter Zijlstra, linux-kernel, linuxppc-dev, anton, Ingo Molnar

On Mon, Mar 15, 2010 at 04:46:15PM +1100, Paul Mackerras wrote:
> This implements a powerpc version of perf_arch_fetch_caller_regs.
> It's implemented in assembly because that way we can be sure there
> isn't a stack frame for perf_arch_fetch_caller_regs.  If it was in
> C, gcc might or might not create a stack frame for it, which would
> affect the number of levels we have to skip.  It's not ifdef'd
> because it is only 14 instructions long.
> 
> With this, we see results from perf record -e lock:lock_acquire like
> this:
> 
> # Samples: 24878
> #
> # Overhead         Command      Shared Object  Symbol
> # ........  ..............  .................  ......
> #
>     14.99%            perf  [kernel.kallsyms]  [k] ._raw_spin_lock
>                       |
>                       --- ._raw_spin_lock
>                          |          
>                          |--25.00%-- .alloc_fd
>                          |          (nil)
>                          |          |          
>                          |          |--50.00%-- .anon_inode_getfd
>                          |          |          .sys_perf_event_open
>                          |          |          syscall_exit
>                          |          |          syscall
>                          |          |          create_counter
>                          |          |          __cmd_record
>                          |          |          run_builtin
>                          |          |          main
>                          |          |          0xfd2e704
>                          |          |          0xfd2e8c0
>                          |          |          (nil)
> 
> ... etc.
> 
> Signed-off-by: Paul Mackerras <paulus@samba.org>


Cool!



> ---
>  arch/powerpc/include/asm/asm-compat.h |    2 ++
>  arch/powerpc/kernel/misc.S            |   20 ++++++++++++++++++++
>  2 files changed, 22 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/asm-compat.h b/arch/powerpc/include/asm/asm-compat.h
> index c1b475a..a9b91ed 100644
> --- a/arch/powerpc/include/asm/asm-compat.h
> +++ b/arch/powerpc/include/asm/asm-compat.h
> @@ -28,6 +28,7 @@
>  #define PPC_LLARX(t, a, b, eh)	PPC_LDARX(t, a, b, eh)
>  #define PPC_STLCX	stringify_in_c(stdcx.)
>  #define PPC_CNTLZL	stringify_in_c(cntlzd)
> +#define PPC_LR_STKOFF	16
>  
>  /* Move to CR, single-entry optimized version. Only available
>   * on POWER4 and later.
> @@ -51,6 +52,7 @@
>  #define PPC_STLCX	stringify_in_c(stwcx.)
>  #define PPC_CNTLZL	stringify_in_c(cntlzw)
>  #define PPC_MTOCRF	stringify_in_c(mtcrf)
> +#define PPC_LR_STKOFF	4
>  
>  #endif
>  
> diff --git a/arch/powerpc/kernel/misc.S b/arch/powerpc/kernel/misc.S
> index 2d29752..4459500 100644
> --- a/arch/powerpc/kernel/misc.S
> +++ b/arch/powerpc/kernel/misc.S
> @@ -127,3 +127,23 @@ _GLOBAL(__setup_cpu_power7)
>  _GLOBAL(__restore_cpu_power7)
>  	/* place holder */
>  	blr
> +
> +/*
> + * Get a minimal set of registers for our caller's nth caller.
> + * r3 = regs pointer, r5 = n.
> + */
> +_GLOBAL(perf_arch_fetch_caller_regs)
> +	mr	r6,r1
> +	cmpwi	r5,0
> +	mflr	r4
> +	ble	2f
> +	mtctr	r5
> +1:	PPC_LL	r6,0(r6)
> +	bdnz	1b
> +	PPC_LL	r4,PPC_LR_STKOFF(r6)
> +2:	PPC_LL	r7,0(r6)
> +	PPC_LL	r7,PPC_LR_STKOFF(r7)
> +	PPC_STL	r6,GPR1-STACK_FRAME_OVERHEAD(r3)
> +	PPC_STL	r4,_NIP-STACK_FRAME_OVERHEAD(r3)
> +	PPC_STL	r7,_LINK-STACK_FRAME_OVERHEAD(r3)
> +	blr

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] powerpc/perf_events: Implement perf_arch_fetch_caller_regs for powerpc
  2010-03-15 21:04 ` Frederic Weisbecker
@ 2010-03-16  3:22   ` Paul Mackerras
  2010-03-16 20:56     ` Frederic Weisbecker
  0 siblings, 1 reply; 5+ messages in thread
From: Paul Mackerras @ 2010-03-16  3:22 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, linux-kernel, linuxppc-dev, anton, Ingo Molnar

On Mon, Mar 15, 2010 at 10:04:54PM +0100, Frederic Weisbecker wrote:
> On Mon, Mar 15, 2010 at 04:46:15PM +1100, Paul Mackerras wrote:

> >     14.99%            perf  [kernel.kallsyms]  [k] ._raw_spin_lock
> >                       |
> >                       --- ._raw_spin_lock
> >                          |          
> >                          |--25.00%-- .alloc_fd
> >                          |          (nil)
> >                          |          |          
> >                          |          |--50.00%-- .anon_inode_getfd
> >                          |          |          .sys_perf_event_open
> >                          |          |          syscall_exit
> >                          |          |          syscall
> >                          |          |          create_counter
> >                          |          |          __cmd_record
> >                          |          |          run_builtin
> >                          |          |          main
> >                          |          |          0xfd2e704
> >                          |          |          0xfd2e8c0
> >                          |          |          (nil)
> > 
> > ... etc.
> > 
> > Signed-off-by: Paul Mackerras <paulus@samba.org>
> 
> 
> Cool!

By the way, I notice that gcc tends to inline the tracing functions,
which means that by going up 2 stack frames we miss some of the
functions.  For example, for the lock:lock_acquire event, we have
_raw_spin_lock() -> lock_acquire() -> trace_lock_acquire() ->
perf_trace_lock_acquire() -> perf_trace_templ_lock_acquire() ->
perf_fetch_caller_regs() -> perf_arch_fetch_caller_regs().

But in the ppc64 kernel binary I just built, gcc inlined
trace_lock_acquire in lock_acquire, and perf_trace_templ_lock_acquire
in perf_trace_lock_acquire.  Given that perf_fetch_caller_regs is
explicitly inlined, going up two levels from perf_fetch_caller_regs
gets us to _raw_spin_lock, whereas I think you intended it to get us
to trace_lock_acquire.  I'm not sure what to do about that - any
thoughts?

Paul.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] powerpc/perf_events: Implement perf_arch_fetch_caller_regs for powerpc
  2010-03-16  3:22   ` Paul Mackerras
@ 2010-03-16 20:56     ` Frederic Weisbecker
  0 siblings, 0 replies; 5+ messages in thread
From: Frederic Weisbecker @ 2010-03-16 20:56 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Peter Zijlstra, linux-kernel, linuxppc-dev, anton, Ingo Molnar

On Tue, Mar 16, 2010 at 02:22:13PM +1100, Paul Mackerras wrote:
> On Mon, Mar 15, 2010 at 10:04:54PM +0100, Frederic Weisbecker wrote:
> > On Mon, Mar 15, 2010 at 04:46:15PM +1100, Paul Mackerras wrote:
> 
> > >     14.99%            perf  [kernel.kallsyms]  [k] ._raw_spin_lock
> > >                       |
> > >                       --- ._raw_spin_lock
> > >                          |          
> > >                          |--25.00%-- .alloc_fd
> > >                          |          (nil)
> > >                          |          |          
> > >                          |          |--50.00%-- .anon_inode_getfd
> > >                          |          |          .sys_perf_event_open
> > >                          |          |          syscall_exit
> > >                          |          |          syscall
> > >                          |          |          create_counter
> > >                          |          |          __cmd_record
> > >                          |          |          run_builtin
> > >                          |          |          main
> > >                          |          |          0xfd2e704
> > >                          |          |          0xfd2e8c0
> > >                          |          |          (nil)
> > > 
> > > ... etc.
> > > 
> > > Signed-off-by: Paul Mackerras <paulus@samba.org>
> > 
> > 
> > Cool!
> 
> By the way, I notice that gcc tends to inline the tracing functions,
> which means that by going up 2 stack frames we miss some of the
> functions.  For example, for the lock:lock_acquire event, we have
> _raw_spin_lock() -> lock_acquire() -> trace_lock_acquire() ->
> perf_trace_lock_acquire() -> perf_trace_templ_lock_acquire() ->
> perf_fetch_caller_regs() -> perf_arch_fetch_caller_regs().
> 
> But in the ppc64 kernel binary I just built, gcc inlined
> trace_lock_acquire in lock_acquire, and perf_trace_templ_lock_acquire
> in perf_trace_lock_acquire.  Given that perf_fetch_caller_regs is
> explicitly inlined, going up two levels from perf_fetch_caller_regs
> gets us to _raw_spin_lock, whereas I think you intended it to get us
> to trace_lock_acquire.  I'm not sure what to do about that - any
> thoughts?



Yeah I've indeed seen this, and the problem is especially
the fact perf_trace_templ_lock_acquire may or may not be
inlined.

It is used for trace events that use the TRACE_EVENT_CLASS
thing. We define a pattern of event structure that is shared
among several events.

For example event A and event B share perf_trace_templ_foo.
Both will have a different perf_trace_blah but those
perf_trace_blah will both call the same perf_trace_templ_foo(),
in this case, it won't be inlined.

Events that don't share a pattern will have their
perf_trace_templ inlined, because there will be an exclusive 1:1
relationship between both.

The rewind of 2 is well suited for events sharing a pattern, ip
will match the right event source, and not one of its callers.

Unfortunately, the others are more unlucky.
I didn't mind much about this yet because it  had no bad effect
on lock events. Quite the opposite actually. It's not very interesting
to have lock_acquire as the event source unless you have a callchain.

If you have no callchain, you'll see a lot of such in perf report:

sym1	lock_aquire
sym2	lock_acquire
sym3	lock_acquire

What you want here is the function that called lock_acquire.

But if you have a callchain it's fine, because you have the nature
of the event (lock_aquire) and the origin as well.

That said, lock events are an exception where the mistake
has a lucky result. Other inlined events are harmed as we lose
their most important caller. So I'm going to fix that.

I can just fetch the regs from perf_trace_foo() and pass them
to perf_trace_templ_foo() and here we are.

Thanks.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2010-03-16 20:56 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-03-15  5:46 [PATCH] powerpc/perf_events: Implement perf_arch_fetch_caller_regs for powerpc Paul Mackerras
2010-03-15 17:36 ` Michael Neuling
2010-03-15 21:04 ` Frederic Weisbecker
2010-03-16  3:22   ` Paul Mackerras
2010-03-16 20:56     ` Frederic Weisbecker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).