PowerPC ftrace function trace optimisation

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* PowerPC ftrace function trace optimisation
@ 2010-04-29  0:51 Anton Blanchard
  2010-04-29  1:02 ` Benjamin Herrenschmidt
  2010-04-29  1:55 ` Steven Rostedt
  0 siblings, 2 replies; 6+ messages in thread
From: Anton Blanchard @ 2010-04-29  0:51 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: paulus, imunsie, rostedt, amodra

Hi,

Alan Modra pointed out that he added an option to PowerPC gcc years ago
specifically for us to do lightweight mcount profiling.

The normal PowerPC gcc mcount stuff forces a stack spill and gets itself
tangled up in the function prolog, making it impossible to nop out easily:

# gcc -pg:

0000000000000000 <.foo>:
   0:   7c 08 02 a6     mflr    r0		<--- shared stack spill code
   4:   f8 01 00 10     std     r0,16(r1)	<--|
   8:   f8 21 ff 91     stdu    r1,-112(r1)	<--+
   c:   48 00 00 01     bl      c <.foo+0xc>	<--- call to mcount
  10:   60 00 00 00     nop

  14:   e9 22 00 00     ld      r9,0(r2)
  18:   e8 69 00 02     lwa     r3,0(r9)
  1c:   38 21 00 70     addi    r1,r1,112
  20:   e8 01 00 10     ld      r0,16(r1)
  24:   7c 08 03 a6     mtlr    r0
  28:   4e 80 00 20     blr

The option Alan added reduces the footprint to 3 instructions which can
be noped out completely. The rest of the function does not rely on the first
three instructions. No stack spill is forced either:

# gcc -pg -mprofile-kernel

0000000000000000 <.foo>:
   0:   7c 08 02 a6     mflr    r0
   4:   f8 01 00 10     std     r0,16(r1)
   8:   48 00 00 01     bl      8 <.foo+0x8>	<--- call to mcount

   c:   7c 08 02 a6     mflr    r0
  10:   f8 01 00 10     std     r0,16(r1)
  14:   f8 21 ff d1     stdu    r1,-48(r1)
  18:   e9 22 00 00     ld      r9,0(r2)
  1c:   e8 69 00 02     lwa     r3,0(r9)
  20:   38 21 00 30     addi    r1,r1,48
  24:   e8 01 00 10     ld      r0,16(r1)
  28:   7c 08 03 a6     mtlr    r0
  2c:   4e 80 00 20     blr

This mean we could support ftrace function trace with very little overhead.

In fact if we are careful when switching to the new mcount ABI and don't
rely on the store of r0, we could probably optimise this even further in a
future gcc and remove the store completely. mcount would be 2 instructions:

   mflr    r0              
   bl      8 <.foo+0x8>

Anton

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: PowerPC ftrace function trace optimisation
  2010-04-29  0:51 PowerPC ftrace function trace optimisation Anton Blanchard
@ 2010-04-29  1:02 ` Benjamin Herrenschmidt
  2010-04-29  1:08   ` Benjamin Herrenschmidt
  2010-04-29  1:22   ` Alan Modra
  2010-04-29  1:55 ` Steven Rostedt
  1 sibling, 2 replies; 6+ messages in thread
From: Benjamin Herrenschmidt @ 2010-04-29  1:02 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: linuxppc-dev, paulus, imunsie, rostedt, amodra


> The option Alan added reduces the footprint to 3 instructions which can
> be noped out completely. The rest of the function does not rely on the first
> three instructions. No stack spill is forced either:
> 
> # gcc -pg -mprofile-kernel

>From a quick test it appears that this only works with -m64, not -m32.
Alan is that correct ? Any chance you can fix that in future gcc
versions ?

Also should we implement support for both type of mcounts or just only
allow enabling of ftrace with gcc's that support this ?

Cheers,
Ben.

> 0000000000000000 <.foo>:
>    0:   7c 08 02 a6     mflr    r0
>    4:   f8 01 00 10     std     r0,16(r1)
>    8:   48 00 00 01     bl      8 <.foo+0x8>	<--- call to mcount
> 
>    c:   7c 08 02 a6     mflr    r0
>   10:   f8 01 00 10     std     r0,16(r1)
>   14:   f8 21 ff d1     stdu    r1,-48(r1)
>   18:   e9 22 00 00     ld      r9,0(r2)
>   1c:   e8 69 00 02     lwa     r3,0(r9)
>   20:   38 21 00 30     addi    r1,r1,48
>   24:   e8 01 00 10     ld      r0,16(r1)
>   28:   7c 08 03 a6     mtlr    r0
>   2c:   4e 80 00 20     blr
> 
> 
> This mean we could support ftrace function trace with very little overhead.
> 
> In fact if we are careful when switching to the new mcount ABI and don't
> rely on the store of r0, we could probably optimise this even further in a
> future gcc and remove the store completely. mcount would be 2 instructions:
> 
>    mflr    r0              
>    bl      8 <.foo+0x8>
> 
> Anton

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: PowerPC ftrace function trace optimisation
  2010-04-29  1:02 ` Benjamin Herrenschmidt
@ 2010-04-29  1:08   ` Benjamin Herrenschmidt
  2010-04-29  1:22   ` Alan Modra
  1 sibling, 0 replies; 6+ messages in thread
From: Benjamin Herrenschmidt @ 2010-04-29  1:08 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: linuxppc-dev, paulus, imunsie, rostedt, amodra

On Thu, 2010-04-29 at 11:02 +1000, Benjamin Herrenschmidt wrote:
> > The option Alan added reduces the footprint to 3 instructions which can
> > be noped out completely. The rest of the function does not rely on the first
> > three instructions. No stack spill is forced either:
> > 
> > # gcc -pg -mprofile-kernel
> 
> >From a quick test it appears that this only works with -m64, not -m32.
> Alan is that correct ? Any chance you can fix that in future gcc
> versions ?
> 
> Also should we implement support for both type of mcounts or just only
> allow enabling of ftrace with gcc's that support this ?

Also, Anton noticed :

> Cheers,
> Ben.
> 
> > 0000000000000000 <.foo>:
> >    0:   7c 08 02 a6     mflr    r0
> >    4:   f8 01 00 10     std     r0,16(r1)

The std is not useful here. We can do it inside mcount.

> >    8:   48 00 00 01     bl      8 <.foo+0x8>	<--- call to mcount

And I noticed:

 
> >    c:   7c 08 02 a6     mflr    r0

I'm happy to guarantee that mcount does the above.

> >   10:   f8 01 00 10     std     r0,16(r1)

And maybe that one too.

However I understand if it's easier not to change the prolog codegen
(the 2 insn above) and just stick to adding a 2 or 3 instructions
boilerplate at the top.

Cheers,
Ben.

> >   14:   f8 21 ff d1     stdu    r1,-48(r1)
> >   18:   e9 22 00 00     ld      r9,0(r2)
> >   1c:   e8 69 00 02     lwa     r3,0(r9)
> >   20:   38 21 00 30     addi    r1,r1,48
> >   24:   e8 01 00 10     ld      r0,16(r1)
> >   28:   7c 08 03 a6     mtlr    r0
> >   2c:   4e 80 00 20     blr
> > 
> > 
> > This mean we could support ftrace function trace with very little overhead.
> > 
> > In fact if we are careful when switching to the new mcount ABI and don't
> > rely on the store of r0, we could probably optimise this even further in a
> > future gcc and remove the store completely. mcount would be 2 instructions:
> > 
> >    mflr    r0              
> >    bl      8 <.foo+0x8>
> > 
> > Anton
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: PowerPC ftrace function trace optimisation
  2010-04-29  1:02 ` Benjamin Herrenschmidt
  2010-04-29  1:08   ` Benjamin Herrenschmidt
@ 2010-04-29  1:22   ` Alan Modra
  1 sibling, 0 replies; 6+ messages in thread
From: Alan Modra @ 2010-04-29  1:22 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linuxppc-dev, paulus, Anton Blanchard, rostedt, imunsie

On Thu, Apr 29, 2010 at 11:02:47AM +1000, Benjamin Herrenschmidt wrote:
> From a quick test it appears that this only works with -m64, not -m32.
> Alan is that correct ?

Yes.

> Any chance you can fix that in future gcc versions ?

No need really.  32-bit _mcount calls happen before the prologue
anyway.

> > In fact if we are careful when switching to the new mcount ABI and don't
> > rely on the store of r0, we could probably optimise this even further in a
> > future gcc and remove the store completely. mcount would be 2 instructions:
> > 
> >    mflr    r0              
> >    bl      8 <.foo+0x8>

Yeah.  Also, I should have used a different name for this mcount from
the standard 64-bit mcount.

-- 
Alan Modra
Australia Development Lab, IBM

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: PowerPC ftrace function trace optimisation
  2010-04-29  0:51 PowerPC ftrace function trace optimisation Anton Blanchard
  2010-04-29  1:02 ` Benjamin Herrenschmidt
@ 2010-04-29  1:55 ` Steven Rostedt
  2010-04-29  2:10   ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 6+ messages in thread
From: Steven Rostedt @ 2010-04-29  1:55 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: linuxppc-dev, imunsie, paulus, amodra

On Thu, 2010-04-29 at 10:51 +1000, Anton Blanchard wrote:
> Hi,

> # gcc -pg -mprofile-kernel
> 
> 0000000000000000 <.foo>:
>    0:   7c 08 02 a6     mflr    r0
>    4:   f8 01 00 10     std     r0,16(r1)
>    8:   48 00 00 01     bl      8 <.foo+0x8>	<--- call to mcount
> 
>    c:   7c 08 02 a6     mflr    r0

Why the extra mflr? Can't we just make it a requirement that mcount
returns with r0 back to what it was?

-- Steve

>   10:   f8 01 00 10     std     r0,16(r1)
>   14:   f8 21 ff d1     stdu    r1,-48(r1)
>   18:   e9 22 00 00     ld      r9,0(r2)
>   1c:   e8 69 00 02     lwa     r3,0(r9)
>   20:   38 21 00 30     addi    r1,r1,48
>   24:   e8 01 00 10     ld      r0,16(r1)
>   28:   7c 08 03 a6     mtlr    r0
>   2c:   4e 80 00 20     blr
> 
> 
> This mean we could support ftrace function trace with very little overhead.
> 
> In fact if we are careful when switching to the new mcount ABI and don't
> rely on the store of r0, we could probably optimise this even further in a
> future gcc and remove the store completely. mcount would be 2 instructions:
> 
>    mflr    r0              
>    bl      8 <.foo+0x8>
> 
> Anton

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: PowerPC ftrace function trace optimisation
  2010-04-29  1:55 ` Steven Rostedt
@ 2010-04-29  2:10   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 6+ messages in thread
From: Benjamin Herrenschmidt @ 2010-04-29  2:10 UTC (permalink / raw)
  To: rostedt; +Cc: linuxppc-dev, paulus, Anton Blanchard, imunsie, amodra

On Wed, 2010-04-28 at 21:55 -0400, Steven Rostedt wrote:
> On Thu, 2010-04-29 at 10:51 +1000, Anton Blanchard wrote:
> > Hi,
> 
> > # gcc -pg -mprofile-kernel
> > 
> > 0000000000000000 <.foo>:
> >    0:   7c 08 02 a6     mflr    r0
> >    4:   f8 01 00 10     std     r0,16(r1)
> >    8:   48 00 00 01     bl      8 <.foo+0x8>	<--- call to mcount
> > 
> >    c:   7c 08 02 a6     mflr    r0
> 
> Why the extra mflr? Can't we just make it a requirement that mcount
> returns with r0 back to what it was?

Well, we can't just change that now, it's been in for long enough.

We might be able to get a new option later on that makes it more
efficient tho (for example removing the std), but let's see what we can
do with what we have.

The extra mflr makes sense if you consider that the option just
pre-pends a pre-canned set of instructions and doesn't actually touch
anything to the prolog generation. It might be possible to do a hack to
make the prolog aware that LR is already in r0 but let's look at that
after we've verified we can get the existing stuff working :-)

Another idea Alan had is that if we could have a list of call sites,
instead of NOP'ing we could instead change the branches of all call
sites to skip the 3 instruction mcount prolog :-)

Now, we do store the relocs with the kernel image when using
CONFIG_RELOCATABLE, though we might want to 'sort' them a bit to easily
find callers from call sites, but it's something to also consider.

Cheers,
Ben.

> -- Steve
> 
> >   10:   f8 01 00 10     std     r0,16(r1)
> >   14:   f8 21 ff d1     stdu    r1,-48(r1)
> >   18:   e9 22 00 00     ld      r9,0(r2)
> >   1c:   e8 69 00 02     lwa     r3,0(r9)
> >   20:   38 21 00 30     addi    r1,r1,48
> >   24:   e8 01 00 10     ld      r0,16(r1)
> >   28:   7c 08 03 a6     mtlr    r0
> >   2c:   4e 80 00 20     blr
> > 
> > 
> > This mean we could support ftrace function trace with very little overhead.
> > 
> > In fact if we are careful when switching to the new mcount ABI and don't
> > rely on the store of r0, we could probably optimise this even further in a
> > future gcc and remove the store completely. mcount would be 2 instructions:
> > 
> >    mflr    r0              
> >    bl      8 <.foo+0x8>
> > 
> > Anton
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2010-04-29  2:11 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-29  0:51 PowerPC ftrace function trace optimisation Anton Blanchard
2010-04-29  1:02 ` Benjamin Herrenschmidt
2010-04-29  1:08   ` Benjamin Herrenschmidt
2010-04-29  1:22   ` Alan Modra
2010-04-29  1:55 ` Steven Rostedt
2010-04-29  2:10   ` Benjamin Herrenschmidt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).