* PowerPC ftrace function trace optimisation
@ 2010-04-29 0:51 Anton Blanchard
2010-04-29 1:02 ` Benjamin Herrenschmidt
2010-04-29 1:55 ` Steven Rostedt
0 siblings, 2 replies; 6+ messages in thread
From: Anton Blanchard @ 2010-04-29 0:51 UTC (permalink / raw)
To: linuxppc-dev; +Cc: paulus, imunsie, rostedt, amodra
Hi,
Alan Modra pointed out that he added an option to PowerPC gcc years ago
specifically for us to do lightweight mcount profiling.
The normal PowerPC gcc mcount stuff forces a stack spill and gets itself
tangled up in the function prolog, making it impossible to nop out easily:
# gcc -pg:
0000000000000000 <.foo>:
0: 7c 08 02 a6 mflr r0 <--- shared stack spill code
4: f8 01 00 10 std r0,16(r1) <--|
8: f8 21 ff 91 stdu r1,-112(r1) <--+
c: 48 00 00 01 bl c <.foo+0xc> <--- call to mcount
10: 60 00 00 00 nop
14: e9 22 00 00 ld r9,0(r2)
18: e8 69 00 02 lwa r3,0(r9)
1c: 38 21 00 70 addi r1,r1,112
20: e8 01 00 10 ld r0,16(r1)
24: 7c 08 03 a6 mtlr r0
28: 4e 80 00 20 blr
The option Alan added reduces the footprint to 3 instructions which can
be noped out completely. The rest of the function does not rely on the first
three instructions. No stack spill is forced either:
# gcc -pg -mprofile-kernel
0000000000000000 <.foo>:
0: 7c 08 02 a6 mflr r0
4: f8 01 00 10 std r0,16(r1)
8: 48 00 00 01 bl 8 <.foo+0x8> <--- call to mcount
c: 7c 08 02 a6 mflr r0
10: f8 01 00 10 std r0,16(r1)
14: f8 21 ff d1 stdu r1,-48(r1)
18: e9 22 00 00 ld r9,0(r2)
1c: e8 69 00 02 lwa r3,0(r9)
20: 38 21 00 30 addi r1,r1,48
24: e8 01 00 10 ld r0,16(r1)
28: 7c 08 03 a6 mtlr r0
2c: 4e 80 00 20 blr
This mean we could support ftrace function trace with very little overhead.
In fact if we are careful when switching to the new mcount ABI and don't
rely on the store of r0, we could probably optimise this even further in a
future gcc and remove the store completely. mcount would be 2 instructions:
mflr r0
bl 8 <.foo+0x8>
Anton
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: PowerPC ftrace function trace optimisation
2010-04-29 0:51 PowerPC ftrace function trace optimisation Anton Blanchard
@ 2010-04-29 1:02 ` Benjamin Herrenschmidt
2010-04-29 1:08 ` Benjamin Herrenschmidt
2010-04-29 1:22 ` Alan Modra
2010-04-29 1:55 ` Steven Rostedt
1 sibling, 2 replies; 6+ messages in thread
From: Benjamin Herrenschmidt @ 2010-04-29 1:02 UTC (permalink / raw)
To: Anton Blanchard; +Cc: linuxppc-dev, paulus, imunsie, rostedt, amodra
> The option Alan added reduces the footprint to 3 instructions which can
> be noped out completely. The rest of the function does not rely on the first
> three instructions. No stack spill is forced either:
>
> # gcc -pg -mprofile-kernel
>From a quick test it appears that this only works with -m64, not -m32.
Alan is that correct ? Any chance you can fix that in future gcc
versions ?
Also should we implement support for both type of mcounts or just only
allow enabling of ftrace with gcc's that support this ?
Cheers,
Ben.
> 0000000000000000 <.foo>:
> 0: 7c 08 02 a6 mflr r0
> 4: f8 01 00 10 std r0,16(r1)
> 8: 48 00 00 01 bl 8 <.foo+0x8> <--- call to mcount
>
> c: 7c 08 02 a6 mflr r0
> 10: f8 01 00 10 std r0,16(r1)
> 14: f8 21 ff d1 stdu r1,-48(r1)
> 18: e9 22 00 00 ld r9,0(r2)
> 1c: e8 69 00 02 lwa r3,0(r9)
> 20: 38 21 00 30 addi r1,r1,48
> 24: e8 01 00 10 ld r0,16(r1)
> 28: 7c 08 03 a6 mtlr r0
> 2c: 4e 80 00 20 blr
>
>
> This mean we could support ftrace function trace with very little overhead.
>
> In fact if we are careful when switching to the new mcount ABI and don't
> rely on the store of r0, we could probably optimise this even further in a
> future gcc and remove the store completely. mcount would be 2 instructions:
>
> mflr r0
> bl 8 <.foo+0x8>
>
> Anton
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: PowerPC ftrace function trace optimisation
2010-04-29 1:02 ` Benjamin Herrenschmidt
@ 2010-04-29 1:08 ` Benjamin Herrenschmidt
2010-04-29 1:22 ` Alan Modra
1 sibling, 0 replies; 6+ messages in thread
From: Benjamin Herrenschmidt @ 2010-04-29 1:08 UTC (permalink / raw)
To: Anton Blanchard; +Cc: linuxppc-dev, paulus, imunsie, rostedt, amodra
On Thu, 2010-04-29 at 11:02 +1000, Benjamin Herrenschmidt wrote:
> > The option Alan added reduces the footprint to 3 instructions which can
> > be noped out completely. The rest of the function does not rely on the first
> > three instructions. No stack spill is forced either:
> >
> > # gcc -pg -mprofile-kernel
>
> >From a quick test it appears that this only works with -m64, not -m32.
> Alan is that correct ? Any chance you can fix that in future gcc
> versions ?
>
> Also should we implement support for both type of mcounts or just only
> allow enabling of ftrace with gcc's that support this ?
Also, Anton noticed :
> Cheers,
> Ben.
>
> > 0000000000000000 <.foo>:
> > 0: 7c 08 02 a6 mflr r0
> > 4: f8 01 00 10 std r0,16(r1)
The std is not useful here. We can do it inside mcount.
> > 8: 48 00 00 01 bl 8 <.foo+0x8> <--- call to mcount
And I noticed:
> > c: 7c 08 02 a6 mflr r0
I'm happy to guarantee that mcount does the above.
> > 10: f8 01 00 10 std r0,16(r1)
And maybe that one too.
However I understand if it's easier not to change the prolog codegen
(the 2 insn above) and just stick to adding a 2 or 3 instructions
boilerplate at the top.
Cheers,
Ben.
> > 14: f8 21 ff d1 stdu r1,-48(r1)
> > 18: e9 22 00 00 ld r9,0(r2)
> > 1c: e8 69 00 02 lwa r3,0(r9)
> > 20: 38 21 00 30 addi r1,r1,48
> > 24: e8 01 00 10 ld r0,16(r1)
> > 28: 7c 08 03 a6 mtlr r0
> > 2c: 4e 80 00 20 blr
> >
> >
> > This mean we could support ftrace function trace with very little overhead.
> >
> > In fact if we are careful when switching to the new mcount ABI and don't
> > rely on the store of r0, we could probably optimise this even further in a
> > future gcc and remove the store completely. mcount would be 2 instructions:
> >
> > mflr r0
> > bl 8 <.foo+0x8>
> >
> > Anton
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: PowerPC ftrace function trace optimisation
2010-04-29 1:02 ` Benjamin Herrenschmidt
2010-04-29 1:08 ` Benjamin Herrenschmidt
@ 2010-04-29 1:22 ` Alan Modra
1 sibling, 0 replies; 6+ messages in thread
From: Alan Modra @ 2010-04-29 1:22 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: linuxppc-dev, paulus, Anton Blanchard, rostedt, imunsie
On Thu, Apr 29, 2010 at 11:02:47AM +1000, Benjamin Herrenschmidt wrote:
> From a quick test it appears that this only works with -m64, not -m32.
> Alan is that correct ?
Yes.
> Any chance you can fix that in future gcc versions ?
No need really. 32-bit _mcount calls happen before the prologue
anyway.
> > In fact if we are careful when switching to the new mcount ABI and don't
> > rely on the store of r0, we could probably optimise this even further in a
> > future gcc and remove the store completely. mcount would be 2 instructions:
> >
> > mflr r0
> > bl 8 <.foo+0x8>
Yeah. Also, I should have used a different name for this mcount from
the standard 64-bit mcount.
--
Alan Modra
Australia Development Lab, IBM
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: PowerPC ftrace function trace optimisation
2010-04-29 0:51 PowerPC ftrace function trace optimisation Anton Blanchard
2010-04-29 1:02 ` Benjamin Herrenschmidt
@ 2010-04-29 1:55 ` Steven Rostedt
2010-04-29 2:10 ` Benjamin Herrenschmidt
1 sibling, 1 reply; 6+ messages in thread
From: Steven Rostedt @ 2010-04-29 1:55 UTC (permalink / raw)
To: Anton Blanchard; +Cc: linuxppc-dev, imunsie, paulus, amodra
On Thu, 2010-04-29 at 10:51 +1000, Anton Blanchard wrote:
> Hi,
> # gcc -pg -mprofile-kernel
>
> 0000000000000000 <.foo>:
> 0: 7c 08 02 a6 mflr r0
> 4: f8 01 00 10 std r0,16(r1)
> 8: 48 00 00 01 bl 8 <.foo+0x8> <--- call to mcount
>
> c: 7c 08 02 a6 mflr r0
Why the extra mflr? Can't we just make it a requirement that mcount
returns with r0 back to what it was?
-- Steve
> 10: f8 01 00 10 std r0,16(r1)
> 14: f8 21 ff d1 stdu r1,-48(r1)
> 18: e9 22 00 00 ld r9,0(r2)
> 1c: e8 69 00 02 lwa r3,0(r9)
> 20: 38 21 00 30 addi r1,r1,48
> 24: e8 01 00 10 ld r0,16(r1)
> 28: 7c 08 03 a6 mtlr r0
> 2c: 4e 80 00 20 blr
>
>
> This mean we could support ftrace function trace with very little overhead.
>
> In fact if we are careful when switching to the new mcount ABI and don't
> rely on the store of r0, we could probably optimise this even further in a
> future gcc and remove the store completely. mcount would be 2 instructions:
>
> mflr r0
> bl 8 <.foo+0x8>
>
> Anton
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: PowerPC ftrace function trace optimisation
2010-04-29 1:55 ` Steven Rostedt
@ 2010-04-29 2:10 ` Benjamin Herrenschmidt
0 siblings, 0 replies; 6+ messages in thread
From: Benjamin Herrenschmidt @ 2010-04-29 2:10 UTC (permalink / raw)
To: rostedt; +Cc: linuxppc-dev, paulus, Anton Blanchard, imunsie, amodra
On Wed, 2010-04-28 at 21:55 -0400, Steven Rostedt wrote:
> On Thu, 2010-04-29 at 10:51 +1000, Anton Blanchard wrote:
> > Hi,
>
> > # gcc -pg -mprofile-kernel
> >
> > 0000000000000000 <.foo>:
> > 0: 7c 08 02 a6 mflr r0
> > 4: f8 01 00 10 std r0,16(r1)
> > 8: 48 00 00 01 bl 8 <.foo+0x8> <--- call to mcount
> >
> > c: 7c 08 02 a6 mflr r0
>
> Why the extra mflr? Can't we just make it a requirement that mcount
> returns with r0 back to what it was?
Well, we can't just change that now, it's been in for long enough.
We might be able to get a new option later on that makes it more
efficient tho (for example removing the std), but let's see what we can
do with what we have.
The extra mflr makes sense if you consider that the option just
pre-pends a pre-canned set of instructions and doesn't actually touch
anything to the prolog generation. It might be possible to do a hack to
make the prolog aware that LR is already in r0 but let's look at that
after we've verified we can get the existing stuff working :-)
Another idea Alan had is that if we could have a list of call sites,
instead of NOP'ing we could instead change the branches of all call
sites to skip the 3 instruction mcount prolog :-)
Now, we do store the relocs with the kernel image when using
CONFIG_RELOCATABLE, though we might want to 'sort' them a bit to easily
find callers from call sites, but it's something to also consider.
Cheers,
Ben.
> -- Steve
>
> > 10: f8 01 00 10 std r0,16(r1)
> > 14: f8 21 ff d1 stdu r1,-48(r1)
> > 18: e9 22 00 00 ld r9,0(r2)
> > 1c: e8 69 00 02 lwa r3,0(r9)
> > 20: 38 21 00 30 addi r1,r1,48
> > 24: e8 01 00 10 ld r0,16(r1)
> > 28: 7c 08 03 a6 mtlr r0
> > 2c: 4e 80 00 20 blr
> >
> >
> > This mean we could support ftrace function trace with very little overhead.
> >
> > In fact if we are careful when switching to the new mcount ABI and don't
> > rely on the store of r0, we could probably optimise this even further in a
> > future gcc and remove the store completely. mcount would be 2 instructions:
> >
> > mflr r0
> > bl 8 <.foo+0x8>
> >
> > Anton
>
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2010-04-29 2:11 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-29 0:51 PowerPC ftrace function trace optimisation Anton Blanchard
2010-04-29 1:02 ` Benjamin Herrenschmidt
2010-04-29 1:08 ` Benjamin Herrenschmidt
2010-04-29 1:22 ` Alan Modra
2010-04-29 1:55 ` Steven Rostedt
2010-04-29 2:10 ` Benjamin Herrenschmidt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).