public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] ftrace: to kill a daemon
@ 2008-08-07 18:20 Steven Rostedt
  2008-08-07 18:47 ` Mathieu Desnoyers
                   ` (2 more replies)
  0 siblings, 3 replies; 56+ messages in thread
From: Steven Rostedt @ 2008-08-07 18:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	Linus Torvalds, David Miller, Mathieu Desnoyers, Roland McGrath,
	Ulrich Drepper, Rusty Russell, Jeremy Fitzhardinge,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams


One of the things that bothered me about the latest ftrace code was
this annoying daemon that would wake up once a second to see if it
had work to do. If it did not, it would go to sleep, otherwise it would
do its work and then go to sleep.

You see, the reason for this is that for ftrace to maintain performance
when configured in but disabled, it would need to change all the
locations that called "mcount" (enabled with the gcc -pg option) into
nops.  The "-pg" option in gcc sets up a function profiler to call this
function called "mcount". If you simply have "mcount" return, it will
still add 15 to 18% overhead in performance. Changing all the calls to
nops moved the overhead into noise.

To get rid of this, I had the mcount code record the location that called
it. Later, the "ftraced" daemon would wake up and look to see if
any new functions were recorded. If so, it would call kstop_machine
and convert the calls to "nops". We needed kstop_machine because bad
things happen on SMP if you modify code that happens to be in the
instruction cache of another CPU.

This "ftraced" kernel thread would be a happy little worker, but it caused
some pains. One, is that it woke up once a second, and Ted Tso got mad
at me because it would show up on PowerTop. I could easily make the
default 5 seconds, and even have it runtime configurable, with a trivial
patch. I have not got around to doing that yet.

The other annoying thing, and this one bothers me the most, is that we
can not have this enabled on a production -rt kernel.  The latency caused
by the kstop_machine when doing work is lost in the noise on a non-rt
kernel, but it can be up to 800 microseconds, and that would kill
the -rt kernel.  The reason this bothered me the most, is that -rt is
where it came from, and ftraced was not treating its motherland very well.

Along came Gregory Haskins, who was bickering about having ftrace enabled
on a production -rt kernel. I told him the reasons that this would be bad
and then he started thinking out loud, and suggesting wild ideas, like
patching gcc!

Since I have recently seen "The Dark Knight", Gregory's comments put me
into an "evil" mood.  I then thought of the idea about using the
relocation entries of the mcount call sites, in a prelinked object file,
and create a separate section with a list of these sites. On boot up,
record them and change them into nops.

That's it! No kstop_machine for turning them into nops. We would only need
stop_machine to enable or disable tracing, but a user not tracing will not have
to deal with this annoying "ftraced" kernel thread waking up every second
or ever running kstop_machine.

What's more, this means we can enable it on a production -rt kernel!

Now, this was no easy task. We needed to add a section to every object
file with a list of pointers to the call sites to mcount. The idea I came
up with was to make a tmp.s file for every object just after it is compiled.
This tmp.s would then be compiled and relinked into the original object.
The tmp.s file would have something like:

  .section __mcount_loc,"a",@progbits
  .quad location_of_mcount1
  .quad location_of_mcount2
  (etc)

By running objdump on the object file we can find the offsets into the
sections that the functions are called.

For example, looking at hrtimer.o:

Disassembly of section .text:

0000000000000000 <hrtimer_init_sleeper>:
       0:       55                      push   %rbp
       1:       48 89 e5                mov    %rsp,%rbp
       4:       e8 00 00 00 00          callq  9 <hrtimer_init_sleeper+0x9>
                        5: R_X86_64_PC32        mcount+0xfffffffffffffffc
[...]

the '5' in the '5: R_X86_64_PC32' is the offset that the mcount relocation
is to be done for the call site. This offset is from the .text section,
and not necessarily, from the function. If we look further we see:

000000000000001e <ktime_add_safe>:
      1e:       55                      push   %rbp
      1f:       48 89 e5                mov    %rsp,%rbp
      22:       e8 00 00 00 00          callq  27 <ktime_add_safe+0x9>
                        23: R_X86_64_PC32       mcount+0xfffffffffffffffc


This mcount call site is 0x23 from the .text section, and obviously
not from the ktime_add_safe.

If we make a tmp.s that has the following:

  .section __mcount_loc,"a",@progbits
  .quad hrtimer_init_sleeper + 0x5
  .quad hrtimer_init_sleeper + 0x23

We have a section with the locations of these two call sites. After the final
linking, they will point to the actual address used.

All that would need to be done is:

gcc -c tmp.s -o tmp.o
ld -r tmp.o hrtimer.o -o tmp_hrtime.o
mv tmp_hrtimer.o hrtimer.o

Easy as that! Not quite.  What happens if that first function in the
section is a static function? That is, the symbol for the function
is local to the object. If for some reason hrtimer_init_sleeper is static,
the tmp_hrtimer.o would have two symbols for hrtimer_init_sleeper.
One local and one global.

But we can be even more evil with this idea. We can do crazy things
with objcopy to solve it for us.

  objcopy --globalize-symbol hrtimer_init_sleeper hrtimer.o tmp_hrtimer.o

Now the hrtimer_init_sleeper would be global for linking.

  ld -r tmp_hrtimer.o tmp.o -o tmp2_hrtimer.o

Now the tmp.o could use the same global hrtimer_init_sleeper symbol.
But we have tmp2_hritmer.o that has the tmp.o and tmp_hrtimer.o symbols,
but we cant just blindly convert local symbols to globals.

The solution is simply put it back to local.

  objcopy --localize-symbol hrtimer_init_sleeper tmp2_hrtimer.o hrtimer.o

Now our hrtimer.o file has our __mcount_loc section and the
reference to hrtimer_init_sleeper will be resolved.

This is a bit complex to do in shell scripting and Makefiles, so I wrote
a well documented recordmcount.pl perl script, that will do the above
all in one place.

With this new update, we can work to kill that kernel thread "ftraced"!

This patch set ports to x86_64 and i386, the other archs will still use
the daemon until they are converted over.

I tested this on both x86_64 and i386 with and without CONFIG_RELOCATE
set.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-07 18:20 Steven Rostedt
@ 2008-08-07 18:47 ` Mathieu Desnoyers
  2008-08-07 20:42   ` Steven Rostedt
  2008-08-07 21:11 ` Jeremy Fitzhardinge
  2008-08-09  9:48 ` Abhishek Sagar
  2 siblings, 1 reply; 56+ messages in thread
From: Mathieu Desnoyers @ 2008-08-07 18:47 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, Linus Torvalds, David Miller, Roland McGrath,
	Ulrich Drepper, Rusty Russell, Jeremy Fitzhardinge,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

* Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> One of the things that bothered me about the latest ftrace code was
> this annoying daemon that would wake up once a second to see if it
> had work to do. If it did not, it would go to sleep, otherwise it would
> do its work and then go to sleep.
> 
> You see, the reason for this is that for ftrace to maintain performance
> when configured in but disabled, it would need to change all the
> locations that called "mcount" (enabled with the gcc -pg option) into
> nops.  The "-pg" option in gcc sets up a function profiler to call this
> function called "mcount". If you simply have "mcount" return, it will
> still add 15 to 18% overhead in performance. Changing all the calls to
> nops moved the overhead into noise.
> 
> To get rid of this, I had the mcount code record the location that called
> it. Later, the "ftraced" daemon would wake up and look to see if
> any new functions were recorded. If so, it would call kstop_machine
> and convert the calls to "nops". We needed kstop_machine because bad
> things happen on SMP if you modify code that happens to be in the
> instruction cache of another CPU.
> 
> This "ftraced" kernel thread would be a happy little worker, but it caused
> some pains. One, is that it woke up once a second, and Ted Tso got mad
> at me because it would show up on PowerTop. I could easily make the
> default 5 seconds, and even have it runtime configurable, with a trivial
> patch. I have not got around to doing that yet.
> 
> The other annoying thing, and this one bothers me the most, is that we
> can not have this enabled on a production -rt kernel.  The latency caused
> by the kstop_machine when doing work is lost in the noise on a non-rt
> kernel, but it can be up to 800 microseconds, and that would kill
> the -rt kernel.  The reason this bothered me the most, is that -rt is
> where it came from, and ftraced was not treating its motherland very well.
> 

Hi Steven,

If you really want to stop using stop_machine, I think you should have a
look at my immediate values infrastructure :

see:
http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-lttng.git;a=blob;f=arch/x86/kernel/immediate.c;h=0958c02b49eed3bbc00bdc5aceeee17a6d0c7ab4;hb=HEAD

particularly replace_instruction_safe(), which uses a temporary
breakpoint to safely replace an instruction on a live SMP system without
using stop_machine. Feel free to try it in ftrace. :)

You may need to tweak the imv_notifier(), which is responsible for
executing the breakpoint. The only special thing this handler has to
know is the non-relocatable instructions you might want to insert (it
only cares about the new instructions, not the old ones being replaced).
The current code deals with 5-bytes jumps. Note that the instruction is
executed in the breakpoint handler with preemption disabled, which might
not be well suited for a call instruction.

I would recommend to patch in a 5-bytes jmp with 0 offset
(e9 00 00 00 00) when disabled (this makes a single 5-bytes instruction
and thus makes sure no instruction pointer can iret in the middle).

When enabling the site, you could patch-in the original call, but you
should tweak the imv_notifier() so that it uses the jmp offset 0 in the
bypass instead of the function call, because preemption would otherwise
be disabled around the call when executed in the breakpoint bypass.

Therefore, simply statically forcing the bypass code to e9 00 00 00 00
in all the cases (a nop) would do the job for your needs.

Mathieu

> Along came Gregory Haskins, who was bickering about having ftrace enabled
> on a production -rt kernel. I told him the reasons that this would be bad
> and then he started thinking out loud, and suggesting wild ideas, like
> patching gcc!
> 
> Since I have recently seen "The Dark Knight", Gregory's comments put me
> into an "evil" mood.  I then thought of the idea about using the
> relocation entries of the mcount call sites, in a prelinked object file,
> and create a separate section with a list of these sites. On boot up,
> record them and change them into nops.
> 
> That's it! No kstop_machine for turning them into nops. We would only need
> stop_machine to enable or disable tracing, but a user not tracing will not have
> to deal with this annoying "ftraced" kernel thread waking up every second
> or ever running kstop_machine.
> 
> What's more, this means we can enable it on a production -rt kernel!
> 
> Now, this was no easy task. We needed to add a section to every object
> file with a list of pointers to the call sites to mcount. The idea I came
> up with was to make a tmp.s file for every object just after it is compiled.
> This tmp.s would then be compiled and relinked into the original object.
> The tmp.s file would have something like:
> 
>   .section __mcount_loc,"a",@progbits
>   .quad location_of_mcount1
>   .quad location_of_mcount2
>   (etc)
> 
> By running objdump on the object file we can find the offsets into the
> sections that the functions are called.
> 
> For example, looking at hrtimer.o:
> 
> Disassembly of section .text:
> 
> 0000000000000000 <hrtimer_init_sleeper>:
>        0:       55                      push   %rbp
>        1:       48 89 e5                mov    %rsp,%rbp
>        4:       e8 00 00 00 00          callq  9 <hrtimer_init_sleeper+0x9>
>                         5: R_X86_64_PC32        mcount+0xfffffffffffffffc
> [...]
> 
> the '5' in the '5: R_X86_64_PC32' is the offset that the mcount relocation
> is to be done for the call site. This offset is from the .text section,
> and not necessarily, from the function. If we look further we see:
> 
> 000000000000001e <ktime_add_safe>:
>       1e:       55                      push   %rbp
>       1f:       48 89 e5                mov    %rsp,%rbp
>       22:       e8 00 00 00 00          callq  27 <ktime_add_safe+0x9>
>                         23: R_X86_64_PC32       mcount+0xfffffffffffffffc
> 
> 
> This mcount call site is 0x23 from the .text section, and obviously
> not from the ktime_add_safe.
> 
> If we make a tmp.s that has the following:
> 
>   .section __mcount_loc,"a",@progbits
>   .quad hrtimer_init_sleeper + 0x5
>   .quad hrtimer_init_sleeper + 0x23
> 
> We have a section with the locations of these two call sites. After the final
> linking, they will point to the actual address used.
> 
> All that would need to be done is:
> 
> gcc -c tmp.s -o tmp.o
> ld -r tmp.o hrtimer.o -o tmp_hrtime.o
> mv tmp_hrtimer.o hrtimer.o
> 
> Easy as that! Not quite.  What happens if that first function in the
> section is a static function? That is, the symbol for the function
> is local to the object. If for some reason hrtimer_init_sleeper is static,
> the tmp_hrtimer.o would have two symbols for hrtimer_init_sleeper.
> One local and one global.
> 
> But we can be even more evil with this idea. We can do crazy things
> with objcopy to solve it for us.
> 
>   objcopy --globalize-symbol hrtimer_init_sleeper hrtimer.o tmp_hrtimer.o
> 
> Now the hrtimer_init_sleeper would be global for linking.
> 
>   ld -r tmp_hrtimer.o tmp.o -o tmp2_hrtimer.o
> 
> Now the tmp.o could use the same global hrtimer_init_sleeper symbol.
> But we have tmp2_hritmer.o that has the tmp.o and tmp_hrtimer.o symbols,
> but we cant just blindly convert local symbols to globals.
> 
> The solution is simply put it back to local.
> 
>   objcopy --localize-symbol hrtimer_init_sleeper tmp2_hrtimer.o hrtimer.o
> 
> Now our hrtimer.o file has our __mcount_loc section and the
> reference to hrtimer_init_sleeper will be resolved.
> 
> This is a bit complex to do in shell scripting and Makefiles, so I wrote
> a well documented recordmcount.pl perl script, that will do the above
> all in one place.
> 
> With this new update, we can work to kill that kernel thread "ftraced"!
> 
> This patch set ports to x86_64 and i386, the other archs will still use
> the daemon until they are converted over.
> 
> I tested this on both x86_64 and i386 with and without CONFIG_RELOCATE
> set.
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-07 18:47 ` Mathieu Desnoyers
@ 2008-08-07 20:42   ` Steven Rostedt
  2008-08-08 17:22     ` Mathieu Desnoyers
  0 siblings, 1 reply; 56+ messages in thread
From: Steven Rostedt @ 2008-08-07 20:42 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, Linus Torvalds, David Miller, Roland McGrath,
	Ulrich Drepper, Rusty Russell, Jeremy Fitzhardinge,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams


On Thu, 7 Aug 2008, Mathieu Desnoyers wrote:

> 
> Hi Steven,
> 
> If you really want to stop using stop_machine, I think you should have a
> look at my immediate values infrastructure :
> 
> see:
> http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-lttng.git;a=blob;f=arch/x86/kernel/immediate.c;h=0958c02b49eed3bbc00bdc5aceeee17a6d0c7ab4;hb=HEAD
> 
> particularly replace_instruction_safe(), which uses a temporary
> breakpoint to safely replace an instruction on a live SMP system without
> using stop_machine. Feel free to try it in ftrace. :)
> 
> You may need to tweak the imv_notifier(), which is responsible for
> executing the breakpoint. The only special thing this handler has to
> know is the non-relocatable instructions you might want to insert (it
> only cares about the new instructions, not the old ones being replaced).
> The current code deals with 5-bytes jumps. Note that the instruction is
> executed in the breakpoint handler with preemption disabled, which might
> not be well suited for a call instruction.
> 
> I would recommend to patch in a 5-bytes jmp with 0 offset
> (e9 00 00 00 00) when disabled (this makes a single 5-bytes instruction
> and thus makes sure no instruction pointer can iret in the middle).
> 
> When enabling the site, you could patch-in the original call, but you
> should tweak the imv_notifier() so that it uses the jmp offset 0 in the
> bypass instead of the function call, because preemption would otherwise
> be disabled around the call when executed in the breakpoint bypass.
> 
> Therefore, simply statically forcing the bypass code to e9 00 00 00 00
> in all the cases (a nop) would do the job for your needs.

I originally used jumps instead of nops, but unfortunately, they actually 
hurt performance more than adding nops. Ingo told me it was probably due
to using up the jump predictions of the CPU.

-- Steve


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-07 18:20 Steven Rostedt
  2008-08-07 18:47 ` Mathieu Desnoyers
@ 2008-08-07 21:11 ` Jeremy Fitzhardinge
  2008-08-07 21:29   ` Steven Rostedt
  2008-08-09  9:48 ` Abhishek Sagar
  2 siblings, 1 reply; 56+ messages in thread
From: Jeremy Fitzhardinge @ 2008-08-07 21:11 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, Linus Torvalds, David Miller, Mathieu Desnoyers,
	Roland McGrath, Ulrich Drepper, Rusty Russell, Gregory Haskins,
	Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves,
	Clark Williams, Sam Ravnborg, Eric W. Biederman

Steven Rostedt wrote:
> Now, this was no easy task. We needed to add a section to every object
> file with a list of pointers to the call sites to mcount. The idea I came
> up with was to make a tmp.s file for every object just after it is compiled.
> This tmp.s would then be compiled and relinked into the original object.
> The tmp.s file would have something like:
>
>   .section __mcount_loc,"a",@progbits
>   .quad location_of_mcount1
>   .quad location_of_mcount2
>   (etc)
>   

I have  a few concerns about this scheme:

One is that you assume that all text is in a .text section, and that the 
offsets you compute on a per-section basis are going to be valid in the 
final kernel image.  At the very least you should make each offset 
relative to its own function rather than inter-function.

But it seems pretty fragile overall.  Things like -ffunction-sections 
and section gc will invalidate the table you build up.

It seems to me that you can acheive the same effect by:

   1. link vmlinux with "ld -q", which will leave the relocation
      information intact in the final object
   2. use "readelf -r" to extract the relocation info, find the
      references to mcount and build a table
   3. link that table into the kernel image
   4. repeat to get a stable result

Note that steps 2-4 are identical to kallsyms, both in intent and 
detail.  The only difference is the precise table you build and the 
command you use to extract the info from the kernel.  From a quick peek 
at the way Makefile implements kallsyms, it looks to me like it would be 
fairly easy to generalize so that you can do the mcount reloc processing 
in the same relink passes as kallsyms with minimal overhead on the 
kernel build time.

It just seems incredibly fiddly to be doing all this per-object.

    J

> By running objdump on the object file we can find the offsets into the
> sections that the functions are called.
>
> For example, looking at hrtimer.o:
>
> Disassembly of section .text:
>
> 0000000000000000 <hrtimer_init_sleeper>:
>        0:       55                      push   %rbp
>        1:       48 89 e5                mov    %rsp,%rbp
>        4:       e8 00 00 00 00          callq  9 <hrtimer_init_sleeper+0x9>
>                         5: R_X86_64_PC32        mcount+0xfffffffffffffffc
> [...]
>
> the '5' in the '5: R_X86_64_PC32' is the offset that the mcount relocation
> is to be done for the call site. This offset is from the .text section,
> and not necessarily, from the function. If we look further we see:
>
> 000000000000001e <ktime_add_safe>:
>       1e:       55                      push   %rbp
>       1f:       48 89 e5                mov    %rsp,%rbp
>       22:       e8 00 00 00 00          callq  27 <ktime_add_safe+0x9>
>                         23: R_X86_64_PC32       mcount+0xfffffffffffffffc
>
>
> This mcount call site is 0x23 from the .text section, and obviously
> not from the ktime_add_safe.
>
> If we make a tmp.s that has the following:
>
>   .section __mcount_loc,"a",@progbits
>   .quad hrtimer_init_sleeper + 0x5
>   .quad hrtimer_init_sleeper + 0x23
>
> We have a section with the locations of these two call sites. After the final
> linking, they will point to the actual address used.
>
> All that would need to be done is:
>
> gcc -c tmp.s -o tmp.o
> ld -r tmp.o hrtimer.o -o tmp_hrtime.o
> mv tmp_hrtimer.o hrtimer.o
>
> Easy as that! Not quite.  What happens if that first function in the
> section is a static function? That is, the symbol for the function
> is local to the object. If for some reason hrtimer_init_sleeper is static,
> the tmp_hrtimer.o would have two symbols for hrtimer_init_sleeper.
> One local and one global.
>
> But we can be even more evil with this idea. We can do crazy things
> with objcopy to solve it for us.
>
>   objcopy --globalize-symbol hrtimer_init_sleeper hrtimer.o tmp_hrtimer.o
>
> Now the hrtimer_init_sleeper would be global for linking.
>
>   ld -r tmp_hrtimer.o tmp.o -o tmp2_hrtimer.o
>
> Now the tmp.o could use the same global hrtimer_init_sleeper symbol.
> But we have tmp2_hritmer.o that has the tmp.o and tmp_hrtimer.o symbols,
> but we cant just blindly convert local symbols to globals.
>
> The solution is simply put it back to local.
>
>   objcopy --localize-symbol hrtimer_init_sleeper tmp2_hrtimer.o hrtimer.o
>
> Now our hrtimer.o file has our __mcount_loc section and the
> reference to hrtimer_init_sleeper will be resolved.
>
> This is a bit complex to do in shell scripting and Makefiles, so I wrote
> a well documented recordmcount.pl perl script, that will do the above
> all in one place.
>
> With this new update, we can work to kill that kernel thread "ftraced"!
>
> This patch set ports to x86_64 and i386, the other archs will still use
> the daemon until they are converted over.
>
> I tested this on both x86_64 and i386 with and without CONFIG_RELOCATE
> set.
>
>   


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-07 21:28 ` [PATCH 0/5] ftrace: to kill a daemon Bodo Eggert
@ 2008-08-07 21:24   ` Jeremy Fitzhardinge
  2008-08-07 21:35   ` Steven Rostedt
  1 sibling, 0 replies; 56+ messages in thread
From: Jeremy Fitzhardinge @ 2008-08-07 21:24 UTC (permalink / raw)
  To: 7eggert
  Cc: Steven Rostedt, linux-kernel, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, Linus Torvalds, David Miller,
	Mathieu Desnoyers, Roland McGrath, Ulrich Drepper, Rusty Russell,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

Bodo Eggert wrote:
> Steven Rostedt <rostedt@goodmis.org> wrote:
>
>   
>> The idea I came
>> up with was to make a tmp.s file for every object just after it is compiled.
>>     
>
> make -j would cause these files to overwrite each other, wouldn't it?
>   

I'd assume the name is actually derived from the real filename...

    J


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
       [not found] <aYipy-5FM-9@gated-at.bofh.it>
@ 2008-08-07 21:28 ` Bodo Eggert
  2008-08-07 21:24   ` Jeremy Fitzhardinge
  2008-08-07 21:35   ` Steven Rostedt
       [not found] ` <aYiIP-6bN-11@gated-at.bofh.it>
  1 sibling, 2 replies; 56+ messages in thread
From: Bodo Eggert @ 2008-08-07 21:28 UTC (permalink / raw)
  To: Steven Rostedt, linux-kernel, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, Linus Torvalds, David Miller,
	Mathieu Desnoyers, Roland McGrath, Ulrich Drepper, Rusty Russell,
	Jeremy Fitzhardinge, Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

Steven Rostedt <rostedt@goodmis.org> wrote:

> The idea I came
> up with was to make a tmp.s file for every object just after it is compiled.

make -j would cause these files to overwrite each other, wouldn't it?



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-07 21:11 ` Jeremy Fitzhardinge
@ 2008-08-07 21:29   ` Steven Rostedt
  2008-08-07 22:26     ` Roland McGrath
  0 siblings, 1 reply; 56+ messages in thread
From: Steven Rostedt @ 2008-08-07 21:29 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, Linus Torvalds, David Miller, Mathieu Desnoyers,
	Roland McGrath, Ulrich Drepper, Rusty Russell, Gregory Haskins,
	Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves,
	Clark Williams, Sam Ravnborg, Eric W. Biederman


On Thu, 7 Aug 2008, Jeremy Fitzhardinge wrote:

> Steven Rostedt wrote:
> > Now, this was no easy task. We needed to add a section to every object
> > file with a list of pointers to the call sites to mcount. The idea I came
> > up with was to make a tmp.s file for every object just after it is compiled.
> > This tmp.s would then be compiled and relinked into the original object.
> > The tmp.s file would have something like:
> > 
> >   .section __mcount_loc,"a",@progbits
> >   .quad location_of_mcount1
> >   .quad location_of_mcount2
> >   (etc)
> >   
> 
> I have  a few concerns about this scheme:
> 
> One is that you assume that all text is in a .text section, and that the
> offsets you compute on a per-section basis are going to be valid in the final
> kernel image.  At the very least you should make each offset relative to its
> own function rather than inter-function.

I assume neither ;-)

This works with .text.sched .text.init .text.cpu etc. I use the first 
function in each section as my reference point.  This is the key.

> 
> But it seems pretty fragile overall.  Things like -ffunction-sections and
> section gc will invalidate the table you build up.

I don't see how. I'm referencing a function as my pointer. If this was
true, then I wouldn't be able to do

  static void my_func(void) { [...] }
  struct func_struct {
     void (*func)(void);
  } f = { my_func; };

My references end up exactly the same as the "f->func" reference does.
If my references are broken, so is this f->func. Which would be bad to
say the least.

> 
> It seems to me that you can acheive the same effect by:
> 
>   1. link vmlinux with "ld -q", which will leave the relocation
>      information intact in the final object
>   2. use "readelf -r" to extract the relocation info, find the
>      references to mcount and build a table
>   3. link that table into the kernel image
>   4. repeat to get a stable result

This doesn't seem any less complex than what I did. With this, I would 
have to come up with another way to handle modules. This will make
things a lot more complex.

> 
> Note that steps 2-4 are identical to kallsyms, both in intent and detail.  The
> only difference is the precise table you build and the command you use to
> extract the info from the kernel.  From a quick peek at the way Makefile
> implements kallsyms, it looks to me like it would be fairly easy to generalize
> so that you can do the mcount reloc processing in the same relink passes as
> kallsyms with minimal overhead on the kernel build time.

I could work on this. But again, this will put more work into module.c 
that I would like to avoid.

> 
> It just seems incredibly fiddly to be doing all this per-object.

Actually, this isn't that much fiddling. What I did was simply make a list
of pointers to the call sites to mcount. The pointers used the first 
function of each section to offset against. I relinked in the list making 
it just like a list done within C.  All relocations and magic after this
will be handled just fine.

-- Steve


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-07 21:28 ` [PATCH 0/5] ftrace: to kill a daemon Bodo Eggert
  2008-08-07 21:24   ` Jeremy Fitzhardinge
@ 2008-08-07 21:35   ` Steven Rostedt
  1 sibling, 0 replies; 56+ messages in thread
From: Steven Rostedt @ 2008-08-07 21:35 UTC (permalink / raw)
  To: Bodo Eggert
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, Linus Torvalds, David Miller, Mathieu Desnoyers,
	Roland McGrath, Ulrich Drepper, Rusty Russell,
	Jeremy Fitzhardinge, Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams


On Thu, 7 Aug 2008, Bodo Eggert wrote:

> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > The idea I came
> > up with was to make a tmp.s file for every object just after it is compiled.
> 
> make -j would cause these files to overwrite each other, wouldn't it?

I simplified my explanation. I really used something like this:

for init/main.o:

init/.tmp_mc_main.s  : for the list file
init/.tmp_mc_main.o  : for the list compiled object.
init/.tmp_gl_main.o  : for the object converted static to global.
init/.tmp_mx_main.o  : for the merged list and global object.


The tmp.s was to keep the focus on the concept. But the actual 
implementation handles multi-threaded builds.

I use distcc and build with -j40. I had no issues.

Thanks,

-- Steve


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-07 21:29   ` Steven Rostedt
@ 2008-08-07 22:26     ` Roland McGrath
  2008-08-08  1:21       ` Steven Rostedt
  2008-08-08  4:54       ` Sam Ravnborg
  0 siblings, 2 replies; 56+ messages in thread
From: Roland McGrath @ 2008-08-07 22:26 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jeremy Fitzhardinge, linux-kernel, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, Linus Torvalds, David Miller,
	Mathieu Desnoyers, Ulrich Drepper, Rusty Russell, Gregory Haskins,
	Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves,
	Clark Williams, Sam Ravnborg, Eric W. Biederman

> This doesn't seem any less complex than what I did. With this, I would 
> have to come up with another way to handle modules. This will make
> things a lot more complex.

The scheme you've implemented can apply fine to a .ko file after it's made.
They are just .o's really.  It is presumably faster to do one step per
final .ko rather than many tiny ones (per source file).  

That might be a benefit to doing it all at the end for vmlinux too.  I
think the best way would be to have a vmlinux.o that we actually use in the
link, rather than just analyzing and discarding.  Then you can just do your
existing hack on vmlinux.o before it's linked into vmlinux.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-07 22:26     ` Roland McGrath
@ 2008-08-08  1:21       ` Steven Rostedt
  2008-08-08  1:24         ` Steven Rostedt
                           ` (2 more replies)
  2008-08-08  4:54       ` Sam Ravnborg
  1 sibling, 3 replies; 56+ messages in thread
From: Steven Rostedt @ 2008-08-08  1:21 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Jeremy Fitzhardinge, linux-kernel, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, Linus Torvalds, David Miller,
	Mathieu Desnoyers, Ulrich Drepper, Rusty Russell, Gregory Haskins,
	Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves,
	Clark Williams, Sam Ravnborg, Eric W. Biederman


On Thu, 7 Aug 2008, Roland McGrath wrote:

> > This doesn't seem any less complex than what I did. With this, I would 
> > have to come up with another way to handle modules. This will make
> > things a lot more complex.
> 
> The scheme you've implemented can apply fine to a .ko file after it's made.
> They are just .o's really.  It is presumably faster to do one step per
> final .ko rather than many tiny ones (per source file).  
> 
> That might be a benefit to doing it all at the end for vmlinux too.  I
> think the best way would be to have a vmlinux.o that we actually use in the
> link, rather than just analyzing and discarding.  Then you can just do your
> existing hack on vmlinux.o before it's linked into vmlinux.

OK, I am playing with this. And you are right, it does work with vmlinux.o 
file.  It adds almost another minute to the build anytime something is 
changed. Where the per-object way may take longer doing a full build, but 
does not add that extra minute ever build.

Which would people prefer?  I little longer full build and do the 
modification on every object, or have a shorter full build, but every 
build will take that extra minute to do?

Note, if you have DYNMAIC_FTRACE disabled, this is a nop, and will _not_ 
affect you.

-- Steve


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-08  1:21       ` Steven Rostedt
@ 2008-08-08  1:24         ` Steven Rostedt
  2008-08-08  1:56         ` Steven Rostedt
  2008-08-08  7:22         ` Peter Zijlstra
  2 siblings, 0 replies; 56+ messages in thread
From: Steven Rostedt @ 2008-08-08  1:24 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Jeremy Fitzhardinge, linux-kernel, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, Linus Torvalds, David Miller,
	Mathieu Desnoyers, Ulrich Drepper, Rusty Russell, Gregory Haskins,
	Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves,
	Clark Williams, Sam Ravnborg, Eric W. Biederman


On Thu, 7 Aug 2008, Steven Rostedt wrote:
> On Thu, 7 Aug 2008, Roland McGrath wrote:
> > 
> > The scheme you've implemented can apply fine to a .ko file after it's made.
> > They are just .o's really.  It is presumably faster to do one step per
> > final .ko rather than many tiny ones (per source file).  
> > 
> > That might be a benefit to doing it all at the end for vmlinux too.  I
> > think the best way would be to have a vmlinux.o that we actually use in the
> > link, rather than just analyzing and discarding.  Then you can just do your
> > existing hack on vmlinux.o before it's linked into vmlinux.
> 
> OK, I am playing with this. And you are right, it does work with vmlinux.o 
> file.  It adds almost another minute to the build anytime something is 
> changed. Where the per-object way may take longer doing a full build, but 
> does not add that extra minute ever build.
> 
> Which would people prefer?  I little longer full build and do the 
> modification on every object, or have a shorter full build, but every 
> build will take that extra minute to do?
> 
> Note, if you have DYNMAIC_FTRACE disabled, this is a nop, and will _not_ 
> affect you.

Another advantage of doing it at the end on the vmlinux.o, is that you do 
not need to recompile the entire kernel when you change the config option. 
But you do need to recompile all modules.

-- Steve


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-08  1:21       ` Steven Rostedt
  2008-08-08  1:24         ` Steven Rostedt
@ 2008-08-08  1:56         ` Steven Rostedt
  2008-08-08  7:22         ` Peter Zijlstra
  2 siblings, 0 replies; 56+ messages in thread
From: Steven Rostedt @ 2008-08-08  1:56 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Jeremy Fitzhardinge, linux-kernel, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, Linus Torvalds, David Miller,
	Mathieu Desnoyers, Ulrich Drepper, Rusty Russell, Gregory Haskins,
	Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves,
	Clark Williams, Sam Ravnborg, Eric W. Biederman



On Thu, 7 Aug 2008, Steven Rostedt wrote:

> 
> On Thu, 7 Aug 2008, Roland McGrath wrote:
> 
> > > This doesn't seem any less complex than what I did. With this, I would 
> > > have to come up with another way to handle modules. This will make
> > > things a lot more complex.
> > 
> > The scheme you've implemented can apply fine to a .ko file after it's made.
> > They are just .o's really.  It is presumably faster to do one step per
> > final .ko rather than many tiny ones (per source file).  
> > 
> > That might be a benefit to doing it all at the end for vmlinux too.  I
> > think the best way would be to have a vmlinux.o that we actually use in the
> > link, rather than just analyzing and discarding.  Then you can just do your
> > existing hack on vmlinux.o before it's linked into vmlinux.
> 
> OK, I am playing with this. And you are right, it does work with vmlinux.o 
> file.  It adds almost another minute to the build anytime something is 
> changed. Where the per-object way may take longer doing a full build, but 
> does not add that extra minute ever build.

I spoke to soon. I was only looking at the result of the vmlinux.o. But it 
seems that kallsyms skips using this file and relinks everything itself.
Thus the __mcount_loc I added never made it into the vmlinux. The more I 
try to get this method to work, the more complex it becomes. And I still 
did not change the recordmcount.pl at all. Which means, by trying to do it 
against the vmlinux.o, I'm actually making it more complex than it already 
is.

Right now, doing it to every object seems to be the safest approach.

-- Steve



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-07 22:26     ` Roland McGrath
  2008-08-08  1:21       ` Steven Rostedt
@ 2008-08-08  4:54       ` Sam Ravnborg
  1 sibling, 0 replies; 56+ messages in thread
From: Sam Ravnborg @ 2008-08-08  4:54 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Steven Rostedt, Jeremy Fitzhardinge, linux-kernel, Ingo Molnar,
	Thomas Gleixner, Peter Zijlstra, Andrew Morton, Linus Torvalds,
	David Miller, Mathieu Desnoyers, Ulrich Drepper, Rusty Russell,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams, Eric W. Biederman

On Thu, Aug 07, 2008 at 03:26:42PM -0700, Roland McGrath wrote:
> > This doesn't seem any less complex than what I did. With this, I would 
> > have to come up with another way to handle modules. This will make
> > things a lot more complex.
> 
> The scheme you've implemented can apply fine to a .ko file after it's made.
> They are just .o's really.  It is presumably faster to do one step per
> final .ko rather than many tiny ones (per source file).  
> 
> That might be a benefit to doing it all at the end for vmlinux too.  I
> think the best way would be to have a vmlinux.o that we actually use in the
> link, rather than just analyzing and discarding.  Then you can just do your
> existing hack on vmlinux.o before it's linked into vmlinux.
I have patches to actually use vmlinux.o as part of the link process,
but as I hit some subtle issues with um and sparc they are not
yet ready. I plan to dust them off again and fix up the sparc and
um issues but that is not this month.

	Sam

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-08  1:21       ` Steven Rostedt
  2008-08-08  1:24         ` Steven Rostedt
  2008-08-08  1:56         ` Steven Rostedt
@ 2008-08-08  7:22         ` Peter Zijlstra
  2008-08-08 11:31           ` Steven Rostedt
  2 siblings, 1 reply; 56+ messages in thread
From: Peter Zijlstra @ 2008-08-08  7:22 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Roland McGrath, Jeremy Fitzhardinge, linux-kernel, Ingo Molnar,
	Thomas Gleixner, Andrew Morton, Linus Torvalds, David Miller,
	Mathieu Desnoyers, Ulrich Drepper, Rusty Russell, Gregory Haskins,
	Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves,
	Clark Williams, Sam Ravnborg, Eric W. Biederman

On Thu, 2008-08-07 at 21:21 -0400, Steven Rostedt wrote:

> Which would people prefer?  I little longer full build and do the 
> modification on every object, or have a shorter full build, but every 
> build will take that extra minute to do?

I have a very strong preference to do it per obj rather than at the end,
that way it can be distributed and make these fancy smp boxen useful.

Bonus points if you can make distcc dig it.

The vmlinux link path is the thing that is killing build performance,
anything to shorten that is golden, anything increasing it needs _very_
good justification.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-08  7:22         ` Peter Zijlstra
@ 2008-08-08 11:31           ` Steven Rostedt
  0 siblings, 0 replies; 56+ messages in thread
From: Steven Rostedt @ 2008-08-08 11:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Roland McGrath, Jeremy Fitzhardinge, linux-kernel, Ingo Molnar,
	Thomas Gleixner, Andrew Morton, Linus Torvalds, David Miller,
	Mathieu Desnoyers, Ulrich Drepper, Rusty Russell, Gregory Haskins,
	Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves,
	Clark Williams, Sam Ravnborg, Eric W. Biederman


On Fri, 8 Aug 2008, Peter Zijlstra wrote:

> On Thu, 2008-08-07 at 21:21 -0400, Steven Rostedt wrote:
> 
> > Which would people prefer?  I little longer full build and do the 
> > modification on every object, or have a shorter full build, but every 
> > build will take that extra minute to do?
> 
> I have a very strong preference to do it per obj rather than at the end,
> that way it can be distributed and make these fancy smp boxen useful.
> 
> Bonus points if you can make distcc dig it.

Yes, the CC used in the script does indeed use distcc (if you are using 
it).

;-)

-- Steve

> 
> The vmlinux link path is the thing that is killing build performance,
> anything to shorten that is golden, anything increasing it needs _very_
> good justification.
> 
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-07 20:42   ` Steven Rostedt
@ 2008-08-08 17:22     ` Mathieu Desnoyers
  2008-08-08 17:36       ` Steven Rostedt
  0 siblings, 1 reply; 56+ messages in thread
From: Mathieu Desnoyers @ 2008-08-08 17:22 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, Linus Torvalds, David Miller, Roland McGrath,
	Ulrich Drepper, Rusty Russell, Jeremy Fitzhardinge,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

* Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> On Thu, 7 Aug 2008, Mathieu Desnoyers wrote:
> 
> > 
> > Hi Steven,
> > 
> > If you really want to stop using stop_machine, I think you should have a
> > look at my immediate values infrastructure :
> > 
> > see:
> > http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-lttng.git;a=blob;f=arch/x86/kernel/immediate.c;h=0958c02b49eed3bbc00bdc5aceeee17a6d0c7ab4;hb=HEAD
> > 
> > particularly replace_instruction_safe(), which uses a temporary
> > breakpoint to safely replace an instruction on a live SMP system without
> > using stop_machine. Feel free to try it in ftrace. :)
> > 
> > You may need to tweak the imv_notifier(), which is responsible for
> > executing the breakpoint. The only special thing this handler has to
> > know is the non-relocatable instructions you might want to insert (it
> > only cares about the new instructions, not the old ones being replaced).
> > The current code deals with 5-bytes jumps. Note that the instruction is
> > executed in the breakpoint handler with preemption disabled, which might
> > not be well suited for a call instruction.
> > 
> > I would recommend to patch in a 5-bytes jmp with 0 offset
> > (e9 00 00 00 00) when disabled (this makes a single 5-bytes instruction
> > and thus makes sure no instruction pointer can iret in the middle).
> > 
> > When enabling the site, you could patch-in the original call, but you
> > should tweak the imv_notifier() so that it uses the jmp offset 0 in the
> > bypass instead of the function call, because preemption would otherwise
> > be disabled around the call when executed in the breakpoint bypass.
> > 
> > Therefore, simply statically forcing the bypass code to e9 00 00 00 00
> > in all the cases (a nop) would do the job for your needs.
> 
> I originally used jumps instead of nops, but unfortunately, they actually 
> hurt performance more than adding nops. Ingo told me it was probably due
> to using up the jump predictions of the CPU.
> 

Hrm, are you sure you use a single 5-bytes nop instruction then, or do
you use a mix of various nop sizes (add_nops) on some architectures ?

You can consume the branch prediction buffers for conditional branches,
but I doubt static jumps have this impact ? I don't see what "jump
predictions" you are referring to here exactly.

Mathieu

> -- Steve
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-08 17:22     ` Mathieu Desnoyers
@ 2008-08-08 17:36       ` Steven Rostedt
  2008-08-08 17:46         ` Mathieu Desnoyers
  0 siblings, 1 reply; 56+ messages in thread
From: Steven Rostedt @ 2008-08-08 17:36 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, Linus Torvalds, David Miller, Roland McGrath,
	Ulrich Drepper, Rusty Russell, Jeremy Fitzhardinge,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams


On Fri, 8 Aug 2008, Mathieu Desnoyers wrote:
> * Steven Rostedt (rostedt@goodmis.org) wrote:
> > 
> > I originally used jumps instead of nops, but unfortunately, they actually 
> > hurt performance more than adding nops. Ingo told me it was probably due
> > to using up the jump predictions of the CPU.
> > 
> 
> Hrm, are you sure you use a single 5-bytes nop instruction then, or do
> you use a mix of various nop sizes (add_nops) on some architectures ?

I use (for x86) what is in include/asm-x86/nops.h depending on what the
cpuid gives us.

> 
> You can consume the branch prediction buffers for conditional branches,
> but I doubt static jumps have this impact ? I don't see what "jump
> predictions" you are referring to here exactly.

I don't know the details, but we definitely saw a drop in preformance 
between using nops and static jumps.

-- Steve


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-08 17:36       ` Steven Rostedt
@ 2008-08-08 17:46         ` Mathieu Desnoyers
  2008-08-08 18:13           ` Steven Rostedt
  0 siblings, 1 reply; 56+ messages in thread
From: Mathieu Desnoyers @ 2008-08-08 17:46 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, Linus Torvalds, David Miller, Roland McGrath,
	Ulrich Drepper, Rusty Russell, Jeremy Fitzhardinge,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

* Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> On Fri, 8 Aug 2008, Mathieu Desnoyers wrote:
> > * Steven Rostedt (rostedt@goodmis.org) wrote:
> > > 
> > > I originally used jumps instead of nops, but unfortunately, they actually 
> > > hurt performance more than adding nops. Ingo told me it was probably due
> > > to using up the jump predictions of the CPU.
> > > 
> > 
> > Hrm, are you sure you use a single 5-bytes nop instruction then, or do
> > you use a mix of various nop sizes (add_nops) on some architectures ?
> 
> I use (for x86) what is in include/asm-x86/nops.h depending on what the
> cpuid gives us.
> 

That's bad :

#define GENERIC_NOP5 GENERIC_NOP1 GENERIC_NOP4

#define K8_NOP5 K8_NOP3 K8_NOP2

#define K7_NOP5 K7_NOP4 ASM_NOP1

So, when you try, later, to replace these instructions with a single
5-bytes instruction, a preempted thread could iret in the middle of your
5-bytes insn and cause an illegal instruction ?


> > 
> > You can consume the branch prediction buffers for conditional branches,
> > but I doubt static jumps have this impact ? I don't see what "jump
> > predictions" you are referring to here exactly.
> 
> I don't know the details, but we definitely saw a drop in preformance 
> between using nops and static jumps.
> 

Generated by replacing all the call by 5-bytes jumps e9 00 00 00 00
instead of the 5-bytes add_nops ? On which architectures ?

Mathieu


> -- Steve
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-08 17:46         ` Mathieu Desnoyers
@ 2008-08-08 18:13           ` Steven Rostedt
  2008-08-08 18:15             ` Peter Zijlstra
  2008-08-08 18:21             ` Mathieu Desnoyers
  0 siblings, 2 replies; 56+ messages in thread
From: Steven Rostedt @ 2008-08-08 18:13 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, Linus Torvalds, David Miller, Roland McGrath,
	Ulrich Drepper, Rusty Russell, Jeremy Fitzhardinge,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams


On Fri, 8 Aug 2008, Mathieu Desnoyers wrote:
> * Steven Rostedt (rostedt@goodmis.org) wrote:
> > 
> > On Fri, 8 Aug 2008, Mathieu Desnoyers wrote:
> > > * Steven Rostedt (rostedt@goodmis.org) wrote:
> > > > 
> > > > I originally used jumps instead of nops, but unfortunately, they actually 
> > > > hurt performance more than adding nops. Ingo told me it was probably due
> > > > to using up the jump predictions of the CPU.
> > > > 
> > > 
> > > Hrm, are you sure you use a single 5-bytes nop instruction then, or do
> > > you use a mix of various nop sizes (add_nops) on some architectures ?
> > 
> > I use (for x86) what is in include/asm-x86/nops.h depending on what the
> > cpuid gives us.
> > 
> 
> That's bad :
> 
> #define GENERIC_NOP5 GENERIC_NOP1 GENERIC_NOP4
> 
> #define K8_NOP5 K8_NOP3 K8_NOP2
> 
> #define K7_NOP5 K7_NOP4 ASM_NOP1
> 
> So, when you try, later, to replace these instructions with a single
> 5-bytes instruction, a preempted thread could iret in the middle of your
> 5-bytes insn and cause an illegal instruction ?

That's why I use kstop_machine.

> 
> 
> > > 
> > > You can consume the branch prediction buffers for conditional branches,
> > > but I doubt static jumps have this impact ? I don't see what "jump
> > > predictions" you are referring to here exactly.
> > 
> > I don't know the details, but we definitely saw a drop in preformance 
> > between using nops and static jumps.
> > 
> 
> Generated by replacing all the call by 5-bytes jumps e9 00 00 00 00
> instead of the 5-bytes add_nops ? On which architectures ?
> 

I ran this on my Dell (intel Xeon), which IIRC did show the performance 
degration. I unfortunately don't have the time to redo those tests, but 
you are welcome to.

Just look at arch/x86/kernel/ftrace.c and replace the nop with the jump.
In fact, the comments in that file still say it is a jmp. Remember, my 
first go was to use the jmp.

-- Steve


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-08 18:13           ` Steven Rostedt
@ 2008-08-08 18:15             ` Peter Zijlstra
  2008-08-08 18:21             ` Mathieu Desnoyers
  1 sibling, 0 replies; 56+ messages in thread
From: Peter Zijlstra @ 2008-08-08 18:15 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, linux-kernel, Ingo Molnar, Thomas Gleixner,
	Andrew Morton, Linus Torvalds, David Miller, Roland McGrath,
	Ulrich Drepper, Rusty Russell, Jeremy Fitzhardinge,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

On Fri, 2008-08-08 at 14:13 -0400, Steven Rostedt wrote:
> On Fri, 8 Aug 2008, Mathieu Desnoyers wrote:
> > * Steven Rostedt (rostedt@goodmis.org) wrote:
> > > 
> > > On Fri, 8 Aug 2008, Mathieu Desnoyers wrote:
> > > > * Steven Rostedt (rostedt@goodmis.org) wrote:
> > > > > 
> > > > > I originally used jumps instead of nops, but unfortunately, they actually 
> > > > > hurt performance more than adding nops. Ingo told me it was probably due
> > > > > to using up the jump predictions of the CPU.
> > > > > 
> > > > 
> > > > Hrm, are you sure you use a single 5-bytes nop instruction then, or do
> > > > you use a mix of various nop sizes (add_nops) on some architectures ?
> > > 
> > > I use (for x86) what is in include/asm-x86/nops.h depending on what the
> > > cpuid gives us.
> > > 
> > 
> > That's bad :
> > 
> > #define GENERIC_NOP5 GENERIC_NOP1 GENERIC_NOP4
> > 
> > #define K8_NOP5 K8_NOP3 K8_NOP2
> > 
> > #define K7_NOP5 K7_NOP4 ASM_NOP1
> > 
> > So, when you try, later, to replace these instructions with a single
> > 5-bytes instruction, a preempted thread could iret in the middle of your
> > 5-bytes insn and cause an illegal instruction ?
> 
> That's why I use kstop_machine.
> 
> > 
> > 
> > > > 
> > > > You can consume the branch prediction buffers for conditional branches,
> > > > but I doubt static jumps have this impact ? I don't see what "jump
> > > > predictions" you are referring to here exactly.
> > > 
> > > I don't know the details, but we definitely saw a drop in preformance 
> > > between using nops and static jumps.
> > > 
> > 
> > Generated by replacing all the call by 5-bytes jumps e9 00 00 00 00
> > instead of the 5-bytes add_nops ? On which architectures ?
> > 
> 
> I ran this on my Dell (intel Xeon), which IIRC did show the performance 
> degration. I unfortunately don't have the time to redo those tests, but 
> you are welcome to.
> 
> Just look at arch/x86/kernel/ftrace.c and replace the nop with the jump.
> In fact, the comments in that file still say it is a jmp. Remember, my 
> first go was to use the jmp.

5 single byte nops?


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-08 18:13           ` Steven Rostedt
  2008-08-08 18:15             ` Peter Zijlstra
@ 2008-08-08 18:21             ` Mathieu Desnoyers
  2008-08-08 18:41               ` Steven Rostedt
  1 sibling, 1 reply; 56+ messages in thread
From: Mathieu Desnoyers @ 2008-08-08 18:21 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, Linus Torvalds, David Miller, Roland McGrath,
	Ulrich Drepper, Rusty Russell, Jeremy Fitzhardinge,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

* Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> On Fri, 8 Aug 2008, Mathieu Desnoyers wrote:
> > * Steven Rostedt (rostedt@goodmis.org) wrote:
> > > 
> > > On Fri, 8 Aug 2008, Mathieu Desnoyers wrote:
> > > > * Steven Rostedt (rostedt@goodmis.org) wrote:
> > > > > 
> > > > > I originally used jumps instead of nops, but unfortunately, they actually 
> > > > > hurt performance more than adding nops. Ingo told me it was probably due
> > > > > to using up the jump predictions of the CPU.
> > > > > 
> > > > 
> > > > Hrm, are you sure you use a single 5-bytes nop instruction then, or do
> > > > you use a mix of various nop sizes (add_nops) on some architectures ?
> > > 
> > > I use (for x86) what is in include/asm-x86/nops.h depending on what the
> > > cpuid gives us.
> > > 
> > 
> > That's bad :
> > 
> > #define GENERIC_NOP5 GENERIC_NOP1 GENERIC_NOP4
> > 
> > #define K8_NOP5 K8_NOP3 K8_NOP2
> > 
> > #define K7_NOP5 K7_NOP4 ASM_NOP1
> > 
> > So, when you try, later, to replace these instructions with a single
> > 5-bytes instruction, a preempted thread could iret in the middle of your
> > 5-bytes insn and cause an illegal instruction ?
> 
> That's why I use kstop_machine.
> 

kstop_machine does not guarantee that you won't have _any_ thread
preempted with IP pointing exactly in the middle of your instructions
_before_ the modification scheduled back in _after_ the modification and
thus causing an illegal instruction.

Still buggy. :/

> > 
> > 
> > > > 
> > > > You can consume the branch prediction buffers for conditional branches,
> > > > but I doubt static jumps have this impact ? I don't see what "jump
> > > > predictions" you are referring to here exactly.
> > > 
> > > I don't know the details, but we definitely saw a drop in preformance 
> > > between using nops and static jumps.
> > > 
> > 
> > Generated by replacing all the call by 5-bytes jumps e9 00 00 00 00
> > instead of the 5-bytes add_nops ? On which architectures ?
> > 
> 
> I ran this on my Dell (intel Xeon), which IIRC did show the performance 
> degration. I unfortunately don't have the time to redo those tests, but 
> you are welcome to.
> 
> Just look at arch/x86/kernel/ftrace.c and replace the nop with the jump.
> In fact, the comments in that file still say it is a jmp. Remember, my 
> first go was to use the jmp.
> 

I'll try to find time to compare :

multi-instructions 5-bytes nops (although this approach is just buggy)
5-bytes jump to the next address
2-bytes jump to offset +3.

Mathieu

> -- Steve
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-08 18:21             ` Mathieu Desnoyers
@ 2008-08-08 18:41               ` Steven Rostedt
  2008-08-08 19:04                 ` Linus Torvalds
                                   ` (3 more replies)
  0 siblings, 4 replies; 56+ messages in thread
From: Steven Rostedt @ 2008-08-08 18:41 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, Linus Torvalds, David Miller, Roland McGrath,
	Ulrich Drepper, Rusty Russell, Jeremy Fitzhardinge,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams


On Fri, 8 Aug 2008, Mathieu Desnoyers wrote:

> * Steven Rostedt (rostedt@goodmis.org) wrote:
> > > > 
> > > 
> > > That's bad :
> > > 
> > > #define GENERIC_NOP5 GENERIC_NOP1 GENERIC_NOP4
> > > 
> > > #define K8_NOP5 K8_NOP3 K8_NOP2
> > > 
> > > #define K7_NOP5 K7_NOP4 ASM_NOP1
> > > 
> > > So, when you try, later, to replace these instructions with a single
> > > 5-bytes instruction, a preempted thread could iret in the middle of your
> > > 5-bytes insn and cause an illegal instruction ?
> > 
> > That's why I use kstop_machine.
> > 
> 
> kstop_machine does not guarantee that you won't have _any_ thread
> preempted with IP pointing exactly in the middle of your instructions
> _before_ the modification scheduled back in _after_ the modification and
> thus causing an illegal instruction.
> 
> Still buggy. :/

Hmm, good point. Unless...

Can a processor be preempted in a middle of nops?  What do nops do for a 
processor? Can it skip them nicely in one shot?

This means I'll have to do the benchmarks again, and see what the 
performance difference of a jmp and a nop is significant. I'm thinking 
that if the processor can safely skip nops without any type of processing, 
this may be the reason that nops are better than a jmp. A jmp causes the 
processor to do a little more work.

I might even run a test to see if I can force a processor that uses the 
three-two nops to preempt between them.

I can add a test in x86 ftrace.c to check to see which nop was used, and 
use the jmp if the arch does not have a 5 byte nop.

I'm assuming that jmp is more expensive than the nops because otherwise
a jmp 0 would have been used as a 5 byte nop.

-- Steve

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-08 18:41               ` Steven Rostedt
@ 2008-08-08 19:04                 ` Linus Torvalds
  2008-08-08 19:05                 ` Mathieu Desnoyers
                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 56+ messages in thread
From: Linus Torvalds @ 2008-08-08 19:04 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, linux-kernel, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, David Miller, Roland McGrath,
	Ulrich Drepper, Rusty Russell, Jeremy Fitzhardinge,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams



On Fri, 8 Aug 2008, Steven Rostedt wrote:
> 
> Can a processor be preempted in a middle of nops?

Sure. If you have two nops in a row (and the kernel definition of the NOP 
array does _not_ guarantee that it's a single-instruction one), you may 
get a profile hit (ie any interrupt) on the second one. It's less 
_likely_, but it certainly is not architecturally in any way guaranteed 
that the kernel "nop[]" tables would be atomic.

> What do nops do for a  processor?

Depends on the microarchitecture. But many will squash it in the decode 
stage, and generate no uops for them at all, so it's purely a decode 
throughput issue and has absolutely _zero_ impact for any later CPU 
stages.

> Can it skip them nicely in one shot?

See above. It needs to decode them, and the decoder itself may well have 
some limitations - for example, the longer nops may not even decode in all 
decoders, which is why some uarchs might prefer two short nops to one long 
one, but _generally_ a nop will not make it any further than the decoder. 
But separate nops do count as separate instructions, ie they will hit all 
the normal decode limits (mostly three or four instructions decoded per 
cycle).

> I'm assuming that jmp is more expensive than the nops because otherwise
> a jmp 0 would have been used as a 5 byte nop.

Yes. A CPU core _could_ certainly do special decoding for 'jmp 0' too, but 
I don't know any that do. The 'jmp' is much more likely to be special in 
the front-end and the decoder, and can easily cause things like the 
prefetcher to hickup (ie it tries to start prefetching at the "new" target 
address).

So I would absolutely _expect_ a 'jmp' to be noticeably more expensive 
than one of the special nop's that can be squashed by the decoder. 

A nop that is squashed by the decoder will literally take absolutely no 
other resources. It doesn't even need to be tracked from an instruction 
completion standpoint (which _may_ end up meaning that a profiler would 
actually never see a hit on the second nop, but it's quite likely to 
depend on which decoder it hits etc).

			Linus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-08 18:41               ` Steven Rostedt
  2008-08-08 19:04                 ` Linus Torvalds
@ 2008-08-08 19:05                 ` Mathieu Desnoyers
  2008-08-08 23:38                   ` Steven Rostedt
  2008-08-08 19:08                 ` Jeremy Fitzhardinge
  2008-08-11  2:41                 ` Rusty Russell
  3 siblings, 1 reply; 56+ messages in thread
From: Mathieu Desnoyers @ 2008-08-08 19:05 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, Linus Torvalds, David Miller, Roland McGrath,
	Ulrich Drepper, Rusty Russell, Jeremy Fitzhardinge,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

* Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> On Fri, 8 Aug 2008, Mathieu Desnoyers wrote:
> 
> > * Steven Rostedt (rostedt@goodmis.org) wrote:
> > > > > 
> > > > 
> > > > That's bad :
> > > > 
> > > > #define GENERIC_NOP5 GENERIC_NOP1 GENERIC_NOP4
> > > > 
> > > > #define K8_NOP5 K8_NOP3 K8_NOP2
> > > > 
> > > > #define K7_NOP5 K7_NOP4 ASM_NOP1
> > > > 
> > > > So, when you try, later, to replace these instructions with a single
> > > > 5-bytes instruction, a preempted thread could iret in the middle of your
> > > > 5-bytes insn and cause an illegal instruction ?
> > > 
> > > That's why I use kstop_machine.
> > > 
> > 
> > kstop_machine does not guarantee that you won't have _any_ thread
> > preempted with IP pointing exactly in the middle of your instructions
> > _before_ the modification scheduled back in _after_ the modification and
> > thus causing an illegal instruction.
> > 
> > Still buggy. :/
> 
> Hmm, good point. Unless...
> 
> Can a processor be preempted in a middle of nops?  What do nops do for a 
> processor? Can it skip them nicely in one shot?
> 

Given that those are multiple instructions, I think a processor has all
the rights to preempt in the middle of them. And even if some specific
architecture, for any obscure reason, happens to merge them, I don't
think this will be portable across Intel, AMD, ...

> This means I'll have to do the benchmarks again, and see what the 
> performance difference of a jmp and a nop is significant. I'm thinking 
> that if the processor can safely skip nops without any type of processing, 
> this may be the reason that nops are better than a jmp. A jmp causes the 
> processor to do a little more work.
> 
> I might even run a test to see if I can force a processor that uses the 
> three-two nops to preempt between them.
> 

Yup, although one architecture not triggering this doesn't say much
about the various x86 flavors out there. In any case
- if you trigger the problem, we have to fix it.
- if you do not succeed to trigger the problem, we will have to test it
  on a wider architecture range and maybe end up fixit it anyway to play
  safe with the specs.

So, in every case, we end up fixing the issue.


> I can add a test in x86 ftrace.c to check to see which nop was used, and 
> use the jmp if the arch does not have a 5 byte nop.
> 

I would propose the following alternative :

Create new macros in include/asm-x86/nops.h :

/* short jump, offset 3 bytes : skips total of 5 bytes */
#define GENERIC_ATOMIC_NOP5 ".byte 0xeb,0x03,0x00,0x00,0x00\n"

#if defined(CONFIG_MK7)
#define ATOMIC_NOP5 GENERIC_ATOMIC_NOP5
#elif defined(CONFIG_X86_P6_NOP)
#define ATOMIC_NOP5 P6_NOP5
#elif defined(CONFIG_X86_64)
#define ATOMIC_NOP5 GENERIC_ATOMIC_NOP5
#else
#define ATOMIC_NOP5 GENERIC_ATOMIC_NOP5
#endif

And then optimize if necessary. You will probably find plenty of
knowledgeable people who will know better 5-bytes nop instruction more
efficient than this "generic" short jump offset 0x3.

Then you can use the (buggy) 3nops/2nops as a performance baseline and
see the performance hit on each architecture.

First get it right, then make it fast....

Mathieu

> I'm assuming that jmp is more expensive than the nops because otherwise
> a jmp 0 would have been used as a 5 byte nop.
> 
> -- Steve

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-08 18:41               ` Steven Rostedt
  2008-08-08 19:04                 ` Linus Torvalds
  2008-08-08 19:05                 ` Mathieu Desnoyers
@ 2008-08-08 19:08                 ` Jeremy Fitzhardinge
  2008-08-11  2:41                 ` Rusty Russell
  3 siblings, 0 replies; 56+ messages in thread
From: Jeremy Fitzhardinge @ 2008-08-08 19:08 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, linux-kernel, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, Linus Torvalds, David Miller,
	Roland McGrath, Ulrich Drepper, Rusty Russell, Gregory Haskins,
	Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves,
	Clark Williams

Steven Rostedt wrote:
> Hmm, good point. Unless...
>
> Can a processor be preempted in a middle of nops?  What do nops do for a 
> processor? Can it skip them nicely in one shot?
>   

The CPU can be interrupted between any two instructions, with a couple 
of exceptions (it must, otherwise an unbounded run of nops could cause 
an unbounded interrupt latency).  If you don't want eip to be in the 
middle of your instruction site, you need to make sure it's nopped out 
with a single instruction.

Unfortunately, aside from P6_NOP5, none of the standard asm/nops.h nops 
are suitable.  Something like "0x66, 0x66, 0x66, 0x66, 0x90" would work 
everywhere, but its possible putting more than 3 prefixes on the 
instruction will kick the decoder into a slow path, which may have a 
noticable effect.

> This means I'll have to do the benchmarks again, and see what the 
> performance difference of a jmp and a nop is significant. I'm thinking 
> that if the processor can safely skip nops without any type of processing, 
> this may be the reason that nops are better than a jmp. A jmp causes the 
> processor to do a little more work.
>   

A no-op jmp could be easily converted to a nop early in the pipeline.  
But I don't know whether anyone bothers to do that.

    J

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-08 19:05                 ` Mathieu Desnoyers
@ 2008-08-08 23:38                   ` Steven Rostedt
  2008-08-09  0:23                     ` Andi Kleen
  2008-08-09  0:30                     ` Steven Rostedt
  0 siblings, 2 replies; 56+ messages in thread
From: Steven Rostedt @ 2008-08-08 23:38 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	Linus Torvalds, David Miller, Roland McGrath, Ulrich Drepper,
	Rusty Russell, Jeremy Fitzhardinge, Gregory Haskins,
	Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves,
	Clark Williams


[ patch included ]

On Fri, 8 Aug 2008, Mathieu Desnoyers wrote:

> * Steven Rostedt (rostedt@goodmis.org) wrote:
> > > > That's why I use kstop_machine.
> > > > 
> > > 
> > > kstop_machine does not guarantee that you won't have _any_ thread
> > > preempted with IP pointing exactly in the middle of your instructions
> > > _before_ the modification scheduled back in _after_ the modification and
> > > thus causing an illegal instruction.
> > > 
> > > Still buggy. :/
> > 
> > Hmm, good point. Unless...
> > 
> > Can a processor be preempted in a middle of nops?  What do nops do for a 
> > processor? Can it skip them nicely in one shot?
> > 
> 
> Given that those are multiple instructions, I think a processor has all
> the rights to preempt in the middle of them. And even if some specific
> architecture, for any obscure reason, happens to merge them, I don't
> think this will be portable across Intel, AMD, ...
> 
> > This means I'll have to do the benchmarks again, and see what the 
> > performance difference of a jmp and a nop is significant. I'm thinking 
> > that if the processor can safely skip nops without any type of processing, 
> > this may be the reason that nops are better than a jmp. A jmp causes the 
> > processor to do a little more work.
> > 
> > I might even run a test to see if I can force a processor that uses the 
> > three-two nops to preempt between them.
> > 
> 
> Yup, although one architecture not triggering this doesn't say much
> about the various x86 flavors out there. In any case
> - if you trigger the problem, we have to fix it.
> - if you do not succeed to trigger the problem, we will have to test it
>   on a wider architecture range and maybe end up fixit it anyway to play
>   safe with the specs.
> 
> So, in every case, we end up fixing the issue.
> 
> 
> > I can add a test in x86 ftrace.c to check to see which nop was used, and 
> > use the jmp if the arch does not have a 5 byte nop.
> > 
> 
> I would propose the following alternative :
> 
> Create new macros in include/asm-x86/nops.h :
> 
> /* short jump, offset 3 bytes : skips total of 5 bytes */
> #define GENERIC_ATOMIC_NOP5 ".byte 0xeb,0x03,0x00,0x00,0x00\n"
> 
> #if defined(CONFIG_MK7)
> #define ATOMIC_NOP5 GENERIC_ATOMIC_NOP5
> #elif defined(CONFIG_X86_P6_NOP)
> #define ATOMIC_NOP5 P6_NOP5
> #elif defined(CONFIG_X86_64)
> #define ATOMIC_NOP5 GENERIC_ATOMIC_NOP5
> #else
> #define ATOMIC_NOP5 GENERIC_ATOMIC_NOP5
> #endif
> 
> And then optimize if necessary. You will probably find plenty of
> knowledgeable people who will know better 5-bytes nop instruction more
> efficient than this "generic" short jump offset 0x3.
> 
> Then you can use the (buggy) 3nops/2nops as a performance baseline and
> see the performance hit on each architecture.
> 
> First get it right, then make it fast....
> 

I'm stubborn, I want to get it right _and_ keep it fast.

I still want the NOPS. Using jmps will hurt performance and that would 
keep this turned off on all distros.

But lets think outside the box here (and we will ignore Alan's cat).

Right now the issue is that we might preempt after the first nop, and when 
we enable the code, that task will crash when it tries to read the second 
nop.

Since we are doing the modifications from kstop_machine, all the tasks are 
stopped. We can simply look to see if the tasks have been preempted in 
kernel space and if so, is their instruction pointer pointing to the 
second nop. If it is, move the ip forward.

Here's a patch that does just that for both i386 and x86_64.

I added a field in the thread_info struct called "ip". This is a pointer 
to the location of the task ip in the stack if it was preempted in kernel 
space. Null otherwise:

         jz restore_all
 +       lea PT_EIP(%esp), %eax
 +       movl %eax, TI_ip(%ebp)
         call preempt_schedule_irq
 +       GET_THREAD_INFO(%ebp)
 +       movl $0, TI_ip(%ebp)
         jmp need_resched


Then, just before we enable tracing (we only need to do this when we 
enable tracing, since that is when we have a two instruction nop), we look 
at all the tasks. If the task->thread_info->ip is set, this means that it
was preempted just before going back to the kernel.

We look at the **ip and see if it compares with the second nop. If it 
does, we increment the ip by the size of that nop:

               if (memcmp(*ip, second_nop, x86_nop5_part2) == 0)
                       /* Match, move the ip forward */
                       *ip += x86_nop5_part2;



We do this just once before enabling all the locations, and we only do it 
if we have a two part nop.

Interesting enough, I wrote a module that did the following:

void (*silly_func)(void);

void do_something_silly(void)
{
}


static int my_thread(void *arg)
{
        int i;
        while (!kthread_should_stop()) {
                for (i=0; i < 100; i++)
                        silly_func();
        }
        return 0;
}

static struct task_struct *p;

static int __init mcount_stress_init(void)
{
        silly_func = do_something_silly;
        p = kthread_run(my_thread, NULL, "sillytask");
        return 0;
}

static void mcount_stress_exit(void)
{
        kthread_stop(p);
}



The do_something_silly had an mcount pointer to it. I put in printks in 
the ftrace enabled code to see where this was preempted. It was preempted 
several times before and after the nops, but never at either nop.

Maybe I didn't run it enough (almost 2 hours), but perhaps it is very 
unlikely to be preempted at a nop if there's something coming up next.

Yes a string of nops may be preempted, but perhaps only two nops followed 
by an actual command might be skipped quickly.

I'll write some hacks to look at where it is preempted in the scheduler 
itself, and see if I see it preempting at the second nop ever.

But here's a patch that will work around the problem that we might be 
preempted within the two nops.

Note, this is only in the slow path of enabling the function tracer. It is 
only done at enabling time inside the kstop_machine, which has a large 
overhead anyways.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 arch/x86/kernel/alternative.c    |   29 +++++++++++++++++++-------
 arch/x86/kernel/asm-offsets_32.c |    1 
 arch/x86/kernel/asm-offsets_64.c |    1 
 arch/x86/kernel/entry_32.S       |    4 +++
 arch/x86/kernel/entry_64.S       |    4 +++
 arch/x86/kernel/ftrace.c         |   43 +++++++++++++++++++++++++++++++++++++++
 include/asm-x86/ftrace.h         |    5 ++++
 include/asm-x86/thread_info.h    |    4 +++
 kernel/trace/ftrace.c            |   12 ++++++++++
 9 files changed, 96 insertions(+), 7 deletions(-)

Index: linux-tip.git/arch/x86/kernel/alternative.c
===================================================================
--- linux-tip.git.orig/arch/x86/kernel/alternative.c	2008-06-05 11:52:24.000000000 -0400
+++ linux-tip.git/arch/x86/kernel/alternative.c	2008-08-08 16:20:23.000000000 -0400
@@ -140,13 +140,26 @@ static const unsigned char *const p6_nop
 };
 #endif
 
+/*
+ * Some versions of x86 CPUs have a two part NOP5. This
+ * can break ftrace if a process is preempted between
+ * the two. ftrace needs to know what the second nop
+ * is to handle this case.
+ */
+int x86_nop5_part2;
+
 #ifdef CONFIG_X86_64
 
 extern char __vsyscall_0;
 const unsigned char *const *find_nop_table(void)
 {
-	return boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
-	       boot_cpu_data.x86 < 6 ? k8_nops : p6_nops;
+	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
+	    boot_cpu_data.x86 < 6) {
+		x86_nop5_part2 = 2; /* K8_NOP2 */
+		return k8_nops;
+	} else
+		/* keep k86_nop5_part2 NULL */
+		return p6_nops;
 }
 
 #else /* CONFIG_X86_64 */
@@ -154,12 +167,13 @@ const unsigned char *const *find_nop_tab
 static const struct nop {
 	int cpuid;
 	const unsigned char *const *noptable;
+	int nop5_part2; /* size of part2 nop */
 } noptypes[] = {
-	{ X86_FEATURE_K8, k8_nops },
-	{ X86_FEATURE_K7, k7_nops },
-	{ X86_FEATURE_P4, p6_nops },
-	{ X86_FEATURE_P3, p6_nops },
-	{ -1, NULL }
+	{ X86_FEATURE_K8, k8_nops, 2},
+	{ X86_FEATURE_K7, k7_nops, 1 },
+	{ X86_FEATURE_P4, p6_nops, 0 },
+	{ X86_FEATURE_P3, p6_nops, 0 },
+	{ -1, NULL, 0 }
 };
 
 const unsigned char *const *find_nop_table(void)
@@ -170,6 +184,7 @@ const unsigned char *const *find_nop_tab
 	for (i = 0; noptypes[i].cpuid >= 0; i++) {
 		if (boot_cpu_has(noptypes[i].cpuid)) {
 			noptable = noptypes[i].noptable;
+			x86_nop5_part2 = noptypes[i].nop5_part2;
 			break;
 		}
 	}
Index: linux-tip.git/arch/x86/kernel/asm-offsets_32.c
===================================================================
--- linux-tip.git.orig/arch/x86/kernel/asm-offsets_32.c	2008-07-27 10:43:26.000000000 -0400
+++ linux-tip.git/arch/x86/kernel/asm-offsets_32.c	2008-08-08 15:46:55.000000000 -0400
@@ -59,6 +59,7 @@ void foo(void)
 	OFFSET(TI_restart_block, thread_info, restart_block);
 	OFFSET(TI_sysenter_return, thread_info, sysenter_return);
 	OFFSET(TI_cpu, thread_info, cpu);
+	OFFSET(TI_ip, thread_info, ip);
 	BLANK();
 
 	OFFSET(GDS_size, desc_ptr, size);
Index: linux-tip.git/arch/x86/kernel/asm-offsets_64.c
===================================================================
--- linux-tip.git.orig/arch/x86/kernel/asm-offsets_64.c	2008-07-27 10:43:26.000000000 -0400
+++ linux-tip.git/arch/x86/kernel/asm-offsets_64.c	2008-08-08 15:52:34.000000000 -0400
@@ -41,6 +41,7 @@ int main(void)
 	ENTRY(addr_limit);
 	ENTRY(preempt_count);
 	ENTRY(status);
+	ENTRY(ip);
 #ifdef CONFIG_IA32_EMULATION
 	ENTRY(sysenter_return);
 #endif
Index: linux-tip.git/arch/x86/kernel/entry_32.S
===================================================================
--- linux-tip.git.orig/arch/x86/kernel/entry_32.S	2008-07-27 10:43:26.000000000 -0400
+++ linux-tip.git/arch/x86/kernel/entry_32.S	2008-08-08 17:13:27.000000000 -0400
@@ -304,7 +304,11 @@ need_resched:
 	jz restore_all
 	testl $X86_EFLAGS_IF,PT_EFLAGS(%esp)	# interrupts off (exception path) ?
 	jz restore_all
+	lea PT_EIP(%esp), %eax
+	movl %eax, TI_ip(%ebp)
 	call preempt_schedule_irq
+	GET_THREAD_INFO(%ebp)
+	movl $0, TI_ip(%ebp)
 	jmp need_resched
 END(resume_kernel)
 #endif
Index: linux-tip.git/arch/x86/kernel/entry_64.S
===================================================================
--- linux-tip.git.orig/arch/x86/kernel/entry_64.S	2008-07-27 10:43:26.000000000 -0400
+++ linux-tip.git/arch/x86/kernel/entry_64.S	2008-08-08 17:12:47.000000000 -0400
@@ -837,7 +837,11 @@ ENTRY(retint_kernel)
 	jnc  retint_restore_args
 	bt   $9,EFLAGS-ARGOFFSET(%rsp)	/* interrupts off? */
 	jnc  retint_restore_args
+	leaq RIP-ARGOFFSET(%rsp), %rax
+	movq %rax, TI_ip(%rcx)
 	call preempt_schedule_irq
+	GET_THREAD_INFO(%rcx)
+	movq $0, TI_ip(%rcx)
 	jmp exit_intr
 #endif	
 
Index: linux-tip.git/arch/x86/kernel/ftrace.c
===================================================================
--- linux-tip.git.orig/arch/x86/kernel/ftrace.c	2008-06-26 14:58:54.000000000 -0400
+++ linux-tip.git/arch/x86/kernel/ftrace.c	2008-08-08 17:48:04.000000000 -0400
@@ -127,6 +127,46 @@ notrace int ftrace_mcount_set(unsigned l
 	return 0;
 }
 
+static const unsigned char *second_nop;
+
+void arch_ftrace_pre_enable(void)
+{
+	struct task_struct *g, *p;
+	unsigned long **ip;
+
+	int i;
+
+	if (!second_nop)
+		return;
+
+	/*
+	 * x86 has a two part nop to handle 5 byte instructions.
+	 * If a task was preempted after the first nop, and has
+	 * not ran the second nop, if we modify the code, we can
+	 * crash the system. Thus, we will look at all the tasks
+	 * and if any of them was preempted and will run the
+	 * second nop next, we simply move their ip pointer past
+	 * the second nop.
+	 */
+
+	/*
+	 * Don't need to grab the task list lock, we are running
+	 * in kstop_machine
+	 */
+	do_each_thread(g, p) {
+		/*
+		 * In entry.S we save the ip when a task is preempted
+		 * and reset it when it is back running.
+		 */
+		ip = task_thread_info(p)->ip;
+		if (!ip)
+			continue;
+		if (memcmp(*ip, second_nop, x86_nop5_part2) == 0)
+			/* Match, move the ip forward */
+			*ip += x86_nop5_part2;
+	} while_each_thread(g, p);
+}
+
 int __init ftrace_dyn_arch_init(void *data)
 {
 	const unsigned char *const *noptable = find_nop_table();
@@ -137,5 +177,8 @@ int __init ftrace_dyn_arch_init(void *da
 
 	ftrace_nop = (unsigned long *)noptable[MCOUNT_INSN_SIZE];
 
+	if (x86_nop5_part2)
+		second_nop = noptable[x86_nop5_part2];
+
 	return 0;
 }
Index: linux-tip.git/include/asm-x86/ftrace.h
===================================================================
--- linux-tip.git.orig/include/asm-x86/ftrace.h	2008-08-08 13:00:51.000000000 -0400
+++ linux-tip.git/include/asm-x86/ftrace.h	2008-08-08 16:41:09.000000000 -0400
@@ -17,6 +17,11 @@ static inline unsigned long ftrace_call_
 	 */
 	return addr - 1;
 }
+
+extern int x86_nop5_part2;
+extern void arch_ftrace_pre_enable(void);
+#define ftrace_pre_enable arch_ftrace_pre_enable
+
 #endif
 
 #endif /* CONFIG_FTRACE */
Index: linux-tip.git/include/asm-x86/thread_info.h
===================================================================
--- linux-tip.git.orig/include/asm-x86/thread_info.h	2008-08-07 11:14:43.000000000 -0400
+++ linux-tip.git/include/asm-x86/thread_info.h	2008-08-08 17:06:15.000000000 -0400
@@ -29,6 +29,9 @@ struct thread_info {
 	__u32			cpu;		/* current CPU */
 	int			preempt_count;	/* 0 => preemptable,
 						   <0 => BUG */
+	unsigned long		**ip;		/* pointer to ip on stackwhen
+						   preempted
+						*/
 	mm_segment_t		addr_limit;
 	struct restart_block    restart_block;
 	void __user		*sysenter_return;
@@ -47,6 +50,7 @@ struct thread_info {
 	.flags		= 0,			\
 	.cpu		= 0,			\
 	.preempt_count	= 1,			\
+	.ip		= NULL,			\
 	.addr_limit	= KERNEL_DS,		\
 	.restart_block = {			\
 		.fn = do_no_restart_syscall,	\
Index: linux-tip.git/kernel/trace/ftrace.c
===================================================================
--- linux-tip.git.orig/kernel/trace/ftrace.c	2008-08-08 13:00:52.000000000 -0400
+++ linux-tip.git/kernel/trace/ftrace.c	2008-08-08 16:18:14.000000000 -0400
@@ -32,6 +32,10 @@
 
 #include "trace.h"
 
+#ifndef ftrace_pre_enable
+# define ftrace_pre_enable() do { } while (0)
+#endif
+
 /* ftrace_enabled is a method to turn ftrace on or off */
 int ftrace_enabled __read_mostly;
 static int last_ftrace_enabled;
@@ -500,6 +504,14 @@ static void ftrace_replace_code(int enab
 	else
 		new = ftrace_nop_replace();
 
+	/*
+	 * Some archs *cough*x86*cough* have more than one nop to cover
+	 * the call to mcount. In these cases, special care must be taken
+	 * before we start converting nops into calls.
+	 */
+	if (enable)
+		ftrace_pre_enable();
+
 	for (pg = ftrace_pages_start; pg; pg = pg->next) {
 		for (i = 0; i < pg->index; i++) {
 			rec = &pg->records[i];

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-08 23:38                   ` Steven Rostedt
@ 2008-08-09  0:23                     ` Andi Kleen
  2008-08-09  0:36                       ` Steven Rostedt
  2008-08-09  0:30                     ` Steven Rostedt
  1 sibling, 1 reply; 56+ messages in thread
From: Andi Kleen @ 2008-08-09  0:23 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, LKML, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, Linus Torvalds, David Miller,
	Roland McGrath, Ulrich Drepper, Rusty Russell,
	Jeremy Fitzhardinge, Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

Steven Rostedt <rostedt@goodmis.org> writes:

> I'm stubborn, I want to get it right _and_ keep it fast.

For me it would seem better to just not use two part 5 byte nops
instead of adding such hacks.  I doubt there are that many of them
anyways. I bet you won't be able to measure any difference between the
different nop types in any macro benchmark.

-Andi

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-08 23:38                   ` Steven Rostedt
  2008-08-09  0:23                     ` Andi Kleen
@ 2008-08-09  0:30                     ` Steven Rostedt
  2008-08-11 18:21                       ` Mathieu Desnoyers
  1 sibling, 1 reply; 56+ messages in thread
From: Steven Rostedt @ 2008-08-09  0:30 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	Linus Torvalds, David Miller, Roland McGrath, Ulrich Drepper,
	Rusty Russell, Jeremy Fitzhardinge, Gregory Haskins,
	Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves,
	Clark Williams


[ updated patch ]

On Fri, 8 Aug 2008, Steven Rostedt wrote:
> 
> 
> The do_something_silly had an mcount pointer to it. I put in printks in 
> the ftrace enabled code to see where this was preempted. It was preempted 
> several times before and after the nops, but never at either nop.
> 
> Maybe I didn't run it enough (almost 2 hours), but perhaps it is very 
> unlikely to be preempted at a nop if there's something coming up next.
> 
> Yes a string of nops may be preempted, but perhaps only two nops followed 
> by an actual command might be skipped quickly.
> 
> I'll write some hacks to look at where it is preempted in the scheduler 
> itself, and see if I see it preempting at the second nop ever.

I put that code in the scheduler and it does indeed preempt in the nops!
It also uncovered a nasty bug I had in my code:

[...]

> Index: linux-tip.git/arch/x86/kernel/ftrace.c
> ===================================================================
> --- linux-tip.git.orig/arch/x86/kernel/ftrace.c	2008-06-26 14:58:54.000000000 -0400
> +++ linux-tip.git/arch/x86/kernel/ftrace.c	2008-08-08 17:48:04.000000000 -0400
> @@ -127,6 +127,46 @@ notrace int ftrace_mcount_set(unsigned l
>  	return 0;
>  }
>  
> +static const unsigned char *second_nop;
> +
> +void arch_ftrace_pre_enable(void)
> +{
> +	struct task_struct *g, *p;
> +	unsigned long **ip;
> +

[...]

> +		ip = task_thread_info(p)->ip;
> +		if (!ip)
> +			continue;
> +		if (memcmp(*ip, second_nop, x86_nop5_part2) == 0)
> +			/* Match, move the ip forward */
> +			*ip += x86_nop5_part2;

Notice the bug? Hint, *ip is of type unsigned long *.

Here's the updated fix:

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 arch/x86/kernel/alternative.c    |   29 ++++++++++++++++++++-------
 arch/x86/kernel/asm-offsets_32.c |    1 
 arch/x86/kernel/asm-offsets_64.c |    1 
 arch/x86/kernel/entry_32.S       |    4 +++
 arch/x86/kernel/entry_64.S       |    4 +++
 arch/x86/kernel/ftrace.c         |   41 +++++++++++++++++++++++++++++++++++++++
 include/asm-x86/ftrace.h         |    5 ++++
 include/asm-x86/thread_info.h    |    4 +++
 kernel/trace/ftrace.c            |   12 +++++++++++
 9 files changed, 94 insertions(+), 7 deletions(-)

Index: linux-tip.git/arch/x86/kernel/alternative.c
===================================================================
--- linux-tip.git.orig/arch/x86/kernel/alternative.c	2008-06-05 11:52:24.000000000 -0400
+++ linux-tip.git/arch/x86/kernel/alternative.c	2008-08-08 16:20:23.000000000 -0400
@@ -140,13 +140,26 @@ static const unsigned char *const p6_nop
 };
 #endif
 
+/*
+ * Some versions of x86 CPUs have a two part NOP5. This
+ * can break ftrace if a process is preempted between
+ * the two. ftrace needs to know what the second nop
+ * is to handle this case.
+ */
+int x86_nop5_part2;
+
 #ifdef CONFIG_X86_64
 
 extern char __vsyscall_0;
 const unsigned char *const *find_nop_table(void)
 {
-	return boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
-	       boot_cpu_data.x86 < 6 ? k8_nops : p6_nops;
+	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
+	    boot_cpu_data.x86 < 6) {
+		x86_nop5_part2 = 2; /* K8_NOP2 */
+		return k8_nops;
+	} else
+		/* keep k86_nop5_part2 NULL */
+		return p6_nops;
 }
 
 #else /* CONFIG_X86_64 */
@@ -154,12 +167,13 @@ const unsigned char *const *find_nop_tab
 static const struct nop {
 	int cpuid;
 	const unsigned char *const *noptable;
+	int nop5_part2;
 } noptypes[] = {
-	{ X86_FEATURE_K8, k8_nops },
-	{ X86_FEATURE_K7, k7_nops },
-	{ X86_FEATURE_P4, p6_nops },
-	{ X86_FEATURE_P3, p6_nops },
-	{ -1, NULL }
+	{ X86_FEATURE_K8, k8_nops, 2},
+	{ X86_FEATURE_K7, k7_nops, 1 },
+	{ X86_FEATURE_P4, p6_nops, 0 },
+	{ X86_FEATURE_P3, p6_nops, 0 },
+	{ -1, NULL, 0 }
 };
 
 const unsigned char *const *find_nop_table(void)
@@ -170,6 +184,7 @@ const unsigned char *const *find_nop_tab
 	for (i = 0; noptypes[i].cpuid >= 0; i++) {
 		if (boot_cpu_has(noptypes[i].cpuid)) {
 			noptable = noptypes[i].noptable;
+			x86_nop5_part2 = noptypes[i].nop5_part2;
 			break;
 		}
 	}
Index: linux-tip.git/arch/x86/kernel/asm-offsets_32.c
===================================================================
--- linux-tip.git.orig/arch/x86/kernel/asm-offsets_32.c	2008-07-27 10:43:26.000000000 -0400
+++ linux-tip.git/arch/x86/kernel/asm-offsets_32.c	2008-08-08 15:46:55.000000000 -0400
@@ -59,6 +59,7 @@ void foo(void)
 	OFFSET(TI_restart_block, thread_info, restart_block);
 	OFFSET(TI_sysenter_return, thread_info, sysenter_return);
 	OFFSET(TI_cpu, thread_info, cpu);
+	OFFSET(TI_ip, thread_info, ip);
 	BLANK();
 
 	OFFSET(GDS_size, desc_ptr, size);
Index: linux-tip.git/arch/x86/kernel/asm-offsets_64.c
===================================================================
--- linux-tip.git.orig/arch/x86/kernel/asm-offsets_64.c	2008-07-27 10:43:26.000000000 -0400
+++ linux-tip.git/arch/x86/kernel/asm-offsets_64.c	2008-08-08 15:52:34.000000000 -0400
@@ -41,6 +41,7 @@ int main(void)
 	ENTRY(addr_limit);
 	ENTRY(preempt_count);
 	ENTRY(status);
+	ENTRY(ip);
 #ifdef CONFIG_IA32_EMULATION
 	ENTRY(sysenter_return);
 #endif
Index: linux-tip.git/arch/x86/kernel/entry_32.S
===================================================================
--- linux-tip.git.orig/arch/x86/kernel/entry_32.S	2008-07-27 10:43:26.000000000 -0400
+++ linux-tip.git/arch/x86/kernel/entry_32.S	2008-08-08 17:13:27.000000000 -0400
@@ -304,7 +304,11 @@ need_resched:
 	jz restore_all
 	testl $X86_EFLAGS_IF,PT_EFLAGS(%esp)	# interrupts off (exception path) ?
 	jz restore_all
+	lea PT_EIP(%esp), %eax
+	movl %eax, TI_ip(%ebp)
 	call preempt_schedule_irq
+	GET_THREAD_INFO(%ebp)
+	movl $0, TI_ip(%ebp)
 	jmp need_resched
 END(resume_kernel)
 #endif
Index: linux-tip.git/arch/x86/kernel/entry_64.S
===================================================================
--- linux-tip.git.orig/arch/x86/kernel/entry_64.S	2008-07-27 10:43:26.000000000 -0400
+++ linux-tip.git/arch/x86/kernel/entry_64.S	2008-08-08 17:12:47.000000000 -0400
@@ -837,7 +837,11 @@ ENTRY(retint_kernel)
 	jnc  retint_restore_args
 	bt   $9,EFLAGS-ARGOFFSET(%rsp)	/* interrupts off? */
 	jnc  retint_restore_args
+	leaq RIP-ARGOFFSET(%rsp), %rax
+	movq %rax, TI_ip(%rcx)
 	call preempt_schedule_irq
+	GET_THREAD_INFO(%rcx)
+	movq $0, TI_ip(%rcx)
 	jmp exit_intr
 #endif	
 
Index: linux-tip.git/arch/x86/kernel/ftrace.c
===================================================================
--- linux-tip.git.orig/arch/x86/kernel/ftrace.c	2008-06-26 14:58:54.000000000 -0400
+++ linux-tip.git/arch/x86/kernel/ftrace.c	2008-08-08 20:22:19.000000000 -0400
@@ -127,6 +127,44 @@ notrace int ftrace_mcount_set(unsigned l
 	return 0;
 }
 
+static const unsigned char *second_nop;
+
+void arch_ftrace_pre_enable(void)
+{
+	struct task_struct *g, *p;
+	void **ip;
+
+	if (!second_nop)
+		return;
+
+	/*
+	 * x86 has a two part nop to handle 5 byte instructions.
+	 * If a task was preempted after the first nop, and has
+	 * not ran the second nop, if we modify the code, we can
+	 * crash the system. Thus, we will look at all the tasks
+	 * and if any of them was preempted and will run the
+	 * second nop next, we simply move their ip pointer past
+	 * the second nop.
+	 */
+
+	/*
+	 * Don't need to grab the task list lock, we are running
+	 * in kstop_machine
+	 */
+	do_each_thread(g, p) {
+		/*
+		 * In entry.S we save the ip when a task is preempted
+		 * and reset it when it is back running.
+		 */
+		ip = (void **)task_thread_info(p)->ip;
+		if (!ip)
+			continue;
+		if (memcmp(*ip, second_nop, x86_nop5_part2) == 0)
+			/* Match, move the ip forward */
+			*ip += x86_nop5_part2;
+	} while_each_thread(g, p);
+}
+
 int __init ftrace_dyn_arch_init(void *data)
 {
 	const unsigned char *const *noptable = find_nop_table();
@@ -137,5 +175,8 @@ int __init ftrace_dyn_arch_init(void *da
 
 	ftrace_nop = (unsigned long *)noptable[MCOUNT_INSN_SIZE];
 
+	if (x86_nop5_part2)
+		second_nop = noptable[x86_nop5_part2];
+
 	return 0;
 }
Index: linux-tip.git/include/asm-x86/ftrace.h
===================================================================
--- linux-tip.git.orig/include/asm-x86/ftrace.h	2008-08-08 13:00:51.000000000 -0400
+++ linux-tip.git/include/asm-x86/ftrace.h	2008-08-08 16:41:09.000000000 -0400
@@ -17,6 +17,11 @@ static inline unsigned long ftrace_call_
 	 */
 	return addr - 1;
 }
+
+extern int x86_nop5_part2;
+extern void arch_ftrace_pre_enable(void);
+#define ftrace_pre_enable arch_ftrace_pre_enable
+
 #endif
 
 #endif /* CONFIG_FTRACE */
Index: linux-tip.git/include/asm-x86/thread_info.h
===================================================================
--- linux-tip.git.orig/include/asm-x86/thread_info.h	2008-08-07 11:14:43.000000000 -0400
+++ linux-tip.git/include/asm-x86/thread_info.h	2008-08-08 17:06:15.000000000 -0400
@@ -29,6 +29,9 @@ struct thread_info {
 	__u32			cpu;		/* current CPU */
 	int			preempt_count;	/* 0 => preemptable,
 						   <0 => BUG */
+	unsigned long		**ip;		/* pointer to ip on stackwhen
+						   preempted
+						*/
 	mm_segment_t		addr_limit;
 	struct restart_block    restart_block;
 	void __user		*sysenter_return;
@@ -47,6 +50,7 @@ struct thread_info {
 	.flags		= 0,			\
 	.cpu		= 0,			\
 	.preempt_count	= 1,			\
+	.ip		= NULL,			\
 	.addr_limit	= KERNEL_DS,		\
 	.restart_block = {			\
 		.fn = do_no_restart_syscall,	\
Index: linux-tip.git/kernel/trace/ftrace.c
===================================================================
--- linux-tip.git.orig/kernel/trace/ftrace.c	2008-08-08 13:00:52.000000000 -0400
+++ linux-tip.git/kernel/trace/ftrace.c	2008-08-08 16:18:14.000000000 -0400
@@ -32,6 +32,10 @@
 
 #include "trace.h"
 
+#ifndef ftrace_pre_enable
+# define ftrace_pre_enable() do { } while (0)
+#endif
+
 /* ftrace_enabled is a method to turn ftrace on or off */
 int ftrace_enabled __read_mostly;
 static int last_ftrace_enabled;
@@ -500,6 +504,14 @@ static void ftrace_replace_code(int enab
 	else
 		new = ftrace_nop_replace();
 
+	/*
+	 * Some archs *cough*x86*cough* have more than one nop to cover
+	 * the call to mcount. In these cases, special care must be taken
+	 * before we start converting nops into calls.
+	 */
+	if (enable)
+		ftrace_pre_enable();
+
 	for (pg = ftrace_pages_start; pg; pg = pg->next) {
 		for (i = 0; i < pg->index; i++) {
 			rec = &pg->records[i];

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-09  0:23                     ` Andi Kleen
@ 2008-08-09  0:36                       ` Steven Rostedt
  2008-08-09  0:47                         ` Jeremy Fitzhardinge
                                           ` (2 more replies)
  0 siblings, 3 replies; 56+ messages in thread
From: Steven Rostedt @ 2008-08-09  0:36 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Mathieu Desnoyers, LKML, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, Linus Torvalds, David Miller,
	Roland McGrath, Ulrich Drepper, Rusty Russell,
	Jeremy Fitzhardinge, Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams


On Sat, 9 Aug 2008, Andi Kleen wrote:

> Steven Rostedt <rostedt@goodmis.org> writes:
> 
> > I'm stubborn, I want to get it right _and_ keep it fast.
> 
> For me it would seem better to just not use two part 5 byte nops
> instead of adding such hacks.  I doubt there are that many of them
> anyways. I bet you won't be able to measure any difference between the
> different nop types in any macro benchmark.

I wish we had a true 5 byte nop. The alternative is a jmp 0, which is 
measurable.  This is replacing mcount from a kernel compile with the -pg 
option. With a basic build (not counting modules), I have over 15,000 
locations that are turned into these 5 byte nops.

# objdump -dr vmlinux.o | grep mcount |wc
  15152   45489  764924

If we use the jmp 0, then yes, we will see the overhead. The double nop 
that is used for 5 bytes, is significantly better than the jump.

-- Steve


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-09  0:36                       ` Steven Rostedt
@ 2008-08-09  0:47                         ` Jeremy Fitzhardinge
  2008-08-09  0:51                           ` Linus Torvalds
  2008-08-09  0:51                           ` Steven Rostedt
  2008-08-09  0:53                         ` Roland McGrath
  2008-08-09  1:19                         ` Andi Kleen
  2 siblings, 2 replies; 56+ messages in thread
From: Jeremy Fitzhardinge @ 2008-08-09  0:47 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andi Kleen, Mathieu Desnoyers, LKML, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, Linus Torvalds, David Miller,
	Roland McGrath, Ulrich Drepper, Rusty Russell, Gregory Haskins,
	Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves,
	Clark Williams

Steven Rostedt wrote:
> I wish we had a true 5 byte nop. 

0x66 0x66 0x66 0x66 0x90

    J

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-09  0:47                         ` Jeremy Fitzhardinge
@ 2008-08-09  0:51                           ` Linus Torvalds
  2008-08-09  1:25                             ` Steven Rostedt
  2008-08-09  0:51                           ` Steven Rostedt
  1 sibling, 1 reply; 56+ messages in thread
From: Linus Torvalds @ 2008-08-09  0:51 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Steven Rostedt, Andi Kleen, Mathieu Desnoyers, LKML, Ingo Molnar,
	Thomas Gleixner, Peter Zijlstra, Andrew Morton, David Miller,
	Roland McGrath, Ulrich Drepper, Rusty Russell, Gregory Haskins,
	Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves,
	Clark Williams



On Fri, 8 Aug 2008, Jeremy Fitzhardinge wrote:
>
> Steven Rostedt wrote:
> > I wish we had a true 5 byte nop. 
> 
> 0x66 0x66 0x66 0x66 0x90

I don't think so. Multiple redundant prefixes can be really expensive on 
some uarchs.

A no-op that isn't cheap isn't a no-op at all, it's a slow-op.

			Linus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-09  0:47                         ` Jeremy Fitzhardinge
  2008-08-09  0:51                           ` Linus Torvalds
@ 2008-08-09  0:51                           ` Steven Rostedt
  1 sibling, 0 replies; 56+ messages in thread
From: Steven Rostedt @ 2008-08-09  0:51 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Mathieu Desnoyers, LKML, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, Linus Torvalds, David Miller,
	Roland McGrath, Ulrich Drepper, Rusty Russell, Gregory Haskins,
	Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves,
	Clark Williams



On Fri, 8 Aug 2008, Jeremy Fitzhardinge wrote:

> Steven Rostedt wrote:
> > I wish we had a true 5 byte nop. 
> 
> 0x66 0x66 0x66 0x66 0x90
> 

But I quote you:

> Unfortunately, aside from P6_NOP5, none of the standard asm/nops.h nops 
> are suitable.  Something like "0x66, 0x66, 0x66, 0x66, 0x90" would work
> everywhere, but its possible putting more than 3 prefixes on the 
> instruction will kick the decoder into a slow path, which may have a 
> noticable effect.

I guess the next step is to test if it is slow.

-- Steve


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-09  0:36                       ` Steven Rostedt
  2008-08-09  0:47                         ` Jeremy Fitzhardinge
@ 2008-08-09  0:53                         ` Roland McGrath
  2008-08-09  1:13                           ` Andi Kleen
  2008-08-09  1:19                         ` Andi Kleen
  2 siblings, 1 reply; 56+ messages in thread
From: Roland McGrath @ 2008-08-09  0:53 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andi Kleen, Mathieu Desnoyers, LKML, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, Linus Torvalds, David Miller,
	Ulrich Drepper, Rusty Russell, Jeremy Fitzhardinge,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

Are we sure there aren't some good available 5-byte nops?

e.g. for 64-bit maybe 3e 48 8d 76 00
     for 32-bit maybe 3e 8d 74 26 00

Have the chip folks confirmed that exactly the set we have in
asm-x86/nops.h are the only optimal ones?


Thanks,
Roland

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-09  0:53                         ` Roland McGrath
@ 2008-08-09  1:13                           ` Andi Kleen
  0 siblings, 0 replies; 56+ messages in thread
From: Andi Kleen @ 2008-08-09  1:13 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Steven Rostedt, Andi Kleen, Mathieu Desnoyers, LKML, Ingo Molnar,
	Thomas Gleixner, Peter Zijlstra, Andrew Morton, Linus Torvalds,
	David Miller, Ulrich Drepper, Rusty Russell, Jeremy Fitzhardinge,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

On Fri, Aug 08, 2008 at 05:53:26PM -0700, Roland McGrath wrote:
> Are we sure there aren't some good available 5-byte nops?
> 
> e.g. for 64-bit maybe 3e 48 8d 76 00
>      for 32-bit maybe 3e 8d 74 26 00
> 
> Have the chip folks confirmed that exactly the set we have in
> asm-x86/nops.h are the only optimal ones?

They were based on the respective optimization manuals, so yes.

-Andi

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-09  0:36                       ` Steven Rostedt
  2008-08-09  0:47                         ` Jeremy Fitzhardinge
  2008-08-09  0:53                         ` Roland McGrath
@ 2008-08-09  1:19                         ` Andi Kleen
  2008-08-09  1:30                           ` Steven Rostedt
  2008-08-09  4:12                           ` Steven Rostedt
  2 siblings, 2 replies; 56+ messages in thread
From: Andi Kleen @ 2008-08-09  1:19 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andi Kleen, Mathieu Desnoyers, LKML, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, Linus Torvalds, David Miller,
	Roland McGrath, Ulrich Drepper, Rusty Russell,
	Jeremy Fitzhardinge, Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

> I wish we had a true 5 byte nop. The alternative is a jmp 0, which is 
> measurable.  This is replacing mcount from a kernel compile with the -pg 

Scary usage.  How much is the difference? 

BTW a good way to get most of that back would be to not use frame 
pointers if you have them enabled ;-) Unfortunately recently
the "select FRAME_POINTER"s keep creeping all over so far too many kernels
suffer from this slow mode now :-/

-Andi

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-09  0:51                           ` Linus Torvalds
@ 2008-08-09  1:25                             ` Steven Rostedt
  2008-08-13  6:31                               ` Mathieu Desnoyers
  0 siblings, 1 reply; 56+ messages in thread
From: Steven Rostedt @ 2008-08-09  1:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeremy Fitzhardinge, Andi Kleen, Mathieu Desnoyers, LKML,
	Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams


On Fri, 8 Aug 2008, Linus Torvalds wrote:
> 
> 
> On Fri, 8 Aug 2008, Jeremy Fitzhardinge wrote:
> >
> > Steven Rostedt wrote:
> > > I wish we had a true 5 byte nop. 
> > 
> > 0x66 0x66 0x66 0x66 0x90
> 
> I don't think so. Multiple redundant prefixes can be really expensive on 
> some uarchs.
> 
> A no-op that isn't cheap isn't a no-op at all, it's a slow-op.


A quick meaningless benchmark showed a slight perfomance hit.

Here's 10 runs of "hackbench 50" using the two part 5 byte nop:

run 1
Time: 4.501
run 2
Time: 4.855
run 3
Time: 4.198
run 4
Time: 4.587
run 5
Time: 5.016
run 6
Time: 4.757
run 7
Time: 4.477
run 8
Time: 4.693
run 9
Time: 4.710
run 10
Time: 4.715
avg = 4.6509


And 10 runs using the above 5 byte nop:

run 1
Time: 4.832
run 2
Time: 5.319
run 3
Time: 5.213
run 4
Time: 4.830
run 5
Time: 4.363
run 6
Time: 4.391
run 7
Time: 4.772
run 8
Time: 4.992
run 9
Time: 4.727
run 10
Time: 4.825
avg = 4.8264

# cat /proc/cpuinfo
processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 15
model		: 65
model name	: Dual-Core AMD Opteron(tm) Processor 2220
stepping	: 3
cpu MHz		: 2799.992
cache size	: 1024 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 2
apicid		: 0
initial apicid	: 0
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 1
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic 
cr8_legacy
bogomips	: 5599.98
clflush size	: 64
power management: ts fid vid ttp tm stc

There's 4 of these.

Just to make sure, I ran the above nop test again:

[ this is reverse from the above runs ]

run 1
Time: 4.723
run 2
Time: 5.080
run 3
Time: 4.521
run 4
Time: 4.841
run 5
Time: 4.696
run 6
Time: 4.946
run 7
Time: 4.754
run 8
Time: 4.717
run 9
Time: 4.905
run 10
Time: 4.814
avg = 4.7997

And again the two part nop:

run 1
Time: 4.434
run 2
Time: 4.496
run 3
Time: 4.801
run 4
Time: 4.714
run 5
Time: 4.631
run 6
Time: 5.178
run 7
Time: 4.728
run 8
Time: 4.920
run 9
Time: 4.898
run 10
Time: 4.770
avg = 4.757


This time it was close, but still seems to have some difference.

heh, perhaps it's just noise.

-- Steve


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-09  1:19                         ` Andi Kleen
@ 2008-08-09  1:30                           ` Steven Rostedt
  2008-08-09  1:55                             ` Andi Kleen
  2008-08-09  4:12                           ` Steven Rostedt
  1 sibling, 1 reply; 56+ messages in thread
From: Steven Rostedt @ 2008-08-09  1:30 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Mathieu Desnoyers, LKML, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, Linus Torvalds, David Miller,
	Roland McGrath, Ulrich Drepper, Rusty Russell,
	Jeremy Fitzhardinge, Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams


On Sat, 9 Aug 2008, Andi Kleen wrote:

> > I wish we had a true 5 byte nop. The alternative is a jmp 0, which is 
> > measurable.  This is replacing mcount from a kernel compile with the -pg 
> 
> Scary usage.  How much is the difference? 
> 
> BTW a good way to get most of that back would be to not use frame 
> pointers if you have them enabled ;-) Unfortunately recently
> the "select FRAME_POINTER"s keep creeping all over so far too many kernels
> suffer from this slow mode now :-/

Funny, CONFIG_FTRACE happens to select that. Now the question is, would 
mcount work without it?

i386:
ENTRY(mcount)
        pushl %eax
        pushl %ecx
        pushl %edx
        movl 0xc(%esp), %eax
        movl 0x4(%ebp), %edx  <-- we record the parent ip here
        subl $MCOUNT_INSN_SIZE, %eax

        call *ftrace_trace_function

        popl %edx
        popl %ecx
        popl %eax
	ret


x86_64:
ENTRY(mcount)
        subq $0x38, %rsp
        movq %rax, (%rsp)
        movq %rcx, 8(%rsp)
        movq %rdx, 16(%rsp)
        movq %rsi, 24(%rsp)
        movq %rdi, 32(%rsp)
        movq %r8, 40(%rsp)
        movq %r9, 48(%rsp)

        movq 0x38(%rsp), %rdi
        movq 8(%rbp), %rsi <-- we record the parent pointer here
        subq $MCOUNT_INSN_SIZE, %rdi

        call   *ftrace_trace_function

        movq 48(%rsp), %r9
        movq 40(%rsp), %r8
        movq 32(%rsp), %rdi
        movq 24(%rsp), %rsi
        movq 16(%rsp), %rdx
        movq 8(%rsp), %rcx
        movq (%rsp), %rax
        addq $0x38, %rsp
	retq

-- Steve


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-09  1:30                           ` Steven Rostedt
@ 2008-08-09  1:55                             ` Andi Kleen
  2008-08-09  2:03                               ` Steven Rostedt
  0 siblings, 1 reply; 56+ messages in thread
From: Andi Kleen @ 2008-08-09  1:55 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andi Kleen, Mathieu Desnoyers, LKML, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, Linus Torvalds, David Miller,
	Roland McGrath, Ulrich Drepper, Rusty Russell,
	Jeremy Fitzhardinge, Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

> Funny, CONFIG_FTRACE happens to select that. Now the question is, would 
> mcount work without it?

Not without fixing gcc first. It would work if gcc always called
mcount the first thing before setting up the stack frame. Not 
sure why it doesn't do that.

Still do a benchmark of frame pointer vs no frame pointer kernel
and you'll see, especially on a older CPUs without special hardware
to avoid stack stalls (e.g. not Core2)

BTW always forcing frame pointers also means that ftrace is far
from near zero overhead even when disabled. That is unless you
find a way to nop the frame pointers too, but that would be likely
very difficult because the code will actually use it.

-Andi

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-09  1:55                             ` Andi Kleen
@ 2008-08-09  2:03                               ` Steven Rostedt
  2008-08-09  2:23                                 ` Andi Kleen
  0 siblings, 1 reply; 56+ messages in thread
From: Steven Rostedt @ 2008-08-09  2:03 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Mathieu Desnoyers, LKML, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, Linus Torvalds, David Miller,
	Roland McGrath, Ulrich Drepper, Rusty Russell,
	Jeremy Fitzhardinge, Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams


On Sat, 9 Aug 2008, Andi Kleen wrote:

> > Funny, CONFIG_FTRACE happens to select that. Now the question is, would 
> > mcount work without it?
> 
> Not without fixing gcc first. It would work if gcc always called
> mcount the first thing before setting up the stack frame. Not 
> sure why it doesn't do that.

I could take out the frame pointer option and pass in zero for the parent 
in such that case. But the parent is quite useful. But if it helps
with performance, it may be worth doing this.

-- Steve

> 
> Still do a benchmark of frame pointer vs no frame pointer kernel
> and you'll see, especially on a older CPUs without special hardware
> to avoid stack stalls (e.g. not Core2)

I'll see what it does with just a kernel build.

-- Steve

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-09  2:03                               ` Steven Rostedt
@ 2008-08-09  2:23                                 ` Andi Kleen
  0 siblings, 0 replies; 56+ messages in thread
From: Andi Kleen @ 2008-08-09  2:23 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andi Kleen, Mathieu Desnoyers, LKML, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, Linus Torvalds, David Miller,
	Roland McGrath, Ulrich Drepper, Rusty Russell,
	Jeremy Fitzhardinge, Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

On Fri, Aug 08, 2008 at 10:03:42PM -0400, Steven Rostedt wrote:
> 
> On Sat, 9 Aug 2008, Andi Kleen wrote:
> 
> > > Funny, CONFIG_FTRACE happens to select that. Now the question is, would 
> > > mcount work without it?
> > 
> > Not without fixing gcc first. It would work if gcc always called
> > mcount the first thing before setting up the stack frame. Not 
> > sure why it doesn't do that.
> 
> I could take out the frame pointer option and pass in zero for the parent 
> in such that case. But the parent is quite useful. But if it helps
> with performance, it may be worth doing this.

Right now gcc errors out unfortunately. Would need to ask gcc developers
to fix this first. But when they move mcount to be always first
you can still get at the parent, it will always be at 4(%esp) or 8(%rsp).

An alternative would be also run an assembler post processor that
inserts mcount calls without compiler help.

-Andi

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-09  1:19                         ` Andi Kleen
  2008-08-09  1:30                           ` Steven Rostedt
@ 2008-08-09  4:12                           ` Steven Rostedt
  1 sibling, 0 replies; 56+ messages in thread
From: Steven Rostedt @ 2008-08-09  4:12 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Mathieu Desnoyers, LKML, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, Linus Torvalds, David Miller,
	Roland McGrath, Ulrich Drepper, Rusty Russell,
	Jeremy Fitzhardinge, Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams


On Sat, 9 Aug 2008, Andi Kleen wrote:

> > I wish we had a true 5 byte nop. The alternative is a jmp 0, which is 
> > measurable.  This is replacing mcount from a kernel compile with the -pg 
> 
> Scary usage.  How much is the difference? 
> 
> BTW a good way to get most of that back would be to not use frame 
> pointers if you have them enabled ;-) Unfortunately recently
> the "select FRAME_POINTER"s keep creeping all over so far too many kernels
> suffer from this slow mode now :-/


I checked out the latest git tree from Linus, and set up a basic config 
for it. I built it once. Then I rebooted into various versions of the 
kernel and timed the makes. This is what I did for each kernel.

# cd /work/git/linux-compile.git
# make distclean
# cp config-use .config
# make oldconfig
# time make

Here are the results. Just for curiosity sake.

Using the two part nop:
 real    3m26.751s
 user    1m36.064s
 sys     0m39.450s

Using the 5 byte nop with the four 0x66:
 real    3m31.543s
 user    1m36.197s
 sys     0m39.309s

Using jmp 0 as a nop:
 real    3m30.025s
 user    1m35.948s
 sys     0m39.540s


Then I realized that I had the irqsoff, and preemptoff tracers configured.
These do add an overhead when configured in, even when they are disabled.
I disabled them and ran the code with the two part nop again:

Using two part nop without other tracers:
 real    3m26.851s
 user    1m36.425s
 sys     0m35.437s

I then disabled ftrace completely:

no ftrace times:
 real    3m25.807s
 user    1m36.992s
 sys     0m35.715s

Then I disabled frame pointers as well:
 real    3m25.104s
 user    1m35.726s
 sys     0m34.797s

The important numbers here is the sys time. The user and real should not
be that much affected by these changes. Well, the real would, but not the 
user.

I have an old toshiba that I want to run this on too. That might give more 
interesting results. But it's getting late, and I'll do that another time.

Looking at the results, and yes I should run this more than once, but I 
just wanted an idea, this box did not show much difference between the two 
part nop, the jmp and the 5 byte nop.

Running with or without ftrace enabled did not make a difference. Heck, 
the ftrace disabled was actually slower.

But it does seem that Andi is correct about the frame pointers. It did 
slow the compile down by almost a second.

Just some food for thought.

-- Steve


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-07 18:20 Steven Rostedt
  2008-08-07 18:47 ` Mathieu Desnoyers
  2008-08-07 21:11 ` Jeremy Fitzhardinge
@ 2008-08-09  9:48 ` Abhishek Sagar
  2008-08-09 13:01   ` Steven Rostedt
  2 siblings, 1 reply; 56+ messages in thread
From: Abhishek Sagar @ 2008-08-09  9:48 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, Linus Torvalds, David Miller, Mathieu Desnoyers,
	Roland McGrath, Ulrich Drepper, Rusty Russell,
	Jeremy Fitzhardinge, Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

On Thu, Aug 7, 2008 at 11:50 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> You see, the reason for this is that for ftrace to maintain performance
> when configured in but disabled, it would need to change all the
> locations that called "mcount" (enabled with the gcc -pg option) into
> nops.  The "-pg" option in gcc sets up a function profiler to call this
> function called "mcount". If you simply have "mcount" return, it will
> still add 15 to 18% overhead in performance. Changing all the calls to
> nops moved the overhead into noise.
>
> To get rid of this, I had the mcount code record the location that called
> it. Later, the "ftraced" daemon would wake up and look to see if
> any new functions were recorded. If so, it would call kstop_machine
> and convert the calls to "nops". We needed kstop_machine because bad
> things happen on SMP if you modify code that happens to be in the
> instruction cache of another CPU.

Is this new framework needed for x86 specific reasons only? From what
I gathered here, ftraced defers mcount patching simply because there's
no way to update a 5-byte nop atomically. If so, why can't mcount site
patching be left to arch specific ftrace code? For !SMP or archs which
generate word sized mcount branch calls (e.g, ARM) is there really no
way to patch  mcount sites synchronously from inside ftrace_record_ip
by disabling interrupts?

Regards,
Abhishek Sagar

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
       [not found]                       ` <aYKF1-2qs-7@gated-at.bofh.it>
@ 2008-08-09 11:50                         ` Bodo Eggert
  2008-08-09 13:02                           ` Steven Rostedt
  0 siblings, 1 reply; 56+ messages in thread
From: Bodo Eggert @ 2008-08-09 11:50 UTC (permalink / raw)
  To: Steven Rostedt, Andi Kleen, Mathieu Desnoyers, LKML, Ingo Molnar,
	Thomas Gleixner, Peter Zijlstra, Andrew Morton, Linus Torvalds,
	David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell,
	Jeremy Fitzhardinge, Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

Steven Rostedt <rostedt@goodmis.org> wrote:
> On Sat, 9 Aug 2008, Andi Kleen wrote:
>> Steven Rostedt <rostedt@goodmis.org> writes:

>> > I'm stubborn, I want to get it right _and_ keep it fast.
>> 
>> For me it would seem better to just not use two part 5 byte nops
>> instead of adding such hacks.  I doubt there are that many of them
>> anyways. I bet you won't be able to measure any difference between the
>> different nop types in any macro benchmark.
> 
> I wish we had a true 5 byte nop. The alternative is a jmp 0, which is
> measurable.

Did you try short jumps? (0xeb 0x03 0x?? 0x?? 0x??)


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-09  9:48 ` Abhishek Sagar
@ 2008-08-09 13:01   ` Steven Rostedt
  2008-08-09 15:01     ` Abhishek Sagar
  0 siblings, 1 reply; 56+ messages in thread
From: Steven Rostedt @ 2008-08-09 13:01 UTC (permalink / raw)
  To: Abhishek Sagar
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, Linus Torvalds, David Miller, Mathieu Desnoyers,
	Roland McGrath, Ulrich Drepper, Rusty Russell,
	Jeremy Fitzhardinge, Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams


On Sat, 9 Aug 2008, Abhishek Sagar wrote:

> On Thu, Aug 7, 2008 at 11:50 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> > You see, the reason for this is that for ftrace to maintain performance
> > when configured in but disabled, it would need to change all the
> > locations that called "mcount" (enabled with the gcc -pg option) into
> > nops.  The "-pg" option in gcc sets up a function profiler to call this
> > function called "mcount". If you simply have "mcount" return, it will
> > still add 15 to 18% overhead in performance. Changing all the calls to
> > nops moved the overhead into noise.
> >
> > To get rid of this, I had the mcount code record the location that called
> > it. Later, the "ftraced" daemon would wake up and look to see if
> > any new functions were recorded. If so, it would call kstop_machine
> > and convert the calls to "nops". We needed kstop_machine because bad
> > things happen on SMP if you modify code that happens to be in the
> > instruction cache of another CPU.
> 
> Is this new framework needed for x86 specific reasons only? From what
> I gathered here, ftraced defers mcount patching simply because there's
> no way to update a 5-byte nop atomically. If so, why can't mcount site
> patching be left to arch specific ftrace code? For !SMP or archs which
> generate word sized mcount branch calls (e.g, ARM) is there really no
> way to patch  mcount sites synchronously from inside ftrace_record_ip
> by disabling interrupts?

There's two topics in this thread.

1) the x86 issue of the 5 byte instruction. The problem with x86 is that on
some CPUs the nop used consists of two nops to fill the 5 bytes. There is 
no way to change that atomically. The workarounds for this is the arch 
specific ftrace_pre_enable() that will make sure no process is about to 
execute the second part of that nop.

2) Getting rid of the daemon. The daemon is used to patch the code 
dynamically later on bootup. Now an arch may or may not be able to modify 
code in SMP, but I've been told that this is dangerous to do even on PPC.

Dynamically modifying text that might be in the pipeline on another CPU 
may or may not be dangerous on all archs.

The fix here is to convert the mcount calls to nops at boot up. This is 
really ideal on all archs. This means we know ever mcount call, and we get 
rid of the requirement that we need to run the code once before we can 
trace it.

The kstop_machine is now only left at the start and stop of tracing.

-- Steve


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-09 11:50                         ` Bodo Eggert
@ 2008-08-09 13:02                           ` Steven Rostedt
  2008-08-09 14:25                             ` Steven Rostedt
  2008-08-09 14:42                             ` Bodo Eggert
  0 siblings, 2 replies; 56+ messages in thread
From: Steven Rostedt @ 2008-08-09 13:02 UTC (permalink / raw)
  To: Bodo Eggert
  Cc: Andi Kleen, Mathieu Desnoyers, LKML, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, Linus Torvalds, David Miller,
	Roland McGrath, Ulrich Drepper, Rusty Russell,
	Jeremy Fitzhardinge, Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams


On Sat, 9 Aug 2008, Bodo Eggert wrote:

> Steven Rostedt <rostedt@goodmis.org> wrote:
> > On Sat, 9 Aug 2008, Andi Kleen wrote:
> >> Steven Rostedt <rostedt@goodmis.org> writes:
> 
> >> > I'm stubborn, I want to get it right _and_ keep it fast.
> >> 
> >> For me it would seem better to just not use two part 5 byte nops
> >> instead of adding such hacks.  I doubt there are that many of them
> >> anyways. I bet you won't be able to measure any difference between the
> >> different nop types in any macro benchmark.
> > 
> > I wish we had a true 5 byte nop. The alternative is a jmp 0, which is
> > measurable.
> 
> Did you try short jumps? (0xeb 0x03 0x?? 0x?? 0x??)

What would those last three bytes be?

-- Steve


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-09 13:02                           ` Steven Rostedt
@ 2008-08-09 14:25                             ` Steven Rostedt
  2008-08-09 14:42                             ` Bodo Eggert
  1 sibling, 0 replies; 56+ messages in thread
From: Steven Rostedt @ 2008-08-09 14:25 UTC (permalink / raw)
  To: Bodo Eggert
  Cc: Andi Kleen, Mathieu Desnoyers, LKML, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, Linus Torvalds, David Miller,
	Roland McGrath, Ulrich Drepper, Rusty Russell,
	Jeremy Fitzhardinge, Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams



On Sat, 9 Aug 2008, Steven Rostedt wrote:
> > 
> > Did you try short jumps? (0xeb 0x03 0x?? 0x?? 0x??)
> 
> What would those last three bytes be?

Ah, I see, you jump 3 bytes forward.

I'm pretty sure that it is still more expensive than a nop, on some archs.
But I'm not 100% sure on that.

-- Steve


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-09 13:02                           ` Steven Rostedt
  2008-08-09 14:25                             ` Steven Rostedt
@ 2008-08-09 14:42                             ` Bodo Eggert
  1 sibling, 0 replies; 56+ messages in thread
From: Bodo Eggert @ 2008-08-09 14:42 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Bodo Eggert, Andi Kleen, Mathieu Desnoyers, LKML, Ingo Molnar,
	Thomas Gleixner, Peter Zijlstra, Andrew Morton, Linus Torvalds,
	David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell,
	Jeremy Fitzhardinge, Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

On Sat, 9 Aug 2008, Steven Rostedt wrote:
> On Sat, 9 Aug 2008, Bodo Eggert wrote:
> > Steven Rostedt <rostedt@goodmis.org> wrote:
> > > On Sat, 9 Aug 2008, Andi Kleen wrote:
> > >> Steven Rostedt <rostedt@goodmis.org> writes:

> > >> > I'm stubborn, I want to get it right _and_ keep it fast.
> > >> 
> > >> For me it would seem better to just not use two part 5 byte nops
> > >> instead of adding such hacks.  I doubt there are that many of them
> > >> anyways. I bet you won't be able to measure any difference between the
> > >> different nop types in any macro benchmark.
> > > 
> > > I wish we had a true 5 byte nop. The alternative is a jmp 0, which is
> > > measurable.
> > 
> > Did you try short jumps? (0xeb 0x03 0x?? 0x?? 0x??)
> 
> What would those last three bytes be?

Anything, since the CPU will ignore them.

My hope is that different kinds of jump will behaver differently,
but I fear the side effect of the jmp (reread memory in case of 
self-modifying code) will cause the CPU to slow down anyway.
-- 
Is reading in the bathroom considered Multitasking?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-09 13:01   ` Steven Rostedt
@ 2008-08-09 15:01     ` Abhishek Sagar
  2008-08-09 15:37       ` Steven Rostedt
  0 siblings, 1 reply; 56+ messages in thread
From: Abhishek Sagar @ 2008-08-09 15:01 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, Linus Torvalds, David Miller, Mathieu Desnoyers,
	Roland McGrath, Ulrich Drepper, Rusty Russell,
	Jeremy Fitzhardinge, Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

On Sat, Aug 9, 2008 at 6:31 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> Dynamically modifying text that might be in the pipeline on another CPU
> may or may not be dangerous on all archs.

Kprobes leaves this mess for individual archs to handle (see
arch_arm_kprobe). What would be nice to have is an explanation as to
which archs are safe from this and under what conditions. Also, for
!SMP, this boot-time patching approach and the associated build time
paraphernalia would simply be a bloat. There we can simply have a
daemonless ftrace which patches mcount sites synchronously. Why not
let each arch have this hinge on CONFIG_SMP?

> The fix here is to convert the mcount calls to nops at boot up. This is
> really ideal on all archs. This means we know ever mcount call, and we get
> rid of the requirement that we need to run the code once before we can
> trace it.

This solution indeed would fit all archs well but for some it may be
an overkill (or is it?...I'd like to know that :-\ ).

Oh and as you pointed out, it would mean that we have to no longer run
the code before starting to trace it. But why don't we do that now?
How about calling the tracer from ftrace_record_ip? All we need to do
is pass along the parent_ip from mcount.

> The kstop_machine is now only left at the start and stop of tracing.

This a nice fix. I'm just looking to find out what the self modifying
code problem here translates to on different archs for my own
understanding :-)

Thanks,
Abhishek Sagar

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-09 15:01     ` Abhishek Sagar
@ 2008-08-09 15:37       ` Steven Rostedt
  2008-08-09 17:14         ` Abhishek Sagar
  0 siblings, 1 reply; 56+ messages in thread
From: Steven Rostedt @ 2008-08-09 15:37 UTC (permalink / raw)
  To: Abhishek Sagar
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, Linus Torvalds, David Miller, Mathieu Desnoyers,
	Roland McGrath, Ulrich Drepper, Rusty Russell,
	Jeremy Fitzhardinge, Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams


On Sat, 9 Aug 2008, Abhishek Sagar wrote:

> On Sat, Aug 9, 2008 at 6:31 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> > Dynamically modifying text that might be in the pipeline on another CPU
> > may or may not be dangerous on all archs.
> 
> Kprobes leaves this mess for individual archs to handle (see
> arch_arm_kprobe). What would be nice to have is an explanation as to
> which archs are safe from this and under what conditions. Also, for
> !SMP, this boot-time patching approach and the associated build time
> paraphernalia would simply be a bloat. There we can simply have a
> daemonless ftrace which patches mcount sites synchronously. Why not
> let each arch have this hinge on CONFIG_SMP?

One of the things I tried to do was to make ftrace port easily to all 
archs. David Miller ported it to Sparc64 in 20 minutes and that was mostly 
compiling.  Doing a kprobe fix would require much more knowledge to each 
arch and more prone to errors.

> 
> > The fix here is to convert the mcount calls to nops at boot up. This is
> > really ideal on all archs. This means we know ever mcount call, and we get
> > rid of the requirement that we need to run the code once before we can
> > trace it.
> 
> This solution indeed would fit all archs well but for some it may be
> an overkill (or is it?...I'd like to know that :-\ ).

It is not an issue at all. The mcount list is discarded with the init 
section, and the change to nops is relatively fast. We do make a list of 
all entries anyway, so that we can pick and choose which functions we want 
to enable tracing on.

This recording at boot up should be fine on all archs, SMP or UP and will 
get rid of that daemon.

> 
> Oh and as you pointed out, it would mean that we have to no longer run
> the code before starting to trace it. But why don't we do that now?
> How about calling the tracer from ftrace_record_ip? All we need to do
> is pass along the parent_ip from mcount.

I want ftrace_record_ip to be as fast as possible, and also having it call
the registered function means we need to test if it was set in the filter 
too. This is too much for what record_ip does. And as I said, doing it all 
on boot up would be best.

> 
> > The kstop_machine is now only left at the start and stop of tracing.
> 
> This a nice fix. I'm just looking to find out what the self modifying
> code problem here translates to on different archs for my own
> understanding :-)


Now, what I can do, is remove the kstop machine from UP. Hmm, I need to 
check if kstop_machine code itself is smart enough to simply do a 
local_irq_save; run function; local_irq_restore, if it was on UP and not 
SMP.

-- Steve


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-09 15:37       ` Steven Rostedt
@ 2008-08-09 17:14         ` Abhishek Sagar
  0 siblings, 0 replies; 56+ messages in thread
From: Abhishek Sagar @ 2008-08-09 17:14 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, Linus Torvalds, David Miller, Mathieu Desnoyers,
	Roland McGrath, Ulrich Drepper, Rusty Russell,
	Jeremy Fitzhardinge, Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

On Sat, Aug 9, 2008 at 9:07 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> One of the things I tried to do was to make ftrace port easily to all
> archs. David Miller ported it to Sparc64 in 20 minutes and that was mostly
> compiling.  Doing a kprobe fix would require much more knowledge to each
> arch and more prone to errors.

I agree with the point about porting. My kprobe reference was simply
to point out that
an existing module seems to be overwriting code on the fly and doesn't
take care of much
on non-x86 archs. Ftrace has a unique 5-byte call replacement issue
specific to x86 which kprobes doesn't have to deal with.

>> This solution indeed would fit all archs well but for some it may be
>> an overkill (or is it?...I'd like to know that :-\ ).
>
> It is not an issue at all. The mcount list is discarded with the init
> section, and the change to nops is relatively fast. We do make a list of
> all entries anyway, so that we can pick and choose which functions we want
> to enable tracing on.

Ok.

> I want ftrace_record_ip to be as fast as possible, and also having it call
> the registered function means we need to test if it was set in the filter
> too. This is too much for what record_ip does. And as I said, doing it all
> on boot up would be best.

Alrighty.

Thanks,
Abhishek Sagar

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-08 18:41               ` Steven Rostedt
                                   ` (2 preceding siblings ...)
  2008-08-08 19:08                 ` Jeremy Fitzhardinge
@ 2008-08-11  2:41                 ` Rusty Russell
  2008-08-11 12:33                   ` Steven Rostedt
  3 siblings, 1 reply; 56+ messages in thread
From: Rusty Russell @ 2008-08-11  2:41 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, linux-kernel, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, Linus Torvalds, David Miller,
	Roland McGrath, Ulrich Drepper, Jeremy Fitzhardinge,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

On Saturday 09 August 2008 04:41:06 Steven Rostedt wrote:
> On Fri, 8 Aug 2008, Mathieu Desnoyers wrote:
> > kstop_machine does not guarantee that you won't have _any_ thread
> > preempted with IP pointing exactly in the middle of your instructions
> > _before_ the modification scheduled back in _after_ the modification and
> > thus causing an illegal instruction.
> >
> > Still buggy. :/
>
> Hmm, good point. Unless...

You can walk the task list and fix them up, if you have to.  Of course, this 
could be extremely slow with a million threads.  Maybe just walking the 
runqueues would be sufficient?

Rusty.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-11  2:41                 ` Rusty Russell
@ 2008-08-11 12:33                   ` Steven Rostedt
  0 siblings, 0 replies; 56+ messages in thread
From: Steven Rostedt @ 2008-08-11 12:33 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Mathieu Desnoyers, linux-kernel, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, Linus Torvalds, David Miller,
	Roland McGrath, Ulrich Drepper, Jeremy Fitzhardinge,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams


On Mon, 11 Aug 2008, Rusty Russell wrote:

> On Saturday 09 August 2008 04:41:06 Steven Rostedt wrote:
> > On Fri, 8 Aug 2008, Mathieu Desnoyers wrote:
> > > kstop_machine does not guarantee that you won't have _any_ thread
> > > preempted with IP pointing exactly in the middle of your instructions
> > > _before_ the modification scheduled back in _after_ the modification and
> > > thus causing an illegal instruction.
> > >
> > > Still buggy. :/
> >
> > Hmm, good point. Unless...
> 
> You can walk the task list and fix them up, if you have to.  Of course, this 
> could be extremely slow with a million threads.  Maybe just walking the 
> runqueues would be sufficient?

True, we could do the run queues too. But we are only looping once 
through the tasks and checking a single pointer on it, which would only be 
set if the task was preempted while in the kernel.

As for being slow, this only happens when we enable the function tracer, 
which is slow anyway. It does not need to happen on disabling the tracer.

Note, this is enabling the calls to the tracer function. The tracer itself 
can enable and disable quickly by turning on or off a variable. This is 
just the "setup" part. Not something that happens often.

-- Steve


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-09  0:30                     ` Steven Rostedt
@ 2008-08-11 18:21                       ` Mathieu Desnoyers
  2008-08-11 19:28                         ` Steven Rostedt
  0 siblings, 1 reply; 56+ messages in thread
From: Mathieu Desnoyers @ 2008-08-11 18:21 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	Linus Torvalds, David Miller, Roland McGrath, Ulrich Drepper,
	Rusty Russell, Jeremy Fitzhardinge, Gregory Haskins,
	Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves,
	Clark Williams

* Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> [ updated patch ]
> 
> On Fri, 8 Aug 2008, Steven Rostedt wrote:
> > 
> > 
> > The do_something_silly had an mcount pointer to it. I put in printks in 
> > the ftrace enabled code to see where this was preempted. It was preempted 
> > several times before and after the nops, but never at either nop.
> > 
> > Maybe I didn't run it enough (almost 2 hours), but perhaps it is very 
> > unlikely to be preempted at a nop if there's something coming up next.
> > 
> > Yes a string of nops may be preempted, but perhaps only two nops followed 
> > by an actual command might be skipped quickly.
> > 
> > I'll write some hacks to look at where it is preempted in the scheduler 
> > itself, and see if I see it preempting at the second nop ever.
> 
> I put that code in the scheduler and it does indeed preempt in the nops!
> It also uncovered a nasty bug I had in my code:
> 
> [...]
> 
> > Index: linux-tip.git/arch/x86/kernel/ftrace.c
> > ===================================================================
> > --- linux-tip.git.orig/arch/x86/kernel/ftrace.c	2008-06-26 14:58:54.000000000 -0400
> > +++ linux-tip.git/arch/x86/kernel/ftrace.c	2008-08-08 17:48:04.000000000 -0400
> > @@ -127,6 +127,46 @@ notrace int ftrace_mcount_set(unsigned l
> >  	return 0;
> >  }
> >  
> > +static const unsigned char *second_nop;
> > +
> > +void arch_ftrace_pre_enable(void)
> > +{
> > +	struct task_struct *g, *p;
> > +	unsigned long **ip;
> > +
> 
> [...]
> 
> > +		ip = task_thread_info(p)->ip;
> > +		if (!ip)
> > +			continue;
> > +		if (memcmp(*ip, second_nop, x86_nop5_part2) == 0)
> > +			/* Match, move the ip forward */
> > +			*ip += x86_nop5_part2;
> 
> Notice the bug? Hint, *ip is of type unsigned long *.
> 
> Here's the updated fix:
> 

Hi Steven,

I'm actually a bit worried about this scheduler-centric approach. The
problem is that any trap that could be generated in the middle of the
3/2 nops (think of performance counters if the APIC is set as a trap
gate) which would stack an interruptible trap handler on top of those
instructions would lead to having a return IP pointing in the wrong
spot, but because the scheduler would interrupt the trap handler and not
the trapped code, it would not detect this.

I am not sure wheter we do actually have the possibility to have these
interruptible trap handlers considering the way the kernel sets up the
gates, but I think the "let's figure out where the IP stopped" approach
is a bit complex and fragile.

Trying to find a fast atomic 5-bytes nop, involving the
microarchitecture guys, seems like a better approach to me.

Mathieu

> Signed-off-by: Steven Rostedt <srostedt@redhat.com>
> ---
>  arch/x86/kernel/alternative.c    |   29 ++++++++++++++++++++-------
>  arch/x86/kernel/asm-offsets_32.c |    1 
>  arch/x86/kernel/asm-offsets_64.c |    1 
>  arch/x86/kernel/entry_32.S       |    4 +++
>  arch/x86/kernel/entry_64.S       |    4 +++
>  arch/x86/kernel/ftrace.c         |   41 +++++++++++++++++++++++++++++++++++++++
>  include/asm-x86/ftrace.h         |    5 ++++
>  include/asm-x86/thread_info.h    |    4 +++
>  kernel/trace/ftrace.c            |   12 +++++++++++
>  9 files changed, 94 insertions(+), 7 deletions(-)
> 
> Index: linux-tip.git/arch/x86/kernel/alternative.c
> ===================================================================
> --- linux-tip.git.orig/arch/x86/kernel/alternative.c	2008-06-05 11:52:24.000000000 -0400
> +++ linux-tip.git/arch/x86/kernel/alternative.c	2008-08-08 16:20:23.000000000 -0400
> @@ -140,13 +140,26 @@ static const unsigned char *const p6_nop
>  };
>  #endif
>  
> +/*
> + * Some versions of x86 CPUs have a two part NOP5. This
> + * can break ftrace if a process is preempted between
> + * the two. ftrace needs to know what the second nop
> + * is to handle this case.
> + */
> +int x86_nop5_part2;
> +
>  #ifdef CONFIG_X86_64
>  
>  extern char __vsyscall_0;
>  const unsigned char *const *find_nop_table(void)
>  {
> -	return boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
> -	       boot_cpu_data.x86 < 6 ? k8_nops : p6_nops;
> +	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
> +	    boot_cpu_data.x86 < 6) {
> +		x86_nop5_part2 = 2; /* K8_NOP2 */
> +		return k8_nops;
> +	} else
> +		/* keep k86_nop5_part2 NULL */
> +		return p6_nops;
>  }
>  
>  #else /* CONFIG_X86_64 */
> @@ -154,12 +167,13 @@ const unsigned char *const *find_nop_tab
>  static const struct nop {
>  	int cpuid;
>  	const unsigned char *const *noptable;
> +	int nop5_part2;
>  } noptypes[] = {
> -	{ X86_FEATURE_K8, k8_nops },
> -	{ X86_FEATURE_K7, k7_nops },
> -	{ X86_FEATURE_P4, p6_nops },
> -	{ X86_FEATURE_P3, p6_nops },
> -	{ -1, NULL }
> +	{ X86_FEATURE_K8, k8_nops, 2},
> +	{ X86_FEATURE_K7, k7_nops, 1 },
> +	{ X86_FEATURE_P4, p6_nops, 0 },
> +	{ X86_FEATURE_P3, p6_nops, 0 },
> +	{ -1, NULL, 0 }
>  };
>  
>  const unsigned char *const *find_nop_table(void)
> @@ -170,6 +184,7 @@ const unsigned char *const *find_nop_tab
>  	for (i = 0; noptypes[i].cpuid >= 0; i++) {
>  		if (boot_cpu_has(noptypes[i].cpuid)) {
>  			noptable = noptypes[i].noptable;
> +			x86_nop5_part2 = noptypes[i].nop5_part2;
>  			break;
>  		}
>  	}
> Index: linux-tip.git/arch/x86/kernel/asm-offsets_32.c
> ===================================================================
> --- linux-tip.git.orig/arch/x86/kernel/asm-offsets_32.c	2008-07-27 10:43:26.000000000 -0400
> +++ linux-tip.git/arch/x86/kernel/asm-offsets_32.c	2008-08-08 15:46:55.000000000 -0400
> @@ -59,6 +59,7 @@ void foo(void)
>  	OFFSET(TI_restart_block, thread_info, restart_block);
>  	OFFSET(TI_sysenter_return, thread_info, sysenter_return);
>  	OFFSET(TI_cpu, thread_info, cpu);
> +	OFFSET(TI_ip, thread_info, ip);
>  	BLANK();
>  
>  	OFFSET(GDS_size, desc_ptr, size);
> Index: linux-tip.git/arch/x86/kernel/asm-offsets_64.c
> ===================================================================
> --- linux-tip.git.orig/arch/x86/kernel/asm-offsets_64.c	2008-07-27 10:43:26.000000000 -0400
> +++ linux-tip.git/arch/x86/kernel/asm-offsets_64.c	2008-08-08 15:52:34.000000000 -0400
> @@ -41,6 +41,7 @@ int main(void)
>  	ENTRY(addr_limit);
>  	ENTRY(preempt_count);
>  	ENTRY(status);
> +	ENTRY(ip);
>  #ifdef CONFIG_IA32_EMULATION
>  	ENTRY(sysenter_return);
>  #endif
> Index: linux-tip.git/arch/x86/kernel/entry_32.S
> ===================================================================
> --- linux-tip.git.orig/arch/x86/kernel/entry_32.S	2008-07-27 10:43:26.000000000 -0400
> +++ linux-tip.git/arch/x86/kernel/entry_32.S	2008-08-08 17:13:27.000000000 -0400
> @@ -304,7 +304,11 @@ need_resched:
>  	jz restore_all
>  	testl $X86_EFLAGS_IF,PT_EFLAGS(%esp)	# interrupts off (exception path) ?
>  	jz restore_all
> +	lea PT_EIP(%esp), %eax
> +	movl %eax, TI_ip(%ebp)
>  	call preempt_schedule_irq
> +	GET_THREAD_INFO(%ebp)
> +	movl $0, TI_ip(%ebp)
>  	jmp need_resched
>  END(resume_kernel)
>  #endif
> Index: linux-tip.git/arch/x86/kernel/entry_64.S
> ===================================================================
> --- linux-tip.git.orig/arch/x86/kernel/entry_64.S	2008-07-27 10:43:26.000000000 -0400
> +++ linux-tip.git/arch/x86/kernel/entry_64.S	2008-08-08 17:12:47.000000000 -0400
> @@ -837,7 +837,11 @@ ENTRY(retint_kernel)
>  	jnc  retint_restore_args
>  	bt   $9,EFLAGS-ARGOFFSET(%rsp)	/* interrupts off? */
>  	jnc  retint_restore_args
> +	leaq RIP-ARGOFFSET(%rsp), %rax
> +	movq %rax, TI_ip(%rcx)
>  	call preempt_schedule_irq
> +	GET_THREAD_INFO(%rcx)
> +	movq $0, TI_ip(%rcx)
>  	jmp exit_intr
>  #endif	
>  
> Index: linux-tip.git/arch/x86/kernel/ftrace.c
> ===================================================================
> --- linux-tip.git.orig/arch/x86/kernel/ftrace.c	2008-06-26 14:58:54.000000000 -0400
> +++ linux-tip.git/arch/x86/kernel/ftrace.c	2008-08-08 20:22:19.000000000 -0400
> @@ -127,6 +127,44 @@ notrace int ftrace_mcount_set(unsigned l
>  	return 0;
>  }
>  
> +static const unsigned char *second_nop;
> +
> +void arch_ftrace_pre_enable(void)
> +{
> +	struct task_struct *g, *p;
> +	void **ip;
> +
> +	if (!second_nop)
> +		return;
> +
> +	/*
> +	 * x86 has a two part nop to handle 5 byte instructions.
> +	 * If a task was preempted after the first nop, and has
> +	 * not ran the second nop, if we modify the code, we can
> +	 * crash the system. Thus, we will look at all the tasks
> +	 * and if any of them was preempted and will run the
> +	 * second nop next, we simply move their ip pointer past
> +	 * the second nop.
> +	 */
> +
> +	/*
> +	 * Don't need to grab the task list lock, we are running
> +	 * in kstop_machine
> +	 */
> +	do_each_thread(g, p) {
> +		/*
> +		 * In entry.S we save the ip when a task is preempted
> +		 * and reset it when it is back running.
> +		 */
> +		ip = (void **)task_thread_info(p)->ip;
> +		if (!ip)
> +			continue;
> +		if (memcmp(*ip, second_nop, x86_nop5_part2) == 0)
> +			/* Match, move the ip forward */
> +			*ip += x86_nop5_part2;
> +	} while_each_thread(g, p);
> +}
> +
>  int __init ftrace_dyn_arch_init(void *data)
>  {
>  	const unsigned char *const *noptable = find_nop_table();
> @@ -137,5 +175,8 @@ int __init ftrace_dyn_arch_init(void *da
>  
>  	ftrace_nop = (unsigned long *)noptable[MCOUNT_INSN_SIZE];
>  
> +	if (x86_nop5_part2)
> +		second_nop = noptable[x86_nop5_part2];
> +
>  	return 0;
>  }
> Index: linux-tip.git/include/asm-x86/ftrace.h
> ===================================================================
> --- linux-tip.git.orig/include/asm-x86/ftrace.h	2008-08-08 13:00:51.000000000 -0400
> +++ linux-tip.git/include/asm-x86/ftrace.h	2008-08-08 16:41:09.000000000 -0400
> @@ -17,6 +17,11 @@ static inline unsigned long ftrace_call_
>  	 */
>  	return addr - 1;
>  }
> +
> +extern int x86_nop5_part2;
> +extern void arch_ftrace_pre_enable(void);
> +#define ftrace_pre_enable arch_ftrace_pre_enable
> +
>  #endif
>  
>  #endif /* CONFIG_FTRACE */
> Index: linux-tip.git/include/asm-x86/thread_info.h
> ===================================================================
> --- linux-tip.git.orig/include/asm-x86/thread_info.h	2008-08-07 11:14:43.000000000 -0400
> +++ linux-tip.git/include/asm-x86/thread_info.h	2008-08-08 17:06:15.000000000 -0400
> @@ -29,6 +29,9 @@ struct thread_info {
>  	__u32			cpu;		/* current CPU */
>  	int			preempt_count;	/* 0 => preemptable,
>  						   <0 => BUG */
> +	unsigned long		**ip;		/* pointer to ip on stackwhen
> +						   preempted
> +						*/
>  	mm_segment_t		addr_limit;
>  	struct restart_block    restart_block;
>  	void __user		*sysenter_return;
> @@ -47,6 +50,7 @@ struct thread_info {
>  	.flags		= 0,			\
>  	.cpu		= 0,			\
>  	.preempt_count	= 1,			\
> +	.ip		= NULL,			\
>  	.addr_limit	= KERNEL_DS,		\
>  	.restart_block = {			\
>  		.fn = do_no_restart_syscall,	\
> Index: linux-tip.git/kernel/trace/ftrace.c
> ===================================================================
> --- linux-tip.git.orig/kernel/trace/ftrace.c	2008-08-08 13:00:52.000000000 -0400
> +++ linux-tip.git/kernel/trace/ftrace.c	2008-08-08 16:18:14.000000000 -0400
> @@ -32,6 +32,10 @@
>  
>  #include "trace.h"
>  
> +#ifndef ftrace_pre_enable
> +# define ftrace_pre_enable() do { } while (0)
> +#endif
> +
>  /* ftrace_enabled is a method to turn ftrace on or off */
>  int ftrace_enabled __read_mostly;
>  static int last_ftrace_enabled;
> @@ -500,6 +504,14 @@ static void ftrace_replace_code(int enab
>  	else
>  		new = ftrace_nop_replace();
>  
> +	/*
> +	 * Some archs *cough*x86*cough* have more than one nop to cover
> +	 * the call to mcount. In these cases, special care must be taken
> +	 * before we start converting nops into calls.
> +	 */
> +	if (enable)
> +		ftrace_pre_enable();
> +
>  	for (pg = ftrace_pages_start; pg; pg = pg->next) {
>  		for (i = 0; i < pg->index; i++) {
>  			rec = &pg->records[i];

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-11 18:21                       ` Mathieu Desnoyers
@ 2008-08-11 19:28                         ` Steven Rostedt
  0 siblings, 0 replies; 56+ messages in thread
From: Steven Rostedt @ 2008-08-11 19:28 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	Linus Torvalds, David Miller, Roland McGrath, Ulrich Drepper,
	Rusty Russell, Jeremy Fitzhardinge, Gregory Haskins,
	Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves,
	Clark Williams



On Mon, 11 Aug 2008, Mathieu Desnoyers wrote:
> 
> Hi Steven,
> 
> I'm actually a bit worried about this scheduler-centric approach. The
> problem is that any trap that could be generated in the middle of the
> 3/2 nops (think of performance counters if the APIC is set as a trap
> gate) which would stack an interruptible trap handler on top of those
> instructions would lead to having a return IP pointing in the wrong
> spot, but because the scheduler would interrupt the trap handler and not
> the trapped code, it would not detect this.
> 
> I am not sure wheter we do actually have the possibility to have these
> interruptible trap handlers considering the way the kernel sets up the
> gates, but I think the "let's figure out where the IP stopped" approach
> is a bit complex and fragile.
> 
> Trying to find a fast atomic 5-bytes nop, involving the
> microarchitecture guys, seems like a better approach to me.
> 

I agree that the fast atomic 5-byte nop is the optimal approach, but until 
we have that, we need to do this.

I don't think this approach is too complex, I wrote the code in about an 
hour, and the patch isn't that big. albeit my one bug, which was a CS101 
type bug (pointer arithmetic), there wasn't anything to me that was 
complex.

The thing is, I want this to be enabled in a production kernel. If there 
is a 1 in a million chance that this theoretical bug with the trap 
happening in a middle of the double nop, it only affects those that enable 
the tracer. No one else. The code only happens when tracing is being 
enabled.

If we need to put in a non optimal nop in, to replace the call to mcount, 
this affects everyone. Anyone that has CONFIG_FTRACE on, and never plans 
on running a trace.

If it comes down to slowing everyone down, against hitting a 1 in a 
million theoretical bug, I'll take my chances on the bug. Again, that code 
is only performed when tracing is being enabled. That is, when it converts 
all 20 thousand nops into calls to a tracer function. If the crash 
happens, it will only happen after tracing has started. Then we can easily 
point the bug at the tracer.

IOW, I do not want to slow down Linus' box.

-- Steve


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-09  1:25                             ` Steven Rostedt
@ 2008-08-13  6:31                               ` Mathieu Desnoyers
  2008-08-13 15:38                                 ` Mathieu Desnoyers
  0 siblings, 1 reply; 56+ messages in thread
From: Mathieu Desnoyers @ 2008-08-13  6:31 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Jeremy Fitzhardinge, Andi Kleen, LKML,
	Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

* Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> On Fri, 8 Aug 2008, Linus Torvalds wrote:
> > 
> > 
> > On Fri, 8 Aug 2008, Jeremy Fitzhardinge wrote:
> > >
> > > Steven Rostedt wrote:
> > > > I wish we had a true 5 byte nop. 
> > > 
> > > 0x66 0x66 0x66 0x66 0x90
> > 
> > I don't think so. Multiple redundant prefixes can be really expensive on 
> > some uarchs.
> > 
> > A no-op that isn't cheap isn't a no-op at all, it's a slow-op.
> 
> 
> A quick meaningless benchmark showed a slight perfomance hit.
> 

Hi Steven,

I tried to run my own tests to see if I could get to know if these
numbers are actually meaningful at all. My results seems to show that
there is not any significant difference between the various
configurations, and actually that the only one tendency I see is that
the 2-bytes jump offset 0x03 would be slightly faster than the 3/2 nop
on Intel Xeon. But we would have to run these a bit more often to
confirm that I guess.

I am just trying to get a sense of whether we are really trying hard to
optimize something worthless in practice, and to me it looks like it.
But it could be the architecture I am using that brings these results.

Mathieu

Intel Xeon dual quad-core
Intel(R) Xeon(R) CPU           E5405  @ 2.00GHz

3/2 nop used :
K8_NOP3 K8_NOP2
#define K8_NOP2 ".byte 0x66,0x90\n"
#define K8_NOP3 ".byte 0x66,0x66,0x90\n"

** Summary **

Test A : make -j20 2.6.27-rc2 kernel (real time)
                                          Avg.      std.dev
Case 1 : ftrace not compiled-in.          1m9.76s   0.41s
Case 2 : 3/2 nops                         1m9.95s   0.36s
Case 3 : 2-bytes jump, offset 0x03        1m9.10s   0.40s
Case 4 : 5-bytes jump, offset 0x00        1m9.25s   0.34s

Test B : hackbench 15

Case 1 : ftrace not compiled-in.          0.349s    0.007s
Case 2 : 3/2 nops                         0.351s    0.014s
Case 3 : 2-bytes jump, offset 0x03        0.350s    0.007s
Case 4 : 5-bytes jump, offset 0x00        0.351s    0.010s



** Detail **

* Test A

benchmark : make -j20 2.6.27-rc2 kernel
make clean; make -j20; make clean done before the tests to prime caches.
Same .config used.


Case 1 : ftrace not compiled-in.

real	1m9.980s
user	7m27.664s
sys	0m48.771s

real	1m9.330s
user	7m27.244s
sys	0m50.567s

real	1m9.393s
user	7m27.408s
sys	0m50.511s

real	1m9.674s
user	7m28.088s
sys	0m50.327s

real	1m10.441s
user	7m27.736s
sys	0m49.687s

real time
average : 1m9.76s
std. dev. : 0.41s

after a reboot with the same kernel :

real	1m8.758s
user	7m26.012s
sys	0m48.835s

real	1m11.035s
user	7m26.432s
sys	0m49.171s

real	1m9.834s
user	7m25.768s
sys	0m49.167s


Case 2 : 3/2 nops

real	1m9.713s
user	7m27.524s
sys	0m48.315s

real	1m9.481s
user	7m27.144s
sys	0m48.587s

real	1m10.565s
user	7m27.048s
sys	0m48.715s

real	1m10.008s
user	7m26.436s
sys	0m49.295s

real	1m9.982s
user	7m27.160s
sys	0m48.667s

real time
avg : 1m9.95s
std. dev. : 0.36s


Case 3 : 2-bytes jump, offset 0x03

real	1m9.158s
user	7m27.108s
sys	0m48.775s

real	1m9.159s
user	7m27.320s
sys	0m48.659s

real	1m8.390s
user	7m27.976s
sys	0m48.359s

real	1m9.143s
user	7m26.624s
sys	0m48.719s

real	1m9.642s
user	7m26.228s
sys	0m49.483s

real time
avg : 1m9.10s
std. dev. : 0.40s

one extra after reboot with same kernel :

real	1m8.855s
user	7m27.372s
sys	0m48.543s


Case 4 : 5-bytes jump, offset 0x00

real	1m9.173s
user	7m27.228s
sys	0m48.151s

real	1m9.735s
user	7m26.852s
sys	0m48.499s

real	1m9.502s
user	7m27.148s
sys	0m48.107s

real	1m8.727s
user	7m27.416s
sys	0m48.071s

real	1m9.115s
user	7m26.932s
sys	0m48.727s

real time
avg : 1m9.25s
std. dev. : 0.34s


* Test B

Hackbench

Case 1 : ftrace not compiled-in.

./hackbench 15
Time: 0.358
./hackbench 15
Time: 0.342
./hackbench 15
Time: 0.354
./hackbench 15
Time: 0.338
./hackbench 15
Time: 0.347

Average : 0.349
std. dev. : 0.007

Case 2 : 3/2 nops

./hackbench 15
Time: 0.328
./hackbench 15
Time: 0.368
./hackbench 15
Time: 0.351
./hackbench 15
Time: 0.343
./hackbench 15
Time: 0.366

Average : 0.351
std. dev. : 0.014

Case 3 : jmp 2 bytes

./hackbench 15
Time: 0.346
./hackbench 15
Time: 0.359
./hackbench 15
Time: 0.356
./hackbench 15
Time: 0.350
./hackbench 15
Time: 0.340

Average : 0.350
std. dev. : 0.007

Case 3 : jmp 5 bytes

./hackbench 15
Time: 0.346
./hackbench 15
Time: 0.346
./hackbench 15
Time: 0.364
./hackbench 15
Time: 0.362
./hackbench 15
Time: 0.338

Average : 0.351
std. dev. : 0.010


Hardware used :

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 23
model name	: Intel(R) Xeon(R) CPU           E5405  @ 2.00GHz
stepping	: 6
cpu MHz		: 2000.114
cache size	: 6144 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm
constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx tm2 ssse3
cx16 xtpr dca sse4_1 lahf_lm
bogomips	: 4000.22
clflush size	: 64
cache_alignment	: 64
address sizes	: 38 bits physical, 48 bits virtual
power management:

(7 other similar cpus)


> Here's 10 runs of "hackbench 50" using the two part 5 byte nop:
> 
> run 1
> Time: 4.501
> run 2
> Time: 4.855
> run 3
> Time: 4.198
> run 4
> Time: 4.587
> run 5
> Time: 5.016
> run 6
> Time: 4.757
> run 7
> Time: 4.477
> run 8
> Time: 4.693
> run 9
> Time: 4.710
> run 10
> Time: 4.715
> avg = 4.6509
> 
> 
> And 10 runs using the above 5 byte nop:
> 
> run 1
> Time: 4.832
> run 2
> Time: 5.319
> run 3
> Time: 5.213
> run 4
> Time: 4.830
> run 5
> Time: 4.363
> run 6
> Time: 4.391
> run 7
> Time: 4.772
> run 8
> Time: 4.992
> run 9
> Time: 4.727
> run 10
> Time: 4.825
> avg = 4.8264
> 
> # cat /proc/cpuinfo
> processor	: 0
> vendor_id	: AuthenticAMD
> cpu family	: 15
> model		: 65
> model name	: Dual-Core AMD Opteron(tm) Processor 2220
> stepping	: 3
> cpu MHz		: 2799.992
> cache size	: 1024 KB
> physical id	: 0
> siblings	: 2
> core id		: 0
> cpu cores	: 2
> apicid		: 0
> initial apicid	: 0
> fdiv_bug	: no
> hlt_bug		: no
> f00f_bug	: no
> coma_bug	: no
> fpu		: yes
> fpu_exception	: yes
> cpuid level	: 1
> wp		: yes
> flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic 
> cr8_legacy
> bogomips	: 5599.98
> clflush size	: 64
> power management: ts fid vid ttp tm stc
> 
> There's 4 of these.
> 
> Just to make sure, I ran the above nop test again:
> 
> [ this is reverse from the above runs ]
> 
> run 1
> Time: 4.723
> run 2
> Time: 5.080
> run 3
> Time: 4.521
> run 4
> Time: 4.841
> run 5
> Time: 4.696
> run 6
> Time: 4.946
> run 7
> Time: 4.754
> run 8
> Time: 4.717
> run 9
> Time: 4.905
> run 10
> Time: 4.814
> avg = 4.7997
> 
> And again the two part nop:
> 
> run 1
> Time: 4.434
> run 2
> Time: 4.496
> run 3
> Time: 4.801
> run 4
> Time: 4.714
> run 5
> Time: 4.631
> run 6
> Time: 5.178
> run 7
> Time: 4.728
> run 8
> Time: 4.920
> run 9
> Time: 4.898
> run 10
> Time: 4.770
> avg = 4.757
> 
> 
> This time it was close, but still seems to have some difference.
> 
> heh, perhaps it's just noise.
> 
> -- Steve
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-13  6:31                               ` Mathieu Desnoyers
@ 2008-08-13 15:38                                 ` Mathieu Desnoyers
  0 siblings, 0 replies; 56+ messages in thread
From: Mathieu Desnoyers @ 2008-08-13 15:38 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Jeremy Fitzhardinge, Andi Kleen, LKML,
	Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

* Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca) wrote:
> * Steven Rostedt (rostedt@goodmis.org) wrote:
> > 
> > On Fri, 8 Aug 2008, Linus Torvalds wrote:
> > > 
> > > 
> > > On Fri, 8 Aug 2008, Jeremy Fitzhardinge wrote:
> > > >
> > > > Steven Rostedt wrote:
> > > > > I wish we had a true 5 byte nop. 
> > > > 
> > > > 0x66 0x66 0x66 0x66 0x90
> > > 
> > > I don't think so. Multiple redundant prefixes can be really expensive on 
> > > some uarchs.
> > > 
> > > A no-op that isn't cheap isn't a no-op at all, it's a slow-op.
> > 
> > 
> > A quick meaningless benchmark showed a slight perfomance hit.
> > 
> 
> Hi Steven,
> 
> I tried to run my own tests to see if I could get to know if these
> numbers are actually meaningful at all. My results seems to show that
> there is not any significant difference between the various
> configurations, and actually that the only one tendency I see is that
> the 2-bytes jump offset 0x03 would be slightly faster than the 3/2 nop
> on Intel Xeon. But we would have to run these a bit more often to
> confirm that I guess.
> 
> I am just trying to get a sense of whether we are really trying hard to
> optimize something worthless in practice, and to me it looks like it.
> But it could be the architecture I am using that brings these results.
> 
> Mathieu
> 
> Intel Xeon dual quad-core
> Intel(R) Xeon(R) CPU           E5405  @ 2.00GHz
> 
> 3/2 nop used :
> K8_NOP3 K8_NOP2
> #define K8_NOP2 ".byte 0x66,0x90\n"
> #define K8_NOP3 ".byte 0x66,0x66,0x90\n"
> 

Small correction : my architecture uses the P6_NOP5, which is an atomic
5-bytes nop (just looked at the runtime output of find_nop_table()).

5: nopl 0x00(%eax,%eax,1)
#define P6_NOP5 ".byte 0x0f,0x1f,0x44,0x00,0\n"

But the results stands. Maybe I should try to force a test with a
K8_NOP3 K8_NOP2 nop.

Mathieu

> ** Summary **
> 
> Test A : make -j20 2.6.27-rc2 kernel (real time)
>                                           Avg.      std.dev
> Case 1 : ftrace not compiled-in.          1m9.76s   0.41s
> Case 2 : 3/2 nops                         1m9.95s   0.36s
> Case 3 : 2-bytes jump, offset 0x03        1m9.10s   0.40s
> Case 4 : 5-bytes jump, offset 0x00        1m9.25s   0.34s
> 
> Test B : hackbench 15
> 
> Case 1 : ftrace not compiled-in.          0.349s    0.007s
> Case 2 : 3/2 nops                         0.351s    0.014s
> Case 3 : 2-bytes jump, offset 0x03        0.350s    0.007s
> Case 4 : 5-bytes jump, offset 0x00        0.351s    0.010s
> 
> 
> 
> ** Detail **
> 
> * Test A
> 
> benchmark : make -j20 2.6.27-rc2 kernel
> make clean; make -j20; make clean done before the tests to prime caches.
> Same .config used.
> 
> 
> Case 1 : ftrace not compiled-in.
> 
> real	1m9.980s
> user	7m27.664s
> sys	0m48.771s
> 
> real	1m9.330s
> user	7m27.244s
> sys	0m50.567s
> 
> real	1m9.393s
> user	7m27.408s
> sys	0m50.511s
> 
> real	1m9.674s
> user	7m28.088s
> sys	0m50.327s
> 
> real	1m10.441s
> user	7m27.736s
> sys	0m49.687s
> 
> real time
> average : 1m9.76s
> std. dev. : 0.41s
> 
> after a reboot with the same kernel :
> 
> real	1m8.758s
> user	7m26.012s
> sys	0m48.835s
> 
> real	1m11.035s
> user	7m26.432s
> sys	0m49.171s
> 
> real	1m9.834s
> user	7m25.768s
> sys	0m49.167s
> 
> 
> Case 2 : 3/2 nops
> 
> real	1m9.713s
> user	7m27.524s
> sys	0m48.315s
> 
> real	1m9.481s
> user	7m27.144s
> sys	0m48.587s
> 
> real	1m10.565s
> user	7m27.048s
> sys	0m48.715s
> 
> real	1m10.008s
> user	7m26.436s
> sys	0m49.295s
> 
> real	1m9.982s
> user	7m27.160s
> sys	0m48.667s
> 
> real time
> avg : 1m9.95s
> std. dev. : 0.36s
> 
> 
> Case 3 : 2-bytes jump, offset 0x03
> 
> real	1m9.158s
> user	7m27.108s
> sys	0m48.775s
> 
> real	1m9.159s
> user	7m27.320s
> sys	0m48.659s
> 
> real	1m8.390s
> user	7m27.976s
> sys	0m48.359s
> 
> real	1m9.143s
> user	7m26.624s
> sys	0m48.719s
> 
> real	1m9.642s
> user	7m26.228s
> sys	0m49.483s
> 
> real time
> avg : 1m9.10s
> std. dev. : 0.40s
> 
> one extra after reboot with same kernel :
> 
> real	1m8.855s
> user	7m27.372s
> sys	0m48.543s
> 
> 
> Case 4 : 5-bytes jump, offset 0x00
> 
> real	1m9.173s
> user	7m27.228s
> sys	0m48.151s
> 
> real	1m9.735s
> user	7m26.852s
> sys	0m48.499s
> 
> real	1m9.502s
> user	7m27.148s
> sys	0m48.107s
> 
> real	1m8.727s
> user	7m27.416s
> sys	0m48.071s
> 
> real	1m9.115s
> user	7m26.932s
> sys	0m48.727s
> 
> real time
> avg : 1m9.25s
> std. dev. : 0.34s
> 
> 
> * Test B
> 
> Hackbench
> 
> Case 1 : ftrace not compiled-in.
> 
> ./hackbench 15
> Time: 0.358
> ./hackbench 15
> Time: 0.342
> ./hackbench 15
> Time: 0.354
> ./hackbench 15
> Time: 0.338
> ./hackbench 15
> Time: 0.347
> 
> Average : 0.349
> std. dev. : 0.007
> 
> Case 2 : 3/2 nops
> 
> ./hackbench 15
> Time: 0.328
> ./hackbench 15
> Time: 0.368
> ./hackbench 15
> Time: 0.351
> ./hackbench 15
> Time: 0.343
> ./hackbench 15
> Time: 0.366
> 
> Average : 0.351
> std. dev. : 0.014
> 
> Case 3 : jmp 2 bytes
> 
> ./hackbench 15
> Time: 0.346
> ./hackbench 15
> Time: 0.359
> ./hackbench 15
> Time: 0.356
> ./hackbench 15
> Time: 0.350
> ./hackbench 15
> Time: 0.340
> 
> Average : 0.350
> std. dev. : 0.007
> 
> Case 3 : jmp 5 bytes
> 
> ./hackbench 15
> Time: 0.346
> ./hackbench 15
> Time: 0.346
> ./hackbench 15
> Time: 0.364
> ./hackbench 15
> Time: 0.362
> ./hackbench 15
> Time: 0.338
> 
> Average : 0.351
> std. dev. : 0.010
> 
> 
> Hardware used :
> 
> processor	: 0
> vendor_id	: GenuineIntel
> cpu family	: 6
> model		: 23
> model name	: Intel(R) Xeon(R) CPU           E5405  @ 2.00GHz
> stepping	: 6
> cpu MHz		: 2000.114
> cache size	: 6144 KB
> physical id	: 0
> siblings	: 4
> core id		: 0
> cpu cores	: 4
> apicid		: 0
> initial apicid	: 0
> fpu		: yes
> fpu_exception	: yes
> cpuid level	: 10
> wp		: yes
> flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
> pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm
> constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx tm2 ssse3
> cx16 xtpr dca sse4_1 lahf_lm
> bogomips	: 4000.22
> clflush size	: 64
> cache_alignment	: 64
> address sizes	: 38 bits physical, 48 bits virtual
> power management:
> 
> (7 other similar cpus)
> 
> 
> > Here's 10 runs of "hackbench 50" using the two part 5 byte nop:
> > 
> > run 1
> > Time: 4.501
> > run 2
> > Time: 4.855
> > run 3
> > Time: 4.198
> > run 4
> > Time: 4.587
> > run 5
> > Time: 5.016
> > run 6
> > Time: 4.757
> > run 7
> > Time: 4.477
> > run 8
> > Time: 4.693
> > run 9
> > Time: 4.710
> > run 10
> > Time: 4.715
> > avg = 4.6509
> > 
> > 
> > And 10 runs using the above 5 byte nop:
> > 
> > run 1
> > Time: 4.832
> > run 2
> > Time: 5.319
> > run 3
> > Time: 5.213
> > run 4
> > Time: 4.830
> > run 5
> > Time: 4.363
> > run 6
> > Time: 4.391
> > run 7
> > Time: 4.772
> > run 8
> > Time: 4.992
> > run 9
> > Time: 4.727
> > run 10
> > Time: 4.825
> > avg = 4.8264
> > 
> > # cat /proc/cpuinfo
> > processor	: 0
> > vendor_id	: AuthenticAMD
> > cpu family	: 15
> > model		: 65
> > model name	: Dual-Core AMD Opteron(tm) Processor 2220
> > stepping	: 3
> > cpu MHz		: 2799.992
> > cache size	: 1024 KB
> > physical id	: 0
> > siblings	: 2
> > core id		: 0
> > cpu cores	: 2
> > apicid		: 0
> > initial apicid	: 0
> > fdiv_bug	: no
> > hlt_bug		: no
> > f00f_bug	: no
> > coma_bug	: no
> > fpu		: yes
> > fpu_exception	: yes
> > cpuid level	: 1
> > wp		: yes
> > flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> > cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> > rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic 
> > cr8_legacy
> > bogomips	: 5599.98
> > clflush size	: 64
> > power management: ts fid vid ttp tm stc
> > 
> > There's 4 of these.
> > 
> > Just to make sure, I ran the above nop test again:
> > 
> > [ this is reverse from the above runs ]
> > 
> > run 1
> > Time: 4.723
> > run 2
> > Time: 5.080
> > run 3
> > Time: 4.521
> > run 4
> > Time: 4.841
> > run 5
> > Time: 4.696
> > run 6
> > Time: 4.946
> > run 7
> > Time: 4.754
> > run 8
> > Time: 4.717
> > run 9
> > Time: 4.905
> > run 10
> > Time: 4.814
> > avg = 4.7997
> > 
> > And again the two part nop:
> > 
> > run 1
> > Time: 4.434
> > run 2
> > Time: 4.496
> > run 3
> > Time: 4.801
> > run 4
> > Time: 4.714
> > run 5
> > Time: 4.631
> > run 6
> > Time: 5.178
> > run 7
> > Time: 4.728
> > run 8
> > Time: 4.920
> > run 9
> > Time: 4.898
> > run 10
> > Time: 4.770
> > avg = 4.757
> > 
> > 
> > This time it was close, but still seems to have some difference.
> > 
> > heh, perhaps it's just noise.
> > 
> > -- Steve
> > 
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2008-08-13 15:43 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <aYipy-5FM-9@gated-at.bofh.it>
2008-08-07 21:28 ` [PATCH 0/5] ftrace: to kill a daemon Bodo Eggert
2008-08-07 21:24   ` Jeremy Fitzhardinge
2008-08-07 21:35   ` Steven Rostedt
     [not found] ` <aYiIP-6bN-11@gated-at.bofh.it>
     [not found]   ` <aYkAT-15Q-1@gated-at.bofh.it>
     [not found]     ` <aYDX5-fa-31@gated-at.bofh.it>
     [not found]       ` <aYE6H-tg-21@gated-at.bofh.it>
     [not found]         ` <aYEgj-Fc-17@gated-at.bofh.it>
     [not found]           ` <aYEJq-1xB-23@gated-at.bofh.it>
     [not found]             ` <aYET8-1L7-17@gated-at.bofh.it>
     [not found]               ` <aYFcq-2f2-11@gated-at.bofh.it>
     [not found]                 ` <aYFvP-2Vc-25@gated-at.bofh.it>
     [not found]                   ` <aYJIZ-15G-7@gated-at.bofh.it>
     [not found]                     ` <aYKvs-2gQ-7@gated-at.bofh.it>
     [not found]                       ` <aYKF1-2qs-7@gated-at.bofh.it>
2008-08-09 11:50                         ` Bodo Eggert
2008-08-09 13:02                           ` Steven Rostedt
2008-08-09 14:25                             ` Steven Rostedt
2008-08-09 14:42                             ` Bodo Eggert
2008-08-07 18:20 Steven Rostedt
2008-08-07 18:47 ` Mathieu Desnoyers
2008-08-07 20:42   ` Steven Rostedt
2008-08-08 17:22     ` Mathieu Desnoyers
2008-08-08 17:36       ` Steven Rostedt
2008-08-08 17:46         ` Mathieu Desnoyers
2008-08-08 18:13           ` Steven Rostedt
2008-08-08 18:15             ` Peter Zijlstra
2008-08-08 18:21             ` Mathieu Desnoyers
2008-08-08 18:41               ` Steven Rostedt
2008-08-08 19:04                 ` Linus Torvalds
2008-08-08 19:05                 ` Mathieu Desnoyers
2008-08-08 23:38                   ` Steven Rostedt
2008-08-09  0:23                     ` Andi Kleen
2008-08-09  0:36                       ` Steven Rostedt
2008-08-09  0:47                         ` Jeremy Fitzhardinge
2008-08-09  0:51                           ` Linus Torvalds
2008-08-09  1:25                             ` Steven Rostedt
2008-08-13  6:31                               ` Mathieu Desnoyers
2008-08-13 15:38                                 ` Mathieu Desnoyers
2008-08-09  0:51                           ` Steven Rostedt
2008-08-09  0:53                         ` Roland McGrath
2008-08-09  1:13                           ` Andi Kleen
2008-08-09  1:19                         ` Andi Kleen
2008-08-09  1:30                           ` Steven Rostedt
2008-08-09  1:55                             ` Andi Kleen
2008-08-09  2:03                               ` Steven Rostedt
2008-08-09  2:23                                 ` Andi Kleen
2008-08-09  4:12                           ` Steven Rostedt
2008-08-09  0:30                     ` Steven Rostedt
2008-08-11 18:21                       ` Mathieu Desnoyers
2008-08-11 19:28                         ` Steven Rostedt
2008-08-08 19:08                 ` Jeremy Fitzhardinge
2008-08-11  2:41                 ` Rusty Russell
2008-08-11 12:33                   ` Steven Rostedt
2008-08-07 21:11 ` Jeremy Fitzhardinge
2008-08-07 21:29   ` Steven Rostedt
2008-08-07 22:26     ` Roland McGrath
2008-08-08  1:21       ` Steven Rostedt
2008-08-08  1:24         ` Steven Rostedt
2008-08-08  1:56         ` Steven Rostedt
2008-08-08  7:22         ` Peter Zijlstra
2008-08-08 11:31           ` Steven Rostedt
2008-08-08  4:54       ` Sam Ravnborg
2008-08-09  9:48 ` Abhishek Sagar
2008-08-09 13:01   ` Steven Rostedt
2008-08-09 15:01     ` Abhishek Sagar
2008-08-09 15:37       ` Steven Rostedt
2008-08-09 17:14         ` Abhishek Sagar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox