Re: Efficient x86 and x86_64 NOP microbenchmarks

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: Efficient x86 and x86_64 NOP microbenchmarks
       [not found] <20080813191926.GB15547@Krystal>
@ 2008-08-13 20:00 ` Steven Rostedt
  2008-08-13 20:06   ` Jeremy Fitzhardinge
  2008-08-13 20:15   ` Andi Kleen
  0 siblings, 2 replies; 18+ messages in thread
From: Steven Rostedt @ 2008-08-13 20:00 UTC (permalink / raw)
  To: Andi Kleen, Thomas Gleixner
  Cc: Mathieu Desnoyers, Linus Torvalds, Steven Rostedt,
	Jeremy Fitzhardinge, LKML, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, David Miller, Roland McGrath, Ulrich Drepper,
	Rusty Russell, Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

[
  Thanks to Mathieu Desnoyers who forward this to me. Currently my ISP 
for goodmis.org is having issues:
  https://help.domaindirect.com/index.php?_m=news&_a=viewnews&newsid=104
]
> ----- Forwarded message from Andi Kleen <andi@firstfloor.org> -----
>
>   
>> So microbenchmarking this way will probably make some things look 
>> unrealistically good. 
>>     
>
> Must be careful to miss the big picture here.
>
> We have two assumptions here in this thread:
>
> - Normal alternative() nops are relatively infrequent, typically
> in points with enough pipeline bubbles anyways, and it likely doesn't
> matter how they are encode. And also they don't have an issue
> with mult part instructions anyways because they're not patched
> at runtime, so always the best known can be used.
>
> - The one case where nops are very frequent and matter and multipart
> is a problem is with ftrace noping out the call to mcount at runtime 
> because that happens on every function entry.
> Even there the overhead is not that big, but at least measurable 
> in kernel builds.
>   

The problem is not ftrace noping out the call at runtime. The problem is 
ftrace changing the nops back to calls to mcount.

The nop part is simple, straight forward and not an issue that we are 
talking here. The issue is which kind of nop to use. The bug with the 
multi-part nop happens when we _enable_ tracing. That is, when someone 
runs the tracer. The issue with the multi-part nop is that a task could 
have been preempted after it executed the first nop and before the 
second part. Then we enable tracing, and when the task is scheduled back 
in, it now will execute half the call to the mcount function.

I want this point very clear. If you never run tracing, this bug will 
not happen. And the bug only happens on enabling the tracer, not on the 
disabling part. Not to mention that the bug itself will only happen 1 in 
a billion.

> Now the numbers have shown that just by not using frame pointer (
> -pg right now implies frame pointer) you can get more benefit 
> than what you lose from using non optimal nops.
>   

No, I can easily make a patch that does not use frame pointers but still 
uses -pg. We just can not print the parent function in the trace. This 
can easily be added to a config, as well as easily implemented.
> So for me the best strategy would be to get rid of the frame pointer
> and ignore the nops. This unfortunately would require going away
> from -pg and instead post process gcc output to insert "call mcount"
> manually. But the nice advantage of that is that you could actually 
> set up a custom table of callers built in a ELF section and with
> that you don't actually need the runtime patching (which is only
> done currently because there's no global table of mcount calls),
> but could do everything in stop_machine(). Without
> runtime patching you also don't need single part nops. 
>
>   

I'm totally confused here.  How do you enable function tracing?  How do 
we make a call to the code that will trace a function was hit?

> I think that would be the best option. I especially like it because
> it would prevent forcing frame pointer which seems to be costlier
> than any kinds of nosp.

As I stated, the frame pointer part is only to record the parent 
function in tracing. ie:

             ls-4866  [00] 177596.041275: _spin_unlock <-journal_stop

Here we see that the function _spin_unlock was called by the function 
journal_stop. We can easily turn off parent tracing now, with:

# echo noprint-parent > /debug/tracing/iter_ctrl

which gives us just:

             ls-4866  [00] 177596.041275: _spin_unlock

If we disable frame pointers, the noprint-parent option would be forced. 
Not that devastating, but it gives the option to still have function 
tracing to the user without the requirement of having frame pointers.

I would still require that the irqsoff tracer add frame pointers, just 
because knowing that the long latency of interrupts disabled happened at 
local_irq_save doesn't cut it ;-)

Anyway, who would want to run with frame pointers disabled? If you ever 
get a bug crash, the stack trace is pretty much useless.

-- Steve

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Efficient x86 and x86_64 NOP microbenchmarks
  2008-08-13 20:00 ` Efficient x86 and x86_64 NOP microbenchmarks Steven Rostedt
@ 2008-08-13 20:06   ` Jeremy Fitzhardinge
  2008-08-13 20:34     ` Steven Rostedt
  2008-08-13 20:15   ` Andi Kleen
  1 sibling, 1 reply; 18+ messages in thread
From: Jeremy Fitzhardinge @ 2008-08-13 20:06 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andi Kleen, Thomas Gleixner, Mathieu Desnoyers, Linus Torvalds,
	Steven Rostedt, LKML, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

Steven Rostedt wrote:
> No, I can easily make a patch that does not use frame pointers but 
> still uses -pg. We just can not print the parent function in the 
> trace. This can easily be added to a config, as well as easily 
> implemented.

Why?  You can always get the calling function, because its return 
address is on the stack (assuming mcount is called before the function 
puts its own frame on the stack).  But without a frame pointer, you 
can't necessarily get the caller's caller.

But I think Andi's point is that gcc forces frame pointers on when you 
enable mcount, so there's no choice in the matter.

    J

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Efficient x86 and x86_64 NOP microbenchmarks
  2008-08-13 20:06   ` Jeremy Fitzhardinge
@ 2008-08-13 20:34     ` Steven Rostedt
  0 siblings, 0 replies; 18+ messages in thread
From: Steven Rostedt @ 2008-08-13 20:34 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Jeremy Fitzhardinge, Andi Kleen, Thomas Gleixner, Linus Torvalds,
	Steven Rostedt, LKML, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams


Just a curious run of Mathieu's micro benchmark:

NR_TESTS 10000000
                       test empty cycles : 182500444
                test 2-bytes jump cycles : 195969127
                test 5-bytes jump cycles : 197000202
                    test 3/2 nops cycles : 201333408
test 5-bytes nop with long prefix cycles : 205000067
              test 5-bytes P6 nop cycles : 205000227
    test Generic 1/4 5-bytes nops cycles : 200000077
         test K7 1/4 5-bytes nops cycles : 197549045


And this was on a Pentium III 847.461 MHz box (my old toshiba laptop)

The jumps here played the best, but that could just be cache issues. But 
interesting to see that of the nops, the K7 1/4 faired the best.

-- Steve


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Efficient x86 and x86_64 NOP microbenchmarks
  2008-08-13 20:00 ` Efficient x86 and x86_64 NOP microbenchmarks Steven Rostedt
  2008-08-13 20:06   ` Jeremy Fitzhardinge
@ 2008-08-13 20:15   ` Andi Kleen
  2008-08-13 20:21     ` Linus Torvalds
  2008-08-13 20:21     ` Steven Rostedt
  1 sibling, 2 replies; 18+ messages in thread
From: Andi Kleen @ 2008-08-13 20:15 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andi Kleen, Thomas Gleixner, Mathieu Desnoyers, Linus Torvalds,
	Steven Rostedt, Jeremy Fitzhardinge, LKML, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, David Miller, Roland McGrath,
	Ulrich Drepper, Rusty Russell, Gregory Haskins,
	Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves,
	Clark Williams

On Wed, Aug 13, 2008 at 04:00:37PM -0400, Steven Rostedt wrote:
> >Now the numbers have shown that just by not using frame pointer (
> >-pg right now implies frame pointer) you can get more benefit 
> >than what you lose from using non optimal nops.
> >  
> 
> No, I can easily make a patch that does not use frame pointers but still 

Not without patching gcc. Try it. The patch is not very difficult and i did
it here, but it needs a patch. 

> If we disable frame pointers, the noprint-parent option would be forced. 

Actually you can get the parent without frame pointer if you just
force gcc to emit mcount before touching the stack frame (and manual
insertion pass would do that). Then parent is at 4(%esp)/8(%rsp)   
Again teaching gcc that is not very difficult, but it needs a patch.

> I would still require that the irqsoff tracer add frame pointers, just 
> because knowing that the long latency of interrupts disabled happened at 
> local_irq_save doesn't cut it ;-)

Nope.
> 
> Anyway, who would want to run with frame pointers disabled? If you ever 
> get a bug crash, the stack trace is pretty much useless.

First that's not true (remember most production kernels run
without frame pointers, also e.g. crash or systemtap know how to do proper 
unwinding without slow frame pointers) and if you want it runtime also you 
can always add the dwarf2 unwinder (like the opensuse kernel does) and get 
better backtraces than you could ever get with frame pointers (that is 
because e.g.  most assembler code doesn't even bother to set up frame 
pointers, but it is all dwarf2 annotated) 

Also I must say the whole ftrace noping exercise is pretty pointless without
avoiding frame pointers because it does save less than what you lose
unconditionally from the "select FRAME_POINTER"

-Andi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Efficient x86 and x86_64 NOP microbenchmarks
  2008-08-13 20:15   ` Andi Kleen
@ 2008-08-13 20:21     ` Linus Torvalds
  2008-08-13 20:21     ` Steven Rostedt
  1 sibling, 0 replies; 18+ messages in thread
From: Linus Torvalds @ 2008-08-13 20:21 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Steven Rostedt, Thomas Gleixner, Mathieu Desnoyers,
	Steven Rostedt, Jeremy Fitzhardinge, LKML, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, David Miller, Roland McGrath,
	Ulrich Drepper, Rusty Russell, Gregory Haskins,
	Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves,
	Clark Williams



On Wed, 13 Aug 2008, Andi Kleen wrote:
> 
> Also I must say the whole ftrace noping exercise is pretty pointless without
> avoiding frame pointers because it does save less than what you lose
> unconditionally from the "select FRAME_POINTER"

Andi, you seem to have missed the whole point. This is a _correctness_ 
issue as long as the nop is not a single instruction. And the workaround 
for that is uglier than just making a single-instruction nop.

So the question now is to find a good nop that _is_ a single atomic 
instruction. Your blathering about frame pointers is missing the whole 
point!

			Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Efficient x86 and x86_64 NOP microbenchmarks
  2008-08-13 20:15   ` Andi Kleen
  2008-08-13 20:21     ` Linus Torvalds
@ 2008-08-13 20:21     ` Steven Rostedt
  1 sibling, 0 replies; 18+ messages in thread
From: Steven Rostedt @ 2008-08-13 20:21 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Thomas Gleixner, Mathieu Desnoyers, Linus Torvalds,
	Steven Rostedt, Jeremy Fitzhardinge, LKML, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, David Miller, Roland McGrath,
	Ulrich Drepper, Rusty Russell, Gregory Haskins,
	Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves,
	Clark Williams

Andi Kleen wrote:
> On Wed, Aug 13, 2008 at 04:00:37PM -0400, Steven Rostedt wrote:
>   
>>> Now the numbers have shown that just by not using frame pointer (
>>> -pg right now implies frame pointer) you can get more benefit 
>>> than what you lose from using non optimal nops.
>>>  
>>>       
>> No, I can easily make a patch that does not use frame pointers but still 
>>     
>
> Not without patching gcc. Try it. The patch is not very difficult and i did
> it here, but it needs a patch. 
>   

OK, I admit you are right ;-)

I got the error message:

    gcc: -pg and -fomit-frame-pointer are incompatible

-- Steve


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
@ 2008-08-08 18:13 Steven Rostedt
  2008-08-08 18:21 ` Mathieu Desnoyers
  0 siblings, 1 reply; 18+ messages in thread
From: Steven Rostedt @ 2008-08-08 18:13 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, Linus Torvalds, David Miller, Roland McGrath,
	Ulrich Drepper, Rusty Russell, Jeremy Fitzhardinge,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams


On Fri, 8 Aug 2008, Mathieu Desnoyers wrote:
> * Steven Rostedt (rostedt@goodmis.org) wrote:
> > 
> > On Fri, 8 Aug 2008, Mathieu Desnoyers wrote:
> > > * Steven Rostedt (rostedt@goodmis.org) wrote:
> > > > 
> > > > I originally used jumps instead of nops, but unfortunately, they actually 
> > > > hurt performance more than adding nops. Ingo told me it was probably due
> > > > to using up the jump predictions of the CPU.
> > > > 
> > > 
> > > Hrm, are you sure you use a single 5-bytes nop instruction then, or do
> > > you use a mix of various nop sizes (add_nops) on some architectures ?
> > 
> > I use (for x86) what is in include/asm-x86/nops.h depending on what the
> > cpuid gives us.
> > 
> 
> That's bad :
> 
> #define GENERIC_NOP5 GENERIC_NOP1 GENERIC_NOP4
> 
> #define K8_NOP5 K8_NOP3 K8_NOP2
> 
> #define K7_NOP5 K7_NOP4 ASM_NOP1
> 
> So, when you try, later, to replace these instructions with a single
> 5-bytes instruction, a preempted thread could iret in the middle of your
> 5-bytes insn and cause an illegal instruction ?

That's why I use kstop_machine.

> 
> 
> > > 
> > > You can consume the branch prediction buffers for conditional branches,
> > > but I doubt static jumps have this impact ? I don't see what "jump
> > > predictions" you are referring to here exactly.
> > 
> > I don't know the details, but we definitely saw a drop in preformance 
> > between using nops and static jumps.
> > 
> 
> Generated by replacing all the call by 5-bytes jumps e9 00 00 00 00
> instead of the 5-bytes add_nops ? On which architectures ?
> 

I ran this on my Dell (intel Xeon), which IIRC did show the performance 
degration. I unfortunately don't have the time to redo those tests, but 
you are welcome to.

Just look at arch/x86/kernel/ftrace.c and replace the nop with the jump.
In fact, the comments in that file still say it is a jmp. Remember, my 
first go was to use the jmp.

-- Steve


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-08 18:13 [PATCH 0/5] ftrace: to kill a daemon Steven Rostedt
@ 2008-08-08 18:21 ` Mathieu Desnoyers
  2008-08-08 18:41   ` Steven Rostedt
  0 siblings, 1 reply; 18+ messages in thread
From: Mathieu Desnoyers @ 2008-08-08 18:21 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, Linus Torvalds, David Miller, Roland McGrath,
	Ulrich Drepper, Rusty Russell, Jeremy Fitzhardinge,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

* Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> On Fri, 8 Aug 2008, Mathieu Desnoyers wrote:
> > * Steven Rostedt (rostedt@goodmis.org) wrote:
> > > 
> > > On Fri, 8 Aug 2008, Mathieu Desnoyers wrote:
> > > > * Steven Rostedt (rostedt@goodmis.org) wrote:
> > > > > 
> > > > > I originally used jumps instead of nops, but unfortunately, they actually 
> > > > > hurt performance more than adding nops. Ingo told me it was probably due
> > > > > to using up the jump predictions of the CPU.
> > > > > 
> > > > 
> > > > Hrm, are you sure you use a single 5-bytes nop instruction then, or do
> > > > you use a mix of various nop sizes (add_nops) on some architectures ?
> > > 
> > > I use (for x86) what is in include/asm-x86/nops.h depending on what the
> > > cpuid gives us.
> > > 
> > 
> > That's bad :
> > 
> > #define GENERIC_NOP5 GENERIC_NOP1 GENERIC_NOP4
> > 
> > #define K8_NOP5 K8_NOP3 K8_NOP2
> > 
> > #define K7_NOP5 K7_NOP4 ASM_NOP1
> > 
> > So, when you try, later, to replace these instructions with a single
> > 5-bytes instruction, a preempted thread could iret in the middle of your
> > 5-bytes insn and cause an illegal instruction ?
> 
> That's why I use kstop_machine.
> 

kstop_machine does not guarantee that you won't have _any_ thread
preempted with IP pointing exactly in the middle of your instructions
_before_ the modification scheduled back in _after_ the modification and
thus causing an illegal instruction.

Still buggy. :/

> > 
> > 
> > > > 
> > > > You can consume the branch prediction buffers for conditional branches,
> > > > but I doubt static jumps have this impact ? I don't see what "jump
> > > > predictions" you are referring to here exactly.
> > > 
> > > I don't know the details, but we definitely saw a drop in preformance 
> > > between using nops and static jumps.
> > > 
> > 
> > Generated by replacing all the call by 5-bytes jumps e9 00 00 00 00
> > instead of the 5-bytes add_nops ? On which architectures ?
> > 
> 
> I ran this on my Dell (intel Xeon), which IIRC did show the performance 
> degration. I unfortunately don't have the time to redo those tests, but 
> you are welcome to.
> 
> Just look at arch/x86/kernel/ftrace.c and replace the nop with the jump.
> In fact, the comments in that file still say it is a jmp. Remember, my 
> first go was to use the jmp.
> 

I'll try to find time to compare :

multi-instructions 5-bytes nops (although this approach is just buggy)
5-bytes jump to the next address
2-bytes jump to offset +3.

Mathieu

> -- Steve
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-08 18:21 ` Mathieu Desnoyers
@ 2008-08-08 18:41   ` Steven Rostedt
  2008-08-08 19:05     ` Mathieu Desnoyers
  0 siblings, 1 reply; 18+ messages in thread
From: Steven Rostedt @ 2008-08-08 18:41 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, Linus Torvalds, David Miller, Roland McGrath,
	Ulrich Drepper, Rusty Russell, Jeremy Fitzhardinge,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

On Fri, 8 Aug 2008, Mathieu Desnoyers wrote:

> * Steven Rostedt (rostedt@goodmis.org) wrote:
> > > > 
> > > 
> > > That's bad :
> > > 
> > > #define GENERIC_NOP5 GENERIC_NOP1 GENERIC_NOP4
> > > 
> > > #define K8_NOP5 K8_NOP3 K8_NOP2
> > > 
> > > #define K7_NOP5 K7_NOP4 ASM_NOP1
> > > 
> > > So, when you try, later, to replace these instructions with a single
> > > 5-bytes instruction, a preempted thread could iret in the middle of your
> > > 5-bytes insn and cause an illegal instruction ?
> > 
> > That's why I use kstop_machine.
> > 
> 
> kstop_machine does not guarantee that you won't have _any_ thread
> preempted with IP pointing exactly in the middle of your instructions
> _before_ the modification scheduled back in _after_ the modification and
> thus causing an illegal instruction.
> 
> Still buggy. :/

Hmm, good point. Unless...

Can a processor be preempted in a middle of nops?  What do nops do for a 
processor? Can it skip them nicely in one shot?

This means I'll have to do the benchmarks again, and see what the 
performance difference of a jmp and a nop is significant. I'm thinking 
that if the processor can safely skip nops without any type of processing, 
this may be the reason that nops are better than a jmp. A jmp causes the 
processor to do a little more work.

I might even run a test to see if I can force a processor that uses the 
three-two nops to preempt between them.

I can add a test in x86 ftrace.c to check to see which nop was used, and 
use the jmp if the arch does not have a 5 byte nop.

I'm assuming that jmp is more expensive than the nops because otherwise
a jmp 0 would have been used as a 5 byte nop.

-- Steve

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-08 18:41   ` Steven Rostedt
@ 2008-08-08 19:05     ` Mathieu Desnoyers
  2008-08-08 23:38       ` Steven Rostedt
  0 siblings, 1 reply; 18+ messages in thread
From: Mathieu Desnoyers @ 2008-08-08 19:05 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, Linus Torvalds, David Miller, Roland McGrath,
	Ulrich Drepper, Rusty Russell, Jeremy Fitzhardinge,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

* Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> On Fri, 8 Aug 2008, Mathieu Desnoyers wrote:
> 
> > * Steven Rostedt (rostedt@goodmis.org) wrote:
> > > > > 
> > > > 
> > > > That's bad :
> > > > 
> > > > #define GENERIC_NOP5 GENERIC_NOP1 GENERIC_NOP4
> > > > 
> > > > #define K8_NOP5 K8_NOP3 K8_NOP2
> > > > 
> > > > #define K7_NOP5 K7_NOP4 ASM_NOP1
> > > > 
> > > > So, when you try, later, to replace these instructions with a single
> > > > 5-bytes instruction, a preempted thread could iret in the middle of your
> > > > 5-bytes insn and cause an illegal instruction ?
> > > 
> > > That's why I use kstop_machine.
> > > 
> > 
> > kstop_machine does not guarantee that you won't have _any_ thread
> > preempted with IP pointing exactly in the middle of your instructions
> > _before_ the modification scheduled back in _after_ the modification and
> > thus causing an illegal instruction.
> > 
> > Still buggy. :/
> 
> Hmm, good point. Unless...
> 
> Can a processor be preempted in a middle of nops?  What do nops do for a 
> processor? Can it skip them nicely in one shot?
> 

Given that those are multiple instructions, I think a processor has all
the rights to preempt in the middle of them. And even if some specific
architecture, for any obscure reason, happens to merge them, I don't
think this will be portable across Intel, AMD, ...

> This means I'll have to do the benchmarks again, and see what the 
> performance difference of a jmp and a nop is significant. I'm thinking 
> that if the processor can safely skip nops without any type of processing, 
> this may be the reason that nops are better than a jmp. A jmp causes the 
> processor to do a little more work.
> 
> I might even run a test to see if I can force a processor that uses the 
> three-two nops to preempt between them.
> 

Yup, although one architecture not triggering this doesn't say much
about the various x86 flavors out there. In any case
- if you trigger the problem, we have to fix it.
- if you do not succeed to trigger the problem, we will have to test it
  on a wider architecture range and maybe end up fixit it anyway to play
  safe with the specs.

So, in every case, we end up fixing the issue.


> I can add a test in x86 ftrace.c to check to see which nop was used, and 
> use the jmp if the arch does not have a 5 byte nop.
> 

I would propose the following alternative :

Create new macros in include/asm-x86/nops.h :

/* short jump, offset 3 bytes : skips total of 5 bytes */
#define GENERIC_ATOMIC_NOP5 ".byte 0xeb,0x03,0x00,0x00,0x00\n"

#if defined(CONFIG_MK7)
#define ATOMIC_NOP5 GENERIC_ATOMIC_NOP5
#elif defined(CONFIG_X86_P6_NOP)
#define ATOMIC_NOP5 P6_NOP5
#elif defined(CONFIG_X86_64)
#define ATOMIC_NOP5 GENERIC_ATOMIC_NOP5
#else
#define ATOMIC_NOP5 GENERIC_ATOMIC_NOP5
#endif

And then optimize if necessary. You will probably find plenty of
knowledgeable people who will know better 5-bytes nop instruction more
efficient than this "generic" short jump offset 0x3.

Then you can use the (buggy) 3nops/2nops as a performance baseline and
see the performance hit on each architecture.

First get it right, then make it fast....

Mathieu

> I'm assuming that jmp is more expensive than the nops because otherwise
> a jmp 0 would have been used as a 5 byte nop.
> 
> -- Steve

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-08 19:05     ` Mathieu Desnoyers
@ 2008-08-08 23:38       ` Steven Rostedt
  2008-08-09  0:23         ` Andi Kleen
  0 siblings, 1 reply; 18+ messages in thread
From: Steven Rostedt @ 2008-08-08 23:38 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	Linus Torvalds, David Miller, Roland McGrath, Ulrich Drepper,
	Rusty Russell, Jeremy Fitzhardinge, Gregory Haskins,
	Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves,
	Clark Williams


[ patch included ]

On Fri, 8 Aug 2008, Mathieu Desnoyers wrote:

> * Steven Rostedt (rostedt@goodmis.org) wrote:
> > > > That's why I use kstop_machine.
> > > > 
> > > 
> > > kstop_machine does not guarantee that you won't have _any_ thread
> > > preempted with IP pointing exactly in the middle of your instructions
> > > _before_ the modification scheduled back in _after_ the modification and
> > > thus causing an illegal instruction.
> > > 
> > > Still buggy. :/
> > 
> > Hmm, good point. Unless...
> > 
> > Can a processor be preempted in a middle of nops?  What do nops do for a 
> > processor? Can it skip them nicely in one shot?
> > 
> 
> Given that those are multiple instructions, I think a processor has all
> the rights to preempt in the middle of them. And even if some specific
> architecture, for any obscure reason, happens to merge them, I don't
> think this will be portable across Intel, AMD, ...
> 
> > This means I'll have to do the benchmarks again, and see what the 
> > performance difference of a jmp and a nop is significant. I'm thinking 
> > that if the processor can safely skip nops without any type of processing, 
> > this may be the reason that nops are better than a jmp. A jmp causes the 
> > processor to do a little more work.
> > 
> > I might even run a test to see if I can force a processor that uses the 
> > three-two nops to preempt between them.
> > 
> 
> Yup, although one architecture not triggering this doesn't say much
> about the various x86 flavors out there. In any case
> - if you trigger the problem, we have to fix it.
> - if you do not succeed to trigger the problem, we will have to test it
>   on a wider architecture range and maybe end up fixit it anyway to play
>   safe with the specs.
> 
> So, in every case, we end up fixing the issue.
> 
> 
> > I can add a test in x86 ftrace.c to check to see which nop was used, and 
> > use the jmp if the arch does not have a 5 byte nop.
> > 
> 
> I would propose the following alternative :
> 
> Create new macros in include/asm-x86/nops.h :
> 
> /* short jump, offset 3 bytes : skips total of 5 bytes */
> #define GENERIC_ATOMIC_NOP5 ".byte 0xeb,0x03,0x00,0x00,0x00\n"
> 
> #if defined(CONFIG_MK7)
> #define ATOMIC_NOP5 GENERIC_ATOMIC_NOP5
> #elif defined(CONFIG_X86_P6_NOP)
> #define ATOMIC_NOP5 P6_NOP5
> #elif defined(CONFIG_X86_64)
> #define ATOMIC_NOP5 GENERIC_ATOMIC_NOP5
> #else
> #define ATOMIC_NOP5 GENERIC_ATOMIC_NOP5
> #endif
> 
> And then optimize if necessary. You will probably find plenty of
> knowledgeable people who will know better 5-bytes nop instruction more
> efficient than this "generic" short jump offset 0x3.
> 
> Then you can use the (buggy) 3nops/2nops as a performance baseline and
> see the performance hit on each architecture.
> 
> First get it right, then make it fast....
> 

I'm stubborn, I want to get it right _and_ keep it fast.

I still want the NOPS. Using jmps will hurt performance and that would 
keep this turned off on all distros.

But lets think outside the box here (and we will ignore Alan's cat).

Right now the issue is that we might preempt after the first nop, and when 
we enable the code, that task will crash when it tries to read the second 
nop.

Since we are doing the modifications from kstop_machine, all the tasks are 
stopped. We can simply look to see if the tasks have been preempted in 
kernel space and if so, is their instruction pointer pointing to the 
second nop. If it is, move the ip forward.

Here's a patch that does just that for both i386 and x86_64.

I added a field in the thread_info struct called "ip". This is a pointer 
to the location of the task ip in the stack if it was preempted in kernel 
space. Null otherwise:

         jz restore_all
 +       lea PT_EIP(%esp), %eax
 +       movl %eax, TI_ip(%ebp)
         call preempt_schedule_irq
 +       GET_THREAD_INFO(%ebp)
 +       movl $0, TI_ip(%ebp)
         jmp need_resched


Then, just before we enable tracing (we only need to do this when we 
enable tracing, since that is when we have a two instruction nop), we look 
at all the tasks. If the task->thread_info->ip is set, this means that it
was preempted just before going back to the kernel.

We look at the **ip and see if it compares with the second nop. If it 
does, we increment the ip by the size of that nop:

               if (memcmp(*ip, second_nop, x86_nop5_part2) == 0)
                       /* Match, move the ip forward */
                       *ip += x86_nop5_part2;



We do this just once before enabling all the locations, and we only do it 
if we have a two part nop.

Interesting enough, I wrote a module that did the following:

void (*silly_func)(void);

void do_something_silly(void)
{
}


static int my_thread(void *arg)
{
        int i;
        while (!kthread_should_stop()) {
                for (i=0; i < 100; i++)
                        silly_func();
        }
        return 0;
}

static struct task_struct *p;

static int __init mcount_stress_init(void)
{
        silly_func = do_something_silly;
        p = kthread_run(my_thread, NULL, "sillytask");
        return 0;
}

static void mcount_stress_exit(void)
{
        kthread_stop(p);
}



The do_something_silly had an mcount pointer to it. I put in printks in 
the ftrace enabled code to see where this was preempted. It was preempted 
several times before and after the nops, but never at either nop.

Maybe I didn't run it enough (almost 2 hours), but perhaps it is very 
unlikely to be preempted at a nop if there's something coming up next.

Yes a string of nops may be preempted, but perhaps only two nops followed 
by an actual command might be skipped quickly.

I'll write some hacks to look at where it is preempted in the scheduler 
itself, and see if I see it preempting at the second nop ever.

But here's a patch that will work around the problem that we might be 
preempted within the two nops.

Note, this is only in the slow path of enabling the function tracer. It is 
only done at enabling time inside the kstop_machine, which has a large 
overhead anyways.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 arch/x86/kernel/alternative.c    |   29 +++++++++++++++++++-------
 arch/x86/kernel/asm-offsets_32.c |    1 
 arch/x86/kernel/asm-offsets_64.c |    1 
 arch/x86/kernel/entry_32.S       |    4 +++
 arch/x86/kernel/entry_64.S       |    4 +++
 arch/x86/kernel/ftrace.c         |   43 +++++++++++++++++++++++++++++++++++++++
 include/asm-x86/ftrace.h         |    5 ++++
 include/asm-x86/thread_info.h    |    4 +++
 kernel/trace/ftrace.c            |   12 ++++++++++
 9 files changed, 96 insertions(+), 7 deletions(-)

Index: linux-tip.git/arch/x86/kernel/alternative.c
===================================================================
--- linux-tip.git.orig/arch/x86/kernel/alternative.c	2008-06-05 11:52:24.000000000 -0400
+++ linux-tip.git/arch/x86/kernel/alternative.c	2008-08-08 16:20:23.000000000 -0400
@@ -140,13 +140,26 @@ static const unsigned char *const p6_nop
 };
 #endif
 
+/*
+ * Some versions of x86 CPUs have a two part NOP5. This
+ * can break ftrace if a process is preempted between
+ * the two. ftrace needs to know what the second nop
+ * is to handle this case.
+ */
+int x86_nop5_part2;
+
 #ifdef CONFIG_X86_64
 
 extern char __vsyscall_0;
 const unsigned char *const *find_nop_table(void)
 {
-	return boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
-	       boot_cpu_data.x86 < 6 ? k8_nops : p6_nops;
+	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
+	    boot_cpu_data.x86 < 6) {
+		x86_nop5_part2 = 2; /* K8_NOP2 */
+		return k8_nops;
+	} else
+		/* keep k86_nop5_part2 NULL */
+		return p6_nops;
 }
 
 #else /* CONFIG_X86_64 */
@@ -154,12 +167,13 @@ const unsigned char *const *find_nop_tab
 static const struct nop {
 	int cpuid;
 	const unsigned char *const *noptable;
+	int nop5_part2; /* size of part2 nop */
 } noptypes[] = {
-	{ X86_FEATURE_K8, k8_nops },
-	{ X86_FEATURE_K7, k7_nops },
-	{ X86_FEATURE_P4, p6_nops },
-	{ X86_FEATURE_P3, p6_nops },
-	{ -1, NULL }
+	{ X86_FEATURE_K8, k8_nops, 2},
+	{ X86_FEATURE_K7, k7_nops, 1 },
+	{ X86_FEATURE_P4, p6_nops, 0 },
+	{ X86_FEATURE_P3, p6_nops, 0 },
+	{ -1, NULL, 0 }
 };
 
 const unsigned char *const *find_nop_table(void)
@@ -170,6 +184,7 @@ const unsigned char *const *find_nop_tab
 	for (i = 0; noptypes[i].cpuid >= 0; i++) {
 		if (boot_cpu_has(noptypes[i].cpuid)) {
 			noptable = noptypes[i].noptable;
+			x86_nop5_part2 = noptypes[i].nop5_part2;
 			break;
 		}
 	}
Index: linux-tip.git/arch/x86/kernel/asm-offsets_32.c
===================================================================
--- linux-tip.git.orig/arch/x86/kernel/asm-offsets_32.c	2008-07-27 10:43:26.000000000 -0400
+++ linux-tip.git/arch/x86/kernel/asm-offsets_32.c	2008-08-08 15:46:55.000000000 -0400
@@ -59,6 +59,7 @@ void foo(void)
 	OFFSET(TI_restart_block, thread_info, restart_block);
 	OFFSET(TI_sysenter_return, thread_info, sysenter_return);
 	OFFSET(TI_cpu, thread_info, cpu);
+	OFFSET(TI_ip, thread_info, ip);
 	BLANK();
 
 	OFFSET(GDS_size, desc_ptr, size);
Index: linux-tip.git/arch/x86/kernel/asm-offsets_64.c
===================================================================
--- linux-tip.git.orig/arch/x86/kernel/asm-offsets_64.c	2008-07-27 10:43:26.000000000 -0400
+++ linux-tip.git/arch/x86/kernel/asm-offsets_64.c	2008-08-08 15:52:34.000000000 -0400
@@ -41,6 +41,7 @@ int main(void)
 	ENTRY(addr_limit);
 	ENTRY(preempt_count);
 	ENTRY(status);
+	ENTRY(ip);
 #ifdef CONFIG_IA32_EMULATION
 	ENTRY(sysenter_return);
 #endif
Index: linux-tip.git/arch/x86/kernel/entry_32.S
===================================================================
--- linux-tip.git.orig/arch/x86/kernel/entry_32.S	2008-07-27 10:43:26.000000000 -0400
+++ linux-tip.git/arch/x86/kernel/entry_32.S	2008-08-08 17:13:27.000000000 -0400
@@ -304,7 +304,11 @@ need_resched:
 	jz restore_all
 	testl $X86_EFLAGS_IF,PT_EFLAGS(%esp)	# interrupts off (exception path) ?
 	jz restore_all
+	lea PT_EIP(%esp), %eax
+	movl %eax, TI_ip(%ebp)
 	call preempt_schedule_irq
+	GET_THREAD_INFO(%ebp)
+	movl $0, TI_ip(%ebp)
 	jmp need_resched
 END(resume_kernel)
 #endif
Index: linux-tip.git/arch/x86/kernel/entry_64.S
===================================================================
--- linux-tip.git.orig/arch/x86/kernel/entry_64.S	2008-07-27 10:43:26.000000000 -0400
+++ linux-tip.git/arch/x86/kernel/entry_64.S	2008-08-08 17:12:47.000000000 -0400
@@ -837,7 +837,11 @@ ENTRY(retint_kernel)
 	jnc  retint_restore_args
 	bt   $9,EFLAGS-ARGOFFSET(%rsp)	/* interrupts off? */
 	jnc  retint_restore_args
+	leaq RIP-ARGOFFSET(%rsp), %rax
+	movq %rax, TI_ip(%rcx)
 	call preempt_schedule_irq
+	GET_THREAD_INFO(%rcx)
+	movq $0, TI_ip(%rcx)
 	jmp exit_intr
 #endif	
 
Index: linux-tip.git/arch/x86/kernel/ftrace.c
===================================================================
--- linux-tip.git.orig/arch/x86/kernel/ftrace.c	2008-06-26 14:58:54.000000000 -0400
+++ linux-tip.git/arch/x86/kernel/ftrace.c	2008-08-08 17:48:04.000000000 -0400
@@ -127,6 +127,46 @@ notrace int ftrace_mcount_set(unsigned l
 	return 0;
 }
 
+static const unsigned char *second_nop;
+
+void arch_ftrace_pre_enable(void)
+{
+	struct task_struct *g, *p;
+	unsigned long **ip;
+
+	int i;
+
+	if (!second_nop)
+		return;
+
+	/*
+	 * x86 has a two part nop to handle 5 byte instructions.
+	 * If a task was preempted after the first nop, and has
+	 * not ran the second nop, if we modify the code, we can
+	 * crash the system. Thus, we will look at all the tasks
+	 * and if any of them was preempted and will run the
+	 * second nop next, we simply move their ip pointer past
+	 * the second nop.
+	 */
+
+	/*
+	 * Don't need to grab the task list lock, we are running
+	 * in kstop_machine
+	 */
+	do_each_thread(g, p) {
+		/*
+		 * In entry.S we save the ip when a task is preempted
+		 * and reset it when it is back running.
+		 */
+		ip = task_thread_info(p)->ip;
+		if (!ip)
+			continue;
+		if (memcmp(*ip, second_nop, x86_nop5_part2) == 0)
+			/* Match, move the ip forward */
+			*ip += x86_nop5_part2;
+	} while_each_thread(g, p);
+}
+
 int __init ftrace_dyn_arch_init(void *data)
 {
 	const unsigned char *const *noptable = find_nop_table();
@@ -137,5 +177,8 @@ int __init ftrace_dyn_arch_init(void *da
 
 	ftrace_nop = (unsigned long *)noptable[MCOUNT_INSN_SIZE];
 
+	if (x86_nop5_part2)
+		second_nop = noptable[x86_nop5_part2];
+
 	return 0;
 }
Index: linux-tip.git/include/asm-x86/ftrace.h
===================================================================
--- linux-tip.git.orig/include/asm-x86/ftrace.h	2008-08-08 13:00:51.000000000 -0400
+++ linux-tip.git/include/asm-x86/ftrace.h	2008-08-08 16:41:09.000000000 -0400
@@ -17,6 +17,11 @@ static inline unsigned long ftrace_call_
 	 */
 	return addr - 1;
 }
+
+extern int x86_nop5_part2;
+extern void arch_ftrace_pre_enable(void);
+#define ftrace_pre_enable arch_ftrace_pre_enable
+
 #endif
 
 #endif /* CONFIG_FTRACE */
Index: linux-tip.git/include/asm-x86/thread_info.h
===================================================================
--- linux-tip.git.orig/include/asm-x86/thread_info.h	2008-08-07 11:14:43.000000000 -0400
+++ linux-tip.git/include/asm-x86/thread_info.h	2008-08-08 17:06:15.000000000 -0400
@@ -29,6 +29,9 @@ struct thread_info {
 	__u32			cpu;		/* current CPU */
 	int			preempt_count;	/* 0 => preemptable,
 						   <0 => BUG */
+	unsigned long		**ip;		/* pointer to ip on stackwhen
+						   preempted
+						*/
 	mm_segment_t		addr_limit;
 	struct restart_block    restart_block;
 	void __user		*sysenter_return;
@@ -47,6 +50,7 @@ struct thread_info {
 	.flags		= 0,			\
 	.cpu		= 0,			\
 	.preempt_count	= 1,			\
+	.ip		= NULL,			\
 	.addr_limit	= KERNEL_DS,		\
 	.restart_block = {			\
 		.fn = do_no_restart_syscall,	\
Index: linux-tip.git/kernel/trace/ftrace.c
===================================================================
--- linux-tip.git.orig/kernel/trace/ftrace.c	2008-08-08 13:00:52.000000000 -0400
+++ linux-tip.git/kernel/trace/ftrace.c	2008-08-08 16:18:14.000000000 -0400
@@ -32,6 +32,10 @@
 
 #include "trace.h"
 
+#ifndef ftrace_pre_enable
+# define ftrace_pre_enable() do { } while (0)
+#endif
+
 /* ftrace_enabled is a method to turn ftrace on or off */
 int ftrace_enabled __read_mostly;
 static int last_ftrace_enabled;
@@ -500,6 +504,14 @@ static void ftrace_replace_code(int enab
 	else
 		new = ftrace_nop_replace();
 
+	/*
+	 * Some archs *cough*x86*cough* have more than one nop to cover
+	 * the call to mcount. In these cases, special care must be taken
+	 * before we start converting nops into calls.
+	 */
+	if (enable)
+		ftrace_pre_enable();
+
 	for (pg = ftrace_pages_start; pg; pg = pg->next) {
 		for (i = 0; i < pg->index; i++) {
 			rec = &pg->records[i];

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-08 23:38       ` Steven Rostedt
@ 2008-08-09  0:23         ` Andi Kleen
  2008-08-09  0:36           ` Steven Rostedt
  0 siblings, 1 reply; 18+ messages in thread
From: Andi Kleen @ 2008-08-09  0:23 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, LKML, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, Linus Torvalds, David Miller,
	Roland McGrath, Ulrich Drepper, Rusty Russell,
	Jeremy Fitzhardinge, Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

Steven Rostedt <rostedt@goodmis.org> writes:

> I'm stubborn, I want to get it right _and_ keep it fast.

For me it would seem better to just not use two part 5 byte nops
instead of adding such hacks.  I doubt there are that many of them
anyways. I bet you won't be able to measure any difference between the
different nop types in any macro benchmark.

-Andi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-09  0:23         ` Andi Kleen
@ 2008-08-09  0:36           ` Steven Rostedt
  2008-08-09  0:47             ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 18+ messages in thread
From: Steven Rostedt @ 2008-08-09  0:36 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Mathieu Desnoyers, LKML, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, Linus Torvalds, David Miller,
	Roland McGrath, Ulrich Drepper, Rusty Russell,
	Jeremy Fitzhardinge, Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams


On Sat, 9 Aug 2008, Andi Kleen wrote:

> Steven Rostedt <rostedt@goodmis.org> writes:
> 
> > I'm stubborn, I want to get it right _and_ keep it fast.
> 
> For me it would seem better to just not use two part 5 byte nops
> instead of adding such hacks.  I doubt there are that many of them
> anyways. I bet you won't be able to measure any difference between the
> different nop types in any macro benchmark.

I wish we had a true 5 byte nop. The alternative is a jmp 0, which is 
measurable.  This is replacing mcount from a kernel compile with the -pg 
option. With a basic build (not counting modules), I have over 15,000 
locations that are turned into these 5 byte nops.

# objdump -dr vmlinux.o | grep mcount |wc
  15152   45489  764924

If we use the jmp 0, then yes, we will see the overhead. The double nop 
that is used for 5 bytes, is significantly better than the jump.

-- Steve


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-09  0:36           ` Steven Rostedt
@ 2008-08-09  0:47             ` Jeremy Fitzhardinge
  2008-08-09  0:51               ` Linus Torvalds
  0 siblings, 1 reply; 18+ messages in thread
From: Jeremy Fitzhardinge @ 2008-08-09  0:47 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andi Kleen, Mathieu Desnoyers, LKML, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, Linus Torvalds, David Miller,
	Roland McGrath, Ulrich Drepper, Rusty Russell, Gregory Haskins,
	Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves,
	Clark Williams

Steven Rostedt wrote:
> I wish we had a true 5 byte nop. 

0x66 0x66 0x66 0x66 0x90

    J

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-09  0:47             ` Jeremy Fitzhardinge
@ 2008-08-09  0:51               ` Linus Torvalds
  2008-08-09  1:25                 ` Steven Rostedt
  0 siblings, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2008-08-09  0:51 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Steven Rostedt, Andi Kleen, Mathieu Desnoyers, LKML, Ingo Molnar,
	Thomas Gleixner, Peter Zijlstra, Andrew Morton, David Miller,
	Roland McGrath, Ulrich Drepper, Rusty Russell, Gregory Haskins,
	Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves,
	Clark Williams



On Fri, 8 Aug 2008, Jeremy Fitzhardinge wrote:
>
> Steven Rostedt wrote:
> > I wish we had a true 5 byte nop. 
> 
> 0x66 0x66 0x66 0x66 0x90

I don't think so. Multiple redundant prefixes can be really expensive on 
some uarchs.

A no-op that isn't cheap isn't a no-op at all, it's a slow-op.

			Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] ftrace: to kill a daemon
  2008-08-09  0:51               ` Linus Torvalds
@ 2008-08-09  1:25                 ` Steven Rostedt
  2008-08-13 17:52                   ` Efficient x86 and x86_64 NOP microbenchmarks Mathieu Desnoyers
  0 siblings, 1 reply; 18+ messages in thread
From: Steven Rostedt @ 2008-08-09  1:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeremy Fitzhardinge, Andi Kleen, Mathieu Desnoyers, LKML,
	Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams


On Fri, 8 Aug 2008, Linus Torvalds wrote:
> 
> 
> On Fri, 8 Aug 2008, Jeremy Fitzhardinge wrote:
> >
> > Steven Rostedt wrote:
> > > I wish we had a true 5 byte nop. 
> > 
> > 0x66 0x66 0x66 0x66 0x90
> 
> I don't think so. Multiple redundant prefixes can be really expensive on 
> some uarchs.
> 
> A no-op that isn't cheap isn't a no-op at all, it's a slow-op.


A quick meaningless benchmark showed a slight perfomance hit.

Here's 10 runs of "hackbench 50" using the two part 5 byte nop:

run 1
Time: 4.501
run 2
Time: 4.855
run 3
Time: 4.198
run 4
Time: 4.587
run 5
Time: 5.016
run 6
Time: 4.757
run 7
Time: 4.477
run 8
Time: 4.693
run 9
Time: 4.710
run 10
Time: 4.715
avg = 4.6509


And 10 runs using the above 5 byte nop:

run 1
Time: 4.832
run 2
Time: 5.319
run 3
Time: 5.213
run 4
Time: 4.830
run 5
Time: 4.363
run 6
Time: 4.391
run 7
Time: 4.772
run 8
Time: 4.992
run 9
Time: 4.727
run 10
Time: 4.825
avg = 4.8264

# cat /proc/cpuinfo
processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 15
model		: 65
model name	: Dual-Core AMD Opteron(tm) Processor 2220
stepping	: 3
cpu MHz		: 2799.992
cache size	: 1024 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 2
apicid		: 0
initial apicid	: 0
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 1
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic 
cr8_legacy
bogomips	: 5599.98
clflush size	: 64
power management: ts fid vid ttp tm stc

There's 4 of these.

Just to make sure, I ran the above nop test again:

[ this is reverse from the above runs ]

run 1
Time: 4.723
run 2
Time: 5.080
run 3
Time: 4.521
run 4
Time: 4.841
run 5
Time: 4.696
run 6
Time: 4.946
run 7
Time: 4.754
run 8
Time: 4.717
run 9
Time: 4.905
run 10
Time: 4.814
avg = 4.7997

And again the two part nop:

run 1
Time: 4.434
run 2
Time: 4.496
run 3
Time: 4.801
run 4
Time: 4.714
run 5
Time: 4.631
run 6
Time: 5.178
run 7
Time: 4.728
run 8
Time: 4.920
run 9
Time: 4.898
run 10
Time: 4.770
avg = 4.757


This time it was close, but still seems to have some difference.

heh, perhaps it's just noise.

-- Steve


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Efficient x86 and x86_64 NOP microbenchmarks
  2008-08-09  1:25                 ` Steven Rostedt
@ 2008-08-13 17:52                   ` Mathieu Desnoyers
  2008-08-13 18:27                     ` Linus Torvalds
  0 siblings, 1 reply; 18+ messages in thread
From: Mathieu Desnoyers @ 2008-08-13 17:52 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Jeremy Fitzhardinge, Andi Kleen, LKML,
	Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

* Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> On Fri, 8 Aug 2008, Linus Torvalds wrote:
> > 
> > 
> > On Fri, 8 Aug 2008, Jeremy Fitzhardinge wrote:
> > >
> > > Steven Rostedt wrote:
> > > > I wish we had a true 5 byte nop. 
> > > 
> > > 0x66 0x66 0x66 0x66 0x90
> > 
> > I don't think so. Multiple redundant prefixes can be really expensive on 
> > some uarchs.
> > 
> > A no-op that isn't cheap isn't a no-op at all, it's a slow-op.
> 
> 
> A quick meaningless benchmark showed a slight perfomance hit.
> 

Hi Steven,

I also did some microbenchmarks on my Intel Xeon 64 bits, AMD64 and
Intel Pentium 4 boxes to compare a baseline (function doing a bit of
memory read and arithmetic operations) to cases where nops are used.
Here are the results. The kernel module used for the benchmarks is
below, feel free to run it on your own architectures.

Xeon :

NR_TESTS                                    10000000
test empty cycles :                        165472020
test 2-bytes jump cycles :                 166666806
test 5-bytes jump cycles :                 166978164
test 3/2 nops cycles :                     169259406
test 5-bytes nop with long prefix cycles : 160000140
test 5-bytes P6 nop cycles :               163333458


AMD64 :

NR_TESTS                                    10000000
test empty cycles :                        145142367
test 2-bytes jump cycles :                 150000178
test 5-bytes jump cycles :                 150000171
test 3/2 nops cycles :                     159999994
test 5-bytes nop with long prefix cycles : 150000156
test 5-bytes P6 nop cycles :               150000148


Intel Pentium 4 :

NR_TESTS                                    10000000
test empty cycles :                        290001045
test 2-bytes jump cycles :                 310000568
test 5-bytes jump cycles :                 310000478
test 3/2 nops cycles :                     290000565
test 5-bytes nop with long prefix cycles : 311085510
test 5-bytes P6 nop cycles :               300000517
test Generic 1/4 5-bytes nops cycles :     310000553
test K7 1/4 5-bytes nops cycles :          300000533


These numbers show that both on Xeon and AMD64, the 

   .byte 0x66,0x66,0x66,0x66,0x90

(osp osp osp osp nop, which is not currently used in nops.h)

is the fastest nop on both architectures.

The currently used 3/2 nops looks like a _very_ bad choice for AMD64
cycle-wise.

The currently used 5-bytes P6 nop used on Xeon seems to be a bit slower
than the 0x66,0x66,0x66,0x66,0x90 nop too.

For the Intel Pentium 4, the best atomic choice seems to be the current
one (5-bytes P6 nop : .byte 0x0f,0x1f,0x44,0x00,0), although we can see
that the 3/2 nop used for K8 would be a bit faster. It is probably due
to the fact that P4 handles long instruction prefixes slowly.

Is there any reason why not to use these atomic nops and kill our
instruction atomicity problems altogether ?

(various cpuinfo can be found below)

Mathieu


/* test-nop-speed.c
 *
 */

#include <linux/module.h>
#include <linux/proc_fs.h>
#include <linux/sched.h>
#include <linux/timex.h>
#include <linux/marker.h>
#include <asm/ptrace.h>

#define NR_TESTS 10000000

int var, var2;

struct proc_dir_entry *pentry = NULL;

void empty(void)
{
	asm volatile ("");
	var += 50;
	var /= 10;
	var *= var2;
}

void twobytesjump(void)
{
	asm volatile ("jmp 1f\n\t"
		".byte 0x00, 0x00, 0x00\n\t"
		"1:\n\t");
	var += 50;
	var /= 10;
	var *= var2;
}

void fivebytesjump(void)
{
	asm volatile (".byte 0xe9, 0x00, 0x00, 0x00, 0x00\n\t");
	var += 50;
	var /= 10;
	var *= var2;
}

void threetwonops(void)
{
	asm volatile (".byte 0x66,0x66,0x90,0x66,0x90\n\t");
	var += 50;
	var /= 10;
	var *= var2;
}

void fivebytesnop(void)
{
	asm volatile (".byte 0x66,0x66,0x66,0x66,0x90\n\t");
	var += 50;
	var /= 10;
	var *= var2;
}

void fivebytespsixnop(void)
{
	asm volatile (".byte 0x0f,0x1f,0x44,0x00,0\n\t");
	var += 50;
	var /= 10;
	var *= var2;
}

/*
 * GENERIC_NOP1 GENERIC_NOP4,
 * 1: nop
 * _not_ nops in 64-bit mode.
 * 4: leal 0x00(,%esi,1),%esi
 */
void genericfivebytesonefournops(void)
{
	asm volatile (".byte 0x90,0x8d,0x74,0x26,0x00\n\t");
	var += 50;
	var /= 10;
	var *= var2;
}

/*
 * K7_NOP4 ASM_NOP1
 * 1: nop
 * assumed _not_ to be nops in 64-bit mode.
 * leal 0x00(,%eax,1),%eax
 */
void k7fivebytesonefournops(void)
{
	asm volatile (".byte 0x90,0x8d,0x44,0x20,0x00\n\t");
	var += 50;
	var /= 10;
	var *= var2;
}

void perform_test(const char *name, void (*callback)(void))
{
	unsigned int i;
	cycles_t cycles1, cycles2;
	unsigned long flags;

	local_irq_save(flags);
	rdtsc_barrier();
	cycles1 = get_cycles();
	rdtsc_barrier();
	for(i=0; i<NR_TESTS; i++) {
		callback();
	}
	rdtsc_barrier();
	cycles2 = get_cycles();
	rdtsc_barrier();
	local_irq_restore(flags);
	printk("test %s cycles : %llu\n", name, cycles2-cycles1);
}

static int my_open(struct inode *inode, struct file *file)
{
	printk("NR_TESTS %d\n", NR_TESTS);

	perform_test("empty", empty);
	perform_test("2-bytes jump", twobytesjump);
	perform_test("5-bytes jump", fivebytesjump);
	perform_test("3/2 nops", threetwonops);
	perform_test("5-bytes nop with long prefix", fivebytesnop);
	perform_test("5-bytes P6 nop", fivebytespsixnop);
#ifdef CONFIG_X86_32
	perform_test("Generic 1/4 5-bytes nops", genericfivebytesonefournops);
	perform_test("K7 1/4 5-bytes nops", k7fivebytesonefournops);
#endif

	return -EPERM;
}


static struct file_operations my_operations = {
	.open = my_open,
};

int init_module(void)
{
	pentry = create_proc_entry("testnops", 0444, NULL);
	if (pentry)
		pentry->proc_fops = &my_operations;

	return 0;
}

void cleanup_module(void)
{
	remove_proc_entry("testnops", NULL);
}

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Mathieu Desnoyers");
MODULE_DESCRIPTION("NOP Test");


Xeon cpuinfo :

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 23
model name	: Intel(R) Xeon(R) CPU           E5405  @ 2.00GHz
stepping	: 6
cpu MHz		: 2000.126
cache size	: 6144 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx tm2 ssse3 cx16 xtpr dca sse4_1 lahf_lm
bogomips	: 4000.25
clflush size	: 64
cache_alignment	: 64
address sizes	: 38 bits physical, 48 bits virtual
power management:

AMD64 cpuinfo :

processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 15
model		: 35
model name	: AMD Athlon(tm)64 X2 Dual Core Processor  3800+
stepping	: 2
cpu MHz		: 2009.139
cache size	: 512 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 2
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 1
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow rep_good pni lahf_lm cmp_legacy
bogomips	: 4022.42
TLB size	: 1024 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

Pentium 4 :


processor	: 0
vendor_id	: GenuineIntel
cpu family	: 15
model		: 4
model name	: Intel(R) Pentium(R) 4 CPU 3.00GHz
stepping	: 1
cpu MHz		: 3000.138
cache size	: 1024 KB
physical id	: 0
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 0
initial apicid	: 0
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 5
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx constant_tsc up pebs bts pni monitor ds_cpl cid xtpr
bogomips	: 6005.70
clflush size	: 64
power management:



> Here's 10 runs of "hackbench 50" using the two part 5 byte nop:
> 
> run 1
> Time: 4.501
> run 2
> Time: 4.855
> run 3
> Time: 4.198
> run 4
> Time: 4.587
> run 5
> Time: 5.016
> run 6
> Time: 4.757
> run 7
> Time: 4.477
> run 8
> Time: 4.693
> run 9
> Time: 4.710
> run 10
> Time: 4.715
> avg = 4.6509
> 
> 
> And 10 runs using the above 5 byte nop:
> 
> run 1
> Time: 4.832
> run 2
> Time: 5.319
> run 3
> Time: 5.213
> run 4
> Time: 4.830
> run 5
> Time: 4.363
> run 6
> Time: 4.391
> run 7
> Time: 4.772
> run 8
> Time: 4.992
> run 9
> Time: 4.727
> run 10
> Time: 4.825
> avg = 4.8264
> 
> # cat /proc/cpuinfo
> processor	: 0
> vendor_id	: AuthenticAMD
> cpu family	: 15
> model		: 65
> model name	: Dual-Core AMD Opteron(tm) Processor 2220
> stepping	: 3
> cpu MHz		: 2799.992
> cache size	: 1024 KB
> physical id	: 0
> siblings	: 2
> core id		: 0
> cpu cores	: 2
> apicid		: 0
> initial apicid	: 0
> fdiv_bug	: no
> hlt_bug		: no
> f00f_bug	: no
> coma_bug	: no
> fpu		: yes
> fpu_exception	: yes
> cpuid level	: 1
> wp		: yes
> flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic 
> cr8_legacy
> bogomips	: 5599.98
> clflush size	: 64
> power management: ts fid vid ttp tm stc
> 
> There's 4 of these.
> 
> Just to make sure, I ran the above nop test again:
> 
> [ this is reverse from the above runs ]
> 
> run 1
> Time: 4.723
> run 2
> Time: 5.080
> run 3
> Time: 4.521
> run 4
> Time: 4.841
> run 5
> Time: 4.696
> run 6
> Time: 4.946
> run 7
> Time: 4.754
> run 8
> Time: 4.717
> run 9
> Time: 4.905
> run 10
> Time: 4.814
> avg = 4.7997
> 
> And again the two part nop:
> 
> run 1
> Time: 4.434
> run 2
> Time: 4.496
> run 3
> Time: 4.801
> run 4
> Time: 4.714
> run 5
> Time: 4.631
> run 6
> Time: 5.178
> run 7
> Time: 4.728
> run 8
> Time: 4.920
> run 9
> Time: 4.898
> run 10
> Time: 4.770
> avg = 4.757
> 
> 
> This time it was close, but still seems to have some difference.
> 
> heh, perhaps it's just noise.
> 
> -- Steve
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Efficient x86 and x86_64 NOP microbenchmarks
  2008-08-13 17:52                   ` Efficient x86 and x86_64 NOP microbenchmarks Mathieu Desnoyers
@ 2008-08-13 18:27                     ` Linus Torvalds
  2008-08-13 18:41                       ` Andi Kleen
  2008-08-13 19:16                       ` Mathieu Desnoyers
  0 siblings, 2 replies; 18+ messages in thread
From: Linus Torvalds @ 2008-08-13 18:27 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, Jeremy Fitzhardinge, Andi Kleen, LKML,
	Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

On Wed, 13 Aug 2008, Mathieu Desnoyers wrote:
> 
> I also did some microbenchmarks on my Intel Xeon 64 bits, AMD64 and
> Intel Pentium 4 boxes to compare a baseline

Note that the biggest problems of a jump-based nop are likely to happen 
when there are I$ misses and/or when there are other jumps involved. Ie a 
some microarchitectures tend to have issues with jumps to jumps, or when 
there are multiple control changes in the same (possibly partial) 
cacheline because the instruction stream prediction may be predecoded in 
the L1 I$, and multiple branches in the same cacheline - or in the same 
execution cycle - can pollute that kind of thing.

So microbenchmarking this way will probably make some things look 
unrealistically good. 

On the P4, the trace cache makes things even more interesting, since it's 
another level of I$ entirely, with very different behavior for the hit 
case vs the miss case.

And I$ misses for the kernel are actually fairly high. Not in 
microbenchmarks that tend to have very repetive behavior and a small I$ 
footprint, but in a lot of real-life loads the *bulk* of all action is in 
user space, and then the kernel side is often invoced with few loops (the 
kernel has very few loops indeed) and a cold I$.

So your numbers are interesting, but it would be really good to also get 
some info from Intel/AMD who may know about microarchitectural issues for 
the cases that don't show up in the hot-I$-cache environment.

			Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Efficient x86 and x86_64 NOP microbenchmarks
  2008-08-13 18:27                     ` Linus Torvalds
@ 2008-08-13 18:41                       ` Andi Kleen
  2008-08-13 18:45                         ` Avi Kivity
  2008-08-13 19:30                         ` Mathieu Desnoyers
  2008-08-13 19:16                       ` Mathieu Desnoyers
  1 sibling, 2 replies; 18+ messages in thread
From: Andi Kleen @ 2008-08-13 18:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mathieu Desnoyers, Steven Rostedt, Jeremy Fitzhardinge,
	Andi Kleen, LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, David Miller, Roland McGrath, Ulrich Drepper,
	Rusty Russell, Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

> So microbenchmarking this way will probably make some things look 
> unrealistically good. 

Must be careful to miss the big picture here.

We have two assumptions here in this thread:

- Normal alternative() nops are relatively infrequent, typically
in points with enough pipeline bubbles anyways, and it likely doesn't
matter how they are encode. And also they don't have an issue
with mult part instructions anyways because they're not patched
at runtime, so always the best known can be used.

- The one case where nops are very frequent and matter and multipart
is a problem is with ftrace noping out the call to mcount at runtime 
because that happens on every function entry.
Even there the overhead is not that big, but at least measurable 
in kernel builds.

Now the numbers have shown that just by not using frame pointer (
-pg right now implies frame pointer) you can get more benefit 
than what you lose from using non optimal nops.

So for me the best strategy would be to get rid of the frame pointer
and ignore the nops. This unfortunately would require going away
from -pg and instead post process gcc output to insert "call mcount"
manually. But the nice advantage of that is that you could actually 
set up a custom table of callers built in a ELF section and with
that you don't actually need the runtime patching (which is only
done currently because there's no global table of mcount calls),
but could do everything in stop_machine(). Without
runtime patching you also don't need single part nops. 

I think that would be the best option. I especially like it because
it would prevent forcing frame pointer which seems to be costlier
than any kinds of nosp.

-Andi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Efficient x86 and x86_64 NOP microbenchmarks
  2008-08-13 18:41                       ` Andi Kleen
@ 2008-08-13 18:45                         ` Avi Kivity
  2008-08-13 18:51                           ` Andi Kleen
  2008-08-13 19:30                         ` Mathieu Desnoyers
  1 sibling, 1 reply; 18+ messages in thread
From: Avi Kivity @ 2008-08-13 18:45 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Mathieu Desnoyers, Steven Rostedt,
	Jeremy Fitzhardinge, LKML, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, David Miller, Roland McGrath,
	Ulrich Drepper, Rusty Russell, Gregory Haskins,
	Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves,
	Clark Williams

Andi Kleen wrote:
> So for me the best strategy would be to get rid of the frame pointer
> and ignore the nops. This unfortunately would require going away
> from -pg and instead post process gcc output to insert "call mcount"
> manually. But the nice advantage of that is that you could actually 
> set up a custom table of callers built in a ELF section and with
> that you don't actually need the runtime patching (which is only
> done currently because there's no global table of mcount calls),
> but could do everything in stop_machine(). Without
> runtime patching you also don't need single part nops. 
>
> I think that would be the best option. I especially like it because
> it would prevent forcing frame pointer which seems to be costlier
> than any kinds of nosp.
>
>   

How would you deal with inlines?  Using debug information?

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Efficient x86 and x86_64 NOP microbenchmarks
  2008-08-13 18:45                         ` Avi Kivity
@ 2008-08-13 18:51                           ` Andi Kleen
  2008-08-13 18:56                             ` Avi Kivity
  0 siblings, 1 reply; 18+ messages in thread
From: Andi Kleen @ 2008-08-13 18:51 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Andi Kleen, Linus Torvalds, Mathieu Desnoyers, Steven Rostedt,
	Jeremy Fitzhardinge, LKML, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, David Miller, Roland McGrath,
	Ulrich Drepper, Rusty Russell, Gregory Haskins,
	Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves,
	Clark Williams

> How would you deal with inlines?  Using debug information?

-pg already ignores inlines, so they aren't even traced today.

It pretty much has to, assume an inline gets spread out by
the global optimizer over the rest of the function, where would
the mcount calls be inserted?

-Andi


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Efficient x86 and x86_64 NOP microbenchmarks
  2008-08-13 18:51                           ` Andi Kleen
@ 2008-08-13 18:56                             ` Avi Kivity
  0 siblings, 0 replies; 18+ messages in thread
From: Avi Kivity @ 2008-08-13 18:56 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Mathieu Desnoyers, Steven Rostedt,
	Jeremy Fitzhardinge, LKML, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, David Miller, Roland McGrath,
	Ulrich Drepper, Rusty Russell, Gregory Haskins,
	Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves,
	Clark Williams

Andi Kleen wrote:
>> How would you deal with inlines?  Using debug information?
>>     
>
> -pg already ignores inlines, so they aren't even traced today.
>
> It pretty much has to, assume an inline gets spread out by
> the global optimizer over the rest of the function, where would
> the mcount calls be inserted?
>   

Good point.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Efficient x86 and x86_64 NOP microbenchmarks
  2008-08-13 18:41                       ` Andi Kleen
  2008-08-13 18:45                         ` Avi Kivity
@ 2008-08-13 19:30                         ` Mathieu Desnoyers
  2008-08-13 19:37                           ` Andi Kleen
  1 sibling, 1 reply; 18+ messages in thread
From: Mathieu Desnoyers @ 2008-08-13 19:30 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Steven Rostedt, Jeremy Fitzhardinge, LKML,
	Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

* Andi Kleen (andi@firstfloor.org) wrote:
> > So microbenchmarking this way will probably make some things look 
> > unrealistically good. 
> 
> Must be careful to miss the big picture here.
> 
> We have two assumptions here in this thread:
> 
> - Normal alternative() nops are relatively infrequent, typically
> in points with enough pipeline bubbles anyways, and it likely doesn't
> matter how they are encode. And also they don't have an issue
> with mult part instructions anyways because they're not patched
> at runtime, so always the best known can be used.
> 
> - The one case where nops are very frequent and matter and multipart
> is a problem is with ftrace noping out the call to mcount at runtime 
> because that happens on every function entry.
> Even there the overhead is not that big, but at least measurable 
> in kernel builds.
> 
> Now the numbers have shown that just by not using frame pointer (
> -pg right now implies frame pointer) you can get more benefit 
> than what you lose from using non optimal nops.
> 
> So for me the best strategy would be to get rid of the frame pointer
> and ignore the nops. This unfortunately would require going away
> from -pg and instead post process gcc output to insert "call mcount"
> manually. But the nice advantage of that is that you could actually 
> set up a custom table of callers built in a ELF section and with
> that you don't actually need the runtime patching (which is only
> done currently because there's no global table of mcount calls),
> but could do everything in stop_machine(). Without
> runtime patching you also don't need single part nops. 
> 

I agree that if frame pointer brings a too big overhead, it should not
be used.

Sorry to ask, I feel I must be missing something, but I'm trying to
figure out where you propose to add the "call mcount" ? In the caller or
in the callee ?

In the caller, I guess it would replace the normal function call, call a
trampoline which would jump to the normal code.

In the callee, as what is currently done with -pg, the callee would have
a call mcount at the beginning of the function.

Or is it a different scheme I don't see ? I am trying to figure out how
you happen to do all that without dynamic code modification and manage
not to hurt performance.

Mathieu

> I think that would be the best option. I especially like it because
> it would prevent forcing frame pointer which seems to be costlier
> than any kinds of nosp.
> 
> -Andi
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Efficient x86 and x86_64 NOP microbenchmarks
  2008-08-13 19:30                         ` Mathieu Desnoyers
@ 2008-08-13 19:37                           ` Andi Kleen
  2008-08-13 20:01                             ` Mathieu Desnoyers
  2008-08-15 21:34                             ` Steven Rostedt
  0 siblings, 2 replies; 18+ messages in thread
From: Andi Kleen @ 2008-08-13 19:37 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andi Kleen, Linus Torvalds, Steven Rostedt, Jeremy Fitzhardinge,
	LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

> Sorry to ask, I feel I must be missing something, but I'm trying to
> figure out where you propose to add the "call mcount" ? In the caller or
> in the callee ?

callee like gcc. caller would be likely more bloated because
there are more calls than functions. Also if it was at the 
callee more code would be needed because the function currently
executed couldn't be gotten from stack directly.

> Or is it a different scheme I don't see ? I am trying to figure out how
> you happen to do all that without dynamic code modification and manage
> not to hurt performance.

The dynamic code modification is only needed because there is no
global table of the mcount call sites. So instead it discovers
them at runtime, but that requires runtime save patching

With a custom call scheme one could just build up a table of 
call sites at link time using an ELF section and then when
tracing is enabled/disabled always patch them all in one go
in a stop_machine(). Then you wouldn't need parallel execution safe
patching anymore and it doesn't matter what the nops look like.

The other advantage is that it would allow getting rid of
the frame pointer.

-Andi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Efficient x86 and x86_64 NOP microbenchmarks
  2008-08-13 19:37                           ` Andi Kleen
@ 2008-08-13 20:01                             ` Mathieu Desnoyers
  2008-08-15 21:34                             ` Steven Rostedt
  1 sibling, 0 replies; 18+ messages in thread
From: Mathieu Desnoyers @ 2008-08-13 20:01 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Steven Rostedt, Steven Rostedt,
	Jeremy Fitzhardinge, LKML, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, David Miller, Roland McGrath,
	Ulrich Drepper, Rusty Russell, Gregory Haskins,
	Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves,
	Clark Williams

* Andi Kleen (andi@firstfloor.org) wrote:
> > Sorry to ask, I feel I must be missing something, but I'm trying to
> > figure out where you propose to add the "call mcount" ? In the caller or
> > in the callee ?
> 
> callee like gcc. caller would be likely more bloated because
> there are more calls than functions. Also if it was at the 
> callee more code would be needed because the function currently
> executed couldn't be gotten from stack directly.
> 
> > Or is it a different scheme I don't see ? I am trying to figure out how
> > you happen to do all that without dynamic code modification and manage
> > not to hurt performance.
> 
> The dynamic code modification is only needed because there is no
> global table of the mcount call sites. So instead it discovers
> them at runtime, but that requires runtime save patching
> 
> With a custom call scheme one could just build up a table of 
> call sites at link time using an ELF section and then when
> tracing is enabled/disabled always patch them all in one go
> in a stop_machine(). Then you wouldn't need parallel execution safe
> patching anymore and it doesn't matter what the nops look like.
> 

I agree that the custom call scheme could let you know the mcount call
site addresses at link time, so you could replace the call instructions
with nops (at link time, so you actually don't know much about the
exact hardware the kernel will be running on, which makes it harder to
choose the best nop). To me, it seems that doing this at link time,
as you propose, is the best approach, as it won't impact the system
bootup time as much as the current ftrace scheme.

However, I disagree with you on one point : if you use nops which are
made of multiple instructions smaller than 5 bytes, enabling the tracer
(patching all the sites in a stop_machine()) still present the risk of
having a preempted thread with a return IP pointing directly in the
middle of what will become a 5-bytes call instruction. When the thread
will be scheduled again after the stop_machine, an illegal instruction
fault (or any random effect) will occur.

Therefore, building a table of mcount call sites in a ELF section,
declaring _single_ 5-bytes nop instruction in the instruction stream
that would fit for all target architectures in lieue of mcount call, so
it can be later patched-in with the 5-bytes call at runtime seems like a
good way to go.

Mathieu

P.S. : It would be good to have a look at the alternative.c lock prefix
vs preemption race I identified a few weeks ago. Actually, this
currently existing cpu hotplug bug is related to the preemption issue I
just explained here. ref. http://lkml.org/lkml/2008/7/30/265,
especially:

"As a general rule, never try to combine smaller instructions into a
bigger one, except in the case of adding a lock-prefix to an instruction :
this case insures that the non-lock prefixed instruction is still
valid after the change has been done. We could however run into a nasty
non-synchronized atomic instruction use in SMP mode if a thread happens
to be scheduled out right after the lock prefix. Hopefully the
alternative code uses the refrigerator... (hrm, it doesn't).

Actually, alternative.c lock-prefix modification is O.K. for spinlocks
because they execute with preemption off, but not for other atomic
operations which may execute with preemption on."

> The other advantage is that it would allow getting rid of
> the frame pointer.
> 
> -Andi
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Efficient x86 and x86_64 NOP microbenchmarks
  2008-08-13 19:37                           ` Andi Kleen
  2008-08-13 20:01                             ` Mathieu Desnoyers
@ 2008-08-15 21:34                             ` Steven Rostedt
  2008-08-15 21:51                               ` Andi Kleen
  1 sibling, 1 reply; 18+ messages in thread
From: Steven Rostedt @ 2008-08-15 21:34 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Mathieu Desnoyers, Linus Torvalds, Jeremy Fitzhardinge, LKML,
	Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams


[ Finally got my goodmis email back ]

On Wed, 13 Aug 2008, Andi Kleen wrote:

> > Sorry to ask, I feel I must be missing something, but I'm trying to
> > figure out where you propose to add the "call mcount" ? In the caller or
> > in the callee ?
> 
> callee like gcc. caller would be likely more bloated because
> there are more calls than functions. Also if it was at the 
> callee more code would be needed because the function currently
> executed couldn't be gotten from stack directly.
> 
> > Or is it a different scheme I don't see ? I am trying to figure out how
> > you happen to do all that without dynamic code modification and manage
> > not to hurt performance.
> 
> The dynamic code modification is only needed because there is no
> global table of the mcount call sites. So instead it discovers
> them at runtime, but that requires runtime save patching

The new code does not discover the places at runtime. The old code did 
that. The "to kill a daemon" removed the runtime discovery and replaced it 
with discovery at compile time.

> 
> With a custom call scheme one could just build up a table of 
> call sites at link time using an ELF section and then when
> tracing is enabled/disabled always patch them all in one go
> in a stop_machine(). Then you wouldn't need parallel execution safe
> patching anymore and it doesn't matter what the nops look like.

The current patch set, pretty much does exactly this. Yes, I patch
at boot up all in one go, before the other CPUS are even active.
This takes all of 6 milliseconds to do. Not much extra time for bootup.

> 
> The other advantage is that it would allow getting rid of
> the frame pointer.

This is the only advantage that you have.

-- Steve


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Efficient x86 and x86_64 NOP microbenchmarks
  2008-08-15 21:34                             ` Steven Rostedt
@ 2008-08-15 21:51                               ` Andi Kleen
  0 siblings, 0 replies; 18+ messages in thread
From: Andi Kleen @ 2008-08-15 21:51 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andi Kleen, Mathieu Desnoyers, Linus Torvalds,
	Jeremy Fitzhardinge, LKML, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, David Miller, Roland McGrath,
	Ulrich Drepper, Rusty Russell, Gregory Haskins,
	Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves,
	Clark Williams

> > The other advantage is that it would allow getting rid of
> > the frame pointer.
> 
> This is the only advantage that you have.

Ok. But it's a serious one.  It gives slightly more gain as your whole 
complicated patching exercise.

Ok maybe it would be better to just properly fix gcc, but the problem
is it takes forever for the user base to actually start using a new
gcc :/

-Andi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Efficient x86 and x86_64 NOP microbenchmarks
  2008-08-13 18:27                     ` Linus Torvalds
  2008-08-13 18:41                       ` Andi Kleen
@ 2008-08-13 19:16                       ` Mathieu Desnoyers
  1 sibling, 0 replies; 18+ messages in thread
From: Mathieu Desnoyers @ 2008-08-13 19:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Steven Rostedt, Jeremy Fitzhardinge, Andi Kleen,
	LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> 
> 
> On Wed, 13 Aug 2008, Mathieu Desnoyers wrote:
> > 
> > I also did some microbenchmarks on my Intel Xeon 64 bits, AMD64 and
> > Intel Pentium 4 boxes to compare a baseline
> 
> Note that the biggest problems of a jump-based nop are likely to happen 
> when there are I$ misses and/or when there are other jumps involved. Ie a 
> some microarchitectures tend to have issues with jumps to jumps, or when 
> there are multiple control changes in the same (possibly partial) 
> cacheline because the instruction stream prediction may be predecoded in 
> the L1 I$, and multiple branches in the same cacheline - or in the same 
> execution cycle - can pollute that kind of thing.
> 

Yup, I agree. Actually, the tests I ran shows that using jumps as nops
does not seems to be the best solution, even cycle-wise.

> So microbenchmarking this way will probably make some things look 
> unrealistically good. 
> 

Yes, I am aware of these "high locality" effects. I use these tests as a
starting point to find out which nops are good candidates, and then it
can be later validated with more thorough testing on real workloads,
which will suffer from higher standard deviation.

Interestingly enough, the P6_NOPS seems to be a poor choice both at the
macro and micro levels for the Intel Xeon (referring to
http://lkml.org/lkml/2008/8/13/253 for the macro-benchmarks).

> On the P4, the trace cache makes things even more interesting, since it's 
> another level of I$ entirely, with very different behavior for the hit 
> case vs the miss case.

As long as the whole kernel agrees on which instructions should be used
for frequently used nops, the instruction trace cache should behave
properly.

> 
> And I$ misses for the kernel are actually fairly high. Not in 
> microbenchmarks that tend to have very repetive behavior and a small I$ 
> footprint, but in a lot of real-life loads the *bulk* of all action is in 
> user space, and then the kernel side is often invoced with few loops (the 
> kernel has very few loops indeed) and a cold I$.

I assume the effect of I$ miss to be the same for all the tested
scenarios (except on P4, and maybe except for the jump cases), given
that in each case we load 5-bytes worth of instructions. Even
considering this, the results I get show that the choices made in the
current kernel does might not be the best ones.

> 
> So your numbers are interesting, but it would be really good to also get 
> some info from Intel/AMD who may know about microarchitectural issues for 
> the cases that don't show up in the hot-I$-cache environment.
> 

Yep. I think it may make a difference if we use jumps, but I doubt it
will change anything to the other various nops. Still, having that
information would be good.

Some more numbers follow for older architectures.

Intel Pentium 3, 550MHz

NR_TESTS                                    10000000
test empty cycles :                        510000254
test 2-bytes jump cycles :                 510000077
test 5-bytes jump cycles :                 510000101
test 3/2 nops cycles :                     500000072
test 5-bytes nop with long prefix cycles : 500000107
test 5-bytes P6 nop cycles :               500000069 (current choice ok)
test Generic 1/4 5-bytes nops cycles :     514687590
test K7 1/4 5-bytes nops cycles :          530000012

Intel Pentium 3, 933MHz

NR_TESTS                                    10000000
test empty cycles :                        510000565
test 2-bytes jump cycles :                 510000133
test 5-bytes jump cycles :                 510000363
test 3/2 nops cycles :                     500000358
test 5-bytes nop with long prefix cycles : 500000331
test 5-bytes P6 nop cycles :               500000625 (current choice ok)
test Generic 1/4 5-bytes nops cycles :     514687797
test K7 1/4 5-bytes nops cycles :          530000273


Intel Pentium M, 2GHz

NR_TESTS                                    10000000
test empty cycles :                        180000515
test 2-bytes jump cycles :                 180000386 (would be the best)
test 5-bytes jump cycles :                 205000435
test 3/2 nops cycles :                     193333517
test 5-bytes nop with long prefix cycles : 205000167
test 5-bytes P6 nop cycles :               205937652
test Generic 1/4 5-bytes nops cycles :     187500174
test K7 1/4 5-bytes nops cycles :          193750161


Intel Pentium 3, 550MHz

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 7
model name	: Pentium III (Katmai)
stepping	: 3
cpu MHz		: 551.295
cache size	: 512 KB
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 2
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr sse
bogomips	: 1103.44
clflush size	: 32

Intel Pentium 3, 933MHz

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 8
model name	: Pentium III (Coppermine)
stepping	: 6
cpu MHz		: 933.134
cache size	: 256 KB
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 2
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr sse
bogomips	: 1868.22
clflush size	: 32

Intel Pentium M, 2GHz

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 13
model name	: Intel(R) Pentium(R) M processor 2.00GHz
stepping	: 8
cpu MHz		: 2000.000
cache size	: 2048 KB
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 2
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat clflush dts acpi mmx fxsr sse sse2 ss tm pbe nx bts est tm2
bogomips	: 3994.64
clflush size	: 64

Mathieu

> 			Linus




-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2008-08-15 21:50 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20080813191926.GB15547@Krystal>
2008-08-13 20:00 ` Efficient x86 and x86_64 NOP microbenchmarks Steven Rostedt
2008-08-13 20:06   ` Jeremy Fitzhardinge
2008-08-13 20:34     ` Steven Rostedt
2008-08-13 20:15   ` Andi Kleen
2008-08-13 20:21     ` Linus Torvalds
2008-08-13 20:21     ` Steven Rostedt
2008-08-08 18:13 [PATCH 0/5] ftrace: to kill a daemon Steven Rostedt
2008-08-08 18:21 ` Mathieu Desnoyers
2008-08-08 18:41   ` Steven Rostedt
2008-08-08 19:05     ` Mathieu Desnoyers
2008-08-08 23:38       ` Steven Rostedt
2008-08-09  0:23         ` Andi Kleen
2008-08-09  0:36           ` Steven Rostedt
2008-08-09  0:47             ` Jeremy Fitzhardinge
2008-08-09  0:51               ` Linus Torvalds
2008-08-09  1:25                 ` Steven Rostedt
2008-08-13 17:52                   ` Efficient x86 and x86_64 NOP microbenchmarks Mathieu Desnoyers
2008-08-13 18:27                     ` Linus Torvalds
2008-08-13 18:41                       ` Andi Kleen
2008-08-13 18:45                         ` Avi Kivity
2008-08-13 18:51                           ` Andi Kleen
2008-08-13 18:56                             ` Avi Kivity
2008-08-13 19:30                         ` Mathieu Desnoyers
2008-08-13 19:37                           ` Andi Kleen
2008-08-13 20:01                             ` Mathieu Desnoyers
2008-08-15 21:34                             ` Steven Rostedt
2008-08-15 21:51                               ` Andi Kleen
2008-08-13 19:16                       ` Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox