* Re: Efficient x86 and x86_64 NOP microbenchmarks [not found] <20080813191926.GB15547@Krystal> @ 2008-08-13 20:00 ` Steven Rostedt 2008-08-13 20:06 ` Jeremy Fitzhardinge 2008-08-13 20:15 ` Andi Kleen 0 siblings, 2 replies; 18+ messages in thread From: Steven Rostedt @ 2008-08-13 20:00 UTC (permalink / raw) To: Andi Kleen, Thomas Gleixner Cc: Mathieu Desnoyers, Linus Torvalds, Steven Rostedt, Jeremy Fitzhardinge, LKML, Ingo Molnar, Peter Zijlstra, Andrew Morton, David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell, Gregory Haskins, Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves, Clark Williams [ Thanks to Mathieu Desnoyers who forward this to me. Currently my ISP for goodmis.org is having issues: https://help.domaindirect.com/index.php?_m=news&_a=viewnews&newsid=104 ] > ----- Forwarded message from Andi Kleen <andi@firstfloor.org> ----- > > >> So microbenchmarking this way will probably make some things look >> unrealistically good. >> > > Must be careful to miss the big picture here. > > We have two assumptions here in this thread: > > - Normal alternative() nops are relatively infrequent, typically > in points with enough pipeline bubbles anyways, and it likely doesn't > matter how they are encode. And also they don't have an issue > with mult part instructions anyways because they're not patched > at runtime, so always the best known can be used. > > - The one case where nops are very frequent and matter and multipart > is a problem is with ftrace noping out the call to mcount at runtime > because that happens on every function entry. > Even there the overhead is not that big, but at least measurable > in kernel builds. > The problem is not ftrace noping out the call at runtime. The problem is ftrace changing the nops back to calls to mcount. The nop part is simple, straight forward and not an issue that we are talking here. The issue is which kind of nop to use. The bug with the multi-part nop happens when we _enable_ tracing. That is, when someone runs the tracer. The issue with the multi-part nop is that a task could have been preempted after it executed the first nop and before the second part. Then we enable tracing, and when the task is scheduled back in, it now will execute half the call to the mcount function. I want this point very clear. If you never run tracing, this bug will not happen. And the bug only happens on enabling the tracer, not on the disabling part. Not to mention that the bug itself will only happen 1 in a billion. > Now the numbers have shown that just by not using frame pointer ( > -pg right now implies frame pointer) you can get more benefit > than what you lose from using non optimal nops. > No, I can easily make a patch that does not use frame pointers but still uses -pg. We just can not print the parent function in the trace. This can easily be added to a config, as well as easily implemented. > So for me the best strategy would be to get rid of the frame pointer > and ignore the nops. This unfortunately would require going away > from -pg and instead post process gcc output to insert "call mcount" > manually. But the nice advantage of that is that you could actually > set up a custom table of callers built in a ELF section and with > that you don't actually need the runtime patching (which is only > done currently because there's no global table of mcount calls), > but could do everything in stop_machine(). Without > runtime patching you also don't need single part nops. > > I'm totally confused here. How do you enable function tracing? How do we make a call to the code that will trace a function was hit? > I think that would be the best option. I especially like it because > it would prevent forcing frame pointer which seems to be costlier > than any kinds of nosp. As I stated, the frame pointer part is only to record the parent function in tracing. ie: ls-4866 [00] 177596.041275: _spin_unlock <-journal_stop Here we see that the function _spin_unlock was called by the function journal_stop. We can easily turn off parent tracing now, with: # echo noprint-parent > /debug/tracing/iter_ctrl which gives us just: ls-4866 [00] 177596.041275: _spin_unlock If we disable frame pointers, the noprint-parent option would be forced. Not that devastating, but it gives the option to still have function tracing to the user without the requirement of having frame pointers. I would still require that the irqsoff tracer add frame pointers, just because knowing that the long latency of interrupts disabled happened at local_irq_save doesn't cut it ;-) Anyway, who would want to run with frame pointers disabled? If you ever get a bug crash, the stack trace is pretty much useless. -- Steve ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Efficient x86 and x86_64 NOP microbenchmarks 2008-08-13 20:00 ` Efficient x86 and x86_64 NOP microbenchmarks Steven Rostedt @ 2008-08-13 20:06 ` Jeremy Fitzhardinge 2008-08-13 20:34 ` Steven Rostedt 2008-08-13 20:15 ` Andi Kleen 1 sibling, 1 reply; 18+ messages in thread From: Jeremy Fitzhardinge @ 2008-08-13 20:06 UTC (permalink / raw) To: Steven Rostedt Cc: Andi Kleen, Thomas Gleixner, Mathieu Desnoyers, Linus Torvalds, Steven Rostedt, LKML, Ingo Molnar, Peter Zijlstra, Andrew Morton, David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell, Gregory Haskins, Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves, Clark Williams Steven Rostedt wrote: > No, I can easily make a patch that does not use frame pointers but > still uses -pg. We just can not print the parent function in the > trace. This can easily be added to a config, as well as easily > implemented. Why? You can always get the calling function, because its return address is on the stack (assuming mcount is called before the function puts its own frame on the stack). But without a frame pointer, you can't necessarily get the caller's caller. But I think Andi's point is that gcc forces frame pointers on when you enable mcount, so there's no choice in the matter. J ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Efficient x86 and x86_64 NOP microbenchmarks 2008-08-13 20:06 ` Jeremy Fitzhardinge @ 2008-08-13 20:34 ` Steven Rostedt 0 siblings, 0 replies; 18+ messages in thread From: Steven Rostedt @ 2008-08-13 20:34 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Jeremy Fitzhardinge, Andi Kleen, Thomas Gleixner, Linus Torvalds, Steven Rostedt, LKML, Ingo Molnar, Peter Zijlstra, Andrew Morton, David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell, Gregory Haskins, Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves, Clark Williams Just a curious run of Mathieu's micro benchmark: NR_TESTS 10000000 test empty cycles : 182500444 test 2-bytes jump cycles : 195969127 test 5-bytes jump cycles : 197000202 test 3/2 nops cycles : 201333408 test 5-bytes nop with long prefix cycles : 205000067 test 5-bytes P6 nop cycles : 205000227 test Generic 1/4 5-bytes nops cycles : 200000077 test K7 1/4 5-bytes nops cycles : 197549045 And this was on a Pentium III 847.461 MHz box (my old toshiba laptop) The jumps here played the best, but that could just be cache issues. But interesting to see that of the nops, the K7 1/4 faired the best. -- Steve ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Efficient x86 and x86_64 NOP microbenchmarks 2008-08-13 20:00 ` Efficient x86 and x86_64 NOP microbenchmarks Steven Rostedt 2008-08-13 20:06 ` Jeremy Fitzhardinge @ 2008-08-13 20:15 ` Andi Kleen 2008-08-13 20:21 ` Linus Torvalds 2008-08-13 20:21 ` Steven Rostedt 1 sibling, 2 replies; 18+ messages in thread From: Andi Kleen @ 2008-08-13 20:15 UTC (permalink / raw) To: Steven Rostedt Cc: Andi Kleen, Thomas Gleixner, Mathieu Desnoyers, Linus Torvalds, Steven Rostedt, Jeremy Fitzhardinge, LKML, Ingo Molnar, Peter Zijlstra, Andrew Morton, David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell, Gregory Haskins, Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves, Clark Williams On Wed, Aug 13, 2008 at 04:00:37PM -0400, Steven Rostedt wrote: > >Now the numbers have shown that just by not using frame pointer ( > >-pg right now implies frame pointer) you can get more benefit > >than what you lose from using non optimal nops. > > > > No, I can easily make a patch that does not use frame pointers but still Not without patching gcc. Try it. The patch is not very difficult and i did it here, but it needs a patch. > If we disable frame pointers, the noprint-parent option would be forced. Actually you can get the parent without frame pointer if you just force gcc to emit mcount before touching the stack frame (and manual insertion pass would do that). Then parent is at 4(%esp)/8(%rsp) Again teaching gcc that is not very difficult, but it needs a patch. > I would still require that the irqsoff tracer add frame pointers, just > because knowing that the long latency of interrupts disabled happened at > local_irq_save doesn't cut it ;-) Nope. > > Anyway, who would want to run with frame pointers disabled? If you ever > get a bug crash, the stack trace is pretty much useless. First that's not true (remember most production kernels run without frame pointers, also e.g. crash or systemtap know how to do proper unwinding without slow frame pointers) and if you want it runtime also you can always add the dwarf2 unwinder (like the opensuse kernel does) and get better backtraces than you could ever get with frame pointers (that is because e.g. most assembler code doesn't even bother to set up frame pointers, but it is all dwarf2 annotated) Also I must say the whole ftrace noping exercise is pretty pointless without avoiding frame pointers because it does save less than what you lose unconditionally from the "select FRAME_POINTER" -Andi ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Efficient x86 and x86_64 NOP microbenchmarks 2008-08-13 20:15 ` Andi Kleen @ 2008-08-13 20:21 ` Linus Torvalds 2008-08-13 20:21 ` Steven Rostedt 1 sibling, 0 replies; 18+ messages in thread From: Linus Torvalds @ 2008-08-13 20:21 UTC (permalink / raw) To: Andi Kleen Cc: Steven Rostedt, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, Jeremy Fitzhardinge, LKML, Ingo Molnar, Peter Zijlstra, Andrew Morton, David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell, Gregory Haskins, Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves, Clark Williams On Wed, 13 Aug 2008, Andi Kleen wrote: > > Also I must say the whole ftrace noping exercise is pretty pointless without > avoiding frame pointers because it does save less than what you lose > unconditionally from the "select FRAME_POINTER" Andi, you seem to have missed the whole point. This is a _correctness_ issue as long as the nop is not a single instruction. And the workaround for that is uglier than just making a single-instruction nop. So the question now is to find a good nop that _is_ a single atomic instruction. Your blathering about frame pointers is missing the whole point! Linus ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Efficient x86 and x86_64 NOP microbenchmarks 2008-08-13 20:15 ` Andi Kleen 2008-08-13 20:21 ` Linus Torvalds @ 2008-08-13 20:21 ` Steven Rostedt 1 sibling, 0 replies; 18+ messages in thread From: Steven Rostedt @ 2008-08-13 20:21 UTC (permalink / raw) To: Andi Kleen Cc: Thomas Gleixner, Mathieu Desnoyers, Linus Torvalds, Steven Rostedt, Jeremy Fitzhardinge, LKML, Ingo Molnar, Peter Zijlstra, Andrew Morton, David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell, Gregory Haskins, Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves, Clark Williams Andi Kleen wrote: > On Wed, Aug 13, 2008 at 04:00:37PM -0400, Steven Rostedt wrote: > >>> Now the numbers have shown that just by not using frame pointer ( >>> -pg right now implies frame pointer) you can get more benefit >>> than what you lose from using non optimal nops. >>> >>> >> No, I can easily make a patch that does not use frame pointers but still >> > > Not without patching gcc. Try it. The patch is not very difficult and i did > it here, but it needs a patch. > OK, I admit you are right ;-) I got the error message: gcc: -pg and -fomit-frame-pointer are incompatible -- Steve ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 0/5] ftrace: to kill a daemon @ 2008-08-08 18:13 Steven Rostedt 2008-08-08 18:21 ` Mathieu Desnoyers 0 siblings, 1 reply; 18+ messages in thread From: Steven Rostedt @ 2008-08-08 18:13 UTC (permalink / raw) To: Mathieu Desnoyers Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton, Linus Torvalds, David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell, Jeremy Fitzhardinge, Gregory Haskins, Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves, Clark Williams On Fri, 8 Aug 2008, Mathieu Desnoyers wrote: > * Steven Rostedt (rostedt@goodmis.org) wrote: > > > > On Fri, 8 Aug 2008, Mathieu Desnoyers wrote: > > > * Steven Rostedt (rostedt@goodmis.org) wrote: > > > > > > > > I originally used jumps instead of nops, but unfortunately, they actually > > > > hurt performance more than adding nops. Ingo told me it was probably due > > > > to using up the jump predictions of the CPU. > > > > > > > > > > Hrm, are you sure you use a single 5-bytes nop instruction then, or do > > > you use a mix of various nop sizes (add_nops) on some architectures ? > > > > I use (for x86) what is in include/asm-x86/nops.h depending on what the > > cpuid gives us. > > > > That's bad : > > #define GENERIC_NOP5 GENERIC_NOP1 GENERIC_NOP4 > > #define K8_NOP5 K8_NOP3 K8_NOP2 > > #define K7_NOP5 K7_NOP4 ASM_NOP1 > > So, when you try, later, to replace these instructions with a single > 5-bytes instruction, a preempted thread could iret in the middle of your > 5-bytes insn and cause an illegal instruction ? That's why I use kstop_machine. > > > > > > > > You can consume the branch prediction buffers for conditional branches, > > > but I doubt static jumps have this impact ? I don't see what "jump > > > predictions" you are referring to here exactly. > > > > I don't know the details, but we definitely saw a drop in preformance > > between using nops and static jumps. > > > > Generated by replacing all the call by 5-bytes jumps e9 00 00 00 00 > instead of the 5-bytes add_nops ? On which architectures ? > I ran this on my Dell (intel Xeon), which IIRC did show the performance degration. I unfortunately don't have the time to redo those tests, but you are welcome to. Just look at arch/x86/kernel/ftrace.c and replace the nop with the jump. In fact, the comments in that file still say it is a jmp. Remember, my first go was to use the jmp. -- Steve ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 0/5] ftrace: to kill a daemon 2008-08-08 18:13 [PATCH 0/5] ftrace: to kill a daemon Steven Rostedt @ 2008-08-08 18:21 ` Mathieu Desnoyers 2008-08-08 18:41 ` Steven Rostedt 0 siblings, 1 reply; 18+ messages in thread From: Mathieu Desnoyers @ 2008-08-08 18:21 UTC (permalink / raw) To: Steven Rostedt Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton, Linus Torvalds, David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell, Jeremy Fitzhardinge, Gregory Haskins, Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves, Clark Williams * Steven Rostedt (rostedt@goodmis.org) wrote: > > On Fri, 8 Aug 2008, Mathieu Desnoyers wrote: > > * Steven Rostedt (rostedt@goodmis.org) wrote: > > > > > > On Fri, 8 Aug 2008, Mathieu Desnoyers wrote: > > > > * Steven Rostedt (rostedt@goodmis.org) wrote: > > > > > > > > > > I originally used jumps instead of nops, but unfortunately, they actually > > > > > hurt performance more than adding nops. Ingo told me it was probably due > > > > > to using up the jump predictions of the CPU. > > > > > > > > > > > > > Hrm, are you sure you use a single 5-bytes nop instruction then, or do > > > > you use a mix of various nop sizes (add_nops) on some architectures ? > > > > > > I use (for x86) what is in include/asm-x86/nops.h depending on what the > > > cpuid gives us. > > > > > > > That's bad : > > > > #define GENERIC_NOP5 GENERIC_NOP1 GENERIC_NOP4 > > > > #define K8_NOP5 K8_NOP3 K8_NOP2 > > > > #define K7_NOP5 K7_NOP4 ASM_NOP1 > > > > So, when you try, later, to replace these instructions with a single > > 5-bytes instruction, a preempted thread could iret in the middle of your > > 5-bytes insn and cause an illegal instruction ? > > That's why I use kstop_machine. > kstop_machine does not guarantee that you won't have _any_ thread preempted with IP pointing exactly in the middle of your instructions _before_ the modification scheduled back in _after_ the modification and thus causing an illegal instruction. Still buggy. :/ > > > > > > > > > > > > You can consume the branch prediction buffers for conditional branches, > > > > but I doubt static jumps have this impact ? I don't see what "jump > > > > predictions" you are referring to here exactly. > > > > > > I don't know the details, but we definitely saw a drop in preformance > > > between using nops and static jumps. > > > > > > > Generated by replacing all the call by 5-bytes jumps e9 00 00 00 00 > > instead of the 5-bytes add_nops ? On which architectures ? > > > > I ran this on my Dell (intel Xeon), which IIRC did show the performance > degration. I unfortunately don't have the time to redo those tests, but > you are welcome to. > > Just look at arch/x86/kernel/ftrace.c and replace the nop with the jump. > In fact, the comments in that file still say it is a jmp. Remember, my > first go was to use the jmp. > I'll try to find time to compare : multi-instructions 5-bytes nops (although this approach is just buggy) 5-bytes jump to the next address 2-bytes jump to offset +3. Mathieu > -- Steve > -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 0/5] ftrace: to kill a daemon 2008-08-08 18:21 ` Mathieu Desnoyers @ 2008-08-08 18:41 ` Steven Rostedt 2008-08-08 19:05 ` Mathieu Desnoyers 0 siblings, 1 reply; 18+ messages in thread From: Steven Rostedt @ 2008-08-08 18:41 UTC (permalink / raw) To: Mathieu Desnoyers Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton, Linus Torvalds, David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell, Jeremy Fitzhardinge, Gregory Haskins, Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves, Clark Williams On Fri, 8 Aug 2008, Mathieu Desnoyers wrote: > * Steven Rostedt (rostedt@goodmis.org) wrote: > > > > > > > > > > That's bad : > > > > > > #define GENERIC_NOP5 GENERIC_NOP1 GENERIC_NOP4 > > > > > > #define K8_NOP5 K8_NOP3 K8_NOP2 > > > > > > #define K7_NOP5 K7_NOP4 ASM_NOP1 > > > > > > So, when you try, later, to replace these instructions with a single > > > 5-bytes instruction, a preempted thread could iret in the middle of your > > > 5-bytes insn and cause an illegal instruction ? > > > > That's why I use kstop_machine. > > > > kstop_machine does not guarantee that you won't have _any_ thread > preempted with IP pointing exactly in the middle of your instructions > _before_ the modification scheduled back in _after_ the modification and > thus causing an illegal instruction. > > Still buggy. :/ Hmm, good point. Unless... Can a processor be preempted in a middle of nops? What do nops do for a processor? Can it skip them nicely in one shot? This means I'll have to do the benchmarks again, and see what the performance difference of a jmp and a nop is significant. I'm thinking that if the processor can safely skip nops without any type of processing, this may be the reason that nops are better than a jmp. A jmp causes the processor to do a little more work. I might even run a test to see if I can force a processor that uses the three-two nops to preempt between them. I can add a test in x86 ftrace.c to check to see which nop was used, and use the jmp if the arch does not have a 5 byte nop. I'm assuming that jmp is more expensive than the nops because otherwise a jmp 0 would have been used as a 5 byte nop. -- Steve ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 0/5] ftrace: to kill a daemon 2008-08-08 18:41 ` Steven Rostedt @ 2008-08-08 19:05 ` Mathieu Desnoyers 2008-08-08 23:38 ` Steven Rostedt 0 siblings, 1 reply; 18+ messages in thread From: Mathieu Desnoyers @ 2008-08-08 19:05 UTC (permalink / raw) To: Steven Rostedt Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton, Linus Torvalds, David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell, Jeremy Fitzhardinge, Gregory Haskins, Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves, Clark Williams * Steven Rostedt (rostedt@goodmis.org) wrote: > > On Fri, 8 Aug 2008, Mathieu Desnoyers wrote: > > > * Steven Rostedt (rostedt@goodmis.org) wrote: > > > > > > > > > > > > > That's bad : > > > > > > > > #define GENERIC_NOP5 GENERIC_NOP1 GENERIC_NOP4 > > > > > > > > #define K8_NOP5 K8_NOP3 K8_NOP2 > > > > > > > > #define K7_NOP5 K7_NOP4 ASM_NOP1 > > > > > > > > So, when you try, later, to replace these instructions with a single > > > > 5-bytes instruction, a preempted thread could iret in the middle of your > > > > 5-bytes insn and cause an illegal instruction ? > > > > > > That's why I use kstop_machine. > > > > > > > kstop_machine does not guarantee that you won't have _any_ thread > > preempted with IP pointing exactly in the middle of your instructions > > _before_ the modification scheduled back in _after_ the modification and > > thus causing an illegal instruction. > > > > Still buggy. :/ > > Hmm, good point. Unless... > > Can a processor be preempted in a middle of nops? What do nops do for a > processor? Can it skip them nicely in one shot? > Given that those are multiple instructions, I think a processor has all the rights to preempt in the middle of them. And even if some specific architecture, for any obscure reason, happens to merge them, I don't think this will be portable across Intel, AMD, ... > This means I'll have to do the benchmarks again, and see what the > performance difference of a jmp and a nop is significant. I'm thinking > that if the processor can safely skip nops without any type of processing, > this may be the reason that nops are better than a jmp. A jmp causes the > processor to do a little more work. > > I might even run a test to see if I can force a processor that uses the > three-two nops to preempt between them. > Yup, although one architecture not triggering this doesn't say much about the various x86 flavors out there. In any case - if you trigger the problem, we have to fix it. - if you do not succeed to trigger the problem, we will have to test it on a wider architecture range and maybe end up fixit it anyway to play safe with the specs. So, in every case, we end up fixing the issue. > I can add a test in x86 ftrace.c to check to see which nop was used, and > use the jmp if the arch does not have a 5 byte nop. > I would propose the following alternative : Create new macros in include/asm-x86/nops.h : /* short jump, offset 3 bytes : skips total of 5 bytes */ #define GENERIC_ATOMIC_NOP5 ".byte 0xeb,0x03,0x00,0x00,0x00\n" #if defined(CONFIG_MK7) #define ATOMIC_NOP5 GENERIC_ATOMIC_NOP5 #elif defined(CONFIG_X86_P6_NOP) #define ATOMIC_NOP5 P6_NOP5 #elif defined(CONFIG_X86_64) #define ATOMIC_NOP5 GENERIC_ATOMIC_NOP5 #else #define ATOMIC_NOP5 GENERIC_ATOMIC_NOP5 #endif And then optimize if necessary. You will probably find plenty of knowledgeable people who will know better 5-bytes nop instruction more efficient than this "generic" short jump offset 0x3. Then you can use the (buggy) 3nops/2nops as a performance baseline and see the performance hit on each architecture. First get it right, then make it fast.... Mathieu > I'm assuming that jmp is more expensive than the nops because otherwise > a jmp 0 would have been used as a 5 byte nop. > > -- Steve -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 0/5] ftrace: to kill a daemon 2008-08-08 19:05 ` Mathieu Desnoyers @ 2008-08-08 23:38 ` Steven Rostedt 2008-08-09 0:23 ` Andi Kleen 0 siblings, 1 reply; 18+ messages in thread From: Steven Rostedt @ 2008-08-08 23:38 UTC (permalink / raw) To: Mathieu Desnoyers Cc: LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton, Linus Torvalds, David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell, Jeremy Fitzhardinge, Gregory Haskins, Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves, Clark Williams [ patch included ] On Fri, 8 Aug 2008, Mathieu Desnoyers wrote: > * Steven Rostedt (rostedt@goodmis.org) wrote: > > > > That's why I use kstop_machine. > > > > > > > > > > kstop_machine does not guarantee that you won't have _any_ thread > > > preempted with IP pointing exactly in the middle of your instructions > > > _before_ the modification scheduled back in _after_ the modification and > > > thus causing an illegal instruction. > > > > > > Still buggy. :/ > > > > Hmm, good point. Unless... > > > > Can a processor be preempted in a middle of nops? What do nops do for a > > processor? Can it skip them nicely in one shot? > > > > Given that those are multiple instructions, I think a processor has all > the rights to preempt in the middle of them. And even if some specific > architecture, for any obscure reason, happens to merge them, I don't > think this will be portable across Intel, AMD, ... > > > This means I'll have to do the benchmarks again, and see what the > > performance difference of a jmp and a nop is significant. I'm thinking > > that if the processor can safely skip nops without any type of processing, > > this may be the reason that nops are better than a jmp. A jmp causes the > > processor to do a little more work. > > > > I might even run a test to see if I can force a processor that uses the > > three-two nops to preempt between them. > > > > Yup, although one architecture not triggering this doesn't say much > about the various x86 flavors out there. In any case > - if you trigger the problem, we have to fix it. > - if you do not succeed to trigger the problem, we will have to test it > on a wider architecture range and maybe end up fixit it anyway to play > safe with the specs. > > So, in every case, we end up fixing the issue. > > > > I can add a test in x86 ftrace.c to check to see which nop was used, and > > use the jmp if the arch does not have a 5 byte nop. > > > > I would propose the following alternative : > > Create new macros in include/asm-x86/nops.h : > > /* short jump, offset 3 bytes : skips total of 5 bytes */ > #define GENERIC_ATOMIC_NOP5 ".byte 0xeb,0x03,0x00,0x00,0x00\n" > > #if defined(CONFIG_MK7) > #define ATOMIC_NOP5 GENERIC_ATOMIC_NOP5 > #elif defined(CONFIG_X86_P6_NOP) > #define ATOMIC_NOP5 P6_NOP5 > #elif defined(CONFIG_X86_64) > #define ATOMIC_NOP5 GENERIC_ATOMIC_NOP5 > #else > #define ATOMIC_NOP5 GENERIC_ATOMIC_NOP5 > #endif > > And then optimize if necessary. You will probably find plenty of > knowledgeable people who will know better 5-bytes nop instruction more > efficient than this "generic" short jump offset 0x3. > > Then you can use the (buggy) 3nops/2nops as a performance baseline and > see the performance hit on each architecture. > > First get it right, then make it fast.... > I'm stubborn, I want to get it right _and_ keep it fast. I still want the NOPS. Using jmps will hurt performance and that would keep this turned off on all distros. But lets think outside the box here (and we will ignore Alan's cat). Right now the issue is that we might preempt after the first nop, and when we enable the code, that task will crash when it tries to read the second nop. Since we are doing the modifications from kstop_machine, all the tasks are stopped. We can simply look to see if the tasks have been preempted in kernel space and if so, is their instruction pointer pointing to the second nop. If it is, move the ip forward. Here's a patch that does just that for both i386 and x86_64. I added a field in the thread_info struct called "ip". This is a pointer to the location of the task ip in the stack if it was preempted in kernel space. Null otherwise: jz restore_all + lea PT_EIP(%esp), %eax + movl %eax, TI_ip(%ebp) call preempt_schedule_irq + GET_THREAD_INFO(%ebp) + movl $0, TI_ip(%ebp) jmp need_resched Then, just before we enable tracing (we only need to do this when we enable tracing, since that is when we have a two instruction nop), we look at all the tasks. If the task->thread_info->ip is set, this means that it was preempted just before going back to the kernel. We look at the **ip and see if it compares with the second nop. If it does, we increment the ip by the size of that nop: if (memcmp(*ip, second_nop, x86_nop5_part2) == 0) /* Match, move the ip forward */ *ip += x86_nop5_part2; We do this just once before enabling all the locations, and we only do it if we have a two part nop. Interesting enough, I wrote a module that did the following: void (*silly_func)(void); void do_something_silly(void) { } static int my_thread(void *arg) { int i; while (!kthread_should_stop()) { for (i=0; i < 100; i++) silly_func(); } return 0; } static struct task_struct *p; static int __init mcount_stress_init(void) { silly_func = do_something_silly; p = kthread_run(my_thread, NULL, "sillytask"); return 0; } static void mcount_stress_exit(void) { kthread_stop(p); } The do_something_silly had an mcount pointer to it. I put in printks in the ftrace enabled code to see where this was preempted. It was preempted several times before and after the nops, but never at either nop. Maybe I didn't run it enough (almost 2 hours), but perhaps it is very unlikely to be preempted at a nop if there's something coming up next. Yes a string of nops may be preempted, but perhaps only two nops followed by an actual command might be skipped quickly. I'll write some hacks to look at where it is preempted in the scheduler itself, and see if I see it preempting at the second nop ever. But here's a patch that will work around the problem that we might be preempted within the two nops. Note, this is only in the slow path of enabling the function tracer. It is only done at enabling time inside the kstop_machine, which has a large overhead anyways. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- arch/x86/kernel/alternative.c | 29 +++++++++++++++++++------- arch/x86/kernel/asm-offsets_32.c | 1 arch/x86/kernel/asm-offsets_64.c | 1 arch/x86/kernel/entry_32.S | 4 +++ arch/x86/kernel/entry_64.S | 4 +++ arch/x86/kernel/ftrace.c | 43 +++++++++++++++++++++++++++++++++++++++ include/asm-x86/ftrace.h | 5 ++++ include/asm-x86/thread_info.h | 4 +++ kernel/trace/ftrace.c | 12 ++++++++++ 9 files changed, 96 insertions(+), 7 deletions(-) Index: linux-tip.git/arch/x86/kernel/alternative.c =================================================================== --- linux-tip.git.orig/arch/x86/kernel/alternative.c 2008-06-05 11:52:24.000000000 -0400 +++ linux-tip.git/arch/x86/kernel/alternative.c 2008-08-08 16:20:23.000000000 -0400 @@ -140,13 +140,26 @@ static const unsigned char *const p6_nop }; #endif +/* + * Some versions of x86 CPUs have a two part NOP5. This + * can break ftrace if a process is preempted between + * the two. ftrace needs to know what the second nop + * is to handle this case. + */ +int x86_nop5_part2; + #ifdef CONFIG_X86_64 extern char __vsyscall_0; const unsigned char *const *find_nop_table(void) { - return boot_cpu_data.x86_vendor != X86_VENDOR_INTEL || - boot_cpu_data.x86 < 6 ? k8_nops : p6_nops; + if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL || + boot_cpu_data.x86 < 6) { + x86_nop5_part2 = 2; /* K8_NOP2 */ + return k8_nops; + } else + /* keep k86_nop5_part2 NULL */ + return p6_nops; } #else /* CONFIG_X86_64 */ @@ -154,12 +167,13 @@ const unsigned char *const *find_nop_tab static const struct nop { int cpuid; const unsigned char *const *noptable; + int nop5_part2; /* size of part2 nop */ } noptypes[] = { - { X86_FEATURE_K8, k8_nops }, - { X86_FEATURE_K7, k7_nops }, - { X86_FEATURE_P4, p6_nops }, - { X86_FEATURE_P3, p6_nops }, - { -1, NULL } + { X86_FEATURE_K8, k8_nops, 2}, + { X86_FEATURE_K7, k7_nops, 1 }, + { X86_FEATURE_P4, p6_nops, 0 }, + { X86_FEATURE_P3, p6_nops, 0 }, + { -1, NULL, 0 } }; const unsigned char *const *find_nop_table(void) @@ -170,6 +184,7 @@ const unsigned char *const *find_nop_tab for (i = 0; noptypes[i].cpuid >= 0; i++) { if (boot_cpu_has(noptypes[i].cpuid)) { noptable = noptypes[i].noptable; + x86_nop5_part2 = noptypes[i].nop5_part2; break; } } Index: linux-tip.git/arch/x86/kernel/asm-offsets_32.c =================================================================== --- linux-tip.git.orig/arch/x86/kernel/asm-offsets_32.c 2008-07-27 10:43:26.000000000 -0400 +++ linux-tip.git/arch/x86/kernel/asm-offsets_32.c 2008-08-08 15:46:55.000000000 -0400 @@ -59,6 +59,7 @@ void foo(void) OFFSET(TI_restart_block, thread_info, restart_block); OFFSET(TI_sysenter_return, thread_info, sysenter_return); OFFSET(TI_cpu, thread_info, cpu); + OFFSET(TI_ip, thread_info, ip); BLANK(); OFFSET(GDS_size, desc_ptr, size); Index: linux-tip.git/arch/x86/kernel/asm-offsets_64.c =================================================================== --- linux-tip.git.orig/arch/x86/kernel/asm-offsets_64.c 2008-07-27 10:43:26.000000000 -0400 +++ linux-tip.git/arch/x86/kernel/asm-offsets_64.c 2008-08-08 15:52:34.000000000 -0400 @@ -41,6 +41,7 @@ int main(void) ENTRY(addr_limit); ENTRY(preempt_count); ENTRY(status); + ENTRY(ip); #ifdef CONFIG_IA32_EMULATION ENTRY(sysenter_return); #endif Index: linux-tip.git/arch/x86/kernel/entry_32.S =================================================================== --- linux-tip.git.orig/arch/x86/kernel/entry_32.S 2008-07-27 10:43:26.000000000 -0400 +++ linux-tip.git/arch/x86/kernel/entry_32.S 2008-08-08 17:13:27.000000000 -0400 @@ -304,7 +304,11 @@ need_resched: jz restore_all testl $X86_EFLAGS_IF,PT_EFLAGS(%esp) # interrupts off (exception path) ? jz restore_all + lea PT_EIP(%esp), %eax + movl %eax, TI_ip(%ebp) call preempt_schedule_irq + GET_THREAD_INFO(%ebp) + movl $0, TI_ip(%ebp) jmp need_resched END(resume_kernel) #endif Index: linux-tip.git/arch/x86/kernel/entry_64.S =================================================================== --- linux-tip.git.orig/arch/x86/kernel/entry_64.S 2008-07-27 10:43:26.000000000 -0400 +++ linux-tip.git/arch/x86/kernel/entry_64.S 2008-08-08 17:12:47.000000000 -0400 @@ -837,7 +837,11 @@ ENTRY(retint_kernel) jnc retint_restore_args bt $9,EFLAGS-ARGOFFSET(%rsp) /* interrupts off? */ jnc retint_restore_args + leaq RIP-ARGOFFSET(%rsp), %rax + movq %rax, TI_ip(%rcx) call preempt_schedule_irq + GET_THREAD_INFO(%rcx) + movq $0, TI_ip(%rcx) jmp exit_intr #endif Index: linux-tip.git/arch/x86/kernel/ftrace.c =================================================================== --- linux-tip.git.orig/arch/x86/kernel/ftrace.c 2008-06-26 14:58:54.000000000 -0400 +++ linux-tip.git/arch/x86/kernel/ftrace.c 2008-08-08 17:48:04.000000000 -0400 @@ -127,6 +127,46 @@ notrace int ftrace_mcount_set(unsigned l return 0; } +static const unsigned char *second_nop; + +void arch_ftrace_pre_enable(void) +{ + struct task_struct *g, *p; + unsigned long **ip; + + int i; + + if (!second_nop) + return; + + /* + * x86 has a two part nop to handle 5 byte instructions. + * If a task was preempted after the first nop, and has + * not ran the second nop, if we modify the code, we can + * crash the system. Thus, we will look at all the tasks + * and if any of them was preempted and will run the + * second nop next, we simply move their ip pointer past + * the second nop. + */ + + /* + * Don't need to grab the task list lock, we are running + * in kstop_machine + */ + do_each_thread(g, p) { + /* + * In entry.S we save the ip when a task is preempted + * and reset it when it is back running. + */ + ip = task_thread_info(p)->ip; + if (!ip) + continue; + if (memcmp(*ip, second_nop, x86_nop5_part2) == 0) + /* Match, move the ip forward */ + *ip += x86_nop5_part2; + } while_each_thread(g, p); +} + int __init ftrace_dyn_arch_init(void *data) { const unsigned char *const *noptable = find_nop_table(); @@ -137,5 +177,8 @@ int __init ftrace_dyn_arch_init(void *da ftrace_nop = (unsigned long *)noptable[MCOUNT_INSN_SIZE]; + if (x86_nop5_part2) + second_nop = noptable[x86_nop5_part2]; + return 0; } Index: linux-tip.git/include/asm-x86/ftrace.h =================================================================== --- linux-tip.git.orig/include/asm-x86/ftrace.h 2008-08-08 13:00:51.000000000 -0400 +++ linux-tip.git/include/asm-x86/ftrace.h 2008-08-08 16:41:09.000000000 -0400 @@ -17,6 +17,11 @@ static inline unsigned long ftrace_call_ */ return addr - 1; } + +extern int x86_nop5_part2; +extern void arch_ftrace_pre_enable(void); +#define ftrace_pre_enable arch_ftrace_pre_enable + #endif #endif /* CONFIG_FTRACE */ Index: linux-tip.git/include/asm-x86/thread_info.h =================================================================== --- linux-tip.git.orig/include/asm-x86/thread_info.h 2008-08-07 11:14:43.000000000 -0400 +++ linux-tip.git/include/asm-x86/thread_info.h 2008-08-08 17:06:15.000000000 -0400 @@ -29,6 +29,9 @@ struct thread_info { __u32 cpu; /* current CPU */ int preempt_count; /* 0 => preemptable, <0 => BUG */ + unsigned long **ip; /* pointer to ip on stackwhen + preempted + */ mm_segment_t addr_limit; struct restart_block restart_block; void __user *sysenter_return; @@ -47,6 +50,7 @@ struct thread_info { .flags = 0, \ .cpu = 0, \ .preempt_count = 1, \ + .ip = NULL, \ .addr_limit = KERNEL_DS, \ .restart_block = { \ .fn = do_no_restart_syscall, \ Index: linux-tip.git/kernel/trace/ftrace.c =================================================================== --- linux-tip.git.orig/kernel/trace/ftrace.c 2008-08-08 13:00:52.000000000 -0400 +++ linux-tip.git/kernel/trace/ftrace.c 2008-08-08 16:18:14.000000000 -0400 @@ -32,6 +32,10 @@ #include "trace.h" +#ifndef ftrace_pre_enable +# define ftrace_pre_enable() do { } while (0) +#endif + /* ftrace_enabled is a method to turn ftrace on or off */ int ftrace_enabled __read_mostly; static int last_ftrace_enabled; @@ -500,6 +504,14 @@ static void ftrace_replace_code(int enab else new = ftrace_nop_replace(); + /* + * Some archs *cough*x86*cough* have more than one nop to cover + * the call to mcount. In these cases, special care must be taken + * before we start converting nops into calls. + */ + if (enable) + ftrace_pre_enable(); + for (pg = ftrace_pages_start; pg; pg = pg->next) { for (i = 0; i < pg->index; i++) { rec = &pg->records[i]; ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 0/5] ftrace: to kill a daemon 2008-08-08 23:38 ` Steven Rostedt @ 2008-08-09 0:23 ` Andi Kleen 2008-08-09 0:36 ` Steven Rostedt 0 siblings, 1 reply; 18+ messages in thread From: Andi Kleen @ 2008-08-09 0:23 UTC (permalink / raw) To: Steven Rostedt Cc: Mathieu Desnoyers, LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton, Linus Torvalds, David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell, Jeremy Fitzhardinge, Gregory Haskins, Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves, Clark Williams Steven Rostedt <rostedt@goodmis.org> writes: > I'm stubborn, I want to get it right _and_ keep it fast. For me it would seem better to just not use two part 5 byte nops instead of adding such hacks. I doubt there are that many of them anyways. I bet you won't be able to measure any difference between the different nop types in any macro benchmark. -Andi ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 0/5] ftrace: to kill a daemon 2008-08-09 0:23 ` Andi Kleen @ 2008-08-09 0:36 ` Steven Rostedt 2008-08-09 0:47 ` Jeremy Fitzhardinge 0 siblings, 1 reply; 18+ messages in thread From: Steven Rostedt @ 2008-08-09 0:36 UTC (permalink / raw) To: Andi Kleen Cc: Mathieu Desnoyers, LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton, Linus Torvalds, David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell, Jeremy Fitzhardinge, Gregory Haskins, Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves, Clark Williams On Sat, 9 Aug 2008, Andi Kleen wrote: > Steven Rostedt <rostedt@goodmis.org> writes: > > > I'm stubborn, I want to get it right _and_ keep it fast. > > For me it would seem better to just not use two part 5 byte nops > instead of adding such hacks. I doubt there are that many of them > anyways. I bet you won't be able to measure any difference between the > different nop types in any macro benchmark. I wish we had a true 5 byte nop. The alternative is a jmp 0, which is measurable. This is replacing mcount from a kernel compile with the -pg option. With a basic build (not counting modules), I have over 15,000 locations that are turned into these 5 byte nops. # objdump -dr vmlinux.o | grep mcount |wc 15152 45489 764924 If we use the jmp 0, then yes, we will see the overhead. The double nop that is used for 5 bytes, is significantly better than the jump. -- Steve ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 0/5] ftrace: to kill a daemon 2008-08-09 0:36 ` Steven Rostedt @ 2008-08-09 0:47 ` Jeremy Fitzhardinge 2008-08-09 0:51 ` Linus Torvalds 0 siblings, 1 reply; 18+ messages in thread From: Jeremy Fitzhardinge @ 2008-08-09 0:47 UTC (permalink / raw) To: Steven Rostedt Cc: Andi Kleen, Mathieu Desnoyers, LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton, Linus Torvalds, David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell, Gregory Haskins, Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves, Clark Williams Steven Rostedt wrote: > I wish we had a true 5 byte nop. 0x66 0x66 0x66 0x66 0x90 J ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 0/5] ftrace: to kill a daemon 2008-08-09 0:47 ` Jeremy Fitzhardinge @ 2008-08-09 0:51 ` Linus Torvalds 2008-08-09 1:25 ` Steven Rostedt 0 siblings, 1 reply; 18+ messages in thread From: Linus Torvalds @ 2008-08-09 0:51 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Steven Rostedt, Andi Kleen, Mathieu Desnoyers, LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton, David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell, Gregory Haskins, Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves, Clark Williams On Fri, 8 Aug 2008, Jeremy Fitzhardinge wrote: > > Steven Rostedt wrote: > > I wish we had a true 5 byte nop. > > 0x66 0x66 0x66 0x66 0x90 I don't think so. Multiple redundant prefixes can be really expensive on some uarchs. A no-op that isn't cheap isn't a no-op at all, it's a slow-op. Linus ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 0/5] ftrace: to kill a daemon 2008-08-09 0:51 ` Linus Torvalds @ 2008-08-09 1:25 ` Steven Rostedt 2008-08-13 17:52 ` Efficient x86 and x86_64 NOP microbenchmarks Mathieu Desnoyers 0 siblings, 1 reply; 18+ messages in thread From: Steven Rostedt @ 2008-08-09 1:25 UTC (permalink / raw) To: Linus Torvalds Cc: Jeremy Fitzhardinge, Andi Kleen, Mathieu Desnoyers, LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton, David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell, Gregory Haskins, Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves, Clark Williams On Fri, 8 Aug 2008, Linus Torvalds wrote: > > > On Fri, 8 Aug 2008, Jeremy Fitzhardinge wrote: > > > > Steven Rostedt wrote: > > > I wish we had a true 5 byte nop. > > > > 0x66 0x66 0x66 0x66 0x90 > > I don't think so. Multiple redundant prefixes can be really expensive on > some uarchs. > > A no-op that isn't cheap isn't a no-op at all, it's a slow-op. A quick meaningless benchmark showed a slight perfomance hit. Here's 10 runs of "hackbench 50" using the two part 5 byte nop: run 1 Time: 4.501 run 2 Time: 4.855 run 3 Time: 4.198 run 4 Time: 4.587 run 5 Time: 5.016 run 6 Time: 4.757 run 7 Time: 4.477 run 8 Time: 4.693 run 9 Time: 4.710 run 10 Time: 4.715 avg = 4.6509 And 10 runs using the above 5 byte nop: run 1 Time: 4.832 run 2 Time: 5.319 run 3 Time: 5.213 run 4 Time: 4.830 run 5 Time: 4.363 run 6 Time: 4.391 run 7 Time: 4.772 run 8 Time: 4.992 run 9 Time: 4.727 run 10 Time: 4.825 avg = 4.8264 # cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 15 model : 65 model name : Dual-Core AMD Opteron(tm) Processor 2220 stepping : 3 cpu MHz : 2799.992 cache size : 1024 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy bogomips : 5599.98 clflush size : 64 power management: ts fid vid ttp tm stc There's 4 of these. Just to make sure, I ran the above nop test again: [ this is reverse from the above runs ] run 1 Time: 4.723 run 2 Time: 5.080 run 3 Time: 4.521 run 4 Time: 4.841 run 5 Time: 4.696 run 6 Time: 4.946 run 7 Time: 4.754 run 8 Time: 4.717 run 9 Time: 4.905 run 10 Time: 4.814 avg = 4.7997 And again the two part nop: run 1 Time: 4.434 run 2 Time: 4.496 run 3 Time: 4.801 run 4 Time: 4.714 run 5 Time: 4.631 run 6 Time: 5.178 run 7 Time: 4.728 run 8 Time: 4.920 run 9 Time: 4.898 run 10 Time: 4.770 avg = 4.757 This time it was close, but still seems to have some difference. heh, perhaps it's just noise. -- Steve ^ permalink raw reply [flat|nested] 18+ messages in thread
* Efficient x86 and x86_64 NOP microbenchmarks 2008-08-09 1:25 ` Steven Rostedt @ 2008-08-13 17:52 ` Mathieu Desnoyers 2008-08-13 18:27 ` Linus Torvalds 0 siblings, 1 reply; 18+ messages in thread From: Mathieu Desnoyers @ 2008-08-13 17:52 UTC (permalink / raw) To: Steven Rostedt Cc: Linus Torvalds, Jeremy Fitzhardinge, Andi Kleen, LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton, David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell, Gregory Haskins, Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves, Clark Williams * Steven Rostedt (rostedt@goodmis.org) wrote: > > On Fri, 8 Aug 2008, Linus Torvalds wrote: > > > > > > On Fri, 8 Aug 2008, Jeremy Fitzhardinge wrote: > > > > > > Steven Rostedt wrote: > > > > I wish we had a true 5 byte nop. > > > > > > 0x66 0x66 0x66 0x66 0x90 > > > > I don't think so. Multiple redundant prefixes can be really expensive on > > some uarchs. > > > > A no-op that isn't cheap isn't a no-op at all, it's a slow-op. > > > A quick meaningless benchmark showed a slight perfomance hit. > Hi Steven, I also did some microbenchmarks on my Intel Xeon 64 bits, AMD64 and Intel Pentium 4 boxes to compare a baseline (function doing a bit of memory read and arithmetic operations) to cases where nops are used. Here are the results. The kernel module used for the benchmarks is below, feel free to run it on your own architectures. Xeon : NR_TESTS 10000000 test empty cycles : 165472020 test 2-bytes jump cycles : 166666806 test 5-bytes jump cycles : 166978164 test 3/2 nops cycles : 169259406 test 5-bytes nop with long prefix cycles : 160000140 test 5-bytes P6 nop cycles : 163333458 AMD64 : NR_TESTS 10000000 test empty cycles : 145142367 test 2-bytes jump cycles : 150000178 test 5-bytes jump cycles : 150000171 test 3/2 nops cycles : 159999994 test 5-bytes nop with long prefix cycles : 150000156 test 5-bytes P6 nop cycles : 150000148 Intel Pentium 4 : NR_TESTS 10000000 test empty cycles : 290001045 test 2-bytes jump cycles : 310000568 test 5-bytes jump cycles : 310000478 test 3/2 nops cycles : 290000565 test 5-bytes nop with long prefix cycles : 311085510 test 5-bytes P6 nop cycles : 300000517 test Generic 1/4 5-bytes nops cycles : 310000553 test K7 1/4 5-bytes nops cycles : 300000533 These numbers show that both on Xeon and AMD64, the .byte 0x66,0x66,0x66,0x66,0x90 (osp osp osp osp nop, which is not currently used in nops.h) is the fastest nop on both architectures. The currently used 3/2 nops looks like a _very_ bad choice for AMD64 cycle-wise. The currently used 5-bytes P6 nop used on Xeon seems to be a bit slower than the 0x66,0x66,0x66,0x66,0x90 nop too. For the Intel Pentium 4, the best atomic choice seems to be the current one (5-bytes P6 nop : .byte 0x0f,0x1f,0x44,0x00,0), although we can see that the 3/2 nop used for K8 would be a bit faster. It is probably due to the fact that P4 handles long instruction prefixes slowly. Is there any reason why not to use these atomic nops and kill our instruction atomicity problems altogether ? (various cpuinfo can be found below) Mathieu /* test-nop-speed.c * */ #include <linux/module.h> #include <linux/proc_fs.h> #include <linux/sched.h> #include <linux/timex.h> #include <linux/marker.h> #include <asm/ptrace.h> #define NR_TESTS 10000000 int var, var2; struct proc_dir_entry *pentry = NULL; void empty(void) { asm volatile (""); var += 50; var /= 10; var *= var2; } void twobytesjump(void) { asm volatile ("jmp 1f\n\t" ".byte 0x00, 0x00, 0x00\n\t" "1:\n\t"); var += 50; var /= 10; var *= var2; } void fivebytesjump(void) { asm volatile (".byte 0xe9, 0x00, 0x00, 0x00, 0x00\n\t"); var += 50; var /= 10; var *= var2; } void threetwonops(void) { asm volatile (".byte 0x66,0x66,0x90,0x66,0x90\n\t"); var += 50; var /= 10; var *= var2; } void fivebytesnop(void) { asm volatile (".byte 0x66,0x66,0x66,0x66,0x90\n\t"); var += 50; var /= 10; var *= var2; } void fivebytespsixnop(void) { asm volatile (".byte 0x0f,0x1f,0x44,0x00,0\n\t"); var += 50; var /= 10; var *= var2; } /* * GENERIC_NOP1 GENERIC_NOP4, * 1: nop * _not_ nops in 64-bit mode. * 4: leal 0x00(,%esi,1),%esi */ void genericfivebytesonefournops(void) { asm volatile (".byte 0x90,0x8d,0x74,0x26,0x00\n\t"); var += 50; var /= 10; var *= var2; } /* * K7_NOP4 ASM_NOP1 * 1: nop * assumed _not_ to be nops in 64-bit mode. * leal 0x00(,%eax,1),%eax */ void k7fivebytesonefournops(void) { asm volatile (".byte 0x90,0x8d,0x44,0x20,0x00\n\t"); var += 50; var /= 10; var *= var2; } void perform_test(const char *name, void (*callback)(void)) { unsigned int i; cycles_t cycles1, cycles2; unsigned long flags; local_irq_save(flags); rdtsc_barrier(); cycles1 = get_cycles(); rdtsc_barrier(); for(i=0; i<NR_TESTS; i++) { callback(); } rdtsc_barrier(); cycles2 = get_cycles(); rdtsc_barrier(); local_irq_restore(flags); printk("test %s cycles : %llu\n", name, cycles2-cycles1); } static int my_open(struct inode *inode, struct file *file) { printk("NR_TESTS %d\n", NR_TESTS); perform_test("empty", empty); perform_test("2-bytes jump", twobytesjump); perform_test("5-bytes jump", fivebytesjump); perform_test("3/2 nops", threetwonops); perform_test("5-bytes nop with long prefix", fivebytesnop); perform_test("5-bytes P6 nop", fivebytespsixnop); #ifdef CONFIG_X86_32 perform_test("Generic 1/4 5-bytes nops", genericfivebytesonefournops); perform_test("K7 1/4 5-bytes nops", k7fivebytesonefournops); #endif return -EPERM; } static struct file_operations my_operations = { .open = my_open, }; int init_module(void) { pentry = create_proc_entry("testnops", 0444, NULL); if (pentry) pentry->proc_fops = &my_operations; return 0; } void cleanup_module(void) { remove_proc_entry("testnops", NULL); } MODULE_LICENSE("GPL"); MODULE_AUTHOR("Mathieu Desnoyers"); MODULE_DESCRIPTION("NOP Test"); Xeon cpuinfo : processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Xeon(R) CPU E5405 @ 2.00GHz stepping : 6 cpu MHz : 2000.126 cache size : 6144 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx tm2 ssse3 cx16 xtpr dca sse4_1 lahf_lm bogomips : 4000.25 clflush size : 64 cache_alignment : 64 address sizes : 38 bits physical, 48 bits virtual power management: AMD64 cpuinfo : processor : 0 vendor_id : AuthenticAMD cpu family : 15 model : 35 model name : AMD Athlon(tm)64 X2 Dual Core Processor 3800+ stepping : 2 cpu MHz : 2009.139 cache size : 512 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow rep_good pni lahf_lm cmp_legacy bogomips : 4022.42 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp Pentium 4 : processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Pentium(R) 4 CPU 3.00GHz stepping : 1 cpu MHz : 3000.138 cache size : 1024 KB physical id : 0 siblings : 1 core id : 0 cpu cores : 1 apicid : 0 initial apicid : 0 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx constant_tsc up pebs bts pni monitor ds_cpl cid xtpr bogomips : 6005.70 clflush size : 64 power management: > Here's 10 runs of "hackbench 50" using the two part 5 byte nop: > > run 1 > Time: 4.501 > run 2 > Time: 4.855 > run 3 > Time: 4.198 > run 4 > Time: 4.587 > run 5 > Time: 5.016 > run 6 > Time: 4.757 > run 7 > Time: 4.477 > run 8 > Time: 4.693 > run 9 > Time: 4.710 > run 10 > Time: 4.715 > avg = 4.6509 > > > And 10 runs using the above 5 byte nop: > > run 1 > Time: 4.832 > run 2 > Time: 5.319 > run 3 > Time: 5.213 > run 4 > Time: 4.830 > run 5 > Time: 4.363 > run 6 > Time: 4.391 > run 7 > Time: 4.772 > run 8 > Time: 4.992 > run 9 > Time: 4.727 > run 10 > Time: 4.825 > avg = 4.8264 > > # cat /proc/cpuinfo > processor : 0 > vendor_id : AuthenticAMD > cpu family : 15 > model : 65 > model name : Dual-Core AMD Opteron(tm) Processor 2220 > stepping : 3 > cpu MHz : 2799.992 > cache size : 1024 KB > physical id : 0 > siblings : 2 > core id : 0 > cpu cores : 2 > apicid : 0 > initial apicid : 0 > fdiv_bug : no > hlt_bug : no > f00f_bug : no > coma_bug : no > fpu : yes > fpu_exception : yes > cpuid level : 1 > wp : yes > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca > cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt > rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic > cr8_legacy > bogomips : 5599.98 > clflush size : 64 > power management: ts fid vid ttp tm stc > > There's 4 of these. > > Just to make sure, I ran the above nop test again: > > [ this is reverse from the above runs ] > > run 1 > Time: 4.723 > run 2 > Time: 5.080 > run 3 > Time: 4.521 > run 4 > Time: 4.841 > run 5 > Time: 4.696 > run 6 > Time: 4.946 > run 7 > Time: 4.754 > run 8 > Time: 4.717 > run 9 > Time: 4.905 > run 10 > Time: 4.814 > avg = 4.7997 > > And again the two part nop: > > run 1 > Time: 4.434 > run 2 > Time: 4.496 > run 3 > Time: 4.801 > run 4 > Time: 4.714 > run 5 > Time: 4.631 > run 6 > Time: 5.178 > run 7 > Time: 4.728 > run 8 > Time: 4.920 > run 9 > Time: 4.898 > run 10 > Time: 4.770 > avg = 4.757 > > > This time it was close, but still seems to have some difference. > > heh, perhaps it's just noise. > > -- Steve > -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Efficient x86 and x86_64 NOP microbenchmarks 2008-08-13 17:52 ` Efficient x86 and x86_64 NOP microbenchmarks Mathieu Desnoyers @ 2008-08-13 18:27 ` Linus Torvalds 2008-08-13 18:41 ` Andi Kleen 2008-08-13 19:16 ` Mathieu Desnoyers 0 siblings, 2 replies; 18+ messages in thread From: Linus Torvalds @ 2008-08-13 18:27 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Steven Rostedt, Jeremy Fitzhardinge, Andi Kleen, LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton, David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell, Gregory Haskins, Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves, Clark Williams On Wed, 13 Aug 2008, Mathieu Desnoyers wrote: > > I also did some microbenchmarks on my Intel Xeon 64 bits, AMD64 and > Intel Pentium 4 boxes to compare a baseline Note that the biggest problems of a jump-based nop are likely to happen when there are I$ misses and/or when there are other jumps involved. Ie a some microarchitectures tend to have issues with jumps to jumps, or when there are multiple control changes in the same (possibly partial) cacheline because the instruction stream prediction may be predecoded in the L1 I$, and multiple branches in the same cacheline - or in the same execution cycle - can pollute that kind of thing. So microbenchmarking this way will probably make some things look unrealistically good. On the P4, the trace cache makes things even more interesting, since it's another level of I$ entirely, with very different behavior for the hit case vs the miss case. And I$ misses for the kernel are actually fairly high. Not in microbenchmarks that tend to have very repetive behavior and a small I$ footprint, but in a lot of real-life loads the *bulk* of all action is in user space, and then the kernel side is often invoced with few loops (the kernel has very few loops indeed) and a cold I$. So your numbers are interesting, but it would be really good to also get some info from Intel/AMD who may know about microarchitectural issues for the cases that don't show up in the hot-I$-cache environment. Linus ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Efficient x86 and x86_64 NOP microbenchmarks 2008-08-13 18:27 ` Linus Torvalds @ 2008-08-13 18:41 ` Andi Kleen 2008-08-13 18:45 ` Avi Kivity 2008-08-13 19:30 ` Mathieu Desnoyers 2008-08-13 19:16 ` Mathieu Desnoyers 1 sibling, 2 replies; 18+ messages in thread From: Andi Kleen @ 2008-08-13 18:41 UTC (permalink / raw) To: Linus Torvalds Cc: Mathieu Desnoyers, Steven Rostedt, Jeremy Fitzhardinge, Andi Kleen, LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton, David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell, Gregory Haskins, Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves, Clark Williams > So microbenchmarking this way will probably make some things look > unrealistically good. Must be careful to miss the big picture here. We have two assumptions here in this thread: - Normal alternative() nops are relatively infrequent, typically in points with enough pipeline bubbles anyways, and it likely doesn't matter how they are encode. And also they don't have an issue with mult part instructions anyways because they're not patched at runtime, so always the best known can be used. - The one case where nops are very frequent and matter and multipart is a problem is with ftrace noping out the call to mcount at runtime because that happens on every function entry. Even there the overhead is not that big, but at least measurable in kernel builds. Now the numbers have shown that just by not using frame pointer ( -pg right now implies frame pointer) you can get more benefit than what you lose from using non optimal nops. So for me the best strategy would be to get rid of the frame pointer and ignore the nops. This unfortunately would require going away from -pg and instead post process gcc output to insert "call mcount" manually. But the nice advantage of that is that you could actually set up a custom table of callers built in a ELF section and with that you don't actually need the runtime patching (which is only done currently because there's no global table of mcount calls), but could do everything in stop_machine(). Without runtime patching you also don't need single part nops. I think that would be the best option. I especially like it because it would prevent forcing frame pointer which seems to be costlier than any kinds of nosp. -Andi ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Efficient x86 and x86_64 NOP microbenchmarks 2008-08-13 18:41 ` Andi Kleen @ 2008-08-13 18:45 ` Avi Kivity 2008-08-13 18:51 ` Andi Kleen 2008-08-13 19:30 ` Mathieu Desnoyers 1 sibling, 1 reply; 18+ messages in thread From: Avi Kivity @ 2008-08-13 18:45 UTC (permalink / raw) To: Andi Kleen Cc: Linus Torvalds, Mathieu Desnoyers, Steven Rostedt, Jeremy Fitzhardinge, LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton, David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell, Gregory Haskins, Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves, Clark Williams Andi Kleen wrote: > So for me the best strategy would be to get rid of the frame pointer > and ignore the nops. This unfortunately would require going away > from -pg and instead post process gcc output to insert "call mcount" > manually. But the nice advantage of that is that you could actually > set up a custom table of callers built in a ELF section and with > that you don't actually need the runtime patching (which is only > done currently because there's no global table of mcount calls), > but could do everything in stop_machine(). Without > runtime patching you also don't need single part nops. > > I think that would be the best option. I especially like it because > it would prevent forcing frame pointer which seems to be costlier > than any kinds of nosp. > > How would you deal with inlines? Using debug information? -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Efficient x86 and x86_64 NOP microbenchmarks 2008-08-13 18:45 ` Avi Kivity @ 2008-08-13 18:51 ` Andi Kleen 2008-08-13 18:56 ` Avi Kivity 0 siblings, 1 reply; 18+ messages in thread From: Andi Kleen @ 2008-08-13 18:51 UTC (permalink / raw) To: Avi Kivity Cc: Andi Kleen, Linus Torvalds, Mathieu Desnoyers, Steven Rostedt, Jeremy Fitzhardinge, LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton, David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell, Gregory Haskins, Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves, Clark Williams > How would you deal with inlines? Using debug information? -pg already ignores inlines, so they aren't even traced today. It pretty much has to, assume an inline gets spread out by the global optimizer over the rest of the function, where would the mcount calls be inserted? -Andi ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Efficient x86 and x86_64 NOP microbenchmarks 2008-08-13 18:51 ` Andi Kleen @ 2008-08-13 18:56 ` Avi Kivity 0 siblings, 0 replies; 18+ messages in thread From: Avi Kivity @ 2008-08-13 18:56 UTC (permalink / raw) To: Andi Kleen Cc: Linus Torvalds, Mathieu Desnoyers, Steven Rostedt, Jeremy Fitzhardinge, LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton, David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell, Gregory Haskins, Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves, Clark Williams Andi Kleen wrote: >> How would you deal with inlines? Using debug information? >> > > -pg already ignores inlines, so they aren't even traced today. > > It pretty much has to, assume an inline gets spread out by > the global optimizer over the rest of the function, where would > the mcount calls be inserted? > Good point. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Efficient x86 and x86_64 NOP microbenchmarks 2008-08-13 18:41 ` Andi Kleen 2008-08-13 18:45 ` Avi Kivity @ 2008-08-13 19:30 ` Mathieu Desnoyers 2008-08-13 19:37 ` Andi Kleen 1 sibling, 1 reply; 18+ messages in thread From: Mathieu Desnoyers @ 2008-08-13 19:30 UTC (permalink / raw) To: Andi Kleen Cc: Linus Torvalds, Steven Rostedt, Jeremy Fitzhardinge, LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton, David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell, Gregory Haskins, Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves, Clark Williams * Andi Kleen (andi@firstfloor.org) wrote: > > So microbenchmarking this way will probably make some things look > > unrealistically good. > > Must be careful to miss the big picture here. > > We have two assumptions here in this thread: > > - Normal alternative() nops are relatively infrequent, typically > in points with enough pipeline bubbles anyways, and it likely doesn't > matter how they are encode. And also they don't have an issue > with mult part instructions anyways because they're not patched > at runtime, so always the best known can be used. > > - The one case where nops are very frequent and matter and multipart > is a problem is with ftrace noping out the call to mcount at runtime > because that happens on every function entry. > Even there the overhead is not that big, but at least measurable > in kernel builds. > > Now the numbers have shown that just by not using frame pointer ( > -pg right now implies frame pointer) you can get more benefit > than what you lose from using non optimal nops. > > So for me the best strategy would be to get rid of the frame pointer > and ignore the nops. This unfortunately would require going away > from -pg and instead post process gcc output to insert "call mcount" > manually. But the nice advantage of that is that you could actually > set up a custom table of callers built in a ELF section and with > that you don't actually need the runtime patching (which is only > done currently because there's no global table of mcount calls), > but could do everything in stop_machine(). Without > runtime patching you also don't need single part nops. > I agree that if frame pointer brings a too big overhead, it should not be used. Sorry to ask, I feel I must be missing something, but I'm trying to figure out where you propose to add the "call mcount" ? In the caller or in the callee ? In the caller, I guess it would replace the normal function call, call a trampoline which would jump to the normal code. In the callee, as what is currently done with -pg, the callee would have a call mcount at the beginning of the function. Or is it a different scheme I don't see ? I am trying to figure out how you happen to do all that without dynamic code modification and manage not to hurt performance. Mathieu > I think that would be the best option. I especially like it because > it would prevent forcing frame pointer which seems to be costlier > than any kinds of nosp. > > -Andi > -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Efficient x86 and x86_64 NOP microbenchmarks 2008-08-13 19:30 ` Mathieu Desnoyers @ 2008-08-13 19:37 ` Andi Kleen 2008-08-13 20:01 ` Mathieu Desnoyers 2008-08-15 21:34 ` Steven Rostedt 0 siblings, 2 replies; 18+ messages in thread From: Andi Kleen @ 2008-08-13 19:37 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Andi Kleen, Linus Torvalds, Steven Rostedt, Jeremy Fitzhardinge, LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton, David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell, Gregory Haskins, Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves, Clark Williams > Sorry to ask, I feel I must be missing something, but I'm trying to > figure out where you propose to add the "call mcount" ? In the caller or > in the callee ? callee like gcc. caller would be likely more bloated because there are more calls than functions. Also if it was at the callee more code would be needed because the function currently executed couldn't be gotten from stack directly. > Or is it a different scheme I don't see ? I am trying to figure out how > you happen to do all that without dynamic code modification and manage > not to hurt performance. The dynamic code modification is only needed because there is no global table of the mcount call sites. So instead it discovers them at runtime, but that requires runtime save patching With a custom call scheme one could just build up a table of call sites at link time using an ELF section and then when tracing is enabled/disabled always patch them all in one go in a stop_machine(). Then you wouldn't need parallel execution safe patching anymore and it doesn't matter what the nops look like. The other advantage is that it would allow getting rid of the frame pointer. -Andi ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Efficient x86 and x86_64 NOP microbenchmarks 2008-08-13 19:37 ` Andi Kleen @ 2008-08-13 20:01 ` Mathieu Desnoyers 2008-08-15 21:34 ` Steven Rostedt 1 sibling, 0 replies; 18+ messages in thread From: Mathieu Desnoyers @ 2008-08-13 20:01 UTC (permalink / raw) To: Andi Kleen Cc: Linus Torvalds, Steven Rostedt, Steven Rostedt, Jeremy Fitzhardinge, LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton, David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell, Gregory Haskins, Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves, Clark Williams * Andi Kleen (andi@firstfloor.org) wrote: > > Sorry to ask, I feel I must be missing something, but I'm trying to > > figure out where you propose to add the "call mcount" ? In the caller or > > in the callee ? > > callee like gcc. caller would be likely more bloated because > there are more calls than functions. Also if it was at the > callee more code would be needed because the function currently > executed couldn't be gotten from stack directly. > > > Or is it a different scheme I don't see ? I am trying to figure out how > > you happen to do all that without dynamic code modification and manage > > not to hurt performance. > > The dynamic code modification is only needed because there is no > global table of the mcount call sites. So instead it discovers > them at runtime, but that requires runtime save patching > > With a custom call scheme one could just build up a table of > call sites at link time using an ELF section and then when > tracing is enabled/disabled always patch them all in one go > in a stop_machine(). Then you wouldn't need parallel execution safe > patching anymore and it doesn't matter what the nops look like. > I agree that the custom call scheme could let you know the mcount call site addresses at link time, so you could replace the call instructions with nops (at link time, so you actually don't know much about the exact hardware the kernel will be running on, which makes it harder to choose the best nop). To me, it seems that doing this at link time, as you propose, is the best approach, as it won't impact the system bootup time as much as the current ftrace scheme. However, I disagree with you on one point : if you use nops which are made of multiple instructions smaller than 5 bytes, enabling the tracer (patching all the sites in a stop_machine()) still present the risk of having a preempted thread with a return IP pointing directly in the middle of what will become a 5-bytes call instruction. When the thread will be scheduled again after the stop_machine, an illegal instruction fault (or any random effect) will occur. Therefore, building a table of mcount call sites in a ELF section, declaring _single_ 5-bytes nop instruction in the instruction stream that would fit for all target architectures in lieue of mcount call, so it can be later patched-in with the 5-bytes call at runtime seems like a good way to go. Mathieu P.S. : It would be good to have a look at the alternative.c lock prefix vs preemption race I identified a few weeks ago. Actually, this currently existing cpu hotplug bug is related to the preemption issue I just explained here. ref. http://lkml.org/lkml/2008/7/30/265, especially: "As a general rule, never try to combine smaller instructions into a bigger one, except in the case of adding a lock-prefix to an instruction : this case insures that the non-lock prefixed instruction is still valid after the change has been done. We could however run into a nasty non-synchronized atomic instruction use in SMP mode if a thread happens to be scheduled out right after the lock prefix. Hopefully the alternative code uses the refrigerator... (hrm, it doesn't). Actually, alternative.c lock-prefix modification is O.K. for spinlocks because they execute with preemption off, but not for other atomic operations which may execute with preemption on." > The other advantage is that it would allow getting rid of > the frame pointer. > > -Andi > -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Efficient x86 and x86_64 NOP microbenchmarks 2008-08-13 19:37 ` Andi Kleen 2008-08-13 20:01 ` Mathieu Desnoyers @ 2008-08-15 21:34 ` Steven Rostedt 2008-08-15 21:51 ` Andi Kleen 1 sibling, 1 reply; 18+ messages in thread From: Steven Rostedt @ 2008-08-15 21:34 UTC (permalink / raw) To: Andi Kleen Cc: Mathieu Desnoyers, Linus Torvalds, Jeremy Fitzhardinge, LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton, David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell, Gregory Haskins, Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves, Clark Williams [ Finally got my goodmis email back ] On Wed, 13 Aug 2008, Andi Kleen wrote: > > Sorry to ask, I feel I must be missing something, but I'm trying to > > figure out where you propose to add the "call mcount" ? In the caller or > > in the callee ? > > callee like gcc. caller would be likely more bloated because > there are more calls than functions. Also if it was at the > callee more code would be needed because the function currently > executed couldn't be gotten from stack directly. > > > Or is it a different scheme I don't see ? I am trying to figure out how > > you happen to do all that without dynamic code modification and manage > > not to hurt performance. > > The dynamic code modification is only needed because there is no > global table of the mcount call sites. So instead it discovers > them at runtime, but that requires runtime save patching The new code does not discover the places at runtime. The old code did that. The "to kill a daemon" removed the runtime discovery and replaced it with discovery at compile time. > > With a custom call scheme one could just build up a table of > call sites at link time using an ELF section and then when > tracing is enabled/disabled always patch them all in one go > in a stop_machine(). Then you wouldn't need parallel execution safe > patching anymore and it doesn't matter what the nops look like. The current patch set, pretty much does exactly this. Yes, I patch at boot up all in one go, before the other CPUS are even active. This takes all of 6 milliseconds to do. Not much extra time for bootup. > > The other advantage is that it would allow getting rid of > the frame pointer. This is the only advantage that you have. -- Steve ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Efficient x86 and x86_64 NOP microbenchmarks 2008-08-15 21:34 ` Steven Rostedt @ 2008-08-15 21:51 ` Andi Kleen 0 siblings, 0 replies; 18+ messages in thread From: Andi Kleen @ 2008-08-15 21:51 UTC (permalink / raw) To: Steven Rostedt Cc: Andi Kleen, Mathieu Desnoyers, Linus Torvalds, Jeremy Fitzhardinge, LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton, David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell, Gregory Haskins, Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves, Clark Williams > > The other advantage is that it would allow getting rid of > > the frame pointer. > > This is the only advantage that you have. Ok. But it's a serious one. It gives slightly more gain as your whole complicated patching exercise. Ok maybe it would be better to just properly fix gcc, but the problem is it takes forever for the user base to actually start using a new gcc :/ -Andi ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Efficient x86 and x86_64 NOP microbenchmarks 2008-08-13 18:27 ` Linus Torvalds 2008-08-13 18:41 ` Andi Kleen @ 2008-08-13 19:16 ` Mathieu Desnoyers 1 sibling, 0 replies; 18+ messages in thread From: Mathieu Desnoyers @ 2008-08-13 19:16 UTC (permalink / raw) To: Linus Torvalds Cc: Steven Rostedt, Steven Rostedt, Jeremy Fitzhardinge, Andi Kleen, LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton, David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell, Gregory Haskins, Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves, Clark Williams * Linus Torvalds (torvalds@linux-foundation.org) wrote: > > > On Wed, 13 Aug 2008, Mathieu Desnoyers wrote: > > > > I also did some microbenchmarks on my Intel Xeon 64 bits, AMD64 and > > Intel Pentium 4 boxes to compare a baseline > > Note that the biggest problems of a jump-based nop are likely to happen > when there are I$ misses and/or when there are other jumps involved. Ie a > some microarchitectures tend to have issues with jumps to jumps, or when > there are multiple control changes in the same (possibly partial) > cacheline because the instruction stream prediction may be predecoded in > the L1 I$, and multiple branches in the same cacheline - or in the same > execution cycle - can pollute that kind of thing. > Yup, I agree. Actually, the tests I ran shows that using jumps as nops does not seems to be the best solution, even cycle-wise. > So microbenchmarking this way will probably make some things look > unrealistically good. > Yes, I am aware of these "high locality" effects. I use these tests as a starting point to find out which nops are good candidates, and then it can be later validated with more thorough testing on real workloads, which will suffer from higher standard deviation. Interestingly enough, the P6_NOPS seems to be a poor choice both at the macro and micro levels for the Intel Xeon (referring to http://lkml.org/lkml/2008/8/13/253 for the macro-benchmarks). > On the P4, the trace cache makes things even more interesting, since it's > another level of I$ entirely, with very different behavior for the hit > case vs the miss case. As long as the whole kernel agrees on which instructions should be used for frequently used nops, the instruction trace cache should behave properly. > > And I$ misses for the kernel are actually fairly high. Not in > microbenchmarks that tend to have very repetive behavior and a small I$ > footprint, but in a lot of real-life loads the *bulk* of all action is in > user space, and then the kernel side is often invoced with few loops (the > kernel has very few loops indeed) and a cold I$. I assume the effect of I$ miss to be the same for all the tested scenarios (except on P4, and maybe except for the jump cases), given that in each case we load 5-bytes worth of instructions. Even considering this, the results I get show that the choices made in the current kernel does might not be the best ones. > > So your numbers are interesting, but it would be really good to also get > some info from Intel/AMD who may know about microarchitectural issues for > the cases that don't show up in the hot-I$-cache environment. > Yep. I think it may make a difference if we use jumps, but I doubt it will change anything to the other various nops. Still, having that information would be good. Some more numbers follow for older architectures. Intel Pentium 3, 550MHz NR_TESTS 10000000 test empty cycles : 510000254 test 2-bytes jump cycles : 510000077 test 5-bytes jump cycles : 510000101 test 3/2 nops cycles : 500000072 test 5-bytes nop with long prefix cycles : 500000107 test 5-bytes P6 nop cycles : 500000069 (current choice ok) test Generic 1/4 5-bytes nops cycles : 514687590 test K7 1/4 5-bytes nops cycles : 530000012 Intel Pentium 3, 933MHz NR_TESTS 10000000 test empty cycles : 510000565 test 2-bytes jump cycles : 510000133 test 5-bytes jump cycles : 510000363 test 3/2 nops cycles : 500000358 test 5-bytes nop with long prefix cycles : 500000331 test 5-bytes P6 nop cycles : 500000625 (current choice ok) test Generic 1/4 5-bytes nops cycles : 514687797 test K7 1/4 5-bytes nops cycles : 530000273 Intel Pentium M, 2GHz NR_TESTS 10000000 test empty cycles : 180000515 test 2-bytes jump cycles : 180000386 (would be the best) test 5-bytes jump cycles : 205000435 test 3/2 nops cycles : 193333517 test 5-bytes nop with long prefix cycles : 205000167 test 5-bytes P6 nop cycles : 205937652 test Generic 1/4 5-bytes nops cycles : 187500174 test K7 1/4 5-bytes nops cycles : 193750161 Intel Pentium 3, 550MHz processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 7 model name : Pentium III (Katmai) stepping : 3 cpu MHz : 551.295 cache size : 512 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr sse bogomips : 1103.44 clflush size : 32 Intel Pentium 3, 933MHz processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 8 model name : Pentium III (Coppermine) stepping : 6 cpu MHz : 933.134 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr sse bogomips : 1868.22 clflush size : 32 Intel Pentium M, 2GHz processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 13 model name : Intel(R) Pentium(R) M processor 2.00GHz stepping : 8 cpu MHz : 2000.000 cache size : 2048 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat clflush dts acpi mmx fxsr sse sse2 ss tm pbe nx bts est tm2 bogomips : 3994.64 clflush size : 64 Mathieu > Linus -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2008-08-15 21:50 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20080813191926.GB15547@Krystal>
2008-08-13 20:00 ` Efficient x86 and x86_64 NOP microbenchmarks Steven Rostedt
2008-08-13 20:06 ` Jeremy Fitzhardinge
2008-08-13 20:34 ` Steven Rostedt
2008-08-13 20:15 ` Andi Kleen
2008-08-13 20:21 ` Linus Torvalds
2008-08-13 20:21 ` Steven Rostedt
2008-08-08 18:13 [PATCH 0/5] ftrace: to kill a daemon Steven Rostedt
2008-08-08 18:21 ` Mathieu Desnoyers
2008-08-08 18:41 ` Steven Rostedt
2008-08-08 19:05 ` Mathieu Desnoyers
2008-08-08 23:38 ` Steven Rostedt
2008-08-09 0:23 ` Andi Kleen
2008-08-09 0:36 ` Steven Rostedt
2008-08-09 0:47 ` Jeremy Fitzhardinge
2008-08-09 0:51 ` Linus Torvalds
2008-08-09 1:25 ` Steven Rostedt
2008-08-13 17:52 ` Efficient x86 and x86_64 NOP microbenchmarks Mathieu Desnoyers
2008-08-13 18:27 ` Linus Torvalds
2008-08-13 18:41 ` Andi Kleen
2008-08-13 18:45 ` Avi Kivity
2008-08-13 18:51 ` Andi Kleen
2008-08-13 18:56 ` Avi Kivity
2008-08-13 19:30 ` Mathieu Desnoyers
2008-08-13 19:37 ` Andi Kleen
2008-08-13 20:01 ` Mathieu Desnoyers
2008-08-15 21:34 ` Steven Rostedt
2008-08-15 21:51 ` Andi Kleen
2008-08-13 19:16 ` Mathieu Desnoyers
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox