Efficient x86 and x86_64 NOP microbenchmarks

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Efficient x86 and x86_64 NOP microbenchmarks
  2008-08-09  1:25                 ` Steven Rostedt
@ 2008-08-13 17:52                   ` Mathieu Desnoyers
  2008-08-13 18:27                     ` Linus Torvalds
  0 siblings, 1 reply; 18+ messages in thread
From: Mathieu Desnoyers @ 2008-08-13 17:52 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Jeremy Fitzhardinge, Andi Kleen, LKML,
	Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

* Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> On Fri, 8 Aug 2008, Linus Torvalds wrote:
> > 
> > 
> > On Fri, 8 Aug 2008, Jeremy Fitzhardinge wrote:
> > >
> > > Steven Rostedt wrote:
> > > > I wish we had a true 5 byte nop. 
> > > 
> > > 0x66 0x66 0x66 0x66 0x90
> > 
> > I don't think so. Multiple redundant prefixes can be really expensive on 
> > some uarchs.
> > 
> > A no-op that isn't cheap isn't a no-op at all, it's a slow-op.
> 
> 
> A quick meaningless benchmark showed a slight perfomance hit.
> 

Hi Steven,

I also did some microbenchmarks on my Intel Xeon 64 bits, AMD64 and
Intel Pentium 4 boxes to compare a baseline (function doing a bit of
memory read and arithmetic operations) to cases where nops are used.
Here are the results. The kernel module used for the benchmarks is
below, feel free to run it on your own architectures.

Xeon :

NR_TESTS                                    10000000
test empty cycles :                        165472020
test 2-bytes jump cycles :                 166666806
test 5-bytes jump cycles :                 166978164
test 3/2 nops cycles :                     169259406
test 5-bytes nop with long prefix cycles : 160000140
test 5-bytes P6 nop cycles :               163333458


AMD64 :

NR_TESTS                                    10000000
test empty cycles :                        145142367
test 2-bytes jump cycles :                 150000178
test 5-bytes jump cycles :                 150000171
test 3/2 nops cycles :                     159999994
test 5-bytes nop with long prefix cycles : 150000156
test 5-bytes P6 nop cycles :               150000148


Intel Pentium 4 :

NR_TESTS                                    10000000
test empty cycles :                        290001045
test 2-bytes jump cycles :                 310000568
test 5-bytes jump cycles :                 310000478
test 3/2 nops cycles :                     290000565
test 5-bytes nop with long prefix cycles : 311085510
test 5-bytes P6 nop cycles :               300000517
test Generic 1/4 5-bytes nops cycles :     310000553
test K7 1/4 5-bytes nops cycles :          300000533


These numbers show that both on Xeon and AMD64, the 

   .byte 0x66,0x66,0x66,0x66,0x90

(osp osp osp osp nop, which is not currently used in nops.h)

is the fastest nop on both architectures.

The currently used 3/2 nops looks like a _very_ bad choice for AMD64
cycle-wise.

The currently used 5-bytes P6 nop used on Xeon seems to be a bit slower
than the 0x66,0x66,0x66,0x66,0x90 nop too.

For the Intel Pentium 4, the best atomic choice seems to be the current
one (5-bytes P6 nop : .byte 0x0f,0x1f,0x44,0x00,0), although we can see
that the 3/2 nop used for K8 would be a bit faster. It is probably due
to the fact that P4 handles long instruction prefixes slowly.

Is there any reason why not to use these atomic nops and kill our
instruction atomicity problems altogether ?

(various cpuinfo can be found below)

Mathieu


/* test-nop-speed.c
 *
 */

#include <linux/module.h>
#include <linux/proc_fs.h>
#include <linux/sched.h>
#include <linux/timex.h>
#include <linux/marker.h>
#include <asm/ptrace.h>

#define NR_TESTS 10000000

int var, var2;

struct proc_dir_entry *pentry = NULL;

void empty(void)
{
	asm volatile ("");
	var += 50;
	var /= 10;
	var *= var2;
}

void twobytesjump(void)
{
	asm volatile ("jmp 1f\n\t"
		".byte 0x00, 0x00, 0x00\n\t"
		"1:\n\t");
	var += 50;
	var /= 10;
	var *= var2;
}

void fivebytesjump(void)
{
	asm volatile (".byte 0xe9, 0x00, 0x00, 0x00, 0x00\n\t");
	var += 50;
	var /= 10;
	var *= var2;
}

void threetwonops(void)
{
	asm volatile (".byte 0x66,0x66,0x90,0x66,0x90\n\t");
	var += 50;
	var /= 10;
	var *= var2;
}

void fivebytesnop(void)
{
	asm volatile (".byte 0x66,0x66,0x66,0x66,0x90\n\t");
	var += 50;
	var /= 10;
	var *= var2;
}

void fivebytespsixnop(void)
{
	asm volatile (".byte 0x0f,0x1f,0x44,0x00,0\n\t");
	var += 50;
	var /= 10;
	var *= var2;
}

/*
 * GENERIC_NOP1 GENERIC_NOP4,
 * 1: nop
 * _not_ nops in 64-bit mode.
 * 4: leal 0x00(,%esi,1),%esi
 */
void genericfivebytesonefournops(void)
{
	asm volatile (".byte 0x90,0x8d,0x74,0x26,0x00\n\t");
	var += 50;
	var /= 10;
	var *= var2;
}

/*
 * K7_NOP4 ASM_NOP1
 * 1: nop
 * assumed _not_ to be nops in 64-bit mode.
 * leal 0x00(,%eax,1),%eax
 */
void k7fivebytesonefournops(void)
{
	asm volatile (".byte 0x90,0x8d,0x44,0x20,0x00\n\t");
	var += 50;
	var /= 10;
	var *= var2;
}

void perform_test(const char *name, void (*callback)(void))
{
	unsigned int i;
	cycles_t cycles1, cycles2;
	unsigned long flags;

	local_irq_save(flags);
	rdtsc_barrier();
	cycles1 = get_cycles();
	rdtsc_barrier();
	for(i=0; i<NR_TESTS; i++) {
		callback();
	}
	rdtsc_barrier();
	cycles2 = get_cycles();
	rdtsc_barrier();
	local_irq_restore(flags);
	printk("test %s cycles : %llu\n", name, cycles2-cycles1);
}

static int my_open(struct inode *inode, struct file *file)
{
	printk("NR_TESTS %d\n", NR_TESTS);

	perform_test("empty", empty);
	perform_test("2-bytes jump", twobytesjump);
	perform_test("5-bytes jump", fivebytesjump);
	perform_test("3/2 nops", threetwonops);
	perform_test("5-bytes nop with long prefix", fivebytesnop);
	perform_test("5-bytes P6 nop", fivebytespsixnop);
#ifdef CONFIG_X86_32
	perform_test("Generic 1/4 5-bytes nops", genericfivebytesonefournops);
	perform_test("K7 1/4 5-bytes nops", k7fivebytesonefournops);
#endif

	return -EPERM;
}


static struct file_operations my_operations = {
	.open = my_open,
};

int init_module(void)
{
	pentry = create_proc_entry("testnops", 0444, NULL);
	if (pentry)
		pentry->proc_fops = &my_operations;

	return 0;
}

void cleanup_module(void)
{
	remove_proc_entry("testnops", NULL);
}

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Mathieu Desnoyers");
MODULE_DESCRIPTION("NOP Test");


Xeon cpuinfo :

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 23
model name	: Intel(R) Xeon(R) CPU           E5405  @ 2.00GHz
stepping	: 6
cpu MHz		: 2000.126
cache size	: 6144 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx tm2 ssse3 cx16 xtpr dca sse4_1 lahf_lm
bogomips	: 4000.25
clflush size	: 64
cache_alignment	: 64
address sizes	: 38 bits physical, 48 bits virtual
power management:

AMD64 cpuinfo :

processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 15
model		: 35
model name	: AMD Athlon(tm)64 X2 Dual Core Processor  3800+
stepping	: 2
cpu MHz		: 2009.139
cache size	: 512 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 2
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 1
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow rep_good pni lahf_lm cmp_legacy
bogomips	: 4022.42
TLB size	: 1024 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

Pentium 4 :


processor	: 0
vendor_id	: GenuineIntel
cpu family	: 15
model		: 4
model name	: Intel(R) Pentium(R) 4 CPU 3.00GHz
stepping	: 1
cpu MHz		: 3000.138
cache size	: 1024 KB
physical id	: 0
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 0
initial apicid	: 0
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 5
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx constant_tsc up pebs bts pni monitor ds_cpl cid xtpr
bogomips	: 6005.70
clflush size	: 64
power management:



> Here's 10 runs of "hackbench 50" using the two part 5 byte nop:
> 
> run 1
> Time: 4.501
> run 2
> Time: 4.855
> run 3
> Time: 4.198
> run 4
> Time: 4.587
> run 5
> Time: 5.016
> run 6
> Time: 4.757
> run 7
> Time: 4.477
> run 8
> Time: 4.693
> run 9
> Time: 4.710
> run 10
> Time: 4.715
> avg = 4.6509
> 
> 
> And 10 runs using the above 5 byte nop:
> 
> run 1
> Time: 4.832
> run 2
> Time: 5.319
> run 3
> Time: 5.213
> run 4
> Time: 4.830
> run 5
> Time: 4.363
> run 6
> Time: 4.391
> run 7
> Time: 4.772
> run 8
> Time: 4.992
> run 9
> Time: 4.727
> run 10
> Time: 4.825
> avg = 4.8264
> 
> # cat /proc/cpuinfo
> processor	: 0
> vendor_id	: AuthenticAMD
> cpu family	: 15
> model		: 65
> model name	: Dual-Core AMD Opteron(tm) Processor 2220
> stepping	: 3
> cpu MHz		: 2799.992
> cache size	: 1024 KB
> physical id	: 0
> siblings	: 2
> core id		: 0
> cpu cores	: 2
> apicid		: 0
> initial apicid	: 0
> fdiv_bug	: no
> hlt_bug		: no
> f00f_bug	: no
> coma_bug	: no
> fpu		: yes
> fpu_exception	: yes
> cpuid level	: 1
> wp		: yes
> flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic 
> cr8_legacy
> bogomips	: 5599.98
> clflush size	: 64
> power management: ts fid vid ttp tm stc
> 
> There's 4 of these.
> 
> Just to make sure, I ran the above nop test again:
> 
> [ this is reverse from the above runs ]
> 
> run 1
> Time: 4.723
> run 2
> Time: 5.080
> run 3
> Time: 4.521
> run 4
> Time: 4.841
> run 5
> Time: 4.696
> run 6
> Time: 4.946
> run 7
> Time: 4.754
> run 8
> Time: 4.717
> run 9
> Time: 4.905
> run 10
> Time: 4.814
> avg = 4.7997
> 
> And again the two part nop:
> 
> run 1
> Time: 4.434
> run 2
> Time: 4.496
> run 3
> Time: 4.801
> run 4
> Time: 4.714
> run 5
> Time: 4.631
> run 6
> Time: 5.178
> run 7
> Time: 4.728
> run 8
> Time: 4.920
> run 9
> Time: 4.898
> run 10
> Time: 4.770
> avg = 4.757
> 
> 
> This time it was close, but still seems to have some difference.
> 
> heh, perhaps it's just noise.
> 
> -- Steve
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Efficient x86 and x86_64 NOP microbenchmarks
  2008-08-13 17:52                   ` Efficient x86 and x86_64 NOP microbenchmarks Mathieu Desnoyers
@ 2008-08-13 18:27                     ` Linus Torvalds
  2008-08-13 18:41                       ` Andi Kleen
  2008-08-13 19:16                       ` Mathieu Desnoyers
  0 siblings, 2 replies; 18+ messages in thread
From: Linus Torvalds @ 2008-08-13 18:27 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, Jeremy Fitzhardinge, Andi Kleen, LKML,
	Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

On Wed, 13 Aug 2008, Mathieu Desnoyers wrote:
> 
> I also did some microbenchmarks on my Intel Xeon 64 bits, AMD64 and
> Intel Pentium 4 boxes to compare a baseline

Note that the biggest problems of a jump-based nop are likely to happen 
when there are I$ misses and/or when there are other jumps involved. Ie a 
some microarchitectures tend to have issues with jumps to jumps, or when 
there are multiple control changes in the same (possibly partial) 
cacheline because the instruction stream prediction may be predecoded in 
the L1 I$, and multiple branches in the same cacheline - or in the same 
execution cycle - can pollute that kind of thing.

So microbenchmarking this way will probably make some things look 
unrealistically good. 

On the P4, the trace cache makes things even more interesting, since it's 
another level of I$ entirely, with very different behavior for the hit 
case vs the miss case.

And I$ misses for the kernel are actually fairly high. Not in 
microbenchmarks that tend to have very repetive behavior and a small I$ 
footprint, but in a lot of real-life loads the *bulk* of all action is in 
user space, and then the kernel side is often invoced with few loops (the 
kernel has very few loops indeed) and a cold I$.

So your numbers are interesting, but it would be really good to also get 
some info from Intel/AMD who may know about microarchitectural issues for 
the cases that don't show up in the hot-I$-cache environment.

			Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Efficient x86 and x86_64 NOP microbenchmarks
  2008-08-13 18:27                     ` Linus Torvalds
@ 2008-08-13 18:41                       ` Andi Kleen
  2008-08-13 18:45                         ` Avi Kivity
  2008-08-13 19:30                         ` Mathieu Desnoyers
  2008-08-13 19:16                       ` Mathieu Desnoyers
  1 sibling, 2 replies; 18+ messages in thread
From: Andi Kleen @ 2008-08-13 18:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mathieu Desnoyers, Steven Rostedt, Jeremy Fitzhardinge,
	Andi Kleen, LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, David Miller, Roland McGrath, Ulrich Drepper,
	Rusty Russell, Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

> So microbenchmarking this way will probably make some things look 
> unrealistically good. 

Must be careful to miss the big picture here.

We have two assumptions here in this thread:

- Normal alternative() nops are relatively infrequent, typically
in points with enough pipeline bubbles anyways, and it likely doesn't
matter how they are encode. And also they don't have an issue
with mult part instructions anyways because they're not patched
at runtime, so always the best known can be used.

- The one case where nops are very frequent and matter and multipart
is a problem is with ftrace noping out the call to mcount at runtime 
because that happens on every function entry.
Even there the overhead is not that big, but at least measurable 
in kernel builds.

Now the numbers have shown that just by not using frame pointer (
-pg right now implies frame pointer) you can get more benefit 
than what you lose from using non optimal nops.

So for me the best strategy would be to get rid of the frame pointer
and ignore the nops. This unfortunately would require going away
from -pg and instead post process gcc output to insert "call mcount"
manually. But the nice advantage of that is that you could actually 
set up a custom table of callers built in a ELF section and with
that you don't actually need the runtime patching (which is only
done currently because there's no global table of mcount calls),
but could do everything in stop_machine(). Without
runtime patching you also don't need single part nops. 

I think that would be the best option. I especially like it because
it would prevent forcing frame pointer which seems to be costlier
than any kinds of nosp.

-Andi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Efficient x86 and x86_64 NOP microbenchmarks
  2008-08-13 18:41                       ` Andi Kleen
@ 2008-08-13 18:45                         ` Avi Kivity
  2008-08-13 18:51                           ` Andi Kleen
  2008-08-13 19:30                         ` Mathieu Desnoyers
  1 sibling, 1 reply; 18+ messages in thread
From: Avi Kivity @ 2008-08-13 18:45 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Mathieu Desnoyers, Steven Rostedt,
	Jeremy Fitzhardinge, LKML, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, David Miller, Roland McGrath,
	Ulrich Drepper, Rusty Russell, Gregory Haskins,
	Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves,
	Clark Williams

Andi Kleen wrote:
> So for me the best strategy would be to get rid of the frame pointer
> and ignore the nops. This unfortunately would require going away
> from -pg and instead post process gcc output to insert "call mcount"
> manually. But the nice advantage of that is that you could actually 
> set up a custom table of callers built in a ELF section and with
> that you don't actually need the runtime patching (which is only
> done currently because there's no global table of mcount calls),
> but could do everything in stop_machine(). Without
> runtime patching you also don't need single part nops. 
>
> I think that would be the best option. I especially like it because
> it would prevent forcing frame pointer which seems to be costlier
> than any kinds of nosp.
>
>   

How would you deal with inlines?  Using debug information?

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Efficient x86 and x86_64 NOP microbenchmarks
  2008-08-13 18:45                         ` Avi Kivity
@ 2008-08-13 18:51                           ` Andi Kleen
  2008-08-13 18:56                             ` Avi Kivity
  0 siblings, 1 reply; 18+ messages in thread
From: Andi Kleen @ 2008-08-13 18:51 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Andi Kleen, Linus Torvalds, Mathieu Desnoyers, Steven Rostedt,
	Jeremy Fitzhardinge, LKML, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, David Miller, Roland McGrath,
	Ulrich Drepper, Rusty Russell, Gregory Haskins,
	Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves,
	Clark Williams

> How would you deal with inlines?  Using debug information?

-pg already ignores inlines, so they aren't even traced today.

It pretty much has to, assume an inline gets spread out by
the global optimizer over the rest of the function, where would
the mcount calls be inserted?

-Andi


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Efficient x86 and x86_64 NOP microbenchmarks
  2008-08-13 18:51                           ` Andi Kleen
@ 2008-08-13 18:56                             ` Avi Kivity
  0 siblings, 0 replies; 18+ messages in thread
From: Avi Kivity @ 2008-08-13 18:56 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Mathieu Desnoyers, Steven Rostedt,
	Jeremy Fitzhardinge, LKML, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, David Miller, Roland McGrath,
	Ulrich Drepper, Rusty Russell, Gregory Haskins,
	Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves,
	Clark Williams

Andi Kleen wrote:
>> How would you deal with inlines?  Using debug information?
>>     
>
> -pg already ignores inlines, so they aren't even traced today.
>
> It pretty much has to, assume an inline gets spread out by
> the global optimizer over the rest of the function, where would
> the mcount calls be inserted?
>   

Good point.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Efficient x86 and x86_64 NOP microbenchmarks
  2008-08-13 18:27                     ` Linus Torvalds
  2008-08-13 18:41                       ` Andi Kleen
@ 2008-08-13 19:16                       ` Mathieu Desnoyers
  1 sibling, 0 replies; 18+ messages in thread
From: Mathieu Desnoyers @ 2008-08-13 19:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Steven Rostedt, Jeremy Fitzhardinge, Andi Kleen,
	LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> 
> 
> On Wed, 13 Aug 2008, Mathieu Desnoyers wrote:
> > 
> > I also did some microbenchmarks on my Intel Xeon 64 bits, AMD64 and
> > Intel Pentium 4 boxes to compare a baseline
> 
> Note that the biggest problems of a jump-based nop are likely to happen 
> when there are I$ misses and/or when there are other jumps involved. Ie a 
> some microarchitectures tend to have issues with jumps to jumps, or when 
> there are multiple control changes in the same (possibly partial) 
> cacheline because the instruction stream prediction may be predecoded in 
> the L1 I$, and multiple branches in the same cacheline - or in the same 
> execution cycle - can pollute that kind of thing.
> 

Yup, I agree. Actually, the tests I ran shows that using jumps as nops
does not seems to be the best solution, even cycle-wise.

> So microbenchmarking this way will probably make some things look 
> unrealistically good. 
> 

Yes, I am aware of these "high locality" effects. I use these tests as a
starting point to find out which nops are good candidates, and then it
can be later validated with more thorough testing on real workloads,
which will suffer from higher standard deviation.

Interestingly enough, the P6_NOPS seems to be a poor choice both at the
macro and micro levels for the Intel Xeon (referring to
http://lkml.org/lkml/2008/8/13/253 for the macro-benchmarks).

> On the P4, the trace cache makes things even more interesting, since it's 
> another level of I$ entirely, with very different behavior for the hit 
> case vs the miss case.

As long as the whole kernel agrees on which instructions should be used
for frequently used nops, the instruction trace cache should behave
properly.

> 
> And I$ misses for the kernel are actually fairly high. Not in 
> microbenchmarks that tend to have very repetive behavior and a small I$ 
> footprint, but in a lot of real-life loads the *bulk* of all action is in 
> user space, and then the kernel side is often invoced with few loops (the 
> kernel has very few loops indeed) and a cold I$.

I assume the effect of I$ miss to be the same for all the tested
scenarios (except on P4, and maybe except for the jump cases), given
that in each case we load 5-bytes worth of instructions. Even
considering this, the results I get show that the choices made in the
current kernel does might not be the best ones.

> 
> So your numbers are interesting, but it would be really good to also get 
> some info from Intel/AMD who may know about microarchitectural issues for 
> the cases that don't show up in the hot-I$-cache environment.
> 

Yep. I think it may make a difference if we use jumps, but I doubt it
will change anything to the other various nops. Still, having that
information would be good.

Some more numbers follow for older architectures.

Intel Pentium 3, 550MHz

NR_TESTS                                    10000000
test empty cycles :                        510000254
test 2-bytes jump cycles :                 510000077
test 5-bytes jump cycles :                 510000101
test 3/2 nops cycles :                     500000072
test 5-bytes nop with long prefix cycles : 500000107
test 5-bytes P6 nop cycles :               500000069 (current choice ok)
test Generic 1/4 5-bytes nops cycles :     514687590
test K7 1/4 5-bytes nops cycles :          530000012

Intel Pentium 3, 933MHz

NR_TESTS                                    10000000
test empty cycles :                        510000565
test 2-bytes jump cycles :                 510000133
test 5-bytes jump cycles :                 510000363
test 3/2 nops cycles :                     500000358
test 5-bytes nop with long prefix cycles : 500000331
test 5-bytes P6 nop cycles :               500000625 (current choice ok)
test Generic 1/4 5-bytes nops cycles :     514687797
test K7 1/4 5-bytes nops cycles :          530000273


Intel Pentium M, 2GHz

NR_TESTS                                    10000000
test empty cycles :                        180000515
test 2-bytes jump cycles :                 180000386 (would be the best)
test 5-bytes jump cycles :                 205000435
test 3/2 nops cycles :                     193333517
test 5-bytes nop with long prefix cycles : 205000167
test 5-bytes P6 nop cycles :               205937652
test Generic 1/4 5-bytes nops cycles :     187500174
test K7 1/4 5-bytes nops cycles :          193750161


Intel Pentium 3, 550MHz

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 7
model name	: Pentium III (Katmai)
stepping	: 3
cpu MHz		: 551.295
cache size	: 512 KB
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 2
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr sse
bogomips	: 1103.44
clflush size	: 32

Intel Pentium 3, 933MHz

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 8
model name	: Pentium III (Coppermine)
stepping	: 6
cpu MHz		: 933.134
cache size	: 256 KB
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 2
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr sse
bogomips	: 1868.22
clflush size	: 32

Intel Pentium M, 2GHz

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 13
model name	: Intel(R) Pentium(R) M processor 2.00GHz
stepping	: 8
cpu MHz		: 2000.000
cache size	: 2048 KB
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 2
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat clflush dts acpi mmx fxsr sse sse2 ss tm pbe nx bts est tm2
bogomips	: 3994.64
clflush size	: 64

Mathieu

> 			Linus




-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Efficient x86 and x86_64 NOP microbenchmarks
  2008-08-13 18:41                       ` Andi Kleen
  2008-08-13 18:45                         ` Avi Kivity
@ 2008-08-13 19:30                         ` Mathieu Desnoyers
  2008-08-13 19:37                           ` Andi Kleen
  1 sibling, 1 reply; 18+ messages in thread
From: Mathieu Desnoyers @ 2008-08-13 19:30 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Steven Rostedt, Jeremy Fitzhardinge, LKML,
	Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

* Andi Kleen (andi@firstfloor.org) wrote:
> > So microbenchmarking this way will probably make some things look 
> > unrealistically good. 
> 
> Must be careful to miss the big picture here.
> 
> We have two assumptions here in this thread:
> 
> - Normal alternative() nops are relatively infrequent, typically
> in points with enough pipeline bubbles anyways, and it likely doesn't
> matter how they are encode. And also they don't have an issue
> with mult part instructions anyways because they're not patched
> at runtime, so always the best known can be used.
> 
> - The one case where nops are very frequent and matter and multipart
> is a problem is with ftrace noping out the call to mcount at runtime 
> because that happens on every function entry.
> Even there the overhead is not that big, but at least measurable 
> in kernel builds.
> 
> Now the numbers have shown that just by not using frame pointer (
> -pg right now implies frame pointer) you can get more benefit 
> than what you lose from using non optimal nops.
> 
> So for me the best strategy would be to get rid of the frame pointer
> and ignore the nops. This unfortunately would require going away
> from -pg and instead post process gcc output to insert "call mcount"
> manually. But the nice advantage of that is that you could actually 
> set up a custom table of callers built in a ELF section and with
> that you don't actually need the runtime patching (which is only
> done currently because there's no global table of mcount calls),
> but could do everything in stop_machine(). Without
> runtime patching you also don't need single part nops. 
> 

I agree that if frame pointer brings a too big overhead, it should not
be used.

Sorry to ask, I feel I must be missing something, but I'm trying to
figure out where you propose to add the "call mcount" ? In the caller or
in the callee ?

In the caller, I guess it would replace the normal function call, call a
trampoline which would jump to the normal code.

In the callee, as what is currently done with -pg, the callee would have
a call mcount at the beginning of the function.

Or is it a different scheme I don't see ? I am trying to figure out how
you happen to do all that without dynamic code modification and manage
not to hurt performance.

Mathieu

> I think that would be the best option. I especially like it because
> it would prevent forcing frame pointer which seems to be costlier
> than any kinds of nosp.
> 
> -Andi
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Efficient x86 and x86_64 NOP microbenchmarks
  2008-08-13 19:30                         ` Mathieu Desnoyers
@ 2008-08-13 19:37                           ` Andi Kleen
  2008-08-13 20:01                             ` Mathieu Desnoyers
  2008-08-15 21:34                             ` Steven Rostedt
  0 siblings, 2 replies; 18+ messages in thread
From: Andi Kleen @ 2008-08-13 19:37 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andi Kleen, Linus Torvalds, Steven Rostedt, Jeremy Fitzhardinge,
	LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

> Sorry to ask, I feel I must be missing something, but I'm trying to
> figure out where you propose to add the "call mcount" ? In the caller or
> in the callee ?

callee like gcc. caller would be likely more bloated because
there are more calls than functions. Also if it was at the 
callee more code would be needed because the function currently
executed couldn't be gotten from stack directly.

> Or is it a different scheme I don't see ? I am trying to figure out how
> you happen to do all that without dynamic code modification and manage
> not to hurt performance.

The dynamic code modification is only needed because there is no
global table of the mcount call sites. So instead it discovers
them at runtime, but that requires runtime save patching

With a custom call scheme one could just build up a table of 
call sites at link time using an ELF section and then when
tracing is enabled/disabled always patch them all in one go
in a stop_machine(). Then you wouldn't need parallel execution safe
patching anymore and it doesn't matter what the nops look like.

The other advantage is that it would allow getting rid of
the frame pointer.

-Andi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Efficient x86 and x86_64 NOP microbenchmarks
       [not found] <20080813191926.GB15547@Krystal>
@ 2008-08-13 20:00 ` Steven Rostedt
  2008-08-13 20:06   ` Jeremy Fitzhardinge
  2008-08-13 20:15   ` Andi Kleen
  0 siblings, 2 replies; 18+ messages in thread
From: Steven Rostedt @ 2008-08-13 20:00 UTC (permalink / raw)
  To: Andi Kleen, Thomas Gleixner
  Cc: Mathieu Desnoyers, Linus Torvalds, Steven Rostedt,
	Jeremy Fitzhardinge, LKML, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, David Miller, Roland McGrath, Ulrich Drepper,
	Rusty Russell, Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

[
  Thanks to Mathieu Desnoyers who forward this to me. Currently my ISP 
for goodmis.org is having issues:
  https://help.domaindirect.com/index.php?_m=news&_a=viewnews&newsid=104
]
> ----- Forwarded message from Andi Kleen <andi@firstfloor.org> -----
>
>   
>> So microbenchmarking this way will probably make some things look 
>> unrealistically good. 
>>     
>
> Must be careful to miss the big picture here.
>
> We have two assumptions here in this thread:
>
> - Normal alternative() nops are relatively infrequent, typically
> in points with enough pipeline bubbles anyways, and it likely doesn't
> matter how they are encode. And also they don't have an issue
> with mult part instructions anyways because they're not patched
> at runtime, so always the best known can be used.
>
> - The one case where nops are very frequent and matter and multipart
> is a problem is with ftrace noping out the call to mcount at runtime 
> because that happens on every function entry.
> Even there the overhead is not that big, but at least measurable 
> in kernel builds.
>   

The problem is not ftrace noping out the call at runtime. The problem is 
ftrace changing the nops back to calls to mcount.

The nop part is simple, straight forward and not an issue that we are 
talking here. The issue is which kind of nop to use. The bug with the 
multi-part nop happens when we _enable_ tracing. That is, when someone 
runs the tracer. The issue with the multi-part nop is that a task could 
have been preempted after it executed the first nop and before the 
second part. Then we enable tracing, and when the task is scheduled back 
in, it now will execute half the call to the mcount function.

I want this point very clear. If you never run tracing, this bug will 
not happen. And the bug only happens on enabling the tracer, not on the 
disabling part. Not to mention that the bug itself will only happen 1 in 
a billion.

> Now the numbers have shown that just by not using frame pointer (
> -pg right now implies frame pointer) you can get more benefit 
> than what you lose from using non optimal nops.
>   

No, I can easily make a patch that does not use frame pointers but still 
uses -pg. We just can not print the parent function in the trace. This 
can easily be added to a config, as well as easily implemented.
> So for me the best strategy would be to get rid of the frame pointer
> and ignore the nops. This unfortunately would require going away
> from -pg and instead post process gcc output to insert "call mcount"
> manually. But the nice advantage of that is that you could actually 
> set up a custom table of callers built in a ELF section and with
> that you don't actually need the runtime patching (which is only
> done currently because there's no global table of mcount calls),
> but could do everything in stop_machine(). Without
> runtime patching you also don't need single part nops. 
>
>   

I'm totally confused here.  How do you enable function tracing?  How do 
we make a call to the code that will trace a function was hit?

> I think that would be the best option. I especially like it because
> it would prevent forcing frame pointer which seems to be costlier
> than any kinds of nosp.

As I stated, the frame pointer part is only to record the parent 
function in tracing. ie:

             ls-4866  [00] 177596.041275: _spin_unlock <-journal_stop

Here we see that the function _spin_unlock was called by the function 
journal_stop. We can easily turn off parent tracing now, with:

# echo noprint-parent > /debug/tracing/iter_ctrl

which gives us just:

             ls-4866  [00] 177596.041275: _spin_unlock

If we disable frame pointers, the noprint-parent option would be forced. 
Not that devastating, but it gives the option to still have function 
tracing to the user without the requirement of having frame pointers.

I would still require that the irqsoff tracer add frame pointers, just 
because knowing that the long latency of interrupts disabled happened at 
local_irq_save doesn't cut it ;-)

Anyway, who would want to run with frame pointers disabled? If you ever 
get a bug crash, the stack trace is pretty much useless.

-- Steve

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Efficient x86 and x86_64 NOP microbenchmarks
  2008-08-13 19:37                           ` Andi Kleen
@ 2008-08-13 20:01                             ` Mathieu Desnoyers
  2008-08-15 21:34                             ` Steven Rostedt
  1 sibling, 0 replies; 18+ messages in thread
From: Mathieu Desnoyers @ 2008-08-13 20:01 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Steven Rostedt, Steven Rostedt,
	Jeremy Fitzhardinge, LKML, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, David Miller, Roland McGrath,
	Ulrich Drepper, Rusty Russell, Gregory Haskins,
	Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves,
	Clark Williams

* Andi Kleen (andi@firstfloor.org) wrote:
> > Sorry to ask, I feel I must be missing something, but I'm trying to
> > figure out where you propose to add the "call mcount" ? In the caller or
> > in the callee ?
> 
> callee like gcc. caller would be likely more bloated because
> there are more calls than functions. Also if it was at the 
> callee more code would be needed because the function currently
> executed couldn't be gotten from stack directly.
> 
> > Or is it a different scheme I don't see ? I am trying to figure out how
> > you happen to do all that without dynamic code modification and manage
> > not to hurt performance.
> 
> The dynamic code modification is only needed because there is no
> global table of the mcount call sites. So instead it discovers
> them at runtime, but that requires runtime save patching
> 
> With a custom call scheme one could just build up a table of 
> call sites at link time using an ELF section and then when
> tracing is enabled/disabled always patch them all in one go
> in a stop_machine(). Then you wouldn't need parallel execution safe
> patching anymore and it doesn't matter what the nops look like.
> 

I agree that the custom call scheme could let you know the mcount call
site addresses at link time, so you could replace the call instructions
with nops (at link time, so you actually don't know much about the
exact hardware the kernel will be running on, which makes it harder to
choose the best nop). To me, it seems that doing this at link time,
as you propose, is the best approach, as it won't impact the system
bootup time as much as the current ftrace scheme.

However, I disagree with you on one point : if you use nops which are
made of multiple instructions smaller than 5 bytes, enabling the tracer
(patching all the sites in a stop_machine()) still present the risk of
having a preempted thread with a return IP pointing directly in the
middle of what will become a 5-bytes call instruction. When the thread
will be scheduled again after the stop_machine, an illegal instruction
fault (or any random effect) will occur.

Therefore, building a table of mcount call sites in a ELF section,
declaring _single_ 5-bytes nop instruction in the instruction stream
that would fit for all target architectures in lieue of mcount call, so
it can be later patched-in with the 5-bytes call at runtime seems like a
good way to go.

Mathieu

P.S. : It would be good to have a look at the alternative.c lock prefix
vs preemption race I identified a few weeks ago. Actually, this
currently existing cpu hotplug bug is related to the preemption issue I
just explained here. ref. http://lkml.org/lkml/2008/7/30/265,
especially:

"As a general rule, never try to combine smaller instructions into a
bigger one, except in the case of adding a lock-prefix to an instruction :
this case insures that the non-lock prefixed instruction is still
valid after the change has been done. We could however run into a nasty
non-synchronized atomic instruction use in SMP mode if a thread happens
to be scheduled out right after the lock prefix. Hopefully the
alternative code uses the refrigerator... (hrm, it doesn't).

Actually, alternative.c lock-prefix modification is O.K. for spinlocks
because they execute with preemption off, but not for other atomic
operations which may execute with preemption on."

> The other advantage is that it would allow getting rid of
> the frame pointer.
> 
> -Andi
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Efficient x86 and x86_64 NOP microbenchmarks
  2008-08-13 20:00 ` Efficient x86 and x86_64 NOP microbenchmarks Steven Rostedt
@ 2008-08-13 20:06   ` Jeremy Fitzhardinge
  2008-08-13 20:34     ` Steven Rostedt
  2008-08-13 20:15   ` Andi Kleen
  1 sibling, 1 reply; 18+ messages in thread
From: Jeremy Fitzhardinge @ 2008-08-13 20:06 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andi Kleen, Thomas Gleixner, Mathieu Desnoyers, Linus Torvalds,
	Steven Rostedt, LKML, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams

Steven Rostedt wrote:
> No, I can easily make a patch that does not use frame pointers but 
> still uses -pg. We just can not print the parent function in the 
> trace. This can easily be added to a config, as well as easily 
> implemented.

Why?  You can always get the calling function, because its return 
address is on the stack (assuming mcount is called before the function 
puts its own frame on the stack).  But without a frame pointer, you 
can't necessarily get the caller's caller.

But I think Andi's point is that gcc forces frame pointers on when you 
enable mcount, so there's no choice in the matter.

    J

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Efficient x86 and x86_64 NOP microbenchmarks
  2008-08-13 20:00 ` Efficient x86 and x86_64 NOP microbenchmarks Steven Rostedt
  2008-08-13 20:06   ` Jeremy Fitzhardinge
@ 2008-08-13 20:15   ` Andi Kleen
  2008-08-13 20:21     ` Linus Torvalds
  2008-08-13 20:21     ` Steven Rostedt
  1 sibling, 2 replies; 18+ messages in thread
From: Andi Kleen @ 2008-08-13 20:15 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andi Kleen, Thomas Gleixner, Mathieu Desnoyers, Linus Torvalds,
	Steven Rostedt, Jeremy Fitzhardinge, LKML, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, David Miller, Roland McGrath,
	Ulrich Drepper, Rusty Russell, Gregory Haskins,
	Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves,
	Clark Williams

On Wed, Aug 13, 2008 at 04:00:37PM -0400, Steven Rostedt wrote:
> >Now the numbers have shown that just by not using frame pointer (
> >-pg right now implies frame pointer) you can get more benefit 
> >than what you lose from using non optimal nops.
> >  
> 
> No, I can easily make a patch that does not use frame pointers but still 

Not without patching gcc. Try it. The patch is not very difficult and i did
it here, but it needs a patch. 

> If we disable frame pointers, the noprint-parent option would be forced. 

Actually you can get the parent without frame pointer if you just
force gcc to emit mcount before touching the stack frame (and manual
insertion pass would do that). Then parent is at 4(%esp)/8(%rsp)   
Again teaching gcc that is not very difficult, but it needs a patch.

> I would still require that the irqsoff tracer add frame pointers, just 
> because knowing that the long latency of interrupts disabled happened at 
> local_irq_save doesn't cut it ;-)

Nope.
> 
> Anyway, who would want to run with frame pointers disabled? If you ever 
> get a bug crash, the stack trace is pretty much useless.

First that's not true (remember most production kernels run
without frame pointers, also e.g. crash or systemtap know how to do proper 
unwinding without slow frame pointers) and if you want it runtime also you 
can always add the dwarf2 unwinder (like the opensuse kernel does) and get 
better backtraces than you could ever get with frame pointers (that is 
because e.g.  most assembler code doesn't even bother to set up frame 
pointers, but it is all dwarf2 annotated) 

Also I must say the whole ftrace noping exercise is pretty pointless without
avoiding frame pointers because it does save less than what you lose
unconditionally from the "select FRAME_POINTER"

-Andi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Efficient x86 and x86_64 NOP microbenchmarks
  2008-08-13 20:15   ` Andi Kleen
@ 2008-08-13 20:21     ` Linus Torvalds
  2008-08-13 20:21     ` Steven Rostedt
  1 sibling, 0 replies; 18+ messages in thread
From: Linus Torvalds @ 2008-08-13 20:21 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Steven Rostedt, Thomas Gleixner, Mathieu Desnoyers,
	Steven Rostedt, Jeremy Fitzhardinge, LKML, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, David Miller, Roland McGrath,
	Ulrich Drepper, Rusty Russell, Gregory Haskins,
	Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves,
	Clark Williams



On Wed, 13 Aug 2008, Andi Kleen wrote:
> 
> Also I must say the whole ftrace noping exercise is pretty pointless without
> avoiding frame pointers because it does save less than what you lose
> unconditionally from the "select FRAME_POINTER"

Andi, you seem to have missed the whole point. This is a _correctness_ 
issue as long as the nop is not a single instruction. And the workaround 
for that is uglier than just making a single-instruction nop.

So the question now is to find a good nop that _is_ a single atomic 
instruction. Your blathering about frame pointers is missing the whole 
point!

			Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Efficient x86 and x86_64 NOP microbenchmarks
  2008-08-13 20:15   ` Andi Kleen
  2008-08-13 20:21     ` Linus Torvalds
@ 2008-08-13 20:21     ` Steven Rostedt
  1 sibling, 0 replies; 18+ messages in thread
From: Steven Rostedt @ 2008-08-13 20:21 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Thomas Gleixner, Mathieu Desnoyers, Linus Torvalds,
	Steven Rostedt, Jeremy Fitzhardinge, LKML, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, David Miller, Roland McGrath,
	Ulrich Drepper, Rusty Russell, Gregory Haskins,
	Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves,
	Clark Williams

Andi Kleen wrote:
> On Wed, Aug 13, 2008 at 04:00:37PM -0400, Steven Rostedt wrote:
>   
>>> Now the numbers have shown that just by not using frame pointer (
>>> -pg right now implies frame pointer) you can get more benefit 
>>> than what you lose from using non optimal nops.
>>>  
>>>       
>> No, I can easily make a patch that does not use frame pointers but still 
>>     
>
> Not without patching gcc. Try it. The patch is not very difficult and i did
> it here, but it needs a patch. 
>   

OK, I admit you are right ;-)

I got the error message:

    gcc: -pg and -fomit-frame-pointer are incompatible

-- Steve


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Efficient x86 and x86_64 NOP microbenchmarks
  2008-08-13 20:06   ` Jeremy Fitzhardinge
@ 2008-08-13 20:34     ` Steven Rostedt
  0 siblings, 0 replies; 18+ messages in thread
From: Steven Rostedt @ 2008-08-13 20:34 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Jeremy Fitzhardinge, Andi Kleen, Thomas Gleixner, Linus Torvalds,
	Steven Rostedt, LKML, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams


Just a curious run of Mathieu's micro benchmark:

NR_TESTS 10000000
                       test empty cycles : 182500444
                test 2-bytes jump cycles : 195969127
                test 5-bytes jump cycles : 197000202
                    test 3/2 nops cycles : 201333408
test 5-bytes nop with long prefix cycles : 205000067
              test 5-bytes P6 nop cycles : 205000227
    test Generic 1/4 5-bytes nops cycles : 200000077
         test K7 1/4 5-bytes nops cycles : 197549045


And this was on a Pentium III 847.461 MHz box (my old toshiba laptop)

The jumps here played the best, but that could just be cache issues. But 
interesting to see that of the nops, the K7 1/4 faired the best.

-- Steve


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Efficient x86 and x86_64 NOP microbenchmarks
  2008-08-13 19:37                           ` Andi Kleen
  2008-08-13 20:01                             ` Mathieu Desnoyers
@ 2008-08-15 21:34                             ` Steven Rostedt
  2008-08-15 21:51                               ` Andi Kleen
  1 sibling, 1 reply; 18+ messages in thread
From: Steven Rostedt @ 2008-08-15 21:34 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Mathieu Desnoyers, Linus Torvalds, Jeremy Fitzhardinge, LKML,
	Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	David Miller, Roland McGrath, Ulrich Drepper, Rusty Russell,
	Gregory Haskins, Arnaldo Carvalho de Melo,
	Luis Claudio R. Goncalves, Clark Williams


[ Finally got my goodmis email back ]

On Wed, 13 Aug 2008, Andi Kleen wrote:

> > Sorry to ask, I feel I must be missing something, but I'm trying to
> > figure out where you propose to add the "call mcount" ? In the caller or
> > in the callee ?
> 
> callee like gcc. caller would be likely more bloated because
> there are more calls than functions. Also if it was at the 
> callee more code would be needed because the function currently
> executed couldn't be gotten from stack directly.
> 
> > Or is it a different scheme I don't see ? I am trying to figure out how
> > you happen to do all that without dynamic code modification and manage
> > not to hurt performance.
> 
> The dynamic code modification is only needed because there is no
> global table of the mcount call sites. So instead it discovers
> them at runtime, but that requires runtime save patching

The new code does not discover the places at runtime. The old code did 
that. The "to kill a daemon" removed the runtime discovery and replaced it 
with discovery at compile time.

> 
> With a custom call scheme one could just build up a table of 
> call sites at link time using an ELF section and then when
> tracing is enabled/disabled always patch them all in one go
> in a stop_machine(). Then you wouldn't need parallel execution safe
> patching anymore and it doesn't matter what the nops look like.

The current patch set, pretty much does exactly this. Yes, I patch
at boot up all in one go, before the other CPUS are even active.
This takes all of 6 milliseconds to do. Not much extra time for bootup.

> 
> The other advantage is that it would allow getting rid of
> the frame pointer.

This is the only advantage that you have.

-- Steve


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Efficient x86 and x86_64 NOP microbenchmarks
  2008-08-15 21:34                             ` Steven Rostedt
@ 2008-08-15 21:51                               ` Andi Kleen
  0 siblings, 0 replies; 18+ messages in thread
From: Andi Kleen @ 2008-08-15 21:51 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andi Kleen, Mathieu Desnoyers, Linus Torvalds,
	Jeremy Fitzhardinge, LKML, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, David Miller, Roland McGrath,
	Ulrich Drepper, Rusty Russell, Gregory Haskins,
	Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves,
	Clark Williams

> > The other advantage is that it would allow getting rid of
> > the frame pointer.
> 
> This is the only advantage that you have.

Ok. But it's a serious one.  It gives slightly more gain as your whole 
complicated patching exercise.

Ok maybe it would be better to just properly fix gcc, but the problem
is it takes forever for the user base to actually start using a new
gcc :/

-Andi

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2008-08-15 21:50 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20080813191926.GB15547@Krystal>
2008-08-13 20:00 ` Efficient x86 and x86_64 NOP microbenchmarks Steven Rostedt
2008-08-13 20:06   ` Jeremy Fitzhardinge
2008-08-13 20:34     ` Steven Rostedt
2008-08-13 20:15   ` Andi Kleen
2008-08-13 20:21     ` Linus Torvalds
2008-08-13 20:21     ` Steven Rostedt
2008-08-08 18:13 [PATCH 0/5] ftrace: to kill a daemon Steven Rostedt
2008-08-08 18:21 ` Mathieu Desnoyers
2008-08-08 18:41   ` Steven Rostedt
2008-08-08 19:05     ` Mathieu Desnoyers
2008-08-08 23:38       ` Steven Rostedt
2008-08-09  0:23         ` Andi Kleen
2008-08-09  0:36           ` Steven Rostedt
2008-08-09  0:47             ` Jeremy Fitzhardinge
2008-08-09  0:51               ` Linus Torvalds
2008-08-09  1:25                 ` Steven Rostedt
2008-08-13 17:52                   ` Efficient x86 and x86_64 NOP microbenchmarks Mathieu Desnoyers
2008-08-13 18:27                     ` Linus Torvalds
2008-08-13 18:41                       ` Andi Kleen
2008-08-13 18:45                         ` Avi Kivity
2008-08-13 18:51                           ` Andi Kleen
2008-08-13 18:56                             ` Avi Kivity
2008-08-13 19:30                         ` Mathieu Desnoyers
2008-08-13 19:37                           ` Andi Kleen
2008-08-13 20:01                             ` Mathieu Desnoyers
2008-08-15 21:34                             ` Steven Rostedt
2008-08-15 21:51                               ` Andi Kleen
2008-08-13 19:16                       ` Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox