From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
To: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
Jeremy Fitzhardinge <jeremy@goop.org>,
Andi Kleen <andi@firstfloor.org>,
LKML <linux-kernel@vger.kernel.org>, Ingo Molnar <mingo@elte.hu>,
Thomas Gleixner <tglx@linutronix.de>,
Peter Zijlstra <peterz@infradead.org>,
Andrew Morton <akpm@linux-foundation.org>,
David Miller <davem@davemloft.net>,
Roland McGrath <roland@redhat.com>,
Ulrich Drepper <drepper@redhat.com>,
Rusty Russell <rusty@rustcorp.com.au>,
Gregory Haskins <ghaskins@novell.com>,
Arnaldo Carvalho de Melo <acme@redhat.com>,
"Luis Claudio R. Goncalves" <lclaudio@uudg.org>,
Clark Williams <williams@redhat.com>
Subject: Efficient x86 and x86_64 NOP microbenchmarks
Date: Wed, 13 Aug 2008 13:52:13 -0400 [thread overview]
Message-ID: <20080813175213.GA8679@Krystal> (raw)
In-Reply-To: <alpine.DEB.1.10.0808082113090.3707@gandalf.stny.rr.com>
* Steven Rostedt (rostedt@goodmis.org) wrote:
>
> On Fri, 8 Aug 2008, Linus Torvalds wrote:
> >
> >
> > On Fri, 8 Aug 2008, Jeremy Fitzhardinge wrote:
> > >
> > > Steven Rostedt wrote:
> > > > I wish we had a true 5 byte nop.
> > >
> > > 0x66 0x66 0x66 0x66 0x90
> >
> > I don't think so. Multiple redundant prefixes can be really expensive on
> > some uarchs.
> >
> > A no-op that isn't cheap isn't a no-op at all, it's a slow-op.
>
>
> A quick meaningless benchmark showed a slight perfomance hit.
>
Hi Steven,
I also did some microbenchmarks on my Intel Xeon 64 bits, AMD64 and
Intel Pentium 4 boxes to compare a baseline (function doing a bit of
memory read and arithmetic operations) to cases where nops are used.
Here are the results. The kernel module used for the benchmarks is
below, feel free to run it on your own architectures.
Xeon :
NR_TESTS 10000000
test empty cycles : 165472020
test 2-bytes jump cycles : 166666806
test 5-bytes jump cycles : 166978164
test 3/2 nops cycles : 169259406
test 5-bytes nop with long prefix cycles : 160000140
test 5-bytes P6 nop cycles : 163333458
AMD64 :
NR_TESTS 10000000
test empty cycles : 145142367
test 2-bytes jump cycles : 150000178
test 5-bytes jump cycles : 150000171
test 3/2 nops cycles : 159999994
test 5-bytes nop with long prefix cycles : 150000156
test 5-bytes P6 nop cycles : 150000148
Intel Pentium 4 :
NR_TESTS 10000000
test empty cycles : 290001045
test 2-bytes jump cycles : 310000568
test 5-bytes jump cycles : 310000478
test 3/2 nops cycles : 290000565
test 5-bytes nop with long prefix cycles : 311085510
test 5-bytes P6 nop cycles : 300000517
test Generic 1/4 5-bytes nops cycles : 310000553
test K7 1/4 5-bytes nops cycles : 300000533
These numbers show that both on Xeon and AMD64, the
.byte 0x66,0x66,0x66,0x66,0x90
(osp osp osp osp nop, which is not currently used in nops.h)
is the fastest nop on both architectures.
The currently used 3/2 nops looks like a _very_ bad choice for AMD64
cycle-wise.
The currently used 5-bytes P6 nop used on Xeon seems to be a bit slower
than the 0x66,0x66,0x66,0x66,0x90 nop too.
For the Intel Pentium 4, the best atomic choice seems to be the current
one (5-bytes P6 nop : .byte 0x0f,0x1f,0x44,0x00,0), although we can see
that the 3/2 nop used for K8 would be a bit faster. It is probably due
to the fact that P4 handles long instruction prefixes slowly.
Is there any reason why not to use these atomic nops and kill our
instruction atomicity problems altogether ?
(various cpuinfo can be found below)
Mathieu
/* test-nop-speed.c
*
*/
#include <linux/module.h>
#include <linux/proc_fs.h>
#include <linux/sched.h>
#include <linux/timex.h>
#include <linux/marker.h>
#include <asm/ptrace.h>
#define NR_TESTS 10000000
int var, var2;
struct proc_dir_entry *pentry = NULL;
void empty(void)
{
asm volatile ("");
var += 50;
var /= 10;
var *= var2;
}
void twobytesjump(void)
{
asm volatile ("jmp 1f\n\t"
".byte 0x00, 0x00, 0x00\n\t"
"1:\n\t");
var += 50;
var /= 10;
var *= var2;
}
void fivebytesjump(void)
{
asm volatile (".byte 0xe9, 0x00, 0x00, 0x00, 0x00\n\t");
var += 50;
var /= 10;
var *= var2;
}
void threetwonops(void)
{
asm volatile (".byte 0x66,0x66,0x90,0x66,0x90\n\t");
var += 50;
var /= 10;
var *= var2;
}
void fivebytesnop(void)
{
asm volatile (".byte 0x66,0x66,0x66,0x66,0x90\n\t");
var += 50;
var /= 10;
var *= var2;
}
void fivebytespsixnop(void)
{
asm volatile (".byte 0x0f,0x1f,0x44,0x00,0\n\t");
var += 50;
var /= 10;
var *= var2;
}
/*
* GENERIC_NOP1 GENERIC_NOP4,
* 1: nop
* _not_ nops in 64-bit mode.
* 4: leal 0x00(,%esi,1),%esi
*/
void genericfivebytesonefournops(void)
{
asm volatile (".byte 0x90,0x8d,0x74,0x26,0x00\n\t");
var += 50;
var /= 10;
var *= var2;
}
/*
* K7_NOP4 ASM_NOP1
* 1: nop
* assumed _not_ to be nops in 64-bit mode.
* leal 0x00(,%eax,1),%eax
*/
void k7fivebytesonefournops(void)
{
asm volatile (".byte 0x90,0x8d,0x44,0x20,0x00\n\t");
var += 50;
var /= 10;
var *= var2;
}
void perform_test(const char *name, void (*callback)(void))
{
unsigned int i;
cycles_t cycles1, cycles2;
unsigned long flags;
local_irq_save(flags);
rdtsc_barrier();
cycles1 = get_cycles();
rdtsc_barrier();
for(i=0; i<NR_TESTS; i++) {
callback();
}
rdtsc_barrier();
cycles2 = get_cycles();
rdtsc_barrier();
local_irq_restore(flags);
printk("test %s cycles : %llu\n", name, cycles2-cycles1);
}
static int my_open(struct inode *inode, struct file *file)
{
printk("NR_TESTS %d\n", NR_TESTS);
perform_test("empty", empty);
perform_test("2-bytes jump", twobytesjump);
perform_test("5-bytes jump", fivebytesjump);
perform_test("3/2 nops", threetwonops);
perform_test("5-bytes nop with long prefix", fivebytesnop);
perform_test("5-bytes P6 nop", fivebytespsixnop);
#ifdef CONFIG_X86_32
perform_test("Generic 1/4 5-bytes nops", genericfivebytesonefournops);
perform_test("K7 1/4 5-bytes nops", k7fivebytesonefournops);
#endif
return -EPERM;
}
static struct file_operations my_operations = {
.open = my_open,
};
int init_module(void)
{
pentry = create_proc_entry("testnops", 0444, NULL);
if (pentry)
pentry->proc_fops = &my_operations;
return 0;
}
void cleanup_module(void)
{
remove_proc_entry("testnops", NULL);
}
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Mathieu Desnoyers");
MODULE_DESCRIPTION("NOP Test");
Xeon cpuinfo :
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Xeon(R) CPU E5405 @ 2.00GHz
stepping : 6
cpu MHz : 2000.126
cache size : 6144 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx tm2 ssse3 cx16 xtpr dca sse4_1 lahf_lm
bogomips : 4000.25
clflush size : 64
cache_alignment : 64
address sizes : 38 bits physical, 48 bits virtual
power management:
AMD64 cpuinfo :
processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 35
model name : AMD Athlon(tm)64 X2 Dual Core Processor 3800+
stepping : 2
cpu MHz : 2009.139
cache size : 512 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow rep_good pni lahf_lm cmp_legacy
bogomips : 4022.42
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp
Pentium 4 :
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 4
model name : Intel(R) Pentium(R) 4 CPU 3.00GHz
stepping : 1
cpu MHz : 3000.138
cache size : 1024 KB
physical id : 0
siblings : 1
core id : 0
cpu cores : 1
apicid : 0
initial apicid : 0
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx constant_tsc up pebs bts pni monitor ds_cpl cid xtpr
bogomips : 6005.70
clflush size : 64
power management:
> Here's 10 runs of "hackbench 50" using the two part 5 byte nop:
>
> run 1
> Time: 4.501
> run 2
> Time: 4.855
> run 3
> Time: 4.198
> run 4
> Time: 4.587
> run 5
> Time: 5.016
> run 6
> Time: 4.757
> run 7
> Time: 4.477
> run 8
> Time: 4.693
> run 9
> Time: 4.710
> run 10
> Time: 4.715
> avg = 4.6509
>
>
> And 10 runs using the above 5 byte nop:
>
> run 1
> Time: 4.832
> run 2
> Time: 5.319
> run 3
> Time: 5.213
> run 4
> Time: 4.830
> run 5
> Time: 4.363
> run 6
> Time: 4.391
> run 7
> Time: 4.772
> run 8
> Time: 4.992
> run 9
> Time: 4.727
> run 10
> Time: 4.825
> avg = 4.8264
>
> # cat /proc/cpuinfo
> processor : 0
> vendor_id : AuthenticAMD
> cpu family : 15
> model : 65
> model name : Dual-Core AMD Opteron(tm) Processor 2220
> stepping : 3
> cpu MHz : 2799.992
> cache size : 1024 KB
> physical id : 0
> siblings : 2
> core id : 0
> cpu cores : 2
> apicid : 0
> initial apicid : 0
> fdiv_bug : no
> hlt_bug : no
> f00f_bug : no
> coma_bug : no
> fpu : yes
> fpu_exception : yes
> cpuid level : 1
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
> cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt
> rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic
> cr8_legacy
> bogomips : 5599.98
> clflush size : 64
> power management: ts fid vid ttp tm stc
>
> There's 4 of these.
>
> Just to make sure, I ran the above nop test again:
>
> [ this is reverse from the above runs ]
>
> run 1
> Time: 4.723
> run 2
> Time: 5.080
> run 3
> Time: 4.521
> run 4
> Time: 4.841
> run 5
> Time: 4.696
> run 6
> Time: 4.946
> run 7
> Time: 4.754
> run 8
> Time: 4.717
> run 9
> Time: 4.905
> run 10
> Time: 4.814
> avg = 4.7997
>
> And again the two part nop:
>
> run 1
> Time: 4.434
> run 2
> Time: 4.496
> run 3
> Time: 4.801
> run 4
> Time: 4.714
> run 5
> Time: 4.631
> run 6
> Time: 5.178
> run 7
> Time: 4.728
> run 8
> Time: 4.920
> run 9
> Time: 4.898
> run 10
> Time: 4.770
> avg = 4.757
>
>
> This time it was close, but still seems to have some difference.
>
> heh, perhaps it's just noise.
>
> -- Steve
>
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
next prev parent reply other threads:[~2008-08-13 17:52 UTC|newest]
Thread overview: 106+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-08-07 18:20 [PATCH 0/5] ftrace: to kill a daemon Steven Rostedt
2008-08-07 18:20 ` [PATCH 1/5] ftrace: create __mcount_loc section Steven Rostedt
2008-08-07 18:20 ` [PATCH 2/5] ftrace: mcount call site on boot nops core Steven Rostedt
2008-08-07 18:20 ` [PATCH 3/5] ftrace: enable mcount recording for modules Steven Rostedt
2008-08-08 6:43 ` Rusty Russell
2008-08-08 12:51 ` Steven Rostedt
2008-08-07 18:20 ` [PATCH 4/5] ftrace: rebuild everything on change to FTRACE_MCOUNT_RECORD Steven Rostedt
2008-08-07 18:20 ` [PATCH 5/5] ftrace: enable using mcount recording on x86 Steven Rostedt
2008-08-07 18:47 ` [PATCH 0/5] ftrace: to kill a daemon Mathieu Desnoyers
2008-08-07 20:42 ` Steven Rostedt
2008-08-08 17:22 ` Mathieu Desnoyers
2008-08-08 17:36 ` Steven Rostedt
2008-08-08 17:46 ` Mathieu Desnoyers
2008-08-08 18:13 ` Steven Rostedt
2008-08-08 18:15 ` Peter Zijlstra
2008-08-08 18:21 ` Mathieu Desnoyers
2008-08-08 18:41 ` Steven Rostedt
2008-08-08 19:04 ` Linus Torvalds
2008-08-08 19:05 ` Mathieu Desnoyers
2008-08-08 23:38 ` Steven Rostedt
2008-08-09 0:23 ` Andi Kleen
2008-08-09 0:36 ` Steven Rostedt
2008-08-09 0:47 ` Jeremy Fitzhardinge
2008-08-09 0:51 ` Linus Torvalds
2008-08-09 1:25 ` Steven Rostedt
2008-08-13 6:31 ` Mathieu Desnoyers
2008-08-13 15:38 ` Mathieu Desnoyers
2008-08-13 17:52 ` Mathieu Desnoyers [this message]
2008-08-13 18:27 ` Efficient x86 and x86_64 NOP microbenchmarks Linus Torvalds
2008-08-13 18:41 ` Andi Kleen
2008-08-13 18:45 ` Avi Kivity
2008-08-13 18:51 ` Andi Kleen
2008-08-13 18:56 ` Avi Kivity
2008-08-13 19:30 ` Mathieu Desnoyers
2008-08-13 19:37 ` Andi Kleen
2008-08-13 20:01 ` Mathieu Desnoyers
2008-08-13 23:41 ` [RFC PATCH] x86 alternatives : fix LOCK_PREFIX race with preemptible kernel and CPU hotplug Mathieu Desnoyers
2008-08-14 0:01 ` H. Peter Anvin
2008-08-14 1:13 ` Mathieu Desnoyers
2008-08-14 1:22 ` Jeremy Fitzhardinge
2008-08-14 1:26 ` Roland McGrath
2008-08-14 1:49 ` Mathieu Desnoyers
2008-08-14 3:35 ` Jeremy Fitzhardinge
2008-08-14 15:18 ` Mathieu Desnoyers
2008-08-14 16:10 ` Linus Torvalds
2008-08-14 16:13 ` H. Peter Anvin
2008-08-14 16:58 ` Mathieu Desnoyers
2008-08-14 17:05 ` Jeremy Fitzhardinge
2008-08-14 17:30 ` Mathieu Desnoyers
2008-08-14 17:43 ` Jeremy Fitzhardinge
2008-08-14 18:37 ` H. Peter Anvin
2008-08-14 18:53 ` Mathieu Desnoyers
2008-08-14 19:29 ` Jeremy Fitzhardinge
2008-08-14 20:31 ` Mathieu Desnoyers
2008-08-14 20:39 ` H. Peter Anvin
2008-08-14 21:46 ` Jeremy Fitzhardinge
2008-08-14 22:26 ` H. Peter Anvin
2008-08-14 17:17 ` H. Peter Anvin
2008-08-14 18:09 ` Mathieu Desnoyers
2008-08-14 19:49 ` Mathieu Desnoyers
2008-08-14 17:04 ` Jeremy Fitzhardinge
2008-08-14 17:18 ` H. Peter Anvin
2008-08-14 17:28 ` Jeremy Fitzhardinge
2008-08-14 17:31 ` H. Peter Anvin
2008-08-14 17:46 ` Mathieu Desnoyers
2008-08-14 17:49 ` Jeremy Fitzhardinge
2008-08-14 17:55 ` Mathieu Desnoyers
2008-08-14 18:59 ` Gregory Haskins
2008-08-15 21:34 ` Efficient x86 and x86_64 NOP microbenchmarks Steven Rostedt
2008-08-15 21:51 ` Andi Kleen
2008-08-13 19:16 ` Mathieu Desnoyers
2008-08-09 0:51 ` [PATCH 0/5] ftrace: to kill a daemon Steven Rostedt
2008-08-09 0:53 ` Roland McGrath
2008-08-09 1:13 ` Andi Kleen
2008-08-09 1:19 ` Andi Kleen
2008-08-09 1:30 ` Steven Rostedt
2008-08-09 1:55 ` Andi Kleen
2008-08-09 2:03 ` Steven Rostedt
2008-08-09 2:23 ` Andi Kleen
2008-08-09 4:12 ` Steven Rostedt
2008-08-09 0:30 ` Steven Rostedt
2008-08-11 18:21 ` Mathieu Desnoyers
2008-08-11 19:28 ` Steven Rostedt
2008-08-08 19:08 ` Jeremy Fitzhardinge
2008-08-11 2:41 ` Rusty Russell
2008-08-11 12:33 ` Steven Rostedt
2008-08-07 21:11 ` Jeremy Fitzhardinge
2008-08-07 21:29 ` Steven Rostedt
2008-08-07 22:26 ` Roland McGrath
2008-08-08 1:21 ` Steven Rostedt
2008-08-08 1:24 ` Steven Rostedt
2008-08-08 1:56 ` Steven Rostedt
2008-08-08 7:22 ` Peter Zijlstra
2008-08-08 11:31 ` Steven Rostedt
2008-08-08 4:54 ` Sam Ravnborg
2008-08-09 9:48 ` Abhishek Sagar
2008-08-09 13:01 ` Steven Rostedt
2008-08-09 15:01 ` Abhishek Sagar
2008-08-09 15:37 ` Steven Rostedt
2008-08-09 17:14 ` Abhishek Sagar
[not found] <20080813191926.GB15547@Krystal>
2008-08-13 20:00 ` Efficient x86 and x86_64 NOP microbenchmarks Steven Rostedt
2008-08-13 20:06 ` Jeremy Fitzhardinge
2008-08-13 20:34 ` Steven Rostedt
2008-08-13 20:15 ` Andi Kleen
2008-08-13 20:21 ` Linus Torvalds
2008-08-13 20:21 ` Steven Rostedt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20080813175213.GA8679@Krystal \
--to=mathieu.desnoyers@polymtl.ca \
--cc=acme@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=andi@firstfloor.org \
--cc=davem@davemloft.net \
--cc=drepper@redhat.com \
--cc=ghaskins@novell.com \
--cc=jeremy@goop.org \
--cc=lclaudio@uudg.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=peterz@infradead.org \
--cc=roland@redhat.com \
--cc=rostedt@goodmis.org \
--cc=rusty@rustcorp.com.au \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
--cc=williams@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox