* x86 - cpu_relax - why nop vs. pause?
@ 2010-02-07 17:28 Michael Breuer
2010-02-07 18:09 ` Joerg Roedel
` (2 more replies)
0 siblings, 3 replies; 7+ messages in thread
From: Michael Breuer @ 2010-02-07 17:28 UTC (permalink / raw)
To: Linux Kernel Mailing List
I did search and noticed some old discussions. Looking at both Intel and
AMD documentation, it would seem that PAUSE is the preferred instruction
within a spin lock. Further, both Intel and AMD specifications state
that the instruction is backward compatible with older x86 processors.
For fun, I changed nop to pause on my core i7 920 (smt enabled) and I'm
seeing about a 5-10% performance improvement on 2.6.33 rc7. Perf top
shows time spent in spin_lock under load drops from an average of around
35% to about 25%.
Thoughts?
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: x86 - cpu_relax - why nop vs. pause? 2010-02-07 17:28 x86 - cpu_relax - why nop vs. pause? Michael Breuer @ 2010-02-07 18:09 ` Joerg Roedel 2010-02-07 18:32 ` Arjan van de Ven [not found] ` <1265566470.6280.10.camel@marge.simson.net> 2 siblings, 0 replies; 7+ messages in thread From: Joerg Roedel @ 2010-02-07 18:09 UTC (permalink / raw) To: Michael Breuer; +Cc: Linux Kernel Mailing List On Sun, Feb 07, 2010 at 12:28:51PM -0500, Michael Breuer wrote: > I did search and noticed some old discussions. Looking at both Intel and > AMD documentation, it would seem that PAUSE is the preferred instruction > within a spin lock. Further, both Intel and AMD specifications state > that the instruction is backward compatible with older x86 processors. Its not the primary reason, but the hardware virtualization extensions of x86 processors support an intercept after a configured amount of pause instructions were executed. This is used to detect spinning vcpus where the lock-holder is scheduled out. > For fun, I changed nop to pause on my core i7 920 (smt enabled) and I'm > seeing about a 5-10% performance improvement on 2.6.33 rc7. Perf top > shows time spent in spin_lock under load drops from an average of around > 35% to about 25%. What benchmarks have you used for your measurements? Joerg ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: x86 - cpu_relax - why nop vs. pause? 2010-02-07 17:28 x86 - cpu_relax - why nop vs. pause? Michael Breuer 2010-02-07 18:09 ` Joerg Roedel @ 2010-02-07 18:32 ` Arjan van de Ven [not found] ` <1265566470.6280.10.camel@marge.simson.net> 2 siblings, 0 replies; 7+ messages in thread From: Arjan van de Ven @ 2010-02-07 18:32 UTC (permalink / raw) To: Michael Breuer; +Cc: Linux Kernel Mailing List On Sun, 07 Feb 2010 12:28:51 -0500 Michael Breuer <mbreuer@majjas.com> wrote: > I did search and noticed some old discussions. Looking at both Intel > and AMD documentation, it would seem that PAUSE is the preferred > instruction within a spin lock. Further, both Intel and AMD > specifications state that the instruction is backward compatible with > older x86 processors. > that's odd.... rep nop and pause ought to be the same... -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org ^ permalink raw reply [flat|nested] 7+ messages in thread
[parent not found: <1265566470.6280.10.camel@marge.simson.net>]
* Re: x86 - cpu_relax - why nop vs. pause? [not found] ` <1265566470.6280.10.camel@marge.simson.net> @ 2010-02-07 20:08 ` Michael Breuer 2010-02-07 21:15 ` Michael Breuer 0 siblings, 1 reply; 7+ messages in thread From: Michael Breuer @ 2010-02-07 20:08 UTC (permalink / raw) To: Linux Kernel Mailing List; +Cc: Mike Galbraith On 2/7/2010 1:14 PM, Mike Galbraith wrote: > On Sun, 2010-02-07 at 12:28 -0500, Michael Breuer wrote: > >> I did search and noticed some old discussions. Looking at both Intel and >> AMD documentation, it would seem that PAUSE is the preferred instruction >> within a spin lock. Further, both Intel and AMD specifications state >> that the instruction is backward compatible with older x86 processors. >> >> For fun, I changed nop to pause on my core i7 920 (smt enabled) and I'm >> seeing about a 5-10% performance improvement on 2.6.33 rc7. Perf top >> shows time spent in spin_lock under load drops from an average of around >> 35% to about 25%. >> >> Thoughts? >> > /* REP NOP (PAUSE) is a good thing to insert into busy-wait loops. */ > > 00000000004004fc<rep_nop>: > 4004fc: 55 push %rbp > 4004fd: 48 89 e5 mov %rsp,%rbp > 400500: f3 90 pause > 400502: c9 leaveq > 400503: c3 retq > > 0000000000400504<pause>: > 400504: 55 push %rbp > 400505: 48 89 e5 mov %rsp,%rbp > 400508: f3 90 pause > 40050a: c9 leaveq > 40050b: c3 retq > > foo.c > > static inline void rep_nop(void) > { > asm volatile("rep; nop" ::: "memory"); > } > > static inline void pause(void) > { > asm volatile("pause" ::: "memory"); > } > > void main(void) > { > rep_nop(); > pause(); > } > > Interesting, and this got me thinking... and testing... I think there's an optimization issue with gcc: First of all - a bit of background on how I got here: After reading the Intel documentation, I tried replacing rep:nop with pause (in theory exactly what's shown above). The system hung on booting. I then tried replacing nop with pause (rep:pause) and the system booted. Using the above example, the opcode becomes f3 f3 90 vs f3 90 (rep nop). Given the above compiler test case, this seemed odd, to say the least. So I played a bit more with gcc. Seems that the optimizer (-O3) is handling the *three*cases differently (objdump output) Base code for all three cases (only change is the asm volitile line as shown for each case): static inline void pause(void) { asm volatile("pause" ::: "memory"); } void main(void) { pause(); } Case1 - asm volatile("pause" ::: "memory"); 0000000000400480 <main>: 400480: f3 90 pause 400482: c3 retq 400483: 90 nop Case2 - asm volitile("rep;nop" ::: "memory") Note: this didn't inline! 0000000000400474 <pause>: 400474: 55 push %rbp 400475: 48 89 e5 mov %rsp,%rbp 400478: f3 90 pause 40047a: c9 leaveq 40047b: c3 retq 000000000040047c <main>: 40047c: 55 push %rbp 40047d: 48 89 e5 mov %rsp,%rbp 400480: e8 ef ff ff ff callq 400474 <pause> 400485: c9 leaveq 400486: c3 retq 400487: 90 nop 400488: 90 nop 400489: 90 nop 40048a: 90 nop 40048b: 90 nop 40048c: 90 nop 40048d: 90 nop 40048e: 90 nop 40048f: 90 nop Case3 - asm volitile("rep;pause" ::: "memory") 0000000000400480 <main>: 400480: f3 f3 90 pause 400483: c3 retq 400484: 90 nop _______ Note the difference between opcodes case 1 and case 3, and the mess made by the compiler in case 2. As to benchmarks - I've checked a few things, no formal or lasting stuff... but striking at first glance: 1) At idle, perf top shows time spent in _raw_spin_lock dropping from ~35% to ~25%. 2) Running a media transcode (single core - handbrakecli): frame rate increased by about 5-10%. 3) During file-intensive operations (#2, above, or copying large files - ext4 on software raid6) - latencytop shows a decerase on writing a page to disc from about 120ms to about 90ms. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: x86 - cpu_relax - why nop vs. pause? 2010-02-07 20:08 ` Michael Breuer @ 2010-02-07 21:15 ` Michael Breuer 2010-02-08 3:50 ` Michael Breuer 0 siblings, 1 reply; 7+ messages in thread From: Michael Breuer @ 2010-02-07 21:15 UTC (permalink / raw) To: Linux Kernel Mailing List; +Cc: Mike Galbraith On 02/07/2010 03:08 PM, Michael Breuer wrote: > On 2/7/2010 1:14 PM, Mike Galbraith wrote: > , and this got me thinking... and testing... I think there's an > optimization issue with gcc: > > First of all - a bit of background on how I got here: > > After reading the Intel documentation, I tried replacing rep:nop with > pause (in theory exactly what's shown above). The system hung on booting. > I then tried replacing nop with pause (rep:pause) and the system > booted. Using the above example, the opcode becomes f3 f3 90 vs f3 90 > (rep nop). > > Given the above compiler test case, this seemed odd, to say the least. > So I played a bit more with gcc. Seems that the optimizer (-O3) is > handling the *three*cases differently (objdump output) > > Base code for all three cases (only change is the asm volitile line as > shown for each case): > > static inline void pause(void) > { > asm volatile("pause" ::: "memory"); > } > > void main(void) > { > pause(); > } > > Case1 - asm volatile("pause" ::: "memory"); > 0000000000400480 <main>: > 400480: f3 90 pause > 400482: c3 retq > 400483: 90 nop > > Case2 - asm volitile("rep;nop" ::: "memory") Note: this didn't inline! > > 0000000000400474 <pause>: > 400474: 55 push %rbp > 400475: 48 89 e5 mov %rsp,%rbp > 400478: f3 90 pause > 40047a: c9 leaveq > 40047b: c3 retq > > 000000000040047c <main>: > 40047c: 55 push %rbp > 40047d: 48 89 e5 mov %rsp,%rbp > 400480: e8 ef ff ff ff callq 400474 <pause> > 400485: c9 leaveq > 400486: c3 retq > 400487: 90 nop > 400488: 90 nop > 400489: 90 nop > 40048a: 90 nop > 40048b: 90 nop > 40048c: 90 nop > 40048d: 90 nop > 40048e: 90 nop > 40048f: 90 nop > > Case3 - asm volitile("rep;pause" ::: "memory") > 0000000000400480 <main>: > 400480: f3 f3 90 pause > 400483: c3 retq > 400484: 90 nop > _______ > Note the difference between opcodes case 1 and case 3, and the mess > made by the compiler in case 2. > > As to benchmarks - I've checked a few things, no formal or lasting > stuff... but striking at first glance: > > 1) At idle, perf top shows time spent in _raw_spin_lock dropping from > ~35% to ~25%. > 2) Running a media transcode (single core - handbrakecli): frame rate > increased by about 5-10%. > 3) During file-intensive operations (#2, above, or copying large files > - ext4 on software raid6) - latencytop shows a decerase on writing a > page to disc from about 120ms to about 90ms. > -- > To unsubscribe from this list: send the line "unsubscribe > linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ Disregard case 2 - was missing -O3. With -O3 or -O2 rep;nop and pause are identical. The interesting case is rep;pause which is different and seems more efficient. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: x86 - cpu_relax - why nop vs. pause? 2010-02-07 21:15 ` Michael Breuer @ 2010-02-08 3:50 ` Michael Breuer 2010-02-08 13:33 ` Artur Skawina 0 siblings, 1 reply; 7+ messages in thread From: Michael Breuer @ 2010-02-08 3:50 UTC (permalink / raw) To: Linux Kernel Mailing List; +Cc: Mike Galbraith, Arjan van de Ven, Joerg Roedel On 2/7/2010 4:15 PM, Michael Breuer wrote: > On 02/07/2010 03:08 PM, Michael Breuer wrote: >> On 2/7/2010 1:14 PM, Mike Galbraith wrote: >> ... >> Case1 - asm volatile("pause" ::: "memory"); >> 0000000000400480 <main>: >> 400480: f3 90 pause >> 400482: c3 retq >> 400483: 90 nop >> >> ... >> >> Case3 - asm volitile("rep;pause" ::: "memory") >> 0000000000400480 <main>: >> 400480: f3 f3 90 pause >> 400483: c3 retq >> 400484: 90 nop >> _______ >> Note the difference between opcodes case 1 and case 3, and the mess >> made by the compiler in case 2. >> >> As to benchmarks - I've checked a few things, no formal or lasting >> stuff... but striking at first glance: >> >> 1) At idle, perf top shows time spent in _raw_spin_lock dropping from >> ~35% to ~25%. >> 2) Running a media transcode (single core - handbrakecli): frame rate >> increased by about 5-10%. >> 3) During file-intensive operations (#2, above, or copying large >> files - ext4 on software raid6) - latencytop shows a decerase on >> writing a page to disc from about 120ms to about 90ms. >> > Disregard case 2 - was missing -O3. With -O3 or -O2 rep;nop and pause > are identical. The interesting case is rep;pause which is different > and seems more efficient. Just to move away from this... totally perplexed, I retested a bit. Seems something else had gone wrong causing me to try 'rep;pause' vs. 'pause'. The resulting opcode is f3 f3 90, as noted above. I do see what seems to be a small but noticeable performance improvement - no idea if there's a downside, and also no idea what f3 f3 90 does vs. f3 90. Might be something interesting, or maybe not. Test scenario: Boot clean to single user mode. perform tiotest -8 five times. %cpu is %usr + %sys as reported by tiotest. Results: Writes pause: 1.14 sec; 72.01MB/sec; 322.44%cpu rep;pause: 1.12 sec; 70.4MB/sec; 311.58%cpu Random Writes pause: 3.7 sec; 8.51MB/sec; 66.48%cpu rep;pause 3.46sec; 9.04MB/sec; 72.34%cpu Reads pause: 11557.48MB/sec; 6040.74%cpu rep;pause 11620.15MB/sec; 5974.90%cpu Random Reads pause: 11416.9MB/sec; 5330.50%cpu rep;pause 11786.99MB/sec; 5118.66%cpu ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: x86 - cpu_relax - why nop vs. pause? 2010-02-08 3:50 ` Michael Breuer @ 2010-02-08 13:33 ` Artur Skawina 0 siblings, 0 replies; 7+ messages in thread From: Artur Skawina @ 2010-02-08 13:33 UTC (permalink / raw) To: Michael Breuer Cc: Linux Kernel Mailing List, Mike Galbraith, Arjan van de Ven, Joerg Roedel Michael Breuer wrote: > Just to move away from this... totally perplexed, I retested a bit. > Seems something else had gone wrong causing me to try 'rep;pause' vs. > 'pause'. The resulting opcode is f3 f3 90, as noted above. > > I do see what seems to be a small but noticeable performance improvement > - no idea if there's a downside, and also no idea what f3 f3 90 does vs. > f3 90. Might be something interesting, or maybe not. Alignment? IOW what happens if you use eg "nop; rep; nop;" or "rep; nop; nop;"? ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2010-02-08 13:33 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-02-07 17:28 x86 - cpu_relax - why nop vs. pause? Michael Breuer
2010-02-07 18:09 ` Joerg Roedel
2010-02-07 18:32 ` Arjan van de Ven
[not found] ` <1265566470.6280.10.camel@marge.simson.net>
2010-02-07 20:08 ` Michael Breuer
2010-02-07 21:15 ` Michael Breuer
2010-02-08 3:50 ` Michael Breuer
2010-02-08 13:33 ` Artur Skawina
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox