From: Peter Zijlstra <peterz@infradead.org>
To: Andi Kleen <ak@linux.intel.com>
Cc: Andi Kleen <andi@firstfloor.org>,
x86@kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 1/3] x86: Move msr accesses out of line
Date: Wed, 25 Feb 2015 13:27:01 +0100 [thread overview]
Message-ID: <20150225122701.GK5029@twins.programming.kicks-ass.net> (raw)
In-Reply-To: <20150223174340.GD27767@tassilo.jf.intel.com>
On Mon, Feb 23, 2015 at 09:43:40AM -0800, Andi Kleen wrote:
> On Mon, Feb 23, 2015 at 06:04:36PM +0100, Peter Zijlstra wrote:
> > On Fri, Feb 20, 2015 at 05:38:55PM -0800, Andi Kleen wrote:
> >
> > > This patch moves the MSR functions out of line. A MSR access is typically
> > > 40-100 cycles or even slower, a call is a few cycles at best, so the
> > > additional function call is not really significant.
> >
> > If I look at the below PDF a CALL+PUSH EBP+MOV RSP,RBP+ ... +POP+RET
> > ends up being 5+1.5+0.5+ .. + 1.5+8 = 16.5 + .. cycles.
>
> You cannot just add up the latency cycles. The CPU runs all of this
> in parallel.
>
> Latency cycles would only be interesting if these instructions were
> on the critical path for computing the result, which they are not.
>
> It should be a few cycles overhead.
I thought that since CALL touches RSP, PUSH touches RSP, MOV RSP,
(obviously) touches RSP, POP touches RSP and well, RET does too. There
were strong dependencies on the instructions and there would be little
room to parallelize things.
I'm glad you so patiently educated me on the wonders of modern
architectures and how it can indeed do all this in parallel.
Still, I wondered, so I ran me a little test. Note that I used a
serializing instruction (LOCK XCHG) because WRMSR is too.
I see a ~14 cycle difference between the inline and noinline version.
If I substitute the LOCK XCHG with XADD, I get to 1,5 cycles in
difference, so clearly there is some magic happening, but serializing
instructions wreck it.
Anybody can explain how such RSP deps get magiced away?
---
root@ivb-ep:~# cat call.c
#define __always_inline inline __attribute__((always_inline))
#define noinline __attribute__((noinline))
static int
#ifdef FOO
noinline
#else
__always_inline
#endif
xchg(int *ptr, int val)
{
asm volatile ("LOCK xchgl %0, %1\n"
: "+r" (val), "+m" (*(ptr))
: : "memory", "cc");
return val;
}
void main(void)
{
int val = 0, old;
for (int i = 0; i < 1000000000; i++)
old = xchg(&val, i);
}
root@ivb-ep:~# gcc -std=gnu99 -O3 -fno-omit-frame-pointer -DFOO -o call call.c
root@ivb-ep:~# objdump -D call | awk '/<[^>]*>:/ {p=0} /<main>:/ {p=1} /<xchg>:/ {p=1} { if (p) print $0 }'
00000000004003e0 <main>:
4003e0: 55 push %rbp
4003e1: 48 89 e5 mov %rsp,%rbp
4003e4: 53 push %rbx
4003e5: 31 db xor %ebx,%ebx
4003e7: 48 83 ec 18 sub $0x18,%rsp
4003eb: c7 45 e0 00 00 00 00 movl $0x0,-0x20(%rbp)
4003f2: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
4003f8: 48 8d 7d e0 lea -0x20(%rbp),%rdi
4003fc: 89 de mov %ebx,%esi
4003fe: 83 c3 01 add $0x1,%ebx
400401: e8 fa 00 00 00 callq 400500 <xchg>
400406: 81 fb 00 ca 9a 3b cmp $0x3b9aca00,%ebx
40040c: 75 ea jne 4003f8 <main+0x18>
40040e: 48 83 c4 18 add $0x18,%rsp
400412: 5b pop %rbx
400413: 5d pop %rbp
400414: c3 retq
0000000000400500 <xchg>:
400500: 55 push %rbp
400501: 89 f0 mov %esi,%eax
400503: 48 89 e5 mov %rsp,%rbp
400506: f0 87 07 lock xchg %eax,(%rdi)
400509: 5d pop %rbp
40050a: c3 retq
40050b: 90 nop
40050c: 90 nop
40050d: 90 nop
40050e: 90 nop
40050f: 90 nop
root@ivb-ep:~# gcc -std=gnu99 -O3 -fno-omit-frame-pointer -o call-inline call.c
root@ivb-ep:~# objdump -D call-inline | awk '/<[^>]*>:/ {p=0} /<main>:/ {p=1} /<xchg>:/ {p=1} { if (p) print $0 }'
00000000004003e0 <main>:
4003e0: 55 push %rbp
4003e1: 31 c0 xor %eax,%eax
4003e3: 48 89 e5 mov %rsp,%rbp
4003e6: c7 45 f0 00 00 00 00 movl $0x0,-0x10(%rbp)
4003ed: 0f 1f 00 nopl (%rax)
4003f0: 89 c2 mov %eax,%edx
4003f2: f0 87 55 f0 lock xchg %edx,-0x10(%rbp)
4003f6: 83 c0 01 add $0x1,%eax
4003f9: 3d 00 ca 9a 3b cmp $0x3b9aca00,%eax
4003fe: 75 f0 jne 4003f0 <main+0x10>
400400: 5d pop %rbp
400401: c3 retq
root@ivb-ep:~# perf stat -e "cycles:u" ./call
Performance counter stats for './call':
36,309,274,162 cycles:u
10.561819310 seconds time elapsed
root@ivb-ep:~# perf stat -e "cycles:u" ./call-inline
Performance counter stats for './call-inline':
22,004,045,745 cycles:u
6.498271508 seconds time elapsed
next prev parent reply other threads:[~2015-02-25 12:27 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-02-21 1:38 [PATCH 1/3] x86: Move msr accesses out of line Andi Kleen
2015-02-21 1:38 ` [PATCH 2/3] x86: Add trace point for MSR accesses Andi Kleen
2015-02-21 1:38 ` [PATCH 3/3] perf, x86: Remove old MSR perf tracing code Andi Kleen
2015-02-23 17:04 ` [PATCH 1/3] x86: Move msr accesses out of line Peter Zijlstra
2015-02-23 17:43 ` Andi Kleen
2015-02-25 12:27 ` Peter Zijlstra [this message]
2015-02-25 18:20 ` Andi Kleen
2015-02-25 18:34 ` Borislav Petkov
2015-02-26 11:43 ` [RFC][PATCH] module: Optimize __module_address() using a latched RB-tree Peter Zijlstra
2015-02-26 12:00 ` Ingo Molnar
2015-02-26 14:12 ` Peter Zijlstra
2015-02-27 11:51 ` Rusty Russell
2015-02-26 16:02 ` Mathieu Desnoyers
2015-02-26 16:43 ` Peter Zijlstra
2015-02-26 16:55 ` Mathieu Desnoyers
2015-02-26 17:16 ` Peter Zijlstra
2015-02-26 17:22 ` Peter Zijlstra
2015-02-26 18:28 ` Paul E. McKenney
2015-02-26 19:06 ` Mathieu Desnoyers
2015-02-26 19:13 ` Peter Zijlstra
2015-02-26 19:41 ` Paul E. McKenney
2015-02-26 19:45 ` Peter Zijlstra
2015-02-26 22:32 ` Peter Zijlstra
2015-02-26 20:52 ` Andi Kleen
2015-02-26 22:36 ` Peter Zijlstra
2015-02-27 10:01 ` Peter Zijlstra
2015-02-28 23:30 ` Paul E. McKenney
2015-02-28 16:41 ` Peter Zijlstra
2015-02-28 16:56 ` Peter Zijlstra
2015-02-28 23:32 ` Paul E. McKenney
2015-03-02 9:24 ` Peter Zijlstra
2015-03-02 16:58 ` Paul E. McKenney
2015-02-27 12:02 ` Rusty Russell
2015-02-27 14:30 ` Peter Zijlstra
-- strict thread matches above, loose matches on Subject: below --
2015-03-20 0:29 Updated MSR tracing patchkit v2 Andi Kleen
2015-03-20 0:29 ` [PATCH 1/3] x86: Move msr accesses out of line Andi Kleen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150225122701.GK5029@twins.programming.kicks-ass.net \
--to=peterz@infradead.org \
--cc=ak@linux.intel.com \
--cc=andi@firstfloor.org \
--cc=linux-kernel@vger.kernel.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox