From: Jeremy Fitzhardinge <jeremy@goop.org>
To: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
Linus Torvalds <torvalds@linux-foundation.org>,
Ingo Molnar <mingo@elte.hu>,
the arch/x86 maintainers <x86@kernel.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Nick Piggin <npiggin@kernel.dk>,
Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Subject: Re: [PATCH RFC 0/7] x86: convert ticketlocks to C and remove duplicate code
Date: Fri, 15 Jul 2011 10:24:49 -0700 [thread overview]
Message-ID: <4E2077E1.4060606@goop.org> (raw)
In-Reply-To: <1ade2577-044b-4391-a607-91a4a989c3ff@email.android.com>
On 06/24/2011 08:15 PM, H. Peter Anvin wrote:
>>> Could you give us the delta in *compiled* code size?
>> Sure. Do you mean for the individual lock sequences, or for the
>> overall
>> kernel?
>>
>> J
> Overall is fine.
Here's both ;)
For the NR_CPUS < 256 case, it shrinks the kernel text by a little over
1k (and a bit of a data reduction too?):
text data bss dec hex filename
7287009 1915524 2347008 11549541 b03b65 vmlinux-spin-base
7285777 1915412 2347008 11548197 b03625 vmlinux-cleantick
A comparison of spin_lock before:
<do_raw_spin_lock>:
55 push %rbp
48 89 e5 mov %rsp,%rbp
e8 6f fa 3d 00 callq <mcount>
b8 00 01 00 00 mov $0x100,%eax
f0 66 0f c1 07 lock xadd %ax,(%rdi)
1: 38 e0 cmp %ah,%al
74 06 je 2f
f3 90 pause
8a 07 mov (%rdi),%al
eb f6 jmp 1b
2: 5d pop %rbp
c3 retq
and after:
<do_raw_spin_lock>:
55 push %rbp
48 89 e5 mov %rsp,%rbp
e8 1f f6 3d 00 callq <mcount>
b8 00 01 00 00 mov $0x100,%eax
f0 66 0f c1 07 lock xadd %ax,(%rdi)
0f b6 d4 movzbl %ah,%edx
1: 38 d0 cmp %dl,%al
74 06 je 2f
f3 90 pause
8a 07 mov (%rdi),%al
eb f6 jmp 1b
2: 5d pop %rbp
c3 retq
IOW the generated code is identical except that the original has "cmp %ah,%al" and
the compiler generates
movzbl %ah,%edx
cmp %dl,%al
I posted a bug about gcc not generating cmp %ah, %al in this case, but the general consensus was that
it's not a good choice on current Intel chips, because it causes a partial word stall, and that the
current generated code is better. (And gcc can generate "cmp %Xh, %Xl" if it wants to - see below.)
Likewise trylock before:
<do_raw_spin_trylock>:
55 push %rbp
48 89 e5 mov %rsp,%rbp
e8 50 fa 3d 00 callq <mcount>
0f b7 07 movzwl (%rdi),%eax
38 e0 cmp %ah,%al
8d 90 00 01 00 00 lea 0x100(%rax),%edx
75 05 jne 1f
f0 66 0f b1 17 lock cmpxchg %dx,(%rdi)
1: 0f 94 c2 sete %dl
0f b6 c2 movzbl %dl,%eax
5d pop %rbp
c3 retq
and after:
<do_raw_spin_trylock>:
55 push %rbp
48 89 e5 mov %rsp,%rbp
e8 fd f5 3d 00 callq <mcount>
31 c0 xor %eax,%eax
66 8b 17 mov (%rdi),%dx
38 d6 cmp %dl,%dh
75 16 jne 1f
8d 8a 00 01 00 00 lea 0x100(%rdx),%ecx
89 d0 mov %edx,%eax
f0 66 0f b1 0f lock cmpxchg %cx,(%rdi)
66 39 d0 cmp %dx,%ax
0f 94 c0 sete %al
0f b6 c0 movzbl %al,%eax
1: 5d pop %rbp
c3 retq
The differences here are a little more extensive, but the code is
broadly comparable. Some interesting points on the gcc code:
* it pre-loads the failure return value, so it doesn't need to use
sete unless its actually doing the cmpxchg - whether this is good
or not depends on how often trylock fails vs succeeds, but is
pretty minor either way
* it generates a 16 bit load, rather than using zero extending
32-bit load
* it *can* generate a "cmp %dl,%dh" if it wants to
* it ends up re-comparing the "old" and "new" values after cmpxchg
rather than reusing the flags it sets. This could be fixed by
having a cmpxchg() variant which returns a flag rather than old,
which would be generally useful since most cmpxchg() callers end
up doing that comparison anyway.
spin_unlock is a little trickier to compare because it's inlined, but
I'm guessing that's the source of the code shrinkage.
Likewise, for NR_CPUS=512, there's a ~900 byte kernel shrinkage with the
new code:
text data bss dec hex filename
7296571 1945380 2424832 11666783 b2055f vmlinux-spin-base-big
7295610 1945412 2424832 11665854 b201be vmlinux-cleantick-big
The generated code for do_raw_spin_lock is even closer:
Before:
<do_raw_spin_lock>:
55 push %rbp
48 89 e5 mov %rsp,%rbp
e8 cf 0a 3e 00 callq <mcount>
b8 00 00 01 00 mov $0x10000,%eax
f0 0f c1 07 lock xadd %eax,(%rdi)
0f b7 d0 movzwl %ax,%edx
c1 e8 10 shr $0x10,%eax
1: 39 c2 cmp %eax,%edx
74 07 je 2f
f3 90 pause
0f b7 17 movzwl (%rdi),%edx
eb f5 jmp 1b
2: 5d pop %rbp
c3 retq
After:
<do_raw_spin_lock>:
55 push %rbp
48 89 e5 mov %rsp,%rbp
e8 63 07 3e 00 callq <mcount>
b8 00 00 01 00 mov $0x10000,%eax
f0 0f c1 07 lock xadd %eax,(%rdi)
89 c2 mov %eax,%edx
c1 ea 10 shr $0x10,%edx
1: 66 39 d0 cmp %dx,%ax
74 07 je 2f
f3 90 pause
66 8b 07 mov (%rdi),%ax
eb f4 jmp 1b
2: 5d pop %rbp
c3 retq
In other words, identical aside from using 16 bit regs rather than 32
bit regs and zero extend.
Trylock:
Before:
<do_raw_spin_trylock>:
55 push %rbp
48 89 e5 mov %rsp,%rbp
e8 aa 0a 3e 00 callq <mcount>
8b 07 mov (%rdi),%eax
89 c2 mov %eax,%edx
c1 c0 10 rol $0x10,%eax
39 c2 cmp %eax,%edx
8d 90 00 00 01 00 lea 0x10000(%rax),%edx
75 04 jne 1f
f0 0f b1 17 lock cmpxchg %edx,(%rdi)
1: 0f 94 c2 sete %dl
0f b6 c2 movzbl %dl,%eax
5d pop %rbp
c3 retq
After:
<do_raw_spin_trylock>:
55 push %rbp
48 89 e5 mov %rsp,%rbp
e8 3e 07 3e 00 callq <mcount>
31 c0 xor %eax,%eax
8b 17 mov (%rdi),%edx
89 d1 mov %edx,%ecx
c1 e9 10 shr $0x10,%ecx
66 39 ca cmp %cx,%dx
75 14 jne 1f
8d 8a 00 00 01 00 lea 0x10000(%rdx),%ecx
89 d0 mov %edx,%eax
f0 0f b1 0f lock cmpxchg %ecx,(%rdi)
39 d0 cmp %edx,%eax
0f 94 c0 sete %al
0f b6 c0 movzbl %al,%eax
1: 5d pop %rbp
c3 retq
The differences here are similar to the < 256 CPU case:
* gcc generates a straightforward shift and 16-bit compare to
compare the head and tail, rather than the rol version (which I
guess keeps everything 32b)
* same pre-loading the failure return value rather than reusing sete
* same redundant compare after cmpxchg
So conclusion:
* overall kernel code size reduction
* the spinlock code is very similar
* the trylock code could be improved a little, but its unclear to me
that it would make much difference (since there's a big fat locked
cmpxchg in the middle of any interesting code path)
* unlock is inlined, so I couldn't evaluate that, but since its just
an unlocked inc, its hard to see how that could go wrong.
J
next prev parent reply other threads:[~2011-07-15 17:24 UTC|newest]
Thread overview: 49+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-06-16 21:40 [PATCH RFC 0/7] x86: convert ticketlocks to C and remove duplicate code Jeremy Fitzhardinge
2011-06-16 21:40 ` [PATCH 1/7] x86/ticketlock: clean up types and accessors Jeremy Fitzhardinge
2011-06-16 21:40 ` [PATCH 2/7] x86/ticketlock: convert spin loop to C Jeremy Fitzhardinge
2011-06-16 21:40 ` [PATCH 3/7] x86/ticketlock: Use C for __ticket_spin_unlock Jeremy Fitzhardinge
2011-06-16 21:40 ` [PATCH 4/7] x86/ticketlock: make large and small ticket versions of spin_lock the same Jeremy Fitzhardinge
2011-06-16 21:40 ` [PATCH 5/7] x86/ticketlock: make __ticket_spin_lock common Jeremy Fitzhardinge
2011-06-16 21:40 ` [PATCH 6/7] x86/ticketlock: make __ticket_spin_trylock common Jeremy Fitzhardinge
2011-06-16 21:40 ` [PATCH 7/7] x86/ticketlock: prevent memory accesses from reordered out of lock region Jeremy Fitzhardinge
2011-06-21 14:01 ` [PATCH RFC 0/7] x86: convert ticketlocks to C and remove duplicate code Peter Zijlstra
2011-06-21 17:54 ` Jeremy Fitzhardinge
2011-06-22 19:21 ` Jeremy Fitzhardinge
2011-06-22 20:19 ` H. Peter Anvin
2011-06-22 20:59 ` Jeremy Fitzhardinge
2011-06-22 21:07 ` H. Peter Anvin
2011-06-22 21:35 ` Jeremy Fitzhardinge
2011-06-22 23:16 ` H. Peter Anvin
2011-06-21 14:18 ` Konrad Rzeszutek Wilk
2011-06-24 1:19 ` [PATCH 1/8] x86/ticketlock: clean up types and accessors Jeremy Fitzhardinge
2011-06-24 1:19 ` [PATCH 2/8] x86/ticketlock: convert spin loop to C Jeremy Fitzhardinge
2011-07-22 19:55 ` [tip:x86/spinlocks] x86, ticketlock: Convert " tip-bot for Jeremy Fitzhardinge
2011-06-24 1:19 ` [PATCH 3/8] x86/ticketlock: Use C for __ticket_spin_unlock Jeremy Fitzhardinge
2011-06-24 21:52 ` H. Peter Anvin
2011-06-24 22:41 ` Jeremy Fitzhardinge
2011-07-22 18:32 ` H. Peter Anvin
2011-07-22 19:28 ` Jeremy Fitzhardinge
2011-07-22 19:56 ` [tip:x86/spinlocks] x86, ticketlock: Use asm volatile for __ticket_unlock_release() tip-bot for H. Peter Anvin
2011-07-22 19:56 ` [tip:x86/spinlocks] x86, ticketlock: Use C for __ticket_spin_unlock tip-bot for Jeremy Fitzhardinge
2011-06-24 1:19 ` [PATCH 4/8] x86/ticketlock: make large and small ticket versions of spin_lock the same Jeremy Fitzhardinge
2011-07-22 19:57 ` [tip:x86/spinlocks] x86, ticketlock: Make " tip-bot for Jeremy Fitzhardinge
2011-06-24 1:19 ` [PATCH 5/8] x86/ticketlock: make __ticket_spin_lock common Jeremy Fitzhardinge
2011-07-22 19:57 ` [tip:x86/spinlocks] x86, ticketlock: Make " tip-bot for Jeremy Fitzhardinge
2011-06-24 1:19 ` [PATCH 6/8] x86/ticketlock: make __ticket_spin_trylock common Jeremy Fitzhardinge
2011-07-22 19:57 ` [tip:x86/spinlocks] x86, ticketlock: Make " tip-bot for Jeremy Fitzhardinge
2011-06-24 1:19 ` [PATCH 7/8] x86: add xadd helper macro Jeremy Fitzhardinge
2011-07-22 19:58 ` [tip:x86/spinlocks] x86: Add " tip-bot for Jeremy Fitzhardinge
2011-06-24 1:19 ` [PATCH 8/8] x86/ticketlock: use xadd helper Jeremy Fitzhardinge
2011-07-22 19:58 ` [tip:x86/spinlocks] x86, ticketlock: Use " tip-bot for Jeremy Fitzhardinge
2011-07-22 19:55 ` [tip:x86/spinlocks] x86, ticketlock: Clean up types and accessors tip-bot for Jeremy Fitzhardinge
2011-06-24 21:50 ` [PATCH RFC 0/7] x86: convert ticketlocks to C and remove duplicate code H. Peter Anvin
2011-06-24 22:42 ` Jeremy Fitzhardinge
2011-06-25 3:15 ` H. Peter Anvin
2011-07-15 17:24 ` Jeremy Fitzhardinge [this message]
2011-07-15 23:06 ` H. Peter Anvin
2011-07-16 0:14 ` Jeremy Fitzhardinge
2011-06-25 10:11 ` Ingo Molnar
2011-06-29 20:44 ` Andi Kleen
2011-07-21 23:33 ` Jeremy Fitzhardinge
2011-07-22 16:25 ` Andi Kleen
2011-09-07 23:25 ` Jeremy Fitzhardinge
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4E2077E1.4060606@goop.org \
--to=jeremy@goop.org \
--cc=hpa@zytor.com \
--cc=jeremy.fitzhardinge@citrix.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=npiggin@kernel.dk \
--cc=peterz@infradead.org \
--cc=torvalds@linux-foundation.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox