From: Jeremy Fitzhardinge <jeremy@goop.org>
To: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
Linus Torvalds <torvalds@linux-foundation.org>,
Ingo Molnar <mingo@elte.hu>,
the arch/x86 maintainers <x86@kernel.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Nick Piggin <npiggin@kernel.dk>,
Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Subject: Re: [PATCH RFC 0/7] x86: convert ticketlocks to C and remove duplicate code
Date: Fri, 15 Jul 2011 10:24:49 -0700 [thread overview]
Message-ID: <4E2077E1.4060606@goop.org> (raw)
In-Reply-To: <1ade2577-044b-4391-a607-91a4a989c3ff@email.android.com>
On 06/24/2011 08:15 PM, H. Peter Anvin wrote:
>>> Could you give us the delta in *compiled* code size?
>> Sure. Do you mean for the individual lock sequences, or for the
>> overall
>> kernel?
>>
>> J
> Overall is fine.
Here's both ;)
For the NR_CPUS < 256 case, it shrinks the kernel text by a little over
1k (and a bit of a data reduction too?):
text data bss dec hex filename
7287009 1915524 2347008 11549541 b03b65 vmlinux-spin-base
7285777 1915412 2347008 11548197 b03625 vmlinux-cleantick
A comparison of spin_lock before:
<do_raw_spin_lock>:
55 push %rbp
48 89 e5 mov %rsp,%rbp
e8 6f fa 3d 00 callq <mcount>
b8 00 01 00 00 mov $0x100,%eax
f0 66 0f c1 07 lock xadd %ax,(%rdi)
1: 38 e0 cmp %ah,%al
74 06 je 2f
f3 90 pause
8a 07 mov (%rdi),%al
eb f6 jmp 1b
2: 5d pop %rbp
c3 retq
and after:
<do_raw_spin_lock>:
55 push %rbp
48 89 e5 mov %rsp,%rbp
e8 1f f6 3d 00 callq <mcount>
b8 00 01 00 00 mov $0x100,%eax
f0 66 0f c1 07 lock xadd %ax,(%rdi)
0f b6 d4 movzbl %ah,%edx
1: 38 d0 cmp %dl,%al
74 06 je 2f
f3 90 pause
8a 07 mov (%rdi),%al
eb f6 jmp 1b
2: 5d pop %rbp
c3 retq
IOW the generated code is identical except that the original has "cmp %ah,%al" and
the compiler generates
movzbl %ah,%edx
cmp %dl,%al
I posted a bug about gcc not generating cmp %ah, %al in this case, but the general consensus was that
it's not a good choice on current Intel chips, because it causes a partial word stall, and that the
current generated code is better. (And gcc can generate "cmp %Xh, %Xl" if it wants to - see below.)
Likewise trylock before:
<do_raw_spin_trylock>:
55 push %rbp
48 89 e5 mov %rsp,%rbp
e8 50 fa 3d 00 callq <mcount>
0f b7 07 movzwl (%rdi),%eax
38 e0 cmp %ah,%al
8d 90 00 01 00 00 lea 0x100(%rax),%edx
75 05 jne 1f
f0 66 0f b1 17 lock cmpxchg %dx,(%rdi)
1: 0f 94 c2 sete %dl
0f b6 c2 movzbl %dl,%eax
5d pop %rbp
c3 retq
and after:
<do_raw_spin_trylock>:
55 push %rbp
48 89 e5 mov %rsp,%rbp
e8 fd f5 3d 00 callq <mcount>
31 c0 xor %eax,%eax
66 8b 17 mov (%rdi),%dx
38 d6 cmp %dl,%dh
75 16 jne 1f
8d 8a 00 01 00 00 lea 0x100(%rdx),%ecx
89 d0 mov %edx,%eax
f0 66 0f b1 0f lock cmpxchg %cx,(%rdi)
66 39 d0 cmp %dx,%ax
0f 94 c0 sete %al
0f b6 c0 movzbl %al,%eax
1: 5d pop %rbp
c3 retq
The differences here are a little more extensive, but the code is
broadly comparable. Some interesting points on the gcc code:
* it pre-loads the failure return value, so it doesn't need to use
sete unless its actually doing the cmpxchg - whether this is good
or not depends on how often trylock fails vs succeeds, but is
pretty minor either way
* it generates a 16 bit load, rather than using zero extending
32-bit load
* it *can* generate a "cmp %dl,%dh" if it wants to
* it ends up re-comparing the "old" and "new" values after cmpxchg
rather than reusing the flags it sets. This could be fixed by
having a cmpxchg() variant which returns a flag rather than old,
which would be generally useful since most cmpxchg() callers end
up doing that comparison anyway.
spin_unlock is a little trickier to compare because it's inlined, but
I'm guessing that's the source of the code shrinkage.
Likewise, for NR_CPUS=512, there's a ~900 byte kernel shrinkage with the
new code:
text data bss dec hex filename
7296571 1945380 2424832 11666783 b2055f vmlinux-spin-base-big
7295610 1945412 2424832 11665854 b201be vmlinux-cleantick-big
The generated code for do_raw_spin_lock is even closer:
Before:
<do_raw_spin_lock>:
55 push %rbp
48 89 e5 mov %rsp,%rbp
e8 cf 0a 3e 00 callq <mcount>
b8 00 00 01 00 mov $0x10000,%eax
f0 0f c1 07 lock xadd %eax,(%rdi)
0f b7 d0 movzwl %ax,%edx
c1 e8 10 shr $0x10,%eax
1: 39 c2 cmp %eax,%edx
74 07 je 2f
f3 90 pause
0f b7 17 movzwl (%rdi),%edx
eb f5 jmp 1b
2: 5d pop %rbp
c3 retq
After:
<do_raw_spin_lock>:
55 push %rbp
48 89 e5 mov %rsp,%rbp
e8 63 07 3e 00 callq <mcount>
b8 00 00 01 00 mov $0x10000,%eax
f0 0f c1 07 lock xadd %eax,(%rdi)
89 c2 mov %eax,%edx
c1 ea 10 shr $0x10,%edx
1: 66 39 d0 cmp %dx,%ax
74 07 je 2f
f3 90 pause
66 8b 07 mov (%rdi),%ax
eb f4 jmp 1b
2: 5d pop %rbp
c3 retq
In other words, identical aside from using 16 bit regs rather than 32
bit regs and zero extend.
Trylock:
Before:
<do_raw_spin_trylock>:
55 push %rbp
48 89 e5 mov %rsp,%rbp
e8 aa 0a 3e 00 callq <mcount>
8b 07 mov (%rdi),%eax
89 c2 mov %eax,%edx
c1 c0 10 rol $0x10,%eax
39 c2 cmp %eax,%edx
8d 90 00 00 01 00 lea 0x10000(%rax),%edx
75 04 jne 1f
f0 0f b1 17 lock cmpxchg %edx,(%rdi)
1: 0f 94 c2 sete %dl
0f b6 c2 movzbl %dl,%eax
5d pop %rbp
c3 retq
After:
<do_raw_spin_trylock>:
55 push %rbp
48 89 e5 mov %rsp,%rbp
e8 3e 07 3e 00 callq <mcount>
31 c0 xor %eax,%eax
8b 17 mov (%rdi),%edx
89 d1 mov %edx,%ecx
c1 e9 10 shr $0x10,%ecx
66 39 ca cmp %cx,%dx
75 14 jne 1f
8d 8a 00 00 01 00 lea 0x10000(%rdx),%ecx
89 d0 mov %edx,%eax
f0 0f b1 0f lock cmpxchg %ecx,(%rdi)
39 d0 cmp %edx,%eax
0f 94 c0 sete %al
0f b6 c0 movzbl %al,%eax
1: 5d pop %rbp
c3 retq
The differences here are similar to the < 256 CPU case:
* gcc generates a straightforward shift and 16-bit compare to
compare the head and tail, rather than the rol version (which I
guess keeps everything 32b)
* same pre-loading the failure return value rather than reusing sete
* same redundant compare after cmpxchg
So conclusion:
* overall kernel code size reduction
* the spinlock code is very similar
* the trylock code could be improved a little, but its unclear to me
that it would make much difference (since there's a big fat locked
cmpxchg in the middle of any interesting code path)
* unlock is inlined, so I couldn't evaluate that, but since its just
an unlocked inc, its hard to see how that could go wrong.
J
next prev parent reply other threads:[~2011-07-15 17:24 UTC|newest]
Thread overview: 49+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-06-16 21:40 [PATCH RFC 0/7] x86: convert ticketlocks to C and remove duplicate code Jeremy Fitzhardinge
2011-06-16 21:40 ` [PATCH 1/7] x86/ticketlock: clean up types and accessors Jeremy Fitzhardinge
2011-06-16 21:40 ` [PATCH 2/7] x86/ticketlock: convert spin loop to C Jeremy Fitzhardinge
2011-06-16 21:40 ` [PATCH 3/7] x86/ticketlock: Use C for __ticket_spin_unlock Jeremy Fitzhardinge
2011-06-16 21:40 ` [PATCH 4/7] x86/ticketlock: make large and small ticket versions of spin_lock the same Jeremy Fitzhardinge
2011-06-16 21:40 ` [PATCH 5/7] x86/ticketlock: make __ticket_spin_lock common Jeremy Fitzhardinge
2011-06-16 21:40 ` [PATCH 6/7] x86/ticketlock: make __ticket_spin_trylock common Jeremy Fitzhardinge
2011-06-16 21:40 ` [PATCH 7/7] x86/ticketlock: prevent memory accesses from reordered out of lock region Jeremy Fitzhardinge
2011-06-21 14:01 ` [PATCH RFC 0/7] x86: convert ticketlocks to C and remove duplicate code Peter Zijlstra
2011-06-21 17:54 ` Jeremy Fitzhardinge
2011-06-22 19:21 ` Jeremy Fitzhardinge
2011-06-22 20:19 ` H. Peter Anvin
2011-06-22 20:59 ` Jeremy Fitzhardinge
2011-06-22 21:07 ` H. Peter Anvin
2011-06-22 21:35 ` Jeremy Fitzhardinge
2011-06-22 23:16 ` H. Peter Anvin
2011-06-21 14:18 ` Konrad Rzeszutek Wilk
2011-06-24 1:19 ` [PATCH 1/8] x86/ticketlock: clean up types and accessors Jeremy Fitzhardinge
2011-06-24 1:19 ` [PATCH 2/8] x86/ticketlock: convert spin loop to C Jeremy Fitzhardinge
2011-07-22 19:55 ` [tip:x86/spinlocks] x86, ticketlock: Convert " tip-bot for Jeremy Fitzhardinge
2011-06-24 1:19 ` [PATCH 3/8] x86/ticketlock: Use C for __ticket_spin_unlock Jeremy Fitzhardinge
2011-06-24 21:52 ` H. Peter Anvin
2011-06-24 22:41 ` Jeremy Fitzhardinge
2011-07-22 18:32 ` H. Peter Anvin
2011-07-22 19:28 ` Jeremy Fitzhardinge
2011-07-22 19:56 ` [tip:x86/spinlocks] x86, ticketlock: Use asm volatile for __ticket_unlock_release() tip-bot for H. Peter Anvin
2011-07-22 19:56 ` [tip:x86/spinlocks] x86, ticketlock: Use C for __ticket_spin_unlock tip-bot for Jeremy Fitzhardinge
2011-06-24 1:19 ` [PATCH 4/8] x86/ticketlock: make large and small ticket versions of spin_lock the same Jeremy Fitzhardinge
2011-07-22 19:57 ` [tip:x86/spinlocks] x86, ticketlock: Make " tip-bot for Jeremy Fitzhardinge
2011-06-24 1:19 ` [PATCH 5/8] x86/ticketlock: make __ticket_spin_lock common Jeremy Fitzhardinge
2011-07-22 19:57 ` [tip:x86/spinlocks] x86, ticketlock: Make " tip-bot for Jeremy Fitzhardinge
2011-06-24 1:19 ` [PATCH 6/8] x86/ticketlock: make __ticket_spin_trylock common Jeremy Fitzhardinge
2011-07-22 19:57 ` [tip:x86/spinlocks] x86, ticketlock: Make " tip-bot for Jeremy Fitzhardinge
2011-06-24 1:19 ` [PATCH 7/8] x86: add xadd helper macro Jeremy Fitzhardinge
2011-07-22 19:58 ` [tip:x86/spinlocks] x86: Add " tip-bot for Jeremy Fitzhardinge
2011-06-24 1:19 ` [PATCH 8/8] x86/ticketlock: use xadd helper Jeremy Fitzhardinge
2011-07-22 19:58 ` [tip:x86/spinlocks] x86, ticketlock: Use " tip-bot for Jeremy Fitzhardinge
2011-07-22 19:55 ` [tip:x86/spinlocks] x86, ticketlock: Clean up types and accessors tip-bot for Jeremy Fitzhardinge
2011-06-24 21:50 ` [PATCH RFC 0/7] x86: convert ticketlocks to C and remove duplicate code H. Peter Anvin
2011-06-24 22:42 ` Jeremy Fitzhardinge
2011-06-25 3:15 ` H. Peter Anvin
2011-07-15 17:24 ` Jeremy Fitzhardinge [this message]
2011-07-15 23:06 ` H. Peter Anvin
2011-07-16 0:14 ` Jeremy Fitzhardinge
2011-06-25 10:11 ` Ingo Molnar
2011-06-29 20:44 ` Andi Kleen
2011-07-21 23:33 ` Jeremy Fitzhardinge
2011-07-22 16:25 ` Andi Kleen
2011-09-07 23:25 ` Jeremy Fitzhardinge
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4E2077E1.4060606@goop.org \
--to=jeremy@goop.org \
--cc=hpa@zytor.com \
--cc=jeremy.fitzhardinge@citrix.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=npiggin@kernel.dk \
--cc=peterz@infradead.org \
--cc=torvalds@linux-foundation.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.