public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Jeremy Fitzhardinge <jeremy@goop.org>
To: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Ingo Molnar <mingo@elte.hu>,
	the arch/x86 maintainers <x86@kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Nick Piggin <npiggin@kernel.dk>,
	Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Subject: Re: [PATCH RFC 0/7] x86: convert ticketlocks to C and remove duplicate code
Date: Fri, 15 Jul 2011 10:24:49 -0700	[thread overview]
Message-ID: <4E2077E1.4060606@goop.org> (raw)
In-Reply-To: <1ade2577-044b-4391-a607-91a4a989c3ff@email.android.com>

On 06/24/2011 08:15 PM, H. Peter Anvin wrote:
>>> Could you give us the delta in *compiled* code size?
>> Sure.  Do you mean for the individual lock sequences, or for the
>> overall
>> kernel?
>>
>>    J
> Overall is fine.

Here's both ;)

For the NR_CPUS < 256 case, it shrinks the kernel text by a little over
1k (and a bit of a data reduction too?):

   text	   data	    bss	    dec	    hex	filename
7287009	1915524	2347008	11549541	 b03b65	vmlinux-spin-base
7285777	1915412	2347008	11548197	 b03625	vmlinux-cleantick

A comparison of spin_lock before:

<do_raw_spin_lock>:
        55                      push   %rbp
        48 89 e5                mov    %rsp,%rbp
        e8 6f fa 3d 00          callq  <mcount>
        b8 00 01 00 00          mov    $0x100,%eax
        f0 66 0f c1 07          lock xadd %ax,(%rdi)
1:      38 e0                   cmp    %ah,%al
        74 06                   je     2f
        f3 90                   pause  
        8a 07                   mov    (%rdi),%al
        eb f6                   jmp    1b
2:      5d                      pop    %rbp
        c3                      retq   

and after:

<do_raw_spin_lock>:
        55                      push   %rbp
        48 89 e5                mov    %rsp,%rbp
        e8 1f f6 3d 00          callq  <mcount>
        b8 00 01 00 00          mov    $0x100,%eax
        f0 66 0f c1 07          lock xadd %ax,(%rdi)
        0f b6 d4                movzbl %ah,%edx
1:      38 d0                   cmp    %dl,%al
        74 06                   je     2f
        f3 90                   pause  
        8a 07                   mov    (%rdi),%al
        eb f6                   jmp    1b
2:      5d                      pop    %rbp
        c3                      retq   

IOW the generated code is identical except that the original has "cmp %ah,%al" and
the compiler generates
	movzbl %ah,%edx
	cmp    %dl,%al

I posted a bug about gcc not generating cmp %ah, %al in this case, but the general consensus was that
it's not a good choice on current Intel chips, because it causes a partial word stall, and that the
current generated code is better.  (And gcc can generate "cmp %Xh, %Xl" if it wants to - see below.)

Likewise trylock before:

<do_raw_spin_trylock>:
        55                      push   %rbp
        48 89 e5                mov    %rsp,%rbp
        e8 50 fa 3d 00          callq  <mcount>
        0f b7 07                movzwl (%rdi),%eax
        38 e0                   cmp    %ah,%al
        8d 90 00 01 00 00       lea    0x100(%rax),%edx
        75 05                   jne    1f
        f0 66 0f b1 17          lock cmpxchg %dx,(%rdi)
1:      0f 94 c2                sete   %dl
        0f b6 c2                movzbl %dl,%eax
        5d                      pop    %rbp
        c3                      retq   

and after:

<do_raw_spin_trylock>:
        55                      push   %rbp
        48 89 e5                mov    %rsp,%rbp
        e8 fd f5 3d 00          callq  <mcount>
        31 c0                   xor    %eax,%eax
        66 8b 17                mov    (%rdi),%dx
        38 d6                   cmp    %dl,%dh
        75 16                   jne    1f
        8d 8a 00 01 00 00       lea    0x100(%rdx),%ecx
        89 d0                   mov    %edx,%eax
        f0 66 0f b1 0f          lock cmpxchg %cx,(%rdi)
        66 39 d0                cmp    %dx,%ax
        0f 94 c0                sete   %al
        0f b6 c0                movzbl %al,%eax
1:      5d                      pop    %rbp
        c3                      retq   


The differences here are a little more extensive, but the code is
broadly comparable.   Some interesting points on the gcc code:

    * it pre-loads the failure return value, so it doesn't need to use
      sete unless its actually doing the cmpxchg - whether this is good
      or not depends on how often trylock fails vs succeeds, but is
      pretty minor either way
    * it generates a 16 bit load, rather than using zero extending
      32-bit load
    * it *can* generate a "cmp %dl,%dh" if it wants to
    * it ends up re-comparing the "old" and "new" values after cmpxchg
      rather than reusing the flags it sets.  This could be fixed by
      having a cmpxchg() variant which returns a flag rather than old,
      which would be generally useful since most cmpxchg() callers end
      up doing that comparison anyway.

spin_unlock is a little trickier to compare because it's inlined, but
I'm guessing that's the source of the code shrinkage.

Likewise, for NR_CPUS=512, there's a ~900 byte kernel shrinkage with the
new code:

   text	   data	    bss	    dec	    hex	filename
7296571	1945380	2424832	11666783	 b2055f	vmlinux-spin-base-big
7295610	1945412	2424832	11665854	 b201be	vmlinux-cleantick-big

The generated code for do_raw_spin_lock is even closer:

Before:

<do_raw_spin_lock>:
       55                      push   %rbp
       48 89 e5                mov    %rsp,%rbp
       e8 cf 0a 3e 00          callq  <mcount>
       b8 00 00 01 00          mov    $0x10000,%eax
       f0 0f c1 07             lock xadd %eax,(%rdi)
       0f b7 d0                movzwl %ax,%edx
       c1 e8 10                shr    $0x10,%eax
1:     39 c2                   cmp    %eax,%edx
       74 07                   je     2f
       f3 90                   pause  
       0f b7 17                movzwl (%rdi),%edx
       eb f5                   jmp    1b
2:     5d                      pop    %rbp
       c3                      retq   

After:

<do_raw_spin_lock>:
       55                      push   %rbp
       48 89 e5                mov    %rsp,%rbp
       e8 63 07 3e 00          callq  <mcount>
       b8 00 00 01 00          mov    $0x10000,%eax
       f0 0f c1 07             lock xadd %eax,(%rdi)
       89 c2                   mov    %eax,%edx
       c1 ea 10                shr    $0x10,%edx
1:     66 39 d0                cmp    %dx,%ax
       74 07                   je     2f
       f3 90                   pause  
       66 8b 07                mov    (%rdi),%ax
       eb f4                   jmp    1b
2:     5d                      pop    %rbp
       c3                      retq   

In other words, identical aside from using 16 bit regs rather than 32
bit regs and zero extend.

Trylock:

Before:

<do_raw_spin_trylock>:
       55                      push   %rbp
       48 89 e5                mov    %rsp,%rbp
       e8 aa 0a 3e 00          callq  <mcount>
       8b 07                   mov    (%rdi),%eax
       89 c2                   mov    %eax,%edx
       c1 c0 10                rol    $0x10,%eax
       39 c2                   cmp    %eax,%edx
       8d 90 00 00 01 00       lea    0x10000(%rax),%edx
       75 04                   jne    1f
       f0 0f b1 17             lock cmpxchg %edx,(%rdi)
1:     0f 94 c2                sete   %dl
       0f b6 c2                movzbl %dl,%eax
       5d                      pop    %rbp
       c3                      retq   

After:

<do_raw_spin_trylock>:
       55                      push   %rbp
       48 89 e5                mov    %rsp,%rbp
       e8 3e 07 3e 00          callq  <mcount>
       31 c0                   xor    %eax,%eax
       8b 17                   mov    (%rdi),%edx
       89 d1                   mov    %edx,%ecx
       c1 e9 10                shr    $0x10,%ecx
       66 39 ca                cmp    %cx,%dx
       75 14                   jne    1f
       8d 8a 00 00 01 00       lea    0x10000(%rdx),%ecx
       89 d0                   mov    %edx,%eax
       f0 0f b1 0f             lock cmpxchg %ecx,(%rdi)
       39 d0                   cmp    %edx,%eax
       0f 94 c0                sete   %al
       0f b6 c0                movzbl %al,%eax
1:     5d                      pop    %rbp
       c3                      retq   

The differences here are similar to the < 256 CPU case:

    * gcc generates a straightforward shift and 16-bit compare to
      compare the head and tail, rather than the rol version (which I
      guess keeps everything 32b)
    * same pre-loading the failure return value rather than reusing sete
    * same redundant compare after cmpxchg


So conclusion:

    * overall kernel code size reduction
    * the spinlock code is very similar
    * the trylock code could be improved a little, but its unclear to me
      that it would make much difference (since there's a big fat locked
      cmpxchg in the middle of any interesting code path)
    * unlock is inlined, so I couldn't evaluate that, but since its just
      an unlocked inc, its hard to see how that could go wrong.

    J

  reply	other threads:[~2011-07-15 17:24 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-06-16 21:40 [PATCH RFC 0/7] x86: convert ticketlocks to C and remove duplicate code Jeremy Fitzhardinge
2011-06-16 21:40 ` [PATCH 1/7] x86/ticketlock: clean up types and accessors Jeremy Fitzhardinge
2011-06-16 21:40 ` [PATCH 2/7] x86/ticketlock: convert spin loop to C Jeremy Fitzhardinge
2011-06-16 21:40 ` [PATCH 3/7] x86/ticketlock: Use C for __ticket_spin_unlock Jeremy Fitzhardinge
2011-06-16 21:40 ` [PATCH 4/7] x86/ticketlock: make large and small ticket versions of spin_lock the same Jeremy Fitzhardinge
2011-06-16 21:40 ` [PATCH 5/7] x86/ticketlock: make __ticket_spin_lock common Jeremy Fitzhardinge
2011-06-16 21:40 ` [PATCH 6/7] x86/ticketlock: make __ticket_spin_trylock common Jeremy Fitzhardinge
2011-06-16 21:40 ` [PATCH 7/7] x86/ticketlock: prevent memory accesses from reordered out of lock region Jeremy Fitzhardinge
2011-06-21 14:01 ` [PATCH RFC 0/7] x86: convert ticketlocks to C and remove duplicate code Peter Zijlstra
2011-06-21 17:54   ` Jeremy Fitzhardinge
2011-06-22 19:21   ` Jeremy Fitzhardinge
2011-06-22 20:19     ` H. Peter Anvin
2011-06-22 20:59       ` Jeremy Fitzhardinge
2011-06-22 21:07         ` H. Peter Anvin
2011-06-22 21:35           ` Jeremy Fitzhardinge
2011-06-22 23:16             ` H. Peter Anvin
2011-06-21 14:18 ` Konrad Rzeszutek Wilk
2011-06-24  1:19 ` [PATCH 1/8] x86/ticketlock: clean up types and accessors Jeremy Fitzhardinge
2011-06-24  1:19   ` [PATCH 2/8] x86/ticketlock: convert spin loop to C Jeremy Fitzhardinge
2011-07-22 19:55     ` [tip:x86/spinlocks] x86, ticketlock: Convert " tip-bot for Jeremy Fitzhardinge
2011-06-24  1:19   ` [PATCH 3/8] x86/ticketlock: Use C for __ticket_spin_unlock Jeremy Fitzhardinge
2011-06-24 21:52     ` H. Peter Anvin
2011-06-24 22:41       ` Jeremy Fitzhardinge
2011-07-22 18:32         ` H. Peter Anvin
2011-07-22 19:28           ` Jeremy Fitzhardinge
2011-07-22 19:56       ` [tip:x86/spinlocks] x86, ticketlock: Use asm volatile for __ticket_unlock_release() tip-bot for H. Peter Anvin
2011-07-22 19:56     ` [tip:x86/spinlocks] x86, ticketlock: Use C for __ticket_spin_unlock tip-bot for Jeremy Fitzhardinge
2011-06-24  1:19   ` [PATCH 4/8] x86/ticketlock: make large and small ticket versions of spin_lock the same Jeremy Fitzhardinge
2011-07-22 19:57     ` [tip:x86/spinlocks] x86, ticketlock: Make " tip-bot for Jeremy Fitzhardinge
2011-06-24  1:19   ` [PATCH 5/8] x86/ticketlock: make __ticket_spin_lock common Jeremy Fitzhardinge
2011-07-22 19:57     ` [tip:x86/spinlocks] x86, ticketlock: Make " tip-bot for Jeremy Fitzhardinge
2011-06-24  1:19   ` [PATCH 6/8] x86/ticketlock: make __ticket_spin_trylock common Jeremy Fitzhardinge
2011-07-22 19:57     ` [tip:x86/spinlocks] x86, ticketlock: Make " tip-bot for Jeremy Fitzhardinge
2011-06-24  1:19   ` [PATCH 7/8] x86: add xadd helper macro Jeremy Fitzhardinge
2011-07-22 19:58     ` [tip:x86/spinlocks] x86: Add " tip-bot for Jeremy Fitzhardinge
2011-06-24  1:19   ` [PATCH 8/8] x86/ticketlock: use xadd helper Jeremy Fitzhardinge
2011-07-22 19:58     ` [tip:x86/spinlocks] x86, ticketlock: Use " tip-bot for Jeremy Fitzhardinge
2011-07-22 19:55   ` [tip:x86/spinlocks] x86, ticketlock: Clean up types and accessors tip-bot for Jeremy Fitzhardinge
2011-06-24 21:50 ` [PATCH RFC 0/7] x86: convert ticketlocks to C and remove duplicate code H. Peter Anvin
2011-06-24 22:42   ` Jeremy Fitzhardinge
2011-06-25  3:15     ` H. Peter Anvin
2011-07-15 17:24       ` Jeremy Fitzhardinge [this message]
2011-07-15 23:06         ` H. Peter Anvin
2011-07-16  0:14           ` Jeremy Fitzhardinge
2011-06-25 10:11 ` Ingo Molnar
2011-06-29 20:44 ` Andi Kleen
2011-07-21 23:33   ` Jeremy Fitzhardinge
2011-07-22 16:25     ` Andi Kleen
2011-09-07 23:25   ` Jeremy Fitzhardinge

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4E2077E1.4060606@goop.org \
    --to=jeremy@goop.org \
    --cc=hpa@zytor.com \
    --cc=jeremy.fitzhardinge@citrix.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=npiggin@kernel.dk \
    --cc=peterz@infradead.org \
    --cc=torvalds@linux-foundation.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox