All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jeremy Fitzhardinge <jeremy@goop.org>
To: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Ingo Molnar <mingo@elte.hu>,
	the arch/x86 maintainers <x86@kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Nick Piggin <npiggin@kernel.dk>,
	Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Subject: Re: [PATCH RFC 0/7] x86: convert ticketlocks to C and remove duplicate code
Date: Fri, 15 Jul 2011 10:24:49 -0700	[thread overview]
Message-ID: <4E2077E1.4060606@goop.org> (raw)
In-Reply-To: <1ade2577-044b-4391-a607-91a4a989c3ff@email.android.com>

On 06/24/2011 08:15 PM, H. Peter Anvin wrote:
>>> Could you give us the delta in *compiled* code size?
>> Sure.  Do you mean for the individual lock sequences, or for the
>> overall
>> kernel?
>>
>>    J
> Overall is fine.

Here's both ;)

For the NR_CPUS < 256 case, it shrinks the kernel text by a little over
1k (and a bit of a data reduction too?):

   text	   data	    bss	    dec	    hex	filename
7287009	1915524	2347008	11549541	 b03b65	vmlinux-spin-base
7285777	1915412	2347008	11548197	 b03625	vmlinux-cleantick

A comparison of spin_lock before:

<do_raw_spin_lock>:
        55                      push   %rbp
        48 89 e5                mov    %rsp,%rbp
        e8 6f fa 3d 00          callq  <mcount>
        b8 00 01 00 00          mov    $0x100,%eax
        f0 66 0f c1 07          lock xadd %ax,(%rdi)
1:      38 e0                   cmp    %ah,%al
        74 06                   je     2f
        f3 90                   pause  
        8a 07                   mov    (%rdi),%al
        eb f6                   jmp    1b
2:      5d                      pop    %rbp
        c3                      retq   

and after:

<do_raw_spin_lock>:
        55                      push   %rbp
        48 89 e5                mov    %rsp,%rbp
        e8 1f f6 3d 00          callq  <mcount>
        b8 00 01 00 00          mov    $0x100,%eax
        f0 66 0f c1 07          lock xadd %ax,(%rdi)
        0f b6 d4                movzbl %ah,%edx
1:      38 d0                   cmp    %dl,%al
        74 06                   je     2f
        f3 90                   pause  
        8a 07                   mov    (%rdi),%al
        eb f6                   jmp    1b
2:      5d                      pop    %rbp
        c3                      retq   

IOW the generated code is identical except that the original has "cmp %ah,%al" and
the compiler generates
	movzbl %ah,%edx
	cmp    %dl,%al

I posted a bug about gcc not generating cmp %ah, %al in this case, but the general consensus was that
it's not a good choice on current Intel chips, because it causes a partial word stall, and that the
current generated code is better.  (And gcc can generate "cmp %Xh, %Xl" if it wants to - see below.)

Likewise trylock before:

<do_raw_spin_trylock>:
        55                      push   %rbp
        48 89 e5                mov    %rsp,%rbp
        e8 50 fa 3d 00          callq  <mcount>
        0f b7 07                movzwl (%rdi),%eax
        38 e0                   cmp    %ah,%al
        8d 90 00 01 00 00       lea    0x100(%rax),%edx
        75 05                   jne    1f
        f0 66 0f b1 17          lock cmpxchg %dx,(%rdi)
1:      0f 94 c2                sete   %dl
        0f b6 c2                movzbl %dl,%eax
        5d                      pop    %rbp
        c3                      retq   

and after:

<do_raw_spin_trylock>:
        55                      push   %rbp
        48 89 e5                mov    %rsp,%rbp
        e8 fd f5 3d 00          callq  <mcount>
        31 c0                   xor    %eax,%eax
        66 8b 17                mov    (%rdi),%dx
        38 d6                   cmp    %dl,%dh
        75 16                   jne    1f
        8d 8a 00 01 00 00       lea    0x100(%rdx),%ecx
        89 d0                   mov    %edx,%eax
        f0 66 0f b1 0f          lock cmpxchg %cx,(%rdi)
        66 39 d0                cmp    %dx,%ax
        0f 94 c0                sete   %al
        0f b6 c0                movzbl %al,%eax
1:      5d                      pop    %rbp
        c3                      retq   


The differences here are a little more extensive, but the code is
broadly comparable.   Some interesting points on the gcc code:

    * it pre-loads the failure return value, so it doesn't need to use
      sete unless its actually doing the cmpxchg - whether this is good
      or not depends on how often trylock fails vs succeeds, but is
      pretty minor either way
    * it generates a 16 bit load, rather than using zero extending
      32-bit load
    * it *can* generate a "cmp %dl,%dh" if it wants to
    * it ends up re-comparing the "old" and "new" values after cmpxchg
      rather than reusing the flags it sets.  This could be fixed by
      having a cmpxchg() variant which returns a flag rather than old,
      which would be generally useful since most cmpxchg() callers end
      up doing that comparison anyway.

spin_unlock is a little trickier to compare because it's inlined, but
I'm guessing that's the source of the code shrinkage.

Likewise, for NR_CPUS=512, there's a ~900 byte kernel shrinkage with the
new code:

   text	   data	    bss	    dec	    hex	filename
7296571	1945380	2424832	11666783	 b2055f	vmlinux-spin-base-big
7295610	1945412	2424832	11665854	 b201be	vmlinux-cleantick-big

The generated code for do_raw_spin_lock is even closer:

Before:

<do_raw_spin_lock>:
       55                      push   %rbp
       48 89 e5                mov    %rsp,%rbp
       e8 cf 0a 3e 00          callq  <mcount>
       b8 00 00 01 00          mov    $0x10000,%eax
       f0 0f c1 07             lock xadd %eax,(%rdi)
       0f b7 d0                movzwl %ax,%edx
       c1 e8 10                shr    $0x10,%eax
1:     39 c2                   cmp    %eax,%edx
       74 07                   je     2f
       f3 90                   pause  
       0f b7 17                movzwl (%rdi),%edx
       eb f5                   jmp    1b
2:     5d                      pop    %rbp
       c3                      retq   

After:

<do_raw_spin_lock>:
       55                      push   %rbp
       48 89 e5                mov    %rsp,%rbp
       e8 63 07 3e 00          callq  <mcount>
       b8 00 00 01 00          mov    $0x10000,%eax
       f0 0f c1 07             lock xadd %eax,(%rdi)
       89 c2                   mov    %eax,%edx
       c1 ea 10                shr    $0x10,%edx
1:     66 39 d0                cmp    %dx,%ax
       74 07                   je     2f
       f3 90                   pause  
       66 8b 07                mov    (%rdi),%ax
       eb f4                   jmp    1b
2:     5d                      pop    %rbp
       c3                      retq   

In other words, identical aside from using 16 bit regs rather than 32
bit regs and zero extend.

Trylock:

Before:

<do_raw_spin_trylock>:
       55                      push   %rbp
       48 89 e5                mov    %rsp,%rbp
       e8 aa 0a 3e 00          callq  <mcount>
       8b 07                   mov    (%rdi),%eax
       89 c2                   mov    %eax,%edx
       c1 c0 10                rol    $0x10,%eax
       39 c2                   cmp    %eax,%edx
       8d 90 00 00 01 00       lea    0x10000(%rax),%edx
       75 04                   jne    1f
       f0 0f b1 17             lock cmpxchg %edx,(%rdi)
1:     0f 94 c2                sete   %dl
       0f b6 c2                movzbl %dl,%eax
       5d                      pop    %rbp
       c3                      retq   

After:

<do_raw_spin_trylock>:
       55                      push   %rbp
       48 89 e5                mov    %rsp,%rbp
       e8 3e 07 3e 00          callq  <mcount>
       31 c0                   xor    %eax,%eax
       8b 17                   mov    (%rdi),%edx
       89 d1                   mov    %edx,%ecx
       c1 e9 10                shr    $0x10,%ecx
       66 39 ca                cmp    %cx,%dx
       75 14                   jne    1f
       8d 8a 00 00 01 00       lea    0x10000(%rdx),%ecx
       89 d0                   mov    %edx,%eax
       f0 0f b1 0f             lock cmpxchg %ecx,(%rdi)
       39 d0                   cmp    %edx,%eax
       0f 94 c0                sete   %al
       0f b6 c0                movzbl %al,%eax
1:     5d                      pop    %rbp
       c3                      retq   

The differences here are similar to the < 256 CPU case:

    * gcc generates a straightforward shift and 16-bit compare to
      compare the head and tail, rather than the rol version (which I
      guess keeps everything 32b)
    * same pre-loading the failure return value rather than reusing sete
    * same redundant compare after cmpxchg


So conclusion:

    * overall kernel code size reduction
    * the spinlock code is very similar
    * the trylock code could be improved a little, but its unclear to me
      that it would make much difference (since there's a big fat locked
      cmpxchg in the middle of any interesting code path)
    * unlock is inlined, so I couldn't evaluate that, but since its just
      an unlocked inc, its hard to see how that could go wrong.

    J

  reply	other threads:[~2011-07-15 17:24 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-06-16 21:40 [PATCH RFC 0/7] x86: convert ticketlocks to C and remove duplicate code Jeremy Fitzhardinge
2011-06-16 21:40 ` [PATCH 1/7] x86/ticketlock: clean up types and accessors Jeremy Fitzhardinge
2011-06-16 21:40 ` [PATCH 2/7] x86/ticketlock: convert spin loop to C Jeremy Fitzhardinge
2011-06-16 21:40 ` [PATCH 3/7] x86/ticketlock: Use C for __ticket_spin_unlock Jeremy Fitzhardinge
2011-06-16 21:40 ` [PATCH 4/7] x86/ticketlock: make large and small ticket versions of spin_lock the same Jeremy Fitzhardinge
2011-06-16 21:40 ` [PATCH 5/7] x86/ticketlock: make __ticket_spin_lock common Jeremy Fitzhardinge
2011-06-16 21:40 ` [PATCH 6/7] x86/ticketlock: make __ticket_spin_trylock common Jeremy Fitzhardinge
2011-06-16 21:40 ` [PATCH 7/7] x86/ticketlock: prevent memory accesses from reordered out of lock region Jeremy Fitzhardinge
2011-06-21 14:01 ` [PATCH RFC 0/7] x86: convert ticketlocks to C and remove duplicate code Peter Zijlstra
2011-06-21 17:54   ` Jeremy Fitzhardinge
2011-06-22 19:21   ` Jeremy Fitzhardinge
2011-06-22 20:19     ` H. Peter Anvin
2011-06-22 20:59       ` Jeremy Fitzhardinge
2011-06-22 21:07         ` H. Peter Anvin
2011-06-22 21:35           ` Jeremy Fitzhardinge
2011-06-22 23:16             ` H. Peter Anvin
2011-06-21 14:18 ` Konrad Rzeszutek Wilk
2011-06-24  1:19 ` [PATCH 1/8] x86/ticketlock: clean up types and accessors Jeremy Fitzhardinge
2011-06-24  1:19   ` [PATCH 2/8] x86/ticketlock: convert spin loop to C Jeremy Fitzhardinge
2011-07-22 19:55     ` [tip:x86/spinlocks] x86, ticketlock: Convert " tip-bot for Jeremy Fitzhardinge
2011-06-24  1:19   ` [PATCH 3/8] x86/ticketlock: Use C for __ticket_spin_unlock Jeremy Fitzhardinge
2011-06-24 21:52     ` H. Peter Anvin
2011-06-24 22:41       ` Jeremy Fitzhardinge
2011-07-22 18:32         ` H. Peter Anvin
2011-07-22 19:28           ` Jeremy Fitzhardinge
2011-07-22 19:56       ` [tip:x86/spinlocks] x86, ticketlock: Use asm volatile for __ticket_unlock_release() tip-bot for H. Peter Anvin
2011-07-22 19:56     ` [tip:x86/spinlocks] x86, ticketlock: Use C for __ticket_spin_unlock tip-bot for Jeremy Fitzhardinge
2011-06-24  1:19   ` [PATCH 4/8] x86/ticketlock: make large and small ticket versions of spin_lock the same Jeremy Fitzhardinge
2011-07-22 19:57     ` [tip:x86/spinlocks] x86, ticketlock: Make " tip-bot for Jeremy Fitzhardinge
2011-06-24  1:19   ` [PATCH 5/8] x86/ticketlock: make __ticket_spin_lock common Jeremy Fitzhardinge
2011-07-22 19:57     ` [tip:x86/spinlocks] x86, ticketlock: Make " tip-bot for Jeremy Fitzhardinge
2011-06-24  1:19   ` [PATCH 6/8] x86/ticketlock: make __ticket_spin_trylock common Jeremy Fitzhardinge
2011-07-22 19:57     ` [tip:x86/spinlocks] x86, ticketlock: Make " tip-bot for Jeremy Fitzhardinge
2011-06-24  1:19   ` [PATCH 7/8] x86: add xadd helper macro Jeremy Fitzhardinge
2011-07-22 19:58     ` [tip:x86/spinlocks] x86: Add " tip-bot for Jeremy Fitzhardinge
2011-06-24  1:19   ` [PATCH 8/8] x86/ticketlock: use xadd helper Jeremy Fitzhardinge
2011-07-22 19:58     ` [tip:x86/spinlocks] x86, ticketlock: Use " tip-bot for Jeremy Fitzhardinge
2011-07-22 19:55   ` [tip:x86/spinlocks] x86, ticketlock: Clean up types and accessors tip-bot for Jeremy Fitzhardinge
2011-06-24 21:50 ` [PATCH RFC 0/7] x86: convert ticketlocks to C and remove duplicate code H. Peter Anvin
2011-06-24 22:42   ` Jeremy Fitzhardinge
2011-06-25  3:15     ` H. Peter Anvin
2011-07-15 17:24       ` Jeremy Fitzhardinge [this message]
2011-07-15 23:06         ` H. Peter Anvin
2011-07-16  0:14           ` Jeremy Fitzhardinge
2011-06-25 10:11 ` Ingo Molnar
2011-06-29 20:44 ` Andi Kleen
2011-07-21 23:33   ` Jeremy Fitzhardinge
2011-07-22 16:25     ` Andi Kleen
2011-09-07 23:25   ` Jeremy Fitzhardinge

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4E2077E1.4060606@goop.org \
    --to=jeremy@goop.org \
    --cc=hpa@zytor.com \
    --cc=jeremy.fitzhardinge@citrix.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=npiggin@kernel.dk \
    --cc=peterz@infradead.org \
    --cc=torvalds@linux-foundation.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.