Re: [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Joe Perches <joe@perches.com>
To: Alexander Duyck <alexander.duyck@gmail.com>,
	Tom Herbert <tom@herbertland.com>
Cc: Alexander Duyck <aduyck@mirantis.com>,
	Netdev <netdev@vger.kernel.org>,
	David Miller <davem@davemloft.net>
Subject: Re: [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap
Date: Wed, 09 Mar 2016 16:18:33 -0800	[thread overview]
Message-ID: <1457569113.3433.7.camel@perches.com> (raw)
In-Reply-To: <CAKgT0UebO62GBsmL17JrZW0Ptzmr05buc1x6pHv6A_PAr4HBLQ@mail.gmail.com>

On Wed, 2016-03-09 at 08:08 -0800, Alexander Duyck wrote:
> On Tue, Mar 8, 2016 at 10:31 PM, Tom Herbert <tom@herbertland.com> wrote:
> > I took a look inlining these.
> > 
> > #define rol32(V, X) ({                          \
> >         int word = V;                           \
> >         if (__builtin_constant_p(X))            \
> >                 asm("roll $" #X ",%[word]\n\t"  \
> >                     : [word] "=r" (word));      \
> >         else                                    \
> >                 asm("roll %%cl,%[word]\n\t"     \
> >                     : [word] "=r" (word)        \
> >                     : "c" (X));                 \
> >         word;                                   \
> > })
> > 
> > With this I'm seeing a nice speedup in jhash which uses a lot of rol32s...
> Is gcc really not converting the rol32 calls into rotates?

No, it is.

The difference in the object code with the asm for instance is:

(old, compiled with gcc 5.3.1)

<jhash_2words.constprop.5>:
     84e:       81 ee 09 41 52 21       sub    $0x21524109,%esi
     854:       81 ef 09 41 52 21       sub    $0x21524109,%edi
     85a:       55                      push   %rbp
     85b:       89 f0                   mov    %esi,%eax
     85d:       89 f2                   mov    %esi,%edx
     85f:       48 ff 05 00 00 00 00    incq   0x0(%rip)        # 866 <jhash_2words.constprop.5+0x18>
     866:       c1 c2 0e                rol    $0xe,%edx
     869:       35 f7 be ad de          xor    $0xdeadbef7,%eax
     86e:       48 89 e5                mov    %rsp,%rbp
     871:       29 d0                   sub    %edx,%eax
     873:       48 ff 05 00 00 00 00    incq   0x0(%rip)        # 87a <jhash_2words.constprop.5+0x2c>
     87a:       48 ff 05 00 00 00 00    incq   0x0(%rip)        # 881 <jhash_2words.constprop.5+0x33>
     881:       89 c2                   mov    %eax,%edx
     883:       31 c7                   xor    %eax,%edi
     885:       c1 c2 0b                rol    $0xb,%edx
     888:       29 d7                   sub    %edx,%edi
     88a:       89 fa                   mov    %edi,%edx
     88c:       31 fe                   xor    %edi,%esi
     88e:       c1 ca 07                ror    $0x7,%edx
     891:       29 d6                   sub    %edx,%esi
     893:       89 f2                   mov    %esi,%edx
     895:       31 f0                   xor    %esi,%eax
     897:       c1 c2 10                rol    $0x10,%edx
     89a:       29 d0                   sub    %edx,%eax
     89c:       89 c2                   mov    %eax,%edx
     89e:       31 c7                   xor    %eax,%edi
     8a0:       c1 c2 04                rol    $0x4,%edx
     8a3:       29 d7                   sub    %edx,%edi
     8a5:       31 fe                   xor    %edi,%esi
     8a7:       c1 c7 0e                rol    $0xe,%edi
     8aa:       29 fe                   sub    %edi,%esi
     8ac:       31 f0                   xor    %esi,%eax
     8ae:       c1 ce 08                ror    $0x8,%esi
     8b1:       29 f0                   sub    %esi,%eax
     8b3:       5d                      pop    %rbp
     8b4:       c3                      retq   

vs Tom's asm

000000000000084e <jhash_2words.constprop.5>:
     84e:       81 ee 09 41 52 21       sub    $0x21524109,%esi
     854:       8d 87 f7 be ad de       lea    -0x21524109(%rdi),%eax
     85a:       55                      push   %rbp
     85b:       89 f2                   mov    %esi,%edx
     85d:       48 ff 05 00 00 00 00    incq   0x0(%rip)        # 864 <jhash_2words.constprop.5+0x16>
     864:       48 ff 05 00 00 00 00    incq   0x0(%rip)        # 86b <jhash_2words.constprop.5+0x1d>
     86b:       81 f2 f7 be ad de       xor    $0xdeadbef7,%edx
     871:       48 89 e5                mov    %rsp,%rbp
     874:       c1 c1 0e                rol    $0xe,%ecx
     877:       29 ca                   sub    %ecx,%edx
     879:       31 d0                   xor    %edx,%eax
     87b:       c1 c7 0b                rol    $0xb,%edi
     87e:       29 f8                   sub    %edi,%eax
     880:       48 ff 05 00 00 00 00    incq   0x0(%rip)        # 887 <jhash_2words.constprop.5+0x39>
     887:       31 c6                   xor    %eax,%esi
     889:       c1 c7 19                rol    $0x19,%edi
     88c:       29 fe                   sub    %edi,%esi
     88e:       31 f2                   xor    %esi,%edx
     890:       c1 c7 10                rol    $0x10,%edi
     893:       29 fa                   sub    %edi,%edx
     895:       31 d0                   xor    %edx,%eax
     897:       c1 c7 04                rol    $0x4,%edi
     89a:       29 f8                   sub    %edi,%eax
     89c:       31 f0                   xor    %esi,%eax
     89e:       29 c8                   sub    %ecx,%eax
     8a0:       31 d0                   xor    %edx,%eax
     8a2:       5d                      pop    %rbp
     8a3:       c1 c2 18                rol    $0x18,%edx
     8a6:       29 d0                   sub    %edx,%eax
     8a8:       c3                      retq   

> If we need this type of code in order to get the rotates to occur as
> expected then maybe we need to look at doing arch specific versions of
> the functions in bitops.h in order to improve the performance since I
> know these calls are used in some performance critical paths such as
> crypto and hashing.

Yeah, maybe, but why couldn't gcc generate similar code
as Tom's asm? (modulo the ripple reducing ror vs rol uses
when the shift is > 16

next prev parent reply	other threads:[~2016-03-10  0:18 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-08 22:42 [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap Alexander Duyck
2016-03-08 23:25 ` Joe Perches
2016-03-09  5:23   ` Alexander Duyck
2016-03-09  5:50     ` Joe Perches
2016-03-09  6:08       ` Alexander Duyck
2016-03-09  6:31         ` Tom Herbert
2016-03-09 16:08           ` Alexander Duyck
2016-03-10  0:18             ` Joe Perches [this message]
2016-03-10  0:58               ` Tom Herbert
2016-03-09 10:54   ` David Laight
2016-03-09 16:03     ` Alexander Duyck

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1457569113.3433.7.camel@perches.com \
    --to=joe@perches.com \
    --cc=aduyck@mirantis.com \
    --cc=alexander.duyck@gmail.com \
    --cc=davem@davemloft.net \
    --cc=netdev@vger.kernel.org \
    --cc=tom@herbertland.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.