[net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap

public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed

* [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap
@ 2016-03-08 22:42 Alexander Duyck
  2016-03-08 23:25 ` Joe Perches
  0 siblings, 1 reply; 11+ messages in thread
From: Alexander Duyck @ 2016-03-08 22:42 UTC (permalink / raw)
  To: netdev, davem

The code for csum_block_add was doing a funky byteswap to swap the even and
odd bytes of the checksum if the offset was odd.  Instead of doing this we
can save ourselves some trouble and just shift by 8 as this should have the
same effect in terms of the final checksum value and only requires one
instruction.

In addition we can update csum_block_sub to just use csum_block_add with a
inverse value for csum2.  This way we follow the same code path as
csum_block_add without having to duplicate it.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
---
 include/net/checksum.h |   11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/include/net/checksum.h b/include/net/checksum.h
index 10a16b5bd1c7..f9fac66c0e66 100644
--- a/include/net/checksum.h
+++ b/include/net/checksum.h
@@ -88,8 +88,10 @@ static inline __wsum
 csum_block_add(__wsum csum, __wsum csum2, int offset)
 {
 	u32 sum = (__force u32)csum2;
-	if (offset&1)
-		sum = ((sum&0xFF00FF)<<8)+((sum>>8)&0xFF00FF);
+
+	if (offset & 1)
+		sum = (sum << 24) + (sum >> 8);
+
 	return csum_add(csum, (__force __wsum)sum);
 }
 
@@ -102,10 +104,7 @@ csum_block_add_ext(__wsum csum, __wsum csum2, int offset, int len)
 static inline __wsum
 csum_block_sub(__wsum csum, __wsum csum2, int offset)
 {
-	u32 sum = (__force u32)csum2;
-	if (offset&1)
-		sum = ((sum&0xFF00FF)<<8)+((sum>>8)&0xFF00FF);
-	return csum_sub(csum, (__force __wsum)sum);
+	return csum_block_add(csum, ~csum2, offset);
 }
 
 static inline __wsum csum_unfold(__sum16 n)

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap
  2016-03-08 22:42 [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap Alexander Duyck
@ 2016-03-08 23:25 ` Joe Perches
  2016-03-09  5:23   ` Alexander Duyck
  2016-03-09 10:54   ` David Laight
  0 siblings, 2 replies; 11+ messages in thread
From: Joe Perches @ 2016-03-08 23:25 UTC (permalink / raw)
  To: Alexander Duyck, netdev, davem

On Tue, 2016-03-08 at 14:42 -0800, Alexander Duyck wrote:
> The code for csum_block_add was doing a funky byteswap to swap the even and
> odd bytes of the checksum if the offset was odd.  Instead of doing this we
> can save ourselves some trouble and just shift by 8 as this should have the
> same effect in terms of the final checksum value and only requires one
> instruction.

3 instructions?

> In addition we can update csum_block_sub to just use csum_block_add with a
> inverse value for csum2.  This way we follow the same code path as
> csum_block_add without having to duplicate it.
> 
> Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
> ---
>  include/net/checksum.h |   11 +++++------
>  1 file changed, 5 insertions(+), 6 deletions(-)
> 
> diff --git a/include/net/checksum.h b/include/net/checksum.h
> index 10a16b5bd1c7..f9fac66c0e66 100644
> --- a/include/net/checksum.h
> +++ b/include/net/checksum.h
> @@ -88,8 +88,10 @@ static inline __wsum
>  csum_block_add(__wsum csum, __wsum csum2, int offset)
>  {
>  	u32 sum = (__force u32)csum2;
> -	if (offset&1)
> -		sum = ((sum&0xFF00FF)<<8)+((sum>>8)&0xFF00FF);
> +
> +	if (offset & 1)
> +		sum = (sum << 24) + (sum >> 8);

Maybe use ror32(sum, 8);

or maybe something like:

{
	u32 sum;

	/* rotated csum2 of odd offset will be the right checksum */
	if (offset & 1)
		sum = ror32((__force u32)csum2, 8);
	else
		sum = (__force u32)csum2;

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap
  2016-03-08 23:25 ` Joe Perches
@ 2016-03-09  5:23   ` Alexander Duyck
  2016-03-09  5:50     ` Joe Perches
  2016-03-09 10:54   ` David Laight
  1 sibling, 1 reply; 11+ messages in thread
From: Alexander Duyck @ 2016-03-09  5:23 UTC (permalink / raw)
  To: Joe Perches; +Cc: Alexander Duyck, Netdev, David Miller

On Tue, Mar 8, 2016 at 3:25 PM, Joe Perches <joe@perches.com> wrote:
> On Tue, 2016-03-08 at 14:42 -0800, Alexander Duyck wrote:
>> The code for csum_block_add was doing a funky byteswap to swap the even and
>> odd bytes of the checksum if the offset was odd.  Instead of doing this we
>> can save ourselves some trouble and just shift by 8 as this should have the
>> same effect in terms of the final checksum value and only requires one
>> instruction.
>
> 3 instructions?

I was talking about just the one ror vs mov, shl, shr, and ,and, add.

I assume when you say 3 you are including the test and either some
form of conditional move or jump?

>> In addition we can update csum_block_sub to just use csum_block_add with a
>> inverse value for csum2.  This way we follow the same code path as
>> csum_block_add without having to duplicate it.
>>
>> Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
>> ---
>>  include/net/checksum.h |   11 +++++------
>>  1 file changed, 5 insertions(+), 6 deletions(-)
>>
>> diff --git a/include/net/checksum.h b/include/net/checksum.h
>> index 10a16b5bd1c7..f9fac66c0e66 100644
>> --- a/include/net/checksum.h
>> +++ b/include/net/checksum.h
>> @@ -88,8 +88,10 @@ static inline __wsum
>>  csum_block_add(__wsum csum, __wsum csum2, int offset)
>>  {
>>       u32 sum = (__force u32)csum2;
>> -     if (offset&1)
>> -             sum = ((sum&0xFF00FF)<<8)+((sum>>8)&0xFF00FF);
>> +
>> +     if (offset & 1)
>> +             sum = (sum << 24) + (sum >> 8);
>
> Maybe use ror32(sum, 8);

I was actually thinking I could use something like this.  I didn't
realize it was even available.

> or maybe something like:
>
> {
>         u32 sum;
>
>         /* rotated csum2 of odd offset will be the right checksum */
>         if (offset & 1)
>                 sum = ror32((__force u32)csum2, 8);
>         else
>                 sum = (__force u32)csum2;
>

Any specific reason for breaking it up like this?  It seems like it
was easier to just have sum be assigned first and then rotating it if
needed.  What is gained by splitting the assignment up over two
different calls?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap
  2016-03-09  5:23   ` Alexander Duyck
@ 2016-03-09  5:50     ` Joe Perches
  2016-03-09  6:08       ` Alexander Duyck
  0 siblings, 1 reply; 11+ messages in thread
From: Joe Perches @ 2016-03-09  5:50 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Alexander Duyck, Netdev, David Miller

On Tue, 2016-03-08 at 21:23 -0800, Alexander Duyck wrote:
> On Tue, Mar 8, 2016 at 3:25 PM, Joe Perches <joe@perches.com> wrote:
> > On Tue, 2016-03-08 at 14:42 -0800, Alexander Duyck wrote:
> > > The code for csum_block_add was doing a funky byteswap to swap the even and
> > > odd bytes of the checksum if the offset was odd.  Instead of doing this we
> > > can save ourselves some trouble and just shift by 8 as this should have the
> > > same effect in terms of the final checksum value and only requires one
> > > instruction.
> > 3 instructions?
> I was talking about just the one ror vs mov, shl, shr, and ,and, add.
> 
> I assume when you say 3 you are including the test and either some
> form of conditional move or jump?

Yeah, instruction count also depends on architecture (arm/x86/ppc...)

> > > diff --git a/include/net/checksum.h b/include/net/checksum.h
[]
> > > @@ -88,8 +88,10 @@ static inline __wsum
> > >  csum_block_add(__wsum csum, __wsum csum2, int offset)
> > >  {
> > >       u32 sum = (__force u32)csum2;
> > > -     if (offset&1)
> > > -             sum = ((sum&0xFF00FF)<<8)+((sum>>8)&0xFF00FF);
> > > +
> > > +     if (offset & 1)
> > > +             sum = (sum << 24) + (sum >> 8);
> > Maybe use ror32(sum, 8);
> I was actually thinking I could use something like this.  I didn't
> realize it was even available.

Now you know: bitops.h

> > or maybe something like:
> > 
> > {
> >         u32 sum;
> > 
> >         /* rotated csum2 of odd offset will be the right checksum */
> >         if (offset & 1)
> >                 sum = ror32((__force u32)csum2, 8);
> >         else
> >                 sum = (__force u32)csum2;
> > 
> Any specific reason for breaking it up like this?  It seems like it
> was easier to just have sum be assigned first and then rotating it if
> needed.  What is gained by splitting the assignment up over two
> different calls?

It's only for reader clarity where a comment could be useful.
The compiler output shouldn't change.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap
  2016-03-09  5:50     ` Joe Perches
@ 2016-03-09  6:08       ` Alexander Duyck
  2016-03-09  6:31         ` Tom Herbert
  0 siblings, 1 reply; 11+ messages in thread
From: Alexander Duyck @ 2016-03-09  6:08 UTC (permalink / raw)
  To: Joe Perches; +Cc: Alexander Duyck, Netdev, David Miller

On Tue, Mar 8, 2016 at 9:50 PM, Joe Perches <joe@perches.com> wrote:
> On Tue, 2016-03-08 at 21:23 -0800, Alexander Duyck wrote:
>> On Tue, Mar 8, 2016 at 3:25 PM, Joe Perches <joe@perches.com> wrote:
>> > On Tue, 2016-03-08 at 14:42 -0800, Alexander Duyck wrote:
>> > > The code for csum_block_add was doing a funky byteswap to swap the even and
>> > > odd bytes of the checksum if the offset was odd.  Instead of doing this we
>> > > can save ourselves some trouble and just shift by 8 as this should have the
>> > > same effect in terms of the final checksum value and only requires one
>> > > instruction.
>> > 3 instructions?
>> I was talking about just the one ror vs mov, shl, shr, and ,and, add.
>>
>> I assume when you say 3 you are including the test and either some
>> form of conditional move or jump?
>
> Yeah, instruction count also depends on architecture (arm/x86/ppc...)

Right.  But the general idea is that rotate is an instruction most
architectures have.  I haven't heard of an instruction that swaps even
and odd bytes of a 32 bit word.

>> > > diff --git a/include/net/checksum.h b/include/net/checksum.h
> []
>> > > @@ -88,8 +88,10 @@ static inline __wsum
>> > >  csum_block_add(__wsum csum, __wsum csum2, int offset)
>> > >  {
>> > >       u32 sum = (__force u32)csum2;
>> > > -     if (offset&1)
>> > > -             sum = ((sum&0xFF00FF)<<8)+((sum>>8)&0xFF00FF);
>> > > +
>> > > +     if (offset & 1)
>> > > +             sum = (sum << 24) + (sum >> 8);
>> > Maybe use ror32(sum, 8);
>> I was actually thinking I could use something like this.  I didn't
>> realize it was even available.
>
> Now you know: bitops.h
>
>> > or maybe something like:
>> >
>> > {
>> >         u32 sum;
>> >
>> >         /* rotated csum2 of odd offset will be the right checksum */
>> >         if (offset & 1)
>> >                 sum = ror32((__force u32)csum2, 8);
>> >         else
>> >                 sum = (__force u32)csum2;
>> >
>> Any specific reason for breaking it up like this?  It seems like it
>> was easier to just have sum be assigned first and then rotating it if
>> needed.  What is gained by splitting the assignment up over two
>> different calls?
>
> It's only for reader clarity where a comment could be useful.
> The compiler output shouldn't change.

Okay, well I can add a one line comment about aligning to a 16b
boundary for clarity.

- Alex

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap
  2016-03-09  6:08       ` Alexander Duyck
@ 2016-03-09  6:31         ` Tom Herbert
  2016-03-09 16:08           ` Alexander Duyck
  0 siblings, 1 reply; 11+ messages in thread
From: Tom Herbert @ 2016-03-09  6:31 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Joe Perches, Alexander Duyck, Netdev, David Miller

On Tue, Mar 8, 2016 at 10:08 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Tue, Mar 8, 2016 at 9:50 PM, Joe Perches <joe@perches.com> wrote:
>> On Tue, 2016-03-08 at 21:23 -0800, Alexander Duyck wrote:
>>> On Tue, Mar 8, 2016 at 3:25 PM, Joe Perches <joe@perches.com> wrote:
>>> > On Tue, 2016-03-08 at 14:42 -0800, Alexander Duyck wrote:
>>> > > The code for csum_block_add was doing a funky byteswap to swap the even and
>>> > > odd bytes of the checksum if the offset was odd.  Instead of doing this we
>>> > > can save ourselves some trouble and just shift by 8 as this should have the
>>> > > same effect in terms of the final checksum value and only requires one
>>> > > instruction.
>>> > 3 instructions?
>>> I was talking about just the one ror vs mov, shl, shr, and ,and, add.
>>>
>>> I assume when you say 3 you are including the test and either some
>>> form of conditional move or jump?
>>
>> Yeah, instruction count also depends on architecture (arm/x86/ppc...)
>
> Right.  But the general idea is that rotate is an instruction most
> architectures have.  I haven't heard of an instruction that swaps even
> and odd bytes of a 32 bit word.
>
Yes, I took a look inlining these.

#define rol32(V, X) ({                          \
        int word = V;                           \
        if (__builtin_constant_p(X))            \
                asm("roll $" #X ",%[word]\n\t"  \
                    : [word] "=r" (word));      \
        else                                    \
                asm("roll %%cl,%[word]\n\t"     \
                    : [word] "=r" (word)        \
                    : "c" (X));                 \
        word;                                   \
})

With this I'm seeing a nice speedup in jhash which uses a lot of rol32s...

>>> > > diff --git a/include/net/checksum.h b/include/net/checksum.h
>> []
>>> > > @@ -88,8 +88,10 @@ static inline __wsum
>>> > >  csum_block_add(__wsum csum, __wsum csum2, int offset)
>>> > >  {
>>> > >       u32 sum = (__force u32)csum2;
>>> > > -     if (offset&1)
>>> > > -             sum = ((sum&0xFF00FF)<<8)+((sum>>8)&0xFF00FF);
>>> > > +
>>> > > +     if (offset & 1)
>>> > > +             sum = (sum << 24) + (sum >> 8);
>>> > Maybe use ror32(sum, 8);
>>> I was actually thinking I could use something like this.  I didn't
>>> realize it was even available.
>>
>> Now you know: bitops.h
>>
>>> > or maybe something like:
>>> >
>>> > {
>>> >         u32 sum;
>>> >
>>> >         /* rotated csum2 of odd offset will be the right checksum */
>>> >         if (offset & 1)
>>> >                 sum = ror32((__force u32)csum2, 8);
>>> >         else
>>> >                 sum = (__force u32)csum2;
>>> >
>>> Any specific reason for breaking it up like this?  It seems like it
>>> was easier to just have sum be assigned first and then rotating it if
>>> needed.  What is gained by splitting the assignment up over two
>>> different calls?
>>
>> It's only for reader clarity where a comment could be useful.
>> The compiler output shouldn't change.
>
> Okay, well I can add a one line comment about aligning to a 16b
> boundary for clarity.
>
> - Alex

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap
  2016-03-09  6:31         ` Tom Herbert
@ 2016-03-09 16:08           ` Alexander Duyck
  2016-03-10  0:18             ` Joe Perches
  0 siblings, 1 reply; 11+ messages in thread
From: Alexander Duyck @ 2016-03-09 16:08 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Joe Perches, Alexander Duyck, Netdev, David Miller

On Tue, Mar 8, 2016 at 10:31 PM, Tom Herbert <tom@herbertland.com> wrote:
> On Tue, Mar 8, 2016 at 10:08 PM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>> On Tue, Mar 8, 2016 at 9:50 PM, Joe Perches <joe@perches.com> wrote:
>>> On Tue, 2016-03-08 at 21:23 -0800, Alexander Duyck wrote:
>>>> On Tue, Mar 8, 2016 at 3:25 PM, Joe Perches <joe@perches.com> wrote:
>>>> > On Tue, 2016-03-08 at 14:42 -0800, Alexander Duyck wrote:
>>>> > > The code for csum_block_add was doing a funky byteswap to swap the even and
>>>> > > odd bytes of the checksum if the offset was odd.  Instead of doing this we
>>>> > > can save ourselves some trouble and just shift by 8 as this should have the
>>>> > > same effect in terms of the final checksum value and only requires one
>>>> > > instruction.
>>>> > 3 instructions?
>>>> I was talking about just the one ror vs mov, shl, shr, and ,and, add.
>>>>
>>>> I assume when you say 3 you are including the test and either some
>>>> form of conditional move or jump?
>>>
>>> Yeah, instruction count also depends on architecture (arm/x86/ppc...)
>>
>> Right.  But the general idea is that rotate is an instruction most
>> architectures have.  I haven't heard of an instruction that swaps even
>> and odd bytes of a 32 bit word.
>>
> Yes, I took a look inlining these.
>
> #define rol32(V, X) ({                          \
>         int word = V;                           \
>         if (__builtin_constant_p(X))            \
>                 asm("roll $" #X ",%[word]\n\t"  \
>                     : [word] "=r" (word));      \
>         else                                    \
>                 asm("roll %%cl,%[word]\n\t"     \
>                     : [word] "=r" (word)        \
>                     : "c" (X));                 \
>         word;                                   \
> })
>
> With this I'm seeing a nice speedup in jhash which uses a lot of rol32s...

Is gcc really not converting the rol32 calls into rotates?

If we need this type of code in order to get the rotates to occur as
expected then maybe we need to look at doing arch specific versions of
the functions in bitops.h in order to improve the performance since I
know these calls are used in some performance critical paths such as
crypto and hashing.

- Alex

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap
  2016-03-09 16:08           ` Alexander Duyck
@ 2016-03-10  0:18             ` Joe Perches
  2016-03-10  0:58               ` Tom Herbert
  0 siblings, 1 reply; 11+ messages in thread
From: Joe Perches @ 2016-03-10  0:18 UTC (permalink / raw)
  To: Alexander Duyck, Tom Herbert; +Cc: Alexander Duyck, Netdev, David Miller

On Wed, 2016-03-09 at 08:08 -0800, Alexander Duyck wrote:
> On Tue, Mar 8, 2016 at 10:31 PM, Tom Herbert <tom@herbertland.com> wrote:
> > I took a look inlining these.
> > 
> > #define rol32(V, X) ({                          \
> >         int word = V;                           \
> >         if (__builtin_constant_p(X))            \
> >                 asm("roll $" #X ",%[word]\n\t"  \
> >                     : [word] "=r" (word));      \
> >         else                                    \
> >                 asm("roll %%cl,%[word]\n\t"     \
> >                     : [word] "=r" (word)        \
> >                     : "c" (X));                 \
> >         word;                                   \
> > })
> > 
> > With this I'm seeing a nice speedup in jhash which uses a lot of rol32s...
> Is gcc really not converting the rol32 calls into rotates?

No, it is.

The difference in the object code with the asm for instance is:

(old, compiled with gcc 5.3.1)

<jhash_2words.constprop.5>:
     84e:       81 ee 09 41 52 21       sub    $0x21524109,%esi
     854:       81 ef 09 41 52 21       sub    $0x21524109,%edi
     85a:       55                      push   %rbp
     85b:       89 f0                   mov    %esi,%eax
     85d:       89 f2                   mov    %esi,%edx
     85f:       48 ff 05 00 00 00 00    incq   0x0(%rip)        # 866 <jhash_2words.constprop.5+0x18>
     866:       c1 c2 0e                rol    $0xe,%edx
     869:       35 f7 be ad de          xor    $0xdeadbef7,%eax
     86e:       48 89 e5                mov    %rsp,%rbp
     871:       29 d0                   sub    %edx,%eax
     873:       48 ff 05 00 00 00 00    incq   0x0(%rip)        # 87a <jhash_2words.constprop.5+0x2c>
     87a:       48 ff 05 00 00 00 00    incq   0x0(%rip)        # 881 <jhash_2words.constprop.5+0x33>
     881:       89 c2                   mov    %eax,%edx
     883:       31 c7                   xor    %eax,%edi
     885:       c1 c2 0b                rol    $0xb,%edx
     888:       29 d7                   sub    %edx,%edi
     88a:       89 fa                   mov    %edi,%edx
     88c:       31 fe                   xor    %edi,%esi
     88e:       c1 ca 07                ror    $0x7,%edx
     891:       29 d6                   sub    %edx,%esi
     893:       89 f2                   mov    %esi,%edx
     895:       31 f0                   xor    %esi,%eax
     897:       c1 c2 10                rol    $0x10,%edx
     89a:       29 d0                   sub    %edx,%eax
     89c:       89 c2                   mov    %eax,%edx
     89e:       31 c7                   xor    %eax,%edi
     8a0:       c1 c2 04                rol    $0x4,%edx
     8a3:       29 d7                   sub    %edx,%edi
     8a5:       31 fe                   xor    %edi,%esi
     8a7:       c1 c7 0e                rol    $0xe,%edi
     8aa:       29 fe                   sub    %edi,%esi
     8ac:       31 f0                   xor    %esi,%eax
     8ae:       c1 ce 08                ror    $0x8,%esi
     8b1:       29 f0                   sub    %esi,%eax
     8b3:       5d                      pop    %rbp
     8b4:       c3                      retq   

vs Tom's asm

000000000000084e <jhash_2words.constprop.5>:
     84e:       81 ee 09 41 52 21       sub    $0x21524109,%esi
     854:       8d 87 f7 be ad de       lea    -0x21524109(%rdi),%eax
     85a:       55                      push   %rbp
     85b:       89 f2                   mov    %esi,%edx
     85d:       48 ff 05 00 00 00 00    incq   0x0(%rip)        # 864 <jhash_2words.constprop.5+0x16>
     864:       48 ff 05 00 00 00 00    incq   0x0(%rip)        # 86b <jhash_2words.constprop.5+0x1d>
     86b:       81 f2 f7 be ad de       xor    $0xdeadbef7,%edx
     871:       48 89 e5                mov    %rsp,%rbp
     874:       c1 c1 0e                rol    $0xe,%ecx
     877:       29 ca                   sub    %ecx,%edx
     879:       31 d0                   xor    %edx,%eax
     87b:       c1 c7 0b                rol    $0xb,%edi
     87e:       29 f8                   sub    %edi,%eax
     880:       48 ff 05 00 00 00 00    incq   0x0(%rip)        # 887 <jhash_2words.constprop.5+0x39>
     887:       31 c6                   xor    %eax,%esi
     889:       c1 c7 19                rol    $0x19,%edi
     88c:       29 fe                   sub    %edi,%esi
     88e:       31 f2                   xor    %esi,%edx
     890:       c1 c7 10                rol    $0x10,%edi
     893:       29 fa                   sub    %edi,%edx
     895:       31 d0                   xor    %edx,%eax
     897:       c1 c7 04                rol    $0x4,%edi
     89a:       29 f8                   sub    %edi,%eax
     89c:       31 f0                   xor    %esi,%eax
     89e:       29 c8                   sub    %ecx,%eax
     8a0:       31 d0                   xor    %edx,%eax
     8a2:       5d                      pop    %rbp
     8a3:       c1 c2 18                rol    $0x18,%edx
     8a6:       29 d0                   sub    %edx,%eax
     8a8:       c3                      retq   

> If we need this type of code in order to get the rotates to occur as
> expected then maybe we need to look at doing arch specific versions of
> the functions in bitops.h in order to improve the performance since I
> know these calls are used in some performance critical paths such as
> crypto and hashing.

Yeah, maybe, but why couldn't gcc generate similar code
as Tom's asm? (modulo the ripple reducing ror vs rol uses
when the shift is > 16

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap
  2016-03-10  0:18             ` Joe Perches
@ 2016-03-10  0:58               ` Tom Herbert
  0 siblings, 0 replies; 11+ messages in thread
From: Tom Herbert @ 2016-03-10  0:58 UTC (permalink / raw)
  To: Joe Perches; +Cc: Alexander Duyck, Alexander Duyck, Netdev, David Miller

On Wed, Mar 9, 2016 at 4:18 PM, Joe Perches <joe@perches.com> wrote:
> On Wed, 2016-03-09 at 08:08 -0800, Alexander Duyck wrote:
>> On Tue, Mar 8, 2016 at 10:31 PM, Tom Herbert <tom@herbertland.com> wrote:
>> > I took a look inlining these.
>> >
>> > #define rol32(V, X) ({                          \
>> >         int word = V;                           \
>> >         if (__builtin_constant_p(X))            \
>> >                 asm("roll $" #X ",%[word]\n\t"  \
>> >                     : [word] "=r" (word));      \
>> >         else                                    \
>> >                 asm("roll %%cl,%[word]\n\t"     \
>> >                     : [word] "=r" (word)        \
>> >                     : "c" (X));                 \
>> >         word;                                   \
>> > })
>> >
>> > With this I'm seeing a nice speedup in jhash which uses a lot of rol32s...
>> Is gcc really not converting the rol32 calls into rotates?
>
> No, it is.
>
> The difference in the object code with the asm for instance is:
>
> (old, compiled with gcc 5.3.1)
>
> <jhash_2words.constprop.5>:
>      84e:       81 ee 09 41 52 21       sub    $0x21524109,%esi
>      854:       81 ef 09 41 52 21       sub    $0x21524109,%edi
>      85a:       55                      push   %rbp
>      85b:       89 f0                   mov    %esi,%eax
>      85d:       89 f2                   mov    %esi,%edx
>      85f:       48 ff 05 00 00 00 00    incq   0x0(%rip)        # 866 <jhash_2words.constprop.5+0x18>
>      866:       c1 c2 0e                rol    $0xe,%edx
>      869:       35 f7 be ad de          xor    $0xdeadbef7,%eax
>      86e:       48 89 e5                mov    %rsp,%rbp
>      871:       29 d0                   sub    %edx,%eax
>      873:       48 ff 05 00 00 00 00    incq   0x0(%rip)        # 87a <jhash_2words.constprop.5+0x2c>
>      87a:       48 ff 05 00 00 00 00    incq   0x0(%rip)        # 881 <jhash_2words.constprop.5+0x33>
>      881:       89 c2                   mov    %eax,%edx
>      883:       31 c7                   xor    %eax,%edi
>      885:       c1 c2 0b                rol    $0xb,%edx
>      888:       29 d7                   sub    %edx,%edi
>      88a:       89 fa                   mov    %edi,%edx
>      88c:       31 fe                   xor    %edi,%esi
>      88e:       c1 ca 07                ror    $0x7,%edx
>      891:       29 d6                   sub    %edx,%esi
>      893:       89 f2                   mov    %esi,%edx
>      895:       31 f0                   xor    %esi,%eax
>      897:       c1 c2 10                rol    $0x10,%edx
>      89a:       29 d0                   sub    %edx,%eax
>      89c:       89 c2                   mov    %eax,%edx
>      89e:       31 c7                   xor    %eax,%edi
>      8a0:       c1 c2 04                rol    $0x4,%edx
>      8a3:       29 d7                   sub    %edx,%edi
>      8a5:       31 fe                   xor    %edi,%esi
>      8a7:       c1 c7 0e                rol    $0xe,%edi
>      8aa:       29 fe                   sub    %edi,%esi
>      8ac:       31 f0                   xor    %esi,%eax
>      8ae:       c1 ce 08                ror    $0x8,%esi
>      8b1:       29 f0                   sub    %esi,%eax
>      8b3:       5d                      pop    %rbp
>      8b4:       c3                      retq
>
> vs Tom's asm
>
> 000000000000084e <jhash_2words.constprop.5>:
>      84e:       81 ee 09 41 52 21       sub    $0x21524109,%esi
>      854:       8d 87 f7 be ad de       lea    -0x21524109(%rdi),%eax
>      85a:       55                      push   %rbp
>      85b:       89 f2                   mov    %esi,%edx
>      85d:       48 ff 05 00 00 00 00    incq   0x0(%rip)        # 864 <jhash_2words.constprop.5+0x16>
>      864:       48 ff 05 00 00 00 00    incq   0x0(%rip)        # 86b <jhash_2words.constprop.5+0x1d>
>      86b:       81 f2 f7 be ad de       xor    $0xdeadbef7,%edx
>      871:       48 89 e5                mov    %rsp,%rbp
>      874:       c1 c1 0e                rol    $0xe,%ecx
>      877:       29 ca                   sub    %ecx,%edx
>      879:       31 d0                   xor    %edx,%eax
>      87b:       c1 c7 0b                rol    $0xb,%edi
>      87e:       29 f8                   sub    %edi,%eax
>      880:       48 ff 05 00 00 00 00    incq   0x0(%rip)        # 887 <jhash_2words.constprop.5+0x39>
>      887:       31 c6                   xor    %eax,%esi
>      889:       c1 c7 19                rol    $0x19,%edi
>      88c:       29 fe                   sub    %edi,%esi
>      88e:       31 f2                   xor    %esi,%edx
>      890:       c1 c7 10                rol    $0x10,%edi
>      893:       29 fa                   sub    %edi,%edx
>      895:       31 d0                   xor    %edx,%eax
>      897:       c1 c7 04                rol    $0x4,%edi
>      89a:       29 f8                   sub    %edi,%eax
>      89c:       31 f0                   xor    %esi,%eax
>      89e:       29 c8                   sub    %ecx,%eax
>      8a0:       31 d0                   xor    %edx,%eax
>      8a2:       5d                      pop    %rbp
>      8a3:       c1 c2 18                rol    $0x18,%edx
>      8a6:       29 d0                   sub    %edx,%eax
>      8a8:       c3                      retq
>
>> If we need this type of code in order to get the rotates to occur as
>> expected then maybe we need to look at doing arch specific versions of
>> the functions in bitops.h in order to improve the performance since I
>> know these calls are used in some performance critical paths such as
>> crypto and hashing.
>
> Yeah, maybe, but why couldn't gcc generate similar code
> as Tom's asm? (modulo the ripple reducing ror vs rol uses
> when the shift is > 16

I see gcc doing that now, not sure why I was seeing differences before....

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap
  2016-03-08 23:25 ` Joe Perches
  2016-03-09  5:23   ` Alexander Duyck
@ 2016-03-09 10:54   ` David Laight
  2016-03-09 16:03     ` Alexander Duyck
  1 sibling, 1 reply; 11+ messages in thread
From: David Laight @ 2016-03-09 10:54 UTC (permalink / raw)
  To: 'Joe Perches', Alexander Duyck, netdev@vger.kernel.org,
	davem@davemloft.net

From: Joe Perches
> Sent: 08 March 2016 23:26
...
> > +
> > +	if (offset & 1)
> > +		sum = (sum << 24) + (sum >> 8);
> 
> Maybe use ror32(sum, 8);
> 
> or maybe something like:
> 
> {
> 	u32 sum;
> 
> 	/* rotated csum2 of odd offset will be the right checksum */
> 	if (offset & 1)
> 		sum = ror32((__force u32)csum2, 8);
> 	else
> 		sum = (__force u32)csum2;

Or even:
	sum = ror32((__force u32)csum2, (offset & 1) * 8);
to remove the conditional.
Assuming 'rotate by 0 bits' is valid.
If not add 16 to rotate by 16 or 24.

	David

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap
  2016-03-09 10:54   ` David Laight
@ 2016-03-09 16:03     ` Alexander Duyck
  0 siblings, 0 replies; 11+ messages in thread
From: Alexander Duyck @ 2016-03-09 16:03 UTC (permalink / raw)
  To: David Laight
  Cc: Joe Perches, Alexander Duyck, netdev@vger.kernel.org,
	davem@davemloft.net

On Wed, Mar 9, 2016 at 2:54 AM, David Laight <David.Laight@aculab.com> wrote:
> From: Joe Perches
>> Sent: 08 March 2016 23:26
> ...
>> > +
>> > +   if (offset & 1)
>> > +           sum = (sum << 24) + (sum >> 8);
>>
>> Maybe use ror32(sum, 8);
>>
>> or maybe something like:
>>
>> {
>>       u32 sum;
>>
>>       /* rotated csum2 of odd offset will be the right checksum */
>>       if (offset & 1)
>>               sum = ror32((__force u32)csum2, 8);
>>       else
>>               sum = (__force u32)csum2;
>
> Or even:
>         sum = ror32((__force u32)csum2, (offset & 1) * 8);
> to remove the conditional.
> Assuming 'rotate by 0 bits' is valid.
> If not add 16 to rotate by 16 or 24.

The problem is "ror %cl" can be significantly more expensive than just
a "ror $8".  In the case of x86 the difference is as much as 6 cycles
or more on some of the older architectures so it may be better to just
do the rotate by 8 and then an "and" or "test" and "cmovne" which is
what this compiles into right now.

- Alex

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-03-10  0:58 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-08 22:42 [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap Alexander Duyck
2016-03-08 23:25 ` Joe Perches
2016-03-09  5:23   ` Alexander Duyck
2016-03-09  5:50     ` Joe Perches
2016-03-09  6:08       ` Alexander Duyck
2016-03-09  6:31         ` Tom Herbert
2016-03-09 16:08           ` Alexander Duyck
2016-03-10  0:18             ` Joe Perches
2016-03-10  0:58               ` Tom Herbert
2016-03-09 10:54   ` David Laight
2016-03-09 16:03     ` Alexander Duyck

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox