* [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap
@ 2016-03-08 22:42 Alexander Duyck
2016-03-08 23:25 ` Joe Perches
0 siblings, 1 reply; 11+ messages in thread
From: Alexander Duyck @ 2016-03-08 22:42 UTC (permalink / raw)
To: netdev, davem
The code for csum_block_add was doing a funky byteswap to swap the even and
odd bytes of the checksum if the offset was odd. Instead of doing this we
can save ourselves some trouble and just shift by 8 as this should have the
same effect in terms of the final checksum value and only requires one
instruction.
In addition we can update csum_block_sub to just use csum_block_add with a
inverse value for csum2. This way we follow the same code path as
csum_block_add without having to duplicate it.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
---
include/net/checksum.h | 11 +++++------
1 file changed, 5 insertions(+), 6 deletions(-)
diff --git a/include/net/checksum.h b/include/net/checksum.h
index 10a16b5bd1c7..f9fac66c0e66 100644
--- a/include/net/checksum.h
+++ b/include/net/checksum.h
@@ -88,8 +88,10 @@ static inline __wsum
csum_block_add(__wsum csum, __wsum csum2, int offset)
{
u32 sum = (__force u32)csum2;
- if (offset&1)
- sum = ((sum&0xFF00FF)<<8)+((sum>>8)&0xFF00FF);
+
+ if (offset & 1)
+ sum = (sum << 24) + (sum >> 8);
+
return csum_add(csum, (__force __wsum)sum);
}
@@ -102,10 +104,7 @@ csum_block_add_ext(__wsum csum, __wsum csum2, int offset, int len)
static inline __wsum
csum_block_sub(__wsum csum, __wsum csum2, int offset)
{
- u32 sum = (__force u32)csum2;
- if (offset&1)
- sum = ((sum&0xFF00FF)<<8)+((sum>>8)&0xFF00FF);
- return csum_sub(csum, (__force __wsum)sum);
+ return csum_block_add(csum, ~csum2, offset);
}
static inline __wsum csum_unfold(__sum16 n)
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap
2016-03-08 22:42 [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap Alexander Duyck
@ 2016-03-08 23:25 ` Joe Perches
2016-03-09 5:23 ` Alexander Duyck
2016-03-09 10:54 ` David Laight
0 siblings, 2 replies; 11+ messages in thread
From: Joe Perches @ 2016-03-08 23:25 UTC (permalink / raw)
To: Alexander Duyck, netdev, davem
On Tue, 2016-03-08 at 14:42 -0800, Alexander Duyck wrote:
> The code for csum_block_add was doing a funky byteswap to swap the even and
> odd bytes of the checksum if the offset was odd. Instead of doing this we
> can save ourselves some trouble and just shift by 8 as this should have the
> same effect in terms of the final checksum value and only requires one
> instruction.
3 instructions?
> In addition we can update csum_block_sub to just use csum_block_add with a
> inverse value for csum2. This way we follow the same code path as
> csum_block_add without having to duplicate it.
>
> Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
> ---
> include/net/checksum.h | 11 +++++------
> 1 file changed, 5 insertions(+), 6 deletions(-)
>
> diff --git a/include/net/checksum.h b/include/net/checksum.h
> index 10a16b5bd1c7..f9fac66c0e66 100644
> --- a/include/net/checksum.h
> +++ b/include/net/checksum.h
> @@ -88,8 +88,10 @@ static inline __wsum
> csum_block_add(__wsum csum, __wsum csum2, int offset)
> {
> u32 sum = (__force u32)csum2;
> - if (offset&1)
> - sum = ((sum&0xFF00FF)<<8)+((sum>>8)&0xFF00FF);
> +
> + if (offset & 1)
> + sum = (sum << 24) + (sum >> 8);
Maybe use ror32(sum, 8);
or maybe something like:
{
u32 sum;
/* rotated csum2 of odd offset will be the right checksum */
if (offset & 1)
sum = ror32((__force u32)csum2, 8);
else
sum = (__force u32)csum2;
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap
2016-03-08 23:25 ` Joe Perches
@ 2016-03-09 5:23 ` Alexander Duyck
2016-03-09 5:50 ` Joe Perches
2016-03-09 10:54 ` David Laight
1 sibling, 1 reply; 11+ messages in thread
From: Alexander Duyck @ 2016-03-09 5:23 UTC (permalink / raw)
To: Joe Perches; +Cc: Alexander Duyck, Netdev, David Miller
On Tue, Mar 8, 2016 at 3:25 PM, Joe Perches <joe@perches.com> wrote:
> On Tue, 2016-03-08 at 14:42 -0800, Alexander Duyck wrote:
>> The code for csum_block_add was doing a funky byteswap to swap the even and
>> odd bytes of the checksum if the offset was odd. Instead of doing this we
>> can save ourselves some trouble and just shift by 8 as this should have the
>> same effect in terms of the final checksum value and only requires one
>> instruction.
>
> 3 instructions?
I was talking about just the one ror vs mov, shl, shr, and ,and, add.
I assume when you say 3 you are including the test and either some
form of conditional move or jump?
>> In addition we can update csum_block_sub to just use csum_block_add with a
>> inverse value for csum2. This way we follow the same code path as
>> csum_block_add without having to duplicate it.
>>
>> Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
>> ---
>> include/net/checksum.h | 11 +++++------
>> 1 file changed, 5 insertions(+), 6 deletions(-)
>>
>> diff --git a/include/net/checksum.h b/include/net/checksum.h
>> index 10a16b5bd1c7..f9fac66c0e66 100644
>> --- a/include/net/checksum.h
>> +++ b/include/net/checksum.h
>> @@ -88,8 +88,10 @@ static inline __wsum
>> csum_block_add(__wsum csum, __wsum csum2, int offset)
>> {
>> u32 sum = (__force u32)csum2;
>> - if (offset&1)
>> - sum = ((sum&0xFF00FF)<<8)+((sum>>8)&0xFF00FF);
>> +
>> + if (offset & 1)
>> + sum = (sum << 24) + (sum >> 8);
>
> Maybe use ror32(sum, 8);
I was actually thinking I could use something like this. I didn't
realize it was even available.
> or maybe something like:
>
> {
> u32 sum;
>
> /* rotated csum2 of odd offset will be the right checksum */
> if (offset & 1)
> sum = ror32((__force u32)csum2, 8);
> else
> sum = (__force u32)csum2;
>
Any specific reason for breaking it up like this? It seems like it
was easier to just have sum be assigned first and then rotating it if
needed. What is gained by splitting the assignment up over two
different calls?
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap
2016-03-09 5:23 ` Alexander Duyck
@ 2016-03-09 5:50 ` Joe Perches
2016-03-09 6:08 ` Alexander Duyck
0 siblings, 1 reply; 11+ messages in thread
From: Joe Perches @ 2016-03-09 5:50 UTC (permalink / raw)
To: Alexander Duyck; +Cc: Alexander Duyck, Netdev, David Miller
On Tue, 2016-03-08 at 21:23 -0800, Alexander Duyck wrote:
> On Tue, Mar 8, 2016 at 3:25 PM, Joe Perches <joe@perches.com> wrote:
> > On Tue, 2016-03-08 at 14:42 -0800, Alexander Duyck wrote:
> > > The code for csum_block_add was doing a funky byteswap to swap the even and
> > > odd bytes of the checksum if the offset was odd. Instead of doing this we
> > > can save ourselves some trouble and just shift by 8 as this should have the
> > > same effect in terms of the final checksum value and only requires one
> > > instruction.
> > 3 instructions?
> I was talking about just the one ror vs mov, shl, shr, and ,and, add.
>
> I assume when you say 3 you are including the test and either some
> form of conditional move or jump?
Yeah, instruction count also depends on architecture (arm/x86/ppc...)
> > > diff --git a/include/net/checksum.h b/include/net/checksum.h
[]
> > > @@ -88,8 +88,10 @@ static inline __wsum
> > > csum_block_add(__wsum csum, __wsum csum2, int offset)
> > > {
> > > u32 sum = (__force u32)csum2;
> > > - if (offset&1)
> > > - sum = ((sum&0xFF00FF)<<8)+((sum>>8)&0xFF00FF);
> > > +
> > > + if (offset & 1)
> > > + sum = (sum << 24) + (sum >> 8);
> > Maybe use ror32(sum, 8);
> I was actually thinking I could use something like this. I didn't
> realize it was even available.
Now you know: bitops.h
> > or maybe something like:
> >
> > {
> > u32 sum;
> >
> > /* rotated csum2 of odd offset will be the right checksum */
> > if (offset & 1)
> > sum = ror32((__force u32)csum2, 8);
> > else
> > sum = (__force u32)csum2;
> >
> Any specific reason for breaking it up like this? It seems like it
> was easier to just have sum be assigned first and then rotating it if
> needed. What is gained by splitting the assignment up over two
> different calls?
It's only for reader clarity where a comment could be useful.
The compiler output shouldn't change.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap
2016-03-09 5:50 ` Joe Perches
@ 2016-03-09 6:08 ` Alexander Duyck
2016-03-09 6:31 ` Tom Herbert
0 siblings, 1 reply; 11+ messages in thread
From: Alexander Duyck @ 2016-03-09 6:08 UTC (permalink / raw)
To: Joe Perches; +Cc: Alexander Duyck, Netdev, David Miller
On Tue, Mar 8, 2016 at 9:50 PM, Joe Perches <joe@perches.com> wrote:
> On Tue, 2016-03-08 at 21:23 -0800, Alexander Duyck wrote:
>> On Tue, Mar 8, 2016 at 3:25 PM, Joe Perches <joe@perches.com> wrote:
>> > On Tue, 2016-03-08 at 14:42 -0800, Alexander Duyck wrote:
>> > > The code for csum_block_add was doing a funky byteswap to swap the even and
>> > > odd bytes of the checksum if the offset was odd. Instead of doing this we
>> > > can save ourselves some trouble and just shift by 8 as this should have the
>> > > same effect in terms of the final checksum value and only requires one
>> > > instruction.
>> > 3 instructions?
>> I was talking about just the one ror vs mov, shl, shr, and ,and, add.
>>
>> I assume when you say 3 you are including the test and either some
>> form of conditional move or jump?
>
> Yeah, instruction count also depends on architecture (arm/x86/ppc...)
Right. But the general idea is that rotate is an instruction most
architectures have. I haven't heard of an instruction that swaps even
and odd bytes of a 32 bit word.
>> > > diff --git a/include/net/checksum.h b/include/net/checksum.h
> []
>> > > @@ -88,8 +88,10 @@ static inline __wsum
>> > > csum_block_add(__wsum csum, __wsum csum2, int offset)
>> > > {
>> > > u32 sum = (__force u32)csum2;
>> > > - if (offset&1)
>> > > - sum = ((sum&0xFF00FF)<<8)+((sum>>8)&0xFF00FF);
>> > > +
>> > > + if (offset & 1)
>> > > + sum = (sum << 24) + (sum >> 8);
>> > Maybe use ror32(sum, 8);
>> I was actually thinking I could use something like this. I didn't
>> realize it was even available.
>
> Now you know: bitops.h
>
>> > or maybe something like:
>> >
>> > {
>> > u32 sum;
>> >
>> > /* rotated csum2 of odd offset will be the right checksum */
>> > if (offset & 1)
>> > sum = ror32((__force u32)csum2, 8);
>> > else
>> > sum = (__force u32)csum2;
>> >
>> Any specific reason for breaking it up like this? It seems like it
>> was easier to just have sum be assigned first and then rotating it if
>> needed. What is gained by splitting the assignment up over two
>> different calls?
>
> It's only for reader clarity where a comment could be useful.
> The compiler output shouldn't change.
Okay, well I can add a one line comment about aligning to a 16b
boundary for clarity.
- Alex
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap
2016-03-09 6:08 ` Alexander Duyck
@ 2016-03-09 6:31 ` Tom Herbert
2016-03-09 16:08 ` Alexander Duyck
0 siblings, 1 reply; 11+ messages in thread
From: Tom Herbert @ 2016-03-09 6:31 UTC (permalink / raw)
To: Alexander Duyck; +Cc: Joe Perches, Alexander Duyck, Netdev, David Miller
On Tue, Mar 8, 2016 at 10:08 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Tue, Mar 8, 2016 at 9:50 PM, Joe Perches <joe@perches.com> wrote:
>> On Tue, 2016-03-08 at 21:23 -0800, Alexander Duyck wrote:
>>> On Tue, Mar 8, 2016 at 3:25 PM, Joe Perches <joe@perches.com> wrote:
>>> > On Tue, 2016-03-08 at 14:42 -0800, Alexander Duyck wrote:
>>> > > The code for csum_block_add was doing a funky byteswap to swap the even and
>>> > > odd bytes of the checksum if the offset was odd. Instead of doing this we
>>> > > can save ourselves some trouble and just shift by 8 as this should have the
>>> > > same effect in terms of the final checksum value and only requires one
>>> > > instruction.
>>> > 3 instructions?
>>> I was talking about just the one ror vs mov, shl, shr, and ,and, add.
>>>
>>> I assume when you say 3 you are including the test and either some
>>> form of conditional move or jump?
>>
>> Yeah, instruction count also depends on architecture (arm/x86/ppc...)
>
> Right. But the general idea is that rotate is an instruction most
> architectures have. I haven't heard of an instruction that swaps even
> and odd bytes of a 32 bit word.
>
Yes, I took a look inlining these.
#define rol32(V, X) ({ \
int word = V; \
if (__builtin_constant_p(X)) \
asm("roll $" #X ",%[word]\n\t" \
: [word] "=r" (word)); \
else \
asm("roll %%cl,%[word]\n\t" \
: [word] "=r" (word) \
: "c" (X)); \
word; \
})
With this I'm seeing a nice speedup in jhash which uses a lot of rol32s...
>>> > > diff --git a/include/net/checksum.h b/include/net/checksum.h
>> []
>>> > > @@ -88,8 +88,10 @@ static inline __wsum
>>> > > csum_block_add(__wsum csum, __wsum csum2, int offset)
>>> > > {
>>> > > u32 sum = (__force u32)csum2;
>>> > > - if (offset&1)
>>> > > - sum = ((sum&0xFF00FF)<<8)+((sum>>8)&0xFF00FF);
>>> > > +
>>> > > + if (offset & 1)
>>> > > + sum = (sum << 24) + (sum >> 8);
>>> > Maybe use ror32(sum, 8);
>>> I was actually thinking I could use something like this. I didn't
>>> realize it was even available.
>>
>> Now you know: bitops.h
>>
>>> > or maybe something like:
>>> >
>>> > {
>>> > u32 sum;
>>> >
>>> > /* rotated csum2 of odd offset will be the right checksum */
>>> > if (offset & 1)
>>> > sum = ror32((__force u32)csum2, 8);
>>> > else
>>> > sum = (__force u32)csum2;
>>> >
>>> Any specific reason for breaking it up like this? It seems like it
>>> was easier to just have sum be assigned first and then rotating it if
>>> needed. What is gained by splitting the assignment up over two
>>> different calls?
>>
>> It's only for reader clarity where a comment could be useful.
>> The compiler output shouldn't change.
>
> Okay, well I can add a one line comment about aligning to a 16b
> boundary for clarity.
>
> - Alex
^ permalink raw reply [flat|nested] 11+ messages in thread
* RE: [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap
2016-03-08 23:25 ` Joe Perches
2016-03-09 5:23 ` Alexander Duyck
@ 2016-03-09 10:54 ` David Laight
2016-03-09 16:03 ` Alexander Duyck
1 sibling, 1 reply; 11+ messages in thread
From: David Laight @ 2016-03-09 10:54 UTC (permalink / raw)
To: 'Joe Perches', Alexander Duyck, netdev@vger.kernel.org,
davem@davemloft.net
From: Joe Perches
> Sent: 08 March 2016 23:26
...
> > +
> > + if (offset & 1)
> > + sum = (sum << 24) + (sum >> 8);
>
> Maybe use ror32(sum, 8);
>
> or maybe something like:
>
> {
> u32 sum;
>
> /* rotated csum2 of odd offset will be the right checksum */
> if (offset & 1)
> sum = ror32((__force u32)csum2, 8);
> else
> sum = (__force u32)csum2;
Or even:
sum = ror32((__force u32)csum2, (offset & 1) * 8);
to remove the conditional.
Assuming 'rotate by 0 bits' is valid.
If not add 16 to rotate by 16 or 24.
David
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap
2016-03-09 10:54 ` David Laight
@ 2016-03-09 16:03 ` Alexander Duyck
0 siblings, 0 replies; 11+ messages in thread
From: Alexander Duyck @ 2016-03-09 16:03 UTC (permalink / raw)
To: David Laight
Cc: Joe Perches, Alexander Duyck, netdev@vger.kernel.org,
davem@davemloft.net
On Wed, Mar 9, 2016 at 2:54 AM, David Laight <David.Laight@aculab.com> wrote:
> From: Joe Perches
>> Sent: 08 March 2016 23:26
> ...
>> > +
>> > + if (offset & 1)
>> > + sum = (sum << 24) + (sum >> 8);
>>
>> Maybe use ror32(sum, 8);
>>
>> or maybe something like:
>>
>> {
>> u32 sum;
>>
>> /* rotated csum2 of odd offset will be the right checksum */
>> if (offset & 1)
>> sum = ror32((__force u32)csum2, 8);
>> else
>> sum = (__force u32)csum2;
>
> Or even:
> sum = ror32((__force u32)csum2, (offset & 1) * 8);
> to remove the conditional.
> Assuming 'rotate by 0 bits' is valid.
> If not add 16 to rotate by 16 or 24.
The problem is "ror %cl" can be significantly more expensive than just
a "ror $8". In the case of x86 the difference is as much as 6 cycles
or more on some of the older architectures so it may be better to just
do the rotate by 8 and then an "and" or "test" and "cmovne" which is
what this compiles into right now.
- Alex
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap
2016-03-09 6:31 ` Tom Herbert
@ 2016-03-09 16:08 ` Alexander Duyck
2016-03-10 0:18 ` Joe Perches
0 siblings, 1 reply; 11+ messages in thread
From: Alexander Duyck @ 2016-03-09 16:08 UTC (permalink / raw)
To: Tom Herbert; +Cc: Joe Perches, Alexander Duyck, Netdev, David Miller
On Tue, Mar 8, 2016 at 10:31 PM, Tom Herbert <tom@herbertland.com> wrote:
> On Tue, Mar 8, 2016 at 10:08 PM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>> On Tue, Mar 8, 2016 at 9:50 PM, Joe Perches <joe@perches.com> wrote:
>>> On Tue, 2016-03-08 at 21:23 -0800, Alexander Duyck wrote:
>>>> On Tue, Mar 8, 2016 at 3:25 PM, Joe Perches <joe@perches.com> wrote:
>>>> > On Tue, 2016-03-08 at 14:42 -0800, Alexander Duyck wrote:
>>>> > > The code for csum_block_add was doing a funky byteswap to swap the even and
>>>> > > odd bytes of the checksum if the offset was odd. Instead of doing this we
>>>> > > can save ourselves some trouble and just shift by 8 as this should have the
>>>> > > same effect in terms of the final checksum value and only requires one
>>>> > > instruction.
>>>> > 3 instructions?
>>>> I was talking about just the one ror vs mov, shl, shr, and ,and, add.
>>>>
>>>> I assume when you say 3 you are including the test and either some
>>>> form of conditional move or jump?
>>>
>>> Yeah, instruction count also depends on architecture (arm/x86/ppc...)
>>
>> Right. But the general idea is that rotate is an instruction most
>> architectures have. I haven't heard of an instruction that swaps even
>> and odd bytes of a 32 bit word.
>>
> Yes, I took a look inlining these.
>
> #define rol32(V, X) ({ \
> int word = V; \
> if (__builtin_constant_p(X)) \
> asm("roll $" #X ",%[word]\n\t" \
> : [word] "=r" (word)); \
> else \
> asm("roll %%cl,%[word]\n\t" \
> : [word] "=r" (word) \
> : "c" (X)); \
> word; \
> })
>
> With this I'm seeing a nice speedup in jhash which uses a lot of rol32s...
Is gcc really not converting the rol32 calls into rotates?
If we need this type of code in order to get the rotates to occur as
expected then maybe we need to look at doing arch specific versions of
the functions in bitops.h in order to improve the performance since I
know these calls are used in some performance critical paths such as
crypto and hashing.
- Alex
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap
2016-03-09 16:08 ` Alexander Duyck
@ 2016-03-10 0:18 ` Joe Perches
2016-03-10 0:58 ` Tom Herbert
0 siblings, 1 reply; 11+ messages in thread
From: Joe Perches @ 2016-03-10 0:18 UTC (permalink / raw)
To: Alexander Duyck, Tom Herbert; +Cc: Alexander Duyck, Netdev, David Miller
On Wed, 2016-03-09 at 08:08 -0800, Alexander Duyck wrote:
> On Tue, Mar 8, 2016 at 10:31 PM, Tom Herbert <tom@herbertland.com> wrote:
> > I took a look inlining these.
> >
> > #define rol32(V, X) ({ \
> > int word = V; \
> > if (__builtin_constant_p(X)) \
> > asm("roll $" #X ",%[word]\n\t" \
> > : [word] "=r" (word)); \
> > else \
> > asm("roll %%cl,%[word]\n\t" \
> > : [word] "=r" (word) \
> > : "c" (X)); \
> > word; \
> > })
> >
> > With this I'm seeing a nice speedup in jhash which uses a lot of rol32s...
> Is gcc really not converting the rol32 calls into rotates?
No, it is.
The difference in the object code with the asm for instance is:
(old, compiled with gcc 5.3.1)
<jhash_2words.constprop.5>:
84e: 81 ee 09 41 52 21 sub $0x21524109,%esi
854: 81 ef 09 41 52 21 sub $0x21524109,%edi
85a: 55 push %rbp
85b: 89 f0 mov %esi,%eax
85d: 89 f2 mov %esi,%edx
85f: 48 ff 05 00 00 00 00 incq 0x0(%rip) # 866 <jhash_2words.constprop.5+0x18>
866: c1 c2 0e rol $0xe,%edx
869: 35 f7 be ad de xor $0xdeadbef7,%eax
86e: 48 89 e5 mov %rsp,%rbp
871: 29 d0 sub %edx,%eax
873: 48 ff 05 00 00 00 00 incq 0x0(%rip) # 87a <jhash_2words.constprop.5+0x2c>
87a: 48 ff 05 00 00 00 00 incq 0x0(%rip) # 881 <jhash_2words.constprop.5+0x33>
881: 89 c2 mov %eax,%edx
883: 31 c7 xor %eax,%edi
885: c1 c2 0b rol $0xb,%edx
888: 29 d7 sub %edx,%edi
88a: 89 fa mov %edi,%edx
88c: 31 fe xor %edi,%esi
88e: c1 ca 07 ror $0x7,%edx
891: 29 d6 sub %edx,%esi
893: 89 f2 mov %esi,%edx
895: 31 f0 xor %esi,%eax
897: c1 c2 10 rol $0x10,%edx
89a: 29 d0 sub %edx,%eax
89c: 89 c2 mov %eax,%edx
89e: 31 c7 xor %eax,%edi
8a0: c1 c2 04 rol $0x4,%edx
8a3: 29 d7 sub %edx,%edi
8a5: 31 fe xor %edi,%esi
8a7: c1 c7 0e rol $0xe,%edi
8aa: 29 fe sub %edi,%esi
8ac: 31 f0 xor %esi,%eax
8ae: c1 ce 08 ror $0x8,%esi
8b1: 29 f0 sub %esi,%eax
8b3: 5d pop %rbp
8b4: c3 retq
vs Tom's asm
000000000000084e <jhash_2words.constprop.5>:
84e: 81 ee 09 41 52 21 sub $0x21524109,%esi
854: 8d 87 f7 be ad de lea -0x21524109(%rdi),%eax
85a: 55 push %rbp
85b: 89 f2 mov %esi,%edx
85d: 48 ff 05 00 00 00 00 incq 0x0(%rip) # 864 <jhash_2words.constprop.5+0x16>
864: 48 ff 05 00 00 00 00 incq 0x0(%rip) # 86b <jhash_2words.constprop.5+0x1d>
86b: 81 f2 f7 be ad de xor $0xdeadbef7,%edx
871: 48 89 e5 mov %rsp,%rbp
874: c1 c1 0e rol $0xe,%ecx
877: 29 ca sub %ecx,%edx
879: 31 d0 xor %edx,%eax
87b: c1 c7 0b rol $0xb,%edi
87e: 29 f8 sub %edi,%eax
880: 48 ff 05 00 00 00 00 incq 0x0(%rip) # 887 <jhash_2words.constprop.5+0x39>
887: 31 c6 xor %eax,%esi
889: c1 c7 19 rol $0x19,%edi
88c: 29 fe sub %edi,%esi
88e: 31 f2 xor %esi,%edx
890: c1 c7 10 rol $0x10,%edi
893: 29 fa sub %edi,%edx
895: 31 d0 xor %edx,%eax
897: c1 c7 04 rol $0x4,%edi
89a: 29 f8 sub %edi,%eax
89c: 31 f0 xor %esi,%eax
89e: 29 c8 sub %ecx,%eax
8a0: 31 d0 xor %edx,%eax
8a2: 5d pop %rbp
8a3: c1 c2 18 rol $0x18,%edx
8a6: 29 d0 sub %edx,%eax
8a8: c3 retq
> If we need this type of code in order to get the rotates to occur as
> expected then maybe we need to look at doing arch specific versions of
> the functions in bitops.h in order to improve the performance since I
> know these calls are used in some performance critical paths such as
> crypto and hashing.
Yeah, maybe, but why couldn't gcc generate similar code
as Tom's asm? (modulo the ripple reducing ror vs rol uses
when the shift is > 16
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap
2016-03-10 0:18 ` Joe Perches
@ 2016-03-10 0:58 ` Tom Herbert
0 siblings, 0 replies; 11+ messages in thread
From: Tom Herbert @ 2016-03-10 0:58 UTC (permalink / raw)
To: Joe Perches; +Cc: Alexander Duyck, Alexander Duyck, Netdev, David Miller
On Wed, Mar 9, 2016 at 4:18 PM, Joe Perches <joe@perches.com> wrote:
> On Wed, 2016-03-09 at 08:08 -0800, Alexander Duyck wrote:
>> On Tue, Mar 8, 2016 at 10:31 PM, Tom Herbert <tom@herbertland.com> wrote:
>> > I took a look inlining these.
>> >
>> > #define rol32(V, X) ({ \
>> > int word = V; \
>> > if (__builtin_constant_p(X)) \
>> > asm("roll $" #X ",%[word]\n\t" \
>> > : [word] "=r" (word)); \
>> > else \
>> > asm("roll %%cl,%[word]\n\t" \
>> > : [word] "=r" (word) \
>> > : "c" (X)); \
>> > word; \
>> > })
>> >
>> > With this I'm seeing a nice speedup in jhash which uses a lot of rol32s...
>> Is gcc really not converting the rol32 calls into rotates?
>
> No, it is.
>
> The difference in the object code with the asm for instance is:
>
> (old, compiled with gcc 5.3.1)
>
> <jhash_2words.constprop.5>:
> 84e: 81 ee 09 41 52 21 sub $0x21524109,%esi
> 854: 81 ef 09 41 52 21 sub $0x21524109,%edi
> 85a: 55 push %rbp
> 85b: 89 f0 mov %esi,%eax
> 85d: 89 f2 mov %esi,%edx
> 85f: 48 ff 05 00 00 00 00 incq 0x0(%rip) # 866 <jhash_2words.constprop.5+0x18>
> 866: c1 c2 0e rol $0xe,%edx
> 869: 35 f7 be ad de xor $0xdeadbef7,%eax
> 86e: 48 89 e5 mov %rsp,%rbp
> 871: 29 d0 sub %edx,%eax
> 873: 48 ff 05 00 00 00 00 incq 0x0(%rip) # 87a <jhash_2words.constprop.5+0x2c>
> 87a: 48 ff 05 00 00 00 00 incq 0x0(%rip) # 881 <jhash_2words.constprop.5+0x33>
> 881: 89 c2 mov %eax,%edx
> 883: 31 c7 xor %eax,%edi
> 885: c1 c2 0b rol $0xb,%edx
> 888: 29 d7 sub %edx,%edi
> 88a: 89 fa mov %edi,%edx
> 88c: 31 fe xor %edi,%esi
> 88e: c1 ca 07 ror $0x7,%edx
> 891: 29 d6 sub %edx,%esi
> 893: 89 f2 mov %esi,%edx
> 895: 31 f0 xor %esi,%eax
> 897: c1 c2 10 rol $0x10,%edx
> 89a: 29 d0 sub %edx,%eax
> 89c: 89 c2 mov %eax,%edx
> 89e: 31 c7 xor %eax,%edi
> 8a0: c1 c2 04 rol $0x4,%edx
> 8a3: 29 d7 sub %edx,%edi
> 8a5: 31 fe xor %edi,%esi
> 8a7: c1 c7 0e rol $0xe,%edi
> 8aa: 29 fe sub %edi,%esi
> 8ac: 31 f0 xor %esi,%eax
> 8ae: c1 ce 08 ror $0x8,%esi
> 8b1: 29 f0 sub %esi,%eax
> 8b3: 5d pop %rbp
> 8b4: c3 retq
>
> vs Tom's asm
>
> 000000000000084e <jhash_2words.constprop.5>:
> 84e: 81 ee 09 41 52 21 sub $0x21524109,%esi
> 854: 8d 87 f7 be ad de lea -0x21524109(%rdi),%eax
> 85a: 55 push %rbp
> 85b: 89 f2 mov %esi,%edx
> 85d: 48 ff 05 00 00 00 00 incq 0x0(%rip) # 864 <jhash_2words.constprop.5+0x16>
> 864: 48 ff 05 00 00 00 00 incq 0x0(%rip) # 86b <jhash_2words.constprop.5+0x1d>
> 86b: 81 f2 f7 be ad de xor $0xdeadbef7,%edx
> 871: 48 89 e5 mov %rsp,%rbp
> 874: c1 c1 0e rol $0xe,%ecx
> 877: 29 ca sub %ecx,%edx
> 879: 31 d0 xor %edx,%eax
> 87b: c1 c7 0b rol $0xb,%edi
> 87e: 29 f8 sub %edi,%eax
> 880: 48 ff 05 00 00 00 00 incq 0x0(%rip) # 887 <jhash_2words.constprop.5+0x39>
> 887: 31 c6 xor %eax,%esi
> 889: c1 c7 19 rol $0x19,%edi
> 88c: 29 fe sub %edi,%esi
> 88e: 31 f2 xor %esi,%edx
> 890: c1 c7 10 rol $0x10,%edi
> 893: 29 fa sub %edi,%edx
> 895: 31 d0 xor %edx,%eax
> 897: c1 c7 04 rol $0x4,%edi
> 89a: 29 f8 sub %edi,%eax
> 89c: 31 f0 xor %esi,%eax
> 89e: 29 c8 sub %ecx,%eax
> 8a0: 31 d0 xor %edx,%eax
> 8a2: 5d pop %rbp
> 8a3: c1 c2 18 rol $0x18,%edx
> 8a6: 29 d0 sub %edx,%eax
> 8a8: c3 retq
>
>> If we need this type of code in order to get the rotates to occur as
>> expected then maybe we need to look at doing arch specific versions of
>> the functions in bitops.h in order to improve the performance since I
>> know these calls are used in some performance critical paths such as
>> crypto and hashing.
>
> Yeah, maybe, but why couldn't gcc generate similar code
> as Tom's asm? (modulo the ripple reducing ror vs rol uses
> when the shift is > 16
I see gcc doing that now, not sure why I was seeing differences before....
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2016-03-10 0:58 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-08 22:42 [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap Alexander Duyck
2016-03-08 23:25 ` Joe Perches
2016-03-09 5:23 ` Alexander Duyck
2016-03-09 5:50 ` Joe Perches
2016-03-09 6:08 ` Alexander Duyck
2016-03-09 6:31 ` Tom Herbert
2016-03-09 16:08 ` Alexander Duyck
2016-03-10 0:18 ` Joe Perches
2016-03-10 0:58 ` Tom Herbert
2016-03-09 10:54 ` David Laight
2016-03-09 16:03 ` Alexander Duyck
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox