* [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap @ 2016-03-08 22:42 Alexander Duyck 2016-03-08 23:25 ` Joe Perches 0 siblings, 1 reply; 11+ messages in thread From: Alexander Duyck @ 2016-03-08 22:42 UTC (permalink / raw) To: netdev, davem The code for csum_block_add was doing a funky byteswap to swap the even and odd bytes of the checksum if the offset was odd. Instead of doing this we can save ourselves some trouble and just shift by 8 as this should have the same effect in terms of the final checksum value and only requires one instruction. In addition we can update csum_block_sub to just use csum_block_add with a inverse value for csum2. This way we follow the same code path as csum_block_add without having to duplicate it. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> --- include/net/checksum.h | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/include/net/checksum.h b/include/net/checksum.h index 10a16b5bd1c7..f9fac66c0e66 100644 --- a/include/net/checksum.h +++ b/include/net/checksum.h @@ -88,8 +88,10 @@ static inline __wsum csum_block_add(__wsum csum, __wsum csum2, int offset) { u32 sum = (__force u32)csum2; - if (offset&1) - sum = ((sum&0xFF00FF)<<8)+((sum>>8)&0xFF00FF); + + if (offset & 1) + sum = (sum << 24) + (sum >> 8); + return csum_add(csum, (__force __wsum)sum); } @@ -102,10 +104,7 @@ csum_block_add_ext(__wsum csum, __wsum csum2, int offset, int len) static inline __wsum csum_block_sub(__wsum csum, __wsum csum2, int offset) { - u32 sum = (__force u32)csum2; - if (offset&1) - sum = ((sum&0xFF00FF)<<8)+((sum>>8)&0xFF00FF); - return csum_sub(csum, (__force __wsum)sum); + return csum_block_add(csum, ~csum2, offset); } static inline __wsum csum_unfold(__sum16 n) ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap 2016-03-08 22:42 [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap Alexander Duyck @ 2016-03-08 23:25 ` Joe Perches 2016-03-09 5:23 ` Alexander Duyck 2016-03-09 10:54 ` David Laight 0 siblings, 2 replies; 11+ messages in thread From: Joe Perches @ 2016-03-08 23:25 UTC (permalink / raw) To: Alexander Duyck, netdev, davem On Tue, 2016-03-08 at 14:42 -0800, Alexander Duyck wrote: > The code for csum_block_add was doing a funky byteswap to swap the even and > odd bytes of the checksum if the offset was odd. Instead of doing this we > can save ourselves some trouble and just shift by 8 as this should have the > same effect in terms of the final checksum value and only requires one > instruction. 3 instructions? > In addition we can update csum_block_sub to just use csum_block_add with a > inverse value for csum2. This way we follow the same code path as > csum_block_add without having to duplicate it. > > Signed-off-by: Alexander Duyck <aduyck@mirantis.com> > --- > include/net/checksum.h | 11 +++++------ > 1 file changed, 5 insertions(+), 6 deletions(-) > > diff --git a/include/net/checksum.h b/include/net/checksum.h > index 10a16b5bd1c7..f9fac66c0e66 100644 > --- a/include/net/checksum.h > +++ b/include/net/checksum.h > @@ -88,8 +88,10 @@ static inline __wsum > csum_block_add(__wsum csum, __wsum csum2, int offset) > { > u32 sum = (__force u32)csum2; > - if (offset&1) > - sum = ((sum&0xFF00FF)<<8)+((sum>>8)&0xFF00FF); > + > + if (offset & 1) > + sum = (sum << 24) + (sum >> 8); Maybe use ror32(sum, 8); or maybe something like: { u32 sum; /* rotated csum2 of odd offset will be the right checksum */ if (offset & 1) sum = ror32((__force u32)csum2, 8); else sum = (__force u32)csum2; ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap 2016-03-08 23:25 ` Joe Perches @ 2016-03-09 5:23 ` Alexander Duyck 2016-03-09 5:50 ` Joe Perches 2016-03-09 10:54 ` David Laight 1 sibling, 1 reply; 11+ messages in thread From: Alexander Duyck @ 2016-03-09 5:23 UTC (permalink / raw) To: Joe Perches; +Cc: Alexander Duyck, Netdev, David Miller On Tue, Mar 8, 2016 at 3:25 PM, Joe Perches <joe@perches.com> wrote: > On Tue, 2016-03-08 at 14:42 -0800, Alexander Duyck wrote: >> The code for csum_block_add was doing a funky byteswap to swap the even and >> odd bytes of the checksum if the offset was odd. Instead of doing this we >> can save ourselves some trouble and just shift by 8 as this should have the >> same effect in terms of the final checksum value and only requires one >> instruction. > > 3 instructions? I was talking about just the one ror vs mov, shl, shr, and ,and, add. I assume when you say 3 you are including the test and either some form of conditional move or jump? >> In addition we can update csum_block_sub to just use csum_block_add with a >> inverse value for csum2. This way we follow the same code path as >> csum_block_add without having to duplicate it. >> >> Signed-off-by: Alexander Duyck <aduyck@mirantis.com> >> --- >> include/net/checksum.h | 11 +++++------ >> 1 file changed, 5 insertions(+), 6 deletions(-) >> >> diff --git a/include/net/checksum.h b/include/net/checksum.h >> index 10a16b5bd1c7..f9fac66c0e66 100644 >> --- a/include/net/checksum.h >> +++ b/include/net/checksum.h >> @@ -88,8 +88,10 @@ static inline __wsum >> csum_block_add(__wsum csum, __wsum csum2, int offset) >> { >> u32 sum = (__force u32)csum2; >> - if (offset&1) >> - sum = ((sum&0xFF00FF)<<8)+((sum>>8)&0xFF00FF); >> + >> + if (offset & 1) >> + sum = (sum << 24) + (sum >> 8); > > Maybe use ror32(sum, 8); I was actually thinking I could use something like this. I didn't realize it was even available. > or maybe something like: > > { > u32 sum; > > /* rotated csum2 of odd offset will be the right checksum */ > if (offset & 1) > sum = ror32((__force u32)csum2, 8); > else > sum = (__force u32)csum2; > Any specific reason for breaking it up like this? It seems like it was easier to just have sum be assigned first and then rotating it if needed. What is gained by splitting the assignment up over two different calls? ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap 2016-03-09 5:23 ` Alexander Duyck @ 2016-03-09 5:50 ` Joe Perches 2016-03-09 6:08 ` Alexander Duyck 0 siblings, 1 reply; 11+ messages in thread From: Joe Perches @ 2016-03-09 5:50 UTC (permalink / raw) To: Alexander Duyck; +Cc: Alexander Duyck, Netdev, David Miller On Tue, 2016-03-08 at 21:23 -0800, Alexander Duyck wrote: > On Tue, Mar 8, 2016 at 3:25 PM, Joe Perches <joe@perches.com> wrote: > > On Tue, 2016-03-08 at 14:42 -0800, Alexander Duyck wrote: > > > The code for csum_block_add was doing a funky byteswap to swap the even and > > > odd bytes of the checksum if the offset was odd. Instead of doing this we > > > can save ourselves some trouble and just shift by 8 as this should have the > > > same effect in terms of the final checksum value and only requires one > > > instruction. > > 3 instructions? > I was talking about just the one ror vs mov, shl, shr, and ,and, add. > > I assume when you say 3 you are including the test and either some > form of conditional move or jump? Yeah, instruction count also depends on architecture (arm/x86/ppc...) > > > diff --git a/include/net/checksum.h b/include/net/checksum.h [] > > > @@ -88,8 +88,10 @@ static inline __wsum > > > csum_block_add(__wsum csum, __wsum csum2, int offset) > > > { > > > u32 sum = (__force u32)csum2; > > > - if (offset&1) > > > - sum = ((sum&0xFF00FF)<<8)+((sum>>8)&0xFF00FF); > > > + > > > + if (offset & 1) > > > + sum = (sum << 24) + (sum >> 8); > > Maybe use ror32(sum, 8); > I was actually thinking I could use something like this. I didn't > realize it was even available. Now you know: bitops.h > > or maybe something like: > > > > { > > u32 sum; > > > > /* rotated csum2 of odd offset will be the right checksum */ > > if (offset & 1) > > sum = ror32((__force u32)csum2, 8); > > else > > sum = (__force u32)csum2; > > > Any specific reason for breaking it up like this? It seems like it > was easier to just have sum be assigned first and then rotating it if > needed. What is gained by splitting the assignment up over two > different calls? It's only for reader clarity where a comment could be useful. The compiler output shouldn't change. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap 2016-03-09 5:50 ` Joe Perches @ 2016-03-09 6:08 ` Alexander Duyck 2016-03-09 6:31 ` Tom Herbert 0 siblings, 1 reply; 11+ messages in thread From: Alexander Duyck @ 2016-03-09 6:08 UTC (permalink / raw) To: Joe Perches; +Cc: Alexander Duyck, Netdev, David Miller On Tue, Mar 8, 2016 at 9:50 PM, Joe Perches <joe@perches.com> wrote: > On Tue, 2016-03-08 at 21:23 -0800, Alexander Duyck wrote: >> On Tue, Mar 8, 2016 at 3:25 PM, Joe Perches <joe@perches.com> wrote: >> > On Tue, 2016-03-08 at 14:42 -0800, Alexander Duyck wrote: >> > > The code for csum_block_add was doing a funky byteswap to swap the even and >> > > odd bytes of the checksum if the offset was odd. Instead of doing this we >> > > can save ourselves some trouble and just shift by 8 as this should have the >> > > same effect in terms of the final checksum value and only requires one >> > > instruction. >> > 3 instructions? >> I was talking about just the one ror vs mov, shl, shr, and ,and, add. >> >> I assume when you say 3 you are including the test and either some >> form of conditional move or jump? > > Yeah, instruction count also depends on architecture (arm/x86/ppc...) Right. But the general idea is that rotate is an instruction most architectures have. I haven't heard of an instruction that swaps even and odd bytes of a 32 bit word. >> > > diff --git a/include/net/checksum.h b/include/net/checksum.h > [] >> > > @@ -88,8 +88,10 @@ static inline __wsum >> > > csum_block_add(__wsum csum, __wsum csum2, int offset) >> > > { >> > > u32 sum = (__force u32)csum2; >> > > - if (offset&1) >> > > - sum = ((sum&0xFF00FF)<<8)+((sum>>8)&0xFF00FF); >> > > + >> > > + if (offset & 1) >> > > + sum = (sum << 24) + (sum >> 8); >> > Maybe use ror32(sum, 8); >> I was actually thinking I could use something like this. I didn't >> realize it was even available. > > Now you know: bitops.h > >> > or maybe something like: >> > >> > { >> > u32 sum; >> > >> > /* rotated csum2 of odd offset will be the right checksum */ >> > if (offset & 1) >> > sum = ror32((__force u32)csum2, 8); >> > else >> > sum = (__force u32)csum2; >> > >> Any specific reason for breaking it up like this? It seems like it >> was easier to just have sum be assigned first and then rotating it if >> needed. What is gained by splitting the assignment up over two >> different calls? > > It's only for reader clarity where a comment could be useful. > The compiler output shouldn't change. Okay, well I can add a one line comment about aligning to a 16b boundary for clarity. - Alex ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap 2016-03-09 6:08 ` Alexander Duyck @ 2016-03-09 6:31 ` Tom Herbert 2016-03-09 16:08 ` Alexander Duyck 0 siblings, 1 reply; 11+ messages in thread From: Tom Herbert @ 2016-03-09 6:31 UTC (permalink / raw) To: Alexander Duyck; +Cc: Joe Perches, Alexander Duyck, Netdev, David Miller On Tue, Mar 8, 2016 at 10:08 PM, Alexander Duyck <alexander.duyck@gmail.com> wrote: > On Tue, Mar 8, 2016 at 9:50 PM, Joe Perches <joe@perches.com> wrote: >> On Tue, 2016-03-08 at 21:23 -0800, Alexander Duyck wrote: >>> On Tue, Mar 8, 2016 at 3:25 PM, Joe Perches <joe@perches.com> wrote: >>> > On Tue, 2016-03-08 at 14:42 -0800, Alexander Duyck wrote: >>> > > The code for csum_block_add was doing a funky byteswap to swap the even and >>> > > odd bytes of the checksum if the offset was odd. Instead of doing this we >>> > > can save ourselves some trouble and just shift by 8 as this should have the >>> > > same effect in terms of the final checksum value and only requires one >>> > > instruction. >>> > 3 instructions? >>> I was talking about just the one ror vs mov, shl, shr, and ,and, add. >>> >>> I assume when you say 3 you are including the test and either some >>> form of conditional move or jump? >> >> Yeah, instruction count also depends on architecture (arm/x86/ppc...) > > Right. But the general idea is that rotate is an instruction most > architectures have. I haven't heard of an instruction that swaps even > and odd bytes of a 32 bit word. > Yes, I took a look inlining these. #define rol32(V, X) ({ \ int word = V; \ if (__builtin_constant_p(X)) \ asm("roll $" #X ",%[word]\n\t" \ : [word] "=r" (word)); \ else \ asm("roll %%cl,%[word]\n\t" \ : [word] "=r" (word) \ : "c" (X)); \ word; \ }) With this I'm seeing a nice speedup in jhash which uses a lot of rol32s... >>> > > diff --git a/include/net/checksum.h b/include/net/checksum.h >> [] >>> > > @@ -88,8 +88,10 @@ static inline __wsum >>> > > csum_block_add(__wsum csum, __wsum csum2, int offset) >>> > > { >>> > > u32 sum = (__force u32)csum2; >>> > > - if (offset&1) >>> > > - sum = ((sum&0xFF00FF)<<8)+((sum>>8)&0xFF00FF); >>> > > + >>> > > + if (offset & 1) >>> > > + sum = (sum << 24) + (sum >> 8); >>> > Maybe use ror32(sum, 8); >>> I was actually thinking I could use something like this. I didn't >>> realize it was even available. >> >> Now you know: bitops.h >> >>> > or maybe something like: >>> > >>> > { >>> > u32 sum; >>> > >>> > /* rotated csum2 of odd offset will be the right checksum */ >>> > if (offset & 1) >>> > sum = ror32((__force u32)csum2, 8); >>> > else >>> > sum = (__force u32)csum2; >>> > >>> Any specific reason for breaking it up like this? It seems like it >>> was easier to just have sum be assigned first and then rotating it if >>> needed. What is gained by splitting the assignment up over two >>> different calls? >> >> It's only for reader clarity where a comment could be useful. >> The compiler output shouldn't change. > > Okay, well I can add a one line comment about aligning to a 16b > boundary for clarity. > > - Alex ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap 2016-03-09 6:31 ` Tom Herbert @ 2016-03-09 16:08 ` Alexander Duyck 2016-03-10 0:18 ` Joe Perches 0 siblings, 1 reply; 11+ messages in thread From: Alexander Duyck @ 2016-03-09 16:08 UTC (permalink / raw) To: Tom Herbert; +Cc: Joe Perches, Alexander Duyck, Netdev, David Miller On Tue, Mar 8, 2016 at 10:31 PM, Tom Herbert <tom@herbertland.com> wrote: > On Tue, Mar 8, 2016 at 10:08 PM, Alexander Duyck > <alexander.duyck@gmail.com> wrote: >> On Tue, Mar 8, 2016 at 9:50 PM, Joe Perches <joe@perches.com> wrote: >>> On Tue, 2016-03-08 at 21:23 -0800, Alexander Duyck wrote: >>>> On Tue, Mar 8, 2016 at 3:25 PM, Joe Perches <joe@perches.com> wrote: >>>> > On Tue, 2016-03-08 at 14:42 -0800, Alexander Duyck wrote: >>>> > > The code for csum_block_add was doing a funky byteswap to swap the even and >>>> > > odd bytes of the checksum if the offset was odd. Instead of doing this we >>>> > > can save ourselves some trouble and just shift by 8 as this should have the >>>> > > same effect in terms of the final checksum value and only requires one >>>> > > instruction. >>>> > 3 instructions? >>>> I was talking about just the one ror vs mov, shl, shr, and ,and, add. >>>> >>>> I assume when you say 3 you are including the test and either some >>>> form of conditional move or jump? >>> >>> Yeah, instruction count also depends on architecture (arm/x86/ppc...) >> >> Right. But the general idea is that rotate is an instruction most >> architectures have. I haven't heard of an instruction that swaps even >> and odd bytes of a 32 bit word. >> > Yes, I took a look inlining these. > > #define rol32(V, X) ({ \ > int word = V; \ > if (__builtin_constant_p(X)) \ > asm("roll $" #X ",%[word]\n\t" \ > : [word] "=r" (word)); \ > else \ > asm("roll %%cl,%[word]\n\t" \ > : [word] "=r" (word) \ > : "c" (X)); \ > word; \ > }) > > With this I'm seeing a nice speedup in jhash which uses a lot of rol32s... Is gcc really not converting the rol32 calls into rotates? If we need this type of code in order to get the rotates to occur as expected then maybe we need to look at doing arch specific versions of the functions in bitops.h in order to improve the performance since I know these calls are used in some performance critical paths such as crypto and hashing. - Alex ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap 2016-03-09 16:08 ` Alexander Duyck @ 2016-03-10 0:18 ` Joe Perches 2016-03-10 0:58 ` Tom Herbert 0 siblings, 1 reply; 11+ messages in thread From: Joe Perches @ 2016-03-10 0:18 UTC (permalink / raw) To: Alexander Duyck, Tom Herbert; +Cc: Alexander Duyck, Netdev, David Miller On Wed, 2016-03-09 at 08:08 -0800, Alexander Duyck wrote: > On Tue, Mar 8, 2016 at 10:31 PM, Tom Herbert <tom@herbertland.com> wrote: > > I took a look inlining these. > > > > #define rol32(V, X) ({ \ > > int word = V; \ > > if (__builtin_constant_p(X)) \ > > asm("roll $" #X ",%[word]\n\t" \ > > : [word] "=r" (word)); \ > > else \ > > asm("roll %%cl,%[word]\n\t" \ > > : [word] "=r" (word) \ > > : "c" (X)); \ > > word; \ > > }) > > > > With this I'm seeing a nice speedup in jhash which uses a lot of rol32s... > Is gcc really not converting the rol32 calls into rotates? No, it is. The difference in the object code with the asm for instance is: (old, compiled with gcc 5.3.1) <jhash_2words.constprop.5>: 84e: 81 ee 09 41 52 21 sub $0x21524109,%esi 854: 81 ef 09 41 52 21 sub $0x21524109,%edi 85a: 55 push %rbp 85b: 89 f0 mov %esi,%eax 85d: 89 f2 mov %esi,%edx 85f: 48 ff 05 00 00 00 00 incq 0x0(%rip) # 866 <jhash_2words.constprop.5+0x18> 866: c1 c2 0e rol $0xe,%edx 869: 35 f7 be ad de xor $0xdeadbef7,%eax 86e: 48 89 e5 mov %rsp,%rbp 871: 29 d0 sub %edx,%eax 873: 48 ff 05 00 00 00 00 incq 0x0(%rip) # 87a <jhash_2words.constprop.5+0x2c> 87a: 48 ff 05 00 00 00 00 incq 0x0(%rip) # 881 <jhash_2words.constprop.5+0x33> 881: 89 c2 mov %eax,%edx 883: 31 c7 xor %eax,%edi 885: c1 c2 0b rol $0xb,%edx 888: 29 d7 sub %edx,%edi 88a: 89 fa mov %edi,%edx 88c: 31 fe xor %edi,%esi 88e: c1 ca 07 ror $0x7,%edx 891: 29 d6 sub %edx,%esi 893: 89 f2 mov %esi,%edx 895: 31 f0 xor %esi,%eax 897: c1 c2 10 rol $0x10,%edx 89a: 29 d0 sub %edx,%eax 89c: 89 c2 mov %eax,%edx 89e: 31 c7 xor %eax,%edi 8a0: c1 c2 04 rol $0x4,%edx 8a3: 29 d7 sub %edx,%edi 8a5: 31 fe xor %edi,%esi 8a7: c1 c7 0e rol $0xe,%edi 8aa: 29 fe sub %edi,%esi 8ac: 31 f0 xor %esi,%eax 8ae: c1 ce 08 ror $0x8,%esi 8b1: 29 f0 sub %esi,%eax 8b3: 5d pop %rbp 8b4: c3 retq vs Tom's asm 000000000000084e <jhash_2words.constprop.5>: 84e: 81 ee 09 41 52 21 sub $0x21524109,%esi 854: 8d 87 f7 be ad de lea -0x21524109(%rdi),%eax 85a: 55 push %rbp 85b: 89 f2 mov %esi,%edx 85d: 48 ff 05 00 00 00 00 incq 0x0(%rip) # 864 <jhash_2words.constprop.5+0x16> 864: 48 ff 05 00 00 00 00 incq 0x0(%rip) # 86b <jhash_2words.constprop.5+0x1d> 86b: 81 f2 f7 be ad de xor $0xdeadbef7,%edx 871: 48 89 e5 mov %rsp,%rbp 874: c1 c1 0e rol $0xe,%ecx 877: 29 ca sub %ecx,%edx 879: 31 d0 xor %edx,%eax 87b: c1 c7 0b rol $0xb,%edi 87e: 29 f8 sub %edi,%eax 880: 48 ff 05 00 00 00 00 incq 0x0(%rip) # 887 <jhash_2words.constprop.5+0x39> 887: 31 c6 xor %eax,%esi 889: c1 c7 19 rol $0x19,%edi 88c: 29 fe sub %edi,%esi 88e: 31 f2 xor %esi,%edx 890: c1 c7 10 rol $0x10,%edi 893: 29 fa sub %edi,%edx 895: 31 d0 xor %edx,%eax 897: c1 c7 04 rol $0x4,%edi 89a: 29 f8 sub %edi,%eax 89c: 31 f0 xor %esi,%eax 89e: 29 c8 sub %ecx,%eax 8a0: 31 d0 xor %edx,%eax 8a2: 5d pop %rbp 8a3: c1 c2 18 rol $0x18,%edx 8a6: 29 d0 sub %edx,%eax 8a8: c3 retq > If we need this type of code in order to get the rotates to occur as > expected then maybe we need to look at doing arch specific versions of > the functions in bitops.h in order to improve the performance since I > know these calls are used in some performance critical paths such as > crypto and hashing. Yeah, maybe, but why couldn't gcc generate similar code as Tom's asm? (modulo the ripple reducing ror vs rol uses when the shift is > 16 ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap 2016-03-10 0:18 ` Joe Perches @ 2016-03-10 0:58 ` Tom Herbert 0 siblings, 0 replies; 11+ messages in thread From: Tom Herbert @ 2016-03-10 0:58 UTC (permalink / raw) To: Joe Perches; +Cc: Alexander Duyck, Alexander Duyck, Netdev, David Miller On Wed, Mar 9, 2016 at 4:18 PM, Joe Perches <joe@perches.com> wrote: > On Wed, 2016-03-09 at 08:08 -0800, Alexander Duyck wrote: >> On Tue, Mar 8, 2016 at 10:31 PM, Tom Herbert <tom@herbertland.com> wrote: >> > I took a look inlining these. >> > >> > #define rol32(V, X) ({ \ >> > int word = V; \ >> > if (__builtin_constant_p(X)) \ >> > asm("roll $" #X ",%[word]\n\t" \ >> > : [word] "=r" (word)); \ >> > else \ >> > asm("roll %%cl,%[word]\n\t" \ >> > : [word] "=r" (word) \ >> > : "c" (X)); \ >> > word; \ >> > }) >> > >> > With this I'm seeing a nice speedup in jhash which uses a lot of rol32s... >> Is gcc really not converting the rol32 calls into rotates? > > No, it is. > > The difference in the object code with the asm for instance is: > > (old, compiled with gcc 5.3.1) > > <jhash_2words.constprop.5>: > 84e: 81 ee 09 41 52 21 sub $0x21524109,%esi > 854: 81 ef 09 41 52 21 sub $0x21524109,%edi > 85a: 55 push %rbp > 85b: 89 f0 mov %esi,%eax > 85d: 89 f2 mov %esi,%edx > 85f: 48 ff 05 00 00 00 00 incq 0x0(%rip) # 866 <jhash_2words.constprop.5+0x18> > 866: c1 c2 0e rol $0xe,%edx > 869: 35 f7 be ad de xor $0xdeadbef7,%eax > 86e: 48 89 e5 mov %rsp,%rbp > 871: 29 d0 sub %edx,%eax > 873: 48 ff 05 00 00 00 00 incq 0x0(%rip) # 87a <jhash_2words.constprop.5+0x2c> > 87a: 48 ff 05 00 00 00 00 incq 0x0(%rip) # 881 <jhash_2words.constprop.5+0x33> > 881: 89 c2 mov %eax,%edx > 883: 31 c7 xor %eax,%edi > 885: c1 c2 0b rol $0xb,%edx > 888: 29 d7 sub %edx,%edi > 88a: 89 fa mov %edi,%edx > 88c: 31 fe xor %edi,%esi > 88e: c1 ca 07 ror $0x7,%edx > 891: 29 d6 sub %edx,%esi > 893: 89 f2 mov %esi,%edx > 895: 31 f0 xor %esi,%eax > 897: c1 c2 10 rol $0x10,%edx > 89a: 29 d0 sub %edx,%eax > 89c: 89 c2 mov %eax,%edx > 89e: 31 c7 xor %eax,%edi > 8a0: c1 c2 04 rol $0x4,%edx > 8a3: 29 d7 sub %edx,%edi > 8a5: 31 fe xor %edi,%esi > 8a7: c1 c7 0e rol $0xe,%edi > 8aa: 29 fe sub %edi,%esi > 8ac: 31 f0 xor %esi,%eax > 8ae: c1 ce 08 ror $0x8,%esi > 8b1: 29 f0 sub %esi,%eax > 8b3: 5d pop %rbp > 8b4: c3 retq > > vs Tom's asm > > 000000000000084e <jhash_2words.constprop.5>: > 84e: 81 ee 09 41 52 21 sub $0x21524109,%esi > 854: 8d 87 f7 be ad de lea -0x21524109(%rdi),%eax > 85a: 55 push %rbp > 85b: 89 f2 mov %esi,%edx > 85d: 48 ff 05 00 00 00 00 incq 0x0(%rip) # 864 <jhash_2words.constprop.5+0x16> > 864: 48 ff 05 00 00 00 00 incq 0x0(%rip) # 86b <jhash_2words.constprop.5+0x1d> > 86b: 81 f2 f7 be ad de xor $0xdeadbef7,%edx > 871: 48 89 e5 mov %rsp,%rbp > 874: c1 c1 0e rol $0xe,%ecx > 877: 29 ca sub %ecx,%edx > 879: 31 d0 xor %edx,%eax > 87b: c1 c7 0b rol $0xb,%edi > 87e: 29 f8 sub %edi,%eax > 880: 48 ff 05 00 00 00 00 incq 0x0(%rip) # 887 <jhash_2words.constprop.5+0x39> > 887: 31 c6 xor %eax,%esi > 889: c1 c7 19 rol $0x19,%edi > 88c: 29 fe sub %edi,%esi > 88e: 31 f2 xor %esi,%edx > 890: c1 c7 10 rol $0x10,%edi > 893: 29 fa sub %edi,%edx > 895: 31 d0 xor %edx,%eax > 897: c1 c7 04 rol $0x4,%edi > 89a: 29 f8 sub %edi,%eax > 89c: 31 f0 xor %esi,%eax > 89e: 29 c8 sub %ecx,%eax > 8a0: 31 d0 xor %edx,%eax > 8a2: 5d pop %rbp > 8a3: c1 c2 18 rol $0x18,%edx > 8a6: 29 d0 sub %edx,%eax > 8a8: c3 retq > >> If we need this type of code in order to get the rotates to occur as >> expected then maybe we need to look at doing arch specific versions of >> the functions in bitops.h in order to improve the performance since I >> know these calls are used in some performance critical paths such as >> crypto and hashing. > > Yeah, maybe, but why couldn't gcc generate similar code > as Tom's asm? (modulo the ripple reducing ror vs rol uses > when the shift is > 16 I see gcc doing that now, not sure why I was seeing differences before.... ^ permalink raw reply [flat|nested] 11+ messages in thread
* RE: [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap 2016-03-08 23:25 ` Joe Perches 2016-03-09 5:23 ` Alexander Duyck @ 2016-03-09 10:54 ` David Laight 2016-03-09 16:03 ` Alexander Duyck 1 sibling, 1 reply; 11+ messages in thread From: David Laight @ 2016-03-09 10:54 UTC (permalink / raw) To: 'Joe Perches', Alexander Duyck, netdev@vger.kernel.org, davem@davemloft.net From: Joe Perches > Sent: 08 March 2016 23:26 ... > > + > > + if (offset & 1) > > + sum = (sum << 24) + (sum >> 8); > > Maybe use ror32(sum, 8); > > or maybe something like: > > { > u32 sum; > > /* rotated csum2 of odd offset will be the right checksum */ > if (offset & 1) > sum = ror32((__force u32)csum2, 8); > else > sum = (__force u32)csum2; Or even: sum = ror32((__force u32)csum2, (offset & 1) * 8); to remove the conditional. Assuming 'rotate by 0 bits' is valid. If not add 16 to rotate by 16 or 24. David ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap 2016-03-09 10:54 ` David Laight @ 2016-03-09 16:03 ` Alexander Duyck 0 siblings, 0 replies; 11+ messages in thread From: Alexander Duyck @ 2016-03-09 16:03 UTC (permalink / raw) To: David Laight Cc: Joe Perches, Alexander Duyck, netdev@vger.kernel.org, davem@davemloft.net On Wed, Mar 9, 2016 at 2:54 AM, David Laight <David.Laight@aculab.com> wrote: > From: Joe Perches >> Sent: 08 March 2016 23:26 > ... >> > + >> > + if (offset & 1) >> > + sum = (sum << 24) + (sum >> 8); >> >> Maybe use ror32(sum, 8); >> >> or maybe something like: >> >> { >> u32 sum; >> >> /* rotated csum2 of odd offset will be the right checksum */ >> if (offset & 1) >> sum = ror32((__force u32)csum2, 8); >> else >> sum = (__force u32)csum2; > > Or even: > sum = ror32((__force u32)csum2, (offset & 1) * 8); > to remove the conditional. > Assuming 'rotate by 0 bits' is valid. > If not add 16 to rotate by 16 or 24. The problem is "ror %cl" can be significantly more expensive than just a "ror $8". In the case of x86 the difference is as much as 6 cycles or more on some of the older architectures so it may be better to just do the rotate by 8 and then an "and" or "test" and "cmovne" which is what this compiles into right now. - Alex ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2016-03-10 0:58 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-03-08 22:42 [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap Alexander Duyck 2016-03-08 23:25 ` Joe Perches 2016-03-09 5:23 ` Alexander Duyck 2016-03-09 5:50 ` Joe Perches 2016-03-09 6:08 ` Alexander Duyck 2016-03-09 6:31 ` Tom Herbert 2016-03-09 16:08 ` Alexander Duyck 2016-03-10 0:18 ` Joe Perches 2016-03-10 0:58 ` Tom Herbert 2016-03-09 10:54 ` David Laight 2016-03-09 16:03 ` Alexander Duyck
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox