* [PATCH] arm64: Implement optimised IP checksum helpers @ 2016-05-12 14:26 ` Robin Murphy 0 siblings, 0 replies; 10+ messages in thread From: Robin Murphy @ 2016-05-12 14:26 UTC (permalink / raw) To: linux-arm-kernel AArch64 is capable of 128-bit memory accesses without alignment restrictions, which makes it both possible and highly practical to slurp up a typical 20-byte IP header in just 2 loads. Implement our own version of ip_fast_checksum() to take advantage of that, resulting in considerably fewer instructions and memory accesses than the generic version. We can also get more optimal code generation for csum_fold() by defining it a slightly different way round from the generic version, so throw that into the mix too. Suggested-by: Luke Starrett <luke.starrett@broadcom.com> Signed-off-by: Robin Murphy <robin.murphy@arm.com> --- Having spent an evening poring over disassembler output, it turns out that there's really no need to drop into asm for this, hurrah! I've taken it for a spin on both big and little-endian, with no ill effects. arch/arm64/include/asm/checksum.h | 49 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 49 insertions(+) create mode 100644 arch/arm64/include/asm/checksum.h diff --git a/arch/arm64/include/asm/checksum.h b/arch/arm64/include/asm/checksum.h new file mode 100644 index 000000000000..7e975ffce837 --- /dev/null +++ b/arch/arm64/include/asm/checksum.h @@ -0,0 +1,49 @@ +/* + * Copyright (C) 2016 ARM Ltd. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see <http://www.gnu.org/licenses/>. + */ +#ifndef __ASM_CHECKSUM_H +#define __ASM_CHECKSUM_H + +static inline __sum16 csum_fold(__wsum csum) +{ + u32 sum = (__force u32)csum; + sum += (sum >> 16) | (sum << 16); + return ~(__force __sum16)(sum >> 16); +} +#define csum_fold csum_fold + +static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl) +{ + __uint128_t tmp; + u64 sum; + + tmp = *(const __uint128_t *)iph; + iph += 16; + ihl -= 4; + tmp += ((tmp >> 64) | (tmp << 64)); + sum = tmp >> 64; + do { + sum += *(const u32 *)iph; + iph += 4; + } while (--ihl); + + sum += ((sum >> 32) | (sum << 32)); + return csum_fold(sum >> 32); +} +#define ip_fast_csum ip_fast_csum + +#include <asm-generic/checksum.h> + +#endif /* __ASM_CHECKSUM_H */ -- 2.8.1.dirty ^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH] arm64: Implement optimised IP checksum helpers @ 2016-05-12 14:26 ` Robin Murphy 0 siblings, 0 replies; 10+ messages in thread From: Robin Murphy @ 2016-05-12 14:26 UTC (permalink / raw) To: will.deacon, catalin.marinas, luke.starrett Cc: linux-kernel, linux-arm-kernel, bcm-kernel-feedback-list AArch64 is capable of 128-bit memory accesses without alignment restrictions, which makes it both possible and highly practical to slurp up a typical 20-byte IP header in just 2 loads. Implement our own version of ip_fast_checksum() to take advantage of that, resulting in considerably fewer instructions and memory accesses than the generic version. We can also get more optimal code generation for csum_fold() by defining it a slightly different way round from the generic version, so throw that into the mix too. Suggested-by: Luke Starrett <luke.starrett@broadcom.com> Signed-off-by: Robin Murphy <robin.murphy@arm.com> --- Having spent an evening poring over disassembler output, it turns out that there's really no need to drop into asm for this, hurrah! I've taken it for a spin on both big and little-endian, with no ill effects. arch/arm64/include/asm/checksum.h | 49 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 49 insertions(+) create mode 100644 arch/arm64/include/asm/checksum.h diff --git a/arch/arm64/include/asm/checksum.h b/arch/arm64/include/asm/checksum.h new file mode 100644 index 000000000000..7e975ffce837 --- /dev/null +++ b/arch/arm64/include/asm/checksum.h @@ -0,0 +1,49 @@ +/* + * Copyright (C) 2016 ARM Ltd. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see <http://www.gnu.org/licenses/>. + */ +#ifndef __ASM_CHECKSUM_H +#define __ASM_CHECKSUM_H + +static inline __sum16 csum_fold(__wsum csum) +{ + u32 sum = (__force u32)csum; + sum += (sum >> 16) | (sum << 16); + return ~(__force __sum16)(sum >> 16); +} +#define csum_fold csum_fold + +static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl) +{ + __uint128_t tmp; + u64 sum; + + tmp = *(const __uint128_t *)iph; + iph += 16; + ihl -= 4; + tmp += ((tmp >> 64) | (tmp << 64)); + sum = tmp >> 64; + do { + sum += *(const u32 *)iph; + iph += 4; + } while (--ihl); + + sum += ((sum >> 32) | (sum << 32)); + return csum_fold(sum >> 32); +} +#define ip_fast_csum ip_fast_csum + +#include <asm-generic/checksum.h> + +#endif /* __ASM_CHECKSUM_H */ -- 2.8.1.dirty ^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH] arm64: Implement optimised IP checksum helpers 2016-05-12 14:26 ` Robin Murphy @ 2016-05-12 15:34 ` Luke Starrett -1 siblings, 0 replies; 10+ messages in thread From: Luke Starrett @ 2016-05-12 15:34 UTC (permalink / raw) To: linux-arm-kernel Hi Robin, I pulled this in to a userspace test app expecting that the __uint128_t type might cause GCC to emit 'ldp'. Seems like that was that your intent based on your commit note. Instead I see two 64b loads (ldr Xn), and a single 32b load (ldr Wn) for the trailing 4B. This was with Linaro GCC 4.9-2015.06. Otherwise, the C cycle count looks good enough compared to the asm version. Thanks, Luke On 5/12/2016 10:26 AM, Robin Murphy wrote: > AArch64 is capable of 128-bit memory accesses without alignment > restrictions, which makes it both possible and highly practical to slurp > up a typical 20-byte IP header in just 2 loads. Implement our own > version of ip_fast_checksum() to take advantage of that, resulting in > considerably fewer instructions and memory accesses than the generic > version. We can also get more optimal code generation for csum_fold() by > defining it a slightly different way round from the generic version, so > throw that into the mix too. > > Suggested-by: Luke Starrett <luke.starrett@broadcom.com> > Signed-off-by: Robin Murphy <robin.murphy@arm.com> > --- > > Having spent an evening poring over disassembler output, it turns out > that there's really no need to drop into asm for this, hurrah! I've > taken it for a spin on both big and little-endian, with no ill effects. > > arch/arm64/include/asm/checksum.h | 49 +++++++++++++++++++++++++++++++++++++++ > 1 file changed, 49 insertions(+) > create mode 100644 arch/arm64/include/asm/checksum.h > > diff --git a/arch/arm64/include/asm/checksum.h b/arch/arm64/include/asm/checksum.h > new file mode 100644 > index 000000000000..7e975ffce837 > --- /dev/null > +++ b/arch/arm64/include/asm/checksum.h > @@ -0,0 +1,49 @@ > +/* > + * Copyright (C) 2016 ARM Ltd. > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License version 2 as > + * published by the Free Software Foundation. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU General Public License for more details. > + * > + * You should have received a copy of the GNU General Public License > + * along with this program. If not, see <http://www.gnu.org/licenses/>. > + */ > +#ifndef __ASM_CHECKSUM_H > +#define __ASM_CHECKSUM_H > + > +static inline __sum16 csum_fold(__wsum csum) > +{ > + u32 sum = (__force u32)csum; > + sum += (sum >> 16) | (sum << 16); > + return ~(__force __sum16)(sum >> 16); > +} > +#define csum_fold csum_fold > + > +static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl) > +{ > + __uint128_t tmp; > + u64 sum; > + > + tmp = *(const __uint128_t *)iph; > + iph += 16; > + ihl -= 4; > + tmp += ((tmp >> 64) | (tmp << 64)); > + sum = tmp >> 64; > + do { > + sum += *(const u32 *)iph; > + iph += 4; > + } while (--ihl); > + > + sum += ((sum >> 32) | (sum << 32)); > + return csum_fold(sum >> 32); > +} > +#define ip_fast_csum ip_fast_csum > + > +#include <asm-generic/checksum.h> > + > +#endif /* __ASM_CHECKSUM_H */ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] arm64: Implement optimised IP checksum helpers @ 2016-05-12 15:34 ` Luke Starrett 0 siblings, 0 replies; 10+ messages in thread From: Luke Starrett @ 2016-05-12 15:34 UTC (permalink / raw) To: Robin Murphy, will.deacon, catalin.marinas Cc: linux-kernel, linux-arm-kernel, bcm-kernel-feedback-list Hi Robin, I pulled this in to a userspace test app expecting that the __uint128_t type might cause GCC to emit 'ldp'. Seems like that was that your intent based on your commit note. Instead I see two 64b loads (ldr Xn), and a single 32b load (ldr Wn) for the trailing 4B. This was with Linaro GCC 4.9-2015.06. Otherwise, the C cycle count looks good enough compared to the asm version. Thanks, Luke On 5/12/2016 10:26 AM, Robin Murphy wrote: > AArch64 is capable of 128-bit memory accesses without alignment > restrictions, which makes it both possible and highly practical to slurp > up a typical 20-byte IP header in just 2 loads. Implement our own > version of ip_fast_checksum() to take advantage of that, resulting in > considerably fewer instructions and memory accesses than the generic > version. We can also get more optimal code generation for csum_fold() by > defining it a slightly different way round from the generic version, so > throw that into the mix too. > > Suggested-by: Luke Starrett <luke.starrett@broadcom.com> > Signed-off-by: Robin Murphy <robin.murphy@arm.com> > --- > > Having spent an evening poring over disassembler output, it turns out > that there's really no need to drop into asm for this, hurrah! I've > taken it for a spin on both big and little-endian, with no ill effects. > > arch/arm64/include/asm/checksum.h | 49 +++++++++++++++++++++++++++++++++++++++ > 1 file changed, 49 insertions(+) > create mode 100644 arch/arm64/include/asm/checksum.h > > diff --git a/arch/arm64/include/asm/checksum.h b/arch/arm64/include/asm/checksum.h > new file mode 100644 > index 000000000000..7e975ffce837 > --- /dev/null > +++ b/arch/arm64/include/asm/checksum.h > @@ -0,0 +1,49 @@ > +/* > + * Copyright (C) 2016 ARM Ltd. > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License version 2 as > + * published by the Free Software Foundation. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU General Public License for more details. > + * > + * You should have received a copy of the GNU General Public License > + * along with this program. If not, see <http://www.gnu.org/licenses/>. > + */ > +#ifndef __ASM_CHECKSUM_H > +#define __ASM_CHECKSUM_H > + > +static inline __sum16 csum_fold(__wsum csum) > +{ > + u32 sum = (__force u32)csum; > + sum += (sum >> 16) | (sum << 16); > + return ~(__force __sum16)(sum >> 16); > +} > +#define csum_fold csum_fold > + > +static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl) > +{ > + __uint128_t tmp; > + u64 sum; > + > + tmp = *(const __uint128_t *)iph; > + iph += 16; > + ihl -= 4; > + tmp += ((tmp >> 64) | (tmp << 64)); > + sum = tmp >> 64; > + do { > + sum += *(const u32 *)iph; > + iph += 4; > + } while (--ihl); > + > + sum += ((sum >> 32) | (sum << 32)); > + return csum_fold(sum >> 32); > +} > +#define ip_fast_csum ip_fast_csum > + > +#include <asm-generic/checksum.h> > + > +#endif /* __ASM_CHECKSUM_H */ ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH] arm64: Implement optimised IP checksum helpers 2016-05-12 15:34 ` Luke Starrett @ 2016-05-12 17:08 ` Robin Murphy -1 siblings, 0 replies; 10+ messages in thread From: Robin Murphy @ 2016-05-12 17:08 UTC (permalink / raw) To: linux-arm-kernel Hi Luke, On 12/05/16 16:34, Luke Starrett wrote: > Hi Robin, > > I pulled this in to a userspace test app expecting that the __uint128_t > type might cause GCC to emit 'ldp'. Seems like that was that your > intent based on your commit note. Instead I see two 64b loads (ldr Xn), > and a single 32b load (ldr Wn) for the trailing 4B. This was with > Linaro GCC 4.9-2015.06. GCC 5 happily emits ldp there, but indeed I couldn't figure out how to convince GCC 4 to do so. From a quick ferret around in the GCC Git, it looks like the relevant optimisations may have only gone in post-4.9. > Otherwise, the C cycle count looks good enough compared to the asm version. Yeah, compiling as standalone functions with GCC 5 I get 19 instructions vs. 17 for the asm, but the loop logic gets optimised out completely when ihl is a compile-time constant (e.g. inet_gro_receive()) Cheers, Robin. > > Thanks, > > Luke > > > On 5/12/2016 10:26 AM, Robin Murphy wrote: >> AArch64 is capable of 128-bit memory accesses without alignment >> restrictions, which makes it both possible and highly practical to slurp >> up a typical 20-byte IP header in just 2 loads. Implement our own >> version of ip_fast_checksum() to take advantage of that, resulting in >> considerably fewer instructions and memory accesses than the generic >> version. We can also get more optimal code generation for csum_fold() by >> defining it a slightly different way round from the generic version, so >> throw that into the mix too. >> >> Suggested-by: Luke Starrett <luke.starrett@broadcom.com> >> Signed-off-by: Robin Murphy <robin.murphy@arm.com> >> --- >> >> Having spent an evening poring over disassembler output, it turns out >> that there's really no need to drop into asm for this, hurrah! I've >> taken it for a spin on both big and little-endian, with no ill effects. >> >> arch/arm64/include/asm/checksum.h | 49 >> +++++++++++++++++++++++++++++++++++++++ >> 1 file changed, 49 insertions(+) >> create mode 100644 arch/arm64/include/asm/checksum.h >> >> diff --git a/arch/arm64/include/asm/checksum.h >> b/arch/arm64/include/asm/checksum.h >> new file mode 100644 >> index 000000000000..7e975ffce837 >> --- /dev/null >> +++ b/arch/arm64/include/asm/checksum.h >> @@ -0,0 +1,49 @@ >> +/* >> + * Copyright (C) 2016 ARM Ltd. >> + * >> + * This program is free software; you can redistribute it and/or modify >> + * it under the terms of the GNU General Public License version 2 as >> + * published by the Free Software Foundation. >> + * >> + * This program is distributed in the hope that it will be useful, >> + * but WITHOUT ANY WARRANTY; without even the implied warranty of >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the >> + * GNU General Public License for more details. >> + * >> + * You should have received a copy of the GNU General Public License >> + * along with this program. If not, see <http://www.gnu.org/licenses/>. >> + */ >> +#ifndef __ASM_CHECKSUM_H >> +#define __ASM_CHECKSUM_H >> + >> +static inline __sum16 csum_fold(__wsum csum) >> +{ >> + u32 sum = (__force u32)csum; >> + sum += (sum >> 16) | (sum << 16); >> + return ~(__force __sum16)(sum >> 16); >> +} >> +#define csum_fold csum_fold >> + >> +static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl) >> +{ >> + __uint128_t tmp; >> + u64 sum; >> + >> + tmp = *(const __uint128_t *)iph; >> + iph += 16; >> + ihl -= 4; >> + tmp += ((tmp >> 64) | (tmp << 64)); >> + sum = tmp >> 64; >> + do { >> + sum += *(const u32 *)iph; >> + iph += 4; >> + } while (--ihl); >> + >> + sum += ((sum >> 32) | (sum << 32)); >> + return csum_fold(sum >> 32); >> +} >> +#define ip_fast_csum ip_fast_csum >> + >> +#include <asm-generic/checksum.h> >> + >> +#endif /* __ASM_CHECKSUM_H */ > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] arm64: Implement optimised IP checksum helpers @ 2016-05-12 17:08 ` Robin Murphy 0 siblings, 0 replies; 10+ messages in thread From: Robin Murphy @ 2016-05-12 17:08 UTC (permalink / raw) To: Luke Starrett, will.deacon, catalin.marinas Cc: linux-kernel, linux-arm-kernel, bcm-kernel-feedback-list Hi Luke, On 12/05/16 16:34, Luke Starrett wrote: > Hi Robin, > > I pulled this in to a userspace test app expecting that the __uint128_t > type might cause GCC to emit 'ldp'. Seems like that was that your > intent based on your commit note. Instead I see two 64b loads (ldr Xn), > and a single 32b load (ldr Wn) for the trailing 4B. This was with > Linaro GCC 4.9-2015.06. GCC 5 happily emits ldp there, but indeed I couldn't figure out how to convince GCC 4 to do so. From a quick ferret around in the GCC Git, it looks like the relevant optimisations may have only gone in post-4.9. > Otherwise, the C cycle count looks good enough compared to the asm version. Yeah, compiling as standalone functions with GCC 5 I get 19 instructions vs. 17 for the asm, but the loop logic gets optimised out completely when ihl is a compile-time constant (e.g. inet_gro_receive()) Cheers, Robin. > > Thanks, > > Luke > > > On 5/12/2016 10:26 AM, Robin Murphy wrote: >> AArch64 is capable of 128-bit memory accesses without alignment >> restrictions, which makes it both possible and highly practical to slurp >> up a typical 20-byte IP header in just 2 loads. Implement our own >> version of ip_fast_checksum() to take advantage of that, resulting in >> considerably fewer instructions and memory accesses than the generic >> version. We can also get more optimal code generation for csum_fold() by >> defining it a slightly different way round from the generic version, so >> throw that into the mix too. >> >> Suggested-by: Luke Starrett <luke.starrett@broadcom.com> >> Signed-off-by: Robin Murphy <robin.murphy@arm.com> >> --- >> >> Having spent an evening poring over disassembler output, it turns out >> that there's really no need to drop into asm for this, hurrah! I've >> taken it for a spin on both big and little-endian, with no ill effects. >> >> arch/arm64/include/asm/checksum.h | 49 >> +++++++++++++++++++++++++++++++++++++++ >> 1 file changed, 49 insertions(+) >> create mode 100644 arch/arm64/include/asm/checksum.h >> >> diff --git a/arch/arm64/include/asm/checksum.h >> b/arch/arm64/include/asm/checksum.h >> new file mode 100644 >> index 000000000000..7e975ffce837 >> --- /dev/null >> +++ b/arch/arm64/include/asm/checksum.h >> @@ -0,0 +1,49 @@ >> +/* >> + * Copyright (C) 2016 ARM Ltd. >> + * >> + * This program is free software; you can redistribute it and/or modify >> + * it under the terms of the GNU General Public License version 2 as >> + * published by the Free Software Foundation. >> + * >> + * This program is distributed in the hope that it will be useful, >> + * but WITHOUT ANY WARRANTY; without even the implied warranty of >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the >> + * GNU General Public License for more details. >> + * >> + * You should have received a copy of the GNU General Public License >> + * along with this program. If not, see <http://www.gnu.org/licenses/>. >> + */ >> +#ifndef __ASM_CHECKSUM_H >> +#define __ASM_CHECKSUM_H >> + >> +static inline __sum16 csum_fold(__wsum csum) >> +{ >> + u32 sum = (__force u32)csum; >> + sum += (sum >> 16) | (sum << 16); >> + return ~(__force __sum16)(sum >> 16); >> +} >> +#define csum_fold csum_fold >> + >> +static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl) >> +{ >> + __uint128_t tmp; >> + u64 sum; >> + >> + tmp = *(const __uint128_t *)iph; >> + iph += 16; >> + ihl -= 4; >> + tmp += ((tmp >> 64) | (tmp << 64)); >> + sum = tmp >> 64; >> + do { >> + sum += *(const u32 *)iph; >> + iph += 4; >> + } while (--ihl); >> + >> + sum += ((sum >> 32) | (sum << 32)); >> + return csum_fold(sum >> 32); >> +} >> +#define ip_fast_csum ip_fast_csum >> + >> +#include <asm-generic/checksum.h> >> + >> +#endif /* __ASM_CHECKSUM_H */ > ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH] arm64: Implement optimised IP checksum helpers 2016-05-12 17:08 ` Robin Murphy @ 2016-05-12 18:32 ` Luke Starrett -1 siblings, 0 replies; 10+ messages in thread From: Luke Starrett @ 2016-05-12 18:32 UTC (permalink / raw) To: linux-arm-kernel Hi Robin, On 5/12/2016 1:08 PM, Robin Murphy wrote: > Hi Luke, > > On 12/05/16 16:34, Luke Starrett wrote: >> Hi Robin, >> >> I pulled this in to a userspace test app expecting that the __uint128_t >> type might cause GCC to emit 'ldp'. Seems like that was that your >> intent based on your commit note. Instead I see two 64b loads (ldr Xn), >> and a single 32b load (ldr Wn) for the trailing 4B. This was with >> Linaro GCC 4.9-2015.06. > > GCC 5 happily emits ldp there, but indeed I couldn't figure out how to > convince GCC 4 to do so. From a quick ferret around in the GCC Git, it > looks like the relevant optimisations may have only gone in post-4.9. > Not a problem. I was just curious for my own selfish reasons. >> Otherwise, the C cycle count looks good enough compared to the asm >> version. > > Yeah, compiling as standalone functions with GCC 5 I get 19 > instructions vs. 17 for the asm, but the loop logic gets optimised out > completely when ihl is a compile-time constant (e.g. inet_gro_receive()) > I updated to Linaro GCC 5.3-2016.02, and saw what you described. I ran some smoke testing against a random header generator. LGTM. Acked-by: Luke Starrett <luke.starrett@broadcom.com> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] arm64: Implement optimised IP checksum helpers @ 2016-05-12 18:32 ` Luke Starrett 0 siblings, 0 replies; 10+ messages in thread From: Luke Starrett @ 2016-05-12 18:32 UTC (permalink / raw) To: Robin Murphy, will.deacon, catalin.marinas Cc: linux-kernel, linux-arm-kernel, bcm-kernel-feedback-list Hi Robin, On 5/12/2016 1:08 PM, Robin Murphy wrote: > Hi Luke, > > On 12/05/16 16:34, Luke Starrett wrote: >> Hi Robin, >> >> I pulled this in to a userspace test app expecting that the __uint128_t >> type might cause GCC to emit 'ldp'. Seems like that was that your >> intent based on your commit note. Instead I see two 64b loads (ldr Xn), >> and a single 32b load (ldr Wn) for the trailing 4B. This was with >> Linaro GCC 4.9-2015.06. > > GCC 5 happily emits ldp there, but indeed I couldn't figure out how to > convince GCC 4 to do so. From a quick ferret around in the GCC Git, it > looks like the relevant optimisations may have only gone in post-4.9. > Not a problem. I was just curious for my own selfish reasons. >> Otherwise, the C cycle count looks good enough compared to the asm >> version. > > Yeah, compiling as standalone functions with GCC 5 I get 19 > instructions vs. 17 for the asm, but the loop logic gets optimised out > completely when ihl is a compile-time constant (e.g. inet_gro_receive()) > I updated to Linaro GCC 5.3-2016.02, and saw what you described. I ran some smoke testing against a random header generator. LGTM. Acked-by: Luke Starrett <luke.starrett@broadcom.com> ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH] arm64: Implement optimised IP checksum helpers 2016-05-12 14:26 ` Robin Murphy @ 2016-06-17 17:00 ` Catalin Marinas -1 siblings, 0 replies; 10+ messages in thread From: Catalin Marinas @ 2016-06-17 17:00 UTC (permalink / raw) To: linux-arm-kernel On Thu, May 12, 2016 at 03:26:48PM +0100, Robin Murphy wrote: > AArch64 is capable of 128-bit memory accesses without alignment > restrictions, which makes it both possible and highly practical to slurp > up a typical 20-byte IP header in just 2 loads. Implement our own > version of ip_fast_checksum() to take advantage of that, resulting in > considerably fewer instructions and memory accesses than the generic > version. We can also get more optimal code generation for csum_fold() by > defining it a slightly different way round from the generic version, so > throw that into the mix too. > > Suggested-by: Luke Starrett <luke.starrett@broadcom.com> > Signed-off-by: Robin Murphy <robin.murphy@arm.com> Queued for 4.8. Thanks. -- Catalin ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] arm64: Implement optimised IP checksum helpers @ 2016-06-17 17:00 ` Catalin Marinas 0 siblings, 0 replies; 10+ messages in thread From: Catalin Marinas @ 2016-06-17 17:00 UTC (permalink / raw) To: Robin Murphy Cc: will.deacon, luke.starrett, bcm-kernel-feedback-list, linux-kernel, linux-arm-kernel On Thu, May 12, 2016 at 03:26:48PM +0100, Robin Murphy wrote: > AArch64 is capable of 128-bit memory accesses without alignment > restrictions, which makes it both possible and highly practical to slurp > up a typical 20-byte IP header in just 2 loads. Implement our own > version of ip_fast_checksum() to take advantage of that, resulting in > considerably fewer instructions and memory accesses than the generic > version. We can also get more optimal code generation for csum_fold() by > defining it a slightly different way round from the generic version, so > throw that into the mix too. > > Suggested-by: Luke Starrett <luke.starrett@broadcom.com> > Signed-off-by: Robin Murphy <robin.murphy@arm.com> Queued for 4.8. Thanks. -- Catalin ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2016-06-17 17:00 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-05-12 14:26 [PATCH] arm64: Implement optimised IP checksum helpers Robin Murphy 2016-05-12 14:26 ` Robin Murphy 2016-05-12 15:34 ` Luke Starrett 2016-05-12 15:34 ` Luke Starrett 2016-05-12 17:08 ` Robin Murphy 2016-05-12 17:08 ` Robin Murphy 2016-05-12 18:32 ` Luke Starrett 2016-05-12 18:32 ` Luke Starrett 2016-06-17 17:00 ` Catalin Marinas 2016-06-17 17:00 ` Catalin Marinas
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.