All of lore.kernel.org
 help / color / mirror / Atom feed
From: robin.murphy@arm.com (Robin Murphy)
To: linux-arm-kernel@lists.infradead.org
Subject: [PATCH] arm64: Implement optimised IP checksum helpers
Date: Thu, 12 May 2016 18:08:41 +0100	[thread overview]
Message-ID: <5734B899.3040207@arm.com> (raw)
In-Reply-To: <a3d91ddf-565d-5cef-0a56-39154e3a00d2@broadcom.com>

Hi Luke,

On 12/05/16 16:34, Luke Starrett wrote:
> Hi Robin,
>
> I pulled this in to a userspace test app expecting that the __uint128_t
> type might cause GCC to emit 'ldp'.  Seems like that was that your
> intent based on your commit note.  Instead I see two 64b loads (ldr Xn),
> and a single 32b load (ldr Wn) for the trailing 4B.  This was with
> Linaro GCC 4.9-2015.06.

GCC 5 happily emits ldp there, but indeed I couldn't figure out how to 
convince GCC 4 to do so. From a quick ferret around in the GCC Git, it 
looks like the relevant optimisations may have only gone in post-4.9.

> Otherwise, the C cycle count looks good enough compared to the asm version.

Yeah, compiling as standalone functions with GCC 5 I get 19 instructions 
vs. 17 for the asm, but the loop logic gets optimised out completely 
when ihl is a compile-time constant (e.g. inet_gro_receive())

Cheers,
Robin.

>
> Thanks,
>
> Luke
>
>
> On 5/12/2016 10:26 AM, Robin Murphy wrote:
>> AArch64 is capable of 128-bit memory accesses without alignment
>> restrictions, which makes it both possible and highly practical to slurp
>> up a typical 20-byte IP header in just 2 loads. Implement our own
>> version of ip_fast_checksum() to take advantage of that, resulting in
>> considerably fewer instructions and memory accesses than the generic
>> version. We can also get more optimal code generation for csum_fold() by
>> defining it a slightly different way round from the generic version, so
>> throw that into the mix too.
>>
>> Suggested-by: Luke Starrett <luke.starrett@broadcom.com>
>> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
>> ---
>>
>> Having spent an evening poring over disassembler output, it turns out
>> that there's really no need to drop into asm for this, hurrah! I've
>> taken it for a spin on both big and little-endian, with no ill effects.
>>
>>   arch/arm64/include/asm/checksum.h | 49
>> +++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 49 insertions(+)
>>   create mode 100644 arch/arm64/include/asm/checksum.h
>>
>> diff --git a/arch/arm64/include/asm/checksum.h
>> b/arch/arm64/include/asm/checksum.h
>> new file mode 100644
>> index 000000000000..7e975ffce837
>> --- /dev/null
>> +++ b/arch/arm64/include/asm/checksum.h
>> @@ -0,0 +1,49 @@
>> +/*
>> + * Copyright (C) 2016 ARM Ltd.
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License version 2 as
>> + * published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License
>> + * along with this program.  If not, see <http://www.gnu.org/licenses/>.
>> + */
>> +#ifndef __ASM_CHECKSUM_H
>> +#define __ASM_CHECKSUM_H
>> +
>> +static inline __sum16 csum_fold(__wsum csum)
>> +{
>> +    u32 sum = (__force u32)csum;
>> +    sum += (sum >> 16) | (sum << 16);
>> +    return ~(__force __sum16)(sum >> 16);
>> +}
>> +#define csum_fold csum_fold
>> +
>> +static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
>> +{
>> +    __uint128_t tmp;
>> +    u64 sum;
>> +
>> +    tmp = *(const __uint128_t *)iph;
>> +    iph += 16;
>> +    ihl -= 4;
>> +    tmp += ((tmp >> 64) | (tmp << 64));
>> +    sum = tmp >> 64;
>> +    do {
>> +        sum += *(const u32 *)iph;
>> +        iph += 4;
>> +    } while (--ihl);
>> +
>> +    sum += ((sum >> 32) | (sum << 32));
>> +    return csum_fold(sum >> 32);
>> +}
>> +#define ip_fast_csum ip_fast_csum
>> +
>> +#include <asm-generic/checksum.h>
>> +
>> +#endif    /* __ASM_CHECKSUM_H */
>

WARNING: multiple messages have this Message-ID (diff)
From: Robin Murphy <robin.murphy@arm.com>
To: Luke Starrett <luke.starrett@broadcom.com>,
	will.deacon@arm.com, catalin.marinas@arm.com
Cc: linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	bcm-kernel-feedback-list@broadcom.com
Subject: Re: [PATCH] arm64: Implement optimised IP checksum helpers
Date: Thu, 12 May 2016 18:08:41 +0100	[thread overview]
Message-ID: <5734B899.3040207@arm.com> (raw)
In-Reply-To: <a3d91ddf-565d-5cef-0a56-39154e3a00d2@broadcom.com>

Hi Luke,

On 12/05/16 16:34, Luke Starrett wrote:
> Hi Robin,
>
> I pulled this in to a userspace test app expecting that the __uint128_t
> type might cause GCC to emit 'ldp'.  Seems like that was that your
> intent based on your commit note.  Instead I see two 64b loads (ldr Xn),
> and a single 32b load (ldr Wn) for the trailing 4B.  This was with
> Linaro GCC 4.9-2015.06.

GCC 5 happily emits ldp there, but indeed I couldn't figure out how to 
convince GCC 4 to do so. From a quick ferret around in the GCC Git, it 
looks like the relevant optimisations may have only gone in post-4.9.

> Otherwise, the C cycle count looks good enough compared to the asm version.

Yeah, compiling as standalone functions with GCC 5 I get 19 instructions 
vs. 17 for the asm, but the loop logic gets optimised out completely 
when ihl is a compile-time constant (e.g. inet_gro_receive())

Cheers,
Robin.

>
> Thanks,
>
> Luke
>
>
> On 5/12/2016 10:26 AM, Robin Murphy wrote:
>> AArch64 is capable of 128-bit memory accesses without alignment
>> restrictions, which makes it both possible and highly practical to slurp
>> up a typical 20-byte IP header in just 2 loads. Implement our own
>> version of ip_fast_checksum() to take advantage of that, resulting in
>> considerably fewer instructions and memory accesses than the generic
>> version. We can also get more optimal code generation for csum_fold() by
>> defining it a slightly different way round from the generic version, so
>> throw that into the mix too.
>>
>> Suggested-by: Luke Starrett <luke.starrett@broadcom.com>
>> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
>> ---
>>
>> Having spent an evening poring over disassembler output, it turns out
>> that there's really no need to drop into asm for this, hurrah! I've
>> taken it for a spin on both big and little-endian, with no ill effects.
>>
>>   arch/arm64/include/asm/checksum.h | 49
>> +++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 49 insertions(+)
>>   create mode 100644 arch/arm64/include/asm/checksum.h
>>
>> diff --git a/arch/arm64/include/asm/checksum.h
>> b/arch/arm64/include/asm/checksum.h
>> new file mode 100644
>> index 000000000000..7e975ffce837
>> --- /dev/null
>> +++ b/arch/arm64/include/asm/checksum.h
>> @@ -0,0 +1,49 @@
>> +/*
>> + * Copyright (C) 2016 ARM Ltd.
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License version 2 as
>> + * published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License
>> + * along with this program.  If not, see <http://www.gnu.org/licenses/>.
>> + */
>> +#ifndef __ASM_CHECKSUM_H
>> +#define __ASM_CHECKSUM_H
>> +
>> +static inline __sum16 csum_fold(__wsum csum)
>> +{
>> +    u32 sum = (__force u32)csum;
>> +    sum += (sum >> 16) | (sum << 16);
>> +    return ~(__force __sum16)(sum >> 16);
>> +}
>> +#define csum_fold csum_fold
>> +
>> +static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
>> +{
>> +    __uint128_t tmp;
>> +    u64 sum;
>> +
>> +    tmp = *(const __uint128_t *)iph;
>> +    iph += 16;
>> +    ihl -= 4;
>> +    tmp += ((tmp >> 64) | (tmp << 64));
>> +    sum = tmp >> 64;
>> +    do {
>> +        sum += *(const u32 *)iph;
>> +        iph += 4;
>> +    } while (--ihl);
>> +
>> +    sum += ((sum >> 32) | (sum << 32));
>> +    return csum_fold(sum >> 32);
>> +}
>> +#define ip_fast_csum ip_fast_csum
>> +
>> +#include <asm-generic/checksum.h>
>> +
>> +#endif    /* __ASM_CHECKSUM_H */
>

  reply	other threads:[~2016-05-12 17:08 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-12 14:26 [PATCH] arm64: Implement optimised IP checksum helpers Robin Murphy
2016-05-12 14:26 ` Robin Murphy
2016-05-12 15:34 ` Luke Starrett
2016-05-12 15:34   ` Luke Starrett
2016-05-12 17:08   ` Robin Murphy [this message]
2016-05-12 17:08     ` Robin Murphy
2016-05-12 18:32     ` Luke Starrett
2016-05-12 18:32       ` Luke Starrett
2016-06-17 17:00 ` Catalin Marinas
2016-06-17 17:00   ` Catalin Marinas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5734B899.3040207@arm.com \
    --to=robin.murphy@arm.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.