From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id BB93CC10F0E for ; Mon, 15 Apr 2019 18:18:27 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 92297205ED for ; Mon, 15 Apr 2019 18:18:27 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727869AbfDOSS0 (ORCPT ); Mon, 15 Apr 2019 14:18:26 -0400 Received: from foss.arm.com ([217.140.101.70]:40326 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726182AbfDOSS0 (ORCPT ); Mon, 15 Apr 2019 14:18:26 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 896EC80D; Mon, 15 Apr 2019 11:18:25 -0700 (PDT) Received: from [10.1.196.75] (e110467-lin.cambridge.arm.com [10.1.196.75]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 25B893F706; Mon, 15 Apr 2019 11:18:24 -0700 (PDT) Subject: Re: [PATCH] arm64: do_csum: implement accelerated scalar version To: Will Deacon , Zhangshaokun Cc: Ard Biesheuvel , linux-arm-kernel@lists.infradead.org, netdev@vger.kernel.org, ilias.apalodimas@linaro.org, "huanglingyan (A)" , steve.capper@arm.com References: <20190218230842.11448-1-ard.biesheuvel@linaro.org> <20190412095243.GA27193@fuggles.cambridge.arm.com> From: Robin Murphy Message-ID: <41b30c72-c1c5-14b2-b2e1-3507d552830d@arm.com> Date: Mon, 15 Apr 2019 19:18:22 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1 MIME-Version: 1.0 In-Reply-To: <20190412095243.GA27193@fuggles.cambridge.arm.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-GB Content-Transfer-Encoding: 7bit Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On 12/04/2019 10:52, Will Deacon wrote: > On Fri, Apr 12, 2019 at 10:31:16AM +0800, Zhangshaokun wrote: >> On 2019/2/19 7:08, Ard Biesheuvel wrote: >>> It turns out that the IP checksumming code is still exercised often, >>> even though one might expect that modern NICs with checksum offload >>> have no use for it. However, as Lingyan points out, there are >>> combinations of features where the network stack may still fall back >>> to software checksumming, and so it makes sense to provide an >>> optimized implementation in software as well. >>> >>> So provide an implementation of do_csum() in scalar assembler, which, >>> unlike C, gives direct access to the carry flag, making the code run >>> substantially faster. The routine uses overlapping 64 byte loads for >>> all input size > 64 bytes, in order to reduce the number of branches >>> and improve performance on cores with deep pipelines. >>> >>> On Cortex-A57, this implementation is on par with Lingyan's NEON >>> implementation, and roughly 7x as fast as the generic C code. >>> >>> Cc: "huanglingyan (A)" >>> Signed-off-by: Ard Biesheuvel >>> --- >>> Test code after the patch. >> >> Hi maintainers and Ard, >> >> Any update on it? > > I'm waiting for Robin to come back with numbers for a C implementation. > > Robin -- did you get anywhere with that? Still not what I would call finished, but where I've got so far (besides an increasingly elaborate test rig) is as below - it still wants some unrolling in the middle to really fly (and actual testing on BE), but the worst-case performance already equals or just beats this asm version on Cortex-A53 with GCC 7 (by virtue of being alignment-insensitive and branchless except for the loop). Unfortunately, the advantage of C code being instrumentable does also come around to bite me... Robin. ----->8----- /* Looks dumb, but generates nice-ish code */ static u64 accumulate(u64 sum, u64 data) { __uint128_t tmp = (__uint128_t)sum + data; return tmp + (tmp >> 64); } unsigned int do_csum_c(const unsigned char *buff, int len) { unsigned int offset, shift, sum, count; u64 data, *ptr; u64 sum64 = 0; offset = (unsigned long)buff & 0x7; /* * This is to all intents and purposes safe, since rounding down cannot * result in a different page or cache line being accessed, and @buff * should absolutely not be pointing to anything read-sensitive. * It does, however, piss off KASAN... */ ptr = (u64 *)(buff - offset); shift = offset * 8; /* * Head: zero out any excess leading bytes. Shifting back by the same * amount should be at least as fast as any other way of handling the * odd/even alignment, and means we can ignore it until the very end. */ data = *ptr++; #ifdef __LITTLE_ENDIAN data = (data >> shift) << shift; #else data = (data << shift) >> shift; #endif count = 8 - offset; /* Body: straightforward aligned loads from here on... */ //TODO: fancy stuff with larger strides and uint128s? while(len > count) { sum64 = accumulate(sum64, data); data = *ptr++; count += 8; } /* * Tail: zero any over-read bytes similarly to the head, again * preserving odd/even alignment. */ shift = (count - len) * 8; #ifdef __LITTLE_ENDIAN data = (data << shift) >> shift; #else data = (data >> shift) << shift; #endif sum64 = accumulate(sum64, data); /* Finally, folding */ sum64 += (sum64 >> 32) | (sum64 << 32); sum = sum64 >> 32; sum += (sum >> 16) | (sum << 16); if (offset & 1) return (u16)swab32(sum); return sum >> 16; }