From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 524F0C04E53 for ; Wed, 15 May 2019 12:39:53 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 2E6872084E for ; Wed, 15 May 2019 12:39:53 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727071AbfEOMjw (ORCPT ); Wed, 15 May 2019 08:39:52 -0400 Received: from foss.arm.com ([217.140.101.70]:43738 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726635AbfEOMjv (ORCPT ); Wed, 15 May 2019 08:39:51 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id C2262374; Wed, 15 May 2019 05:39:50 -0700 (PDT) Received: from [10.1.196.75] (e110467-lin.cambridge.arm.com [10.1.196.75]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 219443F71E; Wed, 15 May 2019 05:39:48 -0700 (PDT) Subject: Re: [PATCH] arm64: do_csum: implement accelerated scalar version To: David Laight , 'Will Deacon' Cc: Zhangshaokun , Ard Biesheuvel , "linux-arm-kernel@lists.infradead.org" , "netdev@vger.kernel.org" , "ilias.apalodimas@linaro.org" , "huanglingyan (A)" , "steve.capper@arm.com" References: <20190218230842.11448-1-ard.biesheuvel@linaro.org> <20190412095243.GA27193@fuggles.cambridge.arm.com> <41b30c72-c1c5-14b2-b2e1-3507d552830d@arm.com> <20190515094704.GC24357@fuggles.cambridge.arm.com> <6e755b2daaf341128cb3b54f36172442@AcuMS.aculab.com> <3d4fdbb5-7c7f-9331-187e-14c09dd1c18d@arm.com> <9f72aecd99e74c1a939df6562ed9c18c@AcuMS.aculab.com> From: Robin Murphy Message-ID: <083f8222-971c-0d8e-4650-0d88b193e316@arm.com> Date: Wed, 15 May 2019 13:39:47 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1 MIME-Version: 1.0 In-Reply-To: <9f72aecd99e74c1a939df6562ed9c18c@AcuMS.aculab.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-GB Content-Transfer-Encoding: 7bit Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On 15/05/2019 12:13, David Laight wrote: > From: Robin Murphy >> Sent: 15 May 2019 11:58 >> To: David Laight; 'Will Deacon' >> Cc: Zhangshaokun; Ard Biesheuvel; linux-arm-kernel@lists.infradead.org; netdev@vger.kernel.org; >> ilias.apalodimas@linaro.org; huanglingyan (A); steve.capper@arm.com >> Subject: Re: [PATCH] arm64: do_csum: implement accelerated scalar version >> >> On 15/05/2019 11:15, David Laight wrote: >>> ... >>>>> ptr = (u64 *)(buff - offset); >>>>> shift = offset * 8; >>>>> >>>>> /* >>>>> * Head: zero out any excess leading bytes. Shifting back by the same >>>>> * amount should be at least as fast as any other way of handling the >>>>> * odd/even alignment, and means we can ignore it until the very end. >>>>> */ >>>>> data = *ptr++; >>>>> #ifdef __LITTLE_ENDIAN >>>>> data = (data >> shift) << shift; >>>>> #else >>>>> data = (data << shift) >> shift; >>>>> #endif >>> >>> I suspect that >>> #ifdef __LITTLE_ENDIAN >>> data &= ~0ull << shift; >>> #else >>> data &= ~0ull >> shift; >>> #endif >>> is likely to be better. >> >> Out of interest, better in which respects? For the A64 ISA at least, >> that would take 3 instructions plus an additional scratch register, e.g.: >> >> MOV x2, #~0 >> LSL x2, x2, x1 >> AND x0, x0, x1 [That should have been "AND x0, x1, x2", obviously...] >> >> (alternatively "AND x0, x0, x1 LSL x2" to save 4 bytes of code, but that >> will typically take as many cycles if not more than just pipelining the >> two 'simple' ALU instructions) >> >> Whereas the original is just two shift instruction in-place. >> >> LSR x0, x0, x1 >> LSL x0, x0, x1 >> >> If the operation were repeated, the constant generation could certainly >> be amortised over multiple subsequent ANDs for a net win, but that isn't >> the case here. > > On a superscaler processor you reduce the register dependency > chain by one instruction. > The original code is pretty much a single dependency chain so > you are likely to be able to generate the mask 'for free'. Gotcha, although 'free' still means additional I$ and register rename footprint, vs. (typically) just 1 extra cycle to forward an ALU result. It's an interesting consideration, but in our case there are almost certainly far more little in-order cores out in the wild than big OoO ones, and the double-shift will always be objectively better for those. Thanks, Robin.