From mboxrd@z Thu Jan 1 00:00:00 1970 From: Eric Biggers Subject: Re: [PATCH] crypto: arm/chacha20 - faster 8-bit rotations and other optimizations Date: Fri, 31 Aug 2018 23:42:17 -0700 Message-ID: <20180901064216.GB6466@sol.localdomain> References: <20180831080140.20553-1-ebiggers@kernel.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Cc: "open list:HARDWARE RANDOM NUMBER GENERATOR CORE" , linux-arm-kernel , Herbert Xu To: Ard Biesheuvel Return-path: Content-Disposition: inline In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=m.gmane.org@lists.infradead.org List-Id: linux-crypto.vger.kernel.org On Fri, Aug 31, 2018 at 06:51:34PM +0200, Ard Biesheuvel wrote: > >> > >> + adr ip, .Lrol8_table > >> mov r3, #10 > >> > >> .Ldoubleround4: > >> @@ -238,24 +268,25 @@ ENTRY(chacha20_4block_xor_neon) > >> // x1 += x5, x13 = rotl32(x13 ^ x1, 8) > >> // x2 += x6, x14 = rotl32(x14 ^ x2, 8) > >> // x3 += x7, x15 = rotl32(x15 ^ x3, 8) > >> + vld1.8 {d16}, [ip, :64] > > Also, would it perhaps be more efficient to keep the rotation vector > in a pair of GPRs, and use something like > > vmov d16, r4, r5 > > here? > I tried that, but it doesn't help on either Cortex-A7 or Cortex-A53. In fact it's very slightly worse. - Eric