From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailtransmit05.runbox.com (mailtransmit05.runbox.com [185.226.149.38]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 00CCF274B58; Fri, 26 Dec 2025 20:25:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=185.226.149.38 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766780710; cv=none; b=etCdiOYSyWJdfk1vXqffNFGPws8voNwA8ZdcRS4MQGi82uczuMZLhUNAzr1bn1e6syR8BzurqsCDqvatvjAjvSA/KEjCqcLpuandSgdf/Cra26/vFFXSOKC6CKR24UGLU4jeGztsP4cmiXTwVDVGvCxxskFXvojGz8NWIkn1t7Y= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766780710; c=relaxed/simple; bh=nvlLVnvatpqwgOjAS/q3NGOgoHfMJaX/ZS0arCm25vI=; h=Date:From:To:Cc:Subject:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=Q1bfNFaNm6oXhQIvfNLjNzP83HR4HER+Y5CNUC+bhnLYmuqlVDvLa8+weXqFRgGy3GNycDfwhETM+/zqGO5am8swgTBUFUcHQMAKXicWeiqSb5fNZFVEEq0dlzwSyDkG2bhUVT+nlsInqF3ihc1gDJliXpLW0GIc8b2zgkbk2BY= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=runbox.com; spf=pass smtp.mailfrom=runbox.com; dkim=pass (2048-bit key) header.d=runbox.com header.i=@runbox.com header.b=JT6QTaUX; arc=none smtp.client-ip=185.226.149.38 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=runbox.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=runbox.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=runbox.com header.i=@runbox.com header.b="JT6QTaUX" Received: from mailtransmit03.runbox ([10.9.9.163] helo=aibo.runbox.com) by mailtransmit05.runbox.com with esmtps (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.93) (envelope-from ) id 1vZEMf-003iee-38; Fri, 26 Dec 2025 21:24:45 +0100 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=runbox.com; s=selector2; h=Content-Transfer-Encoding:Content-Type:MIME-Version: References:In-Reply-To:Message-ID:Subject:Cc:To:From:Date; bh=i1GRlWCzDCZz3wdSHgzwoDEJfj5lb11Zn8FwETxb/9c=; b=JT6QTaUXUK+E4xcKu7QGqIZCOa TKrrz8OrTaQqR+6vp6Me/XQ5GevgJZn0yZyHvJ9bkASBio+x6/RWP1EEXF0vpSLNH7M8LPlVrcXzG 3p9/33+lbiQ/aMuWhlCSXdYTz1yFWwpot82lX3fMA+4MbHPrjPma/oJ/hBpMz9atpp7zp79D68Ykj N+6GXhcLYnvK7rtT19eq9dxr1ZziF3EEa+hm3F/hDNF7x8zY+9nV1ctacLh3EKAzgorMuLSQU3dgx HMqLG837EHQT5/bLHjq8qGQnwZ6YuZizsmRsXhb6coKVtbQI/Ts1B2huGfAO+yFMxGzXaRnW/ooW+ lyPkVvDA==; Received: from [10.9.9.73] (helo=submission02.runbox) by mailtransmit03.runbox with esmtp (Exim 4.86_2) (envelope-from ) id 1vZEMd-0008Sz-UA; Fri, 26 Dec 2025 21:24:44 +0100 Received: by submission02.runbox with esmtpsa [Authenticated ID (1493616)] (TLS1.2:ECDHE_SECP256R1__RSA_SHA256__AES_256_GCM:256) (Exim 4.93) id 1vZEMV-00H1T6-CE; Fri, 26 Dec 2025 21:24:35 +0100 Date: Fri, 26 Dec 2025 20:24:33 +0000 From: david laight To: Eric Biggers Cc: linux-crypto@vger.kernel.org, linux-kernel@vger.kernel.org, Ard Biesheuvel , "Jason A . Donenfeld" , Herbert Xu , Thorsten Blum , Nathan Chancellor , Nick Desaulniers , Bill Wendling , Justin Stitt , David Sterba , llvm@lists.linux.dev, linux-btrfs@vger.kernel.org Subject: Re: [PATCH] lib/crypto: blake2b: Roll up BLAKE2b round loop on 32-bit Message-ID: <20251226202433.107af09a@pumpkin> In-Reply-To: <20251205201411.GA1954@quark> References: <20251203190652.144076-1-ebiggers@kernel.org> <20251205141644.313404db@pumpkin> <20251205201411.GA1954@quark> X-Mailer: Claws Mail 4.1.1 (GTK 3.24.38; arm-unknown-linux-gnueabihf) Precedence: bulk X-Mailing-List: linux-btrfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit On Fri, 5 Dec 2025 12:14:11 -0800 Eric Biggers wrote: > On Fri, Dec 05, 2025 at 02:16:44PM +0000, david laight wrote: > > Note that executing two G() in parallel probably requires the source > > interleave the instructions for the two G() rather than relying on the > > cpu's 'out of order execution' to do all the work > > (Intel cpu might manage it...). > > I actually tried that earlier, and it didn't help. Either the compiler > interleaved the calculations already, or the CPU did, or both. > > It definitely could use some more investigation to better understand > exactly what is going on, though. > > You're welcome to take a closer look, if you're interested. I had a quick look at the objdump output for the 'not unrolled loop' of blake2s on x86-64 compiled with gcc 12.2. The generated code seemed reasonable. A single register tracked the array of offsets for the data buffer. So on x86 there was a read of the offset then nn(%rsp,%reg,4) to get the value (%reg,8 for blake2b). There weren't many spills to stack, I suspect that 14 of the v[] were assigned to registers - but didn't analyse the entire loop. The fully unrolled loop is harder to read, but one of the v[] still needs spilling to stack. Each 1/2G has at least one memory read and seven ALU operations. The Intel cpu (Haswell onwards) can execute 4 ALU instructions every clock - so however well the multiple G get scheduled each 1/2G will be (pretty much) two clocks. That really means it should be possible to include the second memory read (for the not-unrolled loop) without slowing things down. Even if the nn(%rsp,%reg,8) needs an extra ALU operations the change shouldn't be massive. Which makes be wonder whether the slowdown for rolling-up the loop is due to data cache effects rather than actual ALU instructions. Of course this is x86 and the nn(%rsp,%reg,8) addressing mode helps. Otherwise you'd want to multiply the offsets by 8 and, ideally, add in the stack offset of the data[] array allowing the simpler (%sp,%reg) addressing mode. I've still not done any timings, on holiday with the wrong computers. David > > - Eric >