From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 411CDC3ABA9 for ; Thu, 1 May 2025 02:28:34 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:In-Reply-To:Content-Type: MIME-Version:References:Message-ID:Subject:Cc:To:From:Date:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=fdKA7A/RFCud62iMEec2PYRd8Uh47erogmK/JyNaTdU=; b=cEreje8VL8pTjXGuohFBKCmPhd Vy5p/+9GFz44h4eZMajsKzA93Px2fgKefFfmu1wVGHJfyoFXhOHTpFLO1W9xKpEYOgJ3shwK11Nvp VoajDcZTVnc/+XCRVrSF+zUw2WvOGvlINgoF4rhNLzIoY99khQywrYw/TuPXjjqMN4I8hehfb2tRB A+dYFsPlqr4zLfAcRyXE0xaVAYye3qRHar9AtqUUIlWmA1pHJe+MB2Q4mV+cg/iFqHG6a4/YHVCp6 Q6oSbgQv6Hm9W0+DnI7VZx9orA5whmIm9Zkr5W6R5MP67HiNDnJ4wsGtJh13SLupFchhHHmpwvFlI E6/Kjc8Q==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.98.2 #2 (Red Hat Linux)) id 1uAJex-0000000EVjp-27Ua; Thu, 01 May 2025 02:28:23 +0000 Received: from sea.source.kernel.org ([2600:3c0a:e001:78e:0:1991:8:25]) by bombadil.infradead.org with esmtps (Exim 4.98.2 #2 (Red Hat Linux)) id 1uAJd0-0000000EVC3-4B9m; Thu, 01 May 2025 02:26:24 +0000 Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sea.source.kernel.org (Postfix) with ESMTP id 70FF343E95; Thu, 1 May 2025 02:26:19 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 0B4EBC4CEE7; Thu, 1 May 2025 02:26:20 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1746066381; bh=xDvmI+RzBrsAPUkmwQMATcemiN1t9n0AYagyAz0GKD8=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=aVIMS6EX1bmdbZlCT4Ph9BQSkBsKo1ZZ8cv5kzE4w46J3KEcLVv9IU8J0N+zyXVDh AQh5znq8L73Eebkqo9Roe6Tl62Ut7eDfj7S+TgnQVbe0LlkF9QvY+5jRgMO/I58+Hd DUtTGnQiUls4Cq2vq1gOBHdJSYQrXrUnhN1ltHksGAwh2qLYZDOhdMQfyTqAL8258e EYaDBvVua06TCgkHQsRTxv7B27A5j+kEQQOkOjqMum92sfyZPD0dgF2MTTnz9pQXtS TW38MM3FRjMU3ZHSvSgrLnKmeDJzpKfSfPN84pKFsYNlM2D3IIZyiP174i14GcTTpv XkpcgDRMXkRMw== Date: Wed, 30 Apr 2025 19:26:17 -0700 From: Eric Biggers To: Herbert Xu Cc: Linux Crypto Mailing List , linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-mips@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-riscv@lists.infradead.org, sparclinux@vger.kernel.org, linux-s390@vger.kernel.org, x86@kernel.org, Ard Biesheuvel , "Jason A . Donenfeld" , Linus Torvalds Subject: Re: [PATCH 00/12] crypto: sha256 - Use partial block API Message-ID: <20250501022617.GA65059@sol.localdomain> References: <20250430174543.GB1958@sol.localdomain> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20250430_192623_076320_E21EFC21 X-CRM114-Status: GOOD ( 18.78 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On Thu, May 01, 2025 at 09:21:15AM +0800, Herbert Xu wrote: > On Wed, Apr 30, 2025 at 10:45:43AM -0700, Eric Biggers wrote: > > > > As for your sha256_finup "optimization", it's an interesting idea, but > > unfortunately it slightly slows down the common case which is count % 64 < 56, > > due to the unnecessary copy to the stack and the following zeroization. In the > > uncommon case where count % 64 >= 56 you do get to pass nblocks=2 to > > sha256_blocks_*(), but ultimately SHA-256 is serialized block-by-block anyway, > > so it ends up being only slightly faster in that case, which again is the > > uncommon case. So while it's an interesting idea, it doesn't seem to actually > > be better. And the fact that that patch is also being used to submit unrelated, > > more dubious changes isn't very helpful, of course. > > I'm more than willing to change sha256_finup if you can prove it > with real numbers that it is worse than the single-block version. Interesting approach -- pushing out misguided optimizations without data, then demanding data for them to be reverted. It's obviously worse for len % 64 < 56 for the reason I gave, so this is a waste of time IMO. But since you're insisting on data anyway, here are some quick benchmarks on AMD Zen 5 (not going to bother formatting into a table): Before your finup "optimization": sha256(len=0): 145 cycles sha256(len=1): 146 cycles sha256(len=2): 146 cycles sha256(len=3): 146 cycles sha256(len=4): 146 cycles sha256(len=5): 146 cycles sha256(len=6): 146 cycles sha256(len=7): 146 cycles sha256(len=8): 151 cycles sha256(len=9): 148 cycles sha256(len=10): 148 cycles sha256(len=11): 148 cycles sha256(len=12): 148 cycles sha256(len=13): 148 cycles sha256(len=14): 148 cycles sha256(len=15): 149 cycles sha256(len=16): 149 cycles sha256(len=17): 148 cycles sha256(len=18): 148 cycles sha256(len=19): 148 cycles sha256(len=20): 148 cycles sha256(len=21): 148 cycles sha256(len=22): 148 cycles sha256(len=23): 148 cycles sha256(len=24): 148 cycles sha256(len=25): 148 cycles sha256(len=26): 148 cycles sha256(len=27): 148 cycles sha256(len=28): 148 cycles sha256(len=29): 148 cycles sha256(len=30): 148 cycles sha256(len=31): 148 cycles sha256(len=32): 151 cycles sha256(len=33): 148 cycles sha256(len=34): 148 cycles sha256(len=35): 148 cycles sha256(len=36): 148 cycles sha256(len=37): 148 cycles sha256(len=38): 148 cycles sha256(len=39): 148 cycles sha256(len=40): 148 cycles sha256(len=41): 148 cycles sha256(len=42): 148 cycles sha256(len=43): 148 cycles sha256(len=44): 148 cycles sha256(len=45): 148 cycles sha256(len=46): 150 cycles sha256(len=47): 149 cycles sha256(len=48): 147 cycles sha256(len=49): 147 cycles sha256(len=50): 147 cycles sha256(len=51): 147 cycles sha256(len=52): 147 cycles sha256(len=53): 147 cycles sha256(len=54): 147 cycles sha256(len=55): 148 cycles sha256(len=56): 278 cycles sha256(len=57): 278 cycles sha256(len=58): 278 cycles sha256(len=59): 278 cycles sha256(len=60): 277 cycles sha256(len=61): 277 cycles sha256(len=62): 277 cycles sha256(len=63): 276 cycles sha256(len=64): 276 cycles After your finup "optimization": sha256(len=0): 188 cycles sha256(len=1): 190 cycles sha256(len=2): 190 cycles sha256(len=3): 190 cycles sha256(len=4): 189 cycles sha256(len=5): 189 cycles sha256(len=6): 189 cycles sha256(len=7): 190 cycles sha256(len=8): 187 cycles sha256(len=9): 188 cycles sha256(len=10): 188 cycles sha256(len=11): 188 cycles sha256(len=12): 189 cycles sha256(len=13): 189 cycles sha256(len=14): 188 cycles sha256(len=15): 189 cycles sha256(len=16): 189 cycles sha256(len=17): 190 cycles sha256(len=18): 190 cycles sha256(len=19): 190 cycles sha256(len=20): 190 cycles sha256(len=21): 190 cycles sha256(len=22): 190 cycles sha256(len=23): 190 cycles sha256(len=24): 191 cycles sha256(len=25): 191 cycles sha256(len=26): 191 cycles sha256(len=27): 191 cycles sha256(len=28): 191 cycles sha256(len=29): 192 cycles sha256(len=30): 191 cycles sha256(len=31): 191 cycles sha256(len=32): 191 cycles sha256(len=33): 191 cycles sha256(len=34): 191 cycles sha256(len=35): 191 cycles sha256(len=36): 192 cycles sha256(len=37): 192 cycles sha256(len=38): 192 cycles sha256(len=39): 191 cycles sha256(len=40): 191 cycles sha256(len=41): 194 cycles sha256(len=42): 193 cycles sha256(len=43): 193 cycles sha256(len=44): 193 cycles sha256(len=45): 193 cycles sha256(len=46): 194 cycles sha256(len=47): 194 cycles sha256(len=48): 193 cycles sha256(len=49): 195 cycles sha256(len=50): 195 cycles sha256(len=51): 196 cycles sha256(len=52): 196 cycles sha256(len=53): 195 cycles sha256(len=54): 195 cycles sha256(len=55): 195 cycles sha256(len=56): 297 cycles sha256(len=57): 297 cycles sha256(len=58): 297 cycles sha256(len=59): 297 cycles sha256(len=60): 297 cycles sha256(len=61): 297 cycles sha256(len=62): 297 cycles sha256(len=63): 297 cycles sha256(len=64): 292 cycles So your "optimization" made it ~43 cycles slower for len % 64 < 56, or ~19 cycles slower for len % 64 >= 56. As I said, it's from the overhead of unnecessarily copying the data onto the stack and then having to zeroize it at the end. - Eric