Re: [PATCH 00/12] crypto: sha256 - Use partial block API

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

From: Eric Biggers <ebiggers@kernel.org>
To: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Linux Crypto Mailing List <linux-crypto@vger.kernel.org>,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, linux-mips@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org, linux-riscv@lists.infradead.org,
	sparclinux@vger.kernel.org, linux-s390@vger.kernel.org,
	x86@kernel.org, Ard Biesheuvel <ardb@kernel.org>,
	"Jason A . Donenfeld" <Jason@zx2c4.com>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH 00/12] crypto: sha256 - Use partial block API
Date: Wed, 30 Apr 2025 19:26:17 -0700	[thread overview]
Message-ID: <20250501022617.GA65059@sol.localdomain> (raw)
In-Reply-To: <aBLMi5XOQKJyJGu-@gondor.apana.org.au>

On Thu, May 01, 2025 at 09:21:15AM +0800, Herbert Xu wrote:
> On Wed, Apr 30, 2025 at 10:45:43AM -0700, Eric Biggers wrote:
> >
> > As for your sha256_finup "optimization", it's an interesting idea, but
> > unfortunately it slightly slows down the common case which is count % 64 < 56,
> > due to the unnecessary copy to the stack and the following zeroization.  In the
> > uncommon case where count % 64 >= 56 you do get to pass nblocks=2 to
> > sha256_blocks_*(), but ultimately SHA-256 is serialized block-by-block anyway,
> > so it ends up being only slightly faster in that case, which again is the
> > uncommon case.  So while it's an interesting idea, it doesn't seem to actually
> > be better.  And the fact that that patch is also being used to submit unrelated,
> > more dubious changes isn't very helpful, of course.
> 
> I'm more than willing to change sha256_finup if you can prove it
> with real numbers that it is worse than the single-block version.

Interesting approach -- pushing out misguided optimizations without data, then
demanding data for them to be reverted.  It's obviously worse for
len % 64 < 56 for the reason I gave, so this is a waste of time IMO.  But since
you're insisting on data anyway, here are some quick benchmarks on AMD Zen 5
(not going to bother formatting into a table):

Before your finup "optimization":

 sha256(len=0): 145 cycles
 sha256(len=1): 146 cycles
 sha256(len=2): 146 cycles
 sha256(len=3): 146 cycles
 sha256(len=4): 146 cycles
 sha256(len=5): 146 cycles
 sha256(len=6): 146 cycles
 sha256(len=7): 146 cycles
 sha256(len=8): 151 cycles
 sha256(len=9): 148 cycles
 sha256(len=10): 148 cycles
 sha256(len=11): 148 cycles
 sha256(len=12): 148 cycles
 sha256(len=13): 148 cycles
 sha256(len=14): 148 cycles
 sha256(len=15): 149 cycles
 sha256(len=16): 149 cycles
 sha256(len=17): 148 cycles
 sha256(len=18): 148 cycles
 sha256(len=19): 148 cycles
 sha256(len=20): 148 cycles
 sha256(len=21): 148 cycles
 sha256(len=22): 148 cycles
 sha256(len=23): 148 cycles
 sha256(len=24): 148 cycles
 sha256(len=25): 148 cycles
 sha256(len=26): 148 cycles
 sha256(len=27): 148 cycles
 sha256(len=28): 148 cycles
 sha256(len=29): 148 cycles
 sha256(len=30): 148 cycles
 sha256(len=31): 148 cycles
 sha256(len=32): 151 cycles
 sha256(len=33): 148 cycles
 sha256(len=34): 148 cycles
 sha256(len=35): 148 cycles
 sha256(len=36): 148 cycles
 sha256(len=37): 148 cycles
 sha256(len=38): 148 cycles
 sha256(len=39): 148 cycles
 sha256(len=40): 148 cycles
 sha256(len=41): 148 cycles
 sha256(len=42): 148 cycles
 sha256(len=43): 148 cycles
 sha256(len=44): 148 cycles
 sha256(len=45): 148 cycles
 sha256(len=46): 150 cycles
 sha256(len=47): 149 cycles
 sha256(len=48): 147 cycles
 sha256(len=49): 147 cycles
 sha256(len=50): 147 cycles
 sha256(len=51): 147 cycles
 sha256(len=52): 147 cycles
 sha256(len=53): 147 cycles
 sha256(len=54): 147 cycles
 sha256(len=55): 148 cycles
 sha256(len=56): 278 cycles
 sha256(len=57): 278 cycles
 sha256(len=58): 278 cycles
 sha256(len=59): 278 cycles
 sha256(len=60): 277 cycles
 sha256(len=61): 277 cycles
 sha256(len=62): 277 cycles
 sha256(len=63): 276 cycles
 sha256(len=64): 276 cycles

After your finup "optimization":

 sha256(len=0): 188 cycles
 sha256(len=1): 190 cycles
 sha256(len=2): 190 cycles
 sha256(len=3): 190 cycles
 sha256(len=4): 189 cycles
 sha256(len=5): 189 cycles
 sha256(len=6): 189 cycles
 sha256(len=7): 190 cycles
 sha256(len=8): 187 cycles
 sha256(len=9): 188 cycles
 sha256(len=10): 188 cycles
 sha256(len=11): 188 cycles
 sha256(len=12): 189 cycles
 sha256(len=13): 189 cycles
 sha256(len=14): 188 cycles
 sha256(len=15): 189 cycles
 sha256(len=16): 189 cycles
 sha256(len=17): 190 cycles
 sha256(len=18): 190 cycles
 sha256(len=19): 190 cycles
 sha256(len=20): 190 cycles
 sha256(len=21): 190 cycles
 sha256(len=22): 190 cycles
 sha256(len=23): 190 cycles
 sha256(len=24): 191 cycles
 sha256(len=25): 191 cycles
 sha256(len=26): 191 cycles
 sha256(len=27): 191 cycles
 sha256(len=28): 191 cycles
 sha256(len=29): 192 cycles
 sha256(len=30): 191 cycles
 sha256(len=31): 191 cycles
 sha256(len=32): 191 cycles
 sha256(len=33): 191 cycles
 sha256(len=34): 191 cycles
 sha256(len=35): 191 cycles
 sha256(len=36): 192 cycles
 sha256(len=37): 192 cycles
 sha256(len=38): 192 cycles
 sha256(len=39): 191 cycles
 sha256(len=40): 191 cycles
 sha256(len=41): 194 cycles
 sha256(len=42): 193 cycles
 sha256(len=43): 193 cycles
 sha256(len=44): 193 cycles
 sha256(len=45): 193 cycles
 sha256(len=46): 194 cycles
 sha256(len=47): 194 cycles
 sha256(len=48): 193 cycles
 sha256(len=49): 195 cycles
 sha256(len=50): 195 cycles
 sha256(len=51): 196 cycles
 sha256(len=52): 196 cycles
 sha256(len=53): 195 cycles
 sha256(len=54): 195 cycles
 sha256(len=55): 195 cycles
 sha256(len=56): 297 cycles
 sha256(len=57): 297 cycles
 sha256(len=58): 297 cycles
 sha256(len=59): 297 cycles
 sha256(len=60): 297 cycles
 sha256(len=61): 297 cycles
 sha256(len=62): 297 cycles
 sha256(len=63): 297 cycles
 sha256(len=64): 292 cycles

So your "optimization" made it ~43 cycles slower for len % 64 < 56, or ~19
cycles slower for len % 64 >= 56.

As I said, it's from the overhead of unnecessarily copying the data onto the
stack and then having to zeroize it at the end.

- Eric

next prev parent reply	other threads:[~2025-05-01  2:26 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <cover.1745992998.git.herbert@gondor.apana.org.au>
2025-04-30 17:45 ` [PATCH 00/12] crypto: sha256 - Use partial block API Eric Biggers
2025-05-01  1:21   ` Herbert Xu
2025-05-01  2:26     ` Eric Biggers [this message]
2025-05-01  5:19       ` Herbert Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250501022617.GA65059@sol.localdomain \
    --to=ebiggers@kernel.org \
    --cc=Jason@zx2c4.com \
    --cc=ardb@kernel.org \
    --cc=herbert@gondor.apana.org.au \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-crypto@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mips@vger.kernel.org \
    --cc=linux-riscv@lists.infradead.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=sparclinux@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).