[PATCH] lib/crypto: blake2b: Roll up BLAKE2b round loop on 32-bit

linux-crypto.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] lib/crypto: blake2b: Roll up BLAKE2b round loop on 32-bit
@ 2025-12-03 19:06 Eric Biggers
  2025-12-04  9:05 ` Ard Biesheuvel
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Eric Biggers @ 2025-12-03 19:06 UTC (permalink / raw)
  To: linux-crypto
  Cc: linux-kernel, Ard Biesheuvel, Jason A . Donenfeld, Herbert Xu,
	Thorsten Blum, Nathan Chancellor, Nick Desaulniers, Bill Wendling,
	Justin Stitt, David Laight, David Sterba, llvm, linux-btrfs,
	Eric Biggers

BLAKE2b has a state of 16 64-bit words.  Add the message data in and
there are 32 64-bit words.  With the current code where all the rounds
are unrolled to enable constant-folding of the blake2b_sigma values,
this results in a very large code size on 32-bit kernels, including a
recurring issue where gcc uses a large amount of stack.

There's just not much benefit to this unrolling when the code is already
so large.  Let's roll up the rounds when !CONFIG_64BIT.  Then, remove
the now-unnecessary override of the stack frame size warning.

Code size improvements for blake2b_compress_generic():

                  Size before (bytes)    Size after (bytes)
                  -------------------    ------------------
    i386, gcc           27584                 3632
    i386, clang         18208                 3248
    arm32, gcc          19912                 2860
    arm32, clang        21336                 3344

Running the BLAKE2b benchmark on a !CONFIG_64BIT kernel on an x86_64
processor shows a 16384B throughput change of 351 => 340 MB/s (gcc) or
442 MB/s => 375 MB/s (clang).  So clearly not much of a slowdown either.
But also that microbenchmark also effectively disregards cache usage,
which is important in practice and is far better in the smaller code.

Note: If we rolled up the loop on x86_64 too, the change would be
7024 bytes => 1584 bytes and 1960 MB/s => 1396 MB/s (gcc), or
6848 bytes => 1696 bytes and 1920 MB/s => 1263 MB/s (clang).
Maybe still worth it, though not quite as clearly beneficial.

Fixes: 91d689337fe8 ("crypto: blake2b - add blake2b generic implementation")
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
---
 lib/crypto/Makefile  |  1 -
 lib/crypto/blake2b.c | 25 +++++++++++++------------
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/lib/crypto/Makefile b/lib/crypto/Makefile
index b5346cebbb55..330ab65b29c4 100644
--- a/lib/crypto/Makefile
+++ b/lib/crypto/Makefile
@@ -31,11 +31,10 @@ obj-$(CONFIG_CRYPTO_LIB_GF128MUL)		+= gf128mul.o
 
 ################################################################################
 
 obj-$(CONFIG_CRYPTO_LIB_BLAKE2B) += libblake2b.o
 libblake2b-y := blake2b.o
-CFLAGS_blake2b.o := -Wframe-larger-than=4096 #  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930
 ifeq ($(CONFIG_CRYPTO_LIB_BLAKE2B_ARCH),y)
 CFLAGS_blake2b.o += -I$(src)/$(SRCARCH)
 libblake2b-$(CONFIG_ARM) += arm/blake2b-neon-core.o
 endif # CONFIG_CRYPTO_LIB_BLAKE2B_ARCH
 
diff --git a/lib/crypto/blake2b.c b/lib/crypto/blake2b.c
index 09c6d65d8a6e..923cda5c7603 100644
--- a/lib/crypto/blake2b.c
+++ b/lib/crypto/blake2b.c
@@ -12,10 +12,11 @@
 #include <linux/bug.h>
 #include <linux/export.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/string.h>
+#include <linux/unroll.h>
 #include <linux/types.h>
 
 static const u8 blake2b_sigma[12][16] = {
 	{ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 },
 	{ 14, 10, 4, 8, 9, 15, 13, 6, 1, 12, 0, 2, 11, 7, 5, 3 },
@@ -81,22 +82,22 @@ blake2b_compress_generic(struct blake2b_ctx *ctx,
 	G(r, 4, v[0], v[ 5], v[10], v[15]); \
 	G(r, 5, v[1], v[ 6], v[11], v[12]); \
 	G(r, 6, v[2], v[ 7], v[ 8], v[13]); \
 	G(r, 7, v[3], v[ 4], v[ 9], v[14]); \
 } while (0)
-		ROUND(0);
-		ROUND(1);
-		ROUND(2);
-		ROUND(3);
-		ROUND(4);
-		ROUND(5);
-		ROUND(6);
-		ROUND(7);
-		ROUND(8);
-		ROUND(9);
-		ROUND(10);
-		ROUND(11);
+
+#ifdef CONFIG_64BIT
+		/*
+		 * Unroll the rounds loop to enable constant-folding of the
+		 * blake2b_sigma values.  Seems worthwhile on 64-bit kernels.
+		 * Not worthwhile on 32-bit kernels because the code size is
+		 * already so large there due to BLAKE2b using 64-bit words.
+		 */
+		unrolled_full
+#endif
+		for (int r = 0; r < 12; r++)
+			ROUND(r);
 
 #undef G
 #undef ROUND
 
 		for (i = 0; i < 8; ++i)

base-commit: 3f9f0252130e7dd60d41be0802bf58f6471c691d
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] lib/crypto: blake2b: Roll up BLAKE2b round loop on 32-bit
  2025-12-03 19:06 [PATCH] lib/crypto: blake2b: Roll up BLAKE2b round loop on 32-bit Eric Biggers
@ 2025-12-04  9:05 ` Ard Biesheuvel
  2025-12-04 17:56 ` Jason A. Donenfeld
  2025-12-05 14:16 ` david laight
  2 siblings, 0 replies; 7+ messages in thread
From: Ard Biesheuvel @ 2025-12-04  9:05 UTC (permalink / raw)
  To: Eric Biggers
  Cc: linux-crypto, linux-kernel, Jason A . Donenfeld, Herbert Xu,
	Thorsten Blum, Nathan Chancellor, Nick Desaulniers, Bill Wendling,
	Justin Stitt, David Laight, David Sterba, llvm, linux-btrfs

On Wed, 3 Dec 2025 at 20:09, Eric Biggers <ebiggers@kernel.org> wrote:
>
> BLAKE2b has a state of 16 64-bit words.  Add the message data in and
> there are 32 64-bit words.  With the current code where all the rounds
> are unrolled to enable constant-folding of the blake2b_sigma values,
> this results in a very large code size on 32-bit kernels, including a
> recurring issue where gcc uses a large amount of stack.
>
> There's just not much benefit to this unrolling when the code is already
> so large.  Let's roll up the rounds when !CONFIG_64BIT.  Then, remove
> the now-unnecessary override of the stack frame size warning.
>
> Code size improvements for blake2b_compress_generic():
>
>                   Size before (bytes)    Size after (bytes)
>                   -------------------    ------------------
>     i386, gcc           27584                 3632
>     i386, clang         18208                 3248
>     arm32, gcc          19912                 2860
>     arm32, clang        21336                 3344
>
> Running the BLAKE2b benchmark on a !CONFIG_64BIT kernel on an x86_64
> processor shows a 16384B throughput change of 351 => 340 MB/s (gcc) or
> 442 MB/s => 375 MB/s (clang).  So clearly not much of a slowdown either.
> But also that microbenchmark also effectively disregards cache usage,
> which is important in practice and is far better in the smaller code.
>
> Note: If we rolled up the loop on x86_64 too, the change would be
> 7024 bytes => 1584 bytes and 1960 MB/s => 1396 MB/s (gcc), or
> 6848 bytes => 1696 bytes and 1920 MB/s => 1263 MB/s (clang).
> Maybe still worth it, though not quite as clearly beneficial.
>
> Fixes: 91d689337fe8 ("crypto: blake2b - add blake2b generic implementation")
> Signed-off-by: Eric Biggers <ebiggers@kernel.org>
> ---
>  lib/crypto/Makefile  |  1 -
>  lib/crypto/blake2b.c | 25 +++++++++++++------------
>  2 files changed, 13 insertions(+), 13 deletions(-)
>

Reviewed-by: Ard Biesheuvel <ardb@kernel.org>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] lib/crypto: blake2b: Roll up BLAKE2b round loop on 32-bit
  2025-12-03 19:06 [PATCH] lib/crypto: blake2b: Roll up BLAKE2b round loop on 32-bit Eric Biggers
  2025-12-04  9:05 ` Ard Biesheuvel
@ 2025-12-04 17:56 ` Jason A. Donenfeld
  2025-12-05  4:58   ` Eric Biggers
  2025-12-05 14:16 ` david laight
  2 siblings, 1 reply; 7+ messages in thread
From: Jason A. Donenfeld @ 2025-12-04 17:56 UTC (permalink / raw)
  To: Eric Biggers
  Cc: linux-crypto, linux-kernel, Ard Biesheuvel, Herbert Xu,
	Thorsten Blum, Nathan Chancellor, Nick Desaulniers, Bill Wendling,
	Justin Stitt, David Laight, David Sterba, llvm, linux-btrfs

On Wed, Dec 03, 2025 at 11:06:52AM -0800, Eric Biggers wrote:
>  	G(r, 4, v[0], v[ 5], v[10], v[15]); \
>  	G(r, 5, v[1], v[ 6], v[11], v[12]); \
>  	G(r, 6, v[2], v[ 7], v[ 8], v[13]); \
>  	G(r, 7, v[3], v[ 4], v[ 9], v[14]); \
>  } while (0)
> -		ROUND(0);
> -		ROUND(1);
> -		ROUND(2);
> -		ROUND(3);
> -		ROUND(4);
> -		ROUND(5);
> -		ROUND(6);
> -		ROUND(7);
> -		ROUND(8);
> -		ROUND(9);
> -		ROUND(10);
> -		ROUND(11);
> +
> +#ifdef CONFIG_64BIT
> +		/*
> +		 * Unroll the rounds loop to enable constant-folding of the
> +		 * blake2b_sigma values.  Seems worthwhile on 64-bit kernels.
> +		 * Not worthwhile on 32-bit kernels because the code size is
> +		 * already so large there due to BLAKE2b using 64-bit words.
> +		 */
> +		unrolled_full
> +#endif
> +		for (int r = 0; r < 12; r++)
> +			ROUND(r);
>  
>  #undef G
>  #undef ROUND

Since you're now using `unrolled_full`, ROUND doesn't need to be a macro
anymore. You can just do:

  unrolled_full
  for (int r = 0; r < 12; r++) {
    G(r, 0, v[0], v[ 4], v[ 8], v[12]);
    G(r, 1, v[1], v[ 5], v[ 9], v[13]);
    G(r, 2, v[2], v[ 6], v[10], v[14]);
    G(r, 3, v[3], v[ 7], v[11], v[15]);
    G(r, 4, v[0], v[ 5], v[10], v[15]);
    G(r, 5, v[1], v[ 6], v[11], v[12]);
    G(r, 6, v[2], v[ 7], v[ 8], v[13]);
    G(r, 7, v[3], v[ 4], v[ 9], v[14]);
  }

Likewise, you can simplify the blake2s implementation in the same way
(but don't make the unrolled_full conditional there, obviously).
`unrolled_full` seems like a nice way of doing this compared to macros.

Jason

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] lib/crypto: blake2b: Roll up BLAKE2b round loop on 32-bit
  2025-12-04 17:56 ` Jason A. Donenfeld
@ 2025-12-05  4:58   ` Eric Biggers
  0 siblings, 0 replies; 7+ messages in thread
From: Eric Biggers @ 2025-12-05  4:58 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: linux-crypto, linux-kernel, Ard Biesheuvel, Herbert Xu,
	Thorsten Blum, Nathan Chancellor, Nick Desaulniers, Bill Wendling,
	Justin Stitt, David Laight, David Sterba, llvm, linux-btrfs

On Thu, Dec 04, 2025 at 12:56:32PM -0500, Jason A. Donenfeld wrote:
> On Wed, Dec 03, 2025 at 11:06:52AM -0800, Eric Biggers wrote:
> >  	G(r, 4, v[0], v[ 5], v[10], v[15]); \
> >  	G(r, 5, v[1], v[ 6], v[11], v[12]); \
> >  	G(r, 6, v[2], v[ 7], v[ 8], v[13]); \
> >  	G(r, 7, v[3], v[ 4], v[ 9], v[14]); \
> >  } while (0)
> > -		ROUND(0);
> > -		ROUND(1);
> > -		ROUND(2);
> > -		ROUND(3);
> > -		ROUND(4);
> > -		ROUND(5);
> > -		ROUND(6);
> > -		ROUND(7);
> > -		ROUND(8);
> > -		ROUND(9);
> > -		ROUND(10);
> > -		ROUND(11);
> > +
> > +#ifdef CONFIG_64BIT
> > +		/*
> > +		 * Unroll the rounds loop to enable constant-folding of the
> > +		 * blake2b_sigma values.  Seems worthwhile on 64-bit kernels.
> > +		 * Not worthwhile on 32-bit kernels because the code size is
> > +		 * already so large there due to BLAKE2b using 64-bit words.
> > +		 */
> > +		unrolled_full
> > +#endif
> > +		for (int r = 0; r < 12; r++)
> > +			ROUND(r);
> >  
> >  #undef G
> >  #undef ROUND
> 
> Since you're now using `unrolled_full`, ROUND doesn't need to be a macro
> anymore. You can just do:
> 
>   unrolled_full
>   for (int r = 0; r < 12; r++) {
>     G(r, 0, v[0], v[ 4], v[ 8], v[12]);
>     G(r, 1, v[1], v[ 5], v[ 9], v[13]);
>     G(r, 2, v[2], v[ 6], v[10], v[14]);
>     G(r, 3, v[3], v[ 7], v[11], v[15]);
>     G(r, 4, v[0], v[ 5], v[10], v[15]);
>     G(r, 5, v[1], v[ 6], v[11], v[12]);
>     G(r, 6, v[2], v[ 7], v[ 8], v[13]);
>     G(r, 7, v[3], v[ 4], v[ 9], v[14]);
>   }
> 
> Likewise, you can simplify the blake2s implementation in the same way
> (but don't make the unrolled_full conditional there, obviously).
> `unrolled_full` seems like a nice way of doing this compared to macros.

Yes, good idea.  I'll do that.

- Eric

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] lib/crypto: blake2b: Roll up BLAKE2b round loop on 32-bit
  2025-12-03 19:06 [PATCH] lib/crypto: blake2b: Roll up BLAKE2b round loop on 32-bit Eric Biggers
  2025-12-04  9:05 ` Ard Biesheuvel
  2025-12-04 17:56 ` Jason A. Donenfeld
@ 2025-12-05 14:16 ` david laight
  2025-12-05 20:14   ` Eric Biggers
  2 siblings, 1 reply; 7+ messages in thread
From: david laight @ 2025-12-05 14:16 UTC (permalink / raw)
  To: Eric Biggers
  Cc: linux-crypto, linux-kernel, Ard Biesheuvel, Jason A . Donenfeld,
	Herbert Xu, Thorsten Blum, Nathan Chancellor, Nick Desaulniers,
	Bill Wendling, Justin Stitt, David Sterba, llvm, linux-btrfs

On Wed,  3 Dec 2025 11:06:52 -0800
Eric Biggers <ebiggers@kernel.org> wrote:

> BLAKE2b has a state of 16 64-bit words.  Add the message data in and
> there are 32 64-bit words.  With the current code where all the rounds
> are unrolled to enable constant-folding of the blake2b_sigma values,
> this results in a very large code size on 32-bit kernels, including a
> recurring issue where gcc uses a large amount of stack.
> 
> There's just not much benefit to this unrolling when the code is already
> so large.  Let's roll up the rounds when !CONFIG_64BIT.  Then, remove
> the now-unnecessary override of the stack frame size warning.
> 
> Code size improvements for blake2b_compress_generic():
> 
>                   Size before (bytes)    Size after (bytes)
>                   -------------------    ------------------
>     i386, gcc           27584                 3632
>     i386, clang         18208                 3248
>     arm32, gcc          19912                 2860
>     arm32, clang        21336                 3344
> 
> Running the BLAKE2b benchmark on a !CONFIG_64BIT kernel on an x86_64
> processor shows a 16384B throughput change of 351 => 340 MB/s (gcc) or
> 442 MB/s => 375 MB/s (clang).  So clearly not much of a slowdown either.
> But also that microbenchmark also effectively disregards cache usage,
> which is important in practice and is far better in the smaller code.

Any idea how many clocks those are for each G() ?
That number would give an idea of the actual 'quality' of the code.

A quick count shows 14 alu operations with a register dependency
chain length of 12.
So however hard you try G() will take 12 clocks (on 64bit) provided
all the instructions have no extra result latency (probably true).
That means there is plenty of time for two memory reads for each of the
m[*b2b_sigma++] accesses (including the increment).
On x86-64 there aren't enough registers to hold all of v[], so there also
need to be another 4 reads and writes for each G().
Total 8 memory reads, 4 memory writes and 12 alu clocks - shouldn't be
too hard to get that to run in 12 clocks.

Because of the long register dependency chain there will be gains from
running two G() in parallel.
I don't think you'll get two to run in 12 clocks on x86-64.
While two reads and a write can be done in each clock on later cpu it
definitely needs a 'following wind'.
Arm-64 is another matter, that should be able to hold all of v[] in
registers throughout.
(Although the memcpy(v, ctx->h, 64) probably needs replacing with 8
separate assignments.)

Note that executing two G() in parallel probably requires the source
interleave the instructions for the two G() rather than relying on the
cpu's 'out of order execution' to do all the work
(Intel cpu might manage it...).

While all that is a lot of changes to get right, I suspect that just:
	const u8 *b2b_sigma = blake2b_sigma[0];

#define G(a, b, c, d) \
	a += b + m[b2b_sigma[0]];
	...
	a += b + m[b2b_sigma[1]];
	b2b_sigma += 2;
	...

will remove almost all the benefit from unrolling the 'ROUND' loop.
Especially since the loop termination condition can use b2b_sigma.

Everything except x86 will also benefit from multiplying all of
blake2b_sigma by 8 and doing *(u64 *)((u8)m + b2b_sigma[0]) for
the accesses.

32bit is another matter entirely.
I think gcc can handle u64 as either a pair of 32bit values or as
a register-pair (so pairs of simode ops, or single dimode ones).
The code seems better if you stop it doing the latter, but breathe
on something that joins up the two parts and you are stuck with it.

I can't help thinking that swapping the order of the bit-pairs in the
index to v[] would make it easier to write ROUND() as a real loop.
The code size (and I-cache) reduction might make up for any losses,
especially since the code is really memory limited (because of the
accesses to v[]) rather than alu limited - so a few more alu
instructions may make little difference.

	David

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] lib/crypto: blake2b: Roll up BLAKE2b round loop on 32-bit
  2025-12-05 14:16 ` david laight
@ 2025-12-05 20:14   ` Eric Biggers
  2025-12-05 22:04     ` david laight
  0 siblings, 1 reply; 7+ messages in thread
From: Eric Biggers @ 2025-12-05 20:14 UTC (permalink / raw)
  To: david laight
  Cc: linux-crypto, linux-kernel, Ard Biesheuvel, Jason A . Donenfeld,
	Herbert Xu, Thorsten Blum, Nathan Chancellor, Nick Desaulniers,
	Bill Wendling, Justin Stitt, David Sterba, llvm, linux-btrfs

On Fri, Dec 05, 2025 at 02:16:44PM +0000, david laight wrote:
> Note that executing two G() in parallel probably requires the source
> interleave the instructions for the two G() rather than relying on the
> cpu's 'out of order execution' to do all the work
> (Intel cpu might manage it...).

I actually tried that earlier, and it didn't help.  Either the compiler
interleaved the calculations already, or the CPU did, or both.

It definitely could use some more investigation to better understand
exactly what is going on, though.

You're welcome to take a closer look, if you're interested.

- Eric

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] lib/crypto: blake2b: Roll up BLAKE2b round loop on 32-bit
  2025-12-05 20:14   ` Eric Biggers
@ 2025-12-05 22:04     ` david laight
  0 siblings, 0 replies; 7+ messages in thread
From: david laight @ 2025-12-05 22:04 UTC (permalink / raw)
  To: Eric Biggers
  Cc: linux-crypto, linux-kernel, Ard Biesheuvel, Jason A . Donenfeld,
	Herbert Xu, Thorsten Blum, Nathan Chancellor, Nick Desaulniers,
	Bill Wendling, Justin Stitt, David Sterba, llvm, linux-btrfs

On Fri, 5 Dec 2025 12:14:11 -0800
Eric Biggers <ebiggers@kernel.org> wrote:

> On Fri, Dec 05, 2025 at 02:16:44PM +0000, david laight wrote:
> > Note that executing two G() in parallel probably requires the source
> > interleave the instructions for the two G() rather than relying on the
> > cpu's 'out of order execution' to do all the work
> > (Intel cpu might manage it...).  
> 
> I actually tried that earlier, and it didn't help.  Either the compiler
> interleaved the calculations already, or the CPU did, or both.

Or they are never interleaved and doing that didn't help.

> It definitely could use some more investigation to better understand
> exactly what is going on, though.
> 
> You're welcome to take a closer look, if you're interested.

I might try calling the code from my 'clock counting' wrapper.
That is good enough to see the data dependency of a 'div' instruction
and the effect of branch misses due to taking a different path from
the previous call (about 20 clocks on a zen-5).

Did you notice that 'u64 t[2];' is a 128bit 'byte counter' for the
buffer being processed!
I doubt 32bit needs that many bits :-)

	David

> 
> - Eric
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-12-05 22:04 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-03 19:06 [PATCH] lib/crypto: blake2b: Roll up BLAKE2b round loop on 32-bit Eric Biggers
2025-12-04  9:05 ` Ard Biesheuvel
2025-12-04 17:56 ` Jason A. Donenfeld
2025-12-05  4:58   ` Eric Biggers
2025-12-05 14:16 ` david laight
2025-12-05 20:14   ` Eric Biggers
2025-12-05 22:04     ` david laight

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).