[PATCH 0/6] x86 BLAKE2s cleanups

linux-crypto.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/6] x86 BLAKE2s cleanups
@ 2025-11-02 23:42 Eric Biggers
  2025-11-02 23:42 ` [PATCH 1/6] lib/crypto: x86/blake2s: Fix 32-bit arg treated as 64-bit Eric Biggers
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: Eric Biggers @ 2025-11-02 23:42 UTC (permalink / raw)
  To: linux-crypto
  Cc: linux-kernel, Ard Biesheuvel, Jason A . Donenfeld, Herbert Xu,
	x86, Samuel Neves, Eric Biggers

Various cleanups for the SSSE3 and AVX512 implementations of BLAKE2s.

This is targeting libcrypto-next.

Eric Biggers (6):
  lib/crypto: x86/blake2s: Fix 32-bit arg treated as 64-bit
  lib/crypto: x86/blake2s: Drop check for nblocks == 0
  lib/crypto: x86/blake2s: Use local labels for data
  lib/crypto: x86/blake2s: Improve readability
  lib/crypto: x86/blake2s: Avoid writing back unchanged 'f' value
  lib/crypto: x86/blake2s: Use vpternlogd for 3-input XORs

 lib/crypto/x86/blake2s-core.S | 275 +++++++++++++++++++---------------
 1 file changed, 157 insertions(+), 118 deletions(-)

base-commit: 5a2a5e62a5216ba05d4481cf90d915f4de0bfde9
-- 
2.51.2


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/6] lib/crypto: x86/blake2s: Fix 32-bit arg treated as 64-bit
  2025-11-02 23:42 [PATCH 0/6] x86 BLAKE2s cleanups Eric Biggers
@ 2025-11-02 23:42 ` Eric Biggers
  2025-11-02 23:42 ` [PATCH 2/6] lib/crypto: x86/blake2s: Drop check for nblocks == 0 Eric Biggers
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Eric Biggers @ 2025-11-02 23:42 UTC (permalink / raw)
  To: linux-crypto
  Cc: linux-kernel, Ard Biesheuvel, Jason A . Donenfeld, Herbert Xu,
	x86, Samuel Neves, Eric Biggers, stable

In the C code, the 'inc' argument to the assembly functions
blake2s_compress_ssse3() and blake2s_compress_avx512() is declared with
type u32, matching blake2s_compress().  The assembly code then reads it
from the 64-bit %rcx.  However, the ABI doesn't guarantee zero-extension
to 64 bits, nor do gcc or clang guarantee it.  Therefore, fix these
functions to read this argument from the 32-bit %ecx.

In theory, this bug could have caused the wrong 'inc' value to be used,
causing incorrect BLAKE2s hashes.  In practice, probably not: I've fixed
essentially this same bug in many other assembly files too, but there's
never been a real report of it having caused a problem.  In x86_64, all
writes to 32-bit registers are zero-extended to 64 bits.  That results
in zero-extension in nearly all situations.  I've only been able to
demonstrate a lack of zero-extension with a somewhat contrived example
involving truncation, e.g. when the C code has a u64 variable holding
0x1234567800000040 and passes it as a u32 expecting it to be truncated
to 0x40 (64).  But that's not what the real code does, of course.

Fixes: ed0356eda153 ("crypto: blake2s - x86_64 SIMD implementation")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
---
 lib/crypto/x86/blake2s-core.S | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/crypto/x86/blake2s-core.S b/lib/crypto/x86/blake2s-core.S
index ef8e9f427aab..093e7814f387 100644
--- a/lib/crypto/x86/blake2s-core.S
+++ b/lib/crypto/x86/blake2s-core.S
@@ -50,11 +50,11 @@ SYM_FUNC_START(blake2s_compress_ssse3)
 	movdqu		(%rdi),%xmm0
 	movdqu		0x10(%rdi),%xmm1
 	movdqa		ROT16(%rip),%xmm12
 	movdqa		ROR328(%rip),%xmm13
 	movdqu		0x20(%rdi),%xmm14
-	movq		%rcx,%xmm15
+	movd		%ecx,%xmm15
 	leaq		SIGMA+0xa0(%rip),%r8
 	jmp		.Lbeginofloop
 	.align		32
 .Lbeginofloop:
 	movdqa		%xmm0,%xmm10
@@ -174,11 +174,11 @@ SYM_FUNC_END(blake2s_compress_ssse3)

 SYM_FUNC_START(blake2s_compress_avx512)
 	vmovdqu		(%rdi),%xmm0
 	vmovdqu		0x10(%rdi),%xmm1
 	vmovdqu		0x20(%rdi),%xmm4
-	vmovq		%rcx,%xmm5
+	vmovd		%ecx,%xmm5
 	vmovdqa		IV(%rip),%xmm14
 	vmovdqa		IV+16(%rip),%xmm15
 	jmp		.Lblake2s_compress_avx512_mainloop
 .align 32
 .Lblake2s_compress_avx512_mainloop:
-- 
2.51.2

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/6] lib/crypto: x86/blake2s: Drop check for nblocks == 0
  2025-11-02 23:42 [PATCH 0/6] x86 BLAKE2s cleanups Eric Biggers
  2025-11-02 23:42 ` [PATCH 1/6] lib/crypto: x86/blake2s: Fix 32-bit arg treated as 64-bit Eric Biggers
@ 2025-11-02 23:42 ` Eric Biggers
  2025-11-02 23:42 ` [PATCH 3/6] lib/crypto: x86/blake2s: Use local labels for data Eric Biggers
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Eric Biggers @ 2025-11-02 23:42 UTC (permalink / raw)
  To: linux-crypto
  Cc: linux-kernel, Ard Biesheuvel, Jason A . Donenfeld, Herbert Xu,
	x86, Samuel Neves, Eric Biggers

Since blake2s_compress() is always passed nblocks != 0, remove the
unnecessary check for nblocks == 0 from blake2s_compress_ssse3().

Note that this makes it consistent with blake2s_compress_avx512() in the
same file as well as the arm32 blake2s_compress().

Signed-off-by: Eric Biggers <ebiggers@kernel.org>
---
 lib/crypto/x86/blake2s-core.S | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/lib/crypto/x86/blake2s-core.S b/lib/crypto/x86/blake2s-core.S
index 093e7814f387..aee13b97cc34 100644
--- a/lib/crypto/x86/blake2s-core.S
+++ b/lib/crypto/x86/blake2s-core.S
@@ -43,12 +43,10 @@ SIGMA2:
 .byte 15,  5,  4, 13, 10,  7,  3, 11, 12,  2,  0,  6,  9,  8,  1, 14
 .byte  8,  7, 14, 11, 13, 15,  0, 12, 10,  4,  5,  6,  3,  2,  1,  9
 
 .text
 SYM_FUNC_START(blake2s_compress_ssse3)
-	testq		%rdx,%rdx
-	je		.Lendofloop
 	movdqu		(%rdi),%xmm0
 	movdqu		0x10(%rdi),%xmm1
 	movdqa		ROT16(%rip),%xmm12
 	movdqa		ROR328(%rip),%xmm13
 	movdqu		0x20(%rdi),%xmm14
@@ -166,11 +164,10 @@ SYM_FUNC_START(blake2s_compress_ssse3)
 	decq		%rdx
 	jnz		.Lbeginofloop
 	movdqu		%xmm0,(%rdi)
 	movdqu		%xmm1,0x10(%rdi)
 	movdqu		%xmm14,0x20(%rdi)
-.Lendofloop:
 	RET
 SYM_FUNC_END(blake2s_compress_ssse3)
 
 SYM_FUNC_START(blake2s_compress_avx512)
 	vmovdqu		(%rdi),%xmm0
-- 
2.51.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/6] lib/crypto: x86/blake2s: Use local labels for data
  2025-11-02 23:42 [PATCH 0/6] x86 BLAKE2s cleanups Eric Biggers
  2025-11-02 23:42 ` [PATCH 1/6] lib/crypto: x86/blake2s: Fix 32-bit arg treated as 64-bit Eric Biggers
  2025-11-02 23:42 ` [PATCH 2/6] lib/crypto: x86/blake2s: Drop check for nblocks == 0 Eric Biggers
@ 2025-11-02 23:42 ` Eric Biggers
  2025-11-02 23:42 ` [PATCH 4/6] lib/crypto: x86/blake2s: Improve readability Eric Biggers
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Eric Biggers @ 2025-11-02 23:42 UTC (permalink / raw)
  To: linux-crypto
  Cc: linux-kernel, Ard Biesheuvel, Jason A . Donenfeld, Herbert Xu,
	x86, Samuel Neves, Eric Biggers

Following the usual practice, prefix the names of the data labels with
".L" so that the assembler treats them as truly local.  This more
clearly expresses the intent and is less error-prone.

Signed-off-by: Eric Biggers <ebiggers@kernel.org>
---
 lib/crypto/x86/blake2s-core.S | 45 ++++++++++++++++++++---------------
 1 file changed, 26 insertions(+), 19 deletions(-)

diff --git a/lib/crypto/x86/blake2s-core.S b/lib/crypto/x86/blake2s-core.S
index aee13b97cc34..14e487559c09 100644
--- a/lib/crypto/x86/blake2s-core.S
+++ b/lib/crypto/x86/blake2s-core.S
@@ -4,36 +4,43 @@
  * Copyright (C) 2017-2019 Samuel Neves <sneves@dei.uc.pt>. All Rights Reserved.
  */
 
 #include <linux/linkage.h>
 
-.section .rodata.cst32.BLAKE2S_IV, "aM", @progbits, 32
+.section .rodata.cst32.iv, "aM", @progbits, 32
 .align 32
-IV:	.octa 0xA54FF53A3C6EF372BB67AE856A09E667
+.Liv:
+	.octa 0xA54FF53A3C6EF372BB67AE856A09E667
 	.octa 0x5BE0CD191F83D9AB9B05688C510E527F
-.section .rodata.cst16.ROT16, "aM", @progbits, 16
+
+.section .rodata.cst16.ror16, "aM", @progbits, 16
 .align 16
-ROT16:	.octa 0x0D0C0F0E09080B0A0504070601000302
-.section .rodata.cst16.ROR328, "aM", @progbits, 16
+.Lror16:
+	.octa 0x0D0C0F0E09080B0A0504070601000302
+
+.section .rodata.cst16.ror8, "aM", @progbits, 16
 .align 16
-ROR328:	.octa 0x0C0F0E0D080B0A090407060500030201
-.section .rodata.cst64.BLAKE2S_SIGMA, "aM", @progbits, 160
+.Lror8:
+	.octa 0x0C0F0E0D080B0A090407060500030201
+
+.section .rodata.cst64.sigma, "aM", @progbits, 160
 .align 64
-SIGMA:
+.Lsigma:
 .byte  0,  2,  4,  6,  1,  3,  5,  7, 14,  8, 10, 12, 15,  9, 11, 13
 .byte 14,  4,  9, 13, 10,  8, 15,  6,  5,  1,  0, 11,  3, 12,  2,  7
 .byte 11, 12,  5, 15,  8,  0,  2, 13,  9, 10,  3,  7,  4, 14,  6,  1
 .byte  7,  3, 13, 11,  9,  1, 12, 14, 15,  2,  5,  4,  8,  6, 10,  0
 .byte  9,  5,  2, 10,  0,  7,  4, 15,  3, 14, 11,  6, 13,  1, 12,  8
 .byte  2,  6,  0,  8, 12, 10, 11,  3,  1,  4,  7, 15,  9, 13,  5, 14
 .byte 12,  1, 14,  4,  5, 15, 13, 10,  8,  0,  6,  9, 11,  7,  3,  2
 .byte 13,  7, 12,  3, 11, 14,  1,  9,  2,  5, 15,  8, 10,  0,  4,  6
 .byte  6, 14, 11,  0, 15,  9,  3,  8, 10, 12, 13,  1,  5,  2,  7,  4
 .byte 10,  8,  7,  1,  2,  4,  6,  5, 13, 15,  9,  3,  0, 11, 14, 12
-.section .rodata.cst64.BLAKE2S_SIGMA2, "aM", @progbits, 160
+
+.section .rodata.cst64.sigma2, "aM", @progbits, 160
 .align 64
-SIGMA2:
+.Lsigma2:
 .byte  0,  2,  4,  6,  1,  3,  5,  7, 14,  8, 10, 12, 15,  9, 11, 13
 .byte  8,  2, 13, 15, 10,  9, 12,  3,  6,  4,  0, 14,  5, 11,  1,  7
 .byte 11, 13,  8,  6,  5, 10, 14,  3,  2,  4, 12, 15,  1,  0,  7,  9
 .byte 11, 10,  7,  0,  8, 15,  1, 13,  3,  6,  2, 12,  4, 14,  9,  5
 .byte  4, 10,  9, 14, 15,  0, 11,  8,  1,  7,  3, 13,  2,  5,  6, 12
@@ -45,25 +52,25 @@ SIGMA2:
 
 .text
 SYM_FUNC_START(blake2s_compress_ssse3)
 	movdqu		(%rdi),%xmm0
 	movdqu		0x10(%rdi),%xmm1
-	movdqa		ROT16(%rip),%xmm12
-	movdqa		ROR328(%rip),%xmm13
+	movdqa		.Lror16(%rip),%xmm12
+	movdqa		.Lror8(%rip),%xmm13
 	movdqu		0x20(%rdi),%xmm14
 	movd		%ecx,%xmm15
-	leaq		SIGMA+0xa0(%rip),%r8
+	leaq		.Lsigma+0xa0(%rip),%r8
 	jmp		.Lbeginofloop
 	.align		32
 .Lbeginofloop:
 	movdqa		%xmm0,%xmm10
 	movdqa		%xmm1,%xmm11
 	paddq		%xmm15,%xmm14
-	movdqa		IV(%rip),%xmm2
+	movdqa		.Liv(%rip),%xmm2
 	movdqa		%xmm14,%xmm3
-	pxor		IV+0x10(%rip),%xmm3
-	leaq		SIGMA(%rip),%rcx
+	pxor		.Liv+0x10(%rip),%xmm3
+	leaq		.Lsigma(%rip),%rcx
 .Lroundloop:
 	movzbl		(%rcx),%eax
 	movd		(%rsi,%rax,4),%xmm4
 	movzbl		0x1(%rcx),%eax
 	movd		(%rsi,%rax,4),%xmm5
@@ -172,12 +179,12 @@ SYM_FUNC_END(blake2s_compress_ssse3)
 SYM_FUNC_START(blake2s_compress_avx512)
 	vmovdqu		(%rdi),%xmm0
 	vmovdqu		0x10(%rdi),%xmm1
 	vmovdqu		0x20(%rdi),%xmm4
 	vmovd		%ecx,%xmm5
-	vmovdqa		IV(%rip),%xmm14
-	vmovdqa		IV+16(%rip),%xmm15
+	vmovdqa		.Liv(%rip),%xmm14
+	vmovdqa		.Liv+16(%rip),%xmm15
 	jmp		.Lblake2s_compress_avx512_mainloop
 .align 32
 .Lblake2s_compress_avx512_mainloop:
 	vmovdqa		%xmm0,%xmm10
 	vmovdqa		%xmm1,%xmm11
@@ -185,11 +192,11 @@ SYM_FUNC_START(blake2s_compress_avx512)
 	vmovdqa		%xmm14,%xmm2
 	vpxor		%xmm15,%xmm4,%xmm3
 	vmovdqu		(%rsi),%ymm6
 	vmovdqu		0x20(%rsi),%ymm7
 	addq		$0x40,%rsi
-	leaq		SIGMA2(%rip),%rax
+	leaq		.Lsigma2(%rip),%rax
 	movb		$0xa,%cl
 .Lblake2s_compress_avx512_roundloop:
 	vpmovzxbd	(%rax),%ymm8
 	vpmovzxbd	0x8(%rax),%ymm9
 	addq		$0x10,%rax
-- 
2.51.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 4/6] lib/crypto: x86/blake2s: Improve readability
  2025-11-02 23:42 [PATCH 0/6] x86 BLAKE2s cleanups Eric Biggers
                   ` (2 preceding siblings ...)
  2025-11-02 23:42 ` [PATCH 3/6] lib/crypto: x86/blake2s: Use local labels for data Eric Biggers
@ 2025-11-02 23:42 ` Eric Biggers
  2025-11-02 23:42 ` [PATCH 5/6] lib/crypto: x86/blake2s: Avoid writing back unchanged 'f' value Eric Biggers
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Eric Biggers @ 2025-11-02 23:42 UTC (permalink / raw)
  To: linux-crypto
  Cc: linux-kernel, Ard Biesheuvel, Jason A . Donenfeld, Herbert Xu,
	x86, Samuel Neves, Eric Biggers

Various cleanups for readability.  No change to the generated code:

- Add some comments
- Add #defines for arguments
- Rename some labels
- Use decimal constants instead of hex where it makes sense.
  (The pshufd immediates intentionally remain as hex.)
- Add blank lines when there's a logical break

The round loop still could use some work, but this is at least a start.

Signed-off-by: Eric Biggers <ebiggers@kernel.org>
---
 lib/crypto/x86/blake2s-core.S | 231 ++++++++++++++++++++--------------
 1 file changed, 134 insertions(+), 97 deletions(-)

diff --git a/lib/crypto/x86/blake2s-core.S b/lib/crypto/x86/blake2s-core.S
index 14e487559c09..f805a49c590d 100644
--- a/lib/crypto/x86/blake2s-core.S
+++ b/lib/crypto/x86/blake2s-core.S
@@ -48,209 +48,246 @@
 .byte  4,  8, 15,  9, 14, 11, 13,  5,  3,  2,  1, 12,  6, 10,  7,  0
 .byte  6, 13,  0, 14, 12,  2,  1, 11, 15,  4,  5,  8,  7,  9,  3, 10
 .byte 15,  5,  4, 13, 10,  7,  3, 11, 12,  2,  0,  6,  9,  8,  1, 14
 .byte  8,  7, 14, 11, 13, 15,  0, 12, 10,  4,  5,  6,  3,  2,  1,  9
 
+#define CTX		%rdi
+#define DATA		%rsi
+#define NBLOCKS		%rdx
+#define INC		%ecx
+
 .text
+//
+// void blake2s_compress_ssse3(struct blake2s_ctx *ctx,
+//			       const u8 *data, size_t nblocks, u32 inc);
+//
+// Only the first three fields of struct blake2s_ctx are used:
+//	u32 h[8];	(inout)
+//	u32 t[2];	(inout)
+//	u32 f[2];	(in)
+//
 SYM_FUNC_START(blake2s_compress_ssse3)
-	movdqu		(%rdi),%xmm0
-	movdqu		0x10(%rdi),%xmm1
+	movdqu		(CTX),%xmm0		// Load h[0..3]
+	movdqu		16(CTX),%xmm1		// Load h[4..7]
 	movdqa		.Lror16(%rip),%xmm12
 	movdqa		.Lror8(%rip),%xmm13
-	movdqu		0x20(%rdi),%xmm14
-	movd		%ecx,%xmm15
-	leaq		.Lsigma+0xa0(%rip),%r8
-	jmp		.Lbeginofloop
+	movdqu		32(CTX),%xmm14		// Load t and f
+	movd		INC,%xmm15		// Load inc
+	leaq		.Lsigma+160(%rip),%r8
+	jmp		.Lssse3_mainloop
+
 	.align		32
-.Lbeginofloop:
-	movdqa		%xmm0,%xmm10
-	movdqa		%xmm1,%xmm11
-	paddq		%xmm15,%xmm14
-	movdqa		.Liv(%rip),%xmm2
+.Lssse3_mainloop:
+	// Main loop: each iteration processes one 64-byte block.
+	movdqa		%xmm0,%xmm10		// Save h[0..3] and let v[0..3] = h[0..3]
+	movdqa		%xmm1,%xmm11		// Save h[4..7] and let v[4..7] = h[4..7]
+	paddq		%xmm15,%xmm14		// t += inc (64-bit addition)
+	movdqa		.Liv(%rip),%xmm2	// v[8..11] = iv[0..3]
 	movdqa		%xmm14,%xmm3
-	pxor		.Liv+0x10(%rip),%xmm3
+	pxor		.Liv+16(%rip),%xmm3	// v[12..15] = iv[4..7] ^ [t, f]
 	leaq		.Lsigma(%rip),%rcx
-.Lroundloop:
+
+.Lssse3_roundloop:
+	// Round loop: each iteration does 1 round (of 10 rounds total).
 	movzbl		(%rcx),%eax
-	movd		(%rsi,%rax,4),%xmm4
-	movzbl		0x1(%rcx),%eax
-	movd		(%rsi,%rax,4),%xmm5
-	movzbl		0x2(%rcx),%eax
-	movd		(%rsi,%rax,4),%xmm6
-	movzbl		0x3(%rcx),%eax
-	movd		(%rsi,%rax,4),%xmm7
+	movd		(DATA,%rax,4),%xmm4
+	movzbl		1(%rcx),%eax
+	movd		(DATA,%rax,4),%xmm5
+	movzbl		2(%rcx),%eax
+	movd		(DATA,%rax,4),%xmm6
+	movzbl		3(%rcx),%eax
+	movd		(DATA,%rax,4),%xmm7
 	punpckldq	%xmm5,%xmm4
 	punpckldq	%xmm7,%xmm6
 	punpcklqdq	%xmm6,%xmm4
 	paddd		%xmm4,%xmm0
 	paddd		%xmm1,%xmm0
 	pxor		%xmm0,%xmm3
 	pshufb		%xmm12,%xmm3
 	paddd		%xmm3,%xmm2
 	pxor		%xmm2,%xmm1
 	movdqa		%xmm1,%xmm8
-	psrld		$0xc,%xmm1
-	pslld		$0x14,%xmm8
+	psrld		$12,%xmm1
+	pslld		$20,%xmm8
 	por		%xmm8,%xmm1
-	movzbl		0x4(%rcx),%eax
-	movd		(%rsi,%rax,4),%xmm5
-	movzbl		0x5(%rcx),%eax
-	movd		(%rsi,%rax,4),%xmm6
-	movzbl		0x6(%rcx),%eax
-	movd		(%rsi,%rax,4),%xmm7
-	movzbl		0x7(%rcx),%eax
-	movd		(%rsi,%rax,4),%xmm4
+	movzbl		4(%rcx),%eax
+	movd		(DATA,%rax,4),%xmm5
+	movzbl		5(%rcx),%eax
+	movd		(DATA,%rax,4),%xmm6
+	movzbl		6(%rcx),%eax
+	movd		(DATA,%rax,4),%xmm7
+	movzbl		7(%rcx),%eax
+	movd		(DATA,%rax,4),%xmm4
 	punpckldq	%xmm6,%xmm5
 	punpckldq	%xmm4,%xmm7
 	punpcklqdq	%xmm7,%xmm5
 	paddd		%xmm5,%xmm0
 	paddd		%xmm1,%xmm0
 	pxor		%xmm0,%xmm3
 	pshufb		%xmm13,%xmm3
 	paddd		%xmm3,%xmm2
 	pxor		%xmm2,%xmm1
 	movdqa		%xmm1,%xmm8
-	psrld		$0x7,%xmm1
-	pslld		$0x19,%xmm8
+	psrld		$7,%xmm1
+	pslld		$25,%xmm8
 	por		%xmm8,%xmm1
 	pshufd		$0x93,%xmm0,%xmm0
 	pshufd		$0x4e,%xmm3,%xmm3
 	pshufd		$0x39,%xmm2,%xmm2
-	movzbl		0x8(%rcx),%eax
-	movd		(%rsi,%rax,4),%xmm6
-	movzbl		0x9(%rcx),%eax
-	movd		(%rsi,%rax,4),%xmm7
-	movzbl		0xa(%rcx),%eax
-	movd		(%rsi,%rax,4),%xmm4
-	movzbl		0xb(%rcx),%eax
-	movd		(%rsi,%rax,4),%xmm5
+	movzbl		8(%rcx),%eax
+	movd		(DATA,%rax,4),%xmm6
+	movzbl		9(%rcx),%eax
+	movd		(DATA,%rax,4),%xmm7
+	movzbl		10(%rcx),%eax
+	movd		(DATA,%rax,4),%xmm4
+	movzbl		11(%rcx),%eax
+	movd		(DATA,%rax,4),%xmm5
 	punpckldq	%xmm7,%xmm6
 	punpckldq	%xmm5,%xmm4
 	punpcklqdq	%xmm4,%xmm6
 	paddd		%xmm6,%xmm0
 	paddd		%xmm1,%xmm0
 	pxor		%xmm0,%xmm3
 	pshufb		%xmm12,%xmm3
 	paddd		%xmm3,%xmm2
 	pxor		%xmm2,%xmm1
 	movdqa		%xmm1,%xmm8
-	psrld		$0xc,%xmm1
-	pslld		$0x14,%xmm8
+	psrld		$12,%xmm1
+	pslld		$20,%xmm8
 	por		%xmm8,%xmm1
-	movzbl		0xc(%rcx),%eax
-	movd		(%rsi,%rax,4),%xmm7
-	movzbl		0xd(%rcx),%eax
-	movd		(%rsi,%rax,4),%xmm4
-	movzbl		0xe(%rcx),%eax
-	movd		(%rsi,%rax,4),%xmm5
-	movzbl		0xf(%rcx),%eax
-	movd		(%rsi,%rax,4),%xmm6
+	movzbl		12(%rcx),%eax
+	movd		(DATA,%rax,4),%xmm7
+	movzbl		13(%rcx),%eax
+	movd		(DATA,%rax,4),%xmm4
+	movzbl		14(%rcx),%eax
+	movd		(DATA,%rax,4),%xmm5
+	movzbl		15(%rcx),%eax
+	movd		(DATA,%rax,4),%xmm6
 	punpckldq	%xmm4,%xmm7
 	punpckldq	%xmm6,%xmm5
 	punpcklqdq	%xmm5,%xmm7
 	paddd		%xmm7,%xmm0
 	paddd		%xmm1,%xmm0
 	pxor		%xmm0,%xmm3
 	pshufb		%xmm13,%xmm3
 	paddd		%xmm3,%xmm2
 	pxor		%xmm2,%xmm1
 	movdqa		%xmm1,%xmm8
-	psrld		$0x7,%xmm1
-	pslld		$0x19,%xmm8
+	psrld		$7,%xmm1
+	pslld		$25,%xmm8
 	por		%xmm8,%xmm1
 	pshufd		$0x39,%xmm0,%xmm0
 	pshufd		$0x4e,%xmm3,%xmm3
 	pshufd		$0x93,%xmm2,%xmm2
-	addq		$0x10,%rcx
+	addq		$16,%rcx
 	cmpq		%r8,%rcx
-	jnz		.Lroundloop
+	jnz		.Lssse3_roundloop
+
+	// Compute the new h: h[0..7] ^= v[0..7] ^ v[8..15]
 	pxor		%xmm2,%xmm0
 	pxor		%xmm3,%xmm1
 	pxor		%xmm10,%xmm0
 	pxor		%xmm11,%xmm1
-	addq		$0x40,%rsi
-	decq		%rdx
-	jnz		.Lbeginofloop
-	movdqu		%xmm0,(%rdi)
-	movdqu		%xmm1,0x10(%rdi)
-	movdqu		%xmm14,0x20(%rdi)
+	addq		$64,DATA
+	decq		NBLOCKS
+	jnz		.Lssse3_mainloop
+
+	movdqu		%xmm0,(CTX)		// Store new h[0..3]
+	movdqu		%xmm1,16(CTX)		// Store new h[4..7]
+	movdqu		%xmm14,32(CTX)		// Store new t and f
 	RET
 SYM_FUNC_END(blake2s_compress_ssse3)
 
+//
+// void blake2s_compress_avx512(struct blake2s_ctx *ctx,
+//				const u8 *data, size_t nblocks, u32 inc);
+//
+// Only the first three fields of struct blake2s_ctx are used:
+//	u32 h[8];	(inout)
+//	u32 t[2];	(inout)
+//	u32 f[2];	(in)
+//
 SYM_FUNC_START(blake2s_compress_avx512)
-	vmovdqu		(%rdi),%xmm0
-	vmovdqu		0x10(%rdi),%xmm1
-	vmovdqu		0x20(%rdi),%xmm4
-	vmovd		%ecx,%xmm5
-	vmovdqa		.Liv(%rip),%xmm14
-	vmovdqa		.Liv+16(%rip),%xmm15
-	jmp		.Lblake2s_compress_avx512_mainloop
-.align 32
-.Lblake2s_compress_avx512_mainloop:
-	vmovdqa		%xmm0,%xmm10
-	vmovdqa		%xmm1,%xmm11
-	vpaddq		%xmm5,%xmm4,%xmm4
-	vmovdqa		%xmm14,%xmm2
-	vpxor		%xmm15,%xmm4,%xmm3
-	vmovdqu		(%rsi),%ymm6
-	vmovdqu		0x20(%rsi),%ymm7
-	addq		$0x40,%rsi
+	vmovdqu		(CTX),%xmm0		// Load h[0..3]
+	vmovdqu		16(CTX),%xmm1		// Load h[4..7]
+	vmovdqu		32(CTX),%xmm4		// Load t and f
+	vmovd		INC,%xmm5		// Load inc
+	vmovdqa		.Liv(%rip),%xmm14	// Load iv[0..3]
+	vmovdqa		.Liv+16(%rip),%xmm15	// Load iv[4..7]
+	jmp		.Lavx512_mainloop
+
+	.align		32
+.Lavx512_mainloop:
+	// Main loop: each iteration processes one 64-byte block.
+	vmovdqa		%xmm0,%xmm10		// Save h[0..3] and let v[0..3] = h[0..3]
+	vmovdqa		%xmm1,%xmm11		// Save h[4..7] and let v[4..7] = h[4..7]
+	vpaddq		%xmm5,%xmm4,%xmm4	// t += inc (64-bit addition)
+	vmovdqa		%xmm14,%xmm2		// v[8..11] = iv[0..3]
+	vpxor		%xmm15,%xmm4,%xmm3	// v[12..15] = iv[4..7] ^ [t, f]
+	vmovdqu		(DATA),%ymm6		// Load first 8 data words
+	vmovdqu		32(DATA),%ymm7		// Load second 8 data words
+	addq		$64,DATA
 	leaq		.Lsigma2(%rip),%rax
-	movb		$0xa,%cl
-.Lblake2s_compress_avx512_roundloop:
+	movb		$10,%cl			// Set num rounds remaining
+
+.Lavx512_roundloop:
+	// Round loop: each iteration does 1 round (of 10 rounds total).
 	vpmovzxbd	(%rax),%ymm8
-	vpmovzxbd	0x8(%rax),%ymm9
-	addq		$0x10,%rax
+	vpmovzxbd	8(%rax),%ymm9
+	addq		$16,%rax
 	vpermi2d	%ymm7,%ymm6,%ymm8
 	vpermi2d	%ymm7,%ymm6,%ymm9
 	vmovdqa		%ymm8,%ymm6
 	vmovdqa		%ymm9,%ymm7
 	vpaddd		%xmm8,%xmm0,%xmm0
 	vpaddd		%xmm1,%xmm0,%xmm0
 	vpxor		%xmm0,%xmm3,%xmm3
-	vprord		$0x10,%xmm3,%xmm3
+	vprord		$16,%xmm3,%xmm3
 	vpaddd		%xmm3,%xmm2,%xmm2
 	vpxor		%xmm2,%xmm1,%xmm1
-	vprord		$0xc,%xmm1,%xmm1
-	vextracti128	$0x1,%ymm8,%xmm8
+	vprord		$12,%xmm1,%xmm1
+	vextracti128	$1,%ymm8,%xmm8
 	vpaddd		%xmm8,%xmm0,%xmm0
 	vpaddd		%xmm1,%xmm0,%xmm0
 	vpxor		%xmm0,%xmm3,%xmm3
-	vprord		$0x8,%xmm3,%xmm3
+	vprord		$8,%xmm3,%xmm3
 	vpaddd		%xmm3,%xmm2,%xmm2
 	vpxor		%xmm2,%xmm1,%xmm1
-	vprord		$0x7,%xmm1,%xmm1
+	vprord		$7,%xmm1,%xmm1
 	vpshufd		$0x93,%xmm0,%xmm0
 	vpshufd		$0x4e,%xmm3,%xmm3
 	vpshufd		$0x39,%xmm2,%xmm2
 	vpaddd		%xmm9,%xmm0,%xmm0
 	vpaddd		%xmm1,%xmm0,%xmm0
 	vpxor		%xmm0,%xmm3,%xmm3
-	vprord		$0x10,%xmm3,%xmm3
+	vprord		$16,%xmm3,%xmm3
 	vpaddd		%xmm3,%xmm2,%xmm2
 	vpxor		%xmm2,%xmm1,%xmm1
-	vprord		$0xc,%xmm1,%xmm1
-	vextracti128	$0x1,%ymm9,%xmm9
+	vprord		$12,%xmm1,%xmm1
+	vextracti128	$1,%ymm9,%xmm9
 	vpaddd		%xmm9,%xmm0,%xmm0
 	vpaddd		%xmm1,%xmm0,%xmm0
 	vpxor		%xmm0,%xmm3,%xmm3
-	vprord		$0x8,%xmm3,%xmm3
+	vprord		$8,%xmm3,%xmm3
 	vpaddd		%xmm3,%xmm2,%xmm2
 	vpxor		%xmm2,%xmm1,%xmm1
-	vprord		$0x7,%xmm1,%xmm1
+	vprord		$7,%xmm1,%xmm1
 	vpshufd		$0x39,%xmm0,%xmm0
 	vpshufd		$0x4e,%xmm3,%xmm3
 	vpshufd		$0x93,%xmm2,%xmm2
 	decb		%cl
-	jne		.Lblake2s_compress_avx512_roundloop
+	jne		.Lavx512_roundloop
+
+	// Compute the new h: h[0..7] ^= v[0..7] ^ v[8..15]
 	vpxor		%xmm10,%xmm0,%xmm0
 	vpxor		%xmm11,%xmm1,%xmm1
 	vpxor		%xmm2,%xmm0,%xmm0
 	vpxor		%xmm3,%xmm1,%xmm1
-	decq		%rdx
-	jne		.Lblake2s_compress_avx512_mainloop
-	vmovdqu		%xmm0,(%rdi)
-	vmovdqu		%xmm1,0x10(%rdi)
-	vmovdqu		%xmm4,0x20(%rdi)
+	decq		NBLOCKS
+	jne		.Lavx512_mainloop
+
+	vmovdqu		%xmm0,(CTX)		// Store new h[0..3]
+	vmovdqu		%xmm1,16(CTX)		// Store new h[4..7]
+	vmovdqu		%xmm4,32(CTX)		// Store new t and f
 	vzeroupper
 	RET
 SYM_FUNC_END(blake2s_compress_avx512)
-- 
2.51.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 5/6] lib/crypto: x86/blake2s: Avoid writing back unchanged 'f' value
  2025-11-02 23:42 [PATCH 0/6] x86 BLAKE2s cleanups Eric Biggers
                   ` (3 preceding siblings ...)
  2025-11-02 23:42 ` [PATCH 4/6] lib/crypto: x86/blake2s: Improve readability Eric Biggers
@ 2025-11-02 23:42 ` Eric Biggers
  2025-11-02 23:42 ` [PATCH 6/6] lib/crypto: x86/blake2s: Use vpternlogd for 3-input XORs Eric Biggers
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Eric Biggers @ 2025-11-02 23:42 UTC (permalink / raw)
  To: linux-crypto
  Cc: linux-kernel, Ard Biesheuvel, Jason A . Donenfeld, Herbert Xu,
	x86, Samuel Neves, Eric Biggers

Just before returning, blake2s_compress_ssse3() and
blake2s_compress_avx512() store updated values to the 'h', 't', and 'f'
fields of struct blake2s_ctx.  But 'f' is always unchanged (which is
correct; only the C code changes it).  So, there's no need to write to
'f'.  Use 64-bit stores (movq and vmovq) instead of 128-bit stores
(movdqu and vmovdqu) so that only 't' is written.

Signed-off-by: Eric Biggers <ebiggers@kernel.org>
---
 lib/crypto/x86/blake2s-core.S | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/crypto/x86/blake2s-core.S b/lib/crypto/x86/blake2s-core.S
index f805a49c590d..869064f6ac16 100644
--- a/lib/crypto/x86/blake2s-core.S
+++ b/lib/crypto/x86/blake2s-core.S
@@ -191,11 +191,11 @@ SYM_FUNC_START(blake2s_compress_ssse3)
 	decq		NBLOCKS
 	jnz		.Lssse3_mainloop
 
 	movdqu		%xmm0,(CTX)		// Store new h[0..3]
 	movdqu		%xmm1,16(CTX)		// Store new h[4..7]
-	movdqu		%xmm14,32(CTX)		// Store new t and f
+	movq		%xmm14,32(CTX)		// Store new t (f is unchanged)
 	RET
 SYM_FUNC_END(blake2s_compress_ssse3)
 
 //
 // void blake2s_compress_avx512(struct blake2s_ctx *ctx,
@@ -285,9 +285,9 @@ SYM_FUNC_START(blake2s_compress_avx512)
 	decq		NBLOCKS
 	jne		.Lavx512_mainloop
 
 	vmovdqu		%xmm0,(CTX)		// Store new h[0..3]
 	vmovdqu		%xmm1,16(CTX)		// Store new h[4..7]
-	vmovdqu		%xmm4,32(CTX)		// Store new t and f
+	vmovq		%xmm4,32(CTX)		// Store new t (f is unchanged)
 	vzeroupper
 	RET
 SYM_FUNC_END(blake2s_compress_avx512)
-- 
2.51.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 6/6] lib/crypto: x86/blake2s: Use vpternlogd for 3-input XORs
  2025-11-02 23:42 [PATCH 0/6] x86 BLAKE2s cleanups Eric Biggers
                   ` (4 preceding siblings ...)
  2025-11-02 23:42 ` [PATCH 5/6] lib/crypto: x86/blake2s: Avoid writing back unchanged 'f' value Eric Biggers
@ 2025-11-02 23:42 ` Eric Biggers
  2025-11-03  8:14 ` [PATCH 0/6] x86 BLAKE2s cleanups Ard Biesheuvel
  2025-11-03 17:35 ` Eric Biggers
  7 siblings, 0 replies; 9+ messages in thread
From: Eric Biggers @ 2025-11-02 23:42 UTC (permalink / raw)
  To: linux-crypto
  Cc: linux-kernel, Ard Biesheuvel, Jason A . Donenfeld, Herbert Xu,
	x86, Samuel Neves, Eric Biggers

AVX-512 supports 3-input XORs via the vpternlogd (or vpternlogq)
instruction with immediate 0x96.  This approach, vs. the alternative of
two vpxor instructions, is already used in the CRC, AES-GCM, and AES-XTS
code, since it reduces the instruction count and is faster on some CPUs.
Make blake2s_compress_avx512() take advantage of it too.

Signed-off-by: Eric Biggers <ebiggers@kernel.org>
---
 lib/crypto/x86/blake2s-core.S | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/lib/crypto/x86/blake2s-core.S b/lib/crypto/x86/blake2s-core.S
index 869064f6ac16..7b1d98ca7482 100644
--- a/lib/crypto/x86/blake2s-core.S
+++ b/lib/crypto/x86/blake2s-core.S
@@ -276,14 +276,12 @@ SYM_FUNC_START(blake2s_compress_avx512)
 	vpshufd		$0x93,%xmm2,%xmm2
 	decb		%cl
 	jne		.Lavx512_roundloop
 
 	// Compute the new h: h[0..7] ^= v[0..7] ^ v[8..15]
-	vpxor		%xmm10,%xmm0,%xmm0
-	vpxor		%xmm11,%xmm1,%xmm1
-	vpxor		%xmm2,%xmm0,%xmm0
-	vpxor		%xmm3,%xmm1,%xmm1
+	vpternlogd	$0x96,%xmm10,%xmm2,%xmm0
+	vpternlogd	$0x96,%xmm11,%xmm3,%xmm1
 	decq		NBLOCKS
 	jne		.Lavx512_mainloop
 
 	vmovdqu		%xmm0,(CTX)		// Store new h[0..3]
 	vmovdqu		%xmm1,16(CTX)		// Store new h[4..7]
-- 
2.51.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/6] x86 BLAKE2s cleanups
  2025-11-02 23:42 [PATCH 0/6] x86 BLAKE2s cleanups Eric Biggers
                   ` (5 preceding siblings ...)
  2025-11-02 23:42 ` [PATCH 6/6] lib/crypto: x86/blake2s: Use vpternlogd for 3-input XORs Eric Biggers
@ 2025-11-03  8:14 ` Ard Biesheuvel
  2025-11-03 17:35 ` Eric Biggers
  7 siblings, 0 replies; 9+ messages in thread
From: Ard Biesheuvel @ 2025-11-03  8:14 UTC (permalink / raw)
  To: Eric Biggers
  Cc: linux-crypto, linux-kernel, Jason A . Donenfeld, Herbert Xu, x86,
	Samuel Neves

On Mon, 3 Nov 2025 at 00:44, Eric Biggers <ebiggers@kernel.org> wrote:
>
> Various cleanups for the SSSE3 and AVX512 implementations of BLAKE2s.
>
> This is targeting libcrypto-next.
>
> Eric Biggers (6):
>   lib/crypto: x86/blake2s: Fix 32-bit arg treated as 64-bit
>   lib/crypto: x86/blake2s: Drop check for nblocks == 0
>   lib/crypto: x86/blake2s: Use local labels for data
>   lib/crypto: x86/blake2s: Improve readability
>   lib/crypto: x86/blake2s: Avoid writing back unchanged 'f' value
>   lib/crypto: x86/blake2s: Use vpternlogd for 3-input XORs
>


Reviewed-by: Ard Biesheuvel <ardb@kernel.org>


>  lib/crypto/x86/blake2s-core.S | 275 +++++++++++++++++++---------------
>  1 file changed, 157 insertions(+), 118 deletions(-)
>
> base-commit: 5a2a5e62a5216ba05d4481cf90d915f4de0bfde9
> --
> 2.51.2
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/6] x86 BLAKE2s cleanups
  2025-11-02 23:42 [PATCH 0/6] x86 BLAKE2s cleanups Eric Biggers
                   ` (6 preceding siblings ...)
  2025-11-03  8:14 ` [PATCH 0/6] x86 BLAKE2s cleanups Ard Biesheuvel
@ 2025-11-03 17:35 ` Eric Biggers
  7 siblings, 0 replies; 9+ messages in thread
From: Eric Biggers @ 2025-11-03 17:35 UTC (permalink / raw)
  To: linux-crypto
  Cc: linux-kernel, Ard Biesheuvel, Jason A . Donenfeld, Herbert Xu,
	x86, Samuel Neves

On Sun, Nov 02, 2025 at 03:42:03PM -0800, Eric Biggers wrote:
> Various cleanups for the SSSE3 and AVX512 implementations of BLAKE2s.
> 
> This is targeting libcrypto-next.
> 
> Eric Biggers (6):
>   lib/crypto: x86/blake2s: Fix 32-bit arg treated as 64-bit
>   lib/crypto: x86/blake2s: Drop check for nblocks == 0
>   lib/crypto: x86/blake2s: Use local labels for data
>   lib/crypto: x86/blake2s: Improve readability
>   lib/crypto: x86/blake2s: Avoid writing back unchanged 'f' value
>   lib/crypto: x86/blake2s: Use vpternlogd for 3-input XORs
> 
>  lib/crypto/x86/blake2s-core.S | 275 +++++++++++++++++++---------------
>  1 file changed, 157 insertions(+), 118 deletions(-)
> 
> base-commit: 5a2a5e62a5216ba05d4481cf90d915f4de0bfde9

Applied to https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git/log/?h=libcrypto-next

- Eric

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-11-03 17:37 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-02 23:42 [PATCH 0/6] x86 BLAKE2s cleanups Eric Biggers
2025-11-02 23:42 ` [PATCH 1/6] lib/crypto: x86/blake2s: Fix 32-bit arg treated as 64-bit Eric Biggers
2025-11-02 23:42 ` [PATCH 2/6] lib/crypto: x86/blake2s: Drop check for nblocks == 0 Eric Biggers
2025-11-02 23:42 ` [PATCH 3/6] lib/crypto: x86/blake2s: Use local labels for data Eric Biggers
2025-11-02 23:42 ` [PATCH 4/6] lib/crypto: x86/blake2s: Improve readability Eric Biggers
2025-11-02 23:42 ` [PATCH 5/6] lib/crypto: x86/blake2s: Avoid writing back unchanged 'f' value Eric Biggers
2025-11-02 23:42 ` [PATCH 6/6] lib/crypto: x86/blake2s: Use vpternlogd for 3-input XORs Eric Biggers
2025-11-03  8:14 ` [PATCH 0/6] x86 BLAKE2s cleanups Ard Biesheuvel
2025-11-03 17:35 ` Eric Biggers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).