* [PATCH 0/6] x86 BLAKE2s cleanups
@ 2025-11-02 23:42 Eric Biggers
2025-11-02 23:42 ` [PATCH 1/6] lib/crypto: x86/blake2s: Fix 32-bit arg treated as 64-bit Eric Biggers
` (7 more replies)
0 siblings, 8 replies; 9+ messages in thread
From: Eric Biggers @ 2025-11-02 23:42 UTC (permalink / raw)
To: linux-crypto
Cc: linux-kernel, Ard Biesheuvel, Jason A . Donenfeld, Herbert Xu,
x86, Samuel Neves, Eric Biggers
Various cleanups for the SSSE3 and AVX512 implementations of BLAKE2s.
This is targeting libcrypto-next.
Eric Biggers (6):
lib/crypto: x86/blake2s: Fix 32-bit arg treated as 64-bit
lib/crypto: x86/blake2s: Drop check for nblocks == 0
lib/crypto: x86/blake2s: Use local labels for data
lib/crypto: x86/blake2s: Improve readability
lib/crypto: x86/blake2s: Avoid writing back unchanged 'f' value
lib/crypto: x86/blake2s: Use vpternlogd for 3-input XORs
lib/crypto/x86/blake2s-core.S | 275 +++++++++++++++++++---------------
1 file changed, 157 insertions(+), 118 deletions(-)
base-commit: 5a2a5e62a5216ba05d4481cf90d915f4de0bfde9
--
2.51.2
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 1/6] lib/crypto: x86/blake2s: Fix 32-bit arg treated as 64-bit
2025-11-02 23:42 [PATCH 0/6] x86 BLAKE2s cleanups Eric Biggers
@ 2025-11-02 23:42 ` Eric Biggers
2025-11-02 23:42 ` [PATCH 2/6] lib/crypto: x86/blake2s: Drop check for nblocks == 0 Eric Biggers
` (6 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Eric Biggers @ 2025-11-02 23:42 UTC (permalink / raw)
To: linux-crypto
Cc: linux-kernel, Ard Biesheuvel, Jason A . Donenfeld, Herbert Xu,
x86, Samuel Neves, Eric Biggers, stable
In the C code, the 'inc' argument to the assembly functions
blake2s_compress_ssse3() and blake2s_compress_avx512() is declared with
type u32, matching blake2s_compress(). The assembly code then reads it
from the 64-bit %rcx. However, the ABI doesn't guarantee zero-extension
to 64 bits, nor do gcc or clang guarantee it. Therefore, fix these
functions to read this argument from the 32-bit %ecx.
In theory, this bug could have caused the wrong 'inc' value to be used,
causing incorrect BLAKE2s hashes. In practice, probably not: I've fixed
essentially this same bug in many other assembly files too, but there's
never been a real report of it having caused a problem. In x86_64, all
writes to 32-bit registers are zero-extended to 64 bits. That results
in zero-extension in nearly all situations. I've only been able to
demonstrate a lack of zero-extension with a somewhat contrived example
involving truncation, e.g. when the C code has a u64 variable holding
0x1234567800000040 and passes it as a u32 expecting it to be truncated
to 0x40 (64). But that's not what the real code does, of course.
Fixes: ed0356eda153 ("crypto: blake2s - x86_64 SIMD implementation")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
---
lib/crypto/x86/blake2s-core.S | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/lib/crypto/x86/blake2s-core.S b/lib/crypto/x86/blake2s-core.S
index ef8e9f427aab..093e7814f387 100644
--- a/lib/crypto/x86/blake2s-core.S
+++ b/lib/crypto/x86/blake2s-core.S
@@ -50,11 +50,11 @@ SYM_FUNC_START(blake2s_compress_ssse3)
movdqu (%rdi),%xmm0
movdqu 0x10(%rdi),%xmm1
movdqa ROT16(%rip),%xmm12
movdqa ROR328(%rip),%xmm13
movdqu 0x20(%rdi),%xmm14
- movq %rcx,%xmm15
+ movd %ecx,%xmm15
leaq SIGMA+0xa0(%rip),%r8
jmp .Lbeginofloop
.align 32
.Lbeginofloop:
movdqa %xmm0,%xmm10
@@ -174,11 +174,11 @@ SYM_FUNC_END(blake2s_compress_ssse3)
SYM_FUNC_START(blake2s_compress_avx512)
vmovdqu (%rdi),%xmm0
vmovdqu 0x10(%rdi),%xmm1
vmovdqu 0x20(%rdi),%xmm4
- vmovq %rcx,%xmm5
+ vmovd %ecx,%xmm5
vmovdqa IV(%rip),%xmm14
vmovdqa IV+16(%rip),%xmm15
jmp .Lblake2s_compress_avx512_mainloop
.align 32
.Lblake2s_compress_avx512_mainloop:
--
2.51.2
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 2/6] lib/crypto: x86/blake2s: Drop check for nblocks == 0
2025-11-02 23:42 [PATCH 0/6] x86 BLAKE2s cleanups Eric Biggers
2025-11-02 23:42 ` [PATCH 1/6] lib/crypto: x86/blake2s: Fix 32-bit arg treated as 64-bit Eric Biggers
@ 2025-11-02 23:42 ` Eric Biggers
2025-11-02 23:42 ` [PATCH 3/6] lib/crypto: x86/blake2s: Use local labels for data Eric Biggers
` (5 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Eric Biggers @ 2025-11-02 23:42 UTC (permalink / raw)
To: linux-crypto
Cc: linux-kernel, Ard Biesheuvel, Jason A . Donenfeld, Herbert Xu,
x86, Samuel Neves, Eric Biggers
Since blake2s_compress() is always passed nblocks != 0, remove the
unnecessary check for nblocks == 0 from blake2s_compress_ssse3().
Note that this makes it consistent with blake2s_compress_avx512() in the
same file as well as the arm32 blake2s_compress().
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
---
lib/crypto/x86/blake2s-core.S | 3 ---
1 file changed, 3 deletions(-)
diff --git a/lib/crypto/x86/blake2s-core.S b/lib/crypto/x86/blake2s-core.S
index 093e7814f387..aee13b97cc34 100644
--- a/lib/crypto/x86/blake2s-core.S
+++ b/lib/crypto/x86/blake2s-core.S
@@ -43,12 +43,10 @@ SIGMA2:
.byte 15, 5, 4, 13, 10, 7, 3, 11, 12, 2, 0, 6, 9, 8, 1, 14
.byte 8, 7, 14, 11, 13, 15, 0, 12, 10, 4, 5, 6, 3, 2, 1, 9
.text
SYM_FUNC_START(blake2s_compress_ssse3)
- testq %rdx,%rdx
- je .Lendofloop
movdqu (%rdi),%xmm0
movdqu 0x10(%rdi),%xmm1
movdqa ROT16(%rip),%xmm12
movdqa ROR328(%rip),%xmm13
movdqu 0x20(%rdi),%xmm14
@@ -166,11 +164,10 @@ SYM_FUNC_START(blake2s_compress_ssse3)
decq %rdx
jnz .Lbeginofloop
movdqu %xmm0,(%rdi)
movdqu %xmm1,0x10(%rdi)
movdqu %xmm14,0x20(%rdi)
-.Lendofloop:
RET
SYM_FUNC_END(blake2s_compress_ssse3)
SYM_FUNC_START(blake2s_compress_avx512)
vmovdqu (%rdi),%xmm0
--
2.51.2
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 3/6] lib/crypto: x86/blake2s: Use local labels for data
2025-11-02 23:42 [PATCH 0/6] x86 BLAKE2s cleanups Eric Biggers
2025-11-02 23:42 ` [PATCH 1/6] lib/crypto: x86/blake2s: Fix 32-bit arg treated as 64-bit Eric Biggers
2025-11-02 23:42 ` [PATCH 2/6] lib/crypto: x86/blake2s: Drop check for nblocks == 0 Eric Biggers
@ 2025-11-02 23:42 ` Eric Biggers
2025-11-02 23:42 ` [PATCH 4/6] lib/crypto: x86/blake2s: Improve readability Eric Biggers
` (4 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Eric Biggers @ 2025-11-02 23:42 UTC (permalink / raw)
To: linux-crypto
Cc: linux-kernel, Ard Biesheuvel, Jason A . Donenfeld, Herbert Xu,
x86, Samuel Neves, Eric Biggers
Following the usual practice, prefix the names of the data labels with
".L" so that the assembler treats them as truly local. This more
clearly expresses the intent and is less error-prone.
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
---
lib/crypto/x86/blake2s-core.S | 45 ++++++++++++++++++++---------------
1 file changed, 26 insertions(+), 19 deletions(-)
diff --git a/lib/crypto/x86/blake2s-core.S b/lib/crypto/x86/blake2s-core.S
index aee13b97cc34..14e487559c09 100644
--- a/lib/crypto/x86/blake2s-core.S
+++ b/lib/crypto/x86/blake2s-core.S
@@ -4,36 +4,43 @@
* Copyright (C) 2017-2019 Samuel Neves <sneves@dei.uc.pt>. All Rights Reserved.
*/
#include <linux/linkage.h>
-.section .rodata.cst32.BLAKE2S_IV, "aM", @progbits, 32
+.section .rodata.cst32.iv, "aM", @progbits, 32
.align 32
-IV: .octa 0xA54FF53A3C6EF372BB67AE856A09E667
+.Liv:
+ .octa 0xA54FF53A3C6EF372BB67AE856A09E667
.octa 0x5BE0CD191F83D9AB9B05688C510E527F
-.section .rodata.cst16.ROT16, "aM", @progbits, 16
+
+.section .rodata.cst16.ror16, "aM", @progbits, 16
.align 16
-ROT16: .octa 0x0D0C0F0E09080B0A0504070601000302
-.section .rodata.cst16.ROR328, "aM", @progbits, 16
+.Lror16:
+ .octa 0x0D0C0F0E09080B0A0504070601000302
+
+.section .rodata.cst16.ror8, "aM", @progbits, 16
.align 16
-ROR328: .octa 0x0C0F0E0D080B0A090407060500030201
-.section .rodata.cst64.BLAKE2S_SIGMA, "aM", @progbits, 160
+.Lror8:
+ .octa 0x0C0F0E0D080B0A090407060500030201
+
+.section .rodata.cst64.sigma, "aM", @progbits, 160
.align 64
-SIGMA:
+.Lsigma:
.byte 0, 2, 4, 6, 1, 3, 5, 7, 14, 8, 10, 12, 15, 9, 11, 13
.byte 14, 4, 9, 13, 10, 8, 15, 6, 5, 1, 0, 11, 3, 12, 2, 7
.byte 11, 12, 5, 15, 8, 0, 2, 13, 9, 10, 3, 7, 4, 14, 6, 1
.byte 7, 3, 13, 11, 9, 1, 12, 14, 15, 2, 5, 4, 8, 6, 10, 0
.byte 9, 5, 2, 10, 0, 7, 4, 15, 3, 14, 11, 6, 13, 1, 12, 8
.byte 2, 6, 0, 8, 12, 10, 11, 3, 1, 4, 7, 15, 9, 13, 5, 14
.byte 12, 1, 14, 4, 5, 15, 13, 10, 8, 0, 6, 9, 11, 7, 3, 2
.byte 13, 7, 12, 3, 11, 14, 1, 9, 2, 5, 15, 8, 10, 0, 4, 6
.byte 6, 14, 11, 0, 15, 9, 3, 8, 10, 12, 13, 1, 5, 2, 7, 4
.byte 10, 8, 7, 1, 2, 4, 6, 5, 13, 15, 9, 3, 0, 11, 14, 12
-.section .rodata.cst64.BLAKE2S_SIGMA2, "aM", @progbits, 160
+
+.section .rodata.cst64.sigma2, "aM", @progbits, 160
.align 64
-SIGMA2:
+.Lsigma2:
.byte 0, 2, 4, 6, 1, 3, 5, 7, 14, 8, 10, 12, 15, 9, 11, 13
.byte 8, 2, 13, 15, 10, 9, 12, 3, 6, 4, 0, 14, 5, 11, 1, 7
.byte 11, 13, 8, 6, 5, 10, 14, 3, 2, 4, 12, 15, 1, 0, 7, 9
.byte 11, 10, 7, 0, 8, 15, 1, 13, 3, 6, 2, 12, 4, 14, 9, 5
.byte 4, 10, 9, 14, 15, 0, 11, 8, 1, 7, 3, 13, 2, 5, 6, 12
@@ -45,25 +52,25 @@ SIGMA2:
.text
SYM_FUNC_START(blake2s_compress_ssse3)
movdqu (%rdi),%xmm0
movdqu 0x10(%rdi),%xmm1
- movdqa ROT16(%rip),%xmm12
- movdqa ROR328(%rip),%xmm13
+ movdqa .Lror16(%rip),%xmm12
+ movdqa .Lror8(%rip),%xmm13
movdqu 0x20(%rdi),%xmm14
movd %ecx,%xmm15
- leaq SIGMA+0xa0(%rip),%r8
+ leaq .Lsigma+0xa0(%rip),%r8
jmp .Lbeginofloop
.align 32
.Lbeginofloop:
movdqa %xmm0,%xmm10
movdqa %xmm1,%xmm11
paddq %xmm15,%xmm14
- movdqa IV(%rip),%xmm2
+ movdqa .Liv(%rip),%xmm2
movdqa %xmm14,%xmm3
- pxor IV+0x10(%rip),%xmm3
- leaq SIGMA(%rip),%rcx
+ pxor .Liv+0x10(%rip),%xmm3
+ leaq .Lsigma(%rip),%rcx
.Lroundloop:
movzbl (%rcx),%eax
movd (%rsi,%rax,4),%xmm4
movzbl 0x1(%rcx),%eax
movd (%rsi,%rax,4),%xmm5
@@ -172,12 +179,12 @@ SYM_FUNC_END(blake2s_compress_ssse3)
SYM_FUNC_START(blake2s_compress_avx512)
vmovdqu (%rdi),%xmm0
vmovdqu 0x10(%rdi),%xmm1
vmovdqu 0x20(%rdi),%xmm4
vmovd %ecx,%xmm5
- vmovdqa IV(%rip),%xmm14
- vmovdqa IV+16(%rip),%xmm15
+ vmovdqa .Liv(%rip),%xmm14
+ vmovdqa .Liv+16(%rip),%xmm15
jmp .Lblake2s_compress_avx512_mainloop
.align 32
.Lblake2s_compress_avx512_mainloop:
vmovdqa %xmm0,%xmm10
vmovdqa %xmm1,%xmm11
@@ -185,11 +192,11 @@ SYM_FUNC_START(blake2s_compress_avx512)
vmovdqa %xmm14,%xmm2
vpxor %xmm15,%xmm4,%xmm3
vmovdqu (%rsi),%ymm6
vmovdqu 0x20(%rsi),%ymm7
addq $0x40,%rsi
- leaq SIGMA2(%rip),%rax
+ leaq .Lsigma2(%rip),%rax
movb $0xa,%cl
.Lblake2s_compress_avx512_roundloop:
vpmovzxbd (%rax),%ymm8
vpmovzxbd 0x8(%rax),%ymm9
addq $0x10,%rax
--
2.51.2
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 4/6] lib/crypto: x86/blake2s: Improve readability
2025-11-02 23:42 [PATCH 0/6] x86 BLAKE2s cleanups Eric Biggers
` (2 preceding siblings ...)
2025-11-02 23:42 ` [PATCH 3/6] lib/crypto: x86/blake2s: Use local labels for data Eric Biggers
@ 2025-11-02 23:42 ` Eric Biggers
2025-11-02 23:42 ` [PATCH 5/6] lib/crypto: x86/blake2s: Avoid writing back unchanged 'f' value Eric Biggers
` (3 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Eric Biggers @ 2025-11-02 23:42 UTC (permalink / raw)
To: linux-crypto
Cc: linux-kernel, Ard Biesheuvel, Jason A . Donenfeld, Herbert Xu,
x86, Samuel Neves, Eric Biggers
Various cleanups for readability. No change to the generated code:
- Add some comments
- Add #defines for arguments
- Rename some labels
- Use decimal constants instead of hex where it makes sense.
(The pshufd immediates intentionally remain as hex.)
- Add blank lines when there's a logical break
The round loop still could use some work, but this is at least a start.
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
---
lib/crypto/x86/blake2s-core.S | 231 ++++++++++++++++++++--------------
1 file changed, 134 insertions(+), 97 deletions(-)
diff --git a/lib/crypto/x86/blake2s-core.S b/lib/crypto/x86/blake2s-core.S
index 14e487559c09..f805a49c590d 100644
--- a/lib/crypto/x86/blake2s-core.S
+++ b/lib/crypto/x86/blake2s-core.S
@@ -48,209 +48,246 @@
.byte 4, 8, 15, 9, 14, 11, 13, 5, 3, 2, 1, 12, 6, 10, 7, 0
.byte 6, 13, 0, 14, 12, 2, 1, 11, 15, 4, 5, 8, 7, 9, 3, 10
.byte 15, 5, 4, 13, 10, 7, 3, 11, 12, 2, 0, 6, 9, 8, 1, 14
.byte 8, 7, 14, 11, 13, 15, 0, 12, 10, 4, 5, 6, 3, 2, 1, 9
+#define CTX %rdi
+#define DATA %rsi
+#define NBLOCKS %rdx
+#define INC %ecx
+
.text
+//
+// void blake2s_compress_ssse3(struct blake2s_ctx *ctx,
+// const u8 *data, size_t nblocks, u32 inc);
+//
+// Only the first three fields of struct blake2s_ctx are used:
+// u32 h[8]; (inout)
+// u32 t[2]; (inout)
+// u32 f[2]; (in)
+//
SYM_FUNC_START(blake2s_compress_ssse3)
- movdqu (%rdi),%xmm0
- movdqu 0x10(%rdi),%xmm1
+ movdqu (CTX),%xmm0 // Load h[0..3]
+ movdqu 16(CTX),%xmm1 // Load h[4..7]
movdqa .Lror16(%rip),%xmm12
movdqa .Lror8(%rip),%xmm13
- movdqu 0x20(%rdi),%xmm14
- movd %ecx,%xmm15
- leaq .Lsigma+0xa0(%rip),%r8
- jmp .Lbeginofloop
+ movdqu 32(CTX),%xmm14 // Load t and f
+ movd INC,%xmm15 // Load inc
+ leaq .Lsigma+160(%rip),%r8
+ jmp .Lssse3_mainloop
+
.align 32
-.Lbeginofloop:
- movdqa %xmm0,%xmm10
- movdqa %xmm1,%xmm11
- paddq %xmm15,%xmm14
- movdqa .Liv(%rip),%xmm2
+.Lssse3_mainloop:
+ // Main loop: each iteration processes one 64-byte block.
+ movdqa %xmm0,%xmm10 // Save h[0..3] and let v[0..3] = h[0..3]
+ movdqa %xmm1,%xmm11 // Save h[4..7] and let v[4..7] = h[4..7]
+ paddq %xmm15,%xmm14 // t += inc (64-bit addition)
+ movdqa .Liv(%rip),%xmm2 // v[8..11] = iv[0..3]
movdqa %xmm14,%xmm3
- pxor .Liv+0x10(%rip),%xmm3
+ pxor .Liv+16(%rip),%xmm3 // v[12..15] = iv[4..7] ^ [t, f]
leaq .Lsigma(%rip),%rcx
-.Lroundloop:
+
+.Lssse3_roundloop:
+ // Round loop: each iteration does 1 round (of 10 rounds total).
movzbl (%rcx),%eax
- movd (%rsi,%rax,4),%xmm4
- movzbl 0x1(%rcx),%eax
- movd (%rsi,%rax,4),%xmm5
- movzbl 0x2(%rcx),%eax
- movd (%rsi,%rax,4),%xmm6
- movzbl 0x3(%rcx),%eax
- movd (%rsi,%rax,4),%xmm7
+ movd (DATA,%rax,4),%xmm4
+ movzbl 1(%rcx),%eax
+ movd (DATA,%rax,4),%xmm5
+ movzbl 2(%rcx),%eax
+ movd (DATA,%rax,4),%xmm6
+ movzbl 3(%rcx),%eax
+ movd (DATA,%rax,4),%xmm7
punpckldq %xmm5,%xmm4
punpckldq %xmm7,%xmm6
punpcklqdq %xmm6,%xmm4
paddd %xmm4,%xmm0
paddd %xmm1,%xmm0
pxor %xmm0,%xmm3
pshufb %xmm12,%xmm3
paddd %xmm3,%xmm2
pxor %xmm2,%xmm1
movdqa %xmm1,%xmm8
- psrld $0xc,%xmm1
- pslld $0x14,%xmm8
+ psrld $12,%xmm1
+ pslld $20,%xmm8
por %xmm8,%xmm1
- movzbl 0x4(%rcx),%eax
- movd (%rsi,%rax,4),%xmm5
- movzbl 0x5(%rcx),%eax
- movd (%rsi,%rax,4),%xmm6
- movzbl 0x6(%rcx),%eax
- movd (%rsi,%rax,4),%xmm7
- movzbl 0x7(%rcx),%eax
- movd (%rsi,%rax,4),%xmm4
+ movzbl 4(%rcx),%eax
+ movd (DATA,%rax,4),%xmm5
+ movzbl 5(%rcx),%eax
+ movd (DATA,%rax,4),%xmm6
+ movzbl 6(%rcx),%eax
+ movd (DATA,%rax,4),%xmm7
+ movzbl 7(%rcx),%eax
+ movd (DATA,%rax,4),%xmm4
punpckldq %xmm6,%xmm5
punpckldq %xmm4,%xmm7
punpcklqdq %xmm7,%xmm5
paddd %xmm5,%xmm0
paddd %xmm1,%xmm0
pxor %xmm0,%xmm3
pshufb %xmm13,%xmm3
paddd %xmm3,%xmm2
pxor %xmm2,%xmm1
movdqa %xmm1,%xmm8
- psrld $0x7,%xmm1
- pslld $0x19,%xmm8
+ psrld $7,%xmm1
+ pslld $25,%xmm8
por %xmm8,%xmm1
pshufd $0x93,%xmm0,%xmm0
pshufd $0x4e,%xmm3,%xmm3
pshufd $0x39,%xmm2,%xmm2
- movzbl 0x8(%rcx),%eax
- movd (%rsi,%rax,4),%xmm6
- movzbl 0x9(%rcx),%eax
- movd (%rsi,%rax,4),%xmm7
- movzbl 0xa(%rcx),%eax
- movd (%rsi,%rax,4),%xmm4
- movzbl 0xb(%rcx),%eax
- movd (%rsi,%rax,4),%xmm5
+ movzbl 8(%rcx),%eax
+ movd (DATA,%rax,4),%xmm6
+ movzbl 9(%rcx),%eax
+ movd (DATA,%rax,4),%xmm7
+ movzbl 10(%rcx),%eax
+ movd (DATA,%rax,4),%xmm4
+ movzbl 11(%rcx),%eax
+ movd (DATA,%rax,4),%xmm5
punpckldq %xmm7,%xmm6
punpckldq %xmm5,%xmm4
punpcklqdq %xmm4,%xmm6
paddd %xmm6,%xmm0
paddd %xmm1,%xmm0
pxor %xmm0,%xmm3
pshufb %xmm12,%xmm3
paddd %xmm3,%xmm2
pxor %xmm2,%xmm1
movdqa %xmm1,%xmm8
- psrld $0xc,%xmm1
- pslld $0x14,%xmm8
+ psrld $12,%xmm1
+ pslld $20,%xmm8
por %xmm8,%xmm1
- movzbl 0xc(%rcx),%eax
- movd (%rsi,%rax,4),%xmm7
- movzbl 0xd(%rcx),%eax
- movd (%rsi,%rax,4),%xmm4
- movzbl 0xe(%rcx),%eax
- movd (%rsi,%rax,4),%xmm5
- movzbl 0xf(%rcx),%eax
- movd (%rsi,%rax,4),%xmm6
+ movzbl 12(%rcx),%eax
+ movd (DATA,%rax,4),%xmm7
+ movzbl 13(%rcx),%eax
+ movd (DATA,%rax,4),%xmm4
+ movzbl 14(%rcx),%eax
+ movd (DATA,%rax,4),%xmm5
+ movzbl 15(%rcx),%eax
+ movd (DATA,%rax,4),%xmm6
punpckldq %xmm4,%xmm7
punpckldq %xmm6,%xmm5
punpcklqdq %xmm5,%xmm7
paddd %xmm7,%xmm0
paddd %xmm1,%xmm0
pxor %xmm0,%xmm3
pshufb %xmm13,%xmm3
paddd %xmm3,%xmm2
pxor %xmm2,%xmm1
movdqa %xmm1,%xmm8
- psrld $0x7,%xmm1
- pslld $0x19,%xmm8
+ psrld $7,%xmm1
+ pslld $25,%xmm8
por %xmm8,%xmm1
pshufd $0x39,%xmm0,%xmm0
pshufd $0x4e,%xmm3,%xmm3
pshufd $0x93,%xmm2,%xmm2
- addq $0x10,%rcx
+ addq $16,%rcx
cmpq %r8,%rcx
- jnz .Lroundloop
+ jnz .Lssse3_roundloop
+
+ // Compute the new h: h[0..7] ^= v[0..7] ^ v[8..15]
pxor %xmm2,%xmm0
pxor %xmm3,%xmm1
pxor %xmm10,%xmm0
pxor %xmm11,%xmm1
- addq $0x40,%rsi
- decq %rdx
- jnz .Lbeginofloop
- movdqu %xmm0,(%rdi)
- movdqu %xmm1,0x10(%rdi)
- movdqu %xmm14,0x20(%rdi)
+ addq $64,DATA
+ decq NBLOCKS
+ jnz .Lssse3_mainloop
+
+ movdqu %xmm0,(CTX) // Store new h[0..3]
+ movdqu %xmm1,16(CTX) // Store new h[4..7]
+ movdqu %xmm14,32(CTX) // Store new t and f
RET
SYM_FUNC_END(blake2s_compress_ssse3)
+//
+// void blake2s_compress_avx512(struct blake2s_ctx *ctx,
+// const u8 *data, size_t nblocks, u32 inc);
+//
+// Only the first three fields of struct blake2s_ctx are used:
+// u32 h[8]; (inout)
+// u32 t[2]; (inout)
+// u32 f[2]; (in)
+//
SYM_FUNC_START(blake2s_compress_avx512)
- vmovdqu (%rdi),%xmm0
- vmovdqu 0x10(%rdi),%xmm1
- vmovdqu 0x20(%rdi),%xmm4
- vmovd %ecx,%xmm5
- vmovdqa .Liv(%rip),%xmm14
- vmovdqa .Liv+16(%rip),%xmm15
- jmp .Lblake2s_compress_avx512_mainloop
-.align 32
-.Lblake2s_compress_avx512_mainloop:
- vmovdqa %xmm0,%xmm10
- vmovdqa %xmm1,%xmm11
- vpaddq %xmm5,%xmm4,%xmm4
- vmovdqa %xmm14,%xmm2
- vpxor %xmm15,%xmm4,%xmm3
- vmovdqu (%rsi),%ymm6
- vmovdqu 0x20(%rsi),%ymm7
- addq $0x40,%rsi
+ vmovdqu (CTX),%xmm0 // Load h[0..3]
+ vmovdqu 16(CTX),%xmm1 // Load h[4..7]
+ vmovdqu 32(CTX),%xmm4 // Load t and f
+ vmovd INC,%xmm5 // Load inc
+ vmovdqa .Liv(%rip),%xmm14 // Load iv[0..3]
+ vmovdqa .Liv+16(%rip),%xmm15 // Load iv[4..7]
+ jmp .Lavx512_mainloop
+
+ .align 32
+.Lavx512_mainloop:
+ // Main loop: each iteration processes one 64-byte block.
+ vmovdqa %xmm0,%xmm10 // Save h[0..3] and let v[0..3] = h[0..3]
+ vmovdqa %xmm1,%xmm11 // Save h[4..7] and let v[4..7] = h[4..7]
+ vpaddq %xmm5,%xmm4,%xmm4 // t += inc (64-bit addition)
+ vmovdqa %xmm14,%xmm2 // v[8..11] = iv[0..3]
+ vpxor %xmm15,%xmm4,%xmm3 // v[12..15] = iv[4..7] ^ [t, f]
+ vmovdqu (DATA),%ymm6 // Load first 8 data words
+ vmovdqu 32(DATA),%ymm7 // Load second 8 data words
+ addq $64,DATA
leaq .Lsigma2(%rip),%rax
- movb $0xa,%cl
-.Lblake2s_compress_avx512_roundloop:
+ movb $10,%cl // Set num rounds remaining
+
+.Lavx512_roundloop:
+ // Round loop: each iteration does 1 round (of 10 rounds total).
vpmovzxbd (%rax),%ymm8
- vpmovzxbd 0x8(%rax),%ymm9
- addq $0x10,%rax
+ vpmovzxbd 8(%rax),%ymm9
+ addq $16,%rax
vpermi2d %ymm7,%ymm6,%ymm8
vpermi2d %ymm7,%ymm6,%ymm9
vmovdqa %ymm8,%ymm6
vmovdqa %ymm9,%ymm7
vpaddd %xmm8,%xmm0,%xmm0
vpaddd %xmm1,%xmm0,%xmm0
vpxor %xmm0,%xmm3,%xmm3
- vprord $0x10,%xmm3,%xmm3
+ vprord $16,%xmm3,%xmm3
vpaddd %xmm3,%xmm2,%xmm2
vpxor %xmm2,%xmm1,%xmm1
- vprord $0xc,%xmm1,%xmm1
- vextracti128 $0x1,%ymm8,%xmm8
+ vprord $12,%xmm1,%xmm1
+ vextracti128 $1,%ymm8,%xmm8
vpaddd %xmm8,%xmm0,%xmm0
vpaddd %xmm1,%xmm0,%xmm0
vpxor %xmm0,%xmm3,%xmm3
- vprord $0x8,%xmm3,%xmm3
+ vprord $8,%xmm3,%xmm3
vpaddd %xmm3,%xmm2,%xmm2
vpxor %xmm2,%xmm1,%xmm1
- vprord $0x7,%xmm1,%xmm1
+ vprord $7,%xmm1,%xmm1
vpshufd $0x93,%xmm0,%xmm0
vpshufd $0x4e,%xmm3,%xmm3
vpshufd $0x39,%xmm2,%xmm2
vpaddd %xmm9,%xmm0,%xmm0
vpaddd %xmm1,%xmm0,%xmm0
vpxor %xmm0,%xmm3,%xmm3
- vprord $0x10,%xmm3,%xmm3
+ vprord $16,%xmm3,%xmm3
vpaddd %xmm3,%xmm2,%xmm2
vpxor %xmm2,%xmm1,%xmm1
- vprord $0xc,%xmm1,%xmm1
- vextracti128 $0x1,%ymm9,%xmm9
+ vprord $12,%xmm1,%xmm1
+ vextracti128 $1,%ymm9,%xmm9
vpaddd %xmm9,%xmm0,%xmm0
vpaddd %xmm1,%xmm0,%xmm0
vpxor %xmm0,%xmm3,%xmm3
- vprord $0x8,%xmm3,%xmm3
+ vprord $8,%xmm3,%xmm3
vpaddd %xmm3,%xmm2,%xmm2
vpxor %xmm2,%xmm1,%xmm1
- vprord $0x7,%xmm1,%xmm1
+ vprord $7,%xmm1,%xmm1
vpshufd $0x39,%xmm0,%xmm0
vpshufd $0x4e,%xmm3,%xmm3
vpshufd $0x93,%xmm2,%xmm2
decb %cl
- jne .Lblake2s_compress_avx512_roundloop
+ jne .Lavx512_roundloop
+
+ // Compute the new h: h[0..7] ^= v[0..7] ^ v[8..15]
vpxor %xmm10,%xmm0,%xmm0
vpxor %xmm11,%xmm1,%xmm1
vpxor %xmm2,%xmm0,%xmm0
vpxor %xmm3,%xmm1,%xmm1
- decq %rdx
- jne .Lblake2s_compress_avx512_mainloop
- vmovdqu %xmm0,(%rdi)
- vmovdqu %xmm1,0x10(%rdi)
- vmovdqu %xmm4,0x20(%rdi)
+ decq NBLOCKS
+ jne .Lavx512_mainloop
+
+ vmovdqu %xmm0,(CTX) // Store new h[0..3]
+ vmovdqu %xmm1,16(CTX) // Store new h[4..7]
+ vmovdqu %xmm4,32(CTX) // Store new t and f
vzeroupper
RET
SYM_FUNC_END(blake2s_compress_avx512)
--
2.51.2
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 5/6] lib/crypto: x86/blake2s: Avoid writing back unchanged 'f' value
2025-11-02 23:42 [PATCH 0/6] x86 BLAKE2s cleanups Eric Biggers
` (3 preceding siblings ...)
2025-11-02 23:42 ` [PATCH 4/6] lib/crypto: x86/blake2s: Improve readability Eric Biggers
@ 2025-11-02 23:42 ` Eric Biggers
2025-11-02 23:42 ` [PATCH 6/6] lib/crypto: x86/blake2s: Use vpternlogd for 3-input XORs Eric Biggers
` (2 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Eric Biggers @ 2025-11-02 23:42 UTC (permalink / raw)
To: linux-crypto
Cc: linux-kernel, Ard Biesheuvel, Jason A . Donenfeld, Herbert Xu,
x86, Samuel Neves, Eric Biggers
Just before returning, blake2s_compress_ssse3() and
blake2s_compress_avx512() store updated values to the 'h', 't', and 'f'
fields of struct blake2s_ctx. But 'f' is always unchanged (which is
correct; only the C code changes it). So, there's no need to write to
'f'. Use 64-bit stores (movq and vmovq) instead of 128-bit stores
(movdqu and vmovdqu) so that only 't' is written.
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
---
lib/crypto/x86/blake2s-core.S | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/lib/crypto/x86/blake2s-core.S b/lib/crypto/x86/blake2s-core.S
index f805a49c590d..869064f6ac16 100644
--- a/lib/crypto/x86/blake2s-core.S
+++ b/lib/crypto/x86/blake2s-core.S
@@ -191,11 +191,11 @@ SYM_FUNC_START(blake2s_compress_ssse3)
decq NBLOCKS
jnz .Lssse3_mainloop
movdqu %xmm0,(CTX) // Store new h[0..3]
movdqu %xmm1,16(CTX) // Store new h[4..7]
- movdqu %xmm14,32(CTX) // Store new t and f
+ movq %xmm14,32(CTX) // Store new t (f is unchanged)
RET
SYM_FUNC_END(blake2s_compress_ssse3)
//
// void blake2s_compress_avx512(struct blake2s_ctx *ctx,
@@ -285,9 +285,9 @@ SYM_FUNC_START(blake2s_compress_avx512)
decq NBLOCKS
jne .Lavx512_mainloop
vmovdqu %xmm0,(CTX) // Store new h[0..3]
vmovdqu %xmm1,16(CTX) // Store new h[4..7]
- vmovdqu %xmm4,32(CTX) // Store new t and f
+ vmovq %xmm4,32(CTX) // Store new t (f is unchanged)
vzeroupper
RET
SYM_FUNC_END(blake2s_compress_avx512)
--
2.51.2
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 6/6] lib/crypto: x86/blake2s: Use vpternlogd for 3-input XORs
2025-11-02 23:42 [PATCH 0/6] x86 BLAKE2s cleanups Eric Biggers
` (4 preceding siblings ...)
2025-11-02 23:42 ` [PATCH 5/6] lib/crypto: x86/blake2s: Avoid writing back unchanged 'f' value Eric Biggers
@ 2025-11-02 23:42 ` Eric Biggers
2025-11-03 8:14 ` [PATCH 0/6] x86 BLAKE2s cleanups Ard Biesheuvel
2025-11-03 17:35 ` Eric Biggers
7 siblings, 0 replies; 9+ messages in thread
From: Eric Biggers @ 2025-11-02 23:42 UTC (permalink / raw)
To: linux-crypto
Cc: linux-kernel, Ard Biesheuvel, Jason A . Donenfeld, Herbert Xu,
x86, Samuel Neves, Eric Biggers
AVX-512 supports 3-input XORs via the vpternlogd (or vpternlogq)
instruction with immediate 0x96. This approach, vs. the alternative of
two vpxor instructions, is already used in the CRC, AES-GCM, and AES-XTS
code, since it reduces the instruction count and is faster on some CPUs.
Make blake2s_compress_avx512() take advantage of it too.
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
---
lib/crypto/x86/blake2s-core.S | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/lib/crypto/x86/blake2s-core.S b/lib/crypto/x86/blake2s-core.S
index 869064f6ac16..7b1d98ca7482 100644
--- a/lib/crypto/x86/blake2s-core.S
+++ b/lib/crypto/x86/blake2s-core.S
@@ -276,14 +276,12 @@ SYM_FUNC_START(blake2s_compress_avx512)
vpshufd $0x93,%xmm2,%xmm2
decb %cl
jne .Lavx512_roundloop
// Compute the new h: h[0..7] ^= v[0..7] ^ v[8..15]
- vpxor %xmm10,%xmm0,%xmm0
- vpxor %xmm11,%xmm1,%xmm1
- vpxor %xmm2,%xmm0,%xmm0
- vpxor %xmm3,%xmm1,%xmm1
+ vpternlogd $0x96,%xmm10,%xmm2,%xmm0
+ vpternlogd $0x96,%xmm11,%xmm3,%xmm1
decq NBLOCKS
jne .Lavx512_mainloop
vmovdqu %xmm0,(CTX) // Store new h[0..3]
vmovdqu %xmm1,16(CTX) // Store new h[4..7]
--
2.51.2
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 0/6] x86 BLAKE2s cleanups
2025-11-02 23:42 [PATCH 0/6] x86 BLAKE2s cleanups Eric Biggers
` (5 preceding siblings ...)
2025-11-02 23:42 ` [PATCH 6/6] lib/crypto: x86/blake2s: Use vpternlogd for 3-input XORs Eric Biggers
@ 2025-11-03 8:14 ` Ard Biesheuvel
2025-11-03 17:35 ` Eric Biggers
7 siblings, 0 replies; 9+ messages in thread
From: Ard Biesheuvel @ 2025-11-03 8:14 UTC (permalink / raw)
To: Eric Biggers
Cc: linux-crypto, linux-kernel, Jason A . Donenfeld, Herbert Xu, x86,
Samuel Neves
On Mon, 3 Nov 2025 at 00:44, Eric Biggers <ebiggers@kernel.org> wrote:
>
> Various cleanups for the SSSE3 and AVX512 implementations of BLAKE2s.
>
> This is targeting libcrypto-next.
>
> Eric Biggers (6):
> lib/crypto: x86/blake2s: Fix 32-bit arg treated as 64-bit
> lib/crypto: x86/blake2s: Drop check for nblocks == 0
> lib/crypto: x86/blake2s: Use local labels for data
> lib/crypto: x86/blake2s: Improve readability
> lib/crypto: x86/blake2s: Avoid writing back unchanged 'f' value
> lib/crypto: x86/blake2s: Use vpternlogd for 3-input XORs
>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
> lib/crypto/x86/blake2s-core.S | 275 +++++++++++++++++++---------------
> 1 file changed, 157 insertions(+), 118 deletions(-)
>
> base-commit: 5a2a5e62a5216ba05d4481cf90d915f4de0bfde9
> --
> 2.51.2
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 0/6] x86 BLAKE2s cleanups
2025-11-02 23:42 [PATCH 0/6] x86 BLAKE2s cleanups Eric Biggers
` (6 preceding siblings ...)
2025-11-03 8:14 ` [PATCH 0/6] x86 BLAKE2s cleanups Ard Biesheuvel
@ 2025-11-03 17:35 ` Eric Biggers
7 siblings, 0 replies; 9+ messages in thread
From: Eric Biggers @ 2025-11-03 17:35 UTC (permalink / raw)
To: linux-crypto
Cc: linux-kernel, Ard Biesheuvel, Jason A . Donenfeld, Herbert Xu,
x86, Samuel Neves
On Sun, Nov 02, 2025 at 03:42:03PM -0800, Eric Biggers wrote:
> Various cleanups for the SSSE3 and AVX512 implementations of BLAKE2s.
>
> This is targeting libcrypto-next.
>
> Eric Biggers (6):
> lib/crypto: x86/blake2s: Fix 32-bit arg treated as 64-bit
> lib/crypto: x86/blake2s: Drop check for nblocks == 0
> lib/crypto: x86/blake2s: Use local labels for data
> lib/crypto: x86/blake2s: Improve readability
> lib/crypto: x86/blake2s: Avoid writing back unchanged 'f' value
> lib/crypto: x86/blake2s: Use vpternlogd for 3-input XORs
>
> lib/crypto/x86/blake2s-core.S | 275 +++++++++++++++++++---------------
> 1 file changed, 157 insertions(+), 118 deletions(-)
>
> base-commit: 5a2a5e62a5216ba05d4481cf90d915f4de0bfde9
Applied to https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git/log/?h=libcrypto-next
- Eric
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-11-03 17:37 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-02 23:42 [PATCH 0/6] x86 BLAKE2s cleanups Eric Biggers
2025-11-02 23:42 ` [PATCH 1/6] lib/crypto: x86/blake2s: Fix 32-bit arg treated as 64-bit Eric Biggers
2025-11-02 23:42 ` [PATCH 2/6] lib/crypto: x86/blake2s: Drop check for nblocks == 0 Eric Biggers
2025-11-02 23:42 ` [PATCH 3/6] lib/crypto: x86/blake2s: Use local labels for data Eric Biggers
2025-11-02 23:42 ` [PATCH 4/6] lib/crypto: x86/blake2s: Improve readability Eric Biggers
2025-11-02 23:42 ` [PATCH 5/6] lib/crypto: x86/blake2s: Avoid writing back unchanged 'f' value Eric Biggers
2025-11-02 23:42 ` [PATCH 6/6] lib/crypto: x86/blake2s: Use vpternlogd for 3-input XORs Eric Biggers
2025-11-03 8:14 ` [PATCH 0/6] x86 BLAKE2s cleanups Ard Biesheuvel
2025-11-03 17:35 ` Eric Biggers
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).