* [PATCH 0/8] VAES+AVX2 optimized implementation of AES-GCM
@ 2025-10-02 2:31 Eric Biggers
2025-10-02 2:31 ` [PATCH 1/8] crypto: x86/aes-gcm - add VAES+AVX2 optimized code Eric Biggers
` (10 more replies)
0 siblings, 11 replies; 20+ messages in thread
From: Eric Biggers @ 2025-10-02 2:31 UTC (permalink / raw)
To: linux-crypto
Cc: linux-kernel, x86, Ard Biesheuvel, Jason A . Donenfeld,
Eric Biggers
This patchset replaces the 256-bit vector implementation of AES-GCM for
x86_64 with one that requires AVX2 rather than AVX512. This greatly
improves AES-GCM performance on CPUs that have VAES but not AVX512, for
example by up to 74% on AMD Zen 3. For more details, see patch 1.
This patchset also renames the 512-bit vector implementation of AES-GCM
for x86_64 to be named after AVX512 rather than AVX10/512, then adds
some additional optimizations to it.
This patchset applies to next-20250929 and is targeting 6.19. Herbert,
I'd prefer to just apply this myself. But let me know if you'd prefer
to take it instead (considering that AES-GCM hasn't been librarified
yet). Either way, there's no hurry, since this is targeting 6.19.
Eric Biggers (8):
crypto: x86/aes-gcm - add VAES+AVX2 optimized code
crypto: x86/aes-gcm - remove VAES+AVX10/256 optimized code
crypto: x86/aes-gcm - rename avx10 and avx10_512 to avx512
crypto: x86/aes-gcm - clean up AVX512 code to assume 512-bit vectors
crypto: x86/aes-gcm - reorder AVX512 precompute and aad_update
functions
crypto: x86/aes-gcm - revise some comments in AVX512 code
crypto: x86/aes-gcm - optimize AVX512 precomputation of H^2 from H^1
crypto: x86/aes-gcm - optimize long AAD processing with AVX512
arch/x86/crypto/Makefile | 5 +-
arch/x86/crypto/aes-gcm-aesni-x86_64.S | 12 +-
arch/x86/crypto/aes-gcm-vaes-avx2.S | 1150 +++++++++++++++++
...m-avx10-x86_64.S => aes-gcm-vaes-avx512.S} | 722 +++++------
arch/x86/crypto/aesni-intel_glue.c | 264 ++--
5 files changed, 1667 insertions(+), 486 deletions(-)
create mode 100644 arch/x86/crypto/aes-gcm-vaes-avx2.S
rename arch/x86/crypto/{aes-gcm-avx10-x86_64.S => aes-gcm-vaes-avx512.S} (69%)
base-commit: 3b9b1f8df454caa453c7fb07689064edb2eda90a
--
2.51.0
^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH 1/8] crypto: x86/aes-gcm - add VAES+AVX2 optimized code
2025-10-02 2:31 [PATCH 0/8] VAES+AVX2 optimized implementation of AES-GCM Eric Biggers
@ 2025-10-02 2:31 ` Eric Biggers
2025-10-17 18:34 ` Eric Biggers
2025-10-02 2:31 ` [PATCH 2/8] crypto: x86/aes-gcm - remove VAES+AVX10/256 " Eric Biggers
` (9 subsequent siblings)
10 siblings, 1 reply; 20+ messages in thread
From: Eric Biggers @ 2025-10-02 2:31 UTC (permalink / raw)
To: linux-crypto
Cc: linux-kernel, x86, Ard Biesheuvel, Jason A . Donenfeld,
Eric Biggers
Add an implementation of AES-GCM that uses 256-bit vectors and the
following CPU features: Vector AES (VAES), Vector Carryless
Multiplication (VPCLMULQDQ), and AVX2.
It doesn't require AVX512. So unlike the existing VAES+AVX512 code, it
works on CPUs that support VAES but not AVX512, specifically:
- AMD Zen 3, both client and server
- Intel Alder Lake, Raptor Lake, Meteor Lake, Arrow Lake, and Lunar
Lake. (These are client CPUs.)
- Intel Sierra Forest. (This is a server CPU.)
On these CPUs, this VAES+AVX2 code is much faster than the existing
AES-NI code. The AES-NI code uses only 128-bit vectors.
These CPUs are widely deployed, making VAES+AVX2 code worthwhile even
though hopefully future x86_64 CPUs will uniformly support AVX512.
This implementation will also serve as the fallback 256-bit
implementation for older Intel CPUs (Ice Lake and Tiger Lake) that
support AVX512 but downclock too eagerly when 512-bit vectors are used.
Currently, the VAES+AVX10/256 implementation serves that purpose. A
later commit will remove that and just use the VAES+AVX2 one. (Note
that AES-XTS and AES-CTR already successfully use this approach.)
I originally wrote this AES-GCM implementation for BoringSSL. It's been
in BoringSSL for a while now, including in Chromium. This is a port of
it to the Linux kernel. The main changes in the Linux version include:
- Port from "perlasm" to a standard .S file.
- Align all assembly functions with what aesni-intel_glue.c expects,
including adding support for lengths not a multiple of 16 bytes.
- Rework the en/decryption of the final 1 to 127 bytes.
This commit increases AES-256-GCM throughput on AMD Milan (Zen 3) by up
to 74%, as shown by the following tables:
Table 1: AES-256-GCM encryption throughput change,
CPU vs. message length in bytes:
| 16384 | 4096 | 4095 | 1420 | 512 | 500 |
----------------------+-------+-------+-------+-------+-------+-------+
AMD Milan (Zen 3) | 67% | 59% | 61% | 39% | 23% | 27% |
| 300 | 200 | 64 | 63 | 16 |
----------------------+-------+-------+-------+-------+-------+
AMD Milan (Zen 3) | 14% | 12% | 7% | 7% | 0% |
Table 2: AES-256-GCM decryption throughput change,
CPU vs. message length in bytes:
| 16384 | 4096 | 4095 | 1420 | 512 | 500 |
----------------------+-------+-------+-------+-------+-------+-------+
AMD Milan (Zen 3) | 74% | 65% | 65% | 44% | 23% | 26% |
| 300 | 200 | 64 | 63 | 16 |
----------------------+-------+-------+-------+-------+-------+
AMD Milan (Zen 3) | 12% | 11% | 3% | 2% | -3% |
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
---
arch/x86/crypto/Makefile | 1 +
arch/x86/crypto/aes-gcm-vaes-avx2.S | 1150 +++++++++++++++++++++++++++
arch/x86/crypto/aesni-intel_glue.c | 111 ++-
3 files changed, 1260 insertions(+), 2 deletions(-)
create mode 100644 arch/x86/crypto/aes-gcm-vaes-avx2.S
diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index 2d30d5d361458..f6f7b2b8b853e 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -44,10 +44,11 @@ aegis128-aesni-y := aegis128-aesni-asm.o aegis128-aesni-glue.o
obj-$(CONFIG_CRYPTO_AES_NI_INTEL) += aesni-intel.o
aesni-intel-y := aesni-intel_asm.o aesni-intel_glue.o
aesni-intel-$(CONFIG_64BIT) += aes-ctr-avx-x86_64.o \
aes-gcm-aesni-x86_64.o \
+ aes-gcm-vaes-avx2.o \
aes-xts-avx-x86_64.o \
aes-gcm-avx10-x86_64.o
obj-$(CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL) += ghash-clmulni-intel.o
ghash-clmulni-intel-y := ghash-clmulni-intel_asm.o ghash-clmulni-intel_glue.o
diff --git a/arch/x86/crypto/aes-gcm-vaes-avx2.S b/arch/x86/crypto/aes-gcm-vaes-avx2.S
new file mode 100644
index 0000000000000..e628dbb33c0e7
--- /dev/null
+++ b/arch/x86/crypto/aes-gcm-vaes-avx2.S
@@ -0,0 +1,1150 @@
+/* SPDX-License-Identifier: Apache-2.0 OR BSD-2-Clause */
+//
+// AES-GCM implementation for x86_64 CPUs that support the following CPU
+// features: VAES && VPCLMULQDQ && AVX2
+//
+// Copyright 2025 Google LLC
+//
+// Author: Eric Biggers <ebiggers@google.com>
+//
+//------------------------------------------------------------------------------
+//
+// This file is dual-licensed, meaning that you can use it under your choice of
+// either of the following two licenses:
+//
+// Licensed under the Apache License 2.0 (the "License"). You may obtain a copy
+// of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+//
+// or
+//
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions are met:
+//
+// 1. Redistributions of source code must retain the above copyright notice,
+// this list of conditions and the following disclaimer.
+//
+// 2. Redistributions in binary form must reproduce the above copyright
+// notice, this list of conditions and the following disclaimer in the
+// documentation and/or other materials provided with the distribution.
+//
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+// AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+// ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
+// LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+// CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+// SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+// INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+// CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+// ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+// POSSIBILITY OF SUCH DAMAGE.
+//
+// -----------------------------------------------------------------------------
+//
+// This is similar to aes-gcm-avx10-x86_64.S, but it uses AVX2 instead of
+// AVX512. This means it can only use 16 vector registers instead of 32, the
+// maximum vector length is 32 bytes, and some instructions such as vpternlogd
+// and masked loads/stores are unavailable. However, it is able to run on CPUs
+// that have VAES without AVX512, namely AMD Zen 3 (including "Milan" server
+// CPUs), various Intel client CPUs such as Alder Lake, and Intel Sierra Forest.
+//
+// This implementation also uses Karatsuba multiplication instead of schoolbook
+// multiplication for GHASH in its main loop. This does not help much on Intel,
+// but it improves performance by ~5% on AMD Zen 3. Other factors weighing
+// slightly in favor of Karatsuba multiplication in this implementation are the
+// lower maximum vector length (which means there are fewer key powers, so we
+// can cache the halves of each key power XOR'd together and still use less
+// memory than the AVX512 implementation), and the unavailability of the
+// vpternlogd instruction (which helped schoolbook a bit more than Karatsuba).
+
+#include <linux/linkage.h>
+
+.section .rodata
+.p2align 4
+
+ // The below three 16-byte values must be in the order that they are, as
+ // they are really two 32-byte tables and a 16-byte value that overlap:
+ //
+ // - The first 32-byte table begins at .Lselect_high_bytes_table.
+ // For 0 <= len <= 16, the 16-byte value at
+ // '.Lselect_high_bytes_table + len' selects the high 'len' bytes of
+ // another 16-byte value when AND'ed with it.
+ //
+ // - The second 32-byte table begins at .Lrshift_and_bswap_table.
+ // For 0 <= len <= 16, the 16-byte value at
+ // '.Lrshift_and_bswap_table + len' is a vpshufb mask that does the
+ // following operation: right-shift by '16 - len' bytes (shifting in
+ // zeroes), then reflect all 16 bytes.
+ //
+ // - The 16-byte value at .Lbswap_mask is a vpshufb mask that reflects
+ // all 16 bytes.
+.Lselect_high_bytes_table:
+ .octa 0
+.Lrshift_and_bswap_table:
+ .octa 0xffffffffffffffffffffffffffffffff
+.Lbswap_mask:
+ .octa 0x000102030405060708090a0b0c0d0e0f
+
+ // Sixteen 0x0f bytes. By XOR'ing an entry of .Lrshift_and_bswap_table
+ // with this, we get a mask that left-shifts by '16 - len' bytes.
+.Lfifteens:
+ .octa 0x0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f
+
+ // This is the GHASH reducing polynomial without its constant term, i.e.
+ // x^128 + x^7 + x^2 + x, represented using the backwards mapping
+ // between bits and polynomial coefficients.
+ //
+ // Alternatively, it can be interpreted as the naturally-ordered
+ // representation of the polynomial x^127 + x^126 + x^121 + 1, i.e. the
+ // "reversed" GHASH reducing polynomial without its x^128 term.
+.Lgfpoly:
+ .octa 0xc2000000000000000000000000000001
+
+ // Same as above, but with the (1 << 64) bit set.
+.Lgfpoly_and_internal_carrybit:
+ .octa 0xc2000000000000010000000000000001
+
+ // Values needed to prepare the initial vector of counter blocks.
+.Lctr_pattern:
+ .octa 0
+ .octa 1
+
+ // The number of AES blocks per vector, as a 128-bit value.
+.Linc_2blocks:
+ .octa 2
+
+// Offsets in struct aes_gcm_key_vaes_avx2
+#define OFFSETOF_AESKEYLEN 480
+#define OFFSETOF_H_POWERS 512
+#define NUM_H_POWERS 8
+#define OFFSETOFEND_H_POWERS (OFFSETOF_H_POWERS + (NUM_H_POWERS * 16))
+#define OFFSETOF_H_POWERS_XORED OFFSETOFEND_H_POWERS
+
+.text
+
+// Do one step of GHASH-multiplying the 128-bit lanes of \a by the 128-bit lanes
+// of \b and storing the reduced products in \dst. Uses schoolbook
+// multiplication.
+.macro _ghash_mul_step i, a, b, dst, gfpoly, t0, t1, t2
+.if \i == 0
+ vpclmulqdq $0x00, \a, \b, \t0 // LO = a_L * b_L
+ vpclmulqdq $0x01, \a, \b, \t1 // MI_0 = a_L * b_H
+.elseif \i == 1
+ vpclmulqdq $0x10, \a, \b, \t2 // MI_1 = a_H * b_L
+.elseif \i == 2
+ vpxor \t2, \t1, \t1 // MI = MI_0 + MI_1
+.elseif \i == 3
+ vpclmulqdq $0x01, \t0, \gfpoly, \t2 // LO_L*(x^63 + x^62 + x^57)
+.elseif \i == 4
+ vpshufd $0x4e, \t0, \t0 // Swap halves of LO
+.elseif \i == 5
+ vpxor \t0, \t1, \t1 // Fold LO into MI (part 1)
+ vpxor \t2, \t1, \t1 // Fold LO into MI (part 2)
+.elseif \i == 6
+ vpclmulqdq $0x11, \a, \b, \dst // HI = a_H * b_H
+.elseif \i == 7
+ vpclmulqdq $0x01, \t1, \gfpoly, \t0 // MI_L*(x^63 + x^62 + x^57)
+.elseif \i == 8
+ vpshufd $0x4e, \t1, \t1 // Swap halves of MI
+.elseif \i == 9
+ vpxor \t1, \dst, \dst // Fold MI into HI (part 1)
+ vpxor \t0, \dst, \dst // Fold MI into HI (part 2)
+.endif
+.endm
+
+// GHASH-multiply the 128-bit lanes of \a by the 128-bit lanes of \b and store
+// the reduced products in \dst. See _ghash_mul_step for full explanation.
+.macro _ghash_mul a, b, dst, gfpoly, t0, t1, t2
+.irp i, 0,1,2,3,4,5,6,7,8,9
+ _ghash_mul_step \i, \a, \b, \dst, \gfpoly, \t0, \t1, \t2
+.endr
+.endm
+
+// GHASH-multiply the 128-bit lanes of \a by the 128-bit lanes of \b and add the
+// *unreduced* products to \lo, \mi, and \hi.
+.macro _ghash_mul_noreduce a, b, lo, mi, hi, t0
+ vpclmulqdq $0x00, \a, \b, \t0 // a_L * b_L
+ vpxor \t0, \lo, \lo
+ vpclmulqdq $0x01, \a, \b, \t0 // a_L * b_H
+ vpxor \t0, \mi, \mi
+ vpclmulqdq $0x10, \a, \b, \t0 // a_H * b_L
+ vpxor \t0, \mi, \mi
+ vpclmulqdq $0x11, \a, \b, \t0 // a_H * b_H
+ vpxor \t0, \hi, \hi
+.endm
+
+// Reduce the unreduced products from \lo, \mi, and \hi and store the 128-bit
+// reduced products in \hi. See _ghash_mul_step for explanation of reduction.
+.macro _ghash_reduce lo, mi, hi, gfpoly, t0
+ vpclmulqdq $0x01, \lo, \gfpoly, \t0
+ vpshufd $0x4e, \lo, \lo
+ vpxor \lo, \mi, \mi
+ vpxor \t0, \mi, \mi
+ vpclmulqdq $0x01, \mi, \gfpoly, \t0
+ vpshufd $0x4e, \mi, \mi
+ vpxor \mi, \hi, \hi
+ vpxor \t0, \hi, \hi
+.endm
+
+// This is a specialized version of _ghash_mul that computes \a * \a, i.e. it
+// squares \a. It skips computing MI = (a_L * a_H) + (a_H * a_L) = 0.
+.macro _ghash_square a, dst, gfpoly, t0, t1
+ vpclmulqdq $0x00, \a, \a, \t0 // LO = a_L * a_L
+ vpclmulqdq $0x11, \a, \a, \dst // HI = a_H * a_H
+ vpclmulqdq $0x01, \t0, \gfpoly, \t1 // LO_L*(x^63 + x^62 + x^57)
+ vpshufd $0x4e, \t0, \t0 // Swap halves of LO
+ vpxor \t0, \t1, \t1 // Fold LO into MI
+ vpclmulqdq $0x01, \t1, \gfpoly, \t0 // MI_L*(x^63 + x^62 + x^57)
+ vpshufd $0x4e, \t1, \t1 // Swap halves of MI
+ vpxor \t1, \dst, \dst // Fold MI into HI (part 1)
+ vpxor \t0, \dst, \dst // Fold MI into HI (part 2)
+.endm
+
+// void aes_gcm_precompute_vaes_avx2(struct aes_gcm_key_vaes_avx2 *key);
+//
+// Given the expanded AES key |key->base.aes_key|, derive the GHASH subkey and
+// initialize |key->h_powers| and |key->h_powers_xored|.
+//
+// We use h_powers[0..7] to store H^8 through H^1, and h_powers_xored[0..7] to
+// store the 64-bit halves of the key powers XOR'd together (for Karatsuba
+// multiplication) in the order 8,6,7,5,4,2,3,1.
+SYM_FUNC_START(aes_gcm_precompute_vaes_avx2)
+
+ // Function arguments
+ .set KEY, %rdi
+
+ // Additional local variables
+ .set POWERS_PTR, %rsi
+ .set RNDKEYLAST_PTR, %rdx
+ .set TMP0, %ymm0
+ .set TMP0_XMM, %xmm0
+ .set TMP1, %ymm1
+ .set TMP1_XMM, %xmm1
+ .set TMP2, %ymm2
+ .set TMP2_XMM, %xmm2
+ .set H_CUR, %ymm3
+ .set H_CUR_XMM, %xmm3
+ .set H_CUR2, %ymm4
+ .set H_CUR2_XMM, %xmm4
+ .set H_INC, %ymm5
+ .set H_INC_XMM, %xmm5
+ .set GFPOLY, %ymm6
+ .set GFPOLY_XMM, %xmm6
+
+ // Encrypt an all-zeroes block to get the raw hash subkey.
+ movl OFFSETOF_AESKEYLEN(KEY), %eax
+ lea 6*16(KEY,%rax,4), RNDKEYLAST_PTR
+ vmovdqu (KEY), H_CUR_XMM // Zero-th round key XOR all-zeroes block
+ lea 16(KEY), %rax
+1:
+ vaesenc (%rax), H_CUR_XMM, H_CUR_XMM
+ add $16, %rax
+ cmp %rax, RNDKEYLAST_PTR
+ jne 1b
+ vaesenclast (RNDKEYLAST_PTR), H_CUR_XMM, H_CUR_XMM
+
+ // Reflect the bytes of the raw hash subkey.
+ vpshufb .Lbswap_mask(%rip), H_CUR_XMM, H_CUR_XMM
+
+ // Finish preprocessing the byte-reflected hash subkey by multiplying it
+ // by x^-1 ("standard" interpretation of polynomial coefficients) or
+ // equivalently x^1 (natural interpretation). This gets the key into a
+ // format that avoids having to bit-reflect the data blocks later.
+ vpshufd $0xd3, H_CUR_XMM, TMP0_XMM
+ vpsrad $31, TMP0_XMM, TMP0_XMM
+ vpaddq H_CUR_XMM, H_CUR_XMM, H_CUR_XMM
+ vpand .Lgfpoly_and_internal_carrybit(%rip), TMP0_XMM, TMP0_XMM
+ vpxor TMP0_XMM, H_CUR_XMM, H_CUR_XMM
+
+ // Load the gfpoly constant.
+ vbroadcasti128 .Lgfpoly(%rip), GFPOLY
+
+ // Square H^1 to get H^2.
+ _ghash_square H_CUR_XMM, H_INC_XMM, GFPOLY_XMM, TMP0_XMM, TMP1_XMM
+
+ // Create H_CUR = [H^2, H^1] and H_INC = [H^2, H^2].
+ vinserti128 $1, H_CUR_XMM, H_INC, H_CUR
+ vinserti128 $1, H_INC_XMM, H_INC, H_INC
+
+ // Compute H_CUR2 = [H^4, H^3].
+ _ghash_mul H_INC, H_CUR, H_CUR2, GFPOLY, TMP0, TMP1, TMP2
+
+ // Store [H^2, H^1] and [H^4, H^3].
+ vmovdqu H_CUR, OFFSETOF_H_POWERS+3*32(KEY)
+ vmovdqu H_CUR2, OFFSETOF_H_POWERS+2*32(KEY)
+
+ // For Karatsuba multiplication: compute and store the two 64-bit halves
+ // of each key power XOR'd together. Order is 4,2,3,1.
+ vpunpcklqdq H_CUR, H_CUR2, TMP0
+ vpunpckhqdq H_CUR, H_CUR2, TMP1
+ vpxor TMP1, TMP0, TMP0
+ vmovdqu TMP0, OFFSETOF_H_POWERS_XORED+32(KEY)
+
+ // Compute and store H_CUR = [H^6, H^5] and H_CUR2 = [H^8, H^7].
+ _ghash_mul H_INC, H_CUR2, H_CUR, GFPOLY, TMP0, TMP1, TMP2
+ _ghash_mul H_INC, H_CUR, H_CUR2, GFPOLY, TMP0, TMP1, TMP2
+ vmovdqu H_CUR, OFFSETOF_H_POWERS+1*32(KEY)
+ vmovdqu H_CUR2, OFFSETOF_H_POWERS+0*32(KEY)
+
+ // Again, compute and store the two 64-bit halves of each key power
+ // XOR'd together. Order is 8,6,7,5.
+ vpunpcklqdq H_CUR, H_CUR2, TMP0
+ vpunpckhqdq H_CUR, H_CUR2, TMP1
+ vpxor TMP1, TMP0, TMP0
+ vmovdqu TMP0, OFFSETOF_H_POWERS_XORED(KEY)
+
+ vzeroupper
+ RET
+SYM_FUNC_END(aes_gcm_precompute_vaes_avx2)
+
+// Do one step of the GHASH update of four vectors of data blocks.
+// \i: the step to do, 0 through 9
+// \ghashdata_ptr: pointer to the data blocks (ciphertext or AAD)
+// KEY: pointer to struct aes_gcm_key_vaes_avx2
+// BSWAP_MASK: mask for reflecting the bytes of blocks
+// H_POW[2-1]_XORED: cached values from KEY->h_powers_xored
+// TMP[0-2]: temporary registers. TMP[1-2] must be preserved across steps.
+// LO, MI: working state for this macro that must be preserved across steps
+// GHASH_ACC: the GHASH accumulator (input/output)
+.macro _ghash_step_4x i, ghashdata_ptr
+ .set HI, GHASH_ACC # alias
+ .set HI_XMM, GHASH_ACC_XMM
+.if \i == 0
+ // First vector
+ vmovdqu 0*32(\ghashdata_ptr), TMP1
+ vpshufb BSWAP_MASK, TMP1, TMP1
+ vmovdqu OFFSETOF_H_POWERS+0*32(KEY), TMP2
+ vpxor GHASH_ACC, TMP1, TMP1
+ vpclmulqdq $0x00, TMP2, TMP1, LO
+ vpclmulqdq $0x11, TMP2, TMP1, HI
+ vpunpckhqdq TMP1, TMP1, TMP0
+ vpxor TMP1, TMP0, TMP0
+ vpclmulqdq $0x00, H_POW2_XORED, TMP0, MI
+.elseif \i == 1
+.elseif \i == 2
+ // Second vector
+ vmovdqu 1*32(\ghashdata_ptr), TMP1
+ vpshufb BSWAP_MASK, TMP1, TMP1
+ vmovdqu OFFSETOF_H_POWERS+1*32(KEY), TMP2
+ vpclmulqdq $0x00, TMP2, TMP1, TMP0
+ vpxor TMP0, LO, LO
+ vpclmulqdq $0x11, TMP2, TMP1, TMP0
+ vpxor TMP0, HI, HI
+ vpunpckhqdq TMP1, TMP1, TMP0
+ vpxor TMP1, TMP0, TMP0
+ vpclmulqdq $0x10, H_POW2_XORED, TMP0, TMP0
+ vpxor TMP0, MI, MI
+.elseif \i == 3
+ // Third vector
+ vmovdqu 2*32(\ghashdata_ptr), TMP1
+ vpshufb BSWAP_MASK, TMP1, TMP1
+ vmovdqu OFFSETOF_H_POWERS+2*32(KEY), TMP2
+.elseif \i == 4
+ vpclmulqdq $0x00, TMP2, TMP1, TMP0
+ vpxor TMP0, LO, LO
+ vpclmulqdq $0x11, TMP2, TMP1, TMP0
+ vpxor TMP0, HI, HI
+.elseif \i == 5
+ vpunpckhqdq TMP1, TMP1, TMP0
+ vpxor TMP1, TMP0, TMP0
+ vpclmulqdq $0x00, H_POW1_XORED, TMP0, TMP0
+ vpxor TMP0, MI, MI
+
+ // Fourth vector
+ vmovdqu 3*32(\ghashdata_ptr), TMP1
+ vpshufb BSWAP_MASK, TMP1, TMP1
+.elseif \i == 6
+ vmovdqu OFFSETOF_H_POWERS+3*32(KEY), TMP2
+ vpclmulqdq $0x00, TMP2, TMP1, TMP0
+ vpxor TMP0, LO, LO
+ vpclmulqdq $0x11, TMP2, TMP1, TMP0
+ vpxor TMP0, HI, HI
+ vpunpckhqdq TMP1, TMP1, TMP0
+ vpxor TMP1, TMP0, TMP0
+ vpclmulqdq $0x10, H_POW1_XORED, TMP0, TMP0
+ vpxor TMP0, MI, MI
+.elseif \i == 7
+ // Finalize 'mi' following Karatsuba multiplication.
+ vpxor LO, MI, MI
+ vpxor HI, MI, MI
+
+ // Fold lo into mi.
+ vbroadcasti128 .Lgfpoly(%rip), TMP2
+ vpclmulqdq $0x01, LO, TMP2, TMP0
+ vpshufd $0x4e, LO, LO
+ vpxor LO, MI, MI
+ vpxor TMP0, MI, MI
+.elseif \i == 8
+ // Fold mi into hi.
+ vpclmulqdq $0x01, MI, TMP2, TMP0
+ vpshufd $0x4e, MI, MI
+ vpxor MI, HI, HI
+ vpxor TMP0, HI, HI
+.elseif \i == 9
+ vextracti128 $1, HI, TMP0_XMM
+ vpxor TMP0_XMM, HI_XMM, GHASH_ACC_XMM
+.endif
+.endm
+
+// Update GHASH with four vectors of data blocks. See _ghash_step_4x for full
+// explanation.
+.macro _ghash_4x ghashdata_ptr
+.irp i, 0,1,2,3,4,5,6,7,8,9
+ _ghash_step_4x \i, \ghashdata_ptr
+.endr
+.endm
+
+// Load 1 <= %ecx <= 16 bytes from the pointer \src into the xmm register \dst
+// and zeroize any remaining bytes. Clobbers %rax, %rcx, and \tmp{64,32}.
+.macro _load_partial_block src, dst, tmp64, tmp32
+ sub $8, %ecx // LEN - 8
+ jle .Lle8\@
+
+ // Load 9 <= LEN <= 16 bytes.
+ vmovq (\src), \dst // Load first 8 bytes
+ mov (\src, %rcx), %rax // Load last 8 bytes
+ neg %ecx
+ shl $3, %ecx
+ shr %cl, %rax // Discard overlapping bytes
+ vpinsrq $1, %rax, \dst, \dst
+ jmp .Ldone\@
+
+.Lle8\@:
+ add $4, %ecx // LEN - 4
+ jl .Llt4\@
+
+ // Load 4 <= LEN <= 8 bytes.
+ mov (\src), %eax // Load first 4 bytes
+ mov (\src, %rcx), \tmp32 // Load last 4 bytes
+ jmp .Lcombine\@
+
+.Llt4\@:
+ // Load 1 <= LEN <= 3 bytes.
+ add $2, %ecx // LEN - 2
+ movzbl (\src), %eax // Load first byte
+ jl .Lmovq\@
+ movzwl (\src, %rcx), \tmp32 // Load last 2 bytes
+.Lcombine\@:
+ shl $3, %ecx
+ shl %cl, \tmp64
+ or \tmp64, %rax // Combine the two parts
+.Lmovq\@:
+ vmovq %rax, \dst
+.Ldone\@:
+.endm
+
+// Store 1 <= %ecx <= 16 bytes from the xmm register \src to the pointer \dst.
+// Clobbers %rax, %rcx, and \tmp{64,32}.
+.macro _store_partial_block src, dst, tmp64, tmp32
+ sub $8, %ecx // LEN - 8
+ jl .Llt8\@
+
+ // Store 8 <= LEN <= 16 bytes.
+ vpextrq $1, \src, %rax
+ mov %ecx, \tmp32
+ shl $3, %ecx
+ ror %cl, %rax
+ mov %rax, (\dst, \tmp64) // Store last LEN - 8 bytes
+ vmovq \src, (\dst) // Store first 8 bytes
+ jmp .Ldone\@
+
+.Llt8\@:
+ add $4, %ecx // LEN - 4
+ jl .Llt4\@
+
+ // Store 4 <= LEN <= 7 bytes.
+ vpextrd $1, \src, %eax
+ mov %ecx, \tmp32
+ shl $3, %ecx
+ ror %cl, %eax
+ mov %eax, (\dst, \tmp64) // Store last LEN - 4 bytes
+ vmovd \src, (\dst) // Store first 4 bytes
+ jmp .Ldone\@
+
+.Llt4\@:
+ // Store 1 <= LEN <= 3 bytes.
+ vpextrb $0, \src, 0(\dst)
+ cmp $-2, %ecx // LEN - 4 == -2, i.e. LEN == 2?
+ jl .Ldone\@
+ vpextrb $1, \src, 1(\dst)
+ je .Ldone\@
+ vpextrb $2, \src, 2(\dst)
+.Ldone\@:
+.endm
+
+// void aes_gcm_aad_update_vaes_avx2(const struct aes_gcm_key_vaes_avx2 *key,
+// u8 ghash_acc[16],
+// const u8 *aad, int aadlen);
+//
+// This function processes the AAD (Additional Authenticated Data) in GCM.
+// Using the key |key|, it updates the GHASH accumulator |ghash_acc| with the
+// data given by |aad| and |aadlen|. On the first call, |ghash_acc| must be all
+// zeroes. |aadlen| must be a multiple of 16, except on the last call where it
+// can be any length. The caller must do any buffering needed to ensure this.
+//
+// This handles large amounts of AAD efficiently, while also keeping overhead
+// low for small amounts which is the common case. TLS and IPsec use less than
+// one block of AAD, but (uncommonly) other use cases may use much more.
+SYM_FUNC_START(aes_gcm_aad_update_vaes_avx2)
+
+ // Function arguments
+ .set KEY, %rdi
+ .set GHASH_ACC_PTR, %rsi
+ .set AAD, %rdx
+ .set AADLEN, %ecx // Must be %ecx for _load_partial_block
+ .set AADLEN64, %rcx // Zero-extend AADLEN before using!
+
+ // Additional local variables.
+ // %rax and %r8 are used as temporary registers.
+ .set TMP0, %ymm0
+ .set TMP0_XMM, %xmm0
+ .set TMP1, %ymm1
+ .set TMP1_XMM, %xmm1
+ .set TMP2, %ymm2
+ .set TMP2_XMM, %xmm2
+ .set LO, %ymm3
+ .set LO_XMM, %xmm3
+ .set MI, %ymm4
+ .set MI_XMM, %xmm4
+ .set GHASH_ACC, %ymm5
+ .set GHASH_ACC_XMM, %xmm5
+ .set BSWAP_MASK, %ymm6
+ .set BSWAP_MASK_XMM, %xmm6
+ .set GFPOLY, %ymm7
+ .set GFPOLY_XMM, %xmm7
+ .set H_POW2_XORED, %ymm8
+ .set H_POW1_XORED, %ymm9
+
+ // Load the bswap_mask and gfpoly constants. Since AADLEN is usually
+ // small, usually only 128-bit vectors will be used. So as an
+ // optimization, don't broadcast these constants to both 128-bit lanes
+ // quite yet.
+ vmovdqu .Lbswap_mask(%rip), BSWAP_MASK_XMM
+ vmovdqu .Lgfpoly(%rip), GFPOLY_XMM
+
+ // Load the GHASH accumulator.
+ vmovdqu (GHASH_ACC_PTR), GHASH_ACC_XMM
+
+ // Check for the common case of AADLEN <= 16, as well as AADLEN == 0.
+ test AADLEN, AADLEN
+ jz .Laad_done
+ cmp $16, AADLEN
+ jle .Laad_lastblock
+
+ // AADLEN > 16, so we'll operate on full vectors. Broadcast bswap_mask
+ // and gfpoly to both 128-bit lanes.
+ vinserti128 $1, BSWAP_MASK_XMM, BSWAP_MASK, BSWAP_MASK
+ vinserti128 $1, GFPOLY_XMM, GFPOLY, GFPOLY
+
+ // If AADLEN >= 128, update GHASH with 128 bytes of AAD at a time.
+ add $-128, AADLEN // 128 is 4 bytes, -128 is 1 byte
+ jl .Laad_loop_4x_done
+ vmovdqu OFFSETOF_H_POWERS_XORED(KEY), H_POW2_XORED
+ vmovdqu OFFSETOF_H_POWERS_XORED+32(KEY), H_POW1_XORED
+.Laad_loop_4x:
+ _ghash_4x AAD
+ sub $-128, AAD
+ add $-128, AADLEN
+ jge .Laad_loop_4x
+.Laad_loop_4x_done:
+
+ // If AADLEN >= 32, update GHASH with 32 bytes of AAD at a time.
+ add $96, AADLEN
+ jl .Laad_loop_1x_done
+.Laad_loop_1x:
+ vmovdqu (AAD), TMP0
+ vpshufb BSWAP_MASK, TMP0, TMP0
+ vpxor TMP0, GHASH_ACC, GHASH_ACC
+ vmovdqu OFFSETOFEND_H_POWERS-32(KEY), TMP0
+ _ghash_mul TMP0, GHASH_ACC, GHASH_ACC, GFPOLY, TMP1, TMP2, LO
+ vextracti128 $1, GHASH_ACC, TMP0_XMM
+ vpxor TMP0_XMM, GHASH_ACC_XMM, GHASH_ACC_XMM
+ add $32, AAD
+ sub $32, AADLEN
+ jge .Laad_loop_1x
+.Laad_loop_1x_done:
+ add $32, AADLEN
+ // Now 0 <= AADLEN < 32.
+
+ jz .Laad_done
+ cmp $16, AADLEN
+ jle .Laad_lastblock
+
+.Laad_last2blocks:
+ // Update GHASH with the remaining 17 <= AADLEN <= 31 bytes of AAD.
+ mov AADLEN, AADLEN // Zero-extend AADLEN to AADLEN64.
+ vmovdqu (AAD), TMP0_XMM
+ vmovdqu -16(AAD, AADLEN64), TMP1_XMM
+ vpshufb BSWAP_MASK_XMM, TMP0_XMM, TMP0_XMM
+ vpxor TMP0_XMM, GHASH_ACC_XMM, GHASH_ACC_XMM
+ lea .Lrshift_and_bswap_table(%rip), %rax
+ vpshufb -16(%rax, AADLEN64), TMP1_XMM, TMP1_XMM
+ vinserti128 $1, TMP1_XMM, GHASH_ACC, GHASH_ACC
+ vmovdqu OFFSETOFEND_H_POWERS-32(KEY), TMP0
+ _ghash_mul TMP0, GHASH_ACC, GHASH_ACC, GFPOLY, TMP1, TMP2, LO
+ vextracti128 $1, GHASH_ACC, TMP0_XMM
+ vpxor TMP0_XMM, GHASH_ACC_XMM, GHASH_ACC_XMM
+ jmp .Laad_done
+
+.Laad_lastblock:
+ // Update GHASH with the remaining 1 <= AADLEN <= 16 bytes of AAD.
+ _load_partial_block AAD, TMP0_XMM, %r8, %r8d
+ vpshufb BSWAP_MASK_XMM, TMP0_XMM, TMP0_XMM
+ vpxor TMP0_XMM, GHASH_ACC_XMM, GHASH_ACC_XMM
+ vmovdqu OFFSETOFEND_H_POWERS-16(KEY), TMP0_XMM
+ _ghash_mul TMP0_XMM, GHASH_ACC_XMM, GHASH_ACC_XMM, GFPOLY_XMM, \
+ TMP1_XMM, TMP2_XMM, LO_XMM
+
+.Laad_done:
+ // Store the updated GHASH accumulator back to memory.
+ vmovdqu GHASH_ACC_XMM, (GHASH_ACC_PTR)
+
+ vzeroupper
+ RET
+SYM_FUNC_END(aes_gcm_aad_update_vaes_avx2)
+
+// Do one non-last round of AES encryption on the blocks in the given AESDATA
+// vectors using the round key that has been broadcast to all 128-bit lanes of
+// \round_key.
+.macro _vaesenc round_key, vecs:vararg
+.irp i, \vecs
+ vaesenc \round_key, AESDATA\i, AESDATA\i
+.endr
+.endm
+
+// Generate counter blocks in the given AESDATA vectors, then do the zero-th AES
+// round on them. Clobbers TMP0.
+.macro _ctr_begin vecs:vararg
+ vbroadcasti128 .Linc_2blocks(%rip), TMP0
+.irp i, \vecs
+ vpshufb BSWAP_MASK, LE_CTR, AESDATA\i
+ vpaddd TMP0, LE_CTR, LE_CTR
+.endr
+.irp i, \vecs
+ vpxor RNDKEY0, AESDATA\i, AESDATA\i
+.endr
+.endm
+
+// Generate and encrypt counter blocks in the given AESDATA vectors, excluding
+// the last AES round. Clobbers TMP0.
+.macro _aesenc_loop vecs:vararg
+ _ctr_begin \vecs
+ lea 16(KEY), %rax
+.Laesenc_loop\@:
+ vbroadcasti128 (%rax), TMP0
+ _vaesenc TMP0, \vecs
+ add $16, %rax
+ cmp %rax, RNDKEYLAST_PTR
+ jne .Laesenc_loop\@
+.endm
+
+// Finalize the keystream blocks in the given AESDATA vectors by doing the last
+// AES round, then XOR those keystream blocks with the corresponding data.
+// Reduce latency by doing the XOR before the vaesenclast, utilizing the
+// property vaesenclast(key, a) ^ b == vaesenclast(key ^ b, a). Clobbers TMP0.
+.macro _aesenclast_and_xor vecs:vararg
+.irp i, \vecs
+ vpxor \i*32(SRC), RNDKEYLAST, TMP0
+ vaesenclast TMP0, AESDATA\i, AESDATA\i
+.endr
+.irp i, \vecs
+ vmovdqu AESDATA\i, \i*32(DST)
+.endr
+.endm
+
+// void aes_gcm_{enc,dec}_update_vaes_avx2(const struct aes_gcm_key_vaes_avx2 *key,
+// const u32 le_ctr[4], u8 ghash_acc[16],
+// const u8 *src, u8 *dst, int datalen);
+//
+// This macro generates a GCM encryption or decryption update function with the
+// above prototype (with \enc selecting which one). The function computes the
+// next portion of the CTR keystream, XOR's it with |datalen| bytes from |src|,
+// and writes the resulting encrypted or decrypted data to |dst|. It also
+// updates the GHASH accumulator |ghash_acc| using the next |datalen| ciphertext
+// bytes.
+//
+// |datalen| must be a multiple of 16, except on the last call where it can be
+// any length. The caller must do any buffering needed to ensure this. Both
+// in-place and out-of-place en/decryption are supported.
+//
+// |le_ctr| must give the current counter in little-endian format. This
+// function loads the counter from |le_ctr| and increments the loaded counter as
+// needed, but it does *not* store the updated counter back to |le_ctr|. The
+// caller must update |le_ctr| if any more data segments follow. Internally,
+// only the low 32-bit word of the counter is incremented, following the GCM
+// standard.
+.macro _aes_gcm_update enc
+
+ // Function arguments
+ .set KEY, %rdi
+ .set LE_CTR_PTR, %rsi
+ .set LE_CTR_PTR32, %esi
+ .set GHASH_ACC_PTR, %rdx
+ .set SRC, %rcx // Assumed to be %rcx.
+ // See .Ltail_xor_and_ghash_partial_vec
+ .set DST, %r8
+ .set DATALEN, %r9d
+ .set DATALEN64, %r9 // Zero-extend DATALEN before using!
+
+ // Additional local variables
+
+ // %rax is used as a temporary register. LE_CTR_PTR is also available
+ // as a temporary register after the counter is loaded.
+
+ // AES key length in bytes
+ .set AESKEYLEN, %r10d
+ .set AESKEYLEN64, %r10
+
+ // Pointer to the last AES round key for the chosen AES variant
+ .set RNDKEYLAST_PTR, %r11
+
+ // BSWAP_MASK is the shuffle mask for byte-reflecting 128-bit values
+ // using vpshufb, copied to all 128-bit lanes.
+ .set BSWAP_MASK, %ymm0
+ .set BSWAP_MASK_XMM, %xmm0
+
+ // GHASH_ACC is the accumulator variable for GHASH. When fully reduced,
+ // only the lowest 128-bit lane can be nonzero. When not fully reduced,
+ // more than one lane may be used, and they need to be XOR'd together.
+ .set GHASH_ACC, %ymm1
+ .set GHASH_ACC_XMM, %xmm1
+
+ // TMP[0-2] are temporary registers.
+ .set TMP0, %ymm2
+ .set TMP0_XMM, %xmm2
+ .set TMP1, %ymm3
+ .set TMP1_XMM, %xmm3
+ .set TMP2, %ymm4
+ .set TMP2_XMM, %xmm4
+
+ // LO and MI are used to accumulate unreduced GHASH products.
+ .set LO, %ymm5
+ .set LO_XMM, %xmm5
+ .set MI, %ymm6
+ .set MI_XMM, %xmm6
+
+ // H_POW[2-1]_XORED contain cached values from KEY->h_powers_xored. The
+ // descending numbering reflects the order of the key powers.
+ .set H_POW2_XORED, %ymm7
+ .set H_POW2_XORED_XMM, %xmm7
+ .set H_POW1_XORED, %ymm8
+ .set H_POW1_XORED_XMM, %xmm8
+
+ // RNDKEY0 caches the zero-th round key, and RNDKEYLAST the last one.
+ .set RNDKEY0, %ymm9
+ .set RNDKEYLAST, %ymm10
+
+ // LE_CTR contains the next set of little-endian counter blocks.
+ .set LE_CTR, %ymm11
+
+ // AESDATA[0-3] hold the counter blocks that are being encrypted by AES.
+ .set AESDATA0, %ymm12
+ .set AESDATA0_XMM, %xmm12
+ .set AESDATA1, %ymm13
+ .set AESDATA1_XMM, %xmm13
+ .set AESDATA2, %ymm14
+ .set AESDATA2_XMM, %xmm14
+ .set AESDATA3, %ymm15
+ .set AESDATA3_XMM, %xmm15
+
+.if \enc
+ .set GHASHDATA_PTR, DST
+.else
+ .set GHASHDATA_PTR, SRC
+.endif
+
+ vbroadcasti128 .Lbswap_mask(%rip), BSWAP_MASK
+
+ // Load the GHASH accumulator and the starting counter.
+ vmovdqu (GHASH_ACC_PTR), GHASH_ACC_XMM
+ vbroadcasti128 (LE_CTR_PTR), LE_CTR
+
+ // Load the AES key length in bytes.
+ movl OFFSETOF_AESKEYLEN(KEY), AESKEYLEN
+
+ // Make RNDKEYLAST_PTR point to the last AES round key. This is the
+ // round key with index 10, 12, or 14 for AES-128, AES-192, or AES-256
+ // respectively. Then load the zero-th and last round keys.
+ lea 6*16(KEY,AESKEYLEN64,4), RNDKEYLAST_PTR
+ vbroadcasti128 (KEY), RNDKEY0
+ vbroadcasti128 (RNDKEYLAST_PTR), RNDKEYLAST
+
+ // Finish initializing LE_CTR by adding 1 to the second block.
+ vpaddd .Lctr_pattern(%rip), LE_CTR, LE_CTR
+
+ // If there are at least 128 bytes of data, then continue into the loop
+ // that processes 128 bytes of data at a time. Otherwise skip it.
+ add $-128, DATALEN // 128 is 4 bytes, -128 is 1 byte
+ jl .Lcrypt_loop_4x_done\@
+
+ vmovdqu OFFSETOF_H_POWERS_XORED(KEY), H_POW2_XORED
+ vmovdqu OFFSETOF_H_POWERS_XORED+32(KEY), H_POW1_XORED
+
+ // Main loop: en/decrypt and hash 4 vectors (128 bytes) at a time.
+
+.if \enc
+ // Encrypt the first 4 vectors of plaintext blocks.
+ _aesenc_loop 0,1,2,3
+ _aesenclast_and_xor 0,1,2,3
+ sub $-128, SRC // 128 is 4 bytes, -128 is 1 byte
+ add $-128, DATALEN
+ jl .Lghash_last_ciphertext_4x\@
+.endif
+
+.align 16
+.Lcrypt_loop_4x\@:
+
+ // Start the AES encryption of the counter blocks.
+ _ctr_begin 0,1,2,3
+ cmp $24, AESKEYLEN
+ jl 128f // AES-128?
+ je 192f // AES-192?
+ // AES-256
+ vbroadcasti128 -13*16(RNDKEYLAST_PTR), TMP0
+ _vaesenc TMP0, 0,1,2,3
+ vbroadcasti128 -12*16(RNDKEYLAST_PTR), TMP0
+ _vaesenc TMP0, 0,1,2,3
+192:
+ vbroadcasti128 -11*16(RNDKEYLAST_PTR), TMP0
+ _vaesenc TMP0, 0,1,2,3
+ vbroadcasti128 -10*16(RNDKEYLAST_PTR), TMP0
+ _vaesenc TMP0, 0,1,2,3
+128:
+
+ // Finish the AES encryption of the counter blocks in AESDATA[0-3],
+ // interleaved with the GHASH update of the ciphertext blocks.
+.irp i, 9,8,7,6,5,4,3,2,1
+ _ghash_step_4x (9 - \i), GHASHDATA_PTR
+ vbroadcasti128 -\i*16(RNDKEYLAST_PTR), TMP0
+ _vaesenc TMP0, 0,1,2,3
+.endr
+ _ghash_step_4x 9, GHASHDATA_PTR
+.if \enc
+ sub $-128, DST // 128 is 4 bytes, -128 is 1 byte
+.endif
+ _aesenclast_and_xor 0,1,2,3
+ sub $-128, SRC
+.if !\enc
+ sub $-128, DST
+.endif
+ add $-128, DATALEN
+ jge .Lcrypt_loop_4x\@
+
+.if \enc
+.Lghash_last_ciphertext_4x\@:
+ // Update GHASH with the last set of ciphertext blocks.
+ _ghash_4x DST
+ sub $-128, DST
+.endif
+
+.Lcrypt_loop_4x_done\@:
+
+ // Undo the extra subtraction by 128 and check whether data remains.
+ sub $-128, DATALEN // 128 is 4 bytes, -128 is 1 byte
+ jz .Ldone\@
+
+ // The data length isn't a multiple of 128 bytes. Process the remaining
+ // data of length 1 <= DATALEN < 128.
+ //
+ // Since there are enough key powers available for all remaining data,
+ // there is no need to do a GHASH reduction after each iteration.
+ // Instead, multiply each remaining block by its own key power, and only
+ // do a GHASH reduction at the very end.
+
+ // Make POWERS_PTR point to the key powers [H^N, H^(N-1), ...] where N
+ // is the number of blocks that remain.
+ .set POWERS_PTR, LE_CTR_PTR // LE_CTR_PTR is free to be reused.
+ .set POWERS_PTR32, LE_CTR_PTR32
+ mov DATALEN, %eax
+ neg %rax
+ and $~15, %rax // -round_up(DATALEN, 16)
+ lea OFFSETOFEND_H_POWERS(KEY,%rax), POWERS_PTR
+
+ // Start collecting the unreduced GHASH intermediate value LO, MI, HI.
+ .set HI, H_POW2_XORED // H_POW2_XORED is free to be reused.
+ .set HI_XMM, H_POW2_XORED_XMM
+ vpxor LO_XMM, LO_XMM, LO_XMM
+ vpxor MI_XMM, MI_XMM, MI_XMM
+ vpxor HI_XMM, HI_XMM, HI_XMM
+
+ // 1 <= DATALEN < 128. Generate 2 or 4 more vectors of keystream blocks
+ // excluding the last AES round, depending on the remaining DATALEN.
+ cmp $64, DATALEN
+ jg .Ltail_gen_4_keystream_vecs\@
+ _aesenc_loop 0,1
+ cmp $32, DATALEN
+ jge .Ltail_xor_and_ghash_full_vec_loop\@
+ jmp .Ltail_xor_and_ghash_partial_vec\@
+.Ltail_gen_4_keystream_vecs\@:
+ _aesenc_loop 0,1,2,3
+
+ // XOR the remaining data and accumulate the unreduced GHASH products
+ // for DATALEN >= 32, starting with one full 32-byte vector at a time.
+.Ltail_xor_and_ghash_full_vec_loop\@:
+.if \enc
+ _aesenclast_and_xor 0
+ vpshufb BSWAP_MASK, AESDATA0, AESDATA0
+.else
+ vmovdqu (SRC), TMP1
+ vpxor TMP1, RNDKEYLAST, TMP0
+ vaesenclast TMP0, AESDATA0, AESDATA0
+ vmovdqu AESDATA0, (DST)
+ vpshufb BSWAP_MASK, TMP1, AESDATA0
+.endif
+ // The ciphertext blocks (i.e. GHASH input data) are now in AESDATA0.
+ vpxor GHASH_ACC, AESDATA0, AESDATA0
+ vmovdqu (POWERS_PTR), TMP2
+ _ghash_mul_noreduce TMP2, AESDATA0, LO, MI, HI, TMP0
+ vmovdqa AESDATA1, AESDATA0
+ vmovdqa AESDATA2, AESDATA1
+ vmovdqa AESDATA3, AESDATA2
+ vpxor GHASH_ACC_XMM, GHASH_ACC_XMM, GHASH_ACC_XMM
+ add $32, SRC
+ add $32, DST
+ add $32, POWERS_PTR
+ sub $32, DATALEN
+ cmp $32, DATALEN
+ jge .Ltail_xor_and_ghash_full_vec_loop\@
+ test DATALEN, DATALEN
+ jz .Ltail_ghash_reduce\@
+
+.Ltail_xor_and_ghash_partial_vec\@:
+ // XOR the remaining data and accumulate the unreduced GHASH products,
+ // for 1 <= DATALEN < 32.
+ vaesenclast RNDKEYLAST, AESDATA0, AESDATA0
+ cmp $16, DATALEN
+ jle .Ltail_xor_and_ghash_1to16bytes\@
+
+ // Handle 17 <= DATALEN < 32.
+
+ // Load a vpshufb mask that will right-shift by '32 - DATALEN' bytes
+ // (shifting in zeroes), then reflect all 16 bytes.
+ lea .Lrshift_and_bswap_table(%rip), %rax
+ vmovdqu -16(%rax, DATALEN64), TMP2_XMM
+
+ // Move the second keystream block to its own register and left-align it
+ vextracti128 $1, AESDATA0, AESDATA1_XMM
+ vpxor .Lfifteens(%rip), TMP2_XMM, TMP0_XMM
+ vpshufb TMP0_XMM, AESDATA1_XMM, AESDATA1_XMM
+
+ // Using overlapping loads and stores, XOR the source data with the
+ // keystream and write the destination data. Then prepare the GHASH
+ // input data: the full ciphertext block and the zero-padded partial
+ // ciphertext block, both byte-reflected, in AESDATA0.
+.if \enc
+ vpxor -16(SRC, DATALEN64), AESDATA1_XMM, AESDATA1_XMM
+ vpxor (SRC), AESDATA0_XMM, AESDATA0_XMM
+ vmovdqu AESDATA1_XMM, -16(DST, DATALEN64)
+ vmovdqu AESDATA0_XMM, (DST)
+ vpshufb TMP2_XMM, AESDATA1_XMM, AESDATA1_XMM
+ vpshufb BSWAP_MASK_XMM, AESDATA0_XMM, AESDATA0_XMM
+.else
+ vmovdqu -16(SRC, DATALEN64), TMP1_XMM
+ vmovdqu (SRC), TMP0_XMM
+ vpxor TMP1_XMM, AESDATA1_XMM, AESDATA1_XMM
+ vpxor TMP0_XMM, AESDATA0_XMM, AESDATA0_XMM
+ vmovdqu AESDATA1_XMM, -16(DST, DATALEN64)
+ vmovdqu AESDATA0_XMM, (DST)
+ vpshufb TMP2_XMM, TMP1_XMM, AESDATA1_XMM
+ vpshufb BSWAP_MASK_XMM, TMP0_XMM, AESDATA0_XMM
+.endif
+ vpxor GHASH_ACC_XMM, AESDATA0_XMM, AESDATA0_XMM
+ vinserti128 $1, AESDATA1_XMM, AESDATA0, AESDATA0
+ vmovdqu (POWERS_PTR), TMP2
+ jmp .Ltail_ghash_last_vec\@
+
+.Ltail_xor_and_ghash_1to16bytes\@:
+ // Handle 1 <= DATALEN <= 16. Carefully load and store the
+ // possibly-partial block, which we mustn't access out of bounds.
+ vmovdqu (POWERS_PTR), TMP2_XMM
+ mov SRC, KEY // Free up %rcx, assuming SRC == %rcx
+ mov DATALEN, %ecx
+ _load_partial_block KEY, TMP0_XMM, POWERS_PTR, POWERS_PTR32
+ vpxor TMP0_XMM, AESDATA0_XMM, AESDATA0_XMM
+ mov DATALEN, %ecx
+ _store_partial_block AESDATA0_XMM, DST, POWERS_PTR, POWERS_PTR32
+.if \enc
+ lea .Lselect_high_bytes_table(%rip), %rax
+ vpshufb BSWAP_MASK_XMM, AESDATA0_XMM, AESDATA0_XMM
+ vpand (%rax, DATALEN64), AESDATA0_XMM, AESDATA0_XMM
+.else
+ vpshufb BSWAP_MASK_XMM, TMP0_XMM, AESDATA0_XMM
+.endif
+ vpxor GHASH_ACC_XMM, AESDATA0_XMM, AESDATA0_XMM
+
+.Ltail_ghash_last_vec\@:
+ // Accumulate the unreduced GHASH products for the last 1-2 blocks. The
+ // GHASH input data is in AESDATA0. If only one block remains, then the
+ // second block in AESDATA0 is zero and does not affect the result.
+ _ghash_mul_noreduce TMP2, AESDATA0, LO, MI, HI, TMP0
+
+.Ltail_ghash_reduce\@:
+ // Finally, do the GHASH reduction.
+ vbroadcasti128 .Lgfpoly(%rip), TMP0
+ _ghash_reduce LO, MI, HI, TMP0, TMP1
+ vextracti128 $1, HI, GHASH_ACC_XMM
+ vpxor HI_XMM, GHASH_ACC_XMM, GHASH_ACC_XMM
+
+.Ldone\@:
+ // Store the updated GHASH accumulator back to memory.
+ vmovdqu GHASH_ACC_XMM, (GHASH_ACC_PTR)
+
+ vzeroupper
+ RET
+.endm
+
+// void aes_gcm_enc_final_vaes_avx2(const struct aes_gcm_key_vaes_avx2 *key,
+// const u32 le_ctr[4], u8 ghash_acc[16],
+// u64 total_aadlen, u64 total_datalen);
+// bool aes_gcm_dec_final_vaes_avx2(const struct aes_gcm_key_vaes_avx2 *key,
+// const u32 le_ctr[4], const u8 ghash_acc[16],
+// u64 total_aadlen, u64 total_datalen,
+// const u8 tag[16], int taglen);
+//
+// This macro generates one of the above two functions (with \enc selecting
+// which one). Both functions finish computing the GCM authentication tag by
+// updating GHASH with the lengths block and encrypting the GHASH accumulator.
+// |total_aadlen| and |total_datalen| must be the total length of the additional
+// authenticated data and the en/decrypted data in bytes, respectively.
+//
+// The encryption function then stores the full-length (16-byte) computed
+// authentication tag to |ghash_acc|. The decryption function instead loads the
+// expected authentication tag (the one that was transmitted) from the 16-byte
+// buffer |tag|, compares the first 4 <= |taglen| <= 16 bytes of it to the
+// computed tag in constant time, and returns true if and only if they match.
+.macro _aes_gcm_final enc
+
+ // Function arguments
+ .set KEY, %rdi
+ .set LE_CTR_PTR, %rsi
+ .set GHASH_ACC_PTR, %rdx
+ .set TOTAL_AADLEN, %rcx
+ .set TOTAL_DATALEN, %r8
+ .set TAG, %r9
+ .set TAGLEN, %r10d // Originally at 8(%rsp)
+ .set TAGLEN64, %r10
+
+ // Additional local variables.
+ // %rax and %xmm0-%xmm3 are used as temporary registers.
+ .set AESKEYLEN, %r11d
+ .set AESKEYLEN64, %r11
+ .set GFPOLY, %xmm4
+ .set BSWAP_MASK, %xmm5
+ .set LE_CTR, %xmm6
+ .set GHASH_ACC, %xmm7
+ .set H_POW1, %xmm8
+
+ // Load some constants.
+ vmovdqa .Lgfpoly(%rip), GFPOLY
+ vmovdqa .Lbswap_mask(%rip), BSWAP_MASK
+
+ // Load the AES key length in bytes.
+ movl OFFSETOF_AESKEYLEN(KEY), AESKEYLEN
+
+ // Set up a counter block with 1 in the low 32-bit word. This is the
+ // counter that produces the ciphertext needed to encrypt the auth tag.
+ // GFPOLY has 1 in the low word, so grab the 1 from there using a blend.
+ vpblendd $0xe, (LE_CTR_PTR), GFPOLY, LE_CTR
+
+ // Build the lengths block and XOR it with the GHASH accumulator.
+ // Although the lengths block is defined as the AAD length followed by
+ // the en/decrypted data length, both in big-endian byte order, a byte
+ // reflection of the full block is needed because of the way we compute
+ // GHASH (see _ghash_mul_step). By using little-endian values in the
+ // opposite order, we avoid having to reflect any bytes here.
+ vmovq TOTAL_DATALEN, %xmm0
+ vpinsrq $1, TOTAL_AADLEN, %xmm0, %xmm0
+ vpsllq $3, %xmm0, %xmm0 // Bytes to bits
+ vpxor (GHASH_ACC_PTR), %xmm0, GHASH_ACC
+
+ // Load the first hash key power (H^1), which is stored last.
+ vmovdqu OFFSETOFEND_H_POWERS-16(KEY), H_POW1
+
+ // Load TAGLEN if decrypting.
+.if !\enc
+ movl 8(%rsp), TAGLEN
+.endif
+
+ // Make %rax point to the last AES round key for the chosen AES variant.
+ lea 6*16(KEY,AESKEYLEN64,4), %rax
+
+ // Start the AES encryption of the counter block by swapping the counter
+ // block to big-endian and XOR-ing it with the zero-th AES round key.
+ vpshufb BSWAP_MASK, LE_CTR, %xmm0
+ vpxor (KEY), %xmm0, %xmm0
+
+ // Complete the AES encryption and multiply GHASH_ACC by H^1.
+ // Interleave the AES and GHASH instructions to improve performance.
+ cmp $24, AESKEYLEN
+ jl 128f // AES-128?
+ je 192f // AES-192?
+ // AES-256
+ vaesenc -13*16(%rax), %xmm0, %xmm0
+ vaesenc -12*16(%rax), %xmm0, %xmm0
+192:
+ vaesenc -11*16(%rax), %xmm0, %xmm0
+ vaesenc -10*16(%rax), %xmm0, %xmm0
+128:
+.irp i, 0,1,2,3,4,5,6,7,8
+ _ghash_mul_step \i, H_POW1, GHASH_ACC, GHASH_ACC, GFPOLY, \
+ %xmm1, %xmm2, %xmm3
+ vaesenc (\i-9)*16(%rax), %xmm0, %xmm0
+.endr
+ _ghash_mul_step 9, H_POW1, GHASH_ACC, GHASH_ACC, GFPOLY, \
+ %xmm1, %xmm2, %xmm3
+
+ // Undo the byte reflection of the GHASH accumulator.
+ vpshufb BSWAP_MASK, GHASH_ACC, GHASH_ACC
+
+ // Do the last AES round and XOR the resulting keystream block with the
+ // GHASH accumulator to produce the full computed authentication tag.
+ //
+ // Reduce latency by taking advantage of the property vaesenclast(key,
+ // a) ^ b == vaesenclast(key ^ b, a). I.e., XOR GHASH_ACC into the last
+ // round key, instead of XOR'ing the final AES output with GHASH_ACC.
+ //
+ // enc_final then returns the computed auth tag, while dec_final
+ // compares it with the transmitted one and returns a bool. To compare
+ // the tags, dec_final XORs them together and uses vptest to check
+ // whether the result is all-zeroes. This should be constant-time.
+ // dec_final applies the vaesenclast optimization to this additional
+ // value XOR'd too.
+.if \enc
+ vpxor (%rax), GHASH_ACC, %xmm1
+ vaesenclast %xmm1, %xmm0, GHASH_ACC
+ vmovdqu GHASH_ACC, (GHASH_ACC_PTR)
+.else
+ vpxor (TAG), GHASH_ACC, GHASH_ACC
+ vpxor (%rax), GHASH_ACC, GHASH_ACC
+ vaesenclast GHASH_ACC, %xmm0, %xmm0
+ lea .Lselect_high_bytes_table(%rip), %rax
+ vmovdqu (%rax, TAGLEN64), %xmm1
+ vpshufb BSWAP_MASK, %xmm1, %xmm1 // select low bytes, not high
+ vptest %xmm1, %xmm0
+ sete %al
+.endif
+ // No need for vzeroupper here, since only used xmm registers were used.
+ RET
+.endm
+
+SYM_FUNC_START(aes_gcm_enc_update_vaes_avx2)
+ _aes_gcm_update 1
+SYM_FUNC_END(aes_gcm_enc_update_vaes_avx2)
+SYM_FUNC_START(aes_gcm_dec_update_vaes_avx2)
+ _aes_gcm_update 0
+SYM_FUNC_END(aes_gcm_dec_update_vaes_avx2)
+
+SYM_FUNC_START(aes_gcm_enc_final_vaes_avx2)
+ _aes_gcm_final 1
+SYM_FUNC_END(aes_gcm_enc_final_vaes_avx2)
+SYM_FUNC_START(aes_gcm_dec_final_vaes_avx2)
+ _aes_gcm_final 0
+SYM_FUNC_END(aes_gcm_dec_final_vaes_avx2)
diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c
index d953ac470aae3..e2847d67430fd 100644
--- a/arch/x86/crypto/aesni-intel_glue.c
+++ b/arch/x86/crypto/aesni-intel_glue.c
@@ -872,10 +872,40 @@ struct aes_gcm_key_aesni {
#define AES_GCM_KEY_AESNI(key) \
container_of((key), struct aes_gcm_key_aesni, base)
#define AES_GCM_KEY_AESNI_SIZE \
(sizeof(struct aes_gcm_key_aesni) + (15 & ~(CRYPTO_MINALIGN - 1)))
+/* Key struct used by the VAES + AVX2 implementation of AES-GCM */
+struct aes_gcm_key_vaes_avx2 {
+ /*
+ * Common part of the key. The assembly code prefers 16-byte alignment
+ * for the round keys; we get this by them being located at the start of
+ * the struct and the whole struct being 32-byte aligned.
+ */
+ struct aes_gcm_key base;
+
+ /*
+ * Powers of the hash key H^8 through H^1. These are 128-bit values.
+ * They all have an extra factor of x^-1 and are byte-reversed.
+ * The assembly code prefers 32-byte alignment for this.
+ */
+ u64 h_powers[8][2] __aligned(32);
+
+ /*
+ * Each entry in this array contains the two halves of an entry of
+ * h_powers XOR'd together, in the following order:
+ * H^8,H^6,H^7,H^5,H^4,H^2,H^3,H^1 i.e. indices 0,2,1,3,4,6,5,7.
+ * This is used for Karatsuba multiplication.
+ */
+ u64 h_powers_xored[8];
+};
+
+#define AES_GCM_KEY_VAES_AVX2(key) \
+ container_of((key), struct aes_gcm_key_vaes_avx2, base)
+#define AES_GCM_KEY_VAES_AVX2_SIZE \
+ (sizeof(struct aes_gcm_key_vaes_avx2) + (31 & ~(CRYPTO_MINALIGN - 1)))
+
/* Key struct used by the VAES + AVX10 implementations of AES-GCM */
struct aes_gcm_key_avx10 {
/*
* Common part of the key. The assembly code prefers 16-byte alignment
* for the round keys; we get this by them being located at the start of
@@ -908,27 +938,32 @@ struct aes_gcm_key_avx10 {
* indirect calls (which are very expensive on x86) regardless of inlining.
*/
#define FLAG_RFC4106 BIT(0)
#define FLAG_ENC BIT(1)
#define FLAG_AVX BIT(2)
-#define FLAG_AVX10_256 BIT(3)
-#define FLAG_AVX10_512 BIT(4)
+#define FLAG_VAES_AVX2 BIT(3)
+#define FLAG_AVX10_256 BIT(4)
+#define FLAG_AVX10_512 BIT(5)
static inline struct aes_gcm_key *
aes_gcm_key_get(struct crypto_aead *tfm, int flags)
{
if (flags & (FLAG_AVX10_256 | FLAG_AVX10_512))
return PTR_ALIGN(crypto_aead_ctx(tfm), 64);
+ else if (flags & FLAG_VAES_AVX2)
+ return PTR_ALIGN(crypto_aead_ctx(tfm), 32);
else
return PTR_ALIGN(crypto_aead_ctx(tfm), 16);
}
asmlinkage void
aes_gcm_precompute_aesni(struct aes_gcm_key_aesni *key);
asmlinkage void
aes_gcm_precompute_aesni_avx(struct aes_gcm_key_aesni *key);
asmlinkage void
+aes_gcm_precompute_vaes_avx2(struct aes_gcm_key_vaes_avx2 *key);
+asmlinkage void
aes_gcm_precompute_vaes_avx10_256(struct aes_gcm_key_avx10 *key);
asmlinkage void
aes_gcm_precompute_vaes_avx10_512(struct aes_gcm_key_avx10 *key);
static void aes_gcm_precompute(struct aes_gcm_key *key, int flags)
@@ -945,10 +980,12 @@ static void aes_gcm_precompute(struct aes_gcm_key *key, int flags)
*/
if (flags & FLAG_AVX10_512)
aes_gcm_precompute_vaes_avx10_512(AES_GCM_KEY_AVX10(key));
else if (flags & FLAG_AVX10_256)
aes_gcm_precompute_vaes_avx10_256(AES_GCM_KEY_AVX10(key));
+ else if (flags & FLAG_VAES_AVX2)
+ aes_gcm_precompute_vaes_avx2(AES_GCM_KEY_VAES_AVX2(key));
else if (flags & FLAG_AVX)
aes_gcm_precompute_aesni_avx(AES_GCM_KEY_AESNI(key));
else
aes_gcm_precompute_aesni(AES_GCM_KEY_AESNI(key));
}
@@ -958,19 +995,25 @@ aes_gcm_aad_update_aesni(const struct aes_gcm_key_aesni *key,
u8 ghash_acc[16], const u8 *aad, int aadlen);
asmlinkage void
aes_gcm_aad_update_aesni_avx(const struct aes_gcm_key_aesni *key,
u8 ghash_acc[16], const u8 *aad, int aadlen);
asmlinkage void
+aes_gcm_aad_update_vaes_avx2(const struct aes_gcm_key_vaes_avx2 *key,
+ u8 ghash_acc[16], const u8 *aad, int aadlen);
+asmlinkage void
aes_gcm_aad_update_vaes_avx10(const struct aes_gcm_key_avx10 *key,
u8 ghash_acc[16], const u8 *aad, int aadlen);
static void aes_gcm_aad_update(const struct aes_gcm_key *key, u8 ghash_acc[16],
const u8 *aad, int aadlen, int flags)
{
if (flags & (FLAG_AVX10_256 | FLAG_AVX10_512))
aes_gcm_aad_update_vaes_avx10(AES_GCM_KEY_AVX10(key), ghash_acc,
aad, aadlen);
+ else if (flags & FLAG_VAES_AVX2)
+ aes_gcm_aad_update_vaes_avx2(AES_GCM_KEY_VAES_AVX2(key),
+ ghash_acc, aad, aadlen);
else if (flags & FLAG_AVX)
aes_gcm_aad_update_aesni_avx(AES_GCM_KEY_AESNI(key), ghash_acc,
aad, aadlen);
else
aes_gcm_aad_update_aesni(AES_GCM_KEY_AESNI(key), ghash_acc,
@@ -984,10 +1027,14 @@ aes_gcm_enc_update_aesni(const struct aes_gcm_key_aesni *key,
asmlinkage void
aes_gcm_enc_update_aesni_avx(const struct aes_gcm_key_aesni *key,
const u32 le_ctr[4], u8 ghash_acc[16],
const u8 *src, u8 *dst, int datalen);
asmlinkage void
+aes_gcm_enc_update_vaes_avx2(const struct aes_gcm_key_vaes_avx2 *key,
+ const u32 le_ctr[4], u8 ghash_acc[16],
+ const u8 *src, u8 *dst, int datalen);
+asmlinkage void
aes_gcm_enc_update_vaes_avx10_256(const struct aes_gcm_key_avx10 *key,
const u32 le_ctr[4], u8 ghash_acc[16],
const u8 *src, u8 *dst, int datalen);
asmlinkage void
aes_gcm_enc_update_vaes_avx10_512(const struct aes_gcm_key_avx10 *key,
@@ -1001,10 +1048,14 @@ aes_gcm_dec_update_aesni(const struct aes_gcm_key_aesni *key,
asmlinkage void
aes_gcm_dec_update_aesni_avx(const struct aes_gcm_key_aesni *key,
const u32 le_ctr[4], u8 ghash_acc[16],
const u8 *src, u8 *dst, int datalen);
asmlinkage void
+aes_gcm_dec_update_vaes_avx2(const struct aes_gcm_key_vaes_avx2 *key,
+ const u32 le_ctr[4], u8 ghash_acc[16],
+ const u8 *src, u8 *dst, int datalen);
+asmlinkage void
aes_gcm_dec_update_vaes_avx10_256(const struct aes_gcm_key_avx10 *key,
const u32 le_ctr[4], u8 ghash_acc[16],
const u8 *src, u8 *dst, int datalen);
asmlinkage void
aes_gcm_dec_update_vaes_avx10_512(const struct aes_gcm_key_avx10 *key,
@@ -1024,10 +1075,14 @@ aes_gcm_update(const struct aes_gcm_key *key,
src, dst, datalen);
else if (flags & FLAG_AVX10_256)
aes_gcm_enc_update_vaes_avx10_256(AES_GCM_KEY_AVX10(key),
le_ctr, ghash_acc,
src, dst, datalen);
+ else if (flags & FLAG_VAES_AVX2)
+ aes_gcm_enc_update_vaes_avx2(AES_GCM_KEY_VAES_AVX2(key),
+ le_ctr, ghash_acc,
+ src, dst, datalen);
else if (flags & FLAG_AVX)
aes_gcm_enc_update_aesni_avx(AES_GCM_KEY_AESNI(key),
le_ctr, ghash_acc,
src, dst, datalen);
else
@@ -1040,10 +1095,14 @@ aes_gcm_update(const struct aes_gcm_key *key,
src, dst, datalen);
else if (flags & FLAG_AVX10_256)
aes_gcm_dec_update_vaes_avx10_256(AES_GCM_KEY_AVX10(key),
le_ctr, ghash_acc,
src, dst, datalen);
+ else if (flags & FLAG_VAES_AVX2)
+ aes_gcm_dec_update_vaes_avx2(AES_GCM_KEY_VAES_AVX2(key),
+ le_ctr, ghash_acc,
+ src, dst, datalen);
else if (flags & FLAG_AVX)
aes_gcm_dec_update_aesni_avx(AES_GCM_KEY_AESNI(key),
le_ctr, ghash_acc,
src, dst, datalen);
else
@@ -1060,10 +1119,14 @@ aes_gcm_enc_final_aesni(const struct aes_gcm_key_aesni *key,
asmlinkage void
aes_gcm_enc_final_aesni_avx(const struct aes_gcm_key_aesni *key,
const u32 le_ctr[4], u8 ghash_acc[16],
u64 total_aadlen, u64 total_datalen);
asmlinkage void
+aes_gcm_enc_final_vaes_avx2(const struct aes_gcm_key_vaes_avx2 *key,
+ const u32 le_ctr[4], u8 ghash_acc[16],
+ u64 total_aadlen, u64 total_datalen);
+asmlinkage void
aes_gcm_enc_final_vaes_avx10(const struct aes_gcm_key_avx10 *key,
const u32 le_ctr[4], u8 ghash_acc[16],
u64 total_aadlen, u64 total_datalen);
/* __always_inline to optimize out the branches based on @flags */
@@ -1074,10 +1137,14 @@ aes_gcm_enc_final(const struct aes_gcm_key *key,
{
if (flags & (FLAG_AVX10_256 | FLAG_AVX10_512))
aes_gcm_enc_final_vaes_avx10(AES_GCM_KEY_AVX10(key),
le_ctr, ghash_acc,
total_aadlen, total_datalen);
+ else if (flags & FLAG_VAES_AVX2)
+ aes_gcm_enc_final_vaes_avx2(AES_GCM_KEY_VAES_AVX2(key),
+ le_ctr, ghash_acc,
+ total_aadlen, total_datalen);
else if (flags & FLAG_AVX)
aes_gcm_enc_final_aesni_avx(AES_GCM_KEY_AESNI(key),
le_ctr, ghash_acc,
total_aadlen, total_datalen);
else
@@ -1095,10 +1162,15 @@ asmlinkage bool __must_check
aes_gcm_dec_final_aesni_avx(const struct aes_gcm_key_aesni *key,
const u32 le_ctr[4], const u8 ghash_acc[16],
u64 total_aadlen, u64 total_datalen,
const u8 tag[16], int taglen);
asmlinkage bool __must_check
+aes_gcm_dec_final_vaes_avx2(const struct aes_gcm_key_vaes_avx2 *key,
+ const u32 le_ctr[4], const u8 ghash_acc[16],
+ u64 total_aadlen, u64 total_datalen,
+ const u8 tag[16], int taglen);
+asmlinkage bool __must_check
aes_gcm_dec_final_vaes_avx10(const struct aes_gcm_key_avx10 *key,
const u32 le_ctr[4], const u8 ghash_acc[16],
u64 total_aadlen, u64 total_datalen,
const u8 tag[16], int taglen);
@@ -1111,10 +1183,15 @@ aes_gcm_dec_final(const struct aes_gcm_key *key, const u32 le_ctr[4],
if (flags & (FLAG_AVX10_256 | FLAG_AVX10_512))
return aes_gcm_dec_final_vaes_avx10(AES_GCM_KEY_AVX10(key),
le_ctr, ghash_acc,
total_aadlen, total_datalen,
tag, taglen);
+ else if (flags & FLAG_VAES_AVX2)
+ return aes_gcm_dec_final_vaes_avx2(AES_GCM_KEY_VAES_AVX2(key),
+ le_ctr, ghash_acc,
+ total_aadlen, total_datalen,
+ tag, taglen);
else if (flags & FLAG_AVX)
return aes_gcm_dec_final_aesni_avx(AES_GCM_KEY_AESNI(key),
le_ctr, ghash_acc,
total_aadlen, total_datalen,
tag, taglen);
@@ -1193,10 +1270,14 @@ static int gcm_setkey(struct crypto_aead *tfm, const u8 *raw_key,
BUILD_BUG_ON(offsetof(struct aes_gcm_key_aesni, base.aes_key.key_enc) != 0);
BUILD_BUG_ON(offsetof(struct aes_gcm_key_aesni, base.aes_key.key_length) != 480);
BUILD_BUG_ON(offsetof(struct aes_gcm_key_aesni, h_powers) != 496);
BUILD_BUG_ON(offsetof(struct aes_gcm_key_aesni, h_powers_xored) != 624);
BUILD_BUG_ON(offsetof(struct aes_gcm_key_aesni, h_times_x64) != 688);
+ BUILD_BUG_ON(offsetof(struct aes_gcm_key_vaes_avx2, base.aes_key.key_enc) != 0);
+ BUILD_BUG_ON(offsetof(struct aes_gcm_key_vaes_avx2, base.aes_key.key_length) != 480);
+ BUILD_BUG_ON(offsetof(struct aes_gcm_key_vaes_avx2, h_powers) != 512);
+ BUILD_BUG_ON(offsetof(struct aes_gcm_key_vaes_avx2, h_powers_xored) != 640);
BUILD_BUG_ON(offsetof(struct aes_gcm_key_avx10, base.aes_key.key_enc) != 0);
BUILD_BUG_ON(offsetof(struct aes_gcm_key_avx10, base.aes_key.key_length) != 480);
BUILD_BUG_ON(offsetof(struct aes_gcm_key_avx10, h_powers) != 512);
BUILD_BUG_ON(offsetof(struct aes_gcm_key_avx10, padding) != 768);
@@ -1238,10 +1319,26 @@ static int gcm_setkey(struct crypto_aead *tfm, const u8 *raw_key,
k->h_powers[i][0] = be64_to_cpu(h.b);
k->h_powers[i][1] = be64_to_cpu(h.a);
gf128mul_lle(&h, &h1);
}
memset(k->padding, 0, sizeof(k->padding));
+ } else if (flags & FLAG_VAES_AVX2) {
+ struct aes_gcm_key_vaes_avx2 *k =
+ AES_GCM_KEY_VAES_AVX2(key);
+ static const u8 indices[8] = { 0, 2, 1, 3, 4, 6, 5, 7 };
+
+ for (i = ARRAY_SIZE(k->h_powers) - 1; i >= 0; i--) {
+ k->h_powers[i][0] = be64_to_cpu(h.b);
+ k->h_powers[i][1] = be64_to_cpu(h.a);
+ gf128mul_lle(&h, &h1);
+ }
+ for (i = 0; i < ARRAY_SIZE(k->h_powers_xored); i++) {
+ int j = indices[i];
+
+ k->h_powers_xored[i] = k->h_powers[j][0] ^
+ k->h_powers[j][1];
+ }
} else {
struct aes_gcm_key_aesni *k = AES_GCM_KEY_AESNI(key);
for (i = ARRAY_SIZE(k->h_powers) - 1; i >= 0; i--) {
k->h_powers[i][0] = be64_to_cpu(h.b);
@@ -1506,10 +1603,15 @@ DEFINE_GCM_ALGS(aesni, /* no flags */ 0,
/* aes_gcm_algs_aesni_avx */
DEFINE_GCM_ALGS(aesni_avx, FLAG_AVX,
"generic-gcm-aesni-avx", "rfc4106-gcm-aesni-avx",
AES_GCM_KEY_AESNI_SIZE, 500);
+/* aes_gcm_algs_vaes_avx2 */
+DEFINE_GCM_ALGS(vaes_avx2, FLAG_VAES_AVX2,
+ "generic-gcm-vaes-avx2", "rfc4106-gcm-vaes-avx2",
+ AES_GCM_KEY_VAES_AVX2_SIZE, 600);
+
/* aes_gcm_algs_vaes_avx10_256 */
DEFINE_GCM_ALGS(vaes_avx10_256, FLAG_AVX10_256,
"generic-gcm-vaes-avx10_256", "rfc4106-gcm-vaes-avx10_256",
AES_GCM_KEY_AVX10_SIZE, 700);
@@ -1546,10 +1648,14 @@ static int __init register_avx_algs(void)
return 0;
err = crypto_register_skciphers(skcipher_algs_vaes_avx2,
ARRAY_SIZE(skcipher_algs_vaes_avx2));
if (err)
return err;
+ err = crypto_register_aeads(aes_gcm_algs_vaes_avx2,
+ ARRAY_SIZE(aes_gcm_algs_vaes_avx2));
+ if (err)
+ return err;
if (!boot_cpu_has(X86_FEATURE_AVX512BW) ||
!boot_cpu_has(X86_FEATURE_AVX512VL) ||
!boot_cpu_has(X86_FEATURE_BMI2) ||
!cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM |
@@ -1593,10 +1699,11 @@ static void unregister_avx_algs(void)
{
unregister_skciphers(skcipher_algs_aesni_avx);
unregister_aeads(aes_gcm_algs_aesni_avx);
unregister_skciphers(skcipher_algs_vaes_avx2);
unregister_skciphers(skcipher_algs_vaes_avx512);
+ unregister_aeads(aes_gcm_algs_vaes_avx2);
unregister_aeads(aes_gcm_algs_vaes_avx10_256);
unregister_aeads(aes_gcm_algs_vaes_avx10_512);
}
#else /* CONFIG_X86_64 */
static struct aead_alg aes_gcm_algs_aesni[0];
--
2.51.0
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 2/8] crypto: x86/aes-gcm - remove VAES+AVX10/256 optimized code
2025-10-02 2:31 [PATCH 0/8] VAES+AVX2 optimized implementation of AES-GCM Eric Biggers
2025-10-02 2:31 ` [PATCH 1/8] crypto: x86/aes-gcm - add VAES+AVX2 optimized code Eric Biggers
@ 2025-10-02 2:31 ` Eric Biggers
2025-10-02 2:31 ` [PATCH 3/8] crypto: x86/aes-gcm - rename avx10 and avx10_512 to avx512 Eric Biggers
` (8 subsequent siblings)
10 siblings, 0 replies; 20+ messages in thread
From: Eric Biggers @ 2025-10-02 2:31 UTC (permalink / raw)
To: linux-crypto
Cc: linux-kernel, x86, Ard Biesheuvel, Jason A . Donenfeld,
Eric Biggers
Remove the VAES+AVX10/256 optimized implementation of AES-GCM.
It's no longer expected to be useful for future CPUs, since Intel
changed the AVX10 specification to require 512-bit vectors.
In addition, it's no longer very useful to serve as the 256-bit fallback
for older Intel CPUs (Ice Lake and Tiger Lake) that downclock too
eagerly when 512-bit vectors are used. This is because I ended up
writing another 256-bit implementation anyway, using VAES+AVX2. The
VAES+AVX2 implementation is almost as fast as the VAES+AVX10/256 one, as
shown by the following tables. So, let's just use it instead.
Table 1: AES-256-GCM encryption throughput change,
CPU vs. message length in bytes:
| 16384 | 4096 | 4095 | 1420 | 512 | 500 |
----------------------+-------+-------+-------+-------+-------+-------+
Intel Ice Lake Server | -2% | -1% | 0% | -2% | -2% | 3% |
| 300 | 200 | 64 | 63 | 16 |
----------------------+-------+-------+-------+-------+-------+
Intel Ice Lake Server | 1% | 0% | 4% | 2% | -6% |
Table 2: AES-256-GCM decryption throughput change,
CPU vs. message length in bytes:
| 16384 | 4096 | 4095 | 1420 | 512 | 500 |
----------------------+-------+-------+-------+-------+-------+-------+
Intel Ice Lake Server | -1% | -1% | 1% | -2% | 0% | 2% |
| 300 | 200 | 64 | 63 | 16 |
----------------------+-------+-------+-------+-------+-------+
Intel Ice Lake Server | -1% | 4% | 1% | 0% | -5% |
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
---
arch/x86/crypto/aes-gcm-avx10-x86_64.S | 11 ------
arch/x86/crypto/aesni-intel_glue.c | 54 +++-----------------------
2 files changed, 6 insertions(+), 59 deletions(-)
diff --git a/arch/x86/crypto/aes-gcm-avx10-x86_64.S b/arch/x86/crypto/aes-gcm-avx10-x86_64.S
index 02ee11083d4f8..4fb04506d7932 100644
--- a/arch/x86/crypto/aes-gcm-avx10-x86_64.S
+++ b/arch/x86/crypto/aes-gcm-avx10-x86_64.S
@@ -1079,21 +1079,10 @@
.endif
// No need for vzeroupper here, since only used xmm registers were used.
RET
.endm
-_set_veclen 32
-SYM_FUNC_START(aes_gcm_precompute_vaes_avx10_256)
- _aes_gcm_precompute
-SYM_FUNC_END(aes_gcm_precompute_vaes_avx10_256)
-SYM_FUNC_START(aes_gcm_enc_update_vaes_avx10_256)
- _aes_gcm_update 1
-SYM_FUNC_END(aes_gcm_enc_update_vaes_avx10_256)
-SYM_FUNC_START(aes_gcm_dec_update_vaes_avx10_256)
- _aes_gcm_update 0
-SYM_FUNC_END(aes_gcm_dec_update_vaes_avx10_256)
-
_set_veclen 64
SYM_FUNC_START(aes_gcm_precompute_vaes_avx10_512)
_aes_gcm_precompute
SYM_FUNC_END(aes_gcm_precompute_vaes_avx10_512)
SYM_FUNC_START(aes_gcm_enc_update_vaes_avx10_512)
diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c
index e2847d67430fd..1ed8513208d36 100644
--- a/arch/x86/crypto/aesni-intel_glue.c
+++ b/arch/x86/crypto/aesni-intel_glue.c
@@ -939,17 +939,16 @@ struct aes_gcm_key_avx10 {
*/
#define FLAG_RFC4106 BIT(0)
#define FLAG_ENC BIT(1)
#define FLAG_AVX BIT(2)
#define FLAG_VAES_AVX2 BIT(3)
-#define FLAG_AVX10_256 BIT(4)
-#define FLAG_AVX10_512 BIT(5)
+#define FLAG_AVX10_512 BIT(4)
static inline struct aes_gcm_key *
aes_gcm_key_get(struct crypto_aead *tfm, int flags)
{
- if (flags & (FLAG_AVX10_256 | FLAG_AVX10_512))
+ if (flags & FLAG_AVX10_512)
return PTR_ALIGN(crypto_aead_ctx(tfm), 64);
else if (flags & FLAG_VAES_AVX2)
return PTR_ALIGN(crypto_aead_ctx(tfm), 32);
else
return PTR_ALIGN(crypto_aead_ctx(tfm), 16);
@@ -960,30 +959,16 @@ aes_gcm_precompute_aesni(struct aes_gcm_key_aesni *key);
asmlinkage void
aes_gcm_precompute_aesni_avx(struct aes_gcm_key_aesni *key);
asmlinkage void
aes_gcm_precompute_vaes_avx2(struct aes_gcm_key_vaes_avx2 *key);
asmlinkage void
-aes_gcm_precompute_vaes_avx10_256(struct aes_gcm_key_avx10 *key);
-asmlinkage void
aes_gcm_precompute_vaes_avx10_512(struct aes_gcm_key_avx10 *key);
static void aes_gcm_precompute(struct aes_gcm_key *key, int flags)
{
- /*
- * To make things a bit easier on the assembly side, the AVX10
- * implementations use the same key format. Therefore, a single
- * function using 256-bit vectors would suffice here. However, it's
- * straightforward to provide a 512-bit one because of how the assembly
- * code is structured, and it works nicely because the total size of the
- * key powers is a multiple of 512 bits. So we take advantage of that.
- *
- * A similar situation applies to the AES-NI implementations.
- */
if (flags & FLAG_AVX10_512)
aes_gcm_precompute_vaes_avx10_512(AES_GCM_KEY_AVX10(key));
- else if (flags & FLAG_AVX10_256)
- aes_gcm_precompute_vaes_avx10_256(AES_GCM_KEY_AVX10(key));
else if (flags & FLAG_VAES_AVX2)
aes_gcm_precompute_vaes_avx2(AES_GCM_KEY_VAES_AVX2(key));
else if (flags & FLAG_AVX)
aes_gcm_precompute_aesni_avx(AES_GCM_KEY_AESNI(key));
else
@@ -1004,11 +989,11 @@ aes_gcm_aad_update_vaes_avx10(const struct aes_gcm_key_avx10 *key,
u8 ghash_acc[16], const u8 *aad, int aadlen);
static void aes_gcm_aad_update(const struct aes_gcm_key *key, u8 ghash_acc[16],
const u8 *aad, int aadlen, int flags)
{
- if (flags & (FLAG_AVX10_256 | FLAG_AVX10_512))
+ if (flags & FLAG_AVX10_512)
aes_gcm_aad_update_vaes_avx10(AES_GCM_KEY_AVX10(key), ghash_acc,
aad, aadlen);
else if (flags & FLAG_VAES_AVX2)
aes_gcm_aad_update_vaes_avx2(AES_GCM_KEY_VAES_AVX2(key),
ghash_acc, aad, aadlen);
@@ -1031,14 +1016,10 @@ aes_gcm_enc_update_aesni_avx(const struct aes_gcm_key_aesni *key,
asmlinkage void
aes_gcm_enc_update_vaes_avx2(const struct aes_gcm_key_vaes_avx2 *key,
const u32 le_ctr[4], u8 ghash_acc[16],
const u8 *src, u8 *dst, int datalen);
asmlinkage void
-aes_gcm_enc_update_vaes_avx10_256(const struct aes_gcm_key_avx10 *key,
- const u32 le_ctr[4], u8 ghash_acc[16],
- const u8 *src, u8 *dst, int datalen);
-asmlinkage void
aes_gcm_enc_update_vaes_avx10_512(const struct aes_gcm_key_avx10 *key,
const u32 le_ctr[4], u8 ghash_acc[16],
const u8 *src, u8 *dst, int datalen);
asmlinkage void
@@ -1052,14 +1033,10 @@ aes_gcm_dec_update_aesni_avx(const struct aes_gcm_key_aesni *key,
asmlinkage void
aes_gcm_dec_update_vaes_avx2(const struct aes_gcm_key_vaes_avx2 *key,
const u32 le_ctr[4], u8 ghash_acc[16],
const u8 *src, u8 *dst, int datalen);
asmlinkage void
-aes_gcm_dec_update_vaes_avx10_256(const struct aes_gcm_key_avx10 *key,
- const u32 le_ctr[4], u8 ghash_acc[16],
- const u8 *src, u8 *dst, int datalen);
-asmlinkage void
aes_gcm_dec_update_vaes_avx10_512(const struct aes_gcm_key_avx10 *key,
const u32 le_ctr[4], u8 ghash_acc[16],
const u8 *src, u8 *dst, int datalen);
/* __always_inline to optimize out the branches based on @flags */
@@ -1071,14 +1048,10 @@ aes_gcm_update(const struct aes_gcm_key *key,
if (flags & FLAG_ENC) {
if (flags & FLAG_AVX10_512)
aes_gcm_enc_update_vaes_avx10_512(AES_GCM_KEY_AVX10(key),
le_ctr, ghash_acc,
src, dst, datalen);
- else if (flags & FLAG_AVX10_256)
- aes_gcm_enc_update_vaes_avx10_256(AES_GCM_KEY_AVX10(key),
- le_ctr, ghash_acc,
- src, dst, datalen);
else if (flags & FLAG_VAES_AVX2)
aes_gcm_enc_update_vaes_avx2(AES_GCM_KEY_VAES_AVX2(key),
le_ctr, ghash_acc,
src, dst, datalen);
else if (flags & FLAG_AVX)
@@ -1091,14 +1064,10 @@ aes_gcm_update(const struct aes_gcm_key *key,
} else {
if (flags & FLAG_AVX10_512)
aes_gcm_dec_update_vaes_avx10_512(AES_GCM_KEY_AVX10(key),
le_ctr, ghash_acc,
src, dst, datalen);
- else if (flags & FLAG_AVX10_256)
- aes_gcm_dec_update_vaes_avx10_256(AES_GCM_KEY_AVX10(key),
- le_ctr, ghash_acc,
- src, dst, datalen);
else if (flags & FLAG_VAES_AVX2)
aes_gcm_dec_update_vaes_avx2(AES_GCM_KEY_VAES_AVX2(key),
le_ctr, ghash_acc,
src, dst, datalen);
else if (flags & FLAG_AVX)
@@ -1133,11 +1102,11 @@ aes_gcm_enc_final_vaes_avx10(const struct aes_gcm_key_avx10 *key,
static __always_inline void
aes_gcm_enc_final(const struct aes_gcm_key *key,
const u32 le_ctr[4], u8 ghash_acc[16],
u64 total_aadlen, u64 total_datalen, int flags)
{
- if (flags & (FLAG_AVX10_256 | FLAG_AVX10_512))
+ if (flags & FLAG_AVX10_512)
aes_gcm_enc_final_vaes_avx10(AES_GCM_KEY_AVX10(key),
le_ctr, ghash_acc,
total_aadlen, total_datalen);
else if (flags & FLAG_VAES_AVX2)
aes_gcm_enc_final_vaes_avx2(AES_GCM_KEY_VAES_AVX2(key),
@@ -1178,11 +1147,11 @@ aes_gcm_dec_final_vaes_avx10(const struct aes_gcm_key_avx10 *key,
static __always_inline bool __must_check
aes_gcm_dec_final(const struct aes_gcm_key *key, const u32 le_ctr[4],
u8 ghash_acc[16], u64 total_aadlen, u64 total_datalen,
u8 tag[16], int taglen, int flags)
{
- if (flags & (FLAG_AVX10_256 | FLAG_AVX10_512))
+ if (flags & FLAG_AVX10_512)
return aes_gcm_dec_final_vaes_avx10(AES_GCM_KEY_AVX10(key),
le_ctr, ghash_acc,
total_aadlen, total_datalen,
tag, taglen);
else if (flags & FLAG_VAES_AVX2)
@@ -1310,11 +1279,11 @@ static int gcm_setkey(struct crypto_aead *tfm, const u8 *raw_key,
/* Compute H^1 * x^-1 */
h = h1;
gf128mul_lle(&h, (const be128 *)x_to_the_minus1);
/* Compute the needed key powers */
- if (flags & (FLAG_AVX10_256 | FLAG_AVX10_512)) {
+ if (flags & FLAG_AVX10_512) {
struct aes_gcm_key_avx10 *k = AES_GCM_KEY_AVX10(key);
for (i = ARRAY_SIZE(k->h_powers) - 1; i >= 0; i--) {
k->h_powers[i][0] = be64_to_cpu(h.b);
k->h_powers[i][1] = be64_to_cpu(h.a);
@@ -1608,15 +1577,10 @@ DEFINE_GCM_ALGS(aesni_avx, FLAG_AVX,
/* aes_gcm_algs_vaes_avx2 */
DEFINE_GCM_ALGS(vaes_avx2, FLAG_VAES_AVX2,
"generic-gcm-vaes-avx2", "rfc4106-gcm-vaes-avx2",
AES_GCM_KEY_VAES_AVX2_SIZE, 600);
-/* aes_gcm_algs_vaes_avx10_256 */
-DEFINE_GCM_ALGS(vaes_avx10_256, FLAG_AVX10_256,
- "generic-gcm-vaes-avx10_256", "rfc4106-gcm-vaes-avx10_256",
- AES_GCM_KEY_AVX10_SIZE, 700);
-
/* aes_gcm_algs_vaes_avx10_512 */
DEFINE_GCM_ALGS(vaes_avx10_512, FLAG_AVX10_512,
"generic-gcm-vaes-avx10_512", "rfc4106-gcm-vaes-avx10_512",
AES_GCM_KEY_AVX10_SIZE, 800);
@@ -1660,15 +1624,10 @@ static int __init register_avx_algs(void)
!boot_cpu_has(X86_FEATURE_BMI2) ||
!cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM |
XFEATURE_MASK_AVX512, NULL))
return 0;
- err = crypto_register_aeads(aes_gcm_algs_vaes_avx10_256,
- ARRAY_SIZE(aes_gcm_algs_vaes_avx10_256));
- if (err)
- return err;
-
if (boot_cpu_has(X86_FEATURE_PREFER_YMM)) {
int i;
for (i = 0; i < ARRAY_SIZE(skcipher_algs_vaes_avx512); i++)
skcipher_algs_vaes_avx512[i].base.cra_priority = 1;
@@ -1700,11 +1659,10 @@ static void unregister_avx_algs(void)
unregister_skciphers(skcipher_algs_aesni_avx);
unregister_aeads(aes_gcm_algs_aesni_avx);
unregister_skciphers(skcipher_algs_vaes_avx2);
unregister_skciphers(skcipher_algs_vaes_avx512);
unregister_aeads(aes_gcm_algs_vaes_avx2);
- unregister_aeads(aes_gcm_algs_vaes_avx10_256);
unregister_aeads(aes_gcm_algs_vaes_avx10_512);
}
#else /* CONFIG_X86_64 */
static struct aead_alg aes_gcm_algs_aesni[0];
--
2.51.0
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 3/8] crypto: x86/aes-gcm - rename avx10 and avx10_512 to avx512
2025-10-02 2:31 [PATCH 0/8] VAES+AVX2 optimized implementation of AES-GCM Eric Biggers
2025-10-02 2:31 ` [PATCH 1/8] crypto: x86/aes-gcm - add VAES+AVX2 optimized code Eric Biggers
2025-10-02 2:31 ` [PATCH 2/8] crypto: x86/aes-gcm - remove VAES+AVX10/256 " Eric Biggers
@ 2025-10-02 2:31 ` Eric Biggers
2025-10-02 2:31 ` [PATCH 4/8] crypto: x86/aes-gcm - clean up AVX512 code to assume 512-bit vectors Eric Biggers
` (7 subsequent siblings)
10 siblings, 0 replies; 20+ messages in thread
From: Eric Biggers @ 2025-10-02 2:31 UTC (permalink / raw)
To: linux-crypto
Cc: linux-kernel, x86, Ard Biesheuvel, Jason A . Donenfeld,
Eric Biggers
With the "avx10_256" code removed and the AVX10 specification having
been changed to basically just be a re-packaged AVX512, the "avx10_512"
name no longer makes sense. Replace it with "avx512".
While doing this, also add the "vaes_" prefix in places that didn't
already have it. The result is that the two VAES optimized
implementations are consistently called vaes_avx2 and vaes_avx512.
(Also drop the "-x86_64" part of the assembly filename, to keep it from
getting too long. There's no 32-bit version of this code, and the fact
that it's 64-bit is unremarkable; it's the norm for new code.)
Note: although aes_gcm_aad_update_vaes_avx512() (previously called
aes_gcm_aad_update_vaes_avx10()) uses at most 256-bit vectors, it still
depends on the AVX512 CPU feature. So its new name is still accurate.
Also, a later commit will make it sometimes use 512-bit vectors anyway.
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
---
arch/x86/crypto/Makefile | 4 +-
arch/x86/crypto/aes-gcm-aesni-x86_64.S | 12 +-
arch/x86/crypto/aes-gcm-vaes-avx2.S | 12 +-
...m-avx10-x86_64.S => aes-gcm-vaes-avx512.S} | 92 +++++--------
arch/x86/crypto/aesni-intel_glue.c | 123 +++++++++---------
5 files changed, 105 insertions(+), 138 deletions(-)
rename arch/x86/crypto/{aes-gcm-avx10-x86_64.S => aes-gcm-vaes-avx512.S} (92%)
diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index f6f7b2b8b853e..6409e3009524c 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -45,12 +45,12 @@ aegis128-aesni-y := aegis128-aesni-asm.o aegis128-aesni-glue.o
obj-$(CONFIG_CRYPTO_AES_NI_INTEL) += aesni-intel.o
aesni-intel-y := aesni-intel_asm.o aesni-intel_glue.o
aesni-intel-$(CONFIG_64BIT) += aes-ctr-avx-x86_64.o \
aes-gcm-aesni-x86_64.o \
aes-gcm-vaes-avx2.o \
- aes-xts-avx-x86_64.o \
- aes-gcm-avx10-x86_64.o
+ aes-gcm-vaes-avx512.o \
+ aes-xts-avx-x86_64.o
obj-$(CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL) += ghash-clmulni-intel.o
ghash-clmulni-intel-y := ghash-clmulni-intel_asm.o ghash-clmulni-intel_glue.o
obj-$(CONFIG_CRYPTO_POLYVAL_CLMUL_NI) += polyval-clmulni.o
diff --git a/arch/x86/crypto/aes-gcm-aesni-x86_64.S b/arch/x86/crypto/aes-gcm-aesni-x86_64.S
index 45940e2883a0f..7c8a8a32bd3c6 100644
--- a/arch/x86/crypto/aes-gcm-aesni-x86_64.S
+++ b/arch/x86/crypto/aes-gcm-aesni-x86_64.S
@@ -59,19 +59,19 @@
//
// The specific CPU feature prerequisites are AES-NI and PCLMULQDQ, plus SSE4.1
// for the *_aesni functions or AVX for the *_aesni_avx ones. (But it seems
// there are no CPUs that support AES-NI without also PCLMULQDQ and SSE4.1.)
//
-// The design generally follows that of aes-gcm-avx10-x86_64.S, and that file is
+// The design generally follows that of aes-gcm-vaes-avx512.S, and that file is
// more thoroughly commented. This file has the following notable changes:
//
// - The vector length is fixed at 128-bit, i.e. xmm registers. This means
// there is only one AES block (and GHASH block) per register.
//
-// - Without AVX512 / AVX10, only 16 SIMD registers are available instead of
-// 32. We work around this by being much more careful about using
-// registers, relying heavily on loads to load values as they are needed.
+// - Without AVX512, only 16 SIMD registers are available instead of 32. We
+// work around this by being much more careful about using registers,
+// relying heavily on loads to load values as they are needed.
//
// - Masking is not available either. We work around this by implementing
// partial block loads and stores using overlapping scalar loads and stores
// combined with shifts and SSE4.1 insertion and extraction instructions.
//
@@ -88,12 +88,12 @@
//
// - We implement the GHASH multiplications in the main loop using Karatsuba
// multiplication instead of schoolbook multiplication. This saves one
// pclmulqdq instruction per block, at the cost of one 64-bit load, one
// pshufd, and 0.25 pxors per block. (This is without the three-argument
-// XOR support that would be provided by AVX512 / AVX10, which would be
-// more beneficial to schoolbook than Karatsuba.)
+// XOR support that would be provided by AVX512, which would be more
+// beneficial to schoolbook than Karatsuba.)
//
// As a rough approximation, we can assume that Karatsuba multiplication is
// faster than schoolbook multiplication in this context if one pshufd and
// 0.25 pxors are cheaper than a pclmulqdq. (We assume that the 64-bit
// load is "free" due to running in parallel with arithmetic instructions.)
diff --git a/arch/x86/crypto/aes-gcm-vaes-avx2.S b/arch/x86/crypto/aes-gcm-vaes-avx2.S
index e628dbb33c0e7..5ccbd85383cdd 100644
--- a/arch/x86/crypto/aes-gcm-vaes-avx2.S
+++ b/arch/x86/crypto/aes-gcm-vaes-avx2.S
@@ -47,16 +47,16 @@
// ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
// POSSIBILITY OF SUCH DAMAGE.
//
// -----------------------------------------------------------------------------
//
-// This is similar to aes-gcm-avx10-x86_64.S, but it uses AVX2 instead of
-// AVX512. This means it can only use 16 vector registers instead of 32, the
-// maximum vector length is 32 bytes, and some instructions such as vpternlogd
-// and masked loads/stores are unavailable. However, it is able to run on CPUs
-// that have VAES without AVX512, namely AMD Zen 3 (including "Milan" server
-// CPUs), various Intel client CPUs such as Alder Lake, and Intel Sierra Forest.
+// This is similar to aes-gcm-vaes-avx512.S, but it uses AVX2 instead of AVX512.
+// This means it can only use 16 vector registers instead of 32, the maximum
+// vector length is 32 bytes, and some instructions such as vpternlogd and
+// masked loads/stores are unavailable. However, it is able to run on CPUs that
+// have VAES without AVX512, namely AMD Zen 3 (including "Milan" server CPUs),
+// various Intel client CPUs such as Alder Lake, and Intel Sierra Forest.
//
// This implementation also uses Karatsuba multiplication instead of schoolbook
// multiplication for GHASH in its main loop. This does not help much on Intel,
// but it improves performance by ~5% on AMD Zen 3. Other factors weighing
// slightly in favor of Karatsuba multiplication in this implementation are the
diff --git a/arch/x86/crypto/aes-gcm-avx10-x86_64.S b/arch/x86/crypto/aes-gcm-vaes-avx512.S
similarity index 92%
rename from arch/x86/crypto/aes-gcm-avx10-x86_64.S
rename to arch/x86/crypto/aes-gcm-vaes-avx512.S
index 4fb04506d7932..be5c14d33acc7 100644
--- a/arch/x86/crypto/aes-gcm-avx10-x86_64.S
+++ b/arch/x86/crypto/aes-gcm-vaes-avx512.S
@@ -1,8 +1,9 @@
/* SPDX-License-Identifier: Apache-2.0 OR BSD-2-Clause */
//
-// VAES and VPCLMULQDQ optimized AES-GCM for x86_64
+// AES-GCM implementation for x86_64 CPUs that support the following CPU
+// features: VAES && VPCLMULQDQ && AVX512BW && AVX512VL && BMI2
//
// Copyright 2024 Google LLC
//
// Author: Eric Biggers <ebiggers@google.com>
//
@@ -43,45 +44,10 @@
// SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
// INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
// CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
// ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
// POSSIBILITY OF SUCH DAMAGE.
-//
-//------------------------------------------------------------------------------
-//
-// This file implements AES-GCM (Galois/Counter Mode) for x86_64 CPUs that
-// support VAES (vector AES), VPCLMULQDQ (vector carryless multiplication), and
-// either AVX512 or AVX10. Some of the functions, notably the encryption and
-// decryption update functions which are the most performance-critical, are
-// provided in two variants generated from a macro: one using 256-bit vectors
-// (suffix: vaes_avx10_256) and one using 512-bit vectors (vaes_avx10_512). The
-// other, "shared" functions (vaes_avx10) use at most 256-bit vectors.
-//
-// The functions that use 512-bit vectors are intended for CPUs that support
-// 512-bit vectors *and* where using them doesn't cause significant
-// downclocking. They require the following CPU features:
-//
-// VAES && VPCLMULQDQ && BMI2 && ((AVX512BW && AVX512VL) || AVX10/512)
-//
-// The other functions require the following CPU features:
-//
-// VAES && VPCLMULQDQ && BMI2 && ((AVX512BW && AVX512VL) || AVX10/256)
-//
-// All functions use the "System V" ABI. The Windows ABI is not supported.
-//
-// Note that we use "avx10" in the names of the functions as a shorthand to
-// really mean "AVX10 or a certain set of AVX512 features". Due to Intel's
-// introduction of AVX512 and then its replacement by AVX10, there doesn't seem
-// to be a simple way to name things that makes sense on all CPUs.
-//
-// Note that the macros that support both 256-bit and 512-bit vectors could
-// fairly easily be changed to support 128-bit too. However, this would *not*
-// be sufficient to allow the code to run on CPUs without AVX512 or AVX10,
-// because the code heavily uses several features of these extensions other than
-// the vector length: the increase in the number of SIMD registers from 16 to
-// 32, masking support, and new instructions such as vpternlogd (which can do a
-// three-argument XOR). These features are very useful for AES-GCM.
#include <linux/linkage.h>
.section .rodata
.p2align 6
@@ -310,11 +276,11 @@
vpclmulqdq $0x01, \mi, \gfpoly, \t0
vpshufd $0x4e, \mi, \mi
vpternlogd $0x96, \t0, \mi, \hi
.endm
-// void aes_gcm_precompute_##suffix(struct aes_gcm_key_avx10 *key);
+// void aes_gcm_precompute_vaes_avx512(struct aes_gcm_key_vaes_avx512 *key);
//
// Given the expanded AES key |key->aes_key|, this function derives the GHASH
// subkey and initializes |key->ghash_key_powers| with powers of it.
//
// The number of key powers initialized is NUM_H_POWERS, and they are stored in
@@ -586,13 +552,13 @@
vmovdqu8 GHASHDATA1, 1*VL(DST)
vmovdqu8 GHASHDATA2, 2*VL(DST)
vmovdqu8 GHASHDATA3, 3*VL(DST)
.endm
-// void aes_gcm_{enc,dec}_update_##suffix(const struct aes_gcm_key_avx10 *key,
-// const u32 le_ctr[4], u8 ghash_acc[16],
-// const u8 *src, u8 *dst, int datalen);
+// void aes_gcm_{enc,dec}_update_vaes_avx512(const struct aes_gcm_key_vaes_avx512 *key,
+// const u32 le_ctr[4], u8 ghash_acc[16],
+// const u8 *src, u8 *dst, int datalen);
//
// This macro generates a GCM encryption or decryption update function with the
// above prototype (with \enc selecting which one). This macro supports both
// VL=32 and VL=64. _set_veclen must have been invoked with the desired length.
//
@@ -942,18 +908,18 @@
vzeroupper // This is needed after using ymm or zmm registers.
RET
.endm
-// void aes_gcm_enc_final_vaes_avx10(const struct aes_gcm_key_avx10 *key,
-// const u32 le_ctr[4], u8 ghash_acc[16],
-// u64 total_aadlen, u64 total_datalen);
-// bool aes_gcm_dec_final_vaes_avx10(const struct aes_gcm_key_avx10 *key,
-// const u32 le_ctr[4],
-// const u8 ghash_acc[16],
-// u64 total_aadlen, u64 total_datalen,
-// const u8 tag[16], int taglen);
+// void aes_gcm_enc_final_vaes_avx512(const struct aes_gcm_key_vaes_avx512 *key,
+// const u32 le_ctr[4], u8 ghash_acc[16],
+// u64 total_aadlen, u64 total_datalen);
+// bool aes_gcm_dec_final_vaes_avx512(const struct aes_gcm_key_vaes_avx512 *key,
+// const u32 le_ctr[4],
+// const u8 ghash_acc[16],
+// u64 total_aadlen, u64 total_datalen,
+// const u8 tag[16], int taglen);
//
// This macro generates one of the above two functions (with \enc selecting
// which one). Both functions finish computing the GCM authentication tag by
// updating GHASH with the lengths block and encrypting the GHASH accumulator.
// |total_aadlen| and |total_datalen| must be the total length of the additional
@@ -1080,23 +1046,23 @@
// No need for vzeroupper here, since only used xmm registers were used.
RET
.endm
_set_veclen 64
-SYM_FUNC_START(aes_gcm_precompute_vaes_avx10_512)
+SYM_FUNC_START(aes_gcm_precompute_vaes_avx512)
_aes_gcm_precompute
-SYM_FUNC_END(aes_gcm_precompute_vaes_avx10_512)
-SYM_FUNC_START(aes_gcm_enc_update_vaes_avx10_512)
+SYM_FUNC_END(aes_gcm_precompute_vaes_avx512)
+SYM_FUNC_START(aes_gcm_enc_update_vaes_avx512)
_aes_gcm_update 1
-SYM_FUNC_END(aes_gcm_enc_update_vaes_avx10_512)
-SYM_FUNC_START(aes_gcm_dec_update_vaes_avx10_512)
+SYM_FUNC_END(aes_gcm_enc_update_vaes_avx512)
+SYM_FUNC_START(aes_gcm_dec_update_vaes_avx512)
_aes_gcm_update 0
-SYM_FUNC_END(aes_gcm_dec_update_vaes_avx10_512)
+SYM_FUNC_END(aes_gcm_dec_update_vaes_avx512)
-// void aes_gcm_aad_update_vaes_avx10(const struct aes_gcm_key_avx10 *key,
-// u8 ghash_acc[16],
-// const u8 *aad, int aadlen);
+// void aes_gcm_aad_update_vaes_avx512(const struct aes_gcm_key_vaes_avx512 *key,
+// u8 ghash_acc[16],
+// const u8 *aad, int aadlen);
//
// This function processes the AAD (Additional Authenticated Data) in GCM.
// Using the key |key|, it updates the GHASH accumulator |ghash_acc| with the
// data given by |aad| and |aadlen|. |key->ghash_key_powers| must have been
// initialized. On the first call, |ghash_acc| must be all zeroes. |aadlen|
@@ -1108,11 +1074,11 @@ SYM_FUNC_END(aes_gcm_dec_update_vaes_avx10_512)
// which uses 256-bit vectors (ymm registers) and only has a 1x-wide loop. This
// keeps the code size down, and it enables some micro-optimizations, e.g. using
// VEX-coded instructions instead of EVEX-coded to save some instruction bytes.
// To optimize for large amounts of AAD, we could implement a 4x-wide loop and
// provide a version using 512-bit vectors, but that doesn't seem to be useful.
-SYM_FUNC_START(aes_gcm_aad_update_vaes_avx10)
+SYM_FUNC_START(aes_gcm_aad_update_vaes_avx512)
// Function arguments
.set KEY, %rdi
.set GHASH_ACC_PTR, %rsi
.set AAD, %rdx
@@ -1176,13 +1142,13 @@ SYM_FUNC_START(aes_gcm_aad_update_vaes_avx10)
// Store the updated GHASH accumulator back to memory.
vmovdqu GHASH_ACC_XMM, (GHASH_ACC_PTR)
vzeroupper // This is needed after using ymm or zmm registers.
RET
-SYM_FUNC_END(aes_gcm_aad_update_vaes_avx10)
+SYM_FUNC_END(aes_gcm_aad_update_vaes_avx512)
-SYM_FUNC_START(aes_gcm_enc_final_vaes_avx10)
+SYM_FUNC_START(aes_gcm_enc_final_vaes_avx512)
_aes_gcm_final 1
-SYM_FUNC_END(aes_gcm_enc_final_vaes_avx10)
-SYM_FUNC_START(aes_gcm_dec_final_vaes_avx10)
+SYM_FUNC_END(aes_gcm_enc_final_vaes_avx512)
+SYM_FUNC_START(aes_gcm_dec_final_vaes_avx512)
_aes_gcm_final 0
-SYM_FUNC_END(aes_gcm_dec_final_vaes_avx10)
+SYM_FUNC_END(aes_gcm_dec_final_vaes_avx512)
diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c
index 1ed8513208d36..bb6e2c47ffc61 100644
--- a/arch/x86/crypto/aesni-intel_glue.c
+++ b/arch/x86/crypto/aesni-intel_glue.c
@@ -902,12 +902,12 @@ struct aes_gcm_key_vaes_avx2 {
#define AES_GCM_KEY_VAES_AVX2(key) \
container_of((key), struct aes_gcm_key_vaes_avx2, base)
#define AES_GCM_KEY_VAES_AVX2_SIZE \
(sizeof(struct aes_gcm_key_vaes_avx2) + (31 & ~(CRYPTO_MINALIGN - 1)))
-/* Key struct used by the VAES + AVX10 implementations of AES-GCM */
-struct aes_gcm_key_avx10 {
+/* Key struct used by the VAES + AVX512 implementation of AES-GCM */
+struct aes_gcm_key_vaes_avx512 {
/*
* Common part of the key. The assembly code prefers 16-byte alignment
* for the round keys; we get this by them being located at the start of
* the struct and the whole struct being 64-byte aligned.
*/
@@ -923,14 +923,14 @@ struct aes_gcm_key_avx10 {
u64 h_powers[16][2] __aligned(64);
/* Three padding blocks required by the assembly code */
u64 padding[3][2];
};
-#define AES_GCM_KEY_AVX10(key) \
- container_of((key), struct aes_gcm_key_avx10, base)
-#define AES_GCM_KEY_AVX10_SIZE \
- (sizeof(struct aes_gcm_key_avx10) + (63 & ~(CRYPTO_MINALIGN - 1)))
+#define AES_GCM_KEY_VAES_AVX512(key) \
+ container_of((key), struct aes_gcm_key_vaes_avx512, base)
+#define AES_GCM_KEY_VAES_AVX512_SIZE \
+ (sizeof(struct aes_gcm_key_vaes_avx512) + (63 & ~(CRYPTO_MINALIGN - 1)))
/*
* These flags are passed to the AES-GCM helper functions to specify the
* specific version of AES-GCM (RFC4106 or not), whether it's encryption or
* decryption, and which assembly functions should be called. Assembly
@@ -939,16 +939,16 @@ struct aes_gcm_key_avx10 {
*/
#define FLAG_RFC4106 BIT(0)
#define FLAG_ENC BIT(1)
#define FLAG_AVX BIT(2)
#define FLAG_VAES_AVX2 BIT(3)
-#define FLAG_AVX10_512 BIT(4)
+#define FLAG_VAES_AVX512 BIT(4)
static inline struct aes_gcm_key *
aes_gcm_key_get(struct crypto_aead *tfm, int flags)
{
- if (flags & FLAG_AVX10_512)
+ if (flags & FLAG_VAES_AVX512)
return PTR_ALIGN(crypto_aead_ctx(tfm), 64);
else if (flags & FLAG_VAES_AVX2)
return PTR_ALIGN(crypto_aead_ctx(tfm), 32);
else
return PTR_ALIGN(crypto_aead_ctx(tfm), 16);
@@ -959,16 +959,16 @@ aes_gcm_precompute_aesni(struct aes_gcm_key_aesni *key);
asmlinkage void
aes_gcm_precompute_aesni_avx(struct aes_gcm_key_aesni *key);
asmlinkage void
aes_gcm_precompute_vaes_avx2(struct aes_gcm_key_vaes_avx2 *key);
asmlinkage void
-aes_gcm_precompute_vaes_avx10_512(struct aes_gcm_key_avx10 *key);
+aes_gcm_precompute_vaes_avx512(struct aes_gcm_key_vaes_avx512 *key);
static void aes_gcm_precompute(struct aes_gcm_key *key, int flags)
{
- if (flags & FLAG_AVX10_512)
- aes_gcm_precompute_vaes_avx10_512(AES_GCM_KEY_AVX10(key));
+ if (flags & FLAG_VAES_AVX512)
+ aes_gcm_precompute_vaes_avx512(AES_GCM_KEY_VAES_AVX512(key));
else if (flags & FLAG_VAES_AVX2)
aes_gcm_precompute_vaes_avx2(AES_GCM_KEY_VAES_AVX2(key));
else if (flags & FLAG_AVX)
aes_gcm_precompute_aesni_avx(AES_GCM_KEY_AESNI(key));
else
@@ -983,19 +983,19 @@ aes_gcm_aad_update_aesni_avx(const struct aes_gcm_key_aesni *key,
u8 ghash_acc[16], const u8 *aad, int aadlen);
asmlinkage void
aes_gcm_aad_update_vaes_avx2(const struct aes_gcm_key_vaes_avx2 *key,
u8 ghash_acc[16], const u8 *aad, int aadlen);
asmlinkage void
-aes_gcm_aad_update_vaes_avx10(const struct aes_gcm_key_avx10 *key,
- u8 ghash_acc[16], const u8 *aad, int aadlen);
+aes_gcm_aad_update_vaes_avx512(const struct aes_gcm_key_vaes_avx512 *key,
+ u8 ghash_acc[16], const u8 *aad, int aadlen);
static void aes_gcm_aad_update(const struct aes_gcm_key *key, u8 ghash_acc[16],
const u8 *aad, int aadlen, int flags)
{
- if (flags & FLAG_AVX10_512)
- aes_gcm_aad_update_vaes_avx10(AES_GCM_KEY_AVX10(key), ghash_acc,
- aad, aadlen);
+ if (flags & FLAG_VAES_AVX512)
+ aes_gcm_aad_update_vaes_avx512(AES_GCM_KEY_VAES_AVX512(key),
+ ghash_acc, aad, aadlen);
else if (flags & FLAG_VAES_AVX2)
aes_gcm_aad_update_vaes_avx2(AES_GCM_KEY_VAES_AVX2(key),
ghash_acc, aad, aadlen);
else if (flags & FLAG_AVX)
aes_gcm_aad_update_aesni_avx(AES_GCM_KEY_AESNI(key), ghash_acc,
@@ -1016,13 +1016,13 @@ aes_gcm_enc_update_aesni_avx(const struct aes_gcm_key_aesni *key,
asmlinkage void
aes_gcm_enc_update_vaes_avx2(const struct aes_gcm_key_vaes_avx2 *key,
const u32 le_ctr[4], u8 ghash_acc[16],
const u8 *src, u8 *dst, int datalen);
asmlinkage void
-aes_gcm_enc_update_vaes_avx10_512(const struct aes_gcm_key_avx10 *key,
- const u32 le_ctr[4], u8 ghash_acc[16],
- const u8 *src, u8 *dst, int datalen);
+aes_gcm_enc_update_vaes_avx512(const struct aes_gcm_key_vaes_avx512 *key,
+ const u32 le_ctr[4], u8 ghash_acc[16],
+ const u8 *src, u8 *dst, int datalen);
asmlinkage void
aes_gcm_dec_update_aesni(const struct aes_gcm_key_aesni *key,
const u32 le_ctr[4], u8 ghash_acc[16],
const u8 *src, u8 *dst, int datalen);
@@ -1033,25 +1033,25 @@ aes_gcm_dec_update_aesni_avx(const struct aes_gcm_key_aesni *key,
asmlinkage void
aes_gcm_dec_update_vaes_avx2(const struct aes_gcm_key_vaes_avx2 *key,
const u32 le_ctr[4], u8 ghash_acc[16],
const u8 *src, u8 *dst, int datalen);
asmlinkage void
-aes_gcm_dec_update_vaes_avx10_512(const struct aes_gcm_key_avx10 *key,
- const u32 le_ctr[4], u8 ghash_acc[16],
- const u8 *src, u8 *dst, int datalen);
+aes_gcm_dec_update_vaes_avx512(const struct aes_gcm_key_vaes_avx512 *key,
+ const u32 le_ctr[4], u8 ghash_acc[16],
+ const u8 *src, u8 *dst, int datalen);
/* __always_inline to optimize out the branches based on @flags */
static __always_inline void
aes_gcm_update(const struct aes_gcm_key *key,
const u32 le_ctr[4], u8 ghash_acc[16],
const u8 *src, u8 *dst, int datalen, int flags)
{
if (flags & FLAG_ENC) {
- if (flags & FLAG_AVX10_512)
- aes_gcm_enc_update_vaes_avx10_512(AES_GCM_KEY_AVX10(key),
- le_ctr, ghash_acc,
- src, dst, datalen);
+ if (flags & FLAG_VAES_AVX512)
+ aes_gcm_enc_update_vaes_avx512(AES_GCM_KEY_VAES_AVX512(key),
+ le_ctr, ghash_acc,
+ src, dst, datalen);
else if (flags & FLAG_VAES_AVX2)
aes_gcm_enc_update_vaes_avx2(AES_GCM_KEY_VAES_AVX2(key),
le_ctr, ghash_acc,
src, dst, datalen);
else if (flags & FLAG_AVX)
@@ -1060,14 +1060,14 @@ aes_gcm_update(const struct aes_gcm_key *key,
src, dst, datalen);
else
aes_gcm_enc_update_aesni(AES_GCM_KEY_AESNI(key), le_ctr,
ghash_acc, src, dst, datalen);
} else {
- if (flags & FLAG_AVX10_512)
- aes_gcm_dec_update_vaes_avx10_512(AES_GCM_KEY_AVX10(key),
- le_ctr, ghash_acc,
- src, dst, datalen);
+ if (flags & FLAG_VAES_AVX512)
+ aes_gcm_dec_update_vaes_avx512(AES_GCM_KEY_VAES_AVX512(key),
+ le_ctr, ghash_acc,
+ src, dst, datalen);
else if (flags & FLAG_VAES_AVX2)
aes_gcm_dec_update_vaes_avx2(AES_GCM_KEY_VAES_AVX2(key),
le_ctr, ghash_acc,
src, dst, datalen);
else if (flags & FLAG_AVX)
@@ -1092,24 +1092,24 @@ aes_gcm_enc_final_aesni_avx(const struct aes_gcm_key_aesni *key,
asmlinkage void
aes_gcm_enc_final_vaes_avx2(const struct aes_gcm_key_vaes_avx2 *key,
const u32 le_ctr[4], u8 ghash_acc[16],
u64 total_aadlen, u64 total_datalen);
asmlinkage void
-aes_gcm_enc_final_vaes_avx10(const struct aes_gcm_key_avx10 *key,
- const u32 le_ctr[4], u8 ghash_acc[16],
- u64 total_aadlen, u64 total_datalen);
+aes_gcm_enc_final_vaes_avx512(const struct aes_gcm_key_vaes_avx512 *key,
+ const u32 le_ctr[4], u8 ghash_acc[16],
+ u64 total_aadlen, u64 total_datalen);
/* __always_inline to optimize out the branches based on @flags */
static __always_inline void
aes_gcm_enc_final(const struct aes_gcm_key *key,
const u32 le_ctr[4], u8 ghash_acc[16],
u64 total_aadlen, u64 total_datalen, int flags)
{
- if (flags & FLAG_AVX10_512)
- aes_gcm_enc_final_vaes_avx10(AES_GCM_KEY_AVX10(key),
- le_ctr, ghash_acc,
- total_aadlen, total_datalen);
+ if (flags & FLAG_VAES_AVX512)
+ aes_gcm_enc_final_vaes_avx512(AES_GCM_KEY_VAES_AVX512(key),
+ le_ctr, ghash_acc,
+ total_aadlen, total_datalen);
else if (flags & FLAG_VAES_AVX2)
aes_gcm_enc_final_vaes_avx2(AES_GCM_KEY_VAES_AVX2(key),
le_ctr, ghash_acc,
total_aadlen, total_datalen);
else if (flags & FLAG_AVX)
@@ -1136,26 +1136,26 @@ asmlinkage bool __must_check
aes_gcm_dec_final_vaes_avx2(const struct aes_gcm_key_vaes_avx2 *key,
const u32 le_ctr[4], const u8 ghash_acc[16],
u64 total_aadlen, u64 total_datalen,
const u8 tag[16], int taglen);
asmlinkage bool __must_check
-aes_gcm_dec_final_vaes_avx10(const struct aes_gcm_key_avx10 *key,
- const u32 le_ctr[4], const u8 ghash_acc[16],
- u64 total_aadlen, u64 total_datalen,
- const u8 tag[16], int taglen);
+aes_gcm_dec_final_vaes_avx512(const struct aes_gcm_key_vaes_avx512 *key,
+ const u32 le_ctr[4], const u8 ghash_acc[16],
+ u64 total_aadlen, u64 total_datalen,
+ const u8 tag[16], int taglen);
/* __always_inline to optimize out the branches based on @flags */
static __always_inline bool __must_check
aes_gcm_dec_final(const struct aes_gcm_key *key, const u32 le_ctr[4],
u8 ghash_acc[16], u64 total_aadlen, u64 total_datalen,
u8 tag[16], int taglen, int flags)
{
- if (flags & FLAG_AVX10_512)
- return aes_gcm_dec_final_vaes_avx10(AES_GCM_KEY_AVX10(key),
- le_ctr, ghash_acc,
- total_aadlen, total_datalen,
- tag, taglen);
+ if (flags & FLAG_VAES_AVX512)
+ return aes_gcm_dec_final_vaes_avx512(AES_GCM_KEY_VAES_AVX512(key),
+ le_ctr, ghash_acc,
+ total_aadlen, total_datalen,
+ tag, taglen);
else if (flags & FLAG_VAES_AVX2)
return aes_gcm_dec_final_vaes_avx2(AES_GCM_KEY_VAES_AVX2(key),
le_ctr, ghash_acc,
total_aadlen, total_datalen,
tag, taglen);
@@ -1243,14 +1243,14 @@ static int gcm_setkey(struct crypto_aead *tfm, const u8 *raw_key,
BUILD_BUG_ON(offsetof(struct aes_gcm_key_aesni, h_times_x64) != 688);
BUILD_BUG_ON(offsetof(struct aes_gcm_key_vaes_avx2, base.aes_key.key_enc) != 0);
BUILD_BUG_ON(offsetof(struct aes_gcm_key_vaes_avx2, base.aes_key.key_length) != 480);
BUILD_BUG_ON(offsetof(struct aes_gcm_key_vaes_avx2, h_powers) != 512);
BUILD_BUG_ON(offsetof(struct aes_gcm_key_vaes_avx2, h_powers_xored) != 640);
- BUILD_BUG_ON(offsetof(struct aes_gcm_key_avx10, base.aes_key.key_enc) != 0);
- BUILD_BUG_ON(offsetof(struct aes_gcm_key_avx10, base.aes_key.key_length) != 480);
- BUILD_BUG_ON(offsetof(struct aes_gcm_key_avx10, h_powers) != 512);
- BUILD_BUG_ON(offsetof(struct aes_gcm_key_avx10, padding) != 768);
+ BUILD_BUG_ON(offsetof(struct aes_gcm_key_vaes_avx512, base.aes_key.key_enc) != 0);
+ BUILD_BUG_ON(offsetof(struct aes_gcm_key_vaes_avx512, base.aes_key.key_length) != 480);
+ BUILD_BUG_ON(offsetof(struct aes_gcm_key_vaes_avx512, h_powers) != 512);
+ BUILD_BUG_ON(offsetof(struct aes_gcm_key_vaes_avx512, padding) != 768);
if (likely(crypto_simd_usable())) {
err = aes_check_keylen(keylen);
if (err)
return err;
@@ -1279,12 +1279,13 @@ static int gcm_setkey(struct crypto_aead *tfm, const u8 *raw_key,
/* Compute H^1 * x^-1 */
h = h1;
gf128mul_lle(&h, (const be128 *)x_to_the_minus1);
/* Compute the needed key powers */
- if (flags & FLAG_AVX10_512) {
- struct aes_gcm_key_avx10 *k = AES_GCM_KEY_AVX10(key);
+ if (flags & FLAG_VAES_AVX512) {
+ struct aes_gcm_key_vaes_avx512 *k =
+ AES_GCM_KEY_VAES_AVX512(key);
for (i = ARRAY_SIZE(k->h_powers) - 1; i >= 0; i--) {
k->h_powers[i][0] = be64_to_cpu(h.b);
k->h_powers[i][1] = be64_to_cpu(h.a);
gf128mul_lle(&h, &h1);
@@ -1577,14 +1578,14 @@ DEFINE_GCM_ALGS(aesni_avx, FLAG_AVX,
/* aes_gcm_algs_vaes_avx2 */
DEFINE_GCM_ALGS(vaes_avx2, FLAG_VAES_AVX2,
"generic-gcm-vaes-avx2", "rfc4106-gcm-vaes-avx2",
AES_GCM_KEY_VAES_AVX2_SIZE, 600);
-/* aes_gcm_algs_vaes_avx10_512 */
-DEFINE_GCM_ALGS(vaes_avx10_512, FLAG_AVX10_512,
- "generic-gcm-vaes-avx10_512", "rfc4106-gcm-vaes-avx10_512",
- AES_GCM_KEY_AVX10_SIZE, 800);
+/* aes_gcm_algs_vaes_avx512 */
+DEFINE_GCM_ALGS(vaes_avx512, FLAG_VAES_AVX512,
+ "generic-gcm-vaes-avx512", "rfc4106-gcm-vaes-avx512",
+ AES_GCM_KEY_VAES_AVX512_SIZE, 800);
static int __init register_avx_algs(void)
{
int err;
@@ -1629,20 +1630,20 @@ static int __init register_avx_algs(void)
if (boot_cpu_has(X86_FEATURE_PREFER_YMM)) {
int i;
for (i = 0; i < ARRAY_SIZE(skcipher_algs_vaes_avx512); i++)
skcipher_algs_vaes_avx512[i].base.cra_priority = 1;
- for (i = 0; i < ARRAY_SIZE(aes_gcm_algs_vaes_avx10_512); i++)
- aes_gcm_algs_vaes_avx10_512[i].base.cra_priority = 1;
+ for (i = 0; i < ARRAY_SIZE(aes_gcm_algs_vaes_avx512); i++)
+ aes_gcm_algs_vaes_avx512[i].base.cra_priority = 1;
}
err = crypto_register_skciphers(skcipher_algs_vaes_avx512,
ARRAY_SIZE(skcipher_algs_vaes_avx512));
if (err)
return err;
- err = crypto_register_aeads(aes_gcm_algs_vaes_avx10_512,
- ARRAY_SIZE(aes_gcm_algs_vaes_avx10_512));
+ err = crypto_register_aeads(aes_gcm_algs_vaes_avx512,
+ ARRAY_SIZE(aes_gcm_algs_vaes_avx512));
if (err)
return err;
return 0;
}
@@ -1659,11 +1660,11 @@ static void unregister_avx_algs(void)
unregister_skciphers(skcipher_algs_aesni_avx);
unregister_aeads(aes_gcm_algs_aesni_avx);
unregister_skciphers(skcipher_algs_vaes_avx2);
unregister_skciphers(skcipher_algs_vaes_avx512);
unregister_aeads(aes_gcm_algs_vaes_avx2);
- unregister_aeads(aes_gcm_algs_vaes_avx10_512);
+ unregister_aeads(aes_gcm_algs_vaes_avx512);
}
#else /* CONFIG_X86_64 */
static struct aead_alg aes_gcm_algs_aesni[0];
static int __init register_avx_algs(void)
--
2.51.0
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 4/8] crypto: x86/aes-gcm - clean up AVX512 code to assume 512-bit vectors
2025-10-02 2:31 [PATCH 0/8] VAES+AVX2 optimized implementation of AES-GCM Eric Biggers
` (2 preceding siblings ...)
2025-10-02 2:31 ` [PATCH 3/8] crypto: x86/aes-gcm - rename avx10 and avx10_512 to avx512 Eric Biggers
@ 2025-10-02 2:31 ` Eric Biggers
2025-10-02 2:31 ` [PATCH 5/8] crypto: x86/aes-gcm - reorder AVX512 precompute and aad_update functions Eric Biggers
` (6 subsequent siblings)
10 siblings, 0 replies; 20+ messages in thread
From: Eric Biggers @ 2025-10-02 2:31 UTC (permalink / raw)
To: linux-crypto
Cc: linux-kernel, x86, Ard Biesheuvel, Jason A . Donenfeld,
Eric Biggers
aes-gcm-vaes-avx512.S (originally aes-gcm-avx10-x86_64.S) was designed
to support multiple maximum vector lengths, while still utilizing AVX512
/ AVX10 features such as the increased number of vector registers.
However, the support for multiple maximum vector lengths turned out to
not be useful. Support for maximum vector lengths other than 512 bits
was removed from the AVX10 specification, which leaves "avoiding
overly-eager downclocking" as the only remaining use case for limiting
AVX512 / AVX10 code to 256-bit vectors. But this issue has gone away in
new CPUs, and the separate VAES+AVX2 code which I ended up having to
write anyway provides nearly as good 256-bit support.
Therefore, clean up this code to not be written in terms of a generic
vector length, but rather just assume 512-bit vectors.
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
---
arch/x86/crypto/aes-gcm-vaes-avx512.S | 353 +++++++++++---------------
1 file changed, 153 insertions(+), 200 deletions(-)
diff --git a/arch/x86/crypto/aes-gcm-vaes-avx512.S b/arch/x86/crypto/aes-gcm-vaes-avx512.S
index be5c14d33acc7..3edf829c2ce07 100644
--- a/arch/x86/crypto/aes-gcm-vaes-avx512.S
+++ b/arch/x86/crypto/aes-gcm-vaes-avx512.S
@@ -68,20 +68,18 @@
// Same as above, but with the (1 << 64) bit set.
.Lgfpoly_and_internal_carrybit:
.octa 0xc2000000000000010000000000000001
- // The below constants are used for incrementing the counter blocks.
- // ctr_pattern points to the four 128-bit values [0, 1, 2, 3].
- // inc_2blocks and inc_4blocks point to the single 128-bit values 2 and
- // 4. Note that the same '2' is reused in ctr_pattern and inc_2blocks.
+ // Values needed to prepare the initial vector of counter blocks.
.Lctr_pattern:
.octa 0
.octa 1
-.Linc_2blocks:
.octa 2
.octa 3
+
+ // The number of AES blocks per vector, as a 128-bit value.
.Linc_4blocks:
.octa 4
// Number of powers of the hash key stored in the key struct. The powers are
// stored from highest (H^NUM_H_POWERS) to lowest (H^1).
@@ -94,33 +92,17 @@
#define OFFSETOF_H_POWERS 512
// Offset to end of hash key powers array in the key struct.
//
// This is immediately followed by three zeroized padding blocks, which are
-// included so that partial vectors can be handled more easily. E.g. if VL=64
-// and two blocks remain, we load the 4 values [H^2, H^1, 0, 0]. The most
-// padding blocks needed is 3, which occurs if [H^1, 0, 0, 0] is loaded.
+// included so that partial vectors can be handled more easily. E.g. if two
+// blocks remain, we load the 4 values [H^2, H^1, 0, 0]. The most padding
+// blocks needed is 3, which occurs if [H^1, 0, 0, 0] is loaded.
#define OFFSETOFEND_H_POWERS (OFFSETOF_H_POWERS + (NUM_H_POWERS * 16))
.text
-// Set the vector length in bytes. This sets the VL variable and defines
-// register aliases V0-V31 that map to the ymm or zmm registers.
-.macro _set_veclen vl
- .set VL, \vl
-.irp i, 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15, \
- 16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31
-.if VL == 32
- .set V\i, %ymm\i
-.elseif VL == 64
- .set V\i, %zmm\i
-.else
- .error "Unsupported vector length"
-.endif
-.endr
-.endm
-
// The _ghash_mul_step macro does one step of GHASH multiplication of the
// 128-bit lanes of \a by the corresponding 128-bit lanes of \b and storing the
// reduced products in \dst. \t0, \t1, and \t2 are temporary registers of the
// same size as \a and \b. To complete all steps, this must invoked with \i=0
// through \i=9. The division into steps allows users of this macro to
@@ -284,35 +266,31 @@
// subkey and initializes |key->ghash_key_powers| with powers of it.
//
// The number of key powers initialized is NUM_H_POWERS, and they are stored in
// the order H^NUM_H_POWERS to H^1. The zeroized padding blocks after the key
// powers themselves are also initialized.
-//
-// This macro supports both VL=32 and VL=64. _set_veclen must have been invoked
-// with the desired length. In the VL=32 case, the function computes twice as
-// many key powers than are actually used by the VL=32 GCM update functions.
-// This is done to keep the key format the same regardless of vector length.
.macro _aes_gcm_precompute
// Function arguments
.set KEY, %rdi
- // Additional local variables. V0-V2 and %rax are used as temporaries.
+ // Additional local variables.
+ // %zmm[0-2] and %rax are used as temporaries.
.set POWERS_PTR, %rsi
.set RNDKEYLAST_PTR, %rdx
- .set H_CUR, V3
+ .set H_CUR, %zmm3
.set H_CUR_YMM, %ymm3
.set H_CUR_XMM, %xmm3
- .set H_INC, V4
+ .set H_INC, %zmm4
.set H_INC_YMM, %ymm4
.set H_INC_XMM, %xmm4
- .set GFPOLY, V5
+ .set GFPOLY, %zmm5
.set GFPOLY_YMM, %ymm5
.set GFPOLY_XMM, %xmm5
// Get pointer to lowest set of key powers (located at end of array).
- lea OFFSETOFEND_H_POWERS-VL(KEY), POWERS_PTR
+ lea OFFSETOFEND_H_POWERS-64(KEY), POWERS_PTR
// Encrypt an all-zeroes block to get the raw hash subkey.
movl OFFSETOF_AESKEYLEN(KEY), %eax
lea 6*16(KEY,%rax,4), RNDKEYLAST_PTR
vmovdqu (KEY), %xmm0 // Zero-th round key XOR all-zeroes block
@@ -327,12 +305,12 @@
// Reflect the bytes of the raw hash subkey.
vpshufb .Lbswap_mask(%rip), %xmm0, H_CUR_XMM
// Zeroize the padding blocks.
vpxor %xmm0, %xmm0, %xmm0
- vmovdqu %ymm0, VL(POWERS_PTR)
- vmovdqu %xmm0, VL+2*16(POWERS_PTR)
+ vmovdqu %ymm0, 64(POWERS_PTR)
+ vmovdqu %xmm0, 64+2*16(POWERS_PTR)
// Finish preprocessing the first key power, H^1. Since this GHASH
// implementation operates directly on values with the backwards bit
// order specified by the GCM standard, it's necessary to preprocess the
// raw key as follows. First, reflect its bytes. Second, multiply it
@@ -368,29 +346,26 @@
// Create H_CUR_YMM = [H^2, H^1] and H_INC_YMM = [H^2, H^2].
vinserti128 $1, H_CUR_XMM, H_INC_YMM, H_CUR_YMM
vinserti128 $1, H_INC_XMM, H_INC_YMM, H_INC_YMM
-.if VL == 64
// Create H_CUR = [H^4, H^3, H^2, H^1] and H_INC = [H^4, H^4, H^4, H^4].
_ghash_mul H_INC_YMM, H_CUR_YMM, H_INC_YMM, GFPOLY_YMM, \
%ymm0, %ymm1, %ymm2
vinserti64x4 $1, H_CUR_YMM, H_INC, H_CUR
vshufi64x2 $0, H_INC, H_INC, H_INC
-.endif
// Store the lowest set of key powers.
vmovdqu8 H_CUR, (POWERS_PTR)
- // Compute and store the remaining key powers. With VL=32, repeatedly
- // multiply [H^(i+1), H^i] by [H^2, H^2] to get [H^(i+3), H^(i+2)].
- // With VL=64, repeatedly multiply [H^(i+3), H^(i+2), H^(i+1), H^i] by
+ // Compute and store the remaining key powers.
+ // Repeatedly multiply [H^(i+3), H^(i+2), H^(i+1), H^i] by
// [H^4, H^4, H^4, H^4] to get [H^(i+7), H^(i+6), H^(i+5), H^(i+4)].
- mov $(NUM_H_POWERS*16/VL) - 1, %eax
+ mov $3, %eax
.Lprecompute_next\@:
- sub $VL, POWERS_PTR
- _ghash_mul H_INC, H_CUR, H_CUR, GFPOLY, V0, V1, V2
+ sub $64, POWERS_PTR
+ _ghash_mul H_INC, H_CUR, H_CUR, GFPOLY, %zmm0, %zmm1, %zmm2
vmovdqu8 H_CUR, (POWERS_PTR)
dec %eax
jnz .Lprecompute_next\@
vzeroupper // This is needed after using ymm or zmm registers.
@@ -399,20 +374,14 @@
// XOR together the 128-bit lanes of \src (whose low lane is \src_xmm) and store
// the result in \dst_xmm. This implicitly zeroizes the other lanes of dst.
.macro _horizontal_xor src, src_xmm, dst_xmm, t0_xmm, t1_xmm, t2_xmm
vextracti32x4 $1, \src, \t0_xmm
-.if VL == 32
- vpxord \t0_xmm, \src_xmm, \dst_xmm
-.elseif VL == 64
vextracti32x4 $2, \src, \t1_xmm
vextracti32x4 $3, \src, \t2_xmm
vpxord \t0_xmm, \src_xmm, \dst_xmm
vpternlogd $0x96, \t1_xmm, \t2_xmm, \dst_xmm
-.else
- .error "Unsupported vector length"
-.endif
.endm
// Do one step of the GHASH update of the data blocks given in the vector
// registers GHASHDATA[0-3]. \i specifies the step to do, 0 through 9. The
// division into steps allows users of this macro to optionally interleave the
@@ -422,29 +391,25 @@
// GHASHTMP[0-2] as temporaries. This macro handles the byte-reflection of the
// data blocks. The parameter registers must be preserved across steps.
//
// The GHASH update does: GHASH_ACC = H_POW4*(GHASHDATA0 + GHASH_ACC) +
// H_POW3*GHASHDATA1 + H_POW2*GHASHDATA2 + H_POW1*GHASHDATA3, where the
-// operations are vectorized operations on vectors of 16-byte blocks. E.g.,
-// with VL=32 there are 2 blocks per vector and the vectorized terms correspond
-// to the following non-vectorized terms:
-//
-// H_POW4*(GHASHDATA0 + GHASH_ACC) => H^8*(blk0 + GHASH_ACC_XMM) and H^7*(blk1 + 0)
-// H_POW3*GHASHDATA1 => H^6*blk2 and H^5*blk3
-// H_POW2*GHASHDATA2 => H^4*blk4 and H^3*blk5
-// H_POW1*GHASHDATA3 => H^2*blk6 and H^1*blk7
+// operations are vectorized operations on 512-bit vectors of 128-bit blocks.
+// The vectorized terms correspond to the following non-vectorized terms:
//
-// With VL=64, we use 4 blocks/vector, H^16 through H^1, and blk0 through blk15.
+// H_POW4*(GHASHDATA0 + GHASH_ACC) => H^16*(blk0 + GHASH_ACC_XMM),
+// H^15*(blk1 + 0), H^14*(blk2 + 0), and H^13*(blk3 + 0)
+// H_POW3*GHASHDATA1 => H^12*blk4, H^11*blk5, H^10*blk6, and H^9*blk7
+// H_POW2*GHASHDATA2 => H^8*blk8, H^7*blk9, H^6*blk10, and H^5*blk11
+// H_POW1*GHASHDATA3 => H^4*blk12, H^3*blk13, H^2*blk14, and H^1*blk15
//
// More concretely, this code does:
// - Do vectorized "schoolbook" multiplications to compute the intermediate
// 256-bit product of each block and its corresponding hash key power.
-// There are 4*VL/16 of these intermediate products.
-// - Sum (XOR) the intermediate 256-bit products across vectors. This leaves
-// VL/16 256-bit intermediate values.
+// - Sum (XOR) the intermediate 256-bit products across vectors.
// - Do a vectorized reduction of these 256-bit intermediate values to
-// 128-bits each. This leaves VL/16 128-bit intermediate values.
+// 128-bits each.
// - Sum (XOR) these values and store the 128-bit result in GHASH_ACC_XMM.
//
// See _ghash_mul_step for the full explanation of the operations performed for
// each individual finite field multiplication and reduction.
.macro _ghash_step_4x i
@@ -496,78 +461,76 @@
_horizontal_xor GHASH_ACC, GHASH_ACC_XMM, GHASH_ACC_XMM, \
GHASHDATA0_XMM, GHASHDATA1_XMM, GHASHDATA2_XMM
.endif
.endm
-// Do one non-last round of AES encryption on the counter blocks in V0-V3 using
-// the round key that has been broadcast to all 128-bit lanes of \round_key.
+// Do one non-last round of AES encryption on the blocks in %zmm[0-3] using the
+// round key that has been broadcast to all 128-bit lanes of \round_key.
.macro _vaesenc_4x round_key
- vaesenc \round_key, V0, V0
- vaesenc \round_key, V1, V1
- vaesenc \round_key, V2, V2
- vaesenc \round_key, V3, V3
+ vaesenc \round_key, %zmm0, %zmm0
+ vaesenc \round_key, %zmm1, %zmm1
+ vaesenc \round_key, %zmm2, %zmm2
+ vaesenc \round_key, %zmm3, %zmm3
.endm
// Start the AES encryption of four vectors of counter blocks.
.macro _ctr_begin_4x
// Increment LE_CTR four times to generate four vectors of little-endian
- // counter blocks, swap each to big-endian, and store them in V0-V3.
- vpshufb BSWAP_MASK, LE_CTR, V0
+ // counter blocks, swap each to big-endian, and store them in %zmm[0-3].
+ vpshufb BSWAP_MASK, LE_CTR, %zmm0
vpaddd LE_CTR_INC, LE_CTR, LE_CTR
- vpshufb BSWAP_MASK, LE_CTR, V1
+ vpshufb BSWAP_MASK, LE_CTR, %zmm1
vpaddd LE_CTR_INC, LE_CTR, LE_CTR
- vpshufb BSWAP_MASK, LE_CTR, V2
+ vpshufb BSWAP_MASK, LE_CTR, %zmm2
vpaddd LE_CTR_INC, LE_CTR, LE_CTR
- vpshufb BSWAP_MASK, LE_CTR, V3
+ vpshufb BSWAP_MASK, LE_CTR, %zmm3
vpaddd LE_CTR_INC, LE_CTR, LE_CTR
// AES "round zero": XOR in the zero-th round key.
- vpxord RNDKEY0, V0, V0
- vpxord RNDKEY0, V1, V1
- vpxord RNDKEY0, V2, V2
- vpxord RNDKEY0, V3, V3
+ vpxord RNDKEY0, %zmm0, %zmm0
+ vpxord RNDKEY0, %zmm1, %zmm1
+ vpxord RNDKEY0, %zmm2, %zmm2
+ vpxord RNDKEY0, %zmm3, %zmm3
.endm
-// Do the last AES round for four vectors of counter blocks V0-V3, XOR source
-// data with the resulting keystream, and write the result to DST and
+// Do the last AES round for four vectors of counter blocks %zmm[0-3], XOR
+// source data with the resulting keystream, and write the result to DST and
// GHASHDATA[0-3]. (Implementation differs slightly, but has the same effect.)
.macro _aesenclast_and_xor_4x
// XOR the source data with the last round key, saving the result in
// GHASHDATA[0-3]. This reduces latency by taking advantage of the
// property vaesenclast(key, a) ^ b == vaesenclast(key ^ b, a).
- vpxord 0*VL(SRC), RNDKEYLAST, GHASHDATA0
- vpxord 1*VL(SRC), RNDKEYLAST, GHASHDATA1
- vpxord 2*VL(SRC), RNDKEYLAST, GHASHDATA2
- vpxord 3*VL(SRC), RNDKEYLAST, GHASHDATA3
+ vpxord 0*64(SRC), RNDKEYLAST, GHASHDATA0
+ vpxord 1*64(SRC), RNDKEYLAST, GHASHDATA1
+ vpxord 2*64(SRC), RNDKEYLAST, GHASHDATA2
+ vpxord 3*64(SRC), RNDKEYLAST, GHASHDATA3
// Do the last AES round. This handles the XOR with the source data
// too, as per the optimization described above.
- vaesenclast GHASHDATA0, V0, GHASHDATA0
- vaesenclast GHASHDATA1, V1, GHASHDATA1
- vaesenclast GHASHDATA2, V2, GHASHDATA2
- vaesenclast GHASHDATA3, V3, GHASHDATA3
+ vaesenclast GHASHDATA0, %zmm0, GHASHDATA0
+ vaesenclast GHASHDATA1, %zmm1, GHASHDATA1
+ vaesenclast GHASHDATA2, %zmm2, GHASHDATA2
+ vaesenclast GHASHDATA3, %zmm3, GHASHDATA3
// Store the en/decrypted data to DST.
- vmovdqu8 GHASHDATA0, 0*VL(DST)
- vmovdqu8 GHASHDATA1, 1*VL(DST)
- vmovdqu8 GHASHDATA2, 2*VL(DST)
- vmovdqu8 GHASHDATA3, 3*VL(DST)
+ vmovdqu8 GHASHDATA0, 0*64(DST)
+ vmovdqu8 GHASHDATA1, 1*64(DST)
+ vmovdqu8 GHASHDATA2, 2*64(DST)
+ vmovdqu8 GHASHDATA3, 3*64(DST)
.endm
// void aes_gcm_{enc,dec}_update_vaes_avx512(const struct aes_gcm_key_vaes_avx512 *key,
// const u32 le_ctr[4], u8 ghash_acc[16],
// const u8 *src, u8 *dst, int datalen);
//
// This macro generates a GCM encryption or decryption update function with the
-// above prototype (with \enc selecting which one). This macro supports both
-// VL=32 and VL=64. _set_veclen must have been invoked with the desired length.
-//
-// This function computes the next portion of the CTR keystream, XOR's it with
-// |datalen| bytes from |src|, and writes the resulting encrypted or decrypted
-// data to |dst|. It also updates the GHASH accumulator |ghash_acc| using the
-// next |datalen| ciphertext bytes.
+// above prototype (with \enc selecting which one). The function computes the
+// next portion of the CTR keystream, XOR's it with |datalen| bytes from |src|,
+// and writes the resulting encrypted or decrypted data to |dst|. It also
+// updates the GHASH accumulator |ghash_acc| using the next |datalen| ciphertext
+// bytes.
//
// |datalen| must be a multiple of 16, except on the last call where it can be
// any length. The caller must do any buffering needed to ensure this. Both
// in-place and out-of-place en/decryption are supported.
//
@@ -598,73 +561,73 @@
.set AESKEYLEN64, %r10
// Pointer to the last AES round key for the chosen AES variant
.set RNDKEYLAST_PTR, %r11
- // In the main loop, V0-V3 are used as AES input and output. Elsewhere
- // they are used as temporary registers.
+ // In the main loop, %zmm[0-3] are used as AES input and output.
+ // Elsewhere they are used as temporary registers.
// GHASHDATA[0-3] hold the ciphertext blocks and GHASH input data.
- .set GHASHDATA0, V4
+ .set GHASHDATA0, %zmm4
.set GHASHDATA0_XMM, %xmm4
- .set GHASHDATA1, V5
+ .set GHASHDATA1, %zmm5
.set GHASHDATA1_XMM, %xmm5
- .set GHASHDATA2, V6
+ .set GHASHDATA2, %zmm6
.set GHASHDATA2_XMM, %xmm6
- .set GHASHDATA3, V7
+ .set GHASHDATA3, %zmm7
// BSWAP_MASK is the shuffle mask for byte-reflecting 128-bit values
// using vpshufb, copied to all 128-bit lanes.
- .set BSWAP_MASK, V8
+ .set BSWAP_MASK, %zmm8
// RNDKEY temporarily holds the next AES round key.
- .set RNDKEY, V9
+ .set RNDKEY, %zmm9
// GHASH_ACC is the accumulator variable for GHASH. When fully reduced,
// only the lowest 128-bit lane can be nonzero. When not fully reduced,
// more than one lane may be used, and they need to be XOR'd together.
- .set GHASH_ACC, V10
+ .set GHASH_ACC, %zmm10
.set GHASH_ACC_XMM, %xmm10
// LE_CTR_INC is the vector of 32-bit words that need to be added to a
// vector of little-endian counter blocks to advance it forwards.
- .set LE_CTR_INC, V11
+ .set LE_CTR_INC, %zmm11
// LE_CTR contains the next set of little-endian counter blocks.
- .set LE_CTR, V12
+ .set LE_CTR, %zmm12
// RNDKEY0, RNDKEYLAST, and RNDKEY_M[9-1] contain cached AES round keys,
// copied to all 128-bit lanes. RNDKEY0 is the zero-th round key,
// RNDKEYLAST the last, and RNDKEY_M\i the one \i-th from the last.
- .set RNDKEY0, V13
- .set RNDKEYLAST, V14
- .set RNDKEY_M9, V15
- .set RNDKEY_M8, V16
- .set RNDKEY_M7, V17
- .set RNDKEY_M6, V18
- .set RNDKEY_M5, V19
- .set RNDKEY_M4, V20
- .set RNDKEY_M3, V21
- .set RNDKEY_M2, V22
- .set RNDKEY_M1, V23
+ .set RNDKEY0, %zmm13
+ .set RNDKEYLAST, %zmm14
+ .set RNDKEY_M9, %zmm15
+ .set RNDKEY_M8, %zmm16
+ .set RNDKEY_M7, %zmm17
+ .set RNDKEY_M6, %zmm18
+ .set RNDKEY_M5, %zmm19
+ .set RNDKEY_M4, %zmm20
+ .set RNDKEY_M3, %zmm21
+ .set RNDKEY_M2, %zmm22
+ .set RNDKEY_M1, %zmm23
// GHASHTMP[0-2] are temporary variables used by _ghash_step_4x. These
// cannot coincide with anything used for AES encryption, since for
// performance reasons GHASH and AES encryption are interleaved.
- .set GHASHTMP0, V24
- .set GHASHTMP1, V25
- .set GHASHTMP2, V26
+ .set GHASHTMP0, %zmm24
+ .set GHASHTMP1, %zmm25
+ .set GHASHTMP2, %zmm26
- // H_POW[4-1] contain the powers of the hash key H^(4*VL/16)...H^1. The
+ // H_POW[4-1] contain the powers of the hash key H^16...H^1. The
// descending numbering reflects the order of the key powers.
- .set H_POW4, V27
- .set H_POW3, V28
- .set H_POW2, V29
- .set H_POW1, V30
+ .set H_POW4, %zmm27
+ .set H_POW3, %zmm28
+ .set H_POW2, %zmm29
+ .set H_POW1, %zmm30
// GFPOLY contains the .Lgfpoly constant, copied to all 128-bit lanes.
- .set GFPOLY, V31
+ .set GFPOLY, %zmm31
// Load some constants.
vbroadcasti32x4 .Lbswap_mask(%rip), BSWAP_MASK
vbroadcasti32x4 .Lgfpoly(%rip), GFPOLY
@@ -683,33 +646,27 @@
vbroadcasti32x4 (RNDKEYLAST_PTR), RNDKEYLAST
// Finish initializing LE_CTR by adding [0, 1, ...] to its low words.
vpaddd .Lctr_pattern(%rip), LE_CTR, LE_CTR
- // Initialize LE_CTR_INC to contain VL/16 in all 128-bit lanes.
-.if VL == 32
- vbroadcasti32x4 .Linc_2blocks(%rip), LE_CTR_INC
-.elseif VL == 64
+ // Load 4 into all 128-bit lanes of LE_CTR_INC.
vbroadcasti32x4 .Linc_4blocks(%rip), LE_CTR_INC
-.else
- .error "Unsupported vector length"
-.endif
- // If there are at least 4*VL bytes of data, then continue into the loop
- // that processes 4*VL bytes of data at a time. Otherwise skip it.
+ // If there are at least 256 bytes of data, then continue into the loop
+ // that processes 256 bytes of data at a time. Otherwise skip it.
//
- // Pre-subtracting 4*VL from DATALEN saves an instruction from the main
+ // Pre-subtracting 256 from DATALEN saves an instruction from the main
// loop and also ensures that at least one write always occurs to
// DATALEN, zero-extending it and allowing DATALEN64 to be used later.
- add $-4*VL, DATALEN // shorter than 'sub 4*VL' when VL=32
+ sub $256, DATALEN
jl .Lcrypt_loop_4x_done\@
// Load powers of the hash key.
- vmovdqu8 OFFSETOFEND_H_POWERS-4*VL(KEY), H_POW4
- vmovdqu8 OFFSETOFEND_H_POWERS-3*VL(KEY), H_POW3
- vmovdqu8 OFFSETOFEND_H_POWERS-2*VL(KEY), H_POW2
- vmovdqu8 OFFSETOFEND_H_POWERS-1*VL(KEY), H_POW1
+ vmovdqu8 OFFSETOFEND_H_POWERS-4*64(KEY), H_POW4
+ vmovdqu8 OFFSETOFEND_H_POWERS-3*64(KEY), H_POW3
+ vmovdqu8 OFFSETOFEND_H_POWERS-2*64(KEY), H_POW2
+ vmovdqu8 OFFSETOFEND_H_POWERS-1*64(KEY), H_POW1
// Main loop: en/decrypt and hash 4 vectors at a time.
//
// When possible, interleave the AES encryption of the counter blocks
// with the GHASH update of the ciphertext blocks. This improves
@@ -734,13 +691,13 @@
_vaesenc_4x RNDKEY
add $16, %rax
cmp %rax, RNDKEYLAST_PTR
jne 1b
_aesenclast_and_xor_4x
- sub $-4*VL, SRC // shorter than 'add 4*VL' when VL=32
- sub $-4*VL, DST
- add $-4*VL, DATALEN
+ add $256, SRC
+ add $256, DST
+ sub $256, DATALEN
jl .Lghash_last_ciphertext_4x\@
.endif
// Cache as many additional AES round keys as possible.
.irp i, 9,8,7,6,5,4,3,2,1
@@ -750,14 +707,14 @@
.Lcrypt_loop_4x\@:
// If decrypting, load more ciphertext blocks into GHASHDATA[0-3]. If
// encrypting, GHASHDATA[0-3] already contain the previous ciphertext.
.if !\enc
- vmovdqu8 0*VL(SRC), GHASHDATA0
- vmovdqu8 1*VL(SRC), GHASHDATA1
- vmovdqu8 2*VL(SRC), GHASHDATA2
- vmovdqu8 3*VL(SRC), GHASHDATA3
+ vmovdqu8 0*64(SRC), GHASHDATA0
+ vmovdqu8 1*64(SRC), GHASHDATA1
+ vmovdqu8 2*64(SRC), GHASHDATA2
+ vmovdqu8 3*64(SRC), GHASHDATA3
.endif
// Start the AES encryption of the counter blocks.
_ctr_begin_4x
cmp $24, AESKEYLEN
@@ -773,21 +730,22 @@
_vaesenc_4x RNDKEY
vbroadcasti32x4 -10*16(RNDKEYLAST_PTR), RNDKEY
_vaesenc_4x RNDKEY
128:
- // Finish the AES encryption of the counter blocks in V0-V3, interleaved
- // with the GHASH update of the ciphertext blocks in GHASHDATA[0-3].
+ // Finish the AES encryption of the counter blocks in %zmm[0-3],
+ // interleaved with the GHASH update of the ciphertext blocks in
+ // GHASHDATA[0-3].
.irp i, 9,8,7,6,5,4,3,2,1
_ghash_step_4x (9 - \i)
_vaesenc_4x RNDKEY_M\i
.endr
_ghash_step_4x 9
_aesenclast_and_xor_4x
- sub $-4*VL, SRC // shorter than 'add 4*VL' when VL=32
- sub $-4*VL, DST
- add $-4*VL, DATALEN
+ add $256, SRC
+ add $256, DST
+ sub $256, DATALEN
jge .Lcrypt_loop_4x\@
.if \enc
.Lghash_last_ciphertext_4x\@:
// Update GHASH with the last set of ciphertext blocks.
@@ -796,25 +754,26 @@
.endr
.endif
.Lcrypt_loop_4x_done\@:
- // Undo the extra subtraction by 4*VL and check whether data remains.
- sub $-4*VL, DATALEN // shorter than 'add 4*VL' when VL=32
+ // Undo the extra subtraction by 256 and check whether data remains.
+ add $256, DATALEN
jz .Ldone\@
- // The data length isn't a multiple of 4*VL. Process the remaining data
- // of length 1 <= DATALEN < 4*VL, up to one vector (VL bytes) at a time.
- // Going one vector at a time may seem inefficient compared to having
- // separate code paths for each possible number of vectors remaining.
- // However, using a loop keeps the code size down, and it performs
- // surprising well; modern CPUs will start executing the next iteration
- // before the previous one finishes and also predict the number of loop
- // iterations. For a similar reason, we roll up the AES rounds.
+ // The data length isn't a multiple of 256 bytes. Process the remaining
+ // data of length 1 <= DATALEN < 256, up to one 64-byte vector at a
+ // time. Going one vector at a time may seem inefficient compared to
+ // having separate code paths for each possible number of vectors
+ // remaining. However, using a loop keeps the code size down, and it
+ // performs surprising well; modern CPUs will start executing the next
+ // iteration before the previous one finishes and also predict the
+ // number of loop iterations. For a similar reason, we roll up the AES
+ // rounds.
//
- // On the last iteration, the remaining length may be less than VL.
- // Handle this using masking.
+ // On the last iteration, the remaining length may be less than 64
+ // bytes. Handle this using masking.
//
// Since there are enough key powers available for all remaining data,
// there is no need to do a GHASH reduction after each iteration.
// Instead, multiply each remaining block by its own key power, and only
// do a GHASH reduction at the very end.
@@ -839,69 +798,64 @@
vpxor HI_XMM, HI_XMM, HI_XMM
.Lcrypt_loop_1x\@:
// Select the appropriate mask for this iteration: all 1's if
- // DATALEN >= VL, otherwise DATALEN 1's. Do this branchlessly using the
+ // DATALEN >= 64, otherwise DATALEN 1's. Do this branchlessly using the
// bzhi instruction from BMI2. (This relies on DATALEN <= 255.)
-.if VL < 64
- mov $-1, %eax
- bzhi DATALEN, %eax, %eax
- kmovd %eax, %k1
-.else
mov $-1, %rax
bzhi DATALEN64, %rax, %rax
kmovq %rax, %k1
-.endif
// Encrypt a vector of counter blocks. This does not need to be masked.
- vpshufb BSWAP_MASK, LE_CTR, V0
+ vpshufb BSWAP_MASK, LE_CTR, %zmm0
vpaddd LE_CTR_INC, LE_CTR, LE_CTR
- vpxord RNDKEY0, V0, V0
+ vpxord RNDKEY0, %zmm0, %zmm0
lea 16(KEY), %rax
1:
vbroadcasti32x4 (%rax), RNDKEY
- vaesenc RNDKEY, V0, V0
+ vaesenc RNDKEY, %zmm0, %zmm0
add $16, %rax
cmp %rax, RNDKEYLAST_PTR
jne 1b
- vaesenclast RNDKEYLAST, V0, V0
+ vaesenclast RNDKEYLAST, %zmm0, %zmm0
// XOR the data with the appropriate number of keystream bytes.
- vmovdqu8 (SRC), V1{%k1}{z}
- vpxord V1, V0, V0
- vmovdqu8 V0, (DST){%k1}
+ vmovdqu8 (SRC), %zmm1{%k1}{z}
+ vpxord %zmm1, %zmm0, %zmm0
+ vmovdqu8 %zmm0, (DST){%k1}
// Update GHASH with the ciphertext block(s), without reducing.
//
- // In the case of DATALEN < VL, the ciphertext is zero-padded to VL.
- // (If decrypting, it's done by the above masked load. If encrypting,
- // it's done by the below masked register-to-register move.) Note that
- // if DATALEN <= VL - 16, there will be additional padding beyond the
- // padding of the last block specified by GHASH itself; i.e., there may
- // be whole block(s) that get processed by the GHASH multiplication and
- // reduction instructions but should not actually be included in the
+ // In the case of DATALEN < 64, the ciphertext is zero-padded to 64
+ // bytes. (If decrypting, it's done by the above masked load. If
+ // encrypting, it's done by the below masked register-to-register move.)
+ // Note that if DATALEN <= 48, there will be additional padding beyond
+ // the padding of the last block specified by GHASH itself; i.e., there
+ // may be whole block(s) that get processed by the GHASH multiplication
+ // and reduction instructions but should not actually be included in the
// GHASH. However, any such blocks are all-zeroes, and the values that
// they're multiplied with are also all-zeroes. Therefore they just add
// 0 * 0 = 0 to the final GHASH result, which makes no difference.
vmovdqu8 (POWERS_PTR), H_POW1
.if \enc
- vmovdqu8 V0, V1{%k1}{z}
+ vmovdqu8 %zmm0, %zmm1{%k1}{z}
.endif
- vpshufb BSWAP_MASK, V1, V0
- vpxord GHASH_ACC, V0, V0
- _ghash_mul_noreduce H_POW1, V0, LO, MI, HI, GHASHDATA3, V1, V2, V3
+ vpshufb BSWAP_MASK, %zmm1, %zmm0
+ vpxord GHASH_ACC, %zmm0, %zmm0
+ _ghash_mul_noreduce H_POW1, %zmm0, LO, MI, HI, \
+ GHASHDATA3, %zmm1, %zmm2, %zmm3
vpxor GHASH_ACC_XMM, GHASH_ACC_XMM, GHASH_ACC_XMM
- add $VL, POWERS_PTR
- add $VL, SRC
- add $VL, DST
- sub $VL, DATALEN
+ add $64, POWERS_PTR
+ add $64, SRC
+ add $64, DST
+ sub $64, DATALEN
jg .Lcrypt_loop_1x\@
// Finally, do the GHASH reduction.
- _ghash_reduce LO, MI, HI, GFPOLY, V0
+ _ghash_reduce LO, MI, HI, GFPOLY, %zmm0
_horizontal_xor HI, HI_XMM, GHASH_ACC_XMM, %xmm0, %xmm1, %xmm2
.Ldone\@:
// Store the updated GHASH accumulator back to memory.
vmovdqu GHASH_ACC_XMM, (GHASH_ACC_PTR)
@@ -1045,11 +999,10 @@
.endif
// No need for vzeroupper here, since only used xmm registers were used.
RET
.endm
-_set_veclen 64
SYM_FUNC_START(aes_gcm_precompute_vaes_avx512)
_aes_gcm_precompute
SYM_FUNC_END(aes_gcm_precompute_vaes_avx512)
SYM_FUNC_START(aes_gcm_enc_update_vaes_avx512)
_aes_gcm_update 1
--
2.51.0
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 5/8] crypto: x86/aes-gcm - reorder AVX512 precompute and aad_update functions
2025-10-02 2:31 [PATCH 0/8] VAES+AVX2 optimized implementation of AES-GCM Eric Biggers
` (3 preceding siblings ...)
2025-10-02 2:31 ` [PATCH 4/8] crypto: x86/aes-gcm - clean up AVX512 code to assume 512-bit vectors Eric Biggers
@ 2025-10-02 2:31 ` Eric Biggers
2025-10-02 2:31 ` [PATCH 6/8] crypto: x86/aes-gcm - revise some comments in AVX512 code Eric Biggers
` (5 subsequent siblings)
10 siblings, 0 replies; 20+ messages in thread
From: Eric Biggers @ 2025-10-02 2:31 UTC (permalink / raw)
To: linux-crypto
Cc: linux-kernel, x86, Ard Biesheuvel, Jason A . Donenfeld,
Eric Biggers
Now that the _aes_gcm_precompute macro is instantiated only once,
replace it directly with a function definition.
Also, move aes_gcm_aad_update_vaes_avx512() to a different location in
the file so that it's consistent with aes-gcm-vaes-avx2.S and also the
BoringSSL port of this code.
No functional changes.
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
---
arch/x86/crypto/aes-gcm-vaes-avx512.S | 187 +++++++++++++-------------
1 file changed, 92 insertions(+), 95 deletions(-)
diff --git a/arch/x86/crypto/aes-gcm-vaes-avx512.S b/arch/x86/crypto/aes-gcm-vaes-avx512.S
index 3edf829c2ce07..81a8a027cff8e 100644
--- a/arch/x86/crypto/aes-gcm-vaes-avx512.S
+++ b/arch/x86/crypto/aes-gcm-vaes-avx512.S
@@ -266,11 +266,11 @@
// subkey and initializes |key->ghash_key_powers| with powers of it.
//
// The number of key powers initialized is NUM_H_POWERS, and they are stored in
// the order H^NUM_H_POWERS to H^1. The zeroized padding blocks after the key
// powers themselves are also initialized.
-.macro _aes_gcm_precompute
+SYM_FUNC_START(aes_gcm_precompute_vaes_avx512)
// Function arguments
.set KEY, %rdi
// Additional local variables.
@@ -359,20 +359,20 @@
// Compute and store the remaining key powers.
// Repeatedly multiply [H^(i+3), H^(i+2), H^(i+1), H^i] by
// [H^4, H^4, H^4, H^4] to get [H^(i+7), H^(i+6), H^(i+5), H^(i+4)].
mov $3, %eax
-.Lprecompute_next\@:
+.Lprecompute_next:
sub $64, POWERS_PTR
_ghash_mul H_INC, H_CUR, H_CUR, GFPOLY, %zmm0, %zmm1, %zmm2
vmovdqu8 H_CUR, (POWERS_PTR)
dec %eax
- jnz .Lprecompute_next\@
+ jnz .Lprecompute_next
vzeroupper // This is needed after using ymm or zmm registers.
RET
-.endm
+SYM_FUNC_END(aes_gcm_precompute_vaes_avx512)
// XOR together the 128-bit lanes of \src (whose low lane is \src_xmm) and store
// the result in \dst_xmm. This implicitly zeroizes the other lanes of dst.
.macro _horizontal_xor src, src_xmm, dst_xmm, t0_xmm, t1_xmm, t2_xmm
vextracti32x4 $1, \src, \t0_xmm
@@ -461,10 +461,98 @@
_horizontal_xor GHASH_ACC, GHASH_ACC_XMM, GHASH_ACC_XMM, \
GHASHDATA0_XMM, GHASHDATA1_XMM, GHASHDATA2_XMM
.endif
.endm
+// void aes_gcm_aad_update_vaes_avx512(const struct aes_gcm_key_vaes_avx512 *key,
+// u8 ghash_acc[16],
+// const u8 *aad, int aadlen);
+//
+// This function processes the AAD (Additional Authenticated Data) in GCM.
+// Using the key |key|, it updates the GHASH accumulator |ghash_acc| with the
+// data given by |aad| and |aadlen|. |key->ghash_key_powers| must have been
+// initialized. On the first call, |ghash_acc| must be all zeroes. |aadlen|
+// must be a multiple of 16, except on the last call where it can be any length.
+// The caller must do any buffering needed to ensure this.
+//
+// AES-GCM is almost always used with small amounts of AAD, less than 32 bytes.
+// Therefore, for AAD processing we currently only provide this implementation
+// which uses 256-bit vectors (ymm registers) and only has a 1x-wide loop. This
+// keeps the code size down, and it enables some micro-optimizations, e.g. using
+// VEX-coded instructions instead of EVEX-coded to save some instruction bytes.
+// To optimize for large amounts of AAD, we could implement a 4x-wide loop and
+// provide a version using 512-bit vectors, but that doesn't seem to be useful.
+SYM_FUNC_START(aes_gcm_aad_update_vaes_avx512)
+
+ // Function arguments
+ .set KEY, %rdi
+ .set GHASH_ACC_PTR, %rsi
+ .set AAD, %rdx
+ .set AADLEN, %ecx
+ .set AADLEN64, %rcx // Zero-extend AADLEN before using!
+
+ // Additional local variables.
+ // %rax, %ymm0-%ymm3, and %k1 are used as temporary registers.
+ .set BSWAP_MASK, %ymm4
+ .set GFPOLY, %ymm5
+ .set GHASH_ACC, %ymm6
+ .set GHASH_ACC_XMM, %xmm6
+ .set H_POW1, %ymm7
+
+ // Load some constants.
+ vbroadcasti128 .Lbswap_mask(%rip), BSWAP_MASK
+ vbroadcasti128 .Lgfpoly(%rip), GFPOLY
+
+ // Load the GHASH accumulator.
+ vmovdqu (GHASH_ACC_PTR), GHASH_ACC_XMM
+
+ // Update GHASH with 32 bytes of AAD at a time.
+ //
+ // Pre-subtracting 32 from AADLEN saves an instruction from the loop and
+ // also ensures that at least one write always occurs to AADLEN,
+ // zero-extending it and allowing AADLEN64 to be used later.
+ sub $32, AADLEN
+ jl .Laad_loop_1x_done
+ vmovdqu8 OFFSETOFEND_H_POWERS-32(KEY), H_POW1 // [H^2, H^1]
+.Laad_loop_1x:
+ vmovdqu (AAD), %ymm0
+ vpshufb BSWAP_MASK, %ymm0, %ymm0
+ vpxor %ymm0, GHASH_ACC, GHASH_ACC
+ _ghash_mul H_POW1, GHASH_ACC, GHASH_ACC, GFPOLY, \
+ %ymm0, %ymm1, %ymm2
+ vextracti128 $1, GHASH_ACC, %xmm0
+ vpxor %xmm0, GHASH_ACC_XMM, GHASH_ACC_XMM
+ add $32, AAD
+ sub $32, AADLEN
+ jge .Laad_loop_1x
+.Laad_loop_1x_done:
+ add $32, AADLEN
+ jz .Laad_done
+
+ // Update GHASH with the remaining 1 <= AADLEN < 32 bytes of AAD.
+ mov $-1, %eax
+ bzhi AADLEN, %eax, %eax
+ kmovd %eax, %k1
+ vmovdqu8 (AAD), %ymm0{%k1}{z}
+ neg AADLEN64
+ and $~15, AADLEN64 // -round_up(AADLEN, 16)
+ vmovdqu8 OFFSETOFEND_H_POWERS(KEY,AADLEN64), H_POW1
+ vpshufb BSWAP_MASK, %ymm0, %ymm0
+ vpxor %ymm0, GHASH_ACC, GHASH_ACC
+ _ghash_mul H_POW1, GHASH_ACC, GHASH_ACC, GFPOLY, \
+ %ymm0, %ymm1, %ymm2
+ vextracti128 $1, GHASH_ACC, %xmm0
+ vpxor %xmm0, GHASH_ACC_XMM, GHASH_ACC_XMM
+
+.Laad_done:
+ // Store the updated GHASH accumulator back to memory.
+ vmovdqu GHASH_ACC_XMM, (GHASH_ACC_PTR)
+
+ vzeroupper // This is needed after using ymm or zmm registers.
+ RET
+SYM_FUNC_END(aes_gcm_aad_update_vaes_avx512)
+
// Do one non-last round of AES encryption on the blocks in %zmm[0-3] using the
// round key that has been broadcast to all 128-bit lanes of \round_key.
.macro _vaesenc_4x round_key
vaesenc \round_key, %zmm0, %zmm0
vaesenc \round_key, %zmm1, %zmm1
@@ -999,108 +1087,17 @@
.endif
// No need for vzeroupper here, since only used xmm registers were used.
RET
.endm
-SYM_FUNC_START(aes_gcm_precompute_vaes_avx512)
- _aes_gcm_precompute
-SYM_FUNC_END(aes_gcm_precompute_vaes_avx512)
SYM_FUNC_START(aes_gcm_enc_update_vaes_avx512)
_aes_gcm_update 1
SYM_FUNC_END(aes_gcm_enc_update_vaes_avx512)
SYM_FUNC_START(aes_gcm_dec_update_vaes_avx512)
_aes_gcm_update 0
SYM_FUNC_END(aes_gcm_dec_update_vaes_avx512)
-// void aes_gcm_aad_update_vaes_avx512(const struct aes_gcm_key_vaes_avx512 *key,
-// u8 ghash_acc[16],
-// const u8 *aad, int aadlen);
-//
-// This function processes the AAD (Additional Authenticated Data) in GCM.
-// Using the key |key|, it updates the GHASH accumulator |ghash_acc| with the
-// data given by |aad| and |aadlen|. |key->ghash_key_powers| must have been
-// initialized. On the first call, |ghash_acc| must be all zeroes. |aadlen|
-// must be a multiple of 16, except on the last call where it can be any length.
-// The caller must do any buffering needed to ensure this.
-//
-// AES-GCM is almost always used with small amounts of AAD, less than 32 bytes.
-// Therefore, for AAD processing we currently only provide this implementation
-// which uses 256-bit vectors (ymm registers) and only has a 1x-wide loop. This
-// keeps the code size down, and it enables some micro-optimizations, e.g. using
-// VEX-coded instructions instead of EVEX-coded to save some instruction bytes.
-// To optimize for large amounts of AAD, we could implement a 4x-wide loop and
-// provide a version using 512-bit vectors, but that doesn't seem to be useful.
-SYM_FUNC_START(aes_gcm_aad_update_vaes_avx512)
-
- // Function arguments
- .set KEY, %rdi
- .set GHASH_ACC_PTR, %rsi
- .set AAD, %rdx
- .set AADLEN, %ecx
- .set AADLEN64, %rcx // Zero-extend AADLEN before using!
-
- // Additional local variables.
- // %rax, %ymm0-%ymm3, and %k1 are used as temporary registers.
- .set BSWAP_MASK, %ymm4
- .set GFPOLY, %ymm5
- .set GHASH_ACC, %ymm6
- .set GHASH_ACC_XMM, %xmm6
- .set H_POW1, %ymm7
-
- // Load some constants.
- vbroadcasti128 .Lbswap_mask(%rip), BSWAP_MASK
- vbroadcasti128 .Lgfpoly(%rip), GFPOLY
-
- // Load the GHASH accumulator.
- vmovdqu (GHASH_ACC_PTR), GHASH_ACC_XMM
-
- // Update GHASH with 32 bytes of AAD at a time.
- //
- // Pre-subtracting 32 from AADLEN saves an instruction from the loop and
- // also ensures that at least one write always occurs to AADLEN,
- // zero-extending it and allowing AADLEN64 to be used later.
- sub $32, AADLEN
- jl .Laad_loop_1x_done
- vmovdqu8 OFFSETOFEND_H_POWERS-32(KEY), H_POW1 // [H^2, H^1]
-.Laad_loop_1x:
- vmovdqu (AAD), %ymm0
- vpshufb BSWAP_MASK, %ymm0, %ymm0
- vpxor %ymm0, GHASH_ACC, GHASH_ACC
- _ghash_mul H_POW1, GHASH_ACC, GHASH_ACC, GFPOLY, \
- %ymm0, %ymm1, %ymm2
- vextracti128 $1, GHASH_ACC, %xmm0
- vpxor %xmm0, GHASH_ACC_XMM, GHASH_ACC_XMM
- add $32, AAD
- sub $32, AADLEN
- jge .Laad_loop_1x
-.Laad_loop_1x_done:
- add $32, AADLEN
- jz .Laad_done
-
- // Update GHASH with the remaining 1 <= AADLEN < 32 bytes of AAD.
- mov $-1, %eax
- bzhi AADLEN, %eax, %eax
- kmovd %eax, %k1
- vmovdqu8 (AAD), %ymm0{%k1}{z}
- neg AADLEN64
- and $~15, AADLEN64 // -round_up(AADLEN, 16)
- vmovdqu8 OFFSETOFEND_H_POWERS(KEY,AADLEN64), H_POW1
- vpshufb BSWAP_MASK, %ymm0, %ymm0
- vpxor %ymm0, GHASH_ACC, GHASH_ACC
- _ghash_mul H_POW1, GHASH_ACC, GHASH_ACC, GFPOLY, \
- %ymm0, %ymm1, %ymm2
- vextracti128 $1, GHASH_ACC, %xmm0
- vpxor %xmm0, GHASH_ACC_XMM, GHASH_ACC_XMM
-
-.Laad_done:
- // Store the updated GHASH accumulator back to memory.
- vmovdqu GHASH_ACC_XMM, (GHASH_ACC_PTR)
-
- vzeroupper // This is needed after using ymm or zmm registers.
- RET
-SYM_FUNC_END(aes_gcm_aad_update_vaes_avx512)
-
SYM_FUNC_START(aes_gcm_enc_final_vaes_avx512)
_aes_gcm_final 1
SYM_FUNC_END(aes_gcm_enc_final_vaes_avx512)
SYM_FUNC_START(aes_gcm_dec_final_vaes_avx512)
_aes_gcm_final 0
--
2.51.0
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 6/8] crypto: x86/aes-gcm - revise some comments in AVX512 code
2025-10-02 2:31 [PATCH 0/8] VAES+AVX2 optimized implementation of AES-GCM Eric Biggers
` (4 preceding siblings ...)
2025-10-02 2:31 ` [PATCH 5/8] crypto: x86/aes-gcm - reorder AVX512 precompute and aad_update functions Eric Biggers
@ 2025-10-02 2:31 ` Eric Biggers
2025-10-02 2:31 ` [PATCH 7/8] crypto: x86/aes-gcm - optimize AVX512 precomputation of H^2 from H^1 Eric Biggers
` (4 subsequent siblings)
10 siblings, 0 replies; 20+ messages in thread
From: Eric Biggers @ 2025-10-02 2:31 UTC (permalink / raw)
To: linux-crypto
Cc: linux-kernel, x86, Ard Biesheuvel, Jason A . Donenfeld,
Eric Biggers
- Fix some references to field names in struct aes_gcm_key_vaes_avx512.
- Remove the mention of the counter having to start at 2. The assembly
code doesn't actually assume that it does.
Note that these changes improve consistency with aes-gcm-vaes-avx2.S.
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
---
arch/x86/crypto/aes-gcm-vaes-avx512.S | 27 +++++++++++----------------
1 file changed, 11 insertions(+), 16 deletions(-)
diff --git a/arch/x86/crypto/aes-gcm-vaes-avx512.S b/arch/x86/crypto/aes-gcm-vaes-avx512.S
index 81a8a027cff8e..3cf0945a25170 100644
--- a/arch/x86/crypto/aes-gcm-vaes-avx512.S
+++ b/arch/x86/crypto/aes-gcm-vaes-avx512.S
@@ -260,16 +260,12 @@
vpternlogd $0x96, \t0, \mi, \hi
.endm
// void aes_gcm_precompute_vaes_avx512(struct aes_gcm_key_vaes_avx512 *key);
//
-// Given the expanded AES key |key->aes_key|, this function derives the GHASH
-// subkey and initializes |key->ghash_key_powers| with powers of it.
-//
-// The number of key powers initialized is NUM_H_POWERS, and they are stored in
-// the order H^NUM_H_POWERS to H^1. The zeroized padding blocks after the key
-// powers themselves are also initialized.
+// Given the expanded AES key |key->base.aes_key|, derive the GHASH subkey and
+// initialize |key->h_powers| and |key->padding|.
SYM_FUNC_START(aes_gcm_precompute_vaes_avx512)
// Function arguments
.set KEY, %rdi
@@ -467,14 +463,13 @@ SYM_FUNC_END(aes_gcm_precompute_vaes_avx512)
// u8 ghash_acc[16],
// const u8 *aad, int aadlen);
//
// This function processes the AAD (Additional Authenticated Data) in GCM.
// Using the key |key|, it updates the GHASH accumulator |ghash_acc| with the
-// data given by |aad| and |aadlen|. |key->ghash_key_powers| must have been
-// initialized. On the first call, |ghash_acc| must be all zeroes. |aadlen|
-// must be a multiple of 16, except on the last call where it can be any length.
-// The caller must do any buffering needed to ensure this.
+// data given by |aad| and |aadlen|. On the first call, |ghash_acc| must be all
+// zeroes. |aadlen| must be a multiple of 16, except on the last call where it
+// can be any length. The caller must do any buffering needed to ensure this.
//
// AES-GCM is almost always used with small amounts of AAD, less than 32 bytes.
// Therefore, for AAD processing we currently only provide this implementation
// which uses 256-bit vectors (ymm registers) and only has a 1x-wide loop. This
// keeps the code size down, and it enables some micro-optimizations, e.g. using
@@ -620,16 +615,16 @@ SYM_FUNC_END(aes_gcm_aad_update_vaes_avx512)
//
// |datalen| must be a multiple of 16, except on the last call where it can be
// any length. The caller must do any buffering needed to ensure this. Both
// in-place and out-of-place en/decryption are supported.
//
-// |le_ctr| must give the current counter in little-endian format. For a new
-// message, the low word of the counter must be 2. This function loads the
-// counter from |le_ctr| and increments the loaded counter as needed, but it
-// does *not* store the updated counter back to |le_ctr|. The caller must
-// update |le_ctr| if any more data segments follow. Internally, only the low
-// 32-bit word of the counter is incremented, following the GCM standard.
+// |le_ctr| must give the current counter in little-endian format. This
+// function loads the counter from |le_ctr| and increments the loaded counter as
+// needed, but it does *not* store the updated counter back to |le_ctr|. The
+// caller must update |le_ctr| if any more data segments follow. Internally,
+// only the low 32-bit word of the counter is incremented, following the GCM
+// standard.
.macro _aes_gcm_update enc
// Function arguments
.set KEY, %rdi
.set LE_CTR_PTR, %rsi
--
2.51.0
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 7/8] crypto: x86/aes-gcm - optimize AVX512 precomputation of H^2 from H^1
2025-10-02 2:31 [PATCH 0/8] VAES+AVX2 optimized implementation of AES-GCM Eric Biggers
` (5 preceding siblings ...)
2025-10-02 2:31 ` [PATCH 6/8] crypto: x86/aes-gcm - revise some comments in AVX512 code Eric Biggers
@ 2025-10-02 2:31 ` Eric Biggers
2025-10-02 2:31 ` [PATCH 8/8] crypto: x86/aes-gcm - optimize long AAD processing with AVX512 Eric Biggers
` (3 subsequent siblings)
10 siblings, 0 replies; 20+ messages in thread
From: Eric Biggers @ 2025-10-02 2:31 UTC (permalink / raw)
To: linux-crypto
Cc: linux-kernel, x86, Ard Biesheuvel, Jason A . Donenfeld,
Eric Biggers
Squaring in GF(2^128) requires fewer instructions than a generic
multiplication in GF(2^128). Take advantage of this when computing H^2
from H^1 in aes_gcm_precompute_vaes_avx512().
Note that aes_gcm_precompute_vaes_avx2() already uses this optimization.
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
---
arch/x86/crypto/aes-gcm-vaes-avx512.S | 16 ++++++++++++++--
1 file changed, 14 insertions(+), 2 deletions(-)
diff --git a/arch/x86/crypto/aes-gcm-vaes-avx512.S b/arch/x86/crypto/aes-gcm-vaes-avx512.S
index 3cf0945a25170..5c8301d275c66 100644
--- a/arch/x86/crypto/aes-gcm-vaes-avx512.S
+++ b/arch/x86/crypto/aes-gcm-vaes-avx512.S
@@ -258,10 +258,23 @@
vpclmulqdq $0x01, \mi, \gfpoly, \t0
vpshufd $0x4e, \mi, \mi
vpternlogd $0x96, \t0, \mi, \hi
.endm
+// This is a specialized version of _ghash_mul that computes \a * \a, i.e. it
+// squares \a. It skips computing MI = (a_L * a_H) + (a_H * a_L) = 0.
+.macro _ghash_square a, dst, gfpoly, t0, t1
+ vpclmulqdq $0x00, \a, \a, \t0 // LO = a_L * a_L
+ vpclmulqdq $0x11, \a, \a, \dst // HI = a_H * a_H
+ vpclmulqdq $0x01, \t0, \gfpoly, \t1 // LO_L*(x^63 + x^62 + x^57)
+ vpshufd $0x4e, \t0, \t0 // Swap halves of LO
+ vpxord \t0, \t1, \t1 // Fold LO into MI
+ vpclmulqdq $0x01, \t1, \gfpoly, \t0 // MI_L*(x^63 + x^62 + x^57)
+ vpshufd $0x4e, \t1, \t1 // Swap halves of MI
+ vpternlogd $0x96, \t0, \t1, \dst // Fold MI into HI
+.endm
+
// void aes_gcm_precompute_vaes_avx512(struct aes_gcm_key_vaes_avx512 *key);
//
// Given the expanded AES key |key->base.aes_key|, derive the GHASH subkey and
// initialize |key->h_powers| and |key->padding|.
SYM_FUNC_START(aes_gcm_precompute_vaes_avx512)
@@ -335,12 +348,11 @@ SYM_FUNC_START(aes_gcm_precompute_vaes_avx512)
// Note that as with H^1, all higher key powers also need an extra
// factor of x^-1 (or x using the natural interpretation). Nothing
// special needs to be done to make this happen, though: H^1 * H^1 would
// end up with two factors of x^-1, but the multiplication consumes one.
// So the product H^2 ends up with the desired one factor of x^-1.
- _ghash_mul H_CUR_XMM, H_CUR_XMM, H_INC_XMM, GFPOLY_XMM, \
- %xmm0, %xmm1, %xmm2
+ _ghash_square H_CUR_XMM, H_INC_XMM, GFPOLY_XMM, %xmm0, %xmm1
// Create H_CUR_YMM = [H^2, H^1] and H_INC_YMM = [H^2, H^2].
vinserti128 $1, H_CUR_XMM, H_INC_YMM, H_CUR_YMM
vinserti128 $1, H_INC_XMM, H_INC_YMM, H_INC_YMM
--
2.51.0
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 8/8] crypto: x86/aes-gcm - optimize long AAD processing with AVX512
2025-10-02 2:31 [PATCH 0/8] VAES+AVX2 optimized implementation of AES-GCM Eric Biggers
` (6 preceding siblings ...)
2025-10-02 2:31 ` [PATCH 7/8] crypto: x86/aes-gcm - optimize AVX512 precomputation of H^2 from H^1 Eric Biggers
@ 2025-10-02 2:31 ` Eric Biggers
2025-10-10 18:21 ` [PATCH 0/8] VAES+AVX2 optimized implementation of AES-GCM Ard Biesheuvel
` (2 subsequent siblings)
10 siblings, 0 replies; 20+ messages in thread
From: Eric Biggers @ 2025-10-02 2:31 UTC (permalink / raw)
To: linux-crypto
Cc: linux-kernel, x86, Ard Biesheuvel, Jason A . Donenfeld,
Eric Biggers
Improve the performance of aes_gcm_aad_update_vaes_avx512() on large AAD
(additional authenticated data) lengths by 4-8 times by making it use up
to 512-bit vectors and a 4-vector-wide loop. Previously, it used only
256-bit vectors and a 1-vector-wide loop.
Originally, I assumed that the case of large AADLEN was unimportant.
When reviewing the users of BoringSSL's AES-GCM code, I found that some
callers use BoringSSL's AES-GCM API to just compute GMAC, using large
amounts of AAD with zero bytes en/decrypted. Thus, I included a similar
optimization in the BoringSSL port of this code. I believe it's wise to
include this optimization in the kernel port too for similar reasons,
and to align it more closely with the BoringSSL port.
Another reason this function originally used 256-bit vectors was so that
separate *_avx10_256 and *_avx10_512 versions of it wouldn't be needed.
However, that's no longer applicable.
To avoid a slight performance regression in the common case of AADLEN <=
16, also add a fast path for that case which uses 128-bit vectors. In
fact, this case actually gets slightly faster too, since it saves a
couple instructions over the original 256-bit code.
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
---
arch/x86/crypto/aes-gcm-vaes-avx512.S | 146 +++++++++++++++++---------
1 file changed, 99 insertions(+), 47 deletions(-)
diff --git a/arch/x86/crypto/aes-gcm-vaes-avx512.S b/arch/x86/crypto/aes-gcm-vaes-avx512.S
index 5c8301d275c66..06b71314d65cc 100644
--- a/arch/x86/crypto/aes-gcm-vaes-avx512.S
+++ b/arch/x86/crypto/aes-gcm-vaes-avx512.S
@@ -469,88 +469,142 @@ SYM_FUNC_END(aes_gcm_precompute_vaes_avx512)
_horizontal_xor GHASH_ACC, GHASH_ACC_XMM, GHASH_ACC_XMM, \
GHASHDATA0_XMM, GHASHDATA1_XMM, GHASHDATA2_XMM
.endif
.endm
+// Update GHASH with four vectors of data blocks. See _ghash_step_4x for full
+// explanation.
+.macro _ghash_4x
+.irp i, 0,1,2,3,4,5,6,7,8,9
+ _ghash_step_4x \i
+.endr
+.endm
+
// void aes_gcm_aad_update_vaes_avx512(const struct aes_gcm_key_vaes_avx512 *key,
// u8 ghash_acc[16],
// const u8 *aad, int aadlen);
//
// This function processes the AAD (Additional Authenticated Data) in GCM.
// Using the key |key|, it updates the GHASH accumulator |ghash_acc| with the
// data given by |aad| and |aadlen|. On the first call, |ghash_acc| must be all
// zeroes. |aadlen| must be a multiple of 16, except on the last call where it
// can be any length. The caller must do any buffering needed to ensure this.
//
-// AES-GCM is almost always used with small amounts of AAD, less than 32 bytes.
-// Therefore, for AAD processing we currently only provide this implementation
-// which uses 256-bit vectors (ymm registers) and only has a 1x-wide loop. This
-// keeps the code size down, and it enables some micro-optimizations, e.g. using
-// VEX-coded instructions instead of EVEX-coded to save some instruction bytes.
-// To optimize for large amounts of AAD, we could implement a 4x-wide loop and
-// provide a version using 512-bit vectors, but that doesn't seem to be useful.
+// This handles large amounts of AAD efficiently, while also keeping overhead
+// low for small amounts which is the common case. TLS and IPsec use less than
+// one block of AAD, but (uncommonly) other use cases may use much more.
SYM_FUNC_START(aes_gcm_aad_update_vaes_avx512)
// Function arguments
.set KEY, %rdi
.set GHASH_ACC_PTR, %rsi
.set AAD, %rdx
.set AADLEN, %ecx
.set AADLEN64, %rcx // Zero-extend AADLEN before using!
// Additional local variables.
- // %rax, %ymm0-%ymm3, and %k1 are used as temporary registers.
- .set BSWAP_MASK, %ymm4
- .set GFPOLY, %ymm5
- .set GHASH_ACC, %ymm6
- .set GHASH_ACC_XMM, %xmm6
- .set H_POW1, %ymm7
-
- // Load some constants.
- vbroadcasti128 .Lbswap_mask(%rip), BSWAP_MASK
- vbroadcasti128 .Lgfpoly(%rip), GFPOLY
+ // %rax and %k1 are used as temporary registers.
+ .set GHASHDATA0, %zmm0
+ .set GHASHDATA0_XMM, %xmm0
+ .set GHASHDATA1, %zmm1
+ .set GHASHDATA1_XMM, %xmm1
+ .set GHASHDATA2, %zmm2
+ .set GHASHDATA2_XMM, %xmm2
+ .set GHASHDATA3, %zmm3
+ .set BSWAP_MASK, %zmm4
+ .set BSWAP_MASK_XMM, %xmm4
+ .set GHASH_ACC, %zmm5
+ .set GHASH_ACC_XMM, %xmm5
+ .set H_POW4, %zmm6
+ .set H_POW3, %zmm7
+ .set H_POW2, %zmm8
+ .set H_POW1, %zmm9
+ .set H_POW1_XMM, %xmm9
+ .set GFPOLY, %zmm10
+ .set GFPOLY_XMM, %xmm10
+ .set GHASHTMP0, %zmm11
+ .set GHASHTMP1, %zmm12
+ .set GHASHTMP2, %zmm13
// Load the GHASH accumulator.
vmovdqu (GHASH_ACC_PTR), GHASH_ACC_XMM
- // Update GHASH with 32 bytes of AAD at a time.
- //
- // Pre-subtracting 32 from AADLEN saves an instruction from the loop and
- // also ensures that at least one write always occurs to AADLEN,
- // zero-extending it and allowing AADLEN64 to be used later.
- sub $32, AADLEN
+ // Check for the common case of AADLEN <= 16, as well as AADLEN == 0.
+ cmp $16, AADLEN
+ jg .Laad_more_than_16bytes
+ test AADLEN, AADLEN
+ jz .Laad_done
+
+ // Fast path: update GHASH with 1 <= AADLEN <= 16 bytes of AAD.
+ vmovdqu .Lbswap_mask(%rip), BSWAP_MASK_XMM
+ vmovdqu .Lgfpoly(%rip), GFPOLY_XMM
+ mov $-1, %eax
+ bzhi AADLEN, %eax, %eax
+ kmovd %eax, %k1
+ vmovdqu8 (AAD), GHASHDATA0_XMM{%k1}{z}
+ vmovdqu OFFSETOFEND_H_POWERS-16(KEY), H_POW1_XMM
+ vpshufb BSWAP_MASK_XMM, GHASHDATA0_XMM, GHASHDATA0_XMM
+ vpxor GHASHDATA0_XMM, GHASH_ACC_XMM, GHASH_ACC_XMM
+ _ghash_mul H_POW1_XMM, GHASH_ACC_XMM, GHASH_ACC_XMM, GFPOLY_XMM, \
+ GHASHDATA0_XMM, GHASHDATA1_XMM, GHASHDATA2_XMM
+ jmp .Laad_done
+
+.Laad_more_than_16bytes:
+ vbroadcasti32x4 .Lbswap_mask(%rip), BSWAP_MASK
+ vbroadcasti32x4 .Lgfpoly(%rip), GFPOLY
+
+ // If AADLEN >= 256, update GHASH with 256 bytes of AAD at a time.
+ sub $256, AADLEN
+ jl .Laad_loop_4x_done
+ vmovdqu8 OFFSETOFEND_H_POWERS-4*64(KEY), H_POW4
+ vmovdqu8 OFFSETOFEND_H_POWERS-3*64(KEY), H_POW3
+ vmovdqu8 OFFSETOFEND_H_POWERS-2*64(KEY), H_POW2
+ vmovdqu8 OFFSETOFEND_H_POWERS-1*64(KEY), H_POW1
+.Laad_loop_4x:
+ vmovdqu8 0*64(AAD), GHASHDATA0
+ vmovdqu8 1*64(AAD), GHASHDATA1
+ vmovdqu8 2*64(AAD), GHASHDATA2
+ vmovdqu8 3*64(AAD), GHASHDATA3
+ _ghash_4x
+ add $256, AAD
+ sub $256, AADLEN
+ jge .Laad_loop_4x
+.Laad_loop_4x_done:
+
+ // If AADLEN >= 64, update GHASH with 64 bytes of AAD at a time.
+ add $192, AADLEN
jl .Laad_loop_1x_done
- vmovdqu8 OFFSETOFEND_H_POWERS-32(KEY), H_POW1 // [H^2, H^1]
+ vmovdqu8 OFFSETOFEND_H_POWERS-1*64(KEY), H_POW1
.Laad_loop_1x:
- vmovdqu (AAD), %ymm0
- vpshufb BSWAP_MASK, %ymm0, %ymm0
- vpxor %ymm0, GHASH_ACC, GHASH_ACC
+ vmovdqu8 (AAD), GHASHDATA0
+ vpshufb BSWAP_MASK, GHASHDATA0, GHASHDATA0
+ vpxord GHASHDATA0, GHASH_ACC, GHASH_ACC
_ghash_mul H_POW1, GHASH_ACC, GHASH_ACC, GFPOLY, \
- %ymm0, %ymm1, %ymm2
- vextracti128 $1, GHASH_ACC, %xmm0
- vpxor %xmm0, GHASH_ACC_XMM, GHASH_ACC_XMM
- add $32, AAD
- sub $32, AADLEN
+ GHASHDATA0, GHASHDATA1, GHASHDATA2
+ _horizontal_xor GHASH_ACC, GHASH_ACC_XMM, GHASH_ACC_XMM, \
+ GHASHDATA0_XMM, GHASHDATA1_XMM, GHASHDATA2_XMM
+ add $64, AAD
+ sub $64, AADLEN
jge .Laad_loop_1x
.Laad_loop_1x_done:
- add $32, AADLEN
- jz .Laad_done
- // Update GHASH with the remaining 1 <= AADLEN < 32 bytes of AAD.
- mov $-1, %eax
- bzhi AADLEN, %eax, %eax
- kmovd %eax, %k1
- vmovdqu8 (AAD), %ymm0{%k1}{z}
+ // Update GHASH with the remaining 0 <= AADLEN < 64 bytes of AAD.
+ add $64, AADLEN
+ jz .Laad_done
+ mov $-1, %rax
+ bzhi AADLEN64, %rax, %rax
+ kmovq %rax, %k1
+ vmovdqu8 (AAD), GHASHDATA0{%k1}{z}
neg AADLEN64
and $~15, AADLEN64 // -round_up(AADLEN, 16)
vmovdqu8 OFFSETOFEND_H_POWERS(KEY,AADLEN64), H_POW1
- vpshufb BSWAP_MASK, %ymm0, %ymm0
- vpxor %ymm0, GHASH_ACC, GHASH_ACC
+ vpshufb BSWAP_MASK, GHASHDATA0, GHASHDATA0
+ vpxord GHASHDATA0, GHASH_ACC, GHASH_ACC
_ghash_mul H_POW1, GHASH_ACC, GHASH_ACC, GFPOLY, \
- %ymm0, %ymm1, %ymm2
- vextracti128 $1, GHASH_ACC, %xmm0
- vpxor %xmm0, GHASH_ACC_XMM, GHASH_ACC_XMM
+ GHASHDATA0, GHASHDATA1, GHASHDATA2
+ _horizontal_xor GHASH_ACC, GHASH_ACC_XMM, GHASH_ACC_XMM, \
+ GHASHDATA0_XMM, GHASHDATA1_XMM, GHASHDATA2_XMM
.Laad_done:
// Store the updated GHASH accumulator back to memory.
vmovdqu GHASH_ACC_XMM, (GHASH_ACC_PTR)
@@ -842,13 +896,11 @@ SYM_FUNC_END(aes_gcm_aad_update_vaes_avx512)
jge .Lcrypt_loop_4x\@
.if \enc
.Lghash_last_ciphertext_4x\@:
// Update GHASH with the last set of ciphertext blocks.
-.irp i, 0,1,2,3,4,5,6,7,8,9
- _ghash_step_4x \i
-.endr
+ _ghash_4x
.endif
.Lcrypt_loop_4x_done\@:
// Undo the extra subtraction by 256 and check whether data remains.
--
2.51.0
^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH 0/8] VAES+AVX2 optimized implementation of AES-GCM
2025-10-02 2:31 [PATCH 0/8] VAES+AVX2 optimized implementation of AES-GCM Eric Biggers
` (7 preceding siblings ...)
2025-10-02 2:31 ` [PATCH 8/8] crypto: x86/aes-gcm - optimize long AAD processing with AVX512 Eric Biggers
@ 2025-10-10 18:21 ` Ard Biesheuvel
2025-10-14 0:31 ` Eric Biggers
2025-10-17 8:24 ` Herbert Xu
10 siblings, 0 replies; 20+ messages in thread
From: Ard Biesheuvel @ 2025-10-10 18:21 UTC (permalink / raw)
To: Eric Biggers; +Cc: linux-crypto, linux-kernel, x86, Jason A . Donenfeld
On Wed, 1 Oct 2025 at 19:34, Eric Biggers <ebiggers@kernel.org> wrote:
>
> This patchset replaces the 256-bit vector implementation of AES-GCM for
> x86_64 with one that requires AVX2 rather than AVX512. This greatly
> improves AES-GCM performance on CPUs that have VAES but not AVX512, for
> example by up to 74% on AMD Zen 3. For more details, see patch 1.
>
> This patchset also renames the 512-bit vector implementation of AES-GCM
> for x86_64 to be named after AVX512 rather than AVX10/512, then adds
> some additional optimizations to it.
>
> This patchset applies to next-20250929 and is targeting 6.19. Herbert,
> I'd prefer to just apply this myself. But let me know if you'd prefer
> to take it instead (considering that AES-GCM hasn't been librarified
> yet). Either way, there's no hurry, since this is targeting 6.19.
>
> Eric Biggers (8):
> crypto: x86/aes-gcm - add VAES+AVX2 optimized code
> crypto: x86/aes-gcm - remove VAES+AVX10/256 optimized code
> crypto: x86/aes-gcm - rename avx10 and avx10_512 to avx512
> crypto: x86/aes-gcm - clean up AVX512 code to assume 512-bit vectors
> crypto: x86/aes-gcm - reorder AVX512 precompute and aad_update
> functions
> crypto: x86/aes-gcm - revise some comments in AVX512 code
> crypto: x86/aes-gcm - optimize AVX512 precomputation of H^2 from H^1
> crypto: x86/aes-gcm - optimize long AAD processing with AVX512
>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Ard Biesheuvel <ardb@kernel.org>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/8] VAES+AVX2 optimized implementation of AES-GCM
2025-10-02 2:31 [PATCH 0/8] VAES+AVX2 optimized implementation of AES-GCM Eric Biggers
` (8 preceding siblings ...)
2025-10-10 18:21 ` [PATCH 0/8] VAES+AVX2 optimized implementation of AES-GCM Ard Biesheuvel
@ 2025-10-14 0:31 ` Eric Biggers
2025-10-17 8:25 ` Herbert Xu
2025-10-17 8:24 ` Herbert Xu
10 siblings, 1 reply; 20+ messages in thread
From: Eric Biggers @ 2025-10-14 0:31 UTC (permalink / raw)
To: linux-crypto; +Cc: linux-kernel, x86, Ard Biesheuvel, Jason A . Donenfeld
On Wed, Oct 01, 2025 at 07:31:09PM -0700, Eric Biggers wrote:
> This patchset replaces the 256-bit vector implementation of AES-GCM for
> x86_64 with one that requires AVX2 rather than AVX512. This greatly
> improves AES-GCM performance on CPUs that have VAES but not AVX512, for
> example by up to 74% on AMD Zen 3. For more details, see patch 1.
>
> This patchset also renames the 512-bit vector implementation of AES-GCM
> for x86_64 to be named after AVX512 rather than AVX10/512, then adds
> some additional optimizations to it.
>
> This patchset applies to next-20250929 and is targeting 6.19. Herbert,
> I'd prefer to just apply this myself. But let me know if you'd prefer
> to take it instead (considering that AES-GCM hasn't been librarified
> yet). Either way, there's no hurry, since this is targeting 6.19.
>
> Eric Biggers (8):
> crypto: x86/aes-gcm - add VAES+AVX2 optimized code
> crypto: x86/aes-gcm - remove VAES+AVX10/256 optimized code
> crypto: x86/aes-gcm - rename avx10 and avx10_512 to avx512
> crypto: x86/aes-gcm - clean up AVX512 code to assume 512-bit vectors
> crypto: x86/aes-gcm - reorder AVX512 precompute and aad_update
> functions
> crypto: x86/aes-gcm - revise some comments in AVX512 code
> crypto: x86/aes-gcm - optimize AVX512 precomputation of H^2 from H^1
> crypto: x86/aes-gcm - optimize long AAD processing with AVX512
>
> arch/x86/crypto/Makefile | 5 +-
> arch/x86/crypto/aes-gcm-aesni-x86_64.S | 12 +-
> arch/x86/crypto/aes-gcm-vaes-avx2.S | 1150 +++++++++++++++++
> ...m-avx10-x86_64.S => aes-gcm-vaes-avx512.S} | 722 +++++------
> arch/x86/crypto/aesni-intel_glue.c | 264 ++--
> 5 files changed, 1667 insertions(+), 486 deletions(-)
> create mode 100644 arch/x86/crypto/aes-gcm-vaes-avx2.S
> rename arch/x86/crypto/{aes-gcm-avx10-x86_64.S => aes-gcm-vaes-avx512.S} (69%)
>
> base-commit: 3b9b1f8df454caa453c7fb07689064edb2eda90a
Applied to https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git/log/?h=libcrypto-next
- Eric
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/8] VAES+AVX2 optimized implementation of AES-GCM
2025-10-02 2:31 [PATCH 0/8] VAES+AVX2 optimized implementation of AES-GCM Eric Biggers
` (9 preceding siblings ...)
2025-10-14 0:31 ` Eric Biggers
@ 2025-10-17 8:24 ` Herbert Xu
10 siblings, 0 replies; 20+ messages in thread
From: Herbert Xu @ 2025-10-17 8:24 UTC (permalink / raw)
To: Eric Biggers; +Cc: linux-crypto, linux-kernel, x86, ardb, Jason, ebiggers
Eric Biggers <ebiggers@kernel.org> wrote:
> This patchset replaces the 256-bit vector implementation of AES-GCM for
> x86_64 with one that requires AVX2 rather than AVX512. This greatly
> improves AES-GCM performance on CPUs that have VAES but not AVX512, for
> example by up to 74% on AMD Zen 3. For more details, see patch 1.
>
> This patchset also renames the 512-bit vector implementation of AES-GCM
> for x86_64 to be named after AVX512 rather than AVX10/512, then adds
> some additional optimizations to it.
>
> This patchset applies to next-20250929 and is targeting 6.19. Herbert,
> I'd prefer to just apply this myself. But let me know if you'd prefer
> to take it instead (considering that AES-GCM hasn't been librarified
> yet). Either way, there's no hurry, since this is targeting 6.19.
>
> Eric Biggers (8):
> crypto: x86/aes-gcm - add VAES+AVX2 optimized code
> crypto: x86/aes-gcm - remove VAES+AVX10/256 optimized code
> crypto: x86/aes-gcm - rename avx10 and avx10_512 to avx512
> crypto: x86/aes-gcm - clean up AVX512 code to assume 512-bit vectors
> crypto: x86/aes-gcm - reorder AVX512 precompute and aad_update
> functions
> crypto: x86/aes-gcm - revise some comments in AVX512 code
> crypto: x86/aes-gcm - optimize AVX512 precomputation of H^2 from H^1
> crypto: x86/aes-gcm - optimize long AAD processing with AVX512
>
> arch/x86/crypto/Makefile | 5 +-
> arch/x86/crypto/aes-gcm-aesni-x86_64.S | 12 +-
> arch/x86/crypto/aes-gcm-vaes-avx2.S | 1150 +++++++++++++++++
> ...m-avx10-x86_64.S => aes-gcm-vaes-avx512.S} | 722 +++++------
> arch/x86/crypto/aesni-intel_glue.c | 264 ++--
> 5 files changed, 1667 insertions(+), 486 deletions(-)
> create mode 100644 arch/x86/crypto/aes-gcm-vaes-avx2.S
> rename arch/x86/crypto/{aes-gcm-avx10-x86_64.S => aes-gcm-vaes-avx512.S} (69%)
>
> base-commit: 3b9b1f8df454caa453c7fb07689064edb2eda90a
All applied. Thanks.
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/8] VAES+AVX2 optimized implementation of AES-GCM
2025-10-14 0:31 ` Eric Biggers
@ 2025-10-17 8:25 ` Herbert Xu
2025-10-17 8:44 ` Ard Biesheuvel
0 siblings, 1 reply; 20+ messages in thread
From: Herbert Xu @ 2025-10-17 8:25 UTC (permalink / raw)
To: Eric Biggers; +Cc: linux-crypto, linux-kernel, x86, ardb, Jason
Eric Biggers <ebiggers@kernel.org> wrote:
> On Wed, Oct 01, 2025 at 07:31:09PM -0700, Eric Biggers wrote:
>> This patchset replaces the 256-bit vector implementation of AES-GCM for
>> x86_64 with one that requires AVX2 rather than AVX512. This greatly
>> improves AES-GCM performance on CPUs that have VAES but not AVX512, for
>> example by up to 74% on AMD Zen 3. For more details, see patch 1.
>>
>> This patchset also renames the 512-bit vector implementation of AES-GCM
>> for x86_64 to be named after AVX512 rather than AVX10/512, then adds
>> some additional optimizations to it.
>>
>> This patchset applies to next-20250929 and is targeting 6.19. Herbert,
>> I'd prefer to just apply this myself. But let me know if you'd prefer
>> to take it instead (considering that AES-GCM hasn't been librarified
>> yet). Either way, there's no hurry, since this is targeting 6.19.
>>
>> Eric Biggers (8):
>> crypto: x86/aes-gcm - add VAES+AVX2 optimized code
>> crypto: x86/aes-gcm - remove VAES+AVX10/256 optimized code
>> crypto: x86/aes-gcm - rename avx10 and avx10_512 to avx512
>> crypto: x86/aes-gcm - clean up AVX512 code to assume 512-bit vectors
>> crypto: x86/aes-gcm - reorder AVX512 precompute and aad_update
>> functions
>> crypto: x86/aes-gcm - revise some comments in AVX512 code
>> crypto: x86/aes-gcm - optimize AVX512 precomputation of H^2 from H^1
>> crypto: x86/aes-gcm - optimize long AAD processing with AVX512
>>
>> arch/x86/crypto/Makefile | 5 +-
>> arch/x86/crypto/aes-gcm-aesni-x86_64.S | 12 +-
>> arch/x86/crypto/aes-gcm-vaes-avx2.S | 1150 +++++++++++++++++
>> ...m-avx10-x86_64.S => aes-gcm-vaes-avx512.S} | 722 +++++------
>> arch/x86/crypto/aesni-intel_glue.c | 264 ++--
>> 5 files changed, 1667 insertions(+), 486 deletions(-)
>> create mode 100644 arch/x86/crypto/aes-gcm-vaes-avx2.S
>> rename arch/x86/crypto/{aes-gcm-avx10-x86_64.S => aes-gcm-vaes-avx512.S} (69%)
>>
>> base-commit: 3b9b1f8df454caa453c7fb07689064edb2eda90a
>
> Applied to https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git/log/?h=libcrypto-next
Oops, I didn't see this email until it was too late. Since the
patches should be identical I don't think it matters.
Cheers,
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/8] VAES+AVX2 optimized implementation of AES-GCM
2025-10-17 8:25 ` Herbert Xu
@ 2025-10-17 8:44 ` Ard Biesheuvel
2025-10-17 16:04 ` Eric Biggers
0 siblings, 1 reply; 20+ messages in thread
From: Ard Biesheuvel @ 2025-10-17 8:44 UTC (permalink / raw)
To: Herbert Xu; +Cc: Eric Biggers, linux-crypto, linux-kernel, x86, Jason
On Fri, 17 Oct 2025 at 10:25, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> Eric Biggers <ebiggers@kernel.org> wrote:
> > On Wed, Oct 01, 2025 at 07:31:09PM -0700, Eric Biggers wrote:
> >> This patchset replaces the 256-bit vector implementation of AES-GCM for
> >> x86_64 with one that requires AVX2 rather than AVX512. This greatly
> >> improves AES-GCM performance on CPUs that have VAES but not AVX512, for
> >> example by up to 74% on AMD Zen 3. For more details, see patch 1.
> >>
> >> This patchset also renames the 512-bit vector implementation of AES-GCM
> >> for x86_64 to be named after AVX512 rather than AVX10/512, then adds
> >> some additional optimizations to it.
> >>
> >> This patchset applies to next-20250929 and is targeting 6.19. Herbert,
> >> I'd prefer to just apply this myself. But let me know if you'd prefer
> >> to take it instead (considering that AES-GCM hasn't been librarified
> >> yet). Either way, there's no hurry, since this is targeting 6.19.
> >>
> >> Eric Biggers (8):
> >> crypto: x86/aes-gcm - add VAES+AVX2 optimized code
> >> crypto: x86/aes-gcm - remove VAES+AVX10/256 optimized code
> >> crypto: x86/aes-gcm - rename avx10 and avx10_512 to avx512
> >> crypto: x86/aes-gcm - clean up AVX512 code to assume 512-bit vectors
> >> crypto: x86/aes-gcm - reorder AVX512 precompute and aad_update
> >> functions
> >> crypto: x86/aes-gcm - revise some comments in AVX512 code
> >> crypto: x86/aes-gcm - optimize AVX512 precomputation of H^2 from H^1
> >> crypto: x86/aes-gcm - optimize long AAD processing with AVX512
> >>
> >> arch/x86/crypto/Makefile | 5 +-
> >> arch/x86/crypto/aes-gcm-aesni-x86_64.S | 12 +-
> >> arch/x86/crypto/aes-gcm-vaes-avx2.S | 1150 +++++++++++++++++
> >> ...m-avx10-x86_64.S => aes-gcm-vaes-avx512.S} | 722 +++++------
> >> arch/x86/crypto/aesni-intel_glue.c | 264 ++--
> >> 5 files changed, 1667 insertions(+), 486 deletions(-)
> >> create mode 100644 arch/x86/crypto/aes-gcm-vaes-avx2.S
> >> rename arch/x86/crypto/{aes-gcm-avx10-x86_64.S => aes-gcm-vaes-avx512.S} (69%)
> >>
> >> base-commit: 3b9b1f8df454caa453c7fb07689064edb2eda90a
> >
> > Applied to https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git/log/?h=libcrypto-next
>
> Oops, I didn't see this email until it was too late. Since the
> patches should be identical I don't think it matters.
>
You also failed to apply my acked-by/tested-by so perhaps you should
just drop the patches from your tree again.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/8] VAES+AVX2 optimized implementation of AES-GCM
2025-10-17 8:44 ` Ard Biesheuvel
@ 2025-10-17 16:04 ` Eric Biggers
2025-10-17 20:50 ` Eric Biggers
2025-10-20 4:13 ` Herbert Xu
0 siblings, 2 replies; 20+ messages in thread
From: Eric Biggers @ 2025-10-17 16:04 UTC (permalink / raw)
To: Ard Biesheuvel; +Cc: Herbert Xu, linux-crypto, linux-kernel, x86, Jason
On Fri, Oct 17, 2025 at 10:44:37AM +0200, Ard Biesheuvel wrote:
> On Fri, 17 Oct 2025 at 10:25, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> >
> > Eric Biggers <ebiggers@kernel.org> wrote:
> > > On Wed, Oct 01, 2025 at 07:31:09PM -0700, Eric Biggers wrote:
> > >> This patchset replaces the 256-bit vector implementation of AES-GCM for
> > >> x86_64 with one that requires AVX2 rather than AVX512. This greatly
> > >> improves AES-GCM performance on CPUs that have VAES but not AVX512, for
> > >> example by up to 74% on AMD Zen 3. For more details, see patch 1.
> > >>
> > >> This patchset also renames the 512-bit vector implementation of AES-GCM
> > >> for x86_64 to be named after AVX512 rather than AVX10/512, then adds
> > >> some additional optimizations to it.
> > >>
> > >> This patchset applies to next-20250929 and is targeting 6.19. Herbert,
> > >> I'd prefer to just apply this myself. But let me know if you'd prefer
> > >> to take it instead (considering that AES-GCM hasn't been librarified
> > >> yet). Either way, there's no hurry, since this is targeting 6.19.
> > >>
> > >> Eric Biggers (8):
> > >> crypto: x86/aes-gcm - add VAES+AVX2 optimized code
> > >> crypto: x86/aes-gcm - remove VAES+AVX10/256 optimized code
> > >> crypto: x86/aes-gcm - rename avx10 and avx10_512 to avx512
> > >> crypto: x86/aes-gcm - clean up AVX512 code to assume 512-bit vectors
> > >> crypto: x86/aes-gcm - reorder AVX512 precompute and aad_update
> > >> functions
> > >> crypto: x86/aes-gcm - revise some comments in AVX512 code
> > >> crypto: x86/aes-gcm - optimize AVX512 precomputation of H^2 from H^1
> > >> crypto: x86/aes-gcm - optimize long AAD processing with AVX512
> > >>
> > >> arch/x86/crypto/Makefile | 5 +-
> > >> arch/x86/crypto/aes-gcm-aesni-x86_64.S | 12 +-
> > >> arch/x86/crypto/aes-gcm-vaes-avx2.S | 1150 +++++++++++++++++
> > >> ...m-avx10-x86_64.S => aes-gcm-vaes-avx512.S} | 722 +++++------
> > >> arch/x86/crypto/aesni-intel_glue.c | 264 ++--
> > >> 5 files changed, 1667 insertions(+), 486 deletions(-)
> > >> create mode 100644 arch/x86/crypto/aes-gcm-vaes-avx2.S
> > >> rename arch/x86/crypto/{aes-gcm-avx10-x86_64.S => aes-gcm-vaes-avx512.S} (69%)
> > >>
> > >> base-commit: 3b9b1f8df454caa453c7fb07689064edb2eda90a
> > >
> > > Applied to https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git/log/?h=libcrypto-next
> >
> > Oops, I didn't see this email until it was too late. Since the
> > patches should be identical I don't think it matters.
Well, it seems you didn't read the patchset (even the cover letter) or
any of the replies to it. So maybe I should just take it, as I already
said I preferred, and later did do since you hadn't said you wanted to
take it. It would have been okay if you had volunteered to take this,
but you need to actually read the patches and replies.
As for the patches being identical, besides correctly applying Ard's
tags, I made a couple very minor changes that weren't worth sending a v2
for: clarifying one of the commit messages, and correcting two comments
and dropping some unused aliases from aes-gcm-vaes-avx2.S.
- Eric
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/8] crypto: x86/aes-gcm - add VAES+AVX2 optimized code
2025-10-02 2:31 ` [PATCH 1/8] crypto: x86/aes-gcm - add VAES+AVX2 optimized code Eric Biggers
@ 2025-10-17 18:34 ` Eric Biggers
0 siblings, 0 replies; 20+ messages in thread
From: Eric Biggers @ 2025-10-17 18:34 UTC (permalink / raw)
To: linux-crypto; +Cc: linux-kernel, x86, Ard Biesheuvel, Jason A . Donenfeld
On Wed, Oct 01, 2025 at 07:31:10PM -0700, Eric Biggers wrote:
> Add an implementation of AES-GCM that uses 256-bit vectors and the
> following CPU features: Vector AES (VAES), Vector Carryless
> Multiplication (VPCLMULQDQ), and AVX2.
A few non-functional cleanups I applied after reading over the assembly
file again (wasn't worth resending the whole patchset):
diff --git a/arch/x86/crypto/aes-gcm-vaes-avx2.S b/arch/x86/crypto/aes-gcm-vaes-avx2.S
index e628dbb33c0e..f58096a37342 100644
--- a/arch/x86/crypto/aes-gcm-vaes-avx2.S
+++ b/arch/x86/crypto/aes-gcm-vaes-avx2.S
@@ -231,11 +231,10 @@ SYM_FUNC_START(aes_gcm_precompute_vaes_avx2)
.set TMP2, %ymm2
.set TMP2_XMM, %xmm2
.set H_CUR, %ymm3
.set H_CUR_XMM, %xmm3
.set H_CUR2, %ymm4
- .set H_CUR2_XMM, %xmm4
.set H_INC, %ymm5
.set H_INC_XMM, %xmm5
.set GFPOLY, %ymm6
.set GFPOLY_XMM, %xmm6
@@ -576,11 +575,10 @@ SYM_FUNC_START(aes_gcm_aad_update_vaes_avx2)
jz .Laad_done
cmp $16, AADLEN
jle .Laad_lastblock
-.Laad_last2blocks:
// Update GHASH with the remaining 17 <= AADLEN <= 31 bytes of AAD.
mov AADLEN, AADLEN // Zero-extend AADLEN to AADLEN64.
vmovdqu (AAD), TMP0_XMM
vmovdqu -16(AAD, AADLEN64), TMP1_XMM
vpshufb BSWAP_MASK_XMM, TMP0_XMM, TMP0_XMM
@@ -632,11 +630,11 @@ SYM_FUNC_END(aes_gcm_aad_update_vaes_avx2)
vpxor RNDKEY0, AESDATA\i, AESDATA\i
.endr
.endm
// Generate and encrypt counter blocks in the given AESDATA vectors, excluding
-// the last AES round. Clobbers TMP0.
+// the last AES round. Clobbers %rax and TMP0.
.macro _aesenc_loop vecs:vararg
_ctr_begin \vecs
lea 16(KEY), %rax
.Laesenc_loop\@:
vbroadcasti128 (%rax), TMP0
@@ -687,11 +685,11 @@ SYM_FUNC_END(aes_gcm_aad_update_vaes_avx2)
.set KEY, %rdi
.set LE_CTR_PTR, %rsi
.set LE_CTR_PTR32, %esi
.set GHASH_ACC_PTR, %rdx
.set SRC, %rcx // Assumed to be %rcx.
- // See .Ltail_xor_and_ghash_partial_vec
+ // See .Ltail_xor_and_ghash_1to16bytes
.set DST, %r8
.set DATALEN, %r9d
.set DATALEN64, %r9 // Zero-extend DATALEN before using!
// Additional local variables
@@ -734,11 +732,10 @@ SYM_FUNC_END(aes_gcm_aad_update_vaes_avx2)
// H_POW[2-1]_XORED contain cached values from KEY->h_powers_xored. The
// descending numbering reflects the order of the key powers.
.set H_POW2_XORED, %ymm7
.set H_POW2_XORED_XMM, %xmm7
.set H_POW1_XORED, %ymm8
- .set H_POW1_XORED_XMM, %xmm8
// RNDKEY0 caches the zero-th round key, and RNDKEYLAST the last one.
.set RNDKEY0, %ymm9
.set RNDKEYLAST, %ymm10
@@ -749,13 +746,11 @@ SYM_FUNC_END(aes_gcm_aad_update_vaes_avx2)
.set AESDATA0, %ymm12
.set AESDATA0_XMM, %xmm12
.set AESDATA1, %ymm13
.set AESDATA1_XMM, %xmm13
.set AESDATA2, %ymm14
- .set AESDATA2_XMM, %xmm14
.set AESDATA3, %ymm15
- .set AESDATA3_XMM, %xmm15
.if \enc
.set GHASHDATA_PTR, DST
.else
.set GHASHDATA_PTR, SRC
^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH 0/8] VAES+AVX2 optimized implementation of AES-GCM
2025-10-17 16:04 ` Eric Biggers
@ 2025-10-17 20:50 ` Eric Biggers
2025-10-20 4:13 ` Herbert Xu
1 sibling, 0 replies; 20+ messages in thread
From: Eric Biggers @ 2025-10-17 20:50 UTC (permalink / raw)
To: Herbert Xu; +Cc: linux-crypto, Ard Biesheuvel, linux-kernel, x86, Jason
On Fri, Oct 17, 2025 at 09:04:37AM -0700, Eric Biggers wrote:
> On Fri, Oct 17, 2025 at 10:44:37AM +0200, Ard Biesheuvel wrote:
> > On Fri, 17 Oct 2025 at 10:25, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> > >
> > > Eric Biggers <ebiggers@kernel.org> wrote:
> > > > On Wed, Oct 01, 2025 at 07:31:09PM -0700, Eric Biggers wrote:
> > > >> This patchset replaces the 256-bit vector implementation of AES-GCM for
> > > >> x86_64 with one that requires AVX2 rather than AVX512. This greatly
> > > >> improves AES-GCM performance on CPUs that have VAES but not AVX512, for
> > > >> example by up to 74% on AMD Zen 3. For more details, see patch 1.
> > > >>
> > > >> This patchset also renames the 512-bit vector implementation of AES-GCM
> > > >> for x86_64 to be named after AVX512 rather than AVX10/512, then adds
> > > >> some additional optimizations to it.
> > > >>
> > > >> This patchset applies to next-20250929 and is targeting 6.19. Herbert,
> > > >> I'd prefer to just apply this myself. But let me know if you'd prefer
> > > >> to take it instead (considering that AES-GCM hasn't been librarified
> > > >> yet). Either way, there's no hurry, since this is targeting 6.19.
> > > >>
> > > >> Eric Biggers (8):
> > > >> crypto: x86/aes-gcm - add VAES+AVX2 optimized code
> > > >> crypto: x86/aes-gcm - remove VAES+AVX10/256 optimized code
> > > >> crypto: x86/aes-gcm - rename avx10 and avx10_512 to avx512
> > > >> crypto: x86/aes-gcm - clean up AVX512 code to assume 512-bit vectors
> > > >> crypto: x86/aes-gcm - reorder AVX512 precompute and aad_update
> > > >> functions
> > > >> crypto: x86/aes-gcm - revise some comments in AVX512 code
> > > >> crypto: x86/aes-gcm - optimize AVX512 precomputation of H^2 from H^1
> > > >> crypto: x86/aes-gcm - optimize long AAD processing with AVX512
> > > >>
> > > >> arch/x86/crypto/Makefile | 5 +-
> > > >> arch/x86/crypto/aes-gcm-aesni-x86_64.S | 12 +-
> > > >> arch/x86/crypto/aes-gcm-vaes-avx2.S | 1150 +++++++++++++++++
> > > >> ...m-avx10-x86_64.S => aes-gcm-vaes-avx512.S} | 722 +++++------
> > > >> arch/x86/crypto/aesni-intel_glue.c | 264 ++--
> > > >> 5 files changed, 1667 insertions(+), 486 deletions(-)
> > > >> create mode 100644 arch/x86/crypto/aes-gcm-vaes-avx2.S
> > > >> rename arch/x86/crypto/{aes-gcm-avx10-x86_64.S => aes-gcm-vaes-avx512.S} (69%)
> > > >>
> > > >> base-commit: 3b9b1f8df454caa453c7fb07689064edb2eda90a
> > > >
> > > > Applied to https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git/log/?h=libcrypto-next
> > >
> > > Oops, I didn't see this email until it was too late. Since the
> > > patches should be identical I don't think it matters.
>
> Well, it seems you didn't read the patchset (even the cover letter) or
> any of the replies to it. So maybe I should just take it, as I already
> said I preferred, and later did do since you hadn't said you wanted to
> take it. It would have been okay if you had volunteered to take this,
> but you need to actually read the patches and replies.
>
> As for the patches being identical, besides correctly applying Ard's
> tags, I made a couple very minor changes that weren't worth sending a v2
> for: clarifying one of the commit messages, and correcting two comments
> and dropping some unused aliases from aes-gcm-vaes-avx2.S.
And to be clear, these aren't going to go through two trees. That would
be silly. If you really want to take them after all, then ask me to
drop them first, and make sure to apply them properly with Acked-by and
Tested-by tags. Otherwise, please drop your duplicate copy.
Thanks,
- Eric
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/8] VAES+AVX2 optimized implementation of AES-GCM
2025-10-17 16:04 ` Eric Biggers
2025-10-17 20:50 ` Eric Biggers
@ 2025-10-20 4:13 ` Herbert Xu
2025-10-20 16:57 ` Eric Biggers
1 sibling, 1 reply; 20+ messages in thread
From: Herbert Xu @ 2025-10-20 4:13 UTC (permalink / raw)
To: Eric Biggers; +Cc: Ard Biesheuvel, linux-crypto, linux-kernel, x86, Jason
On Fri, Oct 17, 2025 at 09:04:37AM -0700, Eric Biggers wrote:
>
> Well, it seems you didn't read the patchset (even the cover letter) or
> any of the replies to it. So maybe I should just take it, as I already
> said I preferred, and later did do since you hadn't said you wanted to
> take it. It would have been okay if you had volunteered to take this,
> but you need to actually read the patches and replies.
The reason I didn't see your cover-letter is because you didn't send
it to me. Your To/CC list was:
To: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org,
x86@kernel.org,
Ard Biesheuvel <ardb@kernel.org>,
"Jason A . Donenfeld" <Jason@zx2c4.com>,
Eric Biggers <ebiggers@kernel.org>
So all I get is the patches without the cover letter. Of course
anybody who replies to the cover letter won't be visible to me
either.
Please consider adding my email address to the Cc list next time.
I will drop this patch-set.
Cheers,
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/8] VAES+AVX2 optimized implementation of AES-GCM
2025-10-20 4:13 ` Herbert Xu
@ 2025-10-20 16:57 ` Eric Biggers
2025-10-21 3:00 ` Herbert Xu
0 siblings, 1 reply; 20+ messages in thread
From: Eric Biggers @ 2025-10-20 16:57 UTC (permalink / raw)
To: Herbert Xu; +Cc: Ard Biesheuvel, linux-crypto, linux-kernel, x86, Jason
On Mon, Oct 20, 2025 at 12:13:48PM +0800, Herbert Xu wrote:
> On Fri, Oct 17, 2025 at 09:04:37AM -0700, Eric Biggers wrote:
> >
> > Well, it seems you didn't read the patchset (even the cover letter) or
> > any of the replies to it. So maybe I should just take it, as I already
> > said I preferred, and later did do since you hadn't said you wanted to
> > take it. It would have been okay if you had volunteered to take this,
> > but you need to actually read the patches and replies.
>
> The reason I didn't see your cover-letter is because you didn't send
> it to me. Your To/CC list was:
>
> To: linux-crypto@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org,
> x86@kernel.org,
> Ard Biesheuvel <ardb@kernel.org>,
> "Jason A . Donenfeld" <Jason@zx2c4.com>,
> Eric Biggers <ebiggers@kernel.org>
>
> So all I get is the patches without the cover letter. Of course
> anybody who replies to the cover letter won't be visible to me
> either.
>
> Please consider adding my email address to the Cc list next time.
Well, one would think you would be subscribed to linux-crypto.
But whatever, I'll Cc you explicitly on future patches.
> I will drop this patch-set.
Thanks,
- Eric
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/8] VAES+AVX2 optimized implementation of AES-GCM
2025-10-20 16:57 ` Eric Biggers
@ 2025-10-21 3:00 ` Herbert Xu
0 siblings, 0 replies; 20+ messages in thread
From: Herbert Xu @ 2025-10-21 3:00 UTC (permalink / raw)
To: Eric Biggers; +Cc: Ard Biesheuvel, linux-crypto, linux-kernel, x86, Jason
On Mon, Oct 20, 2025 at 09:57:58AM -0700, Eric Biggers wrote:
>
> Well, one would think you would be subscribed to linux-crypto.
> But whatever, I'll Cc you explicitly on future patches.
I never said that I'm not subscribed to linux-crypto. I will eventually
see it, but it could be too late if you wanted me to action on an item
that's only in the cover letter.
Thanks,
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2025-10-21 3:00 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-02 2:31 [PATCH 0/8] VAES+AVX2 optimized implementation of AES-GCM Eric Biggers
2025-10-02 2:31 ` [PATCH 1/8] crypto: x86/aes-gcm - add VAES+AVX2 optimized code Eric Biggers
2025-10-17 18:34 ` Eric Biggers
2025-10-02 2:31 ` [PATCH 2/8] crypto: x86/aes-gcm - remove VAES+AVX10/256 " Eric Biggers
2025-10-02 2:31 ` [PATCH 3/8] crypto: x86/aes-gcm - rename avx10 and avx10_512 to avx512 Eric Biggers
2025-10-02 2:31 ` [PATCH 4/8] crypto: x86/aes-gcm - clean up AVX512 code to assume 512-bit vectors Eric Biggers
2025-10-02 2:31 ` [PATCH 5/8] crypto: x86/aes-gcm - reorder AVX512 precompute and aad_update functions Eric Biggers
2025-10-02 2:31 ` [PATCH 6/8] crypto: x86/aes-gcm - revise some comments in AVX512 code Eric Biggers
2025-10-02 2:31 ` [PATCH 7/8] crypto: x86/aes-gcm - optimize AVX512 precomputation of H^2 from H^1 Eric Biggers
2025-10-02 2:31 ` [PATCH 8/8] crypto: x86/aes-gcm - optimize long AAD processing with AVX512 Eric Biggers
2025-10-10 18:21 ` [PATCH 0/8] VAES+AVX2 optimized implementation of AES-GCM Ard Biesheuvel
2025-10-14 0:31 ` Eric Biggers
2025-10-17 8:25 ` Herbert Xu
2025-10-17 8:44 ` Ard Biesheuvel
2025-10-17 16:04 ` Eric Biggers
2025-10-17 20:50 ` Eric Biggers
2025-10-20 4:13 ` Herbert Xu
2025-10-20 16:57 ` Eric Biggers
2025-10-21 3:00 ` Herbert Xu
2025-10-17 8:24 ` Herbert Xu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).