[RFC PATCH 0/2] Eliminate the no-SIMD en/decryption fallbacks on x86

linux-crypto.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/2] Eliminate the no-SIMD en/decryption fallbacks on x86
@ 2025-02-20  5:13 Eric Biggers
  2025-02-20  5:13 ` [RFC PATCH 1/2] x86/fpu: make kernel-mode FPU reliably usable in softirqs Eric Biggers
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Eric Biggers @ 2025-02-20  5:13 UTC (permalink / raw)
  To: x86
  Cc: linux-crypto, linux-kernel, Ard Biesheuvel, Ben Greear,
	Xiao Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Andy Lutomirski, Jason A . Donenfeld

The patchset can also be retrieved from:

    git fetch https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git x86-softirq-fpu-fix-v1

This patchset fixes a longstanding issue where kernel-mode FPU (i.e.,
SIMD) was not reliably usable in softirqs in x86, which was creating the
need for a fallback.  The fallback was really bad for performance, and
it even hurt performance for users that never encountered the edge case
where kernel-mode FPU was not usable.

This patchset aligns x86 with other architectures such as arm, arm64,
and riscv by making kernel-mode FPU work in softirqs reliably.  There
are a few possible ways to achieve that, and for now I just went with
the simplest way; see patch 1 for details.

Patch 2 eliminates all uses of the "crypto SIMD helper" from x86, as
patch 1 makes it unnecessary.  For the RFC it is just one big patch;
I'll probably split patch 2 up if this progresses past RFC status.

Performance results have been positive.  All en/decryption is now
slightly faster on x86, as it no longer take a detour through
crypto/simd.c.  I get a 7% or 23% improvement for AES-XTS, for example.

I also benchmarked bidirectional IPsec, which has been claimed to often
hit the edge case where kernel-mode FPU was previously not usable in
softirq context.  Ultimately, I was not actually able to reproduce that
edge case being reached unless I reduced the number of CPUs to 1, in
which case it then started being occasionally reached.  Regardless, even
without that case being reached, IPsec throughput still improved by 2%.
In situations where that case was being reached, or where users required
a synchronous algorithm, a much larger improvement should be seen.

Eric Biggers (2):
  x86/fpu: make kernel-mode FPU reliably usable in softirqs
  crypto: x86 - stop using the SIMD helper

 arch/x86/crypto/Kconfig                    |  14 --
 arch/x86/crypto/aegis128-aesni-glue.c      |  13 +-
 arch/x86/crypto/aesni-intel_glue.c         | 168 ++++++++-------------
 arch/x86/crypto/aria_aesni_avx2_glue.c     |  22 +--
 arch/x86/crypto/aria_aesni_avx_glue.c      |  20 +--
 arch/x86/crypto/aria_gfni_avx512_glue.c    |  22 +--
 arch/x86/crypto/camellia_aesni_avx2_glue.c |  21 +--
 arch/x86/crypto/camellia_aesni_avx_glue.c  |  21 +--
 arch/x86/crypto/cast5_avx_glue.c           |  21 +--
 arch/x86/crypto/cast6_avx_glue.c           |  20 +--
 arch/x86/crypto/serpent_avx2_glue.c        |  21 +--
 arch/x86/crypto/serpent_avx_glue.c         |  21 +--
 arch/x86/crypto/serpent_sse2_glue.c        |  21 +--
 arch/x86/crypto/sm4_aesni_avx2_glue.c      |  30 ++--
 arch/x86/crypto/sm4_aesni_avx_glue.c       |  30 ++--
 arch/x86/crypto/twofish_avx_glue.c         |  21 +--
 arch/x86/include/asm/fpu/api.h             |  17 +--
 arch/x86/kernel/fpu/core.c                 |  37 ++---
 18 files changed, 180 insertions(+), 360 deletions(-)

base-commit: 0ad2507d5d93f39619fc42372c347d6006b64319
prerequisite-patch-id: ec1feea7e6f4d03e4e4c64c492197b89c957611a
-- 
2.48.1

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH 1/2] x86/fpu: make kernel-mode FPU reliably usable in softirqs
  2025-02-20  5:13 [RFC PATCH 0/2] Eliminate the no-SIMD en/decryption fallbacks on x86 Eric Biggers
@ 2025-02-20  5:13 ` Eric Biggers
  2025-02-21  7:38   ` Xiao Liang
  2025-02-28  3:59   ` Eric Biggers
  2025-02-20  5:13 ` [RFC PATCH 2/2] crypto: x86 - stop using the SIMD helper Eric Biggers
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 12+ messages in thread
From: Eric Biggers @ 2025-02-20  5:13 UTC (permalink / raw)
  To: x86
  Cc: linux-crypto, linux-kernel, Ard Biesheuvel, Ben Greear,
	Xiao Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Andy Lutomirski, Jason A . Donenfeld

From: Eric Biggers <ebiggers@google.com>

Currently kernel-mode FPU is not always usable in softirq context on
x86, since softirqs can nest inside a kernel-mode FPU section in task
context, and nested use of kernel-mode FPU is not supported.

Therefore, x86 SIMD-optimized code that can be called in softirq context
has to sometimes fall back to non-SIMD code.  There are two options for
the fallback, both of which are pretty terrible:

  (a) Use a scalar fallback.  This can be 10-100x slower than vectorized
      code because it cannot use specialized instructions like AES, SHA,
      or carryless multiplication.

  (b) Execute the request asynchronously using a kworker.  In other
      words, use the "crypto SIMD helper" in crypto/simd.c.

Currently most of the x86 en/decryption code (skcipher and aead
algorithms) uses option (b), since this avoids the slow scalar fallback
and it is easier to wire up.  But option (b) is still really bad for its
own reasons:

  - Punting the request to a kworker is bad for performance too.

  - It forces the algorithm to be marked as asynchronous
    (CRYPTO_ALG_ASYNC), preventing it from being used by crypto API
    users who request a synchronous algorithm.  That's another huge
    performance problem, which is especially unfortunate for users who
    don't even do en/decryption in softirq context.

  - It makes all en/decryption operations take a detour through
    crypto/simd.c.  That involves additional checks and an additional
    indirect call, which slow down en/decryption for *everyone*.

Fortunately, the skcipher and aead APIs are only usable in task and
softirq context in the first place, nor is it supported to call them
with hardirqs disabled.  Thus, if kernel-mode FPU were to be reliably
usable in softirq context, no fallback would be needed.  Indeed, other
architectures such as arm, arm64, and riscv have already done this.

Therefore, this patch updates x86 accordingly to reliably support
kernel-mode FPU in softirqs (except when hardirqs are disabled).

This is done by just disabling softirq processing in kernel-mode FPU
sections, as that prevents the nesting that was problematic.

This will delay some softirqs slightly, but only ones that would have
otherwise been nested inside a task context kernel-mode FPU section.
Any such softirqs would have taken the slow fallback path before if they
tried to do any en/decryption.  Now these softirqs will just run at the
end of the task context kernel-mode FPU section (since local_bh_enable()
runs pending softirqs) and will no longer take the slow fallback path.

To comply with the requirements of local_bh_disable and local_bh_enable,
this change also removes support for kernel-mode FPU in hardirq context
or with hardirqs disabled.  This should not be a problem, though.  There
does not appear to be any use case for kernel-mode FPU in such contexts,
and notably arm64 and riscv already have these same conditions.

Alternatives considered:

- Make kernel-mode FPU sections fully preemptible.  This would require
  growing task_struct by another struct fpstate which is more than 2K.

- Make softirqs save/restore the kernel-mode FPU state to a per-CPU
  struct fpstate when nested use is detected.  Somewhat interesting, but
  seems unnecessary when a simpler solution exists.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 arch/x86/include/asm/fpu/api.h | 17 +++++++---------
 arch/x86/kernel/fpu/core.c     | 37 +++++++++++-----------------------
 2 files changed, 19 insertions(+), 35 deletions(-)

diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index f86ad3335529d..f42de5f05e7eb 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -14,14 +14,13 @@
 
 #include <asm/fpu/types.h>
 
 /*
  * Use kernel_fpu_begin/end() if you intend to use FPU in kernel context. It
- * disables preemption so be careful if you intend to use it for long periods
- * of time.
- * If you intend to use the FPU in irq/softirq you need to check first with
- * irq_fpu_usable() if it is possible.
+ * disables preemption and softirq processing, so be careful if you intend to
+ * use it for long periods of time.  Kernel-mode FPU cannot be used in all
+ * contexts -- see irq_fpu_usable() for details.
  */
 
 /* Kernel FPU states to initialize in kernel_fpu_begin_mask() */
 #define KFPU_387	_BITUL(0)	/* 387 state will be initialized */
 #define KFPU_MXCSR	_BITUL(1)	/* MXCSR will be initialized */
@@ -48,25 +47,23 @@ static inline void kernel_fpu_begin(void)
 	kernel_fpu_begin_mask(KFPU_387 | KFPU_MXCSR);
 #endif
 }
 
 /*
- * Use fpregs_lock() while editing CPU's FPU registers or fpu->fpstate.
- * A context switch will (and softirq might) save CPU's FPU registers to
- * fpu->fpstate.regs and set TIF_NEED_FPU_LOAD leaving CPU's FPU registers in
- * a random state.
+ * Use fpregs_lock() while editing CPU's FPU registers or fpu->fpstate, or while
+ * using the FPU in kernel mode.  A context switch will (and softirq might) save
+ * CPU's FPU registers to fpu->fpstate.regs and set TIF_NEED_FPU_LOAD leaving
+ * CPU's FPU registers in a random state.
  *
  * local_bh_disable() protects against both preemption and soft interrupts
  * on !RT kernels.
  *
  * On RT kernels local_bh_disable() is not sufficient because it only
  * serializes soft interrupt related sections via a local lock, but stays
  * preemptible. Disabling preemption is the right choice here as bottom
  * half processing is always in thread context on RT kernels so it
  * implicitly prevents bottom half processing as well.
- *
- * Disabling preemption also serializes against kernel_fpu_begin().
  */
 static inline void fpregs_lock(void)
 {
 	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
 		local_bh_disable();
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 1209c7aebb211..0f7268452bf20 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -55,35 +55,22 @@ DEFINE_PER_CPU(struct fpu *, fpu_fpregs_owner_ctx);
  * Can we use the FPU in kernel mode with the
  * whole "kernel_fpu_begin/end()" sequence?
  */
 bool irq_fpu_usable(void)
 {
-	if (WARN_ON_ONCE(in_nmi()))
-		return false;
-
-	/* In kernel FPU usage already active? */
-	if (this_cpu_read(in_kernel_fpu))
-		return false;
-
 	/*
-	 * When not in NMI or hard interrupt context, FPU can be used in:
-	 *
-	 * - Task context except from within fpregs_lock()'ed critical
-	 *   regions.
-	 *
-	 * - Soft interrupt processing context which cannot happen
-	 *   while in a fpregs_lock()'ed critical region.
+	 * kernel_fpu_begin() takes fpregs_lock(), which disables preemption and
+	 * softirq processing.  That prevents any other task or softirq from
+	 * trying to use the FPU.  Therefore, kernel-mode FPU can always be used
+	 * in task and softirq context, except when hardirqs are disabled which
+	 * is not compatible with disabling and enabling softirq processing, or
+	 * when kernel-mode FPU is explicitly nested (which should never
+	 * happen).  Disabling/enabling softirq processing is also not allowed
+	 * in hardirq context.  Thus, we get the following condition.
 	 */
-	if (!in_hardirq())
-		return true;
-
-	/*
-	 * In hard interrupt context it's safe when soft interrupts
-	 * are enabled, which means the interrupt did not hit in
-	 * a fpregs_lock()'ed critical region.
-	 */
-	return !softirq_count();
+	return !this_cpu_read(in_kernel_fpu) &&
+		!in_hardirq() && !irqs_disabled() && !in_nmi();
 }
 EXPORT_SYMBOL(irq_fpu_usable);
 
 /*
  * Track AVX512 state use because it is known to slow the max clock
@@ -418,11 +405,11 @@ int fpu_copy_uabi_to_guest_fpstate(struct fpu_guest *gfpu, const void *buf,
 EXPORT_SYMBOL_GPL(fpu_copy_uabi_to_guest_fpstate);
 #endif /* CONFIG_KVM */
 
 void kernel_fpu_begin_mask(unsigned int kfpu_mask)
 {
-	preempt_disable();
+	fpregs_lock();
 
 	WARN_ON_FPU(!irq_fpu_usable());
 	WARN_ON_FPU(this_cpu_read(in_kernel_fpu));
 
 	this_cpu_write(in_kernel_fpu, true);
@@ -446,11 +433,11 @@ EXPORT_SYMBOL_GPL(kernel_fpu_begin_mask);
 void kernel_fpu_end(void)
 {
 	WARN_ON_FPU(!this_cpu_read(in_kernel_fpu));
 
 	this_cpu_write(in_kernel_fpu, false);
-	preempt_enable();
+	fpregs_unlock();
 }
 EXPORT_SYMBOL_GPL(kernel_fpu_end);
 
 /*
  * Sync the FPU register state to current's memory register state when the

base-commit: 0ad2507d5d93f39619fc42372c347d6006b64319
prerequisite-patch-id: ec1feea7e6f4d03e4e4c64c492197b89c957611a
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 2/2] crypto: x86 - stop using the SIMD helper
  2025-02-20  5:13 [RFC PATCH 0/2] Eliminate the no-SIMD en/decryption fallbacks on x86 Eric Biggers
  2025-02-20  5:13 ` [RFC PATCH 1/2] x86/fpu: make kernel-mode FPU reliably usable in softirqs Eric Biggers
@ 2025-02-20  5:13 ` Eric Biggers
  2025-02-21  3:53 ` [RFC PATCH 0/2] Eliminate the no-SIMD en/decryption fallbacks on x86 Herbert Xu
  2025-02-24 18:57 ` Eric Biggers
  3 siblings, 0 replies; 12+ messages in thread
From: Eric Biggers @ 2025-02-20  5:13 UTC (permalink / raw)
  To: x86
  Cc: linux-crypto, linux-kernel, Ard Biesheuvel, Ben Greear,
	Xiao Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Andy Lutomirski, Jason A . Donenfeld

From: Eric Biggers <ebiggers@google.com>

(This is an RFC that just updates all algorithms in one big patch.  I'll
split this up by file if it's actually going to be merged.)

Stop wrapping skcipher and aead algorithms with the crypto SIMD helper
(crypto/simd.c).  The only purpose of doing so was to work around x86
not always supporting kernel-mode FPU in softirqs.  Specifically, if a
hardirq interrupted a task context kernel-mode FPU section and then a
softirqs were run at the end of that hardirq, those softirqs could not
use kernel-mode FPU.  This has now been fixed.  In combination with the
fact that the skcipher and aead APIs only support task and softirq
contexts, these can now just use kernel-mode FPU unconditionally on x86.

This simplifies the code and improves performance.

En/decryption gets at least somewhat faster for everyone, since the
crypto API functions such as crypto_skcipher_encrypt() now go directly
to the underlying algorithm rather than taking a detour through
crypto/simd.c which involved an extra indirect call.  For example, on a
Ryzen 9 9950X desktop processor, AES-256-XTS is now 23% faster for
512-byte messages and 7% faster for 4096-byte messages (when accessed
through crypto_skcipher_encrypt() or crypto_skcipher_decrypt()).

There's also a much larger performance improvement for crypto API users
that only support synchronous algorithms.  These users will now actually
use the x86 SIMD (e.g. AES-NI or VAES) optimized en/decryption modes,
which they couldn't before because they were marked as asynchronous.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 arch/x86/crypto/Kconfig                    |  14 --
 arch/x86/crypto/aegis128-aesni-glue.c      |  13 +-
 arch/x86/crypto/aesni-intel_glue.c         | 168 ++++++++-------------
 arch/x86/crypto/aria_aesni_avx2_glue.c     |  22 +--
 arch/x86/crypto/aria_aesni_avx_glue.c      |  20 +--
 arch/x86/crypto/aria_gfni_avx512_glue.c    |  22 +--
 arch/x86/crypto/camellia_aesni_avx2_glue.c |  21 +--
 arch/x86/crypto/camellia_aesni_avx_glue.c  |  21 +--
 arch/x86/crypto/cast5_avx_glue.c           |  21 +--
 arch/x86/crypto/cast6_avx_glue.c           |  20 +--
 arch/x86/crypto/serpent_avx2_glue.c        |  21 +--
 arch/x86/crypto/serpent_avx_glue.c         |  21 +--
 arch/x86/crypto/serpent_sse2_glue.c        |  21 +--
 arch/x86/crypto/sm4_aesni_avx2_glue.c      |  30 ++--
 arch/x86/crypto/sm4_aesni_avx_glue.c       |  30 ++--
 arch/x86/crypto/twofish_avx_glue.c         |  21 +--
 16 files changed, 161 insertions(+), 325 deletions(-)

diff --git a/arch/x86/crypto/Kconfig b/arch/x86/crypto/Kconfig
index 4757bf922075b..5ef3c32d3d5f0 100644
--- a/arch/x86/crypto/Kconfig
+++ b/arch/x86/crypto/Kconfig
@@ -19,11 +19,10 @@ config CRYPTO_AES_NI_INTEL
 	select CRYPTO_AEAD
 	select CRYPTO_LIB_AES
 	select CRYPTO_LIB_GF128MUL
 	select CRYPTO_ALGAPI
 	select CRYPTO_SKCIPHER
-	select CRYPTO_SIMD
 	help
 	  Block cipher: AES cipher algorithms
 	  AEAD cipher: AES with GCM
 	  Length-preserving ciphers: AES with ECB, CBC, CTS, CTR, XCTR, XTS
 
@@ -60,11 +59,10 @@ config CRYPTO_CAMELLIA_X86_64
 config CRYPTO_CAMELLIA_AESNI_AVX_X86_64
 	tristate "Ciphers: Camellia with modes: ECB, CBC (AES-NI/AVX)"
 	depends on X86 && 64BIT
 	select CRYPTO_SKCIPHER
 	select CRYPTO_CAMELLIA_X86_64
-	select CRYPTO_SIMD
 	imply CRYPTO_XTS
 	help
 	  Length-preserving ciphers: Camellia with ECB and CBC modes
 
 	  Architecture: x86_64 using:
@@ -86,11 +84,10 @@ config CRYPTO_CAST5_AVX_X86_64
 	tristate "Ciphers: CAST5 with modes: ECB, CBC (AVX)"
 	depends on X86 && 64BIT
 	select CRYPTO_SKCIPHER
 	select CRYPTO_CAST5
 	select CRYPTO_CAST_COMMON
-	select CRYPTO_SIMD
 	imply CRYPTO_CTR
 	help
 	  Length-preserving ciphers: CAST5 (CAST-128) cipher algorithm
 	  (RFC2144) with ECB and CBC modes
 
@@ -103,11 +100,10 @@ config CRYPTO_CAST6_AVX_X86_64
 	tristate "Ciphers: CAST6 with modes: ECB, CBC (AVX)"
 	depends on X86 && 64BIT
 	select CRYPTO_SKCIPHER
 	select CRYPTO_CAST6
 	select CRYPTO_CAST_COMMON
-	select CRYPTO_SIMD
 	imply CRYPTO_XTS
 	imply CRYPTO_CTR
 	help
 	  Length-preserving ciphers: CAST6 (CAST-256) cipher algorithm
 	  (RFC2612) with ECB and CBC modes
@@ -134,11 +130,10 @@ config CRYPTO_DES3_EDE_X86_64
 config CRYPTO_SERPENT_SSE2_X86_64
 	tristate "Ciphers: Serpent with modes: ECB, CBC (SSE2)"
 	depends on X86 && 64BIT
 	select CRYPTO_SKCIPHER
 	select CRYPTO_SERPENT
-	select CRYPTO_SIMD
 	imply CRYPTO_CTR
 	help
 	  Length-preserving ciphers: Serpent cipher algorithm
 	  with ECB and CBC modes
 
@@ -150,11 +145,10 @@ config CRYPTO_SERPENT_SSE2_X86_64
 config CRYPTO_SERPENT_SSE2_586
 	tristate "Ciphers: Serpent with modes: ECB, CBC (32-bit with SSE2)"
 	depends on X86 && !64BIT
 	select CRYPTO_SKCIPHER
 	select CRYPTO_SERPENT
-	select CRYPTO_SIMD
 	imply CRYPTO_CTR
 	help
 	  Length-preserving ciphers: Serpent cipher algorithm
 	  with ECB and CBC modes
 
@@ -166,11 +160,10 @@ config CRYPTO_SERPENT_SSE2_586
 config CRYPTO_SERPENT_AVX_X86_64
 	tristate "Ciphers: Serpent with modes: ECB, CBC (AVX)"
 	depends on X86 && 64BIT
 	select CRYPTO_SKCIPHER
 	select CRYPTO_SERPENT
-	select CRYPTO_SIMD
 	imply CRYPTO_XTS
 	imply CRYPTO_CTR
 	help
 	  Length-preserving ciphers: Serpent cipher algorithm
 	  with ECB and CBC modes
@@ -195,11 +188,10 @@ config CRYPTO_SERPENT_AVX2_X86_64
 
 config CRYPTO_SM4_AESNI_AVX_X86_64
 	tristate "Ciphers: SM4 with modes: ECB, CBC, CTR (AES-NI/AVX)"
 	depends on X86 && 64BIT
 	select CRYPTO_SKCIPHER
-	select CRYPTO_SIMD
 	select CRYPTO_ALGAPI
 	select CRYPTO_SM4
 	help
 	  Length-preserving ciphers: SM4 cipher algorithms
 	  (OSCCA GB/T 32907-2016) with ECB, CBC, and CTR modes
@@ -216,11 +208,10 @@ config CRYPTO_SM4_AESNI_AVX_X86_64
 
 config CRYPTO_SM4_AESNI_AVX2_X86_64
 	tristate "Ciphers: SM4 with modes: ECB, CBC, CTR (AES-NI/AVX2)"
 	depends on X86 && 64BIT
 	select CRYPTO_SKCIPHER
-	select CRYPTO_SIMD
 	select CRYPTO_ALGAPI
 	select CRYPTO_SM4
 	select CRYPTO_SM4_AESNI_AVX_X86_64
 	help
 	  Length-preserving ciphers: SM4 cipher algorithms
@@ -275,11 +266,10 @@ config CRYPTO_TWOFISH_X86_64_3WAY
 
 config CRYPTO_TWOFISH_AVX_X86_64
 	tristate "Ciphers: Twofish with modes: ECB, CBC (AVX)"
 	depends on X86 && 64BIT
 	select CRYPTO_SKCIPHER
-	select CRYPTO_SIMD
 	select CRYPTO_TWOFISH_COMMON
 	select CRYPTO_TWOFISH_X86_64
 	select CRYPTO_TWOFISH_X86_64_3WAY
 	imply CRYPTO_XTS
 	help
@@ -293,11 +283,10 @@ config CRYPTO_TWOFISH_AVX_X86_64
 
 config CRYPTO_ARIA_AESNI_AVX_X86_64
 	tristate "Ciphers: ARIA with modes: ECB, CTR (AES-NI/AVX/GFNI)"
 	depends on X86 && 64BIT
 	select CRYPTO_SKCIPHER
-	select CRYPTO_SIMD
 	select CRYPTO_ALGAPI
 	select CRYPTO_ARIA
 	help
 	  Length-preserving cipher: ARIA cipher algorithms
 	  (RFC 5794) with ECB and CTR modes
@@ -311,11 +300,10 @@ config CRYPTO_ARIA_AESNI_AVX_X86_64
 
 config CRYPTO_ARIA_AESNI_AVX2_X86_64
 	tristate "Ciphers: ARIA with modes: ECB, CTR (AES-NI/AVX2/GFNI)"
 	depends on X86 && 64BIT
 	select CRYPTO_SKCIPHER
-	select CRYPTO_SIMD
 	select CRYPTO_ALGAPI
 	select CRYPTO_ARIA
 	select CRYPTO_ARIA_AESNI_AVX_X86_64
 	help
 	  Length-preserving cipher: ARIA cipher algorithms
@@ -330,11 +318,10 @@ config CRYPTO_ARIA_AESNI_AVX2_X86_64
 
 config CRYPTO_ARIA_GFNI_AVX512_X86_64
 	tristate "Ciphers: ARIA with modes: ECB, CTR (AVX512/GFNI)"
 	depends on X86 && 64BIT && AS_AVX512 && AS_GFNI
 	select CRYPTO_SKCIPHER
-	select CRYPTO_SIMD
 	select CRYPTO_ALGAPI
 	select CRYPTO_ARIA
 	select CRYPTO_ARIA_AESNI_AVX_X86_64
 	select CRYPTO_ARIA_AESNI_AVX2_X86_64
 	help
@@ -364,11 +351,10 @@ config CRYPTO_CHACHA20_X86_64
 
 config CRYPTO_AEGIS128_AESNI_SSE2
 	tristate "AEAD ciphers: AEGIS-128 (AES-NI/SSE4.1)"
 	depends on X86 && 64BIT
 	select CRYPTO_AEAD
-	select CRYPTO_SIMD
 	help
 	  AEGIS-128 AEAD algorithm
 
 	  Architecture: x86_64 using:
 	  - AES-NI (AES New Instructions)
diff --git a/arch/x86/crypto/aegis128-aesni-glue.c b/arch/x86/crypto/aegis128-aesni-glue.c
index 01fa568dc5fc4..c937426abf6a0 100644
--- a/arch/x86/crypto/aegis128-aesni-glue.c
+++ b/arch/x86/crypto/aegis128-aesni-glue.c
@@ -6,11 +6,10 @@
  * Copyright (c) 2017-2018 Ondrej Mosnacek <omosnacek@gmail.com>
  * Copyright (C) 2017-2018 Red Hat, Inc. All rights reserved.
  */
 
 #include <crypto/internal/aead.h>
-#include <crypto/internal/simd.h>
 #include <crypto/internal/skcipher.h>
 #include <crypto/scatterwalk.h>
 #include <linux/module.h>
 #include <asm/fpu/api.h>
 #include <asm/cpu_device_id.h>
@@ -234,39 +233,35 @@ static struct aead_alg crypto_aegis128_aesni_alg = {
 	.ivsize = AEGIS128_NONCE_SIZE,
 	.maxauthsize = AEGIS128_MAX_AUTH_SIZE,
 	.chunksize = AEGIS128_BLOCK_SIZE,
 
 	.base = {
-		.cra_flags = CRYPTO_ALG_INTERNAL,
 		.cra_blocksize = 1,
 		.cra_ctxsize = sizeof(struct aegis_ctx) +
 			       __alignof__(struct aegis_ctx),
 		.cra_priority = 400,
 
-		.cra_name = "__aegis128",
-		.cra_driver_name = "__aegis128-aesni",
+		.cra_name = "aegis128",
+		.cra_driver_name = "aegis128-aesni",
 
 		.cra_module = THIS_MODULE,
 	}
 };
 
-static struct simd_aead_alg *simd_alg;
-
 static int __init crypto_aegis128_aesni_module_init(void)
 {
 	if (!boot_cpu_has(X86_FEATURE_XMM4_1) ||
 	    !boot_cpu_has(X86_FEATURE_AES) ||
 	    !cpu_has_xfeatures(XFEATURE_MASK_SSE, NULL))
 		return -ENODEV;
 
-	return simd_register_aeads_compat(&crypto_aegis128_aesni_alg, 1,
-					  &simd_alg);
+	return crypto_register_aead(&crypto_aegis128_aesni_alg);
 }
 
 static void __exit crypto_aegis128_aesni_module_exit(void)
 {
-	simd_unregister_aeads(&crypto_aegis128_aesni_alg, 1, &simd_alg);
+	crypto_unregister_aead(&crypto_aegis128_aesni_alg);
 }
 
 module_init(crypto_aegis128_aesni_module_init);
 module_exit(crypto_aegis128_aesni_module_exit);
 
diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c
index 345c97db06f32..7d184fdbaa0ed 100644
--- a/arch/x86/crypto/aesni-intel_glue.c
+++ b/arch/x86/crypto/aesni-intel_glue.c
@@ -574,14 +574,13 @@ static struct crypto_alg aesni_cipher_alg = {
 };
 
 static struct skcipher_alg aesni_skciphers[] = {
 	{
 		.base = {
-			.cra_name		= "__ecb(aes)",
-			.cra_driver_name	= "__ecb-aes-aesni",
+			.cra_name		= "ecb(aes)",
+			.cra_driver_name	= "ecb-aes-aesni",
 			.cra_priority		= 400,
-			.cra_flags		= CRYPTO_ALG_INTERNAL,
 			.cra_blocksize		= AES_BLOCK_SIZE,
 			.cra_ctxsize		= CRYPTO_AES_CTX_SIZE,
 			.cra_module		= THIS_MODULE,
 		},
 		.min_keysize	= AES_MIN_KEY_SIZE,
@@ -589,14 +588,13 @@ static struct skcipher_alg aesni_skciphers[] = {
 		.setkey		= aesni_skcipher_setkey,
 		.encrypt	= ecb_encrypt,
 		.decrypt	= ecb_decrypt,
 	}, {
 		.base = {
-			.cra_name		= "__cbc(aes)",
-			.cra_driver_name	= "__cbc-aes-aesni",
+			.cra_name		= "cbc(aes)",
+			.cra_driver_name	= "cbc-aes-aesni",
 			.cra_priority		= 400,
-			.cra_flags		= CRYPTO_ALG_INTERNAL,
 			.cra_blocksize		= AES_BLOCK_SIZE,
 			.cra_ctxsize		= CRYPTO_AES_CTX_SIZE,
 			.cra_module		= THIS_MODULE,
 		},
 		.min_keysize	= AES_MIN_KEY_SIZE,
@@ -605,14 +603,13 @@ static struct skcipher_alg aesni_skciphers[] = {
 		.setkey		= aesni_skcipher_setkey,
 		.encrypt	= cbc_encrypt,
 		.decrypt	= cbc_decrypt,
 	}, {
 		.base = {
-			.cra_name		= "__cts(cbc(aes))",
-			.cra_driver_name	= "__cts-cbc-aes-aesni",
+			.cra_name		= "cts(cbc(aes))",
+			.cra_driver_name	= "cts-cbc-aes-aesni",
 			.cra_priority		= 400,
-			.cra_flags		= CRYPTO_ALG_INTERNAL,
 			.cra_blocksize		= AES_BLOCK_SIZE,
 			.cra_ctxsize		= CRYPTO_AES_CTX_SIZE,
 			.cra_module		= THIS_MODULE,
 		},
 		.min_keysize	= AES_MIN_KEY_SIZE,
@@ -623,14 +620,13 @@ static struct skcipher_alg aesni_skciphers[] = {
 		.encrypt	= cts_cbc_encrypt,
 		.decrypt	= cts_cbc_decrypt,
 #ifdef CONFIG_X86_64
 	}, {
 		.base = {
-			.cra_name		= "__ctr(aes)",
-			.cra_driver_name	= "__ctr-aes-aesni",
+			.cra_name		= "ctr(aes)",
+			.cra_driver_name	= "ctr-aes-aesni",
 			.cra_priority		= 400,
-			.cra_flags		= CRYPTO_ALG_INTERNAL,
 			.cra_blocksize		= 1,
 			.cra_ctxsize		= CRYPTO_AES_CTX_SIZE,
 			.cra_module		= THIS_MODULE,
 		},
 		.min_keysize	= AES_MIN_KEY_SIZE,
@@ -641,14 +637,13 @@ static struct skcipher_alg aesni_skciphers[] = {
 		.encrypt	= ctr_crypt_aesni,
 		.decrypt	= ctr_crypt_aesni,
 #endif
 	}, {
 		.base = {
-			.cra_name		= "__xts(aes)",
-			.cra_driver_name	= "__xts-aes-aesni",
+			.cra_name		= "xts(aes)",
+			.cra_driver_name	= "xts-aes-aesni",
 			.cra_priority		= 401,
-			.cra_flags		= CRYPTO_ALG_INTERNAL,
 			.cra_blocksize		= AES_BLOCK_SIZE,
 			.cra_ctxsize		= XTS_AES_CTX_SIZE,
 			.cra_module		= THIS_MODULE,
 		},
 		.min_keysize	= 2 * AES_MIN_KEY_SIZE,
@@ -659,13 +654,10 @@ static struct skcipher_alg aesni_skciphers[] = {
 		.encrypt	= xts_encrypt_aesni,
 		.decrypt	= xts_decrypt_aesni,
 	}
 };
 
-static
-struct simd_skcipher_alg *aesni_simd_skciphers[ARRAY_SIZE(aesni_skciphers)];
-
 #ifdef CONFIG_X86_64
 asmlinkage void aes_xts_encrypt_iv(const struct crypto_aes_ctx *tweak_key,
 				   u8 iv[AES_BLOCK_SIZE]);
 
 /* __always_inline to avoid indirect call */
@@ -800,14 +792,13 @@ static int xctr_crypt_##suffix(struct skcipher_request *req)		       \
 {									       \
 	return xctr_crypt(req, aes_xctr_crypt_##suffix);		       \
 }									       \
 									       \
 static struct skcipher_alg skcipher_algs_##suffix[] = {{		       \
-	.base.cra_name		= "__xts(aes)",				       \
-	.base.cra_driver_name	= "__xts-aes-" driver_name_suffix,	       \
+	.base.cra_name		= "xts(aes)",				       \
+	.base.cra_driver_name	= "xts-aes-" driver_name_suffix,	       \
 	.base.cra_priority	= priority,				       \
-	.base.cra_flags		= CRYPTO_ALG_INTERNAL,			       \
 	.base.cra_blocksize	= AES_BLOCK_SIZE,			       \
 	.base.cra_ctxsize	= XTS_AES_CTX_SIZE,			       \
 	.base.cra_module	= THIS_MODULE,				       \
 	.min_keysize		= 2 * AES_MIN_KEY_SIZE,			       \
 	.max_keysize		= 2 * AES_MAX_KEY_SIZE,			       \
@@ -815,14 +806,13 @@ static struct skcipher_alg skcipher_algs_##suffix[] = {{		       \
 	.walksize		= 2 * AES_BLOCK_SIZE,			       \
 	.setkey			= xts_setkey_aesni,			       \
 	.encrypt		= xts_encrypt_##suffix,			       \
 	.decrypt		= xts_decrypt_##suffix,			       \
 }, {									       \
-	.base.cra_name		= "__ctr(aes)",				       \
-	.base.cra_driver_name	= "__ctr-aes-" driver_name_suffix,	       \
+	.base.cra_name		= "ctr(aes)",				       \
+	.base.cra_driver_name	= "ctr-aes-" driver_name_suffix,	       \
 	.base.cra_priority	= priority,				       \
-	.base.cra_flags		= CRYPTO_ALG_INTERNAL,			       \
 	.base.cra_blocksize	= 1,					       \
 	.base.cra_ctxsize	= CRYPTO_AES_CTX_SIZE,			       \
 	.base.cra_module	= THIS_MODULE,				       \
 	.min_keysize		= AES_MIN_KEY_SIZE,			       \
 	.max_keysize		= AES_MAX_KEY_SIZE,			       \
@@ -830,29 +820,25 @@ static struct skcipher_alg skcipher_algs_##suffix[] = {{		       \
 	.chunksize		= AES_BLOCK_SIZE,			       \
 	.setkey			= aesni_skcipher_setkey,		       \
 	.encrypt		= ctr_crypt_##suffix,			       \
 	.decrypt		= ctr_crypt_##suffix,			       \
 }, {									       \
-	.base.cra_name		= "__xctr(aes)",			       \
-	.base.cra_driver_name	= "__xctr-aes-" driver_name_suffix,	       \
+	.base.cra_name		= "xctr(aes)",				       \
+	.base.cra_driver_name	= "xctr-aes-" driver_name_suffix,	       \
 	.base.cra_priority	= priority,				       \
-	.base.cra_flags		= CRYPTO_ALG_INTERNAL,			       \
 	.base.cra_blocksize	= 1,					       \
 	.base.cra_ctxsize	= CRYPTO_AES_CTX_SIZE,			       \
 	.base.cra_module	= THIS_MODULE,				       \
 	.min_keysize		= AES_MIN_KEY_SIZE,			       \
 	.max_keysize		= AES_MAX_KEY_SIZE,			       \
 	.ivsize			= AES_BLOCK_SIZE,			       \
 	.chunksize		= AES_BLOCK_SIZE,			       \
 	.setkey			= aesni_skcipher_setkey,		       \
 	.encrypt		= xctr_crypt_##suffix,			       \
 	.decrypt		= xctr_crypt_##suffix,			       \
-}};									       \
+}}
 									       \
-static struct simd_skcipher_alg *					       \
-simd_skcipher_algs_##suffix[ARRAY_SIZE(skcipher_algs_##suffix)]
-
 DEFINE_AVX_SKCIPHER_ALGS(aesni_avx, "aesni-avx", 500);
 #if defined(CONFIG_AS_VAES) && defined(CONFIG_AS_VPCLMULQDQ)
 DEFINE_AVX_SKCIPHER_ALGS(vaes_avx2, "vaes-avx2", 600);
 DEFINE_AVX_SKCIPHER_ALGS(vaes_avx10_256, "vaes-avx10_256", 700);
 DEFINE_AVX_SKCIPHER_ALGS(vaes_avx10_512, "vaes-avx10_512", 800);
@@ -1508,14 +1494,13 @@ static struct aead_alg aes_gcm_algs_##suffix[] = { {			       \
 	.decrypt		= gcm_decrypt_##suffix,			       \
 	.ivsize			= GCM_AES_IV_SIZE,			       \
 	.chunksize		= AES_BLOCK_SIZE,			       \
 	.maxauthsize		= 16,					       \
 	.base = {							       \
-		.cra_name		= "__gcm(aes)",			       \
-		.cra_driver_name	= "__" generic_driver_name,	       \
+		.cra_name		= "gcm(aes)",			       \
+		.cra_driver_name	= generic_driver_name,		       \
 		.cra_priority		= (priority),			       \
-		.cra_flags		= CRYPTO_ALG_INTERNAL,		       \
 		.cra_blocksize		= 1,				       \
 		.cra_ctxsize		= (ctxsize),			       \
 		.cra_module		= THIS_MODULE,			       \
 	},								       \
 }, {									       \
@@ -1525,21 +1510,18 @@ static struct aead_alg aes_gcm_algs_##suffix[] = { {			       \
 	.decrypt		= rfc4106_decrypt_##suffix,		       \
 	.ivsize			= GCM_RFC4106_IV_SIZE,			       \
 	.chunksize		= AES_BLOCK_SIZE,			       \
 	.maxauthsize		= 16,					       \
 	.base = {							       \
-		.cra_name		= "__rfc4106(gcm(aes))",	       \
-		.cra_driver_name	= "__" rfc_driver_name,		       \
+		.cra_name		= "rfc4106(gcm(aes))",		       \
+		.cra_driver_name	= rfc_driver_name,		       \
 		.cra_priority		= (priority),			       \
-		.cra_flags		= CRYPTO_ALG_INTERNAL,		       \
 		.cra_blocksize		= 1,				       \
 		.cra_ctxsize		= (ctxsize),			       \
 		.cra_module		= THIS_MODULE,			       \
 	},								       \
-} };									       \
-									       \
-static struct simd_aead_alg *aes_gcm_simdalgs_##suffix[2]		       \
+} }
 
 /* aes_gcm_algs_aesni */
 DEFINE_GCM_ALGS(aesni, /* no flags */ 0,
 		"generic-gcm-aesni", "rfc4106-gcm-aesni",
 		AES_GCM_KEY_AESNI_SIZE, 400);
@@ -1585,18 +1567,16 @@ static int __init register_avx_algs(void)
 {
 	int err;
 
 	if (!boot_cpu_has(X86_FEATURE_AVX))
 		return 0;
-	err = simd_register_skciphers_compat(skcipher_algs_aesni_avx,
-					     ARRAY_SIZE(skcipher_algs_aesni_avx),
-					     simd_skcipher_algs_aesni_avx);
+	err = crypto_register_skciphers(skcipher_algs_aesni_avx,
+					ARRAY_SIZE(skcipher_algs_aesni_avx));
 	if (err)
 		return err;
-	err = simd_register_aeads_compat(aes_gcm_algs_aesni_avx,
-					 ARRAY_SIZE(aes_gcm_algs_aesni_avx),
-					 aes_gcm_simdalgs_aesni_avx);
+	err = crypto_register_aeads(aes_gcm_algs_aesni_avx,
+				    ARRAY_SIZE(aes_gcm_algs_aesni_avx));
 	if (err)
 		return err;
 	/*
 	 * Note: not all the algorithms registered below actually require
 	 * VPCLMULQDQ.  But in practice every CPU with VAES also has VPCLMULQDQ.
@@ -1608,31 +1588,28 @@ static int __init register_avx_algs(void)
 	    !boot_cpu_has(X86_FEATURE_VAES) ||
 	    !boot_cpu_has(X86_FEATURE_VPCLMULQDQ) ||
 	    !boot_cpu_has(X86_FEATURE_PCLMULQDQ) ||
 	    !cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM, NULL))
 		return 0;
-	err = simd_register_skciphers_compat(skcipher_algs_vaes_avx2,
-					     ARRAY_SIZE(skcipher_algs_vaes_avx2),
-					     simd_skcipher_algs_vaes_avx2);
+	err = crypto_register_skciphers(skcipher_algs_vaes_avx2,
+					ARRAY_SIZE(skcipher_algs_vaes_avx2));
 	if (err)
 		return err;
 
 	if (!boot_cpu_has(X86_FEATURE_AVX512BW) ||
 	    !boot_cpu_has(X86_FEATURE_AVX512VL) ||
 	    !boot_cpu_has(X86_FEATURE_BMI2) ||
 	    !cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM |
 			       XFEATURE_MASK_AVX512, NULL))
 		return 0;
 
-	err = simd_register_skciphers_compat(skcipher_algs_vaes_avx10_256,
-					     ARRAY_SIZE(skcipher_algs_vaes_avx10_256),
-					     simd_skcipher_algs_vaes_avx10_256);
+	err = crypto_register_skciphers(skcipher_algs_vaes_avx10_256,
+					ARRAY_SIZE(skcipher_algs_vaes_avx10_256));
 	if (err)
 		return err;
-	err = simd_register_aeads_compat(aes_gcm_algs_vaes_avx10_256,
-					 ARRAY_SIZE(aes_gcm_algs_vaes_avx10_256),
-					 aes_gcm_simdalgs_vaes_avx10_256);
+	err = crypto_register_aeads(aes_gcm_algs_vaes_avx10_256,
+				    ARRAY_SIZE(aes_gcm_algs_vaes_avx10_256));
 	if (err)
 		return err;
 
 	if (x86_match_cpu(zmm_exclusion_list)) {
 		int i;
@@ -1641,60 +1618,43 @@ static int __init register_avx_algs(void)
 			skcipher_algs_vaes_avx10_512[i].base.cra_priority = 1;
 		for (i = 0; i < ARRAY_SIZE(aes_gcm_algs_vaes_avx10_512); i++)
 			aes_gcm_algs_vaes_avx10_512[i].base.cra_priority = 1;
 	}
 
-	err = simd_register_skciphers_compat(skcipher_algs_vaes_avx10_512,
-					     ARRAY_SIZE(skcipher_algs_vaes_avx10_512),
-					     simd_skcipher_algs_vaes_avx10_512);
+	err = crypto_register_skciphers(skcipher_algs_vaes_avx10_512,
+					ARRAY_SIZE(skcipher_algs_vaes_avx10_512));
 	if (err)
 		return err;
-	err = simd_register_aeads_compat(aes_gcm_algs_vaes_avx10_512,
-					 ARRAY_SIZE(aes_gcm_algs_vaes_avx10_512),
-					 aes_gcm_simdalgs_vaes_avx10_512);
+	err = crypto_register_aeads(aes_gcm_algs_vaes_avx10_512,
+				    ARRAY_SIZE(aes_gcm_algs_vaes_avx10_512));
 	if (err)
 		return err;
 #endif /* CONFIG_AS_VAES && CONFIG_AS_VPCLMULQDQ */
 	return 0;
 }
 
+#define unregister_skciphers(A) \
+	if (refcount_read(&(A)[0].base.cra_refcnt) != 0) \
+		crypto_unregister_skciphers((A), ARRAY_SIZE(A))
+#define unregister_aeads(A) \
+	if (refcount_read(&(A)[0].base.cra_refcnt) != 0) \
+		crypto_unregister_aeads((A), ARRAY_SIZE(A))
+
 static void unregister_avx_algs(void)
 {
-	if (simd_skcipher_algs_aesni_avx[0])
-		simd_unregister_skciphers(skcipher_algs_aesni_avx,
-					  ARRAY_SIZE(skcipher_algs_aesni_avx),
-					  simd_skcipher_algs_aesni_avx);
-	if (aes_gcm_simdalgs_aesni_avx[0])
-		simd_unregister_aeads(aes_gcm_algs_aesni_avx,
-				      ARRAY_SIZE(aes_gcm_algs_aesni_avx),
-				      aes_gcm_simdalgs_aesni_avx);
+	unregister_skciphers(skcipher_algs_aesni_avx);
+	unregister_aeads(aes_gcm_algs_aesni_avx);
 #if defined(CONFIG_AS_VAES) && defined(CONFIG_AS_VPCLMULQDQ)
-	if (simd_skcipher_algs_vaes_avx2[0])
-		simd_unregister_skciphers(skcipher_algs_vaes_avx2,
-					  ARRAY_SIZE(skcipher_algs_vaes_avx2),
-					  simd_skcipher_algs_vaes_avx2);
-	if (simd_skcipher_algs_vaes_avx10_256[0])
-		simd_unregister_skciphers(skcipher_algs_vaes_avx10_256,
-					  ARRAY_SIZE(skcipher_algs_vaes_avx10_256),
-					  simd_skcipher_algs_vaes_avx10_256);
-	if (aes_gcm_simdalgs_vaes_avx10_256[0])
-		simd_unregister_aeads(aes_gcm_algs_vaes_avx10_256,
-				      ARRAY_SIZE(aes_gcm_algs_vaes_avx10_256),
-				      aes_gcm_simdalgs_vaes_avx10_256);
-	if (simd_skcipher_algs_vaes_avx10_512[0])
-		simd_unregister_skciphers(skcipher_algs_vaes_avx10_512,
-					  ARRAY_SIZE(skcipher_algs_vaes_avx10_512),
-					  simd_skcipher_algs_vaes_avx10_512);
-	if (aes_gcm_simdalgs_vaes_avx10_512[0])
-		simd_unregister_aeads(aes_gcm_algs_vaes_avx10_512,
-				      ARRAY_SIZE(aes_gcm_algs_vaes_avx10_512),
-				      aes_gcm_simdalgs_vaes_avx10_512);
+	unregister_skciphers(skcipher_algs_vaes_avx2);
+	unregister_skciphers(skcipher_algs_vaes_avx10_256);
+	unregister_skciphers(skcipher_algs_vaes_avx10_512);
+	unregister_aeads(aes_gcm_algs_vaes_avx10_256);
+	unregister_aeads(aes_gcm_algs_vaes_avx10_512);
 #endif
 }
 #else /* CONFIG_X86_64 */
 static struct aead_alg aes_gcm_algs_aesni[0];
-static struct simd_aead_alg *aes_gcm_simdalgs_aesni[0];
 
 static int __init register_avx_algs(void)
 {
 	return 0;
 }
@@ -1719,19 +1679,17 @@ static int __init aesni_init(void)
 
 	err = crypto_register_alg(&aesni_cipher_alg);
 	if (err)
 		return err;
 
-	err = simd_register_skciphers_compat(aesni_skciphers,
-					     ARRAY_SIZE(aesni_skciphers),
-					     aesni_simd_skciphers);
+	err = crypto_register_skciphers(aesni_skciphers,
+					ARRAY_SIZE(aesni_skciphers));
 	if (err)
 		goto unregister_cipher;
 
-	err = simd_register_aeads_compat(aes_gcm_algs_aesni,
-					 ARRAY_SIZE(aes_gcm_algs_aesni),
-					 aes_gcm_simdalgs_aesni);
+	err = crypto_register_aeads(aes_gcm_algs_aesni,
+				    ARRAY_SIZE(aes_gcm_algs_aesni));
 	if (err)
 		goto unregister_skciphers;
 
 	err = register_avx_algs();
 	if (err)
@@ -1739,28 +1697,26 @@ static int __init aesni_init(void)
 
 	return 0;
 
 unregister_avx:
 	unregister_avx_algs();
-	simd_unregister_aeads(aes_gcm_algs_aesni,
-			      ARRAY_SIZE(aes_gcm_algs_aesni),
-			      aes_gcm_simdalgs_aesni);
+	crypto_unregister_aeads(aes_gcm_algs_aesni,
+				ARRAY_SIZE(aes_gcm_algs_aesni));
 unregister_skciphers:
-	simd_unregister_skciphers(aesni_skciphers, ARRAY_SIZE(aesni_skciphers),
-				  aesni_simd_skciphers);
+	crypto_unregister_skciphers(aesni_skciphers,
+				    ARRAY_SIZE(aesni_skciphers));
 unregister_cipher:
 	crypto_unregister_alg(&aesni_cipher_alg);
 	return err;
 }
 
 static void __exit aesni_exit(void)
 {
-	simd_unregister_aeads(aes_gcm_algs_aesni,
-			      ARRAY_SIZE(aes_gcm_algs_aesni),
-			      aes_gcm_simdalgs_aesni);
-	simd_unregister_skciphers(aesni_skciphers, ARRAY_SIZE(aesni_skciphers),
-				  aesni_simd_skciphers);
+	crypto_unregister_aeads(aes_gcm_algs_aesni,
+				ARRAY_SIZE(aes_gcm_algs_aesni));
+	crypto_unregister_skciphers(aesni_skciphers,
+				    ARRAY_SIZE(aesni_skciphers));
 	crypto_unregister_alg(&aesni_cipher_alg);
 	unregister_avx_algs();
 }
 
 module_init(aesni_init);
diff --git a/arch/x86/crypto/aria_aesni_avx2_glue.c b/arch/x86/crypto/aria_aesni_avx2_glue.c
index 87a11804fc77f..b4bddcd584577 100644
--- a/arch/x86/crypto/aria_aesni_avx2_glue.c
+++ b/arch/x86/crypto/aria_aesni_avx2_glue.c
@@ -4,11 +4,10 @@
  *
  * Copyright (c) 2022 Taehee Yoo <ap420073@gmail.com>
  */
 
 #include <crypto/algapi.h>
-#include <crypto/internal/simd.h>
 #include <crypto/aria.h>
 #include <linux/crypto.h>
 #include <linux/err.h>
 #include <linux/module.h>
 #include <linux/types.h>
@@ -163,28 +162,26 @@ static int aria_avx2_init_tfm(struct crypto_skcipher *tfm)
 	return 0;
 }
 
 static struct skcipher_alg aria_algs[] = {
 	{
-		.base.cra_name		= "__ecb(aria)",
-		.base.cra_driver_name	= "__ecb-aria-avx2",
+		.base.cra_name		= "ecb(aria)",
+		.base.cra_driver_name	= "ecb-aria-avx2",
 		.base.cra_priority	= 500,
-		.base.cra_flags		= CRYPTO_ALG_INTERNAL,
 		.base.cra_blocksize	= ARIA_BLOCK_SIZE,
 		.base.cra_ctxsize	= sizeof(struct aria_ctx),
 		.base.cra_module	= THIS_MODULE,
 		.min_keysize		= ARIA_MIN_KEY_SIZE,
 		.max_keysize		= ARIA_MAX_KEY_SIZE,
 		.setkey			= aria_avx2_set_key,
 		.encrypt		= aria_avx2_ecb_encrypt,
 		.decrypt		= aria_avx2_ecb_decrypt,
 	}, {
-		.base.cra_name		= "__ctr(aria)",
-		.base.cra_driver_name	= "__ctr-aria-avx2",
+		.base.cra_name		= "ctr(aria)",
+		.base.cra_driver_name	= "ctr-aria-avx2",
 		.base.cra_priority	= 500,
-		.base.cra_flags		= CRYPTO_ALG_INTERNAL |
-					  CRYPTO_ALG_SKCIPHER_REQSIZE_LARGE,
+		.base.cra_flags		= CRYPTO_ALG_SKCIPHER_REQSIZE_LARGE,
 		.base.cra_blocksize	= 1,
 		.base.cra_ctxsize	= sizeof(struct aria_ctx),
 		.base.cra_module	= THIS_MODULE,
 		.min_keysize		= ARIA_MIN_KEY_SIZE,
 		.max_keysize		= ARIA_MAX_KEY_SIZE,
@@ -195,12 +192,10 @@ static struct skcipher_alg aria_algs[] = {
 		.decrypt		= aria_avx2_ctr_encrypt,
 		.init                   = aria_avx2_init_tfm,
 	}
 };
 
-static struct simd_skcipher_alg *aria_simd_algs[ARRAY_SIZE(aria_algs)];
-
 static int __init aria_avx2_init(void)
 {
 	const char *feature_name;
 
 	if (!boot_cpu_has(X86_FEATURE_AVX) ||
@@ -231,19 +226,16 @@ static int __init aria_avx2_init(void)
 		aria_ops.aria_encrypt_32way = aria_aesni_avx2_encrypt_32way;
 		aria_ops.aria_decrypt_32way = aria_aesni_avx2_decrypt_32way;
 		aria_ops.aria_ctr_crypt_32way = aria_aesni_avx2_ctr_crypt_32way;
 	}
 
-	return simd_register_skciphers_compat(aria_algs,
-					      ARRAY_SIZE(aria_algs),
-					      aria_simd_algs);
+	return crypto_register_skciphers(aria_algs, ARRAY_SIZE(aria_algs));
 }
 
 static void __exit aria_avx2_exit(void)
 {
-	simd_unregister_skciphers(aria_algs, ARRAY_SIZE(aria_algs),
-				  aria_simd_algs);
+	crypto_unregister_skciphers(aria_algs, ARRAY_SIZE(aria_algs));
 }
 
 module_init(aria_avx2_init);
 module_exit(aria_avx2_exit);
 
diff --git a/arch/x86/crypto/aria_aesni_avx_glue.c b/arch/x86/crypto/aria_aesni_avx_glue.c
index 4e1516b76669e..ab9b38d05332a 100644
--- a/arch/x86/crypto/aria_aesni_avx_glue.c
+++ b/arch/x86/crypto/aria_aesni_avx_glue.c
@@ -4,11 +4,10 @@
  *
  * Copyright (c) 2022 Taehee Yoo <ap420073@gmail.com>
  */
 
 #include <crypto/algapi.h>
-#include <crypto/internal/simd.h>
 #include <crypto/aria.h>
 #include <linux/crypto.h>
 #include <linux/err.h>
 #include <linux/module.h>
 #include <linux/types.h>
@@ -150,27 +149,25 @@ static int aria_avx_init_tfm(struct crypto_skcipher *tfm)
 	return 0;
 }
 
 static struct skcipher_alg aria_algs[] = {
 	{
-		.base.cra_name		= "__ecb(aria)",
-		.base.cra_driver_name	= "__ecb-aria-avx",
+		.base.cra_name		= "ecb(aria)",
+		.base.cra_driver_name	= "ecb-aria-avx",
 		.base.cra_priority	= 400,
-		.base.cra_flags		= CRYPTO_ALG_INTERNAL,
 		.base.cra_blocksize	= ARIA_BLOCK_SIZE,
 		.base.cra_ctxsize	= sizeof(struct aria_ctx),
 		.base.cra_module	= THIS_MODULE,
 		.min_keysize		= ARIA_MIN_KEY_SIZE,
 		.max_keysize		= ARIA_MAX_KEY_SIZE,
 		.setkey			= aria_avx_set_key,
 		.encrypt		= aria_avx_ecb_encrypt,
 		.decrypt		= aria_avx_ecb_decrypt,
 	}, {
-		.base.cra_name		= "__ctr(aria)",
-		.base.cra_driver_name	= "__ctr-aria-avx",
+		.base.cra_name		= "ctr(aria)",
+		.base.cra_driver_name	= "ctr-aria-avx",
 		.base.cra_priority	= 400,
-		.base.cra_flags		= CRYPTO_ALG_INTERNAL,
 		.base.cra_blocksize	= 1,
 		.base.cra_ctxsize	= sizeof(struct aria_ctx),
 		.base.cra_module	= THIS_MODULE,
 		.min_keysize		= ARIA_MIN_KEY_SIZE,
 		.max_keysize		= ARIA_MAX_KEY_SIZE,
@@ -182,12 +179,10 @@ static struct skcipher_alg aria_algs[] = {
 		.decrypt		= aria_avx_ctr_encrypt,
 		.init			= aria_avx_init_tfm,
 	}
 };
 
-static struct simd_skcipher_alg *aria_simd_algs[ARRAY_SIZE(aria_algs)];
-
 static int __init aria_avx_init(void)
 {
 	const char *feature_name;
 
 	if (!boot_cpu_has(X86_FEATURE_AVX) ||
@@ -211,19 +206,16 @@ static int __init aria_avx_init(void)
 		aria_ops.aria_encrypt_16way = aria_aesni_avx_encrypt_16way;
 		aria_ops.aria_decrypt_16way = aria_aesni_avx_decrypt_16way;
 		aria_ops.aria_ctr_crypt_16way = aria_aesni_avx_ctr_crypt_16way;
 	}
 
-	return simd_register_skciphers_compat(aria_algs,
-					      ARRAY_SIZE(aria_algs),
-					      aria_simd_algs);
+	return crypto_register_skciphers(aria_algs, ARRAY_SIZE(aria_algs));
 }
 
 static void __exit aria_avx_exit(void)
 {
-	simd_unregister_skciphers(aria_algs, ARRAY_SIZE(aria_algs),
-				  aria_simd_algs);
+	crypto_unregister_skciphers(aria_algs, ARRAY_SIZE(aria_algs));
 }
 
 module_init(aria_avx_init);
 module_exit(aria_avx_exit);
 
diff --git a/arch/x86/crypto/aria_gfni_avx512_glue.c b/arch/x86/crypto/aria_gfni_avx512_glue.c
index f4a2208d26383..363cbf4399cca 100644
--- a/arch/x86/crypto/aria_gfni_avx512_glue.c
+++ b/arch/x86/crypto/aria_gfni_avx512_glue.c
@@ -4,11 +4,10 @@
  *
  * Copyright (c) 2022 Taehee Yoo <ap420073@gmail.com>
  */
 
 #include <crypto/algapi.h>
-#include <crypto/internal/simd.h>
 #include <crypto/aria.h>
 #include <linux/crypto.h>
 #include <linux/err.h>
 #include <linux/module.h>
 #include <linux/types.h>
@@ -163,28 +162,26 @@ static int aria_avx512_init_tfm(struct crypto_skcipher *tfm)
 	return 0;
 }
 
 static struct skcipher_alg aria_algs[] = {
 	{
-		.base.cra_name		= "__ecb(aria)",
-		.base.cra_driver_name	= "__ecb-aria-avx512",
+		.base.cra_name		= "ecb(aria)",
+		.base.cra_driver_name	= "ecb-aria-avx512",
 		.base.cra_priority	= 600,
-		.base.cra_flags		= CRYPTO_ALG_INTERNAL,
 		.base.cra_blocksize	= ARIA_BLOCK_SIZE,
 		.base.cra_ctxsize	= sizeof(struct aria_ctx),
 		.base.cra_module	= THIS_MODULE,
 		.min_keysize		= ARIA_MIN_KEY_SIZE,
 		.max_keysize		= ARIA_MAX_KEY_SIZE,
 		.setkey			= aria_avx512_set_key,
 		.encrypt		= aria_avx512_ecb_encrypt,
 		.decrypt		= aria_avx512_ecb_decrypt,
 	}, {
-		.base.cra_name		= "__ctr(aria)",
-		.base.cra_driver_name	= "__ctr-aria-avx512",
+		.base.cra_name		= "ctr(aria)",
+		.base.cra_driver_name	= "ctr-aria-avx512",
 		.base.cra_priority	= 600,
-		.base.cra_flags		= CRYPTO_ALG_INTERNAL |
-					  CRYPTO_ALG_SKCIPHER_REQSIZE_LARGE,
+		.base.cra_flags		= CRYPTO_ALG_SKCIPHER_REQSIZE_LARGE,
 		.base.cra_blocksize	= 1,
 		.base.cra_ctxsize	= sizeof(struct aria_ctx),
 		.base.cra_module	= THIS_MODULE,
 		.min_keysize		= ARIA_MIN_KEY_SIZE,
 		.max_keysize		= ARIA_MAX_KEY_SIZE,
@@ -195,12 +192,10 @@ static struct skcipher_alg aria_algs[] = {
 		.decrypt		= aria_avx512_ctr_encrypt,
 		.init                   = aria_avx512_init_tfm,
 	}
 };
 
-static struct simd_skcipher_alg *aria_simd_algs[ARRAY_SIZE(aria_algs)];
-
 static int __init aria_avx512_init(void)
 {
 	const char *feature_name;
 
 	if (!boot_cpu_has(X86_FEATURE_AVX) ||
@@ -227,19 +222,16 @@ static int __init aria_avx512_init(void)
 	aria_ops.aria_ctr_crypt_32way = aria_aesni_avx2_gfni_ctr_crypt_32way;
 	aria_ops.aria_encrypt_64way = aria_gfni_avx512_encrypt_64way;
 	aria_ops.aria_decrypt_64way = aria_gfni_avx512_decrypt_64way;
 	aria_ops.aria_ctr_crypt_64way = aria_gfni_avx512_ctr_crypt_64way;
 
-	return simd_register_skciphers_compat(aria_algs,
-					      ARRAY_SIZE(aria_algs),
-					      aria_simd_algs);
+	return crypto_register_skciphers(aria_algs, ARRAY_SIZE(aria_algs));
 }
 
 static void __exit aria_avx512_exit(void)
 {
-	simd_unregister_skciphers(aria_algs, ARRAY_SIZE(aria_algs),
-				  aria_simd_algs);
+	crypto_unregister_skciphers(aria_algs, ARRAY_SIZE(aria_algs));
 }
 
 module_init(aria_avx512_init);
 module_exit(aria_avx512_exit);
 
diff --git a/arch/x86/crypto/camellia_aesni_avx2_glue.c b/arch/x86/crypto/camellia_aesni_avx2_glue.c
index e7e4d64e9577e..2d2f4e16537c4 100644
--- a/arch/x86/crypto/camellia_aesni_avx2_glue.c
+++ b/arch/x86/crypto/camellia_aesni_avx2_glue.c
@@ -4,11 +4,10 @@
  *
  * Copyright © 2013 Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
  */
 
 #include <crypto/algapi.h>
-#include <crypto/internal/simd.h>
 #include <linux/crypto.h>
 #include <linux/err.h>
 #include <linux/module.h>
 #include <linux/types.h>
 
@@ -67,27 +66,25 @@ static int cbc_decrypt(struct skcipher_request *req)
 	CBC_WALK_END();
 }
 
 static struct skcipher_alg camellia_algs[] = {
 	{
-		.base.cra_name		= "__ecb(camellia)",
-		.base.cra_driver_name	= "__ecb-camellia-aesni-avx2",
+		.base.cra_name		= "ecb(camellia)",
+		.base.cra_driver_name	= "ecb-camellia-aesni-avx2",
 		.base.cra_priority	= 500,
-		.base.cra_flags		= CRYPTO_ALG_INTERNAL,
 		.base.cra_blocksize	= CAMELLIA_BLOCK_SIZE,
 		.base.cra_ctxsize	= sizeof(struct camellia_ctx),
 		.base.cra_module	= THIS_MODULE,
 		.min_keysize		= CAMELLIA_MIN_KEY_SIZE,
 		.max_keysize		= CAMELLIA_MAX_KEY_SIZE,
 		.setkey			= camellia_setkey,
 		.encrypt		= ecb_encrypt,
 		.decrypt		= ecb_decrypt,
 	}, {
-		.base.cra_name		= "__cbc(camellia)",
-		.base.cra_driver_name	= "__cbc-camellia-aesni-avx2",
+		.base.cra_name		= "cbc(camellia)",
+		.base.cra_driver_name	= "cbc-camellia-aesni-avx2",
 		.base.cra_priority	= 500,
-		.base.cra_flags		= CRYPTO_ALG_INTERNAL,
 		.base.cra_blocksize	= CAMELLIA_BLOCK_SIZE,
 		.base.cra_ctxsize	= sizeof(struct camellia_ctx),
 		.base.cra_module	= THIS_MODULE,
 		.min_keysize		= CAMELLIA_MIN_KEY_SIZE,
 		.max_keysize		= CAMELLIA_MAX_KEY_SIZE,
@@ -96,12 +93,10 @@ static struct skcipher_alg camellia_algs[] = {
 		.encrypt		= cbc_encrypt,
 		.decrypt		= cbc_decrypt,
 	},
 };
 
-static struct simd_skcipher_alg *camellia_simd_algs[ARRAY_SIZE(camellia_algs)];
-
 static int __init camellia_aesni_init(void)
 {
 	const char *feature_name;
 
 	if (!boot_cpu_has(X86_FEATURE_AVX) ||
@@ -116,19 +111,17 @@ static int __init camellia_aesni_init(void)
 				&feature_name)) {
 		pr_info("CPU feature '%s' is not supported.\n", feature_name);
 		return -ENODEV;
 	}
 
-	return simd_register_skciphers_compat(camellia_algs,
-					      ARRAY_SIZE(camellia_algs),
-					      camellia_simd_algs);
+	return crypto_register_skciphers(camellia_algs,
+					 ARRAY_SIZE(camellia_algs));
 }
 
 static void __exit camellia_aesni_fini(void)
 {
-	simd_unregister_skciphers(camellia_algs, ARRAY_SIZE(camellia_algs),
-				  camellia_simd_algs);
+	crypto_unregister_skciphers(camellia_algs, ARRAY_SIZE(camellia_algs));
 }
 
 module_init(camellia_aesni_init);
 module_exit(camellia_aesni_fini);
 
diff --git a/arch/x86/crypto/camellia_aesni_avx_glue.c b/arch/x86/crypto/camellia_aesni_avx_glue.c
index c7ccf63e741e1..a7d1623881424 100644
--- a/arch/x86/crypto/camellia_aesni_avx_glue.c
+++ b/arch/x86/crypto/camellia_aesni_avx_glue.c
@@ -4,11 +4,10 @@
  *
  * Copyright © 2012-2013 Jussi Kivilinna <jussi.kivilinna@iki.fi>
  */
 
 #include <crypto/algapi.h>
-#include <crypto/internal/simd.h>
 #include <linux/crypto.h>
 #include <linux/err.h>
 #include <linux/module.h>
 #include <linux/types.h>
 
@@ -67,27 +66,25 @@ static int cbc_decrypt(struct skcipher_request *req)
 	CBC_WALK_END();
 }
 
 static struct skcipher_alg camellia_algs[] = {
 	{
-		.base.cra_name		= "__ecb(camellia)",
-		.base.cra_driver_name	= "__ecb-camellia-aesni",
+		.base.cra_name		= "ecb(camellia)",
+		.base.cra_driver_name	= "ecb-camellia-aesni",
 		.base.cra_priority	= 400,
-		.base.cra_flags		= CRYPTO_ALG_INTERNAL,
 		.base.cra_blocksize	= CAMELLIA_BLOCK_SIZE,
 		.base.cra_ctxsize	= sizeof(struct camellia_ctx),
 		.base.cra_module	= THIS_MODULE,
 		.min_keysize		= CAMELLIA_MIN_KEY_SIZE,
 		.max_keysize		= CAMELLIA_MAX_KEY_SIZE,
 		.setkey			= camellia_setkey,
 		.encrypt		= ecb_encrypt,
 		.decrypt		= ecb_decrypt,
 	}, {
-		.base.cra_name		= "__cbc(camellia)",
-		.base.cra_driver_name	= "__cbc-camellia-aesni",
+		.base.cra_name		= "cbc(camellia)",
+		.base.cra_driver_name	= "cbc-camellia-aesni",
 		.base.cra_priority	= 400,
-		.base.cra_flags		= CRYPTO_ALG_INTERNAL,
 		.base.cra_blocksize	= CAMELLIA_BLOCK_SIZE,
 		.base.cra_ctxsize	= sizeof(struct camellia_ctx),
 		.base.cra_module	= THIS_MODULE,
 		.min_keysize		= CAMELLIA_MIN_KEY_SIZE,
 		.max_keysize		= CAMELLIA_MAX_KEY_SIZE,
@@ -96,12 +93,10 @@ static struct skcipher_alg camellia_algs[] = {
 		.encrypt		= cbc_encrypt,
 		.decrypt		= cbc_decrypt,
 	}
 };
 
-static struct simd_skcipher_alg *camellia_simd_algs[ARRAY_SIZE(camellia_algs)];
-
 static int __init camellia_aesni_init(void)
 {
 	const char *feature_name;
 
 	if (!boot_cpu_has(X86_FEATURE_AVX) ||
@@ -115,19 +110,17 @@ static int __init camellia_aesni_init(void)
 				&feature_name)) {
 		pr_info("CPU feature '%s' is not supported.\n", feature_name);
 		return -ENODEV;
 	}
 
-	return simd_register_skciphers_compat(camellia_algs,
-					      ARRAY_SIZE(camellia_algs),
-					      camellia_simd_algs);
+	return crypto_register_skciphers(camellia_algs,
+					 ARRAY_SIZE(camellia_algs));
 }
 
 static void __exit camellia_aesni_fini(void)
 {
-	simd_unregister_skciphers(camellia_algs, ARRAY_SIZE(camellia_algs),
-				  camellia_simd_algs);
+	crypto_unregister_skciphers(camellia_algs, ARRAY_SIZE(camellia_algs));
 }
 
 module_init(camellia_aesni_init);
 module_exit(camellia_aesni_fini);
 
diff --git a/arch/x86/crypto/cast5_avx_glue.c b/arch/x86/crypto/cast5_avx_glue.c
index 3976a87f92ad5..3aca04d43b34a 100644
--- a/arch/x86/crypto/cast5_avx_glue.c
+++ b/arch/x86/crypto/cast5_avx_glue.c
@@ -6,11 +6,10 @@
  *     <Johannes.Goetzfried@informatik.stud.uni-erlangen.de>
  */
 
 #include <crypto/algapi.h>
 #include <crypto/cast5.h>
-#include <crypto/internal/simd.h>
 #include <linux/crypto.h>
 #include <linux/err.h>
 #include <linux/module.h>
 #include <linux/types.h>
 
@@ -62,27 +61,25 @@ static int cbc_decrypt(struct skcipher_request *req)
 	CBC_WALK_END();
 }
 
 static struct skcipher_alg cast5_algs[] = {
 	{
-		.base.cra_name		= "__ecb(cast5)",
-		.base.cra_driver_name	= "__ecb-cast5-avx",
+		.base.cra_name		= "ecb(cast5)",
+		.base.cra_driver_name	= "ecb-cast5-avx",
 		.base.cra_priority	= 200,
-		.base.cra_flags		= CRYPTO_ALG_INTERNAL,
 		.base.cra_blocksize	= CAST5_BLOCK_SIZE,
 		.base.cra_ctxsize	= sizeof(struct cast5_ctx),
 		.base.cra_module	= THIS_MODULE,
 		.min_keysize		= CAST5_MIN_KEY_SIZE,
 		.max_keysize		= CAST5_MAX_KEY_SIZE,
 		.setkey			= cast5_setkey_skcipher,
 		.encrypt		= ecb_encrypt,
 		.decrypt		= ecb_decrypt,
 	}, {
-		.base.cra_name		= "__cbc(cast5)",
-		.base.cra_driver_name	= "__cbc-cast5-avx",
+		.base.cra_name		= "cbc(cast5)",
+		.base.cra_driver_name	= "cbc-cast5-avx",
 		.base.cra_priority	= 200,
-		.base.cra_flags		= CRYPTO_ALG_INTERNAL,
 		.base.cra_blocksize	= CAST5_BLOCK_SIZE,
 		.base.cra_ctxsize	= sizeof(struct cast5_ctx),
 		.base.cra_module	= THIS_MODULE,
 		.min_keysize		= CAST5_MIN_KEY_SIZE,
 		.max_keysize		= CAST5_MAX_KEY_SIZE,
@@ -91,31 +88,27 @@ static struct skcipher_alg cast5_algs[] = {
 		.encrypt		= cbc_encrypt,
 		.decrypt		= cbc_decrypt,
 	}
 };
 
-static struct simd_skcipher_alg *cast5_simd_algs[ARRAY_SIZE(cast5_algs)];
-
 static int __init cast5_init(void)
 {
 	const char *feature_name;
 
 	if (!cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM,
 				&feature_name)) {
 		pr_info("CPU feature '%s' is not supported.\n", feature_name);
 		return -ENODEV;
 	}
 
-	return simd_register_skciphers_compat(cast5_algs,
-					      ARRAY_SIZE(cast5_algs),
-					      cast5_simd_algs);
+	return crypto_register_skciphers(cast5_algs,
+					 ARRAY_SIZE(cast5_algs));
 }
 
 static void __exit cast5_exit(void)
 {
-	simd_unregister_skciphers(cast5_algs, ARRAY_SIZE(cast5_algs),
-				  cast5_simd_algs);
+	crypto_unregister_skciphers(cast5_algs, ARRAY_SIZE(cast5_algs));
 }
 
 module_init(cast5_init);
 module_exit(cast5_exit);
 
diff --git a/arch/x86/crypto/cast6_avx_glue.c b/arch/x86/crypto/cast6_avx_glue.c
index 7e2aea3723490..c4dd28c303036 100644
--- a/arch/x86/crypto/cast6_avx_glue.c
+++ b/arch/x86/crypto/cast6_avx_glue.c
@@ -12,11 +12,10 @@
 #include <linux/types.h>
 #include <linux/crypto.h>
 #include <linux/err.h>
 #include <crypto/algapi.h>
 #include <crypto/cast6.h>
-#include <crypto/internal/simd.h>
 
 #include "ecb_cbc_helpers.h"
 
 #define CAST6_PARALLEL_BLOCKS 8
 
@@ -62,27 +61,25 @@ static int cbc_decrypt(struct skcipher_request *req)
 	CBC_WALK_END();
 }
 
 static struct skcipher_alg cast6_algs[] = {
 	{
-		.base.cra_name		= "__ecb(cast6)",
-		.base.cra_driver_name	= "__ecb-cast6-avx",
+		.base.cra_name		= "ecb(cast6)",
+		.base.cra_driver_name	= "ecb-cast6-avx",
 		.base.cra_priority	= 200,
-		.base.cra_flags		= CRYPTO_ALG_INTERNAL,
 		.base.cra_blocksize	= CAST6_BLOCK_SIZE,
 		.base.cra_ctxsize	= sizeof(struct cast6_ctx),
 		.base.cra_module	= THIS_MODULE,
 		.min_keysize		= CAST6_MIN_KEY_SIZE,
 		.max_keysize		= CAST6_MAX_KEY_SIZE,
 		.setkey			= cast6_setkey_skcipher,
 		.encrypt		= ecb_encrypt,
 		.decrypt		= ecb_decrypt,
 	}, {
-		.base.cra_name		= "__cbc(cast6)",
-		.base.cra_driver_name	= "__cbc-cast6-avx",
+		.base.cra_name		= "cbc(cast6)",
+		.base.cra_driver_name	= "cbc-cast6-avx",
 		.base.cra_priority	= 200,
-		.base.cra_flags		= CRYPTO_ALG_INTERNAL,
 		.base.cra_blocksize	= CAST6_BLOCK_SIZE,
 		.base.cra_ctxsize	= sizeof(struct cast6_ctx),
 		.base.cra_module	= THIS_MODULE,
 		.min_keysize		= CAST6_MIN_KEY_SIZE,
 		.max_keysize		= CAST6_MAX_KEY_SIZE,
@@ -91,31 +88,26 @@ static struct skcipher_alg cast6_algs[] = {
 		.encrypt		= cbc_encrypt,
 		.decrypt		= cbc_decrypt,
 	},
 };
 
-static struct simd_skcipher_alg *cast6_simd_algs[ARRAY_SIZE(cast6_algs)];
-
 static int __init cast6_init(void)
 {
 	const char *feature_name;
 
 	if (!cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM,
 				&feature_name)) {
 		pr_info("CPU feature '%s' is not supported.\n", feature_name);
 		return -ENODEV;
 	}
 
-	return simd_register_skciphers_compat(cast6_algs,
-					      ARRAY_SIZE(cast6_algs),
-					      cast6_simd_algs);
+	return crypto_register_skciphers(cast6_algs, ARRAY_SIZE(cast6_algs));
 }
 
 static void __exit cast6_exit(void)
 {
-	simd_unregister_skciphers(cast6_algs, ARRAY_SIZE(cast6_algs),
-				  cast6_simd_algs);
+	crypto_unregister_skciphers(cast6_algs, ARRAY_SIZE(cast6_algs));
 }
 
 module_init(cast6_init);
 module_exit(cast6_exit);
 
diff --git a/arch/x86/crypto/serpent_avx2_glue.c b/arch/x86/crypto/serpent_avx2_glue.c
index 347e97f4b713b..f5f2121b79567 100644
--- a/arch/x86/crypto/serpent_avx2_glue.c
+++ b/arch/x86/crypto/serpent_avx2_glue.c
@@ -8,11 +8,10 @@
 #include <linux/module.h>
 #include <linux/types.h>
 #include <linux/crypto.h>
 #include <linux/err.h>
 #include <crypto/algapi.h>
-#include <crypto/internal/simd.h>
 #include <crypto/serpent.h>
 
 #include "serpent-avx.h"
 #include "ecb_cbc_helpers.h"
 
@@ -63,27 +62,25 @@ static int cbc_decrypt(struct skcipher_request *req)
 	CBC_WALK_END();
 }
 
 static struct skcipher_alg serpent_algs[] = {
 	{
-		.base.cra_name		= "__ecb(serpent)",
-		.base.cra_driver_name	= "__ecb-serpent-avx2",
+		.base.cra_name		= "ecb(serpent)",
+		.base.cra_driver_name	= "ecb-serpent-avx2",
 		.base.cra_priority	= 600,
-		.base.cra_flags		= CRYPTO_ALG_INTERNAL,
 		.base.cra_blocksize	= SERPENT_BLOCK_SIZE,
 		.base.cra_ctxsize	= sizeof(struct serpent_ctx),
 		.base.cra_module	= THIS_MODULE,
 		.min_keysize		= SERPENT_MIN_KEY_SIZE,
 		.max_keysize		= SERPENT_MAX_KEY_SIZE,
 		.setkey			= serpent_setkey_skcipher,
 		.encrypt		= ecb_encrypt,
 		.decrypt		= ecb_decrypt,
 	}, {
-		.base.cra_name		= "__cbc(serpent)",
-		.base.cra_driver_name	= "__cbc-serpent-avx2",
+		.base.cra_name		= "cbc(serpent)",
+		.base.cra_driver_name	= "cbc-serpent-avx2",
 		.base.cra_priority	= 600,
-		.base.cra_flags		= CRYPTO_ALG_INTERNAL,
 		.base.cra_blocksize	= SERPENT_BLOCK_SIZE,
 		.base.cra_ctxsize	= sizeof(struct serpent_ctx),
 		.base.cra_module	= THIS_MODULE,
 		.min_keysize		= SERPENT_MIN_KEY_SIZE,
 		.max_keysize		= SERPENT_MAX_KEY_SIZE,
@@ -92,12 +89,10 @@ static struct skcipher_alg serpent_algs[] = {
 		.encrypt		= cbc_encrypt,
 		.decrypt		= cbc_decrypt,
 	},
 };
 
-static struct simd_skcipher_alg *serpent_simd_algs[ARRAY_SIZE(serpent_algs)];
-
 static int __init serpent_avx2_init(void)
 {
 	const char *feature_name;
 
 	if (!boot_cpu_has(X86_FEATURE_AVX2) || !boot_cpu_has(X86_FEATURE_OSXSAVE)) {
@@ -108,19 +103,17 @@ static int __init serpent_avx2_init(void)
 				&feature_name)) {
 		pr_info("CPU feature '%s' is not supported.\n", feature_name);
 		return -ENODEV;
 	}
 
-	return simd_register_skciphers_compat(serpent_algs,
-					      ARRAY_SIZE(serpent_algs),
-					      serpent_simd_algs);
+	return crypto_register_skciphers(serpent_algs,
+					 ARRAY_SIZE(serpent_algs));
 }
 
 static void __exit serpent_avx2_fini(void)
 {
-	simd_unregister_skciphers(serpent_algs, ARRAY_SIZE(serpent_algs),
-				  serpent_simd_algs);
+	crypto_unregister_skciphers(serpent_algs, ARRAY_SIZE(serpent_algs));
 }
 
 module_init(serpent_avx2_init);
 module_exit(serpent_avx2_fini);
 
diff --git a/arch/x86/crypto/serpent_avx_glue.c b/arch/x86/crypto/serpent_avx_glue.c
index 6c248e1ea4ef7..e640abc1cb8a7 100644
--- a/arch/x86/crypto/serpent_avx_glue.c
+++ b/arch/x86/crypto/serpent_avx_glue.c
@@ -11,11 +11,10 @@
 #include <linux/module.h>
 #include <linux/types.h>
 #include <linux/crypto.h>
 #include <linux/err.h>
 #include <crypto/algapi.h>
-#include <crypto/internal/simd.h>
 #include <crypto/serpent.h>
 
 #include "serpent-avx.h"
 #include "ecb_cbc_helpers.h"
 
@@ -69,27 +68,25 @@ static int cbc_decrypt(struct skcipher_request *req)
 	CBC_WALK_END();
 }
 
 static struct skcipher_alg serpent_algs[] = {
 	{
-		.base.cra_name		= "__ecb(serpent)",
-		.base.cra_driver_name	= "__ecb-serpent-avx",
+		.base.cra_name		= "ecb(serpent)",
+		.base.cra_driver_name	= "ecb-serpent-avx",
 		.base.cra_priority	= 500,
-		.base.cra_flags		= CRYPTO_ALG_INTERNAL,
 		.base.cra_blocksize	= SERPENT_BLOCK_SIZE,
 		.base.cra_ctxsize	= sizeof(struct serpent_ctx),
 		.base.cra_module	= THIS_MODULE,
 		.min_keysize		= SERPENT_MIN_KEY_SIZE,
 		.max_keysize		= SERPENT_MAX_KEY_SIZE,
 		.setkey			= serpent_setkey_skcipher,
 		.encrypt		= ecb_encrypt,
 		.decrypt		= ecb_decrypt,
 	}, {
-		.base.cra_name		= "__cbc(serpent)",
-		.base.cra_driver_name	= "__cbc-serpent-avx",
+		.base.cra_name		= "cbc(serpent)",
+		.base.cra_driver_name	= "cbc-serpent-avx",
 		.base.cra_priority	= 500,
-		.base.cra_flags		= CRYPTO_ALG_INTERNAL,
 		.base.cra_blocksize	= SERPENT_BLOCK_SIZE,
 		.base.cra_ctxsize	= sizeof(struct serpent_ctx),
 		.base.cra_module	= THIS_MODULE,
 		.min_keysize		= SERPENT_MIN_KEY_SIZE,
 		.max_keysize		= SERPENT_MAX_KEY_SIZE,
@@ -98,31 +95,27 @@ static struct skcipher_alg serpent_algs[] = {
 		.encrypt		= cbc_encrypt,
 		.decrypt		= cbc_decrypt,
 	},
 };
 
-static struct simd_skcipher_alg *serpent_simd_algs[ARRAY_SIZE(serpent_algs)];
-
 static int __init serpent_init(void)
 {
 	const char *feature_name;
 
 	if (!cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM,
 				&feature_name)) {
 		pr_info("CPU feature '%s' is not supported.\n", feature_name);
 		return -ENODEV;
 	}
 
-	return simd_register_skciphers_compat(serpent_algs,
-					      ARRAY_SIZE(serpent_algs),
-					      serpent_simd_algs);
+	return crypto_register_skciphers(serpent_algs,
+					 ARRAY_SIZE(serpent_algs));
 }
 
 static void __exit serpent_exit(void)
 {
-	simd_unregister_skciphers(serpent_algs, ARRAY_SIZE(serpent_algs),
-				  serpent_simd_algs);
+	crypto_unregister_skciphers(serpent_algs, ARRAY_SIZE(serpent_algs));
 }
 
 module_init(serpent_init);
 module_exit(serpent_exit);
 
diff --git a/arch/x86/crypto/serpent_sse2_glue.c b/arch/x86/crypto/serpent_sse2_glue.c
index d78f37e9b2cf7..80ee17ec21b46 100644
--- a/arch/x86/crypto/serpent_sse2_glue.c
+++ b/arch/x86/crypto/serpent_sse2_glue.c
@@ -16,11 +16,10 @@
 #include <linux/types.h>
 #include <linux/crypto.h>
 #include <linux/err.h>
 #include <crypto/algapi.h>
 #include <crypto/b128ops.h>
-#include <crypto/internal/simd.h>
 #include <crypto/serpent.h>
 
 #include "serpent-sse2.h"
 #include "ecb_cbc_helpers.h"
 
@@ -72,27 +71,25 @@ static int cbc_decrypt(struct skcipher_request *req)
 	CBC_WALK_END();
 }
 
 static struct skcipher_alg serpent_algs[] = {
 	{
-		.base.cra_name		= "__ecb(serpent)",
-		.base.cra_driver_name	= "__ecb-serpent-sse2",
+		.base.cra_name		= "ecb(serpent)",
+		.base.cra_driver_name	= "ecb-serpent-sse2",
 		.base.cra_priority	= 400,
-		.base.cra_flags		= CRYPTO_ALG_INTERNAL,
 		.base.cra_blocksize	= SERPENT_BLOCK_SIZE,
 		.base.cra_ctxsize	= sizeof(struct serpent_ctx),
 		.base.cra_module	= THIS_MODULE,
 		.min_keysize		= SERPENT_MIN_KEY_SIZE,
 		.max_keysize		= SERPENT_MAX_KEY_SIZE,
 		.setkey			= serpent_setkey_skcipher,
 		.encrypt		= ecb_encrypt,
 		.decrypt		= ecb_decrypt,
 	}, {
-		.base.cra_name		= "__cbc(serpent)",
-		.base.cra_driver_name	= "__cbc-serpent-sse2",
+		.base.cra_name		= "cbc(serpent)",
+		.base.cra_driver_name	= "cbc-serpent-sse2",
 		.base.cra_priority	= 400,
-		.base.cra_flags		= CRYPTO_ALG_INTERNAL,
 		.base.cra_blocksize	= SERPENT_BLOCK_SIZE,
 		.base.cra_ctxsize	= sizeof(struct serpent_ctx),
 		.base.cra_module	= THIS_MODULE,
 		.min_keysize		= SERPENT_MIN_KEY_SIZE,
 		.max_keysize		= SERPENT_MAX_KEY_SIZE,
@@ -101,28 +98,24 @@ static struct skcipher_alg serpent_algs[] = {
 		.encrypt		= cbc_encrypt,
 		.decrypt		= cbc_decrypt,
 	},
 };
 
-static struct simd_skcipher_alg *serpent_simd_algs[ARRAY_SIZE(serpent_algs)];
-
 static int __init serpent_sse2_init(void)
 {
 	if (!boot_cpu_has(X86_FEATURE_XMM2)) {
 		printk(KERN_INFO "SSE2 instructions are not detected.\n");
 		return -ENODEV;
 	}
 
-	return simd_register_skciphers_compat(serpent_algs,
-					      ARRAY_SIZE(serpent_algs),
-					      serpent_simd_algs);
+	return crypto_register_skciphers(serpent_algs,
+					 ARRAY_SIZE(serpent_algs));
 }
 
 static void __exit serpent_sse2_exit(void)
 {
-	simd_unregister_skciphers(serpent_algs, ARRAY_SIZE(serpent_algs),
-				  serpent_simd_algs);
+	crypto_unregister_skciphers(serpent_algs, ARRAY_SIZE(serpent_algs));
 }
 
 module_init(serpent_sse2_init);
 module_exit(serpent_sse2_exit);
 
diff --git a/arch/x86/crypto/sm4_aesni_avx2_glue.c b/arch/x86/crypto/sm4_aesni_avx2_glue.c
index 1148fd4cd57f8..14596091560d6 100644
--- a/arch/x86/crypto/sm4_aesni_avx2_glue.c
+++ b/arch/x86/crypto/sm4_aesni_avx2_glue.c
@@ -9,12 +9,10 @@
  */
 
 #include <linux/module.h>
 #include <linux/crypto.h>
 #include <linux/kernel.h>
-#include <asm/simd.h>
-#include <crypto/internal/simd.h>
 #include <crypto/internal/skcipher.h>
 #include <crypto/sm4.h>
 #include "sm4-avx.h"
 
 #define SM4_CRYPT16_BLOCK_SIZE	(SM4_BLOCK_SIZE * 16)
@@ -46,14 +44,13 @@ static int ctr_crypt(struct skcipher_request *req)
 }
 
 static struct skcipher_alg sm4_aesni_avx2_skciphers[] = {
 	{
 		.base = {
-			.cra_name		= "__ecb(sm4)",
-			.cra_driver_name	= "__ecb-sm4-aesni-avx2",
+			.cra_name		= "ecb(sm4)",
+			.cra_driver_name	= "ecb-sm4-aesni-avx2",
 			.cra_priority		= 500,
-			.cra_flags		= CRYPTO_ALG_INTERNAL,
 			.cra_blocksize		= SM4_BLOCK_SIZE,
 			.cra_ctxsize		= sizeof(struct sm4_ctx),
 			.cra_module		= THIS_MODULE,
 		},
 		.min_keysize	= SM4_KEY_SIZE,
@@ -62,14 +59,13 @@ static struct skcipher_alg sm4_aesni_avx2_skciphers[] = {
 		.setkey		= sm4_skcipher_setkey,
 		.encrypt	= sm4_avx_ecb_encrypt,
 		.decrypt	= sm4_avx_ecb_decrypt,
 	}, {
 		.base = {
-			.cra_name		= "__cbc(sm4)",
-			.cra_driver_name	= "__cbc-sm4-aesni-avx2",
+			.cra_name		= "cbc(sm4)",
+			.cra_driver_name	= "cbc-sm4-aesni-avx2",
 			.cra_priority		= 500,
-			.cra_flags		= CRYPTO_ALG_INTERNAL,
 			.cra_blocksize		= SM4_BLOCK_SIZE,
 			.cra_ctxsize		= sizeof(struct sm4_ctx),
 			.cra_module		= THIS_MODULE,
 		},
 		.min_keysize	= SM4_KEY_SIZE,
@@ -79,14 +75,13 @@ static struct skcipher_alg sm4_aesni_avx2_skciphers[] = {
 		.setkey		= sm4_skcipher_setkey,
 		.encrypt	= sm4_cbc_encrypt,
 		.decrypt	= cbc_decrypt,
 	}, {
 		.base = {
-			.cra_name		= "__ctr(sm4)",
-			.cra_driver_name	= "__ctr-sm4-aesni-avx2",
+			.cra_name		= "ctr(sm4)",
+			.cra_driver_name	= "ctr-sm4-aesni-avx2",
 			.cra_priority		= 500,
-			.cra_flags		= CRYPTO_ALG_INTERNAL,
 			.cra_blocksize		= 1,
 			.cra_ctxsize		= sizeof(struct sm4_ctx),
 			.cra_module		= THIS_MODULE,
 		},
 		.min_keysize	= SM4_KEY_SIZE,
@@ -98,13 +93,10 @@ static struct skcipher_alg sm4_aesni_avx2_skciphers[] = {
 		.encrypt	= ctr_crypt,
 		.decrypt	= ctr_crypt,
 	}
 };
 
-static struct simd_skcipher_alg *
-simd_sm4_aesni_avx2_skciphers[ARRAY_SIZE(sm4_aesni_avx2_skciphers)];
-
 static int __init sm4_init(void)
 {
 	const char *feature_name;
 
 	if (!boot_cpu_has(X86_FEATURE_AVX) ||
@@ -119,20 +111,18 @@ static int __init sm4_init(void)
 				&feature_name)) {
 		pr_info("CPU feature '%s' is not supported.\n", feature_name);
 		return -ENODEV;
 	}
 
-	return simd_register_skciphers_compat(sm4_aesni_avx2_skciphers,
-					ARRAY_SIZE(sm4_aesni_avx2_skciphers),
-					simd_sm4_aesni_avx2_skciphers);
+	return crypto_register_skciphers(sm4_aesni_avx2_skciphers,
+					 ARRAY_SIZE(sm4_aesni_avx2_skciphers));
 }
 
 static void __exit sm4_exit(void)
 {
-	simd_unregister_skciphers(sm4_aesni_avx2_skciphers,
-				ARRAY_SIZE(sm4_aesni_avx2_skciphers),
-				simd_sm4_aesni_avx2_skciphers);
+	crypto_unregister_skciphers(sm4_aesni_avx2_skciphers,
+				    ARRAY_SIZE(sm4_aesni_avx2_skciphers));
 }
 
 module_init(sm4_init);
 module_exit(sm4_exit);
 
diff --git a/arch/x86/crypto/sm4_aesni_avx_glue.c b/arch/x86/crypto/sm4_aesni_avx_glue.c
index 85b4ca78b47b5..d8289a8fa3807 100644
--- a/arch/x86/crypto/sm4_aesni_avx_glue.c
+++ b/arch/x86/crypto/sm4_aesni_avx_glue.c
@@ -9,12 +9,10 @@
  */
 
 #include <linux/module.h>
 #include <linux/crypto.h>
 #include <linux/kernel.h>
-#include <asm/simd.h>
-#include <crypto/internal/simd.h>
 #include <crypto/internal/skcipher.h>
 #include <crypto/sm4.h>
 #include "sm4-avx.h"
 
 #define SM4_CRYPT8_BLOCK_SIZE	(SM4_BLOCK_SIZE * 8)
@@ -261,14 +259,13 @@ static int ctr_crypt(struct skcipher_request *req)
 }
 
 static struct skcipher_alg sm4_aesni_avx_skciphers[] = {
 	{
 		.base = {
-			.cra_name		= "__ecb(sm4)",
-			.cra_driver_name	= "__ecb-sm4-aesni-avx",
+			.cra_name		= "ecb(sm4)",
+			.cra_driver_name	= "ecb-sm4-aesni-avx",
 			.cra_priority		= 400,
-			.cra_flags		= CRYPTO_ALG_INTERNAL,
 			.cra_blocksize		= SM4_BLOCK_SIZE,
 			.cra_ctxsize		= sizeof(struct sm4_ctx),
 			.cra_module		= THIS_MODULE,
 		},
 		.min_keysize	= SM4_KEY_SIZE,
@@ -277,14 +274,13 @@ static struct skcipher_alg sm4_aesni_avx_skciphers[] = {
 		.setkey		= sm4_skcipher_setkey,
 		.encrypt	= sm4_avx_ecb_encrypt,
 		.decrypt	= sm4_avx_ecb_decrypt,
 	}, {
 		.base = {
-			.cra_name		= "__cbc(sm4)",
-			.cra_driver_name	= "__cbc-sm4-aesni-avx",
+			.cra_name		= "cbc(sm4)",
+			.cra_driver_name	= "cbc-sm4-aesni-avx",
 			.cra_priority		= 400,
-			.cra_flags		= CRYPTO_ALG_INTERNAL,
 			.cra_blocksize		= SM4_BLOCK_SIZE,
 			.cra_ctxsize		= sizeof(struct sm4_ctx),
 			.cra_module		= THIS_MODULE,
 		},
 		.min_keysize	= SM4_KEY_SIZE,
@@ -294,14 +290,13 @@ static struct skcipher_alg sm4_aesni_avx_skciphers[] = {
 		.setkey		= sm4_skcipher_setkey,
 		.encrypt	= sm4_cbc_encrypt,
 		.decrypt	= cbc_decrypt,
 	}, {
 		.base = {
-			.cra_name		= "__ctr(sm4)",
-			.cra_driver_name	= "__ctr-sm4-aesni-avx",
+			.cra_name		= "ctr(sm4)",
+			.cra_driver_name	= "ctr-sm4-aesni-avx",
 			.cra_priority		= 400,
-			.cra_flags		= CRYPTO_ALG_INTERNAL,
 			.cra_blocksize		= 1,
 			.cra_ctxsize		= sizeof(struct sm4_ctx),
 			.cra_module		= THIS_MODULE,
 		},
 		.min_keysize	= SM4_KEY_SIZE,
@@ -313,13 +308,10 @@ static struct skcipher_alg sm4_aesni_avx_skciphers[] = {
 		.encrypt	= ctr_crypt,
 		.decrypt	= ctr_crypt,
 	}
 };
 
-static struct simd_skcipher_alg *
-simd_sm4_aesni_avx_skciphers[ARRAY_SIZE(sm4_aesni_avx_skciphers)];
-
 static int __init sm4_init(void)
 {
 	const char *feature_name;
 
 	if (!boot_cpu_has(X86_FEATURE_AVX) ||
@@ -333,20 +325,18 @@ static int __init sm4_init(void)
 				&feature_name)) {
 		pr_info("CPU feature '%s' is not supported.\n", feature_name);
 		return -ENODEV;
 	}
 
-	return simd_register_skciphers_compat(sm4_aesni_avx_skciphers,
-					ARRAY_SIZE(sm4_aesni_avx_skciphers),
-					simd_sm4_aesni_avx_skciphers);
+	return crypto_register_skciphers(sm4_aesni_avx_skciphers,
+					 ARRAY_SIZE(sm4_aesni_avx_skciphers));
 }
 
 static void __exit sm4_exit(void)
 {
-	simd_unregister_skciphers(sm4_aesni_avx_skciphers,
-					ARRAY_SIZE(sm4_aesni_avx_skciphers),
-					simd_sm4_aesni_avx_skciphers);
+	crypto_unregister_skciphers(sm4_aesni_avx_skciphers,
+				    ARRAY_SIZE(sm4_aesni_avx_skciphers));
 }
 
 module_init(sm4_init);
 module_exit(sm4_exit);
 
diff --git a/arch/x86/crypto/twofish_avx_glue.c b/arch/x86/crypto/twofish_avx_glue.c
index 3eb3440b477a8..9e20db0137501 100644
--- a/arch/x86/crypto/twofish_avx_glue.c
+++ b/arch/x86/crypto/twofish_avx_glue.c
@@ -11,11 +11,10 @@
 #include <linux/module.h>
 #include <linux/types.h>
 #include <linux/crypto.h>
 #include <linux/err.h>
 #include <crypto/algapi.h>
-#include <crypto/internal/simd.h>
 #include <crypto/twofish.h>
 
 #include "twofish.h"
 #include "ecb_cbc_helpers.h"
 
@@ -72,27 +71,25 @@ static int cbc_decrypt(struct skcipher_request *req)
 	CBC_WALK_END();
 }
 
 static struct skcipher_alg twofish_algs[] = {
 	{
-		.base.cra_name		= "__ecb(twofish)",
-		.base.cra_driver_name	= "__ecb-twofish-avx",
+		.base.cra_name		= "ecb(twofish)",
+		.base.cra_driver_name	= "ecb-twofish-avx",
 		.base.cra_priority	= 400,
-		.base.cra_flags		= CRYPTO_ALG_INTERNAL,
 		.base.cra_blocksize	= TF_BLOCK_SIZE,
 		.base.cra_ctxsize	= sizeof(struct twofish_ctx),
 		.base.cra_module	= THIS_MODULE,
 		.min_keysize		= TF_MIN_KEY_SIZE,
 		.max_keysize		= TF_MAX_KEY_SIZE,
 		.setkey			= twofish_setkey_skcipher,
 		.encrypt		= ecb_encrypt,
 		.decrypt		= ecb_decrypt,
 	}, {
-		.base.cra_name		= "__cbc(twofish)",
-		.base.cra_driver_name	= "__cbc-twofish-avx",
+		.base.cra_name		= "cbc(twofish)",
+		.base.cra_driver_name	= "cbc-twofish-avx",
 		.base.cra_priority	= 400,
-		.base.cra_flags		= CRYPTO_ALG_INTERNAL,
 		.base.cra_blocksize	= TF_BLOCK_SIZE,
 		.base.cra_ctxsize	= sizeof(struct twofish_ctx),
 		.base.cra_module	= THIS_MODULE,
 		.min_keysize		= TF_MIN_KEY_SIZE,
 		.max_keysize		= TF_MAX_KEY_SIZE,
@@ -101,30 +98,26 @@ static struct skcipher_alg twofish_algs[] = {
 		.encrypt		= cbc_encrypt,
 		.decrypt		= cbc_decrypt,
 	},
 };
 
-static struct simd_skcipher_alg *twofish_simd_algs[ARRAY_SIZE(twofish_algs)];
-
 static int __init twofish_init(void)
 {
 	const char *feature_name;
 
 	if (!cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM, &feature_name)) {
 		pr_info("CPU feature '%s' is not supported.\n", feature_name);
 		return -ENODEV;
 	}
 
-	return simd_register_skciphers_compat(twofish_algs,
-					      ARRAY_SIZE(twofish_algs),
-					      twofish_simd_algs);
+	return crypto_register_skciphers(twofish_algs,
+					 ARRAY_SIZE(twofish_algs));
 }
 
 static void __exit twofish_exit(void)
 {
-	simd_unregister_skciphers(twofish_algs, ARRAY_SIZE(twofish_algs),
-				  twofish_simd_algs);
+	crypto_unregister_skciphers(twofish_algs, ARRAY_SIZE(twofish_algs));
 }
 
 module_init(twofish_init);
 module_exit(twofish_exit);
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/2] Eliminate the no-SIMD en/decryption fallbacks on x86
  2025-02-20  5:13 [RFC PATCH 0/2] Eliminate the no-SIMD en/decryption fallbacks on x86 Eric Biggers
  2025-02-20  5:13 ` [RFC PATCH 1/2] x86/fpu: make kernel-mode FPU reliably usable in softirqs Eric Biggers
  2025-02-20  5:13 ` [RFC PATCH 2/2] crypto: x86 - stop using the SIMD helper Eric Biggers
@ 2025-02-21  3:53 ` Herbert Xu
  2025-02-24 18:57 ` Eric Biggers
  3 siblings, 0 replies; 12+ messages in thread
From: Herbert Xu @ 2025-02-21  3:53 UTC (permalink / raw)
  To: Eric Biggers
  Cc: x86, linux-crypto, linux-kernel, ardb, greearb, shaw.leon, tglx,
	mingo, bp, dave.hansen, luto, Jason

Eric Biggers <ebiggers@kernel.org> wrote:
> The patchset can also be retrieved from:
> 
>    git fetch https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git x86-softirq-fpu-fix-v1
> 
> This patchset fixes a longstanding issue where kernel-mode FPU (i.e.,
> SIMD) was not reliably usable in softirqs in x86, which was creating the
> need for a fallback.  The fallback was really bad for performance, and
> it even hurt performance for users that never encountered the edge case
> where kernel-mode FPU was not usable.

Great work!
 
> I also benchmarked bidirectional IPsec, which has been claimed to often
> hit the edge case where kernel-mode FPU was previously not usable in
> softirq context.  Ultimately, I was not actually able to reproduce that
> edge case being reached unless I reduced the number of CPUs to 1, in
> which case it then started being occasionally reached.  Regardless, even
> without that case being reached, IPsec throughput still improved by 2%.
> In situations where that case was being reached, or where users required
> a synchronous algorithm, a much larger improvement should be seen.

You would need a situation where your CPU is maxed out by your
bandwidth, so on a physical box these days you would need 10GbE
at the minimum.

However, I used to be able to easily reproduce this using virtualisation
because there the bandwidth is essentially unlimited.  So perhaps
a KVM guest with a single CPU doing bidirection IPsec to the host
should be enough to reproduce this case.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 1/2] x86/fpu: make kernel-mode FPU reliably usable in softirqs
  2025-02-20  5:13 ` [RFC PATCH 1/2] x86/fpu: make kernel-mode FPU reliably usable in softirqs Eric Biggers
@ 2025-02-21  7:38   ` Xiao Liang
  2025-02-21 19:31     ` Eric Biggers
  2025-02-28  3:59   ` Eric Biggers
  1 sibling, 1 reply; 12+ messages in thread
From: Xiao Liang @ 2025-02-21  7:38 UTC (permalink / raw)
  To: Eric Biggers
  Cc: x86, linux-crypto, linux-kernel, Ard Biesheuvel, Ben Greear,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Andy Lutomirski, Jason A . Donenfeld

On Thu, Feb 20, 2025 at 1:16 PM Eric Biggers <ebiggers@kernel.org> wrote:
>
> From: Eric Biggers <ebiggers@google.com>
>
> Currently kernel-mode FPU is not always usable in softirq context on
> x86, since softirqs can nest inside a kernel-mode FPU section in task
> context, and nested use of kernel-mode FPU is not supported.
>
> Therefore, x86 SIMD-optimized code that can be called in softirq context
> has to sometimes fall back to non-SIMD code.  There are two options for
> the fallback, both of which are pretty terrible:
>
>   (a) Use a scalar fallback.  This can be 10-100x slower than vectorized
>       code because it cannot use specialized instructions like AES, SHA,
>       or carryless multiplication.
>
>   (b) Execute the request asynchronously using a kworker.  In other
>       words, use the "crypto SIMD helper" in crypto/simd.c.
>
> Currently most of the x86 en/decryption code (skcipher and aead
> algorithms) uses option (b), since this avoids the slow scalar fallback
> and it is easier to wire up.  But option (b) is still really bad for its
> own reasons:
>
>   - Punting the request to a kworker is bad for performance too.
>
>   - It forces the algorithm to be marked as asynchronous
>     (CRYPTO_ALG_ASYNC), preventing it from being used by crypto API
>     users who request a synchronous algorithm.  That's another huge
>     performance problem, which is especially unfortunate for users who
>     don't even do en/decryption in softirq context.
>
>   - It makes all en/decryption operations take a detour through
>     crypto/simd.c.  That involves additional checks and an additional
>     indirect call, which slow down en/decryption for *everyone*.

Thank you for the detailed information.

> Fortunately, the skcipher and aead APIs are only usable in task and
> softirq context in the first place, nor is it supported to call them
> with hardirqs disabled.  Thus, if kernel-mode FPU were to be reliably
> usable in softirq context, no fallback would be needed.  Indeed, other
> architectures such as arm, arm64, and riscv have already done this.
>
> Therefore, this patch updates x86 accordingly to reliably support
> kernel-mode FPU in softirqs (except when hardirqs are disabled).
>
> This is done by just disabling softirq processing in kernel-mode FPU
> sections, as that prevents the nesting that was problematic.
>
> This will delay some softirqs slightly, but only ones that would have
> otherwise been nested inside a task context kernel-mode FPU section.
> Any such softirqs would have taken the slow fallback path before if they
> tried to do any en/decryption.  Now these softirqs will just run at the
> end of the task context kernel-mode FPU section (since local_bh_enable()
> runs pending softirqs) and will no longer take the slow fallback path.

I think this will delay all softirqs, including those that don't use
FPU. Will there be a performance impact?
(I guess you've noticed the patch I submitted last year. And this is
the main reason why it was implemented in the way you mentioned as
the second alternative.)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 1/2] x86/fpu: make kernel-mode FPU reliably usable in softirqs
  2025-02-21  7:38   ` Xiao Liang
@ 2025-02-21 19:31     ` Eric Biggers
  2025-02-25 22:21       ` David Laight
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Biggers @ 2025-02-21 19:31 UTC (permalink / raw)
  To: Xiao Liang
  Cc: x86, linux-crypto, linux-kernel, Ard Biesheuvel, Ben Greear,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Andy Lutomirski, Jason A . Donenfeld

Hi Xiao,

On Fri, Feb 21, 2025 at 03:38:27PM +0800, Xiao Liang wrote:
> > Therefore, this patch updates x86 accordingly to reliably support
> > kernel-mode FPU in softirqs (except when hardirqs are disabled).
> >
> > This is done by just disabling softirq processing in kernel-mode FPU
> > sections, as that prevents the nesting that was problematic.
> >
> > This will delay some softirqs slightly, but only ones that would have
> > otherwise been nested inside a task context kernel-mode FPU section.
> > Any such softirqs would have taken the slow fallback path before if they
> > tried to do any en/decryption.  Now these softirqs will just run at the
> > end of the task context kernel-mode FPU section (since local_bh_enable()
> > runs pending softirqs) and will no longer take the slow fallback path.
> 
> I think this will delay all softirqs, including those that don't use
> FPU. Will there be a performance impact?
> (I guess you've noticed the patch I submitted last year. And this is
> the main reason why it was implemented in the way you mentioned as
> the second alternative.)

Thanks for taking a look at this patch!  It's true that this patch makes all
softirqs on the same CPU be delayed until the end of the current kernel-mode FPU
section.  But, I'm a bit skeptical that it actually matters enough on x86 to go
with a more complex solution that would allow nested kernel-mode FPU.
Kernel-mode FPU sections are generally short; the usual cases are en/decrypting
disk sectors or network packets that are 4 KiB or less.

Even if a longer buffer is passed in, most of the x86 SIMD-optimized code
already divides the buffer into chunks of at most 4 KiB and uses a separate
kernel-mode FPU section for each chunk.  This happens either explicitly, or
implicitly via the skcipher_walk_* functions which never return more than
PAGE_SIZE (i.e. 4 KiB on x86) in a single step.  There is some code that does
not do this, e.g. the CRC code, but that could easily be fixed.

The commonly-used x86 SIMD-optimized code is also super fast these days, and is
only getting faster.  For example, on an AMD desktop processor from this year I
get roughly 35 GB/s AES-256-XTS, 25 GB/s AES-256-GCM, or 80 GB/s any CRC,
courtesy of VAES and VPCLMULQDQ (measuring single-threaded throughput at max
frequency).  That works out to 50-165 nanoseconds per 4 KiB.  Increasingly these
algorithms can be thought of as similar to memcpy() in speed.

Of course, the worst case is probably about 100x slower -- consider a CPU that
is much older, and from a low-voltage product line (e.g. Intel Atom), and not
running at its max frequency, and computing a much slower crypto algorithm that
lacks hardware acceleration like Serpent-XTS, or even AES-something if the CPU
is so old (over 15 years) as to lack AES-NI.

But, the super slow crypto algorithms are becoming increasingly rare.  The
crypto algorithms in use these days tend to have hardware acceleration on x86
(via AES-NI, PCLMULQDQ, or SHA extensions) or at least be fast with SSE / AVX.

So while the worst case is likely about 20 microseconds on certain systems where
everything lines up the wrong way, realistically the worst case on most systems
based on what's actually being used is probably less than 1 microsecond.

That seems probably short enough to be acceptable?  Remember that preemption was
already being disabled during this time.  And this is only on one CPU.

I think it's also important to note that when local_bh_enable() re-enables
softirq processing (when called from kernel_fpu_end()), it also immediatelly
runs any pending softirqs.  Thus there would be no additional delay; the CPU
will *immediately* run any pending softirqs.

As for supporting nested kernel-mode FPU if we wanted to go that way: yes, your
patch from last year
https://lore.kernel.org/lkml/20240403140138.393825-1-shaw.leon@gmail.com/
ostensibly did that.  However, I found some bugs in it; e.g., it didn't take
into account that struct fpu is variable-length.  So it didn't turn out as
simple as that patch made it seem.  Just extending fpregs_{lock,unlock}() to
kernel-mode FPU is a simpler solution with fewer edge cases, and it avoids
increasing the memory usage of the kernel.  So I thought I'd propose that first.

- Eric

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/2] Eliminate the no-SIMD en/decryption fallbacks on x86
  2025-02-20  5:13 [RFC PATCH 0/2] Eliminate the no-SIMD en/decryption fallbacks on x86 Eric Biggers
                   ` (2 preceding siblings ...)
  2025-02-21  3:53 ` [RFC PATCH 0/2] Eliminate the no-SIMD en/decryption fallbacks on x86 Herbert Xu
@ 2025-02-24 18:57 ` Eric Biggers
  3 siblings, 0 replies; 12+ messages in thread
From: Eric Biggers @ 2025-02-24 18:57 UTC (permalink / raw)
  To: x86
  Cc: linux-crypto, linux-kernel, Ard Biesheuvel, Ben Greear,
	Xiao Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Andy Lutomirski, Jason A . Donenfeld

On Wed, Feb 19, 2025 at 09:13:23PM -0800, Eric Biggers wrote:
> The patchset can also be retrieved from:
>  
>     git fetch https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git x86-softirq-fpu-fix-v1
> 
> This patchset fixes a longstanding issue where kernel-mode FPU (i.e.,
> SIMD) was not reliably usable in softirqs in x86, which was creating the
> need for a fallback.  The fallback was really bad for performance, and
> it even hurt performance for users that never encountered the edge case
> where kernel-mode FPU was not usable.
> 
> This patchset aligns x86 with other architectures such as arm, arm64,
> and riscv by making kernel-mode FPU work in softirqs reliably.  There
> are a few possible ways to achieve that, and for now I just went with
> the simplest way; see patch 1 for details.
> 
> Patch 2 eliminates all uses of the "crypto SIMD helper" from x86, as
> patch 1 makes it unnecessary.  For the RFC it is just one big patch;
> I'll probably split patch 2 up if this progresses past RFC status.
> 
> Performance results have been positive.  All en/decryption is now
> slightly faster on x86, as it no longer take a detour through
> crypto/simd.c.  I get a 7% or 23% improvement for AES-XTS, for example.
> 
> I also benchmarked bidirectional IPsec, which has been claimed to often
> hit the edge case where kernel-mode FPU was previously not usable in
> softirq context.  Ultimately, I was not actually able to reproduce that
> edge case being reached unless I reduced the number of CPUs to 1, in
> which case it then started being occasionally reached.  Regardless, even
> without that case being reached, IPsec throughput still improved by 2%.
> In situations where that case was being reached, or where users required
> a synchronous algorithm, a much larger improvement should be seen.
> 
> Eric Biggers (2):
>   x86/fpu: make kernel-mode FPU reliably usable in softirqs
>   crypto: x86 - stop using the SIMD helper

Any thoughts on this from the x86 folks?

- Eric

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 1/2] x86/fpu: make kernel-mode FPU reliably usable in softirqs
  2025-02-21 19:31     ` Eric Biggers
@ 2025-02-25 22:21       ` David Laight
  2025-02-25 22:59         ` Eric Biggers
  0 siblings, 1 reply; 12+ messages in thread
From: David Laight @ 2025-02-25 22:21 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Xiao Liang, x86, linux-crypto, linux-kernel, Ard Biesheuvel,
	Ben Greear, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Andy Lutomirski

On Fri, 21 Feb 2025 19:31:24 +0000
Eric Biggers <ebiggers@kernel.org> wrote:

> Hi Xiao,
> 
> On Fri, Feb 21, 2025 at 03:38:27PM +0800, Xiao Liang wrote:
> > > Therefore, this patch updates x86 accordingly to reliably support
> > > kernel-mode FPU in softirqs (except when hardirqs are disabled).
> > >
> > > This is done by just disabling softirq processing in kernel-mode FPU
> > > sections, as that prevents the nesting that was problematic.
> > >
> > > This will delay some softirqs slightly, but only ones that would have
> > > otherwise been nested inside a task context kernel-mode FPU section.
> > > Any such softirqs would have taken the slow fallback path before if they
> > > tried to do any en/decryption.  Now these softirqs will just run at the
> > > end of the task context kernel-mode FPU section (since local_bh_enable()
> > > runs pending softirqs) and will no longer take the slow fallback path.  
> > 
> > I think this will delay all softirqs, including those that don't use
> > FPU. Will there be a performance impact?
> > (I guess you've noticed the patch I submitted last year. And this is
> > the main reason why it was implemented in the way you mentioned as
> > the second alternative.)  
> 
> Thanks for taking a look at this patch!  It's true that this patch makes all
> softirqs on the same CPU be delayed until the end of the current kernel-mode FPU
> section.  But, I'm a bit skeptical that it actually matters enough on x86 to go
> with a more complex solution that would allow nested kernel-mode FPU.
> Kernel-mode FPU sections are generally short; the usual cases are en/decrypting
> disk sectors or network packets that are 4 KiB or less.
> 
....
> I think it's also important to note that when local_bh_enable() re-enables
> softirq processing (when called from kernel_fpu_end()), it also immediatelly
> runs any pending softirqs.  Thus there would be no additional delay; the CPU
> will *immediately* run any pending softirqs.

I'd also have thought that anything time-critical shouldn't rely on softirq.
The network stack will run a lot of code in softirq context, a bit of time
with softirq disabled isn't going to make any difference to real-world latency.

I do wonder though whether the network napi code should be running in softint
context at all.
With the amount of data it is trivial to get through a single 'consumer' ethernet
interface it can easily cause the scheduler to mis-behave.
I'd guess that google (etc) use threaded napi, multiple rx queues and RFS to get
the network processing spread out and non contending with process code.

> 
> As for supporting nested kernel-mode FPU if we wanted to go that way: yes, your
> patch from last year
> https://lore.kernel.org/lkml/20240403140138.393825-1-shaw.leon@gmail.com/
> ostensibly did that.  However, I found some bugs in it; e.g., it didn't take
> into account that struct fpu is variable-length.  So it didn't turn out as
> simple as that patch made it seem.  Just extending fpregs_{lock,unlock}() to
> kernel-mode FPU is a simpler solution with fewer edge cases, and it avoids
> increasing the memory usage of the kernel.  So I thought I'd propose that first.

Since many kernel users don't want the traditional fpu, they just need to use
an instruction that requires an AVX register or two, is it possible for code
to specify a small save area for just two or four registers and then use just
those registers? (so treating then all as caller-saved).
I know that won't work with anything that affects the fpu status register,
but if you want a single wide register for a PCIe read (to generate a big TLP)
it is more than enough.

I'm sure there are horrid pitfalls, especially if IPI are still used to for
deferred save of fpu state.

	David



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 1/2] x86/fpu: make kernel-mode FPU reliably usable in softirqs
  2025-02-25 22:21       ` David Laight
@ 2025-02-25 22:59         ` Eric Biggers
  2025-02-26 17:09           ` Dave Hansen
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Biggers @ 2025-02-25 22:59 UTC (permalink / raw)
  To: David Laight
  Cc: Xiao Liang, x86, linux-crypto, linux-kernel, Ard Biesheuvel,
	Ben Greear, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Andy Lutomirski

On Tue, Feb 25, 2025 at 10:21:33PM +0000, David Laight wrote:
> > As for supporting nested kernel-mode FPU if we wanted to go that way: yes, your
> > patch from last year
> > https://lore.kernel.org/lkml/20240403140138.393825-1-shaw.leon@gmail.com/
> > ostensibly did that.  However, I found some bugs in it; e.g., it didn't take
> > into account that struct fpu is variable-length.  So it didn't turn out as
> > simple as that patch made it seem.  Just extending fpregs_{lock,unlock}() to
> > kernel-mode FPU is a simpler solution with fewer edge cases, and it avoids
> > increasing the memory usage of the kernel.  So I thought I'd propose that first.
> 
> Since many kernel users don't want the traditional fpu, they just need to use
> an instruction that requires an AVX register or two, is it possible for code
> to specify a small save area for just two or four registers and then use just
> those registers? (so treating then all as caller-saved).
> I know that won't work with anything that affects the fpu status register,
> but if you want a single wide register for a PCIe read (to generate a big TLP)
> it is more than enough.
> 
> I'm sure there are horrid pitfalls, especially if IPI are still used to for
> deferred save of fpu state.

I'm afraid that's not an accurate summary of what uses the vector registers in
kernel mode.  The main use case is crypto, and most of the crypto code uses a
lot of vector registers.  Some of the older crypto code uses at most 8 vector
registers (xmm0-xmm7) for 32-bit compatibility, but newer code uses 16 or even
up to 32 YMM or ZMM registers.  The new AES-GCM code for example uses all 32
vector registers, and the new AES-XTS code uses 30.

In general, taking full advantage of the vector register set improves
performance, and the trend has very much been towards using more registers --
not fewer.  (And the registers have been getting larger too!)  AES by itself
tends to need about 8 registers to take advantage of the CPU's full AES
throughput, but there are other computations like GHASH or tweak computation
that need to be interleaved with AES, using more registers.  And various
constants and round keys can be cached in registers to improve performance.

If we had to save/restore a large number of vector registers in every crypto
function call (not amortized to one save/restore per return to userspace), that
would be a big performance problem.

Most of the crypto code certainly could be written to use fewer registers.  But
it would reduce performance, especially if we tried to squeeze it down to use a
really small number of registers like 2-4.  Plus any such efforts would
complicate efforts to port crypto code between the kernel and userspace, as
userspace does not have such constraints on the number of registers.

- Eric

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 1/2] x86/fpu: make kernel-mode FPU reliably usable in softirqs
  2025-02-25 22:59         ` Eric Biggers
@ 2025-02-26 17:09           ` Dave Hansen
  0 siblings, 0 replies; 12+ messages in thread
From: Dave Hansen @ 2025-02-26 17:09 UTC (permalink / raw)
  To: Eric Biggers, David Laight
  Cc: Xiao Liang, x86, linux-crypto, linux-kernel, Ard Biesheuvel,
	Ben Greear, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Andy Lutomirski

On 2/25/25 14:59, Eric Biggers wrote:
> If we had to save/restore a large number of vector registers in every crypto
> function call (not amortized to one save/restore per return to userspace), that
> would be a big performance problem.

I just did a quick trace on my laptop. Looks like I have two main
kernel_fpu_begin() users: LUKS and networking. They both very much seem
to do a bunch of kernel_fpu_begin() operations but very few actual XSAVEs:

     26 : save_fpregs_to_fpstate <-kernel_fpu_begin_mask
    818 : kernel_fpu_begin_mask <-crc32c_pcl_intel_update
   4192 : kernel_fpu_begin_mask <-xts_encrypt_vaes_avx10_256

This is at least _one_ data point very much in favor of Eric's argument
here. It appears that that the cost of one XSAVE is amortized across a
bunch of kernel_fpu_begin()s.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 1/2] x86/fpu: make kernel-mode FPU reliably usable in softirqs
  2025-02-20  5:13 ` [RFC PATCH 1/2] x86/fpu: make kernel-mode FPU reliably usable in softirqs Eric Biggers
  2025-02-21  7:38   ` Xiao Liang
@ 2025-02-28  3:59   ` Eric Biggers
  2025-02-28 12:39     ` Ard Biesheuvel
  1 sibling, 1 reply; 12+ messages in thread
From: Eric Biggers @ 2025-02-28  3:59 UTC (permalink / raw)
  To: x86
  Cc: linux-crypto, linux-kernel, Ard Biesheuvel, Ben Greear,
	Xiao Liang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Andy Lutomirski, Jason A . Donenfeld

On Wed, Feb 19, 2025 at 09:13:24PM -0800, Eric Biggers wrote:
> To comply with the requirements of local_bh_disable and local_bh_enable,
> this change also removes support for kernel-mode FPU in hardirq context
> or with hardirqs disabled.  This should not be a problem, though.  There
> does not appear to be any use case for kernel-mode FPU in such contexts,
> and notably arm64 and riscv already have these same conditions.

I found a problem with this assumption: the system suspend and resume code calls
kernel_fpu_begin() and kernel_fpu_end() with hardirqs disabled.  See
__save_processor_state() and __restore_processor_state() in
arch/x86/power/cpu.c.  That triggers the WARN_ON_FPU(!irq_fpu_usable()).

I think there are two directions we could go with this: either choose a solution
that keeps kernel_fpu_begin() usable with hardirqs disabled; or change
__save_processor_state() and __restore_processor_state() to save/restore the FPU
registers directly, e.g. via save_fpregs_to_fpstate() and
restore_fpregs_from_fpstate().  (Kernel-mode FPU isn't actually being used in
this case, so a more direct save/restore might make sense here.)

- Eric

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 1/2] x86/fpu: make kernel-mode FPU reliably usable in softirqs
  2025-02-28  3:59   ` Eric Biggers
@ 2025-02-28 12:39     ` Ard Biesheuvel
  0 siblings, 0 replies; 12+ messages in thread
From: Ard Biesheuvel @ 2025-02-28 12:39 UTC (permalink / raw)
  To: Eric Biggers
  Cc: x86, linux-crypto, linux-kernel, Ben Greear, Xiao Liang,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Andy Lutomirski, Jason A . Donenfeld

On Fri, 28 Feb 2025 at 04:59, Eric Biggers <ebiggers@kernel.org> wrote:
>
> On Wed, Feb 19, 2025 at 09:13:24PM -0800, Eric Biggers wrote:
> > To comply with the requirements of local_bh_disable and local_bh_enable,
> > this change also removes support for kernel-mode FPU in hardirq context
> > or with hardirqs disabled.  This should not be a problem, though.  There
> > does not appear to be any use case for kernel-mode FPU in such contexts,
> > and notably arm64 and riscv already have these same conditions.
>
> I found a problem with this assumption: the system suspend and resume code calls
> kernel_fpu_begin() and kernel_fpu_end() with hardirqs disabled.  See
> __save_processor_state() and __restore_processor_state() in
> arch/x86/power/cpu.c.  That triggers the WARN_ON_FPU(!irq_fpu_usable()).
>
> I think there are two directions we could go with this: either choose a solution
> that keeps kernel_fpu_begin() usable with hardirqs disabled;

I still owe you an investigation into how this interoperates with EFI
runtime services, but it appears there are cases (efi-pstore on an
OOPS) where EFI SetVariable() might be invoked with IRQs disabled.
arm64 has a special case for EFI runtime calls made under conditions
where SIMD may not be used, and essentially just preserves and
restores the entire state.

It is rather unfortunate that this is needed, but the UEFI spec
permits runtime service implementations to use XMM registers so there
is no way around this AFAIK.


> or change
> __save_processor_state() and __restore_processor_state() to save/restore the FPU
> registers directly, e.g. via save_fpregs_to_fpstate() and
> restore_fpregs_from_fpstate().  (Kernel-mode FPU isn't actually being used in
> this case, so a more direct save/restore might make sense here.)
>
> - Eric

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-02-28 12:40 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-20  5:13 [RFC PATCH 0/2] Eliminate the no-SIMD en/decryption fallbacks on x86 Eric Biggers
2025-02-20  5:13 ` [RFC PATCH 1/2] x86/fpu: make kernel-mode FPU reliably usable in softirqs Eric Biggers
2025-02-21  7:38   ` Xiao Liang
2025-02-21 19:31     ` Eric Biggers
2025-02-25 22:21       ` David Laight
2025-02-25 22:59         ` Eric Biggers
2025-02-26 17:09           ` Dave Hansen
2025-02-28  3:59   ` Eric Biggers
2025-02-28 12:39     ` Ard Biesheuvel
2025-02-20  5:13 ` [RFC PATCH 2/2] crypto: x86 - stop using the SIMD helper Eric Biggers
2025-02-21  3:53 ` [RFC PATCH 0/2] Eliminate the no-SIMD en/decryption fallbacks on x86 Herbert Xu
2025-02-24 18:57 ` Eric Biggers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).