Linux cryptographic layer development

Linux cryptographic layer development
 help / color / mirror / Atom feed

* [PATCH 2/2] crypto: qat - add KUnit coverage for the migration import parser
From: Michael Bommarito @ 2026-06-14 13:06 UTC (permalink / raw)
  To: Giovanni Cabiddu, Herbert Xu
  Cc: David S . Miller, Kees Cook, qat-linux, linux-crypto,
	linux-kernel
In-Reply-To: <20260614130619.2519534-1-michael.bommarito@gmail.com>

Add KUnit coverage for the remote (migration import) path of the QAT
live migration state manager. The cases drive the real
adf_mstate_mgr_init_from_remote() -> adf_mstate_sect_validate() parser
on a buffer sized as the GEN4 VFIO migration backend allocates it, and
include the preh_len == buffer-size boundary case that is the
regression oracle for the preceding fix. The test file is included from
adf_mstate_mgr.c so it can reach the file-local preamble and section
types, gated by CONFIG_CRYPTO_DEV_QAT_KUNIT_TEST. It is offered as a
separate patch so it can be taken or dropped independently of the fix.

Reproduced under KASAN on QEMU; the boundary case reports the
slab-out-of-bounds read on an unfixed tree and passes once the fix is
in place.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
---
Three cases under KASAN on QEMU x86_64: a valid empty preamble, a valid
in-bounds section header, and the preh_len == buffer-size (4096) boundary
case. The boundary case is the regression oracle for patch 1: it reports
the slab-out-of-bounds read on an unfixed tree and returns -EINVAL with
the fix in place. The two valid cases drive the same parser path with no
out-of-bounds access and pass on both trees.

 drivers/crypto/intel/qat/Kconfig              | 16 ++++
 .../intel/qat/qat_common/adf_mstate_mgr.c     |  4 +
 .../qat/qat_common/adf_mstate_mgr_test.c      | 81 +++++++++++++++++++
 3 files changed, 101 insertions(+)
 create mode 100644 drivers/crypto/intel/qat/qat_common/adf_mstate_mgr_test.c

diff --git a/drivers/crypto/intel/qat/Kconfig b/drivers/crypto/intel/qat/Kconfig
index 9d6e6f52d2dcb..116f7f94c9a64 100644
--- a/drivers/crypto/intel/qat/Kconfig
+++ b/drivers/crypto/intel/qat/Kconfig
@@ -133,3 +133,19 @@ config CRYPTO_DEV_QAT_ERROR_INJECTION
 
 	  This functionality is available via debugfs entry of the Intel(R)
 	  QuickAssist device
+
+config CRYPTO_DEV_QAT_KUNIT_TEST
+	bool "KUnit tests for Intel(R) QAT live migration state manager" if !KUNIT_ALL_TESTS
+	depends on CRYPTO_DEV_QAT && KUNIT=y
+	default KUNIT_ALL_TESTS
+	help
+	  Build KUnit tests for the Intel(R) QAT live migration state
+	  manager remote-import parser (adf_mstate_mgr.c). The tests drive
+	  the migration setup parser on a buffer sized as the QAT GEN4
+	  VFIO migration backend allocates it and check that a malformed
+	  remote preamble is rejected before any out-of-bounds access.
+
+	  For more information on KUnit and unit tests in general, please
+	  refer to the KUnit documentation in Documentation/dev-tools/kunit/.
+
+	  If unsure, say N.
diff --git a/drivers/crypto/intel/qat/qat_common/adf_mstate_mgr.c b/drivers/crypto/intel/qat/qat_common/adf_mstate_mgr.c
index 7370b87f72a2f..701279409c32c 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_mstate_mgr.c
+++ b/drivers/crypto/intel/qat/qat_common/adf_mstate_mgr.c
@@ -326,3 +326,7 @@ found:
 
 	return sect;
 }
+
+#if IS_ENABLED(CONFIG_CRYPTO_DEV_QAT_KUNIT_TEST)
+#include "adf_mstate_mgr_test.c"
+#endif
diff --git a/drivers/crypto/intel/qat/qat_common/adf_mstate_mgr_test.c b/drivers/crypto/intel/qat/qat_common/adf_mstate_mgr_test.c
new file mode 100644
index 0000000000000..e067c13f4a9dd
--- /dev/null
+++ b/drivers/crypto/intel/qat/qat_common/adf_mstate_mgr_test.c
@@ -0,0 +1,81 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2026 Intel Corporation */
+
+/*
+ * KUnit coverage for the QAT live migration remote-import parser. The cases
+ * drive the real adf_mstate_mgr_init_from_remote() on a buffer sized as the
+ * GEN4 VFIO migration backend allocates it (4096 bytes), including the
+ * preh_len == buffer-size boundary case. Included from adf_mstate_mgr.c to
+ * reach the file-local preamble and section types.
+ */
+
+#include <kunit/test.h>
+
+#define ADF_MSTATE_TEST_BUF_SIZE 4096
+
+static void qat_mstate_remote_run(struct kunit *test, u16 preh_len,
+				  u16 n_sects, u32 sect0_size, int expect)
+{
+	struct adf_mstate_mgr mgr;
+	struct adf_mstate_preh *pre;
+	u8 *buf;
+	int ret;
+
+	buf = kzalloc(ADF_MSTATE_TEST_BUF_SIZE, GFP_KERNEL);
+	KUNIT_ASSERT_NOT_NULL(test, buf);
+
+	pre = (struct adf_mstate_preh *)buf;
+	pre->magic = ADF_MSTATE_MAGIC;
+	pre->version = ADF_MSTATE_VERSION;
+	pre->preh_len = preh_len;
+	pre->n_sects = n_sects;
+	pre->size = 0;
+
+	/* Place an in-bounds section header when there is room for one. */
+	if (n_sects &&
+	    (u32)preh_len + sizeof(struct adf_mstate_sect_h) <= ADF_MSTATE_TEST_BUF_SIZE) {
+		struct adf_mstate_sect_h *s =
+			(struct adf_mstate_sect_h *)(buf + preh_len);
+
+		s->size = sect0_size;
+		s->sub_sects = 0;
+	}
+
+	ret = adf_mstate_mgr_init_from_remote(&mgr, buf, ADF_MSTATE_TEST_BUF_SIZE,
+					      NULL, NULL);
+	KUNIT_EXPECT_EQ(test, ret, expect);
+
+	kfree(buf);
+}
+
+/* Valid empty preamble: the validation loop never runs. */
+static void qat_mstate_remote_empty(struct kunit *test)
+{
+	qat_mstate_remote_run(test, sizeof(struct adf_mstate_preh), 0, 0, 0);
+}
+
+/* Valid in-bounds section header: same parser path, no out-of-bounds read. */
+static void qat_mstate_remote_inbounds_sect(struct kunit *test)
+{
+	qat_mstate_remote_run(test, sizeof(struct adf_mstate_preh), 1, 0, 0);
+}
+
+/* preh_len == buffer size puts the cursor past the allocation; expect -EINVAL. */
+static void qat_mstate_remote_oob_header(struct kunit *test)
+{
+	qat_mstate_remote_run(test, ADF_MSTATE_TEST_BUF_SIZE, 1, 0, -EINVAL);
+}
+
+static struct kunit_case qat_mstate_remote_cases[] = {
+	KUNIT_CASE(qat_mstate_remote_empty),
+	KUNIT_CASE(qat_mstate_remote_inbounds_sect),
+	KUNIT_CASE(qat_mstate_remote_oob_header),
+	{}
+};
+
+static struct kunit_suite qat_mstate_remote_suite = {
+	.name = "qat_mstate_remote",
+	.test_cases = qat_mstate_remote_cases,
+};
+
+kunit_test_suite(qat_mstate_remote_suite);
-- 
2.53.0


^ permalink raw reply related

* [PATCH 1/2] crypto: qat - validate migration section header is in bounds
From: Michael Bommarito @ 2026-06-14 13:06 UTC (permalink / raw)
  To: Giovanni Cabiddu, Herbert Xu
  Cc: David S . Miller, Kees Cook, qat-linux, linux-crypto,
	linux-kernel
In-Reply-To: <20260614130619.2519534-1-michael.bommarito@gmail.com>

adf_mstate_mgr_init_from_remote() sets the section-walk cursor to
mgr->buf + preh_len from the remote migration preamble. The default
preamble checker only rejects preh_len > mgr->size, so preh_len ==
mgr->size (the 4096-byte QAT VF state buffer) puts mgr->state one
region past the allocation while n_sects is still honoured.
adf_mstate_sect_validate() then reads sect->size from that cursor
before proving the section header is in the buffer. The remote stream
reaches this parser from the destination-host VFIO migration path
(qat_vf_resume_write), so a malformed import reads out of bounds.

Reject section headers not fully contained in the state buffer before
dereferencing any of their fields.

Reproduced under KASAN on QEMU via the KUnit case in patch 2; the
slab-out-of-bounds read is gone after this change.

Fixes: f0bbfc391aa7 ("crypto: qat - implement interface for live migration")
Cc: stable@vger.kernel.org
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
---
The patch 2 KUnit case drives the real parser on a kmalloc(4096) buffer
under KASAN on QEMU x86_64. Trigger {preh_len=4096, n_sects=1}: stock
tree reports

  BUG: KASAN: slab-out-of-bounds in qat_mstate_remote_run

reading sect->size 8 bytes past the allocation; patched it returns
-EINVAL and KASAN is silent. Two benign controls (empty preamble,
in-bounds section header) drive the same path with no OOB and pass on
both trees. No in-tree selftest exercises adf_mstate_mgr.c; patch 2 is
the coverage offered. KASAN build of the touched object is warning clean.

 .../crypto/intel/qat/qat_common/adf_mstate_mgr.c   | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/crypto/intel/qat/qat_common/adf_mstate_mgr.c b/drivers/crypto/intel/qat/qat_common/adf_mstate_mgr.c
index f9017e03ec0f2..7370b87f72a2f 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_mstate_mgr.c
+++ b/drivers/crypto/intel/qat/qat_common/adf_mstate_mgr.c
@@ -231,8 +231,18 @@ static int adf_mstate_sect_validate(struct adf_mstate_mgr *mgr)
 
 	end = (uintptr_t)mgr->buf + mgr->size;
 	for (i = 0; i < mgr->n_sects; i++) {
-		uintptr_t s_start = (uintptr_t)sect->state;
-		uintptr_t s_end = s_start + sect->size;
+		uintptr_t s_start, s_end;
+
+		/* The section header must be in the buffer before it is read. */
+		if ((uintptr_t)sect < (uintptr_t)mgr->buf ||
+		    (uintptr_t)sect > end - sizeof(*sect)) {
+			pr_debug("QAT: LM - Section header out of bounds (index=%u) in state_mgr (size=%u, secs=%u)\n",
+				 i, mgr->size, mgr->n_sects);
+			return -EINVAL;
+		}
+
+		s_start = (uintptr_t)sect->state;
+		s_end = s_start + sect->size;
 
 		if (s_end < s_start || s_end > end) {
 			pr_debug("QAT: LM - Corrupted state section (index=%u, size=%u) in state_mgr (size=%u, secs=%u)\n",
-- 
2.53.0


^ permalink raw reply related

* [PATCH 0/2] crypto: qat - bound the live migration import parser
From: Michael Bommarito @ 2026-06-14 13:06 UTC (permalink / raw)
  To: Giovanni Cabiddu, Herbert Xu
  Cc: David S . Miller, Kees Cook, qat-linux, linux-crypto,
	linux-kernel

adf_mstate_mgr_init_from_remote() sets the section-walk cursor to
mgr->buf + preh_len from a remote-supplied preh_len, and the default
preamble checker only rejects preh_len > mgr->size. A remote preamble
with preh_len == mgr->size moves the cursor one region past the
allocation while n_sects is still honoured, so adf_mstate_sect_validate()
reads sect->size before the section header is proven in bounds. The
remote stream reaches this parser from the destination-host VFIO
migration path (qat_vf_resume_write), so a malformed import reads out of
bounds in the destination host kernel (fatal under KASAN / panic_on_warn).

Patch 1 rejects section headers not fully contained in the state buffer.
Patch 2 adds KUnit coverage and is offered separately so it can be taken
or dropped on its own. The parser was driven on QEMU x86_64 under KASAN
via the patch 2 suite (Level-2: buggy code unchanged, surrounding VFIO/PF
environment synthesized); the boundary trigger reports the out-of-bounds
read on the unfixed parser and is gone after patch 1, with two benign
controls passing on both trees.

Michael Bommarito (2):
  crypto: qat - validate migration section header is in bounds
  crypto: qat - add KUnit coverage for the migration import parser

 drivers/crypto/intel/qat/Kconfig              | 16 ++++
 .../intel/qat/qat_common/adf_mstate_mgr.c     | 18 ++++-
 .../qat/qat_common/adf_mstate_mgr_test.c      | 81 +++++++++++++++++++
 3 files changed, 113 insertions(+), 2 deletions(-)
 create mode 100644 drivers/crypto/intel/qat/qat_common/adf_mstate_mgr_test.c

-- 
2.53.0

^ permalink raw reply

* Re: [PATCH v2] lib/raid/xor: x86: Add AVX-512 optimized xor_gen()
From: David Laight @ 2026-06-14 10:16 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Andrew Morton, linux-kernel, Christoph Hellwig, linux-crypto, x86,
	linux-raid
In-Reply-To: <20260614010357.69416-1-ebiggers@kernel.org>

On Sat, 13 Jun 2026 18:03:57 -0700
Eric Biggers <ebiggers@kernel.org> wrote:

> Add an implementation of xor_gen() using AVX-512.
> 
> It uses 512-bit vectors, i.e. ZMM registers.  It also uses the
> vpternlogq instruction to do three-input XORs when applicable.
> 
> It's enabled on x86_64 CPUs that have AVX512F && !PREFER_YMM.  In
> practice that means:
> 
>     - AMD Zen 4 and later (client and server)

Doesn't zen4 only have a 256bit bus between the cpu and cache?
So avx512 reads take two clocks.
Since this is memory limited it is unlikely to run faster than the
avx256 version.
OTOH if it doesn't cause down-clocking as well then it won't be slower.

>     - Intel Sapphire Rapids and later (server)
>     - Intel Rocket Lake (client)
>     - Intel Nova Lake and later (client)
> 
> The !PREFER_YMM condition excludes the older AVX-512 implementations in
> Intel Skylake Server and Intel Ice Lake.  They could run this code, but
> they're known to have overly-eager downclocking when ZMM registers are
> used.  This is the same policy that the crypto and CRC code uses.
> 
> Benchmark on AMD Ryzen 9 9950X (Zen 5):
> 
>     src_cnt    avx          avx512       Improvement
>     =======    ==========   ==========   ===========
>     1          56353 MB/s   75388 MB/s   33%
>     2          54274 MB/s   68409 MB/s   26%
>     3          44649 MB/s   64042 MB/s   43%
>     4          41315 MB/s   55002 MB/s   33%
> 
> Note: for now I omitted the cpu_has_xfeatures() check that the AVX-512
> optimized crypto and CRC code does, since it's not implemented on
> User-Mode Linux and it's never been present in the RAID6 code either.
> 
> Signed-off-by: Eric Biggers <ebiggers@kernel.org>

Since I suggested it :-)

Reviewed-By: David Laight <david.laight.linux@gmail.com>

Some 'not very important' comments:

I did wonder whether moving the loop into the asm() would help.
gcc has a nasty habit of pessimising loops when you try to be clever.
It is certainly safer for tight loops like these.
That does have the side effect of making p0 be %1 which doesn't improve
readability. Either used named parameters or possibly just change p0 to p1 (etc)
so they match.

The code should be limited by the memory reads, so the 3-argument xor and
the interleave of the unroll may make no difference.

Some cpu do have constraints on the cache alignment in order to do two
reads per clock, but I've forgotten them and they got better before AVX-512.
If that were affecting this code (on the tested cpu) then I'd expect the
interleaved unroll would improve the _4 and -5 functions.
So it probably doesn't affect this code.

Using the same loop for the avx-256 and sse (and even smaller) functions could
well generate code that runs 'pretty much as fast as possible' on older cpu.
Intel cpu (going back to Sandy bridge) are likely to execute the loop in the
same number of clocks - but clearly copying half or a quarter of the data.
But I've no experience of zen1.

Might be worth doing for avx-256, does any care about anything older :-)

	David


> ---
> 
> Changed in v2:
>     - Fixed build on UML
>     - Reworked the implementation
> 
>  lib/raid/xor/Makefile         |   2 +-
>  lib/raid/xor/x86/xor-avx512.c | 121 ++++++++++++++++++++++++++++++++++
>  lib/raid/xor/x86/xor_arch.h   |  26 ++++----
>  3 files changed, 137 insertions(+), 12 deletions(-)
>  create mode 100644 lib/raid/xor/x86/xor-avx512.c
> 
> diff --git a/lib/raid/xor/Makefile b/lib/raid/xor/Makefile
> index 4d633dfd5b90..4af945861a51 100644
> --- a/lib/raid/xor/Makefile
> +++ b/lib/raid/xor/Makefile
> @@ -26,11 +26,11 @@ xor-$(CONFIG_ALTIVEC)		+= powerpc/xor_vmx.o powerpc/xor_vmx_glue.o
>  xor-$(CONFIG_RISCV_ISA_V)	+= riscv/xor.o riscv/xor-glue.o
>  xor-$(CONFIG_SPARC32)		+= sparc/xor-sparc32.o
>  xor-$(CONFIG_SPARC64)		+= sparc/xor-sparc64.o sparc/xor-sparc64-glue.o
>  xor-$(CONFIG_S390)		+= s390/xor.o
>  xor-$(CONFIG_X86_32)		+= x86/xor-avx.o x86/xor-sse.o x86/xor-mmx.o
> -xor-$(CONFIG_X86_64)		+= x86/xor-avx.o x86/xor-sse.o
> +xor-$(CONFIG_X86_64)		+= x86/xor-avx.o x86/xor-sse.o x86/xor-avx512.o
>  obj-y				+= tests/
>  
>  CFLAGS_arm/xor-neon.o		+= $(CC_FLAGS_FPU)
>  CFLAGS_REMOVE_arm/xor-neon.o	+= $(CC_FLAGS_NO_FPU)
>  
> diff --git a/lib/raid/xor/x86/xor-avx512.c b/lib/raid/xor/x86/xor-avx512.c
> new file mode 100644
> index 000000000000..87b981d74c90
> --- /dev/null
> +++ b/lib/raid/xor/x86/xor-avx512.c
> @@ -0,0 +1,121 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * AVX-512 optimized implementation of xor_gen()
> + *
> + * Copyright 2026 Google LLC
> + */
> +
> +#include <linux/types.h>
> +#include <asm/fpu/api.h>
> +#include "xor_impl.h"
> +#include "xor_arch.h"
> +
> +/*
> + * Implementation notes:
> + *
> + * Unrolling by the number of buffers (2-5) is very important.
> + *
> + * Unrolling by length is less important, especially when using register-indexed
> + * addressing with negative indices from the end of the buffers.  That approach
> + * results in just two loop control instructions being needed per iteration,
> + * regardless of the number of buffers.
> + *
> + * In fact, benchmarks showed that the 2 and 3 buffer cases require only 2x
> + * unrolling by length, while the 4 and 5 buffer cases don't require any
> + * unrolling by length.  Benchmarks also showed that the register-indexed
> + * addressing isn't a bottleneck either; i.e., we can't do any better by
> + * incrementing the pointers as we go along, even with more unrolling.
> + */
> +
> +static void xor_avx512_2(long bytes, u8 *p0, const u8 *p1)
> +{
> +	long i = -bytes;
> +
> +	asm volatile("1: vmovdqa64 (%0,%1), %%zmm0\n"
> +		     "vmovdqa64 64(%0,%1), %%zmm1\n"
> +		     "vpxorq (%0,%2), %%zmm0, %%zmm0\n"
> +		     "vpxorq 64(%0,%2), %%zmm1, %%zmm1\n"
> +		     "vmovdqa64 %%zmm0, (%0,%1)\n"
> +		     "vmovdqa64 %%zmm1, 64(%0,%1)\n"
> +		     "add $128, %0\n"
> +		     "jnz 1b\n"
> +		     : "+&r"(i)
> +		     : "r"(p0 + bytes), "r"(p1 + bytes)
> +		     : "memory", "cc");
> +}
> +
> +static void xor_avx512_3(long bytes, u8 *p0, const u8 *p1, const u8 *p2)
> +{
> +	long i = -bytes;
> +
> +	asm volatile("1: vmovdqa64 (%0,%1), %%zmm0\n"
> +		     "vmovdqa64 64(%0,%1), %%zmm1\n"
> +		     "vmovdqa64 (%0,%2), %%zmm2\n"
> +		     "vmovdqa64 64(%0,%2), %%zmm3\n"
> +		     "vpternlogq $0x96, (%0,%3), %%zmm2, %%zmm0\n"
> +		     "vpternlogq $0x96, 64(%0,%3), %%zmm3, %%zmm1\n"
> +		     "vmovdqa64 %%zmm0, (%0,%1)\n"
> +		     "vmovdqa64 %%zmm1, 64(%0,%1)\n"
> +		     "add $128, %0\n"
> +		     "jnz 1b\n"
> +		     : "+&r"(i)
> +		     : "r"(p0 + bytes), "r"(p1 + bytes), "r"(p2 + bytes)
> +		     : "memory", "cc");
> +}
> +
> +static void xor_avx512_4(long bytes, u8 *p0, const u8 *p1, const u8 *p2,
> +			 const u8 *p3)
> +{
> +	long i = -bytes;
> +
> +	asm volatile("1: vmovdqa64 (%0,%1), %%zmm0\n"
> +		     "vmovdqa64 (%0,%2), %%zmm1\n"
> +		     "vpxorq (%0,%3), %%zmm0, %%zmm0\n"
> +		     "vpternlogq $0x96, (%0,%4), %%zmm1, %%zmm0\n"
> +		     "vmovdqa64 %%zmm0, (%0,%1)\n"
> +		     "add $64, %0\n"
> +		     "jnz 1b\n"
> +		     : "+&r"(i)
> +		     : "r"(p0 + bytes), "r"(p1 + bytes), "r"(p2 + bytes),
> +		       "r"(p3 + bytes)
> +		     : "memory", "cc");
> +}
> +
> +static void xor_avx512_5(long bytes, u8 *p0, const u8 *p1, const u8 *p2,
> +			 const u8 *p3, const u8 *p4)
> +{
> +	long i = -bytes;
> +
> +	asm volatile("1: vmovdqa64 (%0,%1), %%zmm0\n"
> +		     "vmovdqa64 (%0,%2), %%zmm1\n"
> +		     "vpternlogq $0x96, (%0,%3), %%zmm1, %%zmm0\n"
> +		     "vmovdqa64 (%0,%4), %%zmm1\n"
> +		     "vpternlogq $0x96, (%0,%5), %%zmm1, %%zmm0\n"
> +		     "vmovdqa64 %%zmm0, (%0,%1)\n"
> +		     "add $64, %0\n"
> +		     "jnz 1b\n"
> +		     : "+&r"(i)
> +		     : "r"(p0 + bytes), "r"(p1 + bytes), "r"(p2 + bytes),
> +		       "r"(p3 + bytes), "r"(p4 + bytes)
> +		     : "memory", "cc");
> +}
> +
> +DO_XOR_BLOCKS(avx512_inner, xor_avx512_2, xor_avx512_3, xor_avx512_4,
> +	      xor_avx512_5);
> +
> +/*
> + * Preconditions: bytes is a nonzero multiple of 512, and all buffers are
> + * 64-byte aligned.
> + */
> +static void xor_gen_avx512(void *dest, void **srcs, unsigned int src_cnt,
> +			   unsigned int bytes)
> +{
> +	kernel_fpu_begin();
> +	xor_gen_avx512_inner(dest, srcs, src_cnt, bytes);
> +	kernel_fpu_end();
> +}
> +
> +struct xor_block_template xor_block_avx512 = {
> +	.name = "avx512",
> +	.xor_gen = xor_gen_avx512,
> +};
> diff --git a/lib/raid/xor/x86/xor_arch.h b/lib/raid/xor/x86/xor_arch.h
> index 99fe85a213c6..b5d49376fc97 100644
> --- a/lib/raid/xor/x86/xor_arch.h
> +++ b/lib/raid/xor/x86/xor_arch.h
> @@ -4,26 +4,30 @@
>  extern struct xor_block_template xor_block_pII_mmx;
>  extern struct xor_block_template xor_block_p5_mmx;
>  extern struct xor_block_template xor_block_sse;
>  extern struct xor_block_template xor_block_sse_pf64;
>  extern struct xor_block_template xor_block_avx;
> +extern struct xor_block_template xor_block_avx512;
>  
> -/*
> - * When SSE is available, use it as it can write around L2.  We may also be able
> - * to load into the L1 only depending on how the cpu deals with a load to a line
> - * that is being prefetched.
> - *
> - * When AVX2 is available, force using it as it is better by all measures.
> - *
> - * 32-bit without MMX can fall back to the generic routines.
> - */
>  static __always_inline void __init arch_xor_init(void)
>  {
> -	if (boot_cpu_has(X86_FEATURE_AVX) &&
> -	    boot_cpu_has(X86_FEATURE_OSXSAVE)) {
> +	if (IS_ENABLED(CONFIG_X86_64) && boot_cpu_has(X86_FEATURE_AVX512F) &&
> +	    boot_cpu_has(X86_FEATURE_OSXSAVE) &&
> +	    !boot_cpu_has(X86_FEATURE_PREFER_YMM)) {
> +		/* AVX-512 will be the best; no need to try others. */
> +		/* !PREFER_YMM excludes CPUs with overly-eager downclocking. */
> +		xor_force(&xor_block_avx512);
> +	} else if (boot_cpu_has(X86_FEATURE_AVX) &&
> +		   boot_cpu_has(X86_FEATURE_OSXSAVE)) {
> +		/* AVX will be the best; no need to try others. */
>  		xor_force(&xor_block_avx);
>  	} else if (IS_ENABLED(CONFIG_X86_64) || boot_cpu_has(X86_FEATURE_XMM)) {
> +		/*
> +		 * When SSE is available, use it as it can write around L2.  We
> +		 * may also be able to load into the L1 only depending on how
> +		 * the cpu deals with a load to a line that is being prefetched.
> +		 */
>  		xor_register(&xor_block_sse);
>  		xor_register(&xor_block_sse_pf64);
>  	} else if (boot_cpu_has(X86_FEATURE_MMX)) {
>  		xor_register(&xor_block_pII_mmx);
>  		xor_register(&xor_block_p5_mmx);
> 
> base-commit: 2b07ea76fd28989bde5993532d7a943a6f90e246


^ permalink raw reply

* Re: [RFC] ML-KEM (FIPS 203) implementation with reusable decapsulation pool
From: kstzavertaylo @ 2026-06-14  7:50 UTC (permalink / raw)
  To: Eric Biggers; +Cc: linux-crypto, herbert
In-Reply-To: <20260612183240.GA2157807@google.com>

Thank you for the detailed feedback and for outlining the historical
context regarding pools in the crypto subsystem.

I understand your point of view and the preference for keeping the
core implementation simple with per-operation allocations (or
caller-provided workspaces), especially given the lack of precedent
for pool-based designs in lib/crypto. My approach with the reusable
decapsulation pool was driven by a focus on constrained environments
where minimizing stack usage and relying on reusable preallocated
working memory during the hot path can be particularly valuable.
However, I fully agree that concrete data is needed to properly
evaluate the trade-offs.

I see your point regarding preallocated workspaces and caller-managed
caching. One of the goals of my prototype was to explore a design
where decapsulation operates on reusable preallocated contexts rather
than per-call working memory, primarily to reduce stack requirements
and move memory management into an initialization phase. I need to
analyze more carefully how much of this can already be achieved
through a caller-provided workspace model and whether the additional
complexity of a dedicated pool is actually justified.

I am currently working on benchmarks that compare stack consumption,
allocation behavior, memory footprint, and performance between the
different approaches. Once I have solid numbers, I will share the
results and my conclusions.

I also appreciate the clarification regarding KPP. My original
prototype used KPP because it appeared to be the closest existing
interface for key establishment, but I am not specifically attached to
that approach and will spend some time evaluating how the same ideas
could fit into the lib/crypto model as well. In the meantime, I will
also look into how the pre-allocated workspace support you suggested
could be integrated.

Best regards,
K. Zavertailo

On Fri, Jun 12, 2026 at 9:32 PM Eric Biggers <ebiggers@kernel.org> wrote:
>
> On Fri, Jun 12, 2026 at 05:14:54PM +0300, kstzavertaylo wrote:
> > Thank you for the detailed reply and for pointing me to the existing
> > ML-KEM/X-Wing patchset. I spent some time reviewing the implementation
> > to better understand the design choices and how they compare to the
> > approach I took in my own work.
> >
> > After reviewing the patchset, I can see several strengths in the
> > implementation. It integrates cleanly into the existing lib/crypto
> > infrastructure, reuses kernel cryptographic primitives, avoids large
> > stack allocations, and includes KUnit-based validation. The
> > implementation also appears intentionally compact and well aligned
> > with existing kernel conventions.
> >
> > While reviewing the implementation, I noticed that decapsulation
> > allocates a temporary workspace for each operation. This is one of the
> > areas where my design diverged, which is what originally motivated the
> > reusable pool approach.
> >
> > My implementation was developed with a somewhat different goal in
> > mind. I experimented with a reusable decapsulation workspace model
> > where memory is allocated during key initialization and then reused
> > across subsequent decapsulation operations. The main motivation was
> > reducing allocation frequency and minimizing both stack usage and
> > repeated memory management during decapsulation.
> >
> > As a result, the implementation avoids allocations during
> > decapsulation entirely by reusing preallocated workspaces associated
> > with the key context. My original hypothesis was that moving memory
> > allocation to key initialization, thereby eliminating allocations from
> > the decapsulation path, could reduce allocation overhead during
> > repeated decapsulation operations and be beneficial in environments
> > where allocation activity is considered undesirable.
>
> In my ML-KEM code, all the decapsulation memory is consolidated into
> struct mlkem_decap_workspace.  It would be straightforward to support
> the caller providing a pre-allocated workspace.
>
> In the case of X-Wing, we could also support pre-expanding the
> decapsulation key.
>
> It just depends on what is actually going to be needed by the kernel
> feature(s) that are going to use this.  Which we don't really know yet.
>
> We do know that it hasn't been found to be useful for the crypto
> subsystem to provide pools for any other algorithm in the kernel, for a
> variety of reasons.  Usually callers can just allocate per-operation, or
> they have some sort of object (inode, block device, socket, etc.) that's
> a natural place for them to cache whatever they need anyway.  In the
> rare cases where some sort of pool is needed it's implemented in the
> caller, optimized for the particular use case.  So I think there's a
> good chance your pool idea is going off on the wrong track.
>
> > Another difference is the integration level. My prototype explored
> > direct integration through the KPP interface, whereas the patchset
> > focuses on providing a reusable cryptographic library component within
> > lib/crypto. These approaches address somewhat different layers of the
> > kernel crypto stack.
>
> We don't need crypto_kpp support, as it's much more complex and harder
> to use than the crypto library
> (https://docs.kernel.org/crypto/libcrypto.html).  Also it seems it's not
> really possible anyway, since crypto_kpp is an old design that works for
> Diffie-Hellman but not KEMs.
>
> - Eric

^ permalink raw reply

* Re: [PATCH] crypto: s5p-sss - correct CONFIG_CRYPTO_DEV_EXYNOS_RNG macro name in comment
From: Ethan Nelson-Moore @ 2026-06-14  1:36 UTC (permalink / raw)
  To: Eric Biggers
  Cc: linux-crypto, linux-samsung-soc, Krzysztof Kozlowski,
	Vladimir Zapolskiy, Herbert Xu, David S. Miller
In-Reply-To: <20260614005044.GA1808@sol>

Hi, Eric,

On Sat, Jun 13, 2026 at 5:52 PM Eric Biggers <ebiggers@kernel.org> wrote:
> CONFIG_CRYPTO_DEV_EXYNOS_RNG was already removed by
> https://lore.kernel.org/linux-crypto/20260531175932.32171-1-ebiggers@kernel.org/

Thanks for letting me know.

> I didn't want to touch this comment which is nonsense anyway.  But if
> you're going to try to update it, it should be updated to correctly
> explain that the driver is working around broken devicetree bindings.

Yes, that comment definitely needs rewriting - I had no idea that is
what it is referring to.

Ethan

^ permalink raw reply

* [PATCH] crypto: amcc: move ioremapping up
From: Rosen Penev @ 2026-06-14  1:29 UTC (permalink / raw)
  To: linux-crypto; +Cc: Herbert Xu, David S. Miller, open list

There's no need for devm_platform_ioremap_resource() to be so far down.
In fact, putting it up allows direct return instead of having to goto
some branch. Also, remove the error message as the function complains
loudly itself. No need to duplicate.

Signed-off-by: Rosen Penev <rosenp@gmail.com>
---
 drivers/crypto/amcc/crypto4xx_core.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/drivers/crypto/amcc/crypto4xx_core.c b/drivers/crypto/amcc/crypto4xx_core.c
index 001da785af07..0271b5e4d923 100644
--- a/drivers/crypto/amcc/crypto4xx_core.c
+++ b/drivers/crypto/amcc/crypto4xx_core.c
@@ -1251,6 +1251,10 @@ static int crypto4xx_probe(struct platform_device *ofdev)
 	if (!core_dev->dev)
 		return -ENOMEM;
 
+	core_dev->dev->ce_base = devm_platform_ioremap_resource(ofdev, 0);
+	if (IS_ERR(core_dev->dev->ce_base))
+		return PTR_ERR(core_dev->dev->ce_base);
+
 	/*
 	 * Older version of 460EX/GT have a hardware bug.
 	 * Hence they do not support H/W based security intr coalescing
@@ -1286,13 +1290,6 @@ static int crypto4xx_probe(struct platform_device *ofdev)
 	tasklet_init(&core_dev->tasklet, crypto4xx_bh_tasklet_cb,
 		     (unsigned long) dev);
 
-	core_dev->dev->ce_base = devm_platform_ioremap_resource(ofdev, 0);
-	if (IS_ERR(core_dev->dev->ce_base)) {
-		dev_err(&ofdev->dev, "failed to ioremap resource");
-		rc = PTR_ERR(core_dev->dev->ce_base);
-		goto err_build_sdr;
-	}
-
 	/* Register for Crypto isr, Crypto Engine IRQ */
 	core_dev->irq = platform_get_irq(ofdev, 0);
 	if (core_dev->irq < 0) {
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH] crypto: amcc - embed pdr_uinfo as flexible array in crypto4xx_device
From: Rosen Penev @ 2026-06-14  1:27 UTC (permalink / raw)
  To: linux-crypto; +Cc: Herbert Xu, David S. Miller, open list
In-Reply-To: <20260613234559.20934-1-rosenp@gmail.com>

On Sat, Jun 13, 2026 at 4:46 PM Rosen Penev <rosenp@gmail.com> wrote:
>
> No need to allocate and free separately.
On further review, it makes more sense to change to a static array.
>
> This keeps crypto4xx_destroy_pdr dedicated to dma freeing only.
>
> Assisted-by: opencode:big-pickle
> Signed-off-by: Rosen Penev <rosenp@gmail.com>
> ---
>  drivers/crypto/amcc/crypto4xx_core.c | 12 +-----------
>  drivers/crypto/amcc/crypto4xx_core.h |  3 ++-
>  2 files changed, 3 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/crypto/amcc/crypto4xx_core.c b/drivers/crypto/amcc/crypto4xx_core.c
> index 001da785af07..ea1e40b3184b 100644
> --- a/drivers/crypto/amcc/crypto4xx_core.c
> +++ b/drivers/crypto/amcc/crypto4xx_core.c
> @@ -171,14 +171,6 @@ static u32 crypto4xx_build_pdr(struct crypto4xx_device *dev)
>         if (!dev->pdr)
>                 return -ENOMEM;
>
> -       dev->pdr_uinfo = kzalloc_objs(struct pd_uinfo, PPC4XX_NUM_PD);
> -       if (!dev->pdr_uinfo) {
> -               dma_free_coherent(dev->core_dev->device,
> -                                 sizeof(struct ce_pd) * PPC4XX_NUM_PD,
> -                                 dev->pdr,
> -                                 dev->pdr_pa);
> -               return -ENOMEM;
> -       }
>         dev->shadow_sa_pool = dma_alloc_coherent(dev->core_dev->device,
>                                    sizeof(union shadow_sa_buf) * PPC4XX_NUM_PD,
>                                    &dev->shadow_sa_pool_pa,
> @@ -226,8 +218,6 @@ static void crypto4xx_destroy_pdr(struct crypto4xx_device *dev)
>                 dma_free_coherent(dev->core_dev->device,
>                         sizeof(struct sa_state_record) * PPC4XX_NUM_PD,
>                         dev->shadow_sr_pool, dev->shadow_sr_pool_pa);
> -
> -       kfree(dev->pdr_uinfo);
>  }
>
>  static u32 crypto4xx_get_pd_from_pdr_nolock(struct crypto4xx_device *dev)
> @@ -1247,7 +1237,7 @@ static int crypto4xx_probe(struct platform_device *ofdev)
>         dev_set_drvdata(dev, core_dev);
>         core_dev->ofdev = ofdev;
>         core_dev->dev = devm_kzalloc(
> -               &ofdev->dev, sizeof(struct crypto4xx_device), GFP_KERNEL);
> +               &ofdev->dev, struct_size(core_dev->dev, pdr_uinfo, PPC4XX_NUM_PD), GFP_KERNEL);
>         if (!core_dev->dev)
>                 return -ENOMEM;
>
> diff --git a/drivers/crypto/amcc/crypto4xx_core.h b/drivers/crypto/amcc/crypto4xx_core.h
> index 66a95733c86d..bd4a286514a4 100644
> --- a/drivers/crypto/amcc/crypto4xx_core.h
> +++ b/drivers/crypto/amcc/crypto4xx_core.h
> @@ -93,11 +93,12 @@ struct crypto4xx_device {
>         u32 gdr_head;
>         u32 sdr_tail;
>         u32 sdr_head;
> -       struct pd_uinfo *pdr_uinfo;
>         struct list_head alg_list;      /* List of algorithm supported
>                                         by this device */
>         struct ratelimit_state aead_ratelimit;
>         bool is_revb;
> +
> +       struct pd_uinfo pdr_uinfo[];
>  };
>
>  struct crypto4xx_core_device {
> --
> 2.54.0
>

^ permalink raw reply

* [PATCH v2] lib/raid/xor: x86: Add AVX-512 optimized xor_gen()
From: Eric Biggers @ 2026-06-14  1:03 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Christoph Hellwig, linux-crypto, x86, Eric Biggers, David Laight,
	linux-raid

Add an implementation of xor_gen() using AVX-512.

It uses 512-bit vectors, i.e. ZMM registers.  It also uses the
vpternlogq instruction to do three-input XORs when applicable.

It's enabled on x86_64 CPUs that have AVX512F && !PREFER_YMM.  In
practice that means:

    - AMD Zen 4 and later (client and server)
    - Intel Sapphire Rapids and later (server)
    - Intel Rocket Lake (client)
    - Intel Nova Lake and later (client)

The !PREFER_YMM condition excludes the older AVX-512 implementations in
Intel Skylake Server and Intel Ice Lake.  They could run this code, but
they're known to have overly-eager downclocking when ZMM registers are
used.  This is the same policy that the crypto and CRC code uses.

Benchmark on AMD Ryzen 9 9950X (Zen 5):

    src_cnt    avx          avx512       Improvement
    =======    ==========   ==========   ===========
    1          56353 MB/s   75388 MB/s   33%
    2          54274 MB/s   68409 MB/s   26%
    3          44649 MB/s   64042 MB/s   43%
    4          41315 MB/s   55002 MB/s   33%

Note: for now I omitted the cpu_has_xfeatures() check that the AVX-512
optimized crypto and CRC code does, since it's not implemented on
User-Mode Linux and it's never been present in the RAID6 code either.

Signed-off-by: Eric Biggers <ebiggers@kernel.org>
---

Changed in v2:
    - Fixed build on UML
    - Reworked the implementation

 lib/raid/xor/Makefile         |   2 +-
 lib/raid/xor/x86/xor-avx512.c | 121 ++++++++++++++++++++++++++++++++++
 lib/raid/xor/x86/xor_arch.h   |  26 ++++----
 3 files changed, 137 insertions(+), 12 deletions(-)
 create mode 100644 lib/raid/xor/x86/xor-avx512.c

diff --git a/lib/raid/xor/Makefile b/lib/raid/xor/Makefile
index 4d633dfd5b90..4af945861a51 100644
--- a/lib/raid/xor/Makefile
+++ b/lib/raid/xor/Makefile
@@ -26,11 +26,11 @@ xor-$(CONFIG_ALTIVEC)		+= powerpc/xor_vmx.o powerpc/xor_vmx_glue.o
 xor-$(CONFIG_RISCV_ISA_V)	+= riscv/xor.o riscv/xor-glue.o
 xor-$(CONFIG_SPARC32)		+= sparc/xor-sparc32.o
 xor-$(CONFIG_SPARC64)		+= sparc/xor-sparc64.o sparc/xor-sparc64-glue.o
 xor-$(CONFIG_S390)		+= s390/xor.o
 xor-$(CONFIG_X86_32)		+= x86/xor-avx.o x86/xor-sse.o x86/xor-mmx.o
-xor-$(CONFIG_X86_64)		+= x86/xor-avx.o x86/xor-sse.o
+xor-$(CONFIG_X86_64)		+= x86/xor-avx.o x86/xor-sse.o x86/xor-avx512.o
 obj-y				+= tests/
 
 CFLAGS_arm/xor-neon.o		+= $(CC_FLAGS_FPU)
 CFLAGS_REMOVE_arm/xor-neon.o	+= $(CC_FLAGS_NO_FPU)
 
diff --git a/lib/raid/xor/x86/xor-avx512.c b/lib/raid/xor/x86/xor-avx512.c
new file mode 100644
index 000000000000..87b981d74c90
--- /dev/null
+++ b/lib/raid/xor/x86/xor-avx512.c
@@ -0,0 +1,121 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * AVX-512 optimized implementation of xor_gen()
+ *
+ * Copyright 2026 Google LLC
+ */
+
+#include <linux/types.h>
+#include <asm/fpu/api.h>
+#include "xor_impl.h"
+#include "xor_arch.h"
+
+/*
+ * Implementation notes:
+ *
+ * Unrolling by the number of buffers (2-5) is very important.
+ *
+ * Unrolling by length is less important, especially when using register-indexed
+ * addressing with negative indices from the end of the buffers.  That approach
+ * results in just two loop control instructions being needed per iteration,
+ * regardless of the number of buffers.
+ *
+ * In fact, benchmarks showed that the 2 and 3 buffer cases require only 2x
+ * unrolling by length, while the 4 and 5 buffer cases don't require any
+ * unrolling by length.  Benchmarks also showed that the register-indexed
+ * addressing isn't a bottleneck either; i.e., we can't do any better by
+ * incrementing the pointers as we go along, even with more unrolling.
+ */
+
+static void xor_avx512_2(long bytes, u8 *p0, const u8 *p1)
+{
+	long i = -bytes;
+
+	asm volatile("1: vmovdqa64 (%0,%1), %%zmm0\n"
+		     "vmovdqa64 64(%0,%1), %%zmm1\n"
+		     "vpxorq (%0,%2), %%zmm0, %%zmm0\n"
+		     "vpxorq 64(%0,%2), %%zmm1, %%zmm1\n"
+		     "vmovdqa64 %%zmm0, (%0,%1)\n"
+		     "vmovdqa64 %%zmm1, 64(%0,%1)\n"
+		     "add $128, %0\n"
+		     "jnz 1b\n"
+		     : "+&r"(i)
+		     : "r"(p0 + bytes), "r"(p1 + bytes)
+		     : "memory", "cc");
+}
+
+static void xor_avx512_3(long bytes, u8 *p0, const u8 *p1, const u8 *p2)
+{
+	long i = -bytes;
+
+	asm volatile("1: vmovdqa64 (%0,%1), %%zmm0\n"
+		     "vmovdqa64 64(%0,%1), %%zmm1\n"
+		     "vmovdqa64 (%0,%2), %%zmm2\n"
+		     "vmovdqa64 64(%0,%2), %%zmm3\n"
+		     "vpternlogq $0x96, (%0,%3), %%zmm2, %%zmm0\n"
+		     "vpternlogq $0x96, 64(%0,%3), %%zmm3, %%zmm1\n"
+		     "vmovdqa64 %%zmm0, (%0,%1)\n"
+		     "vmovdqa64 %%zmm1, 64(%0,%1)\n"
+		     "add $128, %0\n"
+		     "jnz 1b\n"
+		     : "+&r"(i)
+		     : "r"(p0 + bytes), "r"(p1 + bytes), "r"(p2 + bytes)
+		     : "memory", "cc");
+}
+
+static void xor_avx512_4(long bytes, u8 *p0, const u8 *p1, const u8 *p2,
+			 const u8 *p3)
+{
+	long i = -bytes;
+
+	asm volatile("1: vmovdqa64 (%0,%1), %%zmm0\n"
+		     "vmovdqa64 (%0,%2), %%zmm1\n"
+		     "vpxorq (%0,%3), %%zmm0, %%zmm0\n"
+		     "vpternlogq $0x96, (%0,%4), %%zmm1, %%zmm0\n"
+		     "vmovdqa64 %%zmm0, (%0,%1)\n"
+		     "add $64, %0\n"
+		     "jnz 1b\n"
+		     : "+&r"(i)
+		     : "r"(p0 + bytes), "r"(p1 + bytes), "r"(p2 + bytes),
+		       "r"(p3 + bytes)
+		     : "memory", "cc");
+}
+
+static void xor_avx512_5(long bytes, u8 *p0, const u8 *p1, const u8 *p2,
+			 const u8 *p3, const u8 *p4)
+{
+	long i = -bytes;
+
+	asm volatile("1: vmovdqa64 (%0,%1), %%zmm0\n"
+		     "vmovdqa64 (%0,%2), %%zmm1\n"
+		     "vpternlogq $0x96, (%0,%3), %%zmm1, %%zmm0\n"
+		     "vmovdqa64 (%0,%4), %%zmm1\n"
+		     "vpternlogq $0x96, (%0,%5), %%zmm1, %%zmm0\n"
+		     "vmovdqa64 %%zmm0, (%0,%1)\n"
+		     "add $64, %0\n"
+		     "jnz 1b\n"
+		     : "+&r"(i)
+		     : "r"(p0 + bytes), "r"(p1 + bytes), "r"(p2 + bytes),
+		       "r"(p3 + bytes), "r"(p4 + bytes)
+		     : "memory", "cc");
+}
+
+DO_XOR_BLOCKS(avx512_inner, xor_avx512_2, xor_avx512_3, xor_avx512_4,
+	      xor_avx512_5);
+
+/*
+ * Preconditions: bytes is a nonzero multiple of 512, and all buffers are
+ * 64-byte aligned.
+ */
+static void xor_gen_avx512(void *dest, void **srcs, unsigned int src_cnt,
+			   unsigned int bytes)
+{
+	kernel_fpu_begin();
+	xor_gen_avx512_inner(dest, srcs, src_cnt, bytes);
+	kernel_fpu_end();
+}
+
+struct xor_block_template xor_block_avx512 = {
+	.name = "avx512",
+	.xor_gen = xor_gen_avx512,
+};
diff --git a/lib/raid/xor/x86/xor_arch.h b/lib/raid/xor/x86/xor_arch.h
index 99fe85a213c6..b5d49376fc97 100644
--- a/lib/raid/xor/x86/xor_arch.h
+++ b/lib/raid/xor/x86/xor_arch.h
@@ -4,26 +4,30 @@
 extern struct xor_block_template xor_block_pII_mmx;
 extern struct xor_block_template xor_block_p5_mmx;
 extern struct xor_block_template xor_block_sse;
 extern struct xor_block_template xor_block_sse_pf64;
 extern struct xor_block_template xor_block_avx;
+extern struct xor_block_template xor_block_avx512;
 
-/*
- * When SSE is available, use it as it can write around L2.  We may also be able
- * to load into the L1 only depending on how the cpu deals with a load to a line
- * that is being prefetched.
- *
- * When AVX2 is available, force using it as it is better by all measures.
- *
- * 32-bit without MMX can fall back to the generic routines.
- */
 static __always_inline void __init arch_xor_init(void)
 {
-	if (boot_cpu_has(X86_FEATURE_AVX) &&
-	    boot_cpu_has(X86_FEATURE_OSXSAVE)) {
+	if (IS_ENABLED(CONFIG_X86_64) && boot_cpu_has(X86_FEATURE_AVX512F) &&
+	    boot_cpu_has(X86_FEATURE_OSXSAVE) &&
+	    !boot_cpu_has(X86_FEATURE_PREFER_YMM)) {
+		/* AVX-512 will be the best; no need to try others. */
+		/* !PREFER_YMM excludes CPUs with overly-eager downclocking. */
+		xor_force(&xor_block_avx512);
+	} else if (boot_cpu_has(X86_FEATURE_AVX) &&
+		   boot_cpu_has(X86_FEATURE_OSXSAVE)) {
+		/* AVX will be the best; no need to try others. */
 		xor_force(&xor_block_avx);
 	} else if (IS_ENABLED(CONFIG_X86_64) || boot_cpu_has(X86_FEATURE_XMM)) {
+		/*
+		 * When SSE is available, use it as it can write around L2.  We
+		 * may also be able to load into the L1 only depending on how
+		 * the cpu deals with a load to a line that is being prefetched.
+		 */
 		xor_register(&xor_block_sse);
 		xor_register(&xor_block_sse_pf64);
 	} else if (boot_cpu_has(X86_FEATURE_MMX)) {
 		xor_register(&xor_block_pII_mmx);
 		xor_register(&xor_block_p5_mmx);

base-commit: 2b07ea76fd28989bde5993532d7a943a6f90e246
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH] crypto: s5p-sss - correct CONFIG_CRYPTO_DEV_EXYNOS_RNG macro name in comment
From: Eric Biggers @ 2026-06-14  0:50 UTC (permalink / raw)
  To: Ethan Nelson-Moore
  Cc: linux-crypto, linux-samsung-soc, Krzysztof Kozlowski,
	Vladimir Zapolskiy, Herbert Xu, David S. Miller
In-Reply-To: <20260613223648.119694-1-enelsonmoore@gmail.com>

On Sat, Jun 13, 2026 at 03:36:47PM -0700, Ethan Nelson-Moore wrote:
> A comment in drivers/crypto/s5p-sss.c incorrectly refers to
> CONFIG_EXYNOS_RNG instead of CONFIG_CRYPTO_DEV_EXYNOS_RNG. Correct it.
> 
> Discovered while searching for CONFIG_* symbols referenced in code but
> not defined in any Kconfig file.
> 
> Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com>
> ---
>  drivers/crypto/s5p-sss.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/crypto/s5p-sss.c b/drivers/crypto/s5p-sss.c
> index bdda7b39af85..9bb1b1661174 100644
> --- a/drivers/crypto/s5p-sss.c
> +++ b/drivers/crypto/s5p-sss.c
> @@ -2151,8 +2151,8 @@ static int s5p_aes_probe(struct platform_device *pdev)
>  
>  	/*
>  	 * Note: HASH and PRNG uses the same registers in secss, avoid
> -	 * overwrite each other. This will drop HASH when CONFIG_EXYNOS_RNG
> -	 * is enabled in config. We need larger size for HASH registers in
> +	 * overwrite each other. This will drop HASH when CONFIG_CRYPTO_DEV_EXYNOS_RNG
> +	 * is enabled. We need larger size for HASH registers in
>  	 * secss, current describe only AES/DES
>  	 */
>  	if (IS_ENABLED(CONFIG_CRYPTO_DEV_EXYNOS_HASH)) {

CONFIG_CRYPTO_DEV_EXYNOS_RNG was already removed by
https://lore.kernel.org/linux-crypto/20260531175932.32171-1-ebiggers@kernel.org/

I didn't want to touch this comment which is nonsense anyway.  But if
you're going to try to update it, it should be updated to correctly
explain that the driver is working around broken devicetree bindings.

- Eric

^ permalink raw reply

* [PATCH] crypto: amcc - embed pdr_uinfo as flexible array in crypto4xx_device
From: Rosen Penev @ 2026-06-13 23:45 UTC (permalink / raw)
  To: linux-crypto; +Cc: Herbert Xu, David S. Miller, open list

No need to allocate and free separately.

This keeps crypto4xx_destroy_pdr dedicated to dma freeing only.

Assisted-by: opencode:big-pickle
Signed-off-by: Rosen Penev <rosenp@gmail.com>
---
 drivers/crypto/amcc/crypto4xx_core.c | 12 +-----------
 drivers/crypto/amcc/crypto4xx_core.h |  3 ++-
 2 files changed, 3 insertions(+), 12 deletions(-)

diff --git a/drivers/crypto/amcc/crypto4xx_core.c b/drivers/crypto/amcc/crypto4xx_core.c
index 001da785af07..ea1e40b3184b 100644
--- a/drivers/crypto/amcc/crypto4xx_core.c
+++ b/drivers/crypto/amcc/crypto4xx_core.c
@@ -171,14 +171,6 @@ static u32 crypto4xx_build_pdr(struct crypto4xx_device *dev)
 	if (!dev->pdr)
 		return -ENOMEM;
 
-	dev->pdr_uinfo = kzalloc_objs(struct pd_uinfo, PPC4XX_NUM_PD);
-	if (!dev->pdr_uinfo) {
-		dma_free_coherent(dev->core_dev->device,
-				  sizeof(struct ce_pd) * PPC4XX_NUM_PD,
-				  dev->pdr,
-				  dev->pdr_pa);
-		return -ENOMEM;
-	}
 	dev->shadow_sa_pool = dma_alloc_coherent(dev->core_dev->device,
 				   sizeof(union shadow_sa_buf) * PPC4XX_NUM_PD,
 				   &dev->shadow_sa_pool_pa,
@@ -226,8 +218,6 @@ static void crypto4xx_destroy_pdr(struct crypto4xx_device *dev)
 		dma_free_coherent(dev->core_dev->device,
 			sizeof(struct sa_state_record) * PPC4XX_NUM_PD,
 			dev->shadow_sr_pool, dev->shadow_sr_pool_pa);
-
-	kfree(dev->pdr_uinfo);
 }
 
 static u32 crypto4xx_get_pd_from_pdr_nolock(struct crypto4xx_device *dev)
@@ -1247,7 +1237,7 @@ static int crypto4xx_probe(struct platform_device *ofdev)
 	dev_set_drvdata(dev, core_dev);
 	core_dev->ofdev = ofdev;
 	core_dev->dev = devm_kzalloc(
-		&ofdev->dev, sizeof(struct crypto4xx_device), GFP_KERNEL);
+		&ofdev->dev, struct_size(core_dev->dev, pdr_uinfo, PPC4XX_NUM_PD), GFP_KERNEL);
 	if (!core_dev->dev)
 		return -ENOMEM;
 
diff --git a/drivers/crypto/amcc/crypto4xx_core.h b/drivers/crypto/amcc/crypto4xx_core.h
index 66a95733c86d..bd4a286514a4 100644
--- a/drivers/crypto/amcc/crypto4xx_core.h
+++ b/drivers/crypto/amcc/crypto4xx_core.h
@@ -93,11 +93,12 @@ struct crypto4xx_device {
 	u32 gdr_head;
 	u32 sdr_tail;
 	u32 sdr_head;
-	struct pd_uinfo *pdr_uinfo;
 	struct list_head alg_list;	/* List of algorithm supported
 					by this device */
 	struct ratelimit_state aead_ratelimit;
 	bool is_revb;
+
+	struct pd_uinfo pdr_uinfo[];
 };
 
 struct crypto4xx_core_device {
-- 
2.54.0


^ permalink raw reply related

* [PATCH] crypto: s5p-sss - correct CONFIG_CRYPTO_DEV_EXYNOS_RNG macro name in comment
From: Ethan Nelson-Moore @ 2026-06-13 22:36 UTC (permalink / raw)
  To: linux-crypto, linux-samsung-soc
  Cc: Ethan Nelson-Moore, Krzysztof Kozlowski, Vladimir Zapolskiy,
	Herbert Xu, David S. Miller

A comment in drivers/crypto/s5p-sss.c incorrectly refers to
CONFIG_EXYNOS_RNG instead of CONFIG_CRYPTO_DEV_EXYNOS_RNG. Correct it.

Discovered while searching for CONFIG_* symbols referenced in code but
not defined in any Kconfig file.

Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com>
---
 drivers/crypto/s5p-sss.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/crypto/s5p-sss.c b/drivers/crypto/s5p-sss.c
index bdda7b39af85..9bb1b1661174 100644
--- a/drivers/crypto/s5p-sss.c
+++ b/drivers/crypto/s5p-sss.c
@@ -2151,8 +2151,8 @@ static int s5p_aes_probe(struct platform_device *pdev)
 
 	/*
 	 * Note: HASH and PRNG uses the same registers in secss, avoid
-	 * overwrite each other. This will drop HASH when CONFIG_EXYNOS_RNG
-	 * is enabled in config. We need larger size for HASH registers in
+	 * overwrite each other. This will drop HASH when CONFIG_CRYPTO_DEV_EXYNOS_RNG
+	 * is enabled. We need larger size for HASH registers in
 	 * secss, current describe only AES/DES
 	 */
 	if (IS_ENABLED(CONFIG_CRYPTO_DEV_EXYNOS_HASH)) {
-- 
2.43.0


^ permalink raw reply related

* [PATCH v3 1/1] crypto: atmel-sha204a - fix heap info leak on I2C transfer failure
From: Lothar Rubusch @ 2026-06-13 20:20 UTC (permalink / raw)
  To: thorsten.blum, herbert, davem, nicolas.ferre, alexandre.belloni,
	claudiu.beznea, ardb, krzk+dt
  Cc: linux-crypto, linux-arm-kernel, linux-kernel, l.rubusch

The nonblocking RNG path allocates a work_data structure to track the
state of an in-flight asynchronous I2C request. This pointer is stored
in rng->priv and later consumed by the read path once the transaction
completes.

If the underlying I2C transfer fails, the completion callback is invoked
with a non-zero status. In this case, the allocated work_data is not
usable for producing RNG output and must not remain associated with the
hwrng state.

Previously, the failure path only logged a warning but left the pointer
state uncleared, which can result in subsequent read attempts observing
stale state and interpreting it as valid completion data.

Fix this by freeing the pending work_data. The I2C transaction reports
an error. This ensures that failed requests do not leave residual state
behind that could be interpreted as valid RNG data on later reads.
Clearing rng->priv is done at the subsequent call to nonblocking read.

Fixes: da001fb651b0 ("crypto: atmel-i2c - add support for SHA204A random number generator")
Signed-off-by: Lothar Rubusch <l.rubusch@gmail.com>
Assisted-by: Gemini:1.5 Pro [google]
Reviewed-by: Thorsten Blum <thorsten.blum@linux.dev>
---
v2 -> v3:
- remove existing error-path cleanup behavior [`rng->priv = 0;`],
  update commit msg
- rebased

v1 -> v2:
- reword commit message for clarity and precision
- keep existing error-path cleanup behavior unchanged, update commit msg

 drivers/crypto/atmel-sha204a.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/crypto/atmel-sha204a.c b/drivers/crypto/atmel-sha204a.c
index 4c9af737b33a..5eb76245347d 100644
--- a/drivers/crypto/atmel-sha204a.c
+++ b/drivers/crypto/atmel-sha204a.c
@@ -31,10 +31,14 @@ static void atmel_sha204a_rng_done(struct atmel_i2c_work_data *work_data,
 	struct atmel_i2c_client_priv *i2c_priv = work_data->ctx;
 	struct hwrng *rng = areq;

-	if (status)
+	if (status) {
 		dev_warn_ratelimited(&i2c_priv->client->dev,
 				     "i2c transaction failed (%d)\n",
 				     status);
+		kfree(work_data);
+		atomic_dec(&i2c_priv->tfm_count);
+		return;
+	}

 	rng->priv = (unsigned long)work_data;
 	atomic_dec(&i2c_priv->tfm_count);

base-commit: 6ea0ce3a19f9c37a014099e2b0a46b27fa164564
-- 
2.53.0

^ permalink raw reply related

* Re: [PATCH] crypto: ti - Use list_first_entry_or_null() in dthe_get_dev()
From: T Pratham @ 2026-06-13 19:30 UTC (permalink / raw)
  To: Mert Seftali, Herbert Xu
  Cc: David S . Miller, Dan Carpenter, linux-crypto, linux-kernel,
	kernel test robot
In-Reply-To: <20260613085858.32580-1-mertsftl@gmail.com>

On 13-06-2026 14:28, Mert Seftali wrote:
> dthe_get_dev() fetches a device from the global device list with
> list_first_entry() and then checks the result for NULL. However,
> list_first_entry() never returns NULL: on an empty list it returns a
> bogus pointer computed from the list head. The NULL check is therefore
> dead code, and an empty list would be treated as a valid entry and
> moved around as if it were a real device.
> 
> Use list_first_entry_or_null() so the existing NULL check works as
> intended and an empty list is handled gracefully.
> 
> Fixes: 52f641bc63a4 ("crypto: ti - Add driver for DTHE V2 AES Engine (ECB, CBC)")
> Reported-by: kernel test robot <lkp@intel.com>
> Reported-by: Dan Carpenter <error27@gmail.com>
> Closes: https://lore.kernel.org/r/202606111933.69GGTKxr-lkp@intel.com/
> Signed-off-by: Mert Seftali <mertsftl@gmail.com>
> ---
>  drivers/crypto/ti/dthev2-common.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/crypto/ti/dthev2-common.c b/drivers/crypto/ti/dthev2-common.c
> index a2ad79bec105..cc0244938267 100644
> --- a/drivers/crypto/ti/dthev2-common.c
> +++ b/drivers/crypto/ti/dthev2-common.c
> @@ -40,7 +40,7 @@ struct dthe_data *dthe_get_dev(struct dthe_tfm_ctx *ctx)
>  		return ctx->dev_data;
>  
>  	spin_lock_bh(&dthe_dev_list.lock);
> -	dev_data = list_first_entry(&dthe_dev_list.dev_list, struct dthe_data, list);
> +	dev_data = list_first_entry_or_null(&dthe_dev_list.dev_list, struct dthe_data, list);
>  	if (dev_data)
>  		list_move_tail(&dev_data->list, &dthe_dev_list.dev_list);
>  	spin_unlock_bh(&dthe_dev_list.lock);

LGTM.

Reviewed-by: T Pratham <t-pratham@ti.com>

-- 
Regards
T Pratham <t-pratham@ti.com>

^ permalink raw reply

* Re: [PATCH] crypto: atmel-ecc - drop unused curve id from atmel_ecdh_ctx
From: Thorsten Blum @ 2026-06-13 14:23 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Nicolas Ferre, Alexandre Belloni,
	Claudiu Beznea
  Cc: linux-crypto, linux-arm-kernel, linux-kernel
In-Reply-To: <20260611105159.460794-3-thorsten.blum@linux.dev>

On Thu, Jun 11, 2026 at 12:52:01PM +0200, Thorsten Blum wrote:
> ->curve_id is only set once, but never used - remove it.
> 
> Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
> ---
>  drivers/crypto/atmel-ecc.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/drivers/crypto/atmel-ecc.c b/drivers/crypto/atmel-ecc.c
> index 9da9dd6585df..93f219558c2f 100644
> --- a/drivers/crypto/atmel-ecc.c
> +++ b/drivers/crypto/atmel-ecc.c
> @@ -33,7 +33,6 @@ static struct atmel_ecc_driver_data driver_data;
>   * @public_key : generated when calling set_secret(). It's the responsibility
>   *               of the user to not call set_secret() while
>   *               generate_public_key() or compute_shared_secret() are in flight.
> - * @curve_id   : elliptic curve id
>   * @do_fallback: true when the device doesn't support the curve or when the user
>   *               wants to use its own private key.
>   */
> @@ -41,7 +40,6 @@ struct atmel_ecdh_ctx {
>  	struct i2c_client *client;
>  	struct crypto_kpp *fallback;
>  	const u8 *public_key;
> -	unsigned int curve_id;
>  	bool do_fallback;
>  };
>  
> @@ -250,7 +248,6 @@ static int atmel_ecdh_init_tfm(struct crypto_kpp *tfm)
>  	struct crypto_kpp *fallback;
>  	struct atmel_ecdh_ctx *ctx = kpp_tfm_ctx(tfm);
>  
> -	ctx->curve_id = ECC_CURVE_NIST_P256;
>  	ctx->client = atmel_ecc_i2c_client_alloc();
>  	if (IS_ERR(ctx->client)) {
>  		pr_err("tfm - i2c_client binding failed\n");

I'll need to rebase and resend this assuming [1] is applied first, as it
currently doesn't apply cleanly.

[1] https://lore.kernel.org/lkml/20260609100552.233494-3-thorsten.blum@linux.dev/

^ permalink raw reply

* Re: [PATCH] crypto: atmel-ecc - reject hardware ECDH without a public key
From: Thorsten Blum @ 2026-06-13 14:21 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Nicolas Ferre, Alexandre Belloni,
	Claudiu Beznea, Tudor Ambarus
  Cc: linux-crypto, linux-arm-kernel, linux-kernel
In-Reply-To: <20260611213617.463552-2-thorsten.blum@linux.dev>

On Thu, Jun 11, 2026 at 11:36:17PM +0200, Thorsten Blum wrote:
> The hardware ECDH path in atmel_ecdh_compute_shared_secret() uses the
> private key stored in the device. However, the public key is cached only
> after atmel_ecdh_set_secret() successfully generated that private key
> for the current tfm.
> 
> atmel_ecdh_generate_public_key() already rejects requests when no public
> key is cached. Add the same check to atmel_ecdh_compute_shared_secret()
> to prevent the device from using a private key that was not generated
> for the current tfm.
> 
> Fixes: 11105693fa05 ("crypto: atmel-ecc - introduce Microchip / Atmel ECC driver")
> Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
> ---
>  drivers/crypto/atmel-ecc.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/crypto/atmel-ecc.c b/drivers/crypto/atmel-ecc.c
> index 93f219558c2f..542c8cc13a0f 100644
> --- a/drivers/crypto/atmel-ecc.c
> +++ b/drivers/crypto/atmel-ecc.c
> @@ -173,6 +173,9 @@ static int atmel_ecdh_compute_shared_secret(struct kpp_request *req)
>  		return crypto_kpp_compute_shared_secret(req);
>  	}
>  
> +	if (!ctx->public_key)
> +		return -EINVAL;
> +
>  	/* must have exactly two points to be on the curve */
>  	if (req->src_len != ATMEL_ECC_PUBKEY_SIZE)
>  		return -EINVAL;

I'll need to rebase and resend this assuming [1] is applied first, as it
currently doesn't apply cleanly.

[1] https://lore.kernel.org/lkml/20260609100552.233494-3-thorsten.blum@linux.dev/

^ permalink raw reply

* i.MX95: EdgeLock Enclave secure storage
From: Fabio Estevam @ 2026-06-13 13:58 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: Schrempf Frieder,
	moderated list:ARM/FREESCALE IMX / MXC ARM ARCHITECTURE,
	open list:HARDWARE RANDOM NUMBER GENERATOR CORE, Peng Fan,
	Stefano Babic, Frank Li

Hi Pankaj,

First of all, thank you for your work on upstreaming the
EdgeLock Enclave (ELE) support. It is great to finally see the
ELE framework landing upstream after a long development effort.

I am currently evaluating the state of i.MX95 secure-boot and
storage-security support based on current linux-next, with the
goal of understanding what can already be achieved using
upstream software and what pieces are still under development.

From my review, it appears that the following infrastructure is
already available upstream:

- ELE/V2X mailbox support for i.MX95.
- OCOTP/ELE nvmem support for fuse access.
- Secure-enclave bindings documenting the i.MX95 ELE HSM.

However, I could not find upstream support for several
capabilities that would be useful for secure storage
deployments on i.MX95, including:

- An ELE-backed trusted-key provider for the Linux trusted key
framework.
- Integration allowing Linux to use ELE as a key-sealing/
unsealing backend.
- i.MX95-specific crypto acceleration exposed through the Linux
crypto API for dm-crypt use cases.

Are you aware of any ongoing upstream or planned development
activities in these areas, particularly for i.MX95?

Any information about the upstream roadmap, ongoing
development, or expected direction for these features would be
greatly appreciated.

Thanks again for your work and for any insights you can share.

Regards,

Fabio Estevam

^ permalink raw reply

* Re: [PATCH RESEND v2 1/1] crypto: atmel-sha204a - fix heap info leak on I2C transfer failure
From: Herbert Xu @ 2026-06-13 12:28 UTC (permalink / raw)
  To: Lothar Rubusch
  Cc: thorsten.blum, davem, nicolas.ferre, alexandre.belloni,
	claudiu.beznea, ardb, krzk+dt, linux-crypto, linux-arm-kernel,
	linux-kernel
In-Reply-To: <CAFXKEHYcp-0+uCA47mtDe_+LUAZucEPbDJzoh5+e3Q3R20mN9Q@mail.gmail.com>

On Sat, Jun 13, 2026 at 10:52:25AM +0200, Lothar Rubusch wrote:
> On Thu, Jun 11, 2026 at 6:59 AM Herbert Xu <herbert@gondor.apana.org.au> wrote:
> >
> > On Tue, Jun 09, 2026 at 09:47:23AM +0000, Lothar Rubusch wrote:
> > >
> > > diff --git a/drivers/crypto/atmel-sha204a.c b/drivers/crypto/atmel-sha204a.c
> > > index 4c9af737b33a..20cd915ea8a3 100644
> > > --- a/drivers/crypto/atmel-sha204a.c
> > > +++ b/drivers/crypto/atmel-sha204a.c
> > > @@ -31,10 +31,15 @@ static void atmel_sha204a_rng_done(struct atmel_i2c_work_data *work_data,
> > >       struct atmel_i2c_client_priv *i2c_priv = work_data->ctx;
> > >       struct hwrng *rng = areq;
> > >
> > > -     if (status)
> > > +     if (status) {
> > >               dev_warn_ratelimited(&i2c_priv->client->dev,
> > >                                    "i2c transaction failed (%d)\n",
> > >                                    status);
> > > +             kfree(work_data);
> > > +             rng->priv = 0;
> >
> > Why is this necessary? It appears that rng_read_nonblocking already
> > zeroes rng->priv.
> >
> 
> IMHO this is not the same. The patch targets the error path. If the
> `status` in `atmel_sha204a_rng_done()` is failed, then failed `work_data` is
> still assigned and `rng->priv` is not zeroed at the moment. Only a
> subsequent call to `rng_read_nonblocking()` will set `rng->priv = 0;`

Right, the rng->priv gets set on the error path prior to your patch.
But with your patch, there is no need to clear rng->priv because it
never gets set on the error path.

All I'm asking for is to remove the rng->priv = 0 because it only
causes confusion.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH] crypto: atmel-ecc - remove stale comments in atmel_ecc_remove
From: Lothar Rubusch @ 2026-06-13 10:06 UTC (permalink / raw)
  To: linux-crypto, davem, nicolas.ferre, alexandre.belloni
  Cc: thorsten.blum, herbert, linux-arm-kernel, linux-kernel, l.rubusch
In-Reply-To: <aiqUBXIybgHXA6uj@linux.dev>

> From linux-crypto-vger  Thu Jun 11 10:55:01 2026
> From: Thorsten Blum <thorsten.blum () linux ! dev>
> Date: Thu, 11 Jun 2026 10:55:01 +0000
> To: linux-crypto-vger
> Subject: Re: [PATCH] crypto: atmel-ecc - remove stale comments in atmel_ecc_remove
> Message-Id: <aiqUBXIybgHXA6uj () linux ! dev>
> X-MARC-Message: https://marc.info/?l=linux-crypto-vger&m=178117527182807
> 
> On Thu, Jun 11, 2026 at 01:29:52PM +0800, Herbert Xu wrote:
> > On Tue, Jun 02, 2026 at 06:52:49PM +0200, Thorsten Blum wrote:
> > > atmel_ecc_remove() no longer returns -EBUSY since commit 7df7563b16aa
> > > ("crypto: atmel-ecc - Remove duplicated error reporting in .remove()")
> > > and is a void function since commit ed5c2f5fd10d ("i2c: Make remove
> > > callback return void").
> > > 
> > > Remove and update the outdated comments.
> > > 
> > > Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
> > > ---
> > >  drivers/crypto/atmel-ecc.c | 6 ++----
> > >  1 file changed, 2 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/drivers/crypto/atmel-ecc.c b/drivers/crypto/atmel-ecc.c
> > > index 9c380351d2f9..e6068dc0a0c1 100644
> > > --- a/drivers/crypto/atmel-ecc.c
> > > +++ b/drivers/crypto/atmel-ecc.c
> > > @@ -347,13 +347,11 @@ static void atmel_ecc_remove(struct i2c_client *client)
> > >  {
> > >  	struct atmel_i2c_client_priv *i2c_priv = i2c_get_clientdata(client);
> > >  
> > > -	/* Return EBUSY if i2c client already allocated. */
> > >  	if (atomic_read(&i2c_priv->tfm_count)) {
> > >  		/*
> > >  		 * After we return here, the memory backing the device is freed.
> > > -		 * That happens no matter what the return value of this function
> > > -		 * is because in the Linux device model there is no error
> > > -		 * handling for unbinding a driver.
> > > +		 * That happens because in the Linux device model there is no
> > > +		 * error handling for unbinding a driver.
> > >  		 * If there is still some action pending, it probably involves
> > >  		 * accessing the freed memory.
> > >  		 */
> > 
> > Please fix this properly rather than fiddling with the comments.
> > 
> > Drivers should always fail gracefully if the hardware disappears.
> 
> Yes, I'm working on a fix, but it's not ready yet.
> 

Hi guys, since this is going towards some work I already presented here and
still waiting on answer/request for comment from maintainer(s).
https://marc.info/?l=linux-kernel&m=178099821038957&w=2

The issue in the remove() arises when working with devres in combination with
asynch slow bus hardware, as we do here. AFAIK in the remove() are mainly two
options, either give a timeout to solve communication gracefully, then cut; or
wait indefinitely on the device to clear, in case forever.

When we cut off after timeout (first case) and still something arrives, it
would probably access freed memory resources. In the second case, simply
waiting on the device to resolve, might contain the risk of an infinite
waiting at driver removal. The other alternative would be to manage kmallocs
manually, i.e. to move away from devres (probably not what we want).
Currently, the driver just simply cuts off and has this problematic situation
very well spotted by the original author and commented.

Further, related to this situation in the remove() is using the global driver
data, which then might be overriden, and thus leak, when still around, and
this connects to dealing with synchronizing adding to the i2c_clientList and
algo registration, both happening in probe().

I tried to address all three issues. That's why the patch ended with such a
lengthy comment. The patch is reviewed by sashiko complaining only the above
dilemma.
https://sashiko.dev/#/patchset/20260609092927.47222-1-l.rubusch%40gmail.com

I hope I did not interfere too much with Thorstens fixes here. Since I assumed
you were active on rather different topics. Pls, let me know if so. I just
want to see this issue out of my way for the refac patch series.

Best,
L

^ permalink raw reply

* [PATCH] crypto: ti - Use list_first_entry_or_null() in dthe_get_dev()
From: Mert Seftali @ 2026-06-13  8:58 UTC (permalink / raw)
  To: T Pratham, Herbert Xu
  Cc: David S . Miller, Dan Carpenter, linux-crypto, linux-kernel,
	Mert Seftali, kernel test robot

dthe_get_dev() fetches a device from the global device list with
list_first_entry() and then checks the result for NULL. However,
list_first_entry() never returns NULL: on an empty list it returns a
bogus pointer computed from the list head. The NULL check is therefore
dead code, and an empty list would be treated as a valid entry and
moved around as if it were a real device.

Use list_first_entry_or_null() so the existing NULL check works as
intended and an empty list is handled gracefully.

Fixes: 52f641bc63a4 ("crypto: ti - Add driver for DTHE V2 AES Engine (ECB, CBC)")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <error27@gmail.com>
Closes: https://lore.kernel.org/r/202606111933.69GGTKxr-lkp@intel.com/
Signed-off-by: Mert Seftali <mertsftl@gmail.com>
---
 drivers/crypto/ti/dthev2-common.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/crypto/ti/dthev2-common.c b/drivers/crypto/ti/dthev2-common.c
index a2ad79bec105..cc0244938267 100644
--- a/drivers/crypto/ti/dthev2-common.c
+++ b/drivers/crypto/ti/dthev2-common.c
@@ -40,7 +40,7 @@ struct dthe_data *dthe_get_dev(struct dthe_tfm_ctx *ctx)
 		return ctx->dev_data;
 
 	spin_lock_bh(&dthe_dev_list.lock);
-	dev_data = list_first_entry(&dthe_dev_list.dev_list, struct dthe_data, list);
+	dev_data = list_first_entry_or_null(&dthe_dev_list.dev_list, struct dthe_data, list);
 	if (dev_data)
 		list_move_tail(&dev_data->list, &dthe_dev_list.dev_list);
 	spin_unlock_bh(&dthe_dev_list.lock);
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH RESEND v2 1/1] crypto: atmel-sha204a - fix heap info leak on I2C transfer failure
From: Lothar Rubusch @ 2026-06-13  8:52 UTC (permalink / raw)
  To: Herbert Xu
  Cc: thorsten.blum, davem, nicolas.ferre, alexandre.belloni,
	claudiu.beznea, ardb, krzk+dt, linux-crypto, linux-arm-kernel,
	linux-kernel
In-Reply-To: <aipAf_uZnX_gwZnl@gondor.apana.org.au>

On Thu, Jun 11, 2026 at 6:59 AM Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Tue, Jun 09, 2026 at 09:47:23AM +0000, Lothar Rubusch wrote:
> >
> > diff --git a/drivers/crypto/atmel-sha204a.c b/drivers/crypto/atmel-sha204a.c
> > index 4c9af737b33a..20cd915ea8a3 100644
> > --- a/drivers/crypto/atmel-sha204a.c
> > +++ b/drivers/crypto/atmel-sha204a.c
> > @@ -31,10 +31,15 @@ static void atmel_sha204a_rng_done(struct atmel_i2c_work_data *work_data,
> >       struct atmel_i2c_client_priv *i2c_priv = work_data->ctx;
> >       struct hwrng *rng = areq;
> >
> > -     if (status)
> > +     if (status) {
> >               dev_warn_ratelimited(&i2c_priv->client->dev,
> >                                    "i2c transaction failed (%d)\n",
> >                                    status);
> > +             kfree(work_data);
> > +             rng->priv = 0;
>
> Why is this necessary? It appears that rng_read_nonblocking already
> zeroes rng->priv.
>

IMHO this is not the same. The patch targets the error path. If the
`status` in `atmel_sha204a_rng_done()` is failed, then failed `work_data` is
still assigned and `rng->priv` is not zeroed at the moment. Only a
subsequent call to `rng_read_nonblocking()` will set `rng->priv = 0;`

The call order is something like this:
1. atmel_sha204a_init // module setup
2. atmel_sha204a_rng_read_nonblocking // call 1
3. atmel_sha204a_rng_done             // if fail, still copies
work_data <-- patch clears here
...
4. atmel_sha204a_rng_read_nonblocking // call 2, clears rng->priv = 0

Originally this was a sashiko finding, when I move the RNG part into the
common driver. Reason: Actually all Atmel ECC and Atmel SHA204a devices
support the same RNG mech. Thus part of my refactoring is moving it to the
common core driver atmel_i2c. I was advised by the maintainer to use also
sashiko's feedback. So, I went on identifying sashiko issues and have a
look into it, if I can provide a fix for it. This is one of them.

Sashiko asked:
"If the I2C transaction fails here, we still assign the work_data to
rng->priv. Since kmalloc_obj() uses GFP_ATOMIC and does not zero memory,
does this risk leaking uninitialized slab memory or stale data from
previous reads when the next non-blocking read copies from
work_data->cmd.data?"

ref: https://sashiko.dev/#/patchset/20260512224349.64621-1-l.rubusch%40gmail.com
[search for `atmel_i2c_rng_done` on that link]

I'm not sure about the risk or the (real) severity sashiko mentiones
here. But it seems to
be correct, when atmel_sha204a_rng_done() fails in the status, it
continues assigning the
failed result in the work_data:

    static void atmel_sha204a_rng_done(struct atmel_i2c_work_data *work_data,
                       void *areq, int status)
    {
        struct atmel_i2c_client_priv *i2c_priv = work_data->ctx;
        struct hwrng *rng = areq;

        if (status)
            dev_warn_ratelimited(&i2c_priv->client->dev,
                         "i2c transaction failed (%d)\n",
                         status);

        rng->priv = (unsigned long)work_data;
        atomic_dec(&i2c_priv->tfm_count);
    }

Hence, my proposed patch will stop it passing work_data, if status is
failed. It will not
assign rng->priv anymore then containing old data, but clear it. It
will free the `work_data`
to provoke a new allocation happening in `atmel_sha204a_rng_read_nonblocking()`.

The patch is sashiko and maintainer reviewed and solves sashikos complaints.
ref: https://sashiko.dev/#/patchset/20260609094723.47237-1-l.rubusch%40gmail.com
Setting `rng->priv = 0;` is rather safety here.

Thank you for asking. Accept, drop or modification needed - please,
leave me a note,
I'd highly appreciate.

Best,
L

> Thanks,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH] lib/raid/xor: x86: Add AVX-512 optimized xor_gen()
From: David Laight @ 2026-06-13  8:48 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Christoph Hellwig, Andrew Morton, linux-kernel, linux-crypto, x86,
	Andrea Mazzoleni
In-Reply-To: <20260613002704.GA11922@sol>

On Fri, 12 Jun 2026 17:27:04 -0700
Eric Biggers <ebiggers@kernel.org> wrote:

> On Fri, Jun 12, 2026 at 10:04:32AM +0100, David Laight wrote:
> > On Fri, 12 Jun 2026 07:22:47 +0200
> > Christoph Hellwig <hch@lst.de> wrote:
> >   
> > > On Thu, Jun 11, 2026 at 09:40:34PM -0700, Eric Biggers wrote:  
> > > > Add an implementation of xor_gen() using AVX-512.    
> > >   
> > > > Benchmark on AMD Ryzen 9 9950X (Zen 5):    
> > > 
> > > Can you share the benchmark?
> > > 
> > > In my local tree I have ports of the AVX2 and AVX512 implementations
> > > from snapraid (https://github.com/amadvance/snapraid), which in userspace
> > > give really good performance.  On my Laptop with a AMD Ryzen AI 7 PRO 350
> > > (which is a Zen5 with the slower double pumped AVX512 unit), both of
> > > them get over 1GB/s throughput on the snapraid benchmarks.  I've been
> > > holding them back as I don't have a good kernel benchmarking harness,
> > > and it's missing the quirks for old AVX512 or the newer AMD special
> > > cases.  
> > 
> > From my experiments on Intel cpu (and I don't remember the zen-5 being
> > that different - but I've done less testing on it) you don't need to
> > unroll loops very much at all.
> > 
> > A reasonable model seems to be that the uops generated by the instruction
> > decoder get executed when all the prerequisite registers and the required
> > execution unit are available.
> > So for a memory copy (and the xor is basically a copy) the control loop
> > can run way ahead of the read/write instructions.
> > This means you can get the control loop 'for free' and unrolling further
> > makes no/little difference.
> > 
> > Each xor is two memory reads and one memory write.
> > The cpu I was using could only do one write/clock - so you can only do one
> > xor each clock. I think some of the newer ones can to two writes/clock but
> > I'm not sure how many reads/clock they can do - might still be 2, don't
> > think it s 4.
> > So you should be able to get one xor per clock, but I doubt you'll get two
> > (and possibly not even 1.3 - which would require 4 memory accesses per clock).
> > 
> > The best loop construct is the one that uses negative offsets from the
> > end of the buffers, basically:
> > 	buf += len;
> > 	offset = -len;
> > 	do
> > 		f(buf[offset]);
> > 	while (offset += size);
> > that reduces the loop control to just an 'add' and 'jnz' (which can
> > get merged into a single u-op).
> > 
> > The cpu have enough execution units to execute two memory reads,
> > a memory write, an xor the add and jnz every clock.
> > So even the 'rolled up' loop might run at one xor per clock.
> > While I think I got a 'one clock loop' on my zen-5 (testing
> > word-at-a-time strlen) I only managed a two clock loop on the newest
> > Intel cpu I've got (which isn't that new).
> > So put two xor in the loop and it shouldn't be limited by the loop
> > control, but will be limited by the memory accesses instead.
> > 
> > Further unrolling shouldn't help and may make things worse.
> > The Intel cpu have logic to directly forward the result of an
> > ALU instruction into the next few instructions, but after that you can
> > get a stall because of the 'round trip' via the register file.
> > So part way down an unrolled nn(%reg) sequence you can get a stall.
> > An extra 'add $0,%reg' in the middle of the unrolled loop will
> > 'refresh' the register and speed things up.
> > (I hit that with a loop that needed a rather more complicated control
> > structure.)
> > 
> > You definitely need to use the pmc clock counter and data dependencies
> > against the rdpmc instruction to get sensible performance figures.
> > The can reasonably reliably measure down to less than 20 clocks.  
> 
> The version at the end of this email is what you're suggesting, I think.

Looks about right (I wouldn't have found 'vpternlogq $0x96').
Should be read limited on both cpu.
I think zen5 can do two avx-512 reads and (maybe or) two avx-512 writes per clock.
(zen4 reads take two clocks so you might as well use avx-256.)
Sapphire raids might be the same, but I recall some cpu supports 3 reads/clock.

> On Sapphire Rapids and Ryzen 9 9950X it's about the same speed as mine,
> just a few percent slower on Sapphire with src_cnt == 1.

Is that 512 bytes?
The minimum block size for these in 128 bytes.
As you know a smaller block size is generally better if you need to support
arbitrary lengths.

> 
> So we could use it.  It's just a bit fragile since it assumes the loop
> overhead and indexed addressing will never be a bottleneck on any
> current or future CPU.  Unrolling by more gives something more robust
> that "just works", without having to analyze whether the loops are okay
> on each CPU model individually based on microarchitectural details.

Unrolling further is likely to be slower in 'real life' because of the
effects on the I-cache.
Not to mention D-cache effects - the exact cache alignment can matter.
It is also unlikely that future cpu will be significantly slower.

The more usual problem is that some older cpu (usually zen1) ends up
running the code significantly slower than other algorithms.
That might matter for the avx-128 version of this code.

I might try putting these functions through some user-space clock count
measuring code I've written.
You done the hard bit of getting the asm syntax right.

	David


> 
> // SPDX-License-Identifier: GPL-2.0-or-later
> /*
>  * AVX-512 optimized implementation of xor_gen()
>  *
>  * Copyright 2026 Google LLC
>  */
> 
> #include <linux/compiler.h>
> #include <linux/types.h>
> #include <linux/unroll.h>
> #include <asm/fpu/api.h>
> #include "xor_impl.h"
> #include "xor_arch.h"
> 
> static void xor_avx512_2(long bytes, u8 *p0, const u8 *p1)
> {
> 	long i = -bytes; /* Use negative indexing to minimize loop overhead. */
> 
> 	p0 += bytes;
> 	p1 += bytes;
> 	unrolled_none
> 	do {
> 		/* unroll by 2x to reduce loop overhead */
> 		asm volatile("vmovdqa64 (%2,%0), %%zmm0\n"
> 			     "vmovdqa64 64(%2,%0), %%zmm1\n"
> 			     "vpxorq (%2,%1), %%zmm0, %%zmm0\n"
> 			     "vpxorq 64(%2,%1), %%zmm1, %%zmm1\n"
> 			     "vmovdqa64 %%zmm0, (%2,%0)\n"
> 			     "vmovdqa64 %%zmm1, 64(%2,%0)\n"
> 			     :
> 			     : "r"(p0), "r"(p1), "r"(i)
> 			     : "memory");
> 	} while ((i += 128) != 0);
> }
> 
> static void xor_avx512_3(long bytes, u8 *p0, const u8 *p1, const u8 *p2)
> {
> 	long i = -bytes; /* Use negative indexing to minimize loop overhead. */
> 
> 	p0 += bytes;
> 	p1 += bytes;
> 	p2 += bytes;
> 	unrolled_none
> 	do {
> 		/* unroll by 2x to reduce loop overhead */
> 		asm volatile("vmovdqa64 (%3,%0), %%zmm0\n"
> 			     "vmovdqa64 64(%3,%0), %%zmm1\n"
> 			     "vmovdqa64 (%3,%1), %%zmm2\n"
> 			     "vmovdqa64 64(%3,%1), %%zmm3\n"
> 			     "vpternlogq $0x96, (%3,%2), %%zmm2, %%zmm0\n"
> 			     "vpternlogq $0x96, 64(%3,%2), %%zmm3, %%zmm1\n"
> 			     "vmovdqa64 %%zmm0, (%3,%0)\n"
> 			     "vmovdqa64 %%zmm1, 64(%3,%0)\n"
> 			     :
> 			     : "r"(p0), "r"(p1), "r"(p2), "r"(i)
> 			     : "memory");
> 	} while ((i += 128) != 0);
> }
> 
> static void xor_avx512_4(long bytes, u8 *p0, const u8 *p1, const u8 *p2,
> 			 const u8 *p3)
> {
> 	long i = -bytes; /* Use negative indexing to minimize loop overhead. */
> 
> 	p0 += bytes;
> 	p1 += bytes;
> 	p2 += bytes;
> 	p3 += bytes;
> 	unrolled_none
> 	do {
> 		asm volatile("vmovdqa64 (%4,%0), %%zmm0\n"
> 			     "vmovdqa64 (%4,%1), %%zmm1\n"
> 			     "vpxorq (%4,%2), %%zmm0, %%zmm0\n"
> 			     "vpternlogq $0x96, (%4,%3), %%zmm1, %%zmm0\n"
> 			     "vmovdqa64 %%zmm0, (%4,%0)\n"
> 			     :
> 			     : "r"(p0), "r"(p1), "r"(p2), "r"(p3), "r"(i)
> 			     : "memory");
> 	} while ((i += 64) != 0);
> }
> 
> static void xor_avx512_5(long bytes, u8 *p0, const u8 *p1, const u8 *p2,
> 			 const u8 *p3, const u8 *p4)
> {
> 	long i = -bytes; /* Use negative indexing to minimize loop overhead. */
> 
> 	p0 += bytes;
> 	p1 += bytes;
> 	p2 += bytes;
> 	p3 += bytes;
> 	p4 += bytes;
> 	unrolled_none
> 	do {
> 		asm volatile("vmovdqa64 (%5,%0), %%zmm0\n"
> 			     "vmovdqa64 (%5,%1), %%zmm1\n"
> 			     "vpternlogq $0x96, (%5,%2), %%zmm1, %%zmm0\n"
> 			     "vmovdqa64 (%5,%3), %%zmm1\n"
> 			     "vpternlogq $0x96, (%5,%4), %%zmm1, %%zmm0\n"
> 			     "vmovdqa64 %%zmm0, (%5,%0)\n"
> 			     :
> 			     : "r"(p0), "r"(p1), "r"(p2), "r"(p3), "r"(p4),
> 			       "r"(i)
> 			     : "memory");
> 	} while ((i += 64) != 0);
> }
> 
> DO_XOR_BLOCKS(avx512_inner, xor_avx512_2, xor_avx512_3, xor_avx512_4,
> 	      xor_avx512_5);
> 
> /*
>  * Preconditions: bytes is a nonzero multiple of 512, and all buffers are
>  * 64-byte aligned.
>  */
> static void xor_gen_avx512(void *dest, void **srcs, unsigned int src_cnt,
> 			   unsigned int bytes)
> {
> 	kernel_fpu_begin();
> 	xor_gen_avx512_inner(dest, srcs, src_cnt, bytes);
> 	kernel_fpu_end();
> }
> 
> struct xor_block_template xor_block_avx512 = {
> 	.name = "avx512",
> 	.xor_gen = xor_gen_avx512,
> };


^ permalink raw reply

* Re: [PATCH] lib/raid/xor: x86: Add AVX-512 optimized xor_gen()
From: Eric Biggers @ 2026-06-13  0:27 UTC (permalink / raw)
  To: David Laight
  Cc: Christoph Hellwig, Andrew Morton, linux-kernel, linux-crypto, x86,
	Andrea Mazzoleni
In-Reply-To: <20260612100432.1f1c8c7a@pumpkin>

On Fri, Jun 12, 2026 at 10:04:32AM +0100, David Laight wrote:
> On Fri, 12 Jun 2026 07:22:47 +0200
> Christoph Hellwig <hch@lst.de> wrote:
> 
> > On Thu, Jun 11, 2026 at 09:40:34PM -0700, Eric Biggers wrote:
> > > Add an implementation of xor_gen() using AVX-512.  
> > 
> > > Benchmark on AMD Ryzen 9 9950X (Zen 5):  
> > 
> > Can you share the benchmark?
> > 
> > In my local tree I have ports of the AVX2 and AVX512 implementations
> > from snapraid (https://github.com/amadvance/snapraid), which in userspace
> > give really good performance.  On my Laptop with a AMD Ryzen AI 7 PRO 350
> > (which is a Zen5 with the slower double pumped AVX512 unit), both of
> > them get over 1GB/s throughput on the snapraid benchmarks.  I've been
> > holding them back as I don't have a good kernel benchmarking harness,
> > and it's missing the quirks for old AVX512 or the newer AMD special
> > cases.
> 
> From my experiments on Intel cpu (and I don't remember the zen-5 being
> that different - but I've done less testing on it) you don't need to
> unroll loops very much at all.
> 
> A reasonable model seems to be that the uops generated by the instruction
> decoder get executed when all the prerequisite registers and the required
> execution unit are available.
> So for a memory copy (and the xor is basically a copy) the control loop
> can run way ahead of the read/write instructions.
> This means you can get the control loop 'for free' and unrolling further
> makes no/little difference.
> 
> Each xor is two memory reads and one memory write.
> The cpu I was using could only do one write/clock - so you can only do one
> xor each clock. I think some of the newer ones can to two writes/clock but
> I'm not sure how many reads/clock they can do - might still be 2, don't
> think it s 4.
> So you should be able to get one xor per clock, but I doubt you'll get two
> (and possibly not even 1.3 - which would require 4 memory accesses per clock).
> 
> The best loop construct is the one that uses negative offsets from the
> end of the buffers, basically:
> 	buf += len;
> 	offset = -len;
> 	do
> 		f(buf[offset]);
> 	while (offset += size);
> that reduces the loop control to just an 'add' and 'jnz' (which can
> get merged into a single u-op).
> 
> The cpu have enough execution units to execute two memory reads,
> a memory write, an xor the add and jnz every clock.
> So even the 'rolled up' loop might run at one xor per clock.
> While I think I got a 'one clock loop' on my zen-5 (testing
> word-at-a-time strlen) I only managed a two clock loop on the newest
> Intel cpu I've got (which isn't that new).
> So put two xor in the loop and it shouldn't be limited by the loop
> control, but will be limited by the memory accesses instead.
> 
> Further unrolling shouldn't help and may make things worse.
> The Intel cpu have logic to directly forward the result of an
> ALU instruction into the next few instructions, but after that you can
> get a stall because of the 'round trip' via the register file.
> So part way down an unrolled nn(%reg) sequence you can get a stall.
> An extra 'add $0,%reg' in the middle of the unrolled loop will
> 'refresh' the register and speed things up.
> (I hit that with a loop that needed a rather more complicated control
> structure.)
> 
> You definitely need to use the pmc clock counter and data dependencies
> against the rdpmc instruction to get sensible performance figures.
> The can reasonably reliably measure down to less than 20 clocks.

The version at the end of this email is what you're suggesting, I think.
On Sapphire Rapids and Ryzen 9 9950X it's about the same speed as mine,
just a few percent slower on Sapphire with src_cnt == 1.

So we could use it.  It's just a bit fragile since it assumes the loop
overhead and indexed addressing will never be a bottleneck on any
current or future CPU.  Unrolling by more gives something more robust
that "just works", without having to analyze whether the loops are okay
on each CPU model individually based on microarchitectural details.

// SPDX-License-Identifier: GPL-2.0-or-later
/*
 * AVX-512 optimized implementation of xor_gen()
 *
 * Copyright 2026 Google LLC
 */

#include <linux/compiler.h>
#include <linux/types.h>
#include <linux/unroll.h>
#include <asm/fpu/api.h>
#include "xor_impl.h"
#include "xor_arch.h"

static void xor_avx512_2(long bytes, u8 *p0, const u8 *p1)
{
	long i = -bytes; /* Use negative indexing to minimize loop overhead. */

	p0 += bytes;
	p1 += bytes;
	unrolled_none
	do {
		/* unroll by 2x to reduce loop overhead */
		asm volatile("vmovdqa64 (%2,%0), %%zmm0\n"
			     "vmovdqa64 64(%2,%0), %%zmm1\n"
			     "vpxorq (%2,%1), %%zmm0, %%zmm0\n"
			     "vpxorq 64(%2,%1), %%zmm1, %%zmm1\n"
			     "vmovdqa64 %%zmm0, (%2,%0)\n"
			     "vmovdqa64 %%zmm1, 64(%2,%0)\n"
			     :
			     : "r"(p0), "r"(p1), "r"(i)
			     : "memory");
	} while ((i += 128) != 0);
}

static void xor_avx512_3(long bytes, u8 *p0, const u8 *p1, const u8 *p2)
{
	long i = -bytes; /* Use negative indexing to minimize loop overhead. */

	p0 += bytes;
	p1 += bytes;
	p2 += bytes;
	unrolled_none
	do {
		/* unroll by 2x to reduce loop overhead */
		asm volatile("vmovdqa64 (%3,%0), %%zmm0\n"
			     "vmovdqa64 64(%3,%0), %%zmm1\n"
			     "vmovdqa64 (%3,%1), %%zmm2\n"
			     "vmovdqa64 64(%3,%1), %%zmm3\n"
			     "vpternlogq $0x96, (%3,%2), %%zmm2, %%zmm0\n"
			     "vpternlogq $0x96, 64(%3,%2), %%zmm3, %%zmm1\n"
			     "vmovdqa64 %%zmm0, (%3,%0)\n"
			     "vmovdqa64 %%zmm1, 64(%3,%0)\n"
			     :
			     : "r"(p0), "r"(p1), "r"(p2), "r"(i)
			     : "memory");
	} while ((i += 128) != 0);
}

static void xor_avx512_4(long bytes, u8 *p0, const u8 *p1, const u8 *p2,
			 const u8 *p3)
{
	long i = -bytes; /* Use negative indexing to minimize loop overhead. */

	p0 += bytes;
	p1 += bytes;
	p2 += bytes;
	p3 += bytes;
	unrolled_none
	do {
		asm volatile("vmovdqa64 (%4,%0), %%zmm0\n"
			     "vmovdqa64 (%4,%1), %%zmm1\n"
			     "vpxorq (%4,%2), %%zmm0, %%zmm0\n"
			     "vpternlogq $0x96, (%4,%3), %%zmm1, %%zmm0\n"
			     "vmovdqa64 %%zmm0, (%4,%0)\n"
			     :
			     : "r"(p0), "r"(p1), "r"(p2), "r"(p3), "r"(i)
			     : "memory");
	} while ((i += 64) != 0);
}

static void xor_avx512_5(long bytes, u8 *p0, const u8 *p1, const u8 *p2,
			 const u8 *p3, const u8 *p4)
{
	long i = -bytes; /* Use negative indexing to minimize loop overhead. */

	p0 += bytes;
	p1 += bytes;
	p2 += bytes;
	p3 += bytes;
	p4 += bytes;
	unrolled_none
	do {
		asm volatile("vmovdqa64 (%5,%0), %%zmm0\n"
			     "vmovdqa64 (%5,%1), %%zmm1\n"
			     "vpternlogq $0x96, (%5,%2), %%zmm1, %%zmm0\n"
			     "vmovdqa64 (%5,%3), %%zmm1\n"
			     "vpternlogq $0x96, (%5,%4), %%zmm1, %%zmm0\n"
			     "vmovdqa64 %%zmm0, (%5,%0)\n"
			     :
			     : "r"(p0), "r"(p1), "r"(p2), "r"(p3), "r"(p4),
			       "r"(i)
			     : "memory");
	} while ((i += 64) != 0);
}

DO_XOR_BLOCKS(avx512_inner, xor_avx512_2, xor_avx512_3, xor_avx512_4,
	      xor_avx512_5);

/*
 * Preconditions: bytes is a nonzero multiple of 512, and all buffers are
 * 64-byte aligned.
 */
static void xor_gen_avx512(void *dest, void **srcs, unsigned int src_cnt,
			   unsigned int bytes)
{
	kernel_fpu_begin();
	xor_gen_avx512_inner(dest, srcs, src_cnt, bytes);
	kernel_fpu_end();
}

struct xor_block_template xor_block_avx512 = {
	.name = "avx512",
	.xor_gen = xor_gen_avx512,
};

^ permalink raw reply

* Re: [RFC] ML-KEM (FIPS 203) implementation with reusable decapsulation pool
From: Eric Biggers @ 2026-06-12 18:32 UTC (permalink / raw)
  To: kstzavertaylo; +Cc: linux-crypto, herbert
In-Reply-To: <CAMho2Rem-B908oaFQzTx8Mg895LuvPcfN9+ANoHW+XfGW+wB6A@mail.gmail.com>

On Fri, Jun 12, 2026 at 05:14:54PM +0300, kstzavertaylo wrote:
> Thank you for the detailed reply and for pointing me to the existing
> ML-KEM/X-Wing patchset. I spent some time reviewing the implementation
> to better understand the design choices and how they compare to the
> approach I took in my own work.
> 
> After reviewing the patchset, I can see several strengths in the
> implementation. It integrates cleanly into the existing lib/crypto
> infrastructure, reuses kernel cryptographic primitives, avoids large
> stack allocations, and includes KUnit-based validation. The
> implementation also appears intentionally compact and well aligned
> with existing kernel conventions.
> 
> While reviewing the implementation, I noticed that decapsulation
> allocates a temporary workspace for each operation. This is one of the
> areas where my design diverged, which is what originally motivated the
> reusable pool approach.
> 
> My implementation was developed with a somewhat different goal in
> mind. I experimented with a reusable decapsulation workspace model
> where memory is allocated during key initialization and then reused
> across subsequent decapsulation operations. The main motivation was
> reducing allocation frequency and minimizing both stack usage and
> repeated memory management during decapsulation.
>
> As a result, the implementation avoids allocations during
> decapsulation entirely by reusing preallocated workspaces associated
> with the key context. My original hypothesis was that moving memory
> allocation to key initialization, thereby eliminating allocations from
> the decapsulation path, could reduce allocation overhead during
> repeated decapsulation operations and be beneficial in environments
> where allocation activity is considered undesirable.

In my ML-KEM code, all the decapsulation memory is consolidated into
struct mlkem_decap_workspace.  It would be straightforward to support
the caller providing a pre-allocated workspace.

In the case of X-Wing, we could also support pre-expanding the
decapsulation key.

It just depends on what is actually going to be needed by the kernel
feature(s) that are going to use this.  Which we don't really know yet.

We do know that it hasn't been found to be useful for the crypto
subsystem to provide pools for any other algorithm in the kernel, for a
variety of reasons.  Usually callers can just allocate per-operation, or
they have some sort of object (inode, block device, socket, etc.) that's
a natural place for them to cache whatever they need anyway.  In the
rare cases where some sort of pool is needed it's implemented in the
caller, optimized for the particular use case.  So I think there's a
good chance your pool idea is going off on the wrong track.

> Another difference is the integration level. My prototype explored
> direct integration through the KPP interface, whereas the patchset
> focuses on providing a reusable cryptographic library component within
> lib/crypto. These approaches address somewhat different layers of the
> kernel crypto stack.

We don't need crypto_kpp support, as it's much more complex and harder
to use than the crypto library
(https://docs.kernel.org/crypto/libcrypto.html).  Also it seems it's not
really possible anyway, since crypto_kpp is an old design that works for
Diffie-Hellman but not KEMs.

- Eric

^ permalink raw reply

* Re: [PATCH] crypto: ccp: Fix SNP range list bounds check
From: Tycho Andersen @ 2026-06-12 15:18 UTC (permalink / raw)
  To: ZongYao.Chen
  Cc: Ashish Kalra, Tom Lendacky, John Allen, Herbert Xu,
	David S. Miller, Michael Roth, Jarkko Sakkinen,
	Borislav Petkov (AMD), Brijesh Singh, Tianjia Zhang, linux-crypto,
	linux-kernel, stable
In-Reply-To: <20260612092525.1203150-1-ZongYao.Chen@linux.alibaba.com>

On Fri, Jun 12, 2026 at 05:25:25PM +0800, ZongYao.Chen@linux.alibaba.com wrote:
> From: Zongyao Chen <ZongYao.Chen@linux.alibaba.com>
> 
> snp_filter_reserved_mem_regions() checks the range list size before
> adding a new entry. If the page-sized SNP_INIT_EX buffer is already
> full, the next matching resource can still write one entry past the end
> of the buffer.
> 
> Check that there is room for the next entry before appending it, and
> compute the next entry pointer only after the bounds check.

> Fixes: 1ca5614b84ee ("crypto: ccp: Add support to initialize the AMD-SP for SEV-SNP")
> Cc: stable@vger.kernel.org
> Signed-off-by: Zongyao Chen <ZongYao.Chen@linux.alibaba.com>

I believe there is a version of this in the crypto tree already as
1b864b6cb213 ("crypto: ccp - Fix snp_filter_reserved_mem_regions()
off-by-one").

Thanks,

Tycho

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox