Linux cryptographic layer development
 help / color / mirror / Atom feed
* Re: [PATCH RESEND v2 1/1] crypto: atmel-sha204a - fix heap info leak on I2C transfer failure
From: Herbert Xu @ 2026-06-13 12:28 UTC (permalink / raw)
  To: Lothar Rubusch
  Cc: thorsten.blum, davem, nicolas.ferre, alexandre.belloni,
	claudiu.beznea, ardb, krzk+dt, linux-crypto, linux-arm-kernel,
	linux-kernel
In-Reply-To: <CAFXKEHYcp-0+uCA47mtDe_+LUAZucEPbDJzoh5+e3Q3R20mN9Q@mail.gmail.com>

On Sat, Jun 13, 2026 at 10:52:25AM +0200, Lothar Rubusch wrote:
> On Thu, Jun 11, 2026 at 6:59 AM Herbert Xu <herbert@gondor.apana.org.au> wrote:
> >
> > On Tue, Jun 09, 2026 at 09:47:23AM +0000, Lothar Rubusch wrote:
> > >
> > > diff --git a/drivers/crypto/atmel-sha204a.c b/drivers/crypto/atmel-sha204a.c
> > > index 4c9af737b33a..20cd915ea8a3 100644
> > > --- a/drivers/crypto/atmel-sha204a.c
> > > +++ b/drivers/crypto/atmel-sha204a.c
> > > @@ -31,10 +31,15 @@ static void atmel_sha204a_rng_done(struct atmel_i2c_work_data *work_data,
> > >       struct atmel_i2c_client_priv *i2c_priv = work_data->ctx;
> > >       struct hwrng *rng = areq;
> > >
> > > -     if (status)
> > > +     if (status) {
> > >               dev_warn_ratelimited(&i2c_priv->client->dev,
> > >                                    "i2c transaction failed (%d)\n",
> > >                                    status);
> > > +             kfree(work_data);
> > > +             rng->priv = 0;
> >
> > Why is this necessary? It appears that rng_read_nonblocking already
> > zeroes rng->priv.
> >
> 
> IMHO this is not the same. The patch targets the error path. If the
> `status` in `atmel_sha204a_rng_done()` is failed, then failed `work_data` is
> still assigned and `rng->priv` is not zeroed at the moment. Only a
> subsequent call to `rng_read_nonblocking()` will set `rng->priv = 0;`

Right, the rng->priv gets set on the error path prior to your patch.
But with your patch, there is no need to clear rng->priv because it
never gets set on the error path.

All I'm asking for is to remove the rng->priv = 0 because it only
causes confusion.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH] crypto: atmel-ecc - remove stale comments in atmel_ecc_remove
From: Lothar Rubusch @ 2026-06-13 10:06 UTC (permalink / raw)
  To: linux-crypto, davem, nicolas.ferre, alexandre.belloni
  Cc: thorsten.blum, herbert, linux-arm-kernel, linux-kernel, l.rubusch
In-Reply-To: <aiqUBXIybgHXA6uj@linux.dev>

> From linux-crypto-vger  Thu Jun 11 10:55:01 2026
> From: Thorsten Blum <thorsten.blum () linux ! dev>
> Date: Thu, 11 Jun 2026 10:55:01 +0000
> To: linux-crypto-vger
> Subject: Re: [PATCH] crypto: atmel-ecc - remove stale comments in atmel_ecc_remove
> Message-Id: <aiqUBXIybgHXA6uj () linux ! dev>
> X-MARC-Message: https://marc.info/?l=linux-crypto-vger&m=178117527182807
> 
> On Thu, Jun 11, 2026 at 01:29:52PM +0800, Herbert Xu wrote:
> > On Tue, Jun 02, 2026 at 06:52:49PM +0200, Thorsten Blum wrote:
> > > atmel_ecc_remove() no longer returns -EBUSY since commit 7df7563b16aa
> > > ("crypto: atmel-ecc - Remove duplicated error reporting in .remove()")
> > > and is a void function since commit ed5c2f5fd10d ("i2c: Make remove
> > > callback return void").
> > > 
> > > Remove and update the outdated comments.
> > > 
> > > Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
> > > ---
> > >  drivers/crypto/atmel-ecc.c | 6 ++----
> > >  1 file changed, 2 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/drivers/crypto/atmel-ecc.c b/drivers/crypto/atmel-ecc.c
> > > index 9c380351d2f9..e6068dc0a0c1 100644
> > > --- a/drivers/crypto/atmel-ecc.c
> > > +++ b/drivers/crypto/atmel-ecc.c
> > > @@ -347,13 +347,11 @@ static void atmel_ecc_remove(struct i2c_client *client)
> > >  {
> > >  	struct atmel_i2c_client_priv *i2c_priv = i2c_get_clientdata(client);
> > >  
> > > -	/* Return EBUSY if i2c client already allocated. */
> > >  	if (atomic_read(&i2c_priv->tfm_count)) {
> > >  		/*
> > >  		 * After we return here, the memory backing the device is freed.
> > > -		 * That happens no matter what the return value of this function
> > > -		 * is because in the Linux device model there is no error
> > > -		 * handling for unbinding a driver.
> > > +		 * That happens because in the Linux device model there is no
> > > +		 * error handling for unbinding a driver.
> > >  		 * If there is still some action pending, it probably involves
> > >  		 * accessing the freed memory.
> > >  		 */
> > 
> > Please fix this properly rather than fiddling with the comments.
> > 
> > Drivers should always fail gracefully if the hardware disappears.
> 
> Yes, I'm working on a fix, but it's not ready yet.
> 

Hi guys, since this is going towards some work I already presented here and
still waiting on answer/request for comment from maintainer(s).
https://marc.info/?l=linux-kernel&m=178099821038957&w=2

The issue in the remove() arises when working with devres in combination with
asynch slow bus hardware, as we do here. AFAIK in the remove() are mainly two
options, either give a timeout to solve communication gracefully, then cut; or
wait indefinitely on the device to clear, in case forever.

When we cut off after timeout (first case) and still something arrives, it
would probably access freed memory resources. In the second case, simply
waiting on the device to resolve, might contain the risk of an infinite
waiting at driver removal. The other alternative would be to manage kmallocs
manually, i.e. to move away from devres (probably not what we want).
Currently, the driver just simply cuts off and has this problematic situation
very well spotted by the original author and commented.

Further, related to this situation in the remove() is using the global driver
data, which then might be overriden, and thus leak, when still around, and
this connects to dealing with synchronizing adding to the i2c_clientList and
algo registration, both happening in probe().

I tried to address all three issues. That's why the patch ended with such a
lengthy comment. The patch is reviewed by sashiko complaining only the above
dilemma.
https://sashiko.dev/#/patchset/20260609092927.47222-1-l.rubusch%40gmail.com

I hope I did not interfere too much with Thorstens fixes here. Since I assumed
you were active on rather different topics. Pls, let me know if so. I just
want to see this issue out of my way for the refac patch series.

Best,
L

^ permalink raw reply

* [PATCH] crypto: ti - Use list_first_entry_or_null() in dthe_get_dev()
From: Mert Seftali @ 2026-06-13  8:58 UTC (permalink / raw)
  To: T Pratham, Herbert Xu
  Cc: David S . Miller, Dan Carpenter, linux-crypto, linux-kernel,
	Mert Seftali, kernel test robot

dthe_get_dev() fetches a device from the global device list with
list_first_entry() and then checks the result for NULL. However,
list_first_entry() never returns NULL: on an empty list it returns a
bogus pointer computed from the list head. The NULL check is therefore
dead code, and an empty list would be treated as a valid entry and
moved around as if it were a real device.

Use list_first_entry_or_null() so the existing NULL check works as
intended and an empty list is handled gracefully.

Fixes: 52f641bc63a4 ("crypto: ti - Add driver for DTHE V2 AES Engine (ECB, CBC)")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <error27@gmail.com>
Closes: https://lore.kernel.org/r/202606111933.69GGTKxr-lkp@intel.com/
Signed-off-by: Mert Seftali <mertsftl@gmail.com>
---
 drivers/crypto/ti/dthev2-common.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/crypto/ti/dthev2-common.c b/drivers/crypto/ti/dthev2-common.c
index a2ad79bec105..cc0244938267 100644
--- a/drivers/crypto/ti/dthev2-common.c
+++ b/drivers/crypto/ti/dthev2-common.c
@@ -40,7 +40,7 @@ struct dthe_data *dthe_get_dev(struct dthe_tfm_ctx *ctx)
 		return ctx->dev_data;
 
 	spin_lock_bh(&dthe_dev_list.lock);
-	dev_data = list_first_entry(&dthe_dev_list.dev_list, struct dthe_data, list);
+	dev_data = list_first_entry_or_null(&dthe_dev_list.dev_list, struct dthe_data, list);
 	if (dev_data)
 		list_move_tail(&dev_data->list, &dthe_dev_list.dev_list);
 	spin_unlock_bh(&dthe_dev_list.lock);
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH RESEND v2 1/1] crypto: atmel-sha204a - fix heap info leak on I2C transfer failure
From: Lothar Rubusch @ 2026-06-13  8:52 UTC (permalink / raw)
  To: Herbert Xu
  Cc: thorsten.blum, davem, nicolas.ferre, alexandre.belloni,
	claudiu.beznea, ardb, krzk+dt, linux-crypto, linux-arm-kernel,
	linux-kernel
In-Reply-To: <aipAf_uZnX_gwZnl@gondor.apana.org.au>

On Thu, Jun 11, 2026 at 6:59 AM Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Tue, Jun 09, 2026 at 09:47:23AM +0000, Lothar Rubusch wrote:
> >
> > diff --git a/drivers/crypto/atmel-sha204a.c b/drivers/crypto/atmel-sha204a.c
> > index 4c9af737b33a..20cd915ea8a3 100644
> > --- a/drivers/crypto/atmel-sha204a.c
> > +++ b/drivers/crypto/atmel-sha204a.c
> > @@ -31,10 +31,15 @@ static void atmel_sha204a_rng_done(struct atmel_i2c_work_data *work_data,
> >       struct atmel_i2c_client_priv *i2c_priv = work_data->ctx;
> >       struct hwrng *rng = areq;
> >
> > -     if (status)
> > +     if (status) {
> >               dev_warn_ratelimited(&i2c_priv->client->dev,
> >                                    "i2c transaction failed (%d)\n",
> >                                    status);
> > +             kfree(work_data);
> > +             rng->priv = 0;
>
> Why is this necessary? It appears that rng_read_nonblocking already
> zeroes rng->priv.
>

IMHO this is not the same. The patch targets the error path. If the
`status` in `atmel_sha204a_rng_done()` is failed, then failed `work_data` is
still assigned and `rng->priv` is not zeroed at the moment. Only a
subsequent call to `rng_read_nonblocking()` will set `rng->priv = 0;`


The call order is something like this:
1. atmel_sha204a_init // module setup
2. atmel_sha204a_rng_read_nonblocking // call 1
3. atmel_sha204a_rng_done             // if fail, still copies
work_data <-- patch clears here
...
4. atmel_sha204a_rng_read_nonblocking // call 2, clears rng->priv = 0

Originally this was a sashiko finding, when I move the RNG part into the
common driver. Reason: Actually all Atmel ECC and Atmel SHA204a devices
support the same RNG mech. Thus part of my refactoring is moving it to the
common core driver atmel_i2c. I was advised by the maintainer to use also
sashiko's feedback. So, I went on identifying sashiko issues and have a
look into it, if I can provide a fix for it. This is one of them.

Sashiko asked:
"If the I2C transaction fails here, we still assign the work_data to
rng->priv. Since kmalloc_obj() uses GFP_ATOMIC and does not zero memory,
does this risk leaking uninitialized slab memory or stale data from
previous reads when the next non-blocking read copies from
work_data->cmd.data?"

ref: https://sashiko.dev/#/patchset/20260512224349.64621-1-l.rubusch%40gmail.com
[search for `atmel_i2c_rng_done` on that link]

I'm not sure about the risk or the (real) severity sashiko mentiones
here. But it seems to
be correct, when atmel_sha204a_rng_done() fails in the status, it
continues assigning the
failed result in the work_data:

    static void atmel_sha204a_rng_done(struct atmel_i2c_work_data *work_data,
                       void *areq, int status)
    {
        struct atmel_i2c_client_priv *i2c_priv = work_data->ctx;
        struct hwrng *rng = areq;

        if (status)
            dev_warn_ratelimited(&i2c_priv->client->dev,
                         "i2c transaction failed (%d)\n",
                         status);

        rng->priv = (unsigned long)work_data;
        atomic_dec(&i2c_priv->tfm_count);
    }

Hence, my proposed patch will stop it passing work_data, if status is
failed. It will not
assign rng->priv anymore then containing old data, but clear it. It
will free the `work_data`
to provoke a new allocation happening in `atmel_sha204a_rng_read_nonblocking()`.

The patch is sashiko and maintainer reviewed and solves sashikos complaints.
ref: https://sashiko.dev/#/patchset/20260609094723.47237-1-l.rubusch%40gmail.com
Setting `rng->priv = 0;` is rather safety here.

Thank you for asking. Accept, drop or modification needed - please,
leave me a note,
I'd highly appreciate.

Best,
L

> Thanks,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH] lib/raid/xor: x86: Add AVX-512 optimized xor_gen()
From: David Laight @ 2026-06-13  8:48 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Christoph Hellwig, Andrew Morton, linux-kernel, linux-crypto, x86,
	Andrea Mazzoleni
In-Reply-To: <20260613002704.GA11922@sol>

On Fri, 12 Jun 2026 17:27:04 -0700
Eric Biggers <ebiggers@kernel.org> wrote:

> On Fri, Jun 12, 2026 at 10:04:32AM +0100, David Laight wrote:
> > On Fri, 12 Jun 2026 07:22:47 +0200
> > Christoph Hellwig <hch@lst.de> wrote:
> >   
> > > On Thu, Jun 11, 2026 at 09:40:34PM -0700, Eric Biggers wrote:  
> > > > Add an implementation of xor_gen() using AVX-512.    
> > >   
> > > > Benchmark on AMD Ryzen 9 9950X (Zen 5):    
> > > 
> > > Can you share the benchmark?
> > > 
> > > In my local tree I have ports of the AVX2 and AVX512 implementations
> > > from snapraid (https://github.com/amadvance/snapraid), which in userspace
> > > give really good performance.  On my Laptop with a AMD Ryzen AI 7 PRO 350
> > > (which is a Zen5 with the slower double pumped AVX512 unit), both of
> > > them get over 1GB/s throughput on the snapraid benchmarks.  I've been
> > > holding them back as I don't have a good kernel benchmarking harness,
> > > and it's missing the quirks for old AVX512 or the newer AMD special
> > > cases.  
> > 
> > From my experiments on Intel cpu (and I don't remember the zen-5 being
> > that different - but I've done less testing on it) you don't need to
> > unroll loops very much at all.
> > 
> > A reasonable model seems to be that the uops generated by the instruction
> > decoder get executed when all the prerequisite registers and the required
> > execution unit are available.
> > So for a memory copy (and the xor is basically a copy) the control loop
> > can run way ahead of the read/write instructions.
> > This means you can get the control loop 'for free' and unrolling further
> > makes no/little difference.
> > 
> > Each xor is two memory reads and one memory write.
> > The cpu I was using could only do one write/clock - so you can only do one
> > xor each clock. I think some of the newer ones can to two writes/clock but
> > I'm not sure how many reads/clock they can do - might still be 2, don't
> > think it s 4.
> > So you should be able to get one xor per clock, but I doubt you'll get two
> > (and possibly not even 1.3 - which would require 4 memory accesses per clock).
> > 
> > The best loop construct is the one that uses negative offsets from the
> > end of the buffers, basically:
> > 	buf += len;
> > 	offset = -len;
> > 	do
> > 		f(buf[offset]);
> > 	while (offset += size);
> > that reduces the loop control to just an 'add' and 'jnz' (which can
> > get merged into a single u-op).
> > 
> > The cpu have enough execution units to execute two memory reads,
> > a memory write, an xor the add and jnz every clock.
> > So even the 'rolled up' loop might run at one xor per clock.
> > While I think I got a 'one clock loop' on my zen-5 (testing
> > word-at-a-time strlen) I only managed a two clock loop on the newest
> > Intel cpu I've got (which isn't that new).
> > So put two xor in the loop and it shouldn't be limited by the loop
> > control, but will be limited by the memory accesses instead.
> > 
> > Further unrolling shouldn't help and may make things worse.
> > The Intel cpu have logic to directly forward the result of an
> > ALU instruction into the next few instructions, but after that you can
> > get a stall because of the 'round trip' via the register file.
> > So part way down an unrolled nn(%reg) sequence you can get a stall.
> > An extra 'add $0,%reg' in the middle of the unrolled loop will
> > 'refresh' the register and speed things up.
> > (I hit that with a loop that needed a rather more complicated control
> > structure.)
> > 
> > You definitely need to use the pmc clock counter and data dependencies
> > against the rdpmc instruction to get sensible performance figures.
> > The can reasonably reliably measure down to less than 20 clocks.  
> 
> The version at the end of this email is what you're suggesting, I think.

Looks about right (I wouldn't have found 'vpternlogq $0x96').
Should be read limited on both cpu.
I think zen5 can do two avx-512 reads and (maybe or) two avx-512 writes per clock.
(zen4 reads take two clocks so you might as well use avx-256.)
Sapphire raids might be the same, but I recall some cpu supports 3 reads/clock.

> On Sapphire Rapids and Ryzen 9 9950X it's about the same speed as mine,
> just a few percent slower on Sapphire with src_cnt == 1.

Is that 512 bytes?
The minimum block size for these in 128 bytes.
As you know a smaller block size is generally better if you need to support
arbitrary lengths.

> 
> So we could use it.  It's just a bit fragile since it assumes the loop
> overhead and indexed addressing will never be a bottleneck on any
> current or future CPU.  Unrolling by more gives something more robust
> that "just works", without having to analyze whether the loops are okay
> on each CPU model individually based on microarchitectural details.

Unrolling further is likely to be slower in 'real life' because of the
effects on the I-cache.
Not to mention D-cache effects - the exact cache alignment can matter.
It is also unlikely that future cpu will be significantly slower.

The more usual problem is that some older cpu (usually zen1) ends up
running the code significantly slower than other algorithms.
That might matter for the avx-128 version of this code.

I might try putting these functions through some user-space clock count
measuring code I've written.
You done the hard bit of getting the asm syntax right.

	David


> 
> // SPDX-License-Identifier: GPL-2.0-or-later
> /*
>  * AVX-512 optimized implementation of xor_gen()
>  *
>  * Copyright 2026 Google LLC
>  */
> 
> #include <linux/compiler.h>
> #include <linux/types.h>
> #include <linux/unroll.h>
> #include <asm/fpu/api.h>
> #include "xor_impl.h"
> #include "xor_arch.h"
> 
> static void xor_avx512_2(long bytes, u8 *p0, const u8 *p1)
> {
> 	long i = -bytes; /* Use negative indexing to minimize loop overhead. */
> 
> 	p0 += bytes;
> 	p1 += bytes;
> 	unrolled_none
> 	do {
> 		/* unroll by 2x to reduce loop overhead */
> 		asm volatile("vmovdqa64 (%2,%0), %%zmm0\n"
> 			     "vmovdqa64 64(%2,%0), %%zmm1\n"
> 			     "vpxorq (%2,%1), %%zmm0, %%zmm0\n"
> 			     "vpxorq 64(%2,%1), %%zmm1, %%zmm1\n"
> 			     "vmovdqa64 %%zmm0, (%2,%0)\n"
> 			     "vmovdqa64 %%zmm1, 64(%2,%0)\n"
> 			     :
> 			     : "r"(p0), "r"(p1), "r"(i)
> 			     : "memory");
> 	} while ((i += 128) != 0);
> }
> 
> static void xor_avx512_3(long bytes, u8 *p0, const u8 *p1, const u8 *p2)
> {
> 	long i = -bytes; /* Use negative indexing to minimize loop overhead. */
> 
> 	p0 += bytes;
> 	p1 += bytes;
> 	p2 += bytes;
> 	unrolled_none
> 	do {
> 		/* unroll by 2x to reduce loop overhead */
> 		asm volatile("vmovdqa64 (%3,%0), %%zmm0\n"
> 			     "vmovdqa64 64(%3,%0), %%zmm1\n"
> 			     "vmovdqa64 (%3,%1), %%zmm2\n"
> 			     "vmovdqa64 64(%3,%1), %%zmm3\n"
> 			     "vpternlogq $0x96, (%3,%2), %%zmm2, %%zmm0\n"
> 			     "vpternlogq $0x96, 64(%3,%2), %%zmm3, %%zmm1\n"
> 			     "vmovdqa64 %%zmm0, (%3,%0)\n"
> 			     "vmovdqa64 %%zmm1, 64(%3,%0)\n"
> 			     :
> 			     : "r"(p0), "r"(p1), "r"(p2), "r"(i)
> 			     : "memory");
> 	} while ((i += 128) != 0);
> }
> 
> static void xor_avx512_4(long bytes, u8 *p0, const u8 *p1, const u8 *p2,
> 			 const u8 *p3)
> {
> 	long i = -bytes; /* Use negative indexing to minimize loop overhead. */
> 
> 	p0 += bytes;
> 	p1 += bytes;
> 	p2 += bytes;
> 	p3 += bytes;
> 	unrolled_none
> 	do {
> 		asm volatile("vmovdqa64 (%4,%0), %%zmm0\n"
> 			     "vmovdqa64 (%4,%1), %%zmm1\n"
> 			     "vpxorq (%4,%2), %%zmm0, %%zmm0\n"
> 			     "vpternlogq $0x96, (%4,%3), %%zmm1, %%zmm0\n"
> 			     "vmovdqa64 %%zmm0, (%4,%0)\n"
> 			     :
> 			     : "r"(p0), "r"(p1), "r"(p2), "r"(p3), "r"(i)
> 			     : "memory");
> 	} while ((i += 64) != 0);
> }
> 
> static void xor_avx512_5(long bytes, u8 *p0, const u8 *p1, const u8 *p2,
> 			 const u8 *p3, const u8 *p4)
> {
> 	long i = -bytes; /* Use negative indexing to minimize loop overhead. */
> 
> 	p0 += bytes;
> 	p1 += bytes;
> 	p2 += bytes;
> 	p3 += bytes;
> 	p4 += bytes;
> 	unrolled_none
> 	do {
> 		asm volatile("vmovdqa64 (%5,%0), %%zmm0\n"
> 			     "vmovdqa64 (%5,%1), %%zmm1\n"
> 			     "vpternlogq $0x96, (%5,%2), %%zmm1, %%zmm0\n"
> 			     "vmovdqa64 (%5,%3), %%zmm1\n"
> 			     "vpternlogq $0x96, (%5,%4), %%zmm1, %%zmm0\n"
> 			     "vmovdqa64 %%zmm0, (%5,%0)\n"
> 			     :
> 			     : "r"(p0), "r"(p1), "r"(p2), "r"(p3), "r"(p4),
> 			       "r"(i)
> 			     : "memory");
> 	} while ((i += 64) != 0);
> }
> 
> DO_XOR_BLOCKS(avx512_inner, xor_avx512_2, xor_avx512_3, xor_avx512_4,
> 	      xor_avx512_5);
> 
> /*
>  * Preconditions: bytes is a nonzero multiple of 512, and all buffers are
>  * 64-byte aligned.
>  */
> static void xor_gen_avx512(void *dest, void **srcs, unsigned int src_cnt,
> 			   unsigned int bytes)
> {
> 	kernel_fpu_begin();
> 	xor_gen_avx512_inner(dest, srcs, src_cnt, bytes);
> 	kernel_fpu_end();
> }
> 
> struct xor_block_template xor_block_avx512 = {
> 	.name = "avx512",
> 	.xor_gen = xor_gen_avx512,
> };


^ permalink raw reply

* Re: [PATCH] lib/raid/xor: x86: Add AVX-512 optimized xor_gen()
From: Eric Biggers @ 2026-06-13  0:27 UTC (permalink / raw)
  To: David Laight
  Cc: Christoph Hellwig, Andrew Morton, linux-kernel, linux-crypto, x86,
	Andrea Mazzoleni
In-Reply-To: <20260612100432.1f1c8c7a@pumpkin>

On Fri, Jun 12, 2026 at 10:04:32AM +0100, David Laight wrote:
> On Fri, 12 Jun 2026 07:22:47 +0200
> Christoph Hellwig <hch@lst.de> wrote:
> 
> > On Thu, Jun 11, 2026 at 09:40:34PM -0700, Eric Biggers wrote:
> > > Add an implementation of xor_gen() using AVX-512.  
> > 
> > > Benchmark on AMD Ryzen 9 9950X (Zen 5):  
> > 
> > Can you share the benchmark?
> > 
> > In my local tree I have ports of the AVX2 and AVX512 implementations
> > from snapraid (https://github.com/amadvance/snapraid), which in userspace
> > give really good performance.  On my Laptop with a AMD Ryzen AI 7 PRO 350
> > (which is a Zen5 with the slower double pumped AVX512 unit), both of
> > them get over 1GB/s throughput on the snapraid benchmarks.  I've been
> > holding them back as I don't have a good kernel benchmarking harness,
> > and it's missing the quirks for old AVX512 or the newer AMD special
> > cases.
> 
> From my experiments on Intel cpu (and I don't remember the zen-5 being
> that different - but I've done less testing on it) you don't need to
> unroll loops very much at all.
> 
> A reasonable model seems to be that the uops generated by the instruction
> decoder get executed when all the prerequisite registers and the required
> execution unit are available.
> So for a memory copy (and the xor is basically a copy) the control loop
> can run way ahead of the read/write instructions.
> This means you can get the control loop 'for free' and unrolling further
> makes no/little difference.
> 
> Each xor is two memory reads and one memory write.
> The cpu I was using could only do one write/clock - so you can only do one
> xor each clock. I think some of the newer ones can to two writes/clock but
> I'm not sure how many reads/clock they can do - might still be 2, don't
> think it s 4.
> So you should be able to get one xor per clock, but I doubt you'll get two
> (and possibly not even 1.3 - which would require 4 memory accesses per clock).
> 
> The best loop construct is the one that uses negative offsets from the
> end of the buffers, basically:
> 	buf += len;
> 	offset = -len;
> 	do
> 		f(buf[offset]);
> 	while (offset += size);
> that reduces the loop control to just an 'add' and 'jnz' (which can
> get merged into a single u-op).
> 
> The cpu have enough execution units to execute two memory reads,
> a memory write, an xor the add and jnz every clock.
> So even the 'rolled up' loop might run at one xor per clock.
> While I think I got a 'one clock loop' on my zen-5 (testing
> word-at-a-time strlen) I only managed a two clock loop on the newest
> Intel cpu I've got (which isn't that new).
> So put two xor in the loop and it shouldn't be limited by the loop
> control, but will be limited by the memory accesses instead.
> 
> Further unrolling shouldn't help and may make things worse.
> The Intel cpu have logic to directly forward the result of an
> ALU instruction into the next few instructions, but after that you can
> get a stall because of the 'round trip' via the register file.
> So part way down an unrolled nn(%reg) sequence you can get a stall.
> An extra 'add $0,%reg' in the middle of the unrolled loop will
> 'refresh' the register and speed things up.
> (I hit that with a loop that needed a rather more complicated control
> structure.)
> 
> You definitely need to use the pmc clock counter and data dependencies
> against the rdpmc instruction to get sensible performance figures.
> The can reasonably reliably measure down to less than 20 clocks.

The version at the end of this email is what you're suggesting, I think.
On Sapphire Rapids and Ryzen 9 9950X it's about the same speed as mine,
just a few percent slower on Sapphire with src_cnt == 1.

So we could use it.  It's just a bit fragile since it assumes the loop
overhead and indexed addressing will never be a bottleneck on any
current or future CPU.  Unrolling by more gives something more robust
that "just works", without having to analyze whether the loops are okay
on each CPU model individually based on microarchitectural details.

// SPDX-License-Identifier: GPL-2.0-or-later
/*
 * AVX-512 optimized implementation of xor_gen()
 *
 * Copyright 2026 Google LLC
 */

#include <linux/compiler.h>
#include <linux/types.h>
#include <linux/unroll.h>
#include <asm/fpu/api.h>
#include "xor_impl.h"
#include "xor_arch.h"

static void xor_avx512_2(long bytes, u8 *p0, const u8 *p1)
{
	long i = -bytes; /* Use negative indexing to minimize loop overhead. */

	p0 += bytes;
	p1 += bytes;
	unrolled_none
	do {
		/* unroll by 2x to reduce loop overhead */
		asm volatile("vmovdqa64 (%2,%0), %%zmm0\n"
			     "vmovdqa64 64(%2,%0), %%zmm1\n"
			     "vpxorq (%2,%1), %%zmm0, %%zmm0\n"
			     "vpxorq 64(%2,%1), %%zmm1, %%zmm1\n"
			     "vmovdqa64 %%zmm0, (%2,%0)\n"
			     "vmovdqa64 %%zmm1, 64(%2,%0)\n"
			     :
			     : "r"(p0), "r"(p1), "r"(i)
			     : "memory");
	} while ((i += 128) != 0);
}

static void xor_avx512_3(long bytes, u8 *p0, const u8 *p1, const u8 *p2)
{
	long i = -bytes; /* Use negative indexing to minimize loop overhead. */

	p0 += bytes;
	p1 += bytes;
	p2 += bytes;
	unrolled_none
	do {
		/* unroll by 2x to reduce loop overhead */
		asm volatile("vmovdqa64 (%3,%0), %%zmm0\n"
			     "vmovdqa64 64(%3,%0), %%zmm1\n"
			     "vmovdqa64 (%3,%1), %%zmm2\n"
			     "vmovdqa64 64(%3,%1), %%zmm3\n"
			     "vpternlogq $0x96, (%3,%2), %%zmm2, %%zmm0\n"
			     "vpternlogq $0x96, 64(%3,%2), %%zmm3, %%zmm1\n"
			     "vmovdqa64 %%zmm0, (%3,%0)\n"
			     "vmovdqa64 %%zmm1, 64(%3,%0)\n"
			     :
			     : "r"(p0), "r"(p1), "r"(p2), "r"(i)
			     : "memory");
	} while ((i += 128) != 0);
}

static void xor_avx512_4(long bytes, u8 *p0, const u8 *p1, const u8 *p2,
			 const u8 *p3)
{
	long i = -bytes; /* Use negative indexing to minimize loop overhead. */

	p0 += bytes;
	p1 += bytes;
	p2 += bytes;
	p3 += bytes;
	unrolled_none
	do {
		asm volatile("vmovdqa64 (%4,%0), %%zmm0\n"
			     "vmovdqa64 (%4,%1), %%zmm1\n"
			     "vpxorq (%4,%2), %%zmm0, %%zmm0\n"
			     "vpternlogq $0x96, (%4,%3), %%zmm1, %%zmm0\n"
			     "vmovdqa64 %%zmm0, (%4,%0)\n"
			     :
			     : "r"(p0), "r"(p1), "r"(p2), "r"(p3), "r"(i)
			     : "memory");
	} while ((i += 64) != 0);
}

static void xor_avx512_5(long bytes, u8 *p0, const u8 *p1, const u8 *p2,
			 const u8 *p3, const u8 *p4)
{
	long i = -bytes; /* Use negative indexing to minimize loop overhead. */

	p0 += bytes;
	p1 += bytes;
	p2 += bytes;
	p3 += bytes;
	p4 += bytes;
	unrolled_none
	do {
		asm volatile("vmovdqa64 (%5,%0), %%zmm0\n"
			     "vmovdqa64 (%5,%1), %%zmm1\n"
			     "vpternlogq $0x96, (%5,%2), %%zmm1, %%zmm0\n"
			     "vmovdqa64 (%5,%3), %%zmm1\n"
			     "vpternlogq $0x96, (%5,%4), %%zmm1, %%zmm0\n"
			     "vmovdqa64 %%zmm0, (%5,%0)\n"
			     :
			     : "r"(p0), "r"(p1), "r"(p2), "r"(p3), "r"(p4),
			       "r"(i)
			     : "memory");
	} while ((i += 64) != 0);
}

DO_XOR_BLOCKS(avx512_inner, xor_avx512_2, xor_avx512_3, xor_avx512_4,
	      xor_avx512_5);

/*
 * Preconditions: bytes is a nonzero multiple of 512, and all buffers are
 * 64-byte aligned.
 */
static void xor_gen_avx512(void *dest, void **srcs, unsigned int src_cnt,
			   unsigned int bytes)
{
	kernel_fpu_begin();
	xor_gen_avx512_inner(dest, srcs, src_cnt, bytes);
	kernel_fpu_end();
}

struct xor_block_template xor_block_avx512 = {
	.name = "avx512",
	.xor_gen = xor_gen_avx512,
};

^ permalink raw reply

* Re: [RFC] ML-KEM (FIPS 203) implementation with reusable decapsulation pool
From: Eric Biggers @ 2026-06-12 18:32 UTC (permalink / raw)
  To: kstzavertaylo; +Cc: linux-crypto, herbert
In-Reply-To: <CAMho2Rem-B908oaFQzTx8Mg895LuvPcfN9+ANoHW+XfGW+wB6A@mail.gmail.com>

On Fri, Jun 12, 2026 at 05:14:54PM +0300, kstzavertaylo wrote:
> Thank you for the detailed reply and for pointing me to the existing
> ML-KEM/X-Wing patchset. I spent some time reviewing the implementation
> to better understand the design choices and how they compare to the
> approach I took in my own work.
> 
> After reviewing the patchset, I can see several strengths in the
> implementation. It integrates cleanly into the existing lib/crypto
> infrastructure, reuses kernel cryptographic primitives, avoids large
> stack allocations, and includes KUnit-based validation. The
> implementation also appears intentionally compact and well aligned
> with existing kernel conventions.
> 
> While reviewing the implementation, I noticed that decapsulation
> allocates a temporary workspace for each operation. This is one of the
> areas where my design diverged, which is what originally motivated the
> reusable pool approach.
> 
> My implementation was developed with a somewhat different goal in
> mind. I experimented with a reusable decapsulation workspace model
> where memory is allocated during key initialization and then reused
> across subsequent decapsulation operations. The main motivation was
> reducing allocation frequency and minimizing both stack usage and
> repeated memory management during decapsulation.
>
> As a result, the implementation avoids allocations during
> decapsulation entirely by reusing preallocated workspaces associated
> with the key context. My original hypothesis was that moving memory
> allocation to key initialization, thereby eliminating allocations from
> the decapsulation path, could reduce allocation overhead during
> repeated decapsulation operations and be beneficial in environments
> where allocation activity is considered undesirable.

In my ML-KEM code, all the decapsulation memory is consolidated into
struct mlkem_decap_workspace.  It would be straightforward to support
the caller providing a pre-allocated workspace.

In the case of X-Wing, we could also support pre-expanding the
decapsulation key.

It just depends on what is actually going to be needed by the kernel
feature(s) that are going to use this.  Which we don't really know yet.

We do know that it hasn't been found to be useful for the crypto
subsystem to provide pools for any other algorithm in the kernel, for a
variety of reasons.  Usually callers can just allocate per-operation, or
they have some sort of object (inode, block device, socket, etc.) that's
a natural place for them to cache whatever they need anyway.  In the
rare cases where some sort of pool is needed it's implemented in the
caller, optimized for the particular use case.  So I think there's a
good chance your pool idea is going off on the wrong track.

> Another difference is the integration level. My prototype explored
> direct integration through the KPP interface, whereas the patchset
> focuses on providing a reusable cryptographic library component within
> lib/crypto. These approaches address somewhat different layers of the
> kernel crypto stack.

We don't need crypto_kpp support, as it's much more complex and harder
to use than the crypto library
(https://docs.kernel.org/crypto/libcrypto.html).  Also it seems it's not
really possible anyway, since crypto_kpp is an old design that works for
Diffie-Hellman but not KEMs.

- Eric

^ permalink raw reply

* Re: [PATCH] crypto: ccp: Fix SNP range list bounds check
From: Tycho Andersen @ 2026-06-12 15:18 UTC (permalink / raw)
  To: ZongYao.Chen
  Cc: Ashish Kalra, Tom Lendacky, John Allen, Herbert Xu,
	David S. Miller, Michael Roth, Jarkko Sakkinen,
	Borislav Petkov (AMD), Brijesh Singh, Tianjia Zhang, linux-crypto,
	linux-kernel, stable
In-Reply-To: <20260612092525.1203150-1-ZongYao.Chen@linux.alibaba.com>

On Fri, Jun 12, 2026 at 05:25:25PM +0800, ZongYao.Chen@linux.alibaba.com wrote:
> From: Zongyao Chen <ZongYao.Chen@linux.alibaba.com>
> 
> snp_filter_reserved_mem_regions() checks the range list size before
> adding a new entry. If the page-sized SNP_INIT_EX buffer is already
> full, the next matching resource can still write one entry past the end
> of the buffer.
> 
> Check that there is room for the next entry before appending it, and
> compute the next entry pointer only after the bounds check.

> Fixes: 1ca5614b84ee ("crypto: ccp: Add support to initialize the AMD-SP for SEV-SNP")
> Cc: stable@vger.kernel.org
> Signed-off-by: Zongyao Chen <ZongYao.Chen@linux.alibaba.com>

I believe there is a version of this in the crypto tree already as
1b864b6cb213 ("crypto: ccp - Fix snp_filter_reserved_mem_regions()
off-by-one").

Thanks,

Tycho

^ permalink raw reply

* Re: [RFC] ML-KEM (FIPS 203) implementation with reusable decapsulation pool
From: kstzavertaylo @ 2026-06-12 14:14 UTC (permalink / raw)
  To: Eric Biggers; +Cc: linux-crypto, herbert
In-Reply-To: <20260609192542.GA3811606@google.com>

Thank you for the detailed reply and for pointing me to the existing
ML-KEM/X-Wing patchset. I spent some time reviewing the implementation
to better understand the design choices and how they compare to the
approach I took in my own work.

After reviewing the patchset, I can see several strengths in the
implementation. It integrates cleanly into the existing lib/crypto
infrastructure, reuses kernel cryptographic primitives, avoids large
stack allocations, and includes KUnit-based validation. The
implementation also appears intentionally compact and well aligned
with existing kernel conventions.

While reviewing the implementation, I noticed that decapsulation
allocates a temporary workspace for each operation. This is one of the
areas where my design diverged, which is what originally motivated the
reusable pool approach.

My implementation was developed with a somewhat different goal in
mind. I experimented with a reusable decapsulation workspace model
where memory is allocated during key initialization and then reused
across subsequent decapsulation operations. The main motivation was
reducing allocation frequency and minimizing both stack usage and
repeated memory management during decapsulation.

As a result, the implementation avoids allocations during
decapsulation entirely by reusing preallocated workspaces associated
with the key context. My original hypothesis was that moving memory
allocation to key initialization, thereby eliminating allocations from
the decapsulation path, could reduce allocation overhead during
repeated decapsulation operations and be beneficial in environments
where allocation activity is considered undesirable.

Another difference is the integration level. My prototype explored
direct integration through the KPP interface, whereas the patchset
focuses on providing a reusable cryptographic library component within
lib/crypto. These approaches address somewhat different layers of the
kernel crypto stack.

The primary reason I initially started working on this implementation
was to explore whether a reusable-workspace architecture could be
useful in environments where allocation frequency and memory reuse are
considered important design factors. I therefore wanted to understand
whether such an approach might offer any practical value within the
kernel context, even if the overall implementation strategy differs
from the existing patchset.

The goal is to analyze the results and understand whether the
reusable-workspace approach actually achieves its intended goals in
terms of memory usage, allocation behavior, throughput, and related
metrics. In particular, I am interested in understanding whether such
an approach may provide practical benefits in environments where stack
space is constrained or where reducing allocation activity is
desirable. To better evaluate these tradeoffs, I am currently
preparing a comparison against several established ML-KEM
implementations. If such data would be useful for the discussion, I
would be happy to share the results once they are available.

Best regards,
K. Zavertailo


On Tue, Jun 9, 2026 at 10:25 PM Eric Biggers <ebiggers@kernel.org> wrote:
>
> On Tue, Jun 09, 2026 at 10:45:48AM +0300, kstzavertaylo wrote:
> > Hello,
> > I have been working on an ML-KEM (FIPS 203) implementation for the
> > Linux kernel. This is an early RFC to solicit feedback on the overall
> > design and architecture before further polishing.
> >
> > The implementation consists of two closely related variants sharing
> > the same core cryptographic logic:
> >     1. A userspace implementation accompanied by a set of validation
> > programs, including NIST KAT vectors, timing-leakage testing (dudect),
> > pool stress tests, and additional functional tests.
> >     2. A Linux kernel module implementing the KPP interface and
> > reusing the same core architecture where possible.
> >
> > Key features include:
> >    1. Support for all three parameter sets: ML-KEM-512, ML-KEM-768,
> > and ML-KEM-1024.
> >    2. The implementation uses a reusable decapsulation pool consisting
> > of preallocated slots associated with a key context. The goal of this
> > design is to move memory allocation to key initialization and avoid
> > per-decapsulation allocations.
> >    3. Explicit zeroization of sensitive data and constant-time
> > operations where required.
> >    4. Portable C11 codebase with minimal differences between userspace
> > and kernel versions.
> >
> > I am aware that some aspects (local SHA3/SHAKE implementation, coding
> > style, etc.) will likely need adjustment to align with upstream
> > expectations.
> >
> > At this stage, I would like to ask for feedback on the following points:
> >    1. Is the general direction (KPP integration + reusable
> > decapsulation pool) acceptable?
> >    2. Are there any fundamental concerns with the pool-based architecture?
> >    3. Would you prefer to reuse kernel crypto primitives for
> > SHA3/SHAKE, or is the current embedded approach acceptable at this
> > stage?
> >
> > The implementation is available at: repository - https://github.com/kstzv/ml-kem
> >
> > Documentation and implementation details are available in the repository.
> >
> > Any feedback, criticism or suggestions would be greatly appreciated.
>
> There's already a kernel patchset for ML-KEM and X-Wing ready to go:
> https://lore.kernel.org/linux-crypto/20260525184403.101818-1-ebiggers@kernel.org/T/#u
> It's a high quality implementation that fully follows kernel conventions
> already.  There just hasn't been a reason to merge it yet, since there's
> no user yet.
>
> We could consider replacing my ML-KEM implementation (patch 1 of that
> series) with a different one.  But it would have to be a high-quality
> implementation that brings something substantially new to the table.
>
> I think only an integration of
> https://github.com/pq-code-package/mlkem-native *might* have a chance at
> passing that bar.  However, it would be way more code than my
> implementation, would have significant integration challenges, and would
> need some fixing up to work in the kernel.  The main benefit would be
> getting the assembly code, but it's not clear that will be needed.  So
> those are some of the reasons I didn't reach for that initially.
>
> I don't think integrating https://github.com/kstzv/ml-kem would be
> beneficial, for a number of reasons.
>
> Anyway, I suggest you review the pre-existing patchset
> https://lore.kernel.org/linux-crypto/20260525184403.101818-1-ebiggers@kernel.org/
> and give feedback on that, if you have any.
>
> - Eric

^ permalink raw reply

* Re: [PATCH] crypto: ccp: Fix SNP range list bounds check
From: Tom Lendacky @ 2026-06-12 13:05 UTC (permalink / raw)
  To: ZongYao.Chen, Ashish Kalra, John Allen, Herbert Xu,
	David S. Miller
  Cc: Michael Roth, Jarkko Sakkinen, Borislav Petkov (AMD),
	Brijesh Singh, Tianjia Zhang, linux-crypto, linux-kernel, stable
In-Reply-To: <20260612092525.1203150-1-ZongYao.Chen@linux.alibaba.com>

On 6/12/26 04:25, ZongYao.Chen@linux.alibaba.com wrote:
> [Some people who received this message don't often get email from zongyao.chen@linux.alibaba.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> 
> From: Zongyao Chen <ZongYao.Chen@linux.alibaba.com>
> 
> snp_filter_reserved_mem_regions() checks the range list size before
> adding a new entry. If the page-sized SNP_INIT_EX buffer is already
> full, the next matching resource can still write one entry past the end
> of the buffer.
> 
> Check that there is room for the next entry before appending it, and
> compute the next entry pointer only after the bounds check.
> 
> Fixes: 1ca5614b84ee ("crypto: ccp: Add support to initialize the AMD-SP for SEV-SNP")
> Cc: stable@vger.kernel.org
> Signed-off-by: Zongyao Chen <ZongYao.Chen@linux.alibaba.com>

Thanks for the submission, but this has already been fixed with
1b864b6cb213 ("crypto: ccp - Fix snp_filter_reserved_mem_regions()
off-by-one")

Thanks,
Tom

> ---
>  drivers/crypto/ccp/sev-dev.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
> index d1e9e0ac63b6..9e6efb3ec175 100644
> --- a/drivers/crypto/ccp/sev-dev.c
> +++ b/drivers/crypto/ccp/sev-dev.c
> @@ -1324,17 +1324,19 @@ static int snp_get_platform_data(struct sev_device *sev, int *error)
>  static int snp_filter_reserved_mem_regions(struct resource *rs, void *arg)
>  {
>         struct sev_data_range_list *range_list = arg;
> -       struct sev_data_range *range = &range_list->ranges[range_list->num_elements];
> +       struct sev_data_range *range;
>         size_t size;
> 
>         /*
>          * Ensure the list of HV_FIXED pages that will be passed to firmware
>          * do not exceed the page-sized argument buffer.
>          */
> -       if ((range_list->num_elements * sizeof(struct sev_data_range) +
> +       if (((range_list->num_elements + 1) * sizeof(struct sev_data_range) +
>              sizeof(struct sev_data_range_list)) > PAGE_SIZE)
>                 return -E2BIG;
> 
> +       range = &range_list->ranges[range_list->num_elements];
> +
>         switch (rs->desc) {
>         case E820_TYPE_RESERVED:
>         case E820_TYPE_PMEM:
> --
> 2.47.3
> 


^ permalink raw reply

* [PATCH] crypto: ccp: Fix SNP range list bounds check
From: ZongYao.Chen @ 2026-06-12  9:25 UTC (permalink / raw)
  To: Ashish Kalra, Tom Lendacky, John Allen, Herbert Xu,
	David S. Miller
  Cc: Michael Roth, Jarkko Sakkinen, Borislav Petkov (AMD),
	Brijesh Singh, Tianjia Zhang, linux-crypto, linux-kernel,
	Zongyao Chen, stable

From: Zongyao Chen <ZongYao.Chen@linux.alibaba.com>

snp_filter_reserved_mem_regions() checks the range list size before
adding a new entry. If the page-sized SNP_INIT_EX buffer is already
full, the next matching resource can still write one entry past the end
of the buffer.

Check that there is room for the next entry before appending it, and
compute the next entry pointer only after the bounds check.

Fixes: 1ca5614b84ee ("crypto: ccp: Add support to initialize the AMD-SP for SEV-SNP")
Cc: stable@vger.kernel.org
Signed-off-by: Zongyao Chen <ZongYao.Chen@linux.alibaba.com>
---
 drivers/crypto/ccp/sev-dev.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index d1e9e0ac63b6..9e6efb3ec175 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -1324,17 +1324,19 @@ static int snp_get_platform_data(struct sev_device *sev, int *error)
 static int snp_filter_reserved_mem_regions(struct resource *rs, void *arg)
 {
 	struct sev_data_range_list *range_list = arg;
-	struct sev_data_range *range = &range_list->ranges[range_list->num_elements];
+	struct sev_data_range *range;
 	size_t size;
 
 	/*
 	 * Ensure the list of HV_FIXED pages that will be passed to firmware
 	 * do not exceed the page-sized argument buffer.
 	 */
-	if ((range_list->num_elements * sizeof(struct sev_data_range) +
+	if (((range_list->num_elements + 1) * sizeof(struct sev_data_range) +
 	     sizeof(struct sev_data_range_list)) > PAGE_SIZE)
 		return -E2BIG;
 
+	range = &range_list->ranges[range_list->num_elements];
+
 	switch (rs->desc) {
 	case E820_TYPE_RESERVED:
 	case E820_TYPE_PMEM:
-- 
2.47.3


^ permalink raw reply related

* Re: [PATCH] lib/raid/xor: x86: Add AVX-512 optimized xor_gen()
From: David Laight @ 2026-06-12  9:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Eric Biggers, Andrew Morton, linux-kernel, linux-crypto, x86,
	Andrea Mazzoleni
In-Reply-To: <20260612052247.GA8848@lst.de>

On Fri, 12 Jun 2026 07:22:47 +0200
Christoph Hellwig <hch@lst.de> wrote:

> On Thu, Jun 11, 2026 at 09:40:34PM -0700, Eric Biggers wrote:
> > Add an implementation of xor_gen() using AVX-512.  
> 
> > Benchmark on AMD Ryzen 9 9950X (Zen 5):  
> 
> Can you share the benchmark?
> 
> In my local tree I have ports of the AVX2 and AVX512 implementations
> from snapraid (https://github.com/amadvance/snapraid), which in userspace
> give really good performance.  On my Laptop with a AMD Ryzen AI 7 PRO 350
> (which is a Zen5 with the slower double pumped AVX512 unit), both of
> them get over 1GB/s throughput on the snapraid benchmarks.  I've been
> holding them back as I don't have a good kernel benchmarking harness,
> and it's missing the quirks for old AVX512 or the newer AMD special
> cases.

From my experiments on Intel cpu (and I don't remember the zen-5 being
that different - but I've done less testing on it) you don't need to
unroll loops very much at all.

A reasonable model seems to be that the uops generated by the instruction
decoder get executed when all the prerequisite registers and the required
execution unit are available.
So for a memory copy (and the xor is basically a copy) the control loop
can run way ahead of the read/write instructions.
This means you can get the control loop 'for free' and unrolling further
makes no/little difference.

Each xor is two memory reads and one memory write.
The cpu I was using could only do one write/clock - so you can only do one
xor each clock. I think some of the newer ones can to two writes/clock but
I'm not sure how many reads/clock they can do - might still be 2, don't
think it s 4.
So you should be able to get one xor per clock, but I doubt you'll get two
(and possibly not even 1.3 - which would require 4 memory accesses per clock).

The best loop construct is the one that uses negative offsets from the
end of the buffers, basically:
	buf += len;
	offset = -len;
	do
		f(buf[offset]);
	while (offset += size);
that reduces the loop control to just an 'add' and 'jnz' (which can
get merged into a single u-op).

The cpu have enough execution units to execute two memory reads,
a memory write, an xor the add and jnz every clock.
So even the 'rolled up' loop might run at one xor per clock.
While I think I got a 'one clock loop' on my zen-5 (testing
word-at-a-time strlen) I only managed a two clock loop on the newest
Intel cpu I've got (which isn't that new).
So put two xor in the loop and it shouldn't be limited by the loop
control, but will be limited by the memory accesses instead.

Further unrolling shouldn't help and may make things worse.
The Intel cpu have logic to directly forward the result of an
ALU instruction into the next few instructions, but after that you can
get a stall because of the 'round trip' via the register file.
So part way down an unrolled nn(%reg) sequence you can get a stall.
An extra 'add $0,%reg' in the middle of the unrolled loop will
'refresh' the register and speed things up.
(I hit that with a loop that needed a rather more complicated control
structure.)

You definitely need to use the pmc clock counter and data dependencies
against the rdpmc instruction to get sensible performance figures.
The can reasonably reliably measure down to less than 20 clocks.

	David
 

^ permalink raw reply

* Re: [PATCH 1/2] crypto: qce: Fix xts-aes-qce for weak keys
From: Dmitry Baryshkov @ 2026-06-12  6:11 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Kuldeep Singh, Thara Gopinath, David S. Miller,
	Bartosz Golaszewski, Eric Biggers, Thara Gopinath, linux-crypto,
	linux-arm-msm, linux-kernel
In-Reply-To: <aiuA8CCGcfP6MdLy@gondor.apana.org.au>

On Fri, Jun 12, 2026 at 11:45:52AM +0800, Herbert Xu wrote:
> On Fri, Jun 12, 2026 at 03:40:49AM +0300, Dmitry Baryshkov wrote:
> >
> > > Fix xts-aes-qce behavior by using generic helper xts_verify_key() to
> > > reject keys early with -EINVAL for FIPS mode active(or FORBID_WEAK_KEYS
> > > set). For non-FIPS mode, since QCE hardware cannot accept the keys, use
> > > software fallback mechanism to encrypt the data.
> > 
> > No, if it is a hardware driver, there should be no software fallback.
> 
> The driver must support everything that the software implementation
> supports.  So if the hardware can't do something, it has to use a
> fallback.

It's unexpected. But you know it better than I do.

-- 
With best wishes
Dmitry

^ permalink raw reply

* Re: [PATCH] lib/raid/xor: x86: Add AVX-512 optimized xor_gen()
From: Eric Biggers @ 2026-06-12  5:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, linux-kernel, linux-crypto, x86, Andrea Mazzoleni
In-Reply-To: <20260612052247.GA8848@lst.de>

On Fri, Jun 12, 2026 at 07:22:47AM +0200, Christoph Hellwig wrote:
> On Thu, Jun 11, 2026 at 09:40:34PM -0700, Eric Biggers wrote:
> > Add an implementation of xor_gen() using AVX-512.
> 
> > Benchmark on AMD Ryzen 9 9950X (Zen 5):
> 
> Can you share the benchmark?

For now I had just hacked up do_xor_speed() as follows and changed
xor_force() to xor_register().  There should be a benchmark added to the
KUnit test similar to the one in the crypto and CRC tests, though.

diff --git a/lib/raid/xor/xor-core.c b/lib/raid/xor/xor-core.c
index bd4e6e434418..8c5814af03d5 100644
--- a/lib/raid/xor/xor-core.c
+++ b/lib/raid/xor/xor-core.c
@@ -76,15 +76,24 @@ void __init xor_force(struct xor_block_template *tmpl)
 #define REPS		800U
 
 static void __init
-do_xor_speed(struct xor_block_template *tmpl, void *b1, void *b2)
+do_xor_speed(struct xor_block_template *tmpl, void *b1, void *b2,
+	     void *b3, void *b4, void *b5)
 {
+	for (int src_cnt = 1; src_cnt <= 4; src_cnt++) {
 	int speed;
 	unsigned long reps;
 	ktime_t min, start, t0;
-	void *srcs[1] = { b2 };
+	void *srcs[4] = { b2, b3, b4, b5 };
 
 	preempt_disable();
 
+	/* warm-up */
+	for (int i = 0; i < 8000; i++) {
+		mb(); /* prevent loop optimization */
+		tmpl->xor_gen(b1, srcs, src_cnt, BENCH_SIZE);
+		mb();
+	}
+
 	reps = 0;
 	t0 = ktime_get();
 	/* delay start until time has advanced */
@@ -92,7 +101,7 @@ do_xor_speed(struct xor_block_template *tmpl, void *b1, void *b2)
 		cpu_relax();
 	do {
 		mb(); /* prevent loop optimization */
-		tmpl->xor_gen(b1, srcs, 1, BENCH_SIZE);
+		tmpl->xor_gen(b1, srcs, src_cnt, BENCH_SIZE);
 		mb();
 	} while (reps++ < REPS || (t0 = ktime_get()) == start);
 	min = ktime_sub(t0, start);
@@ -105,26 +114,30 @@ do_xor_speed(struct xor_block_template *tmpl, void *b1, void *b2)
 
 	pr_info("   %-16s: %5d MB/sec\n", tmpl->name, speed);
 }
+}
 
 static int __init calibrate_xor_blocks(void)
 {
-	void *b1, *b2;
+	void *b1, *b2, *b3, *b4, *b5;
 	struct xor_block_template *f, *fastest;
 
 	if (forced_template)
 		return 0;
 
-	b1 = (void *) __get_free_pages(GFP_KERNEL, 2);
+	b1 = (void *) __get_free_pages(GFP_KERNEL, 4);
 	if (!b1) {
 		pr_warn("xor: Yikes!  No memory available.\n");
 		return -ENOMEM;
 	}
 	b2 = b1 + 2*PAGE_SIZE + BENCH_SIZE;
+	b3 = b2 + 2*PAGE_SIZE + BENCH_SIZE;
+	b4 = b3 + 2*PAGE_SIZE + BENCH_SIZE;
+	b5 = b4 + 2*PAGE_SIZE + BENCH_SIZE;
 
 	pr_info("xor: measuring software checksum speed\n");
 	fastest = template_list;
 	for (f = template_list; f; f = f->next) {
-		do_xor_speed(f, b1, b2);
+		do_xor_speed(f, b1, b2, b3, b4, b5);
 		if (f->speed > fastest->speed)
 			fastest = f;
 	}

> In my local tree I have ports of the AVX2 and AVX512 implementations
> from snapraid (https://github.com/amadvance/snapraid), which in userspace
> give really good performance.  On my Laptop with a AMD Ryzen AI 7 PRO 350
> (which is a Zen5 with the slower double pumped AVX512 unit), both of
> them get over 1GB/s throughput on the snapraid benchmarks.  I've been
> holding them back as I don't have a good kernel benchmarking harness,
> and it's missing the quirks for old AVX512 or the newer AMD special
> cases.
> 
> Attached for reference.
> 
> Note that either way I'd prefer if we could get away from the stange
> old code organization with the DO{1-4} helpers which don't really
> help.

Well, doing the same on your avx512bw version and adding a column to my
table for it (by the way, I think it really just needs avx512f), I get:

        src_cnt    avx          avx512       avx512bw
        =======    ==========   ==========   ==========
        1          68423 MB/s   81940 MB/s   12067 MB/s
        2          56035 MB/s   74112 MB/s   10958 MB/s
        3          49396 MB/s   67011 MB/s   8608 MB/s
        4          43056 MB/s   60823 MB/s   8069 MB/s

So, your version isn't great, I'm afraid.  Making the inner loop be over
src_cnt does simplify the code a lot, but it destroys performance since
it turns into 9 instructions for each 64 bytes in each 3 buffers:

      5b:   89 c1                   mov    %eax,%ecx
      5d:   8d 70 01                lea    0x1(%rax),%esi
      60:   48 8b 0c cb             mov    (%rbx,%rcx,8),%rcx
      64:   48 8b 34 f3             mov    (%rbx,%rsi,8),%rsi
      68:   62 f1 fd 48 6f 0c 11    vmovdqa64 (%rcx,%rdx,1),%zmm1
      6f:   62 f3 f5 48 25 04 16    vpternlogq $0x96,(%rsi,%rdx,1),%zmm1,%zmm0
      76:   96 
      77:   83 c0 02                add    $0x2,%eax
      7a:   39 f8                   cmp    %edi,%eax
      7c:   72 dd                   jb     5b <xor_gen_avx512bw+0x4b>

You could try unrolling by 512 bytes, which should help.

- Eric

^ permalink raw reply related

* Re: [PATCH] lib/raid/xor: x86: Add AVX-512 optimized xor_gen()
From: Christoph Hellwig @ 2026-06-12  5:22 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Andrew Morton, linux-kernel, Christoph Hellwig, linux-crypto, x86,
	Andrea Mazzoleni
In-Reply-To: <20260612044034.117442-1-ebiggers@kernel.org>

On Thu, Jun 11, 2026 at 09:40:34PM -0700, Eric Biggers wrote:
> Add an implementation of xor_gen() using AVX-512.

> Benchmark on AMD Ryzen 9 9950X (Zen 5):

Can you share the benchmark?

In my local tree I have ports of the AVX2 and AVX512 implementations
from snapraid (https://github.com/amadvance/snapraid), which in userspace
give really good performance.  On my Laptop with a AMD Ryzen AI 7 PRO 350
(which is a Zen5 with the slower double pumped AVX512 unit), both of
them get over 1GB/s throughput on the snapraid benchmarks.  I've been
holding them back as I don't have a good kernel benchmarking harness,
and it's missing the quirks for old AVX512 or the newer AMD special
cases.

Attached for reference.

Note that either way I'd prefer if we could get away from the stange
old code organization with the DO{1-4} helpers which don't really
help.

diff --git a/lib/raid/xor/Makefile b/lib/raid/xor/Makefile
index 4d633dfd5b90..3d5ebeda241e 100644
--- a/lib/raid/xor/Makefile
+++ b/lib/raid/xor/Makefile
@@ -28,7 +28,7 @@ xor-$(CONFIG_SPARC32)		+= sparc/xor-sparc32.o
 xor-$(CONFIG_SPARC64)		+= sparc/xor-sparc64.o sparc/xor-sparc64-glue.o
 xor-$(CONFIG_S390)		+= s390/xor.o
 xor-$(CONFIG_X86_32)		+= x86/xor-avx.o x86/xor-sse.o x86/xor-mmx.o
-xor-$(CONFIG_X86_64)		+= x86/xor-avx.o x86/xor-sse.o
+xor-$(CONFIG_X86_64)		+= x86/xor-avx512.o x86/xor-avx.o x86/xor-sse.o
 obj-y				+= tests/
 
 CFLAGS_arm/xor-neon.o		+= $(CC_FLAGS_FPU)
diff --git a/lib/raid/xor/x86/xor-avx.c b/lib/raid/xor/x86/xor-avx.c
index f7777d7aa269..cd376a7c52d3 100644
--- a/lib/raid/xor/x86/xor-avx.c
+++ b/lib/raid/xor/x86/xor-avx.c
@@ -1,152 +1,31 @@
-// SPDX-License-Identifier: GPL-2.0-only
+// SPDX-License-Identifier: GPL-2.0-or-later
 /*
- * Optimized XOR parity functions for AVX
- *
- * Copyright (C) 2012 Intel Corporation
- * Author: Jim Kukunas <james.t.kukunas@linux.intel.com>
- *
- * Based on Ingo Molnar and Zach Brown's respective MMX and SSE routines
+ * Copyright (C) 2026 Andrea Mazzoleni
  */
-#include <linux/compiler.h>
 #include <asm/fpu/api.h>
 #include "xor_impl.h"
 #include "xor_arch.h"
 
-#define BLOCK4(i) \
-		BLOCK(32 * i, 0) \
-		BLOCK(32 * (i + 1), 1) \
-		BLOCK(32 * (i + 2), 2) \
-		BLOCK(32 * (i + 3), 3)
-
-#define BLOCK16() \
-		BLOCK4(0) \
-		BLOCK4(4) \
-		BLOCK4(8) \
-		BLOCK4(12)
-
-static void xor_avx_2(unsigned long bytes, unsigned long * __restrict p0,
-		      const unsigned long * __restrict p1)
-{
-	unsigned long lines = bytes >> 9;
-
-	while (lines--) {
-#undef BLOCK
-#define BLOCK(i, reg) \
-do { \
-	asm volatile("vmovdqa %0, %%ymm" #reg : : "m" (p1[i / sizeof(*p1)])); \
-	asm volatile("vxorps %0, %%ymm" #reg ", %%ymm"  #reg : : \
-		"m" (p0[i / sizeof(*p0)])); \
-	asm volatile("vmovdqa %%ymm" #reg ", %0" : \
-		"=m" (p0[i / sizeof(*p0)])); \
-} while (0);
-
-		BLOCK16()
-
-		p0 = (unsigned long *)((uintptr_t)p0 + 512);
-		p1 = (unsigned long *)((uintptr_t)p1 + 512);
-	}
-}
-
-static void xor_avx_3(unsigned long bytes, unsigned long * __restrict p0,
-		      const unsigned long * __restrict p1,
-		      const unsigned long * __restrict p2)
-{
-	unsigned long lines = bytes >> 9;
-
-	while (lines--) {
-#undef BLOCK
-#define BLOCK(i, reg) \
-do { \
-	asm volatile("vmovdqa %0, %%ymm" #reg : : "m" (p2[i / sizeof(*p2)])); \
-	asm volatile("vxorps %0, %%ymm" #reg ", %%ymm" #reg : : \
-		"m" (p1[i / sizeof(*p1)])); \
-	asm volatile("vxorps %0, %%ymm" #reg ", %%ymm" #reg : : \
-		"m" (p0[i / sizeof(*p0)])); \
-	asm volatile("vmovdqa %%ymm" #reg ", %0" : \
-		"=m" (p0[i / sizeof(*p0)])); \
-} while (0);
-
-		BLOCK16()
-
-		p0 = (unsigned long *)((uintptr_t)p0 + 512);
-		p1 = (unsigned long *)((uintptr_t)p1 + 512);
-		p2 = (unsigned long *)((uintptr_t)p2 + 512);
-	}
-}
-
-static void xor_avx_4(unsigned long bytes, unsigned long * __restrict p0,
-		      const unsigned long * __restrict p1,
-		      const unsigned long * __restrict p2,
-		      const unsigned long * __restrict p3)
-{
-	unsigned long lines = bytes >> 9;
-
-	while (lines--) {
-#undef BLOCK
-#define BLOCK(i, reg) \
-do { \
-	asm volatile("vmovdqa %0, %%ymm" #reg : : "m" (p3[i / sizeof(*p3)])); \
-	asm volatile("vxorps %0, %%ymm" #reg ", %%ymm" #reg : : \
-		"m" (p2[i / sizeof(*p2)])); \
-	asm volatile("vxorps %0, %%ymm" #reg ", %%ymm" #reg : : \
-		"m" (p1[i / sizeof(*p1)])); \
-	asm volatile("vxorps %0, %%ymm" #reg ", %%ymm" #reg : : \
-		"m" (p0[i / sizeof(*p0)])); \
-	asm volatile("vmovdqa %%ymm" #reg ", %0" : \
-		"=m" (p0[i / sizeof(*p0)])); \
-} while (0);
-
-		BLOCK16();
-
-		p0 = (unsigned long *)((uintptr_t)p0 + 512);
-		p1 = (unsigned long *)((uintptr_t)p1 + 512);
-		p2 = (unsigned long *)((uintptr_t)p2 + 512);
-		p3 = (unsigned long *)((uintptr_t)p3 + 512);
-	}
-}
-
-static void xor_avx_5(unsigned long bytes, unsigned long * __restrict p0,
-	     const unsigned long * __restrict p1,
-	     const unsigned long * __restrict p2,
-	     const unsigned long * __restrict p3,
-	     const unsigned long * __restrict p4)
-{
-	unsigned long lines = bytes >> 9;
-
-	while (lines--) {
-#undef BLOCK
-#define BLOCK(i, reg) \
-do { \
-	asm volatile("vmovdqa %0, %%ymm" #reg : : "m" (p4[i / sizeof(*p4)])); \
-	asm volatile("vxorps %0, %%ymm" #reg ", %%ymm" #reg : : \
-		"m" (p3[i / sizeof(*p3)])); \
-	asm volatile("vxorps %0, %%ymm" #reg ", %%ymm" #reg : : \
-		"m" (p2[i / sizeof(*p2)])); \
-	asm volatile("vxorps %0, %%ymm" #reg ", %%ymm" #reg : : \
-		"m" (p1[i / sizeof(*p1)])); \
-	asm volatile("vxorps %0, %%ymm" #reg ", %%ymm" #reg : : \
-		"m" (p0[i / sizeof(*p0)])); \
-	asm volatile("vmovdqa %%ymm" #reg ", %0" : \
-		"=m" (p0[i / sizeof(*p0)])); \
-} while (0);
-
-		BLOCK16()
-
-		p0 = (unsigned long *)((uintptr_t)p0 + 512);
-		p1 = (unsigned long *)((uintptr_t)p1 + 512);
-		p2 = (unsigned long *)((uintptr_t)p2 + 512);
-		p3 = (unsigned long *)((uintptr_t)p3 + 512);
-		p4 = (unsigned long *)((uintptr_t)p4 + 512);
-	}
-}
-
-DO_XOR_BLOCKS(avx_inner, xor_avx_2, xor_avx_3, xor_avx_4, xor_avx_5);
-
 static void xor_gen_avx(void *dest, void **srcs, unsigned int src_cnt,
 			unsigned int bytes)
 {
+	u8 **v = (u8 **)srcs;
+	u8 *p = dest;
+	unsigned int i, d;
+
 	kernel_fpu_begin();
-	xor_gen_avx_inner(dest, srcs, src_cnt, bytes);
+	for (i = 0; i < bytes; i += 64) {
+		asm volatile ("vmovdqa %0,%%ymm0" : : "m" (p[i]));
+		asm volatile ("vmovdqa %0,%%ymm1" : : "m" (p[i + 32]));
+		for (d = 0; d < src_cnt; ++d) {
+			asm volatile ("vpxor %0,%%ymm0,%%ymm0"
+				: : "m" (v[d][i]));
+			asm volatile ("vpxor %0,%%ymm1,%%ymm1"
+				: : "m" (v[d][i + 32]));
+		}
+		asm volatile ("vmovntdq %%ymm0,%0" : "=m" (p[i]));
+		asm volatile ("vmovntdq %%ymm1,%0" : "=m" (p[i + 32]));
+	}
 	kernel_fpu_end();
 }
 
diff --git a/lib/raid/xor/x86/xor-avx512.c b/lib/raid/xor/x86/xor-avx512.c
new file mode 100644
index 000000000000..9b323a0e1821
--- /dev/null
+++ b/lib/raid/xor/x86/xor-avx512.c
@@ -0,0 +1,34 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2026 Andrea Mazzoleni
+ */
+#include <asm/fpu/api.h>
+#include "xor_impl.h"
+#include "xor_arch.h"
+
+static void xor_gen_avx512bw(void *dest, void **srcs, unsigned int src_cnt,
+		unsigned int bytes)
+{
+	unsigned int last = src_cnt - 1, i, d;
+	u8 **v = (u8 **)srcs;
+	u8 *p = dest;
+
+	kernel_fpu_begin();
+	for (i = 0; i < bytes; i += 64) {
+		asm volatile("vmovdqa64 %0,%%zmm0" : : "m" (p[i]));
+		for (d = 0; d < last; d += 2)
+			asm volatile("vmovdqa64 %0,%%zmm1\n\t"
+				     "vpternlogq $0x96,%1,%%zmm1,%%zmm0"
+				     : : "m" (v[d][i]), "m" (v[d + 1][i]));
+		if (d == last)
+			asm volatile("vpxorq %0,%%zmm0,%%zmm0"
+				     : : "m" (v[last][i]));
+		asm volatile("vmovntdq %%zmm0,%0" : "=m" (p[i]));
+	}
+	kernel_fpu_end();
+}
+
+struct xor_block_template xor_block_avx512bw = {
+	.name		= "avx512bw",
+	.xor_gen	= xor_gen_avx512bw,
+};
diff --git a/lib/raid/xor/x86/xor_arch.h b/lib/raid/xor/x86/xor_arch.h
index 99fe85a213c6..73c81221fc01 100644
--- a/lib/raid/xor/x86/xor_arch.h
+++ b/lib/raid/xor/x86/xor_arch.h
@@ -6,6 +6,7 @@ extern struct xor_block_template xor_block_p5_mmx;
 extern struct xor_block_template xor_block_sse;
 extern struct xor_block_template xor_block_sse_pf64;
 extern struct xor_block_template xor_block_avx;
+extern struct xor_block_template xor_block_avx512bw;
 
 /*
  * When SSE is available, use it as it can write around L2.  We may also be able
@@ -20,7 +21,12 @@ static __always_inline void __init arch_xor_init(void)
 {
 	if (boot_cpu_has(X86_FEATURE_AVX) &&
 	    boot_cpu_has(X86_FEATURE_OSXSAVE)) {
-		xor_force(&xor_block_avx);
+		if (boot_cpu_has(X86_FEATURE_AVX2) &&
+		    boot_cpu_has(X86_FEATURE_AVX512F) &&
+		    boot_cpu_has(X86_FEATURE_AVX512BW))
+			xor_force(&xor_block_avx512bw);
+		else
+			xor_force(&xor_block_avx);
 	} else if (IS_ENABLED(CONFIG_X86_64) || boot_cpu_has(X86_FEATURE_XMM)) {
 		xor_register(&xor_block_sse);
 		xor_register(&xor_block_sse_pf64);

^ permalink raw reply related

* [PATCH] lib/raid/xor: x86: Add AVX-512 optimized xor_gen()
From: Eric Biggers @ 2026-06-12  4:40 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Christoph Hellwig, linux-crypto, x86, Eric Biggers

Add an implementation of xor_gen() using AVX-512.

It uses 512-bit vectors, i.e. ZMM registers.  It also uses the
vpternlogq instruction to do three-input XORs when applicable.

It's enabled on x86_64 CPUs that have AVX512F && !PREFER_YMM.  In
practice that means:

    - AMD Zen 4 and later (client and server)
    - Intel Sapphire Rapids and later (server)
    - Intel Rocket Lake (client)
    - Intel Nova Lake and later (client)

The !PREFER_YMM condition excludes the older AVX-512 implementations in
Intel Skylake Server and Intel Ice Lake.  They could run this code, but
they're known to have overly-eager downclocking when ZMM registers are
used.  This is the same policy that the crypto and CRC code uses.

Benchmark on AMD Ryzen 9 9950X (Zen 5):

    src_cnt    avx2         avx512       Improvement
    =======    ==========   ==========   ===========
    1          68423 MB/s   81940 MB/s   19%
    2          56035 MB/s   74112 MB/s   32%
    3          49396 MB/s   67011 MB/s   35%
    4          43056 MB/s   60823 MB/s   41%

Signed-off-by: Eric Biggers <ebiggers@kernel.org>
---
 lib/raid/xor/Makefile         |   2 +-
 lib/raid/xor/x86/xor-avx512.c | 155 ++++++++++++++++++++++++++++++++++
 lib/raid/xor/x86/xor_arch.h   |  27 +++---
 3 files changed, 172 insertions(+), 12 deletions(-)
 create mode 100644 lib/raid/xor/x86/xor-avx512.c

diff --git a/lib/raid/xor/Makefile b/lib/raid/xor/Makefile
index 4d633dfd5b90..4af945861a51 100644
--- a/lib/raid/xor/Makefile
+++ b/lib/raid/xor/Makefile
@@ -26,11 +26,11 @@ xor-$(CONFIG_ALTIVEC)		+= powerpc/xor_vmx.o powerpc/xor_vmx_glue.o
 xor-$(CONFIG_RISCV_ISA_V)	+= riscv/xor.o riscv/xor-glue.o
 xor-$(CONFIG_SPARC32)		+= sparc/xor-sparc32.o
 xor-$(CONFIG_SPARC64)		+= sparc/xor-sparc64.o sparc/xor-sparc64-glue.o
 xor-$(CONFIG_S390)		+= s390/xor.o
 xor-$(CONFIG_X86_32)		+= x86/xor-avx.o x86/xor-sse.o x86/xor-mmx.o
-xor-$(CONFIG_X86_64)		+= x86/xor-avx.o x86/xor-sse.o
+xor-$(CONFIG_X86_64)		+= x86/xor-avx.o x86/xor-sse.o x86/xor-avx512.o
 obj-y				+= tests/
 
 CFLAGS_arm/xor-neon.o		+= $(CC_FLAGS_FPU)
 CFLAGS_REMOVE_arm/xor-neon.o	+= $(CC_FLAGS_NO_FPU)
 
diff --git a/lib/raid/xor/x86/xor-avx512.c b/lib/raid/xor/x86/xor-avx512.c
new file mode 100644
index 000000000000..d2b54aa2be98
--- /dev/null
+++ b/lib/raid/xor/x86/xor-avx512.c
@@ -0,0 +1,155 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * AVX-512 optimized implementation of xor_gen()
+ *
+ * Copyright 2026 Google LLC
+ */
+
+#include <linux/compiler.h>
+#include <linux/types.h>
+#include <asm/fpu/api.h>
+#include "xor_impl.h"
+#include "xor_arch.h"
+
+struct block64 {
+	u8 x[64];
+} __aligned(64);
+
+/*
+ * Use different registers for each unrolled iteration just in case it helps,
+ * though the hardware register renamer should make it unnecessary.
+ */
+
+#define DO_XOR2(i, reg0)                                   \
+	asm volatile("vmovdqa64 %0, %%" reg0 "\n"          \
+		     "vpxorq %1, %%" reg0 ", %%" reg0 "\n" \
+		     "vmovdqa64 %%" reg0 ", %0\n"          \
+		     : "+m"(p0[i])                         \
+		     : "m"(p1[i]))
+
+#define DO_XOR3(i, reg0, reg1)                                        \
+	asm volatile("vmovdqa64 %0, %%" reg0 "\n"                     \
+		     "vmovdqa64 %1, %%" reg1 "\n"                     \
+		     "vpternlogq $0x96, %2, %%" reg1 ", %%" reg0 "\n" \
+		     "vmovdqa64 %%" reg0 ", %0\n"                     \
+		     : "+m"(p0[i])                                    \
+		     : "m"(p1[i]), "m"(p2[i]))
+
+#define DO_XOR4(i, reg0, reg1)                                        \
+	asm volatile("vmovdqa64 %0, %%" reg0 "\n"                     \
+		     "vmovdqa64 %1, %%" reg1 "\n"                     \
+		     "vpxorq %2, %%" reg0 ", %%" reg0 "\n"            \
+		     "vpternlogq $0x96, %3, %%" reg1 ", %%" reg0 "\n" \
+		     "vmovdqa64 %%" reg0 ", %0\n"                     \
+		     : "+m"(p0[i])                                    \
+		     : "m"(p1[i]), "m"(p2[i]), "m"(p3[i]))
+
+#define DO_XOR5(i, reg0, reg1)                                        \
+	asm volatile("vmovdqa64 %0, %%" reg0 "\n"                     \
+		     "vmovdqa64 %1, %%" reg1 "\n"                     \
+		     "vpternlogq $0x96, %2, %%" reg1 ", %%" reg0 "\n" \
+		     "vmovdqa64 %3, %%" reg1 "\n"                     \
+		     "vpternlogq $0x96, %4, %%" reg1 ", %%" reg0 "\n" \
+		     "vmovdqa64 %%" reg0 ", %0\n"                     \
+		     : "+m"(p0[i])                                    \
+		     : "m"(p1[i]), "m"(p2[i]), "m"(p3[i]), "m"(p4[i]))
+
+static void xor_avx512_2(size_t bytes, struct block64 *p0,
+			 const struct block64 *p1)
+{
+	do {
+		DO_XOR2(0, "zmm0");
+		DO_XOR2(1, "zmm1");
+		DO_XOR2(2, "zmm2");
+		DO_XOR2(3, "zmm3");
+		DO_XOR2(4, "zmm4");
+		DO_XOR2(5, "zmm5");
+		DO_XOR2(6, "zmm6");
+		DO_XOR2(7, "zmm7");
+		p0 += 512 / sizeof(*p0);
+		p1 += 512 / sizeof(*p1);
+		bytes -= 512;
+	} while (bytes);
+}
+
+static void xor_avx512_3(size_t bytes, struct block64 *p0,
+			 const struct block64 *p1, const struct block64 *p2)
+{
+	do {
+		DO_XOR3(0, "zmm0", "zmm1");
+		DO_XOR3(1, "zmm2", "zmm3");
+		DO_XOR3(2, "zmm4", "zmm5");
+		DO_XOR3(3, "zmm6", "zmm7");
+		DO_XOR3(4, "zmm8", "zmm9");
+		DO_XOR3(5, "zmm10", "zmm11");
+		DO_XOR3(6, "zmm12", "zmm13");
+		DO_XOR3(7, "zmm14", "zmm15");
+		p0 += 512 / sizeof(*p0);
+		p1 += 512 / sizeof(*p1);
+		p2 += 512 / sizeof(*p2);
+		bytes -= 512;
+	} while (bytes);
+}
+
+static void xor_avx512_4(size_t bytes, struct block64 *p0,
+			 const struct block64 *p1, const struct block64 *p2,
+			 const struct block64 *p3)
+{
+	do {
+		DO_XOR4(0, "zmm0", "zmm1");
+		DO_XOR4(1, "zmm2", "zmm3");
+		DO_XOR4(2, "zmm4", "zmm5");
+		DO_XOR4(3, "zmm6", "zmm7");
+		DO_XOR4(4, "zmm8", "zmm9");
+		DO_XOR4(5, "zmm10", "zmm11");
+		DO_XOR4(6, "zmm12", "zmm13");
+		DO_XOR4(7, "zmm14", "zmm15");
+		p0 += 512 / sizeof(*p0);
+		p1 += 512 / sizeof(*p1);
+		p2 += 512 / sizeof(*p2);
+		p3 += 512 / sizeof(*p3);
+		bytes -= 512;
+	} while (bytes);
+}
+
+static void xor_avx512_5(size_t bytes, struct block64 *p0,
+			 const struct block64 *p1, const struct block64 *p2,
+			 const struct block64 *p3, const struct block64 *p4)
+{
+	do {
+		DO_XOR5(0, "zmm0", "zmm1");
+		DO_XOR5(1, "zmm2", "zmm3");
+		DO_XOR5(2, "zmm4", "zmm5");
+		DO_XOR5(3, "zmm6", "zmm7");
+		DO_XOR5(4, "zmm8", "zmm9");
+		DO_XOR5(5, "zmm10", "zmm11");
+		DO_XOR5(6, "zmm12", "zmm13");
+		DO_XOR5(7, "zmm14", "zmm15");
+		p0 += 512 / sizeof(*p0);
+		p1 += 512 / sizeof(*p1);
+		p2 += 512 / sizeof(*p2);
+		p3 += 512 / sizeof(*p3);
+		p4 += 512 / sizeof(*p4);
+		bytes -= 512;
+	} while (bytes);
+}
+
+DO_XOR_BLOCKS(avx512_inner, xor_avx512_2, xor_avx512_3, xor_avx512_4,
+	      xor_avx512_5);
+
+/*
+ * Preconditions: bytes is a nonzero multiple of 512, and all buffers are
+ * 64-byte aligned.
+ */
+static void xor_gen_avx512(void *dest, void **srcs, unsigned int src_cnt,
+			   unsigned int bytes)
+{
+	kernel_fpu_begin();
+	xor_gen_avx512_inner(dest, srcs, src_cnt, bytes);
+	kernel_fpu_end();
+}
+
+struct xor_block_template xor_block_avx512 = {
+	.name = "avx512",
+	.xor_gen = xor_gen_avx512,
+};
diff --git a/lib/raid/xor/x86/xor_arch.h b/lib/raid/xor/x86/xor_arch.h
index 99fe85a213c6..199124e32c27 100644
--- a/lib/raid/xor/x86/xor_arch.h
+++ b/lib/raid/xor/x86/xor_arch.h
@@ -1,29 +1,34 @@
 /* SPDX-License-Identifier: GPL-2.0-or-later */
 #include <asm/cpufeature.h>
+#include <asm/fpu/api.h>
 
 extern struct xor_block_template xor_block_pII_mmx;
 extern struct xor_block_template xor_block_p5_mmx;
 extern struct xor_block_template xor_block_sse;
 extern struct xor_block_template xor_block_sse_pf64;
 extern struct xor_block_template xor_block_avx;
+extern struct xor_block_template xor_block_avx512;
 
-/*
- * When SSE is available, use it as it can write around L2.  We may also be able
- * to load into the L1 only depending on how the cpu deals with a load to a line
- * that is being prefetched.
- *
- * When AVX2 is available, force using it as it is better by all measures.
- *
- * 32-bit without MMX can fall back to the generic routines.
- */
 static __always_inline void __init arch_xor_init(void)
 {
-	if (boot_cpu_has(X86_FEATURE_AVX) &&
-	    boot_cpu_has(X86_FEATURE_OSXSAVE)) {
+	if (IS_ENABLED(CONFIG_X86_64) && boot_cpu_has(X86_FEATURE_AVX512F) &&
+	    !boot_cpu_has(X86_FEATURE_PREFER_YMM) &&
+	    cpu_has_xfeatures(XFEATURE_MASK_AVX512, NULL)) {
+		/* AVX-512 will be the best; no need to try others. */
+		/* !PREFER_YMM excludes CPUs with overly-eager downclocking. */
+		xor_force(&xor_block_avx512);
+	} else if (boot_cpu_has(X86_FEATURE_AVX) &&
+		   boot_cpu_has(X86_FEATURE_OSXSAVE)) {
+		/* AVX will be the best; no need to try others. */
 		xor_force(&xor_block_avx);
 	} else if (IS_ENABLED(CONFIG_X86_64) || boot_cpu_has(X86_FEATURE_XMM)) {
+		/*
+		 * When SSE is available, use it as it can write around L2.  We
+		 * may also be able to load into the L1 only depending on how
+		 * the cpu deals with a load to a line that is being prefetched.
+		 */
 		xor_register(&xor_block_sse);
 		xor_register(&xor_block_sse_pf64);
 	} else if (boot_cpu_has(X86_FEATURE_MMX)) {
 		xor_register(&xor_block_pII_mmx);
 		xor_register(&xor_block_p5_mmx);

base-commit: 9716c086c8e8b141d35aa61f2e96a2e83de212a7
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH 1/2] crypto: qce: Fix xts-aes-qce for weak keys
From: Herbert Xu @ 2026-06-12  3:45 UTC (permalink / raw)
  To: Dmitry Baryshkov
  Cc: Kuldeep Singh, Thara Gopinath, David S. Miller,
	Bartosz Golaszewski, Eric Biggers, Thara Gopinath, linux-crypto,
	linux-arm-msm, linux-kernel
In-Reply-To: <533motquixnbence674lawbnlnxevcrcnysymwncjis46j5uoq@wcemraangg63>

On Fri, Jun 12, 2026 at 03:40:49AM +0300, Dmitry Baryshkov wrote:
>
> > Fix xts-aes-qce behavior by using generic helper xts_verify_key() to
> > reject keys early with -EINVAL for FIPS mode active(or FORBID_WEAK_KEYS
> > set). For non-FIPS mode, since QCE hardware cannot accept the keys, use
> > software fallback mechanism to encrypt the data.
> 
> No, if it is a hardware driver, there should be no software fallback.

The driver must support everything that the software implementation
supports.  So if the hardware can't do something, it has to use a
fallback.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 0/4] Xilinx TRNG fix and simplification
From: Herbert Xu @ 2026-06-12  1:58 UTC (permalink / raw)
  To: Eric Biggers
  Cc: linux-crypto, linux-kernel, Mounika Botcha, Harsh Jain,
	Olivia Mackall, Michal Simek, linux-arm-kernel
In-Reply-To: <20260611204702.GB1747@quark>

On Thu, Jun 11, 2026 at 01:47:02PM -0700, Eric Biggers wrote:
>
> Can you re-add the following to "hwrng: xilinx - Move xilinx-rng into
> drivers/char/hw_random/"?  It seems you applied this before the qcom-rng
> series, then dropped the drivers/char/hw_random/Makefile change rather
> than resolve it.
> 
> diff --git a/drivers/char/hw_random/Makefile b/drivers/char/hw_random/Makefile
> index 3e655d6e116b..95b5adb49560 100644
> --- a/drivers/char/hw_random/Makefile
> +++ b/drivers/char/hw_random/Makefile
> @@ -51,5 +51,6 @@ obj-$(CONFIG_HW_RANDOM_XIPHERA) += xiphera-trng.o
>  obj-$(CONFIG_HW_RANDOM_ARM_SMCCC_TRNG) += arm_smccc_trng.o
>  obj-$(CONFIG_HW_RANDOM_CN10K) += cn10k-rng.o
>  obj-$(CONFIG_HW_RANDOM_POLARFIRE_SOC) += mpfs-rng.o
>  obj-$(CONFIG_HW_RANDOM_ROCKCHIP) += rockchip-rng.o
>  obj-$(CONFIG_HW_RANDOM_JH7110) += jh7110-trng.o
> +obj-$(CONFIG_HW_RANDOM_XILINX) += xilinx-trng.o

Thanks for checking.  It should be fixed now.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 0/2] Fix Qualcomm Crypto engine self tests failures
From: Dmitry Baryshkov @ 2026-06-12  0:43 UTC (permalink / raw)
  To: Kuldeep Singh
  Cc: Eric Biggers, Bartosz Golaszewski, Thara Gopinath, Herbert Xu,
	David S. Miller, Thara Gopinath, linux-crypto, linux-arm-msm,
	linux-kernel
In-Reply-To: <1abc518e-e24e-44ff-9b15-1766dcecd8a2@oss.qualcomm.com>

On Thu, Jun 11, 2026 at 03:17:24PM +0530, Kuldeep Singh wrote:
> On 11-06-2026 00:12, Eric Biggers wrote:
> > On Wed, Jun 10, 2026 at 11:24:03AM +0530, Kuldeep Singh wrote:
> >> Steps followed:
> >>   - Enable EXPERT and CRYPTO_SEFLTESTS config.
> > 
> > So the full tests (CRYPTO_SELFTESTS_FULL) still haven't been run?
> 
> Crypto_selftests was only run as there's some discussion ongoing with
> Bartosz on removal of deprecated/unsafe algos.

pointer?

> 
> Seems Bartosz will be sending patches for algorithm removal changes.
> The rest relevant selftests issues we'll fix accordingly.

So, the old kernels will remain broken? Or do we expect to backport the
cipher removal patches too?

-- 
With best wishes
Dmitry

^ permalink raw reply

* Re: [PATCH 1/2] crypto: qce: Fix xts-aes-qce for weak keys
From: Dmitry Baryshkov @ 2026-06-12  0:40 UTC (permalink / raw)
  To: Kuldeep Singh
  Cc: Thara Gopinath, Herbert Xu, David S. Miller, Bartosz Golaszewski,
	Eric Biggers, Thara Gopinath, linux-crypto, linux-arm-msm,
	linux-kernel
In-Reply-To: <20260610-qce_selftest_fix-v1-1-1b0504783a46@oss.qualcomm.com>

On Wed, Jun 10, 2026 at 11:24:04AM +0530, Kuldeep Singh wrote:
> The QCE hardware does not support AES XTS mode when key1 and key2 are
> equal. The driver was handling this by unconditionally rejecting the
> keys with -ENOKEY(-126), regardless of whether FIPS mode is active or
> the FORBID_WEAK_KEYS flag is set.
> [    5.599170] alg: skcipher: xts-aes-qce setkey failed on test vector 0; expected_error=0, actual_error=-126, flags=0x1
> [    5.599184] alg: self-tests for xts(aes) using xts-aes-qce failed (rc=-126)
> 
> In general for weak keys,
> - If FIPS mode is active or FORBID_WEAK_KEYS is set: return -EINVAL.
> - In non-FIPS mode, Accept the key and encrypt successfully.
> 
> Since QCE was returning -ENOKEY for non-FIPS mode whereas the
> expectation is to encrypt content and return success, the selftest saw a
> mismatch and failed.
> 
> There are two problems in QCE behavior:
>   * -ENOKEY is returned instead of -EINVAL for the FIPS/weak-key
>     rejection case.
>   * key1 == key2 is rejected even in non-FIPS mode

Rewrite this commit message to English text rather than multiple kinds
of the bullet lists. For example:

QCE hardware can't support the insecure setup of the AES XTS cipher
mode, where key1 and key2 are equal. Currently driver unconditionally
returns -ENOKEY, while the rest of the system expects to get -EINVAL in
FIPS mode or if FORBID_WEAK_KEYS is true. Correct the driver to return
-EINVAL instead of -ENOKEY.

Then another commit to crypto testmgr to let crypto drivers fail for
AES-XTS (and also another commit with docs update).

> 
> Fix xts-aes-qce behavior by using generic helper xts_verify_key() to
> reject keys early with -EINVAL for FIPS mode active(or FORBID_WEAK_KEYS
> set). For non-FIPS mode, since QCE hardware cannot accept the keys, use
> software fallback mechanism to encrypt the data.

No, if it is a hardware driver, there should be no software fallback.

> 
> Fixes: f0d078dd6c49 ("crypto: qce - Return unsupported if key1 and key 2 are same for AES XTS algorithm")
> Signed-off-by: Kuldeep Singh <kuldeep.singh@oss.qualcomm.com>
> ---
>  drivers/crypto/qce/cipher.h   |  1 +
>  drivers/crypto/qce/skcipher.c | 20 +++++++++++++-------
>  2 files changed, 14 insertions(+), 7 deletions(-)
> 

-- 
With best wishes
Dmitry

^ permalink raw reply

* [PATCH] crypto: atmel-ecc - reject hardware ECDH without a public key
From: Thorsten Blum @ 2026-06-11 21:36 UTC (permalink / raw)
  To: Thorsten Blum, Herbert Xu, David S. Miller, Nicolas Ferre,
	Alexandre Belloni, Claudiu Beznea, Tudor Ambarus
  Cc: linux-crypto, linux-arm-kernel, linux-kernel

The hardware ECDH path in atmel_ecdh_compute_shared_secret() uses the
private key stored in the device. However, the public key is cached only
after atmel_ecdh_set_secret() successfully generated that private key
for the current tfm.

atmel_ecdh_generate_public_key() already rejects requests when no public
key is cached. Add the same check to atmel_ecdh_compute_shared_secret()
to prevent the device from using a private key that was not generated
for the current tfm.

Fixes: 11105693fa05 ("crypto: atmel-ecc - introduce Microchip / Atmel ECC driver")
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
---
 drivers/crypto/atmel-ecc.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/crypto/atmel-ecc.c b/drivers/crypto/atmel-ecc.c
index 93f219558c2f..542c8cc13a0f 100644
--- a/drivers/crypto/atmel-ecc.c
+++ b/drivers/crypto/atmel-ecc.c
@@ -173,6 +173,9 @@ static int atmel_ecdh_compute_shared_secret(struct kpp_request *req)
 		return crypto_kpp_compute_shared_secret(req);
 	}
 
+	if (!ctx->public_key)
+		return -EINVAL;
+
 	/* must have exactly two points to be on the curve */
 	if (req->src_len != ATMEL_ECC_PUBKEY_SIZE)
 		return -EINVAL;

^ permalink raw reply related

* Re: [PATCH 0/4] Xilinx TRNG fix and simplification
From: Eric Biggers @ 2026-06-11 20:47 UTC (permalink / raw)
  To: Herbert Xu
  Cc: linux-crypto, linux-kernel, Mounika Botcha, Harsh Jain,
	Olivia Mackall, Michal Simek, linux-arm-kernel
In-Reply-To: <aip2l1pwMY4UDBdA@gondor.apana.org.au>

On Thu, Jun 11, 2026 at 04:49:27PM +0800, Herbert Xu wrote:
> On Sun, May 31, 2026 at 12:17:34PM -0700, Eric Biggers wrote:
> > This series fixes and greatly simplifies the Xilinx TRNG driver by:
> > 
> > - Removing the gratuitous crypto_rng interface, leaving just hwrng which
> >   is the one that actually matters.
> > 
> > - Replacing the really complicated AES based entropy extraction
> >   algorithm with a much simpler one.
> > 
> > Note that this mirrors similar changes in other drivers.
> > 
> > Eric Biggers (4):
> >   crypto: xilinx-trng - Remove crypto_rng interface
> >   crypto: xilinx-trng - Fix return value of xtrng_hwrng_trng_read()
> >   crypto: xilinx-trng - Replace crypto_drbg_ctr_df() with HMAC-SHA512
> >   hwrng: xilinx - Move xilinx-rng into drivers/char/hw_random/
> > 
> >  MAINTAINERS                                   |   2 +-
> >  arch/arm64/configs/defconfig                  |   2 +-
> >  crypto/Kconfig                                |   5 -
> >  crypto/Makefile                               |   2 -
> >  crypto/df_sp80090a.c                          | 222 ------------------
> >  drivers/char/hw_random/Kconfig                |  11 +
> >  drivers/char/hw_random/Makefile               |   1 +
> >  .../xilinx => char/hw_random}/xilinx-trng.c   | 134 ++---------
> >  drivers/crypto/Kconfig                        |  13 -
> >  drivers/crypto/xilinx/Makefile                |   1 -
> >  include/crypto/df_sp80090a.h                  |  53 -----
> >  11 files changed, 37 insertions(+), 409 deletions(-)
> >  delete mode 100644 crypto/df_sp80090a.c
> >  rename drivers/{crypto/xilinx => char/hw_random}/xilinx-trng.c (75%)
> >  delete mode 100644 include/crypto/df_sp80090a.h
> > 
> > 
> > base-commit: 5624ea54f3ba5c83d2e5503411a31a8be0278c1e
> > prerequisite-patch-id: 07e982b663ac3f8312ca524f6b91b5b38661df5e
> > prerequisite-patch-id: 72064361a8f36e015ab0b7e1fa4d364b40d90506
> > prerequisite-patch-id: 8978b8e0db7f47935e5f6f0aff14a97f55d3073c
> > prerequisite-patch-id: 6aa0e3e93a008279d71e535a3d0cf48643f55e19
> > -- 
> > 2.54.0
> 
> All applied.  Thanks.

Can you re-add the following to "hwrng: xilinx - Move xilinx-rng into
drivers/char/hw_random/"?  It seems you applied this before the qcom-rng
series, then dropped the drivers/char/hw_random/Makefile change rather
than resolve it.

diff --git a/drivers/char/hw_random/Makefile b/drivers/char/hw_random/Makefile
index 3e655d6e116b..95b5adb49560 100644
--- a/drivers/char/hw_random/Makefile
+++ b/drivers/char/hw_random/Makefile
@@ -51,5 +51,6 @@ obj-$(CONFIG_HW_RANDOM_XIPHERA) += xiphera-trng.o
 obj-$(CONFIG_HW_RANDOM_ARM_SMCCC_TRNG) += arm_smccc_trng.o
 obj-$(CONFIG_HW_RANDOM_CN10K) += cn10k-rng.o
 obj-$(CONFIG_HW_RANDOM_POLARFIRE_SOC) += mpfs-rng.o
 obj-$(CONFIG_HW_RANDOM_ROCKCHIP) += rockchip-rng.o
 obj-$(CONFIG_HW_RANDOM_JH7110) += jh7110-trng.o
+obj-$(CONFIG_HW_RANDOM_XILINX) += xilinx-trng.o


^ permalink raw reply related

* Re: [PATCH] lib/crypto: gf128hash: mark clmul32() as noinline_for_stack
From: Eric Biggers @ 2026-06-11 20:06 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Jason A. Donenfeld, Ard Biesheuvel, Nathan Chancellor,
	Arnd Bergmann, Nick Desaulniers, Bill Wendling, Justin Stitt,
	linux-crypto, linux-kernel, llvm
In-Reply-To: <20260611125952.3387258-1-arnd@kernel.org>

On Thu, Jun 11, 2026 at 02:59:39PM +0200, Arnd Bergmann wrote:
> From: Arnd Bergmann <arnd@arndb.de>
> 
> During randconfig testing, I came across a lot of warnings for the newly
> added carryless multiplication function triggering excessive stack usage
> from spilling temporary variables to the stack:
> 
> lib/crypto/gf128hash.c:166:1: error: stack frame size (1192) exceeds limit (1024) in 'polyval_mul_generic' [-Werror,-Wframe-larger-than]
> 
> In addition to the possible risk of overflowing the kernel stack,
> the generated object code surely performs very poorly.
> 
> This only happens on architectures that don't provide uint128_t
> (which should be all 32-bit architectures on modern compilers), but
> though I tested random x86 and arm configs, I only saw this with arm's
> CONFIG_THUMB2_KERNEL, which adds more pressure to the register allocator.
> 
> The testing was done using clang-22, I don't know if gcc has the same
> problem. Marking clmul32() as noinline_for_stack experimentally shows
> all of the affected builds to completely solve the problem, reducing
> the stack usage to a few bytes as expected.
> 
> Since u64 arithmetic frequently leads to compilers badly optimizing
> 32-bit targets, keeping clmul32 out of line is likely to help on
> other 32-bit configurations as well when they run into this problem,
> though it may also result in a small performance degradation in
> configurations that would benefit from inlining.
> 
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---

Applied to https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git/log/?h=libcrypto-next

- Eric

^ permalink raw reply

* Re: [PATCH v4] crypto/ccp: Introduce SNP_VERIFY_MITIGATION command
From: Pratik R. Sampat @ 2026-06-11 13:44 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: ashish.kalra, thomas.lendacky, john.allen, herbert, davem,
	linux-crypto, linux-kernel, aik, nikunj, michael.roth
In-Reply-To: <aihsp-uQrd2g5vJ0@tycho.pizza>



On 6/9/26 3:48 PM, Tycho Andersen wrote:
> Hi Pratik,
> 
>>
>> See SEV-SNP Firmware ABI specifications 1.58, SNP_VERIFY_MITIGATION for
>> more details.
>>
>> Signed-off-by: Pratik R. Sampat <prsampat@amd.com>
> 
> Reviewed-by: Tycho Andersen (AMD) <tycho@kernel.org>
> 
>> +	if (dst.mit_failure_status) {
>> +		dev_err(sev->dev, "Verify Mitigation - failure status: 0x%x\n",
>> +			dst.mit_failure_status);
>> +		return -EIO;
> 
> Elsewhere the CCP uses EIO to represent a failure to communicate with
> the PSP, but here things worked, it was just in an invalid state.
> Maybe worth a different errno here, -EINVAL or so.
> 

-EIO is a bit awkward here for sure. -EINVAL seems to make more sense.

Thanks!
--Pratik

^ permalink raw reply

* [PATCH] lib/crypto: gf128hash: mark clmul32() as noinline_for_stack
From: Arnd Bergmann @ 2026-06-11 12:59 UTC (permalink / raw)
  To: Eric Biggers, Jason A. Donenfeld, Ard Biesheuvel,
	Nathan Chancellor
  Cc: Arnd Bergmann, Nick Desaulniers, Bill Wendling, Justin Stitt,
	linux-crypto, linux-kernel, llvm

From: Arnd Bergmann <arnd@arndb.de>

During randconfig testing, I came across a lot of warnings for the newly
added carryless multiplication function triggering excessive stack usage
from spilling temporary variables to the stack:

lib/crypto/gf128hash.c:166:1: error: stack frame size (1192) exceeds limit (1024) in 'polyval_mul_generic' [-Werror,-Wframe-larger-than]

In addition to the possible risk of overflowing the kernel stack,
the generated object code surely performs very poorly.

This only happens on architectures that don't provide uint128_t
(which should be all 32-bit architectures on modern compilers), but
though I tested random x86 and arm configs, I only saw this with arm's
CONFIG_THUMB2_KERNEL, which adds more pressure to the register allocator.

The testing was done using clang-22, I don't know if gcc has the same
problem. Marking clmul32() as noinline_for_stack experimentally shows
all of the affected builds to completely solve the problem, reducing
the stack usage to a few bytes as expected.

Since u64 arithmetic frequently leads to compilers badly optimizing
32-bit targets, keeping clmul32 out of line is likely to help on
other 32-bit configurations as well when they run into this problem,
though it may also result in a small performance degradation in
configurations that would benefit from inlining.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
---
 lib/crypto/gf128hash.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/crypto/gf128hash.c b/lib/crypto/gf128hash.c
index 2650603d8ba8..8dcdf5ec98be 100644
--- a/lib/crypto/gf128hash.c
+++ b/lib/crypto/gf128hash.c
@@ -109,7 +109,7 @@ static void clmul64(u64 a, u64 b, u64 *out_lo, u64 *out_hi)
 #else /* CONFIG_ARCH_SUPPORTS_INT128 */
 
 /* Do a 32 x 32 => 64 bit carryless multiplication. */
-static u64 clmul32(u32 a, u32 b)
+static noinline_for_stack u64 clmul32(u32 a, u32 b)
 {
 	/*
 	 * With 32-bit multiplicands and one term every 4 bits, there are up to
-- 
2.39.5


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox