Linux cryptographic layer development

Linux cryptographic layer development
 help / color / mirror / Atom feed

* Re: [PATCH v2] lib/raid/xor: x86: Add AVX-512 optimized xor_gen()
From: David Laight @ 2026-06-15 22:57 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Andrew Morton, linux-kernel, Christoph Hellwig, linux-crypto, x86,
	linux-raid
In-Reply-To: <20260615184435.GA17731@quark>

On Mon, 15 Jun 2026 11:44:35 -0700
Eric Biggers <ebiggers@kernel.org> wrote:

> On Sun, Jun 14, 2026 at 11:16:28AM +0100, David Laight wrote:
> > On Sat, 13 Jun 2026 18:03:57 -0700
> > Eric Biggers <ebiggers@kernel.org> wrote:
...
> > Some 'not very important' comments:
> > 
> > I did wonder whether moving the loop into the asm() would help.
> > gcc has a nasty habit of pessimising loops when you try to be clever.
> > It is certainly safer for tight loops like these.  
> 
> I originally tried leaving the loops to the compiler, but gcc unrolled
> the 1x ones by 2x, despite it having no visibility into the asm block.
> That broke the intent with the indexed addressing, since to achieve the
> unrolling it generated code that incremented the pointers.

I did suspect that might happen.

> So I just ended up moving the loop to the asm, which reliably gives us
> the code we want.

Yep...

...
> > The code should be limited by the memory reads, so the 3-argument xor and
> > the interleave of the unroll may make no difference.  
> 
> The unroll by 2x in the 2 and 3-buffer cases helped a little bit on
> Sapphire Rapids.  I don't know exactly why, but it makes sense that
> those cases are where the loop overhead is most likely to matter.

Each iteration does 2 (or 3) reads and a write.
The cpu can do two reads and a write every clock.
However Intel cpu can only execute a branch every other clock,
so the shortest loop is two clocks.
That means you need need to unroll once to keep the memory logic busy.

The zen5 seems to be able to execute 1-clock loops, so wouldn't need
the unroll.

> > Some cpu do have constraints on the cache alignment in order to do two
> > reads per clock, but I've forgotten them and they got better before AVX-512.
> > If that were affecting this code (on the tested cpu) then I'd expect the
> > interleaved unroll would improve the _4 and -5 functions.
> > So it probably doesn't affect this code.  
> 
> The buffers are always 64-byte aligned here, as documented.

It is all more complex that that.
Whether you can do two reads/clock depends on whether the reads manage to
avoid needing the same buffers (etc) in the cache logic.
For instance it might not work if the addresses differ by the size of the
cache (one of Agner's books might have the answer).
(It was pretty hard to get two reads/clock on Sandy Bridge.)

Then there are some really strange effects.
On zen5 (at least on the one I've got) 'rep movsb' is very slow (setup and copy)
if (IIRC) (%di - %si) mod 4k is between 1 and 127.
The only other alignment that makes much difference is 64byte aligning %di (which
doubles throughput).

-- David

^ permalink raw reply

* Re: [PATCH v3] lib/raid/xor: x86: Add AVX-512 optimized xor_gen()
From: Borislav Petkov @ 2026-06-15 23:53 UTC (permalink / raw)
  To: Eric Biggers, Richard Weinberger
  Cc: x86, Christoph Hellwig, linux-crypto, David Laight, linux-raid,
	Andrew Morton, linux-kernel, linux-um
In-Reply-To: <20260615212922.GA28589@quark>

On Mon, Jun 15, 2026 at 02:29:22PM -0700, Eric Biggers wrote:
> On Mon, Jun 15, 2026 at 09:16:55PM +0000, Borislav Petkov wrote:
> > On June 15, 2026 8:10:50 PM UTC, Eric Biggers <ebiggers@kernel.org> wrote:
> > >
> > >But I wanted to ask: do we really care about the case where features are
> > >"supported" but their XCR0 bits aren't set?  Perhaps the kernel just
> > >doesn't/shouldn't support weird cases like "-cpu max,xsave=off"?
> > >
> > 
> > Yes, our aim is to support only configurations which are actually
> > present in real hardware and not a "oh, it would be good if it did
> > that, just because..."
> 
> Seems reasonable to me.  Would the same apply to UML here?

Good question.

Richi?

> > >If this case indeed needs to be handled, could we make things easier for
> > >the kernel's AVX and AVX-512 optimized code?  Currently AVX-512 needs:
> > >
> > >        if (boot_cpu_has(X86_FEATURE_AVX512F) &&
> > >            cpu_has_xfeatures(XFEATURE_MASK_FP | XFEATURE_MASK_SSE |
> > >                              XFEATURE_MASK_YMM | XFEATURE_MASK_AVX512, NULL))
> > >
> > >How about we make X86_FEATURE_AVX512F depend on XCR0=111xx111, and
> > >X86_FEATURE_AVX depend on XCR0=xxxxx111?  Then the cpu_has_xfeatures()
> > >check wouldn't be needed.  Is there any reason not to do that?
> > 
> >  How do you want to accomplish that? Very early during boot on the BSP
> >  you sanity-check XCR0 and clear feature flags if components are not
> >  set? 
> 
> That would be the idea.  Something similar to what
> arch/x86/kernel/cpu/cpuid-deps.c does. 

Yap.

> Except that seems to only enforce the dependencies when the kernel itself is
> disabling things; if the hypervisor is broken then it just warns.

Not the kernel's problem. We deliberately don't want to maintain a zoo of
options which are not present in real hw. If HV is doing funny things, oh
well...


> In any case, I'd like these to go away:
> 
>     $ git grep cpu_has_xfeatures | wc -l
>     31

Yeah, all in crypto. I can certainly see why.

@dhansen, any other thoughts?

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply

* Re: [PATCH v3] lib/raid/xor: x86: Add AVX-512 optimized xor_gen()
From: Dave Hansen @ 2026-06-16  0:29 UTC (permalink / raw)
  To: Borislav Petkov, Eric Biggers, Richard Weinberger
  Cc: x86, Christoph Hellwig, linux-crypto, David Laight, linux-raid,
	Andrew Morton, linux-kernel, linux-um
In-Reply-To: <20260615235318.GBajCQbuy9dBgKH8L_@fat_crate.local>

On 6/15/26 16:53, Borislav Petkov wrote:
> 
>> In any case, I'd like these to go away:
>>
>>     $ git grep cpu_has_xfeatures | wc -l
>>     31
> Yeah, all in crypto. I can certainly see why.
> 
> @dhansen, any other thoughts?

If we can get rid of cpu_has_xfeatures(), I'm all for it. I'm not quite
sure how the code would look so I'm reserving judgement until I see the
patches. But it's worth a try.

^ permalink raw reply

* [PATCH] crypto: sa2ul: stop probe if context pool creation fails
From: Pengpeng Hou @ 2026-06-16  0:46 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, linux-crypto, linux-kernel; +Cc: Pengpeng Hou

sa_ul_probe() calls sa_init_mem() to create the DMA pool used for
security context buffers, but ignores its return value. If pool creation
fails, probe still continues with DMA setup, algorithm registration and
child population even though later request setup depends on that pool.

Stop probing when sa_init_mem() fails, and route that failure to the PM
cleanup path without attempting to destroy an uncreated DMA pool.

Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
---
 drivers/crypto/sa2ul.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/crypto/sa2ul.c b/drivers/crypto/sa2ul.c
index 965a03d5b27a..d865fd4a098c 100644
--- a/drivers/crypto/sa2ul.c
+++ b/drivers/crypto/sa2ul.c
@@ -2395,7 +2395,10 @@ static int sa_ul_probe(struct platform_device *pdev)
 		return ret;
 	}

-	sa_init_mem(dev_data);
+	ret = sa_init_mem(dev_data);
+	if (ret)
+		goto disable_pm;
+
 	ret = sa_dma_init(dev_data);
 	if (ret)
 		goto destroy_dma_pool;
@@ -2430,6 +2433,7 @@ static int sa_ul_probe(struct platform_device *pdev)
 destroy_dma_pool:
 	dma_pool_destroy(dev_data->sc_pool);

+disable_pm:
 	pm_runtime_put_sync(dev);
 	pm_runtime_disable(dev);

-- 
2.50.1 (Apple Git-155)

^ permalink raw reply related

* [PATCH] crypto: caam: depopulate job rings on populate failure
From: Pengpeng Hou @ 2026-06-16  0:52 UTC (permalink / raw)
  To: Horia Geantă, Pankaj Gupta, Gaurav Jain, Herbert Xu,
	David S. Miller, linux-crypto, linux-kernel
  Cc: Pengpeng Hou

devm_of_platform_populate() only registers its automatic cleanup action
after child population succeeds.  If CAAM job-ring population fails after
creating some job-ring devices, the probe error path reports the error
but leaves the partial children registered.

Explicitly depopulate the job-ring children on the populate failure path.

Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
---
 drivers/crypto/caam/ctrl.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/crypto/caam/ctrl.c b/drivers/crypto/caam/ctrl.c
index 320be5d77737..716c57b9f89b 100644
--- a/drivers/crypto/caam/ctrl.c
+++ b/drivers/crypto/caam/ctrl.c
@@ -1150,8 +1150,10 @@ static int caam_probe(struct platform_device *pdev)
 		 ctrlpriv->total_jobrs, ctrlpriv->qi_present);
 
 	ret = devm_of_platform_populate(dev);
-	if (ret)
+	if (ret) {
 		dev_err(dev, "JR platform devices creation error\n");
+		of_platform_depopulate(dev);
+	}
 
 	return ret;
 }
-- 
2.50.1 (Apple Git-155)


^ permalink raw reply related

* Re: [GIT PULL] Crypto Update for 7.2
From: pr-tracker-bot @ 2026-06-16  3:57 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Linus Torvalds, David S. Miller, Linux Kernel Mailing List,
	Linux Crypto Mailing List
In-Reply-To: <ai9zXTKk9fhCZoKd@gondor.apana.org.au>

The pull request you sent on Mon, 15 Jun 2026 11:37:01 +0800:

> git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 tags/v7.2-p1

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/0d8c1134936f1fb6678156ab4248ac740d274525

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply

* Re: [GIT PULL] CRC updates for 7.2
From: pr-tracker-bot @ 2026-06-16  3:57 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Linus Torvalds, linux-crypto, linux-arm-kernel, linux-kernel,
	Ard Biesheuvel, Christoph Hellwig
In-Reply-To: <20260615174807.GA1831@quark>

The pull request you sent on Mon, 15 Jun 2026 10:48:07 -0700:

> https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git tags/crc-for-linus

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/ef3b7426a63c930c51d60d6c2428663d52a84e2f

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply

* Re: [GIT PULL] Crypto library updates for 7.2
From: pr-tracker-bot @ 2026-06-16  3:57 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Linus Torvalds, linux-crypto, linux-kernel, Ard Biesheuvel,
	Jason A. Donenfeld, Herbert Xu, Arnd Bergmann, Christophe Leroy
In-Reply-To: <20260615175458.GB1831@quark>

The pull request you sent on Mon, 15 Jun 2026 10:54:58 -0700:

> https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git tags/libcrypto-for-linus

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/673f72944dde6614a56b37a61c6097f584c90ce6

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply

* Re: [PATCH v4 0/3] crypto: skcipher - per-request multi-data-unit batching
From: Herbert Xu @ 2026-06-16  4:13 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Leonid Ravich, Alasdair Kergon, Ard Biesheuvel, Jens Axboe,
	Horia Geanta, Gilad Ben-Yossef, linux-crypto, dm-devel,
	linux-block
In-Reply-To: <20260615225317.GB28589@quark>

On Mon, Jun 15, 2026 at 03:53:17PM -0700, Eric Biggers wrote:
>
> So in other words, this series slows down dm-crypt and crypto_skcipher
> for everyone to optimize for an out-of-tree driver.  And there's also no
> benchmark showing that your driver is even worth it over just using the
> CPU.

There is no reason why the software fallback should be slower
than the status quo.  Existing callers of the Crypto API will
be issuing one indirect function call per data unit.  With the
new scheme, the indirect calls per unit moves from from the caller
into the Crypto API.

In fact, we could move it down further and improve upon the
status quo by splitting the data in each algorithm implemntation
so that the calls per unit become direct function calls and only
the overall call into the Crypto API remains indirect.

But yes it would be nice to provide numbers for the fallback
path to verify that we didn't get this case terribly wrong.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH v4 0/3] crypto: skcipher - per-request multi-data-unit batching
From: Eric Biggers @ 2026-06-16  4:50 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Leonid Ravich, Alasdair Kergon, Ard Biesheuvel, Jens Axboe,
	Horia Geanta, Gilad Ben-Yossef, linux-crypto, dm-devel,
	linux-block
In-Reply-To: <ajDNT5jVGgRtiNH6@gondor.apana.org.au>

On Tue, Jun 16, 2026 at 12:13:03PM +0800, Herbert Xu wrote:
> On Mon, Jun 15, 2026 at 03:53:17PM -0700, Eric Biggers wrote:
> >
> > So in other words, this series slows down dm-crypt and crypto_skcipher
> > for everyone to optimize for an out-of-tree driver.  And there's also no
> > benchmark showing that your driver is even worth it over just using the
> > CPU.
> 
> There is no reason why the software fallback should be slower
> than the status quo.  Existing callers of the Crypto API will
> be issuing one indirect function call per data unit.  With the
> new scheme, the indirect calls per unit moves from from the caller
> into the Crypto API.

Have you checked the code?  This patchset adds overhead in multiple
places.  Dynamically allocating multiple scatterlists and then parsing
them, adding a new field to skcipher_request for everyone, new checks in
crypto_skcipher_en/decrypt for everyone, new checks to validate the data
unit size that the caller knew was valid in the first place, etc.

> In fact, we could move it down further and improve upon the
> status quo by splitting the data in each algorithm implemntation
> so that the calls per unit become direct function calls and only
> the overall call into the Crypto API remains indirect.

That's not what this patchset does.  But also, as we know, a better way
to eliminate "Crypto API" overhead is to call the algorithms directly.

- Eric

^ permalink raw reply

* Re: [PATCH v4 0/3] crypto: skcipher - per-request multi-data-unit batching
From: Herbert Xu @ 2026-06-16  4:53 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Leonid Ravich, Alasdair Kergon, Ard Biesheuvel, Jens Axboe,
	Horia Geanta, Gilad Ben-Yossef, linux-crypto, dm-devel,
	linux-block
In-Reply-To: <20260616045023.GA113934@sol>

On Mon, Jun 15, 2026 at 09:50:23PM -0700, Eric Biggers wrote:
>
> Have you checked the code?  This patchset adds overhead in multiple
> places.  Dynamically allocating multiple scatterlists and then parsing
> them, adding a new field to skcipher_request for everyone, new checks in
> crypto_skcipher_en/decrypt for everyone, new checks to validate the data
> unit size that the caller knew was valid in the first place, etc.

No I have not :)

I'm just stating the general principle.

Of course I will not apply the patch-set until I have reviewed it
properly.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH v2 1/8] crypto: qce - Remove unsafe/deprecated algorithms
From: Eric Biggers @ 2026-06-16  5:18 UTC (permalink / raw)
  To: Bartosz Golaszewski
  Cc: Thara Gopinath, Herbert Xu, David S. Miller, Stanimir Varbanov,
	Eneas U de Queiroz, Kuldeep Singh, linux-crypto, linux-arm-msm,
	linux-kernel, brgl, stable
In-Reply-To: <20260615-qce-fix-self-tests-v2-1-dc911f1aad42@oss.qualcomm.com>

On Mon, Jun 15, 2026 at 05:49:52PM +0200, Bartosz Golaszewski wrote:
> Remove algorithms that are either unsafe or deprecated and have no
> in-kernel users that cannot be served by the ARM CE implementations.
> 
> AES-ECB reveals plaintext patterns (identical plaintext blocks produce
> identical ciphertext blocks) and should not be exposed as a hardware-
> accelerated primitive. DES, Triple DES and HMAC-SHA1 have been
> deprecated for years.
> 
> Remove ecb(aes), cbc(des), ecb(des3_ede), cbc(des3_ede), hmac(sha1) and
> all AEAD variants built on these primitives. Also clean up the - now dead
> - code, flags and constants.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>

What is the rationale for still supporting the following?

    sha1
    ecb(des)
    authenc(hmac(sha256),cbc(des))

- Eric

^ permalink raw reply

* Re: [PATCH v8 2/7] x86/sev: Initialize RMPOPT configuration MSRs
From: K Prateek Nayak @ 2026-06-16  6:03 UTC (permalink / raw)
  To: Ashish Kalra, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
	peterz, thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, Tycho.Andersen, Nathan.Fontenot,
	ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <6a4d0ec9e037d91c0fdba9c525942ca288e1b1b2.1781419998.git.ashish.kalra@amd.com>

Hello Ashish,

On 6/16/2026 1:18 AM, Ashish Kalra wrote:
> diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
> index 8bcdce98f6dc..1b5c18408f0b 100644
> --- a/arch/x86/virt/svm/sev.c
> +++ b/arch/x86/virt/svm/sev.c
> @@ -124,6 +124,10 @@ static void *rmp_bookkeeping __ro_after_init;
>  
>  static u64 probed_rmp_base, probed_rmp_size;
>  
> +static cpumask_t rmpopt_cpumask;

nit.

I believe you can use cpumask_var_t here and do a zalloc_cpumask_var()
during snp_setup_rmpopt(). That way !X86_FEATURE_RMPOPT configs don't
have to needlessly waste space to keep a redundant cpumask around.

Same comment for rmpopt_report_cpumask in Patch 7 which can be
allocated dynamically during rmpopt_debugfs_setup().

> +static phys_addr_t rmpopt_pa_start;
> +static bool rmpopt_configured;
> +
>  static LIST_HEAD(snp_leaked_pages_list);
>  static DEFINE_SPINLOCK(snp_leaked_pages_list_lock);
>  
> @@ -490,7 +494,12 @@ static bool __init setup_rmptable(void)
>  	if (rmp_cfg & MSR_AMD64_SEG_RMP_ENABLED) {
>  		if (!setup_segmented_rmptable())
>  			return false;
> +		rmpopt_configured = true;
>  	} else {
> +		/*
> +		 * RMPOPT requires a segmented RMP table, so leave
> +		 * rmpopt_configured clear on contiguous RMP systems.
> +		 */
>  		if (!setup_contiguous_rmptable())
>  			return false;
>  	}
> @@ -555,6 +564,21 @@ int snp_prepare(void)
>  }
>  EXPORT_SYMBOL_FOR_MODULES(snp_prepare, "ccp");
>  
> +static void rmpopt_cleanup(void)
> +{
> +	int cpu;
> +
> +	cpus_read_lock();

nit.

You can use guard(cpus_read_lock)() unless there is a complicated
locking pattern where you need to drop and re-acquire the read lock.

> +
> +	for_each_cpu(cpu, &rmpopt_cpumask)
> +		WARN_ON_ONCE(wrmsrq_on_cpu(cpu, MSR_AMD64_RMPOPT_BASE, 0));
> +
> +	cpus_read_unlock();
> +
> +	cpumask_clear(&rmpopt_cpumask);
> +	rmpopt_pa_start = 0;
> +}
> +
>  void snp_shutdown(void)
>  {
>  	u64 syscfg;

-- 
Thanks and Regards,
Prateek


^ permalink raw reply

* Re: [PATCH v8 4/7] x86/sev: Add support to perform RMP optimizations asynchronously
From: K Prateek Nayak @ 2026-06-16  7:27 UTC (permalink / raw)
  To: Ashish Kalra, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
	peterz, thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, Tycho.Andersen, Nathan.Fontenot,
	ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <de274c2fb3f794ff1f19f0c96184ee50d04d1282.1781419998.git.ashish.kalra@amd.com>

Hello Ashish,

On 6/16/2026 1:19 AM, Ashish Kalra wrote:
> +	/*
> +	 * RMPOPT scans the RMP table, stores the result of the scan in the
> +	 * reserved processor memory. The RMP scan is the most expensive
> +	 * part. If a second RMPOPT occurs, it can skip the expensive scan
> +	 * if they can see a cached result in the reserved processor memory.
> +	 *
> +	 * Do RMPOPT on one CPU alone. Then, follow that up with RMPOPT
> +	 * on every other primary thread. Followers are "designed to"
> +	 * skip the scan if they see the "cached" scan results.
> +	 */
> +	cpumask_copy(follower_mask, &rmpopt_cpumask);

rmpopt_cpumask is constructed after hotplug is disabled but ...

> +
> +	/*
> +	 * Pin the worker to the current CPU for the leader loop so that
> +	 * this_cpu remains valid and the RMPOPT instruction executes on
> +	 * the correct CPU.
> +	 *
> +	 * Use migrate_disable() rather than get_cpu() to prevent
> +	 * migration while still allowing preemption.
> +	 */
> +	migrate_disable();
> +	this_cpu = smp_processor_id();
> +
> +	if (cpumask_test_cpu(this_cpu, follower_mask)) {
> +		/*
> +		 * Current CPU is a primary thread in rmpopt_cpumask.
> +		 * Run leader locally and remove from follower mask.
> +		 */
> +		cpumask_clear_cpu(this_cpu, follower_mask);
> +
> +		for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
> +			rmpopt(pa);
> +			cond_resched();
> +		}
> +	} else if (cpumask_intersects(topology_sibling_cpumask(this_cpu),
> +				      follower_mask)) {
> +		/*
> +		 * Current CPU is a sibling thread whose primary is in
> +		 * rmpopt_cpumask.  RMPOPT_BASE MSR is per-core, so it
> +		 * is safe to run the leader locally.  Remove the sibling's
> +		 * primary from the follower mask as this core is already
> +		 * covered by the leader.
> +		 */
> +		cpumask_andnot(follower_mask, follower_mask,
> +			       topology_sibling_cpumask(this_cpu));
> +
> +		for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
> +			rmpopt(pa);
> +			cond_resched();
> +		}
> +	} else {
> +		/*
> +		 * Current CPU does not have RMPOPT_BASE MSR programmed.
> +		 * Pick an explicit leader from the cpumask to avoid #UD.
> +		 * Use work_on_cpu() to run in process context on the leader,
> +		 * avoiding IPI latency.
> +		 */

... this_cpu is neither in the "rmpopt_cpumask", nor is any of its
siblings on "rmpopt_cpumask".

How does that happen?

> +		int leader_cpu = cpumask_first(follower_mask);
> +
> +		if (WARN_ON_ONCE(leader_cpu >= nr_cpu_ids)) {
> +			migrate_enable();
> +			goto out;
> +		}
> +
> +		cpumask_clear_cpu(leader_cpu, follower_mask);
> +
> +		/* Release migration pin before work_on_cpu(). */
> +		migrate_enable();
> +
> +		work_on_cpu(leader_cpu, rmpopt_leader_fn, NULL);

This creates a delayed work and also waits for it to finish execution
which will add more latency than a simple IPI if the comment about IPI
latency above is accurate.

I think there is some corner case in construction of the
"rmpopt_cpumask" that requires this not-so-pretty else block. Can you
elaborate why this is required?

Perhaps the "rmpopt_cpumask" construction needs:

    for_each_online_cpu(cpu) {
        /* Nominate the first CPU on the sibling mask for RMPOPT */
        if (cpu != cpumask_first(topology_sibling_cpumask(cpu)))
            continue;
        cpumask_set_cpu(cpu, &rmpopt_cpumask);
    }


and all you need here is:

    /* Do RMPOPt for local core */
    for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
        rmpopt(pa);

    /* Skip this core from concurrent RMPOPT */
    cpumask_and_not(follower_mask, &rmpopt_cpumask, topology_sibling_cpumask(cpu));

No?

> +
> +		goto followers;
> +	}
> +
> +	migrate_enable();
> +
-- 
Thanks and Regards,
Prateek


^ permalink raw reply

* Re: [PATCH] crypto: qce - drop unused scatterlist traversal in qce_ahash_update
From: Bartosz Golaszewski @ 2026-06-16  7:49 UTC (permalink / raw)
  To: Thorsten Blum
  Cc: linux-crypto, linux-arm-msm, linux-kernel, Bartosz Golaszewski,
	Herbert Xu, David S. Miller
In-Reply-To: <20260614152605.701754-3-thorsten.blum@linux.dev>

On Sun, 14 Jun 2026 17:26:07 +0200, Thorsten Blum
<thorsten.blum@linux.dev> said:
> Commit df12ef60c87b ("crypto: qce/sha - Do not modify scatterlist passed
> along with request") removed the only use of sg_last, rendering the
> scatterlist traversal useless. Remove it and its local variables.
>
> Also remove the redundant hash_later check, inline the source offset,
> and assign the number of complete blocks directly to req->nbytes.
>
> Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
> ---

Reviewed-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>

^ permalink raw reply

* Re: [PATCH v3] lib/raid/xor: x86: Add AVX-512 optimized xor_gen()
From: David Laight @ 2026-06-16  8:13 UTC (permalink / raw)
  To: Eric Biggers
  Cc: x86, Christoph Hellwig, linux-crypto, linux-raid, Andrew Morton,
	linux-kernel
In-Reply-To: <20260615201050.GB1764@quark>

On Mon, 15 Jun 2026 13:10:50 -0700
Eric Biggers <ebiggers@kernel.org> wrote:

> On Mon, Jun 15, 2026 at 12:03:38PM -0700, Eric Biggers wrote:
> > Note: for now I omitted the cpu_has_xfeatures() check that the AVX-512
> > optimized crypto and CRC code does, since it's not implemented on
> > User-Mode Linux and it's never been present in the RAID6 code either.  
> 
> By the way, Sashiko keeps complaining about this decision.
> 
> Maybe the x86 maintainers have some advice here?
> 
> For context: on x86 processors, executing AVX or AVX512 instructions
> requires not just that the CPU supports the feature, but also that the
> operating system has set certain bits in XCR0.  For example all EVEX
> coded instructions (i.e. AVX-512) require XCR0=111xx111b.  (See Intel
> manual "2.6.11.1 State Dependent #UD".)
> 
> Therefore most of the kernel's AVX and AVX512 optimized code checks not
> just X86_FEATURE_AVX* but also calls cpu_has_xfeatures() to check XCR0.
> 
> But "most" isn't all.  The RAID6 code for example doesn't check
> cpu_has_xfeatures().  So if you e.g. boot a kernel in QEMU using
> "-cpu max,xsave=off", it already crashes when the RAID6 code does its
> boot-time benchmark.
> 
> Part of the reason for that omission probably is that UML doesn't
> provide an implementation of cpu_has_xfeatures().  And the x86 RAID (XOR
> and RAID6) code is enabled on UML.
> 
> It could be implemented for UML by using the xgetbv instruction, like
> what userspace programs do.  (We'd also need to copy the XFEATURE_MASK_*
> constants, as UML can't include arch/x86/include/asm/fpu/types.h)
> 
> But I wanted to ask: do we really care about the case where features are
> "supported" but their XCR0 bits aren't set?  Perhaps the kernel just
> doesn't/shouldn't support weird cases like "-cpu max,xsave=off"?

I think that case definitely matters for userspace.
Isn't it what happens when you run an old OS on a new cpu?
I remember cases where people were compiling programs that used AVX
(possibly from gcc's cpu=native) but the os hadn't been updated to
actually save the relevant registers.
The programs 'sort of worked' until a process switch failed to preserve
the registers.

So the check you need to do is looking at XCR0 rather than anything else.

> 
> If this case indeed needs to be handled, could we make things easier for
> the kernel's AVX and AVX-512 optimized code?  Currently AVX-512 needs:
> 
>         if (boot_cpu_has(X86_FEATURE_AVX512F) &&
>             cpu_has_xfeatures(XFEATURE_MASK_FP | XFEATURE_MASK_SSE |
>                               XFEATURE_MASK_YMM | XFEATURE_MASK_AVX512, NULL))
> 
> How about we make X86_FEATURE_AVX512F depend on XCR0=111xx111, and
> X86_FEATURE_AVX depend on XCR0=xxxxx111?  Then the cpu_has_xfeatures()
> check wouldn't be needed.  Is there any reason not to do that?

If cpu_has_xfeatures() is checking (a copy of) XCR0 isn't it enough
to just check that XFEATURE_MASK_AVX512 is set - it doesn't make any
sense for the other bits to be clear at the same time.
If the XCR0 copy is sane/sanitised you only need to check one bit.
That would let you #define the constant to 0 if the kernel is built without
the feature and the compiler will optimise the code away.

Then the test would just be:
	if (can_use_xfeature(XFEATURE_AVX512))

-- David

> 
> - Eric


^ permalink raw reply

* [PATCH] crypto: ti-dthev2 - Fix potential invalid access when device list is empty
From: Hongling Zeng @ 2026-06-16  9:40 UTC (permalink / raw)
  To: t-pratham, herbert, davem
  Cc: linux-crypto, linux-kernel, zhongling0719, Hongling Zeng

list_first_entry() never returns NULL - if the list is empty, it still
returns a pointer to an invalid object, leading to potential invalid
memory access when dereferenced.

Fix this by using list_first_entry_or_null instead of list_first_entry.

Fixes: 52f641bc63a4 ("crypto: ti - Add driver for DTHE V2 AES Engine (ECB, CBC)")
Signed-off-by: Hongling Zeng <zenghongling@kylinos.cn>
---
 drivers/crypto/ti/dthev2-common.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/crypto/ti/dthev2-common.c b/drivers/crypto/ti/dthev2-common.c
index a2ad79bec105..cc0244938267 100644
--- a/drivers/crypto/ti/dthev2-common.c
+++ b/drivers/crypto/ti/dthev2-common.c
@@ -40,7 +40,7 @@ struct dthe_data *dthe_get_dev(struct dthe_tfm_ctx *ctx)
 		return ctx->dev_data;
 
 	spin_lock_bh(&dthe_dev_list.lock);
-	dev_data = list_first_entry(&dthe_dev_list.dev_list, struct dthe_data, list);
+	dev_data = list_first_entry_or_null(&dthe_dev_list.dev_list, struct dthe_data, list);
 	if (dev_data)
 		list_move_tail(&dev_data->list, &dthe_dev_list.dev_list);
 	spin_unlock_bh(&dthe_dev_list.lock);
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH v2 1/8] crypto: qce - Remove unsafe/deprecated algorithms
From: Bartosz Golaszewski @ 2026-06-16 10:41 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Thara Gopinath, Herbert Xu, David S. Miller, Stanimir Varbanov,
	Eneas U de Queiroz, Kuldeep Singh, linux-crypto, linux-arm-msm,
	linux-kernel, brgl, stable, Bartosz Golaszewski
In-Reply-To: <20260616051820.GA127019@sol>

On Tue, 16 Jun 2026 07:18:20 +0200, Eric Biggers <ebiggers@kernel.org> said:
> On Mon, Jun 15, 2026 at 05:49:52PM +0200, Bartosz Golaszewski wrote:
>> Remove algorithms that are either unsafe or deprecated and have no
>> in-kernel users that cannot be served by the ARM CE implementations.
>>
>> AES-ECB reveals plaintext patterns (identical plaintext blocks produce
>> identical ciphertext blocks) and should not be exposed as a hardware-
>> accelerated primitive. DES, Triple DES and HMAC-SHA1 have been
>> deprecated for years.
>>
>> Remove ecb(aes), cbc(des), ecb(des3_ede), cbc(des3_ede), hmac(sha1) and
>> all AEAD variants built on these primitives. Also clean up the - now dead
>> - code, flags and constants.
>>
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
>
> What is the rationale for still supporting the following?
>
>     sha1
>     ecb(des)
>     authenc(hmac(sha256),cbc(des))
>

No, I should have removed those too. I'll update it in v3.

Bart

^ permalink raw reply

* Re: [PATCH v8 4/7] x86/sev: Add support to perform RMP optimizations asynchronously
From: Kalra, Ashish @ 2026-06-16 19:56 UTC (permalink / raw)
  To: K Prateek Nayak, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
	peterz, thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, Tycho.Andersen, Nathan.Fontenot,
	ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <0fa0bc95-ff31-40c5-b083-3c885d09d0ab@amd.com>

Hello Prateek,

On 6/16/2026 2:27 AM, K Prateek Nayak wrote:
> Hello Ashish,
> 
> On 6/16/2026 1:19 AM, Ashish Kalra wrote:
>> +	/*
>> +	 * RMPOPT scans the RMP table, stores the result of the scan in the
>> +	 * reserved processor memory. The RMP scan is the most expensive
>> +	 * part. If a second RMPOPT occurs, it can skip the expensive scan
>> +	 * if they can see a cached result in the reserved processor memory.
>> +	 *
>> +	 * Do RMPOPT on one CPU alone. Then, follow that up with RMPOPT
>> +	 * on every other primary thread. Followers are "designed to"
>> +	 * skip the scan if they see the "cached" scan results.
>> +	 */
>> +	cpumask_copy(follower_mask, &rmpopt_cpumask);
> 
> rmpopt_cpumask is constructed after hotplug is disabled but ...
> 
>> +
>> +	/*
>> +	 * Pin the worker to the current CPU for the leader loop so that
>> +	 * this_cpu remains valid and the RMPOPT instruction executes on
>> +	 * the correct CPU.
>> +	 *
>> +	 * Use migrate_disable() rather than get_cpu() to prevent
>> +	 * migration while still allowing preemption.
>> +	 */
>> +	migrate_disable();
>> +	this_cpu = smp_processor_id();
>> +
>> +	if (cpumask_test_cpu(this_cpu, follower_mask)) {
>> +		/*
>> +		 * Current CPU is a primary thread in rmpopt_cpumask.
>> +		 * Run leader locally and remove from follower mask.
>> +		 */
>> +		cpumask_clear_cpu(this_cpu, follower_mask);
>> +
>> +		for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
>> +			rmpopt(pa);
>> +			cond_resched();
>> +		}
>> +	} else if (cpumask_intersects(topology_sibling_cpumask(this_cpu),
>> +				      follower_mask)) {
>> +		/*
>> +		 * Current CPU is a sibling thread whose primary is in
>> +		 * rmpopt_cpumask.  RMPOPT_BASE MSR is per-core, so it
>> +		 * is safe to run the leader locally.  Remove the sibling's
>> +		 * primary from the follower mask as this core is already
>> +		 * covered by the leader.
>> +		 */
>> +		cpumask_andnot(follower_mask, follower_mask,
>> +			       topology_sibling_cpumask(this_cpu));
>> +
>> +		for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
>> +			rmpopt(pa);
>> +			cond_resched();
>> +		}
>> +	} else {
>> +		/*
>> +		 * Current CPU does not have RMPOPT_BASE MSR programmed.
>> +		 * Pick an explicit leader from the cpumask to avoid #UD.
>> +		 * Use work_on_cpu() to run in process context on the leader,
>> +		 * avoiding IPI latency.
>> +		 */
> 
> ... this_cpu is neither in the "rmpopt_cpumask", nor is any of its
> siblings on "rmpopt_cpumask".
> 
> How does that happen?

Actually, this was the implementation before the CPU hotplug disable enforcement code was implemented and added in v8,
and i should have fixed this rmpopt_work_handler() accordingly for v8.

With the enforced cpu hotplug disable support, case #3 here (above) is now dead code, and removing it lets
cases #1 and #2 collapse too.

snp_prepare() requires cpu_online_mask == cpu_present_mask before SNP init — so when snp_setup_rmpopt() programs the MSRs, every
core's primary is online -> every core is in rmpopt_cpumask.
  
So now the work handler always runs on a CPU whose core is programmed. topology_sibling_cpumask(this_cpu) therefore always intersects
rmpopt_cpumask -> case #1 or #2 always matches.

So i should actually drop case #3 here - which is: "this_cpu is neither in the "rmpopt_cpumask", nor is any of its
siblings on rmpopt_cpumask"


> 
>> +		int leader_cpu = cpumask_first(follower_mask);
>> +
>> +		if (WARN_ON_ONCE(leader_cpu >= nr_cpu_ids)) {
>> +			migrate_enable();
>> +			goto out;
>> +		}
>> +
>> +		cpumask_clear_cpu(leader_cpu, follower_mask);
>> +
>> +		/* Release migration pin before work_on_cpu(). */
>> +		migrate_enable();
>> +
>> +		work_on_cpu(leader_cpu, rmpopt_leader_fn, NULL);
> 
> This creates a delayed work and also waits for it to finish execution
> which will add more latency than a simple IPI if the comment about IPI
> latency above is accurate.
> 
> I think there is some corner case in construction of the
> "rmpopt_cpumask" that requires this not-so-pretty else block. Can you
> elaborate why this is required?
> 
> Perhaps the "rmpopt_cpumask" construction needs:
> 
>     for_each_online_cpu(cpu) {
>         /* Nominate the first CPU on the sibling mask for RMPOPT */
>         if (cpu != cpumask_first(topology_sibling_cpumask(cpu)))
>             continue;
>         cpumask_set_cpu(cpu, &rmpopt_cpumask);
>     }
> 
> 
> and all you need here is:
> 
>     /* Do RMPOPt for local core */
>     for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
>         rmpopt(pa);
> 
>     /* Skip this core from concurrent RMPOPT */
>     cpumask_and_not(follower_mask, &rmpopt_cpumask, topology_sibling_cpumask(cpu));
> 
> No?
> 

Yes, a simpler implementation will be like this: 
...

 	if (!alloc_cpumask_var(&follower_mask, GFP_KERNEL))
                return;

 	cpumask_copy(follower_mask, &rmpopt_cpumask);

        /*
         * The current CPU's core always has RMPOPT_BASE programmed
         * (snp_prepare() required all CPUs online at setup and CPU hotplug
         * is disabled while SNP is active), so it can always be the leader.
         * RMPOPT_BASE is per-core; exclude this core from the followers.
         */
        migrate_disable();
        cpumask_andnot(follower_mask, follower_mask,
                       topology_sibling_cpumask(smp_processor_id()));

        for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
                rmpopt(pa);
                cond_resched();
        }
        migrate_enable();

        cpus_read_lock();
        for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
                on_each_cpu_mask(follower_mask, rmpopt_smp, (void *)pa, true);
                cond_resched();
        }
        cpus_read_unlock();

        free_cpumask_var(follower_mask);


 Here, the leader exclusion must use the sibling mask, not clear_cpu(this_cpu). That's why my collapsed version uses:

        cpumask_andnot(follower_mask, follower_mask,
                       topology_sibling_cpumask(smp_processor_id()));

  - If this_cpu is a primary: its sibling mask contains itself (the primary) -> andnot removes this core's primary from the followers.
  
  - If this_cpu is a secondary: it isn't in follower_mask at all, but its sibling mask contains its primary, which is in
  follower_mask -> andnot still removes this core's primary. 

  So either way the current core is dropped from the followers. (The old code needed two branches because case #1 used
  clear_cpu(this_cpu) — only correct when this_cpu is the primary — while case #2 used the sibling andnot. The single andnot works for
  both cases).

Thanks,
Ashish

>> +		goto followers;
>> +	}
>> +
>> +	migrate_enable();
>> +

^ permalink raw reply

* [PATCH v2] crypto: ti-dthev2：Fix potential invalid access when device list is empty
From: Hongling Zeng @ 2026-06-17  1:36 UTC (permalink / raw)
  To: t-pratham, herbert, davem
  Cc: linux-crypto, linux-kernel, zhongling0719, Hongling Zeng

list_first_entry() never returns NULL - if the list is empty, it still
returns a pointer to an invalid object, leading to potential invalid
memory access when dereferenced.

Fix this by using list_first_entry_or_null instead of list_first_entry.

Fixes: 52f641bc63a4 ("crypto: ti - Add driver for DTHE V2 AES Engine (ECB, CBC)")
Signed-off-by: Hongling Zeng <zenghongling@kylinos.cn>

---
Change in v2
 -Reorder dthe_remove(): unregister algorithms before removing from list
  This prevents new allocations during removal.
---
 drivers/crypto/ti/dthev2-common.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/crypto/ti/dthev2-common.c b/drivers/crypto/ti/dthev2-common.c
index a2ad79bec105..9480b32b8111 100644
--- a/drivers/crypto/ti/dthev2-common.c
+++ b/drivers/crypto/ti/dthev2-common.c
@@ -40,7 +40,7 @@ struct dthe_data *dthe_get_dev(struct dthe_tfm_ctx *ctx)
 		return ctx->dev_data;
 
 	spin_lock_bh(&dthe_dev_list.lock);
-	dev_data = list_first_entry(&dthe_dev_list.dev_list, struct dthe_data, list);
+	dev_data = list_first_entry_or_null(&dthe_dev_list.dev_list, struct dthe_data, list);
 	if (dev_data)
 		list_move_tail(&dev_data->list, &dthe_dev_list.dev_list);
 	spin_unlock_bh(&dthe_dev_list.lock);
@@ -201,12 +201,12 @@ static void dthe_remove(struct platform_device *pdev)
 {
 	struct dthe_data *dev_data = platform_get_drvdata(pdev);
 
-	spin_lock(&dthe_dev_list.lock);
-	list_del(&dev_data->list);
-	spin_unlock(&dthe_dev_list.lock);
-
 	dthe_unregister_algs();
 
+        spin_lock(&dthe_dev_list.lock);
+        list_del(&dev_data->list);
+        spin_unlock(&dthe_dev_list.lock);
+
 	crypto_engine_exit(dev_data->engine);
 
 	dma_release_channel(dev_data->dma_aes_rx);
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH v8 4/7] x86/sev: Add support to perform RMP optimizations asynchronously
From: K Prateek Nayak @ 2026-06-17  4:20 UTC (permalink / raw)
  To: Kalra, Ashish, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
	peterz, thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, Tycho.Andersen, Nathan.Fontenot,
	ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <8c5f4082-e3a5-4f65-b058-33938a7ee324@amd.com>

Hello Ashish,

On 6/17/2026 1:26 AM, Kalra, Ashish wrote:
> Hello Prateek,
> 
> On 6/16/2026 2:27 AM, K Prateek Nayak wrote:
>> Hello Ashish,
>>
>> On 6/16/2026 1:19 AM, Ashish Kalra wrote:
>>> +	/*
>>> +	 * RMPOPT scans the RMP table, stores the result of the scan in the
>>> +	 * reserved processor memory. The RMP scan is the most expensive
>>> +	 * part. If a second RMPOPT occurs, it can skip the expensive scan
>>> +	 * if they can see a cached result in the reserved processor memory.
>>> +	 *
>>> +	 * Do RMPOPT on one CPU alone. Then, follow that up with RMPOPT
>>> +	 * on every other primary thread. Followers are "designed to"
>>> +	 * skip the scan if they see the "cached" scan results.
>>> +	 */
>>> +	cpumask_copy(follower_mask, &rmpopt_cpumask);
>>
>> rmpopt_cpumask is constructed after hotplug is disabled but ...
>>
>>> +
>>> +	/*
>>> +	 * Pin the worker to the current CPU for the leader loop so that
>>> +	 * this_cpu remains valid and the RMPOPT instruction executes on
>>> +	 * the correct CPU.
>>> +	 *
>>> +	 * Use migrate_disable() rather than get_cpu() to prevent
>>> +	 * migration while still allowing preemption.
>>> +	 */
>>> +	migrate_disable();
>>> +	this_cpu = smp_processor_id();
>>> +
>>> +	if (cpumask_test_cpu(this_cpu, follower_mask)) {
>>> +		/*
>>> +		 * Current CPU is a primary thread in rmpopt_cpumask.
>>> +		 * Run leader locally and remove from follower mask.
>>> +		 */
>>> +		cpumask_clear_cpu(this_cpu, follower_mask);
>>> +
>>> +		for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
>>> +			rmpopt(pa);
>>> +			cond_resched();
>>> +		}
>>> +	} else if (cpumask_intersects(topology_sibling_cpumask(this_cpu),
>>> +				      follower_mask)) {
>>> +		/*
>>> +		 * Current CPU is a sibling thread whose primary is in
>>> +		 * rmpopt_cpumask.  RMPOPT_BASE MSR is per-core, so it
>>> +		 * is safe to run the leader locally.  Remove the sibling's
>>> +		 * primary from the follower mask as this core is already
>>> +		 * covered by the leader.
>>> +		 */
>>> +		cpumask_andnot(follower_mask, follower_mask,
>>> +			       topology_sibling_cpumask(this_cpu));
>>> +
>>> +		for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
>>> +			rmpopt(pa);
>>> +			cond_resched();
>>> +		}
>>> +	} else {
>>> +		/*
>>> +		 * Current CPU does not have RMPOPT_BASE MSR programmed.
>>> +		 * Pick an explicit leader from the cpumask to avoid #UD.
>>> +		 * Use work_on_cpu() to run in process context on the leader,
>>> +		 * avoiding IPI latency.
>>> +		 */
>>
>> ... this_cpu is neither in the "rmpopt_cpumask", nor is any of its
>> siblings on "rmpopt_cpumask".
>>
>> How does that happen?
> 
> Actually, this was the implementation before the CPU hotplug disable enforcement code was implemented and added in v8,
> and i should have fixed this rmpopt_work_handler() accordingly for v8.
> 
> With the enforced cpu hotplug disable support, case #3 here (above) is now dead code, and removing it lets
> cases #1 and #2 collapse too.
> 
> snp_prepare() requires cpu_online_mask == cpu_present_mask before SNP init — so when snp_setup_rmpopt() programs the MSRs, every
> core's primary is online -> every core is in rmpopt_cpumask.
>   
> So now the work handler always runs on a CPU whose core is programmed. topology_sibling_cpumask(this_cpu) therefore always intersects
> rmpopt_cpumask -> case #1 or #2 always matches.
> 
> So i should actually drop case #3 here - which is: "this_cpu is neither in the "rmpopt_cpumask", nor is any of its
> siblings on rmpopt_cpumask"

Ack.

Also the fact that cpu_mark_primary_thread() uses LSBs of APICID and if
you have some insanely weird configuration - like boot with maxcpus=1,
online all the secondary threads (CPUs 256-511 on a 256C/512T system),
launch an SNP guest - it can actually leave everything except CORE0 out
of the "rmpopt_cpumask".

> 
> 
>>
>>> +		int leader_cpu = cpumask_first(follower_mask);
>>> +
>>> +		if (WARN_ON_ONCE(leader_cpu >= nr_cpu_ids)) {
>>> +			migrate_enable();
>>> +			goto out;
>>> +		}
>>> +
>>> +		cpumask_clear_cpu(leader_cpu, follower_mask);
>>> +
>>> +		/* Release migration pin before work_on_cpu(). */
>>> +		migrate_enable();
>>> +
>>> +		work_on_cpu(leader_cpu, rmpopt_leader_fn, NULL);
>>
>> This creates a delayed work and also waits for it to finish execution
>> which will add more latency than a simple IPI if the comment about IPI
>> latency above is accurate.
>>
>> I think there is some corner case in construction of the
>> "rmpopt_cpumask" that requires this not-so-pretty else block. Can you
>> elaborate why this is required?
>>
>> Perhaps the "rmpopt_cpumask" construction needs:
>>
>>     for_each_online_cpu(cpu) {
>>         /* Nominate the first CPU on the sibling mask for RMPOPT */
>>         if (cpu != cpumask_first(topology_sibling_cpumask(cpu)))
>>             continue;
>>         cpumask_set_cpu(cpu, &rmpopt_cpumask);
>>     }
>>
>>
>> and all you need here is:
>>
>>     /* Do RMPOPt for local core */
>>     for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
>>         rmpopt(pa);
>>
>>     /* Skip this core from concurrent RMPOPT */
>>     cpumask_and_not(follower_mask, &rmpopt_cpumask, topology_sibling_cpumask(cpu));
>>
>> No?
>>
> 
> Yes, a simpler implementation will be like this: 
> ...
> 
>  	if (!alloc_cpumask_var(&follower_mask, GFP_KERNEL))
>                 return;
> 

If you move the migrate_disable() here, you can simply do an andnot
without needing to copy the rmpopt_cpumask beforehand and save on one
cpumask iteration.

>  	cpumask_copy(follower_mask, &rmpopt_cpumask);
> 
>         /*
>          * The current CPU's core always has RMPOPT_BASE programmed
>          * (snp_prepare() required all CPUs online at setup and CPU hotplug
>          * is disabled while SNP is active), so it can always be the leader.
>          * RMPOPT_BASE is per-core; exclude this core from the followers.
>          */
>         migrate_disable();
>         cpumask_andnot(follower_mask, follower_mask,
>                        topology_sibling_cpumask(smp_processor_id()));
> 
>         for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
>                 rmpopt(pa);
>                 cond_resched();
>         }
>         migrate_enable();
> 
>         cpus_read_lock();

I think you can even skip the cpus_read_lock() since we know for a
fact that hotplug is disabled when we are here.

Perhaps we can have a lockdep_assert_cpu_hotplug_disabled() which
ensures we'll get a splat if that assumption ever changes when
running with LOCKDEP?

I'll let others comment if that is a good idea or not.

>         for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
>                 on_each_cpu_mask(follower_mask, rmpopt_smp, (void *)pa, true);
>                 cond_resched();
>         }
>         cpus_read_unlock();
> 
>         free_cpumask_var(follower_mask);
> 
> 
>  Here, the leader exclusion must use the sibling mask, not clear_cpu(this_cpu). That's why my collapsed version uses:
> 
>         cpumask_andnot(follower_mask, follower_mask,
>                        topology_sibling_cpumask(smp_processor_id()));
> 
>   - If this_cpu is a primary: its sibling mask contains itself (the primary) -> andnot removes this core's primary from the followers.
>   
>   - If this_cpu is a secondary: it isn't in follower_mask at all, but its sibling mask contains its primary, which is in
>   follower_mask -> andnot still removes this core's primary. 
> 
>   So either way the current core is dropped from the followers. (The old code needed two branches because case #1 used
>   clear_cpu(this_cpu) — only correct when this_cpu is the primary — while case #2 used the sibling andnot. The single andnot works for
>   both cases).

Ack! And I think this looks much cleaner (to my eyes at least ;-)

-- 
Thanks and Regards,
Prateek


^ permalink raw reply

* Re: [PATCH v8 3/7] crypto/ccp: Disable CPU hotplug while SNP is active
From: K Prateek Nayak @ 2026-06-17  4:33 UTC (permalink / raw)
  To: Ashish Kalra, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
	peterz, thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, Tycho.Andersen, Nathan.Fontenot,
	ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <1feccf6e2a56d949b30f403c0ca7949f580e5982.1781419998.git.ashish.kalra@amd.com>

Hello Ashish,

On 6/16/2026 1:19 AM, Ashish Kalra wrote:
> From: Ashish Kalra <ashish.kalra@amd.com>
> 
> The SEV firmware enumerates the CPUs at SNP initialization and is not
> aware of the OS bringing CPUs online or offline afterwards, so OS CPU
> hotplug can diverge from the firmware's expectations and break SNP.
> Disable CPU hotplug while SNP is active.

Dumb question: Is this specific to RMPOPT? Otherwise ...

> 
> SNP is fully torn down only on the SNP_SHUTDOWN_EX x86_snp_shutdown
> path; the legacy path leaves SNP enabled in hardware while clearing
> snp_initialized, so __sev_snp_init_locked() can run again.  Track the
> disable with a flag so it is balanced by a matching enable rather than
> stacked, and re-enable hotplug only on the x86_snp_shutdown path, after
> snp_shutdown() has cleared the per-core RMPOPT_BASE MSRs with hotplug
> still disabled.
> 
> This also keeps the CPU set stable for the asynchronous RMPOPT scan
> added later in this series, and ensures cpus_read_lock() in the scan
> is uncontended.
> 
> Suggested-by: Thomas Lendacky <thomas.lendacky@amd.com>
> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
> ---
>  drivers/crypto/ccp/sev-dev.c | 29 ++++++++++++++++++++++++++++-
>  1 file changed, 28 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
> index 217b6b19802e..c8c3c577463c 100644
> --- a/drivers/crypto/ccp/sev-dev.c
> +++ b/drivers/crypto/ccp/sev-dev.c
> @@ -106,6 +106,9 @@ struct snp_hv_fixed_pages_entry {
>  
>  static LIST_HEAD(snp_hv_fixed_pages);
>  
> +/* Set while SNP has CPU hotplug disabled. */
> +static bool snp_cpu_hotplug_disabled;
> +
>  /* Trusted Memory Region (TMR):
>   *   The TMR is a 1MB area that must be 1MB aligned.  Use the page allocator
>   *   to allocate the memory, which will return aligned memory for the specified
> @@ -1479,6 +1482,17 @@ static int __sev_snp_init_locked(int *error, unsigned int max_snp_asid)
>  
>  	snp_hv_fixed_pages_state_update(sev, HV_FIXED);
>  
> +	/*
> +	 * Disable CPU hotplug while SNP is active.  Guard against stacking
> +	 * the disable count: the legacy SNP_SHUTDOWN_EX path clears
> +	 * snp_initialized without re-enabling hotplug, so this can run
> +	 * again while hotplug is already disabled.
> +	 */
> +	if (!snp_cpu_hotplug_disabled) {
> +		cpu_hotplug_disable();
> +		snp_cpu_hotplug_disabled = true;
> +	}
> +

... should this be done before __sev_do_cmd_locked(SEV_CMD_SNP_INIT_EX)
is issued?

I'm assuming that is when the firmware enumerates the CPUs during SNP
initialization and any hotplug after that should be disallowed?

>  	snp_setup_rmpopt();
>  
>  	sev->snp_initialized = true;
-- 
Thanks and Regards,
Prateek


^ permalink raw reply

* Re: [PATCH v3] lib/raid/xor: x86: Add AVX-512 optimized xor_gen()
From: Christoph Hellwig @ 2026-06-17  5:44 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Borislav Petkov, Eric Biggers, Richard Weinberger, x86,
	Christoph Hellwig, linux-crypto, David Laight, linux-raid,
	Andrew Morton, linux-kernel, linux-um
In-Reply-To: <a832eee3-55ec-4cf4-907f-346ff98870ca@intel.com>

On Mon, Jun 15, 2026 at 05:29:58PM -0700, Dave Hansen wrote:
> On 6/15/26 16:53, Borislav Petkov wrote:
> > 
> >> In any case, I'd like these to go away:
> >>
> >>     $ git grep cpu_has_xfeatures | wc -l
> >>     31
> > Yeah, all in crypto. I can certainly see why.
> > 
> > @dhansen, any other thoughts?
> 
> If we can get rid of cpu_has_xfeatures(), I'm all for it. I'm not quite
> sure how the code would look so I'm reserving judgement until I see the
> patches. But it's worth a try.

I think the most important part is to be consistent.  Either use it
everywhere or not at all.

^ permalink raw reply

* Re: [PATCH v2] lib/raid/xor: x86: Add AVX-512 optimized xor_gen()
From: Christoph Hellwig @ 2026-06-17  5:52 UTC (permalink / raw)
  To: Eric Biggers
  Cc: David Laight, Andrew Morton, linux-kernel, Christoph Hellwig,
	linux-crypto, x86, linux-raid
In-Reply-To: <20260615184435.GA17731@quark>

On Mon, Jun 15, 2026 at 11:44:35AM -0700, Eric Biggers wrote:
> > Doesn't zen4 only have a 256bit bus between the cpu and cache?
> > So avx512 reads take two clocks.
> > Since this is memory limited it is unlikely to run faster than the
> > avx256 version.
> 
> On AMD Genoa (Zen 4 server processor), the AVX-512 code added by this
> patch is indeed about the same speed as the existing AVX-2 code.

The same is true for Zen 5 mobile which has the same AVX-512 limitations.
I don't think it's the bus width, but I'll leave the details to the
experts.

> 
> > OTOH if it doesn't cause down-clocking as well then it won't be slower.
> 
> Yes, as far as I know that's not an issue on AMD processors, even Zen 4.
> The "avoid AVX-512 due to downclocking" rule is historical guidance for
> Intel processors that had a bad implementation of AVX-512.  There's no
> reason to exclude Zen 4 from executing AVX-512 optimized code.  At worst
> it will just be the same, as we're seeing here.

It does not cause down clocking.  But for some of the more complicated
code I've seen AVX512 being significantly slower than AVX2 on these.
So we need to watch out and not automatically assume AVX512 is faster.


^ permalink raw reply

* Re: [PATCH v3] lib/raid/xor: x86: Add AVX-512 optimized xor_gen()
From: Christoph Hellwig @ 2026-06-17  5:56 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Andrew Morton, linux-kernel, Christoph Hellwig, linux-crypto, x86,
	David Laight, linux-raid
In-Reply-To: <20260615190338.26581-1-ebiggers@kernel.org>

Can use the xor: prefix used for all other commits to lib/raid/xor?

> Benchmark on AMD Ryzen 9 9950X (Zen 5):
> 
>     src_cnt    avx          avx512       Improvement
>     =======    ==========   ==========   ===========
>     1          56353 MB/s   75388 MB/s   33%
>     2          54274 MB/s   68409 MB/s   26%
>     3          44649 MB/s   64042 MB/s   43%
>     4          41315 MB/s   55002 MB/s   33%

On my Zen 5 mobile (AMD Ryzen AI 7 PRO 350) both the existing
AVX2 and this AVX512 code give numbers in the 200+ GB/s range.  Not
sure if is just the different benchmarking or something else going on.

FYI, one or 2 sources are basically useless as they RAID5 configs
that have no benefits over simple mirroring and thus the numbers
aren't too interesting.

> +DO_XOR_BLOCKS(avx512_inner, xor_avx512_2, xor_avx512_3, xor_avx512_4,
> +	      xor_avx512_5);

Is there really much of a benefit of doing the historic DO_XOR_BLOCKS
vs doing the loop manually?  Especially as the common cases for a
modern RAID will usually loop over more disks than this was built
for.  I.e., in practice one or two source buffers only happen at the
end of a loop over more disks.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox