Re: [PATCH] crypto: ahash - Stop legacy tfms from using the set_virt fallback path

From: Eric Biggers <ebiggers@kernel.org>
To: Ard Biesheuvel <ardb@kernel.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>,
	linux-crypto@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, linux-mips@vger.kernel.org,
	linux-riscv@lists.infradead.org, linux-s390@vger.kernel.org,
	sparclinux@vger.kernel.org, x86@kernel.org, Jason@zx2c4.com,
	torvalds@linux-foundation.org
Subject: Re: [PATCH] crypto: ahash - Stop legacy tfms from using the set_virt fallback path
Date: Sun, 15 Jun 2025 11:46:38 -0700	[thread overview]
Message-ID: <20250615184638.GA1480@sol> (raw)
In-Reply-To: <CAMj1kXGd93Kg0Vs8ExLhK=fxhRBASU9sOPfgYUogv+rwVqgUsg@mail.gmail.com>

On Sun, Jun 15, 2025 at 09:22:51AM +0200, Ard Biesheuvel wrote:
> On Sun, 15 Jun 2025 at 05:18, Eric Biggers <ebiggers@kernel.org> wrote:
> >
> ...
> > After disabling the crypto self-tests, I was then able to run a benchmark of
> > SHA-256 hashing 4096-byte messages, which fortunately didn't encounter the
> > recursion bug.  I got the following results:
> >
> >     ARMv8 crypto extensions: 1864 MB/s
> >     Generic C code: 358 MB/s
> >     Qualcomm Crypto Engine: 55 MB/s
> >
> > So just to clarify, you believe that asynchronous hash drivers like the Qualcomm
> > Crypto Engine one are useful, and the changes that you're requiring to the
> > CPU-based code are to support these drivers?
> >
> 
> And this offload engine only has one internal queue, right? Whereas
> the CPU results may be multiplied by the number of cores on the soc.
> It would still be interesting how much of this is due to latency
> rather than limited throughput but it seems highly unlikely that there
> are any message sizes large enough where QCE would catch up with the
> CPUs. (AIUI, the only use case we have in the kernel today for message
> sizes that are substantially larger than this is kTLS, but I'm not
> sure how well it works with crypto_aead compared to offload at a more
> suitable level in the networking stack, and this driver does not
> implement GCM in the first place)
> 
> On ARM socs, these offload engines usually exist primarily for the
> benefit of the verified boot implementation in mask ROM, which
> obviously needs to be minimal but doesn't have to be very fast in
> order to get past the first boot stages and hand over to software.
> Then, since the IP block is there, it's listed as a feature in the
> data sheet, even though it is not very useful when running under the
> OS.

With 1 MiB messages, I get 1913 MB/s with ARMv8 CE and 142 MB/s with QCE.

(BTW, that's single-buffer ARMv8 CE.  My two-buffer code is over 3000 MB/s.)

I then changed my benchmark code to take full advantage of the async API and
submit as many requests as the hardware can handle.  (This would be a best-case
scenario for QCE; in many real use cases this is not possible.)  Result with QCE
was 58 MB/s with 4 KiB messages or 155 MB/s for 1 MiB messages.

So yes, QCE seems to have only one queue, and even that one queue is *much*
slower than just using the CPU.  It's even slower than the generic C code.

And until I fixed it recently, the Crypto API defaulted to using QCE instead of
ARMv8 CE.

But this seems to be a common pattern among the offload engines.
I noticed a similar issue with Intel QAT, which I elaborate on in this patch:
https://lore.kernel.org/r/20250615045145.224567-1-ebiggers@kernel.org

- Eric