All of lore.kernel.org
 help / color / mirror / Atom feed
From: Eric Biggers <ebiggers@kernel.org>
To: Ard Biesheuvel <ardb@kernel.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>,
	linux-crypto@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, linux-mips@vger.kernel.org,
	linux-riscv@lists.infradead.org, linux-s390@vger.kernel.org,
	sparclinux@vger.kernel.org, x86@kernel.org, Jason@zx2c4.com,
	torvalds@linux-foundation.org
Subject: Re: [PATCH] crypto: ahash - Stop legacy tfms from using the set_virt fallback path
Date: Sun, 15 Jun 2025 11:46:38 -0700	[thread overview]
Message-ID: <20250615184638.GA1480@sol> (raw)
In-Reply-To: <CAMj1kXGd93Kg0Vs8ExLhK=fxhRBASU9sOPfgYUogv+rwVqgUsg@mail.gmail.com>

On Sun, Jun 15, 2025 at 09:22:51AM +0200, Ard Biesheuvel wrote:
> On Sun, 15 Jun 2025 at 05:18, Eric Biggers <ebiggers@kernel.org> wrote:
> >
> ...
> > After disabling the crypto self-tests, I was then able to run a benchmark of
> > SHA-256 hashing 4096-byte messages, which fortunately didn't encounter the
> > recursion bug.  I got the following results:
> >
> >     ARMv8 crypto extensions: 1864 MB/s
> >     Generic C code: 358 MB/s
> >     Qualcomm Crypto Engine: 55 MB/s
> >
> > So just to clarify, you believe that asynchronous hash drivers like the Qualcomm
> > Crypto Engine one are useful, and the changes that you're requiring to the
> > CPU-based code are to support these drivers?
> >
> 
> And this offload engine only has one internal queue, right? Whereas
> the CPU results may be multiplied by the number of cores on the soc.
> It would still be interesting how much of this is due to latency
> rather than limited throughput but it seems highly unlikely that there
> are any message sizes large enough where QCE would catch up with the
> CPUs. (AIUI, the only use case we have in the kernel today for message
> sizes that are substantially larger than this is kTLS, but I'm not
> sure how well it works with crypto_aead compared to offload at a more
> suitable level in the networking stack, and this driver does not
> implement GCM in the first place)
> 
> On ARM socs, these offload engines usually exist primarily for the
> benefit of the verified boot implementation in mask ROM, which
> obviously needs to be minimal but doesn't have to be very fast in
> order to get past the first boot stages and hand over to software.
> Then, since the IP block is there, it's listed as a feature in the
> data sheet, even though it is not very useful when running under the
> OS.

With 1 MiB messages, I get 1913 MB/s with ARMv8 CE and 142 MB/s with QCE.

(BTW, that's single-buffer ARMv8 CE.  My two-buffer code is over 3000 MB/s.)

I then changed my benchmark code to take full advantage of the async API and
submit as many requests as the hardware can handle.  (This would be a best-case
scenario for QCE; in many real use cases this is not possible.)  Result with QCE
was 58 MB/s with 4 KiB messages or 155 MB/s for 1 MiB messages.

So yes, QCE seems to have only one queue, and even that one queue is *much*
slower than just using the CPU.  It's even slower than the generic C code.

And until I fixed it recently, the Crypto API defaulted to using QCE instead of
ARMv8 CE.

But this seems to be a common pattern among the offload engines.
I noticed a similar issue with Intel QAT, which I elaborate on in this patch:
https://lore.kernel.org/r/20250615045145.224567-1-ebiggers@kernel.org

- Eric


WARNING: multiple messages have this Message-ID (diff)
From: Eric Biggers <ebiggers@kernel.org>
To: Ard Biesheuvel <ardb@kernel.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>,
	linux-crypto@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, linux-mips@vger.kernel.org,
	linux-riscv@lists.infradead.org, linux-s390@vger.kernel.org,
	sparclinux@vger.kernel.org, x86@kernel.org, Jason@zx2c4.com,
	torvalds@linux-foundation.org
Subject: Re: [PATCH] crypto: ahash - Stop legacy tfms from using the set_virt fallback path
Date: Sun, 15 Jun 2025 11:46:38 -0700	[thread overview]
Message-ID: <20250615184638.GA1480@sol> (raw)
In-Reply-To: <CAMj1kXGd93Kg0Vs8ExLhK=fxhRBASU9sOPfgYUogv+rwVqgUsg@mail.gmail.com>

On Sun, Jun 15, 2025 at 09:22:51AM +0200, Ard Biesheuvel wrote:
> On Sun, 15 Jun 2025 at 05:18, Eric Biggers <ebiggers@kernel.org> wrote:
> >
> ...
> > After disabling the crypto self-tests, I was then able to run a benchmark of
> > SHA-256 hashing 4096-byte messages, which fortunately didn't encounter the
> > recursion bug.  I got the following results:
> >
> >     ARMv8 crypto extensions: 1864 MB/s
> >     Generic C code: 358 MB/s
> >     Qualcomm Crypto Engine: 55 MB/s
> >
> > So just to clarify, you believe that asynchronous hash drivers like the Qualcomm
> > Crypto Engine one are useful, and the changes that you're requiring to the
> > CPU-based code are to support these drivers?
> >
> 
> And this offload engine only has one internal queue, right? Whereas
> the CPU results may be multiplied by the number of cores on the soc.
> It would still be interesting how much of this is due to latency
> rather than limited throughput but it seems highly unlikely that there
> are any message sizes large enough where QCE would catch up with the
> CPUs. (AIUI, the only use case we have in the kernel today for message
> sizes that are substantially larger than this is kTLS, but I'm not
> sure how well it works with crypto_aead compared to offload at a more
> suitable level in the networking stack, and this driver does not
> implement GCM in the first place)
> 
> On ARM socs, these offload engines usually exist primarily for the
> benefit of the verified boot implementation in mask ROM, which
> obviously needs to be minimal but doesn't have to be very fast in
> order to get past the first boot stages and hand over to software.
> Then, since the IP block is there, it's listed as a feature in the
> data sheet, even though it is not very useful when running under the
> OS.

With 1 MiB messages, I get 1913 MB/s with ARMv8 CE and 142 MB/s with QCE.

(BTW, that's single-buffer ARMv8 CE.  My two-buffer code is over 3000 MB/s.)

I then changed my benchmark code to take full advantage of the async API and
submit as many requests as the hardware can handle.  (This would be a best-case
scenario for QCE; in many real use cases this is not possible.)  Result with QCE
was 58 MB/s with 4 KiB messages or 155 MB/s for 1 MiB messages.

So yes, QCE seems to have only one queue, and even that one queue is *much*
slower than just using the CPU.  It's even slower than the generic C code.

And until I fixed it recently, the Crypto API defaulted to using QCE instead of
ARMv8 CE.

But this seems to be a common pattern among the offload engines.
I noticed a similar issue with Intel QAT, which I elaborate on in this patch:
https://lore.kernel.org/r/20250615045145.224567-1-ebiggers@kernel.org

- Eric

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

  reply	other threads:[~2025-06-15 18:49 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-11  2:09 [PATCH 00/16] SHA-512 library functions Eric Biggers
2025-06-11  2:09 ` Eric Biggers
2025-06-11  2:09 ` [PATCH 01/16] crypto: sha512 - rename conflicting symbols Eric Biggers
2025-06-11  2:09   ` Eric Biggers
2025-06-11  2:09 ` [PATCH 02/16] lib/crypto/sha512: add support for SHA-384 and SHA-512 Eric Biggers
2025-06-11  2:09   ` Eric Biggers
2025-06-11  2:09 ` [PATCH 03/16] lib/crypto/sha512: add HMAC-SHA384 and HMAC-SHA512 support Eric Biggers
2025-06-11  2:09   ` Eric Biggers
2025-06-11  2:09 ` [PATCH 04/16] lib/crypto/sha512: add KUnit tests for SHA-384 and SHA-512 Eric Biggers
2025-06-11  2:09   ` Eric Biggers
2025-06-11  2:09 ` [PATCH 05/16] lib/crypto/sha256: add KUnit tests for SHA-224 and SHA-256 Eric Biggers
2025-06-11  2:09   ` Eric Biggers
2025-06-11  2:09 ` [PATCH 06/16] crypto: riscv/sha512 - stop depending on sha512_generic_block_fn Eric Biggers
2025-06-11  2:09   ` Eric Biggers
2025-06-11  2:09 ` [PATCH 07/16] crypto: sha512 - replace sha512_generic with wrapper around SHA-512 library Eric Biggers
2025-06-11  2:09   ` Eric Biggers
2025-06-11  2:24   ` Herbert Xu
2025-06-11  2:24     ` Herbert Xu
2025-06-11  3:39     ` Eric Biggers
2025-06-11  3:39       ` Eric Biggers
2025-06-11  3:46       ` Herbert Xu
2025-06-11  3:46         ` Herbert Xu
2025-06-11  3:58         ` Eric Biggers
2025-06-11  3:58           ` Eric Biggers
2025-06-13  5:36           ` Eric Biggers
2025-06-13  5:36             ` Eric Biggers
2025-06-13  5:38             ` Herbert Xu
2025-06-13  5:38               ` Herbert Xu
2025-06-13  5:54               ` Eric Biggers
2025-06-13  5:54                 ` Eric Biggers
2025-06-13  7:38                 ` Ard Biesheuvel
2025-06-13  7:38                   ` Ard Biesheuvel
2025-06-13  8:39                   ` Herbert Xu
2025-06-13  8:39                     ` Herbert Xu
2025-06-13 14:51                     ` Eric Biggers
2025-06-13 14:51                       ` Eric Biggers
2025-06-13 16:35                     ` Linus Torvalds
2025-06-13 16:35                       ` Linus Torvalds
2025-06-13  8:51                 ` [PATCH] crypto: ahash - Stop legacy tfms from using the set_virt fallback path Herbert Xu
2025-06-13  8:51                   ` Herbert Xu
2025-06-15  3:18                   ` Eric Biggers
2025-06-15  3:18                     ` Eric Biggers
2025-06-15  7:22                     ` Ard Biesheuvel
2025-06-15  7:22                       ` Ard Biesheuvel
2025-06-15 18:46                       ` Eric Biggers [this message]
2025-06-15 18:46                         ` Eric Biggers
2025-06-15 19:37                         ` Linus Torvalds
2025-06-15 19:37                           ` Linus Torvalds
2025-06-16  4:09                     ` [PATCH] crypto: ahash - Fix infinite recursion in ahash_def_finup Herbert Xu
2025-06-16  4:09                       ` Herbert Xu
2025-06-11  2:09 ` [PATCH 08/16] lib/crypto/sha512: migrate arm-optimized SHA-512 code to library Eric Biggers
2025-06-11  2:09   ` Eric Biggers
2025-06-11  2:09 ` [PATCH 09/16] lib/crypto/sha512: migrate arm64-optimized " Eric Biggers
2025-06-11  2:09   ` Eric Biggers
2025-06-11  2:09 ` [PATCH 10/16] mips: cavium-octeon: move octeon-crypto.h into asm directory Eric Biggers
2025-06-11  2:09   ` Eric Biggers
2025-06-11  2:09 ` [PATCH 11/16] lib/crypto/sha512: migrate mips-optimized SHA-512 code to library Eric Biggers
2025-06-11  2:09   ` Eric Biggers
2025-06-11  2:09 ` [PATCH 12/16] lib/crypto/sha512: migrate riscv-optimized " Eric Biggers
2025-06-11  2:09   ` Eric Biggers
2025-06-11  2:09 ` [PATCH 13/16] lib/crypto/sha512: migrate s390-optimized " Eric Biggers
2025-06-11  2:09   ` Eric Biggers
2025-06-11  2:09 ` [PATCH 14/16] lib/crypto/sha512: migrate sparc-optimized " Eric Biggers
2025-06-11  2:09   ` Eric Biggers
2025-06-11  2:09 ` [PATCH 15/16] lib/crypto/sha512: migrate x86-optimized " Eric Biggers
2025-06-11  2:09   ` Eric Biggers
2025-06-11  2:09 ` [PATCH 16/16] crypto: sha512 - remove sha512_base.h Eric Biggers
2025-06-11  2:09   ` Eric Biggers

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250615184638.GA1480@sol \
    --to=ebiggers@kernel.org \
    --cc=Jason@zx2c4.com \
    --cc=ardb@kernel.org \
    --cc=herbert@gondor.apana.org.au \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-crypto@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mips@vger.kernel.org \
    --cc=linux-riscv@lists.infradead.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=sparclinux@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.