Re: [PATCH 2/9] lib/crypto: polyval: Add POLYVAL library

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

From: Ard Biesheuvel <ardb@kernel.org>
To: Eric Biggers <ebiggers@kernel.org>
Cc: linux-crypto@vger.kernel.org, linux-kernel@vger.kernel.org,
	 "Jason A . Donenfeld" <Jason@zx2c4.com>,
	Herbert Xu <herbert@gondor.apana.org.au>,
	 linux-arm-kernel@lists.infradead.org, x86@kernel.org
Subject: Re: [PATCH 2/9] lib/crypto: polyval: Add POLYVAL library
Date: Tue, 11 Nov 2025 08:42:29 +0100	[thread overview]
Message-ID: <CAMj1kXHoH2K0dpqyrgFJ-OBmP2QrUWZD3aCfaA_eoPzGsLbPMw@mail.gmail.com> (raw)
In-Reply-To: <CAMj1kXE1mhu7u5RwhCBA_RUGV6JSDV-GQPpq+thE-0-oVxrmfw@mail.gmail.com>

On Mon, 10 Nov 2025 at 16:21, Ard Biesheuvel <ardb@kernel.org> wrote:
>
> Hi,
>
> On Mon, 10 Nov 2025 at 00:49, Eric Biggers <ebiggers@kernel.org> wrote:
> >
> > Add support for POLYVAL to lib/crypto/.
> >
> > This will replace the polyval crypto_shash algorithm and its use in the
> > hctr2 template, simplifying the code and reducing overhead.
> >
> > Specifically, this commit introduces the POLYVAL library API and a
> > generic implementation of it.  Later commits will migrate the existing
> > architecture-optimized implementations of POLYVAL into lib/crypto/ and
> > add a KUnit test suite.
> >
> > I've also rewritten the generic implementation completely, using a more
> > modern approach instead of the traditional table-based approach.  It's
> > now constant-time, requires no precomputation or dynamic memory
> > allocations, decreases the per-key memory usage from 4096 bytes to 16
> > bytes, and is faster than the old polyval-generic even on bulk data
> > reusing the same key (at least on x86_64, where I measured 15% faster).
> > We should do this for GHASH too, but for now just do it for POLYVAL.
> >
>
> Very nice.
>
> GHASH might suffer on 32-bit, I suppose, but taking this approach at
> least on 64-bit also for GHASH would be a huge improvement.
>
> I had a stab at replacing the int128 arithmetic with
> __builtin_bitreverse64(), but it seems to make little difference (and
> GCC does not support it [yet]). I've tried both arm64 and x86, and the
> perf delta (using your kunit benchmark) is negligible in either case.

Sigh. I intended to only apply the generic patch and the kunit test,
but applied the whole series in the end, which explains perfectly why
x86_64 and arm64 performance are identical, given that the generic
code isn't even used.

So trying this again, on a Cortex-A72 without Crypto Extensions, I do
get a ~30% performance improvement doing the below. I haven't
re-tested x86, but given that it does not appear to have a native
scalar bit reverse instruction (or __builtin_bitreverse64() is broken
for it), there is probably no point in finding out.

Not saying we should do this for POLYVAL, but something to keep in
mind for gf128mul.c perhaps.


--- a/lib/crypto/polyval.c
+++ b/lib/crypto/polyval.c
@@ -42,11 +42,48 @@
  * 256-bit => 128-bit reduction algorithm.
  */

-#ifdef CONFIG_ARCH_SUPPORTS_INT128
+#if defined(CONFIG_ARCH_SUPPORTS_INT128) ||
__has_builtin(__builtin_bitreverse64)

 /* Do a 64 x 64 => 128 bit carryless multiplication. */
 static void clmul64(u64 a, u64 b, u64 *out_lo, u64 *out_hi)
 {
+       u64 a0 = a & 0x1111111111111111;
+       u64 a1 = a & 0x2222222222222222;
+       u64 a2 = a & 0x4444444444444444;
+       u64 a3 = a & 0x8888888888888888;
+
+       u64 b0 = b & 0x1111111111111111;
+       u64 b1 = b & 0x2222222222222222;
+       u64 b2 = b & 0x4444444444444444;
+       u64 b3 = b & 0x8888888888888888;
+
+#if __has_builtin(__builtin_bitreverse64)
+#define brev64 __builtin_bitreverse64
+       u64 c0 = (a0 * b0) ^ (a1 * b3) ^ (a2 * b2) ^ (a3 * b1);
+       u64 c1 = (a0 * b1) ^ (a1 * b0) ^ (a2 * b3) ^ (a3 * b2);
+       u64 c2 = (a0 * b2) ^ (a1 * b1) ^ (a2 * b0) ^ (a3 * b3);
+       u64 c3 = (a0 * b3) ^ (a1 * b2) ^ (a2 * b1) ^ (a3 * b0);
+
+       a0 = brev64(a0);
+       a1 = brev64(a1);
+       a2 = brev64(a2);
+       a3 = brev64(a3);
+
+       b0 = brev64(b0);
+       b1 = brev64(b1);
+       b2 = brev64(b2);
+       b3 = brev64(b3);
+
+       u64 d0 = (a0 * b0) ^ (a1 * b3) ^ (a2 * b2) ^ (a3 * b1);
+       u64 d1 = (a0 * b1) ^ (a1 * b0) ^ (a2 * b3) ^ (a3 * b2);
+       u64 d2 = (a0 * b2) ^ (a1 * b1) ^ (a2 * b0) ^ (a3 * b3);
+       u64 d3 = (a0 * b3) ^ (a1 * b2) ^ (a2 * b1) ^ (a3 * b0);
+
+       *out_hi = ((brev64(d0) >> 1) & 0x1111111111111111) ^
+                 ((brev64(d1) >> 1) & 0x2222222222222222) ^
+                 ((brev64(d2) >> 1) & 0x4444444444444444) ^
+                 ((brev64(d3) >> 1) & 0x8888888888888888);
+#else
        /*
         * With 64-bit multiplicands and one term every 4 bits, there would be
         * up to 64 / 4 = 16 one bits per column when each multiplication is
@@ -60,15 +97,10 @@ static void clmul64(u64 a, u64 b, u64 *out_lo, u64 *out_hi)
         * Instead, mask off 4 bits from one multiplicand, giving a max of 15
         * one bits per column.  Then handle those 4 bits separately.
         */
-       u64 a0 = a & 0x1111111111111110;
-       u64 a1 = a & 0x2222222222222220;
-       u64 a2 = a & 0x4444444444444440;
-       u64 a3 = a & 0x8888888888888880;
-
-       u64 b0 = b & 0x1111111111111111;
-       u64 b1 = b & 0x2222222222222222;
-       u64 b2 = b & 0x4444444444444444;
-       u64 b3 = b & 0x8888888888888888;
+       a0 &= ~0xfULL;
+       a1 &= ~0xfULL;
+       a2 &= ~0xfULL;
+       a3 &= ~0xfULL;

        /* Multiply the high 60 bits of @a by @b. */
        u128 c0 = (a0 * (u128)b0) ^ (a1 * (u128)b3) ^
@@ -85,18 +117,20 @@ static void clmul64(u64 a, u64 b, u64 *out_lo, u64 *out_hi)
        u64 e1 = -((a >> 1) & 1) & b;
        u64 e2 = -((a >> 2) & 1) & b;
        u64 e3 = -((a >> 3) & 1) & b;
-       u64 extra_lo = e0 ^ (e1 << 1) ^ (e2 << 2) ^ (e3 << 3);
-       u64 extra_hi = (e1 >> 63) ^ (e2 >> 62) ^ (e3 >> 61);

        /* Add all the intermediate products together. */
-       *out_lo = (((u64)c0) & 0x1111111111111111) ^
-                 (((u64)c1) & 0x2222222222222222) ^
-                 (((u64)c2) & 0x4444444444444444) ^
-                 (((u64)c3) & 0x8888888888888888) ^ extra_lo;
        *out_hi = (((u64)(c0 >> 64)) & 0x1111111111111111) ^
                  (((u64)(c1 >> 64)) & 0x2222222222222222) ^
                  (((u64)(c2 >> 64)) & 0x4444444444444444) ^
-                 (((u64)(c3 >> 64)) & 0x8888888888888888) ^ extra_hi;
+                 (((u64)(c3 >> 64)) & 0x8888888888888888) ^
+                 (e1 >> 63) ^ (e2 >> 62) ^ (e3 >> 61);
+
+       *out_lo = e0 ^ (e1 << 1) ^ (e2 << 2) ^ (e3 << 3);
+#endif
+       *out_lo ^= (((u64)c0) & 0x1111111111111111) ^
+                  (((u64)c1) & 0x2222222222222222) ^
+                  (((u64)c2) & 0x4444444444444444) ^
+                  (((u64)c3) & 0x8888888888888888);
 }

 #else /* CONFIG_ARCH_SUPPORTS_INT128 */

next prev parent reply	other threads:[~2025-11-11  7:42 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-09 23:47 [PATCH 0/9] POLYVAL library Eric Biggers
2025-11-09 23:47 ` [PATCH 1/9] crypto: polyval - Rename conflicting functions Eric Biggers
2025-11-09 23:47 ` [PATCH 2/9] lib/crypto: polyval: Add POLYVAL library Eric Biggers
2025-11-10 15:21   ` Ard Biesheuvel
2025-11-11  7:42     ` Ard Biesheuvel [this message]
2025-11-11 19:46       ` Eric Biggers
2025-11-12 10:32         ` Ard Biesheuvel
2025-11-09 23:47 ` [PATCH 3/9] lib/crypto: tests: Add KUnit tests for POLYVAL Eric Biggers
2025-11-09 23:47 ` [PATCH 4/9] lib/crypto: arm64/polyval: Migrate optimized code into library Eric Biggers
2025-11-09 23:47 ` [PATCH 5/9] lib/crypto: x86/polyval: " Eric Biggers
2025-11-09 23:47 ` [PATCH 6/9] crypto: hctr2 - Convert to use POLYVAL library Eric Biggers
2025-11-09 23:47 ` [PATCH 7/9] crypto: polyval - Remove the polyval crypto_shash Eric Biggers
2025-11-09 23:47 ` [PATCH 8/9] crypto: testmgr - Remove polyval tests Eric Biggers
2025-11-09 23:47 ` [PATCH 9/9] fscrypt: Drop obsolete recommendation to enable optimized POLYVAL Eric Biggers
2025-11-10 15:51 ` [PATCH 0/9] POLYVAL library Ard Biesheuvel
2025-11-11 19:28 ` Eric Biggers

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAMj1kXHoH2K0dpqyrgFJ-OBmP2QrUWZD3aCfaA_eoPzGsLbPMw@mail.gmail.com \
    --to=ardb@kernel.org \
    --cc=Jason@zx2c4.com \
    --cc=ebiggers@kernel.org \
    --cc=herbert@gondor.apana.org.au \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-crypto@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).