* [PATCH] crypto: ecc - Optimize vli additive operations using compiler builtins
@ 2026-06-07 11:24 Fabian Blatter
2026-06-09 18:58 ` Stefan Berger
2026-06-10 17:25 ` Stefan Berger
0 siblings, 2 replies; 6+ messages in thread
From: Fabian Blatter @ 2026-06-07 11:24 UTC (permalink / raw)
To: lukas, ignat, herbert, davem
Cc: stefanb, linux-crypto, linux-kernel, Fabian Blatter
Replace the software carry flag emulation with compiler builtins.
Even the newest compilers struggle with taking advantage of the
hardware carry flag. Compiler builtins allow the compiler to
much more easily achieve this while still remaining constant-time.
This yields an approximately 6-7% performance improvement
on the ecc_gen_privkey, ecc_make_pub_key and crypto_ecdh_shared_secret
functions on x86_64 on all curve sizes.
Additionally, the code becomes much more readable.
Signed-off-by: Fabian Blatter <fabianblatter09@gmail.com>
---
Hi,
I'd like to expand on the benchmarks, compare the generated assembly,
and clarify some things.
Use of compiler builtins:
This patch uses __builtin_addcll, __builtin_subcll when available and
otherwise __builtin_uaddll_overflow, __builtin_usubll_overflow. the
latter have existed since ancient gcc versions, so no third fallback
is needed.
I have put the add_carry and sub_borrow inline functions with the
preprocessor logic for builtin selection directly in crypto/ecc.c.
Please let me know if you would like them to be somewhere else.
They do not emit data-dependent branches, and so remain constant-time.
Benchmarks:
All benchmarks were run single-threaded on my AMD 7700X CPU limited to
5.6Ghz. I have measured both nanoseconds and clock cycles, since their
combination can hint at downclocking issues and allows calculation of
the clock speed during the benchmark.
I have omitted the raw output from the benchmarking code, as they much
exceed the 72 character limit.
I have calculated the percent differences, included clock speed
calculations and relevant summaries.
Macro benchmarks:
These were run in a virtualized environment using virtme-ng on the
compiled linux kernel image compiled with default flags.
(the first value is the original time per operation, the second the
patched one. cc is short for clock cycles)
Curve keypair generation (ecc_gen_privkey + ecc_make_pub_key):
P256:
- 646963ns/op -> 600632ns/op = -7.71%
- 2911300cc/op -> 2702854cc/op = -7.71%
- 4.4999Ghz -> 4.5000Ghz = no difference
P384:
- 1239160ns/op -> 1153940ns/op = -7.38%
- 5576250cc/op -> 5192749cc/op = -7.38%
- 4.5000Ghz -> 4.5000Ghz = no difference
Shared secret generation (crypto_ecdh_shared_secret):
P256:
- 320114ns/op -> 297548ns/op = -7.58%
- 1440521cc/op -> 1338972cc/op = -7.58%
- 4.5000Ghz -> 4.5000Ghz = no difference
P384:
- 620768ns/op -> 582560ns/op = -6.55%
- 2793467cc/op -> 2621529cc/op = -6.55%
- 4.5000Ghz -> 4.5000Ghz = no difference
The benchmarks clearly indicate a roughly 6-7% performance increase on
the public API functions. It also appears that virtme-ng limited the
clock speed to 4.5Ghz
Micro benchmarks:
Since the vli additive functions only rely on u64 being defined, these
were run without virtualization and with varying compilers and
compiler flags.
The microbenchmarks show much more mixed results, depending
heavily on the compiler and optimization level used.
For instance, on gcc and O2, the vli_add present in the
patch is actually 25.3% slower than the original one. I have tracked
this down to gcc using a weird way to restore the carry flag after
each iteration, causing way more dependent instructions, preventing
ILP from executing multiple at once.
This is quite interesting, since, as far as I know, the kernel compiles
with gcc and O2 by default, yet the macro-level benchmarks still show a
performance increase. The effect seems to be reversed when crypto/ecc.c
gets compiled. Or maybe the linux kernel uses some additional
optimization flags, I am unsure.
However, most of the time, the patched version outperforms the original
one by a wide margin:
- On clang -O2 or -O3, vli_add and vli_uadd show a 4.074x and 5.384x
speedup.
- On gcc, vli_uadd shows a 74% performance increase at O2,
and a 2.07x speedup at O3.
The performance profile of vli_sub and vli_usub is almost identical to
that of vli_add and vli_uadd.
Assembly comparison:
I have put together a piece of code on Compiler explorer, to make sure
it compiles on old gcc versions, view instructions and play around with
compiler settings.
If you would like, you can play around yourself here:
https://godbolt.org/z/1jT5zesz8
When using clang 22.1 at -O3 -march=lunarlake, the difference between
the patched and original version is particularly clear. The patched
version produces this assembly in the unrolled vli_add loop:
mov rax, qword ptr [rsi + 8*rcx + 16]
adc rax, qword ptr [rdx + 8*rcx + 16]
mov qword ptr [rdi + 8*rcx + 16], rax
mov rax, qword ptr [rsi + 8*rcx + 24]
adc rax, qword ptr [rdx + 8*rcx + 24]
mov qword ptr [rdi + 8*rcx + 24], rax
mov rax, qword ptr [rsi + 8*rcx + 32]
adc rax, qword ptr [rdx + 8*rcx + 32]
mov qword ptr [rdi + 8*rcx + 32], rax
mov rax, qword ptr [rsi + 8*rcx + 40]
adc rax, qword ptr [rdx + 8*rcx + 40]
mov qword ptr [rdi + 8*rcx + 40], rax
mov rax, qword ptr [rsi + 8*rcx + 48]
adc rax, qword ptr [rdx + 8*rcx + 48]
This is basically optimal for an inner loop. It's pure adc and mov
instructions. The loop counting part is still nowhere near perfect,
and still uses setc instructions. But it is still better than what
the original version produces with the same compiler and flags:
mov r10, qword ptr [rsi + 8*rcx]
lea r11, [r10 + rax]
add r11, qword ptr [rdx + 8*rcx]
xor ebx, ebx
cmp r11, r10
setb bl
cmove rbx, rax
mov qword ptr [rdi + 8*rcx], r11
mov rax, qword ptr [rsi + 8*rcx + 8]
lea r10, [rax + rbx]
add r10, qword ptr [rdx + 8*rcx + 8]
xor r11d, r11d
cmp r10, rax
setb r11b
cmove r11, rbx
mov qword ptr [rdi + 8*rcx + 8], r10
mov rax, qword ptr [rsi + 8*rcx + 16]
lea r10, [rax + r11]
add r10, qword ptr [rdx + 8*rcx + 16]
xor ebx, ebx
cmp r10, rax
setb bl
cmove rbx, r11
mov qword ptr [rdi + 8*rcx + 16], r10
mov rax, qword ptr [rsi + 8*rcx + 24]
lea r10, [rax + rbx]
add r10, qword ptr [rdx + 8*rcx + 24]
xor r11d, r11d
cmp r10, rax
setb r11b
cmove r11, rbx
This is downright horrendous. that entire block of processes only 4
limbs, thats 8 instructions per limb! The add instructions
are also not adc instructions, showing that the carry flag is
being fully emulated. This demonstrates how even on the newest
compilers and at the highest optimization level, still cannot
generate hardware carry chains without explicit use of builtins.
I should note that not just clang 22.1.0 with -O3 -march=lunarlake
does this. Gcc and clang show this behaviour on every version i have
tested, regardless of target architecture.
I am not very familiar with ARM or RISC-V assembly, but looking at
compiler explorer, the effect clearly persists, and in the case of
RISC-V actually gets much worse.
This affects all architectures across all compilers and compiler
flags.
If you have gotten this far, thank you for reading this and I am looking
forward to any feedback! If you would like any changes to this patch,
I am very happy to send a v2.
crypto/ecc.c | 98 ++++++++++++++++++++++++++++++++--------------------
1 file changed, 60 insertions(+), 38 deletions(-)
diff --git a/crypto/ecc.c b/crypto/ecc.c
index 43b0def3a225..4f7bb6f424d8 100644
--- a/crypto/ecc.c
+++ b/crypto/ecc.c
@@ -279,6 +279,48 @@ static void vli_rshift1(u64 *vli, unsigned int ndigits)
}
}
+#ifdef __has_builtin
+#if __has_builtin(__builtin_addcll)
+#define USE_BUILTIN_ADDC
+#endif
+#endif
+
+/* Computes result = left + right + carry_in and updates carry_out */
+static inline void add_carry(u64 left, u64 right, u64 *result, u64 carry_in,
+ u64 *carry_out)
+{
+#ifdef USE_BUILTIN_ADDC
+ *result = __builtin_addcll(left, right, carry_in, carry_out);
+#else
+ u64 sum1, sum2;
+ u64 c1 = __builtin_uaddll_overflow(left, right, &sum1);
+ u64 c2 = __builtin_uaddll_overflow(sum1, carry_in, &sum2);
+ *result = sum2;
+ *carry_out = c1 | c2;
+#endif
+}
+
+#ifdef __has_builtin
+#if __has_builtin(__builtin_subcll)
+#define USE_BUILTIN_SUBC
+#endif
+#endif
+
+/* Computes result = left - right - borrow_in and updates borrow_out */
+static inline void sub_borrow(u64 left, u64 right, u64 *result, u64 borrow_in,
+ u64 *borrow_out)
+{
+#ifdef USE_BUILTIN_SUBC
+ *result = __builtin_subcll(left, right, borrow_in, borrow_out);
+#else
+ u64 diff1, diff2;
+ u64 b1 = __builtin_usubll_overflow(left, right, &diff1);
+ u64 b2 = __builtin_usubll_overflow(diff1, borrow_in, &diff2);
+ *result = diff2;
+ *borrow_out = b1 | b2;
+#endif
+}
+
/* Computes result = left + right, returning carry. Can modify in place. */
static u64 vli_add(u64 *result, const u64 *left, const u64 *right,
unsigned int ndigits)
@@ -286,15 +328,8 @@ static u64 vli_add(u64 *result, const u64 *left, const u64 *right,
u64 carry = 0;
int i;
- for (i = 0; i < ndigits; i++) {
- u64 sum;
-
- sum = left[i] + right[i] + carry;
- if (sum != left[i])
- carry = (sum < left[i]);
-
- result[i] = sum;
- }
+ for (i = 0; i < ndigits; i++)
+ add_carry(left[i], right[i], &result[i], carry, &carry);
return carry;
}
@@ -303,40 +338,29 @@ static u64 vli_add(u64 *result, const u64 *left, const u64 *right,
static u64 vli_uadd(u64 *result, const u64 *left, u64 right,
unsigned int ndigits)
{
- u64 carry = right;
+ u64 carry;
int i;
- for (i = 0; i < ndigits; i++) {
- u64 sum;
+ if (ndigits == 0)
+ return right;
- sum = left[i] + carry;
- if (sum != left[i])
- carry = (sum < left[i]);
- else
- carry = !!carry;
+ carry = __builtin_uaddll_overflow(left[0], right, &result[0]);
- result[i] = sum;
- }
+ for (i = 1; i < ndigits; i++)
+ carry = __builtin_uaddll_overflow(left[i], carry, &result[i]);
return carry;
}
/* Computes result = left - right, returning borrow. Can modify in place. */
u64 vli_sub(u64 *result, const u64 *left, const u64 *right,
- unsigned int ndigits)
+ unsigned int ndigits)
{
u64 borrow = 0;
int i;
- for (i = 0; i < ndigits; i++) {
- u64 diff;
-
- diff = left[i] - right[i] - borrow;
- if (diff != left[i])
- borrow = (diff > left[i]);
-
- result[i] = diff;
- }
+ for (i = 0; i < ndigits; i++)
+ sub_borrow(left[i], right[i], &result[i], borrow, &borrow);
return borrow;
}
@@ -344,20 +368,18 @@ EXPORT_SYMBOL(vli_sub);
/* Computes result = left - right, returning borrow. Can modify in place. */
static u64 vli_usub(u64 *result, const u64 *left, u64 right,
- unsigned int ndigits)
+ unsigned int ndigits)
{
- u64 borrow = right;
+ u64 borrow;
int i;
- for (i = 0; i < ndigits; i++) {
- u64 diff;
+ if (ndigits == 0)
+ return right;
- diff = left[i] - borrow;
- if (diff != left[i])
- borrow = (diff > left[i]);
+ borrow = __builtin_usubll_overflow(left[0], right, &result[0]);
- result[i] = diff;
- }
+ for (i = 1; i < ndigits; i++)
+ borrow = __builtin_usubll_overflow(left[i], borrow, &result[i]);
return borrow;
}
--
2.54.0
^ permalink raw reply related [flat|nested] 6+ messages in thread* Re: [PATCH] crypto: ecc - Optimize vli additive operations using compiler builtins
2026-06-07 11:24 [PATCH] crypto: ecc - Optimize vli additive operations using compiler builtins Fabian Blatter
@ 2026-06-09 18:58 ` Stefan Berger
2026-06-09 20:51 ` Fabian
2026-06-10 17:25 ` Stefan Berger
1 sibling, 1 reply; 6+ messages in thread
From: Stefan Berger @ 2026-06-09 18:58 UTC (permalink / raw)
To: Fabian Blatter, lukas, ignat, herbert, davem; +Cc: linux-crypto, linux-kernel
On 6/7/26 7:24 AM, Fabian Blatter wrote:
> Replace the software carry flag emulation with compiler builtins.
>
> Even the newest compilers struggle with taking advantage of the
> hardware carry flag. Compiler builtins allow the compiler to
> much more easily achieve this while still remaining constant-time.
It looks like you made vli_usub and vli_uadd constant-time now because
otherwise the loops could be ended early once borrow == 0 or carry == 0
respectively. Are all the other functions that operate on the private
keys constant-time?
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] crypto: ecc - Optimize vli additive operations using compiler builtins
2026-06-09 18:58 ` Stefan Berger
@ 2026-06-09 20:51 ` Fabian
2026-06-10 14:52 ` Stefan Berger
0 siblings, 1 reply; 6+ messages in thread
From: Fabian @ 2026-06-09 20:51 UTC (permalink / raw)
To: Stefan Berger; +Cc: lukas, ignat, herbert, davem, linux-crypto, linux-kernel
On Tue, 9 Jun 2026 at 20:58, Stefan Berger <stefanb@linux.ibm.com> wrote:
>
>
>
> On 6/7/26 7:24 AM, Fabian Blatter wrote:
> > Replace the software carry flag emulation with compiler builtins.
> >
> > Even the newest compilers struggle with taking advantage of the
> > hardware carry flag. Compiler builtins allow the compiler to
> > much more easily achieve this while still remaining constant-time.
>
> It looks like you made vli_usub and vli_uadd constant-time now because
> otherwise the loops could be ended early once borrow == 0 or carry == 0
> respectively. Are all the other functions that operate on the private
> keys constant-time?
>
Thanks for the reply,
My primary goal with this patch was performance optimization.
I did not add early exiting because the original version didn't either.
To answer your question: No, some other functions in ecc.c
are not constant-time. For example, vli_is_zero and vli_cmp both
contain early exits.
My patch does remove the branches in the inner loop,
however, the original ones were already constant-time in practice,
because the compiler replaces the branches with cmov's.
I am happy to make any changes to this patch if you like.
I could also look into making `vli_cmp` and `vli_is_zero`,
or others constant-time in a future patch.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] crypto: ecc - Optimize vli additive operations using compiler builtins
2026-06-09 20:51 ` Fabian
@ 2026-06-10 14:52 ` Stefan Berger
2026-06-10 16:57 ` Fabian
0 siblings, 1 reply; 6+ messages in thread
From: Stefan Berger @ 2026-06-10 14:52 UTC (permalink / raw)
To: Fabian; +Cc: lukas, ignat, herbert, davem, linux-crypto, linux-kernel
On 6/9/26 4:51 PM, Fabian wrote:
> On Tue, 9 Jun 2026 at 20:58, Stefan Berger <stefanb@linux.ibm.com> wrote:
>>
>>
>>
>> On 6/7/26 7:24 AM, Fabian Blatter wrote:
>>> Replace the software carry flag emulation with compiler builtins.
>>>
>>> Even the newest compilers struggle with taking advantage of the
>>> hardware carry flag. Compiler builtins allow the compiler to
>>> much more easily achieve this while still remaining constant-time.
>>
>> It looks like you made vli_usub and vli_uadd constant-time now because
>> otherwise the loops could be ended early once borrow == 0 or carry == 0
>> respectively. Are all the other functions that operate on the private
>> keys constant-time?
>>
>
> Thanks for the reply,
>
> My primary goal with this patch was performance optimization.
> I did not add early exiting because the original version didn't either.
>
> To answer your question: No, some other functions in ecc.c
> are not constant-time. For example, vli_is_zero and vli_cmp both
> contain early exits.
> > My patch does remove the branches in the inner loop,
> however, the original ones were already constant-time in practice,
> because the compiler replaces the branches with cmov's.
> > I am happy to make any changes to this patch if you like.
ecrdsa calls ecc_is_pubkey_valid_partial -> vli_mod_square_fast ->
vli_mmod_fast and then may call vli_usub or vli_uadd via
vli_mmod_special2 or vli_mmod_barret.
ecc_point_mult operates on a private key and will call vli_mmod_fast and
for some non-NIST keys it may call either one of vli_usub or vli_uadd
via vli_mmod_special2 or vli_mmod_barret.
Due to the private key operations it's probably better to keep the
functions constant-time for now.
> I could also look into making `vli_cmp` and `vli_is_zero`,
> or others constant-time in a future patch.
I wonder whether it would be practical to suffix constant-time functions
with _ct so that it becomes visible whether the call paths of functions
operation on private keys only call _ct functions? Sometimes one could
optimize functions shared by private and public key operations for
performance -- call them with 'bool ct' in this case and suffix them
with _oct (o=optimized + ct) indicating that they support both variants?
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] crypto: ecc - Optimize vli additive operations using compiler builtins
2026-06-10 14:52 ` Stefan Berger
@ 2026-06-10 16:57 ` Fabian
0 siblings, 0 replies; 6+ messages in thread
From: Fabian @ 2026-06-10 16:57 UTC (permalink / raw)
To: Stefan Berger; +Cc: lukas, ignat, herbert, davem, linux-crypto, linux-kernel
On Wed, 10 Jun 2026 at 16:52, Stefan Berger <stefanb@linux.ibm.com> wrote:
>
>
>
> On 6/9/26 4:51 PM, Fabian wrote:
> > On Tue, 9 Jun 2026 at 20:58, Stefan Berger <stefanb@linux.ibm.com> wrote:
> >>
> >>
> >>
> >> On 6/7/26 7:24 AM, Fabian Blatter wrote:
> >>> Replace the software carry flag emulation with compiler builtins.
> >>>
> >>> Even the newest compilers struggle with taking advantage of the
> >>> hardware carry flag. Compiler builtins allow the compiler to
> >>> much more easily achieve this while still remaining constant-time.
> >>
> >> It looks like you made vli_usub and vli_uadd constant-time now because
> >> otherwise the loops could be ended early once borrow == 0 or carry == 0
> >> respectively. Are all the other functions that operate on the private
> >> keys constant-time?
> >>
> >
> > Thanks for the reply,
> >
> > My primary goal with this patch was performance optimization.
> > I did not add early exiting because the original version didn't either.
> >
> > To answer your question: No, some other functions in ecc.c
> > are not constant-time. For example, vli_is_zero and vli_cmp both
> > contain early exits.
> > > My patch does remove the branches in the inner loop,
> > however, the original ones were already constant-time in practice,
> > because the compiler replaces the branches with cmov's.
> > > I am happy to make any changes to this patch if you like.
>
> ecrdsa calls ecc_is_pubkey_valid_partial -> vli_mod_square_fast ->
> vli_mmod_fast and then may call vli_usub or vli_uadd via
> vli_mmod_special2 or vli_mmod_barret.
>
> ecc_point_mult operates on a private key and will call vli_mmod_fast and
> for some non-NIST keys it may call either one of vli_usub or vli_uadd
> via vli_mmod_special2 or vli_mmod_barret.
>
> Due to the private key operations it's probably better to keep the
> functions constant-time for now.
Agreed.
>
> > I could also look into making `vli_cmp` and `vli_is_zero`,
> > or others constant-time in a future patch.
>
> I wonder whether it would be practical to suffix constant-time functions
> with _ct so that it becomes visible whether the call paths of functions
> operation on private keys only call _ct functions? Sometimes one could
> optimize functions shared by private and public key operations for
> performance -- call them with 'bool ct' in this case and suffix them
> with _oct (o=optimized + ct) indicating that they support both variants?
>
I am unsure whether constant-time operations are even slower than
early-exit variants by a significant margin. vli_uadd and vli_usub
don't seem to be called that often anyways. I believe this probably
doesn't justify the resulting binary size and complexity increase.
Let me know how you'd like to move forward.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] crypto: ecc - Optimize vli additive operations using compiler builtins
2026-06-07 11:24 [PATCH] crypto: ecc - Optimize vli additive operations using compiler builtins Fabian Blatter
2026-06-09 18:58 ` Stefan Berger
@ 2026-06-10 17:25 ` Stefan Berger
1 sibling, 0 replies; 6+ messages in thread
From: Stefan Berger @ 2026-06-10 17:25 UTC (permalink / raw)
To: Fabian Blatter, lukas, ignat, herbert, davem; +Cc: linux-crypto, linux-kernel
On 6/7/26 7:24 AM, Fabian Blatter wrote:
> Replace the software carry flag emulation with compiler builtins.
>
> Even the newest compilers struggle with taking advantage of the
> hardware carry flag. Compiler builtins allow the compiler to
> much more easily achieve this while still remaining constant-time.
>
> This yields an approximately 6-7% performance improvement
> on the ecc_gen_privkey, ecc_make_pub_key and crypto_ecdh_shared_secret
> functions on x86_64 on all curve sizes.
>
> Additionally, the code becomes much more readable.
>
> Signed-off-by: Fabian Blatter <fabianblatter09@gmail.com>
Reviewed-by: Stefan Berger <stefanb@linux.ibm.com>
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-06-10 17:25 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-07 11:24 [PATCH] crypto: ecc - Optimize vli additive operations using compiler builtins Fabian Blatter
2026-06-09 18:58 ` Stefan Berger
2026-06-09 20:51 ` Fabian
2026-06-10 14:52 ` Stefan Berger
2026-06-10 16:57 ` Fabian
2026-06-10 17:25 ` Stefan Berger
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox