* [PATCH 0/2] crypto: arm64/ghash-ce - performance improvements
@ 2018-08-04 18:46 Ard Biesheuvel
2018-08-04 18:46 ` [PATCH 1/2] crypto: arm64/ghash-ce - replace NEON yield check with block limit Ard Biesheuvel
` (2 more replies)
0 siblings, 3 replies; 4+ messages in thread
From: Ard Biesheuvel @ 2018-08-04 18:46 UTC (permalink / raw)
To: linux-arm-kernel
Another bit of performance work on the GHASH driver: this time it is not
the combined AES/GCM algorithm but the bare GHASH driver that gets updated.
Even though ARM cores that implement the polynomical multiplication
instructions that these routines depend on are guaranteed to also support
the AES instructions, and can thus use the AES/GCM driver, there could
be reasons to use the accelerated GHASH in isolation, e.g., with another
symmetric blockcipher, with a faster h/w accelerator, or potentially with
an accelerator that does not expose the AES key to the OS.
The resulting code runs at 1.1 cycles per byte on Cortex-A53 (down from
2.4 cycles per byte)
Ard Biesheuvel (2):
crypto: arm64/ghash-ce - replace NEON yield check with block limit
crypto: arm64/ghash-ce - implement 4-way aggregation
arch/arm64/crypto/ghash-ce-core.S | 153 ++++++++++++++------
arch/arm64/crypto/ghash-ce-glue.c | 87 ++++++-----
2 files changed, 161 insertions(+), 79 deletions(-)
--
2.18.0
^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH 1/2] crypto: arm64/ghash-ce - replace NEON yield check with block limit
2018-08-04 18:46 [PATCH 0/2] crypto: arm64/ghash-ce - performance improvements Ard Biesheuvel
@ 2018-08-04 18:46 ` Ard Biesheuvel
2018-08-04 18:46 ` [PATCH 2/2] crypto: arm64/ghash-ce - implement 4-way aggregation Ard Biesheuvel
2018-08-07 9:53 ` [PATCH 0/2] crypto: arm64/ghash-ce - performance improvements Herbert Xu
2 siblings, 0 replies; 4+ messages in thread
From: Ard Biesheuvel @ 2018-08-04 18:46 UTC (permalink / raw)
To: linux-arm-kernel
Checking the TIF_NEED_RESCHED flag is disproportionately costly on cores
with fast crypto instructions and comparatively slow memory accesses.
On algorithms such as GHASH, which executes at ~1 cycle per byte on
cores that implement support for 64 bit polynomial multiplication,
there is really no need to check the TIF_NEED_RESCHED particularly
often, and so we can remove the NEON yield check from the assembler
routines.
However, unlike the AEAD or skcipher APIs, the shash/ahash APIs take
arbitrary input lengths, and so there needs to be some sanity check
to ensure that we don't hog the CPU for excessive amounts of time.
So let's simply cap the maximum input size that is processed in one go
to 64 KB.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/ghash-ce-core.S | 39 ++++++--------------
arch/arm64/crypto/ghash-ce-glue.c | 16 ++++++--
2 files changed, 23 insertions(+), 32 deletions(-)
diff --git a/arch/arm64/crypto/ghash-ce-core.S b/arch/arm64/crypto/ghash-ce-core.S
index 913e49932ae6..344811c6a0ca 100644
--- a/arch/arm64/crypto/ghash-ce-core.S
+++ b/arch/arm64/crypto/ghash-ce-core.S
@@ -213,31 +213,23 @@
.endm
.macro __pmull_ghash, pn
- frame_push 5
-
- mov x19, x0
- mov x20, x1
- mov x21, x2
- mov x22, x3
- mov x23, x4
-
-0: ld1 {SHASH.2d}, [x22]
- ld1 {XL.2d}, [x20]
+ ld1 {SHASH.2d}, [x3]
+ ld1 {XL.2d}, [x1]
ext SHASH2.16b, SHASH.16b, SHASH.16b, #8
eor SHASH2.16b, SHASH2.16b, SHASH.16b
__pmull_pre_\pn
/* do the head block first, if supplied */
- cbz x23, 1f
- ld1 {T1.2d}, [x23]
- mov x23, xzr
- b 2f
+ cbz x4, 0f
+ ld1 {T1.2d}, [x4]
+ mov x4, xzr
+ b 1f
-1: ld1 {T1.2d}, [x21], #16
- sub w19, w19, #1
+0: ld1 {T1.2d}, [x2], #16
+ sub w0, w0, #1
-2: /* multiply XL by SHASH in GF(2^128) */
+1: /* multiply XL by SHASH in GF(2^128) */
CPU_LE( rev64 T1.16b, T1.16b )
ext T2.16b, XL.16b, XL.16b, #8
@@ -259,18 +251,9 @@ CPU_LE( rev64 T1.16b, T1.16b )
eor T2.16b, T2.16b, XH.16b
eor XL.16b, XL.16b, T2.16b
- cbz w19, 3f
-
- if_will_cond_yield_neon
- st1 {XL.2d}, [x20]
- do_cond_yield_neon
- b 0b
- endif_yield_neon
-
- b 1b
+ cbnz w0, 0b
-3: st1 {XL.2d}, [x20]
- frame_pop
+ st1 {XL.2d}, [x1]
ret
.endm
diff --git a/arch/arm64/crypto/ghash-ce-glue.c b/arch/arm64/crypto/ghash-ce-glue.c
index 88e3d93fa7c7..03ce71ea81a2 100644
--- a/arch/arm64/crypto/ghash-ce-glue.c
+++ b/arch/arm64/crypto/ghash-ce-glue.c
@@ -113,6 +113,9 @@ static void ghash_do_update(int blocks, u64 dg[], const char *src,
}
}
+/* avoid hogging the CPU for too long */
+#define MAX_BLOCKS (SZ_64K / GHASH_BLOCK_SIZE)
+
static int ghash_update(struct shash_desc *desc, const u8 *src,
unsigned int len)
{
@@ -136,11 +139,16 @@ static int ghash_update(struct shash_desc *desc, const u8 *src,
blocks = len / GHASH_BLOCK_SIZE;
len %= GHASH_BLOCK_SIZE;
- ghash_do_update(blocks, ctx->digest, src, key,
- partial ? ctx->buf : NULL);
+ do {
+ int chunk = min(blocks, MAX_BLOCKS);
+
+ ghash_do_update(chunk, ctx->digest, src, key,
+ partial ? ctx->buf : NULL);
- src += blocks * GHASH_BLOCK_SIZE;
- partial = 0;
+ blocks -= chunk;
+ src += chunk * GHASH_BLOCK_SIZE;
+ partial = 0;
+ } while (unlikely(blocks > 0));
}
if (len)
memcpy(ctx->buf + partial, src, len);
--
2.18.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [PATCH 2/2] crypto: arm64/ghash-ce - implement 4-way aggregation
2018-08-04 18:46 [PATCH 0/2] crypto: arm64/ghash-ce - performance improvements Ard Biesheuvel
2018-08-04 18:46 ` [PATCH 1/2] crypto: arm64/ghash-ce - replace NEON yield check with block limit Ard Biesheuvel
@ 2018-08-04 18:46 ` Ard Biesheuvel
2018-08-07 9:53 ` [PATCH 0/2] crypto: arm64/ghash-ce - performance improvements Herbert Xu
2 siblings, 0 replies; 4+ messages in thread
From: Ard Biesheuvel @ 2018-08-04 18:46 UTC (permalink / raw)
To: linux-arm-kernel
Enhance the GHASH implementation that uses 64-bit polynomial
multiplication by adding support for 4-way aggregation. This
more than doubles the performance, from 2.4 cycles per byte
to 1.1 cpb on Cortex-A53.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/ghash-ce-core.S | 122 +++++++++++++++++---
arch/arm64/crypto/ghash-ce-glue.c | 71 ++++++------
2 files changed, 142 insertions(+), 51 deletions(-)
diff --git a/arch/arm64/crypto/ghash-ce-core.S b/arch/arm64/crypto/ghash-ce-core.S
index 344811c6a0ca..1b319b716d5e 100644
--- a/arch/arm64/crypto/ghash-ce-core.S
+++ b/arch/arm64/crypto/ghash-ce-core.S
@@ -46,6 +46,19 @@
ss3 .req v26
ss4 .req v27
+ XL2 .req v8
+ XM2 .req v9
+ XH2 .req v10
+ XL3 .req v11
+ XM3 .req v12
+ XH3 .req v13
+ TT3 .req v14
+ TT4 .req v15
+ HH .req v16
+ HH3 .req v17
+ HH4 .req v18
+ HH34 .req v19
+
.text
.arch armv8-a+crypto
@@ -134,11 +147,25 @@
.endm
.macro __pmull_pre_p64
+ add x8, x3, #16
+ ld1 {HH.2d-HH4.2d}, [x8]
+
+ trn1 SHASH2.2d, SHASH.2d, HH.2d
+ trn2 T1.2d, SHASH.2d, HH.2d
+ eor SHASH2.16b, SHASH2.16b, T1.16b
+
+ trn1 HH34.2d, HH3.2d, HH4.2d
+ trn2 T1.2d, HH3.2d, HH4.2d
+ eor HH34.16b, HH34.16b, T1.16b
+
movi MASK.16b, #0xe1
shl MASK.2d, MASK.2d, #57
.endm
.macro __pmull_pre_p8
+ ext SHASH2.16b, SHASH.16b, SHASH.16b, #8
+ eor SHASH2.16b, SHASH2.16b, SHASH.16b
+
// k00_16 := 0x0000000000000000_000000000000ffff
// k32_48 := 0x00000000ffffffff_0000ffffffffffff
movi k32_48.2d, #0xffffffff
@@ -215,8 +242,6 @@
.macro __pmull_ghash, pn
ld1 {SHASH.2d}, [x3]
ld1 {XL.2d}, [x1]
- ext SHASH2.16b, SHASH.16b, SHASH.16b, #8
- eor SHASH2.16b, SHASH2.16b, SHASH.16b
__pmull_pre_\pn
@@ -224,12 +249,79 @@
cbz x4, 0f
ld1 {T1.2d}, [x4]
mov x4, xzr
- b 1f
+ b 3f
+
+0: .ifc \pn, p64
+ tbnz w0, #0, 2f // skip until #blocks is a
+ tbnz w0, #1, 2f // round multiple of 4
+
+1: ld1 {XM3.16b-TT4.16b}, [x2], #64
+
+ sub w0, w0, #4
+
+ rev64 T1.16b, XM3.16b
+ rev64 T2.16b, XH3.16b
+ rev64 TT4.16b, TT4.16b
+ rev64 TT3.16b, TT3.16b
+
+ ext IN1.16b, TT4.16b, TT4.16b, #8
+ ext XL3.16b, TT3.16b, TT3.16b, #8
+
+ eor TT4.16b, TT4.16b, IN1.16b
+ pmull2 XH2.1q, SHASH.2d, IN1.2d // a1 * b1
+ pmull XL2.1q, SHASH.1d, IN1.1d // a0 * b0
+ pmull XM2.1q, SHASH2.1d, TT4.1d // (a1 + a0)(b1 + b0)
+
+ eor TT3.16b, TT3.16b, XL3.16b
+ pmull2 XH3.1q, HH.2d, XL3.2d // a1 * b1
+ pmull XL3.1q, HH.1d, XL3.1d // a0 * b0
+ pmull2 XM3.1q, SHASH2.2d, TT3.2d // (a1 + a0)(b1 + b0)
+
+ ext IN1.16b, T2.16b, T2.16b, #8
+ eor XL2.16b, XL2.16b, XL3.16b
+ eor XH2.16b, XH2.16b, XH3.16b
+ eor XM2.16b, XM2.16b, XM3.16b
+
+ eor T2.16b, T2.16b, IN1.16b
+ pmull2 XH3.1q, HH3.2d, IN1.2d // a1 * b1
+ pmull XL3.1q, HH3.1d, IN1.1d // a0 * b0
+ pmull XM3.1q, HH34.1d, T2.1d // (a1 + a0)(b1 + b0)
-0: ld1 {T1.2d}, [x2], #16
+ eor XL2.16b, XL2.16b, XL3.16b
+ eor XH2.16b, XH2.16b, XH3.16b
+ eor XM2.16b, XM2.16b, XM3.16b
+
+ ext IN1.16b, T1.16b, T1.16b, #8
+ ext TT3.16b, XL.16b, XL.16b, #8
+ eor XL.16b, XL.16b, IN1.16b
+ eor T1.16b, T1.16b, TT3.16b
+
+ pmull2 XH.1q, HH4.2d, XL.2d // a1 * b1
+ eor T1.16b, T1.16b, XL.16b
+ pmull XL.1q, HH4.1d, XL.1d // a0 * b0
+ pmull2 XM.1q, HH34.2d, T1.2d // (a1 + a0)(b1 + b0)
+
+ eor XL.16b, XL.16b, XL2.16b
+ eor XH.16b, XH.16b, XH2.16b
+ eor XM.16b, XM.16b, XM2.16b
+
+ eor T2.16b, XL.16b, XH.16b
+ ext T1.16b, XL.16b, XH.16b, #8
+ eor XM.16b, XM.16b, T2.16b
+
+ __pmull_reduce_p64
+
+ eor T2.16b, T2.16b, XH.16b
+ eor XL.16b, XL.16b, T2.16b
+
+ cbz w0, 5f
+ b 1b
+ .endif
+
+2: ld1 {T1.2d}, [x2], #16
sub w0, w0, #1
-1: /* multiply XL by SHASH in GF(2^128) */
+3: /* multiply XL by SHASH in GF(2^128) */
CPU_LE( rev64 T1.16b, T1.16b )
ext T2.16b, XL.16b, XL.16b, #8
@@ -242,7 +334,7 @@ CPU_LE( rev64 T1.16b, T1.16b )
__pmull_\pn XL, XL, SHASH // a0 * b0
__pmull_\pn XM, T1, SHASH2 // (a1 + a0)(b1 + b0)
- eor T2.16b, XL.16b, XH.16b
+4: eor T2.16b, XL.16b, XH.16b
ext T1.16b, XL.16b, XH.16b, #8
eor XM.16b, XM.16b, T2.16b
@@ -253,7 +345,7 @@ CPU_LE( rev64 T1.16b, T1.16b )
cbnz w0, 0b
- st1 {XL.2d}, [x1]
+5: st1 {XL.2d}, [x1]
ret
.endm
@@ -269,14 +361,10 @@ ENTRY(pmull_ghash_update_p8)
__pmull_ghash p8
ENDPROC(pmull_ghash_update_p8)
- KS0 .req v8
- KS1 .req v9
- INP0 .req v10
- INP1 .req v11
- HH .req v12
- XL2 .req v13
- XM2 .req v14
- XH2 .req v15
+ KS0 .req v12
+ KS1 .req v13
+ INP0 .req v14
+ INP1 .req v15
.macro load_round_keys, rounds, rk
cmp \rounds, #12
@@ -310,8 +398,8 @@ ENDPROC(pmull_ghash_update_p8)
.endm
.macro pmull_gcm_do_crypt, enc
- ld1 {HH.2d}, [x4], #16
- ld1 {SHASH.2d}, [x4]
+ ld1 {SHASH.2d}, [x4], #16
+ ld1 {HH.2d}, [x4]
ld1 {XL.2d}, [x1]
ldr x8, [x5, #8] // load lower counter
diff --git a/arch/arm64/crypto/ghash-ce-glue.c b/arch/arm64/crypto/ghash-ce-glue.c
index 03ce71ea81a2..08b49fd621cb 100644
--- a/arch/arm64/crypto/ghash-ce-glue.c
+++ b/arch/arm64/crypto/ghash-ce-glue.c
@@ -33,9 +33,12 @@ MODULE_ALIAS_CRYPTO("ghash");
#define GCM_IV_SIZE 12
struct ghash_key {
- u64 a;
- u64 b;
- be128 k;
+ u64 h[2];
+ u64 h2[2];
+ u64 h3[2];
+ u64 h4[2];
+
+ be128 k;
};
struct ghash_desc_ctx {
@@ -46,7 +49,6 @@ struct ghash_desc_ctx {
struct gcm_aes_ctx {
struct crypto_aes_ctx aes_key;
- u64 h2[2];
struct ghash_key ghash_key;
};
@@ -63,11 +65,12 @@ static void (*pmull_ghash_update)(int blocks, u64 dg[], const char *src,
const char *head);
asmlinkage void pmull_gcm_encrypt(int blocks, u64 dg[], u8 dst[],
- const u8 src[], u64 const *k, u8 ctr[],
- u32 const rk[], int rounds, u8 ks[]);
+ const u8 src[], struct ghash_key const *k,
+ u8 ctr[], u32 const rk[], int rounds,
+ u8 ks[]);
asmlinkage void pmull_gcm_decrypt(int blocks, u64 dg[], u8 dst[],
- const u8 src[], u64 const *k,
+ const u8 src[], struct ghash_key const *k,
u8 ctr[], u32 const rk[], int rounds);
asmlinkage void pmull_gcm_encrypt_block(u8 dst[], u8 const src[],
@@ -174,23 +177,36 @@ static int ghash_final(struct shash_desc *desc, u8 *dst)
return 0;
}
+static void ghash_reflect(u64 h[], const be128 *k)
+{
+ u64 carry = be64_to_cpu(k->a) & BIT(63) ? 1 : 0;
+
+ h[0] = (be64_to_cpu(k->b) << 1) | carry;
+ h[1] = (be64_to_cpu(k->a) << 1) | (be64_to_cpu(k->b) >> 63);
+
+ if (carry)
+ h[1] ^= 0xc200000000000000UL;
+}
+
static int __ghash_setkey(struct ghash_key *key,
const u8 *inkey, unsigned int keylen)
{
- u64 a, b;
+ be128 h;
/* needed for the fallback */
memcpy(&key->k, inkey, GHASH_BLOCK_SIZE);
- /* perform multiplication by 'x' in GF(2^128) */
- b = get_unaligned_be64(inkey);
- a = get_unaligned_be64(inkey + 8);
+ ghash_reflect(key->h, &key->k);
+
+ h = key->k;
+ gf128mul_lle(&h, &key->k);
+ ghash_reflect(key->h2, &h);
- key->a = (a << 1) | (b >> 63);
- key->b = (b << 1) | (a >> 63);
+ gf128mul_lle(&h, &key->k);
+ ghash_reflect(key->h3, &h);
- if (b >> 63)
- key->b ^= 0xc200000000000000UL;
+ gf128mul_lle(&h, &key->k);
+ ghash_reflect(key->h4, &h);
return 0;
}
@@ -241,8 +257,7 @@ static int gcm_setkey(struct crypto_aead *tfm, const u8 *inkey,
unsigned int keylen)
{
struct gcm_aes_ctx *ctx = crypto_aead_ctx(tfm);
- be128 h1, h2;
- u8 *key = (u8 *)&h1;
+ u8 key[GHASH_BLOCK_SIZE];
int ret;
ret = crypto_aes_expand_key(&ctx->aes_key, inkey, keylen);
@@ -254,19 +269,7 @@ static int gcm_setkey(struct crypto_aead *tfm, const u8 *inkey,
__aes_arm64_encrypt(ctx->aes_key.key_enc, key, (u8[AES_BLOCK_SIZE]){},
num_rounds(&ctx->aes_key));
- __ghash_setkey(&ctx->ghash_key, key, sizeof(be128));
-
- /* calculate H^2 (used for 2-way aggregation) */
- h2 = h1;
- gf128mul_lle(&h2, &h1);
-
- ctx->h2[0] = (be64_to_cpu(h2.b) << 1) | (be64_to_cpu(h2.a) >> 63);
- ctx->h2[1] = (be64_to_cpu(h2.a) << 1) | (be64_to_cpu(h2.b) >> 63);
-
- if (be64_to_cpu(h2.a) >> 63)
- ctx->h2[1] ^= 0xc200000000000000UL;
-
- return 0;
+ return __ghash_setkey(&ctx->ghash_key, key, sizeof(be128));
}
static int gcm_setauthsize(struct crypto_aead *tfm, unsigned int authsize)
@@ -402,8 +405,8 @@ static int gcm_encrypt(struct aead_request *req)
kernel_neon_begin();
pmull_gcm_encrypt(blocks, dg, walk.dst.virt.addr,
- walk.src.virt.addr, ctx->h2, iv,
- rk, nrounds, ks);
+ walk.src.virt.addr, &ctx->ghash_key,
+ iv, rk, nrounds, ks);
kernel_neon_end();
err = skcipher_walk_done(&walk,
@@ -513,8 +516,8 @@ static int gcm_decrypt(struct aead_request *req)
kernel_neon_begin();
pmull_gcm_decrypt(blocks, dg, walk.dst.virt.addr,
- walk.src.virt.addr, ctx->h2, iv,
- rk, nrounds);
+ walk.src.virt.addr, &ctx->ghash_key,
+ iv, rk, nrounds);
/* check if this is the final iteration of the loop */
if (rem < (2 * AES_BLOCK_SIZE)) {
--
2.18.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [PATCH 0/2] crypto: arm64/ghash-ce - performance improvements
2018-08-04 18:46 [PATCH 0/2] crypto: arm64/ghash-ce - performance improvements Ard Biesheuvel
2018-08-04 18:46 ` [PATCH 1/2] crypto: arm64/ghash-ce - replace NEON yield check with block limit Ard Biesheuvel
2018-08-04 18:46 ` [PATCH 2/2] crypto: arm64/ghash-ce - implement 4-way aggregation Ard Biesheuvel
@ 2018-08-07 9:53 ` Herbert Xu
2 siblings, 0 replies; 4+ messages in thread
From: Herbert Xu @ 2018-08-07 9:53 UTC (permalink / raw)
To: linux-arm-kernel
On Sat, Aug 04, 2018 at 08:46:23PM +0200, Ard Biesheuvel wrote:
> Another bit of performance work on the GHASH driver: this time it is not
> the combined AES/GCM algorithm but the bare GHASH driver that gets updated.
>
> Even though ARM cores that implement the polynomical multiplication
> instructions that these routines depend on are guaranteed to also support
> the AES instructions, and can thus use the AES/GCM driver, there could
> be reasons to use the accelerated GHASH in isolation, e.g., with another
> symmetric blockcipher, with a faster h/w accelerator, or potentially with
> an accelerator that does not expose the AES key to the OS.
>
> The resulting code runs at 1.1 cycles per byte on Cortex-A53 (down from
> 2.4 cycles per byte)
>
> Ard Biesheuvel (2):
> crypto: arm64/ghash-ce - replace NEON yield check with block limit
> crypto: arm64/ghash-ce - implement 4-way aggregation
>
> arch/arm64/crypto/ghash-ce-core.S | 153 ++++++++++++++------
> arch/arm64/crypto/ghash-ce-glue.c | 87 ++++++-----
> 2 files changed, 161 insertions(+), 79 deletions(-)
All applied. Thanks.
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2018-08-07 9:53 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-08-04 18:46 [PATCH 0/2] crypto: arm64/ghash-ce - performance improvements Ard Biesheuvel
2018-08-04 18:46 ` [PATCH 1/2] crypto: arm64/ghash-ce - replace NEON yield check with block limit Ard Biesheuvel
2018-08-04 18:46 ` [PATCH 2/2] crypto: arm64/ghash-ce - implement 4-way aggregation Ard Biesheuvel
2018-08-07 9:53 ` [PATCH 0/2] crypto: arm64/ghash-ce - performance improvements Herbert Xu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).