linux-crypto.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/8] crypto: Clean up arm64 AES-CCM code
@ 2024-01-11 12:33 Ard Biesheuvel
  2024-01-11 12:33 ` [PATCH 1/8] crypto: arm64/aes-ccm - Revert "Rewrite skcipher walker loop" Ard Biesheuvel
                   ` (7 more replies)
  0 siblings, 8 replies; 10+ messages in thread
From: Ard Biesheuvel @ 2024-01-11 12:33 UTC (permalink / raw)
  To: linux-crypto; +Cc: ebiggers, herbert, Ard Biesheuvel

From: Ard Biesheuvel <ardb@kernel.org>

The AES-CCM driver was written 10+ years ago, based on the very first
kernel mode NEON API for arm64, which eagerly preserved/restored the
NEON registers on each call to kernel_neon_begin() resp.
kernel_neon_end().

For this reason, the asm helpers were constructed in a way that used
only 6 NEON registers, as the kernel mode NEON API at the time
implemented an optimization where kernel_neon_begin() took an int
denoting the number of NEON registers to preserve/restore. Given that no
actual hardware existed at the time (except perhaps for APM Xgene1 which
did not implement the crypto instructions), all of this was based on
premature assumptions.

These days, the NEON API is a bit more sophisticated, and does not
bother to preserve/restore anything unless it is needed (e.g., when
context switching or returning to user space). It also no longer
disables preemption. Finally, we've developed some code patterns in the
mean time to deal with tail blocks more cleanly and efficiently.

So let's bring the CCM driver up to date with all of this.

Ard Biesheuvel (8):
  crypto: arm64/aes-ccm - Revert "Rewrite skcipher walker loop"
  crypto: arm64/aes-ccm - Keep NEON enabled during skcipher walk
  crypto: arm64/aes-ccm - Pass short inputs via stack buffer
  crypto: arm64/aes-ccm - Replace bytewise tail handling with NEON
    permute
  crypto: arm64/aes-ccm - Reuse existing MAC update for AAD input
  crypto: arm64/aes-ccm - Cache round keys and unroll AES loops
  crypto: arm64/aes-ccm - Merge encrypt and decrypt asm routines
  crypto: arm64/aes-ccm - Merge finalization into en/decrypt asm helper

 arch/arm64/crypto/Kconfig           |   1 +
 arch/arm64/crypto/aes-ce-ccm-core.S | 270 +++++++-------------
 arch/arm64/crypto/aes-ce-ccm-glue.c | 154 +++++++----
 arch/arm64/crypto/aes-glue.c        |   1 +
 4 files changed, 199 insertions(+), 227 deletions(-)

-- 
2.43.0.275.g3460e3d667-goog


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/8] crypto: arm64/aes-ccm - Revert "Rewrite skcipher walker loop"
  2024-01-11 12:33 [PATCH 0/8] crypto: Clean up arm64 AES-CCM code Ard Biesheuvel
@ 2024-01-11 12:33 ` Ard Biesheuvel
  2024-01-11 12:33 ` [PATCH 2/8] crypto: arm64/aes-ccm - Keep NEON enabled during skcipher walk Ard Biesheuvel
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Ard Biesheuvel @ 2024-01-11 12:33 UTC (permalink / raw)
  To: linux-crypto; +Cc: ebiggers, herbert, Ard Biesheuvel

From: Ard Biesheuvel <ardb@kernel.org>

This reverts commit 57ead1bf1c54, which updated the CCM code to only
rely on walk.nbytes to check for failures returned from the skcipher
walk API, mostly for the common good rather than to fix a particular
problem in the code.

This change introduces a problem of its own: the skcipher walk is
started with the 'atomic' argument set to false, which means that the
skcipher walk API is permitted to sleep. Subsequently, it invokes
skcipher_walk_done() with preemption disabled on the final iteration of
the loop. This appears to work by accident, but it is arguably a bad
example, and providing a better example was the point of the original
patch.

Given that future changes to the CCM code will rely on the original
behavior of entering the loop even for zero sized inputs, let's just
revert this change entirely, and proceed from there.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/crypto/aes-ce-ccm-glue.c | 57 +++++++++++---------
 1 file changed, 31 insertions(+), 26 deletions(-)

diff --git a/arch/arm64/crypto/aes-ce-ccm-glue.c b/arch/arm64/crypto/aes-ce-ccm-glue.c
index 25cd3808ecbe..c4f14415f5f0 100644
--- a/arch/arm64/crypto/aes-ce-ccm-glue.c
+++ b/arch/arm64/crypto/aes-ce-ccm-glue.c
@@ -161,39 +161,43 @@ static int ccm_encrypt(struct aead_request *req)
 	memcpy(buf, req->iv, AES_BLOCK_SIZE);
 
 	err = skcipher_walk_aead_encrypt(&walk, req, false);
+	if (unlikely(err))
+		return err;
 
 	kernel_neon_begin();
 
 	if (req->assoclen)
 		ccm_calculate_auth_mac(req, mac);
 
-	while (walk.nbytes) {
+	do {
 		u32 tail = walk.nbytes % AES_BLOCK_SIZE;
-		bool final = walk.nbytes == walk.total;
 
-		if (final)
+		if (walk.nbytes == walk.total)
 			tail = 0;
 
 		ce_aes_ccm_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
 				   walk.nbytes - tail, ctx->key_enc,
 				   num_rounds(ctx), mac, walk.iv);
 
-		if (!final)
-			kernel_neon_end();
-		err = skcipher_walk_done(&walk, tail);
-		if (!final)
-			kernel_neon_begin();
-	}
+		if (walk.nbytes == walk.total)
+			ce_aes_ccm_final(mac, buf, ctx->key_enc, num_rounds(ctx));
 
-	ce_aes_ccm_final(mac, buf, ctx->key_enc, num_rounds(ctx));
+		kernel_neon_end();
 
-	kernel_neon_end();
+		if (walk.nbytes) {
+			err = skcipher_walk_done(&walk, tail);
+			if (unlikely(err))
+				return err;
+			if (unlikely(walk.nbytes))
+				kernel_neon_begin();
+		}
+	} while (walk.nbytes);
 
 	/* copy authtag to end of dst */
 	scatterwalk_map_and_copy(mac, req->dst, req->assoclen + req->cryptlen,
 				 crypto_aead_authsize(aead), 1);
 
-	return err;
+	return 0;
 }
 
 static int ccm_decrypt(struct aead_request *req)
@@ -215,36 +219,37 @@ static int ccm_decrypt(struct aead_request *req)
 	memcpy(buf, req->iv, AES_BLOCK_SIZE);
 
 	err = skcipher_walk_aead_decrypt(&walk, req, false);
+	if (unlikely(err))
+		return err;
 
 	kernel_neon_begin();
 
 	if (req->assoclen)
 		ccm_calculate_auth_mac(req, mac);
 
-	while (walk.nbytes) {
+	do {
 		u32 tail = walk.nbytes % AES_BLOCK_SIZE;
-		bool final = walk.nbytes == walk.total;
 
-		if (final)
+		if (walk.nbytes == walk.total)
 			tail = 0;
 
 		ce_aes_ccm_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
 				   walk.nbytes - tail, ctx->key_enc,
 				   num_rounds(ctx), mac, walk.iv);
 
-		if (!final)
-			kernel_neon_end();
-		err = skcipher_walk_done(&walk, tail);
-		if (!final)
-			kernel_neon_begin();
-	}
+		if (walk.nbytes == walk.total)
+			ce_aes_ccm_final(mac, buf, ctx->key_enc, num_rounds(ctx));
 
-	ce_aes_ccm_final(mac, buf, ctx->key_enc, num_rounds(ctx));
+		kernel_neon_end();
 
-	kernel_neon_end();
-
-	if (unlikely(err))
-		return err;
+		if (walk.nbytes) {
+			err = skcipher_walk_done(&walk, tail);
+			if (unlikely(err))
+				return err;
+			if (unlikely(walk.nbytes))
+				kernel_neon_begin();
+		}
+	} while (walk.nbytes);
 
 	/* compare calculated auth tag with the stored one */
 	scatterwalk_map_and_copy(buf, req->src,
-- 
2.43.0.275.g3460e3d667-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/8] crypto: arm64/aes-ccm - Keep NEON enabled during skcipher walk
  2024-01-11 12:33 [PATCH 0/8] crypto: Clean up arm64 AES-CCM code Ard Biesheuvel
  2024-01-11 12:33 ` [PATCH 1/8] crypto: arm64/aes-ccm - Revert "Rewrite skcipher walker loop" Ard Biesheuvel
@ 2024-01-11 12:33 ` Ard Biesheuvel
  2024-01-11 12:33 ` [PATCH 3/8] crypto: arm64/aes-ccm - Pass short inputs via stack buffer Ard Biesheuvel
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Ard Biesheuvel @ 2024-01-11 12:33 UTC (permalink / raw)
  To: linux-crypto; +Cc: ebiggers, herbert, Ard Biesheuvel

From: Ard Biesheuvel <ardb@kernel.org>

Now that kernel mode NEON no longer disables preemption, we no longer
have to take care to disable and re-enable use of the NEON when calling
into the skcipher walk API. So just keep it enabled until done.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/crypto/aes-ce-ccm-glue.c | 22 +++++++++-----------
 1 file changed, 10 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/crypto/aes-ce-ccm-glue.c b/arch/arm64/crypto/aes-ce-ccm-glue.c
index c4f14415f5f0..b177ebea7d09 100644
--- a/arch/arm64/crypto/aes-ce-ccm-glue.c
+++ b/arch/arm64/crypto/aes-ce-ccm-glue.c
@@ -182,17 +182,16 @@ static int ccm_encrypt(struct aead_request *req)
 		if (walk.nbytes == walk.total)
 			ce_aes_ccm_final(mac, buf, ctx->key_enc, num_rounds(ctx));
 
-		kernel_neon_end();
-
 		if (walk.nbytes) {
 			err = skcipher_walk_done(&walk, tail);
-			if (unlikely(err))
-				return err;
-			if (unlikely(walk.nbytes))
-				kernel_neon_begin();
 		}
 	} while (walk.nbytes);
 
+	kernel_neon_end();
+
+	if (unlikely(err))
+		return err;
+
 	/* copy authtag to end of dst */
 	scatterwalk_map_and_copy(mac, req->dst, req->assoclen + req->cryptlen,
 				 crypto_aead_authsize(aead), 1);
@@ -240,17 +239,16 @@ static int ccm_decrypt(struct aead_request *req)
 		if (walk.nbytes == walk.total)
 			ce_aes_ccm_final(mac, buf, ctx->key_enc, num_rounds(ctx));
 
-		kernel_neon_end();
-
 		if (walk.nbytes) {
 			err = skcipher_walk_done(&walk, tail);
-			if (unlikely(err))
-				return err;
-			if (unlikely(walk.nbytes))
-				kernel_neon_begin();
 		}
 	} while (walk.nbytes);
 
+	kernel_neon_end();
+
+	if (unlikely(err))
+		return err;
+
 	/* compare calculated auth tag with the stored one */
 	scatterwalk_map_and_copy(buf, req->src,
 				 req->assoclen + req->cryptlen - authsize,
-- 
2.43.0.275.g3460e3d667-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/8] crypto: arm64/aes-ccm - Pass short inputs via stack buffer
  2024-01-11 12:33 [PATCH 0/8] crypto: Clean up arm64 AES-CCM code Ard Biesheuvel
  2024-01-11 12:33 ` [PATCH 1/8] crypto: arm64/aes-ccm - Revert "Rewrite skcipher walker loop" Ard Biesheuvel
  2024-01-11 12:33 ` [PATCH 2/8] crypto: arm64/aes-ccm - Keep NEON enabled during skcipher walk Ard Biesheuvel
@ 2024-01-11 12:33 ` Ard Biesheuvel
  2024-01-11 12:33 ` [PATCH 4/8] crypto: arm64/aes-ccm - Replace bytewise tail handling with NEON permute Ard Biesheuvel
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Ard Biesheuvel @ 2024-01-11 12:33 UTC (permalink / raw)
  To: linux-crypto; +Cc: ebiggers, herbert, Ard Biesheuvel

From: Ard Biesheuvel <ardb@kernel.org>

In preparation for optimizing the CCM core asm code using permutation
vectors and overlapping loads and stores, ensure that inputs shorter
than the size of a AES block are passed via a buffer on the stack, in a
way that positions the data at the end of a 16 byte buffer. This removes
the need for the asm code to reason about a rare corner case where the
tail of the data cannot be read/written using a single NEON load/store
instruction.

While at it, tweak the copyright header and authorship to bring it up to
date.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/crypto/aes-ce-ccm-glue.c | 57 ++++++++++++++------
 1 file changed, 40 insertions(+), 17 deletions(-)

diff --git a/arch/arm64/crypto/aes-ce-ccm-glue.c b/arch/arm64/crypto/aes-ce-ccm-glue.c
index b177ebea7d09..2f4e6a318fcd 100644
--- a/arch/arm64/crypto/aes-ce-ccm-glue.c
+++ b/arch/arm64/crypto/aes-ce-ccm-glue.c
@@ -1,8 +1,11 @@
 // SPDX-License-Identifier: GPL-2.0-only
 /*
- * aes-ccm-glue.c - AES-CCM transform for ARMv8 with Crypto Extensions
+ * aes-ce-ccm-glue.c - AES-CCM transform for ARMv8 with Crypto Extensions
  *
- * Copyright (C) 2013 - 2017 Linaro Ltd <ard.biesheuvel@linaro.org>
+ * Copyright (C) 2013 - 2017 Linaro Ltd.
+ * Copyright (C) 2024 Google LLC
+ *
+ * Author: Ard Biesheuvel <ardb@kernel.org>
  */
 
 #include <asm/neon.h>
@@ -149,7 +152,7 @@ static int ccm_encrypt(struct aead_request *req)
 	struct crypto_aes_ctx *ctx = crypto_aead_ctx(aead);
 	struct skcipher_walk walk;
 	u8 __aligned(8) mac[AES_BLOCK_SIZE];
-	u8 buf[AES_BLOCK_SIZE];
+	u8 orig_iv[AES_BLOCK_SIZE];
 	u32 len = req->cryptlen;
 	int err;
 
@@ -158,7 +161,7 @@ static int ccm_encrypt(struct aead_request *req)
 		return err;
 
 	/* preserve the original iv for the final round */
-	memcpy(buf, req->iv, AES_BLOCK_SIZE);
+	memcpy(orig_iv, req->iv, AES_BLOCK_SIZE);
 
 	err = skcipher_walk_aead_encrypt(&walk, req, false);
 	if (unlikely(err))
@@ -171,16 +174,26 @@ static int ccm_encrypt(struct aead_request *req)
 
 	do {
 		u32 tail = walk.nbytes % AES_BLOCK_SIZE;
+		const u8 *src = walk.src.virt.addr;
+		u8 *dst = walk.dst.virt.addr;
+		u8 buf[AES_BLOCK_SIZE];
 
 		if (walk.nbytes == walk.total)
 			tail = 0;
 
-		ce_aes_ccm_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
-				   walk.nbytes - tail, ctx->key_enc,
-				   num_rounds(ctx), mac, walk.iv);
+		if (unlikely(walk.total < AES_BLOCK_SIZE))
+			src = dst = memcpy(buf + sizeof(buf) - walk.total,
+					   src, walk.total);
+
+		ce_aes_ccm_encrypt(dst, src, walk.nbytes - tail,
+				   ctx->key_enc, num_rounds(ctx),
+				   mac, walk.iv);
+
+		if (unlikely(walk.total < AES_BLOCK_SIZE))
+			memcpy(walk.dst.virt.addr, dst, walk.total);
 
 		if (walk.nbytes == walk.total)
-			ce_aes_ccm_final(mac, buf, ctx->key_enc, num_rounds(ctx));
+			ce_aes_ccm_final(mac, orig_iv, ctx->key_enc, num_rounds(ctx));
 
 		if (walk.nbytes) {
 			err = skcipher_walk_done(&walk, tail);
@@ -206,7 +219,7 @@ static int ccm_decrypt(struct aead_request *req)
 	unsigned int authsize = crypto_aead_authsize(aead);
 	struct skcipher_walk walk;
 	u8 __aligned(8) mac[AES_BLOCK_SIZE];
-	u8 buf[AES_BLOCK_SIZE];
+	u8 orig_iv[AES_BLOCK_SIZE];
 	u32 len = req->cryptlen - authsize;
 	int err;
 
@@ -215,7 +228,7 @@ static int ccm_decrypt(struct aead_request *req)
 		return err;
 
 	/* preserve the original iv for the final round */
-	memcpy(buf, req->iv, AES_BLOCK_SIZE);
+	memcpy(orig_iv, req->iv, AES_BLOCK_SIZE);
 
 	err = skcipher_walk_aead_decrypt(&walk, req, false);
 	if (unlikely(err))
@@ -228,16 +241,26 @@ static int ccm_decrypt(struct aead_request *req)
 
 	do {
 		u32 tail = walk.nbytes % AES_BLOCK_SIZE;
+		const u8 *src = walk.src.virt.addr;
+		u8 *dst = walk.dst.virt.addr;
+		u8 buf[AES_BLOCK_SIZE];
 
 		if (walk.nbytes == walk.total)
 			tail = 0;
 
-		ce_aes_ccm_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
-				   walk.nbytes - tail, ctx->key_enc,
-				   num_rounds(ctx), mac, walk.iv);
+		if (unlikely(walk.total < AES_BLOCK_SIZE))
+			src = dst = memcpy(buf + sizeof(buf) - walk.total,
+					   src, walk.total);
+
+		ce_aes_ccm_decrypt(dst, src, walk.nbytes - tail,
+				   ctx->key_enc, num_rounds(ctx),
+				   mac, walk.iv);
+
+		if (unlikely(walk.total < AES_BLOCK_SIZE))
+			memcpy(walk.dst.virt.addr, dst, walk.total);
 
 		if (walk.nbytes == walk.total)
-			ce_aes_ccm_final(mac, buf, ctx->key_enc, num_rounds(ctx));
+			ce_aes_ccm_final(mac, orig_iv, ctx->key_enc, num_rounds(ctx));
 
 		if (walk.nbytes) {
 			err = skcipher_walk_done(&walk, tail);
@@ -250,11 +273,11 @@ static int ccm_decrypt(struct aead_request *req)
 		return err;
 
 	/* compare calculated auth tag with the stored one */
-	scatterwalk_map_and_copy(buf, req->src,
+	scatterwalk_map_and_copy(orig_iv, req->src,
 				 req->assoclen + req->cryptlen - authsize,
 				 authsize, 0);
 
-	if (crypto_memneq(mac, buf, authsize))
+	if (crypto_memneq(mac, orig_iv, authsize))
 		return -EBADMSG;
 	return 0;
 }
@@ -293,6 +316,6 @@ module_init(aes_mod_init);
 module_exit(aes_mod_exit);
 
 MODULE_DESCRIPTION("Synchronous AES in CCM mode using ARMv8 Crypto Extensions");
-MODULE_AUTHOR("Ard Biesheuvel <ard.biesheuvel@linaro.org>");
+MODULE_AUTHOR("Ard Biesheuvel <ardb@kernel.org>");
 MODULE_LICENSE("GPL v2");
 MODULE_ALIAS_CRYPTO("ccm(aes)");
-- 
2.43.0.275.g3460e3d667-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 4/8] crypto: arm64/aes-ccm - Replace bytewise tail handling with NEON permute
  2024-01-11 12:33 [PATCH 0/8] crypto: Clean up arm64 AES-CCM code Ard Biesheuvel
                   ` (2 preceding siblings ...)
  2024-01-11 12:33 ` [PATCH 3/8] crypto: arm64/aes-ccm - Pass short inputs via stack buffer Ard Biesheuvel
@ 2024-01-11 12:33 ` Ard Biesheuvel
  2024-01-11 16:35   ` Ard Biesheuvel
  2024-01-11 12:33 ` [PATCH 5/8] crypto: arm64/aes-ccm - Reuse existing MAC update for AAD input Ard Biesheuvel
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 10+ messages in thread
From: Ard Biesheuvel @ 2024-01-11 12:33 UTC (permalink / raw)
  To: linux-crypto; +Cc: ebiggers, herbert, Ard Biesheuvel

From: Ard Biesheuvel <ardb@kernel.org>

Implement the CCM tail handling using a single sequence that uses
permute vectors and overlapping loads and stores, rather than going over
the tail byte by byte in a loop, and using scalar operations. This is
more efficient, even though the measured speedup is only around 1-2% on
the CPUs I have tried.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/crypto/aes-ce-ccm-core.S | 59 +++++++++++++-------
 arch/arm64/crypto/aes-ce-ccm-glue.c | 20 +++----
 2 files changed, 48 insertions(+), 31 deletions(-)

diff --git a/arch/arm64/crypto/aes-ce-ccm-core.S b/arch/arm64/crypto/aes-ce-ccm-core.S
index b03f7f71f893..b21a9b759ab2 100644
--- a/arch/arm64/crypto/aes-ce-ccm-core.S
+++ b/arch/arm64/crypto/aes-ce-ccm-core.S
@@ -1,8 +1,11 @@
 /* SPDX-License-Identifier: GPL-2.0-only */
 /*
- * aesce-ccm-core.S - AES-CCM transform for ARMv8 with Crypto Extensions
+ * aes-ce-ccm-core.S - AES-CCM transform for ARMv8 with Crypto Extensions
  *
- * Copyright (C) 2013 - 2017 Linaro Ltd <ard.biesheuvel@linaro.org>
+ * Copyright (C) 2013 - 2017 Linaro Ltd.
+ * Copyright (C) 2024 Google LLC
+ *
+ * Author: Ard Biesheuvel <ardb@kernel.org>
  */
 
 #include <linux/linkage.h>
@@ -168,13 +171,13 @@ CPU_LE(	rev	x8, x8			)	/* keep swabbed ctr in reg */
 	ld1	{v2.16b}, [x1], #16		/* load next input block */
 	.if	\enc == 1
 	eor	v2.16b, v2.16b, v5.16b		/* final round enc+mac */
-	eor	v1.16b, v1.16b, v2.16b		/* xor with crypted ctr */
+	eor	v6.16b, v1.16b, v2.16b		/* xor with crypted ctr */
 	.else
 	eor	v2.16b, v2.16b, v1.16b		/* xor with crypted ctr */
-	eor	v1.16b, v2.16b, v5.16b		/* final round enc */
+	eor	v6.16b, v2.16b, v5.16b		/* final round enc */
 	.endif
 	eor	v0.16b, v0.16b, v2.16b		/* xor mac with pt ^ rk[last] */
-	st1	{v1.16b}, [x0], #16		/* write output block */
+	st1	{v6.16b}, [x0], #16		/* write output block */
 	bne	0b
 CPU_LE(	rev	x8, x8			)
 	st1	{v0.16b}, [x5]			/* store mac */
@@ -183,25 +186,31 @@ CPU_LE(	rev	x8, x8			)
 
 6:	eor	v0.16b, v0.16b, v5.16b		/* final round mac */
 	eor	v1.16b, v1.16b, v5.16b		/* final round enc */
-	st1	{v0.16b}, [x5]			/* store mac */
-	add	w2, w2, #16			/* process partial tail block */
-7:	ldrb	w9, [x1], #1			/* get 1 byte of input */
-	umov	w6, v1.b[0]			/* get top crypted ctr byte */
-	umov	w7, v0.b[0]			/* get top mac byte */
+
+	add	x1, x1, w2, sxtw		/* rewind the input pointer (w2 < 0) */
+	add	x0, x0, w2, sxtw		/* rewind the output pointer */
+
+	adr_l	x8, .Lpermute			/* load permute vectors */
+	add	x9, x8, w2, sxtw
+	sub	x8, x8, w2, sxtw
+	ld1	{v7.16b-v8.16b}, [x9]
+	ld1	{v9.16b}, [x8]
+
+	ld1	{v2.16b}, [x1]			/* load a full block of input */
+	tbl	v1.16b, {v1.16b}, v7.16b	/* move keystream to end of register */
 	.if	\enc == 1
-	eor	w7, w7, w9
-	eor	w9, w9, w6
+	tbl	v7.16b, {v2.16b}, v9.16b	/* copy plaintext to start of v7 */
+	eor	v2.16b, v2.16b, v1.16b		/* encrypt partial input block */
 	.else
-	eor	w9, w9, w6
-	eor	w7, w7, w9
+	eor	v2.16b, v2.16b, v1.16b		/* decrypt partial input block */
+	tbl	v7.16b, {v2.16b}, v9.16b	/* copy plaintext to start of v7 */
 	.endif
-	strb	w9, [x0], #1			/* store out byte */
-	strb	w7, [x5], #1			/* store mac byte */
-	subs	w2, w2, #1
-	beq	5b
-	ext	v0.16b, v0.16b, v0.16b, #1	/* shift out mac byte */
-	ext	v1.16b, v1.16b, v1.16b, #1	/* shift out ctr byte */
-	b	7b
+	eor	v0.16b, v0.16b, v7.16b		/* fold plaintext into mac */
+	tbx	v2.16b, {v6.16b}, v8.16b	/* insert output from previous iteration */
+
+	st1	{v0.16b}, [x5]			/* store mac */
+	st1	{v2.16b}, [x0]			/* store output block */
+	ret
 	.endm
 
 	/*
@@ -219,3 +228,11 @@ SYM_FUNC_END(ce_aes_ccm_encrypt)
 SYM_FUNC_START(ce_aes_ccm_decrypt)
 	aes_ccm_do_crypt	0
 SYM_FUNC_END(ce_aes_ccm_decrypt)
+
+	.section ".rodata", "a"
+	.align	6
+	.fill	15, 1, 0xff
+.Lpermute:
+	.byte	0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7
+	.byte	0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf
+	.fill	15, 1, 0xff
diff --git a/arch/arm64/crypto/aes-ce-ccm-glue.c b/arch/arm64/crypto/aes-ce-ccm-glue.c
index 2f4e6a318fcd..4710e59075f5 100644
--- a/arch/arm64/crypto/aes-ce-ccm-glue.c
+++ b/arch/arm64/crypto/aes-ce-ccm-glue.c
@@ -181,16 +181,16 @@ static int ccm_encrypt(struct aead_request *req)
 		if (walk.nbytes == walk.total)
 			tail = 0;
 
-		if (unlikely(walk.total < AES_BLOCK_SIZE))
-			src = dst = memcpy(buf + sizeof(buf) - walk.total,
-					   src, walk.total);
+		if (unlikely(walk.nbytes < AES_BLOCK_SIZE))
+			src = dst = memcpy(&buf[sizeof(buf) - walk.nbytes],
+					   src, walk.nbytes);
 
 		ce_aes_ccm_encrypt(dst, src, walk.nbytes - tail,
 				   ctx->key_enc, num_rounds(ctx),
 				   mac, walk.iv);
 
-		if (unlikely(walk.total < AES_BLOCK_SIZE))
-			memcpy(walk.dst.virt.addr, dst, walk.total);
+		if (unlikely(walk.nbytes < AES_BLOCK_SIZE))
+			memcpy(walk.dst.virt.addr, dst, walk.nbytes);
 
 		if (walk.nbytes == walk.total)
 			ce_aes_ccm_final(mac, orig_iv, ctx->key_enc, num_rounds(ctx));
@@ -248,16 +248,16 @@ static int ccm_decrypt(struct aead_request *req)
 		if (walk.nbytes == walk.total)
 			tail = 0;
 
-		if (unlikely(walk.total < AES_BLOCK_SIZE))
-			src = dst = memcpy(buf + sizeof(buf) - walk.total,
-					   src, walk.total);
+		if (unlikely(walk.nbytes < AES_BLOCK_SIZE))
+			src = dst = memcpy(&buf[sizeof(buf) - walk.nbytes],
+					   src, walk.nbytes);
 
 		ce_aes_ccm_decrypt(dst, src, walk.nbytes - tail,
 				   ctx->key_enc, num_rounds(ctx),
 				   mac, walk.iv);
 
-		if (unlikely(walk.total < AES_BLOCK_SIZE))
-			memcpy(walk.dst.virt.addr, dst, walk.total);
+		if (unlikely(walk.nbytes < AES_BLOCK_SIZE))
+			memcpy(walk.dst.virt.addr, dst, walk.nbytes);
 
 		if (walk.nbytes == walk.total)
 			ce_aes_ccm_final(mac, orig_iv, ctx->key_enc, num_rounds(ctx));
-- 
2.43.0.275.g3460e3d667-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 5/8] crypto: arm64/aes-ccm - Reuse existing MAC update for AAD input
  2024-01-11 12:33 [PATCH 0/8] crypto: Clean up arm64 AES-CCM code Ard Biesheuvel
                   ` (3 preceding siblings ...)
  2024-01-11 12:33 ` [PATCH 4/8] crypto: arm64/aes-ccm - Replace bytewise tail handling with NEON permute Ard Biesheuvel
@ 2024-01-11 12:33 ` Ard Biesheuvel
  2024-01-11 12:33 ` [PATCH 6/8] crypto: arm64/aes-ccm - Cache round keys and unroll AES loops Ard Biesheuvel
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Ard Biesheuvel @ 2024-01-11 12:33 UTC (permalink / raw)
  To: linux-crypto; +Cc: ebiggers, herbert, Ard Biesheuvel

From: Ard Biesheuvel <ardb@kernel.org>

CCM combines the counter (CTR) encryption mode with a MAC based on the
same block cipher. This MAC construction is a bit clunky: it invokes the
block cipher in a way that cannot be parallelized, resulting in poor CPU
pipeline efficiency.

The arm64 CCM code mitigates this by interleaving the encryption and MAC
at the AES round level, resulting in a substantial speedup. But this
approach does not apply to the additional authenticated data (AAD) which
is not encrypted.

This means the special asm routine dealing with the AAD is not any
better than the MAC update routine used by the arm64 AES block
encryption driver, so let's reuse that, and drop the special AES-CCM
version.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/crypto/Kconfig           |  1 +
 arch/arm64/crypto/aes-ce-ccm-core.S | 71 --------------------
 arch/arm64/crypto/aes-ce-ccm-glue.c | 49 +++++++++++---
 arch/arm64/crypto/aes-glue.c        |  1 +
 4 files changed, 43 insertions(+), 79 deletions(-)

diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index eb7b423ba463..e7d9bd8e4709 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -268,6 +268,7 @@ config CRYPTO_AES_ARM64_CE_CCM
 	depends on ARM64 && KERNEL_MODE_NEON
 	select CRYPTO_ALGAPI
 	select CRYPTO_AES_ARM64_CE
+	select CRYPTO_AES_ARM64_CE_BLK
 	select CRYPTO_AEAD
 	select CRYPTO_LIB_AES
 	help
diff --git a/arch/arm64/crypto/aes-ce-ccm-core.S b/arch/arm64/crypto/aes-ce-ccm-core.S
index b21a9b759ab2..0132872bd780 100644
--- a/arch/arm64/crypto/aes-ce-ccm-core.S
+++ b/arch/arm64/crypto/aes-ce-ccm-core.S
@@ -14,77 +14,6 @@
 	.text
 	.arch	armv8-a+crypto
 
-	/*
-	 * u32 ce_aes_ccm_auth_data(u8 mac[], u8 const in[], u32 abytes,
-	 *			    u32 macp, u8 const rk[], u32 rounds);
-	 */
-SYM_FUNC_START(ce_aes_ccm_auth_data)
-	ld1	{v0.16b}, [x0]			/* load mac */
-	cbz	w3, 1f
-	sub	w3, w3, #16
-	eor	v1.16b, v1.16b, v1.16b
-0:	ldrb	w7, [x1], #1			/* get 1 byte of input */
-	subs	w2, w2, #1
-	add	w3, w3, #1
-	ins	v1.b[0], w7
-	ext	v1.16b, v1.16b, v1.16b, #1	/* rotate in the input bytes */
-	beq	8f				/* out of input? */
-	cbnz	w3, 0b
-	eor	v0.16b, v0.16b, v1.16b
-1:	ld1	{v3.4s}, [x4]			/* load first round key */
-	prfm	pldl1strm, [x1]
-	cmp	w5, #12				/* which key size? */
-	add	x6, x4, #16
-	sub	w7, w5, #2			/* modified # of rounds */
-	bmi	2f
-	bne	5f
-	mov	v5.16b, v3.16b
-	b	4f
-2:	mov	v4.16b, v3.16b
-	ld1	{v5.4s}, [x6], #16		/* load 2nd round key */
-3:	aese	v0.16b, v4.16b
-	aesmc	v0.16b, v0.16b
-4:	ld1	{v3.4s}, [x6], #16		/* load next round key */
-	aese	v0.16b, v5.16b
-	aesmc	v0.16b, v0.16b
-5:	ld1	{v4.4s}, [x6], #16		/* load next round key */
-	subs	w7, w7, #3
-	aese	v0.16b, v3.16b
-	aesmc	v0.16b, v0.16b
-	ld1	{v5.4s}, [x6], #16		/* load next round key */
-	bpl	3b
-	aese	v0.16b, v4.16b
-	subs	w2, w2, #16			/* last data? */
-	eor	v0.16b, v0.16b, v5.16b		/* final round */
-	bmi	6f
-	ld1	{v1.16b}, [x1], #16		/* load next input block */
-	eor	v0.16b, v0.16b, v1.16b		/* xor with mac */
-	bne	1b
-6:	st1	{v0.16b}, [x0]			/* store mac */
-	beq	10f
-	adds	w2, w2, #16
-	beq	10f
-	mov	w3, w2
-7:	ldrb	w7, [x1], #1
-	umov	w6, v0.b[0]
-	eor	w6, w6, w7
-	strb	w6, [x0], #1
-	subs	w2, w2, #1
-	beq	10f
-	ext	v0.16b, v0.16b, v0.16b, #1	/* rotate out the mac bytes */
-	b	7b
-8:	cbz	w3, 91f
-	mov	w7, w3
-	add	w3, w3, #16
-9:	ext	v1.16b, v1.16b, v1.16b, #1
-	adds	w7, w7, #1
-	bne	9b
-91:	eor	v0.16b, v0.16b, v1.16b
-	st1	{v0.16b}, [x0]
-10:	mov	w0, w3
-	ret
-SYM_FUNC_END(ce_aes_ccm_auth_data)
-
 	/*
 	 * void ce_aes_ccm_final(u8 mac[], u8 const ctr[], u8 const rk[],
 	 * 			 u32 rounds);
diff --git a/arch/arm64/crypto/aes-ce-ccm-glue.c b/arch/arm64/crypto/aes-ce-ccm-glue.c
index 4710e59075f5..ed3d79e05112 100644
--- a/arch/arm64/crypto/aes-ce-ccm-glue.c
+++ b/arch/arm64/crypto/aes-ce-ccm-glue.c
@@ -18,6 +18,8 @@
 
 #include "aes-ce-setkey.h"
 
+MODULE_IMPORT_NS(CRYPTO_INTERNAL);
+
 static int num_rounds(struct crypto_aes_ctx *ctx)
 {
 	/*
@@ -30,8 +32,9 @@ static int num_rounds(struct crypto_aes_ctx *ctx)
 	return 6 + ctx->key_length / 4;
 }
 
-asmlinkage u32 ce_aes_ccm_auth_data(u8 mac[], u8 const in[], u32 abytes,
-				    u32 macp, u32 const rk[], u32 rounds);
+asmlinkage u32 ce_aes_mac_update(u8 const in[], u32 const rk[], int rounds,
+				 int blocks, u8 dg[], int enc_before,
+				 int enc_after);
 
 asmlinkage void ce_aes_ccm_encrypt(u8 out[], u8 const in[], u32 cbytes,
 				   u32 const rk[], u32 rounds, u8 mac[],
@@ -97,6 +100,41 @@ static int ccm_init_mac(struct aead_request *req, u8 maciv[], u32 msglen)
 	return 0;
 }
 
+static u32 ce_aes_ccm_auth_data(u8 mac[], u8 const in[], u32 abytes,
+				u32 macp, u32 const rk[], u32 rounds)
+{
+	int enc_after = (macp + abytes) % AES_BLOCK_SIZE;
+
+	do {
+		u32 blocks = abytes / AES_BLOCK_SIZE;
+
+		if (macp == AES_BLOCK_SIZE || (!macp && blocks > 0)) {
+			u32 rem = ce_aes_mac_update(in, rk, rounds, blocks, mac,
+						    macp, enc_after);
+			u32 adv = (blocks - rem) * AES_BLOCK_SIZE;
+
+			macp = enc_after ? 0 : AES_BLOCK_SIZE;
+			in += adv;
+			abytes -= adv;
+
+			if (unlikely(rem)) {
+				kernel_neon_end();
+				kernel_neon_begin();
+				macp = 0;
+			}
+		} else {
+			u32 l = min(AES_BLOCK_SIZE - macp, abytes);
+
+			crypto_xor(&mac[macp], in, l);
+			in += l;
+			macp += l;
+			abytes -= l;
+		}
+	} while (abytes > 0);
+
+	return macp;
+}
+
 static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[])
 {
 	struct crypto_aead *aead = crypto_aead_reqtfm(req);
@@ -104,7 +142,7 @@ static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[])
 	struct __packed { __be16 l; __be32 h; u16 len; } ltag;
 	struct scatter_walk walk;
 	u32 len = req->assoclen;
-	u32 macp = 0;
+	u32 macp = AES_BLOCK_SIZE;
 
 	/* prepend the AAD with a length tag */
 	if (len < 0xff00) {
@@ -128,16 +166,11 @@ static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[])
 			scatterwalk_start(&walk, sg_next(walk.sg));
 			n = scatterwalk_clamp(&walk, len);
 		}
-		n = min_t(u32, n, SZ_4K); /* yield NEON at least every 4k */
 		p = scatterwalk_map(&walk);
 
 		macp = ce_aes_ccm_auth_data(mac, p, n, macp, ctx->key_enc,
 					    num_rounds(ctx));
 
-		if (len / SZ_4K > (len - n) / SZ_4K) {
-			kernel_neon_end();
-			kernel_neon_begin();
-		}
 		len -= n;
 
 		scatterwalk_unmap(p);
diff --git a/arch/arm64/crypto/aes-glue.c b/arch/arm64/crypto/aes-glue.c
index 162787c7aa86..a147e847a5a1 100644
--- a/arch/arm64/crypto/aes-glue.c
+++ b/arch/arm64/crypto/aes-glue.c
@@ -1048,6 +1048,7 @@ static int __init aes_init(void)
 
 #ifdef USE_V8_CRYPTO_EXTENSIONS
 module_cpu_feature_match(AES, aes_init);
+EXPORT_SYMBOL_NS(ce_aes_mac_update, CRYPTO_INTERNAL);
 #else
 module_init(aes_init);
 EXPORT_SYMBOL(neon_aes_ecb_encrypt);
-- 
2.43.0.275.g3460e3d667-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 6/8] crypto: arm64/aes-ccm - Cache round keys and unroll AES loops
  2024-01-11 12:33 [PATCH 0/8] crypto: Clean up arm64 AES-CCM code Ard Biesheuvel
                   ` (4 preceding siblings ...)
  2024-01-11 12:33 ` [PATCH 5/8] crypto: arm64/aes-ccm - Reuse existing MAC update for AAD input Ard Biesheuvel
@ 2024-01-11 12:33 ` Ard Biesheuvel
  2024-01-11 12:33 ` [PATCH 7/8] crypto: arm64/aes-ccm - Merge encrypt and decrypt asm routines Ard Biesheuvel
  2024-01-11 12:33 ` [PATCH 8/8] crypto: arm64/aes-ccm - Merge finalization into en/decrypt asm helper Ard Biesheuvel
  7 siblings, 0 replies; 10+ messages in thread
From: Ard Biesheuvel @ 2024-01-11 12:33 UTC (permalink / raw)
  To: linux-crypto; +Cc: ebiggers, herbert, Ard Biesheuvel

From: Ard Biesheuvel <ardb@kernel.org>

The CCM code as originally written attempted to use as few NEON
registers as possible, to avoid having to eagerly preserve/restore the
entire NEON register file at every call to kernel_neon_begin/end. At
that time, this API took a number of NEON registers as a parameter, and
only preserved that many registers.

Today, the NEON register file is restored lazily, and the old API is
long gone. This means we can use as many NEON registers as we can make
meaningful use of, which means in the AES case that we can keep all
round keys in registers rather than reloading each of them for each AES
block processed.

On Cortex-A53, this results in a speedup of more than 50%. (From 4
cycles per byte to 2.6 cycles per byte)

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/crypto/aes-ce-ccm-core.S | 95 ++++++++------------
 1 file changed, 38 insertions(+), 57 deletions(-)

diff --git a/arch/arm64/crypto/aes-ce-ccm-core.S b/arch/arm64/crypto/aes-ce-ccm-core.S
index 0132872bd780..0ec59fc4ef3e 100644
--- a/arch/arm64/crypto/aes-ce-ccm-core.S
+++ b/arch/arm64/crypto/aes-ce-ccm-core.S
@@ -14,40 +14,46 @@
 	.text
 	.arch	armv8-a+crypto
 
+	.macro	load_round_keys, rk, nr, tmp
+	sub	w\tmp, \nr, #10
+	add	\tmp, \rk, w\tmp, sxtw #4
+	ld1	{v10.4s-v13.4s}, [\rk]
+	ld1	{v14.4s-v17.4s}, [\tmp], #64
+	ld1	{v18.4s-v21.4s}, [\tmp], #64
+	ld1	{v3.4s-v5.4s}, [\tmp]
+	.endm
+
+	.macro	dround, va, vb, vk
+	aese	\va\().16b, \vk\().16b
+	aesmc	\va\().16b, \va\().16b
+	aese	\vb\().16b, \vk\().16b
+	aesmc	\vb\().16b, \vb\().16b
+	.endm
+
+	.macro	aes_encrypt, va, vb, nr
+	tbz	\nr, #2, .L\@
+	dround	\va, \vb, v10
+	dround	\va, \vb, v11
+	tbz	\nr, #1, .L\@
+	dround	\va, \vb, v12
+	dround	\va, \vb, v13
+.L\@:	.irp	v, v14, v15, v16, v17, v18, v19, v20, v21, v3
+	dround	\va, \vb, \v
+	.endr
+	aese	\va\().16b, v4.16b
+	aese	\vb\().16b, v4.16b
+	.endm
+
 	/*
 	 * void ce_aes_ccm_final(u8 mac[], u8 const ctr[], u8 const rk[],
 	 * 			 u32 rounds);
 	 */
 SYM_FUNC_START(ce_aes_ccm_final)
-	ld1	{v3.4s}, [x2], #16		/* load first round key */
 	ld1	{v0.16b}, [x0]			/* load mac */
-	cmp	w3, #12				/* which key size? */
-	sub	w3, w3, #2			/* modified # of rounds */
 	ld1	{v1.16b}, [x1]			/* load 1st ctriv */
-	bmi	0f
-	bne	3f
-	mov	v5.16b, v3.16b
-	b	2f
-0:	mov	v4.16b, v3.16b
-1:	ld1	{v5.4s}, [x2], #16		/* load next round key */
-	aese	v0.16b, v4.16b
-	aesmc	v0.16b, v0.16b
-	aese	v1.16b, v4.16b
-	aesmc	v1.16b, v1.16b
-2:	ld1	{v3.4s}, [x2], #16		/* load next round key */
-	aese	v0.16b, v5.16b
-	aesmc	v0.16b, v0.16b
-	aese	v1.16b, v5.16b
-	aesmc	v1.16b, v1.16b
-3:	ld1	{v4.4s}, [x2], #16		/* load next round key */
-	subs	w3, w3, #3
-	aese	v0.16b, v3.16b
-	aesmc	v0.16b, v0.16b
-	aese	v1.16b, v3.16b
-	aesmc	v1.16b, v1.16b
-	bpl	1b
-	aese	v0.16b, v4.16b
-	aese	v1.16b, v4.16b
+
+	aes_encrypt	v0, v1, w3
+
 	/* final round key cancels out */
 	eor	v0.16b, v0.16b, v1.16b		/* en-/decrypt the mac */
 	st1	{v0.16b}, [x0]			/* store result */
@@ -55,6 +61,8 @@ SYM_FUNC_START(ce_aes_ccm_final)
 SYM_FUNC_END(ce_aes_ccm_final)
 
 	.macro	aes_ccm_do_crypt,enc
+	load_round_keys	x3, w4, x10
+
 	cbz	x2, 5f
 	ldr	x8, [x6, #8]			/* load lower ctr */
 	ld1	{v0.16b}, [x5]			/* load mac */
@@ -64,37 +72,10 @@ CPU_LE(	rev	x8, x8			)	/* keep swabbed ctr in reg */
 	prfm	pldl1strm, [x1]
 	add	x8, x8, #1
 	rev	x9, x8
-	cmp	w4, #12				/* which key size? */
-	sub	w7, w4, #2			/* get modified # of rounds */
 	ins	v1.d[1], x9			/* no carry in lower ctr */
-	ld1	{v3.4s}, [x3]			/* load first round key */
-	add	x10, x3, #16
-	bmi	1f
-	bne	4f
-	mov	v5.16b, v3.16b
-	b	3f
-1:	mov	v4.16b, v3.16b
-	ld1	{v5.4s}, [x10], #16		/* load 2nd round key */
-2:	/* inner loop: 3 rounds, 2x interleaved */
-	aese	v0.16b, v4.16b
-	aesmc	v0.16b, v0.16b
-	aese	v1.16b, v4.16b
-	aesmc	v1.16b, v1.16b
-3:	ld1	{v3.4s}, [x10], #16		/* load next round key */
-	aese	v0.16b, v5.16b
-	aesmc	v0.16b, v0.16b
-	aese	v1.16b, v5.16b
-	aesmc	v1.16b, v1.16b
-4:	ld1	{v4.4s}, [x10], #16		/* load next round key */
-	subs	w7, w7, #3
-	aese	v0.16b, v3.16b
-	aesmc	v0.16b, v0.16b
-	aese	v1.16b, v3.16b
-	aesmc	v1.16b, v1.16b
-	ld1	{v5.4s}, [x10], #16		/* load next round key */
-	bpl	2b
-	aese	v0.16b, v4.16b
-	aese	v1.16b, v4.16b
+
+	aes_encrypt	v0, v1, w4
+
 	subs	w2, w2, #16
 	bmi	6f				/* partial block? */
 	ld1	{v2.16b}, [x1], #16		/* load next input block */
-- 
2.43.0.275.g3460e3d667-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 7/8] crypto: arm64/aes-ccm - Merge encrypt and decrypt asm routines
  2024-01-11 12:33 [PATCH 0/8] crypto: Clean up arm64 AES-CCM code Ard Biesheuvel
                   ` (5 preceding siblings ...)
  2024-01-11 12:33 ` [PATCH 6/8] crypto: arm64/aes-ccm - Cache round keys and unroll AES loops Ard Biesheuvel
@ 2024-01-11 12:33 ` Ard Biesheuvel
  2024-01-11 12:33 ` [PATCH 8/8] crypto: arm64/aes-ccm - Merge finalization into en/decrypt asm helper Ard Biesheuvel
  7 siblings, 0 replies; 10+ messages in thread
From: Ard Biesheuvel @ 2024-01-11 12:33 UTC (permalink / raw)
  To: linux-crypto; +Cc: ebiggers, herbert, Ard Biesheuvel

From: Ard Biesheuvel <ardb@kernel.org>

The encryption and decryption code paths are mostly identical, except
for a small difference where the plaintext input into the MAC is taken
from either the input or the output block.

We can factor this in quite easily using a vector bit select, and a few
additional XORs, without the need for branches. This way, we can use the
same asm helper on the encrypt and decrypt code paths.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/crypto/aes-ce-ccm-core.S | 41 +++++++++-----------
 1 file changed, 18 insertions(+), 23 deletions(-)

diff --git a/arch/arm64/crypto/aes-ce-ccm-core.S b/arch/arm64/crypto/aes-ce-ccm-core.S
index 0ec59fc4ef3e..75be3157bae1 100644
--- a/arch/arm64/crypto/aes-ce-ccm-core.S
+++ b/arch/arm64/crypto/aes-ce-ccm-core.S
@@ -60,7 +60,7 @@ SYM_FUNC_START(ce_aes_ccm_final)
 	ret
 SYM_FUNC_END(ce_aes_ccm_final)
 
-	.macro	aes_ccm_do_crypt,enc
+SYM_FUNC_START_LOCAL(aes_ccm_do_crypt)
 	load_round_keys	x3, w4, x10
 
 	cbz	x2, 5f
@@ -76,28 +76,24 @@ CPU_LE(	rev	x8, x8			)	/* keep swabbed ctr in reg */
 
 	aes_encrypt	v0, v1, w4
 
+	eor	v0.16b, v0.16b, v5.16b		/* final round mac */
+	eor	v1.16b, v1.16b, v5.16b		/* final round enc */
 	subs	w2, w2, #16
 	bmi	6f				/* partial block? */
 	ld1	{v2.16b}, [x1], #16		/* load next input block */
-	.if	\enc == 1
-	eor	v2.16b, v2.16b, v5.16b		/* final round enc+mac */
-	eor	v6.16b, v1.16b, v2.16b		/* xor with crypted ctr */
-	.else
-	eor	v2.16b, v2.16b, v1.16b		/* xor with crypted ctr */
-	eor	v6.16b, v2.16b, v5.16b		/* final round enc */
-	.endif
-	eor	v0.16b, v0.16b, v2.16b		/* xor mac with pt ^ rk[last] */
+	eor	v6.16b, v2.16b, v1.16b		/* en/decrypt input block */
+	mov	v23.16b, v22.16b
+	bsl	v23.16b, v2.16b, v6.16b		/* select plaintext */
 	st1	{v6.16b}, [x0], #16		/* write output block */
+	eor	v0.16b, v0.16b, v23.16b		/* fold plaintext into mac */
+
 	bne	0b
 CPU_LE(	rev	x8, x8			)
 	st1	{v0.16b}, [x5]			/* store mac */
 	str	x8, [x6, #8]			/* store lsb end of ctr (BE) */
 5:	ret
 
-6:	eor	v0.16b, v0.16b, v5.16b		/* final round mac */
-	eor	v1.16b, v1.16b, v5.16b		/* final round enc */
-
-	add	x1, x1, w2, sxtw		/* rewind the input pointer (w2 < 0) */
+6:	add	x1, x1, w2, sxtw		/* rewind the input pointer (w2 < 0) */
 	add	x0, x0, w2, sxtw		/* rewind the output pointer */
 
 	adr_l	x8, .Lpermute			/* load permute vectors */
@@ -108,20 +104,17 @@ CPU_LE(	rev	x8, x8			)
 
 	ld1	{v2.16b}, [x1]			/* load a full block of input */
 	tbl	v1.16b, {v1.16b}, v7.16b	/* move keystream to end of register */
-	.if	\enc == 1
-	tbl	v7.16b, {v2.16b}, v9.16b	/* copy plaintext to start of v7 */
+	tbl	v7.16b, {v2.16b}, v9.16b	/* copy input block to start of v7 */
 	eor	v2.16b, v2.16b, v1.16b		/* encrypt partial input block */
-	.else
-	eor	v2.16b, v2.16b, v1.16b		/* decrypt partial input block */
-	tbl	v7.16b, {v2.16b}, v9.16b	/* copy plaintext to start of v7 */
-	.endif
-	eor	v0.16b, v0.16b, v7.16b		/* fold plaintext into mac */
+	tbl	v9.16b, {v2.16b}, v9.16b	/* copy output block to start of v9 */
+	bsl	v22.16b, v7.16b, v9.16b		/* select plaintext */
+	eor	v0.16b, v0.16b, v22.16b		/* fold plaintext into mac */
 	tbx	v2.16b, {v6.16b}, v8.16b	/* insert output from previous iteration */
 
 	st1	{v0.16b}, [x5]			/* store mac */
 	st1	{v2.16b}, [x0]			/* store output block */
 	ret
-	.endm
+SYM_FUNC_END(aes_ccm_do_crypt)
 
 	/*
 	 * void ce_aes_ccm_encrypt(u8 out[], u8 const in[], u32 cbytes,
@@ -132,11 +125,13 @@ CPU_LE(	rev	x8, x8			)
 	 * 			   u8 ctr[]);
 	 */
 SYM_FUNC_START(ce_aes_ccm_encrypt)
-	aes_ccm_do_crypt	1
+	movi	v22.16b, #255
+	b	aes_ccm_do_crypt
 SYM_FUNC_END(ce_aes_ccm_encrypt)
 
 SYM_FUNC_START(ce_aes_ccm_decrypt)
-	aes_ccm_do_crypt	0
+	movi	v22.16b, #0
+	b	aes_ccm_do_crypt
 SYM_FUNC_END(ce_aes_ccm_decrypt)
 
 	.section ".rodata", "a"
-- 
2.43.0.275.g3460e3d667-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 8/8] crypto: arm64/aes-ccm - Merge finalization into en/decrypt asm helper
  2024-01-11 12:33 [PATCH 0/8] crypto: Clean up arm64 AES-CCM code Ard Biesheuvel
                   ` (6 preceding siblings ...)
  2024-01-11 12:33 ` [PATCH 7/8] crypto: arm64/aes-ccm - Merge encrypt and decrypt asm routines Ard Biesheuvel
@ 2024-01-11 12:33 ` Ard Biesheuvel
  7 siblings, 0 replies; 10+ messages in thread
From: Ard Biesheuvel @ 2024-01-11 12:33 UTC (permalink / raw)
  To: linux-crypto; +Cc: ebiggers, herbert, Ard Biesheuvel

From: Ard Biesheuvel <ardb@kernel.org>

The C glue code already infers whether or not the current iteration is
the final one, by comparing walk.nbytes with walk.total. This means we
can easily inform the asm helper of this as well, by conditionally
passing a pointer to the original IV, which is used in the finalization
of the MAC. This removes the need for a separate call into the asm code
to perform the finalization.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/crypto/aes-ce-ccm-core.S | 32 ++++++++------------
 arch/arm64/crypto/aes-ce-ccm-glue.c | 27 ++++++++---------
 2 files changed, 24 insertions(+), 35 deletions(-)

diff --git a/arch/arm64/crypto/aes-ce-ccm-core.S b/arch/arm64/crypto/aes-ce-ccm-core.S
index 75be3157bae1..c0d89f8ae4c4 100644
--- a/arch/arm64/crypto/aes-ce-ccm-core.S
+++ b/arch/arm64/crypto/aes-ce-ccm-core.S
@@ -44,28 +44,12 @@
 	aese	\vb\().16b, v4.16b
 	.endm
 
-	/*
-	 * void ce_aes_ccm_final(u8 mac[], u8 const ctr[], u8 const rk[],
-	 * 			 u32 rounds);
-	 */
-SYM_FUNC_START(ce_aes_ccm_final)
-	ld1	{v0.16b}, [x0]			/* load mac */
-	ld1	{v1.16b}, [x1]			/* load 1st ctriv */
-
-	aes_encrypt	v0, v1, w3
-
-	/* final round key cancels out */
-	eor	v0.16b, v0.16b, v1.16b		/* en-/decrypt the mac */
-	st1	{v0.16b}, [x0]			/* store result */
-	ret
-SYM_FUNC_END(ce_aes_ccm_final)
-
 SYM_FUNC_START_LOCAL(aes_ccm_do_crypt)
 	load_round_keys	x3, w4, x10
 
+	ld1	{v0.16b}, [x5]			/* load mac */
 	cbz	x2, 5f
 	ldr	x8, [x6, #8]			/* load lower ctr */
-	ld1	{v0.16b}, [x5]			/* load mac */
 CPU_LE(	rev	x8, x8			)	/* keep swabbed ctr in reg */
 0:	/* outer loop */
 	ld1	{v1.8b}, [x6]			/* load upper ctr */
@@ -89,9 +73,9 @@ CPU_LE(	rev	x8, x8			)	/* keep swabbed ctr in reg */
 
 	bne	0b
 CPU_LE(	rev	x8, x8			)
-	st1	{v0.16b}, [x5]			/* store mac */
 	str	x8, [x6, #8]			/* store lsb end of ctr (BE) */
-5:	ret
+5:	cbz	x7, 8f
+	b	7f
 
 6:	add	x1, x1, w2, sxtw		/* rewind the input pointer (w2 < 0) */
 	add	x0, x0, w2, sxtw		/* rewind the output pointer */
@@ -111,8 +95,16 @@ CPU_LE(	rev	x8, x8			)
 	eor	v0.16b, v0.16b, v22.16b		/* fold plaintext into mac */
 	tbx	v2.16b, {v6.16b}, v8.16b	/* insert output from previous iteration */
 
-	st1	{v0.16b}, [x5]			/* store mac */
 	st1	{v2.16b}, [x0]			/* store output block */
+	cbz	x7, 8f				/* time to finalize MAC? */
+7:	ld1	{v1.16b}, [x7]			/* load 1st ctriv */
+
+	aes_encrypt	v0, v1, w4
+
+	/* final round key cancels out */
+	eor	v0.16b, v0.16b, v1.16b		/* en-/decrypt the mac */
+
+8:	st1	{v0.16b}, [x5]			/* store mac */
 	ret
 SYM_FUNC_END(aes_ccm_do_crypt)
 
diff --git a/arch/arm64/crypto/aes-ce-ccm-glue.c b/arch/arm64/crypto/aes-ce-ccm-glue.c
index ed3d79e05112..ce9b28e3c7d6 100644
--- a/arch/arm64/crypto/aes-ce-ccm-glue.c
+++ b/arch/arm64/crypto/aes-ce-ccm-glue.c
@@ -38,14 +38,11 @@ asmlinkage u32 ce_aes_mac_update(u8 const in[], u32 const rk[], int rounds,
 
 asmlinkage void ce_aes_ccm_encrypt(u8 out[], u8 const in[], u32 cbytes,
 				   u32 const rk[], u32 rounds, u8 mac[],
-				   u8 ctr[]);
+				   u8 ctr[], u8 const final_iv[]);
 
 asmlinkage void ce_aes_ccm_decrypt(u8 out[], u8 const in[], u32 cbytes,
 				   u32 const rk[], u32 rounds, u8 mac[],
-				   u8 ctr[]);
-
-asmlinkage void ce_aes_ccm_final(u8 mac[], u8 const ctr[], u32 const rk[],
-				 u32 rounds);
+				   u8 ctr[], u8 const final_iv[]);
 
 static int ccm_setkey(struct crypto_aead *tfm, const u8 *in_key,
 		      unsigned int key_len)
@@ -210,9 +207,12 @@ static int ccm_encrypt(struct aead_request *req)
 		const u8 *src = walk.src.virt.addr;
 		u8 *dst = walk.dst.virt.addr;
 		u8 buf[AES_BLOCK_SIZE];
+		u8 *final_iv = NULL;
 
-		if (walk.nbytes == walk.total)
+		if (walk.nbytes == walk.total) {
 			tail = 0;
+			final_iv = orig_iv;
+		}
 
 		if (unlikely(walk.nbytes < AES_BLOCK_SIZE))
 			src = dst = memcpy(&buf[sizeof(buf) - walk.nbytes],
@@ -220,14 +220,11 @@ static int ccm_encrypt(struct aead_request *req)
 
 		ce_aes_ccm_encrypt(dst, src, walk.nbytes - tail,
 				   ctx->key_enc, num_rounds(ctx),
-				   mac, walk.iv);
+				   mac, walk.iv, final_iv);
 
 		if (unlikely(walk.nbytes < AES_BLOCK_SIZE))
 			memcpy(walk.dst.virt.addr, dst, walk.nbytes);
 
-		if (walk.nbytes == walk.total)
-			ce_aes_ccm_final(mac, orig_iv, ctx->key_enc, num_rounds(ctx));
-
 		if (walk.nbytes) {
 			err = skcipher_walk_done(&walk, tail);
 		}
@@ -277,9 +274,12 @@ static int ccm_decrypt(struct aead_request *req)
 		const u8 *src = walk.src.virt.addr;
 		u8 *dst = walk.dst.virt.addr;
 		u8 buf[AES_BLOCK_SIZE];
+		u8 *final_iv = NULL;
 
-		if (walk.nbytes == walk.total)
+		if (walk.nbytes == walk.total) {
 			tail = 0;
+			final_iv = orig_iv;
+		}
 
 		if (unlikely(walk.nbytes < AES_BLOCK_SIZE))
 			src = dst = memcpy(&buf[sizeof(buf) - walk.nbytes],
@@ -287,14 +287,11 @@ static int ccm_decrypt(struct aead_request *req)
 
 		ce_aes_ccm_decrypt(dst, src, walk.nbytes - tail,
 				   ctx->key_enc, num_rounds(ctx),
-				   mac, walk.iv);
+				   mac, walk.iv, final_iv);
 
 		if (unlikely(walk.nbytes < AES_BLOCK_SIZE))
 			memcpy(walk.dst.virt.addr, dst, walk.nbytes);
 
-		if (walk.nbytes == walk.total)
-			ce_aes_ccm_final(mac, orig_iv, ctx->key_enc, num_rounds(ctx));
-
 		if (walk.nbytes) {
 			err = skcipher_walk_done(&walk, tail);
 		}
-- 
2.43.0.275.g3460e3d667-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 4/8] crypto: arm64/aes-ccm - Replace bytewise tail handling with NEON permute
  2024-01-11 12:33 ` [PATCH 4/8] crypto: arm64/aes-ccm - Replace bytewise tail handling with NEON permute Ard Biesheuvel
@ 2024-01-11 16:35   ` Ard Biesheuvel
  0 siblings, 0 replies; 10+ messages in thread
From: Ard Biesheuvel @ 2024-01-11 16:35 UTC (permalink / raw)
  To: Ard Biesheuvel; +Cc: linux-crypto, ebiggers, herbert

On Thu, 11 Jan 2024 at 13:33, Ard Biesheuvel <ardb+git@google.com> wrote:
>
> From: Ard Biesheuvel <ardb@kernel.org>
>
> Implement the CCM tail handling using a single sequence that uses
> permute vectors and overlapping loads and stores, rather than going over
> the tail byte by byte in a loop, and using scalar operations. This is
> more efficient, even though the measured speedup is only around 1-2% on
> the CPUs I have tried.
>
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
>  arch/arm64/crypto/aes-ce-ccm-core.S | 59 +++++++++++++-------
>  arch/arm64/crypto/aes-ce-ccm-glue.c | 20 +++----
>  2 files changed, 48 insertions(+), 31 deletions(-)
>
...

The hunks below don't belong here: they were supposed to be squashed
into the previous patch.

I will fix that up for the next revision.


> diff --git a/arch/arm64/crypto/aes-ce-ccm-glue.c b/arch/arm64/crypto/aes-ce-ccm-glue.c
> index 2f4e6a318fcd..4710e59075f5 100644
> --- a/arch/arm64/crypto/aes-ce-ccm-glue.c
> +++ b/arch/arm64/crypto/aes-ce-ccm-glue.c
> @@ -181,16 +181,16 @@ static int ccm_encrypt(struct aead_request *req)
>                 if (walk.nbytes == walk.total)
>                         tail = 0;
>
> -               if (unlikely(walk.total < AES_BLOCK_SIZE))
> -                       src = dst = memcpy(buf + sizeof(buf) - walk.total,
> -                                          src, walk.total);
> +               if (unlikely(walk.nbytes < AES_BLOCK_SIZE))
> +                       src = dst = memcpy(&buf[sizeof(buf) - walk.nbytes],
> +                                          src, walk.nbytes);
>
>                 ce_aes_ccm_encrypt(dst, src, walk.nbytes - tail,
>                                    ctx->key_enc, num_rounds(ctx),
>                                    mac, walk.iv);
>
> -               if (unlikely(walk.total < AES_BLOCK_SIZE))
> -                       memcpy(walk.dst.virt.addr, dst, walk.total);
> +               if (unlikely(walk.nbytes < AES_BLOCK_SIZE))
> +                       memcpy(walk.dst.virt.addr, dst, walk.nbytes);
>
>                 if (walk.nbytes == walk.total)
>                         ce_aes_ccm_final(mac, orig_iv, ctx->key_enc, num_rounds(ctx));
> @@ -248,16 +248,16 @@ static int ccm_decrypt(struct aead_request *req)
>                 if (walk.nbytes == walk.total)
>                         tail = 0;
>
> -               if (unlikely(walk.total < AES_BLOCK_SIZE))
> -                       src = dst = memcpy(buf + sizeof(buf) - walk.total,
> -                                          src, walk.total);
> +               if (unlikely(walk.nbytes < AES_BLOCK_SIZE))
> +                       src = dst = memcpy(&buf[sizeof(buf) - walk.nbytes],
> +                                          src, walk.nbytes);
>
>                 ce_aes_ccm_decrypt(dst, src, walk.nbytes - tail,
>                                    ctx->key_enc, num_rounds(ctx),
>                                    mac, walk.iv);
>
> -               if (unlikely(walk.total < AES_BLOCK_SIZE))
> -                       memcpy(walk.dst.virt.addr, dst, walk.total);
> +               if (unlikely(walk.nbytes < AES_BLOCK_SIZE))
> +                       memcpy(walk.dst.virt.addr, dst, walk.nbytes);
>
>                 if (walk.nbytes == walk.total)
>                         ce_aes_ccm_final(mac, orig_iv, ctx->key_enc, num_rounds(ctx));
> --
> 2.43.0.275.g3460e3d667-goog
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-01-11 16:35 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-11 12:33 [PATCH 0/8] crypto: Clean up arm64 AES-CCM code Ard Biesheuvel
2024-01-11 12:33 ` [PATCH 1/8] crypto: arm64/aes-ccm - Revert "Rewrite skcipher walker loop" Ard Biesheuvel
2024-01-11 12:33 ` [PATCH 2/8] crypto: arm64/aes-ccm - Keep NEON enabled during skcipher walk Ard Biesheuvel
2024-01-11 12:33 ` [PATCH 3/8] crypto: arm64/aes-ccm - Pass short inputs via stack buffer Ard Biesheuvel
2024-01-11 12:33 ` [PATCH 4/8] crypto: arm64/aes-ccm - Replace bytewise tail handling with NEON permute Ard Biesheuvel
2024-01-11 16:35   ` Ard Biesheuvel
2024-01-11 12:33 ` [PATCH 5/8] crypto: arm64/aes-ccm - Reuse existing MAC update for AAD input Ard Biesheuvel
2024-01-11 12:33 ` [PATCH 6/8] crypto: arm64/aes-ccm - Cache round keys and unroll AES loops Ard Biesheuvel
2024-01-11 12:33 ` [PATCH 7/8] crypto: arm64/aes-ccm - Merge encrypt and decrypt asm routines Ard Biesheuvel
2024-01-11 12:33 ` [PATCH 8/8] crypto: arm64/aes-ccm - Merge finalization into en/decrypt asm helper Ard Biesheuvel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).