Re: [PATCH v4 7/8] crypto: arm64/polyval: Add PMULL accelerated implementation of POLYVAL

From: Eric Biggers <ebiggers@kernel.org>
To: Nathan Huckleberry <nhuck@google.com>
Cc: linux-crypto@vger.kernel.org,
	Herbert Xu <herbert@gondor.apana.org.au>,
	"David S. Miller" <davem@davemloft.net>,
	linux-arm-kernel@lists.infradead.org,
	Paul Crowley <paulcrowley@google.com>,
	Sami Tolvanen <samitolvanen@google.com>,
	Ard Biesheuvel <ardb@kernel.org>
Subject: Re: [PATCH v4 7/8] crypto: arm64/polyval: Add PMULL accelerated implementation of POLYVAL
Date: Tue, 12 Apr 2022 22:53:34 -0700	[thread overview]
Message-ID: <YlZlXq96e96Bg6tL@sol.localdomain> (raw)
In-Reply-To: <20220412172816.917723-8-nhuck@google.com>

Many of the comments I made on the x86 version apply to this too, but a few
unique comments below:

On Tue, Apr 12, 2022 at 05:28:15PM +0000, Nathan Huckleberry wrote:
> Add hardware accelerated version of POLYVAL for ARM64 CPUs with
> Crypto Extension support.

As I've mentioned, it seems to be "Crypto Extensions", not "Crypto Extension".
That's what everything else in the kernel says, at least.  It's just a nit, but
it's unclear whether you intentionally decided to not change this, or missed it.

> diff --git a/arch/arm64/crypto/polyval-ce-core.S b/arch/arm64/crypto/polyval-ce-core.S
[..]
> +	.text
> +	.align	4
> +
> +	.arch	armv8-a+crypto
> +	.align	4

Having '.align' twice is redundant, right?

> +
> +.Lgstar:
> +	.quad	0xc200000000000000, 0xc200000000000000
> +
> +/*
> + * Computes the product of two 128-bit polynomials in X and Y and XORs the
> + * components of the 256-bit product into LO, MI, HI.
> + *
> + * X = [X_1 : X_0]
> + * Y = [Y_1 : Y_0]
> + *
> + * The multiplication produces four parts:
> + *   LOW: The polynomial given by performing carryless multiplication of X_0 and
> + *   Y_0
> + *   MID: The polynomial given by performing carryless multiplication of (X_0 +
> + *   X_1) and (Y_0 + Y_1)
> + *   HIGH: The polynomial given by performing carryless multiplication of X_1
> + *   and Y_1
> + *
> + * We compute:
> + *  LO += LOW
> + *  MI += MID
> + *  HI += HIGH
> + *
> + * Later, the 256-bit result can be extracted as:
> + *   [HI_1 : HI_0 + HI_1 + MI_1 + LO_1 :: LO_1 + HI_0 + MI_0 + LO_0 : LO_0]

What does the double colon mean?

> + * This step is done when computing the polynomial reduction for efficiency
> + * reasons.
> + */
> +.macro karatsuba1 X Y

A super brief explanation of why Karatsuba multiplication is used instead of
schoolbook multiplication would be helpful.

> +	X .req \X
> +	Y .req \Y
> +	ext	v25.16b, X.16b, X.16b, #8
> +	ext	v26.16b, Y.16b, Y.16b, #8
> +	eor	v25.16b, v25.16b, X.16b
> +	eor	v26.16b, v26.16b, Y.16b
> +	pmull	v27.1q, v25.1d, v26.1d
> +	pmull2	v28.1q, X.2d, Y.2d
> +	pmull	v29.1q, X.1d, Y.1d
> +	eor	MI.16b, MI.16b, v27.16b
> +	eor	HI.16b, HI.16b, v28.16b
> +	eor	LO.16b, LO.16b, v29.16b
> +	.unreq X
> +	.unreq Y
> +.endm

Maybe move the pmull into v27 and the eor into MI down a bit, since they have
more dependencies?  I.e.

	ext	v25.16b, X.16b, X.16b, #8
	ext	v26.16b, Y.16b, Y.16b, #8
	eor	v25.16b, v25.16b, X.16b
	eor	v26.16b, v26.16b, Y.16b
	pmull2	v28.1q, X.2d, Y.2d
	pmull	v29.1q, X.1d, Y.1d
	pmull	v27.1q, v25.1d, v26.1d
	eor	HI.16b, HI.16b, v28.16b
	eor	LO.16b, LO.16b, v29.16b
	eor	MI.16b, MI.16b, v27.16b

I don't know whether it will actually matter on arm64, but on lower-powered
arm32 CPUs this sort of thing definitely can matter.

> +.macro karatsuba1_store X Y
> +	X .req \X
> +	Y .req \Y
> +	ext	v25.16b, X.16b, X.16b, #8
> +	ext	v26.16b, Y.16b, Y.16b, #8
> +	eor	v25.16b, v25.16b, X.16b
> +	eor	v26.16b, v26.16b, Y.16b
> +	pmull	MI.1q, v25.1d, v26.1d
> +	pmull2	HI.1q, X.2d, Y.2d
> +	pmull	LO.1q, X.1d, Y.1d
> +	.unreq X
> +	.unreq Y
> +.endm

Likewise above.

> diff --git a/arch/arm64/crypto/polyval-ce-glue.c b/arch/arm64/crypto/polyval-ce-glue.c
[...]
> +struct polyval_async_ctx {
> +	struct cryptd_ahash *cryptd_tfm;
> +};

struct polyval_async_ctx is no longer used.

> +static int polyval_setkey(struct crypto_shash *tfm,
> +			const u8 *key, unsigned int keylen)
> +{
> +	struct polyval_ctx *ctx = crypto_shash_ctx(tfm);
> +	int i;
> +
> +	if (keylen != POLYVAL_BLOCK_SIZE)
> +		return -EINVAL;
> +
> +	BUILD_BUG_ON(sizeof(u128) != POLYVAL_BLOCK_SIZE);

Where does 'sizeof(u128)' come from?  This file doesn't use u128 at all.

- Eric

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel