[PATCH] poly1305: generic C can be faster on chips with slow unaligned access

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
@ 2016-11-02 17:58 Jason A. Donenfeld
  2016-11-02 20:09 ` Herbert Xu
  2016-11-07 19:12 ` [PATCH v2] " Jason A. Donenfeld
  0 siblings, 2 replies; 42+ messages in thread
From: Jason A. Donenfeld @ 2016-11-02 17:58 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, linux-crypto, linux-kernel,
	Martin Willi
  Cc: Jason A. Donenfeld

On MIPS chips commonly found in inexpensive routers, this makes a big
difference in performance.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 crypto/poly1305_generic.c | 29 ++++++++++++++++++++++++++++-
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/crypto/poly1305_generic.c b/crypto/poly1305_generic.c
index 2df9835d..186e33d 100644
--- a/crypto/poly1305_generic.c
+++ b/crypto/poly1305_generic.c
@@ -65,11 +65,24 @@ EXPORT_SYMBOL_GPL(crypto_poly1305_setkey);
 static void poly1305_setrkey(struct poly1305_desc_ctx *dctx, const u8 *key)
 {
 	/* r &= 0xffffffc0ffffffc0ffffffc0fffffff */
+#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
 	dctx->r[0] = (le32_to_cpuvp(key +  0) >> 0) & 0x3ffffff;
 	dctx->r[1] = (le32_to_cpuvp(key +  3) >> 2) & 0x3ffff03;
 	dctx->r[2] = (le32_to_cpuvp(key +  6) >> 4) & 0x3ffc0ff;
 	dctx->r[3] = (le32_to_cpuvp(key +  9) >> 6) & 0x3f03fff;
 	dctx->r[4] = (le32_to_cpuvp(key + 12) >> 8) & 0x00fffff;
+#else
+	u32 t0, t1, t2, t3;
+	t0 = le32_to_cpuvp(key +  0);
+	t1 = le32_to_cpuvp(key +  4);
+	t2 = le32_to_cpuvp(key +  8);
+	t3 = le32_to_cpuvp(key + 12);
+	dctx->r[0] = t0 & 0x3ffffff; t0 >>= 26; t0 |= t1 << 6;
+	dctx->r[1] = t0 & 0x3ffff03; t1 >>= 20; t1 |= t2 << 12;
+	dctx->r[2] = t1 & 0x3ffc0ff; t2 >>= 14; t2 |= t3 << 18;
+	dctx->r[3] = t2 & 0x3f03fff; t3 >>= 8;
+	dctx->r[4] = t3 & 0x00fffff;
+#endif
 }
 
 static void poly1305_setskey(struct poly1305_desc_ctx *dctx, const u8 *key)
@@ -109,6 +122,9 @@ static unsigned int poly1305_blocks(struct poly1305_desc_ctx *dctx,
 	u32 s1, s2, s3, s4;
 	u32 h0, h1, h2, h3, h4;
 	u64 d0, d1, d2, d3, d4;
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+	u32 t0, t1, t2, t3;
+#endif
 	unsigned int datalen;
 
 	if (unlikely(!dctx->sset)) {
@@ -135,13 +151,24 @@ static unsigned int poly1305_blocks(struct poly1305_desc_ctx *dctx,
 	h4 = dctx->h[4];
 
 	while (likely(srclen >= POLY1305_BLOCK_SIZE)) {
-
 		/* h += m[i] */
+#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
 		h0 += (le32_to_cpuvp(src +  0) >> 0) & 0x3ffffff;
 		h1 += (le32_to_cpuvp(src +  3) >> 2) & 0x3ffffff;
 		h2 += (le32_to_cpuvp(src +  6) >> 4) & 0x3ffffff;
 		h3 += (le32_to_cpuvp(src +  9) >> 6) & 0x3ffffff;
 		h4 += (le32_to_cpuvp(src + 12) >> 8) | hibit;
+#else
+		t0 = le32_to_cpuvp(src +  0);
+		t1 = le32_to_cpuvp(src +  4);
+		t2 = le32_to_cpuvp(src +  8);
+		t3 = le32_to_cpuvp(src + 12);
+		h0 += t0 & 0x3ffffff;
+		h1 += sr((((u64)t1 << 32) | t0), 26) & 0x3ffffff;
+		h2 += sr((((u64)t2 << 32) | t1), 20) & 0x3ffffff;
+		h3 += sr((((u64)t3 << 32) | t2), 14) & 0x3ffffff;
+		h4 += (t3 >> 8) | hibit;
+#endif
 
 		/* h *= r */
 		d0 = mlt(h0, r0) + mlt(h1, s4) + mlt(h2, s3) +
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
  2016-11-02 17:58 [PATCH] poly1305: generic C can be faster on chips with slow unaligned access Jason A. Donenfeld
@ 2016-11-02 20:09 ` Herbert Xu
  2016-11-02 20:47   ` Sandy Harris
  2016-11-02 21:06   ` Jason A. Donenfeld
  2016-11-07 19:12 ` [PATCH v2] " Jason A. Donenfeld
  1 sibling, 2 replies; 42+ messages in thread
From: Herbert Xu @ 2016-11-02 20:09 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: David S. Miller, linux-crypto, linux-kernel, Martin Willi

On Wed, Nov 02, 2016 at 06:58:10PM +0100, Jason A. Donenfeld wrote:
> On MIPS chips commonly found in inexpensive routers, this makes a big
> difference in performance.
> 
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

Can you give some numbers please? What about other architectures
that your patch impacts?

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
  2016-11-02 20:09 ` Herbert Xu
@ 2016-11-02 20:47   ` Sandy Harris
  2016-11-02 21:06   ` Jason A. Donenfeld
  1 sibling, 0 replies; 42+ messages in thread
From: Sandy Harris @ 2016-11-02 20:47 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Jason A. Donenfeld, David S. Miller, linux-crypto, LKML,
	Martin Willi

On Wed, Nov 2, 2016 at 4:09 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote:

> On Wed, Nov 02, 2016 at 06:58:10PM +0100, Jason A. Donenfeld wrote:
>> On MIPS chips commonly found in inexpensive routers, this makes a big
>> difference in performance.
>>
>> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
>
> Can you give some numbers please? What about other architectures
> that your patch impacts?

In general it is not always clear that using whatever hardware crypto
is available is a good idea. Not all such hardware is fast, some CPUs
are, some CPUs have hardware for AES, and even if the hardware is
faster than the CPU, the context switch overheads may exceed the
advantage.

Ideally the patch development or acceptance process would be
testing this, but I think it might be difficult to reach that ideal.

The exception is a hardware RNG; that should always be used unless
it is clearly awful. It cannot do harm, speed is not much of an issue,
and it solves the hardest problem in the random(4) driver, making
sure of correct initialisation before any use.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
  2016-11-02 20:09 ` Herbert Xu
  2016-11-02 20:47   ` Sandy Harris
@ 2016-11-02 21:06   ` Jason A. Donenfeld
  2016-11-02 21:08     ` Herbert Xu
  1 sibling, 1 reply; 42+ messages in thread
From: Jason A. Donenfeld @ 2016-11-02 21:06 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David S. Miller, linux-crypto, LKML, Martin Willi

On Wed, Nov 2, 2016 at 9:09 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> Can you give some numbers please? What about other architectures
> that your patch impacts?

Per [1], the patch gives a 181% speed up on MIPS32r2.

[1] https://lists.zx2c4.com/pipermail/wireguard/2016-September/000398.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
  2016-11-02 21:06   ` Jason A. Donenfeld
@ 2016-11-02 21:08     ` Herbert Xu
  2016-11-02 21:25       ` Jason A. Donenfeld
  0 siblings, 1 reply; 42+ messages in thread
From: Herbert Xu @ 2016-11-02 21:08 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: David S. Miller, linux-crypto, LKML, Martin Willi

On Wed, Nov 02, 2016 at 10:06:39PM +0100, Jason A. Donenfeld wrote:
> On Wed, Nov 2, 2016 at 9:09 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> > Can you give some numbers please? What about other architectures
> > that your patch impacts?
> 
> Per [1], the patch gives a 181% speed up on MIPS32r2.
> 
> [1] https://lists.zx2c4.com/pipermail/wireguard/2016-September/000398.html

What about architectures? In particular, what if we just use your
new code for all architectures.  How much would we lose?

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
  2016-11-02 21:08     ` Herbert Xu
@ 2016-11-02 21:25       ` Jason A. Donenfeld
  2016-11-02 21:26         ` Herbert Xu
  0 siblings, 1 reply; 42+ messages in thread
From: Jason A. Donenfeld @ 2016-11-02 21:25 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David S. Miller, linux-crypto, LKML, Martin Willi

These architectures select HAVE_EFFICIENT_UNALIGNED_ACCESS:

s390 arm arm64 powerpc x86 x86_64

So, these will use the original old code.

The architectures that will thus use the new code are:

alpha arc avr32 blackfin c6x cris frv h7300 hexagon ia64 m32r m68k
metag microblaze mips mn10300 nios2 openrisc parisc score sh sparc
tile um unicore32 xtensa

Unfortunately, of these, the only machines I have access to are MIPS.
My SPARC access went cold a few years ago.

If you insist on a data-motivated approach approach, then I fear my
test of 1 out of 26 different architectures is woefully insufficient.
Does anybody else on the list have access to more hardware and is
interested in benchmarking?

If not, is there a reasonable way to decide on this by considering the
added complexity of code? Are we able to reason best and worst cases
of instruction latency vs unalignment stalls for most CPU designs?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
  2016-11-02 21:25       ` Jason A. Donenfeld
@ 2016-11-02 21:26         ` Herbert Xu
  2016-11-02 22:00           ` Jason A. Donenfeld
  0 siblings, 1 reply; 42+ messages in thread
From: Herbert Xu @ 2016-11-02 21:26 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: David S. Miller, linux-crypto, LKML, Martin Willi

On Wed, Nov 02, 2016 at 10:25:00PM +0100, Jason A. Donenfeld wrote:
> These architectures select HAVE_EFFICIENT_UNALIGNED_ACCESS:
> 
> s390 arm arm64 powerpc x86 x86_64
> 
> So, these will use the original old code.

What I'm interested in is whether the new code is sufficiently
close in performance to the old code, particularonly on x86.

I'd much rather only have a single set of code for all architectures.
After all, this is meant to be a generic implementation.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
  2016-11-02 21:26         ` Herbert Xu
@ 2016-11-02 22:00           ` Jason A. Donenfeld
  2016-11-03  0:49             ` Herbert Xu
  0 siblings, 1 reply; 42+ messages in thread
From: Jason A. Donenfeld @ 2016-11-02 22:00 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David S. Miller, linux-crypto, LKML, Martin Willi

On Wed, Nov 2, 2016 at 10:26 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> What I'm interested in is whether the new code is sufficiently
> close in performance to the old code, particularonly on x86.
>
> I'd much rather only have a single set of code for all architectures.
> After all, this is meant to be a generic implementation.

Just tested. I get a 6% slowdown on my Skylake. No good. I think it's
probably best to have the two paths in there, and not reduce it to
one.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
  2016-11-02 22:00           ` Jason A. Donenfeld
@ 2016-11-03  0:49             ` Herbert Xu
  2016-11-03  7:24               ` Jason A. Donenfeld
  0 siblings, 1 reply; 42+ messages in thread
From: Herbert Xu @ 2016-11-03  0:49 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: David S. Miller, linux-crypto, LKML, Martin Willi

On Wed, Nov 02, 2016 at 11:00:00PM +0100, Jason A. Donenfeld wrote:
>
> Just tested. I get a 6% slowdown on my Skylake. No good. I think it's
> probably best to have the two paths in there, and not reduce it to
> one.

FWIW I'd rather live with a 6% slowdown than having two different
code paths in the generic code.  Anyone who cares about 6% would
be much better off writing an assembly version of the code.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
  2016-11-03  0:49             ` Herbert Xu
@ 2016-11-03  7:24               ` Jason A. Donenfeld
  2016-11-03 17:08                 ` David Miller
  0 siblings, 1 reply; 42+ messages in thread
From: Jason A. Donenfeld @ 2016-11-03  7:24 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David S. Miller, linux-crypto, LKML, Martin Willi

Hi Herbert,

On Thu, Nov 3, 2016 at 1:49 AM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> FWIW I'd rather live with a 6% slowdown than having two different
> code paths in the generic code.  Anyone who cares about 6% would
> be much better off writing an assembly version of the code.

Please think twice before deciding that the generic C "is allowed to
be slow". It turns out to be used far more often than might be
obvious. For example, crypto is commonly done on the netdev layer --
like the case with mac80211-based drivers. At this layer, the FPU on
x86 isn't always available, depending on the path used. Some
combinations of drivers, packet family, and workload can result in the
generic C being used instead of the vectorized assembly for a massive
percentage of time. So, I think we do have a good motivation for
wanting the generic C to be as fast as possible.

In the particular case of poly1305, these are the only spots where
unaligned accesses take place, and they're rather small, and I think
it's pretty obvious what's happening in the two different cases of
code from a quick glance. This isn't the "two different paths case" in
which there's a significant future-facing maintenance burden.

Jason

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
  2016-11-03  7:24               ` Jason A. Donenfeld
@ 2016-11-03 17:08                 ` David Miller
  2016-11-03 22:20                     ` [WireGuard] " Jason A. Donenfeld
  0 siblings, 1 reply; 42+ messages in thread
From: David Miller @ 2016-11-03 17:08 UTC (permalink / raw)
  To: Jason; +Cc: herbert, linux-crypto, linux-kernel, martin

From: "Jason A. Donenfeld" <Jason@zx2c4.com>
Date: Thu, 3 Nov 2016 08:24:57 +0100

> Hi Herbert,
> 
> On Thu, Nov 3, 2016 at 1:49 AM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>> FWIW I'd rather live with a 6% slowdown than having two different
>> code paths in the generic code.  Anyone who cares about 6% would
>> be much better off writing an assembly version of the code.
> 
> Please think twice before deciding that the generic C "is allowed to
> be slow".

In any event no piece of code should be doing 32-bit word reads from
addresses like "x + 3" without, at a very minimum, going through the
kernel unaligned access handlers.

Yet that is what the generic C poly1305 code is doing, all over the
place.

We know explicitly that these offsets will not be 32-bit aligned, so
it is required that we use the helpers, or alternatively do things to
avoid these unaligned accesses such as using temporary storage when
the HAVE_EFFICIENT_UNALIGNED_ACCESS kconfig value is not set.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
  2016-11-03 17:08                 ` David Miller
  2016-11-03 22:20                     ` [WireGuard] " Jason A. Donenfeld
@ 2016-11-03 22:20                     ` Jason A. Donenfeld
  0 siblings, 0 replies; 42+ messages in thread
From: Jason A. Donenfeld @ 2016-11-03 22:20 UTC (permalink / raw)
  To: David Miller
  Cc: Herbert Xu, Martin Willi, LKML, linux-crypto,
	WireGuard mailing list

Hi David,

On Thu, Nov 3, 2016 at 6:08 PM, David Miller <davem@davemloft.net> wrote:
> In any event no piece of code should be doing 32-bit word reads from
> addresses like "x + 3" without, at a very minimum, going through the
> kernel unaligned access handlers.

Excellent point. In otherwords,

    ctx->r[0] = (le32_to_cpuvp(key +  0) >> 0) & 0x3ffffff;
    ctx->r[1] = (le32_to_cpuvp(key +  3) >> 2) & 0x3ffff03;
    ctx->r[2] = (le32_to_cpuvp(key +  6) >> 4) & 0x3ffc0ff;
    ctx->r[3] = (le32_to_cpuvp(key +  9) >> 6) & 0x3f03fff;
    ctx->r[4] = (le32_to_cpuvp(key + 12) >> 8) & 0x00fffff;

should change to:

    ctx->r[0] = (le32_to_cpuvp(key +  0) >> 0) & 0x3ffffff;
    ctx->r[1] = (get_unaligned_le32(key +  3) >> 2) & 0x3ffff03;
    ctx->r[2] = (get_unaligned_le32(key +  6) >> 4) & 0x3ffc0ff;
    ctx->r[3] = (get_unaligned_le32(key +  9) >> 6) & 0x3f03fff;
    ctx->r[4] = (le32_to_cpuvp(key + 12) >> 8) & 0x00fffff;

> We know explicitly that these offsets will not be 32-bit aligned, so
> it is required that we use the helpers, or alternatively do things to
> avoid these unaligned accesses such as using temporary storage when
> the HAVE_EFFICIENT_UNALIGNED_ACCESS kconfig value is not set.

So the question is: is the clever avoidance of unaligned accesses of
the original patch faster or slower than changing the unaligned
accesses to use the helper function?

I've put a little test harness together for playing with this:

    $ git clone git://git.zx2c4.com/polybench
    $ cd polybench
    $ make run

To test with one method, do as normal. To test with the other, remove
"#define USE_FIRST_METHOD" from the source code.

@René: do you think you could retest on your MIPS32r2 hardware and
report back which is faster?

And if anybody else has other hardware and would like to try, this
could be nice.

Regards,
Jason
_______________________________________________
WireGuard mailing list
WireGuard@lists.zx2c4.com
http://lists.zx2c4.com/mailman/listinfo/wireguard

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
@ 2016-11-03 22:20                     ` Jason A. Donenfeld
  0 siblings, 0 replies; 42+ messages in thread
From: Jason A. Donenfeld @ 2016-11-03 22:20 UTC (permalink / raw)
  To: David Miller
  Cc: Herbert Xu, linux-crypto, LKML, Martin Willi,
	WireGuard mailing list, René van Dorst

Hi David,

On Thu, Nov 3, 2016 at 6:08 PM, David Miller <davem@davemloft.net> wrote:
> In any event no piece of code should be doing 32-bit word reads from
> addresses like "x + 3" without, at a very minimum, going through the
> kernel unaligned access handlers.

Excellent point. In otherwords,

    ctx->r[0] = (le32_to_cpuvp(key +  0) >> 0) & 0x3ffffff;
    ctx->r[1] = (le32_to_cpuvp(key +  3) >> 2) & 0x3ffff03;
    ctx->r[2] = (le32_to_cpuvp(key +  6) >> 4) & 0x3ffc0ff;
    ctx->r[3] = (le32_to_cpuvp(key +  9) >> 6) & 0x3f03fff;
    ctx->r[4] = (le32_to_cpuvp(key + 12) >> 8) & 0x00fffff;

should change to:

    ctx->r[0] = (le32_to_cpuvp(key +  0) >> 0) & 0x3ffffff;
    ctx->r[1] = (get_unaligned_le32(key +  3) >> 2) & 0x3ffff03;
    ctx->r[2] = (get_unaligned_le32(key +  6) >> 4) & 0x3ffc0ff;
    ctx->r[3] = (get_unaligned_le32(key +  9) >> 6) & 0x3f03fff;
    ctx->r[4] = (le32_to_cpuvp(key + 12) >> 8) & 0x00fffff;

> We know explicitly that these offsets will not be 32-bit aligned, so
> it is required that we use the helpers, or alternatively do things to
> avoid these unaligned accesses such as using temporary storage when
> the HAVE_EFFICIENT_UNALIGNED_ACCESS kconfig value is not set.

So the question is: is the clever avoidance of unaligned accesses of
the original patch faster or slower than changing the unaligned
accesses to use the helper function?

I've put a little test harness together for playing with this:

    $ git clone git://git.zx2c4.com/polybench
    $ cd polybench
    $ make run

To test with one method, do as normal. To test with the other, remove
"#define USE_FIRST_METHOD" from the source code.

@René: do you think you could retest on your MIPS32r2 hardware and
report back which is faster?

And if anybody else has other hardware and would like to try, this
could be nice.

Regards,
Jason

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [WireGuard] [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
@ 2016-11-03 22:20                     ` Jason A. Donenfeld
  0 siblings, 0 replies; 42+ messages in thread
From: Jason A. Donenfeld @ 2016-11-03 22:20 UTC (permalink / raw)
  To: David Miller
  Cc: Herbert Xu, Martin Willi, LKML, linux-crypto,
	WireGuard mailing list

Hi David,

On Thu, Nov 3, 2016 at 6:08 PM, David Miller <davem@davemloft.net> wrote:
> In any event no piece of code should be doing 32-bit word reads from
> addresses like "x + 3" without, at a very minimum, going through the
> kernel unaligned access handlers.

Excellent point. In otherwords,

    ctx->r[0] =3D (le32_to_cpuvp(key +  0) >> 0) & 0x3ffffff;
    ctx->r[1] =3D (le32_to_cpuvp(key +  3) >> 2) & 0x3ffff03;
    ctx->r[2] =3D (le32_to_cpuvp(key +  6) >> 4) & 0x3ffc0ff;
    ctx->r[3] =3D (le32_to_cpuvp(key +  9) >> 6) & 0x3f03fff;
    ctx->r[4] =3D (le32_to_cpuvp(key + 12) >> 8) & 0x00fffff;

should change to:

    ctx->r[0] =3D (le32_to_cpuvp(key +  0) >> 0) & 0x3ffffff;
    ctx->r[1] =3D (get_unaligned_le32(key +  3) >> 2) & 0x3ffff03;
    ctx->r[2] =3D (get_unaligned_le32(key +  6) >> 4) & 0x3ffc0ff;
    ctx->r[3] =3D (get_unaligned_le32(key +  9) >> 6) & 0x3f03fff;
    ctx->r[4] =3D (le32_to_cpuvp(key + 12) >> 8) & 0x00fffff;

> We know explicitly that these offsets will not be 32-bit aligned, so
> it is required that we use the helpers, or alternatively do things to
> avoid these unaligned accesses such as using temporary storage when
> the HAVE_EFFICIENT_UNALIGNED_ACCESS kconfig value is not set.

So the question is: is the clever avoidance of unaligned accesses of
the original patch faster or slower than changing the unaligned
accesses to use the helper function?

I've put a little test harness together for playing with this:

    $ git clone git://git.zx2c4.com/polybench
    $ cd polybench
    $ make run

To test with one method, do as normal. To test with the other, remove
"#define USE_FIRST_METHOD" from the source code.

@Ren=C3=A9: do you think you could retest on your MIPS32r2 hardware and
report back which is faster?

And if anybody else has other hardware and would like to try, this
could be nice.

Regards,
Jason

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
  2016-11-03 22:20                     ` [WireGuard] " Jason A. Donenfeld
@ 2016-11-04 17:37                       ` Eric Biggers
  -1 siblings, 0 replies; 42+ messages in thread
From: Eric Biggers @ 2016-11-04 17:37 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: David Miller, Herbert Xu, linux-crypto, LKML, Martin Willi,
	WireGuard mailing list, René van Dorst

On Thu, Nov 03, 2016 at 11:20:08PM +0100, Jason A. Donenfeld wrote:
> Hi David,
> 
> On Thu, Nov 3, 2016 at 6:08 PM, David Miller <davem@davemloft.net> wrote:
> > In any event no piece of code should be doing 32-bit word reads from
> > addresses like "x + 3" without, at a very minimum, going through the
> > kernel unaligned access handlers.
> 
> Excellent point. In otherwords,
> 
>     ctx->r[0] = (le32_to_cpuvp(key +  0) >> 0) & 0x3ffffff;
>     ctx->r[1] = (le32_to_cpuvp(key +  3) >> 2) & 0x3ffff03;
>     ctx->r[2] = (le32_to_cpuvp(key +  6) >> 4) & 0x3ffc0ff;
>     ctx->r[3] = (le32_to_cpuvp(key +  9) >> 6) & 0x3f03fff;
>     ctx->r[4] = (le32_to_cpuvp(key + 12) >> 8) & 0x00fffff;
> 
> should change to:
> 
>     ctx->r[0] = (le32_to_cpuvp(key +  0) >> 0) & 0x3ffffff;
>     ctx->r[1] = (get_unaligned_le32(key +  3) >> 2) & 0x3ffff03;
>     ctx->r[2] = (get_unaligned_le32(key +  6) >> 4) & 0x3ffc0ff;
>     ctx->r[3] = (get_unaligned_le32(key +  9) >> 6) & 0x3f03fff;
>     ctx->r[4] = (le32_to_cpuvp(key + 12) >> 8) & 0x00fffff;
> 

I agree, and the current code is wrong; but do note that this proposal is
correct for poly1305_setrkey() but not for poly1305_setskey() and
poly1305_blocks().  In the latter two cases, 4-byte alignment of the source
buffer is *not* guaranteed.  Although crypto_poly1305_update() will be called
with a 4-byte aligned buffer due to the alignmask set on poly1305_alg, the
algorithm operates on 16-byte blocks and therefore has to buffer partial blocks.
If some number of bytes that is not 0 mod 4 is buffered, then the buffer will
fall out of alignment on the next update call.  Hence, get_unaligned_le32() is
actually needed on all the loads, since the buffer will, in general, be of
unknown alignment.

Note: some other shash algorithms have this problem too and do not handle it
correctly.  It seems to be a common mistake.

Eric

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [WireGuard] [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
@ 2016-11-04 17:37                       ` Eric Biggers
  0 siblings, 0 replies; 42+ messages in thread
From: Eric Biggers @ 2016-11-04 17:37 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Herbert Xu, Martin Willi, LKML, linux-crypto, David Miller,
	WireGuard mailing list

On Thu, Nov 03, 2016 at 11:20:08PM +0100, Jason A. Donenfeld wrote:
> Hi David,
> 
> On Thu, Nov 3, 2016 at 6:08 PM, David Miller <davem@davemloft.net> wrote:
> > In any event no piece of code should be doing 32-bit word reads from
> > addresses like "x + 3" without, at a very minimum, going through the
> > kernel unaligned access handlers.
> 
> Excellent point. In otherwords,
> 
>     ctx->r[0] = (le32_to_cpuvp(key +  0) >> 0) & 0x3ffffff;
>     ctx->r[1] = (le32_to_cpuvp(key +  3) >> 2) & 0x3ffff03;
>     ctx->r[2] = (le32_to_cpuvp(key +  6) >> 4) & 0x3ffc0ff;
>     ctx->r[3] = (le32_to_cpuvp(key +  9) >> 6) & 0x3f03fff;
>     ctx->r[4] = (le32_to_cpuvp(key + 12) >> 8) & 0x00fffff;
> 
> should change to:
> 
>     ctx->r[0] = (le32_to_cpuvp(key +  0) >> 0) & 0x3ffffff;
>     ctx->r[1] = (get_unaligned_le32(key +  3) >> 2) & 0x3ffff03;
>     ctx->r[2] = (get_unaligned_le32(key +  6) >> 4) & 0x3ffc0ff;
>     ctx->r[3] = (get_unaligned_le32(key +  9) >> 6) & 0x3f03fff;
>     ctx->r[4] = (le32_to_cpuvp(key + 12) >> 8) & 0x00fffff;
> 

I agree, and the current code is wrong; but do note that this proposal is
correct for poly1305_setrkey() but not for poly1305_setskey() and
poly1305_blocks().  In the latter two cases, 4-byte alignment of the source
buffer is *not* guaranteed.  Although crypto_poly1305_update() will be called
with a 4-byte aligned buffer due to the alignmask set on poly1305_alg, the
algorithm operates on 16-byte blocks and therefore has to buffer partial blocks.
If some number of bytes that is not 0 mod 4 is buffered, then the buffer will
fall out of alignment on the next update call.  Hence, get_unaligned_le32() is
actually needed on all the loads, since the buffer will, in general, be of
unknown alignment.

Note: some other shash algorithms have this problem too and do not handle it
correctly.  It seems to be a common mistake.

Eric

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
  2016-11-04 17:37                       ` [WireGuard] " Eric Biggers
  (?)
@ 2016-11-07 18:08                         ` Jason A. Donenfeld
  -1 siblings, 0 replies; 42+ messages in thread
From: Jason A. Donenfeld @ 2016-11-07 18:08 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Herbert Xu, Martin Willi, LKML, linux-crypto, David Miller,
	WireGuard mailing list

Hi Eric,

On Fri, Nov 4, 2016 at 6:37 PM, Eric Biggers <ebiggers@google.com> wrote:
> I agree, and the current code is wrong; but do note that this proposal is
> correct for poly1305_setrkey() but not for poly1305_setskey() and
> poly1305_blocks().  In the latter two cases, 4-byte alignment of the source
> buffer is *not* guaranteed.  Although crypto_poly1305_update() will be called
> with a 4-byte aligned buffer due to the alignmask set on poly1305_alg, the
> algorithm operates on 16-byte blocks and therefore has to buffer partial blocks.
> If some number of bytes that is not 0 mod 4 is buffered, then the buffer will
> fall out of alignment on the next update call.  Hence, get_unaligned_le32() is
> actually needed on all the loads, since the buffer will, in general, be of
> unknown alignment.

Hmm... The general data flow that strikes me as most pertinent is
something like:

struct sk_buff *skb = get_it_from_somewhere();
skb = skb_share_check(skb, GFP_ATOMIC);
num_frags = skb_cow_data(skb, ..., ...);
struct scatterlist sg[num_frags];
sg_init_table(sg, num_frags);
skb_to_sgvec(skb, sg, ..., ...);
blkcipher_walk_init(&walk, sg, sg, len);
blkcipher_walk_virt_block(&desc, &walk, BLOCK_SIZE);
while (walk.nbytes >= BLOCK_SIZE) {
    size_t chunk_len = rounddown(walk.nbytes, BLOCK_SIZE);
    poly1305_update(&poly1305_state, walk.src.virt.addr, chunk_len);
    blkcipher_walk_done(&desc, &walk, walk.nbytes % BLOCK_SIZE);
}
if (walk.nbytes) {
    poly1305_update(&poly1305_state, walk.src.virt.addr, walk.nbytes);
    blkcipher_walk_done(&desc, &walk, 0);
}

Is your suggestion that that in the final if block, walk.src.virt.addr
might be unaligned? Like in the case of the last fragment being 67
bytes long?

If so, what a hassle. I hope the performance overhead isn't too
awful... I'll resubmit taking into account your suggestions.

By the way -- offlist benchmarks sent to me concluded that using the
unaligned load helpers like David suggested is just as fast as that
handrolled bit magic in the v1.

Regards,
Jason

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
@ 2016-11-07 18:08                         ` Jason A. Donenfeld
  0 siblings, 0 replies; 42+ messages in thread
From: Jason A. Donenfeld @ 2016-11-07 18:08 UTC (permalink / raw)
  To: Eric Biggers
  Cc: David Miller, Herbert Xu, linux-crypto, LKML, Martin Willi,
	WireGuard mailing list, René van Dorst

Hi Eric,

On Fri, Nov 4, 2016 at 6:37 PM, Eric Biggers <ebiggers@google.com> wrote:
> I agree, and the current code is wrong; but do note that this proposal is
> correct for poly1305_setrkey() but not for poly1305_setskey() and
> poly1305_blocks().  In the latter two cases, 4-byte alignment of the source
> buffer is *not* guaranteed.  Although crypto_poly1305_update() will be called
> with a 4-byte aligned buffer due to the alignmask set on poly1305_alg, the
> algorithm operates on 16-byte blocks and therefore has to buffer partial blocks.
> If some number of bytes that is not 0 mod 4 is buffered, then the buffer will
> fall out of alignment on the next update call.  Hence, get_unaligned_le32() is
> actually needed on all the loads, since the buffer will, in general, be of
> unknown alignment.

Hmm... The general data flow that strikes me as most pertinent is
something like:

struct sk_buff *skb = get_it_from_somewhere();
skb = skb_share_check(skb, GFP_ATOMIC);
num_frags = skb_cow_data(skb, ..., ...);
struct scatterlist sg[num_frags];
sg_init_table(sg, num_frags);
skb_to_sgvec(skb, sg, ..., ...);
blkcipher_walk_init(&walk, sg, sg, len);
blkcipher_walk_virt_block(&desc, &walk, BLOCK_SIZE);
while (walk.nbytes >= BLOCK_SIZE) {
    size_t chunk_len = rounddown(walk.nbytes, BLOCK_SIZE);
    poly1305_update(&poly1305_state, walk.src.virt.addr, chunk_len);
    blkcipher_walk_done(&desc, &walk, walk.nbytes % BLOCK_SIZE);
}
if (walk.nbytes) {
    poly1305_update(&poly1305_state, walk.src.virt.addr, walk.nbytes);
    blkcipher_walk_done(&desc, &walk, 0);
}

Is your suggestion that that in the final if block, walk.src.virt.addr
might be unaligned? Like in the case of the last fragment being 67
bytes long?

If so, what a hassle. I hope the performance overhead isn't too
awful... I'll resubmit taking into account your suggestions.

By the way -- offlist benchmarks sent to me concluded that using the
unaligned load helpers like David suggested is just as fast as that
handrolled bit magic in the v1.

Regards,
Jason

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [WireGuard] [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
@ 2016-11-07 18:08                         ` Jason A. Donenfeld
  0 siblings, 0 replies; 42+ messages in thread
From: Jason A. Donenfeld @ 2016-11-07 18:08 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Herbert Xu, Martin Willi, LKML, linux-crypto, David Miller,
	WireGuard mailing list

Hi Eric,

On Fri, Nov 4, 2016 at 6:37 PM, Eric Biggers <ebiggers@google.com> wrote:
> I agree, and the current code is wrong; but do note that this proposal is
> correct for poly1305_setrkey() but not for poly1305_setskey() and
> poly1305_blocks().  In the latter two cases, 4-byte alignment of the source
> buffer is *not* guaranteed.  Although crypto_poly1305_update() will be called
> with a 4-byte aligned buffer due to the alignmask set on poly1305_alg, the
> algorithm operates on 16-byte blocks and therefore has to buffer partial blocks.
> If some number of bytes that is not 0 mod 4 is buffered, then the buffer will
> fall out of alignment on the next update call.  Hence, get_unaligned_le32() is
> actually needed on all the loads, since the buffer will, in general, be of
> unknown alignment.

Hmm... The general data flow that strikes me as most pertinent is
something like:

struct sk_buff *skb = get_it_from_somewhere();
skb = skb_share_check(skb, GFP_ATOMIC);
num_frags = skb_cow_data(skb, ..., ...);
struct scatterlist sg[num_frags];
sg_init_table(sg, num_frags);
skb_to_sgvec(skb, sg, ..., ...);
blkcipher_walk_init(&walk, sg, sg, len);
blkcipher_walk_virt_block(&desc, &walk, BLOCK_SIZE);
while (walk.nbytes >= BLOCK_SIZE) {
    size_t chunk_len = rounddown(walk.nbytes, BLOCK_SIZE);
    poly1305_update(&poly1305_state, walk.src.virt.addr, chunk_len);
    blkcipher_walk_done(&desc, &walk, walk.nbytes % BLOCK_SIZE);
}
if (walk.nbytes) {
    poly1305_update(&poly1305_state, walk.src.virt.addr, walk.nbytes);
    blkcipher_walk_done(&desc, &walk, 0);
}

Is your suggestion that that in the final if block, walk.src.virt.addr
might be unaligned? Like in the case of the last fragment being 67
bytes long?

If so, what a hassle. I hope the performance overhead isn't too
awful... I'll resubmit taking into account your suggestions.

By the way -- offlist benchmarks sent to me concluded that using the
unaligned load helpers like David suggested is just as fast as that
handrolled bit magic in the v1.

Regards,
Jason

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
  2016-11-07 18:08                         ` [WireGuard] " Jason A. Donenfeld
  (?)
@ 2016-11-07 18:23                           ` Jason A. Donenfeld
  -1 siblings, 0 replies; 42+ messages in thread
From: Jason A. Donenfeld @ 2016-11-07 18:23 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Herbert Xu, Martin Willi, LKML, linux-crypto, David Miller,
	WireGuard mailing list

On Mon, Nov 7, 2016 at 7:08 PM, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> Hmm... The general data flow that strikes me as most pertinent is
> something like:
>
> struct sk_buff *skb = get_it_from_somewhere();
> skb = skb_share_check(skb, GFP_ATOMIC);
> num_frags = skb_cow_data(skb, ..., ...);
> struct scatterlist sg[num_frags];
> sg_init_table(sg, num_frags);
> skb_to_sgvec(skb, sg, ..., ...);
> blkcipher_walk_init(&walk, sg, sg, len);
> blkcipher_walk_virt_block(&desc, &walk, BLOCK_SIZE);
> while (walk.nbytes >= BLOCK_SIZE) {
>     size_t chunk_len = rounddown(walk.nbytes, BLOCK_SIZE);
>     poly1305_update(&poly1305_state, walk.src.virt.addr, chunk_len);
>     blkcipher_walk_done(&desc, &walk, walk.nbytes % BLOCK_SIZE);
> }
> if (walk.nbytes) {
>     poly1305_update(&poly1305_state, walk.src.virt.addr, walk.nbytes);
>     blkcipher_walk_done(&desc, &walk, 0);
> }
>
> Is your suggestion that that in the final if block, walk.src.virt.addr
> might be unaligned? Like in the case of the last fragment being 67
> bytes long?

In fact, I'm not so sure this happens here. In the while loop, each
new walk.src.virt.addr will be aligned to BLOCK_SIZE or be aligned by
virtue of being at the start of a new page. In the subsequent if
block, walk.src.virt.addr will either be
some_aligned_address+BLOCK_SIZE, which will be aligned, or it will be
a start of a new page, which will be aligned.

So what did you have in mind exactly?

I don't think anybody is running code like:

for (size_t i = 0; i < len; i += 17)
    poly1305_update(&poly, &buffer[i], 17);

(And if so, those consumers should be fixed.)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
@ 2016-11-07 18:23                           ` Jason A. Donenfeld
  0 siblings, 0 replies; 42+ messages in thread
From: Jason A. Donenfeld @ 2016-11-07 18:23 UTC (permalink / raw)
  To: Eric Biggers
  Cc: David Miller, Herbert Xu, linux-crypto, LKML, Martin Willi,
	WireGuard mailing list, René van Dorst

On Mon, Nov 7, 2016 at 7:08 PM, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> Hmm... The general data flow that strikes me as most pertinent is
> something like:
>
> struct sk_buff *skb = get_it_from_somewhere();
> skb = skb_share_check(skb, GFP_ATOMIC);
> num_frags = skb_cow_data(skb, ..., ...);
> struct scatterlist sg[num_frags];
> sg_init_table(sg, num_frags);
> skb_to_sgvec(skb, sg, ..., ...);
> blkcipher_walk_init(&walk, sg, sg, len);
> blkcipher_walk_virt_block(&desc, &walk, BLOCK_SIZE);
> while (walk.nbytes >= BLOCK_SIZE) {
>     size_t chunk_len = rounddown(walk.nbytes, BLOCK_SIZE);
>     poly1305_update(&poly1305_state, walk.src.virt.addr, chunk_len);
>     blkcipher_walk_done(&desc, &walk, walk.nbytes % BLOCK_SIZE);
> }
> if (walk.nbytes) {
>     poly1305_update(&poly1305_state, walk.src.virt.addr, walk.nbytes);
>     blkcipher_walk_done(&desc, &walk, 0);
> }
>
> Is your suggestion that that in the final if block, walk.src.virt.addr
> might be unaligned? Like in the case of the last fragment being 67
> bytes long?

In fact, I'm not so sure this happens here. In the while loop, each
new walk.src.virt.addr will be aligned to BLOCK_SIZE or be aligned by
virtue of being at the start of a new page. In the subsequent if
block, walk.src.virt.addr will either be
some_aligned_address+BLOCK_SIZE, which will be aligned, or it will be
a start of a new page, which will be aligned.

So what did you have in mind exactly?

I don't think anybody is running code like:

for (size_t i = 0; i < len; i += 17)
    poly1305_update(&poly, &buffer[i], 17);

(And if so, those consumers should be fixed.)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [WireGuard] [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
@ 2016-11-07 18:23                           ` Jason A. Donenfeld
  0 siblings, 0 replies; 42+ messages in thread
From: Jason A. Donenfeld @ 2016-11-07 18:23 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Herbert Xu, Martin Willi, LKML, linux-crypto, David Miller,
	WireGuard mailing list

On Mon, Nov 7, 2016 at 7:08 PM, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> Hmm... The general data flow that strikes me as most pertinent is
> something like:
>
> struct sk_buff *skb = get_it_from_somewhere();
> skb = skb_share_check(skb, GFP_ATOMIC);
> num_frags = skb_cow_data(skb, ..., ...);
> struct scatterlist sg[num_frags];
> sg_init_table(sg, num_frags);
> skb_to_sgvec(skb, sg, ..., ...);
> blkcipher_walk_init(&walk, sg, sg, len);
> blkcipher_walk_virt_block(&desc, &walk, BLOCK_SIZE);
> while (walk.nbytes >= BLOCK_SIZE) {
>     size_t chunk_len = rounddown(walk.nbytes, BLOCK_SIZE);
>     poly1305_update(&poly1305_state, walk.src.virt.addr, chunk_len);
>     blkcipher_walk_done(&desc, &walk, walk.nbytes % BLOCK_SIZE);
> }
> if (walk.nbytes) {
>     poly1305_update(&poly1305_state, walk.src.virt.addr, walk.nbytes);
>     blkcipher_walk_done(&desc, &walk, 0);
> }
>
> Is your suggestion that that in the final if block, walk.src.virt.addr
> might be unaligned? Like in the case of the last fragment being 67
> bytes long?

In fact, I'm not so sure this happens here. In the while loop, each
new walk.src.virt.addr will be aligned to BLOCK_SIZE or be aligned by
virtue of being at the start of a new page. In the subsequent if
block, walk.src.virt.addr will either be
some_aligned_address+BLOCK_SIZE, which will be aligned, or it will be
a start of a new page, which will be aligned.

So what did you have in mind exactly?

I don't think anybody is running code like:

for (size_t i = 0; i < len; i += 17)
    poly1305_update(&poly, &buffer[i], 17);

(And if so, those consumers should be fixed.)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
  2016-11-07 18:08                         ` [WireGuard] " Jason A. Donenfeld
  (?)
@ 2016-11-07 18:26                           ` Eric Biggers
  -1 siblings, 0 replies; 42+ messages in thread
From: Eric Biggers @ 2016-11-07 18:26 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Herbert Xu, Martin Willi, LKML, linux-crypto, David Miller,
	WireGuard mailing list

On Mon, Nov 07, 2016 at 07:08:22PM +0100, Jason A. Donenfeld wrote:
> Hmm... The general data flow that strikes me as most pertinent is
> something like:
> 
> struct sk_buff *skb = get_it_from_somewhere();
> skb = skb_share_check(skb, GFP_ATOMIC);
> num_frags = skb_cow_data(skb, ..., ...);
> struct scatterlist sg[num_frags];
> sg_init_table(sg, num_frags);
> skb_to_sgvec(skb, sg, ..., ...);
> blkcipher_walk_init(&walk, sg, sg, len);
> blkcipher_walk_virt_block(&desc, &walk, BLOCK_SIZE);
> while (walk.nbytes >= BLOCK_SIZE) {
>     size_t chunk_len = rounddown(walk.nbytes, BLOCK_SIZE);
>     poly1305_update(&poly1305_state, walk.src.virt.addr, chunk_len);
>     blkcipher_walk_done(&desc, &walk, walk.nbytes % BLOCK_SIZE);
> }
> if (walk.nbytes) {
>     poly1305_update(&poly1305_state, walk.src.virt.addr, walk.nbytes);
>     blkcipher_walk_done(&desc, &walk, 0);
> }
> 
> Is your suggestion that that in the final if block, walk.src.virt.addr
> might be unaligned? Like in the case of the last fragment being 67
> bytes long?

I was not referring to any users in particular, only what users could do.  As an
example, if you did crypto_shash_update() with 32, 15, then 17 bytes, and the
underlying algorithm is poly1305-generic, the last block would end up
misaligned.  This doesn't appear possible with your pseudocode because it only
passes in multiples of the block size until the very end.  However I don't see
it claimed anywhere that shash API users have to do that.

Eric

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
@ 2016-11-07 18:26                           ` Eric Biggers
  0 siblings, 0 replies; 42+ messages in thread
From: Eric Biggers @ 2016-11-07 18:26 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: David Miller, Herbert Xu, linux-crypto, LKML, Martin Willi,
	WireGuard mailing list, René van Dorst

On Mon, Nov 07, 2016 at 07:08:22PM +0100, Jason A. Donenfeld wrote:
> Hmm... The general data flow that strikes me as most pertinent is
> something like:
> 
> struct sk_buff *skb = get_it_from_somewhere();
> skb = skb_share_check(skb, GFP_ATOMIC);
> num_frags = skb_cow_data(skb, ..., ...);
> struct scatterlist sg[num_frags];
> sg_init_table(sg, num_frags);
> skb_to_sgvec(skb, sg, ..., ...);
> blkcipher_walk_init(&walk, sg, sg, len);
> blkcipher_walk_virt_block(&desc, &walk, BLOCK_SIZE);
> while (walk.nbytes >= BLOCK_SIZE) {
>     size_t chunk_len = rounddown(walk.nbytes, BLOCK_SIZE);
>     poly1305_update(&poly1305_state, walk.src.virt.addr, chunk_len);
>     blkcipher_walk_done(&desc, &walk, walk.nbytes % BLOCK_SIZE);
> }
> if (walk.nbytes) {
>     poly1305_update(&poly1305_state, walk.src.virt.addr, walk.nbytes);
>     blkcipher_walk_done(&desc, &walk, 0);
> }
> 
> Is your suggestion that that in the final if block, walk.src.virt.addr
> might be unaligned? Like in the case of the last fragment being 67
> bytes long?

I was not referring to any users in particular, only what users could do.  As an
example, if you did crypto_shash_update() with 32, 15, then 17 bytes, and the
underlying algorithm is poly1305-generic, the last block would end up
misaligned.  This doesn't appear possible with your pseudocode because it only
passes in multiples of the block size until the very end.  However I don't see
it claimed anywhere that shash API users have to do that.

Eric

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [WireGuard] [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
@ 2016-11-07 18:26                           ` Eric Biggers
  0 siblings, 0 replies; 42+ messages in thread
From: Eric Biggers @ 2016-11-07 18:26 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Herbert Xu, Martin Willi, LKML, linux-crypto, David Miller,
	WireGuard mailing list

On Mon, Nov 07, 2016 at 07:08:22PM +0100, Jason A. Donenfeld wrote:
> Hmm... The general data flow that strikes me as most pertinent is
> something like:
> 
> struct sk_buff *skb = get_it_from_somewhere();
> skb = skb_share_check(skb, GFP_ATOMIC);
> num_frags = skb_cow_data(skb, ..., ...);
> struct scatterlist sg[num_frags];
> sg_init_table(sg, num_frags);
> skb_to_sgvec(skb, sg, ..., ...);
> blkcipher_walk_init(&walk, sg, sg, len);
> blkcipher_walk_virt_block(&desc, &walk, BLOCK_SIZE);
> while (walk.nbytes >= BLOCK_SIZE) {
>     size_t chunk_len = rounddown(walk.nbytes, BLOCK_SIZE);
>     poly1305_update(&poly1305_state, walk.src.virt.addr, chunk_len);
>     blkcipher_walk_done(&desc, &walk, walk.nbytes % BLOCK_SIZE);
> }
> if (walk.nbytes) {
>     poly1305_update(&poly1305_state, walk.src.virt.addr, walk.nbytes);
>     blkcipher_walk_done(&desc, &walk, 0);
> }
> 
> Is your suggestion that that in the final if block, walk.src.virt.addr
> might be unaligned? Like in the case of the last fragment being 67
> bytes long?

I was not referring to any users in particular, only what users could do.  As an
example, if you did crypto_shash_update() with 32, 15, then 17 bytes, and the
underlying algorithm is poly1305-generic, the last block would end up
misaligned.  This doesn't appear possible with your pseudocode because it only
passes in multiples of the block size until the very end.  However I don't see
it claimed anywhere that shash API users have to do that.

Eric

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
  2016-11-07 18:26                           ` [WireGuard] " Eric Biggers
  (?)
@ 2016-11-07 19:02                             ` Jason A. Donenfeld
  -1 siblings, 0 replies; 42+ messages in thread
From: Jason A. Donenfeld @ 2016-11-07 19:02 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Herbert Xu, Martin Willi, LKML, linux-crypto, David Miller,
	WireGuard mailing list

On Mon, Nov 7, 2016 at 7:26 PM, Eric Biggers <ebiggers@google.com> wrote:
>
> I was not referring to any users in particular, only what users could do.  As an
> example, if you did crypto_shash_update() with 32, 15, then 17 bytes, and the
> underlying algorithm is poly1305-generic, the last block would end up
> misaligned.  This doesn't appear possible with your pseudocode because it only
> passes in multiples of the block size until the very end.  However I don't see
> it claimed anywhere that shash API users have to do that.

Actually it appears that crypto/poly1305_generic.c already buffers
incoming blocks to a buffer that definitely looks aligned, to prevent
this condition!

I'll submit a v2 with only the inner unaligned operations changed.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
@ 2016-11-07 19:02                             ` Jason A. Donenfeld
  0 siblings, 0 replies; 42+ messages in thread
From: Jason A. Donenfeld @ 2016-11-07 19:02 UTC (permalink / raw)
  To: Eric Biggers
  Cc: David Miller, Herbert Xu, linux-crypto, LKML, Martin Willi,
	WireGuard mailing list, René van Dorst

On Mon, Nov 7, 2016 at 7:26 PM, Eric Biggers <ebiggers@google.com> wrote:
>
> I was not referring to any users in particular, only what users could do.  As an
> example, if you did crypto_shash_update() with 32, 15, then 17 bytes, and the
> underlying algorithm is poly1305-generic, the last block would end up
> misaligned.  This doesn't appear possible with your pseudocode because it only
> passes in multiples of the block size until the very end.  However I don't see
> it claimed anywhere that shash API users have to do that.

Actually it appears that crypto/poly1305_generic.c already buffers
incoming blocks to a buffer that definitely looks aligned, to prevent
this condition!

I'll submit a v2 with only the inner unaligned operations changed.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [WireGuard] [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
@ 2016-11-07 19:02                             ` Jason A. Donenfeld
  0 siblings, 0 replies; 42+ messages in thread
From: Jason A. Donenfeld @ 2016-11-07 19:02 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Herbert Xu, Martin Willi, LKML, linux-crypto, David Miller,
	WireGuard mailing list

On Mon, Nov 7, 2016 at 7:26 PM, Eric Biggers <ebiggers@google.com> wrote:
>
> I was not referring to any users in particular, only what users could do.  As an
> example, if you did crypto_shash_update() with 32, 15, then 17 bytes, and the
> underlying algorithm is poly1305-generic, the last block would end up
> misaligned.  This doesn't appear possible with your pseudocode because it only
> passes in multiples of the block size until the very end.  However I don't see
> it claimed anywhere that shash API users have to do that.

Actually it appears that crypto/poly1305_generic.c already buffers
incoming blocks to a buffer that definitely looks aligned, to prevent
this condition!

I'll submit a v2 with only the inner unaligned operations changed.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
  2016-11-07 19:02                             ` [WireGuard] " Jason A. Donenfeld
  (?)
@ 2016-11-07 19:25                               ` Eric Biggers
  -1 siblings, 0 replies; 42+ messages in thread
From: Eric Biggers @ 2016-11-07 19:25 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Herbert Xu, Martin Willi, LKML, linux-crypto, David Miller,
	WireGuard mailing list

On Mon, Nov 07, 2016 at 08:02:35PM +0100, Jason A. Donenfeld wrote:
> On Mon, Nov 7, 2016 at 7:26 PM, Eric Biggers <ebiggers@google.com> wrote:
> >
> > I was not referring to any users in particular, only what users could do.  As an
> > example, if you did crypto_shash_update() with 32, 15, then 17 bytes, and the
> > underlying algorithm is poly1305-generic, the last block would end up
> > misaligned.  This doesn't appear possible with your pseudocode because it only
> > passes in multiples of the block size until the very end.  However I don't see
> > it claimed anywhere that shash API users have to do that.
> 
> Actually it appears that crypto/poly1305_generic.c already buffers
> incoming blocks to a buffer that definitely looks aligned, to prevent
> this condition!
> 

No it does *not* buffer all incoming blocks, which is why the source pointer can
fall out of alignment.  Yes, I actually tested this.  In fact this situation is
even hit, in both possible places, in the self-tests.

Eric

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
@ 2016-11-07 19:25                               ` Eric Biggers
  0 siblings, 0 replies; 42+ messages in thread
From: Eric Biggers @ 2016-11-07 19:25 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: David Miller, Herbert Xu, linux-crypto, LKML, Martin Willi,
	WireGuard mailing list, René van Dorst

On Mon, Nov 07, 2016 at 08:02:35PM +0100, Jason A. Donenfeld wrote:
> On Mon, Nov 7, 2016 at 7:26 PM, Eric Biggers <ebiggers@google.com> wrote:
> >
> > I was not referring to any users in particular, only what users could do.  As an
> > example, if you did crypto_shash_update() with 32, 15, then 17 bytes, and the
> > underlying algorithm is poly1305-generic, the last block would end up
> > misaligned.  This doesn't appear possible with your pseudocode because it only
> > passes in multiples of the block size until the very end.  However I don't see
> > it claimed anywhere that shash API users have to do that.
> 
> Actually it appears that crypto/poly1305_generic.c already buffers
> incoming blocks to a buffer that definitely looks aligned, to prevent
> this condition!
> 

No it does *not* buffer all incoming blocks, which is why the source pointer can
fall out of alignment.  Yes, I actually tested this.  In fact this situation is
even hit, in both possible places, in the self-tests.

Eric

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [WireGuard] [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
@ 2016-11-07 19:25                               ` Eric Biggers
  0 siblings, 0 replies; 42+ messages in thread
From: Eric Biggers @ 2016-11-07 19:25 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Herbert Xu, Martin Willi, LKML, linux-crypto, David Miller,
	WireGuard mailing list

On Mon, Nov 07, 2016 at 08:02:35PM +0100, Jason A. Donenfeld wrote:
> On Mon, Nov 7, 2016 at 7:26 PM, Eric Biggers <ebiggers@google.com> wrote:
> >
> > I was not referring to any users in particular, only what users could do.  As an
> > example, if you did crypto_shash_update() with 32, 15, then 17 bytes, and the
> > underlying algorithm is poly1305-generic, the last block would end up
> > misaligned.  This doesn't appear possible with your pseudocode because it only
> > passes in multiples of the block size until the very end.  However I don't see
> > it claimed anywhere that shash API users have to do that.
> 
> Actually it appears that crypto/poly1305_generic.c already buffers
> incoming blocks to a buffer that definitely looks aligned, to prevent
> this condition!
> 

No it does *not* buffer all incoming blocks, which is why the source pointer can
fall out of alignment.  Yes, I actually tested this.  In fact this situation is
even hit, in both possible places, in the self-tests.

Eric

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
  2016-11-07 19:25                               ` [WireGuard] " Eric Biggers
  (?)
@ 2016-11-07 19:41                                 ` Jason A. Donenfeld
  -1 siblings, 0 replies; 42+ messages in thread
From: Jason A. Donenfeld @ 2016-11-07 19:41 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Herbert Xu, Martin Willi, LKML, linux-crypto, David Miller,
	WireGuard mailing list

On Mon, Nov 7, 2016 at 8:25 PM, Eric Biggers <ebiggers@google.com> wrote:
> No it does *not* buffer all incoming blocks, which is why the source pointer can
> fall out of alignment.  Yes, I actually tested this.  In fact this situation is
> even hit, in both possible places, in the self-tests.

Urgh! v3 coming right up...

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
@ 2016-11-07 19:41                                 ` Jason A. Donenfeld
  0 siblings, 0 replies; 42+ messages in thread
From: Jason A. Donenfeld @ 2016-11-07 19:41 UTC (permalink / raw)
  To: Eric Biggers
  Cc: David Miller, Herbert Xu, linux-crypto, LKML, Martin Willi,
	WireGuard mailing list, René van Dorst

On Mon, Nov 7, 2016 at 8:25 PM, Eric Biggers <ebiggers@google.com> wrote:
> No it does *not* buffer all incoming blocks, which is why the source pointer can
> fall out of alignment.  Yes, I actually tested this.  In fact this situation is
> even hit, in both possible places, in the self-tests.

Urgh! v3 coming right up...

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [WireGuard] [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
@ 2016-11-07 19:41                                 ` Jason A. Donenfeld
  0 siblings, 0 replies; 42+ messages in thread
From: Jason A. Donenfeld @ 2016-11-07 19:41 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Herbert Xu, Martin Willi, LKML, linux-crypto, David Miller,
	WireGuard mailing list

On Mon, Nov 7, 2016 at 8:25 PM, Eric Biggers <ebiggers@google.com> wrote:
> No it does *not* buffer all incoming blocks, which is why the source pointer can
> fall out of alignment.  Yes, I actually tested this.  In fact this situation is
> even hit, in both possible places, in the self-tests.

Urgh! v3 coming right up...

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v2] poly1305: generic C can be faster on chips with slow unaligned access
  2016-11-02 17:58 [PATCH] poly1305: generic C can be faster on chips with slow unaligned access Jason A. Donenfeld
  2016-11-02 20:09 ` Herbert Xu
@ 2016-11-07 19:12 ` Jason A. Donenfeld
  2016-11-07 19:43   ` [PATCH v3] " Jason A. Donenfeld
  2016-11-07 19:47   ` [PATCH v4] " Jason A. Donenfeld
  1 sibling, 2 replies; 42+ messages in thread
From: Jason A. Donenfeld @ 2016-11-07 19:12 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, linux-crypto, linux-kernel,
	Martin Willi, Eric Biggers, René van Dorst
  Cc: Jason A. Donenfeld

By using the unaligned access helpers, we drastically improve
performance on small MIPS routers that have to go through the exception
fix-up handler for these unaligned accesses.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 crypto/poly1305_generic.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/crypto/poly1305_generic.c b/crypto/poly1305_generic.c
index 2df9835d..0b86f4e 100644
--- a/crypto/poly1305_generic.c
+++ b/crypto/poly1305_generic.c
@@ -66,9 +66,9 @@ static void poly1305_setrkey(struct poly1305_desc_ctx *dctx, const u8 *key)
 {
 	/* r &= 0xffffffc0ffffffc0ffffffc0fffffff */
 	dctx->r[0] = (le32_to_cpuvp(key +  0) >> 0) & 0x3ffffff;
-	dctx->r[1] = (le32_to_cpuvp(key +  3) >> 2) & 0x3ffff03;
-	dctx->r[2] = (le32_to_cpuvp(key +  6) >> 4) & 0x3ffc0ff;
-	dctx->r[3] = (le32_to_cpuvp(key +  9) >> 6) & 0x3f03fff;
+	dctx->r[1] = (get_unaligned_le32(key +  3) >> 2) & 0x3ffff03;
+	dctx->r[2] = (get_unaligned_le32(key +  6) >> 4) & 0x3ffc0ff;
+	dctx->r[3] = (get_unaligned_le32(key +  9) >> 6) & 0x3f03fff;
 	dctx->r[4] = (le32_to_cpuvp(key + 12) >> 8) & 0x00fffff;
 }
 
@@ -138,9 +138,9 @@ static unsigned int poly1305_blocks(struct poly1305_desc_ctx *dctx,
 
 		/* h += m[i] */
 		h0 += (le32_to_cpuvp(src +  0) >> 0) & 0x3ffffff;
-		h1 += (le32_to_cpuvp(src +  3) >> 2) & 0x3ffffff;
-		h2 += (le32_to_cpuvp(src +  6) >> 4) & 0x3ffffff;
-		h3 += (le32_to_cpuvp(src +  9) >> 6) & 0x3ffffff;
+		h1 += (get_unaligned_le32(src +  3) >> 2) & 0x3ffffff;
+		h2 += (get_unaligned_le32(src +  6) >> 4) & 0x3ffffff;
+		h3 += (get_unaligned_le32(src +  9) >> 6) & 0x3ffffff;
 		h4 += (le32_to_cpuvp(src + 12) >> 8) | hibit;
 
 		/* h *= r */
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v3] poly1305: generic C can be faster on chips with slow unaligned access
  2016-11-07 19:12 ` [PATCH v2] " Jason A. Donenfeld
@ 2016-11-07 19:43   ` Jason A. Donenfeld
  2016-11-12 23:27     ` kbuild test robot
  2016-11-07 19:47   ` [PATCH v4] " Jason A. Donenfeld
  1 sibling, 1 reply; 42+ messages in thread
From: Jason A. Donenfeld @ 2016-11-07 19:43 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, linux-crypto, linux-kernel,
	Martin Willi, Eric Biggers, René van Dorst
  Cc: Jason A. Donenfeld

By using the unaligned access helpers, we drastically improve
performance on small MIPS routers that have to go through the exception
fix-up handler for these unaligned accesses.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 crypto/poly1305_generic.c | 33 ++++++++++++++-------------------
 1 file changed, 14 insertions(+), 19 deletions(-)

diff --git a/crypto/poly1305_generic.c b/crypto/poly1305_generic.c
index 2df9835d..7a77cfc 100644
--- a/crypto/poly1305_generic.c
+++ b/crypto/poly1305_generic.c
@@ -33,11 +33,6 @@ static inline u32 and(u32 v, u32 mask)
 	return v & mask;
 }
 
-static inline u32 le32_to_cpuvp(const void *p)
-{
-	return le32_to_cpup(p);
-}
-
 int crypto_poly1305_init(struct shash_desc *desc)
 {
 	struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc);
@@ -65,19 +60,19 @@ EXPORT_SYMBOL_GPL(crypto_poly1305_setkey);
 static void poly1305_setrkey(struct poly1305_desc_ctx *dctx, const u8 *key)
 {
 	/* r &= 0xffffffc0ffffffc0ffffffc0fffffff */
-	dctx->r[0] = (le32_to_cpuvp(key +  0) >> 0) & 0x3ffffff;
-	dctx->r[1] = (le32_to_cpuvp(key +  3) >> 2) & 0x3ffff03;
-	dctx->r[2] = (le32_to_cpuvp(key +  6) >> 4) & 0x3ffc0ff;
-	dctx->r[3] = (le32_to_cpuvp(key +  9) >> 6) & 0x3f03fff;
-	dctx->r[4] = (le32_to_cpuvp(key + 12) >> 8) & 0x00fffff;
+	dctx->r[0] = (get_unaligned_le32(key +  0) >> 0) & 0x3ffffff;
+	dctx->r[1] = (get_unaligned_le32(key +  3) >> 2) & 0x3ffff03;
+	dctx->r[2] = (get_unaligned_le32(key +  6) >> 4) & 0x3ffc0ff;
+	dctx->r[3] = (get_unaligned_le32(key +  9) >> 6) & 0x3f03fff;
+	dctx->r[4] = (get_unaligned_le32(key + 12) >> 8) & 0x00fffff;
 }
 
 static void poly1305_setskey(struct poly1305_desc_ctx *dctx, const u8 *key)
 {
-	dctx->s[0] = le32_to_cpuvp(key +  0);
-	dctx->s[1] = le32_to_cpuvp(key +  4);
-	dctx->s[2] = le32_to_cpuvp(key +  8);
-	dctx->s[3] = le32_to_cpuvp(key + 12);
+	dctx->s[0] = get_unaligned_le32(key +  0);
+	dctx->s[1] = get_unaligned_le32(key +  4);
+	dctx->s[2] = get_unaligned_le32(key +  8);
+	dctx->s[3] = get_unaligned_le32(key + 12);
 }
 
 unsigned int crypto_poly1305_setdesckey(struct poly1305_desc_ctx *dctx,
@@ -137,11 +132,11 @@ static unsigned int poly1305_blocks(struct poly1305_desc_ctx *dctx,
 	while (likely(srclen >= POLY1305_BLOCK_SIZE)) {
 
 		/* h += m[i] */
-		h0 += (le32_to_cpuvp(src +  0) >> 0) & 0x3ffffff;
-		h1 += (le32_to_cpuvp(src +  3) >> 2) & 0x3ffffff;
-		h2 += (le32_to_cpuvp(src +  6) >> 4) & 0x3ffffff;
-		h3 += (le32_to_cpuvp(src +  9) >> 6) & 0x3ffffff;
-		h4 += (le32_to_cpuvp(src + 12) >> 8) | hibit;
+		h0 += (get_unaligned_le32(src +  0) >> 0) & 0x3ffffff;
+		h1 += (get_unaligned_le32(src +  3) >> 2) & 0x3ffffff;
+		h2 += (get_unaligned_le32(src +  6) >> 4) & 0x3ffffff;
+		h3 += (get_unaligned_le32(src +  9) >> 6) & 0x3ffffff;
+		h4 += (get_unaligned_le32(src + 12) >> 8) | hibit;
 
 		/* h *= r */
 		d0 = mlt(h0, r0) + mlt(h1, s4) + mlt(h2, s3) +
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v3] poly1305: generic C can be faster on chips with slow unaligned access
  2016-11-07 19:43   ` [PATCH v3] " Jason A. Donenfeld
@ 2016-11-12 23:27     ` kbuild test robot
  0 siblings, 0 replies; 42+ messages in thread
From: kbuild test robot @ 2016-11-12 23:27 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: kbuild-all, Herbert Xu, David S. Miller, linux-crypto,
	linux-kernel, Martin Willi, Eric Biggers, René van Dorst,
	Jason A. Donenfeld

[-- Attachment #1: Type: text/plain, Size: 1708 bytes --]

Hi Jason,

[auto build test ERROR on cryptodev/master]
[also build test ERROR on v4.9-rc4 next-20161111]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Jason-A-Donenfeld/poly1305-generic-C-can-be-faster-on-chips-with-slow-unaligned-access/20161108-053912
base:   https://git.kernel.org/pub/scm/linux/kernel/git/herbert/cryptodev-2.6.git master
config: openrisc-allmodconfig (attached as .config)
compiler: or32-linux-gcc (GCC) 4.5.1-or32-1.0rc1
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=openrisc 

All errors (new ones prefixed by >>):

   crypto/poly1305_generic.c: In function 'poly1305_setrkey':
>> crypto/poly1305_generic.c:63:2: error: implicit declaration of function 'get_unaligned_le32'

vim +/get_unaligned_le32 +63 crypto/poly1305_generic.c

    57	}
    58	EXPORT_SYMBOL_GPL(crypto_poly1305_setkey);
    59	
    60	static void poly1305_setrkey(struct poly1305_desc_ctx *dctx, const u8 *key)
    61	{
    62		/* r &= 0xffffffc0ffffffc0ffffffc0fffffff */
  > 63		dctx->r[0] = (get_unaligned_le32(key +  0) >> 0) & 0x3ffffff;
    64		dctx->r[1] = (get_unaligned_le32(key +  3) >> 2) & 0x3ffff03;
    65		dctx->r[2] = (get_unaligned_le32(key +  6) >> 4) & 0x3ffc0ff;
    66		dctx->r[3] = (get_unaligned_le32(key +  9) >> 6) & 0x3f03fff;

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 39242 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v4] poly1305: generic C can be faster on chips with slow unaligned access
  2016-11-07 19:12 ` [PATCH v2] " Jason A. Donenfeld
  2016-11-07 19:43   ` [PATCH v3] " Jason A. Donenfeld
@ 2016-11-07 19:47   ` Jason A. Donenfeld
  2016-11-07 20:40     ` Eric Biggers
                       ` (2 more replies)
  1 sibling, 3 replies; 42+ messages in thread
From: Jason A. Donenfeld @ 2016-11-07 19:47 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, linux-crypto, linux-kernel,
	Martin Willi, Eric Biggers, René van Dorst
  Cc: Jason A. Donenfeld

By using the unaligned access helpers, we drastically improve
performance on small MIPS routers that have to go through the exception
fix-up handler for these unaligned accesses.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 crypto/poly1305_generic.c | 34 +++++++++++++++-------------------
 1 file changed, 15 insertions(+), 19 deletions(-)

diff --git a/crypto/poly1305_generic.c b/crypto/poly1305_generic.c
index 2df9835d..b1c2d57 100644
--- a/crypto/poly1305_generic.c
+++ b/crypto/poly1305_generic.c
@@ -17,6 +17,7 @@
 #include <linux/crypto.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
+#include <asm/unaligned.h>
 
 static inline u64 mlt(u64 a, u64 b)
 {
@@ -33,11 +34,6 @@ static inline u32 and(u32 v, u32 mask)
 	return v & mask;
 }
 
-static inline u32 le32_to_cpuvp(const void *p)
-{
-	return le32_to_cpup(p);
-}
-
 int crypto_poly1305_init(struct shash_desc *desc)
 {
 	struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc);
@@ -65,19 +61,19 @@ EXPORT_SYMBOL_GPL(crypto_poly1305_setkey);
 static void poly1305_setrkey(struct poly1305_desc_ctx *dctx, const u8 *key)
 {
 	/* r &= 0xffffffc0ffffffc0ffffffc0fffffff */
-	dctx->r[0] = (le32_to_cpuvp(key +  0) >> 0) & 0x3ffffff;
-	dctx->r[1] = (le32_to_cpuvp(key +  3) >> 2) & 0x3ffff03;
-	dctx->r[2] = (le32_to_cpuvp(key +  6) >> 4) & 0x3ffc0ff;
-	dctx->r[3] = (le32_to_cpuvp(key +  9) >> 6) & 0x3f03fff;
-	dctx->r[4] = (le32_to_cpuvp(key + 12) >> 8) & 0x00fffff;
+	dctx->r[0] = (get_unaligned_le32(key +  0) >> 0) & 0x3ffffff;
+	dctx->r[1] = (get_unaligned_le32(key +  3) >> 2) & 0x3ffff03;
+	dctx->r[2] = (get_unaligned_le32(key +  6) >> 4) & 0x3ffc0ff;
+	dctx->r[3] = (get_unaligned_le32(key +  9) >> 6) & 0x3f03fff;
+	dctx->r[4] = (get_unaligned_le32(key + 12) >> 8) & 0x00fffff;
 }
 
 static void poly1305_setskey(struct poly1305_desc_ctx *dctx, const u8 *key)
 {
-	dctx->s[0] = le32_to_cpuvp(key +  0);
-	dctx->s[1] = le32_to_cpuvp(key +  4);
-	dctx->s[2] = le32_to_cpuvp(key +  8);
-	dctx->s[3] = le32_to_cpuvp(key + 12);
+	dctx->s[0] = get_unaligned_le32(key +  0);
+	dctx->s[1] = get_unaligned_le32(key +  4);
+	dctx->s[2] = get_unaligned_le32(key +  8);
+	dctx->s[3] = get_unaligned_le32(key + 12);
 }
 
 unsigned int crypto_poly1305_setdesckey(struct poly1305_desc_ctx *dctx,
@@ -137,11 +133,11 @@ static unsigned int poly1305_blocks(struct poly1305_desc_ctx *dctx,
 	while (likely(srclen >= POLY1305_BLOCK_SIZE)) {
 
 		/* h += m[i] */
-		h0 += (le32_to_cpuvp(src +  0) >> 0) & 0x3ffffff;
-		h1 += (le32_to_cpuvp(src +  3) >> 2) & 0x3ffffff;
-		h2 += (le32_to_cpuvp(src +  6) >> 4) & 0x3ffffff;
-		h3 += (le32_to_cpuvp(src +  9) >> 6) & 0x3ffffff;
-		h4 += (le32_to_cpuvp(src + 12) >> 8) | hibit;
+		h0 += (get_unaligned_le32(src +  0) >> 0) & 0x3ffffff;
+		h1 += (get_unaligned_le32(src +  3) >> 2) & 0x3ffffff;
+		h2 += (get_unaligned_le32(src +  6) >> 4) & 0x3ffffff;
+		h3 += (get_unaligned_le32(src +  9) >> 6) & 0x3ffffff;
+		h4 += (get_unaligned_le32(src + 12) >> 8) | hibit;
 
 		/* h *= r */
 		d0 = mlt(h0, r0) + mlt(h1, s4) + mlt(h2, s3) +
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v4] poly1305: generic C can be faster on chips with slow unaligned access
  2016-11-07 19:47   ` [PATCH v4] " Jason A. Donenfeld
@ 2016-11-07 20:40     ` Eric Biggers
  2016-11-08  7:52     ` Martin Willi
  2016-11-13 11:29     ` Herbert Xu
  2 siblings, 0 replies; 42+ messages in thread
From: Eric Biggers @ 2016-11-07 20:40 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Herbert Xu, David S. Miller, linux-crypto, linux-kernel,
	Martin Willi, René van Dorst

On Mon, Nov 07, 2016 at 08:47:09PM +0100, Jason A. Donenfeld wrote:
> By using the unaligned access helpers, we drastically improve
> performance on small MIPS routers that have to go through the exception
> fix-up handler for these unaligned accesses.
> 
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> ---

Reviewed-by: Eric Biggers <ebiggers@google.com>

Nit: the subject line is a little unclear about what was changed.
"make generic C faster on chips with slow unaligned access" would be better.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4] poly1305: generic C can be faster on chips with slow unaligned access
  2016-11-07 19:47   ` [PATCH v4] " Jason A. Donenfeld
  2016-11-07 20:40     ` Eric Biggers
@ 2016-11-08  7:52     ` Martin Willi
  2016-11-08 17:26       ` Eric Biggers
  2016-11-13 11:29     ` Herbert Xu
  2 siblings, 1 reply; 42+ messages in thread
From: Martin Willi @ 2016-11-08  7:52 UTC (permalink / raw)
  To: Jason A. Donenfeld, Herbert Xu, David S. Miller, linux-crypto,
	linux-kernel, Eric Biggers, René van Dorst


> By using the unaligned access helpers, we drastically improve
> performance on small MIPS routers that have to go through the
> exception fix-up handler for these unaligned accesses.

I couldn't measure any slowdown here, so:

Acked-by: Martin Willi <martin@strongswan.org>

> -       dctx->s[0] = le32_to_cpuvp(key +  0);
> +       dctx->s[0] = get_unaligned_le32(key +  0);

Not sure what the exact alignment rules for key/iv are, but maybe we
want to replace the same function in chacha20_generic.c as well?

Martin

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4] poly1305: generic C can be faster on chips with slow unaligned access
  2016-11-08  7:52     ` Martin Willi
@ 2016-11-08 17:26       ` Eric Biggers
  0 siblings, 0 replies; 42+ messages in thread
From: Eric Biggers @ 2016-11-08 17:26 UTC (permalink / raw)
  To: Martin Willi
  Cc: Jason A. Donenfeld, Herbert Xu, David S. Miller, linux-crypto,
	linux-kernel, René van Dorst

On Tue, Nov 08, 2016 at 08:52:39AM +0100, Martin Willi wrote:
> 
> 
> Not sure what the exact alignment rules for key/iv are, but maybe we
> want to replace the same function in chacha20_generic.c as well?
> 
> Martin

chacha20-generic provides a blkcipher API and sets an alignmask of sizeof(u32)
- 1.  This applies to the key buffer for ->setkey() and to the mapped addresses
for the input/output buffers and IV during the blkcipher walk.  So it does not
appear to have the problems that poly1305 has.  

I do however see one small problem which is that
'u8 stream[CHACHA20_BLOCK_SIZE];' is, strictly speaking, not guaranteed to be
aligned to u32.

Eric

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4] poly1305: generic C can be faster on chips with slow unaligned access
  2016-11-07 19:47   ` [PATCH v4] " Jason A. Donenfeld
  2016-11-07 20:40     ` Eric Biggers
  2016-11-08  7:52     ` Martin Willi
@ 2016-11-13 11:29     ` Herbert Xu
  2 siblings, 0 replies; 42+ messages in thread
From: Herbert Xu @ 2016-11-13 11:29 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: David S. Miller, linux-crypto, linux-kernel, Martin Willi,
	Eric Biggers, René van Dorst

On Mon, Nov 07, 2016 at 08:47:09PM +0100, Jason A. Donenfeld wrote:
> By using the unaligned access helpers, we drastically improve
> performance on small MIPS routers that have to go through the exception
> fix-up handler for these unaligned accesses.
> 
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

Patch applied.  Thanks.
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2016-11-13 11:29 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-11-02 17:58 [PATCH] poly1305: generic C can be faster on chips with slow unaligned access Jason A. Donenfeld
2016-11-02 20:09 ` Herbert Xu
2016-11-02 20:47   ` Sandy Harris
2016-11-02 21:06   ` Jason A. Donenfeld
2016-11-02 21:08     ` Herbert Xu
2016-11-02 21:25       ` Jason A. Donenfeld
2016-11-02 21:26         ` Herbert Xu
2016-11-02 22:00           ` Jason A. Donenfeld
2016-11-03  0:49             ` Herbert Xu
2016-11-03  7:24               ` Jason A. Donenfeld
2016-11-03 17:08                 ` David Miller
2016-11-03 22:20                   ` Jason A. Donenfeld
2016-11-03 22:20                     ` Jason A. Donenfeld
2016-11-03 22:20                     ` [WireGuard] " Jason A. Donenfeld
2016-11-04 17:37                     ` Eric Biggers
2016-11-04 17:37                       ` [WireGuard] " Eric Biggers
2016-11-07 18:08                       ` Jason A. Donenfeld
2016-11-07 18:08                         ` Jason A. Donenfeld
2016-11-07 18:08                         ` [WireGuard] " Jason A. Donenfeld
2016-11-07 18:23                         ` Jason A. Donenfeld
2016-11-07 18:23                           ` Jason A. Donenfeld
2016-11-07 18:23                           ` [WireGuard] " Jason A. Donenfeld
2016-11-07 18:26                         ` Eric Biggers
2016-11-07 18:26                           ` Eric Biggers
2016-11-07 18:26                           ` [WireGuard] " Eric Biggers
2016-11-07 19:02                           ` Jason A. Donenfeld
2016-11-07 19:02                             ` Jason A. Donenfeld
2016-11-07 19:02                             ` [WireGuard] " Jason A. Donenfeld
2016-11-07 19:25                             ` Eric Biggers
2016-11-07 19:25                               ` Eric Biggers
2016-11-07 19:25                               ` [WireGuard] " Eric Biggers
2016-11-07 19:41                               ` Jason A. Donenfeld
2016-11-07 19:41                                 ` Jason A. Donenfeld
2016-11-07 19:41                                 ` [WireGuard] " Jason A. Donenfeld
2016-11-07 19:12 ` [PATCH v2] " Jason A. Donenfeld
2016-11-07 19:43   ` [PATCH v3] " Jason A. Donenfeld
2016-11-12 23:27     ` kbuild test robot
2016-11-07 19:47   ` [PATCH v4] " Jason A. Donenfeld
2016-11-07 20:40     ` Eric Biggers
2016-11-08  7:52     ` Martin Willi
2016-11-08 17:26       ` Eric Biggers
2016-11-13 11:29     ` Herbert Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.