linux-crypto.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Jason A. Donenfeld" <Jason@zx2c4.com>
To: David Laight <David.Laight@ACULAB.COM>
Cc: Thomas Gleixner <tglx@linutronix.de>,
	Peter Zijlstra <peterz@infradead.org>,
	Borislav Petkov <bp@alien8.de>,
	LKML <linux-kernel@vger.kernel.org>,
	"x86@kernel.org" <x86@kernel.org>,
	Filipe Manana <fdmanana@suse.com>,
	"linux-crypto@vger.kernel.org" <linux-crypto@vger.kernel.org>
Subject: Re: [patch 3/3] x86/fpu: Make FPU protection more robust
Date: Sat, 7 May 2022 00:34:18 +0200	[thread overview]
Message-ID: <YnWiasChfzbEP67C@zx2c4.com> (raw)
In-Reply-To: <1f4918f734d14e3896071d3c7de1441d@AcuMS.aculab.com>

Hi David,

On Thu, May 05, 2022 at 11:34:40AM +0000, David Laight wrote:
> OTOH the entropy mixing is very likely to be 'cold cache'
> and all the unrolling in blakes7 will completely kill
> performance.

I've seen you mention the BLAKE2s unrolling in like 8 different threads
now, and I'm not convinced that you're entirely wrong, nor am I
convinced that you're entirely right. My response to you is the same as
always: please send a patch with some measurements! I'd love to get this
worked out in a real way.

The last time I went benching these, the unrolled code was ~100 cycles
faster, if I recall correctly, than the rolled code, when used from
WireGuard's hot path. I don't doubt that a cold path would be more
fraught, though, as that's a decent amount of code. So the question is
how to re-roll the rounds without sacrificing those 100 cycles.

In order to begin to figure that out, we have to look at why the
re-rolled loop is slow and the unrolled loop fast. It's not because of
complicated pipeline things. It's because the BLAKE2s permutation is
actually 10 different permutations, one for each round. Take a look at
the core function, G, and its uses of the round number, r:

    #define G(r, i, a, b, c, d) do { \
        a += b + m[blake2s_sigma[r][2 * i + 0]]; \
        d = ror32(d ^ a, 16); \
        c += d; \
        b = ror32(b ^ c, 12); \
        a += b + m[blake2s_sigma[r][2 * i + 1]]; \
        d = ror32(d ^ a, 8); \
        c += d; \
        b = ror32(b ^ c, 7); \
    } while (0)

The blake2s_sigma array is a `static const u8 blake2s_sigma[10][16]`,
with a row for every one of the 10 rounds. What this is actually doing
is reading the message words in a different order each round, so that
the whole permutation is different.

When the loop is unrolled, blake2s_sigma gets inlined, and then there
are no memory accesses. When it's re-rolled, every round accesses
blake2s_sigma 16 times, which hinders performance.

You'll notice, on the other hand, that the SIMD hand coded assembly
implementations do not unroll. The trick is to hide the cost of the
blake2s_sigma indirection in the data dependencies, so that performance
isn't affected. Naively re-rolling the generic code does not inspire the
compiler to do that. But maybe you can figure something out?

Anyway, that's about where my thinking is on this, but I'd love to see
some patches from you at some point if you're interested.

Jason

  parent reply	other threads:[~2022-05-06 22:34 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20220501192740.203963477@linutronix.de>
     [not found] ` <20220501193102.704267030@linutronix.de>
     [not found]   ` <Ym/sHqKqmLOJubgE@zn.tnic>
     [not found]     ` <87k0b4lydr.ffs@tglx>
     [not found]       ` <YnDwjjdiSQ5Yml6E@hirez.programming.kicks-ass.net>
2022-05-04 15:36         ` [patch 3/3] x86/fpu: Make FPU protection more robust Thomas Gleixner
2022-05-04 15:55           ` Jason A. Donenfeld
2022-05-04 16:45             ` Thomas Gleixner
2022-05-04 19:05               ` Jason A. Donenfeld
2022-05-04 21:04                 ` Thomas Gleixner
2022-05-04 23:52                   ` Jason A. Donenfeld
2022-05-05  0:55                     ` Thomas Gleixner
2022-05-05  1:11                       ` Jason A. Donenfeld
2022-05-05  1:21                         ` Thomas Gleixner
2022-05-05 11:02                           ` Jason A. Donenfeld
2022-05-05 11:34                             ` David Laight
2022-05-05 11:35                               ` Jason A. Donenfeld
2022-05-05 11:53                                 ` David Laight
2022-05-06 22:34                               ` Jason A. Donenfeld [this message]
2022-05-07 13:50                                 ` David Laight
2022-05-05 13:48                             ` Jason A. Donenfeld
2022-05-06 22:15                 ` Jason A. Donenfeld

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YnWiasChfzbEP67C@zx2c4.com \
    --to=jason@zx2c4.com \
    --cc=David.Laight@ACULAB.COM \
    --cc=bp@alien8.de \
    --cc=fdmanana@suse.com \
    --cc=linux-crypto@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).