public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@kernel.org>
To: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Eric Biggers <ebiggers3@gmail.com>,
	x86@kernel.org, linux-kernel@vger.kernel.org,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Mathias Krause <minipli@googlemail.com>,
	Chandramouli Narayanan <mouli@linux.intel.com>,
	Jussi Kivilinna <jussi.kivilinna@iki.fi>,
	Peter Zijlstra <peterz@infradead.org>,
	Herbert Xu <herbert@gondor.apana.org.au>,
	"David S. Miller" <davem@davemloft.net>,
	linux-crypto@vger.kernel.org, Eric Biggers <ebiggers@google.com>,
	Andy Lutomirski <luto@kernel.org>, Jiri Slaby <jslaby@suse.cz>
Subject: Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
Date: Thu, 14 Sep 2017 11:16:12 +0200	[thread overview]
Message-ID: <20170914091612.ck33coyubzevru2i@gmail.com> (raw)
In-Reply-To: <20170913212428.kibwbqs2f7dkeslb@treble>


* Josh Poimboeuf <jpoimboe@redhat.com> wrote:

> I'm still looking at the other one (sha512-avx2), but so far I haven't
> found a way to speed it back up.

Here's a couple of very quick observations with possible optimizations:

AFAICS the main effect of the RBP fixes is the introduction of a memory load into 
the critical path, into the body unrolled loop:

+       mov frame_TBL(%rsp), TBL
        vpaddq  (TBL), Y_0, XFER
        vmovdqa XFER, frame_XFER(%rsp)
        FOUR_ROUNDS_AND_SCHED

Both 'TLB' and 'T1' are mapped to R12, which is why TBL has to be spilled to be 
reloaded from the stack.

1)

Note how R12 is used immediately, right in the next instruction:

        vpaddq  (TBL), Y_0, XFER

I.e. the RBP fixes lengthen the program order data dependencies - that's a new 
constraint and a few extra cycles per loop iteration if the workload is 
address-generator bandwidth limited on that.

A simple way to ease that constraint would be to move the 'TLB' load up into the 
loop, body, to the point where 'T1' is used for the last time - which is:


        mov     a, T1           # T1 = a                                # MAJB
        and     c, T1           # T1 = a&c                              # MAJB

        add     y0, y2          # y2 = S1 + CH                          # --
        or      T1, y3          # y3 = MAJ = (a|c)&b)|(a&c)             # MAJ

+       mov frame_TBL(%rsp), TBL

        add     y1, h           # h = k + w + h + S0                    # --

        add     y2, d           # d = k + w + h + d + S1 + CH = d + t1  # --

        add     y2, h           # h = k + w + h + S0 + S1 + CH = t1 + S0# --
        add     y3, h           # h = t1 + S0 + MAJ                     # --

Note how this moves up the 'TLB' reload by 4 instructions.

2)

If this does not get back performance, then maybe another reason is that it's 
cache access latency limited, in which case a more involved optimization would be 
to look at the register patterns and usages:


			first-use	last-use		use-length
	a:		#10		#29			20
	b:		#24		#24			 1
	c:		#14		#30			17
	d:		#23		#34			12
	e:		#11		#20			10
	f:		#15		#15			 1
	g:		#18		#27			10
	h:		#13		#36			24

	y0:		#11		#31			21
	y1:		#12		#33			22
	y2:		#15		#35			21
	y3:		#10		#36			27

	T1:		#16		#32			17

The 'first-use' colums shows the number of the instruction within the loop body 
that the register gets used - with '#1' denoting the first instruction ad #36 the 
last instruction, the 'last-use' column is showing the last instruction, and the 
'use-length' colum shows the 'window' in which a register is used.

What we want are the registers that are used the most tightly, i.e. these two:

	b:		#24		#24			 1
	f:		#15		#15			 1

Of these two 'f' is the best one, because it has an earlier use and longer 
cooldown.

If alias 'TBL' with 'f' then we could reload 'TLB' for the next iteration very 
early on:

        mov     f, y2           # y2 = f                                # CH
+       mov frame_TBL(%rsp), TBL
        rorx    $34, a, T1      # T1 = a >> 34                          # S0B

And there will be 21 instructions that don't depend on TLB after this, plenty of 
time for the load to be generated and propagated.

NOTE: my pseudo-patch is naive, due to the complication caused by the RotateState 
macro name rotation. It's still fundamentally possible I believe, it's just that 
'TBL' has to be rotated too, together with the other varibles.

3)

If even this does not help, because the workload is ucode-cache limited, and the 
extra reloads pushed the critical path just beyond some cache limit, then another 
experiment to try would be to roll _back_ the loop some more: instead of 4x 
FOUR_ROUNDS_AND_SCHED unrolled loops, try just having 2.

The CPU should still be smart enough with 2x interleaving of the loop body, and 
the extra branches should be relatively small and we could get back some 
performance.

In theory ...

4)

If the workload is fundamentally cache-port bandwidth limited, then the extra 
loads from memory to reload 'TLB' take away valuable bandwidth. There's no easy 
fix for that, but to find an unused register.

Here's the (initial, pre-rotation) integer register mappings:

	a:	RAX
	b:	RBX
	c:	RCX
	d:	R8
	e:	RDX
	f:	R9
	g:	R10
	h:	R11

	y0:	R13
	y1:	R14
	y2:	R15
	y3:	RSI

	T1:	R12

	TLB:	R12 # aliased to T1

Look what's missing: I don't see RDI being used in the loop.

RDI is allocated to 'CTX', but that's only used in higher level glue code, it does 
not appear to be used in the inner loops (explicitly at least).

So if this observation of mine is true we could go back to the old code for the 
hotpath, but use RDI for TBL and not reload it in the hotpath.

Thanks,

	Ingo

  parent reply	other threads:[~2017-09-14  9:16 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-29 18:05 [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files Josh Poimboeuf
2017-08-29 18:05 ` [PATCH 01/12] x86/crypto: Fix RBP usage in blowfish-x86_64-asm_64.S Josh Poimboeuf
2017-08-29 18:05 ` [PATCH 02/12] x86/crypto: Fix RBP usage in camellia-x86_64-asm_64.S Josh Poimboeuf
2017-08-29 18:05 ` [PATCH 03/12] x86/crypto: Fix RBP usage in cast5-avx-x86_64-asm_64.S Josh Poimboeuf
2017-08-29 18:05 ` [PATCH 04/12] x86/crypto: Fix RBP usage in cast6-avx-x86_64-asm_64.S Josh Poimboeuf
2017-08-29 18:05 ` [PATCH 05/12] x86/crypto: Fix RBP usage in des3_ede-asm_64.S Josh Poimboeuf
2017-08-29 18:05 ` [PATCH 06/12] x86/crypto: Fix RBP usage in sha1_avx2_x86_64_asm.S Josh Poimboeuf
2017-09-06 16:11   ` Tim Chen
2017-08-29 18:05 ` [PATCH 07/12] x86/crypto: Fix RBP usage in sha1_ssse3_asm.S Josh Poimboeuf
2017-08-29 18:05 ` [PATCH 08/12] x86/crypto: Fix RBP usage in sha256-avx-asm.S Josh Poimboeuf
2017-08-29 18:05 ` [PATCH 09/12] x86/crypto: Fix RBP usage in sha256-avx2-asm.S Josh Poimboeuf
2017-08-29 18:05 ` [PATCH 10/12] x86/crypto: Fix RBP usage in sha256-ssse3-asm.S Josh Poimboeuf
2017-08-29 18:05 ` [PATCH 11/12] x86/crypto: Fix RBP usage in sha512-avx2-asm.S Josh Poimboeuf
2017-08-29 18:05 ` [PATCH 12/12] x86/crypto: Fix RBP usage in twofish-avx-x86_64-asm_64.S Josh Poimboeuf
2017-09-02  0:09 ` [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files Eric Biggers
2017-09-07  0:15   ` Josh Poimboeuf
2017-09-07  7:15   ` Ingo Molnar
2017-09-07 17:58     ` Eric Biggers
2017-09-07 21:26       ` Ingo Molnar
2017-09-08 17:57         ` Eric Biggers
2017-09-13 21:24           ` Josh Poimboeuf
2017-09-13 22:33             ` Josh Poimboeuf
2017-09-15  4:54               ` Eric Biggers
2017-09-15  5:34                 ` Ingo Molnar
2017-09-15 16:07                   ` Eric Biggers
2017-09-15 21:06                     ` Ingo Molnar
2017-09-19  3:00                       ` Herbert Xu
2017-09-14  9:16             ` Ingo Molnar [this message]
2017-09-14  9:28               ` Ingo Molnar
2017-09-14 13:28               ` Josh Poimboeuf
2017-09-15  5:37                 ` Ingo Molnar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170914091612.ck33coyubzevru2i@gmail.com \
    --to=mingo@kernel.org \
    --cc=davem@davemloft.net \
    --cc=ebiggers3@gmail.com \
    --cc=ebiggers@google.com \
    --cc=herbert@gondor.apana.org.au \
    --cc=jpoimboe@redhat.com \
    --cc=jslaby@suse.cz \
    --cc=jussi.kivilinna@iki.fi \
    --cc=linux-crypto@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=minipli@googlemail.com \
    --cc=mouli@linux.intel.com \
    --cc=peterz@infradead.org \
    --cc=tim.c.chen@linux.intel.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox