RE: [PATCH v4 6/6] crypto: lib/sha256 - Unroll LOAD and BLEND loops

Linux cryptographic layer development
 help / color / mirror / Atom feed

From: David Laight <David.Laight@ACULAB.COM>
To: 'Arvind Sankar' <nivedita@alum.mit.edu>,
	Herbert Xu <herbert@gondor.apana.org.au>,
	"David S. Miller" <davem@davemloft.net>,
	"linux-crypto@vger.kernel.org" <linux-crypto@vger.kernel.org>,
	Eric Biggers <ebiggers@kernel.org>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"Eric Biggers" <ebiggers@google.com>
Subject: RE: [PATCH v4 6/6] crypto: lib/sha256 - Unroll LOAD and BLEND loops
Date: Sun, 25 Oct 2020 18:51:18 +0000	[thread overview]
Message-ID: <05150bdb3a4c4b2682ab9cb8fb2ed411@AcuMS.aculab.com> (raw)
In-Reply-To: <20201025143119.1054168-7-nivedita@alum.mit.edu>

From: Arvind Sankar
> Sent: 25 October 2020 14:31
> 
> Unrolling the LOAD and BLEND loops improves performance by ~8% on x86_64
> (tested on Broadwell Xeon) while not increasing code size too much.

I can't believe unrolling the BLEND loop makes any difference.

Unrolling the LOAD one might - but you don't need 8 times,
once should be more than enough.
The LOAD loop needs a memory read, memory write and BSWAP per iteration.
The loop control is add + compare + jmp.
On sandy bridge and later the compare and jmp become a single u-op.
So the loop has the read, write (can happen together) and 3 other u-ops.
That won't run at 1 clock per iteration on Sandy Bridge.
However just unroll once and you need 4 non-memory u-op per loop iteration.
That might run at 2 clocks per 8 bytes.

Fiddling the loop to remove the compare (ie run from -64 to 0)
should merge the 'add' and 'jnz' into a single u-op.
That might be enough to get the 'rolled up' loop to run in 1 clock
on sandy bridge, certainly on slightly later cpu.

That is theoretical for intel cpu sandy bridge onwards.
I've an i7-7700 (Kaby Lake?) that I belive has an extra
instruction pipeline and might run the initial loop in 1 clock.

I don't have any recent AMD cpu, nor any ARM or PPC ones.
But fully out-of-order cpu are likely to be similar.

One of the other test systems I've got is an Atom C2758.
This 8 core but mostly in-order.
Running sha256_transform() on that tend to give one of two
TSC counts, one of which is double the other!
That is pretty consistent even for 100 iterations.

WRT patch 5.
On the C2758 the original unrolled code is slightly faster.
On the i7-7700 the 8 unroll is a bit faster 'hot cache',
but slower 'cold cache' - probably because of the d-cache
loads for K[].

Non-x86 architectures might need to use d-cache reads for
the 32bit 'K' constants even in the unrolled loop.
X86 can use 'lea' with a 32bit offset to avoid data reads.
So the cold-cache case for the old code may be similar.

Interestingly I had to write an asm ror32() to get reasonable
code (in userspace). The C version the kernel uses didn't work.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

next prev parent reply	other threads:[~2020-10-25 18:51 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-25 14:31 [PATCH v4 0/6] crypto: lib/sha256 - cleanup/optimization Arvind Sankar
2020-10-25 14:31 ` [PATCH v4 1/6] crypto: lib/sha256 - Use memzero_explicit() for clearing state Arvind Sankar
2020-10-26  7:59   ` Ard Biesheuvel
2020-10-25 14:31 ` [PATCH v4 2/6] crypto: " Arvind Sankar
2020-10-26  7:58   ` Ard Biesheuvel
2020-10-25 14:31 ` [PATCH v4 3/6] crypto: lib/sha256 - Don't clear temporary variables Arvind Sankar
2020-10-26  7:59   ` Ard Biesheuvel
2020-10-25 14:31 ` [PATCH v4 4/6] crypto: lib/sha256 - Clear W[] in sha256_update() instead of sha256_transform() Arvind Sankar
2020-10-26  8:00   ` Ard Biesheuvel
2020-10-25 14:31 ` [PATCH v4 5/6] crypto: lib/sha256 - Unroll SHA256 loop 8 times intead of 64 Arvind Sankar
2020-10-26  8:00   ` Ard Biesheuvel
2020-10-25 14:31 ` [PATCH v4 6/6] crypto: lib/sha256 - Unroll LOAD and BLEND loops Arvind Sankar
2020-10-25 18:51   ` David Laight [this message]
2020-10-25 20:18     ` Arvind Sankar
2020-10-25 23:23       ` David Laight
2020-10-25 23:53         ` Arvind Sankar
2020-10-26 10:06           ` David Laight
2020-10-26  8:02   ` Ard Biesheuvel
2020-10-30  6:53 ` [PATCH v4 0/6] crypto: lib/sha256 - cleanup/optimization Herbert Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=05150bdb3a4c4b2682ab9cb8fb2ed411@AcuMS.aculab.com \
    --to=david.laight@aculab.com \
    --cc=davem@davemloft.net \
    --cc=ebiggers@google.com \
    --cc=ebiggers@kernel.org \
    --cc=herbert@gondor.apana.org.au \
    --cc=linux-crypto@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=nivedita@alum.mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox