public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] Add optimized SHA-1 implementations for x86 and x86_64
@ 2007-06-08 21:42 Benjamin Gilbert
  2007-06-08 21:42 ` [PATCH 1/3] [CRYPTO] Move sha_init() into cryptohash.h Benjamin Gilbert
                   ` (3 more replies)
  0 siblings, 4 replies; 26+ messages in thread
From: Benjamin Gilbert @ 2007-06-08 21:42 UTC (permalink / raw)
  To: akpm; +Cc: herbert, linux-crypto, linux-kernel

The following 3-part series adds assembly implementations of the SHA-1
transform for x86 and x86_64.  For x86_64 the optimized code is always
selected; on x86 it is selected if the kernel is compiled for i486 or above
(since the code needs BSWAP).  These changes primarily improve the
performance of the CryptoAPI SHA-1 module and of /dev/urandom.  I've
included some performance data from my test boxes below.

This version incorporates feedback from Herbert Xu.  Andrew, I'm sending
this to you because of the (admittedly tiny) intersection with arm and s390
in part 1.

-

tcrypt performance tests:

=== Pentium IV in 32-bit mode, average of 5 trials ===
Test#  Bytes/  Bytes/  Cyc/B  Cyc/B  Change
        block  update    (C)  (asm)
    0      16      16    229    114     50%
    1      64      16    142     76     46%
    2      64      64     79     35     56%
    3     256      16     59     34     42%
    4     256      64     44     24     45%
    5     256     256     43     17     60%
    6    1024      16     51     36     29%
    7    1024     256     30     13     57%
    8    1024    1024     28     12     57%
    9    2048      16     66     30     55%
   10    2048     256     31     12     61%
   11    2048    1024     27     13     52%
   12    2048    2048     26     13     50%
   13    4096      16     49     30     39%
   14    4096     256     28     12     57%
   15    4096    1024     28     11     61%
   16    4096    4096     26     13     50%
   17    8192      16     49     29     41%
   18    8192     256     27     11     59%
   19    8192    1024     26     11     58%
   20    8192    4096     25     10     60%
   21    8192    8192     25     10     60%

=== Intel Core 2 in 64-bit mode, average of 5 trials ===
Test#  Bytes/  Bytes/  Cyc/B  Cyc/B  Change
        block  update    (C)  (asm)
    0      16      16    112     81     28%
    1      64      16     55     39     29%
    2      64      64     42     27     36%
    3     256      16     35     25     29%
    4     256      64     24     14     42%
    5     256     256     22     12     45%
    6    1024      16     31     22     29%
    7    1024     256     17      9     47%
    8    1024    1024     16      9     44%
    9    2048      16     30     22     27%
   10    2048     256     16      8     50%
   11    2048    1024     16      8     50%
   12    2048    2048     16      8     50%
   13    4096      16     29     21     28%
   14    4096     256     16      8     50%
   15    4096    1024     15      8     47%
   16    4096    4096     15      7     53%
   17    8192      16     29     22     24%
   18    8192     256     16      8     50%
   19    8192    1024     15      7     53%
   20    8192    4096     15      7     53%
   21    8192    8192     15      7     53%

I've also done informal tests on other boxes, and the performance
improvement has been in the same ballpark.

On the aforementioned Pentium IV, /dev/urandom throughput goes from 3.7 MB/s
to 5.6 MB/s with the patches; on the Core 2, it increases from 5.5 MB/s to
8.1 MB/s.

Signed-off-by: Benjamin Gilbert <bgilbert@cs.cmu.edu>

^ permalink raw reply	[flat|nested] 26+ messages in thread
* Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+
@ 2007-06-11  7:53 linux
  2007-06-11 19:17 ` Benjamin Gilbert
  0 siblings, 1 reply; 26+ messages in thread
From: linux @ 2007-06-11  7:53 UTC (permalink / raw)
  To: bgilbert, linux-kernel; +Cc: linux

+#define F3(x,y,z)						\
+	movl	x, TMP2;					\
+	andl	y, TMP2;					\
+	movl	x, TMP;						\
+	orl	y, TMP;						\
+	andl	z, TMP;						\
+       orl	TMP2, TMP

*Sigh*.  You don't need TMP2 to compute the majority function.

You're implementing it as (x & y) | ((x | y) & z).
Look at the rephrasing in lib/sha1.c:
#define f3(x,y,z)	((x & y) + (z & (x ^ y)))	/* majority */

By changing the second OR to x^y, you ensure that the two halves
of the first disjunciton are distinct, so you can replace the OR
with XOR, or better yet, +.

Then you can just do two adds to e.  That is, write:

/* Bitwise select: x ? y : z, which is (z ^ (x & (y ^ z))) */
#define F1(x,y,z,dest)		\
	movl	z, TMP;		\
	xorl	y, TMP;		\
	andl	x, TMP;		\
	xorl	z, TMP;		\
	addl	TMP, dest

/* Three-way XOR (x ^ y ^ z) */
#define F2(x,y,z,dest)		\
	movl	z, TMP;		\
	xorl	x, TMP;		\
	xorl	y, TMP;		\
	addl	TMP, dest

/* Majority: (x^y)|(y&z)|(z&x) = (x & z) + ((x ^ z) & y)
#define F3(x,y,z,dest)		\
	movl	z, TMP;		\
	andl	x, TMP;		\
	addl	TMP, dest;	\
	movl	z, TMP;		\
	xorl	x, TMP;		\
	andl	y, TMP;		\
	addl	TMP, dest

Since y is the most recently computed result (it's rotated in the
previous round), I arranged the code to delay its use as late as
possible.


Now you have one more register to play with.


I thought I had some good sha1 asm code lying around, but I
can't seem to find it.  (I have some excellent PowerPC asm if anyone
wants it.)  Here's a basic implementation question:

SHA-1 is made up of 80 rounds, 20 of each of 4 types.
There are 5 working variables, a through e.

The basic round is:
	t = F(b, c, d) + K + rol32(a, 5) + e + W[i];
	e = d; d = c; c = rol32(b, 30); b = a; a = t;
where W[] is the input array.  W[0..15] are the input words, and
W[16..79] are computed by a sort of LFSR from W[0..15].

Each group of 20 rounds has a different F() and K.
This is the smallest way to write the function, but all the
register shuffling makes for a bit of a speed penalty.

A faster way is to unroll 5 iterations and do:
	e += F(b, c, d) + K + rol32(a, 5) + W[i  ]; b = rol32(b, 30);
	d += F(a, b, c) + K + rol32(e, 5) + W[i+1]; a = rol32(a, 30);
	c += F(e, a, b) + K + rol32(d, 5) + W[i+2]; e = rol32(e, 30);
	b += F(d, e, a) + K + rol32(c, 5) + W[i+3]; d = rol32(d, 30);
	a += F(c, d, e) + K + rol32(b, 5) + W[i+4]; c = rol32(c, 30);
then loop over that 4 times each.  This is somewhat larger, but
still reasonably compact; only 20 of the 80 rounds are written out
long-hand.

Faster yet is to unroll all 80 rounds directly.  But it also takes the
most code space, and as we have learned, when your code is not the
execution time hot spot, less cache is faster code.

Is there a preferred implementation?


Another implementation choice has to do with the computation of W[].
W[i] is a function of W[i-3], W[i-8], W[i-14] and W[i-16].
It is possible to keep a 16-word circular buffer with only the most
recent 16 values of W[i%16] and compute each new word as it is needed.
However, the offsets i%16 repeat every 16 rounds, which is an awkward
fit with the 5-round repeating pattern of the main computation.

One option is to compute all the W[] values in a pre-pass beforehand.
Simple and small, but uses 320 bytes of data on the stack or wherever.
An intermediate one is to keep a 20-word buffer, and compute 20 words
at a time just before each of the 20 round groups.

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2007-06-13  6:46 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-06-08 21:42 [PATCH 0/3] Add optimized SHA-1 implementations for x86 and x86_64 Benjamin Gilbert
2007-06-08 21:42 ` [PATCH 1/3] [CRYPTO] Move sha_init() into cryptohash.h Benjamin Gilbert
2007-06-08 21:42 ` [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+ Benjamin Gilbert
2007-06-09  7:32   ` Jan Engelhardt
2007-06-10  1:15     ` Benjamin Gilbert
2007-06-11 19:47       ` Benjamin Gilbert
2007-06-11 19:50         ` [PATCH] " Benjamin Gilbert
2007-06-11 19:52         ` [PATCH] [CRYPTO] Add optimized SHA-1 implementation for x86_64 Benjamin Gilbert
2007-06-09 20:11   ` [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+ Matt Mackall
2007-06-09 20:23     ` Jeff Garzik
2007-06-09 21:34       ` Matt Mackall
2007-06-10  0:33       ` Benjamin Gilbert
2007-06-10 13:59         ` Matt Mackall
2007-06-10 16:47           ` Benjamin Gilbert
2007-06-10 17:33             ` Matt Mackall
2007-06-11 17:39           ` Benjamin Gilbert
2007-06-11 12:04     ` Andi Kleen
2007-06-08 21:42 ` [PATCH 3/3] [CRYPTO] Add optimized SHA-1 implementation for x86_64 Benjamin Gilbert
2007-06-11 12:01   ` Andi Kleen
2007-06-11 19:45     ` Benjamin Gilbert
2007-06-11 20:30 ` [PATCH 0/3] Add optimized SHA-1 implementations for x86 and x86_64 Adrian Bunk
  -- strict thread matches above, loose matches on Subject: below --
2007-06-11  7:53 [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+ linux
2007-06-11 19:17 ` Benjamin Gilbert
2007-06-12  5:05   ` linux
2007-06-13  5:50     ` Matt Mackall
2007-06-13  6:46       ` linux

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox