Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+

All of lore.kernel.org
 help / color / mirror / Atom feed

From: linux@horizon.com
To: bgilbert@cs.cmu.edu, linux@horizon.com
Cc: linux-kernel@vger.kernel.org
Subject: Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+
Date: 12 Jun 2007 01:05:44 -0400	[thread overview]
Message-ID: <20070612050544.23957.qmail@science.horizon.com> (raw)
In-Reply-To: <466D9FDB.2010305@cs.cmu.edu>

> I got this code from Nettle, originally, and I never looked at the SHA-1 
> round structure very closely.  I'll give that approach a try.

Attached is some (tested, working, and public domain) assembly code for
three different sha_transform implementations.  Compared to C code, the
timings to hash 10 MiB on a 600 MHz PIII are:

  One: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 564819 us
 Four: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 391086 us
  Two: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 399134 us
Three: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 345986 us
 Five: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 301152 us

  One: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 558652 us
 Four: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 390980 us
  Two: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 407661 us
Three: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 412434 us
 Five: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 266809 us

  One: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 559053 us
 Four: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 396506 us
  Two: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 401661 us
Three: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 349668 us
 Five: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 265861 us

  One: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 556082 us
 Four: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 392967 us
  Two: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 406381 us
Three: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 338959 us
 Five: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 274712 us

Um.. some more runs, nice --19, that come out a bit more stable:
  One: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 552971 us
 Four: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 388167 us
  Two: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 398721 us
Three: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 337220 us
 Five: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 259790 us

  One: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 551240 us
 Four: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 387812 us
  Two: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 398519 us
Three: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 336903 us
 Five: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 260161 us

  One: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 551934 us
 Four: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 387639 us
  Two: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 398094 us
Three: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 335860 us
 Five: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 259805 us

This is hot-cache testing; I haven't got around to writing macro
tricks that exapnd to a megabyte of object code.  The challenge is to
purge not only the I- and D-caches, but also the branch predictor!


The names are the order they were written in.  "One" is the lib/sha1.c
code (547 bytes with -Os).  "Four" is a 5x unrolled C version (1106 bytes).

"Two" is a space-optimized ASM version, 266 bytes long.  "Three" is 5x
unrolled, 722 bytes long.  "Five" is a fully unrolled version, 3558
bytes long.

(Further space savings are possible, but it doesn't seem worth it.)

I have noticed that every caller of sha_transform in the kernel tree
allocates the W[] array on the stack, so we might as well do that inside
sha_transform.  The point of passing in the buffer is to amortize the
wiping afterwards, but see sha_stackwipe for ideas on how to do that.
(It can even be done mostly portably in C, given a good guess about the
C function's stack usage.)


I also noticed a glaring BUG in the folding at the end of extract_buf at
drivers/char/random.c:797.  That should be:

	/*
	 * In case the hash function has some recognizable
	 * output pattern, we fold it in half.
	 */

	buf[0] ^= buf[4];
	buf[1] ^= buf[3];
	buf[2] ^= rol32(buf[2], 16);	// <--- Bug was here
	memcpy(out, buf, EXTRACT_SIZE);
	memset(buf, 0, sizeof(buf));

if the code is to match the comment.



=== sha1asm.S ===
#define A %eax
#define B %ebx
#define C %ecx
#define D %edx
#define E %ebp
#define I %esi
#define T %edi

# f1(x,y,z) = bitwise x ? y : z = (z ^ (x & (y ^ z)))
#define F1(x,y,z,dest)	\
	movl	z,T;	\
	xorl	y,T;	\
	andl	x,T;	\
	xorl	z,T

# f2(x,y,z) = x ^ y ^ z
#define F2(x,y,z,dest)	\
	movl	z,T;	\
	xorl	x,T;	\
	xorl	y,T

# f3(x,y,z) = majority(x,y,z) = ((x & z) + (y & (x ^ z)))
#define F3(x,y,z,dest)	\
	movl	z,T;	\
	andl	x,T;	\
	addl	T,dest; \
	movl	z,T;	\
	xorl	x,T;	\
	andl	y,T

#define K1  0x5A827999			/* Rounds  0-19: sqrt(2) * 2^30 */
#define K2  0x6ED9EBA1			/* Rounds 20-39: sqrt(3) * 2^30 */
#define K3  0x8F1BBCDC			/* Rounds 40-59: sqrt(5) * 2^30 */
#define K4  0xCA62C1D6			/* Rounds 60-79: sqrt(10) * 2^30 */

# e += W[i] + K + f(b, c, d) + rol32(a, 5); b = rol32(b, 30); i++;
#define ROUND(a,b,c,d,e,f,k)	\
	addl (%esp,I,4),e;	\
	incl I;			\
	f(b,c,d,e);		\
	leal k(T,e),e;		\
	movl a,T;		\
	roll $5,T;		\
	rorl $2,b;		\
	addl T,e

	.text
.globl sha_transform3
	.type	sha_transform3, @function
# void sha_transform3(__u32 digest[5], const char in[64])
sha_transform3:
	pushl	%ebp
	pushl	%edi
	pushl	%esi
	pushl	%ebx
# Args start at 20(%esp)
	xorl	I,I
	movl	24(%esp),B	# B = in
	movl	20(%esp),T	# T = digest
	subl	$320,%esp
1:
	movl	(B,I,4),A
	bswap	A
	movl	A,(%esp,I,4)
	incl	I
	cmpl	$16,I
	jne	1b

2:
	movl	-64(%esp,I,4),A
	xorl	-56(%esp,I,4),A
	xorl	-32(%esp,I,4),A
	xorl	-12(%esp,I,4),A
	roll	$1,A
	movl	A,(%esp,I,4)
	incl	I
	cmpl	$80,I
	jne	2b

	movl	(T),A
	movl	4(T),B
	movl	8(T),C
	movl	12(T),D
	movl	16(T),E

	xorl	I,I
3:
	ROUND(A,B,C,D,E,F1,K1)
	ROUND(E,A,B,C,D,F1,K1)
	ROUND(D,E,A,B,C,F1,K1)
	ROUND(C,D,E,A,B,F1,K1)
	ROUND(B,C,D,E,A,F1,K1)
	cmp	$20,I
	jne	3b
4:
	ROUND(A,B,C,D,E,F2,K2)
	ROUND(E,A,B,C,D,F2,K2)
	ROUND(D,E,A,B,C,F2,K2)
	ROUND(C,D,E,A,B,F2,K2)
	ROUND(B,C,D,E,A,F2,K2)
	cmp	$40,I
	jne	4b
5:
	ROUND(A,B,C,D,E,F3,K3)
	ROUND(E,A,B,C,D,F3,K3)
	ROUND(D,E,A,B,C,F3,K3)
	ROUND(C,D,E,A,B,F3,K3)
	ROUND(B,C,D,E,A,F3,K3)
	cmp	$60,I
	jne	5b
6:
	ROUND(A,B,C,D,E,F2,K4)
	ROUND(E,A,B,C,D,F2,K4)
	ROUND(D,E,A,B,C,F2,K4)
	ROUND(C,D,E,A,B,F2,K4)
	ROUND(B,C,D,E,A,F2,K4)
	cmp	$80,I
	jne	6b

	addl	$320,%esp
	movl	20(%esp),T

	addl	A,(T)
	addl	B,4(T)
	addl	C,8(T)
	addl	D,12(T)
	addl	E,16(T)

	popl	%ebx
	popl	%esi
	popl	%edi
	popl	%ebp

	ret
	.size	sha_transform3, .-sha_transform3
	# Size is 0x2D2 = 722 bytes

# A smaller variant
#define ROUND2(a,b,c,d,e,f,k)	\
	addl (%esp,I,4),e;	\
	incl I;			\
	f(b,c,d,e);		\
	leal k(T,e),T;		\
	movl d,e;		\
	movl c,d;		\
	movl b,c;		\
	rorl $2,c;		\
	movl a,b;		\
	roll $5,a;		\
	addl T,a

.globl sha_transform2
	.type	sha_transform2, @function
# void sha_transform2(__u32 digest[5], const char in[64])
sha_transform2:
	pushl	%ebp
	pushl	%edi
	pushl	%esi
	pushl	%ebx
# Args start at 20(%esp)
	xorl	I,I
	movl	24(%esp),B	# B = in
	movl	20(%esp),T	# T = digest
	subl	$320,%esp
1:
	movl	(B,I,4),A
	bswap	A
	movl	A,(%esp,I,4)
	incl	I
	cmpl	$16,I
	jne	1b

2:
	movl	-64(%esp,I,4),A
	xorl	-56(%esp,I,4),A
	xorl	-32(%esp,I,4),A
	xorl	-12(%esp,I,4),A
	roll	$1,A
	movl	A,(%esp,I,4)
	incl	I
	cmpl	$80,I
	jne	2b

	movl	(T),A
	movl	4(T),B
	movl	8(T),C
	movl	12(T),D
	movl	16(T),E

	xorl	I,I
3:
	ROUND2(A,B,C,D,E,F1,K1)
	cmp	$20,I
	jne	3b
4:
	ROUND2(A,B,C,D,E,F2,K2)
	cmp	$40,I
	jne	4b
5:
	ROUND2(A,B,C,D,E,F3,K3)
	cmp	$60,I
	jne	5b
6:
	ROUND2(A,B,C,D,E,F2,K4)
	cmp	$80,I
	jne	6b

	addl	$320,%esp
	movl	20(%esp),T
	addl	A,(T)
	addl	B,4(T)
	addl	C,8(T)
	addl	D,12(T)
	addl	E,16(T)

	popl	%ebx
	popl	%esi
	popl	%edi
	popl	%ebp

	ret
	.size	sha_transform2, .-sha_transform2
	# Size is 0x10A = 266 bytes

# The three cases of the next input word...
# Fetch big-endian (first 16 rounds)
#define FETCH(i)		\
	movl	4*i(I),T;	\
	bswap	T;		\
	movl	T,4*i(%esp)

# Calculate but don't store (last 3 rounds)
#define CALCX(i)			\
	movl	4*(i&15)(%esp),T;	\
	xorl	4*((i+2)&15)(%esp),T;	\
	xorl	4*((i+8)&15)(%esp),T;	\
	xorl	4*((i+13)&15)(%esp),T;	\
	roll	$1,T

# Calculate and store on stack (middle 61 rounds)
#define CALC(i)			\
	CALCX(i);			\
	movl	T,4*(i&15)(%esp)

# e += W[i] + K + f(b, c, d) + rol32(a, 5); b = rol32(b, 30); i++;
#define ROUND5a(a,b,c,d,e,f,k)	\
	leal k(T,e),e;		\
	f(b,c,d,e);		\
	addl T,e;		\
	movl a,T;		\
	roll $5,T;		\
	rorl $2,b;		\
	addl T,e

# A variant that assumes that k is stored in I
#define ROUND5b(a,b,c,d,e,f)	\
	addl I,e;		\
	addl T,e;		\
	f(b,c,d,e);		\
	addl T,e;		\
	movl a,T;		\
	roll $5,T;		\
	rorl $2,b;		\
	addl T,e

.globl sha_transform5
	.type	sha_transform5, @function
# void sha_transform5(__u32 digest[5], const char in[64])
sha_transform5:
	pushl	%ebp
	pushl	%edi
	pushl	%esi
	pushl	%ebx
# Args start at 20(%esp)
	movl	24(%esp),I	# I = in
	movl	20(%esp),T	# T = digest
	subl	$64,%esp

	movl	(T),A
	movl	4(T),B
	movl	8(T),C
	movl	12(T),D
	movl	16(T),E

	FETCH(0); ROUND5a(A,B,C,D,E,F1,K1)
	FETCH(1); ROUND5a(E,A,B,C,D,F1,K1)
	FETCH(2); ROUND5a(D,E,A,B,C,F1,K1)
	FETCH(3); ROUND5a(C,D,E,A,B,F1,K1)
	FETCH(4); ROUND5a(B,C,D,E,A,F1,K1)

	FETCH(5); ROUND5a(A,B,C,D,E,F1,K1)
	FETCH(6); ROUND5a(E,A,B,C,D,F1,K1)
	FETCH(7); ROUND5a(D,E,A,B,C,F1,K1)
	FETCH(8); ROUND5a(C,D,E,A,B,F1,K1)
	FETCH(9); ROUND5a(B,C,D,E,A,F1,K1)

	FETCH(10); ROUND5a(A,B,C,D,E,F1,K1)
	FETCH(11); ROUND5a(E,A,B,C,D,F1,K1)
	FETCH(12); ROUND5a(D,E,A,B,C,F1,K1)
	FETCH(13); ROUND5a(C,D,E,A,B,F1,K1)
	FETCH(14); ROUND5a(B,C,D,E,A,F1,K1)

	FETCH(15); movl $K1,I; ROUND5b(A,B,C,D,E,F1)
	CALC(16); ROUND5b(E,A,B,C,D,F1)
	CALC(17); ROUND5b(D,E,A,B,C,F1)
	CALC(18); ROUND5b(C,D,E,A,B,F1)
	CALC(19); ROUND5b(B,C,D,E,A,F1)

	movl $K2,I

	CALC(20); ROUND5b(A,B,C,D,E,F2)
	CALC(21); ROUND5b(E,A,B,C,D,F2)
	CALC(22); ROUND5b(D,E,A,B,C,F2)
	CALC(23); ROUND5b(C,D,E,A,B,F2)
	CALC(24); ROUND5b(B,C,D,E,A,F2)

	CALC(25); ROUND5b(A,B,C,D,E,F2)
	CALC(26); ROUND5b(E,A,B,C,D,F2)
	CALC(27); ROUND5b(D,E,A,B,C,F2)
	CALC(28); ROUND5b(C,D,E,A,B,F2)
	CALC(29); ROUND5b(B,C,D,E,A,F2)

	CALC(30); ROUND5b(A,B,C,D,E,F2)
	CALC(31); ROUND5b(E,A,B,C,D,F2)
	CALC(32); ROUND5b(D,E,A,B,C,F2)
	CALC(33); ROUND5b(C,D,E,A,B,F2)
	CALC(34); ROUND5b(B,C,D,E,A,F2)

	CALC(35); ROUND5b(A,B,C,D,E,F2)
	CALC(36); ROUND5b(E,A,B,C,D,F2)
	CALC(37); ROUND5b(D,E,A,B,C,F2)
	CALC(38); ROUND5b(C,D,E,A,B,F2)
	CALC(39); ROUND5b(B,C,D,E,A,F2)

	movl $K3,I

	CALC(40); ROUND5b(A,B,C,D,E,F3)
	CALC(41); ROUND5b(E,A,B,C,D,F3)
	CALC(42); ROUND5b(D,E,A,B,C,F3)
	CALC(43); ROUND5b(C,D,E,A,B,F3)
	CALC(44); ROUND5b(B,C,D,E,A,F3)

	CALC(45); ROUND5b(A,B,C,D,E,F3)
	CALC(46); ROUND5b(E,A,B,C,D,F3)
	CALC(47); ROUND5b(D,E,A,B,C,F3)
	CALC(48); ROUND5b(C,D,E,A,B,F3)
	CALC(49); ROUND5b(B,C,D,E,A,F3)

	CALC(50); ROUND5b(A,B,C,D,E,F3)
	CALC(51); ROUND5b(E,A,B,C,D,F3)
	CALC(52); ROUND5b(D,E,A,B,C,F3)
	CALC(53); ROUND5b(C,D,E,A,B,F3)
	CALC(54); ROUND5b(B,C,D,E,A,F3)

	CALC(55); ROUND5b(A,B,C,D,E,F3)
	CALC(56); ROUND5b(E,A,B,C,D,F3)
	CALC(57); ROUND5b(D,E,A,B,C,F3)
	CALC(58); ROUND5b(C,D,E,A,B,F3)
	CALC(59); ROUND5b(B,C,D,E,A,F3)

	movl $K4,I

	CALC(60); ROUND5b(A,B,C,D,E,F2)
	CALC(61); ROUND5b(E,A,B,C,D,F2)
	CALC(62); ROUND5b(D,E,A,B,C,F2)
	CALC(63); ROUND5b(C,D,E,A,B,F2)
	CALC(64); ROUND5b(B,C,D,E,A,F2)

	CALC(65); ROUND5b(A,B,C,D,E,F2)
	CALC(66); ROUND5b(E,A,B,C,D,F2)
	CALC(67); ROUND5b(D,E,A,B,C,F2)
	CALC(68); ROUND5b(C,D,E,A,B,F2)
	CALC(69); ROUND5b(B,C,D,E,A,F2)

	CALC(70); ROUND5b(A,B,C,D,E,F2)
	CALC(71); ROUND5b(E,A,B,C,D,F2)
	CALC(72); ROUND5b(D,E,A,B,C,F2)
	CALC(73); ROUND5b(C,D,E,A,B,F2)
	CALC(74); ROUND5b(B,C,D,E,A,F2)

	CALC(75); ROUND5b(A,B,C,D,E,F2)
	CALC(76); ROUND5b(E,A,B,C,D,F2)
	CALCX(77); ROUND5b(D,E,A,B,C,F2)
	CALCX(78); ROUND5b(C,D,E,A,B,F2)
	CALCX(79); ROUND5b(B,C,D,E,A,F2)

	addl	$64,%esp
	movl	20(%esp),I

#if 1
	addl	A,(I)
	addl	B,4(I)
	addl	C,8(I)
	addl	D,12(I)
	addl	E,16(I)
#else
	movl	A,(I)
	movl	B,4(I)
	movl	C,8(I)
	movl	D,12(I)
	movl	E,16(I)
#endif

	popl	%ebx
	popl	%esi
	popl	%edi
	popl	%ebp

	ret
	.size	sha_transform5, .-sha_transform5
	# Size is 0xDE6 = 3558 bytes


.globl sha_stackwipe
	.type	sha_stackwipe, @function
# void sha_stackwipe(void)
# After one or more sha_transform calls, we have left the contents of W[]
# on the stack, and from any 16 of those 80 words, the entire input
# can be reconstructed.  If the caller cares, this function obliterates
# the relevant portion of the stack.
# 2 words of argument + 4 woirds of saved registers + 80 words of W[]
sha_stackwipe:
	xorl	%eax,%eax
	movl	$86,%ecx
# Damn, I had hoped that loop; pushl %eax would work..
1:
	decl	%ecx
	pushl	%eax
	jne	1b

	addl	$4*86,%esp
	ret
	.size	sha_stackwipe, .-sha_stackwipe

next prev parent reply	other threads:[~2007-06-12  5:05 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-06-11  7:53 [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+ linux
2007-06-11 19:17 ` Benjamin Gilbert
2007-06-12  5:05   ` linux [this message]
2007-06-13  5:29     ` [PATCH] random: fix folding Matt Mackall
2007-06-13  5:45       ` linux
2007-06-13  6:08         ` Matt Mackall
2007-06-13  5:50     ` [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+ Matt Mackall
2007-06-13  6:46       ` linux
  -- strict thread matches above, loose matches on Subject: below --
2007-06-08 21:42 [PATCH 0/3] Add optimized SHA-1 implementations for x86 and x86_64 Benjamin Gilbert
2007-06-08 21:42 ` [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+ Benjamin Gilbert
2007-06-09  7:32   ` Jan Engelhardt
2007-06-10  1:15     ` Benjamin Gilbert
2007-06-11 19:47       ` Benjamin Gilbert
2007-06-09 20:11   ` Matt Mackall
2007-06-09 20:23     ` Jeff Garzik
2007-06-09 21:34       ` Matt Mackall
2007-06-10  0:33       ` Benjamin Gilbert
2007-06-10 13:59         ` Matt Mackall
2007-06-10 16:47           ` Benjamin Gilbert
2007-06-10 17:33             ` Matt Mackall
2007-06-11 17:39           ` Benjamin Gilbert
2007-06-11 12:04     ` Andi Kleen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070612050544.23957.qmail@science.horizon.com \
    --to=linux@horizon.com \
    --cc=bgilbert@cs.cmu.edu \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.