* Re: Performance issue of 'git branch'
@ 2009-07-26 23:21 George Spelvin
2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin
0 siblings, 1 reply; 60+ messages in thread
From: George Spelvin @ 2009-07-26 23:21 UTC (permalink / raw)
To: git, torvalds; +Cc: linux
> It's a bit sad, since the _only_ thing we load all of libcrypto for is the
> (fairly trivial) SHA1 code.
>
> But at the same time, last time I benchmarked the different SHA1
> libraries, the openssl one was the fastest. I think it has tuned assembly
> language for most architectures. Our regular mozilla-based C code is
> perfectly fine, but it doesn't hold a candle to assembler tuning.
Actually, openssl only has assembly for x86, x86_64, and ia64.
Truthfully, once you have 32 registers, SHA1 is awfully easy to
compile near-optimally.
Git currently includes some hand-tuned assembly that isn't in OpenSSL:
- ARM (only 16 registers, and the rotate+op support can be used nicely)
- PPC (3-way superscalar *without* OO execution benefits from careful
scheduling)
Further, all of the core hand-tuned SHA1 assembly code in OpenSSL is by
Andy Polyakov and is dual-licensed GPL/3-clause BSD *in addition to*
the OpenSSL license. So we can just import it:
See http://www.openssl.org/~appro/cryptogams/
and http://www.openssl.org/~appro/cryptogams/cryptogams-0.tar.gz
(Ooh, look, he has PPC code in there, too. Does anyone with a PPC machine
want to compare it with Git's?)
It'll take some massaging because that's just the core SHA1_Transform
function and not the wrappers, but it's hardly a heroic effort.
I'm pretty deep in the weeds at $DAY_JOB and can't get to it for a while,
but would a patch be appreciated?
^ permalink raw reply [flat|nested] 60+ messages in thread
* Request for benchmarking: x86 SHA1 code 2009-07-26 23:21 Performance issue of 'git branch' George Spelvin @ 2009-07-31 10:46 ` George Spelvin 2009-07-31 11:11 ` Erik Faye-Lund ` (8 more replies) 0 siblings, 9 replies; 60+ messages in thread From: George Spelvin @ 2009-07-31 10:46 UTC (permalink / raw) To: git; +Cc: linux After studying Andy Polyakov's optimized x86 SHA-1 in OpenSSL, I've got a version that's 1.6% slower on a P4 and 15% faster on a Phenom. I'm curious about the performance on other CPUs I don't have access to, particularly Core 2 duo and i7. Could someone do some benchmarking for me? Old (486/Pentium/P2/P3) machines are also interesting, but I'm optimizing for newer ones. I haven't packaged this nicely, but it's not that complicated. - Download Andy's original code from http://www.openssl.org/~appro/cryptogams/cryptogams-0.tar.gz - Unpack and cd to the cryptogams-0/x86 directory - "patch < this_email" to create "sha1test.c", "sha1-586.h", "Makefile", and "sha1-x86.pl". - "make" - Run ./586test (before) and ./x86test (after) and note the timings. Thank you! --- /dev/null 2009-05-12 02:55:38.579106460 -0400 +++ Makefile 2009-07-31 06:22:42.000000000 -0400 @@ -0,0 +1,16 @@ +CC := gcc +CFLAGS := -m32 -W -Wall -Os -g +ASFLAGS := --32 + +all: 586test x86test + +586test : sha1test.c sha1-586.o + $(CC) $(CFLAGS) -o $@ sha1test.c sha1-586.o + +x86test : sha1test.c sha1-x86.o + $(CC) $(CFLAGS) -o $@ sha1test.c sha1-x86.o + +586test x86test : sha1-586.h + +%.s : %.pl x86asm.pl x86unix.pl + perl $< elf > $@ --- /dev/null 2009-05-12 02:55:38.579106460 -0400 +++ sha1test.c 2009-07-28 09:24:09.000000000 -0400 @@ -0,0 +1,67 @@ +#include <stdint.h> +#include <stdlib.h> +#include <stdio.h> +#include <sys/time.h> + +#include "sha1-586.h" + +#define SIZE 1000000 + +#if SIZE % 64 +# error SIZE must be a multiple of 64! +#endif + +int +main(void) +{ + uint32_t iv[5] = { + 0x67452301, 0xefcdab89, 0x98badcfe, 0x10325476, 0xc3d2e1f0 + }; + /* Simplest known-answer test, "abc" */ + static uint8_t const testbuf[64] = { + 'a','b','c', 0x80, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 24 + }; + /* Expected: A9993E364706816ABA3E25717850C26C9CD0D89D */ + static uint32_t const expected[5] = { + 0xa9993e36, 0x4706816a, 0xba3e2571, 0x7850c26c, 0x9cd0d89d }; + unsigned i; + char *p = malloc(SIZE); + struct timeval tv0, tv1; + + if (!p) { + perror("malloc"); + return 1; + } + + sha1_block_data_order(iv, testbuf, 1); + printf("Result: %08x %08x %08x %08x %08x\n" + "Expected:%08x %08x %08x %08x %08x\n", + iv[0], iv[1], iv[2], iv[3], iv[4], expected[0], + expected[1], expected[2], expected[3], expected[4]); + for (i = 0; i < 5; i++) + if (iv[i] != expected[i]) + printf("MISMATCH in word %u\n", i); + + if (gettimeofday(&tv0, NULL) < 0) { + perror("gettimeofday"); + return 1; + } + for (i = 0; i < 500; i++) + sha1_block_data_order(iv, p, SIZE/64); + if (gettimeofday(&tv1, NULL) < 0) { + perror("gettimeofday"); + return 1; + } + tv1.tv_sec -= tv0.tv_sec; + tv1.tv_usec -= tv0.tv_usec; + if (tv1.tv_usec < 0) { + tv1.tv_sec--; + tv1.tv_usec += 1000000; + } + printf("%u bytes: %u.%06u s\n", i * SIZE, (unsigned)tv1.tv_sec, + (unsigned)tv1.tv_usec); + return 0; +} --- /dev/null 2009-05-12 02:55:38.579106460 -0400 +++ sha1-586.h 2009-07-27 09:34:03.000000000 -0400 @@ -0,0 +1,3 @@ +#include <stdint.h> + +void sha1_block_data_order(uint32_t iv[5], void const *in, unsigned len); --- /dev/null 2009-05-12 02:55:38.579106460 -0400 +++ sha1-x86.pl 2009-07-31 06:10:18.000000000 -0400 @@ -0,0 +1,398 @@ +#!/usr/bin/env perl + +# ==================================================================== +# [Re]written by Andy Polyakov <appro@fy.chalmers.se> for the OpenSSL +# project. The module is, however, dual licensed under OpenSSL and +# CRYPTOGAMS licenses depending on where you obtain it. For further +# details see http://www.openssl.org/~appro/cryptogams/. +# ==================================================================== + +# "[Re]written" was achieved in two major overhauls. In 2004 BODY_* +# functions were re-implemented to address P4 performance issue [see +# commentary below], and in 2006 the rest was rewritten in order to +# gain freedom to liberate licensing terms. + +# It was noted that Intel IA-32 C compiler generates code which +# performs ~30% *faster* on P4 CPU than original *hand-coded* +# SHA1 assembler implementation. To address this problem (and +# prove that humans are still better than machines:-), the +# original code was overhauled, which resulted in following +# performance changes: +# +# compared with original compared with Intel cc +# assembler impl. generated code +# Pentium -16% +48% +# PIII/AMD +8% +16% +# P4 +85%(!) +45% +# +# As you can see Pentium came out as looser:-( Yet I reckoned that +# improvement on P4 outweights the loss and incorporate this +# re-tuned code to 0.9.7 and later. +# ---------------------------------------------------------------- +# <appro@fy.chalmers.se> + +$0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1; +push(@INC,"${dir}","${dir}../../perlasm"); +require "x86asm.pl"; + +&asm_init($ARGV[0],"sha1-586.pl",$ARGV[$#ARGV] eq "386"); + +$A="eax"; +$B="ebx"; +$C="ecx"; +$D="edx"; +$E="ebp"; + +# Two temporaries +$S="esi"; +$T="edi"; + +# The round constants +use constant K1 => 0x5a827999; +use constant K2 => 0x6ED9EBA1; +use constant K3 => 0x8F1BBCDC; +use constant K4 => 0xCA62C1D6; + +@V=($A,$B,$C,$D,$E); + +# Given unlimited registers and functional units, it would be +# possible to compute SHA-1 at two cycles per round, using 7 +# operations per round. Remember, each round computes a new +# value for E, which is used as A in the following round and B +# in the round after that. There are two critical paths: +# - A must be rotated and added to the output E +# - B must go through two boolean operations before being added +# to the result E. Since this latter addition can't be done +# in the same-cycle as the critical addition of a<<<5, this is +# a total of 2+1+1 = 4 cycles. +# Additionally, if you want to avoid copying B, you have to +# rotate it soon after use in this round so it is ready for use +# as the following round's C. + +# f = (a <<< 5) + e + K + in[i] + (d^(b&(c^d))) (0..19) +# f = (a <<< 5) + e + K + in[i] + (b^c^d) (20..39, 60..79) +# f = (a <<< 5) + e + K + in[i] + (c&d) + (b&(c^d)) (40..59) +# The hard part is doing this with only two temporary registers. +# Let's divide it into 4 parts. These can be executed in a 7-cycle +# loop, assuming triple (quadruple counting xor separately) issue: +# +# in[i] F1(c,d) F2(b,c,d) a<<<5 +# mov in[i],T (addl S,A) (movl B,S) +# xor in[i+1],T (rorl 5,S) +# xor in[i+2],T movl D,S (addl S,A) +# xor in[i+3],T andl C,S +# rotl 1,T addl S,E movl D,S +# (movl T,in[i]) xorl C,S +# addl T+K,E andl B,S rorl 2,B // +# (mov in[i],T) addl S,E movl A,S +# (xor in[i+1],T) rorl 5,S +# (xor in[i+2],T) (movl C,S) addl S,E +# +# (The last 3 rounds can omit the store of T.) +# The "addl T+K,E" line is actually implemented using the lea instruction, +# which (on a Pentium) requires that neither T not K was modified on the +# previous cycle. +# +# The other two rounds are a bit simpler, and can therefore be "pulled in" +# one cycle, to 6. The bit-select function (0..19): +# in[i] F(b,c,d) a<<<5 +# mov in[i],T (xorl E,S) (addl T+K,A) +# xor in[i+1],T (addl S,A) (movl A,S) +# xor in[i+2],T (rorl 2,C) (roll 5,S) +# xor in[i+3],T movl D,S (addl S,A) +# roll 1,T xorl C,S +# movl T,in[i] andl B,S rorl 2,B +# addl T+K,E xorl D,S // (mov in[i],T) +# (xor in[i+1],T) addl S,E movl A,S +# (xor in[i+2],T) roll 5,S +# (xor in[i+3],T) (movl C,S) addl S,E +# +# And the XOR function (also 6, limited by the in[i] forming) used in +# rounds 20..39 and 60..79: +# in[i] F(b,c,d) a<<<5 +# mov in[i],T (xorl C,S) (addl T+K,A) +# xor in[i+1],T (addl S,A) (movl A,S) +# xor in[i+2],T (roll 5,S) +# xor in[i+3],T (addl S,A) +# roll 1,T movl D,S +# movl T,in[i] xorl B,S rorl 2,B +# addl T+K,E xorl C,S // (mov in[i],T) +# (xor in[i+1],T) addl S,E movl A,S +# (xor in[i+2],T) roll 5,S +# (xor in[i+3],T) addl S,E +# +# The first 16 rounds don't need to form the in[i] equation, letting +# us pull it in another 2 cycles, to 4, after some reassignment of +# temporaries: +# in[i] F(b,c,d) a<<<5 +# movl D,S (roll 5,T) (addl S,A) +# mov in[i],T xorl C,S (addl T,A) +# andl B,S rorl 2,B +# addl T+K,E xorl D,S movl A,T +# addl S,E roll 5,T (movl C,S) +# (mov in[i],T) (xorl B,S) addl T,E +# + +# The transition between rounds 15 and 16 will be a bit tricky... the best +# thing to do is to delay the computation of a<<<5 one cycle and move it back +# to the S register. That way, T is free as early as possible. +# in[i] F(b,c,d) a<<<5 +# (addl T+K,A) (xorl E,S) (movl A,T) +# movl D,S (roll 5,T) (addl S,A) +# mov in[i],T xorl C,S (addl T,A) +# andl B,S rorl 2,B +# addl T+K,E xorl D,S (movl in[1],T) +# (xor in[i+1],T) addl S,E movl A,S +# (xor in[i+2],T) rorl 2,B roll 5,S +# (xor in[i+3],T) (movl C,S) addl S,E + + + + + +# This expects the first copy of D to S to have been done already +# movl D,S (roll 5,T) (addl S,A) // +# mov in[i],T xorl C,S (addl T,A) +# andl B,S rorl 2,B +# addl T+K,E xorl D,S movl A,T +# addl S,E roll 5,T (movl C,S) // +# (mov in[i],T) (xorl B,S) addl T,E + +sub BODY_00_15 +{ + local($n,$a,$b,$c,$d,$e)=@_; + + &comment("00_15 $n"); + &mov($S,$d) if ($n == 0); + &mov($T,&swtmp($n%16)); # Load Xi. + &xor($S,$c); # Continue computing F() = d^(b&(c^d)) + &and($S,$b); + &rotr($b,2); + &lea($e,&DWP(K1,$e,$T)); # Add Xi and K + if ($n < 15) { + &mov($T,$a); + &xor($S,$d); + &rotl($T,5); + &add($e,$S); + &mov($S,$c); # Start of NEXT round's F() + &add($e,$T); + } else { + # This version provides the correct start for BODY_20_39 + &mov($T,&swtmp(($n+1)%16)); # Start computing mext Xi. + &xor($S,$d); + &xor($T,&swtmp(($n+3)%16)); + &add($e,$S); # Add F() + &mov($S,$a); # Start computing a<<<5 + &xor($T,&swtmp(($n+9)%16)); + &rotl($S,5); + } + +} + +# The transition between rounds 15 and 16 will be a bit tricky... the best +# thing to do is to delay the computation of a<<<5 one cycle and move it back +# to the S register. That way, T is free as early as possible. +# in[i] F(b,c,d) a<<<5 +# (addl T+K,A) (xorl E,S) (movl B,T) +# movl D,S (roll 5,T) (addl S,A) // +# mov in[i],T xorl C,S (addl T,A) +# andl B,S rorl 2,B +# addl T+K,E xorl D,S (movl in[1],T) +# (xor in[i+1],T) addl S,E movl A,S +# (xor in[i+2],T) rorl 2,B roll 5,S // +# (xor in[i+3],T) (movl C,S) addl S,E + + +# This starts just before starting to compute F(); the Xi should have XORed +# the first three values together. (Break is at //) +# +# in[i] F(b,c,d) a<<<5 +# mov in[i],T (xorl E,S) (addl T+K,A) +# xor in[i+1],T (addl S,A) (movl B,S) +# xor in[i+2],T (roll 5,S) // +# xor in[i+3],T movl D,S (addl S,A) +# roll 1,T xorl C,S +# movl T,in[i] andl B,S rorl 2,B +# addl T+K,E xorl D,S (mov in[i],T) +# (xor in[i+1],T) addl S,E movl A,S +# (xor in[i+2],T) roll 5,S // +# (xor in[i+3],T) (movl C,S) addl S,E + +sub BODY_16_19 +{ + local($n,$a,$b,$c,$d,$e)=@_; + + &comment("16_20 $n"); + + &xor($T,&swtmp(($n+13)%16)); + &add($a,$S); # End of previous round + &mov($S,$d); # Start current round's F() + &rotl($T,1); + &xor($S,$c); + &mov(&swtmp($n%16),$T); # Store computed Xi. + &and($S,$b); + &rotr($b,2); + &lea($e,&DWP(K1,$e,$T)); # Add Xi and K + &mov($T,&swtmp(($n+1)%16)); # Start computing mext Xi. + &xor($S,$d); + &xor($T,&swtmp(($n+3)%16)); + &add($e,$S); # Add F() + &mov($S,$a); # Start computing a<<<5 + &xor($T,&swtmp(($n+9)%16)); + &rotl($S,5); +} + +# This is just like BODY_16_19, but computes a different F() = b^c^d +# +# in[i] F(b,c,d) a<<<5 +# mov in[i],T (xorl E,S) (addl T+K,A) +# xor in[i+1],T (addl S,A) (movl B,S) +# xor in[i+2],T (roll 5,S) // +# xor in[i+3],T (addl S,A) +# roll 1,T movl C,S +# movl T,in[i] xorl B,S rorl 2,B +# addl T+K,E xorl C,S (mov in[i],T) +# (xor in[i+1],T) addl S,E movl A,S +# (xor in[i+2],T) roll 5,S // +# (xor in[i+3],T) (movl C,S) addl S,E + +sub BODY_20_39 # And 61..79 +{ + local($n,$a,$b,$c,$d,$e)=@_; + local $K=($n<40) ? K2 : K4; + + &comment("21_30 $n"); + + &xor($T,&swtmp(($n+13)%16)); + &add($a,$S); # End of previous round + &mov($S,$d) + &rotl($T,1); + &mov($S,$d); # Start current round's F() + &mov(&swtmp($n%16),$T) if ($n < 77); # Store computed Xi. + &xor($S,$b); + &rotr($b,2); + &lea($e,&DWP($K,$e,$T)); # Add Xi and K + &mov($T,&swtmp(($n+1)%16)) if ($n < 79); # Start computing next Xi. + &xor($S,$c); + &xor($T,&swtmp(($n+3)%16)) if ($n < 79); + &add($e,$S); # Add F1() + &mov($S,$a); # Start computing a<<<5 + &xor($T,&swtmp(($n+9)%16)) if ($n < 79); + &rotl($S,5); + + &add($e,$S) if ($n == 79); +} + + +# This starts immediately after the LEA, and expects to need to finish +# the previous round. (break is at //) +# +# in[i] F1(c,d) F2(b,c,d) a<<<5 +# (addl T+K,E) (andl C,S) (rorl 2,C) +# mov in[i],T (addl S,A) (movl B,S) +# xor in[i+1],T (rorl 5,S) +# xor in[i+2],T / movl D,S (addl S,A) +# xor in[i+3],T andl C,S +# rotl 1,T addl S,E movl D,S +# (movl T,in[i]) xorl C,S +# addl T+K,E andl B,S rorl 2,B +# (mov in[i],T) addl S,E movl A,S +# (xor in[i+1],T) rorl 5,S +# (xor in[i+2],T) // (movl C,S) addl S,E + +sub BODY_40_59 +{ + local($n,$a,$b,$c,$d,$e)=@_; + + &comment("41_59 $n"); + + &add($a,$S); # End of previous round + &mov($S,$d); # Start current round's F(1) + &xor($T,&swtmp(($n+13)%16)); + &and($S,$c); + &rotl($T,1); + &add($e,$S); # Add F1() + &mov($S,$d); # Start current round's F2() + &mov(&swtmp($n%16),$T); # Store computed Xi. + &xor($S,$c); + &lea($e,&DWP(K3,$e,$T)); + &and($S,$b); + &rotr($b,2); + &mov($T,&swtmp(($n+1)%16)); # Start computing next Xi. + &add($e,$S); # Add F2() + &mov($S,$a); # Start computing a<<<5 + &xor($T,&swtmp(($n+3)%16)); + &rotl($S,5); + &xor($T,&swtmp(($n+9)%16)); +} + +&function_begin("sha1_block_data_order",16); + &mov($S,&wparam(0)); # SHA_CTX *c + &mov($T,&wparam(1)); # const void *input + &mov($A,&wparam(2)); # size_t num + &stack_push(16); # allocate X[16] + &shl($A,6); + &add($A,$T); + &mov(&wparam(2),$A); # pointer beyond the end of input + &mov($E,&DWP(16,$S));# pre-load E + + &set_label("loop",16); + + # copy input chunk to X, but reversing byte order! + for ($i=0; $i<16; $i+=4) + { + &mov($A,&DWP(4*($i+0),$T)); + &mov($B,&DWP(4*($i+1),$T)); + &mov($C,&DWP(4*($i+2),$T)); + &mov($D,&DWP(4*($i+3),$T)); + &bswap($A); + &bswap($B); + &bswap($C); + &bswap($D); + &mov(&swtmp($i+0),$A); + &mov(&swtmp($i+1),$B); + &mov(&swtmp($i+2),$C); + &mov(&swtmp($i+3),$D); + } + &mov(&wparam(1),$T); # redundant in 1st spin + + &mov($A,&DWP(0,$S)); # load SHA_CTX + &mov($B,&DWP(4,$S)); + &mov($C,&DWP(8,$S)); + &mov($D,&DWP(12,$S)); + # E is pre-loaded + + for($i=0;$i<16;$i++) { &BODY_00_15($i,@V); unshift(@V,pop(@V)); } + for(;$i<16;$i++) { &BODY_15($i,@V); unshift(@V,pop(@V)); } + for(;$i<20;$i++) { &BODY_16_19($i,@V); unshift(@V,pop(@V)); } + for(;$i<40;$i++) { &BODY_20_39($i,@V); unshift(@V,pop(@V)); } + for(;$i<60;$i++) { &BODY_40_59($i,@V); unshift(@V,pop(@V)); } + for(;$i<80;$i++) { &BODY_20_39($i,@V); unshift(@V,pop(@V)); } + + (($V[4] eq $E) and ($V[0] eq $A)) or die; # double-check + + &comment("Loop trailer"); + + &mov($S,&wparam(0)); # re-load SHA_CTX* + &mov($T,&wparam(1)); # re-load input pointer + + &add($A,&DWP(0,$S)); # E is last "A"... + &add($B,&DWP(4,$S)); + &add($C,&DWP(8,$S)); + &add($D,&DWP(12,$S)); + &add($E,&DWP(16,$S)); + + &mov(&DWP(0,$S),$A); # update SHA_CTX + &add($T,64); # advance input pointer + &mov(&DWP(4,$S),$B); + &cmp($T,&wparam(2)); # have we reached the end yet? + &mov(&DWP(8,$S),$C); + &mov(&DWP(12,$S),$D); + &mov(&DWP(16,$S),$E); + &jb(&label("loop")); + + &stack_pop(16); +&function_end("sha1_block_data_order"); +&asciz("SHA1 block transform for x86, CRYPTOGAMS by <appro\@openssl.org>"); + +&asm_finish(); ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: Request for benchmarking: x86 SHA1 code 2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin @ 2009-07-31 11:11 ` Erik Faye-Lund 2009-07-31 11:31 ` George Spelvin 2009-07-31 11:37 ` Michael J Gruber 2009-07-31 11:21 ` Michael J Gruber ` (7 subsequent siblings) 8 siblings, 2 replies; 60+ messages in thread From: Erik Faye-Lund @ 2009-07-31 11:11 UTC (permalink / raw) To: George Spelvin; +Cc: git On Fri, Jul 31, 2009 at 12:46 PM, George Spelvin<linux@horizon.com> wrote: > I haven't packaged this nicely, but it's not that complicated. > - Download Andy's original code from > http://www.openssl.org/~appro/cryptogams/cryptogams-0.tar.gz > - Unpack and cd to the cryptogams-0/x86 directory > - "patch < this_email" to create "sha1test.c", "sha1-586.h", "Makefile", > and "sha1-x86.pl". > - "make" $ patch < ../sha1-opt.patch.eml patching file `Makefile' patching file `sha1test.c' patching file `sha1-586.h' patching file `sha1-x86.pl' $ make make: *** No rule to make target `sha1-586.o', needed by `586test'. Stop. What did I do wrong? :) Would it be easier if you pushed it out somewhere? -- Erik "kusma" Faye-Lund kusmabite@gmail.com (+47) 986 59 656 ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: Request for benchmarking: x86 SHA1 code 2009-07-31 11:11 ` Erik Faye-Lund @ 2009-07-31 11:31 ` George Spelvin 2009-07-31 11:37 ` Michael J Gruber 1 sibling, 0 replies; 60+ messages in thread From: George Spelvin @ 2009-07-31 11:31 UTC (permalink / raw) To: kusmabite, linux; +Cc: git > $ make > make: *** No rule to make target `sha1-586.o', needed by `586test'. Stop. > > What did I do wrong? :) > Would it be easier if you pushed it out somewhere? H'm.... It *should* do perl sha1-586.pl elf > sha1-586.s as --32 -o sha1-586.o sha1-586.s gcc -m32 -W -Wall -Os -g -o 586test sha1test.c sha1-586.o (And likewise for the "x86test" binary.) which is what happened when I tested it. Obviously I have something non-portable in the Makefile. You could try "make sha1-586.s" and "sha1-586.o" and see which rule is f*ed up. Um, you *are* in a directory which contains a sha1-586.pl file, right? Thanks! ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: Request for benchmarking: x86 SHA1 code 2009-07-31 11:11 ` Erik Faye-Lund 2009-07-31 11:31 ` George Spelvin @ 2009-07-31 11:37 ` Michael J Gruber 2009-07-31 12:24 ` Erik Faye-Lund 1 sibling, 1 reply; 60+ messages in thread From: Michael J Gruber @ 2009-07-31 11:37 UTC (permalink / raw) To: Erik Faye-Lund; +Cc: George Spelvin, git Erik Faye-Lund venit, vidit, dixit 31.07.2009 13:11: > On Fri, Jul 31, 2009 at 12:46 PM, George Spelvin<linux@horizon.com> wrote: >> I haven't packaged this nicely, but it's not that complicated. >> - Download Andy's original code from >> http://www.openssl.org/~appro/cryptogams/cryptogams-0.tar.gz >> - Unpack and cd to the cryptogams-0/x86 directory >> - "patch < this_email" to create "sha1test.c", "sha1-586.h", "Makefile", >> and "sha1-x86.pl". >> - "make" > > $ patch < ../sha1-opt.patch.eml > patching file `Makefile' > patching file `sha1test.c' > patching file `sha1-586.h' > patching file `sha1-x86.pl' > > $ make > make: *** No rule to make target `sha1-586.o', needed by `586test'. Stop. > > What did I do wrong? :) > Would it be easier if you pushed it out somewhere? > You need to go to the x86 directory, apply the patch and run make there. (I made the same mistake.) Also, you i586 (32bit) glibc-devel if you're on a 64 bit system. Michael ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: Request for benchmarking: x86 SHA1 code 2009-07-31 11:37 ` Michael J Gruber @ 2009-07-31 12:24 ` Erik Faye-Lund 2009-07-31 12:29 ` Johannes Schindelin 2009-07-31 12:32 ` George Spelvin 0 siblings, 2 replies; 60+ messages in thread From: Erik Faye-Lund @ 2009-07-31 12:24 UTC (permalink / raw) To: Michael J Gruber; +Cc: George Spelvin, git On Fri, Jul 31, 2009 at 1:37 PM, Michael J Gruber<git@drmicha.warpmail.net> wrote: >> What did I do wrong? :) > > You need to go to the x86 directory, apply the patch and run make there. > (I made the same mistake.) Also, you i586 (32bit) glibc-devel if you're > on a 64 bit system. Aha, thanks :) Now I'm getting a different error: $ make as -o sha1-586.o sha1-586.s sha1-586.s: Assembler messages: sha1-586.s:4: Warning: .type pseudo-op used outside of .def/.endef ignored. sha1-586.s:4: Error: junk at end of line, first unrecognized character is `s' sha1-586.s:1438: Warning: .size pseudo-op used outside of .def/.endef ignored. sha1-586.s:1438: Error: junk at end of line, first unrecognized character is `s' make: *** [sha1-586.o] Error 1 What might be relevant, is that I'm trying this on Windows (Vista 64bit). I'd still think GNU as should be able to assemble the source, though. I've got an i7, so I thought the result might be interresting. -- Erik "kusma" Faye-Lund kusmabite@gmail.com (+47) 986 59 656 ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: Request for benchmarking: x86 SHA1 code 2009-07-31 12:24 ` Erik Faye-Lund @ 2009-07-31 12:29 ` Johannes Schindelin 2009-07-31 12:32 ` George Spelvin 1 sibling, 0 replies; 60+ messages in thread From: Johannes Schindelin @ 2009-07-31 12:29 UTC (permalink / raw) To: Erik Faye-Lund; +Cc: Michael J Gruber, George Spelvin, git Hi, On Fri, 31 Jul 2009, Erik Faye-Lund wrote: > On Fri, Jul 31, 2009 at 1:37 PM, Michael J > Gruber<git@drmicha.warpmail.net> wrote: > >> What did I do wrong? :) > > > > You need to go to the x86 directory, apply the patch and run make there. > > (I made the same mistake.) Also, you i586 (32bit) glibc-devel if you're > > on a 64 bit system. > > Aha, thanks :) > > Now I'm getting a different error: > $ make > as -o sha1-586.o sha1-586.s > sha1-586.s: Assembler messages: > sha1-586.s:4: Warning: .type pseudo-op used outside of .def/.endef ignored. > sha1-586.s:4: Error: junk at end of line, first unrecognized character is `s' > sha1-586.s:1438: Warning: .size pseudo-op used outside of .def/.endef ignored. > sha1-586.s:1438: Error: junk at end of line, first unrecognized character is `s' > > make: *** [sha1-586.o] Error 1 > > What might be relevant, is that I'm trying this on Windows (Vista > 64bit). Probably using msysGit? Then you're still using the 32-bit environment, as MSys is 32-bit only for now. Ciao, Dscho ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: Request for benchmarking: x86 SHA1 code 2009-07-31 12:24 ` Erik Faye-Lund 2009-07-31 12:29 ` Johannes Schindelin @ 2009-07-31 12:32 ` George Spelvin 2009-07-31 12:45 ` Erik Faye-Lund 1 sibling, 1 reply; 60+ messages in thread From: George Spelvin @ 2009-07-31 12:32 UTC (permalink / raw) To: git, kusmabite; +Cc: git, linux > Now I'm getting a different error: > $ make > as -o sha1-586.o sha1-586.s > sha1-586.s: Assembler messages: > sha1-586.s:4: Warning: .type pseudo-op used outside of .def/.endef ignored. > sha1-586.s:4: Error: junk at end of line, first unrecognized character is `s' > sha1-586.s:1438: Warning: .size pseudo-op used outside of .def/.endef ignored. > sha1-586.s:1438: Error: junk at end of line, first unrecognized character is `s' > > make: *** [sha1-586.o] Error 1 > What might be relevant, is that I'm trying this on Windows (Vista > 64bit). I'd still think GNU as should be able to assemble the source, > though. I've got an i7, so I thought the result might be interresting. Ah... what assembler? the perl proprocessor supports multiple assemblers: elf - Linux, FreeBSD, Solaris x86, etc. a.out - DJGPP, elder OpenBSD, etc. coff - GAS/COFF such as Win32 targets win32n - Windows 95/Windows NT NASM format nw-nasm - NetWare NASM format nw-mwasm- NetWare Metrowerks Assembler Maybe you need to replace "elf" with "coff"? ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: Request for benchmarking: x86 SHA1 code 2009-07-31 12:32 ` George Spelvin @ 2009-07-31 12:45 ` Erik Faye-Lund 2009-07-31 13:02 ` George Spelvin 0 siblings, 1 reply; 60+ messages in thread From: Erik Faye-Lund @ 2009-07-31 12:45 UTC (permalink / raw) To: George Spelvin; +Cc: git, git On Fri, Jul 31, 2009 at 2:32 PM, George Spelvin<linux@horizon.com> wrote: > Maybe you need to replace "elf" with "coff"? That did the trick, thanks! Best of 6 runs on an Intel Core i7 920 @ 2.67GHz: before (586test): 1.415 after (x86test): 1.470 -- Erik "kusma" Faye-Lund kusmabite@gmail.com (+47) 986 59 656 ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: Request for benchmarking: x86 SHA1 code 2009-07-31 12:45 ` Erik Faye-Lund @ 2009-07-31 13:02 ` George Spelvin 0 siblings, 0 replies; 60+ messages in thread From: George Spelvin @ 2009-07-31 13:02 UTC (permalink / raw) To: kusmabite, linux; +Cc: git, git > That did the trick, thanks! > > Best of 6 runs on an Intel Core i7 920 @ 2.67GHz: > > before (586test): 1.415 > after (x86test): 1.470 So it's slower. Bummer. :-( Obviously I have some work to do. But thank you very much for the result! ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: Request for benchmarking: x86 SHA1 code 2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin 2009-07-31 11:11 ` Erik Faye-Lund @ 2009-07-31 11:21 ` Michael J Gruber 2009-07-31 11:26 ` Michael J Gruber ` (6 subsequent siblings) 8 siblings, 0 replies; 60+ messages in thread From: Michael J Gruber @ 2009-07-31 11:21 UTC (permalink / raw) To: George Spelvin; +Cc: git George Spelvin venit, vidit, dixit 31.07.2009 12:46: > After studying Andy Polyakov's optimized x86 SHA-1 in OpenSSL, I've > got a version that's 1.6% slower on a P4 and 15% faster on a Phenom. > I'm curious about the performance on other CPUs I don't have access to, > particularly Core 2 duo and i7. > > Could someone do some benchmarking for me? Old (486/Pentium/P2/P3) > machines are also interesting, but I'm optimizing for newer ones. > > I haven't packaged this nicely, but it's not that complicated. > - Download Andy's original code from > http://www.openssl.org/~appro/cryptogams/cryptogams-0.tar.gz > - Unpack and cd to the cryptogams-0/x86 directory > - "patch < this_email" to create "sha1test.c", "sha1-586.h", "Makefile", > and "sha1-x86.pl". > - "make" > - Run ./586test (before) and ./x86test (after) and note the timings. > > Thank you! Best of 6 runs: ./586test Result: a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d Expected:a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d 500000000 bytes: 1.642336 s ./x86test Result: a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d Expected:a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d 500000000 bytes: 1.532153 s System: uname -a Linux localhost.localdomain 2.6.29.6-213.fc11.x86_64 #1 SMP Tue Jul 7 21:02:57 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 Duo CPU T7500 @ 2.20GHz stepping : 11 cpu MHz : 800.000 cache size : 4096 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm ida tpr_shadow vnmi flexpriority bogomips : 4389.20 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 Duo CPU T7500 @ 2.20GHz stepping : 11 cpu MHz : 800.000 cache size : 4096 KB physical id : 0 siblings : 2 core id : 1 cpu cores : 2 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm ida tpr_shadow vnmi flexpriority bogomips : 4388.78 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: Request for benchmarking: x86 SHA1 code 2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin 2009-07-31 11:11 ` Erik Faye-Lund 2009-07-31 11:21 ` Michael J Gruber @ 2009-07-31 11:26 ` Michael J Gruber 2009-07-31 12:31 ` Carlos R. Mafra ` (5 subsequent siblings) 8 siblings, 0 replies; 60+ messages in thread From: Michael J Gruber @ 2009-07-31 11:26 UTC (permalink / raw) To: George Spelvin; +Cc: git George Spelvin venit, vidit, dixit 31.07.2009 12:46: > After studying Andy Polyakov's optimized x86 SHA-1 in OpenSSL, I've > got a version that's 1.6% slower on a P4 and 15% faster on a Phenom. > I'm curious about the performance on other CPUs I don't have access to, > particularly Core 2 duo and i7. > > Could someone do some benchmarking for me? Old (486/Pentium/P2/P3) > machines are also interesting, but I'm optimizing for newer ones. > > I haven't packaged this nicely, but it's not that complicated. > - Download Andy's original code from > http://www.openssl.org/~appro/cryptogams/cryptogams-0.tar.gz > - Unpack and cd to the cryptogams-0/x86 directory > - "patch < this_email" to create "sha1test.c", "sha1-586.h", "Makefile", > and "sha1-x86.pl". > - "make" > - Run ./586test (before) and ./x86test (after) and note the timings. > > Thank you! Best of 6 runs: ./586test Result: a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d Expected:a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d 500000000 bytes: 1.258031 s ./x86test Result: a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d Expected:a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d 500000000 bytes: 1.171770 s System: uname -a Linux whatever 2.6.22-14-generic #1 SMP Tue Feb 12 07:42:25 UTC 2008 i686 GNU/Linux cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz stepping : 10 cpu MHz : 2000.000 cache size : 6144 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr lahf_lm bogomips : 5988.92 clflush size : 64 processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz stepping : 10 cpu MHz : 2000.000 cache size : 6144 KB physical id : 0 siblings : 2 core id : 1 cpu cores : 2 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr lahf_lm bogomips : 5984.92 clflush size : 64 ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: Request for benchmarking: x86 SHA1 code 2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin ` (2 preceding siblings ...) 2009-07-31 11:26 ` Michael J Gruber @ 2009-07-31 12:31 ` Carlos R. Mafra 2009-07-31 13:27 ` Brian Ristuccia ` (4 subsequent siblings) 8 siblings, 0 replies; 60+ messages in thread From: Carlos R. Mafra @ 2009-07-31 12:31 UTC (permalink / raw) To: George Spelvin; +Cc: git On Fri 31.Jul'09 at 6:46:02 -0400, George Spelvin wrote: > - Run ./586test (before) and ./x86test (after) and note the timings. For 8 runs in a Intel(R) Core(TM)2 Duo CPU T7250 @ 2.00GHz, before: 1.75 +- 0.02 after: 1.62 +- 0.02 ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: Request for benchmarking: x86 SHA1 code 2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin ` (3 preceding siblings ...) 2009-07-31 12:31 ` Carlos R. Mafra @ 2009-07-31 13:27 ` Brian Ristuccia 2009-07-31 14:05 ` George Spelvin 2009-07-31 13:27 ` Jakub Narebski ` (3 subsequent siblings) 8 siblings, 1 reply; 60+ messages in thread From: Brian Ristuccia @ 2009-07-31 13:27 UTC (permalink / raw) To: git; +Cc: George Spelvin The revised code is faster on Intel Atom N270 by around 15% (results below typical of several runs): $ ./586test ; ./x86test Result: a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d Expected:a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d 500000000 bytes: 4.981760 s Result: a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d Expected:a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d 500000000 bytes: 4.323324 s $ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 28 model name : Intel(R) Atom(TM) CPU N270 @ 1.60GHz stepping : 2 cpu MHz : 800.000 cache size : 512 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 apicid : 0 initial apicid : 0 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx constant_tsc arch_perfmon pebs bts pni dtes64 monitor ds_cpl est tm2 ssse3 xtpr pdcm lahf_lm bogomips : 3191.59 clflush size : 64 power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 28 model name : Intel(R) Atom(TM) CPU N270 @ 1.60GHz stepping : 2 cpu MHz : 800.000 cache size : 512 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 apicid : 1 initial apicid : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx constant_tsc arch_perfmon pebs bts pni dtes64 monitor ds_cpl est tm2 ssse3 xtpr pdcm lahf_lm bogomips : 3191.91 clflush size : 64 power management: -- Brian Ristuccia brian@ristuccia.com ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: Request for benchmarking: x86 SHA1 code 2009-07-31 13:27 ` Brian Ristuccia @ 2009-07-31 14:05 ` George Spelvin 0 siblings, 0 replies; 60+ messages in thread From: George Spelvin @ 2009-07-31 14:05 UTC (permalink / raw) To: brian, git; +Cc: linux > The revised code is faster on Intel Atom N270 by around 15% (results below > typical of several runs): > > $ ./586test ; ./x86test > Result: a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d > Expected:a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d > 500000000 bytes: 4.981760 s > Result: a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d > Expected:a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d > 500000000 bytes: 4.323324 s Cool, thanks! I hadn't optimized it at all for Atom's in-order pipe, so I'm pleasantly surprised. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: Request for benchmarking: x86 SHA1 code 2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin ` (4 preceding siblings ...) 2009-07-31 13:27 ` Brian Ristuccia @ 2009-07-31 13:27 ` Jakub Narebski 2009-07-31 15:05 ` Peter Harris ` (2 subsequent siblings) 8 siblings, 0 replies; 60+ messages in thread From: Jakub Narebski @ 2009-07-31 13:27 UTC (permalink / raw) To: George Spelvin; +Cc: git "George Spelvin" <linux@horizon.com> writes: > After studying Andy Polyakov's optimized x86 SHA-1 in OpenSSL, I've > got a version that's 1.6% slower on a P4 and 15% faster on a Phenom. > I'm curious about the performance on other CPUs I don't have access to, > particularly Core 2 duo and i7. > > Could someone do some benchmarking for me? Old (486/Pentium/P2/P3) > machines are also interesting, but I'm optimizing for newer ones. ---------- $ [time] ./586test Result: a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d Expected:a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d 500000000 bytes: 5.376325 s real 0m5.384s user 0m5.108s sys 0m0.008s 500000000 bytes: 5.367261 s 5.09user 0.00system 0:05.38elapsed 94%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+378minor)pagefaults 0swaps ---------- $ [time] ./x86test Result: a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d Expected:a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d 500000000 bytes: 5.312238 s real 0m5.325s user 0m5.060s sys 0m0.008s 500000000 bytes: 5.323783 s 5.06user 0.00system 0:05.34elapsed 94%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+377minor)pagefaults 0swaps ========== System: $ uname -a Linux roke 2.6.14-11.1.aur.2 #1 Tue Jan 31 16:05:05 CET 2006 \ i686 athlon i386 GNU/Linux $ cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 6 model : 4 model name : AMD Athlon(tm) processor stepping : 2 cpu MHz : 1000.188 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 mtrr pge \ mca cmov pat pse36 mmx fxsr syscall mmxext 3dnowext 3dnow bogomips : 2002.43 $ free total used free shared buffers cached Mem: 515616 495812 19804 0 6004 103160 -/+ buffers/cache: 386648 128968 Swap: 1052248 279544 772704 -- Jakub Narebski Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: Request for benchmarking: x86 SHA1 code 2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin ` (5 preceding siblings ...) 2009-07-31 13:27 ` Jakub Narebski @ 2009-07-31 15:05 ` Peter Harris 2009-07-31 15:22 ` Peter Harris 2009-08-03 3:47 ` x86 SHA1: Faster than OpenSSL George Spelvin 8 siblings, 0 replies; 60+ messages in thread From: Peter Harris @ 2009-07-31 15:05 UTC (permalink / raw) To: George Spelvin; +Cc: git On Fri, Jul 31, 2009 at 6:46 AM, George Spelvin wrote: > Could someone do some benchmarking for me? Old (486/Pentium/P2/P3) > machines are also interesting, but I'm optimizing for newer ones. The new code appears to be marginally faster on a Pentium 3 Xeon: Best of five runs: 586test: 11.658661 s x86test: 11.209024 s $ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 7 model name : Pentium III (Katmai) stepping : 3 cpu MHz : 547.630 cache size : 1024 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse bogomips : 1097.12 clflush size : 32 power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 7 model name : Pentium III (Katmai) stepping : 3 cpu MHz : 547.630 cache size : 1024 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse bogomips : 1095.37 clflush size : 32 power management: ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: Request for benchmarking: x86 SHA1 code 2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin ` (6 preceding siblings ...) 2009-07-31 15:05 ` Peter Harris @ 2009-07-31 15:22 ` Peter Harris 2009-08-03 3:47 ` x86 SHA1: Faster than OpenSSL George Spelvin 8 siblings, 0 replies; 60+ messages in thread From: Peter Harris @ 2009-07-31 15:22 UTC (permalink / raw) To: George Spelvin; +Cc: git On Fri, Jul 31, 2009 at 6:46 AM, George Spelvin wrote: > Could someone do some benchmarking for me? Old (486/Pentium/P2/P3) > machines are also interesting, but I'm optimizing for newer ones. My Geode isn't old in age, but I admit it's old in design (and the vendor switched to Atom right after I bought it...) Best of three runs: 586test: 26.536597 s x86test: 26.111148 s $ cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 5 model : 10 model name : Geode(TM) Integrated Processor by AMD PCS stepping : 2 cpu MHz : 499.927 cache size : 128 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu de pse tsc msr cx8 sep pge cmov clflush mmx mmxext 3dnowext 3dnow bogomips : 1001.72 clflush size : 32 power management: Peter Harris ^ permalink raw reply [flat|nested] 60+ messages in thread
* x86 SHA1: Faster than OpenSSL 2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin ` (7 preceding siblings ...) 2009-07-31 15:22 ` Peter Harris @ 2009-08-03 3:47 ` George Spelvin 2009-08-03 7:36 ` Jonathan del Strother ` (3 more replies) 8 siblings, 4 replies; 60+ messages in thread From: George Spelvin @ 2009-08-03 3:47 UTC (permalink / raw) To: git; +Cc: appro, appro, linux (Work in progress, state dump to mailing list archives.) This started when discussing git startup overhead due to the dynamic linker. One big contributor is the openssl library, which is used only for its optimized x86 SHA-1 implementation. So I took a look at it, with an eye to importing the code directly into the git source tree, and decided that I felt like trying to do better. The original code was excellent, but it was optimized when the P4 was new. After a bit of tweaking, I've inflicted a slight (1.4%) slowdown on the P4, but a small-but-noticeable speedup on a variety of other processors. Before After Gain Processor 1.585248 1.353314 +17% 2500 MHz Phenom 3.249614 3.295619 -1.4% 1594 MHz P4 1.414512 1.352843 +4.5% 2.66 GHz i7 3.460635 3.284221 +5.4% 1596 MHz Athlon XP 4.077993 3.891826 +4.8% 1144 MHz Athlon 1.912161 1.623212 +17% 2100 MHz Athlon 64 X2 2.956432 2.940210 +0.55% 1794 MHz Mobile Celeron (fam 15 model 2) (Seconds to hash 500x 1 MB, best of 10 runs in all cases.) This is based on Andy Polyakov's GPL/BSD licensed cryptogams code, and (for now) uses the same perl preprocessor. To test it, do the following: - Download Andy's original code from http://www.openssl.org/~appro/cryptogams/cryptogams-0.tar.gz - "tar xz cryptogams-0.tar.gz" - "cd cryptogams-0/x86" - "patch < this_email" to create "sha1test.c", "sha1-586.h", "Makefile", and "sha1-x86.pl". - "make" - Run ./586test (before) and ./x86test (after) and note the timings. The code is currenty only the core SHA transform. Adding the appropriate init/uodate/finish wrappers is straightforward. An open question is how to add appropriate CPU detection to the git build scripts. (Note that `uname -m`, which it currently uses to select the ARM code, does NOT produce the right answer if you're using a 32-bit compiler on a 64-bit platform.) I try to explain it in the comments, but with all the software pipelining that makes the rounds overlap (and there are at least 4 different kinds of rounds, which overlap with each other), it's a bit intricate. If you feel really masochistic, make commenting suggestions... --- /dev/null 2009-05-12 02:55:38.579106460 -0400 +++ Makefile 2009-08-02 06:44:44.000000000 -0400 @@ -0,0 +1,16 @@ +CC := gcc +CFLAGS := -m32 -W -Wall -Os -g +ASFLAGS := --32 + +all: 586test x86test + +586test : sha1test.c sha1-586.o + $(CC) $(CFLAGS) -o $@ sha1test.c sha1-586.o + +x86test : sha1test.c sha1-x86.o + $(CC) $(CFLAGS) -o $@ sha1test.c sha1-x86.o + +586test x86test : sha1-586.h + +%.s : %.pl x86asm.pl x86unix.pl + perl $< elf > $@ --- /dev/null 2009-05-12 02:55:38.579106460 -0400 +++ sha1-586.h 2009-08-02 06:44:17.000000000 -0400 @@ -0,0 +1,3 @@ +#include <stdint.h> + +void sha1_block_data_order(uint32_t iv[5], void const *in, unsigned len); --- /dev/null 2009-05-12 02:55:38.579106460 -0400 +++ sha1test.c 2009-08-02 08:27:48.449609504 -0400 @@ -0,0 +1,85 @@ +#include <stdint.h> +#include <stdlib.h> +#include <stdio.h> +#include <sys/time.h> + +#include "sha1-586.h" + +#define SIZE 1000000 + +#if SIZE % 64 +# error SIZE must be a multiple of 64! +#endif + +static unsigned long +timing_test(uint32_t iv[5], unsigned iter) +{ + unsigned i; + struct timeval tv0, tv1; + static char *p; /* Very large buffer */ + + if (!p) { + p = malloc(SIZE); + if (!p) { + perror("malloc"); + exit(1); + } + } + + if (gettimeofday(&tv0, NULL) < 0) { + perror("gettimeofday"); + exit(1); + } + for (i = 0; i < iter; i++) + sha1_block_data_order(iv, p, SIZE/64); + if (gettimeofday(&tv1, NULL) < 0) { + perror("gettimeofday"); + exit(1); + } + return 1000000ul * (tv1.tv_sec-tv0.tv_sec) + tv1.tv_usec-tv0.tv_usec; +} + +int +main(void) +{ + uint32_t iv[5] = { + 0x67452301, 0xefcdab89, 0x98badcfe, 0x10325476, 0xc3d2e1f0 + }; + /* Simplest known-answer test, "abc" */ + static uint8_t const testbuf[64] = { + 'a','b','c', 0x80, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 24 + }; + /* Expected: A9993E364706816ABA3E25717850C26C9CD0D89D */ + static uint32_t const expected[5] = { + 0xa9993e36, 0x4706816a, 0xba3e2571, 0x7850c26c, 0x9cd0d89d }; + unsigned i; + unsigned long min_usec = -1ul; + + /* quick correct-answer test. silent unless successful. */ + sha1_block_data_order(iv, testbuf, 1); + for (i = 0; i < 5; i++) { + if (iv[i] != expected[i]) { + printf("Result: %08x %08x %08x %08x %08x\n" + "Expected:%08x %08x %08x %08x %08x\n", + iv[0], iv[1], iv[2], iv[3], iv[4], expected[0], + expected[1], expected[2], expected[3], + expected[4]); + break; + } + } + + for (i = 0; i < 10; i++) { + unsigned long usec = timing_test(iv, 500); + printf("%2u/10: %u.%06u s\n", i+1, (unsigned)(usec/1000000), + (unsigned)(usec % 1000000)); + if (usec < min_usec) + min_usec = usec; + } + printf("Minimum time to hash %u bytes: %u.%06u\n", + 500 * SIZE, (unsigned)(min_usec/1000000), + (unsigned)(min_usec % 1000000)); + return 0; +} --- /dev/null 2009-05-12 02:55:38.579106460 -0400 +++ sha1-x86.pl 2009-08-02 08:51:01.069614130 -0400 @@ -0,0 +1,389 @@ +#!/usr/bin/env perl + +# ==================================================================== +# [Re]written by Andy Polyakov <appro@fy.chalmers.se> for the OpenSSL +# project. The module is, however, dual licensed under OpenSSL and +# CRYPTOGAMS licenses depending on where you obtain it. For further +# details see http://www.openssl.org/~appro/cryptogams/. +# ==================================================================== + +# "[Re]written" was achieved in two major overhauls. In 2004 BODY_* +# functions were re-implemented to address P4 performance issue [see +# commentary below], and in 2006 the rest was rewritten in order to +# gain freedom to liberate licensing terms. + +# It was noted that Intel IA-32 C compiler generates code which +# performs ~30% *faster* on P4 CPU than original *hand-coded* +# SHA1 assembler implementation. To address this problem (and +# prove that humans are still better than machines:-), the +# original code was overhauled, which resulted in following +# performance changes: +# +# compared with original compared with Intel cc +# assembler impl. generated code +# Pentium -16% +48% +# PIII/AMD +8% +16% +# P4 +85%(!) +45% +# +# As you can see Pentium came out as looser:-( Yet I reckoned that +# improvement on P4 outweights the loss and incorporate this +# re-tuned code to 0.9.7 and later. +# ---------------------------------------------------------------- +# <appro@fy.chalmers.se> + +$0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1; +push(@INC,"${dir}","${dir}../../perlasm"); +require "x86asm.pl"; + +&asm_init($ARGV[0],"sha1-586.pl",$ARGV[$#ARGV] eq "386"); + +$A="eax"; +$B="ebx"; +$C="ecx"; +$D="edx"; +$E="ebp"; + +# Two temporaries +$S="edi"; +$T="esi"; + +# The round constants +use constant K1 => 0x5a827999; +use constant K2 => 0x6ED9EBA1; +use constant K3 => 0x8F1BBCDC; +use constant K4 => 0xCA62C1D6; + +# Given unlimited registers and functional units, it would be possible to +# compute SHA-1 at two cycles per round, using 7 operations per round. +# Remember, each round computes a new value for E, which is used as A +# in the following round and B in the round after that. There are two +# critical paths: +# - A must be rotated and added to the output E (2 cycles between rounds) +# - B must go through two boolean operations before being added to +# the result E. Since this latter addition can't be done in the +# same-cycle as the critical addition of a<<<5, this is a total of +# 2+1+1 = 4 cycles per 2 rounds. +# Additionally, if you want to avoid copying B, you have to rotate it +# soon after use in this round so it is ready for use as the following +# round's C. + +# e += (a <<< 5) + K + in[i] + (d^(b&(c^d))) (0..19) +# e += (a <<< 5) + K + in[i] + (b^c^d) (20..39, 60..79) +# e += (a <<< 5) + K + in[i] + (c&d) + (b&(c^d)) (40..59) +# +# The hard part is doing this with only two temporary registers. +# Taking the most complex F(b,c,d) function, writing it as above +# breaks it into two parts which can be accumulated into e separately. +# Let's divide it into 4 parts. These can be executed in a 7-cycle +# loop, assuming an in-order triple issue machine +# (quadruple counting xor-from-memory as 2): +# +# in[i] F1(c,d) F2(b,c,d) a<<<5 +# mov in[i],T (addl S,A) (movl B,S) +# xor in[i+1],T (rorl 5,S) +# xor in[i+2],T movl D,S (addl S,A) +# xor in[i+3],T andl C,S +# rotl 1,T addl S,E movl D,S +# movl T,in[i] xorl C,S +# addl T+K,E andl B,S rorl 2,B // +# (mov in[i],T) addl S,E movl A,S +# (xor in[i+1],T) rorl 5,S +# (xor in[i+2],T) (movl C,S) addl S,E +# +# In the above, we routinely read and write a register on the same cycle, +# overlapping the beginning of one computation with the end of another. +# I've tried to place the reads to the left of the writes, but some of the +# overlapping operations from adjacent rounds (given in parentheses) +# violate that. +# +# The "addl T+K,E" line is actually implemented using the lea instruction, +# which (on a Pentium) requires that neither T not K was modified on the +# previous cycle. +# +# As you can see, in the absence of out-of-order execution, the first +# column takes a minimum of 6 cycles (fetch, 3 XORs, rotate, add to E), +# and I reserve a seventh cycle before the add to E so that I can use a +# Pentium's lea instruction. +# +# The other three columns take 3, 4 and 3 cycles, respectively. +# These can all be overlapped by 1 cycle assuming a superscalar +# processor, for a total of 2+2+3 = 7 cycles. +# +# The other F() functions require 5 and 4 cycles, respectively. +# overlapped with the 3-cycle a<<<5 computation, that makes a total of 6 +# and 5 cycles, respectively. If we overlap the beginning and end of the +# Xi computation, we can get it down to 6 cycles, but below that, we'll +# just have to waste a cycle. +# +# For the first 16 rounds, forming Xi is just a fetch, and the F() +# function only requires 5 cycles, so the whole round can be pulled in +# to 4 cycles. + + +# Atom pairing rules (not yet optimized): +# The Atom has a dial-issue in-order pipeline, similar to the +# original Pentium. However, the issue restrictions are different. +# In particular, all memory source operations must use st use port 0, +# as must all rotates. +# +# Given that a round uses 4 fetches and 3 rotates, that's going to +# require significant care to pair well. It may take a completely +# different implementation. +# +# LEA must use port 1, but apparently it has even worse address generation +# interlock latency than the Pentium. Oh well, it's still the best way +# to do a 3-way add with a 32-bit immediate. + + +# The first 16 rounds use s simple simplest F(b,c,d) = d^(b&(c^d)), and +# don't need to form the in[i] equation, letting us reduce the round to +# 4 cycles, after some reassignment of temporaries: + +# movl D,S (roll 5,T) (addl S,A) // +# mov in[i],T xorl C,S (addl T,A) +# andl B,S rorl 2,B +# addl T+K,E xorl D,S movl A,T +# addl S,E roll 5,T (movl C,S) // +# (mov in[i],T) (xorl B,S) addl T,E + +# The // mark where the round function starts. Each round expects the +# first copy of D to S to have been done already. + +# The transition between rounds 15 and 16 is a bit tricky... the best +# thing to do is to delay the computation of a<<<5 one cycle and move it back +# to the S register. That way, T is free as early as possible. +# in[i] F(b,c,d) a<<<5 +# (addl T+K,A) (xorl E,S) (movl B,T) +# movl D,S (roll 5,T) (addl S,A) // +# mov in[i],T xorl C,S (addl T,A) +# andl B,S rorl 2,B +# addl T+K,E xorl D,S (movl in[1],T) +# (xor in[i+1],T) addl S,E movl A,S +# (xor in[i+2],T) rorl 2,B roll 5,S // +# (xor in[i+3],T) (movl C,S) addl S,E + +sub BODY_00_15 +{ + local($n,$a,$b,$c,$d,$e)=@_; + + &comment("00_15 $n"); + &mov($S,$d) if ($n == 0); + &mov($T,&swtmp($n%16)); # V Load Xi. + &xor($S,$c); # U Continue F() = d^(b&(c^d)) + &and($S,$b); # V + &rotr($b,2); # NP + &lea($e,&DWP(K1,$e,$T)); # U Add Xi and K + if ($n < 15) { + &mov($T,$a); # V + &xor($S,$d); # U + &rotl($T,5); # NP + &add($e,$S); # U + &mov($S,$c); # V Start of NEXT round's F() + &add($e,$T); # U + } else { + # This version provides the correct start for BODY_20_39 + &xor($S,$d); # V + &mov($T,&swtmp(($n+1)%16)); # U Start computing mext Xi. + &add($e,$S); # V Add F() + &mov($S,$a); # U Start computing a<<<5 + &xor($T,&swtmp(($n+3)%16)); # V + &rotl($S,5); # U + &xor($T,&swtmp(($n+9)%16)); # V + } +} + + +# A full round using F(b,c,d) = b^c^d. 6 cycles of dependency chain. +# This starts just before starting to compute F(); the Xi should have XORed +# the first three values together. (Break is at //) +# +# in[i] F(b,c,d) a<<<5 +# mov in[i],T (xorl E,S) (addl T+K,A) +# xor in[i+1],T (addl S,A) (movl B,S) +# xor in[i+2],T (roll 5,S) // +# xor in[i+3],T movl D,S (addl S,A) +# roll 1,T xorl C,S +# movl T,in[i] andl B,S rorl 2,B +# addl T+K,E xorl D,S (mov in[i],T) +# (xor in[i+1],T) addl S,E movl A,S +# (xor in[i+2],T) roll 5,S // +# (xor in[i+3],T) (movl C,S) addl S,E + +sub BODY_16_19 +{ + local($n,$a,$b,$c,$d,$e)=@_; + + &comment("16_19 $n"); + + &xor($T,&swtmp(($n+13)%16)); # U + &add($a,$S); # V End of previous round + &mov($S,$d); # U Start current round's F() + &rotl($T,1); # V + &xor($S,$c); # U + &mov(&swtmp($n%16),$T); # U Store computed Xi. + &and($S,$b); # V + &rotr($b,2); # NP + &lea($e,&DWP(K1,$e,$T)); # U Add Xi and K + &mov($T,&swtmp(($n+1)%16)); # V Start computing mext Xi. + &xor($S,$d); # U + &xor($T,&swtmp(($n+3)%16)); # V + &add($e,$S); # U Add F() + &mov($S,$a); # V Start computing a<<<5 + &xor($T,&swtmp(($n+9)%16)); # U + &rotl($S,5); # NP +} + +# This is just like BODY_16_19, but computes a different F() = b^c^d +# +# in[i] F(b,c,d) a<<<5 +# mov in[i],T (xorl E,S) (addl T+K,A) +# xor in[i+1],T (addl S,A) (movl B,S) +# xor in[i+2],T (roll 5,S) // +# xor in[i+3],T (addl S,A) +# roll 1,T movl D,S +# movl T,in[i] xorl B,S rorl 2,B +# addl T+K,E xorl C,S (mov in[i],T) +# (xor in[i+1],T) addl S,E movl A,S +# (xor in[i+2],T) roll 5,S // +# (xor in[i+3],T) (movl C,S) addl S,E + +sub BODY_20_39 # And 61..79 +{ + local($n,$a,$b,$c,$d,$e)=@_; + local $K=($n<40) ? K2 : K4; + + &comment("20_39 $n"); + + &xor($T,&swtmp(($n+13)%16)); # U + &add($a,$S); # V End of previous round + &rotl($T,1); # U + &mov($S,$d); # V Start current round's F() + &mov(&swtmp($n%16),$T) if ($n < 77); # Store computed Xi. + &xor($S,$b); # V + &rotr($b,2); # NP + &lea($e,&DWP($K,$e,$T)); # U Add Xi and K + &mov($T,&swtmp(($n+1)%16)) if ($n < 79); # Start computing next Xi. + &xor($S,$c); # U + &xor($T,&swtmp(($n+3)%16)) if ($n < 79); + &add($e,$S); # U Add F1() + &mov($S,$a); # V Start computing a<<<5 + &xor($T,&swtmp(($n+9)%16)) if ($n < 79); + &rotl($S,5); # NP + + &add($e,$S) if ($n == 79); +} + + +# This starts immediately after the LEA, and expects to need to finish +# the previous round. (break is at //) +# +# in[i] F1(c,d) F2(b,c,d) a<<<5 +# (addl T+K,E) (andl C,S) (rorl 2,C) +# mov in[i],T (addl S,A) (movl B,S) +# xor in[i+1],T (rorl 5,S) +# xor in[i+2],T / movl D,S (addl S,A) +# xor in[i+3],T andl C,S +# rotl 1,T addl S,E movl D,S +# (movl T,in[i]) xorl C,S +# addl T+K,E andl B,S rorl 2,B +# (mov in[i],T) addl S,E movl A,S +# (xor in[i+1],T) rorl 5,S +# (xor in[i+2],T) // (movl C,S) addl S,E + +sub BODY_40_59 +{ + local($n,$a,$b,$c,$d,$e)=@_; + + &comment("40_59 $n"); + + &add($a,$S); # V End of previous round + &mov($S,$d); # U Start current round's F(1) + &xor($T,&swtmp(($n+13)%16)); # V + &and($S,$c); # U + &rotl($T,1); # U XXX Missed pairing + &add($e,$S); # V Add F1() + &mov($S,$d); # U Start current round's F2() + &mov(&swtmp($n%16),$T); # V Store computed Xi. + &xor($S,$c); # U + &lea($e,&DWP(K3,$e,$T)); # V + &and($S,$b); # U XXX Missed pairing + &rotr($b,2); # NP + &mov($T,&swtmp(($n+1)%16)); # U Start computing next Xi. + &add($e,$S); # V Add F2() + &mov($S,$a); # U Start computing a<<<5 + &xor($T,&swtmp(($n+3)%16)); # V + &rotl($S,5); # NP + &xor($T,&swtmp(($n+9)%16)); # U +} +# The above code is NOT optimally paired for the Pentium. (And thus, +# presumably, Atom, which has a very similar dual-issue in-order pipeline.) +# However, attempts to improve it make it slower on Phenom & i7. + +&function_begin("sha1_block_data_order",16); + + local @V = ($A,$B,$C,$D,$E); + local @W = ($A,$B,$C); + + &mov($S,&wparam(0)); # SHA_CTX *c + &mov($T,&wparam(1)); # const void *input + &mov($A,&wparam(2)); # size_t num + &stack_push(16); # allocate X[16] + &shl($A,6); + &add($A,$T); + &mov(&wparam(2),$A); # pointer beyond the end of input + &mov($E,&DWP(16,$S));# pre-load E + &mov($D,&DWP(12,$S));# pre-load D + + &set_label("loop",16); + + # copy input chunk to X, but reversing byte order! + &mov($W[2],&DWP(4*(0),$T)); + &mov($W[1],&DWP(4*(1),$T)); + &bswap($W[2]); + for ($i=0; $i<14; $i++) { + &mov($W[0],&DWP(4*($i+2),$T)); + &bswap($W[1]); + &mov(&swtmp($i+0),$W[2]); + unshift(@W,pop(@W)); + } + &bswap($W[1]); + &mov(&swtmp($i+0),$W[2]); + &mov(&swtmp($i+1),$W[1]); + + &mov(&wparam(1),$T); # redundant in 1st spin + + # Reload A, B and C, which we use as temporaries in the copying + &mov($C,&DWP(8,$S)); + &mov($B,&DWP(4,$S)); + &mov($A,&DWP(0,$S)); + + for($i=0;$i<16;$i++) { &BODY_00_15($i,@V); unshift(@V,pop(@V)); } + for(;$i<20;$i++) { &BODY_16_19($i,@V); unshift(@V,pop(@V)); } + for(;$i<40;$i++) { &BODY_20_39($i,@V); unshift(@V,pop(@V)); } + for(;$i<60;$i++) { &BODY_40_59($i,@V); unshift(@V,pop(@V)); } + for(;$i<80;$i++) { &BODY_20_39($i,@V); unshift(@V,pop(@V)); } + + (($V[4] eq $E) and ($V[0] eq $A)) or die; # double-check + + &comment("Loop trailer"); + + &mov($S,&wparam(0)); # re-load SHA_CTX* + &mov($T,&wparam(1)); # re-load input pointer + + &add($E,&DWP(16,$S)); + &add($D,&DWP(12,$S)); + &add(&DWP(8,$S),$C); + &add(&DWP(4,$S),$B); + &add($T,64); # advance input pointer + &add(&DWP(0,$S),$A); + &mov(&DWP(12,$S),$D); + &mov(&DWP(16,$S),$E); + + &cmp($T,&wparam(2)); # have we reached the end yet? + &jb(&label("loop")); + + &stack_pop(16); +&function_end("sha1_block_data_order"); +&asciz("SHA1 block transform for x86, CRYPTOGAMS by <appro\@openssl.org>"); + +&asm_finish(); ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-03 3:47 ` x86 SHA1: Faster than OpenSSL George Spelvin @ 2009-08-03 7:36 ` Jonathan del Strother 2009-08-04 1:40 ` Mark Lodato ` (2 subsequent siblings) 3 siblings, 0 replies; 60+ messages in thread From: Jonathan del Strother @ 2009-08-03 7:36 UTC (permalink / raw) To: George Spelvin; +Cc: git, appro, appro On Mon, Aug 3, 2009 at 4:47 AM, George Spelvin<linux@horizon.com> wrote: > (Work in progress, state dump to mailing list archives.) > > This started when discussing git startup overhead due to the dynamic > linker. One big contributor is the openssl library, which is used only > for its optimized x86 SHA-1 implementation. So I took a look at it, > with an eye to importing the code directly into the git source tree, > and decided that I felt like trying to do better. > FWIW, this doesn't work on OS X / Darwin. 'as' doesn't take a --32 flag, it takes an -arch i386 flag. After changing that, I get: as -arch i386 -o sha1-586.o sha1-586.s sha1-586.s:4:Unknown pseudo-op: .type sha1-586.s:4:Rest of line ignored. 1st junk character valued 115 (s). sha1-586.s:5:Alignment too large: 15. assumed. sha1-586.s:19:Alignment too large: 15. assumed. sha1-586.s:1438:Unknown pseudo-op: .size sha1-586.s:1438:Rest of line ignored. 1st junk character valued 115 (s). make: *** [sha1-586.o] Error 1 - at which point I have no idea how to fix it. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-03 3:47 ` x86 SHA1: Faster than OpenSSL George Spelvin 2009-08-03 7:36 ` Jonathan del Strother @ 2009-08-04 1:40 ` Mark Lodato 2009-08-04 2:30 ` Linus Torvalds 2009-08-18 21:26 ` Andy Polyakov 3 siblings, 0 replies; 60+ messages in thread From: Mark Lodato @ 2009-08-04 1:40 UTC (permalink / raw) To: George Spelvin; +Cc: git, appro, appro On Sun, Aug 2, 2009 at 11:47 PM, George Spelvin<linux@horizon.com> wrote: > Before After Gain Processor > 1.585248 1.353314 +17% 2500 MHz Phenom > 3.249614 3.295619 -1.4% 1594 MHz P4 > 1.414512 1.352843 +4.5% 2.66 GHz i7 > 3.460635 3.284221 +5.4% 1596 MHz Athlon XP > 4.077993 3.891826 +4.8% 1144 MHz Athlon > 1.912161 1.623212 +17% 2100 MHz Athlon 64 X2 > 2.956432 2.940210 +0.55% 1794 MHz Mobile Celeron (fam 15 model 2) > > (Seconds to hash 500x 1 MB, best of 10 runs in all cases.) > > This is based on Andy Polyakov's GPL/BSD licensed cryptogams code, and > (for now) uses the same perl preprocessor. To test it, do the following: > - Download Andy's original code from > http://www.openssl.org/~appro/cryptogams/cryptogams-0.tar.gz > - "tar xz cryptogams-0.tar.gz" > - "cd cryptogams-0/x86" > - "patch < this_email" to create "sha1test.c", "sha1-586.h", "Makefile", > and "sha1-x86.pl". > - "make" > - Run ./586test (before) and ./x86test (after) and note the timings. Note, to compile this on Ubuntu x86-64, I had to: $ sudo apt-get install libc6-dev-i386 $ ./586test 1/10: 2.016621 s 2/10: 2.030742 s 3/10: 2.027333 s 4/10: 2.024018 s 5/10: 2.022306 s 6/10: 2.022418 s 7/10: 2.047103 s 8/10: 2.035467 s 9/10: 2.032237 s 10/10: 2.029231 s Minimum time to hash 500000000 bytes: 2.016621 $ ./x86test 1/10: 1.818661 s 2/10: 1.814856 s 3/10: 1.816232 s 4/10: 1.815208 s 5/10: 1.834047 s 6/10: 1.843020 s 7/10: 1.819564 s 8/10: 1.815560 s 9/10: 1.824232 s 10/10: 1.820943 s Minimum time to hash 500000000 bytes: 1.814856 $ python -c 'print 2.016621 / 1.814856' 1.11117410968 $ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 CPU 6300 @ 1.86GHz stepping : 2 cpu MHz : 1861.825 cache size : 2048 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant _tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm tpr_shadow bogomips : 3723.65 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 CPU 6300 @ 1.86GHz stepping : 2 cpu MHz : 1861.825 cache size : 2048 KB physical id : 0 siblings : 2 core id : 1 cpu cores : 2 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant _tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm tpr_shadow bogomips : 3724.01 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: I imagine that you can get a bigger speedup by making a 64-bit version (but maybe not). Either way, it would be nice if x86-64 users did not have to install an additional package to compile. Cheers, Mark ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-03 3:47 ` x86 SHA1: Faster than OpenSSL George Spelvin 2009-08-03 7:36 ` Jonathan del Strother 2009-08-04 1:40 ` Mark Lodato @ 2009-08-04 2:30 ` Linus Torvalds 2009-08-04 2:51 ` Linus Torvalds 2009-08-04 4:48 ` George Spelvin 2009-08-18 21:26 ` Andy Polyakov 3 siblings, 2 replies; 60+ messages in thread From: Linus Torvalds @ 2009-08-04 2:30 UTC (permalink / raw) To: George Spelvin; +Cc: git, appro, appro On Sun, 2 Aug 2009, George Spelvin wrote: > > The original code was excellent, but it was optimized when the P4 was new. > After a bit of tweaking, I've inflicted a slight (1.4%) slowdown on the > P4, but a small-but-noticeable speedup on a variety of other processors. > > Before After Gain Processor > 1.585248 1.353314 +17% 2500 MHz Phenom > 3.249614 3.295619 -1.4% 1594 MHz P4 > 1.414512 1.352843 +4.5% 2.66 GHz i7 > 3.460635 3.284221 +5.4% 1596 MHz Athlon XP > 4.077993 3.891826 +4.8% 1144 MHz Athlon > 1.912161 1.623212 +17% 2100 MHz Athlon 64 X2 > 2.956432 2.940210 +0.55% 1794 MHz Mobile Celeron (fam 15 model 2) It would be better to have a more git-centric benchmark that actually shows some real git load, rather than a sha1-only microbenchmark. The thing that I'd prefer is simply git fsck --full on the Linux kernel archive. For me (with a fast machine), it takes about 4m30s with the OpenSSL SHA1, and takes 6m40s with the Mozilla SHA1 (ie using a NO_OPENSSL=1 build). So that's an example of a load that is actually very sensitive to SHA1 performance (more so than _most_ git loads, I suspect), and at the same time is a real git load rather than some SHA1-only microbenchmark. It also shows very clearly why we default to the OpenSSL version over the Mozilla one. NOTE! I didn't do multiple runs to see how stable the numbers are, and so it's possible that I exaggerated the OpenSSL advantage over the Mozilla-SHA1 code. Or vice versa. My point is really only that I don't know how meaningful a "50 x 1M SHA1" benchmark is, while I know that a "git fsck" benchmark has at least _some_ real life value. Linus ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-04 2:30 ` Linus Torvalds @ 2009-08-04 2:51 ` Linus Torvalds 2009-08-04 3:07 ` Jon Smirl 2009-08-18 21:50 ` Andy Polyakov 2009-08-04 4:48 ` George Spelvin 1 sibling, 2 replies; 60+ messages in thread From: Linus Torvalds @ 2009-08-04 2:51 UTC (permalink / raw) To: George Spelvin; +Cc: git, appro, appro On Mon, 3 Aug 2009, Linus Torvalds wrote: > > The thing that I'd prefer is simply > > git fsck --full > > on the Linux kernel archive. For me (with a fast machine), it takes about > 4m30s with the OpenSSL SHA1, and takes 6m40s with the Mozilla SHA1 (ie > using a NO_OPENSSL=1 build). > > So that's an example of a load that is actually very sensitive to SHA1 > performance (more so than _most_ git loads, I suspect), and at the same > time is a real git load rather than some SHA1-only microbenchmark. It also > shows very clearly why we default to the OpenSSL version over the Mozilla > one. "perf report --sort comm,dso,symbol" profiling shows the following for 'git fsck --full' on the kernel repo, using the Mozilla SHA1: 47.69% git /home/torvalds/git/git [.] moz_SHA1_Update 22.98% git /lib64/libz.so.1.2.3 [.] inflate_fast 7.32% git /lib64/libc-2.10.1.so [.] __GI_memcpy 4.66% git /lib64/libz.so.1.2.3 [.] inflate 3.76% git /lib64/libz.so.1.2.3 [.] adler32 2.86% git /lib64/libz.so.1.2.3 [.] inflate_table 2.41% git /home/torvalds/git/git [.] lookup_object 1.31% git /lib64/libc-2.10.1.so [.] _int_malloc 0.84% git /home/torvalds/git/git [.] patch_delta 0.78% git [kernel] [k] hpet_next_event so yeah, SHA1 performance matters. Judging by the OpenSSL numbers, the OpenSSL SHA1 implementation must be about twice as fast as the C version we use. That said, under "normal" git usage models, the SHA1 costs are almost invisible. So git-fsck is definitely a fairly unusual case that stresses the SHA1 performance more than most git lods. Linus ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-04 2:51 ` Linus Torvalds @ 2009-08-04 3:07 ` Jon Smirl 2009-08-04 5:01 ` George Spelvin 2009-08-18 21:50 ` Andy Polyakov 1 sibling, 1 reply; 60+ messages in thread From: Jon Smirl @ 2009-08-04 3:07 UTC (permalink / raw) To: Linus Torvalds; +Cc: George Spelvin, git, appro, appro On Mon, Aug 3, 2009 at 10:51 PM, Linus Torvalds<torvalds@linux-foundation.org> wrote: > > > On Mon, 3 Aug 2009, Linus Torvalds wrote: >> >> The thing that I'd prefer is simply >> >> git fsck --full >> >> on the Linux kernel archive. For me (with a fast machine), it takes about >> 4m30s with the OpenSSL SHA1, and takes 6m40s with the Mozilla SHA1 (ie >> using a NO_OPENSSL=1 build). >> >> So that's an example of a load that is actually very sensitive to SHA1 >> performance (more so than _most_ git loads, I suspect), and at the same >> time is a real git load rather than some SHA1-only microbenchmark. It also >> shows very clearly why we default to the OpenSSL version over the Mozilla >> one. > > "perf report --sort comm,dso,symbol" profiling shows the following for > 'git fsck --full' on the kernel repo, using the Mozilla SHA1: > > 47.69% git /home/torvalds/git/git [.] moz_SHA1_Update > 22.98% git /lib64/libz.so.1.2.3 [.] inflate_fast > 7.32% git /lib64/libc-2.10.1.so [.] __GI_memcpy > 4.66% git /lib64/libz.so.1.2.3 [.] inflate > 3.76% git /lib64/libz.so.1.2.3 [.] adler32 > 2.86% git /lib64/libz.so.1.2.3 [.] inflate_table > 2.41% git /home/torvalds/git/git [.] lookup_object > 1.31% git /lib64/libc-2.10.1.so [.] _int_malloc > 0.84% git /home/torvalds/git/git [.] patch_delta > 0.78% git [kernel] [k] hpet_next_event > > so yeah, SHA1 performance matters. Judging by the OpenSSL numbers, the > OpenSSL SHA1 implementation must be about twice as fast as the C version > we use. Would there happen to be a SHA1 implementation around that can compute the SHA1 without first decompressing the data? Databases gain a lot of speed by using special algorithms that can directly operate on the compressed data. > > That said, under "normal" git usage models, the SHA1 costs are almost > invisible. So git-fsck is definitely a fairly unusual case that stresses > the SHA1 performance more than most git lods. > > Linus > -- > To unsubscribe from this list: send the line "unsubscribe git" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-04 3:07 ` Jon Smirl @ 2009-08-04 5:01 ` George Spelvin 2009-08-04 12:56 ` Jon Smirl 0 siblings, 1 reply; 60+ messages in thread From: George Spelvin @ 2009-08-04 5:01 UTC (permalink / raw) To: jonsmirl; +Cc: git, linux > Would there happen to be a SHA1 implementation around that can compute > the SHA1 without first decompressing the data? Databases gain a lot of > speed by using special algorithms that can directly operate on the > compressed data. I can't imagine how. In general, this requires that the compression be carefully designed to be compatible with the algorithms, and SHA1 is specifically designed to depend on every bit of the input in an un-analyzable way. Also, git normally avoids hashing objects that it doesn't need uncompressed for some other reason. git-fsck is a notable exception, but I think the idea of creating special optimized code paths for that interferes with its reliability and robustness goals. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-04 5:01 ` George Spelvin @ 2009-08-04 12:56 ` Jon Smirl 2009-08-04 14:29 ` Dmitry Potapov 0 siblings, 1 reply; 60+ messages in thread From: Jon Smirl @ 2009-08-04 12:56 UTC (permalink / raw) To: George Spelvin; +Cc: git On Tue, Aug 4, 2009 at 1:01 AM, George Spelvin<linux@horizon.com> wrote: >> Would there happen to be a SHA1 implementation around that can compute >> the SHA1 without first decompressing the data? Databases gain a lot of >> speed by using special algorithms that can directly operate on the >> compressed data. > > I can't imagine how. In general, this requires that the compression > be carefully designed to be compatible with the algorithms, and SHA1 > is specifically designed to depend on every bit of the input in > an un-analyzable way. A simple start would be to feed each byte as it is decompressed directly into the sha code and avoid the intermediate buffer. Removing the buffer reduces cache pressure. > Also, git normally avoids hashing objects that it doesn't need > uncompressed for some other reason. git-fsck is a notable exception, > but I think the idea of creating special optimized code paths for that > interferes with its reliability and robustness goals. Agreed that there is no real need for this, just something to play with if you are trying for a speed record. I'd much rather have a solution for the rebase problem where one side of the diff has moved to a different file and rebase can't figure it out. > -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-04 12:56 ` Jon Smirl @ 2009-08-04 14:29 ` Dmitry Potapov 0 siblings, 0 replies; 60+ messages in thread From: Dmitry Potapov @ 2009-08-04 14:29 UTC (permalink / raw) To: Jon Smirl; +Cc: George Spelvin, git On Tue, Aug 04, 2009 at 08:56:48AM -0400, Jon Smirl wrote: > > A simple start would be to feed each byte as it is decompressed > directly into the sha code and avoid the intermediate buffer. Removing > the buffer reduces cache pressure. First, you still have to preserve any decoded byte in the compress window, which is 32Kb by default. Typical files in Git repositories are not so big, many are under 32Kb and practically all of them fit to L2 cache of modern processors. Second, complication of assembler code from the coupling of two algorithms will be enormous. It is not sufficient registers on x86 for SHA-1 alone. Third, SHA-1 is very computationally intensive and with predictable access pattern (linear), so you do not wait for L2, because it will be in L1. So, I don't see where you can gain significantly. Perhaps, you can win just from re-writing inflate in assembler, but I do not expect any significant gains other than that. And coupling has obvious disadvantages when it comes to maintenance... Dmitry ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-04 2:51 ` Linus Torvalds 2009-08-04 3:07 ` Jon Smirl @ 2009-08-18 21:50 ` Andy Polyakov 1 sibling, 0 replies; 60+ messages in thread From: Andy Polyakov @ 2009-08-18 21:50 UTC (permalink / raw) To: Linus Torvalds; +Cc: George Spelvin, git > On Mon, 3 Aug 2009, Linus Torvalds wrote: >> The thing that I'd prefer is simply >> >> git fsck --full >> >> on the Linux kernel archive. For me (with a fast machine), it takes about >> 4m30s with the OpenSSL SHA1, and takes 6m40s with the Mozilla SHA1 (ie >> using a NO_OPENSSL=1 build). >> >> So that's an example of a load that is actually very sensitive to SHA1 >> performance (more so than _most_ git loads, I suspect), and at the same >> time is a real git load rather than some SHA1-only microbenchmark. I couldn't agree more that real-life benchmarks are of greater value than specific algorithm micro-benchmark. And given the provided profiling data one can argue that +17% (or my +12%) improvement on micro-benchmark aren't really worth bothering about. But it's kind of sport [at least for me], so don't judge too harshly:-) >> It also >> shows very clearly why we default to the OpenSSL version over the Mozilla >> one. As George implicitly mentioned most OpenSSL assembler modules are available under more permissive license and if there is interest I'm ready to assist... > "perf report --sort comm,dso,symbol" profiling shows the following for > 'git fsck --full' on the kernel repo, using the Mozilla SHA1: > > 47.69% git /home/torvalds/git/git [.] moz_SHA1_Update > 22.98% git /lib64/libz.so.1.2.3 [.] inflate_fast > 7.32% git /lib64/libc-2.10.1.so [.] __GI_memcpy > 4.66% git /lib64/libz.so.1.2.3 [.] inflate > 3.76% git /lib64/libz.so.1.2.3 [.] adler32 > 2.86% git /lib64/libz.so.1.2.3 [.] inflate_table > 2.41% git /home/torvalds/git/git [.] lookup_object > 1.31% git /lib64/libc-2.10.1.so [.] _int_malloc > 0.84% git /home/torvalds/git/git [.] patch_delta > 0.78% git [kernel] [k] hpet_next_event > > so yeah, SHA1 performance matters. Judging by the OpenSSL numbers, the > OpenSSL SHA1 implementation must be about twice as fast as the C version > we use. And given /lib64 path this is 64-bit C compiler-generated code compared to 32-bit assembler? Either way in this context I have extra comment addressing previous subscriber, Mark Lodato, who effectively wondered how would 64-bit assembler compare to 32-bit one. First of all there *is* even 64-bit assembler version. But as SHA1 is essentially 32-bit algorithm, 64-bit implementation is only nominally faster, +20% at most. Faster thanks to larger register bank facilitating more efficient instruction scheduling. Cheers. A. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-04 2:30 ` Linus Torvalds 2009-08-04 2:51 ` Linus Torvalds @ 2009-08-04 4:48 ` George Spelvin 2009-08-04 6:30 ` Linus Torvalds 2009-08-04 6:40 ` Linus Torvalds 1 sibling, 2 replies; 60+ messages in thread From: George Spelvin @ 2009-08-04 4:48 UTC (permalink / raw) To: torvalds; +Cc: git, linux > It would be better to have a more git-centric benchmark that actually > shows some real git load, rather than a sha1-only microbenchmark. > > The thing that I'd prefer is simply > > git fsck --full > > on the Linux kernel archive. For me (with a fast machine), it takes about > 4m30s with the OpenSSL SHA1, and takes 6m40s with the Mozilla SHA1 (ie > using a NO_OPENSSL=1 build). The actual goal of this effort is to address the dynamic linker startup time issues by removing the second-largest contributor after libcurl, namely openssl. Optimizing the assembly code is just the fun part. ;-) Anyway, on the git repository: [1273]$ time x/git-fsck --full (New SHA1 code) dangling tree 524973049a7e4593df4af41e0564912f678a41ac dangling tree 7da7d73185a1df5c2a477d2ee5599ac8a58cad56 real 0m59.306s user 0m58.760s sys 0m0.550s [1274]$ time ./git-fsck --full (OpenSSL) dangling tree 524973049a7e4593df4af41e0564912f678a41ac dangling tree 7da7d73185a1df5c2a477d2ee5599ac8a58cad56 real 1m0.364s user 0m59.970s sys 0m0.400s 1.6% is a pretty minor difference, especially as the machine is running a backup at the time (but it's a quad-core, with near-zero CPU usage; the business is all I/O). On the full Linux repository, I repacked it first to make sure that everything was in RAM, and I have the first result: [517]$ time ~/git/x/git-fsck --full (New SHA1 code) real 10m12.702s user 9m48.410s sys 0m23.350s [518]$ time ~/git/git-fsck --full (OpenSSL) real 10m26.083s user 10m2.800s sys 0m22.000s Again, 2.2% is not a huge improvement. But my only goal was not to be worse. > So that's an example of a load that is actually very sensitive to SHA1 > performance (more so than _most_ git loads, I suspect), and at the same > time is a real git load rather than some SHA1-only microbenchmark. It also > shows very clearly why we default to the OpenSSL version over the Mozilla > one. I wasn't questioning *that*. As I said, I was just doing the fun part of importing a heavily-optimized OpenSSL-like SHA1 implementation into the git source tree. (The un-fun part is modifying the build process to detect the target processor and include the right asm automatically.) Anyway, if you want to test it, here's a crude x86_32-only patch to the git tree. "make NO_OPENSSL=1" to use the new code. diff --git a/Makefile b/Makefile index daf4296..8531c39 100644 --- a/Makefile +++ b/Makefile @@ -1176,8 +1176,10 @@ ifdef ARM_SHA1 LIB_OBJS += arm/sha1.o arm/sha1_arm.o else ifdef MOZILLA_SHA1 - SHA1_HEADER = "mozilla-sha1/sha1.h" - LIB_OBJS += mozilla-sha1/sha1.o +# SHA1_HEADER = "mozilla-sha1/sha1.h" +# LIB_OBJS += mozilla-sha1/sha1.o + SHA1_HEADER = "x86/sha1.h" + LIB_OBJS += x86/sha1.o x86/sha1-x86.o else SHA1_HEADER = <openssl/sha.h> EXTLIBS += $(LIB_4_CRYPTO) diff --git a/x86/sha1-x86.s b/x86/sha1-x86.s new file mode 100644 index 0000000..96796d4 --- /dev/null +++ b/x86/sha1-x86.s @@ -0,0 +1,1372 @@ +.file "sha1-586.s" +.text +.globl sha1_block_data_order +.type sha1_block_data_order,@function +.align 16 +sha1_block_data_order: + pushl %ebp + pushl %ebx + pushl %esi + pushl %edi + movl 20(%esp),%edi + movl 24(%esp),%esi + movl 28(%esp),%eax + subl $64,%esp + shll $6,%eax + addl %esi,%eax + movl %eax,92(%esp) + movl 16(%edi),%ebp + movl 12(%edi),%edx +.align 16 +.L000loop: + movl (%esi),%ecx + movl 4(%esi),%ebx + bswap %ecx + movl 8(%esi),%eax + bswap %ebx + movl %ecx,(%esp) + movl 12(%esi),%ecx + bswap %eax + movl %ebx,4(%esp) + movl 16(%esi),%ebx + bswap %ecx + movl %eax,8(%esp) + movl 20(%esi),%eax + bswap %ebx + movl %ecx,12(%esp) + movl 24(%esi),%ecx + bswap %eax + movl %ebx,16(%esp) + movl 28(%esi),%ebx + bswap %ecx + movl %eax,20(%esp) + movl 32(%esi),%eax + bswap %ebx + movl %ecx,24(%esp) + movl 36(%esi),%ecx + bswap %eax + movl %ebx,28(%esp) + movl 40(%esi),%ebx + bswap %ecx + movl %eax,32(%esp) + movl 44(%esi),%eax + bswap %ebx + movl %ecx,36(%esp) + movl 48(%esi),%ecx + bswap %eax + movl %ebx,40(%esp) + movl 52(%esi),%ebx + bswap %ecx + movl %eax,44(%esp) + movl 56(%esi),%eax + bswap %ebx + movl %ecx,48(%esp) + movl 60(%esi),%ecx + bswap %eax + movl %ebx,52(%esp) + bswap %ecx + movl %eax,56(%esp) + movl %ecx,60(%esp) + movl %esi,88(%esp) + movl 8(%edi),%ecx + movl 4(%edi),%ebx + movl (%edi),%eax + /* 00_15 0 */ + movl %edx,%edi + movl (%esp),%esi + xorl %ecx,%edi + andl %ebx,%edi + rorl $2,%ebx + leal 1518500249(%ebp,%esi,1),%ebp + movl %eax,%esi + xorl %edx,%edi + roll $5,%esi + addl %edi,%ebp + movl %ecx,%edi + addl %esi,%ebp + /* 00_15 1 */ + movl 4(%esp),%esi + xorl %ebx,%edi + andl %eax,%edi + rorl $2,%eax + leal 1518500249(%edx,%esi,1),%edx + movl %ebp,%esi + xorl %ecx,%edi + roll $5,%esi + addl %edi,%edx + movl %ebx,%edi + addl %esi,%edx + /* 00_15 2 */ + movl 8(%esp),%esi + xorl %eax,%edi + andl %ebp,%edi + rorl $2,%ebp + leal 1518500249(%ecx,%esi,1),%ecx + movl %edx,%esi + xorl %ebx,%edi + roll $5,%esi + addl %edi,%ecx + movl %eax,%edi + addl %esi,%ecx + /* 00_15 3 */ + movl 12(%esp),%esi + xorl %ebp,%edi + andl %edx,%edi + rorl $2,%edx + leal 1518500249(%ebx,%esi,1),%ebx + movl %ecx,%esi + xorl %eax,%edi + roll $5,%esi + addl %edi,%ebx + movl %ebp,%edi + addl %esi,%ebx + /* 00_15 4 */ + movl 16(%esp),%esi + xorl %edx,%edi + andl %ecx,%edi + rorl $2,%ecx + leal 1518500249(%eax,%esi,1),%eax + movl %ebx,%esi + xorl %ebp,%edi + roll $5,%esi + addl %edi,%eax + movl %edx,%edi + addl %esi,%eax + /* 00_15 5 */ + movl 20(%esp),%esi + xorl %ecx,%edi + andl %ebx,%edi + rorl $2,%ebx + leal 1518500249(%ebp,%esi,1),%ebp + movl %eax,%esi + xorl %edx,%edi + roll $5,%esi + addl %edi,%ebp + movl %ecx,%edi + addl %esi,%ebp + /* 00_15 6 */ + movl 24(%esp),%esi + xorl %ebx,%edi + andl %eax,%edi + rorl $2,%eax + leal 1518500249(%edx,%esi,1),%edx + movl %ebp,%esi + xorl %ecx,%edi + roll $5,%esi + addl %edi,%edx + movl %ebx,%edi + addl %esi,%edx + /* 00_15 7 */ + movl 28(%esp),%esi + xorl %eax,%edi + andl %ebp,%edi + rorl $2,%ebp + leal 1518500249(%ecx,%esi,1),%ecx + movl %edx,%esi + xorl %ebx,%edi + roll $5,%esi + addl %edi,%ecx + movl %eax,%edi + addl %esi,%ecx + /* 00_15 8 */ + movl 32(%esp),%esi + xorl %ebp,%edi + andl %edx,%edi + rorl $2,%edx + leal 1518500249(%ebx,%esi,1),%ebx + movl %ecx,%esi + xorl %eax,%edi + roll $5,%esi + addl %edi,%ebx + movl %ebp,%edi + addl %esi,%ebx + /* 00_15 9 */ + movl 36(%esp),%esi + xorl %edx,%edi + andl %ecx,%edi + rorl $2,%ecx + leal 1518500249(%eax,%esi,1),%eax + movl %ebx,%esi + xorl %ebp,%edi + roll $5,%esi + addl %edi,%eax + movl %edx,%edi + addl %esi,%eax + /* 00_15 10 */ + movl 40(%esp),%esi + xorl %ecx,%edi + andl %ebx,%edi + rorl $2,%ebx + leal 1518500249(%ebp,%esi,1),%ebp + movl %eax,%esi + xorl %edx,%edi + roll $5,%esi + addl %edi,%ebp + movl %ecx,%edi + addl %esi,%ebp + /* 00_15 11 */ + movl 44(%esp),%esi + xorl %ebx,%edi + andl %eax,%edi + rorl $2,%eax + leal 1518500249(%edx,%esi,1),%edx + movl %ebp,%esi + xorl %ecx,%edi + roll $5,%esi + addl %edi,%edx + movl %ebx,%edi + addl %esi,%edx + /* 00_15 12 */ + movl 48(%esp),%esi + xorl %eax,%edi + andl %ebp,%edi + rorl $2,%ebp + leal 1518500249(%ecx,%esi,1),%ecx + movl %edx,%esi + xorl %ebx,%edi + roll $5,%esi + addl %edi,%ecx + movl %eax,%edi + addl %esi,%ecx + /* 00_15 13 */ + movl 52(%esp),%esi + xorl %ebp,%edi + andl %edx,%edi + rorl $2,%edx + leal 1518500249(%ebx,%esi,1),%ebx + movl %ecx,%esi + xorl %eax,%edi + roll $5,%esi + addl %edi,%ebx + movl %ebp,%edi + addl %esi,%ebx + /* 00_15 14 */ + movl 56(%esp),%esi + xorl %edx,%edi + andl %ecx,%edi + rorl $2,%ecx + leal 1518500249(%eax,%esi,1),%eax + movl %ebx,%esi + xorl %ebp,%edi + roll $5,%esi + addl %edi,%eax + movl %edx,%edi + addl %esi,%eax + /* 00_15 15 */ + movl 60(%esp),%esi + xorl %ecx,%edi + andl %ebx,%edi + rorl $2,%ebx + leal 1518500249(%ebp,%esi,1),%ebp + xorl %edx,%edi + movl (%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 8(%esp),%esi + roll $5,%edi + xorl 32(%esp),%esi + /* 16_19 16 */ + xorl 52(%esp),%esi + addl %edi,%ebp + movl %ecx,%edi + roll $1,%esi + xorl %ebx,%edi + movl %esi,(%esp) + andl %eax,%edi + rorl $2,%eax + leal 1518500249(%edx,%esi,1),%edx + movl 4(%esp),%esi + xorl %ecx,%edi + xorl 12(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 36(%esp),%esi + roll $5,%edi + /* 16_19 17 */ + xorl 56(%esp),%esi + addl %edi,%edx + movl %ebx,%edi + roll $1,%esi + xorl %eax,%edi + movl %esi,4(%esp) + andl %ebp,%edi + rorl $2,%ebp + leal 1518500249(%ecx,%esi,1),%ecx + movl 8(%esp),%esi + xorl %ebx,%edi + xorl 16(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 40(%esp),%esi + roll $5,%edi + /* 16_19 18 */ + xorl 60(%esp),%esi + addl %edi,%ecx + movl %eax,%edi + roll $1,%esi + xorl %ebp,%edi + movl %esi,8(%esp) + andl %edx,%edi + rorl $2,%edx + leal 1518500249(%ebx,%esi,1),%ebx + movl 12(%esp),%esi + xorl %eax,%edi + xorl 20(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 44(%esp),%esi + roll $5,%edi + /* 16_19 19 */ + xorl (%esp),%esi + addl %edi,%ebx + movl %ebp,%edi + roll $1,%esi + xorl %edx,%edi + movl %esi,12(%esp) + andl %ecx,%edi + rorl $2,%ecx + leal 1518500249(%eax,%esi,1),%eax + movl 16(%esp),%esi + xorl %ebp,%edi + xorl 24(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 48(%esp),%esi + roll $5,%edi + /* 20_39 20 */ + xorl 4(%esp),%esi + addl %edi,%eax + roll $1,%esi + movl %edx,%edi + movl %esi,16(%esp) + xorl %ebx,%edi + rorl $2,%ebx + leal 1859775393(%ebp,%esi,1),%ebp + movl 20(%esp),%esi + xorl %ecx,%edi + xorl 28(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 52(%esp),%esi + roll $5,%edi + /* 20_39 21 */ + xorl 8(%esp),%esi + addl %edi,%ebp + roll $1,%esi + movl %ecx,%edi + movl %esi,20(%esp) + xorl %eax,%edi + rorl $2,%eax + leal 1859775393(%edx,%esi,1),%edx + movl 24(%esp),%esi + xorl %ebx,%edi + xorl 32(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 56(%esp),%esi + roll $5,%edi + /* 20_39 22 */ + xorl 12(%esp),%esi + addl %edi,%edx + roll $1,%esi + movl %ebx,%edi + movl %esi,24(%esp) + xorl %ebp,%edi + rorl $2,%ebp + leal 1859775393(%ecx,%esi,1),%ecx + movl 28(%esp),%esi + xorl %eax,%edi + xorl 36(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 60(%esp),%esi + roll $5,%edi + /* 20_39 23 */ + xorl 16(%esp),%esi + addl %edi,%ecx + roll $1,%esi + movl %eax,%edi + movl %esi,28(%esp) + xorl %edx,%edi + rorl $2,%edx + leal 1859775393(%ebx,%esi,1),%ebx + movl 32(%esp),%esi + xorl %ebp,%edi + xorl 40(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl (%esp),%esi + roll $5,%edi + /* 20_39 24 */ + xorl 20(%esp),%esi + addl %edi,%ebx + roll $1,%esi + movl %ebp,%edi + movl %esi,32(%esp) + xorl %ecx,%edi + rorl $2,%ecx + leal 1859775393(%eax,%esi,1),%eax + movl 36(%esp),%esi + xorl %edx,%edi + xorl 44(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 4(%esp),%esi + roll $5,%edi + /* 20_39 25 */ + xorl 24(%esp),%esi + addl %edi,%eax + roll $1,%esi + movl %edx,%edi + movl %esi,36(%esp) + xorl %ebx,%edi + rorl $2,%ebx + leal 1859775393(%ebp,%esi,1),%ebp + movl 40(%esp),%esi + xorl %ecx,%edi + xorl 48(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 8(%esp),%esi + roll $5,%edi + /* 20_39 26 */ + xorl 28(%esp),%esi + addl %edi,%ebp + roll $1,%esi + movl %ecx,%edi + movl %esi,40(%esp) + xorl %eax,%edi + rorl $2,%eax + leal 1859775393(%edx,%esi,1),%edx + movl 44(%esp),%esi + xorl %ebx,%edi + xorl 52(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 12(%esp),%esi + roll $5,%edi + /* 20_39 27 */ + xorl 32(%esp),%esi + addl %edi,%edx + roll $1,%esi + movl %ebx,%edi + movl %esi,44(%esp) + xorl %ebp,%edi + rorl $2,%ebp + leal 1859775393(%ecx,%esi,1),%ecx + movl 48(%esp),%esi + xorl %eax,%edi + xorl 56(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 16(%esp),%esi + roll $5,%edi + /* 20_39 28 */ + xorl 36(%esp),%esi + addl %edi,%ecx + roll $1,%esi + movl %eax,%edi + movl %esi,48(%esp) + xorl %edx,%edi + rorl $2,%edx + leal 1859775393(%ebx,%esi,1),%ebx + movl 52(%esp),%esi + xorl %ebp,%edi + xorl 60(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 20(%esp),%esi + roll $5,%edi + /* 20_39 29 */ + xorl 40(%esp),%esi + addl %edi,%ebx + roll $1,%esi + movl %ebp,%edi + movl %esi,52(%esp) + xorl %ecx,%edi + rorl $2,%ecx + leal 1859775393(%eax,%esi,1),%eax + movl 56(%esp),%esi + xorl %edx,%edi + xorl (%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 24(%esp),%esi + roll $5,%edi + /* 20_39 30 */ + xorl 44(%esp),%esi + addl %edi,%eax + roll $1,%esi + movl %edx,%edi + movl %esi,56(%esp) + xorl %ebx,%edi + rorl $2,%ebx + leal 1859775393(%ebp,%esi,1),%ebp + movl 60(%esp),%esi + xorl %ecx,%edi + xorl 4(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 28(%esp),%esi + roll $5,%edi + /* 20_39 31 */ + xorl 48(%esp),%esi + addl %edi,%ebp + roll $1,%esi + movl %ecx,%edi + movl %esi,60(%esp) + xorl %eax,%edi + rorl $2,%eax + leal 1859775393(%edx,%esi,1),%edx + movl (%esp),%esi + xorl %ebx,%edi + xorl 8(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 32(%esp),%esi + roll $5,%edi + /* 20_39 32 */ + xorl 52(%esp),%esi + addl %edi,%edx + roll $1,%esi + movl %ebx,%edi + movl %esi,(%esp) + xorl %ebp,%edi + rorl $2,%ebp + leal 1859775393(%ecx,%esi,1),%ecx + movl 4(%esp),%esi + xorl %eax,%edi + xorl 12(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 36(%esp),%esi + roll $5,%edi + /* 20_39 33 */ + xorl 56(%esp),%esi + addl %edi,%ecx + roll $1,%esi + movl %eax,%edi + movl %esi,4(%esp) + xorl %edx,%edi + rorl $2,%edx + leal 1859775393(%ebx,%esi,1),%ebx + movl 8(%esp),%esi + xorl %ebp,%edi + xorl 16(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 40(%esp),%esi + roll $5,%edi + /* 20_39 34 */ + xorl 60(%esp),%esi + addl %edi,%ebx + roll $1,%esi + movl %ebp,%edi + movl %esi,8(%esp) + xorl %ecx,%edi + rorl $2,%ecx + leal 1859775393(%eax,%esi,1),%eax + movl 12(%esp),%esi + xorl %edx,%edi + xorl 20(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 44(%esp),%esi + roll $5,%edi + /* 20_39 35 */ + xorl (%esp),%esi + addl %edi,%eax + roll $1,%esi + movl %edx,%edi + movl %esi,12(%esp) + xorl %ebx,%edi + rorl $2,%ebx + leal 1859775393(%ebp,%esi,1),%ebp + movl 16(%esp),%esi + xorl %ecx,%edi + xorl 24(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 48(%esp),%esi + roll $5,%edi + /* 20_39 36 */ + xorl 4(%esp),%esi + addl %edi,%ebp + roll $1,%esi + movl %ecx,%edi + movl %esi,16(%esp) + xorl %eax,%edi + rorl $2,%eax + leal 1859775393(%edx,%esi,1),%edx + movl 20(%esp),%esi + xorl %ebx,%edi + xorl 28(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 52(%esp),%esi + roll $5,%edi + /* 20_39 37 */ + xorl 8(%esp),%esi + addl %edi,%edx + roll $1,%esi + movl %ebx,%edi + movl %esi,20(%esp) + xorl %ebp,%edi + rorl $2,%ebp + leal 1859775393(%ecx,%esi,1),%ecx + movl 24(%esp),%esi + xorl %eax,%edi + xorl 32(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 56(%esp),%esi + roll $5,%edi + /* 20_39 38 */ + xorl 12(%esp),%esi + addl %edi,%ecx + roll $1,%esi + movl %eax,%edi + movl %esi,24(%esp) + xorl %edx,%edi + rorl $2,%edx + leal 1859775393(%ebx,%esi,1),%ebx + movl 28(%esp),%esi + xorl %ebp,%edi + xorl 36(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 60(%esp),%esi + roll $5,%edi + /* 20_39 39 */ + xorl 16(%esp),%esi + addl %edi,%ebx + roll $1,%esi + movl %ebp,%edi + movl %esi,28(%esp) + xorl %ecx,%edi + rorl $2,%ecx + leal 1859775393(%eax,%esi,1),%eax + movl 32(%esp),%esi + xorl %edx,%edi + xorl 40(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl (%esp),%esi + roll $5,%edi + /* 40_59 40 */ + addl %edi,%eax + movl %edx,%edi + xorl 20(%esp),%esi + andl %ecx,%edi + roll $1,%esi + addl %edi,%ebp + movl %edx,%edi + movl %esi,32(%esp) + xorl %ecx,%edi + leal 2400959708(%ebp,%esi,1),%ebp + andl %ebx,%edi + rorl $2,%ebx + movl 36(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 44(%esp),%esi + roll $5,%edi + xorl 4(%esp),%esi + /* 40_59 41 */ + addl %edi,%ebp + movl %ecx,%edi + xorl 24(%esp),%esi + andl %ebx,%edi + roll $1,%esi + addl %edi,%edx + movl %ecx,%edi + movl %esi,36(%esp) + xorl %ebx,%edi + leal 2400959708(%edx,%esi,1),%edx + andl %eax,%edi + rorl $2,%eax + movl 40(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 48(%esp),%esi + roll $5,%edi + xorl 8(%esp),%esi + /* 40_59 42 */ + addl %edi,%edx + movl %ebx,%edi + xorl 28(%esp),%esi + andl %eax,%edi + roll $1,%esi + addl %edi,%ecx + movl %ebx,%edi + movl %esi,40(%esp) + xorl %eax,%edi + leal 2400959708(%ecx,%esi,1),%ecx + andl %ebp,%edi + rorl $2,%ebp + movl 44(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 52(%esp),%esi + roll $5,%edi + xorl 12(%esp),%esi + /* 40_59 43 */ + addl %edi,%ecx + movl %eax,%edi + xorl 32(%esp),%esi + andl %ebp,%edi + roll $1,%esi + addl %edi,%ebx + movl %eax,%edi + movl %esi,44(%esp) + xorl %ebp,%edi + leal 2400959708(%ebx,%esi,1),%ebx + andl %edx,%edi + rorl $2,%edx + movl 48(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 56(%esp),%esi + roll $5,%edi + xorl 16(%esp),%esi + /* 40_59 44 */ + addl %edi,%ebx + movl %ebp,%edi + xorl 36(%esp),%esi + andl %edx,%edi + roll $1,%esi + addl %edi,%eax + movl %ebp,%edi + movl %esi,48(%esp) + xorl %edx,%edi + leal 2400959708(%eax,%esi,1),%eax + andl %ecx,%edi + rorl $2,%ecx + movl 52(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 60(%esp),%esi + roll $5,%edi + xorl 20(%esp),%esi + /* 40_59 45 */ + addl %edi,%eax + movl %edx,%edi + xorl 40(%esp),%esi + andl %ecx,%edi + roll $1,%esi + addl %edi,%ebp + movl %edx,%edi + movl %esi,52(%esp) + xorl %ecx,%edi + leal 2400959708(%ebp,%esi,1),%ebp + andl %ebx,%edi + rorl $2,%ebx + movl 56(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl (%esp),%esi + roll $5,%edi + xorl 24(%esp),%esi + /* 40_59 46 */ + addl %edi,%ebp + movl %ecx,%edi + xorl 44(%esp),%esi + andl %ebx,%edi + roll $1,%esi + addl %edi,%edx + movl %ecx,%edi + movl %esi,56(%esp) + xorl %ebx,%edi + leal 2400959708(%edx,%esi,1),%edx + andl %eax,%edi + rorl $2,%eax + movl 60(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 4(%esp),%esi + roll $5,%edi + xorl 28(%esp),%esi + /* 40_59 47 */ + addl %edi,%edx + movl %ebx,%edi + xorl 48(%esp),%esi + andl %eax,%edi + roll $1,%esi + addl %edi,%ecx + movl %ebx,%edi + movl %esi,60(%esp) + xorl %eax,%edi + leal 2400959708(%ecx,%esi,1),%ecx + andl %ebp,%edi + rorl $2,%ebp + movl (%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 8(%esp),%esi + roll $5,%edi + xorl 32(%esp),%esi + /* 40_59 48 */ + addl %edi,%ecx + movl %eax,%edi + xorl 52(%esp),%esi + andl %ebp,%edi + roll $1,%esi + addl %edi,%ebx + movl %eax,%edi + movl %esi,(%esp) + xorl %ebp,%edi + leal 2400959708(%ebx,%esi,1),%ebx + andl %edx,%edi + rorl $2,%edx + movl 4(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 12(%esp),%esi + roll $5,%edi + xorl 36(%esp),%esi + /* 40_59 49 */ + addl %edi,%ebx + movl %ebp,%edi + xorl 56(%esp),%esi + andl %edx,%edi + roll $1,%esi + addl %edi,%eax + movl %ebp,%edi + movl %esi,4(%esp) + xorl %edx,%edi + leal 2400959708(%eax,%esi,1),%eax + andl %ecx,%edi + rorl $2,%ecx + movl 8(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 16(%esp),%esi + roll $5,%edi + xorl 40(%esp),%esi + /* 40_59 50 */ + addl %edi,%eax + movl %edx,%edi + xorl 60(%esp),%esi + andl %ecx,%edi + roll $1,%esi + addl %edi,%ebp + movl %edx,%edi + movl %esi,8(%esp) + xorl %ecx,%edi + leal 2400959708(%ebp,%esi,1),%ebp + andl %ebx,%edi + rorl $2,%ebx + movl 12(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 20(%esp),%esi + roll $5,%edi + xorl 44(%esp),%esi + /* 40_59 51 */ + addl %edi,%ebp + movl %ecx,%edi + xorl (%esp),%esi + andl %ebx,%edi + roll $1,%esi + addl %edi,%edx + movl %ecx,%edi + movl %esi,12(%esp) + xorl %ebx,%edi + leal 2400959708(%edx,%esi,1),%edx + andl %eax,%edi + rorl $2,%eax + movl 16(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 24(%esp),%esi + roll $5,%edi + xorl 48(%esp),%esi + /* 40_59 52 */ + addl %edi,%edx + movl %ebx,%edi + xorl 4(%esp),%esi + andl %eax,%edi + roll $1,%esi + addl %edi,%ecx + movl %ebx,%edi + movl %esi,16(%esp) + xorl %eax,%edi + leal 2400959708(%ecx,%esi,1),%ecx + andl %ebp,%edi + rorl $2,%ebp + movl 20(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 28(%esp),%esi + roll $5,%edi + xorl 52(%esp),%esi + /* 40_59 53 */ + addl %edi,%ecx + movl %eax,%edi + xorl 8(%esp),%esi + andl %ebp,%edi + roll $1,%esi + addl %edi,%ebx + movl %eax,%edi + movl %esi,20(%esp) + xorl %ebp,%edi + leal 2400959708(%ebx,%esi,1),%ebx + andl %edx,%edi + rorl $2,%edx + movl 24(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 32(%esp),%esi + roll $5,%edi + xorl 56(%esp),%esi + /* 40_59 54 */ + addl %edi,%ebx + movl %ebp,%edi + xorl 12(%esp),%esi + andl %edx,%edi + roll $1,%esi + addl %edi,%eax + movl %ebp,%edi + movl %esi,24(%esp) + xorl %edx,%edi + leal 2400959708(%eax,%esi,1),%eax + andl %ecx,%edi + rorl $2,%ecx + movl 28(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 36(%esp),%esi + roll $5,%edi + xorl 60(%esp),%esi + /* 40_59 55 */ + addl %edi,%eax + movl %edx,%edi + xorl 16(%esp),%esi + andl %ecx,%edi + roll $1,%esi + addl %edi,%ebp + movl %edx,%edi + movl %esi,28(%esp) + xorl %ecx,%edi + leal 2400959708(%ebp,%esi,1),%ebp + andl %ebx,%edi + rorl $2,%ebx + movl 32(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 40(%esp),%esi + roll $5,%edi + xorl (%esp),%esi + /* 40_59 56 */ + addl %edi,%ebp + movl %ecx,%edi + xorl 20(%esp),%esi + andl %ebx,%edi + roll $1,%esi + addl %edi,%edx + movl %ecx,%edi + movl %esi,32(%esp) + xorl %ebx,%edi + leal 2400959708(%edx,%esi,1),%edx + andl %eax,%edi + rorl $2,%eax + movl 36(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 44(%esp),%esi + roll $5,%edi + xorl 4(%esp),%esi + /* 40_59 57 */ + addl %edi,%edx + movl %ebx,%edi + xorl 24(%esp),%esi + andl %eax,%edi + roll $1,%esi + addl %edi,%ecx + movl %ebx,%edi + movl %esi,36(%esp) + xorl %eax,%edi + leal 2400959708(%ecx,%esi,1),%ecx + andl %ebp,%edi + rorl $2,%ebp + movl 40(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 48(%esp),%esi + roll $5,%edi + xorl 8(%esp),%esi + /* 40_59 58 */ + addl %edi,%ecx + movl %eax,%edi + xorl 28(%esp),%esi + andl %ebp,%edi + roll $1,%esi + addl %edi,%ebx + movl %eax,%edi + movl %esi,40(%esp) + xorl %ebp,%edi + leal 2400959708(%ebx,%esi,1),%ebx + andl %edx,%edi + rorl $2,%edx + movl 44(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 52(%esp),%esi + roll $5,%edi + xorl 12(%esp),%esi + /* 40_59 59 */ + addl %edi,%ebx + movl %ebp,%edi + xorl 32(%esp),%esi + andl %edx,%edi + roll $1,%esi + addl %edi,%eax + movl %ebp,%edi + movl %esi,44(%esp) + xorl %edx,%edi + leal 2400959708(%eax,%esi,1),%eax + andl %ecx,%edi + rorl $2,%ecx + movl 48(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 56(%esp),%esi + roll $5,%edi + xorl 16(%esp),%esi + /* 20_39 60 */ + xorl 36(%esp),%esi + addl %edi,%eax + roll $1,%esi + movl %edx,%edi + movl %esi,48(%esp) + xorl %ebx,%edi + rorl $2,%ebx + leal 3395469782(%ebp,%esi,1),%ebp + movl 52(%esp),%esi + xorl %ecx,%edi + xorl 60(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 20(%esp),%esi + roll $5,%edi + /* 20_39 61 */ + xorl 40(%esp),%esi + addl %edi,%ebp + roll $1,%esi + movl %ecx,%edi + movl %esi,52(%esp) + xorl %eax,%edi + rorl $2,%eax + leal 3395469782(%edx,%esi,1),%edx + movl 56(%esp),%esi + xorl %ebx,%edi + xorl (%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 24(%esp),%esi + roll $5,%edi + /* 20_39 62 */ + xorl 44(%esp),%esi + addl %edi,%edx + roll $1,%esi + movl %ebx,%edi + movl %esi,56(%esp) + xorl %ebp,%edi + rorl $2,%ebp + leal 3395469782(%ecx,%esi,1),%ecx + movl 60(%esp),%esi + xorl %eax,%edi + xorl 4(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 28(%esp),%esi + roll $5,%edi + /* 20_39 63 */ + xorl 48(%esp),%esi + addl %edi,%ecx + roll $1,%esi + movl %eax,%edi + movl %esi,60(%esp) + xorl %edx,%edi + rorl $2,%edx + leal 3395469782(%ebx,%esi,1),%ebx + movl (%esp),%esi + xorl %ebp,%edi + xorl 8(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 32(%esp),%esi + roll $5,%edi + /* 20_39 64 */ + xorl 52(%esp),%esi + addl %edi,%ebx + roll $1,%esi + movl %ebp,%edi + movl %esi,(%esp) + xorl %ecx,%edi + rorl $2,%ecx + leal 3395469782(%eax,%esi,1),%eax + movl 4(%esp),%esi + xorl %edx,%edi + xorl 12(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 36(%esp),%esi + roll $5,%edi + /* 20_39 65 */ + xorl 56(%esp),%esi + addl %edi,%eax + roll $1,%esi + movl %edx,%edi + movl %esi,4(%esp) + xorl %ebx,%edi + rorl $2,%ebx + leal 3395469782(%ebp,%esi,1),%ebp + movl 8(%esp),%esi + xorl %ecx,%edi + xorl 16(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 40(%esp),%esi + roll $5,%edi + /* 20_39 66 */ + xorl 60(%esp),%esi + addl %edi,%ebp + roll $1,%esi + movl %ecx,%edi + movl %esi,8(%esp) + xorl %eax,%edi + rorl $2,%eax + leal 3395469782(%edx,%esi,1),%edx + movl 12(%esp),%esi + xorl %ebx,%edi + xorl 20(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 44(%esp),%esi + roll $5,%edi + /* 20_39 67 */ + xorl (%esp),%esi + addl %edi,%edx + roll $1,%esi + movl %ebx,%edi + movl %esi,12(%esp) + xorl %ebp,%edi + rorl $2,%ebp + leal 3395469782(%ecx,%esi,1),%ecx + movl 16(%esp),%esi + xorl %eax,%edi + xorl 24(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 48(%esp),%esi + roll $5,%edi + /* 20_39 68 */ + xorl 4(%esp),%esi + addl %edi,%ecx + roll $1,%esi + movl %eax,%edi + movl %esi,16(%esp) + xorl %edx,%edi + rorl $2,%edx + leal 3395469782(%ebx,%esi,1),%ebx + movl 20(%esp),%esi + xorl %ebp,%edi + xorl 28(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 52(%esp),%esi + roll $5,%edi + /* 20_39 69 */ + xorl 8(%esp),%esi + addl %edi,%ebx + roll $1,%esi + movl %ebp,%edi + movl %esi,20(%esp) + xorl %ecx,%edi + rorl $2,%ecx + leal 3395469782(%eax,%esi,1),%eax + movl 24(%esp),%esi + xorl %edx,%edi + xorl 32(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 56(%esp),%esi + roll $5,%edi + /* 20_39 70 */ + xorl 12(%esp),%esi + addl %edi,%eax + roll $1,%esi + movl %edx,%edi + movl %esi,24(%esp) + xorl %ebx,%edi + rorl $2,%ebx + leal 3395469782(%ebp,%esi,1),%ebp + movl 28(%esp),%esi + xorl %ecx,%edi + xorl 36(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 60(%esp),%esi + roll $5,%edi + /* 20_39 71 */ + xorl 16(%esp),%esi + addl %edi,%ebp + roll $1,%esi + movl %ecx,%edi + movl %esi,28(%esp) + xorl %eax,%edi + rorl $2,%eax + leal 3395469782(%edx,%esi,1),%edx + movl 32(%esp),%esi + xorl %ebx,%edi + xorl 40(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl (%esp),%esi + roll $5,%edi + /* 20_39 72 */ + xorl 20(%esp),%esi + addl %edi,%edx + roll $1,%esi + movl %ebx,%edi + movl %esi,32(%esp) + xorl %ebp,%edi + rorl $2,%ebp + leal 3395469782(%ecx,%esi,1),%ecx + movl 36(%esp),%esi + xorl %eax,%edi + xorl 44(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 4(%esp),%esi + roll $5,%edi + /* 20_39 73 */ + xorl 24(%esp),%esi + addl %edi,%ecx + roll $1,%esi + movl %eax,%edi + movl %esi,36(%esp) + xorl %edx,%edi + rorl $2,%edx + leal 3395469782(%ebx,%esi,1),%ebx + movl 40(%esp),%esi + xorl %ebp,%edi + xorl 48(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 8(%esp),%esi + roll $5,%edi + /* 20_39 74 */ + xorl 28(%esp),%esi + addl %edi,%ebx + roll $1,%esi + movl %ebp,%edi + movl %esi,40(%esp) + xorl %ecx,%edi + rorl $2,%ecx + leal 3395469782(%eax,%esi,1),%eax + movl 44(%esp),%esi + xorl %edx,%edi + xorl 52(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 12(%esp),%esi + roll $5,%edi + /* 20_39 75 */ + xorl 32(%esp),%esi + addl %edi,%eax + roll $1,%esi + movl %edx,%edi + movl %esi,44(%esp) + xorl %ebx,%edi + rorl $2,%ebx + leal 3395469782(%ebp,%esi,1),%ebp + movl 48(%esp),%esi + xorl %ecx,%edi + xorl 56(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 16(%esp),%esi + roll $5,%edi + /* 20_39 76 */ + xorl 36(%esp),%esi + addl %edi,%ebp + roll $1,%esi + movl %ecx,%edi + movl %esi,48(%esp) + xorl %eax,%edi + rorl $2,%eax + leal 3395469782(%edx,%esi,1),%edx + movl 52(%esp),%esi + xorl %ebx,%edi + xorl 60(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 20(%esp),%esi + roll $5,%edi + /* 20_39 77 */ + xorl 40(%esp),%esi + addl %edi,%edx + roll $1,%esi + movl %ebx,%edi + xorl %ebp,%edi + rorl $2,%ebp + leal 3395469782(%ecx,%esi,1),%ecx + movl 56(%esp),%esi + xorl %eax,%edi + xorl (%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 24(%esp),%esi + roll $5,%edi + /* 20_39 78 */ + xorl 44(%esp),%esi + addl %edi,%ecx + roll $1,%esi + movl %eax,%edi + xorl %edx,%edi + rorl $2,%edx + leal 3395469782(%ebx,%esi,1),%ebx + movl 60(%esp),%esi + xorl %ebp,%edi + xorl 4(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 28(%esp),%esi + roll $5,%edi + /* 20_39 79 */ + xorl 48(%esp),%esi + addl %edi,%ebx + roll $1,%esi + movl %ebp,%edi + xorl %ecx,%edi + rorl $2,%ecx + leal 3395469782(%eax,%esi,1),%eax + xorl %edx,%edi + addl %edi,%eax + movl %ebx,%edi + roll $5,%edi + addl %edi,%eax + /* Loop trailer */ + movl 84(%esp),%edi + movl 88(%esp),%esi + addl 16(%edi),%ebp + addl 12(%edi),%edx + addl %ecx,8(%edi) + addl %ebx,4(%edi) + addl $64,%esi + addl %eax,(%edi) + movl %edx,12(%edi) + movl %ebp,16(%edi) + cmpl 92(%esp),%esi + jb .L000loop + addl $64,%esp + popl %edi + popl %esi + popl %ebx + popl %ebp + ret +.L_sha1_block_data_order_end: +.size sha1_block_data_order,.L_sha1_block_data_order_end-sha1_block_data_order +.byte 83,72,65,49,32,98,108,111,99,107,32,116,114,97,110,115,102,111,114,109,32,102,111,114,32,120,56,54,44,32,67,82,89,80,84,79,71,65,77,83,32,98,121,32,60,97,112,112,114,111,64,111,112,101,110,115,115,108,46,111,114,103,62,0 diff --git a/x86/sha1.c b/x86/sha1.c new file mode 100644 index 0000000..4c1a569 --- /dev/null +++ b/x86/sha1.c @@ -0,0 +1,81 @@ +/* + * SHA-1 implementation. + * + * Copyright (C) 2005 Paul Mackerras <paulus@samba.org> + * + * This version assumes we are running on a big-endian machine. + * It calls an external sha1_core() to process blocks of 64 bytes. + */ +#include <stdio.h> +#include <string.h> +#include <arpa/inet.h> /* For htonl */ +#include "sha1.h" + +#define x86_sha1_core sha1_block_data_order +extern void x86_sha1_core(uint32_t hash[5], const unsigned char *p, + unsigned int nblocks); + +void x86_SHA1_Init(x86_SHA_CTX *c) +{ + /* Matches prefix of scontext structure */ + static struct { + uint32_t hash[5]; + uint64_t len; + } const iv = { + { 0x67452301, 0xEFCDAB89, 0x98BADCFE, 0x10325476, 0xC3D2E1F0 }, + 0 + }; + + memcpy(c, &iv, sizeof iv); +} + +void x86_SHA1_Update(x86_SHA_CTX *c, const void *p, unsigned long n) +{ + unsigned pos = (unsigned)c->len & 63; + unsigned long nb; + + c->len += n; + + /* Initial partial block */ + if (pos) { + unsigned space = 64 - pos; + if (space > n) + goto end; + memcpy(c->buf + pos, p, space); + p += space; + n -= space; + x86_sha1_core(c->hash, c->buf, 1); + } + + /* The big impressive middle */ + nb = n >> 6; + if (nb) { + x86_sha1_core(c->hash, p, nb); + p += nb << 6; + n &= 63; + } + pos = 0; +end: + /* Final partial block */ + memcpy(c->buf + pos, p, n); +} + +void x86_SHA1_Final(unsigned char *hash, x86_SHA_CTX *c) +{ + unsigned pos = (unsigned)c->len & 63; + + c->buf[pos++] = 0x80; + if (pos > 56) { + memset(c->buf + pos, 0, 64 - pos); + x86_sha1_core(c->hash, c->buf, 1); + pos = 0; + } + memset(c->buf + pos, 0, 56 - pos); + /* Last two words are 64-bit *bit* count */ + *(uint32_t *)(c->buf + 56) = htonl((uint32_t)(c->len >> 29)); + *(uint32_t *)(c->buf + 60) = htonl((uint32_t)c->len << 3); + x86_sha1_core(c->hash, c->buf, 1); + + for (pos = 0; pos < 5; pos++) + ((uint32_t *)hash)[pos] = htonl(c->hash[pos]); +} diff --git a/x86/sha1.h b/x86/sha1.h new file mode 100644 index 0000000..8988da9 --- /dev/null +++ b/x86/sha1.h @@ -0,0 +1,21 @@ +/* + * SHA-1 implementation. + * + * Copyright (C) 2005 Paul Mackerras <paulus@samba.org> + */ +#include <stdint.h> + +typedef struct { + uint32_t hash[5]; + uint64_t len; + unsigned char buf[64]; /* Keep this aligned */ +} x86_SHA_CTX; + +void x86_SHA1_Init(x86_SHA_CTX *c); +void x86_SHA1_Update(x86_SHA_CTX *c, const void *p, unsigned long n); +void x86_SHA1_Final(unsigned char *hash, x86_SHA_CTX *c); + +#define git_SHA_CTX x86_SHA_CTX +#define git_SHA1_Init x86_SHA1_Init +#define git_SHA1_Update x86_SHA1_Update +#define git_SHA1_Final x86_SHA1_Final ^ permalink raw reply related [flat|nested] 60+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-04 4:48 ` George Spelvin @ 2009-08-04 6:30 ` Linus Torvalds 2009-08-04 8:01 ` George Spelvin 2009-08-04 6:40 ` Linus Torvalds 1 sibling, 1 reply; 60+ messages in thread From: Linus Torvalds @ 2009-08-04 6:30 UTC (permalink / raw) To: George Spelvin; +Cc: Git Mailing List On Mon, 4 Aug 2009, George Spelvin wrote: > > The actual goal of this effort is to address the dynamic linker startup > time issues by removing the second-largest contributor after libcurl, > namely openssl. Optimizing the assembly code is just the fun part. ;-) Now, I agree that it would be wonderful to get rid of the linker startup, but the startup costs of openssl are very low compared to the equivalent curl ones. So we can't lose _too_ much performance - especially for long-running jobs where startup costs really don't even matter - in the quest to get rid of those. That said, your numbers are impressive. Improving fsck by 1.1-2.2% is very good. That means that you not only avodied the startup costs, you actually improved on the openssl code. So it's a win-win situation. That said, it would be even better if the SHA1 code was also somewhat portable to other environments (it looks like your current patch is very GNU as specific), and if you had a solution for x86-64 too ;) Yeah, I'm a whiny little b*tch, aren't I? Linus ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-04 6:30 ` Linus Torvalds @ 2009-08-04 8:01 ` George Spelvin 2009-08-04 20:41 ` Junio C Hamano 0 siblings, 1 reply; 60+ messages in thread From: George Spelvin @ 2009-08-04 8:01 UTC (permalink / raw) To: linux, torvalds; +Cc: git > Now, I agree that it would be wonderful to get rid of the linker startup, > but the startup costs of openssl are very low compared to the equivalent > curl ones. So we can't lose _too_ much performance - especially for > long-running jobs where startup costs really don't even matter - in the > quest to get rid of those. > > That said, your numbers are impressive. Improving fsck by 1.1-2.2% is very > good. That means that you not only avodied the startup costs, you actually > improved on the openssl code. So it's a win-win situation. Er, yes, that *is* what the subject line is advertising. I started with the OpenSSL core SHA1 code (which is BSD/GPL dual-licensed by its author) and tweaked it some more for more recent processors. > That said, it would be even better if the SHA1 code was also somewhat > portable to other environments (it looks like your current patch is very > GNU as specific), and if you had a solution for x86-64 too ;) Done and will be done. The code is *actually* written (see the first e-mail in this thread) in the perl-preprocessor that OpenSSL uses, which can generate quite a few output syntaxes (including Intel). I just included the preprocessed version to reduce the complexity of the rough-draft patch. The one question I have is that currently perl is not a critical compile-time dependency; it's needed for some extra stuff, but AFAIK you can get most of git working without it. Whether to add that dependency or what is a Junio question. As for x86-64, I haven't actually *written* it yet, but it'll be a very simple adaptation. Mostly it's just a matter of using the additional registers effectively. > Yeah, I'm a whiny little b*tch, aren't I? Not at all; I expected all of that. Getting rid of OpenSSL kind of requires those things. > Hmm. Does it really help to do the bswap as a separate initial phase? > > As far as I can tell, you load the result of the bswap just a single time > for each value. So the initial "bswap all 64 bytes" seems pointless. >> + /* 00_15 0 */ >> + movl %edx,%edi >> + movl (%esp),%esi > Why not do the bswap here instead? > > Is it because you're running out of registers for scheduling, and want to > use the stack pointer rather than the original source? Exactly. I looked hard at it, but that means that I'd have to write the first 16 rounds with only one temp register, because the other is being used as an input pointer. Here's the pipelined loop for the first 16 rounds (when in[i] is the stack buffer), showing parallel operations on the same line. (Operations in parens belong to adjacent rounds.) # movl D,S (roll 5,T) (addl S,A) // # mov in[i],T xorl C,S (addl T,A) # andl B,S rorl 2,B # addl T+K,E xorl D,S movl A,T # addl S,E roll 5,T (movl C,S) // # (mov in[i],T) (xorl B,S) addl T,E which translates in perl code to: sub BODY_00_15 { local($n,$a,$b,$c,$d,$e)=@_; &comment("00_15 $n"); &mov($S,$d) if ($n == 0); &mov($T,&swtmp($n%16)); # V Load Xi. &xor($S,$c); # U Continue F() = d^(b&(c^d)) &and($S,$b); # V &rotr($b,2); # NP &lea($e,&DWP(K1,$e,$T)); # U Add Xi and K if ($n < 15) { &mov($T,$a); # V &xor($S,$d); # U &rotl($T,5); # NP &add($e,$S); # U &mov($S,$c); # V Start of NEXT round's F() &add($e,$T); # U } else { # This version provides the correct start for BODY_20_39 &xor($S,$d); # V &mov($T,&swtmp(($n+1)%16)); # U Start computing mext Xi. &add($e,$S); # V Add F() &mov($S,$a); # U Start computing a<<<5 &xor($T,&swtmp(($n+3)%16)); # V &rotl($S,5); # U &xor($T,&swtmp(($n+9)%16)); # V } } Anyway, the round is: #define K1 0x5a827999 e += bswap(in[i]) + K1 + (d^(b&(c^d))) + ROTL(a,5). b = ROTR(b,2); Notice how I use one temp (T) for in[i] and ROTL(a,5), and the other (S) for F1(b,c,d) = d^(b&(c^d)). If I only had one temporary, I'd have to seriously un-overlap it: mov S[i],T bswap T mov T,in[i] lea K1(T,e),e mov d,T xor c,T and b,T xor d,T add T,e mov a,T roll 5,T add T,e Current processors probably have enough out-of-order scheduling resources to find the parallelism there, but something like an Atom would be doomed. I just cobbled together a test implementation, and it looks pretty similar on my Phenom here (minimum of 30 runs): Separate copy loop: 1.355603 In-line: 1.350444 (+0.4% faster) A hint of being faster, but not much. It is a couple of percent faster on a P4: Separate copy loop: 3.297174 In-line: 3.237354 (+1.8% faster) And on an i7: Separate copy loop: 1.353641 In-line: 1.336766 (+1.2% faster) but I worry about in-order machines. An Athlon XP: Separate copy loop: 3.252682 In-line: 3.313870 (-1.8% slower) H'm... it's not bad. And the code is smaller. Maybe I'll work on it a bit. If you want to try it, the modified sha1-x86.s file is appended. --- /dev/null 2009-05-12 02:55:38.579106460 -0400 +++ sha1-x86.s 2009-08-04 03:42:31.073284734 -0400 @@ -0,0 +1,1359 @@ +.file "sha1-586.s" +.text +.globl sha1_block_data_order +.type sha1_block_data_order,@function +.align 16 +sha1_block_data_order: + pushl %ebp + pushl %ebx + pushl %esi + pushl %edi + movl 20(%esp),%edi + movl 24(%esp),%esi + movl 28(%esp),%eax + subl $64,%esp + shll $6,%eax + addl %esi,%eax + movl %eax,92(%esp) + movl 16(%edi),%ebp + movl 12(%edi),%edx + movl 8(%edi),%ecx + movl 4(%edi),%ebx + movl (%edi),%eax +.align 16 +.L000loop: + movl %esi,88(%esp) + /* 00_15 0 */ + movl (%esi),%edi + bswap %edi + movl %edi,(%esp) + leal 1518500249(%ebp,%edi,1),%ebp + movl %edx,%edi + xorl %ecx,%edi + andl %ebx,%edi + rorl $2,%ebx + xorl %edx,%edi + addl %edi,%ebp + movl %eax,%edi + roll $5,%edi + addl %edi,%ebp + /* 00_15 1 */ + movl 4(%esi),%edi + bswap %edi + movl %edi,4(%esp) + leal 1518500249(%edx,%edi,1),%edx + movl %ecx,%edi + xorl %ebx,%edi + andl %eax,%edi + rorl $2,%eax + xorl %ecx,%edi + addl %edi,%edx + movl %ebp,%edi + roll $5,%edi + addl %edi,%edx + /* 00_15 2 */ + movl 8(%esi),%edi + bswap %edi + movl %edi,8(%esp) + leal 1518500249(%ecx,%edi,1),%ecx + movl %ebx,%edi + xorl %eax,%edi + andl %ebp,%edi + rorl $2,%ebp + xorl %ebx,%edi + addl %edi,%ecx + movl %edx,%edi + roll $5,%edi + addl %edi,%ecx + /* 00_15 3 */ + movl 12(%esi),%edi + bswap %edi + movl %edi,12(%esp) + leal 1518500249(%ebx,%edi,1),%ebx + movl %eax,%edi + xorl %ebp,%edi + andl %edx,%edi + rorl $2,%edx + xorl %eax,%edi + addl %edi,%ebx + movl %ecx,%edi + roll $5,%edi + addl %edi,%ebx + /* 00_15 4 */ + movl 16(%esi),%edi + bswap %edi + movl %edi,16(%esp) + leal 1518500249(%eax,%edi,1),%eax + movl %ebp,%edi + xorl %edx,%edi + andl %ecx,%edi + rorl $2,%ecx + xorl %ebp,%edi + addl %edi,%eax + movl %ebx,%edi + roll $5,%edi + addl %edi,%eax + /* 00_15 5 */ + movl 20(%esi),%edi + bswap %edi + movl %edi,20(%esp) + leal 1518500249(%ebp,%edi,1),%ebp + movl %edx,%edi + xorl %ecx,%edi + andl %ebx,%edi + rorl $2,%ebx + xorl %edx,%edi + addl %edi,%ebp + movl %eax,%edi + roll $5,%edi + addl %edi,%ebp + /* 00_15 6 */ + movl 24(%esi),%edi + bswap %edi + movl %edi,24(%esp) + leal 1518500249(%edx,%edi,1),%edx + movl %ecx,%edi + xorl %ebx,%edi + andl %eax,%edi + rorl $2,%eax + xorl %ecx,%edi + addl %edi,%edx + movl %ebp,%edi + roll $5,%edi + addl %edi,%edx + /* 00_15 7 */ + movl 28(%esi),%edi + bswap %edi + movl %edi,28(%esp) + leal 1518500249(%ecx,%edi,1),%ecx + movl %ebx,%edi + xorl %eax,%edi + andl %ebp,%edi + rorl $2,%ebp + xorl %ebx,%edi + addl %edi,%ecx + movl %edx,%edi + roll $5,%edi + addl %edi,%ecx + /* 00_15 8 */ + movl 32(%esi),%edi + bswap %edi + movl %edi,32(%esp) + leal 1518500249(%ebx,%edi,1),%ebx + movl %eax,%edi + xorl %ebp,%edi + andl %edx,%edi + rorl $2,%edx + xorl %eax,%edi + addl %edi,%ebx + movl %ecx,%edi + roll $5,%edi + addl %edi,%ebx + /* 00_15 9 */ + movl 36(%esi),%edi + bswap %edi + movl %edi,36(%esp) + leal 1518500249(%eax,%edi,1),%eax + movl %ebp,%edi + xorl %edx,%edi + andl %ecx,%edi + rorl $2,%ecx + xorl %ebp,%edi + addl %edi,%eax + movl %ebx,%edi + roll $5,%edi + addl %edi,%eax + /* 00_15 10 */ + movl 40(%esi),%edi + bswap %edi + movl %edi,40(%esp) + leal 1518500249(%ebp,%edi,1),%ebp + movl %edx,%edi + xorl %ecx,%edi + andl %ebx,%edi + rorl $2,%ebx + xorl %edx,%edi + addl %edi,%ebp + movl %eax,%edi + roll $5,%edi + addl %edi,%ebp + /* 00_15 11 */ + movl 44(%esi),%edi + bswap %edi + movl %edi,44(%esp) + leal 1518500249(%edx,%edi,1),%edx + movl %ecx,%edi + xorl %ebx,%edi + andl %eax,%edi + rorl $2,%eax + xorl %ecx,%edi + addl %edi,%edx + movl %ebp,%edi + roll $5,%edi + addl %edi,%edx + /* 00_15 12 */ + movl 48(%esi),%edi + bswap %edi + movl %edi,48(%esp) + leal 1518500249(%ecx,%edi,1),%ecx + movl %ebx,%edi + xorl %eax,%edi + andl %ebp,%edi + rorl $2,%ebp + xorl %ebx,%edi + addl %edi,%ecx + movl %edx,%edi + roll $5,%edi + addl %edi,%ecx + /* 00_15 13 */ + movl 52(%esi),%edi + bswap %edi + movl %edi,52(%esp) + leal 1518500249(%ebx,%edi,1),%ebx + movl %eax,%edi + xorl %ebp,%edi + andl %edx,%edi + rorl $2,%edx + xorl %eax,%edi + addl %edi,%ebx + movl %ecx,%edi + roll $5,%edi + addl %edi,%ebx + /* 00_15 14 */ + movl 56(%esi),%edi + movl 60(%esi),%esi + bswap %edi + movl %edi,56(%esp) + leal 1518500249(%eax,%edi,1),%eax + movl %ebp,%edi + xorl %edx,%edi + andl %ecx,%edi + rorl $2,%ecx + xorl %ebp,%edi + addl %edi,%eax + movl %ebx,%edi + roll $5,%edi + addl %edi,%eax + /* 00_15 15 */ + movl %edx,%edi + bswap %esi + xorl %ecx,%edi + movl %esi,60(%esp) + andl %ebx,%edi + rorl $2,%ebx + xorl %edx,%edi + leal 1518500249(%ebp,%esi,1),%ebp + movl (%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 8(%esp),%esi + roll $5,%edi + xorl 32(%esp),%esi + /* 16_19 16 */ + xorl 52(%esp),%esi + addl %edi,%ebp + movl %ecx,%edi + roll $1,%esi + xorl %ebx,%edi + movl %esi,(%esp) + andl %eax,%edi + rorl $2,%eax + leal 1518500249(%edx,%esi,1),%edx + movl 4(%esp),%esi + xorl %ecx,%edi + xorl 12(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 36(%esp),%esi + roll $5,%edi + /* 16_19 17 */ + xorl 56(%esp),%esi + addl %edi,%edx + movl %ebx,%edi + roll $1,%esi + xorl %eax,%edi + movl %esi,4(%esp) + andl %ebp,%edi + rorl $2,%ebp + leal 1518500249(%ecx,%esi,1),%ecx + movl 8(%esp),%esi + xorl %ebx,%edi + xorl 16(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 40(%esp),%esi + roll $5,%edi + /* 16_19 18 */ + xorl 60(%esp),%esi + addl %edi,%ecx + movl %eax,%edi + roll $1,%esi + xorl %ebp,%edi + movl %esi,8(%esp) + andl %edx,%edi + rorl $2,%edx + leal 1518500249(%ebx,%esi,1),%ebx + movl 12(%esp),%esi + xorl %eax,%edi + xorl 20(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 44(%esp),%esi + roll $5,%edi + /* 16_19 19 */ + xorl (%esp),%esi + addl %edi,%ebx + movl %ebp,%edi + roll $1,%esi + xorl %edx,%edi + movl %esi,12(%esp) + andl %ecx,%edi + rorl $2,%ecx + leal 1518500249(%eax,%esi,1),%eax + movl 16(%esp),%esi + xorl %ebp,%edi + xorl 24(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 48(%esp),%esi + roll $5,%edi + /* 20_39 20 */ + xorl 4(%esp),%esi + addl %edi,%eax + roll $1,%esi + movl %edx,%edi + movl %esi,16(%esp) + xorl %ebx,%edi + rorl $2,%ebx + leal 1859775393(%ebp,%esi,1),%ebp + movl 20(%esp),%esi + xorl %ecx,%edi + xorl 28(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 52(%esp),%esi + roll $5,%edi + /* 20_39 21 */ + xorl 8(%esp),%esi + addl %edi,%ebp + roll $1,%esi + movl %ecx,%edi + movl %esi,20(%esp) + xorl %eax,%edi + rorl $2,%eax + leal 1859775393(%edx,%esi,1),%edx + movl 24(%esp),%esi + xorl %ebx,%edi + xorl 32(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 56(%esp),%esi + roll $5,%edi + /* 20_39 22 */ + xorl 12(%esp),%esi + addl %edi,%edx + roll $1,%esi + movl %ebx,%edi + movl %esi,24(%esp) + xorl %ebp,%edi + rorl $2,%ebp + leal 1859775393(%ecx,%esi,1),%ecx + movl 28(%esp),%esi + xorl %eax,%edi + xorl 36(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 60(%esp),%esi + roll $5,%edi + /* 20_39 23 */ + xorl 16(%esp),%esi + addl %edi,%ecx + roll $1,%esi + movl %eax,%edi + movl %esi,28(%esp) + xorl %edx,%edi + rorl $2,%edx + leal 1859775393(%ebx,%esi,1),%ebx + movl 32(%esp),%esi + xorl %ebp,%edi + xorl 40(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl (%esp),%esi + roll $5,%edi + /* 20_39 24 */ + xorl 20(%esp),%esi + addl %edi,%ebx + roll $1,%esi + movl %ebp,%edi + movl %esi,32(%esp) + xorl %ecx,%edi + rorl $2,%ecx + leal 1859775393(%eax,%esi,1),%eax + movl 36(%esp),%esi + xorl %edx,%edi + xorl 44(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 4(%esp),%esi + roll $5,%edi + /* 20_39 25 */ + xorl 24(%esp),%esi + addl %edi,%eax + roll $1,%esi + movl %edx,%edi + movl %esi,36(%esp) + xorl %ebx,%edi + rorl $2,%ebx + leal 1859775393(%ebp,%esi,1),%ebp + movl 40(%esp),%esi + xorl %ecx,%edi + xorl 48(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 8(%esp),%esi + roll $5,%edi + /* 20_39 26 */ + xorl 28(%esp),%esi + addl %edi,%ebp + roll $1,%esi + movl %ecx,%edi + movl %esi,40(%esp) + xorl %eax,%edi + rorl $2,%eax + leal 1859775393(%edx,%esi,1),%edx + movl 44(%esp),%esi + xorl %ebx,%edi + xorl 52(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 12(%esp),%esi + roll $5,%edi + /* 20_39 27 */ + xorl 32(%esp),%esi + addl %edi,%edx + roll $1,%esi + movl %ebx,%edi + movl %esi,44(%esp) + xorl %ebp,%edi + rorl $2,%ebp + leal 1859775393(%ecx,%esi,1),%ecx + movl 48(%esp),%esi + xorl %eax,%edi + xorl 56(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 16(%esp),%esi + roll $5,%edi + /* 20_39 28 */ + xorl 36(%esp),%esi + addl %edi,%ecx + roll $1,%esi + movl %eax,%edi + movl %esi,48(%esp) + xorl %edx,%edi + rorl $2,%edx + leal 1859775393(%ebx,%esi,1),%ebx + movl 52(%esp),%esi + xorl %ebp,%edi + xorl 60(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 20(%esp),%esi + roll $5,%edi + /* 20_39 29 */ + xorl 40(%esp),%esi + addl %edi,%ebx + roll $1,%esi + movl %ebp,%edi + movl %esi,52(%esp) + xorl %ecx,%edi + rorl $2,%ecx + leal 1859775393(%eax,%esi,1),%eax + movl 56(%esp),%esi + xorl %edx,%edi + xorl (%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 24(%esp),%esi + roll $5,%edi + /* 20_39 30 */ + xorl 44(%esp),%esi + addl %edi,%eax + roll $1,%esi + movl %edx,%edi + movl %esi,56(%esp) + xorl %ebx,%edi + rorl $2,%ebx + leal 1859775393(%ebp,%esi,1),%ebp + movl 60(%esp),%esi + xorl %ecx,%edi + xorl 4(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 28(%esp),%esi + roll $5,%edi + /* 20_39 31 */ + xorl 48(%esp),%esi + addl %edi,%ebp + roll $1,%esi + movl %ecx,%edi + movl %esi,60(%esp) + xorl %eax,%edi + rorl $2,%eax + leal 1859775393(%edx,%esi,1),%edx + movl (%esp),%esi + xorl %ebx,%edi + xorl 8(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 32(%esp),%esi + roll $5,%edi + /* 20_39 32 */ + xorl 52(%esp),%esi + addl %edi,%edx + roll $1,%esi + movl %ebx,%edi + movl %esi,(%esp) + xorl %ebp,%edi + rorl $2,%ebp + leal 1859775393(%ecx,%esi,1),%ecx + movl 4(%esp),%esi + xorl %eax,%edi + xorl 12(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 36(%esp),%esi + roll $5,%edi + /* 20_39 33 */ + xorl 56(%esp),%esi + addl %edi,%ecx + roll $1,%esi + movl %eax,%edi + movl %esi,4(%esp) + xorl %edx,%edi + rorl $2,%edx + leal 1859775393(%ebx,%esi,1),%ebx + movl 8(%esp),%esi + xorl %ebp,%edi + xorl 16(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 40(%esp),%esi + roll $5,%edi + /* 20_39 34 */ + xorl 60(%esp),%esi + addl %edi,%ebx + roll $1,%esi + movl %ebp,%edi + movl %esi,8(%esp) + xorl %ecx,%edi + rorl $2,%ecx + leal 1859775393(%eax,%esi,1),%eax + movl 12(%esp),%esi + xorl %edx,%edi + xorl 20(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 44(%esp),%esi + roll $5,%edi + /* 20_39 35 */ + xorl (%esp),%esi + addl %edi,%eax + roll $1,%esi + movl %edx,%edi + movl %esi,12(%esp) + xorl %ebx,%edi + rorl $2,%ebx + leal 1859775393(%ebp,%esi,1),%ebp + movl 16(%esp),%esi + xorl %ecx,%edi + xorl 24(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 48(%esp),%esi + roll $5,%edi + /* 20_39 36 */ + xorl 4(%esp),%esi + addl %edi,%ebp + roll $1,%esi + movl %ecx,%edi + movl %esi,16(%esp) + xorl %eax,%edi + rorl $2,%eax + leal 1859775393(%edx,%esi,1),%edx + movl 20(%esp),%esi + xorl %ebx,%edi + xorl 28(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 52(%esp),%esi + roll $5,%edi + /* 20_39 37 */ + xorl 8(%esp),%esi + addl %edi,%edx + roll $1,%esi + movl %ebx,%edi + movl %esi,20(%esp) + xorl %ebp,%edi + rorl $2,%ebp + leal 1859775393(%ecx,%esi,1),%ecx + movl 24(%esp),%esi + xorl %eax,%edi + xorl 32(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 56(%esp),%esi + roll $5,%edi + /* 20_39 38 */ + xorl 12(%esp),%esi + addl %edi,%ecx + roll $1,%esi + movl %eax,%edi + movl %esi,24(%esp) + xorl %edx,%edi + rorl $2,%edx + leal 1859775393(%ebx,%esi,1),%ebx + movl 28(%esp),%esi + xorl %ebp,%edi + xorl 36(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 60(%esp),%esi + roll $5,%edi + /* 20_39 39 */ + xorl 16(%esp),%esi + addl %edi,%ebx + roll $1,%esi + movl %ebp,%edi + movl %esi,28(%esp) + xorl %ecx,%edi + rorl $2,%ecx + leal 1859775393(%eax,%esi,1),%eax + movl 32(%esp),%esi + xorl %edx,%edi + xorl 40(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl (%esp),%esi + roll $5,%edi + /* 40_59 40 */ + addl %edi,%eax + movl %edx,%edi + xorl 20(%esp),%esi + andl %ecx,%edi + roll $1,%esi + addl %edi,%ebp + movl %edx,%edi + movl %esi,32(%esp) + xorl %ecx,%edi + leal 2400959708(%ebp,%esi,1),%ebp + andl %ebx,%edi + rorl $2,%ebx + movl 36(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 44(%esp),%esi + roll $5,%edi + xorl 4(%esp),%esi + /* 40_59 41 */ + addl %edi,%ebp + movl %ecx,%edi + xorl 24(%esp),%esi + andl %ebx,%edi + roll $1,%esi + addl %edi,%edx + movl %ecx,%edi + movl %esi,36(%esp) + xorl %ebx,%edi + leal 2400959708(%edx,%esi,1),%edx + andl %eax,%edi + rorl $2,%eax + movl 40(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 48(%esp),%esi + roll $5,%edi + xorl 8(%esp),%esi + /* 40_59 42 */ + addl %edi,%edx + movl %ebx,%edi + xorl 28(%esp),%esi + andl %eax,%edi + roll $1,%esi + addl %edi,%ecx + movl %ebx,%edi + movl %esi,40(%esp) + xorl %eax,%edi + leal 2400959708(%ecx,%esi,1),%ecx + andl %ebp,%edi + rorl $2,%ebp + movl 44(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 52(%esp),%esi + roll $5,%edi + xorl 12(%esp),%esi + /* 40_59 43 */ + addl %edi,%ecx + movl %eax,%edi + xorl 32(%esp),%esi + andl %ebp,%edi + roll $1,%esi + addl %edi,%ebx + movl %eax,%edi + movl %esi,44(%esp) + xorl %ebp,%edi + leal 2400959708(%ebx,%esi,1),%ebx + andl %edx,%edi + rorl $2,%edx + movl 48(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 56(%esp),%esi + roll $5,%edi + xorl 16(%esp),%esi + /* 40_59 44 */ + addl %edi,%ebx + movl %ebp,%edi + xorl 36(%esp),%esi + andl %edx,%edi + roll $1,%esi + addl %edi,%eax + movl %ebp,%edi + movl %esi,48(%esp) + xorl %edx,%edi + leal 2400959708(%eax,%esi,1),%eax + andl %ecx,%edi + rorl $2,%ecx + movl 52(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 60(%esp),%esi + roll $5,%edi + xorl 20(%esp),%esi + /* 40_59 45 */ + addl %edi,%eax + movl %edx,%edi + xorl 40(%esp),%esi + andl %ecx,%edi + roll $1,%esi + addl %edi,%ebp + movl %edx,%edi + movl %esi,52(%esp) + xorl %ecx,%edi + leal 2400959708(%ebp,%esi,1),%ebp + andl %ebx,%edi + rorl $2,%ebx + movl 56(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl (%esp),%esi + roll $5,%edi + xorl 24(%esp),%esi + /* 40_59 46 */ + addl %edi,%ebp + movl %ecx,%edi + xorl 44(%esp),%esi + andl %ebx,%edi + roll $1,%esi + addl %edi,%edx + movl %ecx,%edi + movl %esi,56(%esp) + xorl %ebx,%edi + leal 2400959708(%edx,%esi,1),%edx + andl %eax,%edi + rorl $2,%eax + movl 60(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 4(%esp),%esi + roll $5,%edi + xorl 28(%esp),%esi + /* 40_59 47 */ + addl %edi,%edx + movl %ebx,%edi + xorl 48(%esp),%esi + andl %eax,%edi + roll $1,%esi + addl %edi,%ecx + movl %ebx,%edi + movl %esi,60(%esp) + xorl %eax,%edi + leal 2400959708(%ecx,%esi,1),%ecx + andl %ebp,%edi + rorl $2,%ebp + movl (%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 8(%esp),%esi + roll $5,%edi + xorl 32(%esp),%esi + /* 40_59 48 */ + addl %edi,%ecx + movl %eax,%edi + xorl 52(%esp),%esi + andl %ebp,%edi + roll $1,%esi + addl %edi,%ebx + movl %eax,%edi + movl %esi,(%esp) + xorl %ebp,%edi + leal 2400959708(%ebx,%esi,1),%ebx + andl %edx,%edi + rorl $2,%edx + movl 4(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 12(%esp),%esi + roll $5,%edi + xorl 36(%esp),%esi + /* 40_59 49 */ + addl %edi,%ebx + movl %ebp,%edi + xorl 56(%esp),%esi + andl %edx,%edi + roll $1,%esi + addl %edi,%eax + movl %ebp,%edi + movl %esi,4(%esp) + xorl %edx,%edi + leal 2400959708(%eax,%esi,1),%eax + andl %ecx,%edi + rorl $2,%ecx + movl 8(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 16(%esp),%esi + roll $5,%edi + xorl 40(%esp),%esi + /* 40_59 50 */ + addl %edi,%eax + movl %edx,%edi + xorl 60(%esp),%esi + andl %ecx,%edi + roll $1,%esi + addl %edi,%ebp + movl %edx,%edi + movl %esi,8(%esp) + xorl %ecx,%edi + leal 2400959708(%ebp,%esi,1),%ebp + andl %ebx,%edi + rorl $2,%ebx + movl 12(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 20(%esp),%esi + roll $5,%edi + xorl 44(%esp),%esi + /* 40_59 51 */ + addl %edi,%ebp + movl %ecx,%edi + xorl (%esp),%esi + andl %ebx,%edi + roll $1,%esi + addl %edi,%edx + movl %ecx,%edi + movl %esi,12(%esp) + xorl %ebx,%edi + leal 2400959708(%edx,%esi,1),%edx + andl %eax,%edi + rorl $2,%eax + movl 16(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 24(%esp),%esi + roll $5,%edi + xorl 48(%esp),%esi + /* 40_59 52 */ + addl %edi,%edx + movl %ebx,%edi + xorl 4(%esp),%esi + andl %eax,%edi + roll $1,%esi + addl %edi,%ecx + movl %ebx,%edi + movl %esi,16(%esp) + xorl %eax,%edi + leal 2400959708(%ecx,%esi,1),%ecx + andl %ebp,%edi + rorl $2,%ebp + movl 20(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 28(%esp),%esi + roll $5,%edi + xorl 52(%esp),%esi + /* 40_59 53 */ + addl %edi,%ecx + movl %eax,%edi + xorl 8(%esp),%esi + andl %ebp,%edi + roll $1,%esi + addl %edi,%ebx + movl %eax,%edi + movl %esi,20(%esp) + xorl %ebp,%edi + leal 2400959708(%ebx,%esi,1),%ebx + andl %edx,%edi + rorl $2,%edx + movl 24(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 32(%esp),%esi + roll $5,%edi + xorl 56(%esp),%esi + /* 40_59 54 */ + addl %edi,%ebx + movl %ebp,%edi + xorl 12(%esp),%esi + andl %edx,%edi + roll $1,%esi + addl %edi,%eax + movl %ebp,%edi + movl %esi,24(%esp) + xorl %edx,%edi + leal 2400959708(%eax,%esi,1),%eax + andl %ecx,%edi + rorl $2,%ecx + movl 28(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 36(%esp),%esi + roll $5,%edi + xorl 60(%esp),%esi + /* 40_59 55 */ + addl %edi,%eax + movl %edx,%edi + xorl 16(%esp),%esi + andl %ecx,%edi + roll $1,%esi + addl %edi,%ebp + movl %edx,%edi + movl %esi,28(%esp) + xorl %ecx,%edi + leal 2400959708(%ebp,%esi,1),%ebp + andl %ebx,%edi + rorl $2,%ebx + movl 32(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 40(%esp),%esi + roll $5,%edi + xorl (%esp),%esi + /* 40_59 56 */ + addl %edi,%ebp + movl %ecx,%edi + xorl 20(%esp),%esi + andl %ebx,%edi + roll $1,%esi + addl %edi,%edx + movl %ecx,%edi + movl %esi,32(%esp) + xorl %ebx,%edi + leal 2400959708(%edx,%esi,1),%edx + andl %eax,%edi + rorl $2,%eax + movl 36(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 44(%esp),%esi + roll $5,%edi + xorl 4(%esp),%esi + /* 40_59 57 */ + addl %edi,%edx + movl %ebx,%edi + xorl 24(%esp),%esi + andl %eax,%edi + roll $1,%esi + addl %edi,%ecx + movl %ebx,%edi + movl %esi,36(%esp) + xorl %eax,%edi + leal 2400959708(%ecx,%esi,1),%ecx + andl %ebp,%edi + rorl $2,%ebp + movl 40(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 48(%esp),%esi + roll $5,%edi + xorl 8(%esp),%esi + /* 40_59 58 */ + addl %edi,%ecx + movl %eax,%edi + xorl 28(%esp),%esi + andl %ebp,%edi + roll $1,%esi + addl %edi,%ebx + movl %eax,%edi + movl %esi,40(%esp) + xorl %ebp,%edi + leal 2400959708(%ebx,%esi,1),%ebx + andl %edx,%edi + rorl $2,%edx + movl 44(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 52(%esp),%esi + roll $5,%edi + xorl 12(%esp),%esi + /* 40_59 59 */ + addl %edi,%ebx + movl %ebp,%edi + xorl 32(%esp),%esi + andl %edx,%edi + roll $1,%esi + addl %edi,%eax + movl %ebp,%edi + movl %esi,44(%esp) + xorl %edx,%edi + leal 2400959708(%eax,%esi,1),%eax + andl %ecx,%edi + rorl $2,%ecx + movl 48(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 56(%esp),%esi + roll $5,%edi + xorl 16(%esp),%esi + /* 20_39 60 */ + xorl 36(%esp),%esi + addl %edi,%eax + roll $1,%esi + movl %edx,%edi + movl %esi,48(%esp) + xorl %ebx,%edi + rorl $2,%ebx + leal 3395469782(%ebp,%esi,1),%ebp + movl 52(%esp),%esi + xorl %ecx,%edi + xorl 60(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 20(%esp),%esi + roll $5,%edi + /* 20_39 61 */ + xorl 40(%esp),%esi + addl %edi,%ebp + roll $1,%esi + movl %ecx,%edi + movl %esi,52(%esp) + xorl %eax,%edi + rorl $2,%eax + leal 3395469782(%edx,%esi,1),%edx + movl 56(%esp),%esi + xorl %ebx,%edi + xorl (%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 24(%esp),%esi + roll $5,%edi + /* 20_39 62 */ + xorl 44(%esp),%esi + addl %edi,%edx + roll $1,%esi + movl %ebx,%edi + movl %esi,56(%esp) + xorl %ebp,%edi + rorl $2,%ebp + leal 3395469782(%ecx,%esi,1),%ecx + movl 60(%esp),%esi + xorl %eax,%edi + xorl 4(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 28(%esp),%esi + roll $5,%edi + /* 20_39 63 */ + xorl 48(%esp),%esi + addl %edi,%ecx + roll $1,%esi + movl %eax,%edi + movl %esi,60(%esp) + xorl %edx,%edi + rorl $2,%edx + leal 3395469782(%ebx,%esi,1),%ebx + movl (%esp),%esi + xorl %ebp,%edi + xorl 8(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 32(%esp),%esi + roll $5,%edi + /* 20_39 64 */ + xorl 52(%esp),%esi + addl %edi,%ebx + roll $1,%esi + movl %ebp,%edi + movl %esi,(%esp) + xorl %ecx,%edi + rorl $2,%ecx + leal 3395469782(%eax,%esi,1),%eax + movl 4(%esp),%esi + xorl %edx,%edi + xorl 12(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 36(%esp),%esi + roll $5,%edi + /* 20_39 65 */ + xorl 56(%esp),%esi + addl %edi,%eax + roll $1,%esi + movl %edx,%edi + movl %esi,4(%esp) + xorl %ebx,%edi + rorl $2,%ebx + leal 3395469782(%ebp,%esi,1),%ebp + movl 8(%esp),%esi + xorl %ecx,%edi + xorl 16(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 40(%esp),%esi + roll $5,%edi + /* 20_39 66 */ + xorl 60(%esp),%esi + addl %edi,%ebp + roll $1,%esi + movl %ecx,%edi + movl %esi,8(%esp) + xorl %eax,%edi + rorl $2,%eax + leal 3395469782(%edx,%esi,1),%edx + movl 12(%esp),%esi + xorl %ebx,%edi + xorl 20(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 44(%esp),%esi + roll $5,%edi + /* 20_39 67 */ + xorl (%esp),%esi + addl %edi,%edx + roll $1,%esi + movl %ebx,%edi + movl %esi,12(%esp) + xorl %ebp,%edi + rorl $2,%ebp + leal 3395469782(%ecx,%esi,1),%ecx + movl 16(%esp),%esi + xorl %eax,%edi + xorl 24(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 48(%esp),%esi + roll $5,%edi + /* 20_39 68 */ + xorl 4(%esp),%esi + addl %edi,%ecx + roll $1,%esi + movl %eax,%edi + movl %esi,16(%esp) + xorl %edx,%edi + rorl $2,%edx + leal 3395469782(%ebx,%esi,1),%ebx + movl 20(%esp),%esi + xorl %ebp,%edi + xorl 28(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 52(%esp),%esi + roll $5,%edi + /* 20_39 69 */ + xorl 8(%esp),%esi + addl %edi,%ebx + roll $1,%esi + movl %ebp,%edi + movl %esi,20(%esp) + xorl %ecx,%edi + rorl $2,%ecx + leal 3395469782(%eax,%esi,1),%eax + movl 24(%esp),%esi + xorl %edx,%edi + xorl 32(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 56(%esp),%esi + roll $5,%edi + /* 20_39 70 */ + xorl 12(%esp),%esi + addl %edi,%eax + roll $1,%esi + movl %edx,%edi + movl %esi,24(%esp) + xorl %ebx,%edi + rorl $2,%ebx + leal 3395469782(%ebp,%esi,1),%ebp + movl 28(%esp),%esi + xorl %ecx,%edi + xorl 36(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 60(%esp),%esi + roll $5,%edi + /* 20_39 71 */ + xorl 16(%esp),%esi + addl %edi,%ebp + roll $1,%esi + movl %ecx,%edi + movl %esi,28(%esp) + xorl %eax,%edi + rorl $2,%eax + leal 3395469782(%edx,%esi,1),%edx + movl 32(%esp),%esi + xorl %ebx,%edi + xorl 40(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl (%esp),%esi + roll $5,%edi + /* 20_39 72 */ + xorl 20(%esp),%esi + addl %edi,%edx + roll $1,%esi + movl %ebx,%edi + movl %esi,32(%esp) + xorl %ebp,%edi + rorl $2,%ebp + leal 3395469782(%ecx,%esi,1),%ecx + movl 36(%esp),%esi + xorl %eax,%edi + xorl 44(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 4(%esp),%esi + roll $5,%edi + /* 20_39 73 */ + xorl 24(%esp),%esi + addl %edi,%ecx + roll $1,%esi + movl %eax,%edi + movl %esi,36(%esp) + xorl %edx,%edi + rorl $2,%edx + leal 3395469782(%ebx,%esi,1),%ebx + movl 40(%esp),%esi + xorl %ebp,%edi + xorl 48(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 8(%esp),%esi + roll $5,%edi + /* 20_39 74 */ + xorl 28(%esp),%esi + addl %edi,%ebx + roll $1,%esi + movl %ebp,%edi + movl %esi,40(%esp) + xorl %ecx,%edi + rorl $2,%ecx + leal 3395469782(%eax,%esi,1),%eax + movl 44(%esp),%esi + xorl %edx,%edi + xorl 52(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 12(%esp),%esi + roll $5,%edi + /* 20_39 75 */ + xorl 32(%esp),%esi + addl %edi,%eax + roll $1,%esi + movl %edx,%edi + movl %esi,44(%esp) + xorl %ebx,%edi + rorl $2,%ebx + leal 3395469782(%ebp,%esi,1),%ebp + movl 48(%esp),%esi + xorl %ecx,%edi + xorl 56(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 16(%esp),%esi + roll $5,%edi + /* 20_39 76 */ + xorl 36(%esp),%esi + addl %edi,%ebp + roll $1,%esi + movl %ecx,%edi + movl %esi,48(%esp) + xorl %eax,%edi + rorl $2,%eax + leal 3395469782(%edx,%esi,1),%edx + movl 52(%esp),%esi + xorl %ebx,%edi + xorl 60(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 20(%esp),%esi + roll $5,%edi + /* 20_39 77 */ + xorl 40(%esp),%esi + addl %edi,%edx + roll $1,%esi + movl %ebx,%edi + xorl %ebp,%edi + rorl $2,%ebp + leal 3395469782(%ecx,%esi,1),%ecx + movl 56(%esp),%esi + xorl %eax,%edi + xorl (%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 24(%esp),%esi + roll $5,%edi + /* 20_39 78 */ + xorl 44(%esp),%esi + addl %edi,%ecx + roll $1,%esi + movl %eax,%edi + xorl %edx,%edi + rorl $2,%edx + leal 3395469782(%ebx,%esi,1),%ebx + movl 60(%esp),%esi + xorl %ebp,%edi + xorl 4(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 28(%esp),%esi + roll $5,%edi + /* 20_39 79 */ + xorl 48(%esp),%esi + addl %edi,%ebx + roll $1,%esi + movl %ebp,%edi + xorl %ecx,%edi + rorl $2,%ecx + leal 3395469782(%eax,%esi,1),%eax + xorl %edx,%edi + addl %edi,%eax + movl %ebx,%edi + roll $5,%edi + addl %edi,%eax + /* Loop trailer */ + movl 84(%esp),%edi + movl 88(%esp),%esi + addl 16(%edi),%ebp + addl 12(%edi),%edx + addl 8(%edi),%ecx + addl 4(%edi),%ebx + addl (%edi),%eax + addl $64,%esi + movl %ebp,16(%edi) + movl %edx,12(%edi) + cmpl 92(%esp),%esi + movl %ecx,8(%edi) + movl %ebx,4(%edi) + movl %eax,(%edi) + jb .L000loop + addl $64,%esp + popl %edi + popl %esi + popl %ebx + popl %ebp + ret +.L_sha1_block_data_order_end: +.size sha1_block_data_order,.L_sha1_block_data_order_end-sha1_block_data_order +.byte 83,72,65,49,32,98,108,111,99,107,32,116,114,97,110,115,102,111,114,109,32,102,111,114,32,120,56,54,44,32,67,82,89,80,84,79,71,65,77,83,32,98,121,32,60,97,112,112,114,111,64,111,112,101,110,115,115,108,46,111,114,103,62,0 ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-04 8:01 ` George Spelvin @ 2009-08-04 20:41 ` Junio C Hamano 2009-08-05 18:17 ` George Spelvin 0 siblings, 1 reply; 60+ messages in thread From: Junio C Hamano @ 2009-08-04 20:41 UTC (permalink / raw) To: George Spelvin; +Cc: torvalds, git "George Spelvin" <linux@horizon.com> writes: > The one question I have is that currently perl is not a critical > compile-time dependency; it's needed for some extra stuff, but AFAIK you > can get most of git working without it. Whether to add that dependency > or what is a Junio question. I am actually feel a lot more uneasy to apply a patch signed of by somebody who calls himself George Spelvin, though. Three classes of people compile git from the source: * People who want to be on the bleeding edge and compile git for themselves, even though they are on mainstream platforms where they could choose distro-packaged one; * People who produce binary packages for distribution. * People who are on minority platforms and have no other way to get git than compiling for themselves; We do not have to worry about the first two groups of people. It won't be too involved for them to install Perl on their system; after all they are already coping with asciidoc and xmlto ;-) We can continue shipping mozilla one to help the last group. In the Makefile, we say: # Define NO_OPENSSL environment variable if you do not have OpenSSL. # This also implies MOZILLA_SHA1. and with your change, we would start implying STANDALONE_OPENSSL_SHA1 instead. But if MOZILLA_SHA1 was given explicitly, we could use that. If they really really really want the extra performance out of statically linked OpenSSL derivative, they could prepare a preprocessed assmebly on some other machine and use it as the last resort if they do not have/want Perl. The situation is exactly the same as the documentation set. They are using HTML/man prepared on another machine (namely, mine) as the last resort if they do not have/want AsciiDoc toolchain. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-04 20:41 ` Junio C Hamano @ 2009-08-05 18:17 ` George Spelvin 2009-08-05 20:36 ` Johannes Schindelin ` (2 more replies) 0 siblings, 3 replies; 60+ messages in thread From: George Spelvin @ 2009-08-05 18:17 UTC (permalink / raw) To: gitster; +Cc: git, linux, torvalds > Three classes of people compile git from the source: > > * People who want to be on the bleeding edge and compile git for > themselves, even though they are on mainstream platforms where they > could choose distro-packaged one; > > * People who produce binary packages for distribution. > > * People who are on minority platforms and have no other way to get git > than compiling for themselves; > > We do not have to worry about the first two groups of people. It won't > be too involved for them to install Perl on their system; after all they > are already coping with asciidoc and xmlto ;-) Actually, I'd get rid of the perl entirely, but I'm not sure how necessary the other-assembler-syntax features are needed by the folks on MacOS X and Windows (msysgit). > We can continue shipping mozilla one to help the last group. Of course, we always need a C fallback. Would you like a faster one? > In the Makefile, we say: > > # Define NO_OPENSSL environment variable if you do not have OpenSSL. > # This also implies MOZILLA_SHA1. > > and with your change, we would start implying STANDALONE_OPENSSL_SHA1 > instead. But if MOZILLA_SHA1 was given explicitly, we could use that. Well, I'd really like to auto-detect the processor. Current gcc's "gcc -v" output includes a "Target: " line that will do nicely. I can, of course, fall back to C if it fails, but is there a significant user base using a non-GCC compiler? ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-05 18:17 ` George Spelvin @ 2009-08-05 20:36 ` Johannes Schindelin 2009-08-05 20:44 ` Junio C Hamano 2009-08-05 20:55 ` Linus Torvalds 2 siblings, 0 replies; 60+ messages in thread From: Johannes Schindelin @ 2009-08-05 20:36 UTC (permalink / raw) To: George Spelvin; +Cc: gitster, git, torvalds Hi, On Wed, 5 Aug 2009, George Spelvin wrote: > > Three classes of people compile git from the source: > > > > * People who want to be on the bleeding edge and compile git for > > themselves, even though they are on mainstream platforms where they > > could choose distro-packaged one; > > > > * People who produce binary packages for distribution. > > > > * People who are on minority platforms and have no other way to get git > > than compiling for themselves; > > > > We do not have to worry about the first two groups of people. It won't > > be too involved for them to install Perl on their system; after all they > > are already coping with asciidoc and xmlto ;-) > > Actually, I'd get rid of the perl entirely, but I'm not sure how > necessary the other-assembler-syntax features are needed by the > folks on MacOS X and Windows (msysgit). Don't worry for MacOSX and msysGit (or Cygwin, for that matter): all of them use GCC. > > We can continue shipping mozilla one to help the last group. > > Of course, we always need a C fallback. Would you like a faster one? Is that a trick question? :-) > > In the Makefile, we say: > > > > # Define NO_OPENSSL environment variable if you do not have OpenSSL. > > # This also implies MOZILLA_SHA1. > > > > and with your change, we would start implying STANDALONE_OPENSSL_SHA1 > > instead. But if MOZILLA_SHA1 was given explicitly, we could use that. > > Well, I'd really like to auto-detect the processor. Current gcc's > "gcc -v" output includes a "Target: " line that will do nicely. I can, > of course, fall back to C if it fails, but is there a significant user > base using a non-GCC compiler? Do you really want to determine which processor to optimize for at compile time? Build system and target system are often different... Ciao, Dscho ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-05 18:17 ` George Spelvin 2009-08-05 20:36 ` Johannes Schindelin @ 2009-08-05 20:44 ` Junio C Hamano 2009-08-05 20:55 ` Linus Torvalds 2 siblings, 0 replies; 60+ messages in thread From: Junio C Hamano @ 2009-08-05 20:44 UTC (permalink / raw) To: George Spelvin; +Cc: git, torvalds "George Spelvin" <linux@horizon.com> writes: >> We can continue shipping mozilla one to help the last group. > > Of course, we always need a C fallback. Would you like a faster one? No. I'd rather keep tested and tried while a better alternative is in work-in-progress state. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-05 18:17 ` George Spelvin 2009-08-05 20:36 ` Johannes Schindelin 2009-08-05 20:44 ` Junio C Hamano @ 2009-08-05 20:55 ` Linus Torvalds 2009-08-05 23:13 ` Linus Torvalds 2 siblings, 1 reply; 60+ messages in thread From: Linus Torvalds @ 2009-08-05 20:55 UTC (permalink / raw) To: George Spelvin; +Cc: gitster, git On Wed, 5 Aug 2009, George Spelvin wrote: > > > We can continue shipping mozilla one to help the last group. > > Of course, we always need a C fallback. Would you like a faster one? I actually looked at code generation (on x86-64) for the C fallback, and it should be quite doable to re-write the C one to generate good code on x86-64. On 32-bit x86, I suspect the register pressures are so intense that it's unrealistic to expect gcc to do a good job, but the Mozilla SHA1 C code really seems _designed_ to be slow in stupid ways (that whole "byte at a time into a word buffer with shifts" is a really really sucky way to handle the endianness issues). So if you'd like to look at the C version, that's definitely worth it. Much bigger bang for the buck than trying to schedule asm language and having to deal with different assemblers/linkers/whatnot. Linus ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-05 20:55 ` Linus Torvalds @ 2009-08-05 23:13 ` Linus Torvalds 2009-08-06 1:18 ` Linus Torvalds 0 siblings, 1 reply; 60+ messages in thread From: Linus Torvalds @ 2009-08-05 23:13 UTC (permalink / raw) To: George Spelvin; +Cc: gitster, git On Wed, 5 Aug 2009, Linus Torvalds wrote: > > I actually looked at code generation (on x86-64) for the C fallback, and > it should be quite doable to re-write the C one to generate good code on > x86-64. Ok, here's a try. It's based on the mozilla SHA1 code, but with quite a bit of surgery. Enable with "make BLK_SHA1=1". Timings for "git fsck --full" on the git directory: - Mozilla SHA1 portable C-code (sucky sucky): MOZILLA_SHA1=1 real 0m38.194s user 0m37.838s sys 0m0.356s - This code ("half-portable C code"): BLK_SHA1=1 real 0m28.120s user 0m27.930s sys 0m0.192s - OpenSSL assembler code: real 0m26.327s user 0m26.194s sys 0m0.136s ie this is slightly slower than the openssh SHA1 routines, but that's only true on something very SHA1-intensive like "git fsck", and this is _almost_ portable code. I say "almost" because it really does require that we can do unaligned word loads, and do a good job of 'htonl()', and it assumes that 'unsigned int' is 32-bit (the latter would be easy to change by using 'uint32_t', but since it's not the relevant portability issue, I don't think it matters). In other words, unlike the Mozilla SHA1, this one doesn't suck. It's certainly not great either, but it's probably good enough in practice, without the headaches of actually making people use an assembler version. And maybe somebody can see how to improve it further? Linus --- From: Linus Torvalds <torvalds@linux-foundation.org> Subject: [PATCH] Add new optimized C 'block-sha1' routines Based on the mozilla SHA1 routine, but doing the input data accesses a word at a time and with 'htonl()' instead of loading bytes and shifting. It requires an architecture that is ok with unaligned 32-bit loads and a fast htonl(). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- Makefile | 9 +++ block-sha1/sha1.c | 145 +++++++++++++++++++++++++++++++++++++++++++++++++++++ block-sha1/sha1.h | 21 ++++++++ 3 files changed, 175 insertions(+), 0 deletions(-) diff --git a/Makefile b/Makefile index d7669b1..f12024c 100644 --- a/Makefile +++ b/Makefile @@ -84,6 +84,10 @@ all:: # specify your own (or DarwinPort's) include directories and # library directories by defining CFLAGS and LDFLAGS appropriately. # +# Define BLK_SHA1 environment variable if you want the C version +# of the SHA1 that assumes you can do unaligned 32-bit loads and +# have a fast htonl() function. +# # Define PPC_SHA1 environment variable when running make to make use of # a bundled SHA1 routine optimized for PowerPC. # @@ -1166,6 +1170,10 @@ ifdef NO_DEFLATE_BOUND BASIC_CFLAGS += -DNO_DEFLATE_BOUND endif +ifdef BLK_SHA1 + SHA1_HEADER = "block-sha1/sha1.h" + LIB_OBJS += block-sha1/sha1.o +else ifdef PPC_SHA1 SHA1_HEADER = "ppc/sha1.h" LIB_OBJS += ppc/sha1.o ppc/sha1ppc.o @@ -1183,6 +1191,7 @@ else endif endif endif +endif ifdef NO_PERL_MAKEMAKER export NO_PERL_MAKEMAKER endif diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c new file mode 100644 index 0000000..8fd90b0 --- /dev/null +++ b/block-sha1/sha1.c @@ -0,0 +1,145 @@ +/* + * Based on the Mozilla SHA1 (see mozilla-sha1/sha1.c), + * optimized to do word accesses rather than byte accesses, + * and to avoid unnecessary copies into the context array. + */ + +#include <string.h> +#include <arpa/inet.h> + +#include "sha1.h" + +/* Hash one 64-byte block of data */ +static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data); + +void blk_SHA1_Init(blk_SHA_CTX *ctx) +{ + ctx->lenW = 0; + ctx->size = 0; + + /* Initialize H with the magic constants (see FIPS180 for constants) + */ + ctx->H[0] = 0x67452301; + ctx->H[1] = 0xefcdab89; + ctx->H[2] = 0x98badcfe; + ctx->H[3] = 0x10325476; + ctx->H[4] = 0xc3d2e1f0; +} + + +void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, int len) +{ + int lenW = ctx->lenW; + + ctx->size += len << 3; + + /* Read the data into W and process blocks as they get full + */ + if (lenW) { + int left = 64 - lenW; + if (len < left) + left = len; + memcpy(lenW + (char *)ctx->W, data, left); + lenW = (lenW + left) & 63; + len -= left; + data += left; + ctx->lenW = lenW; + if (lenW) + return; + blk_SHA1Block(ctx, ctx->W); + } + while (len >= 64) { + blk_SHA1Block(ctx, data); + data += 64; + len -= 64; + } + if (len) { + memcpy(ctx->W, data, len); + ctx->lenW = len; + } +} + + +void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx) +{ + static const unsigned char pad[64] = { 0x80 }; + unsigned int padlen[2]; + int i; + + /* Pad with a binary 1 (ie 0x80), then zeroes, then length + */ + padlen[0] = htonl(ctx->size >> 32); + padlen[1] = htonl(ctx->size); + + blk_SHA1_Update(ctx, pad, 1+ (63 & (55 - ctx->lenW))); + blk_SHA1_Update(ctx, padlen, 8); + + /* Output hash + */ + for (i = 0; i < 5; i++) + ((unsigned int *)hashout)[i] = htonl(ctx->H[i]); +} + +#define SHA_ROT(X,n) (((X) << (n)) | ((X) >> (32-(n)))) + +static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) +{ + int t; + unsigned int A,B,C,D,E,TEMP; + unsigned int W[80]; + + for (t = 0; t < 16; t++) + W[t] = htonl(data[t]); + + /* Unroll it? */ + for (t = 16; t <= 79; t++) + W[t] = SHA_ROT(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1); + + A = ctx->H[0]; + B = ctx->H[1]; + C = ctx->H[2]; + D = ctx->H[3]; + E = ctx->H[4]; + +#define T_0_19(t) \ + TEMP = SHA_ROT(A,5) + (((C^D)&B)^D) + E + W[t] + 0x5a827999; \ + E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP; + + T_0_19( 0); T_0_19( 1); T_0_19( 2); T_0_19( 3); T_0_19( 4); + T_0_19( 5); T_0_19( 6); T_0_19( 7); T_0_19( 8); T_0_19( 9); + T_0_19(10); T_0_19(11); T_0_19(12); T_0_19(13); T_0_19(14); + T_0_19(15); T_0_19(16); T_0_19(17); T_0_19(18); T_0_19(19); + +#define T_20_39(t) \ + TEMP = SHA_ROT(A,5) + (B^C^D) + E + W[t] + 0x6ed9eba1; \ + E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP; + + T_20_39(20); T_20_39(21); T_20_39(22); T_20_39(23); T_20_39(24); + T_20_39(25); T_20_39(26); T_20_39(27); T_20_39(28); T_20_39(29); + T_20_39(30); T_20_39(31); T_20_39(32); T_20_39(33); T_20_39(34); + T_20_39(35); T_20_39(36); T_20_39(37); T_20_39(38); T_20_39(39); + +#define T_40_59(t) \ + TEMP = SHA_ROT(A,5) + ((B&C)|(D&(B|C))) + E + W[t] + 0x8f1bbcdc; \ + E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP; + + T_40_59(40); T_40_59(41); T_40_59(42); T_40_59(43); T_40_59(44); + T_40_59(45); T_40_59(46); T_40_59(47); T_40_59(48); T_40_59(49); + T_40_59(50); T_40_59(51); T_40_59(52); T_40_59(53); T_40_59(54); + T_40_59(55); T_40_59(56); T_40_59(57); T_40_59(58); T_40_59(59); + +#define T_60_79(t) \ + TEMP = SHA_ROT(A,5) + (B^C^D) + E + W[t] + 0xca62c1d6; \ + E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP; + + T_60_79(60); T_60_79(61); T_60_79(62); T_60_79(63); T_60_79(64); + T_60_79(65); T_60_79(66); T_60_79(67); T_60_79(68); T_60_79(69); + T_60_79(70); T_60_79(71); T_60_79(72); T_60_79(73); T_60_79(74); + T_60_79(75); T_60_79(76); T_60_79(77); T_60_79(78); T_60_79(79); + + ctx->H[0] += A; + ctx->H[1] += B; + ctx->H[2] += C; + ctx->H[3] += D; + ctx->H[4] += E; +} diff --git a/block-sha1/sha1.h b/block-sha1/sha1.h new file mode 100644 index 0000000..dbc719f --- /dev/null +++ b/block-sha1/sha1.h @@ -0,0 +1,21 @@ +/* + * Based on the Mozilla SHA1 (see mozilla-sha1/sha1.h), + * optimized to do word accesses rather than byte accesses, + * and to avoid unnecessary copies into the context array. + */ + +typedef struct { + unsigned int H[5]; + unsigned int W[16]; + int lenW; + unsigned long long size; +} blk_SHA_CTX; + +void blk_SHA1_Init(blk_SHA_CTX *ctx); +void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *dataIn, int len); +void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx); + +#define git_SHA_CTX blk_SHA_CTX +#define git_SHA1_Init blk_SHA1_Init +#define git_SHA1_Update blk_SHA1_Update +#define git_SHA1_Final blk_SHA1_Final ^ permalink raw reply related [flat|nested] 60+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-05 23:13 ` Linus Torvalds @ 2009-08-06 1:18 ` Linus Torvalds 2009-08-06 1:52 ` Nicolas Pitre 2009-08-06 18:49 ` Erik Faye-Lund 0 siblings, 2 replies; 60+ messages in thread From: Linus Torvalds @ 2009-08-06 1:18 UTC (permalink / raw) To: George Spelvin; +Cc: gitster, git On Wed, 5 Aug 2009, Linus Torvalds wrote: > > Timings for "git fsck --full" on the git directory: > > - Mozilla SHA1 portable C-code (sucky sucky): MOZILLA_SHA1=1 > > real 0m38.194s > user 0m37.838s > sys 0m0.356s > > - This code ("half-portable C code"): BLK_SHA1=1 > > real 0m28.120s > user 0m27.930s > sys 0m0.192s > > - OpenSSL assembler code: > > real 0m26.327s > user 0m26.194s > sys 0m0.136s Ok, I installed the 32-bit libraries too, to see what it looks like for that case. As expected, the compiler is not able to do a great job due to it being somewhat register starved, but on the other hand, the old Mozilla code did even worse, so.. - Mozilla SHA: real 0m47.063s user 0m46.815s sys 0m0.252s - BLK_SHA1=1 real 0m34.705s user 0m34.394s sys 0m0.312s - OPENSSL: real 0m29.754s user 0m29.446s sys 0m0.288s so the tuned asm from OpenSSL does kick ass, but the C code version isn't _that_ far away. It's quite a reasonable alternative if you don't have the OpenSSL libraries installed, for example. I note that MINGW does NO_OPENSSL by default, for example, and maybe the MINGW people want to test the patch out and enable BLK_SHA1 rather than the original Mozilla one. But while looking at 32-bit issues, I noticed that I really should also cast 'len' when shifting it. Otherwise the thing is limited to fairly small areas (28 bits - 256MB). This is not just a 32-bit problem ("int" is a signed 32-bit thing even in a 64-bit build), but I only noticed it when looking at 32-bit issues. So here's an incremental patch to fix that. Linus --- block-sha1/sha1.c | 4 ++-- block-sha1/sha1.h | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c index 8fd90b0..eef32f7 100644 --- a/block-sha1/sha1.c +++ b/block-sha1/sha1.c @@ -27,11 +27,11 @@ void blk_SHA1_Init(blk_SHA_CTX *ctx) } -void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, int len) +void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, unsigned long len) { int lenW = ctx->lenW; - ctx->size += len << 3; + ctx->size += (unsigned long long) len << 3; /* Read the data into W and process blocks as they get full */ diff --git a/block-sha1/sha1.h b/block-sha1/sha1.h index dbc719f..7be2d93 100644 --- a/block-sha1/sha1.h +++ b/block-sha1/sha1.h @@ -12,7 +12,7 @@ typedef struct { } blk_SHA_CTX; void blk_SHA1_Init(blk_SHA_CTX *ctx); -void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *dataIn, int len); +void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *dataIn, unsigned long len); void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx); #define git_SHA_CTX blk_SHA_CTX ^ permalink raw reply related [flat|nested] 60+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-06 1:18 ` Linus Torvalds @ 2009-08-06 1:52 ` Nicolas Pitre 2009-08-06 2:04 ` Junio C Hamano 2009-08-06 2:08 ` Linus Torvalds 2009-08-06 18:49 ` Erik Faye-Lund 1 sibling, 2 replies; 60+ messages in thread From: Nicolas Pitre @ 2009-08-06 1:52 UTC (permalink / raw) To: Linus Torvalds; +Cc: George Spelvin, Junio C Hamano, git On Wed, 5 Aug 2009, Linus Torvalds wrote: > But while looking at 32-bit issues, I noticed that I really should also > cast 'len' when shifting it. Otherwise the thing is limited to fairly > small areas (28 bits - 256MB). This is not just a 32-bit problem ("int" is > a signed 32-bit thing even in a 64-bit build), but I only noticed it when > looking at 32-bit issues. Even better is to not shift len at all in SHA_update() but shift ctx->size only at the end in SHA_final(). It is not like if SHA_update() could operate on partial bytes, so counting total bytes instead of total bits is all you need. This way you need no cast there and make the code slightly faster. Nicolas ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-06 1:52 ` Nicolas Pitre @ 2009-08-06 2:04 ` Junio C Hamano 2009-08-06 2:10 ` Linus Torvalds 2009-08-06 2:20 ` Nicolas Pitre 2009-08-06 2:08 ` Linus Torvalds 1 sibling, 2 replies; 60+ messages in thread From: Junio C Hamano @ 2009-08-06 2:04 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Linus Torvalds, George Spelvin, git Nicolas Pitre <nico@cam.org> writes: > On Wed, 5 Aug 2009, Linus Torvalds wrote: > >> But while looking at 32-bit issues, I noticed that I really should also >> cast 'len' when shifting it. Otherwise the thing is limited to fairly >> small areas (28 bits - 256MB). This is not just a 32-bit problem ("int" is >> a signed 32-bit thing even in a 64-bit build), but I only noticed it when >> looking at 32-bit issues. > > Even better is to not shift len at all in SHA_update() but shift > ctx->size only at the end in SHA_final(). It is not like if > SHA_update() could operate on partial bytes, so counting total bytes > instead of total bits is all you need. This way you need no cast there > and make the code slightly faster. Like this? By the way, Mozilla one calls Init at the end of Final but block-sha1 doesn't. I do not think it matters for our callers, but on the other hand FInal is not performance critical part nor Init is heavy, so it may not be a bad idea to imitate them as well. Or am I missing something? diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c index eef32f7..8293f7b 100644 --- a/block-sha1/sha1.c +++ b/block-sha1/sha1.c @@ -31,7 +31,7 @@ void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, unsigned long len) { int lenW = ctx->lenW; - ctx->size += (unsigned long long) len << 3; + ctx->size += (unsigned long long) len; /* Read the data into W and process blocks as they get full */ @@ -68,6 +68,7 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx) /* Pad with a binary 1 (ie 0x80), then zeroes, then length */ + ctx->size <<= 3; /* bytes to bits */ padlen[0] = htonl(ctx->size >> 32); padlen[1] = htonl(ctx->size); ^ permalink raw reply related [flat|nested] 60+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-06 2:04 ` Junio C Hamano @ 2009-08-06 2:10 ` Linus Torvalds 2009-08-06 2:20 ` Nicolas Pitre 1 sibling, 0 replies; 60+ messages in thread From: Linus Torvalds @ 2009-08-06 2:10 UTC (permalink / raw) To: Junio C Hamano; +Cc: Nicolas Pitre, George Spelvin, git On Wed, 5 Aug 2009, Junio C Hamano wrote: > > Like this? No, combine it with the other shifts: Yes: > - ctx->size += (unsigned long long) len << 3; > + ctx->size += (unsigned long long) len; No: > + ctx->size <<= 3; /* bytes to bits */ > padlen[0] = htonl(ctx->size >> 32); > padlen[1] = htonl(ctx->size); Do padlen[0] = htonl(ctx->size >> 29); padlen[1] = htonl(ctx->size << 3); instead. Or whatever. Linus ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-06 2:04 ` Junio C Hamano 2009-08-06 2:10 ` Linus Torvalds @ 2009-08-06 2:20 ` Nicolas Pitre 1 sibling, 0 replies; 60+ messages in thread From: Nicolas Pitre @ 2009-08-06 2:20 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, George Spelvin, git On Wed, 5 Aug 2009, Junio C Hamano wrote: > Nicolas Pitre <nico@cam.org> writes: > > > On Wed, 5 Aug 2009, Linus Torvalds wrote: > > > >> But while looking at 32-bit issues, I noticed that I really should also > >> cast 'len' when shifting it. Otherwise the thing is limited to fairly > >> small areas (28 bits - 256MB). This is not just a 32-bit problem ("int" is > >> a signed 32-bit thing even in a 64-bit build), but I only noticed it when > >> looking at 32-bit issues. > > > > Even better is to not shift len at all in SHA_update() but shift > > ctx->size only at the end in SHA_final(). It is not like if > > SHA_update() could operate on partial bytes, so counting total bytes > > instead of total bits is all you need. This way you need no cast there > > and make the code slightly faster. > > Like this? Almost (see below). > By the way, Mozilla one calls Init at the end of Final but block-sha1 > doesn't. I do not think it matters for our callers, but on the other hand > FInal is not performance critical part nor Init is heavy, so it may not be > a bad idea to imitate them as well. Or am I missing something? It is done only to make sure potentially crypto sensitive information is wiped out of the ctx structure instance. In our case we have no such concerns. > diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c > index eef32f7..8293f7b 100644 > --- a/block-sha1/sha1.c > +++ b/block-sha1/sha1.c > @@ -31,7 +31,7 @@ void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, unsigned long len) > { > int lenW = ctx->lenW; > > - ctx->size += (unsigned long long) len << 3; > + ctx->size += (unsigned long long) len; You can get rid of the cast as well now. > /* Read the data into W and process blocks as they get full > */ > @@ -68,6 +68,7 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx) > > /* Pad with a binary 1 (ie 0x80), then zeroes, then length > */ > + ctx->size <<= 3; /* bytes to bits */ > padlen[0] = htonl(ctx->size >> 32); > padlen[1] = htonl(ctx->size); Instead, I'd do: padlen[0] = htonl(ctx->size >> (32 - 3)); padlen[1] = htonl(ctx->size << 3); That would eliminate a redundant write back of ctx->size. Nicolas ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-06 1:52 ` Nicolas Pitre 2009-08-06 2:04 ` Junio C Hamano @ 2009-08-06 2:08 ` Linus Torvalds 2009-08-06 3:19 ` Artur Skawina 1 sibling, 1 reply; 60+ messages in thread From: Linus Torvalds @ 2009-08-06 2:08 UTC (permalink / raw) To: Nicolas Pitre; +Cc: George Spelvin, Junio C Hamano, git On Wed, 5 Aug 2009, Nicolas Pitre wrote: > > Even better is to not shift len at all in SHA_update() but shift > ctx->size only at the end in SHA_final(). It is not like if > SHA_update() could operate on partial bytes, so counting total bytes > instead of total bits is all you need. This way you need no cast there > and make the code slightly faster. Yeah, I tried it, but it's not noticeable. The bigger issue seems to be that it's shifter-limited, or that's what I take away from my profiles. I suspect it's even _more_ shifter-limited on some other micro-architectures, because gcc is being stupid, and generates ror $31,%eax from the "left shift + right shift" combination. It seems to -always- generate a "ror", rather than trying to generate 'rot' if the shift count would be smaller that way. And I know _some_ old micro-architectures will literally internally loop on the rol/ror counts, so "ror $31" can be _much_ more expensive than "rol $1". That isn't the case on my Nehalem, though. But I can't seem to get gcc to generate better code without actually using inline asm.. (So to clarify: this patch makes no difference that I can see to performance, but I suspect it could matter on other CPU's like an old Pentium or maybe an Atom). Linus --- block-sha1/sha1.c | 36 ++++++++++++++++++++++++------------ block-sha1/sha1.h | 2 +- 2 files changed, 25 insertions(+), 13 deletions(-) diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c index 8fd90b0..a45a3de 100644 --- a/block-sha1/sha1.c +++ b/block-sha1/sha1.c @@ -80,7 +80,19 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx) ((unsigned int *)hashout)[i] = htonl(ctx->H[i]); } -#define SHA_ROT(X,n) (((X) << (n)) | ((X) >> (32-(n)))) +#if defined(__i386__) || defined(__x86_64__) + +#define SHA_ASM(op, x, n) ({ unsigned int __res; asm(op " %1,%0":"=r" (__res):"i" (n), "0" (x)); __res; }) +#define SHA_ROL(x,n) SHA_ASM("rol", x, n) +#define SHA_ROR(x,n) SHA_ASM("ror", x, n) + +#else + +#define SHA_ROT(X,n) (((X) << (l)) | ((X) >> (r))) +#define SHA_ROL(X,n) SHA_ROT(X,n,32-(n)) +#define SHA_ROR(X,n) SHA_ROT(X,32-(n),n) + +#endif static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) { @@ -93,7 +105,7 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) /* Unroll it? */ for (t = 16; t <= 79; t++) - W[t] = SHA_ROT(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1); + W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1); A = ctx->H[0]; B = ctx->H[1]; @@ -102,8 +114,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) E = ctx->H[4]; #define T_0_19(t) \ - TEMP = SHA_ROT(A,5) + (((C^D)&B)^D) + E + W[t] + 0x5a827999; \ - E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP; + TEMP = SHA_ROL(A,5) + (((C^D)&B)^D) + E + W[t] + 0x5a827999; \ + E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; T_0_19( 0); T_0_19( 1); T_0_19( 2); T_0_19( 3); T_0_19( 4); T_0_19( 5); T_0_19( 6); T_0_19( 7); T_0_19( 8); T_0_19( 9); @@ -111,8 +123,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) T_0_19(15); T_0_19(16); T_0_19(17); T_0_19(18); T_0_19(19); #define T_20_39(t) \ - TEMP = SHA_ROT(A,5) + (B^C^D) + E + W[t] + 0x6ed9eba1; \ - E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP; + TEMP = SHA_ROL(A,5) + (B^C^D) + E + W[t] + 0x6ed9eba1; \ + E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; T_20_39(20); T_20_39(21); T_20_39(22); T_20_39(23); T_20_39(24); T_20_39(25); T_20_39(26); T_20_39(27); T_20_39(28); T_20_39(29); @@ -120,8 +132,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) T_20_39(35); T_20_39(36); T_20_39(37); T_20_39(38); T_20_39(39); #define T_40_59(t) \ - TEMP = SHA_ROT(A,5) + ((B&C)|(D&(B|C))) + E + W[t] + 0x8f1bbcdc; \ - E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP; + TEMP = SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E + W[t] + 0x8f1bbcdc; \ + E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; T_40_59(40); T_40_59(41); T_40_59(42); T_40_59(43); T_40_59(44); T_40_59(45); T_40_59(46); T_40_59(47); T_40_59(48); T_40_59(49); @@ -129,8 +141,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) T_40_59(55); T_40_59(56); T_40_59(57); T_40_59(58); T_40_59(59); #define T_60_79(t) \ - TEMP = SHA_ROT(A,5) + (B^C^D) + E + W[t] + 0xca62c1d6; \ - E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP; + TEMP = SHA_ROL(A,5) + (B^C^D) + E + W[t] + 0xca62c1d6; \ + E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; T_60_79(60); T_60_79(61); T_60_79(62); T_60_79(63); T_60_79(64); T_60_79(65); T_60_79(66); T_60_79(67); T_60_79(68); T_60_79(69); ^ permalink raw reply related [flat|nested] 60+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-06 2:08 ` Linus Torvalds @ 2009-08-06 3:19 ` Artur Skawina 2009-08-06 3:31 ` Linus Torvalds 0 siblings, 1 reply; 60+ messages in thread From: Artur Skawina @ 2009-08-06 3:19 UTC (permalink / raw) To: Linus Torvalds; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git Linus Torvalds wrote: > > The bigger issue seems to be that it's shifter-limited, or that's what I > take away from my profiles. I suspect it's even _more_ shifter-limited on > some other micro-architectures, because gcc is being stupid, and generates > > ror $31,%eax > > from the "left shift + right shift" combination. It seems to -always- > generate a "ror", rather than trying to generate 'rot' if the shift count > would be smaller that way. > > And I know _some_ old micro-architectures will literally internally loop > on the rol/ror counts, so "ror $31" can be _much_ more expensive than "rol > $1". > > That isn't the case on my Nehalem, though. But I can't seem to get gcc to > generate better code without actually using inline asm.. The compiler does the right thing w/ something like this: +#if __GNUC__>1 && defined(__i386) +#define SHA_ROT(data,bits) ({ \ + unsigned d = (data); \ + if (bits<16) \ + __asm__ ("roll %1,%0" : "=r" (d) : "I" (bits), "0" (d)); \ + else \ + __asm__ ("rorl %1,%0" : "=r" (d) : "I" (32-bits), "0" (d)); \ + d; \ + }) +#else #define SHA_ROT(X,n) (((X) << (n)) | ((X) >> (32-(n)))) +#endif which doesn't obfuscate the code as much. (I needed the asm on p4 anyway, as w/o it the mozilla version is even slower than an rfc3174 one. rol vs ror makes no measurable difference) > static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) > { > @@ -93,7 +105,7 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) > > /* Unroll it? */ > for (t = 16; t <= 79; t++) > - W[t] = SHA_ROT(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1); > + W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1); unrolling this once (but not more) is a win, at least on p4. > #define T_0_19(t) \ > - TEMP = SHA_ROT(A,5) + (((C^D)&B)^D) + E + W[t] + 0x5a827999; \ > - E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP; > + TEMP = SHA_ROL(A,5) + (((C^D)&B)^D) + E + W[t] + 0x5a827999; \ > + E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; > > T_0_19( 0); T_0_19( 1); T_0_19( 2); T_0_19( 3); T_0_19( 4); > T_0_19( 5); T_0_19( 6); T_0_19( 7); T_0_19( 8); T_0_19( 9); unrolling these otoh is a clear loss (iirc ~10%). artur ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-06 3:19 ` Artur Skawina @ 2009-08-06 3:31 ` Linus Torvalds 2009-08-06 3:48 ` Linus Torvalds 2009-08-06 4:08 ` Artur Skawina 0 siblings, 2 replies; 60+ messages in thread From: Linus Torvalds @ 2009-08-06 3:31 UTC (permalink / raw) To: Artur Skawina; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git On Thu, 6 Aug 2009, Artur Skawina wrote: > > > #define T_0_19(t) \ > > - TEMP = SHA_ROT(A,5) + (((C^D)&B)^D) + E + W[t] + 0x5a827999; \ > > - E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP; > > + TEMP = SHA_ROL(A,5) + (((C^D)&B)^D) + E + W[t] + 0x5a827999; \ > > + E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; > > > > T_0_19( 0); T_0_19( 1); T_0_19( 2); T_0_19( 3); T_0_19( 4); > > T_0_19( 5); T_0_19( 6); T_0_19( 7); T_0_19( 8); T_0_19( 9); > > unrolling these otoh is a clear loss (iirc ~10%). I can well imagine. The P4 decode bandwidth is abysmal unless you get things into the trace cache, and the trace cache is of a very limited size. However, on at least Nehalem, unrolling it all is quite a noticeable win. The way it's written, I can easily make it do one or the other by just turning the macro inside a loop (and we can have a preprocessor flag to choose one or the other), but let me work on it a bit more first. I'm trying to move the htonl() inside the loops (the same way I suggested George do with his assembly), and it seems to help a tiny bit. But I may be measuring noise. However, right now my biggest profile hit is on this irritating loop: /* Unroll it? */ for (t = 16; t <= 79; t++) W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1); and I haven't been able to move _that_ into the other iterations yet. Here's my micro-optimization update. It does the first 16 rounds (of the first 20-round thing) specially, and takes the data directly from the input array. I'm _this_ close to breaking the 28s second barrier on git-fsck, but not quite yet. Linus --- From: Linus Torvalds <torvalds@linux-foundation.org> Subject: [PATCH] block-sha1: make the 'ntohl()' part of the first SHA1 loop This helps a teeny bit. But what I -really- want to do is to avoid the whole 80-array loop, and do the xor updates as I go along.. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- block-sha1/sha1.c | 28 ++++++++++++++++------------ 1 files changed, 16 insertions(+), 12 deletions(-) diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c index a45a3de..39a5bbb 100644 --- a/block-sha1/sha1.c +++ b/block-sha1/sha1.c @@ -100,27 +100,31 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) unsigned int A,B,C,D,E,TEMP; unsigned int W[80]; - for (t = 0; t < 16; t++) - W[t] = htonl(data[t]); - - /* Unroll it? */ - for (t = 16; t <= 79; t++) - W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1); - A = ctx->H[0]; B = ctx->H[1]; C = ctx->H[2]; D = ctx->H[3]; E = ctx->H[4]; -#define T_0_19(t) \ +#define T_0_15(t) \ + TEMP = htonl(data[t]); W[t] = TEMP; \ + TEMP += SHA_ROL(A,5) + (((C^D)&B)^D) + E + 0x5a827999; \ + E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; \ + + T_0_15( 0); T_0_15( 1); T_0_15( 2); T_0_15( 3); T_0_15( 4); + T_0_15( 5); T_0_15( 6); T_0_15( 7); T_0_15( 8); T_0_15( 9); + T_0_15(10); T_0_15(11); T_0_15(12); T_0_15(13); T_0_15(14); + T_0_15(15); + + /* Unroll it? */ + for (t = 16; t <= 79; t++) + W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1); + +#define T_16_19(t) \ TEMP = SHA_ROL(A,5) + (((C^D)&B)^D) + E + W[t] + 0x5a827999; \ E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; - T_0_19( 0); T_0_19( 1); T_0_19( 2); T_0_19( 3); T_0_19( 4); - T_0_19( 5); T_0_19( 6); T_0_19( 7); T_0_19( 8); T_0_19( 9); - T_0_19(10); T_0_19(11); T_0_19(12); T_0_19(13); T_0_19(14); - T_0_19(15); T_0_19(16); T_0_19(17); T_0_19(18); T_0_19(19); + T_16_19(16); T_16_19(17); T_16_19(18); T_16_19(19); #define T_20_39(t) \ TEMP = SHA_ROL(A,5) + (B^C^D) + E + W[t] + 0x6ed9eba1; \ ^ permalink raw reply related [flat|nested] 60+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-06 3:31 ` Linus Torvalds @ 2009-08-06 3:48 ` Linus Torvalds 2009-08-06 4:01 ` Linus Torvalds 2009-08-06 4:52 ` George Spelvin 2009-08-06 4:08 ` Artur Skawina 1 sibling, 2 replies; 60+ messages in thread From: Linus Torvalds @ 2009-08-06 3:48 UTC (permalink / raw) To: Artur Skawina; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git On Wed, 5 Aug 2009, Linus Torvalds wrote: > > However, right now my biggest profile hit is on this irritating loop: > > /* Unroll it? */ > for (t = 16; t <= 79; t++) > W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1); > > and I haven't been able to move _that_ into the other iterations yet. Oh yes I have. Here's the patch that gets me sub-28s git-fsck times. In fact, it gives me sub-27s times. In fact, it's really close to the OpenSSL times. And all using plain C. Again - this is all on x86-64. I suspect 32-bit code ends up having spills due to register pressure. That said, I did get rid of that big temporary array, and it now basically only uses that 512-bit array as one circular queue. Linus PS. Ok, so my definition of "plain C" is a bit odd. There's nothing plain about it. It's disgusting C preprocessor misuse. But dang, it's kind of fun to abuse the compiler this way. --- block-sha1/sha1.c | 28 ++++++++++++++++------------ 1 files changed, 16 insertions(+), 12 deletions(-) diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c index 39a5bbb..80193d4 100644 --- a/block-sha1/sha1.c +++ b/block-sha1/sha1.c @@ -96,9 +96,8 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx) static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) { - int t; unsigned int A,B,C,D,E,TEMP; - unsigned int W[80]; + unsigned int array[16]; A = ctx->H[0]; B = ctx->H[1]; @@ -107,8 +106,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) E = ctx->H[4]; #define T_0_15(t) \ - TEMP = htonl(data[t]); W[t] = TEMP; \ - TEMP += SHA_ROL(A,5) + (((C^D)&B)^D) + E + 0x5a827999; \ + TEMP = htonl(data[t]); array[t] = TEMP; \ + TEMP += SHA_ROL(A,5) + (((C^D)&B)^D) + E + 0x5a827999; \ E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; \ T_0_15( 0); T_0_15( 1); T_0_15( 2); T_0_15( 3); T_0_15( 4); @@ -116,18 +115,21 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) T_0_15(10); T_0_15(11); T_0_15(12); T_0_15(13); T_0_15(14); T_0_15(15); - /* Unroll it? */ - for (t = 16; t <= 79; t++) - W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1); +/* This "rolls" over the 512-bit array */ +#define W(x) (array[(x)&15]) +#define SHA_XOR(t) \ + TEMP = SHA_ROL(W(t+13) ^ W(t+8) ^ W(t+2) ^ W(t), 1); W(t) = TEMP; #define T_16_19(t) \ - TEMP = SHA_ROL(A,5) + (((C^D)&B)^D) + E + W[t] + 0x5a827999; \ - E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; + SHA_XOR(t); \ + TEMP += SHA_ROL(A,5) + (((C^D)&B)^D) + E + 0x5a827999; \ + E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; \ T_16_19(16); T_16_19(17); T_16_19(18); T_16_19(19); #define T_20_39(t) \ - TEMP = SHA_ROL(A,5) + (B^C^D) + E + W[t] + 0x6ed9eba1; \ + SHA_XOR(t); \ + TEMP += SHA_ROL(A,5) + (B^C^D) + E + 0x6ed9eba1; \ E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; T_20_39(20); T_20_39(21); T_20_39(22); T_20_39(23); T_20_39(24); @@ -136,7 +138,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) T_20_39(35); T_20_39(36); T_20_39(37); T_20_39(38); T_20_39(39); #define T_40_59(t) \ - TEMP = SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E + W[t] + 0x8f1bbcdc; \ + SHA_XOR(t); \ + TEMP += SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E + 0x8f1bbcdc; \ E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; T_40_59(40); T_40_59(41); T_40_59(42); T_40_59(43); T_40_59(44); @@ -145,7 +148,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) T_40_59(55); T_40_59(56); T_40_59(57); T_40_59(58); T_40_59(59); #define T_60_79(t) \ - TEMP = SHA_ROL(A,5) + (B^C^D) + E + W[t] + 0xca62c1d6; \ + SHA_XOR(t); \ + TEMP += SHA_ROL(A,5) + (B^C^D) + E + 0xca62c1d6; \ E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; T_60_79(60); T_60_79(61); T_60_79(62); T_60_79(63); T_60_79(64); ^ permalink raw reply related [flat|nested] 60+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-06 3:48 ` Linus Torvalds @ 2009-08-06 4:01 ` Linus Torvalds 2009-08-06 4:28 ` Artur Skawina 2009-08-06 4:52 ` George Spelvin 1 sibling, 1 reply; 60+ messages in thread From: Linus Torvalds @ 2009-08-06 4:01 UTC (permalink / raw) To: Artur Skawina; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git On Wed, 5 Aug 2009, Linus Torvalds wrote: > > Here's the patch that gets me sub-28s git-fsck times. In fact, it gives me > sub-27s times. In fact, it's really close to the OpenSSL times. Just to back that up: - OpenSSL: real 0m26.363s user 0m26.174s sys 0m0.188s - This C implementation: real 0m26.594s user 0m26.310s sys 0m0.256s so I'm still slower, but now you really have to look closely to see the difference. In fact, you have to do multiple runs to make sure, because the error bars are bigger thant he difference - but openssl definitely edges my C code out by a small amount, and the above numbers are rairly normal. Linus ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-06 4:01 ` Linus Torvalds @ 2009-08-06 4:28 ` Artur Skawina 2009-08-06 4:50 ` Linus Torvalds 0 siblings, 1 reply; 60+ messages in thread From: Artur Skawina @ 2009-08-06 4:28 UTC (permalink / raw) To: Linus Torvalds; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git Linus Torvalds wrote: > > On Wed, 5 Aug 2009, Linus Torvalds wrote: >> Here's the patch that gets me sub-28s git-fsck times. In fact, it gives me >> sub-27s times. In fact, it's really close to the OpenSSL times. > > Just to back that up: > > - OpenSSL: > > real 0m26.363s > user 0m26.174s > sys 0m0.188s > > - This C implementation: > > real 0m26.594s > user 0m26.310s > sys 0m0.256s > > so I'm still slower, but now you really have to look closely to see the > difference. In fact, you have to do multiple runs to make sure, because > the error bars are bigger thant he difference - but openssl definitely > edges my C code out by a small amount, and the above numbers are rairly > normal. nice, the p4 microbenchmark #s: # TIME[s] SPEED[MB/s] rfc3174 1.357 44.99 rfc3174 1.352 45.13 mozilla 1.509 40.44 mozillaas 1.133 53.87 linus 0.5818 104.9 so it's more than twice as fast as the mozilla implementation. artur ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-06 4:28 ` Artur Skawina @ 2009-08-06 4:50 ` Linus Torvalds 2009-08-06 5:19 ` Artur Skawina 0 siblings, 1 reply; 60+ messages in thread From: Linus Torvalds @ 2009-08-06 4:50 UTC (permalink / raw) To: Artur Skawina; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git On Thu, 6 Aug 2009, Artur Skawina wrote: > > # TIME[s] SPEED[MB/s] > rfc3174 1.357 44.99 > rfc3174 1.352 45.13 > mozilla 1.509 40.44 > mozillaas 1.133 53.87 > linus 0.5818 104.9 > > so it's more than twice as fast as the mozilla implementation. So that's some general SHA1 benchmark you have? I hope it tests correctness too. Although I can't imagine it being wrong - I've made mistakes (oh, yes, many mistakes) when trying to convert the code to something efficient, and even the smallest mistake results in 'git fsck' immediately complaining about every single object. But still. I literally haven't tested it any other way (well, the git test-suite ends up doing a fair amount of testing too, and I _have_ run that). As to my atom testing: my poor little atom is a sad little thing, and it's almost painful to benchmark that thing. But it's worth it to look at how the 32-bit code compares to the openssl asm code too: - BLK_SHA1: real 2m27.160s user 2m23.651s sys 0m2.392s - OpenSSL: real 2m12.580s user 2m9.998s sys 0m1.811s - Mozilla-SHA1: real 3m21.836s user 3m18.369s sys 0m2.862s As expected, the hand-tuned assembly does better (and by a bigger margin). Probably partly because scheduling is important when in-order, and partly because gcc will have a harder time with the small register set. But it's still a big improvement over mozilla one. (This is, as always, 'git fsck --full'. It spends about 50% on that SHA1 calculation, so the SHA1 speedup is larger than you see from just th enumbers) Linus ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-06 4:50 ` Linus Torvalds @ 2009-08-06 5:19 ` Artur Skawina 2009-08-06 7:03 ` George Spelvin 0 siblings, 1 reply; 60+ messages in thread From: Artur Skawina @ 2009-08-06 5:19 UTC (permalink / raw) To: Linus Torvalds; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git Linus Torvalds wrote: > > On Thu, 6 Aug 2009, Artur Skawina wrote: >> # TIME[s] SPEED[MB/s] >> rfc3174 1.357 44.99 >> rfc3174 1.352 45.13 >> mozilla 1.509 40.44 >> mozillaas 1.133 53.87 >> linus 0.5818 104.9 >> >> so it's more than twice as fast as the mozilla implementation. > > So that's some general SHA1 benchmark you have? > > I hope it tests correctness too. yep, sort of, i just check that all versions return the same result when hashing some pseudorandom data. > As to my atom testing: my poor little atom is a sad little thing, and > it's almost painful to benchmark that thing. But it's worth it to look at > how the 32-bit code compares to the openssl asm code too: > > - BLK_SHA1: > real 2m27.160s > - OpenSSL: > real 2m12.580s > - Mozilla-SHA1: > real 3m21.836s > > As expected, the hand-tuned assembly does better (and by a bigger margin). > Probably partly because scheduling is important when in-order, and partly > because gcc will have a harder time with the small register set. > > But it's still a big improvement over mozilla one. > > (This is, as always, 'git fsck --full'. It spends about 50% on that SHA1 > calculation, so the SHA1 speedup is larger than you see from just th > enumbers) I'll start looking at other cpus once i integrate the asm versions into my benchmark. P4s really are "special". Even something as simple as this on top of your version: @@ -129,8 +133,8 @@ #define T_20_39(t) \ SHA_XOR(t); \ - TEMP += SHA_ROL(A,5) + (B^C^D) + E + 0x6ed9eba1; \ - E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; + TEMP += SHA_ROL(A,5) + (B^C^D) + E; \ + E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP + 0x6ed9eba1; T_20_39(20); T_20_39(21); T_20_39(22); T_20_39(23); T_20_39(24); T_20_39(25); T_20_39(26); T_20_39(27); T_20_39(28); T_20_39(29); @@ -139,8 +143,8 @@ #define T_40_59(t) \ SHA_XOR(t); \ - TEMP += SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E + 0x8f1bbcdc; \ - E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; + TEMP += SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E; \ + E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP + 0x8f1bbcdc; T_40_59(40); T_40_59(41); T_40_59(42); T_40_59(43); T_40_59(44); T_40_59(45); T_40_59(46); T_40_59(47); T_40_59(48); T_40_59(49); saves another 10% or so: #Initializing... Rounds: 1000000, size: 62500K, time: 1.421s, speed: 42.97MB/s # TIME[s] SPEED[MB/s] rfc3174 1.403 43.5 # New hash result: b747042d9f4f1fdabd2ac53076f8f830dea7fe0f rfc3174 1.403 43.51 linus 0.5891 103.6 linusas 0.5337 114.4 mozilla 1.535 39.76 mozillaas 1.128 54.13 artur ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-06 5:19 ` Artur Skawina @ 2009-08-06 7:03 ` George Spelvin 0 siblings, 0 replies; 60+ messages in thread From: George Spelvin @ 2009-08-06 7:03 UTC (permalink / raw) To: art.08.09, torvalds; +Cc: git, gitster, linux, nico > On Thu, 6 Aug 2009, Artur Skawina wrote: >> # TIME[s] SPEED[MB/s] >> rfc3174 1.357 44.99 >> rfc3174 1.352 45.13 >> mozilla 1.509 40.44 >> mozillaas 1.133 53.87 >> linus 0.5818 104.9 > #Initializing... Rounds: 1000000, size: 62500K, time: 1.421s, speed: 42.97MB/s > # TIME[s] SPEED[MB/s] > rfc3174 1.403 43.5 > # New hash result: b747042d9f4f1fdabd2ac53076f8f830dea7fe0f > rfc3174 1.403 43.51 > linus 0.5891 103.6 > linusas 0.5337 114.4 > mozilla 1.535 39.76 > mozillaas 1.128 54.13 I'm trying to absorb what you're learning about P4 performance, but I'm getting confused... what is what in these benchmarks? The major architectural decisions I see are: 1) Three possible ways to compute the W[] array for rounds 16..79: 1a) Compute W[16..79] in a loop beforehand (you noted that unrolling two copies helped significantly.) 1b) Compute W[16..79] as part of hash rounds 16..79. 1c) Compute W[0..15] in-place as part of hash rounds 16..79 2) The main hashing can be rolled up or unrolled: 2a) Four 20-round loops. (In case of options 1b and 1c, the first one might be split into a 16 and a 4.) 2b) Four 4-round loops, each unrolled 5x. (See the ARM assembly.) 2c) all 80 rounds unrolled. As Linus noted, 1c is not friends with options 2a and 2b, because the W() indexing math is not longer a compile-time constant. Linus has posted 1a+2c and 1c+2c. You posted some code that could be 2a or 2c depending on an UNROLL preprocessor #define. Which combinations are your "linus" and "linusas" code? You talk about "and my atom seems to like the compact loops too", but I'm not sure which loops those are. Thanks. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-06 3:48 ` Linus Torvalds 2009-08-06 4:01 ` Linus Torvalds @ 2009-08-06 4:52 ` George Spelvin 1 sibling, 0 replies; 60+ messages in thread From: George Spelvin @ 2009-08-06 4:52 UTC (permalink / raw) To: art.08.09, torvalds; +Cc: git, gitster, linux, nico On Wed, 5 Aug 2009, Linus Torvalds wrote: > Oh yes I have. > > Here's the patch that gets me sub-28s git-fsck times. In fact, it gives me > sub-27s times. In fact, it's really close to the OpenSSL times. > > And all using plain C. > > Again - this is all on x86-64. I suspect 32-bit code ends up having > spills due to register pressure. That said, I did get rid of that big > temporary array, and it now basically only uses that 512-bit array as one > circular queue. > > Linus > > PS. Ok, so my definition of "plain C" is a bit odd. There's nothing plain > about it. It's disgusting C preprocessor misuse. But dang, it's kind of > fun to abuse the compiler this way. You're still missing three tricks, which give a slight speedup on my machine: 1) (major) Instead of reassigning all those variable all the time, make the round function E += SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + W[t] + 0x8f1bbcdc; \ B = SHA_ROR(B, 2); and rename the variables between rounds. 2 (minor) One of the round functions ((B&C)|(D&(B|C))) can be rewritten (B&C) | (C&D) | (D&B) = (B&C) | (D&(B|C)) = (B&C) | (D&(B^C)) = (B&C) ^ (D&(B^C)) = (B&C) + (D&(B^C)) to expose more associativty (and thus scheduling flexibility) to the compiler. 3) (minor) ctx->lenW is always simply a copy of the low 6 bits of ctx->size, so there's no need to bother with it. Actually, looking at the code, GCC manages to figure out the first (major) one by itself. Way to go, GCC authors! But getting avoiding the extra temporary in trick 2 also gets rid of some extra REX prefixes, saving 240 bytes in blk_SHA1Block, which is kind of nice in an inner loop. Here's my modified version of your earlier code. I haven't incoporated the W[] formation into the round functions as in your latest version. I'm sure you can bash the two together in very little time. Or I'll get to it later; I really should attend to $DAY_JOB at the moment. diff --git a/Makefile b/Makefile index daf4296..e6df8ec 100644 --- a/Makefile +++ b/Makefile @@ -84,6 +84,10 @@ all:: # specify your own (or DarwinPort's) include directories and # library directories by defining CFLAGS and LDFLAGS appropriately. # +# Define BLK_SHA1 environment variable if you want the C version +# of the SHA1 that assumes you can do unaligned 32-bit loads and +# have a fast htonl() function. +# # Define PPC_SHA1 environment variable when running make to make use of # a bundled SHA1 routine optimized for PowerPC. # @@ -1167,6 +1171,10 @@ ifdef NO_DEFLATE_BOUND BASIC_CFLAGS += -DNO_DEFLATE_BOUND endif +ifdef BLK_SHA1 + SHA1_HEADER = "block-sha1/sha1.h" + LIB_OBJS += block-sha1/sha1.o +else ifdef PPC_SHA1 SHA1_HEADER = "ppc/sha1.h" LIB_OBJS += ppc/sha1.o ppc/sha1ppc.o @@ -1184,6 +1192,7 @@ else endif endif endif +endif ifdef NO_PERL_MAKEMAKER export NO_PERL_MAKEMAKER endif diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c new file mode 100644 index 0000000..261eae7 --- /dev/null +++ b/block-sha1/sha1.c @@ -0,0 +1,141 @@ +/* + * Based on the Mozilla SHA1 (see mozilla-sha1/sha1.c), + * optimized to do word accesses rather than byte accesses, + * and to avoid unnecessary copies into the context array. + */ + +#include <string.h> +#include <arpa/inet.h> + +#include "sha1.h" + +/* Hash one 64-byte block of data */ +static void blk_SHA1Block(blk_SHA_CTX *ctx, const uint32_t *data); + +void blk_SHA1_Init(blk_SHA_CTX *ctx) +{ + /* Initialize H with the magic constants (see FIPS180 for constants) + */ + ctx->H[0] = 0x67452301; + ctx->H[1] = 0xefcdab89; + ctx->H[2] = 0x98badcfe; + ctx->H[3] = 0x10325476; + ctx->H[4] = 0xc3d2e1f0; + ctx->size = 0; +} + + +void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, int len) +{ + int lenW = (int)ctx->size & 63; + + ctx->size += len; + + /* Read the data into W and process blocks as they get full + */ + if (lenW) { + int left = 64 - lenW; + if (len < left) + left = len; + memcpy(lenW + (char *)ctx->W, data, left); + if (left + lenW != 64) + return; + len -= left; + data += left; + blk_SHA1Block(ctx, ctx->W); + } + while (len >= 64) { + blk_SHA1Block(ctx, data); + data += 64; + len -= 64; + } + memcpy(ctx->W, data, len); +} + + +void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx) +{ + int i, lenW = (int)ctx->size & 63; + + /* Pad with a binary 1 (ie 0x80), then zeroes, then length + */ + ((char *)ctx->W)[lenW++] = 0x80; + if (lenW > 56) { + memset((char *)ctx->W + lenW, 0, 64 - lenW); + blk_SHA1Block(ctx, ctx->W); + lenW = 0; + } + memset((char *)ctx->W + lenW, 0, 56 - lenW); + ctx->W[14] = htonl(ctx->size >> 29); + ctx->W[15] = htonl((uint32_t)ctx->size << 3); + blk_SHA1Block(ctx, ctx->W); + + /* Output hash + */ + for (i = 0; i < 5; i++) + ((unsigned int *)hashout)[i] = htonl(ctx->H[i]); +} + +/* SHA-1 helper macros */ +#define SHA_ROT(X,n) (((X) << (n)) | ((X) >> (32-(n)))) +#define F1(b,c,d) (((d^c)&b)^d) +#define F2(b,c,d) (b^c^d) +/* This version lets the compiler use the fact that + is associative. */ +#define F3(b,c,d) (c&d) + (b & (c^d)) + +/* The basic SHA-1 round */ +#define ROUND(a, b, c, d, e, f, k, t) \ + e += SHA_ROT(a,5) + f(b,c,d) + W[t] + k; b = SHA_ROT(b, 30) +/* Five SHA-1 rounds */ +#define FIVE(f, k, t) \ + ROUND(A, B, C, D, E, f, k, t ); \ + ROUND(E, A, B, C, D, f, k, t+1); \ + ROUND(D, E, A, B, C, f, k, t+2); \ + ROUND(C, D, E, A, B, f, k, t+3); \ + ROUND(B, C, D, E, A, f, k, t+4) + +static void blk_SHA1Block(blk_SHA_CTX *ctx, const uint32_t *data) +{ + int t; + uint32_t A,B,C,D,E; + uint32_t W[80]; + + for (t = 0; t < 16; t++) + W[t] = htonl(data[t]); + + /* Unroll it? */ + for (t = 16; t <= 79; t++) + W[t] = SHA_ROT(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1); + + A = ctx->H[0]; + B = ctx->H[1]; + C = ctx->H[2]; + D = ctx->H[3]; + E = ctx->H[4]; + + FIVE(F1, 0x5a827999, 0); + FIVE(F1, 0x5a827999, 5); + FIVE(F1, 0x5a827999, 10); + FIVE(F1, 0x5a827999, 15); + + FIVE(F2, 0x6ed9eba1, 20); + FIVE(F2, 0x6ed9eba1, 25); + FIVE(F2, 0x6ed9eba1, 30); + FIVE(F2, 0x6ed9eba1, 35); + + FIVE(F3, 0x8f1bbcdc, 40); + FIVE(F3, 0x8f1bbcdc, 45); + FIVE(F3, 0x8f1bbcdc, 50); + FIVE(F3, 0x8f1bbcdc, 55); + + FIVE(F2, 0xca62c1d6, 60); + FIVE(F2, 0xca62c1d6, 65); + FIVE(F2, 0xca62c1d6, 70); + FIVE(F2, 0xca62c1d6, 75); + + ctx->H[0] += A; + ctx->H[1] += B; + ctx->H[2] += C; + ctx->H[3] += D; + ctx->H[4] += E; +} diff --git a/block-sha1/sha1.h b/block-sha1/sha1.h new file mode 100644 index 0000000..c9dc156 --- /dev/null +++ b/block-sha1/sha1.h @@ -0,0 +1,21 @@ +/* + * Based on the Mozilla SHA1 (see mozilla-sha1/sha1.h), + * optimized to do word accesses rather than byte accesses, + * and to avoid unnecessary copies into the context array. + */ + #include <stdint.h> + +typedef struct { + uint32_t H[5]; + uint64_t size; + uint32_t W[16]; +} blk_SHA_CTX; + +void blk_SHA1_Init(blk_SHA_CTX *ctx); +void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *dataIn, int len); +void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx); + +#define git_SHA_CTX blk_SHA_CTX +#define git_SHA1_Init blk_SHA1_Init +#define git_SHA1_Update blk_SHA1_Update +#define git_SHA1_Final blk_SHA1_Final ^ permalink raw reply related [flat|nested] 60+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-06 3:31 ` Linus Torvalds 2009-08-06 3:48 ` Linus Torvalds @ 2009-08-06 4:08 ` Artur Skawina 2009-08-06 4:27 ` Linus Torvalds 1 sibling, 1 reply; 60+ messages in thread From: Artur Skawina @ 2009-08-06 4:08 UTC (permalink / raw) To: Linus Torvalds; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git Linus Torvalds wrote: > > On Thu, 6 Aug 2009, Artur Skawina wrote: >>> #define T_0_19(t) \ >>> - TEMP = SHA_ROT(A,5) + (((C^D)&B)^D) + E + W[t] + 0x5a827999; \ >>> - E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP; >>> + TEMP = SHA_ROL(A,5) + (((C^D)&B)^D) + E + W[t] + 0x5a827999; \ >>> + E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; >>> >>> T_0_19( 0); T_0_19( 1); T_0_19( 2); T_0_19( 3); T_0_19( 4); >>> T_0_19( 5); T_0_19( 6); T_0_19( 7); T_0_19( 8); T_0_19( 9); >> unrolling these otoh is a clear loss (iirc ~10%). > > I can well imagine. The P4 decode bandwidth is abysmal unless you get > things into the trace cache, and the trace cache is of a very limited > size. > > However, on at least Nehalem, unrolling it all is quite a noticeable win. > > The way it's written, I can easily make it do one or the other by just > turning the macro inside a loop (and we can have a preprocessor flag to > choose one or the other), but let me work on it a bit more first. that's of course how i measured it.. :) > I'm trying to move the htonl() inside the loops (the same way I suggested > George do with his assembly), and it seems to help a tiny bit. But I may > be measuring noise. i haven't tried your version at all yet (just applied the rol/ror and unrolling changes, but neither was a win on p4) > However, right now my biggest profile hit is on this irritating loop: > > /* Unroll it? */ > for (t = 16; t <= 79; t++) > W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1); > > and I haven't been able to move _that_ into the other iterations yet. i've done that before -- was a small loss -- maybe because of the small trace cache. deleted that attempt while cleaning up the #if mess, so don't have the patch, but it was basically #define newW(t) (W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1)) and than s/W[t]/newW(t)/ in rounds 16..79. I've only tested on p4 and there the winner so far is still: - for (t = 16; t <= 79; t++) + for (t = 16; t <= 79; t+=2) { ctx->W[t] = - SHA_ROT(ctx->W[t-3] ^ ctx->W[t-8] ^ ctx->W[t-14] ^ ctx->W[t-16], 1); + SHA_ROT(ctx->W[t-16] ^ ctx->W[t-14] ^ ctx->W[t-8] ^ ctx->W[t-3], 1); + ctx->W[t+1] = + SHA_ROT(ctx->W[t-15] ^ ctx->W[t-13] ^ ctx->W[t-7] ^ ctx->W[t-2], 1); + } > Here's my micro-optimization update. It does the first 16 rounds (of the > first 20-round thing) specially, and takes the data directly from the > input array. I'm _this_ close to breaking the 28s second barrier on > git-fsck, but not quite yet. tried this before too -- doesn't help. Not much a of a surprise -- if unrolling didn't help adding another loop (for rounds 17..20) won't. artur ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-06 4:08 ` Artur Skawina @ 2009-08-06 4:27 ` Linus Torvalds 2009-08-06 5:44 ` Artur Skawina 0 siblings, 1 reply; 60+ messages in thread From: Linus Torvalds @ 2009-08-06 4:27 UTC (permalink / raw) To: Artur Skawina; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git On Thu, 6 Aug 2009, Artur Skawina wrote: > > > > The way it's written, I can easily make it do one or the other by just > > turning the macro inside a loop (and we can have a preprocessor flag to > > choose one or the other), but let me work on it a bit more first. > > that's of course how i measured it.. :) Well, with my "rolling 512-bit array" I can't do that easily any more. Now it actually depends on the compiler being able to statically do that circular list calculation. If I were to turn it back into the chunks of loops, my new code would suck, because it would have all those nasty dynamic address calculations. > I've only tested on p4 and there the winner so far is still: Yeah, well, I refuse to touch that crappy micro-architecture any more. I complained to Intel people for years that their best CPU was only available as a laptop chip (Pentium-M), and I'm really happy to have gotten rid of all my horrid P4's. (Ok, so it was great when the P4 ran at 2x the frequency of the competition, and then it smoked them all. Except on OS loads, where the P4 exception handling took ten times longer than anything else). So I'm a big biased against P4. I'll try it on my Atom's, though. They're pretty crappy CPU's, but they have a fairly good _reason_ to be crappy. Linus ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-06 4:27 ` Linus Torvalds @ 2009-08-06 5:44 ` Artur Skawina 2009-08-06 5:56 ` Artur Skawina 0 siblings, 1 reply; 60+ messages in thread From: Artur Skawina @ 2009-08-06 5:44 UTC (permalink / raw) To: Linus Torvalds; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git Linus Torvalds wrote: > > On Thu, 6 Aug 2009, Artur Skawina wrote: >>> The way it's written, I can easily make it do one or the other by just >>> turning the macro inside a loop (and we can have a preprocessor flag to >>> choose one or the other), but let me work on it a bit more first. >> that's of course how i measured it.. :) > > Well, with my "rolling 512-bit array" I can't do that easily any more. > > Now it actually depends on the compiler being able to statically do that > circular list calculation. If I were to turn it back into the chunks of > loops, my new code would suck, because it would have all those nasty > dynamic address calculations. i did try (obvious patch below) and in fact the loops still win on p4: #Initializing... Rounds: 1000000, size: 62500K, time: 1.428s, speed: 42.76MB/s # TIME[s] SPEED[MB/s] rfc3174 1.437 42.47 rfc3174 1.438 42.45 linus 0.5791 105.4 linusas 0.5052 120.8 mozilla 1.525 40.01 mozillaas 1.192 51.19 artur --- block-sha1/sha1.c 2009-08-06 06:45:03.407322970 +0200 +++ block-sha1/sha1as.c 2009-08-06 07:36:41.332318683 +0200 @@ -107,13 +107,17 @@ #define T_0_15(t) \ TEMP = htonl(data[t]); array[t] = TEMP; \ - TEMP += SHA_ROL(A,5) + (((C^D)&B)^D) + E + 0x5a827999; \ - E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; \ + TEMP += SHA_ROL(A,5) + (((C^D)&B)^D) + E; \ + E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP + 0x5a827999; \ +#if UNROLL T_0_15( 0); T_0_15( 1); T_0_15( 2); T_0_15( 3); T_0_15( 4); T_0_15( 5); T_0_15( 6); T_0_15( 7); T_0_15( 8); T_0_15( 9); T_0_15(10); T_0_15(11); T_0_15(12); T_0_15(13); T_0_15(14); T_0_15(15); +#else + for (int t = 0; t <= 15; t++) { T_0_15(t); } +#endif /* This "rolls" over the 512-bit array */ #define W(x) (array[(x)&15]) @@ -125,37 +129,53 @@ TEMP += SHA_ROL(A,5) + (((C^D)&B)^D) + E + 0x5a827999; \ E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; \ +#if UNROLL T_16_19(16); T_16_19(17); T_16_19(18); T_16_19(19); +#else + for (int t = 16; t <= 19; t++) { T_16_19(t); } +#endif #define T_20_39(t) \ SHA_XOR(t); \ - TEMP += SHA_ROL(A,5) + (B^C^D) + E + 0x6ed9eba1; \ - E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; + TEMP += SHA_ROL(A,5) + (B^C^D) + E; \ + E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP + 0x6ed9eba1; +#if UNROLL T_20_39(20); T_20_39(21); T_20_39(22); T_20_39(23); T_20_39(24); T_20_39(25); T_20_39(26); T_20_39(27); T_20_39(28); T_20_39(29); T_20_39(30); T_20_39(31); T_20_39(32); T_20_39(33); T_20_39(34); T_20_39(35); T_20_39(36); T_20_39(37); T_20_39(38); T_20_39(39); +#else + for (int t = 20; t <= 39; t++) { T_20_39(t); } +#endif #define T_40_59(t) \ SHA_XOR(t); \ - TEMP += SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E + 0x8f1bbcdc; \ - E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; + TEMP += SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E; \ + E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP + 0x8f1bbcdc; +#if UNROLL T_40_59(40); T_40_59(41); T_40_59(42); T_40_59(43); T_40_59(44); T_40_59(45); T_40_59(46); T_40_59(47); T_40_59(48); T_40_59(49); T_40_59(50); T_40_59(51); T_40_59(52); T_40_59(53); T_40_59(54); T_40_59(55); T_40_59(56); T_40_59(57); T_40_59(58); T_40_59(59); +#else + for (int t = 40; t <= 59; t++) { T_40_59(t); } +#endif #define T_60_79(t) \ SHA_XOR(t); \ TEMP += SHA_ROL(A,5) + (B^C^D) + E + 0xca62c1d6; \ E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; +#if UNROLL T_60_79(60); T_60_79(61); T_60_79(62); T_60_79(63); T_60_79(64); T_60_79(65); T_60_79(66); T_60_79(67); T_60_79(68); T_60_79(69); T_60_79(70); T_60_79(71); T_60_79(72); T_60_79(73); T_60_79(74); T_60_79(75); T_60_79(76); T_60_79(77); T_60_79(78); T_60_79(79); +#else + for (int t = 60; t <= 79; t++) { T_60_79(t); } +#endif ctx->H[0] += A; ctx->H[1] += B; ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-06 5:44 ` Artur Skawina @ 2009-08-06 5:56 ` Artur Skawina 2009-08-06 7:45 ` Artur Skawina 0 siblings, 1 reply; 60+ messages in thread From: Artur Skawina @ 2009-08-06 5:56 UTC (permalink / raw) To: Linus Torvalds; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git Artur Skawina wrote: > i did try (obvious patch below) and in fact the loops still win on p4: > > #Initializing... Rounds: 1000000, size: 62500K, time: 1.428s, speed: 42.76MB/s > # TIME[s] SPEED[MB/s] > rfc3174 1.437 42.47 > rfc3174 1.438 42.45 > linus 0.5791 105.4 > linusas 0.5052 120.8 > mozilla 1.525 40.01 > mozillaas 1.192 51.19 and my atom seems to like the compact loops too: #Initializing... Rounds: 1000000, size: 62500K, time: 4.379s, speed: 13.94MB/s # TIME[s] SPEED[MB/s] rfc3174 4.429 13.78 rfc3174 4.414 13.83 linus 1.733 35.22 linusas 1.5 40.7 mozilla 2.818 21.66 mozillaas 2.539 24.04 artur ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-06 5:56 ` Artur Skawina @ 2009-08-06 7:45 ` Artur Skawina 0 siblings, 0 replies; 60+ messages in thread From: Artur Skawina @ 2009-08-06 7:45 UTC (permalink / raw) To: Linus Torvalds; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git Artur Skawina wrote: > > and my atom seems to like the compact loops too: no, that was wrong, i forgot to turn off the ondemand governor... the unrolled loops are in fact much faster and the numbers look more reasonable, after a few tweaks even on a P4. Now i just need to check how well it does compared to the asm implementations... artur # TIME[s] SPEED[MB/s] # ATOM rfc3174 2.199 27.75 linus 0.8642 70.62 linusas 1.606 38.01 linusas2 0.8763 69.65 mozilla 2.813 21.7 mozillaas 2.539 24.04 # P4 rfc3174 1.402 43.53 linus 0.5835 104.6 linusas 0.4625 132 linusas2 0.4456 137 mozilla 1.529 39.91 mozillaas 1.131 53.96 # P3 rfc3174 5.019 12.16 linus 1.86 32.81 linusas 3.108 19.64 linusas2 1.812 33.68 mozilla 6.431 9.49 mozillaas 5.868 10.4 ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-06 1:18 ` Linus Torvalds 2009-08-06 1:52 ` Nicolas Pitre @ 2009-08-06 18:49 ` Erik Faye-Lund 1 sibling, 0 replies; 60+ messages in thread From: Erik Faye-Lund @ 2009-08-06 18:49 UTC (permalink / raw) To: Linus Torvalds; +Cc: George Spelvin, gitster, git On Thu, Aug 6, 2009 at 3:18 AM, Linus Torvalds<torvalds@linux-foundation.org> wrote: > I note that MINGW does NO_OPENSSL by default, for example, and maybe the > MINGW people want to test the patch out and enable BLK_SHA1 rather than > the original Mozilla one. We recently got OpenSSL in msysgit. The NO_OPENSSL-switch hasn't been flipped yet, though. (We did OpenSSL to get https-support in cURL...) -- Erik "kusma" Faye-Lund kusmabite@gmail.com (+47) 986 59 656 ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-04 4:48 ` George Spelvin 2009-08-04 6:30 ` Linus Torvalds @ 2009-08-04 6:40 ` Linus Torvalds 1 sibling, 0 replies; 60+ messages in thread From: Linus Torvalds @ 2009-08-04 6:40 UTC (permalink / raw) To: George Spelvin; +Cc: Git Mailing List On Mon, 4 Aug 2009, George Spelvin wrote: > +sha1_block_data_order: > + pushl %ebp > + pushl %ebx > + pushl %esi > + pushl %edi > + movl 20(%esp),%edi > + movl 24(%esp),%esi > + movl 28(%esp),%eax > + subl $64,%esp > + shll $6,%eax > + addl %esi,%eax > + movl %eax,92(%esp) > + movl 16(%edi),%ebp > + movl 12(%edi),%edx > +.align 16 > +.L000loop: > + movl (%esi),%ecx > + movl 4(%esi),%ebx > + bswap %ecx > + movl 8(%esi),%eax > + bswap %ebx > + movl %ecx,(%esp) ... Hmm. Does it really help to do the bswap as a separate initial phase? As far as I can tell, you load the result of the bswap just a single time for each value. So the initial "bswap all 64 bytes" seems pointless. > + /* 00_15 0 */ > + movl %edx,%edi > + movl (%esp),%esi Why not do the bswap here instead? Is it because you're running out of registers for scheduling, and want to use the stack pointer rather than the original source? Or does the data dependency end up being so much better that you're better off doing a separate bswap loop? Or is it just because the code was written that way? Intriguing, either way. Linus ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-03 3:47 ` x86 SHA1: Faster than OpenSSL George Spelvin ` (2 preceding siblings ...) 2009-08-04 2:30 ` Linus Torvalds @ 2009-08-18 21:26 ` Andy Polyakov 3 siblings, 0 replies; 60+ messages in thread From: Andy Polyakov @ 2009-08-18 21:26 UTC (permalink / raw) To: George Spelvin; +Cc: git George Spelvin wrote: > (Work in progress, state dump to mailing list archives.) > > This started when discussing git startup overhead due to the dynamic > linker. One big contributor is the openssl library, which is used only > for its optimized x86 SHA-1 implementation. So I took a look at it, > with an eye to importing the code directly into the git source tree, > and decided that I felt like trying to do better. > > The original code was excellent, but it was optimized when the P4 was new. Even though last revision took place when "the P4 was new" and even triggered by its appearance, *all-round* performance was and will always be the prime goal. This means that improvements on some particular micro-architecture is always weighed against losses on others [and compromise is considered of so required]. Please note that I'm *not* trying to diminish George's effort by saying that proposed code is inappropriate, on the contrary I'm nothing but grateful! Thanks, George! I'm only saying that it will be given thorough consideration. Well, I've actually given the consideration and outcome is already committed:-) See http://cvs.openssl.org/chngview?cn=18513. I don't deliver +17%, only +12%, but at the cost of Intel Atom-specific optimizations. I used this opportunity to optimize even for Intel Atom core, something I was planning to do at some point anyway... > http://www.openssl.org/~appro/cryptogams/cryptogams-0.tar.gz > - "tar xz cryptogams-0.tar.gz" If there is interest I can pack new tar ball with updated modules. > An open question is how to add appropriate CPU detection to the git > build scripts. (Note that `uname -m`, which it currently uses to select > the ARM code, does NOT produce the right answer if you're using a 32-bit > compiler on a 64-bit platform.) It's not only that. As next subscriber noted problem on MacOS X, it [MacOS X] uses slightly different assembler convention and ELF modules can't be compiled on MacOS X. OpenSSL perlasm takes care of several assembler flavors and executable formats, including MacOS X. I'm talking about > +++ Makefile 2009-08-02 06:44:44.000000000 -0400 > +%.s : %.pl x86asm.pl x86unix.pl > + perl $< elf > $@ ^^^ this argument. Cheers. A. ^ permalink raw reply [flat|nested] 60+ messages in thread
end of thread, other threads:[~2009-08-18 21:50 UTC | newest] Thread overview: 60+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-07-26 23:21 Performance issue of 'git branch' George Spelvin 2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin 2009-07-31 11:11 ` Erik Faye-Lund 2009-07-31 11:31 ` George Spelvin 2009-07-31 11:37 ` Michael J Gruber 2009-07-31 12:24 ` Erik Faye-Lund 2009-07-31 12:29 ` Johannes Schindelin 2009-07-31 12:32 ` George Spelvin 2009-07-31 12:45 ` Erik Faye-Lund 2009-07-31 13:02 ` George Spelvin 2009-07-31 11:21 ` Michael J Gruber 2009-07-31 11:26 ` Michael J Gruber 2009-07-31 12:31 ` Carlos R. Mafra 2009-07-31 13:27 ` Brian Ristuccia 2009-07-31 14:05 ` George Spelvin 2009-07-31 13:27 ` Jakub Narebski 2009-07-31 15:05 ` Peter Harris 2009-07-31 15:22 ` Peter Harris 2009-08-03 3:47 ` x86 SHA1: Faster than OpenSSL George Spelvin 2009-08-03 7:36 ` Jonathan del Strother 2009-08-04 1:40 ` Mark Lodato 2009-08-04 2:30 ` Linus Torvalds 2009-08-04 2:51 ` Linus Torvalds 2009-08-04 3:07 ` Jon Smirl 2009-08-04 5:01 ` George Spelvin 2009-08-04 12:56 ` Jon Smirl 2009-08-04 14:29 ` Dmitry Potapov 2009-08-18 21:50 ` Andy Polyakov 2009-08-04 4:48 ` George Spelvin 2009-08-04 6:30 ` Linus Torvalds 2009-08-04 8:01 ` George Spelvin 2009-08-04 20:41 ` Junio C Hamano 2009-08-05 18:17 ` George Spelvin 2009-08-05 20:36 ` Johannes Schindelin 2009-08-05 20:44 ` Junio C Hamano 2009-08-05 20:55 ` Linus Torvalds 2009-08-05 23:13 ` Linus Torvalds 2009-08-06 1:18 ` Linus Torvalds 2009-08-06 1:52 ` Nicolas Pitre 2009-08-06 2:04 ` Junio C Hamano 2009-08-06 2:10 ` Linus Torvalds 2009-08-06 2:20 ` Nicolas Pitre 2009-08-06 2:08 ` Linus Torvalds 2009-08-06 3:19 ` Artur Skawina 2009-08-06 3:31 ` Linus Torvalds 2009-08-06 3:48 ` Linus Torvalds 2009-08-06 4:01 ` Linus Torvalds 2009-08-06 4:28 ` Artur Skawina 2009-08-06 4:50 ` Linus Torvalds 2009-08-06 5:19 ` Artur Skawina 2009-08-06 7:03 ` George Spelvin 2009-08-06 4:52 ` George Spelvin 2009-08-06 4:08 ` Artur Skawina 2009-08-06 4:27 ` Linus Torvalds 2009-08-06 5:44 ` Artur Skawina 2009-08-06 5:56 ` Artur Skawina 2009-08-06 7:45 ` Artur Skawina 2009-08-06 18:49 ` Erik Faye-Lund 2009-08-04 6:40 ` Linus Torvalds 2009-08-18 21:26 ` Andy Polyakov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).