* Re: Performance issue of 'git branch'
@ 2009-07-26 23:21 George Spelvin
2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin
0 siblings, 1 reply; 129+ messages in thread
From: George Spelvin @ 2009-07-26 23:21 UTC (permalink / raw)
To: git, torvalds; +Cc: linux
> It's a bit sad, since the _only_ thing we load all of libcrypto for is the
> (fairly trivial) SHA1 code.
>
> But at the same time, last time I benchmarked the different SHA1
> libraries, the openssl one was the fastest. I think it has tuned assembly
> language for most architectures. Our regular mozilla-based C code is
> perfectly fine, but it doesn't hold a candle to assembler tuning.
Actually, openssl only has assembly for x86, x86_64, and ia64.
Truthfully, once you have 32 registers, SHA1 is awfully easy to
compile near-optimally.
Git currently includes some hand-tuned assembly that isn't in OpenSSL:
- ARM (only 16 registers, and the rotate+op support can be used nicely)
- PPC (3-way superscalar *without* OO execution benefits from careful
scheduling)
Further, all of the core hand-tuned SHA1 assembly code in OpenSSL is by
Andy Polyakov and is dual-licensed GPL/3-clause BSD *in addition to*
the OpenSSL license. So we can just import it:
See http://www.openssl.org/~appro/cryptogams/
and http://www.openssl.org/~appro/cryptogams/cryptogams-0.tar.gz
(Ooh, look, he has PPC code in there, too. Does anyone with a PPC machine
want to compare it with Git's?)
It'll take some massaging because that's just the core SHA1_Transform
function and not the wrappers, but it's hardly a heroic effort.
I'm pretty deep in the weeds at $DAY_JOB and can't get to it for a while,
but would a patch be appreciated?
^ permalink raw reply [flat|nested] 129+ messages in thread
* Request for benchmarking: x86 SHA1 code 2009-07-26 23:21 Performance issue of 'git branch' George Spelvin @ 2009-07-31 10:46 ` George Spelvin 2009-07-31 11:11 ` Erik Faye-Lund ` (8 more replies) 0 siblings, 9 replies; 129+ messages in thread From: George Spelvin @ 2009-07-31 10:46 UTC (permalink / raw) To: git; +Cc: linux After studying Andy Polyakov's optimized x86 SHA-1 in OpenSSL, I've got a version that's 1.6% slower on a P4 and 15% faster on a Phenom. I'm curious about the performance on other CPUs I don't have access to, particularly Core 2 duo and i7. Could someone do some benchmarking for me? Old (486/Pentium/P2/P3) machines are also interesting, but I'm optimizing for newer ones. I haven't packaged this nicely, but it's not that complicated. - Download Andy's original code from http://www.openssl.org/~appro/cryptogams/cryptogams-0.tar.gz - Unpack and cd to the cryptogams-0/x86 directory - "patch < this_email" to create "sha1test.c", "sha1-586.h", "Makefile", and "sha1-x86.pl". - "make" - Run ./586test (before) and ./x86test (after) and note the timings. Thank you! --- /dev/null 2009-05-12 02:55:38.579106460 -0400 +++ Makefile 2009-07-31 06:22:42.000000000 -0400 @@ -0,0 +1,16 @@ +CC := gcc +CFLAGS := -m32 -W -Wall -Os -g +ASFLAGS := --32 + +all: 586test x86test + +586test : sha1test.c sha1-586.o + $(CC) $(CFLAGS) -o $@ sha1test.c sha1-586.o + +x86test : sha1test.c sha1-x86.o + $(CC) $(CFLAGS) -o $@ sha1test.c sha1-x86.o + +586test x86test : sha1-586.h + +%.s : %.pl x86asm.pl x86unix.pl + perl $< elf > $@ --- /dev/null 2009-05-12 02:55:38.579106460 -0400 +++ sha1test.c 2009-07-28 09:24:09.000000000 -0400 @@ -0,0 +1,67 @@ +#include <stdint.h> +#include <stdlib.h> +#include <stdio.h> +#include <sys/time.h> + +#include "sha1-586.h" + +#define SIZE 1000000 + +#if SIZE % 64 +# error SIZE must be a multiple of 64! +#endif + +int +main(void) +{ + uint32_t iv[5] = { + 0x67452301, 0xefcdab89, 0x98badcfe, 0x10325476, 0xc3d2e1f0 + }; + /* Simplest known-answer test, "abc" */ + static uint8_t const testbuf[64] = { + 'a','b','c', 0x80, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 24 + }; + /* Expected: A9993E364706816ABA3E25717850C26C9CD0D89D */ + static uint32_t const expected[5] = { + 0xa9993e36, 0x4706816a, 0xba3e2571, 0x7850c26c, 0x9cd0d89d }; + unsigned i; + char *p = malloc(SIZE); + struct timeval tv0, tv1; + + if (!p) { + perror("malloc"); + return 1; + } + + sha1_block_data_order(iv, testbuf, 1); + printf("Result: %08x %08x %08x %08x %08x\n" + "Expected:%08x %08x %08x %08x %08x\n", + iv[0], iv[1], iv[2], iv[3], iv[4], expected[0], + expected[1], expected[2], expected[3], expected[4]); + for (i = 0; i < 5; i++) + if (iv[i] != expected[i]) + printf("MISMATCH in word %u\n", i); + + if (gettimeofday(&tv0, NULL) < 0) { + perror("gettimeofday"); + return 1; + } + for (i = 0; i < 500; i++) + sha1_block_data_order(iv, p, SIZE/64); + if (gettimeofday(&tv1, NULL) < 0) { + perror("gettimeofday"); + return 1; + } + tv1.tv_sec -= tv0.tv_sec; + tv1.tv_usec -= tv0.tv_usec; + if (tv1.tv_usec < 0) { + tv1.tv_sec--; + tv1.tv_usec += 1000000; + } + printf("%u bytes: %u.%06u s\n", i * SIZE, (unsigned)tv1.tv_sec, + (unsigned)tv1.tv_usec); + return 0; +} --- /dev/null 2009-05-12 02:55:38.579106460 -0400 +++ sha1-586.h 2009-07-27 09:34:03.000000000 -0400 @@ -0,0 +1,3 @@ +#include <stdint.h> + +void sha1_block_data_order(uint32_t iv[5], void const *in, unsigned len); --- /dev/null 2009-05-12 02:55:38.579106460 -0400 +++ sha1-x86.pl 2009-07-31 06:10:18.000000000 -0400 @@ -0,0 +1,398 @@ +#!/usr/bin/env perl + +# ==================================================================== +# [Re]written by Andy Polyakov <appro@fy.chalmers.se> for the OpenSSL +# project. The module is, however, dual licensed under OpenSSL and +# CRYPTOGAMS licenses depending on where you obtain it. For further +# details see http://www.openssl.org/~appro/cryptogams/. +# ==================================================================== + +# "[Re]written" was achieved in two major overhauls. In 2004 BODY_* +# functions were re-implemented to address P4 performance issue [see +# commentary below], and in 2006 the rest was rewritten in order to +# gain freedom to liberate licensing terms. + +# It was noted that Intel IA-32 C compiler generates code which +# performs ~30% *faster* on P4 CPU than original *hand-coded* +# SHA1 assembler implementation. To address this problem (and +# prove that humans are still better than machines:-), the +# original code was overhauled, which resulted in following +# performance changes: +# +# compared with original compared with Intel cc +# assembler impl. generated code +# Pentium -16% +48% +# PIII/AMD +8% +16% +# P4 +85%(!) +45% +# +# As you can see Pentium came out as looser:-( Yet I reckoned that +# improvement on P4 outweights the loss and incorporate this +# re-tuned code to 0.9.7 and later. +# ---------------------------------------------------------------- +# <appro@fy.chalmers.se> + +$0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1; +push(@INC,"${dir}","${dir}../../perlasm"); +require "x86asm.pl"; + +&asm_init($ARGV[0],"sha1-586.pl",$ARGV[$#ARGV] eq "386"); + +$A="eax"; +$B="ebx"; +$C="ecx"; +$D="edx"; +$E="ebp"; + +# Two temporaries +$S="esi"; +$T="edi"; + +# The round constants +use constant K1 => 0x5a827999; +use constant K2 => 0x6ED9EBA1; +use constant K3 => 0x8F1BBCDC; +use constant K4 => 0xCA62C1D6; + +@V=($A,$B,$C,$D,$E); + +# Given unlimited registers and functional units, it would be +# possible to compute SHA-1 at two cycles per round, using 7 +# operations per round. Remember, each round computes a new +# value for E, which is used as A in the following round and B +# in the round after that. There are two critical paths: +# - A must be rotated and added to the output E +# - B must go through two boolean operations before being added +# to the result E. Since this latter addition can't be done +# in the same-cycle as the critical addition of a<<<5, this is +# a total of 2+1+1 = 4 cycles. +# Additionally, if you want to avoid copying B, you have to +# rotate it soon after use in this round so it is ready for use +# as the following round's C. + +# f = (a <<< 5) + e + K + in[i] + (d^(b&(c^d))) (0..19) +# f = (a <<< 5) + e + K + in[i] + (b^c^d) (20..39, 60..79) +# f = (a <<< 5) + e + K + in[i] + (c&d) + (b&(c^d)) (40..59) +# The hard part is doing this with only two temporary registers. +# Let's divide it into 4 parts. These can be executed in a 7-cycle +# loop, assuming triple (quadruple counting xor separately) issue: +# +# in[i] F1(c,d) F2(b,c,d) a<<<5 +# mov in[i],T (addl S,A) (movl B,S) +# xor in[i+1],T (rorl 5,S) +# xor in[i+2],T movl D,S (addl S,A) +# xor in[i+3],T andl C,S +# rotl 1,T addl S,E movl D,S +# (movl T,in[i]) xorl C,S +# addl T+K,E andl B,S rorl 2,B // +# (mov in[i],T) addl S,E movl A,S +# (xor in[i+1],T) rorl 5,S +# (xor in[i+2],T) (movl C,S) addl S,E +# +# (The last 3 rounds can omit the store of T.) +# The "addl T+K,E" line is actually implemented using the lea instruction, +# which (on a Pentium) requires that neither T not K was modified on the +# previous cycle. +# +# The other two rounds are a bit simpler, and can therefore be "pulled in" +# one cycle, to 6. The bit-select function (0..19): +# in[i] F(b,c,d) a<<<5 +# mov in[i],T (xorl E,S) (addl T+K,A) +# xor in[i+1],T (addl S,A) (movl A,S) +# xor in[i+2],T (rorl 2,C) (roll 5,S) +# xor in[i+3],T movl D,S (addl S,A) +# roll 1,T xorl C,S +# movl T,in[i] andl B,S rorl 2,B +# addl T+K,E xorl D,S // (mov in[i],T) +# (xor in[i+1],T) addl S,E movl A,S +# (xor in[i+2],T) roll 5,S +# (xor in[i+3],T) (movl C,S) addl S,E +# +# And the XOR function (also 6, limited by the in[i] forming) used in +# rounds 20..39 and 60..79: +# in[i] F(b,c,d) a<<<5 +# mov in[i],T (xorl C,S) (addl T+K,A) +# xor in[i+1],T (addl S,A) (movl A,S) +# xor in[i+2],T (roll 5,S) +# xor in[i+3],T (addl S,A) +# roll 1,T movl D,S +# movl T,in[i] xorl B,S rorl 2,B +# addl T+K,E xorl C,S // (mov in[i],T) +# (xor in[i+1],T) addl S,E movl A,S +# (xor in[i+2],T) roll 5,S +# (xor in[i+3],T) addl S,E +# +# The first 16 rounds don't need to form the in[i] equation, letting +# us pull it in another 2 cycles, to 4, after some reassignment of +# temporaries: +# in[i] F(b,c,d) a<<<5 +# movl D,S (roll 5,T) (addl S,A) +# mov in[i],T xorl C,S (addl T,A) +# andl B,S rorl 2,B +# addl T+K,E xorl D,S movl A,T +# addl S,E roll 5,T (movl C,S) +# (mov in[i],T) (xorl B,S) addl T,E +# + +# The transition between rounds 15 and 16 will be a bit tricky... the best +# thing to do is to delay the computation of a<<<5 one cycle and move it back +# to the S register. That way, T is free as early as possible. +# in[i] F(b,c,d) a<<<5 +# (addl T+K,A) (xorl E,S) (movl A,T) +# movl D,S (roll 5,T) (addl S,A) +# mov in[i],T xorl C,S (addl T,A) +# andl B,S rorl 2,B +# addl T+K,E xorl D,S (movl in[1],T) +# (xor in[i+1],T) addl S,E movl A,S +# (xor in[i+2],T) rorl 2,B roll 5,S +# (xor in[i+3],T) (movl C,S) addl S,E + + + + + +# This expects the first copy of D to S to have been done already +# movl D,S (roll 5,T) (addl S,A) // +# mov in[i],T xorl C,S (addl T,A) +# andl B,S rorl 2,B +# addl T+K,E xorl D,S movl A,T +# addl S,E roll 5,T (movl C,S) // +# (mov in[i],T) (xorl B,S) addl T,E + +sub BODY_00_15 +{ + local($n,$a,$b,$c,$d,$e)=@_; + + &comment("00_15 $n"); + &mov($S,$d) if ($n == 0); + &mov($T,&swtmp($n%16)); # Load Xi. + &xor($S,$c); # Continue computing F() = d^(b&(c^d)) + &and($S,$b); + &rotr($b,2); + &lea($e,&DWP(K1,$e,$T)); # Add Xi and K + if ($n < 15) { + &mov($T,$a); + &xor($S,$d); + &rotl($T,5); + &add($e,$S); + &mov($S,$c); # Start of NEXT round's F() + &add($e,$T); + } else { + # This version provides the correct start for BODY_20_39 + &mov($T,&swtmp(($n+1)%16)); # Start computing mext Xi. + &xor($S,$d); + &xor($T,&swtmp(($n+3)%16)); + &add($e,$S); # Add F() + &mov($S,$a); # Start computing a<<<5 + &xor($T,&swtmp(($n+9)%16)); + &rotl($S,5); + } + +} + +# The transition between rounds 15 and 16 will be a bit tricky... the best +# thing to do is to delay the computation of a<<<5 one cycle and move it back +# to the S register. That way, T is free as early as possible. +# in[i] F(b,c,d) a<<<5 +# (addl T+K,A) (xorl E,S) (movl B,T) +# movl D,S (roll 5,T) (addl S,A) // +# mov in[i],T xorl C,S (addl T,A) +# andl B,S rorl 2,B +# addl T+K,E xorl D,S (movl in[1],T) +# (xor in[i+1],T) addl S,E movl A,S +# (xor in[i+2],T) rorl 2,B roll 5,S // +# (xor in[i+3],T) (movl C,S) addl S,E + + +# This starts just before starting to compute F(); the Xi should have XORed +# the first three values together. (Break is at //) +# +# in[i] F(b,c,d) a<<<5 +# mov in[i],T (xorl E,S) (addl T+K,A) +# xor in[i+1],T (addl S,A) (movl B,S) +# xor in[i+2],T (roll 5,S) // +# xor in[i+3],T movl D,S (addl S,A) +# roll 1,T xorl C,S +# movl T,in[i] andl B,S rorl 2,B +# addl T+K,E xorl D,S (mov in[i],T) +# (xor in[i+1],T) addl S,E movl A,S +# (xor in[i+2],T) roll 5,S // +# (xor in[i+3],T) (movl C,S) addl S,E + +sub BODY_16_19 +{ + local($n,$a,$b,$c,$d,$e)=@_; + + &comment("16_20 $n"); + + &xor($T,&swtmp(($n+13)%16)); + &add($a,$S); # End of previous round + &mov($S,$d); # Start current round's F() + &rotl($T,1); + &xor($S,$c); + &mov(&swtmp($n%16),$T); # Store computed Xi. + &and($S,$b); + &rotr($b,2); + &lea($e,&DWP(K1,$e,$T)); # Add Xi and K + &mov($T,&swtmp(($n+1)%16)); # Start computing mext Xi. + &xor($S,$d); + &xor($T,&swtmp(($n+3)%16)); + &add($e,$S); # Add F() + &mov($S,$a); # Start computing a<<<5 + &xor($T,&swtmp(($n+9)%16)); + &rotl($S,5); +} + +# This is just like BODY_16_19, but computes a different F() = b^c^d +# +# in[i] F(b,c,d) a<<<5 +# mov in[i],T (xorl E,S) (addl T+K,A) +# xor in[i+1],T (addl S,A) (movl B,S) +# xor in[i+2],T (roll 5,S) // +# xor in[i+3],T (addl S,A) +# roll 1,T movl C,S +# movl T,in[i] xorl B,S rorl 2,B +# addl T+K,E xorl C,S (mov in[i],T) +# (xor in[i+1],T) addl S,E movl A,S +# (xor in[i+2],T) roll 5,S // +# (xor in[i+3],T) (movl C,S) addl S,E + +sub BODY_20_39 # And 61..79 +{ + local($n,$a,$b,$c,$d,$e)=@_; + local $K=($n<40) ? K2 : K4; + + &comment("21_30 $n"); + + &xor($T,&swtmp(($n+13)%16)); + &add($a,$S); # End of previous round + &mov($S,$d) + &rotl($T,1); + &mov($S,$d); # Start current round's F() + &mov(&swtmp($n%16),$T) if ($n < 77); # Store computed Xi. + &xor($S,$b); + &rotr($b,2); + &lea($e,&DWP($K,$e,$T)); # Add Xi and K + &mov($T,&swtmp(($n+1)%16)) if ($n < 79); # Start computing next Xi. + &xor($S,$c); + &xor($T,&swtmp(($n+3)%16)) if ($n < 79); + &add($e,$S); # Add F1() + &mov($S,$a); # Start computing a<<<5 + &xor($T,&swtmp(($n+9)%16)) if ($n < 79); + &rotl($S,5); + + &add($e,$S) if ($n == 79); +} + + +# This starts immediately after the LEA, and expects to need to finish +# the previous round. (break is at //) +# +# in[i] F1(c,d) F2(b,c,d) a<<<5 +# (addl T+K,E) (andl C,S) (rorl 2,C) +# mov in[i],T (addl S,A) (movl B,S) +# xor in[i+1],T (rorl 5,S) +# xor in[i+2],T / movl D,S (addl S,A) +# xor in[i+3],T andl C,S +# rotl 1,T addl S,E movl D,S +# (movl T,in[i]) xorl C,S +# addl T+K,E andl B,S rorl 2,B +# (mov in[i],T) addl S,E movl A,S +# (xor in[i+1],T) rorl 5,S +# (xor in[i+2],T) // (movl C,S) addl S,E + +sub BODY_40_59 +{ + local($n,$a,$b,$c,$d,$e)=@_; + + &comment("41_59 $n"); + + &add($a,$S); # End of previous round + &mov($S,$d); # Start current round's F(1) + &xor($T,&swtmp(($n+13)%16)); + &and($S,$c); + &rotl($T,1); + &add($e,$S); # Add F1() + &mov($S,$d); # Start current round's F2() + &mov(&swtmp($n%16),$T); # Store computed Xi. + &xor($S,$c); + &lea($e,&DWP(K3,$e,$T)); + &and($S,$b); + &rotr($b,2); + &mov($T,&swtmp(($n+1)%16)); # Start computing next Xi. + &add($e,$S); # Add F2() + &mov($S,$a); # Start computing a<<<5 + &xor($T,&swtmp(($n+3)%16)); + &rotl($S,5); + &xor($T,&swtmp(($n+9)%16)); +} + +&function_begin("sha1_block_data_order",16); + &mov($S,&wparam(0)); # SHA_CTX *c + &mov($T,&wparam(1)); # const void *input + &mov($A,&wparam(2)); # size_t num + &stack_push(16); # allocate X[16] + &shl($A,6); + &add($A,$T); + &mov(&wparam(2),$A); # pointer beyond the end of input + &mov($E,&DWP(16,$S));# pre-load E + + &set_label("loop",16); + + # copy input chunk to X, but reversing byte order! + for ($i=0; $i<16; $i+=4) + { + &mov($A,&DWP(4*($i+0),$T)); + &mov($B,&DWP(4*($i+1),$T)); + &mov($C,&DWP(4*($i+2),$T)); + &mov($D,&DWP(4*($i+3),$T)); + &bswap($A); + &bswap($B); + &bswap($C); + &bswap($D); + &mov(&swtmp($i+0),$A); + &mov(&swtmp($i+1),$B); + &mov(&swtmp($i+2),$C); + &mov(&swtmp($i+3),$D); + } + &mov(&wparam(1),$T); # redundant in 1st spin + + &mov($A,&DWP(0,$S)); # load SHA_CTX + &mov($B,&DWP(4,$S)); + &mov($C,&DWP(8,$S)); + &mov($D,&DWP(12,$S)); + # E is pre-loaded + + for($i=0;$i<16;$i++) { &BODY_00_15($i,@V); unshift(@V,pop(@V)); } + for(;$i<16;$i++) { &BODY_15($i,@V); unshift(@V,pop(@V)); } + for(;$i<20;$i++) { &BODY_16_19($i,@V); unshift(@V,pop(@V)); } + for(;$i<40;$i++) { &BODY_20_39($i,@V); unshift(@V,pop(@V)); } + for(;$i<60;$i++) { &BODY_40_59($i,@V); unshift(@V,pop(@V)); } + for(;$i<80;$i++) { &BODY_20_39($i,@V); unshift(@V,pop(@V)); } + + (($V[4] eq $E) and ($V[0] eq $A)) or die; # double-check + + &comment("Loop trailer"); + + &mov($S,&wparam(0)); # re-load SHA_CTX* + &mov($T,&wparam(1)); # re-load input pointer + + &add($A,&DWP(0,$S)); # E is last "A"... + &add($B,&DWP(4,$S)); + &add($C,&DWP(8,$S)); + &add($D,&DWP(12,$S)); + &add($E,&DWP(16,$S)); + + &mov(&DWP(0,$S),$A); # update SHA_CTX + &add($T,64); # advance input pointer + &mov(&DWP(4,$S),$B); + &cmp($T,&wparam(2)); # have we reached the end yet? + &mov(&DWP(8,$S),$C); + &mov(&DWP(12,$S),$D); + &mov(&DWP(16,$S),$E); + &jb(&label("loop")); + + &stack_pop(16); +&function_end("sha1_block_data_order"); +&asciz("SHA1 block transform for x86, CRYPTOGAMS by <appro\@openssl.org>"); + +&asm_finish(); ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Request for benchmarking: x86 SHA1 code 2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin @ 2009-07-31 11:11 ` Erik Faye-Lund 2009-07-31 11:31 ` George Spelvin 2009-07-31 11:37 ` Michael J Gruber 2009-07-31 11:21 ` Michael J Gruber ` (7 subsequent siblings) 8 siblings, 2 replies; 129+ messages in thread From: Erik Faye-Lund @ 2009-07-31 11:11 UTC (permalink / raw) To: George Spelvin; +Cc: git On Fri, Jul 31, 2009 at 12:46 PM, George Spelvin<linux@horizon.com> wrote: > I haven't packaged this nicely, but it's not that complicated. > - Download Andy's original code from > http://www.openssl.org/~appro/cryptogams/cryptogams-0.tar.gz > - Unpack and cd to the cryptogams-0/x86 directory > - "patch < this_email" to create "sha1test.c", "sha1-586.h", "Makefile", > and "sha1-x86.pl". > - "make" $ patch < ../sha1-opt.patch.eml patching file `Makefile' patching file `sha1test.c' patching file `sha1-586.h' patching file `sha1-x86.pl' $ make make: *** No rule to make target `sha1-586.o', needed by `586test'. Stop. What did I do wrong? :) Would it be easier if you pushed it out somewhere? -- Erik "kusma" Faye-Lund kusmabite@gmail.com (+47) 986 59 656 ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Request for benchmarking: x86 SHA1 code 2009-07-31 11:11 ` Erik Faye-Lund @ 2009-07-31 11:31 ` George Spelvin 2009-07-31 11:37 ` Michael J Gruber 1 sibling, 0 replies; 129+ messages in thread From: George Spelvin @ 2009-07-31 11:31 UTC (permalink / raw) To: kusmabite, linux; +Cc: git > $ make > make: *** No rule to make target `sha1-586.o', needed by `586test'. Stop. > > What did I do wrong? :) > Would it be easier if you pushed it out somewhere? H'm.... It *should* do perl sha1-586.pl elf > sha1-586.s as --32 -o sha1-586.o sha1-586.s gcc -m32 -W -Wall -Os -g -o 586test sha1test.c sha1-586.o (And likewise for the "x86test" binary.) which is what happened when I tested it. Obviously I have something non-portable in the Makefile. You could try "make sha1-586.s" and "sha1-586.o" and see which rule is f*ed up. Um, you *are* in a directory which contains a sha1-586.pl file, right? Thanks! ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Request for benchmarking: x86 SHA1 code 2009-07-31 11:11 ` Erik Faye-Lund 2009-07-31 11:31 ` George Spelvin @ 2009-07-31 11:37 ` Michael J Gruber 2009-07-31 12:24 ` Erik Faye-Lund 1 sibling, 1 reply; 129+ messages in thread From: Michael J Gruber @ 2009-07-31 11:37 UTC (permalink / raw) To: Erik Faye-Lund; +Cc: George Spelvin, git Erik Faye-Lund venit, vidit, dixit 31.07.2009 13:11: > On Fri, Jul 31, 2009 at 12:46 PM, George Spelvin<linux@horizon.com> wrote: >> I haven't packaged this nicely, but it's not that complicated. >> - Download Andy's original code from >> http://www.openssl.org/~appro/cryptogams/cryptogams-0.tar.gz >> - Unpack and cd to the cryptogams-0/x86 directory >> - "patch < this_email" to create "sha1test.c", "sha1-586.h", "Makefile", >> and "sha1-x86.pl". >> - "make" > > $ patch < ../sha1-opt.patch.eml > patching file `Makefile' > patching file `sha1test.c' > patching file `sha1-586.h' > patching file `sha1-x86.pl' > > $ make > make: *** No rule to make target `sha1-586.o', needed by `586test'. Stop. > > What did I do wrong? :) > Would it be easier if you pushed it out somewhere? > You need to go to the x86 directory, apply the patch and run make there. (I made the same mistake.) Also, you i586 (32bit) glibc-devel if you're on a 64 bit system. Michael ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Request for benchmarking: x86 SHA1 code 2009-07-31 11:37 ` Michael J Gruber @ 2009-07-31 12:24 ` Erik Faye-Lund 2009-07-31 12:29 ` Johannes Schindelin 2009-07-31 12:32 ` George Spelvin 0 siblings, 2 replies; 129+ messages in thread From: Erik Faye-Lund @ 2009-07-31 12:24 UTC (permalink / raw) To: Michael J Gruber; +Cc: George Spelvin, git On Fri, Jul 31, 2009 at 1:37 PM, Michael J Gruber<git@drmicha.warpmail.net> wrote: >> What did I do wrong? :) > > You need to go to the x86 directory, apply the patch and run make there. > (I made the same mistake.) Also, you i586 (32bit) glibc-devel if you're > on a 64 bit system. Aha, thanks :) Now I'm getting a different error: $ make as -o sha1-586.o sha1-586.s sha1-586.s: Assembler messages: sha1-586.s:4: Warning: .type pseudo-op used outside of .def/.endef ignored. sha1-586.s:4: Error: junk at end of line, first unrecognized character is `s' sha1-586.s:1438: Warning: .size pseudo-op used outside of .def/.endef ignored. sha1-586.s:1438: Error: junk at end of line, first unrecognized character is `s' make: *** [sha1-586.o] Error 1 What might be relevant, is that I'm trying this on Windows (Vista 64bit). I'd still think GNU as should be able to assemble the source, though. I've got an i7, so I thought the result might be interresting. -- Erik "kusma" Faye-Lund kusmabite@gmail.com (+47) 986 59 656 ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Request for benchmarking: x86 SHA1 code 2009-07-31 12:24 ` Erik Faye-Lund @ 2009-07-31 12:29 ` Johannes Schindelin 2009-07-31 12:32 ` George Spelvin 1 sibling, 0 replies; 129+ messages in thread From: Johannes Schindelin @ 2009-07-31 12:29 UTC (permalink / raw) To: Erik Faye-Lund; +Cc: Michael J Gruber, George Spelvin, git Hi, On Fri, 31 Jul 2009, Erik Faye-Lund wrote: > On Fri, Jul 31, 2009 at 1:37 PM, Michael J > Gruber<git@drmicha.warpmail.net> wrote: > >> What did I do wrong? :) > > > > You need to go to the x86 directory, apply the patch and run make there. > > (I made the same mistake.) Also, you i586 (32bit) glibc-devel if you're > > on a 64 bit system. > > Aha, thanks :) > > Now I'm getting a different error: > $ make > as -o sha1-586.o sha1-586.s > sha1-586.s: Assembler messages: > sha1-586.s:4: Warning: .type pseudo-op used outside of .def/.endef ignored. > sha1-586.s:4: Error: junk at end of line, first unrecognized character is `s' > sha1-586.s:1438: Warning: .size pseudo-op used outside of .def/.endef ignored. > sha1-586.s:1438: Error: junk at end of line, first unrecognized character is `s' > > make: *** [sha1-586.o] Error 1 > > What might be relevant, is that I'm trying this on Windows (Vista > 64bit). Probably using msysGit? Then you're still using the 32-bit environment, as MSys is 32-bit only for now. Ciao, Dscho ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Request for benchmarking: x86 SHA1 code 2009-07-31 12:24 ` Erik Faye-Lund 2009-07-31 12:29 ` Johannes Schindelin @ 2009-07-31 12:32 ` George Spelvin 2009-07-31 12:45 ` Erik Faye-Lund 1 sibling, 1 reply; 129+ messages in thread From: George Spelvin @ 2009-07-31 12:32 UTC (permalink / raw) To: git, kusmabite; +Cc: git, linux > Now I'm getting a different error: > $ make > as -o sha1-586.o sha1-586.s > sha1-586.s: Assembler messages: > sha1-586.s:4: Warning: .type pseudo-op used outside of .def/.endef ignored. > sha1-586.s:4: Error: junk at end of line, first unrecognized character is `s' > sha1-586.s:1438: Warning: .size pseudo-op used outside of .def/.endef ignored. > sha1-586.s:1438: Error: junk at end of line, first unrecognized character is `s' > > make: *** [sha1-586.o] Error 1 > What might be relevant, is that I'm trying this on Windows (Vista > 64bit). I'd still think GNU as should be able to assemble the source, > though. I've got an i7, so I thought the result might be interresting. Ah... what assembler? the perl proprocessor supports multiple assemblers: elf - Linux, FreeBSD, Solaris x86, etc. a.out - DJGPP, elder OpenBSD, etc. coff - GAS/COFF such as Win32 targets win32n - Windows 95/Windows NT NASM format nw-nasm - NetWare NASM format nw-mwasm- NetWare Metrowerks Assembler Maybe you need to replace "elf" with "coff"? ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Request for benchmarking: x86 SHA1 code 2009-07-31 12:32 ` George Spelvin @ 2009-07-31 12:45 ` Erik Faye-Lund 2009-07-31 13:02 ` George Spelvin 0 siblings, 1 reply; 129+ messages in thread From: Erik Faye-Lund @ 2009-07-31 12:45 UTC (permalink / raw) To: George Spelvin; +Cc: git, git On Fri, Jul 31, 2009 at 2:32 PM, George Spelvin<linux@horizon.com> wrote: > Maybe you need to replace "elf" with "coff"? That did the trick, thanks! Best of 6 runs on an Intel Core i7 920 @ 2.67GHz: before (586test): 1.415 after (x86test): 1.470 -- Erik "kusma" Faye-Lund kusmabite@gmail.com (+47) 986 59 656 ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Request for benchmarking: x86 SHA1 code 2009-07-31 12:45 ` Erik Faye-Lund @ 2009-07-31 13:02 ` George Spelvin 0 siblings, 0 replies; 129+ messages in thread From: George Spelvin @ 2009-07-31 13:02 UTC (permalink / raw) To: kusmabite, linux; +Cc: git, git > That did the trick, thanks! > > Best of 6 runs on an Intel Core i7 920 @ 2.67GHz: > > before (586test): 1.415 > after (x86test): 1.470 So it's slower. Bummer. :-( Obviously I have some work to do. But thank you very much for the result! ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Request for benchmarking: x86 SHA1 code 2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin 2009-07-31 11:11 ` Erik Faye-Lund @ 2009-07-31 11:21 ` Michael J Gruber 2009-07-31 11:26 ` Michael J Gruber ` (6 subsequent siblings) 8 siblings, 0 replies; 129+ messages in thread From: Michael J Gruber @ 2009-07-31 11:21 UTC (permalink / raw) To: George Spelvin; +Cc: git George Spelvin venit, vidit, dixit 31.07.2009 12:46: > After studying Andy Polyakov's optimized x86 SHA-1 in OpenSSL, I've > got a version that's 1.6% slower on a P4 and 15% faster on a Phenom. > I'm curious about the performance on other CPUs I don't have access to, > particularly Core 2 duo and i7. > > Could someone do some benchmarking for me? Old (486/Pentium/P2/P3) > machines are also interesting, but I'm optimizing for newer ones. > > I haven't packaged this nicely, but it's not that complicated. > - Download Andy's original code from > http://www.openssl.org/~appro/cryptogams/cryptogams-0.tar.gz > - Unpack and cd to the cryptogams-0/x86 directory > - "patch < this_email" to create "sha1test.c", "sha1-586.h", "Makefile", > and "sha1-x86.pl". > - "make" > - Run ./586test (before) and ./x86test (after) and note the timings. > > Thank you! Best of 6 runs: ./586test Result: a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d Expected:a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d 500000000 bytes: 1.642336 s ./x86test Result: a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d Expected:a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d 500000000 bytes: 1.532153 s System: uname -a Linux localhost.localdomain 2.6.29.6-213.fc11.x86_64 #1 SMP Tue Jul 7 21:02:57 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 Duo CPU T7500 @ 2.20GHz stepping : 11 cpu MHz : 800.000 cache size : 4096 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm ida tpr_shadow vnmi flexpriority bogomips : 4389.20 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 Duo CPU T7500 @ 2.20GHz stepping : 11 cpu MHz : 800.000 cache size : 4096 KB physical id : 0 siblings : 2 core id : 1 cpu cores : 2 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm ida tpr_shadow vnmi flexpriority bogomips : 4388.78 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Request for benchmarking: x86 SHA1 code 2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin 2009-07-31 11:11 ` Erik Faye-Lund 2009-07-31 11:21 ` Michael J Gruber @ 2009-07-31 11:26 ` Michael J Gruber 2009-07-31 12:31 ` Carlos R. Mafra ` (5 subsequent siblings) 8 siblings, 0 replies; 129+ messages in thread From: Michael J Gruber @ 2009-07-31 11:26 UTC (permalink / raw) To: George Spelvin; +Cc: git George Spelvin venit, vidit, dixit 31.07.2009 12:46: > After studying Andy Polyakov's optimized x86 SHA-1 in OpenSSL, I've > got a version that's 1.6% slower on a P4 and 15% faster on a Phenom. > I'm curious about the performance on other CPUs I don't have access to, > particularly Core 2 duo and i7. > > Could someone do some benchmarking for me? Old (486/Pentium/P2/P3) > machines are also interesting, but I'm optimizing for newer ones. > > I haven't packaged this nicely, but it's not that complicated. > - Download Andy's original code from > http://www.openssl.org/~appro/cryptogams/cryptogams-0.tar.gz > - Unpack and cd to the cryptogams-0/x86 directory > - "patch < this_email" to create "sha1test.c", "sha1-586.h", "Makefile", > and "sha1-x86.pl". > - "make" > - Run ./586test (before) and ./x86test (after) and note the timings. > > Thank you! Best of 6 runs: ./586test Result: a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d Expected:a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d 500000000 bytes: 1.258031 s ./x86test Result: a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d Expected:a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d 500000000 bytes: 1.171770 s System: uname -a Linux whatever 2.6.22-14-generic #1 SMP Tue Feb 12 07:42:25 UTC 2008 i686 GNU/Linux cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz stepping : 10 cpu MHz : 2000.000 cache size : 6144 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr lahf_lm bogomips : 5988.92 clflush size : 64 processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz stepping : 10 cpu MHz : 2000.000 cache size : 6144 KB physical id : 0 siblings : 2 core id : 1 cpu cores : 2 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr lahf_lm bogomips : 5984.92 clflush size : 64 ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Request for benchmarking: x86 SHA1 code 2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin ` (2 preceding siblings ...) 2009-07-31 11:26 ` Michael J Gruber @ 2009-07-31 12:31 ` Carlos R. Mafra 2009-07-31 13:27 ` Brian Ristuccia ` (4 subsequent siblings) 8 siblings, 0 replies; 129+ messages in thread From: Carlos R. Mafra @ 2009-07-31 12:31 UTC (permalink / raw) To: George Spelvin; +Cc: git On Fri 31.Jul'09 at 6:46:02 -0400, George Spelvin wrote: > - Run ./586test (before) and ./x86test (after) and note the timings. For 8 runs in a Intel(R) Core(TM)2 Duo CPU T7250 @ 2.00GHz, before: 1.75 +- 0.02 after: 1.62 +- 0.02 ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Request for benchmarking: x86 SHA1 code 2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin ` (3 preceding siblings ...) 2009-07-31 12:31 ` Carlos R. Mafra @ 2009-07-31 13:27 ` Brian Ristuccia 2009-07-31 14:05 ` George Spelvin 2009-07-31 13:27 ` Jakub Narebski ` (3 subsequent siblings) 8 siblings, 1 reply; 129+ messages in thread From: Brian Ristuccia @ 2009-07-31 13:27 UTC (permalink / raw) To: git; +Cc: George Spelvin The revised code is faster on Intel Atom N270 by around 15% (results below typical of several runs): $ ./586test ; ./x86test Result: a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d Expected:a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d 500000000 bytes: 4.981760 s Result: a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d Expected:a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d 500000000 bytes: 4.323324 s $ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 28 model name : Intel(R) Atom(TM) CPU N270 @ 1.60GHz stepping : 2 cpu MHz : 800.000 cache size : 512 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 apicid : 0 initial apicid : 0 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx constant_tsc arch_perfmon pebs bts pni dtes64 monitor ds_cpl est tm2 ssse3 xtpr pdcm lahf_lm bogomips : 3191.59 clflush size : 64 power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 28 model name : Intel(R) Atom(TM) CPU N270 @ 1.60GHz stepping : 2 cpu MHz : 800.000 cache size : 512 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 apicid : 1 initial apicid : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx constant_tsc arch_perfmon pebs bts pni dtes64 monitor ds_cpl est tm2 ssse3 xtpr pdcm lahf_lm bogomips : 3191.91 clflush size : 64 power management: -- Brian Ristuccia brian@ristuccia.com ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Request for benchmarking: x86 SHA1 code 2009-07-31 13:27 ` Brian Ristuccia @ 2009-07-31 14:05 ` George Spelvin 0 siblings, 0 replies; 129+ messages in thread From: George Spelvin @ 2009-07-31 14:05 UTC (permalink / raw) To: brian, git; +Cc: linux > The revised code is faster on Intel Atom N270 by around 15% (results below > typical of several runs): > > $ ./586test ; ./x86test > Result: a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d > Expected:a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d > 500000000 bytes: 4.981760 s > Result: a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d > Expected:a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d > 500000000 bytes: 4.323324 s Cool, thanks! I hadn't optimized it at all for Atom's in-order pipe, so I'm pleasantly surprised. ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Request for benchmarking: x86 SHA1 code 2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin ` (4 preceding siblings ...) 2009-07-31 13:27 ` Brian Ristuccia @ 2009-07-31 13:27 ` Jakub Narebski 2009-07-31 15:05 ` Peter Harris ` (2 subsequent siblings) 8 siblings, 0 replies; 129+ messages in thread From: Jakub Narebski @ 2009-07-31 13:27 UTC (permalink / raw) To: George Spelvin; +Cc: git "George Spelvin" <linux@horizon.com> writes: > After studying Andy Polyakov's optimized x86 SHA-1 in OpenSSL, I've > got a version that's 1.6% slower on a P4 and 15% faster on a Phenom. > I'm curious about the performance on other CPUs I don't have access to, > particularly Core 2 duo and i7. > > Could someone do some benchmarking for me? Old (486/Pentium/P2/P3) > machines are also interesting, but I'm optimizing for newer ones. ---------- $ [time] ./586test Result: a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d Expected:a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d 500000000 bytes: 5.376325 s real 0m5.384s user 0m5.108s sys 0m0.008s 500000000 bytes: 5.367261 s 5.09user 0.00system 0:05.38elapsed 94%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+378minor)pagefaults 0swaps ---------- $ [time] ./x86test Result: a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d Expected:a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d 500000000 bytes: 5.312238 s real 0m5.325s user 0m5.060s sys 0m0.008s 500000000 bytes: 5.323783 s 5.06user 0.00system 0:05.34elapsed 94%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+377minor)pagefaults 0swaps ========== System: $ uname -a Linux roke 2.6.14-11.1.aur.2 #1 Tue Jan 31 16:05:05 CET 2006 \ i686 athlon i386 GNU/Linux $ cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 6 model : 4 model name : AMD Athlon(tm) processor stepping : 2 cpu MHz : 1000.188 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 mtrr pge \ mca cmov pat pse36 mmx fxsr syscall mmxext 3dnowext 3dnow bogomips : 2002.43 $ free total used free shared buffers cached Mem: 515616 495812 19804 0 6004 103160 -/+ buffers/cache: 386648 128968 Swap: 1052248 279544 772704 -- Jakub Narebski Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Request for benchmarking: x86 SHA1 code 2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin ` (5 preceding siblings ...) 2009-07-31 13:27 ` Jakub Narebski @ 2009-07-31 15:05 ` Peter Harris 2009-07-31 15:22 ` Peter Harris 2009-08-03 3:47 ` x86 SHA1: Faster than OpenSSL George Spelvin 8 siblings, 0 replies; 129+ messages in thread From: Peter Harris @ 2009-07-31 15:05 UTC (permalink / raw) To: George Spelvin; +Cc: git On Fri, Jul 31, 2009 at 6:46 AM, George Spelvin wrote: > Could someone do some benchmarking for me? Old (486/Pentium/P2/P3) > machines are also interesting, but I'm optimizing for newer ones. The new code appears to be marginally faster on a Pentium 3 Xeon: Best of five runs: 586test: 11.658661 s x86test: 11.209024 s $ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 7 model name : Pentium III (Katmai) stepping : 3 cpu MHz : 547.630 cache size : 1024 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse bogomips : 1097.12 clflush size : 32 power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 7 model name : Pentium III (Katmai) stepping : 3 cpu MHz : 547.630 cache size : 1024 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse bogomips : 1095.37 clflush size : 32 power management: ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Request for benchmarking: x86 SHA1 code 2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin ` (6 preceding siblings ...) 2009-07-31 15:05 ` Peter Harris @ 2009-07-31 15:22 ` Peter Harris 2009-08-03 3:47 ` x86 SHA1: Faster than OpenSSL George Spelvin 8 siblings, 0 replies; 129+ messages in thread From: Peter Harris @ 2009-07-31 15:22 UTC (permalink / raw) To: George Spelvin; +Cc: git On Fri, Jul 31, 2009 at 6:46 AM, George Spelvin wrote: > Could someone do some benchmarking for me? Old (486/Pentium/P2/P3) > machines are also interesting, but I'm optimizing for newer ones. My Geode isn't old in age, but I admit it's old in design (and the vendor switched to Atom right after I bought it...) Best of three runs: 586test: 26.536597 s x86test: 26.111148 s $ cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 5 model : 10 model name : Geode(TM) Integrated Processor by AMD PCS stepping : 2 cpu MHz : 499.927 cache size : 128 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu de pse tsc msr cx8 sep pge cmov clflush mmx mmxext 3dnowext 3dnow bogomips : 1001.72 clflush size : 32 power management: Peter Harris ^ permalink raw reply [flat|nested] 129+ messages in thread
* x86 SHA1: Faster than OpenSSL 2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin ` (7 preceding siblings ...) 2009-07-31 15:22 ` Peter Harris @ 2009-08-03 3:47 ` George Spelvin 2009-08-03 7:36 ` Jonathan del Strother ` (3 more replies) 8 siblings, 4 replies; 129+ messages in thread From: George Spelvin @ 2009-08-03 3:47 UTC (permalink / raw) To: git; +Cc: appro, appro, linux (Work in progress, state dump to mailing list archives.) This started when discussing git startup overhead due to the dynamic linker. One big contributor is the openssl library, which is used only for its optimized x86 SHA-1 implementation. So I took a look at it, with an eye to importing the code directly into the git source tree, and decided that I felt like trying to do better. The original code was excellent, but it was optimized when the P4 was new. After a bit of tweaking, I've inflicted a slight (1.4%) slowdown on the P4, but a small-but-noticeable speedup on a variety of other processors. Before After Gain Processor 1.585248 1.353314 +17% 2500 MHz Phenom 3.249614 3.295619 -1.4% 1594 MHz P4 1.414512 1.352843 +4.5% 2.66 GHz i7 3.460635 3.284221 +5.4% 1596 MHz Athlon XP 4.077993 3.891826 +4.8% 1144 MHz Athlon 1.912161 1.623212 +17% 2100 MHz Athlon 64 X2 2.956432 2.940210 +0.55% 1794 MHz Mobile Celeron (fam 15 model 2) (Seconds to hash 500x 1 MB, best of 10 runs in all cases.) This is based on Andy Polyakov's GPL/BSD licensed cryptogams code, and (for now) uses the same perl preprocessor. To test it, do the following: - Download Andy's original code from http://www.openssl.org/~appro/cryptogams/cryptogams-0.tar.gz - "tar xz cryptogams-0.tar.gz" - "cd cryptogams-0/x86" - "patch < this_email" to create "sha1test.c", "sha1-586.h", "Makefile", and "sha1-x86.pl". - "make" - Run ./586test (before) and ./x86test (after) and note the timings. The code is currenty only the core SHA transform. Adding the appropriate init/uodate/finish wrappers is straightforward. An open question is how to add appropriate CPU detection to the git build scripts. (Note that `uname -m`, which it currently uses to select the ARM code, does NOT produce the right answer if you're using a 32-bit compiler on a 64-bit platform.) I try to explain it in the comments, but with all the software pipelining that makes the rounds overlap (and there are at least 4 different kinds of rounds, which overlap with each other), it's a bit intricate. If you feel really masochistic, make commenting suggestions... --- /dev/null 2009-05-12 02:55:38.579106460 -0400 +++ Makefile 2009-08-02 06:44:44.000000000 -0400 @@ -0,0 +1,16 @@ +CC := gcc +CFLAGS := -m32 -W -Wall -Os -g +ASFLAGS := --32 + +all: 586test x86test + +586test : sha1test.c sha1-586.o + $(CC) $(CFLAGS) -o $@ sha1test.c sha1-586.o + +x86test : sha1test.c sha1-x86.o + $(CC) $(CFLAGS) -o $@ sha1test.c sha1-x86.o + +586test x86test : sha1-586.h + +%.s : %.pl x86asm.pl x86unix.pl + perl $< elf > $@ --- /dev/null 2009-05-12 02:55:38.579106460 -0400 +++ sha1-586.h 2009-08-02 06:44:17.000000000 -0400 @@ -0,0 +1,3 @@ +#include <stdint.h> + +void sha1_block_data_order(uint32_t iv[5], void const *in, unsigned len); --- /dev/null 2009-05-12 02:55:38.579106460 -0400 +++ sha1test.c 2009-08-02 08:27:48.449609504 -0400 @@ -0,0 +1,85 @@ +#include <stdint.h> +#include <stdlib.h> +#include <stdio.h> +#include <sys/time.h> + +#include "sha1-586.h" + +#define SIZE 1000000 + +#if SIZE % 64 +# error SIZE must be a multiple of 64! +#endif + +static unsigned long +timing_test(uint32_t iv[5], unsigned iter) +{ + unsigned i; + struct timeval tv0, tv1; + static char *p; /* Very large buffer */ + + if (!p) { + p = malloc(SIZE); + if (!p) { + perror("malloc"); + exit(1); + } + } + + if (gettimeofday(&tv0, NULL) < 0) { + perror("gettimeofday"); + exit(1); + } + for (i = 0; i < iter; i++) + sha1_block_data_order(iv, p, SIZE/64); + if (gettimeofday(&tv1, NULL) < 0) { + perror("gettimeofday"); + exit(1); + } + return 1000000ul * (tv1.tv_sec-tv0.tv_sec) + tv1.tv_usec-tv0.tv_usec; +} + +int +main(void) +{ + uint32_t iv[5] = { + 0x67452301, 0xefcdab89, 0x98badcfe, 0x10325476, 0xc3d2e1f0 + }; + /* Simplest known-answer test, "abc" */ + static uint8_t const testbuf[64] = { + 'a','b','c', 0x80, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 24 + }; + /* Expected: A9993E364706816ABA3E25717850C26C9CD0D89D */ + static uint32_t const expected[5] = { + 0xa9993e36, 0x4706816a, 0xba3e2571, 0x7850c26c, 0x9cd0d89d }; + unsigned i; + unsigned long min_usec = -1ul; + + /* quick correct-answer test. silent unless successful. */ + sha1_block_data_order(iv, testbuf, 1); + for (i = 0; i < 5; i++) { + if (iv[i] != expected[i]) { + printf("Result: %08x %08x %08x %08x %08x\n" + "Expected:%08x %08x %08x %08x %08x\n", + iv[0], iv[1], iv[2], iv[3], iv[4], expected[0], + expected[1], expected[2], expected[3], + expected[4]); + break; + } + } + + for (i = 0; i < 10; i++) { + unsigned long usec = timing_test(iv, 500); + printf("%2u/10: %u.%06u s\n", i+1, (unsigned)(usec/1000000), + (unsigned)(usec % 1000000)); + if (usec < min_usec) + min_usec = usec; + } + printf("Minimum time to hash %u bytes: %u.%06u\n", + 500 * SIZE, (unsigned)(min_usec/1000000), + (unsigned)(min_usec % 1000000)); + return 0; +} --- /dev/null 2009-05-12 02:55:38.579106460 -0400 +++ sha1-x86.pl 2009-08-02 08:51:01.069614130 -0400 @@ -0,0 +1,389 @@ +#!/usr/bin/env perl + +# ==================================================================== +# [Re]written by Andy Polyakov <appro@fy.chalmers.se> for the OpenSSL +# project. The module is, however, dual licensed under OpenSSL and +# CRYPTOGAMS licenses depending on where you obtain it. For further +# details see http://www.openssl.org/~appro/cryptogams/. +# ==================================================================== + +# "[Re]written" was achieved in two major overhauls. In 2004 BODY_* +# functions were re-implemented to address P4 performance issue [see +# commentary below], and in 2006 the rest was rewritten in order to +# gain freedom to liberate licensing terms. + +# It was noted that Intel IA-32 C compiler generates code which +# performs ~30% *faster* on P4 CPU than original *hand-coded* +# SHA1 assembler implementation. To address this problem (and +# prove that humans are still better than machines:-), the +# original code was overhauled, which resulted in following +# performance changes: +# +# compared with original compared with Intel cc +# assembler impl. generated code +# Pentium -16% +48% +# PIII/AMD +8% +16% +# P4 +85%(!) +45% +# +# As you can see Pentium came out as looser:-( Yet I reckoned that +# improvement on P4 outweights the loss and incorporate this +# re-tuned code to 0.9.7 and later. +# ---------------------------------------------------------------- +# <appro@fy.chalmers.se> + +$0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1; +push(@INC,"${dir}","${dir}../../perlasm"); +require "x86asm.pl"; + +&asm_init($ARGV[0],"sha1-586.pl",$ARGV[$#ARGV] eq "386"); + +$A="eax"; +$B="ebx"; +$C="ecx"; +$D="edx"; +$E="ebp"; + +# Two temporaries +$S="edi"; +$T="esi"; + +# The round constants +use constant K1 => 0x5a827999; +use constant K2 => 0x6ED9EBA1; +use constant K3 => 0x8F1BBCDC; +use constant K4 => 0xCA62C1D6; + +# Given unlimited registers and functional units, it would be possible to +# compute SHA-1 at two cycles per round, using 7 operations per round. +# Remember, each round computes a new value for E, which is used as A +# in the following round and B in the round after that. There are two +# critical paths: +# - A must be rotated and added to the output E (2 cycles between rounds) +# - B must go through two boolean operations before being added to +# the result E. Since this latter addition can't be done in the +# same-cycle as the critical addition of a<<<5, this is a total of +# 2+1+1 = 4 cycles per 2 rounds. +# Additionally, if you want to avoid copying B, you have to rotate it +# soon after use in this round so it is ready for use as the following +# round's C. + +# e += (a <<< 5) + K + in[i] + (d^(b&(c^d))) (0..19) +# e += (a <<< 5) + K + in[i] + (b^c^d) (20..39, 60..79) +# e += (a <<< 5) + K + in[i] + (c&d) + (b&(c^d)) (40..59) +# +# The hard part is doing this with only two temporary registers. +# Taking the most complex F(b,c,d) function, writing it as above +# breaks it into two parts which can be accumulated into e separately. +# Let's divide it into 4 parts. These can be executed in a 7-cycle +# loop, assuming an in-order triple issue machine +# (quadruple counting xor-from-memory as 2): +# +# in[i] F1(c,d) F2(b,c,d) a<<<5 +# mov in[i],T (addl S,A) (movl B,S) +# xor in[i+1],T (rorl 5,S) +# xor in[i+2],T movl D,S (addl S,A) +# xor in[i+3],T andl C,S +# rotl 1,T addl S,E movl D,S +# movl T,in[i] xorl C,S +# addl T+K,E andl B,S rorl 2,B // +# (mov in[i],T) addl S,E movl A,S +# (xor in[i+1],T) rorl 5,S +# (xor in[i+2],T) (movl C,S) addl S,E +# +# In the above, we routinely read and write a register on the same cycle, +# overlapping the beginning of one computation with the end of another. +# I've tried to place the reads to the left of the writes, but some of the +# overlapping operations from adjacent rounds (given in parentheses) +# violate that. +# +# The "addl T+K,E" line is actually implemented using the lea instruction, +# which (on a Pentium) requires that neither T not K was modified on the +# previous cycle. +# +# As you can see, in the absence of out-of-order execution, the first +# column takes a minimum of 6 cycles (fetch, 3 XORs, rotate, add to E), +# and I reserve a seventh cycle before the add to E so that I can use a +# Pentium's lea instruction. +# +# The other three columns take 3, 4 and 3 cycles, respectively. +# These can all be overlapped by 1 cycle assuming a superscalar +# processor, for a total of 2+2+3 = 7 cycles. +# +# The other F() functions require 5 and 4 cycles, respectively. +# overlapped with the 3-cycle a<<<5 computation, that makes a total of 6 +# and 5 cycles, respectively. If we overlap the beginning and end of the +# Xi computation, we can get it down to 6 cycles, but below that, we'll +# just have to waste a cycle. +# +# For the first 16 rounds, forming Xi is just a fetch, and the F() +# function only requires 5 cycles, so the whole round can be pulled in +# to 4 cycles. + + +# Atom pairing rules (not yet optimized): +# The Atom has a dial-issue in-order pipeline, similar to the +# original Pentium. However, the issue restrictions are different. +# In particular, all memory source operations must use st use port 0, +# as must all rotates. +# +# Given that a round uses 4 fetches and 3 rotates, that's going to +# require significant care to pair well. It may take a completely +# different implementation. +# +# LEA must use port 1, but apparently it has even worse address generation +# interlock latency than the Pentium. Oh well, it's still the best way +# to do a 3-way add with a 32-bit immediate. + + +# The first 16 rounds use s simple simplest F(b,c,d) = d^(b&(c^d)), and +# don't need to form the in[i] equation, letting us reduce the round to +# 4 cycles, after some reassignment of temporaries: + +# movl D,S (roll 5,T) (addl S,A) // +# mov in[i],T xorl C,S (addl T,A) +# andl B,S rorl 2,B +# addl T+K,E xorl D,S movl A,T +# addl S,E roll 5,T (movl C,S) // +# (mov in[i],T) (xorl B,S) addl T,E + +# The // mark where the round function starts. Each round expects the +# first copy of D to S to have been done already. + +# The transition between rounds 15 and 16 is a bit tricky... the best +# thing to do is to delay the computation of a<<<5 one cycle and move it back +# to the S register. That way, T is free as early as possible. +# in[i] F(b,c,d) a<<<5 +# (addl T+K,A) (xorl E,S) (movl B,T) +# movl D,S (roll 5,T) (addl S,A) // +# mov in[i],T xorl C,S (addl T,A) +# andl B,S rorl 2,B +# addl T+K,E xorl D,S (movl in[1],T) +# (xor in[i+1],T) addl S,E movl A,S +# (xor in[i+2],T) rorl 2,B roll 5,S // +# (xor in[i+3],T) (movl C,S) addl S,E + +sub BODY_00_15 +{ + local($n,$a,$b,$c,$d,$e)=@_; + + &comment("00_15 $n"); + &mov($S,$d) if ($n == 0); + &mov($T,&swtmp($n%16)); # V Load Xi. + &xor($S,$c); # U Continue F() = d^(b&(c^d)) + &and($S,$b); # V + &rotr($b,2); # NP + &lea($e,&DWP(K1,$e,$T)); # U Add Xi and K + if ($n < 15) { + &mov($T,$a); # V + &xor($S,$d); # U + &rotl($T,5); # NP + &add($e,$S); # U + &mov($S,$c); # V Start of NEXT round's F() + &add($e,$T); # U + } else { + # This version provides the correct start for BODY_20_39 + &xor($S,$d); # V + &mov($T,&swtmp(($n+1)%16)); # U Start computing mext Xi. + &add($e,$S); # V Add F() + &mov($S,$a); # U Start computing a<<<5 + &xor($T,&swtmp(($n+3)%16)); # V + &rotl($S,5); # U + &xor($T,&swtmp(($n+9)%16)); # V + } +} + + +# A full round using F(b,c,d) = b^c^d. 6 cycles of dependency chain. +# This starts just before starting to compute F(); the Xi should have XORed +# the first three values together. (Break is at //) +# +# in[i] F(b,c,d) a<<<5 +# mov in[i],T (xorl E,S) (addl T+K,A) +# xor in[i+1],T (addl S,A) (movl B,S) +# xor in[i+2],T (roll 5,S) // +# xor in[i+3],T movl D,S (addl S,A) +# roll 1,T xorl C,S +# movl T,in[i] andl B,S rorl 2,B +# addl T+K,E xorl D,S (mov in[i],T) +# (xor in[i+1],T) addl S,E movl A,S +# (xor in[i+2],T) roll 5,S // +# (xor in[i+3],T) (movl C,S) addl S,E + +sub BODY_16_19 +{ + local($n,$a,$b,$c,$d,$e)=@_; + + &comment("16_19 $n"); + + &xor($T,&swtmp(($n+13)%16)); # U + &add($a,$S); # V End of previous round + &mov($S,$d); # U Start current round's F() + &rotl($T,1); # V + &xor($S,$c); # U + &mov(&swtmp($n%16),$T); # U Store computed Xi. + &and($S,$b); # V + &rotr($b,2); # NP + &lea($e,&DWP(K1,$e,$T)); # U Add Xi and K + &mov($T,&swtmp(($n+1)%16)); # V Start computing mext Xi. + &xor($S,$d); # U + &xor($T,&swtmp(($n+3)%16)); # V + &add($e,$S); # U Add F() + &mov($S,$a); # V Start computing a<<<5 + &xor($T,&swtmp(($n+9)%16)); # U + &rotl($S,5); # NP +} + +# This is just like BODY_16_19, but computes a different F() = b^c^d +# +# in[i] F(b,c,d) a<<<5 +# mov in[i],T (xorl E,S) (addl T+K,A) +# xor in[i+1],T (addl S,A) (movl B,S) +# xor in[i+2],T (roll 5,S) // +# xor in[i+3],T (addl S,A) +# roll 1,T movl D,S +# movl T,in[i] xorl B,S rorl 2,B +# addl T+K,E xorl C,S (mov in[i],T) +# (xor in[i+1],T) addl S,E movl A,S +# (xor in[i+2],T) roll 5,S // +# (xor in[i+3],T) (movl C,S) addl S,E + +sub BODY_20_39 # And 61..79 +{ + local($n,$a,$b,$c,$d,$e)=@_; + local $K=($n<40) ? K2 : K4; + + &comment("20_39 $n"); + + &xor($T,&swtmp(($n+13)%16)); # U + &add($a,$S); # V End of previous round + &rotl($T,1); # U + &mov($S,$d); # V Start current round's F() + &mov(&swtmp($n%16),$T) if ($n < 77); # Store computed Xi. + &xor($S,$b); # V + &rotr($b,2); # NP + &lea($e,&DWP($K,$e,$T)); # U Add Xi and K + &mov($T,&swtmp(($n+1)%16)) if ($n < 79); # Start computing next Xi. + &xor($S,$c); # U + &xor($T,&swtmp(($n+3)%16)) if ($n < 79); + &add($e,$S); # U Add F1() + &mov($S,$a); # V Start computing a<<<5 + &xor($T,&swtmp(($n+9)%16)) if ($n < 79); + &rotl($S,5); # NP + + &add($e,$S) if ($n == 79); +} + + +# This starts immediately after the LEA, and expects to need to finish +# the previous round. (break is at //) +# +# in[i] F1(c,d) F2(b,c,d) a<<<5 +# (addl T+K,E) (andl C,S) (rorl 2,C) +# mov in[i],T (addl S,A) (movl B,S) +# xor in[i+1],T (rorl 5,S) +# xor in[i+2],T / movl D,S (addl S,A) +# xor in[i+3],T andl C,S +# rotl 1,T addl S,E movl D,S +# (movl T,in[i]) xorl C,S +# addl T+K,E andl B,S rorl 2,B +# (mov in[i],T) addl S,E movl A,S +# (xor in[i+1],T) rorl 5,S +# (xor in[i+2],T) // (movl C,S) addl S,E + +sub BODY_40_59 +{ + local($n,$a,$b,$c,$d,$e)=@_; + + &comment("40_59 $n"); + + &add($a,$S); # V End of previous round + &mov($S,$d); # U Start current round's F(1) + &xor($T,&swtmp(($n+13)%16)); # V + &and($S,$c); # U + &rotl($T,1); # U XXX Missed pairing + &add($e,$S); # V Add F1() + &mov($S,$d); # U Start current round's F2() + &mov(&swtmp($n%16),$T); # V Store computed Xi. + &xor($S,$c); # U + &lea($e,&DWP(K3,$e,$T)); # V + &and($S,$b); # U XXX Missed pairing + &rotr($b,2); # NP + &mov($T,&swtmp(($n+1)%16)); # U Start computing next Xi. + &add($e,$S); # V Add F2() + &mov($S,$a); # U Start computing a<<<5 + &xor($T,&swtmp(($n+3)%16)); # V + &rotl($S,5); # NP + &xor($T,&swtmp(($n+9)%16)); # U +} +# The above code is NOT optimally paired for the Pentium. (And thus, +# presumably, Atom, which has a very similar dual-issue in-order pipeline.) +# However, attempts to improve it make it slower on Phenom & i7. + +&function_begin("sha1_block_data_order",16); + + local @V = ($A,$B,$C,$D,$E); + local @W = ($A,$B,$C); + + &mov($S,&wparam(0)); # SHA_CTX *c + &mov($T,&wparam(1)); # const void *input + &mov($A,&wparam(2)); # size_t num + &stack_push(16); # allocate X[16] + &shl($A,6); + &add($A,$T); + &mov(&wparam(2),$A); # pointer beyond the end of input + &mov($E,&DWP(16,$S));# pre-load E + &mov($D,&DWP(12,$S));# pre-load D + + &set_label("loop",16); + + # copy input chunk to X, but reversing byte order! + &mov($W[2],&DWP(4*(0),$T)); + &mov($W[1],&DWP(4*(1),$T)); + &bswap($W[2]); + for ($i=0; $i<14; $i++) { + &mov($W[0],&DWP(4*($i+2),$T)); + &bswap($W[1]); + &mov(&swtmp($i+0),$W[2]); + unshift(@W,pop(@W)); + } + &bswap($W[1]); + &mov(&swtmp($i+0),$W[2]); + &mov(&swtmp($i+1),$W[1]); + + &mov(&wparam(1),$T); # redundant in 1st spin + + # Reload A, B and C, which we use as temporaries in the copying + &mov($C,&DWP(8,$S)); + &mov($B,&DWP(4,$S)); + &mov($A,&DWP(0,$S)); + + for($i=0;$i<16;$i++) { &BODY_00_15($i,@V); unshift(@V,pop(@V)); } + for(;$i<20;$i++) { &BODY_16_19($i,@V); unshift(@V,pop(@V)); } + for(;$i<40;$i++) { &BODY_20_39($i,@V); unshift(@V,pop(@V)); } + for(;$i<60;$i++) { &BODY_40_59($i,@V); unshift(@V,pop(@V)); } + for(;$i<80;$i++) { &BODY_20_39($i,@V); unshift(@V,pop(@V)); } + + (($V[4] eq $E) and ($V[0] eq $A)) or die; # double-check + + &comment("Loop trailer"); + + &mov($S,&wparam(0)); # re-load SHA_CTX* + &mov($T,&wparam(1)); # re-load input pointer + + &add($E,&DWP(16,$S)); + &add($D,&DWP(12,$S)); + &add(&DWP(8,$S),$C); + &add(&DWP(4,$S),$B); + &add($T,64); # advance input pointer + &add(&DWP(0,$S),$A); + &mov(&DWP(12,$S),$D); + &mov(&DWP(16,$S),$E); + + &cmp($T,&wparam(2)); # have we reached the end yet? + &jb(&label("loop")); + + &stack_pop(16); +&function_end("sha1_block_data_order"); +&asciz("SHA1 block transform for x86, CRYPTOGAMS by <appro\@openssl.org>"); + +&asm_finish(); ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-03 3:47 ` x86 SHA1: Faster than OpenSSL George Spelvin @ 2009-08-03 7:36 ` Jonathan del Strother 2009-08-04 1:40 ` Mark Lodato ` (2 subsequent siblings) 3 siblings, 0 replies; 129+ messages in thread From: Jonathan del Strother @ 2009-08-03 7:36 UTC (permalink / raw) To: George Spelvin; +Cc: git, appro, appro On Mon, Aug 3, 2009 at 4:47 AM, George Spelvin<linux@horizon.com> wrote: > (Work in progress, state dump to mailing list archives.) > > This started when discussing git startup overhead due to the dynamic > linker. One big contributor is the openssl library, which is used only > for its optimized x86 SHA-1 implementation. So I took a look at it, > with an eye to importing the code directly into the git source tree, > and decided that I felt like trying to do better. > FWIW, this doesn't work on OS X / Darwin. 'as' doesn't take a --32 flag, it takes an -arch i386 flag. After changing that, I get: as -arch i386 -o sha1-586.o sha1-586.s sha1-586.s:4:Unknown pseudo-op: .type sha1-586.s:4:Rest of line ignored. 1st junk character valued 115 (s). sha1-586.s:5:Alignment too large: 15. assumed. sha1-586.s:19:Alignment too large: 15. assumed. sha1-586.s:1438:Unknown pseudo-op: .size sha1-586.s:1438:Rest of line ignored. 1st junk character valued 115 (s). make: *** [sha1-586.o] Error 1 - at which point I have no idea how to fix it. ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-03 3:47 ` x86 SHA1: Faster than OpenSSL George Spelvin 2009-08-03 7:36 ` Jonathan del Strother @ 2009-08-04 1:40 ` Mark Lodato 2009-08-04 2:30 ` Linus Torvalds 2009-08-18 21:26 ` Andy Polyakov 3 siblings, 0 replies; 129+ messages in thread From: Mark Lodato @ 2009-08-04 1:40 UTC (permalink / raw) To: George Spelvin; +Cc: git, appro, appro On Sun, Aug 2, 2009 at 11:47 PM, George Spelvin<linux@horizon.com> wrote: > Before After Gain Processor > 1.585248 1.353314 +17% 2500 MHz Phenom > 3.249614 3.295619 -1.4% 1594 MHz P4 > 1.414512 1.352843 +4.5% 2.66 GHz i7 > 3.460635 3.284221 +5.4% 1596 MHz Athlon XP > 4.077993 3.891826 +4.8% 1144 MHz Athlon > 1.912161 1.623212 +17% 2100 MHz Athlon 64 X2 > 2.956432 2.940210 +0.55% 1794 MHz Mobile Celeron (fam 15 model 2) > > (Seconds to hash 500x 1 MB, best of 10 runs in all cases.) > > This is based on Andy Polyakov's GPL/BSD licensed cryptogams code, and > (for now) uses the same perl preprocessor. To test it, do the following: > - Download Andy's original code from > http://www.openssl.org/~appro/cryptogams/cryptogams-0.tar.gz > - "tar xz cryptogams-0.tar.gz" > - "cd cryptogams-0/x86" > - "patch < this_email" to create "sha1test.c", "sha1-586.h", "Makefile", > and "sha1-x86.pl". > - "make" > - Run ./586test (before) and ./x86test (after) and note the timings. Note, to compile this on Ubuntu x86-64, I had to: $ sudo apt-get install libc6-dev-i386 $ ./586test 1/10: 2.016621 s 2/10: 2.030742 s 3/10: 2.027333 s 4/10: 2.024018 s 5/10: 2.022306 s 6/10: 2.022418 s 7/10: 2.047103 s 8/10: 2.035467 s 9/10: 2.032237 s 10/10: 2.029231 s Minimum time to hash 500000000 bytes: 2.016621 $ ./x86test 1/10: 1.818661 s 2/10: 1.814856 s 3/10: 1.816232 s 4/10: 1.815208 s 5/10: 1.834047 s 6/10: 1.843020 s 7/10: 1.819564 s 8/10: 1.815560 s 9/10: 1.824232 s 10/10: 1.820943 s Minimum time to hash 500000000 bytes: 1.814856 $ python -c 'print 2.016621 / 1.814856' 1.11117410968 $ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 CPU 6300 @ 1.86GHz stepping : 2 cpu MHz : 1861.825 cache size : 2048 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant _tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm tpr_shadow bogomips : 3723.65 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 CPU 6300 @ 1.86GHz stepping : 2 cpu MHz : 1861.825 cache size : 2048 KB physical id : 0 siblings : 2 core id : 1 cpu cores : 2 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant _tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm tpr_shadow bogomips : 3724.01 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: I imagine that you can get a bigger speedup by making a 64-bit version (but maybe not). Either way, it would be nice if x86-64 users did not have to install an additional package to compile. Cheers, Mark ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-03 3:47 ` x86 SHA1: Faster than OpenSSL George Spelvin 2009-08-03 7:36 ` Jonathan del Strother 2009-08-04 1:40 ` Mark Lodato @ 2009-08-04 2:30 ` Linus Torvalds 2009-08-04 2:51 ` Linus Torvalds 2009-08-04 4:48 ` George Spelvin 2009-08-18 21:26 ` Andy Polyakov 3 siblings, 2 replies; 129+ messages in thread From: Linus Torvalds @ 2009-08-04 2:30 UTC (permalink / raw) To: George Spelvin; +Cc: git, appro, appro On Sun, 2 Aug 2009, George Spelvin wrote: > > The original code was excellent, but it was optimized when the P4 was new. > After a bit of tweaking, I've inflicted a slight (1.4%) slowdown on the > P4, but a small-but-noticeable speedup on a variety of other processors. > > Before After Gain Processor > 1.585248 1.353314 +17% 2500 MHz Phenom > 3.249614 3.295619 -1.4% 1594 MHz P4 > 1.414512 1.352843 +4.5% 2.66 GHz i7 > 3.460635 3.284221 +5.4% 1596 MHz Athlon XP > 4.077993 3.891826 +4.8% 1144 MHz Athlon > 1.912161 1.623212 +17% 2100 MHz Athlon 64 X2 > 2.956432 2.940210 +0.55% 1794 MHz Mobile Celeron (fam 15 model 2) It would be better to have a more git-centric benchmark that actually shows some real git load, rather than a sha1-only microbenchmark. The thing that I'd prefer is simply git fsck --full on the Linux kernel archive. For me (with a fast machine), it takes about 4m30s with the OpenSSL SHA1, and takes 6m40s with the Mozilla SHA1 (ie using a NO_OPENSSL=1 build). So that's an example of a load that is actually very sensitive to SHA1 performance (more so than _most_ git loads, I suspect), and at the same time is a real git load rather than some SHA1-only microbenchmark. It also shows very clearly why we default to the OpenSSL version over the Mozilla one. NOTE! I didn't do multiple runs to see how stable the numbers are, and so it's possible that I exaggerated the OpenSSL advantage over the Mozilla-SHA1 code. Or vice versa. My point is really only that I don't know how meaningful a "50 x 1M SHA1" benchmark is, while I know that a "git fsck" benchmark has at least _some_ real life value. Linus ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-04 2:30 ` Linus Torvalds @ 2009-08-04 2:51 ` Linus Torvalds 2009-08-04 3:07 ` Jon Smirl 2009-08-18 21:50 ` Andy Polyakov 2009-08-04 4:48 ` George Spelvin 1 sibling, 2 replies; 129+ messages in thread From: Linus Torvalds @ 2009-08-04 2:51 UTC (permalink / raw) To: George Spelvin; +Cc: git, appro, appro On Mon, 3 Aug 2009, Linus Torvalds wrote: > > The thing that I'd prefer is simply > > git fsck --full > > on the Linux kernel archive. For me (with a fast machine), it takes about > 4m30s with the OpenSSL SHA1, and takes 6m40s with the Mozilla SHA1 (ie > using a NO_OPENSSL=1 build). > > So that's an example of a load that is actually very sensitive to SHA1 > performance (more so than _most_ git loads, I suspect), and at the same > time is a real git load rather than some SHA1-only microbenchmark. It also > shows very clearly why we default to the OpenSSL version over the Mozilla > one. "perf report --sort comm,dso,symbol" profiling shows the following for 'git fsck --full' on the kernel repo, using the Mozilla SHA1: 47.69% git /home/torvalds/git/git [.] moz_SHA1_Update 22.98% git /lib64/libz.so.1.2.3 [.] inflate_fast 7.32% git /lib64/libc-2.10.1.so [.] __GI_memcpy 4.66% git /lib64/libz.so.1.2.3 [.] inflate 3.76% git /lib64/libz.so.1.2.3 [.] adler32 2.86% git /lib64/libz.so.1.2.3 [.] inflate_table 2.41% git /home/torvalds/git/git [.] lookup_object 1.31% git /lib64/libc-2.10.1.so [.] _int_malloc 0.84% git /home/torvalds/git/git [.] patch_delta 0.78% git [kernel] [k] hpet_next_event so yeah, SHA1 performance matters. Judging by the OpenSSL numbers, the OpenSSL SHA1 implementation must be about twice as fast as the C version we use. That said, under "normal" git usage models, the SHA1 costs are almost invisible. So git-fsck is definitely a fairly unusual case that stresses the SHA1 performance more than most git lods. Linus ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-04 2:51 ` Linus Torvalds @ 2009-08-04 3:07 ` Jon Smirl 2009-08-04 5:01 ` George Spelvin 2009-08-18 21:50 ` Andy Polyakov 1 sibling, 1 reply; 129+ messages in thread From: Jon Smirl @ 2009-08-04 3:07 UTC (permalink / raw) To: Linus Torvalds; +Cc: George Spelvin, git, appro, appro On Mon, Aug 3, 2009 at 10:51 PM, Linus Torvalds<torvalds@linux-foundation.org> wrote: > > > On Mon, 3 Aug 2009, Linus Torvalds wrote: >> >> The thing that I'd prefer is simply >> >> git fsck --full >> >> on the Linux kernel archive. For me (with a fast machine), it takes about >> 4m30s with the OpenSSL SHA1, and takes 6m40s with the Mozilla SHA1 (ie >> using a NO_OPENSSL=1 build). >> >> So that's an example of a load that is actually very sensitive to SHA1 >> performance (more so than _most_ git loads, I suspect), and at the same >> time is a real git load rather than some SHA1-only microbenchmark. It also >> shows very clearly why we default to the OpenSSL version over the Mozilla >> one. > > "perf report --sort comm,dso,symbol" profiling shows the following for > 'git fsck --full' on the kernel repo, using the Mozilla SHA1: > > 47.69% git /home/torvalds/git/git [.] moz_SHA1_Update > 22.98% git /lib64/libz.so.1.2.3 [.] inflate_fast > 7.32% git /lib64/libc-2.10.1.so [.] __GI_memcpy > 4.66% git /lib64/libz.so.1.2.3 [.] inflate > 3.76% git /lib64/libz.so.1.2.3 [.] adler32 > 2.86% git /lib64/libz.so.1.2.3 [.] inflate_table > 2.41% git /home/torvalds/git/git [.] lookup_object > 1.31% git /lib64/libc-2.10.1.so [.] _int_malloc > 0.84% git /home/torvalds/git/git [.] patch_delta > 0.78% git [kernel] [k] hpet_next_event > > so yeah, SHA1 performance matters. Judging by the OpenSSL numbers, the > OpenSSL SHA1 implementation must be about twice as fast as the C version > we use. Would there happen to be a SHA1 implementation around that can compute the SHA1 without first decompressing the data? Databases gain a lot of speed by using special algorithms that can directly operate on the compressed data. > > That said, under "normal" git usage models, the SHA1 costs are almost > invisible. So git-fsck is definitely a fairly unusual case that stresses > the SHA1 performance more than most git lods. > > Linus > -- > To unsubscribe from this list: send the line "unsubscribe git" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-04 3:07 ` Jon Smirl @ 2009-08-04 5:01 ` George Spelvin 2009-08-04 12:56 ` Jon Smirl 0 siblings, 1 reply; 129+ messages in thread From: George Spelvin @ 2009-08-04 5:01 UTC (permalink / raw) To: jonsmirl; +Cc: git, linux > Would there happen to be a SHA1 implementation around that can compute > the SHA1 without first decompressing the data? Databases gain a lot of > speed by using special algorithms that can directly operate on the > compressed data. I can't imagine how. In general, this requires that the compression be carefully designed to be compatible with the algorithms, and SHA1 is specifically designed to depend on every bit of the input in an un-analyzable way. Also, git normally avoids hashing objects that it doesn't need uncompressed for some other reason. git-fsck is a notable exception, but I think the idea of creating special optimized code paths for that interferes with its reliability and robustness goals. ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-04 5:01 ` George Spelvin @ 2009-08-04 12:56 ` Jon Smirl 2009-08-04 14:29 ` Dmitry Potapov 0 siblings, 1 reply; 129+ messages in thread From: Jon Smirl @ 2009-08-04 12:56 UTC (permalink / raw) To: George Spelvin; +Cc: git On Tue, Aug 4, 2009 at 1:01 AM, George Spelvin<linux@horizon.com> wrote: >> Would there happen to be a SHA1 implementation around that can compute >> the SHA1 without first decompressing the data? Databases gain a lot of >> speed by using special algorithms that can directly operate on the >> compressed data. > > I can't imagine how. In general, this requires that the compression > be carefully designed to be compatible with the algorithms, and SHA1 > is specifically designed to depend on every bit of the input in > an un-analyzable way. A simple start would be to feed each byte as it is decompressed directly into the sha code and avoid the intermediate buffer. Removing the buffer reduces cache pressure. > Also, git normally avoids hashing objects that it doesn't need > uncompressed for some other reason. git-fsck is a notable exception, > but I think the idea of creating special optimized code paths for that > interferes with its reliability and robustness goals. Agreed that there is no real need for this, just something to play with if you are trying for a speed record. I'd much rather have a solution for the rebase problem where one side of the diff has moved to a different file and rebase can't figure it out. > -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-04 12:56 ` Jon Smirl @ 2009-08-04 14:29 ` Dmitry Potapov 0 siblings, 0 replies; 129+ messages in thread From: Dmitry Potapov @ 2009-08-04 14:29 UTC (permalink / raw) To: Jon Smirl; +Cc: George Spelvin, git On Tue, Aug 04, 2009 at 08:56:48AM -0400, Jon Smirl wrote: > > A simple start would be to feed each byte as it is decompressed > directly into the sha code and avoid the intermediate buffer. Removing > the buffer reduces cache pressure. First, you still have to preserve any decoded byte in the compress window, which is 32Kb by default. Typical files in Git repositories are not so big, many are under 32Kb and practically all of them fit to L2 cache of modern processors. Second, complication of assembler code from the coupling of two algorithms will be enormous. It is not sufficient registers on x86 for SHA-1 alone. Third, SHA-1 is very computationally intensive and with predictable access pattern (linear), so you do not wait for L2, because it will be in L1. So, I don't see where you can gain significantly. Perhaps, you can win just from re-writing inflate in assembler, but I do not expect any significant gains other than that. And coupling has obvious disadvantages when it comes to maintenance... Dmitry ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-04 2:51 ` Linus Torvalds 2009-08-04 3:07 ` Jon Smirl @ 2009-08-18 21:50 ` Andy Polyakov 1 sibling, 0 replies; 129+ messages in thread From: Andy Polyakov @ 2009-08-18 21:50 UTC (permalink / raw) To: Linus Torvalds; +Cc: George Spelvin, git > On Mon, 3 Aug 2009, Linus Torvalds wrote: >> The thing that I'd prefer is simply >> >> git fsck --full >> >> on the Linux kernel archive. For me (with a fast machine), it takes about >> 4m30s with the OpenSSL SHA1, and takes 6m40s with the Mozilla SHA1 (ie >> using a NO_OPENSSL=1 build). >> >> So that's an example of a load that is actually very sensitive to SHA1 >> performance (more so than _most_ git loads, I suspect), and at the same >> time is a real git load rather than some SHA1-only microbenchmark. I couldn't agree more that real-life benchmarks are of greater value than specific algorithm micro-benchmark. And given the provided profiling data one can argue that +17% (or my +12%) improvement on micro-benchmark aren't really worth bothering about. But it's kind of sport [at least for me], so don't judge too harshly:-) >> It also >> shows very clearly why we default to the OpenSSL version over the Mozilla >> one. As George implicitly mentioned most OpenSSL assembler modules are available under more permissive license and if there is interest I'm ready to assist... > "perf report --sort comm,dso,symbol" profiling shows the following for > 'git fsck --full' on the kernel repo, using the Mozilla SHA1: > > 47.69% git /home/torvalds/git/git [.] moz_SHA1_Update > 22.98% git /lib64/libz.so.1.2.3 [.] inflate_fast > 7.32% git /lib64/libc-2.10.1.so [.] __GI_memcpy > 4.66% git /lib64/libz.so.1.2.3 [.] inflate > 3.76% git /lib64/libz.so.1.2.3 [.] adler32 > 2.86% git /lib64/libz.so.1.2.3 [.] inflate_table > 2.41% git /home/torvalds/git/git [.] lookup_object > 1.31% git /lib64/libc-2.10.1.so [.] _int_malloc > 0.84% git /home/torvalds/git/git [.] patch_delta > 0.78% git [kernel] [k] hpet_next_event > > so yeah, SHA1 performance matters. Judging by the OpenSSL numbers, the > OpenSSL SHA1 implementation must be about twice as fast as the C version > we use. And given /lib64 path this is 64-bit C compiler-generated code compared to 32-bit assembler? Either way in this context I have extra comment addressing previous subscriber, Mark Lodato, who effectively wondered how would 64-bit assembler compare to 32-bit one. First of all there *is* even 64-bit assembler version. But as SHA1 is essentially 32-bit algorithm, 64-bit implementation is only nominally faster, +20% at most. Faster thanks to larger register bank facilitating more efficient instruction scheduling. Cheers. A. ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-04 2:30 ` Linus Torvalds 2009-08-04 2:51 ` Linus Torvalds @ 2009-08-04 4:48 ` George Spelvin 2009-08-04 6:30 ` Linus Torvalds 2009-08-04 6:40 ` Linus Torvalds 1 sibling, 2 replies; 129+ messages in thread From: George Spelvin @ 2009-08-04 4:48 UTC (permalink / raw) To: torvalds; +Cc: git, linux > It would be better to have a more git-centric benchmark that actually > shows some real git load, rather than a sha1-only microbenchmark. > > The thing that I'd prefer is simply > > git fsck --full > > on the Linux kernel archive. For me (with a fast machine), it takes about > 4m30s with the OpenSSL SHA1, and takes 6m40s with the Mozilla SHA1 (ie > using a NO_OPENSSL=1 build). The actual goal of this effort is to address the dynamic linker startup time issues by removing the second-largest contributor after libcurl, namely openssl. Optimizing the assembly code is just the fun part. ;-) Anyway, on the git repository: [1273]$ time x/git-fsck --full (New SHA1 code) dangling tree 524973049a7e4593df4af41e0564912f678a41ac dangling tree 7da7d73185a1df5c2a477d2ee5599ac8a58cad56 real 0m59.306s user 0m58.760s sys 0m0.550s [1274]$ time ./git-fsck --full (OpenSSL) dangling tree 524973049a7e4593df4af41e0564912f678a41ac dangling tree 7da7d73185a1df5c2a477d2ee5599ac8a58cad56 real 1m0.364s user 0m59.970s sys 0m0.400s 1.6% is a pretty minor difference, especially as the machine is running a backup at the time (but it's a quad-core, with near-zero CPU usage; the business is all I/O). On the full Linux repository, I repacked it first to make sure that everything was in RAM, and I have the first result: [517]$ time ~/git/x/git-fsck --full (New SHA1 code) real 10m12.702s user 9m48.410s sys 0m23.350s [518]$ time ~/git/git-fsck --full (OpenSSL) real 10m26.083s user 10m2.800s sys 0m22.000s Again, 2.2% is not a huge improvement. But my only goal was not to be worse. > So that's an example of a load that is actually very sensitive to SHA1 > performance (more so than _most_ git loads, I suspect), and at the same > time is a real git load rather than some SHA1-only microbenchmark. It also > shows very clearly why we default to the OpenSSL version over the Mozilla > one. I wasn't questioning *that*. As I said, I was just doing the fun part of importing a heavily-optimized OpenSSL-like SHA1 implementation into the git source tree. (The un-fun part is modifying the build process to detect the target processor and include the right asm automatically.) Anyway, if you want to test it, here's a crude x86_32-only patch to the git tree. "make NO_OPENSSL=1" to use the new code. diff --git a/Makefile b/Makefile index daf4296..8531c39 100644 --- a/Makefile +++ b/Makefile @@ -1176,8 +1176,10 @@ ifdef ARM_SHA1 LIB_OBJS += arm/sha1.o arm/sha1_arm.o else ifdef MOZILLA_SHA1 - SHA1_HEADER = "mozilla-sha1/sha1.h" - LIB_OBJS += mozilla-sha1/sha1.o +# SHA1_HEADER = "mozilla-sha1/sha1.h" +# LIB_OBJS += mozilla-sha1/sha1.o + SHA1_HEADER = "x86/sha1.h" + LIB_OBJS += x86/sha1.o x86/sha1-x86.o else SHA1_HEADER = <openssl/sha.h> EXTLIBS += $(LIB_4_CRYPTO) diff --git a/x86/sha1-x86.s b/x86/sha1-x86.s new file mode 100644 index 0000000..96796d4 --- /dev/null +++ b/x86/sha1-x86.s @@ -0,0 +1,1372 @@ +.file "sha1-586.s" +.text +.globl sha1_block_data_order +.type sha1_block_data_order,@function +.align 16 +sha1_block_data_order: + pushl %ebp + pushl %ebx + pushl %esi + pushl %edi + movl 20(%esp),%edi + movl 24(%esp),%esi + movl 28(%esp),%eax + subl $64,%esp + shll $6,%eax + addl %esi,%eax + movl %eax,92(%esp) + movl 16(%edi),%ebp + movl 12(%edi),%edx +.align 16 +.L000loop: + movl (%esi),%ecx + movl 4(%esi),%ebx + bswap %ecx + movl 8(%esi),%eax + bswap %ebx + movl %ecx,(%esp) + movl 12(%esi),%ecx + bswap %eax + movl %ebx,4(%esp) + movl 16(%esi),%ebx + bswap %ecx + movl %eax,8(%esp) + movl 20(%esi),%eax + bswap %ebx + movl %ecx,12(%esp) + movl 24(%esi),%ecx + bswap %eax + movl %ebx,16(%esp) + movl 28(%esi),%ebx + bswap %ecx + movl %eax,20(%esp) + movl 32(%esi),%eax + bswap %ebx + movl %ecx,24(%esp) + movl 36(%esi),%ecx + bswap %eax + movl %ebx,28(%esp) + movl 40(%esi),%ebx + bswap %ecx + movl %eax,32(%esp) + movl 44(%esi),%eax + bswap %ebx + movl %ecx,36(%esp) + movl 48(%esi),%ecx + bswap %eax + movl %ebx,40(%esp) + movl 52(%esi),%ebx + bswap %ecx + movl %eax,44(%esp) + movl 56(%esi),%eax + bswap %ebx + movl %ecx,48(%esp) + movl 60(%esi),%ecx + bswap %eax + movl %ebx,52(%esp) + bswap %ecx + movl %eax,56(%esp) + movl %ecx,60(%esp) + movl %esi,88(%esp) + movl 8(%edi),%ecx + movl 4(%edi),%ebx + movl (%edi),%eax + /* 00_15 0 */ + movl %edx,%edi + movl (%esp),%esi + xorl %ecx,%edi + andl %ebx,%edi + rorl $2,%ebx + leal 1518500249(%ebp,%esi,1),%ebp + movl %eax,%esi + xorl %edx,%edi + roll $5,%esi + addl %edi,%ebp + movl %ecx,%edi + addl %esi,%ebp + /* 00_15 1 */ + movl 4(%esp),%esi + xorl %ebx,%edi + andl %eax,%edi + rorl $2,%eax + leal 1518500249(%edx,%esi,1),%edx + movl %ebp,%esi + xorl %ecx,%edi + roll $5,%esi + addl %edi,%edx + movl %ebx,%edi + addl %esi,%edx + /* 00_15 2 */ + movl 8(%esp),%esi + xorl %eax,%edi + andl %ebp,%edi + rorl $2,%ebp + leal 1518500249(%ecx,%esi,1),%ecx + movl %edx,%esi + xorl %ebx,%edi + roll $5,%esi + addl %edi,%ecx + movl %eax,%edi + addl %esi,%ecx + /* 00_15 3 */ + movl 12(%esp),%esi + xorl %ebp,%edi + andl %edx,%edi + rorl $2,%edx + leal 1518500249(%ebx,%esi,1),%ebx + movl %ecx,%esi + xorl %eax,%edi + roll $5,%esi + addl %edi,%ebx + movl %ebp,%edi + addl %esi,%ebx + /* 00_15 4 */ + movl 16(%esp),%esi + xorl %edx,%edi + andl %ecx,%edi + rorl $2,%ecx + leal 1518500249(%eax,%esi,1),%eax + movl %ebx,%esi + xorl %ebp,%edi + roll $5,%esi + addl %edi,%eax + movl %edx,%edi + addl %esi,%eax + /* 00_15 5 */ + movl 20(%esp),%esi + xorl %ecx,%edi + andl %ebx,%edi + rorl $2,%ebx + leal 1518500249(%ebp,%esi,1),%ebp + movl %eax,%esi + xorl %edx,%edi + roll $5,%esi + addl %edi,%ebp + movl %ecx,%edi + addl %esi,%ebp + /* 00_15 6 */ + movl 24(%esp),%esi + xorl %ebx,%edi + andl %eax,%edi + rorl $2,%eax + leal 1518500249(%edx,%esi,1),%edx + movl %ebp,%esi + xorl %ecx,%edi + roll $5,%esi + addl %edi,%edx + movl %ebx,%edi + addl %esi,%edx + /* 00_15 7 */ + movl 28(%esp),%esi + xorl %eax,%edi + andl %ebp,%edi + rorl $2,%ebp + leal 1518500249(%ecx,%esi,1),%ecx + movl %edx,%esi + xorl %ebx,%edi + roll $5,%esi + addl %edi,%ecx + movl %eax,%edi + addl %esi,%ecx + /* 00_15 8 */ + movl 32(%esp),%esi + xorl %ebp,%edi + andl %edx,%edi + rorl $2,%edx + leal 1518500249(%ebx,%esi,1),%ebx + movl %ecx,%esi + xorl %eax,%edi + roll $5,%esi + addl %edi,%ebx + movl %ebp,%edi + addl %esi,%ebx + /* 00_15 9 */ + movl 36(%esp),%esi + xorl %edx,%edi + andl %ecx,%edi + rorl $2,%ecx + leal 1518500249(%eax,%esi,1),%eax + movl %ebx,%esi + xorl %ebp,%edi + roll $5,%esi + addl %edi,%eax + movl %edx,%edi + addl %esi,%eax + /* 00_15 10 */ + movl 40(%esp),%esi + xorl %ecx,%edi + andl %ebx,%edi + rorl $2,%ebx + leal 1518500249(%ebp,%esi,1),%ebp + movl %eax,%esi + xorl %edx,%edi + roll $5,%esi + addl %edi,%ebp + movl %ecx,%edi + addl %esi,%ebp + /* 00_15 11 */ + movl 44(%esp),%esi + xorl %ebx,%edi + andl %eax,%edi + rorl $2,%eax + leal 1518500249(%edx,%esi,1),%edx + movl %ebp,%esi + xorl %ecx,%edi + roll $5,%esi + addl %edi,%edx + movl %ebx,%edi + addl %esi,%edx + /* 00_15 12 */ + movl 48(%esp),%esi + xorl %eax,%edi + andl %ebp,%edi + rorl $2,%ebp + leal 1518500249(%ecx,%esi,1),%ecx + movl %edx,%esi + xorl %ebx,%edi + roll $5,%esi + addl %edi,%ecx + movl %eax,%edi + addl %esi,%ecx + /* 00_15 13 */ + movl 52(%esp),%esi + xorl %ebp,%edi + andl %edx,%edi + rorl $2,%edx + leal 1518500249(%ebx,%esi,1),%ebx + movl %ecx,%esi + xorl %eax,%edi + roll $5,%esi + addl %edi,%ebx + movl %ebp,%edi + addl %esi,%ebx + /* 00_15 14 */ + movl 56(%esp),%esi + xorl %edx,%edi + andl %ecx,%edi + rorl $2,%ecx + leal 1518500249(%eax,%esi,1),%eax + movl %ebx,%esi + xorl %ebp,%edi + roll $5,%esi + addl %edi,%eax + movl %edx,%edi + addl %esi,%eax + /* 00_15 15 */ + movl 60(%esp),%esi + xorl %ecx,%edi + andl %ebx,%edi + rorl $2,%ebx + leal 1518500249(%ebp,%esi,1),%ebp + xorl %edx,%edi + movl (%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 8(%esp),%esi + roll $5,%edi + xorl 32(%esp),%esi + /* 16_19 16 */ + xorl 52(%esp),%esi + addl %edi,%ebp + movl %ecx,%edi + roll $1,%esi + xorl %ebx,%edi + movl %esi,(%esp) + andl %eax,%edi + rorl $2,%eax + leal 1518500249(%edx,%esi,1),%edx + movl 4(%esp),%esi + xorl %ecx,%edi + xorl 12(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 36(%esp),%esi + roll $5,%edi + /* 16_19 17 */ + xorl 56(%esp),%esi + addl %edi,%edx + movl %ebx,%edi + roll $1,%esi + xorl %eax,%edi + movl %esi,4(%esp) + andl %ebp,%edi + rorl $2,%ebp + leal 1518500249(%ecx,%esi,1),%ecx + movl 8(%esp),%esi + xorl %ebx,%edi + xorl 16(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 40(%esp),%esi + roll $5,%edi + /* 16_19 18 */ + xorl 60(%esp),%esi + addl %edi,%ecx + movl %eax,%edi + roll $1,%esi + xorl %ebp,%edi + movl %esi,8(%esp) + andl %edx,%edi + rorl $2,%edx + leal 1518500249(%ebx,%esi,1),%ebx + movl 12(%esp),%esi + xorl %eax,%edi + xorl 20(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 44(%esp),%esi + roll $5,%edi + /* 16_19 19 */ + xorl (%esp),%esi + addl %edi,%ebx + movl %ebp,%edi + roll $1,%esi + xorl %edx,%edi + movl %esi,12(%esp) + andl %ecx,%edi + rorl $2,%ecx + leal 1518500249(%eax,%esi,1),%eax + movl 16(%esp),%esi + xorl %ebp,%edi + xorl 24(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 48(%esp),%esi + roll $5,%edi + /* 20_39 20 */ + xorl 4(%esp),%esi + addl %edi,%eax + roll $1,%esi + movl %edx,%edi + movl %esi,16(%esp) + xorl %ebx,%edi + rorl $2,%ebx + leal 1859775393(%ebp,%esi,1),%ebp + movl 20(%esp),%esi + xorl %ecx,%edi + xorl 28(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 52(%esp),%esi + roll $5,%edi + /* 20_39 21 */ + xorl 8(%esp),%esi + addl %edi,%ebp + roll $1,%esi + movl %ecx,%edi + movl %esi,20(%esp) + xorl %eax,%edi + rorl $2,%eax + leal 1859775393(%edx,%esi,1),%edx + movl 24(%esp),%esi + xorl %ebx,%edi + xorl 32(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 56(%esp),%esi + roll $5,%edi + /* 20_39 22 */ + xorl 12(%esp),%esi + addl %edi,%edx + roll $1,%esi + movl %ebx,%edi + movl %esi,24(%esp) + xorl %ebp,%edi + rorl $2,%ebp + leal 1859775393(%ecx,%esi,1),%ecx + movl 28(%esp),%esi + xorl %eax,%edi + xorl 36(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 60(%esp),%esi + roll $5,%edi + /* 20_39 23 */ + xorl 16(%esp),%esi + addl %edi,%ecx + roll $1,%esi + movl %eax,%edi + movl %esi,28(%esp) + xorl %edx,%edi + rorl $2,%edx + leal 1859775393(%ebx,%esi,1),%ebx + movl 32(%esp),%esi + xorl %ebp,%edi + xorl 40(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl (%esp),%esi + roll $5,%edi + /* 20_39 24 */ + xorl 20(%esp),%esi + addl %edi,%ebx + roll $1,%esi + movl %ebp,%edi + movl %esi,32(%esp) + xorl %ecx,%edi + rorl $2,%ecx + leal 1859775393(%eax,%esi,1),%eax + movl 36(%esp),%esi + xorl %edx,%edi + xorl 44(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 4(%esp),%esi + roll $5,%edi + /* 20_39 25 */ + xorl 24(%esp),%esi + addl %edi,%eax + roll $1,%esi + movl %edx,%edi + movl %esi,36(%esp) + xorl %ebx,%edi + rorl $2,%ebx + leal 1859775393(%ebp,%esi,1),%ebp + movl 40(%esp),%esi + xorl %ecx,%edi + xorl 48(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 8(%esp),%esi + roll $5,%edi + /* 20_39 26 */ + xorl 28(%esp),%esi + addl %edi,%ebp + roll $1,%esi + movl %ecx,%edi + movl %esi,40(%esp) + xorl %eax,%edi + rorl $2,%eax + leal 1859775393(%edx,%esi,1),%edx + movl 44(%esp),%esi + xorl %ebx,%edi + xorl 52(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 12(%esp),%esi + roll $5,%edi + /* 20_39 27 */ + xorl 32(%esp),%esi + addl %edi,%edx + roll $1,%esi + movl %ebx,%edi + movl %esi,44(%esp) + xorl %ebp,%edi + rorl $2,%ebp + leal 1859775393(%ecx,%esi,1),%ecx + movl 48(%esp),%esi + xorl %eax,%edi + xorl 56(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 16(%esp),%esi + roll $5,%edi + /* 20_39 28 */ + xorl 36(%esp),%esi + addl %edi,%ecx + roll $1,%esi + movl %eax,%edi + movl %esi,48(%esp) + xorl %edx,%edi + rorl $2,%edx + leal 1859775393(%ebx,%esi,1),%ebx + movl 52(%esp),%esi + xorl %ebp,%edi + xorl 60(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 20(%esp),%esi + roll $5,%edi + /* 20_39 29 */ + xorl 40(%esp),%esi + addl %edi,%ebx + roll $1,%esi + movl %ebp,%edi + movl %esi,52(%esp) + xorl %ecx,%edi + rorl $2,%ecx + leal 1859775393(%eax,%esi,1),%eax + movl 56(%esp),%esi + xorl %edx,%edi + xorl (%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 24(%esp),%esi + roll $5,%edi + /* 20_39 30 */ + xorl 44(%esp),%esi + addl %edi,%eax + roll $1,%esi + movl %edx,%edi + movl %esi,56(%esp) + xorl %ebx,%edi + rorl $2,%ebx + leal 1859775393(%ebp,%esi,1),%ebp + movl 60(%esp),%esi + xorl %ecx,%edi + xorl 4(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 28(%esp),%esi + roll $5,%edi + /* 20_39 31 */ + xorl 48(%esp),%esi + addl %edi,%ebp + roll $1,%esi + movl %ecx,%edi + movl %esi,60(%esp) + xorl %eax,%edi + rorl $2,%eax + leal 1859775393(%edx,%esi,1),%edx + movl (%esp),%esi + xorl %ebx,%edi + xorl 8(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 32(%esp),%esi + roll $5,%edi + /* 20_39 32 */ + xorl 52(%esp),%esi + addl %edi,%edx + roll $1,%esi + movl %ebx,%edi + movl %esi,(%esp) + xorl %ebp,%edi + rorl $2,%ebp + leal 1859775393(%ecx,%esi,1),%ecx + movl 4(%esp),%esi + xorl %eax,%edi + xorl 12(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 36(%esp),%esi + roll $5,%edi + /* 20_39 33 */ + xorl 56(%esp),%esi + addl %edi,%ecx + roll $1,%esi + movl %eax,%edi + movl %esi,4(%esp) + xorl %edx,%edi + rorl $2,%edx + leal 1859775393(%ebx,%esi,1),%ebx + movl 8(%esp),%esi + xorl %ebp,%edi + xorl 16(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 40(%esp),%esi + roll $5,%edi + /* 20_39 34 */ + xorl 60(%esp),%esi + addl %edi,%ebx + roll $1,%esi + movl %ebp,%edi + movl %esi,8(%esp) + xorl %ecx,%edi + rorl $2,%ecx + leal 1859775393(%eax,%esi,1),%eax + movl 12(%esp),%esi + xorl %edx,%edi + xorl 20(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 44(%esp),%esi + roll $5,%edi + /* 20_39 35 */ + xorl (%esp),%esi + addl %edi,%eax + roll $1,%esi + movl %edx,%edi + movl %esi,12(%esp) + xorl %ebx,%edi + rorl $2,%ebx + leal 1859775393(%ebp,%esi,1),%ebp + movl 16(%esp),%esi + xorl %ecx,%edi + xorl 24(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 48(%esp),%esi + roll $5,%edi + /* 20_39 36 */ + xorl 4(%esp),%esi + addl %edi,%ebp + roll $1,%esi + movl %ecx,%edi + movl %esi,16(%esp) + xorl %eax,%edi + rorl $2,%eax + leal 1859775393(%edx,%esi,1),%edx + movl 20(%esp),%esi + xorl %ebx,%edi + xorl 28(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 52(%esp),%esi + roll $5,%edi + /* 20_39 37 */ + xorl 8(%esp),%esi + addl %edi,%edx + roll $1,%esi + movl %ebx,%edi + movl %esi,20(%esp) + xorl %ebp,%edi + rorl $2,%ebp + leal 1859775393(%ecx,%esi,1),%ecx + movl 24(%esp),%esi + xorl %eax,%edi + xorl 32(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 56(%esp),%esi + roll $5,%edi + /* 20_39 38 */ + xorl 12(%esp),%esi + addl %edi,%ecx + roll $1,%esi + movl %eax,%edi + movl %esi,24(%esp) + xorl %edx,%edi + rorl $2,%edx + leal 1859775393(%ebx,%esi,1),%ebx + movl 28(%esp),%esi + xorl %ebp,%edi + xorl 36(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 60(%esp),%esi + roll $5,%edi + /* 20_39 39 */ + xorl 16(%esp),%esi + addl %edi,%ebx + roll $1,%esi + movl %ebp,%edi + movl %esi,28(%esp) + xorl %ecx,%edi + rorl $2,%ecx + leal 1859775393(%eax,%esi,1),%eax + movl 32(%esp),%esi + xorl %edx,%edi + xorl 40(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl (%esp),%esi + roll $5,%edi + /* 40_59 40 */ + addl %edi,%eax + movl %edx,%edi + xorl 20(%esp),%esi + andl %ecx,%edi + roll $1,%esi + addl %edi,%ebp + movl %edx,%edi + movl %esi,32(%esp) + xorl %ecx,%edi + leal 2400959708(%ebp,%esi,1),%ebp + andl %ebx,%edi + rorl $2,%ebx + movl 36(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 44(%esp),%esi + roll $5,%edi + xorl 4(%esp),%esi + /* 40_59 41 */ + addl %edi,%ebp + movl %ecx,%edi + xorl 24(%esp),%esi + andl %ebx,%edi + roll $1,%esi + addl %edi,%edx + movl %ecx,%edi + movl %esi,36(%esp) + xorl %ebx,%edi + leal 2400959708(%edx,%esi,1),%edx + andl %eax,%edi + rorl $2,%eax + movl 40(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 48(%esp),%esi + roll $5,%edi + xorl 8(%esp),%esi + /* 40_59 42 */ + addl %edi,%edx + movl %ebx,%edi + xorl 28(%esp),%esi + andl %eax,%edi + roll $1,%esi + addl %edi,%ecx + movl %ebx,%edi + movl %esi,40(%esp) + xorl %eax,%edi + leal 2400959708(%ecx,%esi,1),%ecx + andl %ebp,%edi + rorl $2,%ebp + movl 44(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 52(%esp),%esi + roll $5,%edi + xorl 12(%esp),%esi + /* 40_59 43 */ + addl %edi,%ecx + movl %eax,%edi + xorl 32(%esp),%esi + andl %ebp,%edi + roll $1,%esi + addl %edi,%ebx + movl %eax,%edi + movl %esi,44(%esp) + xorl %ebp,%edi + leal 2400959708(%ebx,%esi,1),%ebx + andl %edx,%edi + rorl $2,%edx + movl 48(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 56(%esp),%esi + roll $5,%edi + xorl 16(%esp),%esi + /* 40_59 44 */ + addl %edi,%ebx + movl %ebp,%edi + xorl 36(%esp),%esi + andl %edx,%edi + roll $1,%esi + addl %edi,%eax + movl %ebp,%edi + movl %esi,48(%esp) + xorl %edx,%edi + leal 2400959708(%eax,%esi,1),%eax + andl %ecx,%edi + rorl $2,%ecx + movl 52(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 60(%esp),%esi + roll $5,%edi + xorl 20(%esp),%esi + /* 40_59 45 */ + addl %edi,%eax + movl %edx,%edi + xorl 40(%esp),%esi + andl %ecx,%edi + roll $1,%esi + addl %edi,%ebp + movl %edx,%edi + movl %esi,52(%esp) + xorl %ecx,%edi + leal 2400959708(%ebp,%esi,1),%ebp + andl %ebx,%edi + rorl $2,%ebx + movl 56(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl (%esp),%esi + roll $5,%edi + xorl 24(%esp),%esi + /* 40_59 46 */ + addl %edi,%ebp + movl %ecx,%edi + xorl 44(%esp),%esi + andl %ebx,%edi + roll $1,%esi + addl %edi,%edx + movl %ecx,%edi + movl %esi,56(%esp) + xorl %ebx,%edi + leal 2400959708(%edx,%esi,1),%edx + andl %eax,%edi + rorl $2,%eax + movl 60(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 4(%esp),%esi + roll $5,%edi + xorl 28(%esp),%esi + /* 40_59 47 */ + addl %edi,%edx + movl %ebx,%edi + xorl 48(%esp),%esi + andl %eax,%edi + roll $1,%esi + addl %edi,%ecx + movl %ebx,%edi + movl %esi,60(%esp) + xorl %eax,%edi + leal 2400959708(%ecx,%esi,1),%ecx + andl %ebp,%edi + rorl $2,%ebp + movl (%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 8(%esp),%esi + roll $5,%edi + xorl 32(%esp),%esi + /* 40_59 48 */ + addl %edi,%ecx + movl %eax,%edi + xorl 52(%esp),%esi + andl %ebp,%edi + roll $1,%esi + addl %edi,%ebx + movl %eax,%edi + movl %esi,(%esp) + xorl %ebp,%edi + leal 2400959708(%ebx,%esi,1),%ebx + andl %edx,%edi + rorl $2,%edx + movl 4(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 12(%esp),%esi + roll $5,%edi + xorl 36(%esp),%esi + /* 40_59 49 */ + addl %edi,%ebx + movl %ebp,%edi + xorl 56(%esp),%esi + andl %edx,%edi + roll $1,%esi + addl %edi,%eax + movl %ebp,%edi + movl %esi,4(%esp) + xorl %edx,%edi + leal 2400959708(%eax,%esi,1),%eax + andl %ecx,%edi + rorl $2,%ecx + movl 8(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 16(%esp),%esi + roll $5,%edi + xorl 40(%esp),%esi + /* 40_59 50 */ + addl %edi,%eax + movl %edx,%edi + xorl 60(%esp),%esi + andl %ecx,%edi + roll $1,%esi + addl %edi,%ebp + movl %edx,%edi + movl %esi,8(%esp) + xorl %ecx,%edi + leal 2400959708(%ebp,%esi,1),%ebp + andl %ebx,%edi + rorl $2,%ebx + movl 12(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 20(%esp),%esi + roll $5,%edi + xorl 44(%esp),%esi + /* 40_59 51 */ + addl %edi,%ebp + movl %ecx,%edi + xorl (%esp),%esi + andl %ebx,%edi + roll $1,%esi + addl %edi,%edx + movl %ecx,%edi + movl %esi,12(%esp) + xorl %ebx,%edi + leal 2400959708(%edx,%esi,1),%edx + andl %eax,%edi + rorl $2,%eax + movl 16(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 24(%esp),%esi + roll $5,%edi + xorl 48(%esp),%esi + /* 40_59 52 */ + addl %edi,%edx + movl %ebx,%edi + xorl 4(%esp),%esi + andl %eax,%edi + roll $1,%esi + addl %edi,%ecx + movl %ebx,%edi + movl %esi,16(%esp) + xorl %eax,%edi + leal 2400959708(%ecx,%esi,1),%ecx + andl %ebp,%edi + rorl $2,%ebp + movl 20(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 28(%esp),%esi + roll $5,%edi + xorl 52(%esp),%esi + /* 40_59 53 */ + addl %edi,%ecx + movl %eax,%edi + xorl 8(%esp),%esi + andl %ebp,%edi + roll $1,%esi + addl %edi,%ebx + movl %eax,%edi + movl %esi,20(%esp) + xorl %ebp,%edi + leal 2400959708(%ebx,%esi,1),%ebx + andl %edx,%edi + rorl $2,%edx + movl 24(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 32(%esp),%esi + roll $5,%edi + xorl 56(%esp),%esi + /* 40_59 54 */ + addl %edi,%ebx + movl %ebp,%edi + xorl 12(%esp),%esi + andl %edx,%edi + roll $1,%esi + addl %edi,%eax + movl %ebp,%edi + movl %esi,24(%esp) + xorl %edx,%edi + leal 2400959708(%eax,%esi,1),%eax + andl %ecx,%edi + rorl $2,%ecx + movl 28(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 36(%esp),%esi + roll $5,%edi + xorl 60(%esp),%esi + /* 40_59 55 */ + addl %edi,%eax + movl %edx,%edi + xorl 16(%esp),%esi + andl %ecx,%edi + roll $1,%esi + addl %edi,%ebp + movl %edx,%edi + movl %esi,28(%esp) + xorl %ecx,%edi + leal 2400959708(%ebp,%esi,1),%ebp + andl %ebx,%edi + rorl $2,%ebx + movl 32(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 40(%esp),%esi + roll $5,%edi + xorl (%esp),%esi + /* 40_59 56 */ + addl %edi,%ebp + movl %ecx,%edi + xorl 20(%esp),%esi + andl %ebx,%edi + roll $1,%esi + addl %edi,%edx + movl %ecx,%edi + movl %esi,32(%esp) + xorl %ebx,%edi + leal 2400959708(%edx,%esi,1),%edx + andl %eax,%edi + rorl $2,%eax + movl 36(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 44(%esp),%esi + roll $5,%edi + xorl 4(%esp),%esi + /* 40_59 57 */ + addl %edi,%edx + movl %ebx,%edi + xorl 24(%esp),%esi + andl %eax,%edi + roll $1,%esi + addl %edi,%ecx + movl %ebx,%edi + movl %esi,36(%esp) + xorl %eax,%edi + leal 2400959708(%ecx,%esi,1),%ecx + andl %ebp,%edi + rorl $2,%ebp + movl 40(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 48(%esp),%esi + roll $5,%edi + xorl 8(%esp),%esi + /* 40_59 58 */ + addl %edi,%ecx + movl %eax,%edi + xorl 28(%esp),%esi + andl %ebp,%edi + roll $1,%esi + addl %edi,%ebx + movl %eax,%edi + movl %esi,40(%esp) + xorl %ebp,%edi + leal 2400959708(%ebx,%esi,1),%ebx + andl %edx,%edi + rorl $2,%edx + movl 44(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 52(%esp),%esi + roll $5,%edi + xorl 12(%esp),%esi + /* 40_59 59 */ + addl %edi,%ebx + movl %ebp,%edi + xorl 32(%esp),%esi + andl %edx,%edi + roll $1,%esi + addl %edi,%eax + movl %ebp,%edi + movl %esi,44(%esp) + xorl %edx,%edi + leal 2400959708(%eax,%esi,1),%eax + andl %ecx,%edi + rorl $2,%ecx + movl 48(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 56(%esp),%esi + roll $5,%edi + xorl 16(%esp),%esi + /* 20_39 60 */ + xorl 36(%esp),%esi + addl %edi,%eax + roll $1,%esi + movl %edx,%edi + movl %esi,48(%esp) + xorl %ebx,%edi + rorl $2,%ebx + leal 3395469782(%ebp,%esi,1),%ebp + movl 52(%esp),%esi + xorl %ecx,%edi + xorl 60(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 20(%esp),%esi + roll $5,%edi + /* 20_39 61 */ + xorl 40(%esp),%esi + addl %edi,%ebp + roll $1,%esi + movl %ecx,%edi + movl %esi,52(%esp) + xorl %eax,%edi + rorl $2,%eax + leal 3395469782(%edx,%esi,1),%edx + movl 56(%esp),%esi + xorl %ebx,%edi + xorl (%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 24(%esp),%esi + roll $5,%edi + /* 20_39 62 */ + xorl 44(%esp),%esi + addl %edi,%edx + roll $1,%esi + movl %ebx,%edi + movl %esi,56(%esp) + xorl %ebp,%edi + rorl $2,%ebp + leal 3395469782(%ecx,%esi,1),%ecx + movl 60(%esp),%esi + xorl %eax,%edi + xorl 4(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 28(%esp),%esi + roll $5,%edi + /* 20_39 63 */ + xorl 48(%esp),%esi + addl %edi,%ecx + roll $1,%esi + movl %eax,%edi + movl %esi,60(%esp) + xorl %edx,%edi + rorl $2,%edx + leal 3395469782(%ebx,%esi,1),%ebx + movl (%esp),%esi + xorl %ebp,%edi + xorl 8(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 32(%esp),%esi + roll $5,%edi + /* 20_39 64 */ + xorl 52(%esp),%esi + addl %edi,%ebx + roll $1,%esi + movl %ebp,%edi + movl %esi,(%esp) + xorl %ecx,%edi + rorl $2,%ecx + leal 3395469782(%eax,%esi,1),%eax + movl 4(%esp),%esi + xorl %edx,%edi + xorl 12(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 36(%esp),%esi + roll $5,%edi + /* 20_39 65 */ + xorl 56(%esp),%esi + addl %edi,%eax + roll $1,%esi + movl %edx,%edi + movl %esi,4(%esp) + xorl %ebx,%edi + rorl $2,%ebx + leal 3395469782(%ebp,%esi,1),%ebp + movl 8(%esp),%esi + xorl %ecx,%edi + xorl 16(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 40(%esp),%esi + roll $5,%edi + /* 20_39 66 */ + xorl 60(%esp),%esi + addl %edi,%ebp + roll $1,%esi + movl %ecx,%edi + movl %esi,8(%esp) + xorl %eax,%edi + rorl $2,%eax + leal 3395469782(%edx,%esi,1),%edx + movl 12(%esp),%esi + xorl %ebx,%edi + xorl 20(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 44(%esp),%esi + roll $5,%edi + /* 20_39 67 */ + xorl (%esp),%esi + addl %edi,%edx + roll $1,%esi + movl %ebx,%edi + movl %esi,12(%esp) + xorl %ebp,%edi + rorl $2,%ebp + leal 3395469782(%ecx,%esi,1),%ecx + movl 16(%esp),%esi + xorl %eax,%edi + xorl 24(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 48(%esp),%esi + roll $5,%edi + /* 20_39 68 */ + xorl 4(%esp),%esi + addl %edi,%ecx + roll $1,%esi + movl %eax,%edi + movl %esi,16(%esp) + xorl %edx,%edi + rorl $2,%edx + leal 3395469782(%ebx,%esi,1),%ebx + movl 20(%esp),%esi + xorl %ebp,%edi + xorl 28(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 52(%esp),%esi + roll $5,%edi + /* 20_39 69 */ + xorl 8(%esp),%esi + addl %edi,%ebx + roll $1,%esi + movl %ebp,%edi + movl %esi,20(%esp) + xorl %ecx,%edi + rorl $2,%ecx + leal 3395469782(%eax,%esi,1),%eax + movl 24(%esp),%esi + xorl %edx,%edi + xorl 32(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 56(%esp),%esi + roll $5,%edi + /* 20_39 70 */ + xorl 12(%esp),%esi + addl %edi,%eax + roll $1,%esi + movl %edx,%edi + movl %esi,24(%esp) + xorl %ebx,%edi + rorl $2,%ebx + leal 3395469782(%ebp,%esi,1),%ebp + movl 28(%esp),%esi + xorl %ecx,%edi + xorl 36(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 60(%esp),%esi + roll $5,%edi + /* 20_39 71 */ + xorl 16(%esp),%esi + addl %edi,%ebp + roll $1,%esi + movl %ecx,%edi + movl %esi,28(%esp) + xorl %eax,%edi + rorl $2,%eax + leal 3395469782(%edx,%esi,1),%edx + movl 32(%esp),%esi + xorl %ebx,%edi + xorl 40(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl (%esp),%esi + roll $5,%edi + /* 20_39 72 */ + xorl 20(%esp),%esi + addl %edi,%edx + roll $1,%esi + movl %ebx,%edi + movl %esi,32(%esp) + xorl %ebp,%edi + rorl $2,%ebp + leal 3395469782(%ecx,%esi,1),%ecx + movl 36(%esp),%esi + xorl %eax,%edi + xorl 44(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 4(%esp),%esi + roll $5,%edi + /* 20_39 73 */ + xorl 24(%esp),%esi + addl %edi,%ecx + roll $1,%esi + movl %eax,%edi + movl %esi,36(%esp) + xorl %edx,%edi + rorl $2,%edx + leal 3395469782(%ebx,%esi,1),%ebx + movl 40(%esp),%esi + xorl %ebp,%edi + xorl 48(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 8(%esp),%esi + roll $5,%edi + /* 20_39 74 */ + xorl 28(%esp),%esi + addl %edi,%ebx + roll $1,%esi + movl %ebp,%edi + movl %esi,40(%esp) + xorl %ecx,%edi + rorl $2,%ecx + leal 3395469782(%eax,%esi,1),%eax + movl 44(%esp),%esi + xorl %edx,%edi + xorl 52(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 12(%esp),%esi + roll $5,%edi + /* 20_39 75 */ + xorl 32(%esp),%esi + addl %edi,%eax + roll $1,%esi + movl %edx,%edi + movl %esi,44(%esp) + xorl %ebx,%edi + rorl $2,%ebx + leal 3395469782(%ebp,%esi,1),%ebp + movl 48(%esp),%esi + xorl %ecx,%edi + xorl 56(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 16(%esp),%esi + roll $5,%edi + /* 20_39 76 */ + xorl 36(%esp),%esi + addl %edi,%ebp + roll $1,%esi + movl %ecx,%edi + movl %esi,48(%esp) + xorl %eax,%edi + rorl $2,%eax + leal 3395469782(%edx,%esi,1),%edx + movl 52(%esp),%esi + xorl %ebx,%edi + xorl 60(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 20(%esp),%esi + roll $5,%edi + /* 20_39 77 */ + xorl 40(%esp),%esi + addl %edi,%edx + roll $1,%esi + movl %ebx,%edi + xorl %ebp,%edi + rorl $2,%ebp + leal 3395469782(%ecx,%esi,1),%ecx + movl 56(%esp),%esi + xorl %eax,%edi + xorl (%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 24(%esp),%esi + roll $5,%edi + /* 20_39 78 */ + xorl 44(%esp),%esi + addl %edi,%ecx + roll $1,%esi + movl %eax,%edi + xorl %edx,%edi + rorl $2,%edx + leal 3395469782(%ebx,%esi,1),%ebx + movl 60(%esp),%esi + xorl %ebp,%edi + xorl 4(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 28(%esp),%esi + roll $5,%edi + /* 20_39 79 */ + xorl 48(%esp),%esi + addl %edi,%ebx + roll $1,%esi + movl %ebp,%edi + xorl %ecx,%edi + rorl $2,%ecx + leal 3395469782(%eax,%esi,1),%eax + xorl %edx,%edi + addl %edi,%eax + movl %ebx,%edi + roll $5,%edi + addl %edi,%eax + /* Loop trailer */ + movl 84(%esp),%edi + movl 88(%esp),%esi + addl 16(%edi),%ebp + addl 12(%edi),%edx + addl %ecx,8(%edi) + addl %ebx,4(%edi) + addl $64,%esi + addl %eax,(%edi) + movl %edx,12(%edi) + movl %ebp,16(%edi) + cmpl 92(%esp),%esi + jb .L000loop + addl $64,%esp + popl %edi + popl %esi + popl %ebx + popl %ebp + ret +.L_sha1_block_data_order_end: +.size sha1_block_data_order,.L_sha1_block_data_order_end-sha1_block_data_order +.byte 83,72,65,49,32,98,108,111,99,107,32,116,114,97,110,115,102,111,114,109,32,102,111,114,32,120,56,54,44,32,67,82,89,80,84,79,71,65,77,83,32,98,121,32,60,97,112,112,114,111,64,111,112,101,110,115,115,108,46,111,114,103,62,0 diff --git a/x86/sha1.c b/x86/sha1.c new file mode 100644 index 0000000..4c1a569 --- /dev/null +++ b/x86/sha1.c @@ -0,0 +1,81 @@ +/* + * SHA-1 implementation. + * + * Copyright (C) 2005 Paul Mackerras <paulus@samba.org> + * + * This version assumes we are running on a big-endian machine. + * It calls an external sha1_core() to process blocks of 64 bytes. + */ +#include <stdio.h> +#include <string.h> +#include <arpa/inet.h> /* For htonl */ +#include "sha1.h" + +#define x86_sha1_core sha1_block_data_order +extern void x86_sha1_core(uint32_t hash[5], const unsigned char *p, + unsigned int nblocks); + +void x86_SHA1_Init(x86_SHA_CTX *c) +{ + /* Matches prefix of scontext structure */ + static struct { + uint32_t hash[5]; + uint64_t len; + } const iv = { + { 0x67452301, 0xEFCDAB89, 0x98BADCFE, 0x10325476, 0xC3D2E1F0 }, + 0 + }; + + memcpy(c, &iv, sizeof iv); +} + +void x86_SHA1_Update(x86_SHA_CTX *c, const void *p, unsigned long n) +{ + unsigned pos = (unsigned)c->len & 63; + unsigned long nb; + + c->len += n; + + /* Initial partial block */ + if (pos) { + unsigned space = 64 - pos; + if (space > n) + goto end; + memcpy(c->buf + pos, p, space); + p += space; + n -= space; + x86_sha1_core(c->hash, c->buf, 1); + } + + /* The big impressive middle */ + nb = n >> 6; + if (nb) { + x86_sha1_core(c->hash, p, nb); + p += nb << 6; + n &= 63; + } + pos = 0; +end: + /* Final partial block */ + memcpy(c->buf + pos, p, n); +} + +void x86_SHA1_Final(unsigned char *hash, x86_SHA_CTX *c) +{ + unsigned pos = (unsigned)c->len & 63; + + c->buf[pos++] = 0x80; + if (pos > 56) { + memset(c->buf + pos, 0, 64 - pos); + x86_sha1_core(c->hash, c->buf, 1); + pos = 0; + } + memset(c->buf + pos, 0, 56 - pos); + /* Last two words are 64-bit *bit* count */ + *(uint32_t *)(c->buf + 56) = htonl((uint32_t)(c->len >> 29)); + *(uint32_t *)(c->buf + 60) = htonl((uint32_t)c->len << 3); + x86_sha1_core(c->hash, c->buf, 1); + + for (pos = 0; pos < 5; pos++) + ((uint32_t *)hash)[pos] = htonl(c->hash[pos]); +} diff --git a/x86/sha1.h b/x86/sha1.h new file mode 100644 index 0000000..8988da9 --- /dev/null +++ b/x86/sha1.h @@ -0,0 +1,21 @@ +/* + * SHA-1 implementation. + * + * Copyright (C) 2005 Paul Mackerras <paulus@samba.org> + */ +#include <stdint.h> + +typedef struct { + uint32_t hash[5]; + uint64_t len; + unsigned char buf[64]; /* Keep this aligned */ +} x86_SHA_CTX; + +void x86_SHA1_Init(x86_SHA_CTX *c); +void x86_SHA1_Update(x86_SHA_CTX *c, const void *p, unsigned long n); +void x86_SHA1_Final(unsigned char *hash, x86_SHA_CTX *c); + +#define git_SHA_CTX x86_SHA_CTX +#define git_SHA1_Init x86_SHA1_Init +#define git_SHA1_Update x86_SHA1_Update +#define git_SHA1_Final x86_SHA1_Final ^ permalink raw reply related [flat|nested] 129+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-04 4:48 ` George Spelvin @ 2009-08-04 6:30 ` Linus Torvalds 2009-08-04 8:01 ` George Spelvin 2009-08-04 6:40 ` Linus Torvalds 1 sibling, 1 reply; 129+ messages in thread From: Linus Torvalds @ 2009-08-04 6:30 UTC (permalink / raw) To: George Spelvin; +Cc: Git Mailing List On Mon, 4 Aug 2009, George Spelvin wrote: > > The actual goal of this effort is to address the dynamic linker startup > time issues by removing the second-largest contributor after libcurl, > namely openssl. Optimizing the assembly code is just the fun part. ;-) Now, I agree that it would be wonderful to get rid of the linker startup, but the startup costs of openssl are very low compared to the equivalent curl ones. So we can't lose _too_ much performance - especially for long-running jobs where startup costs really don't even matter - in the quest to get rid of those. That said, your numbers are impressive. Improving fsck by 1.1-2.2% is very good. That means that you not only avodied the startup costs, you actually improved on the openssl code. So it's a win-win situation. That said, it would be even better if the SHA1 code was also somewhat portable to other environments (it looks like your current patch is very GNU as specific), and if you had a solution for x86-64 too ;) Yeah, I'm a whiny little b*tch, aren't I? Linus ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-04 6:30 ` Linus Torvalds @ 2009-08-04 8:01 ` George Spelvin 2009-08-04 20:41 ` Junio C Hamano 0 siblings, 1 reply; 129+ messages in thread From: George Spelvin @ 2009-08-04 8:01 UTC (permalink / raw) To: linux, torvalds; +Cc: git > Now, I agree that it would be wonderful to get rid of the linker startup, > but the startup costs of openssl are very low compared to the equivalent > curl ones. So we can't lose _too_ much performance - especially for > long-running jobs where startup costs really don't even matter - in the > quest to get rid of those. > > That said, your numbers are impressive. Improving fsck by 1.1-2.2% is very > good. That means that you not only avodied the startup costs, you actually > improved on the openssl code. So it's a win-win situation. Er, yes, that *is* what the subject line is advertising. I started with the OpenSSL core SHA1 code (which is BSD/GPL dual-licensed by its author) and tweaked it some more for more recent processors. > That said, it would be even better if the SHA1 code was also somewhat > portable to other environments (it looks like your current patch is very > GNU as specific), and if you had a solution for x86-64 too ;) Done and will be done. The code is *actually* written (see the first e-mail in this thread) in the perl-preprocessor that OpenSSL uses, which can generate quite a few output syntaxes (including Intel). I just included the preprocessed version to reduce the complexity of the rough-draft patch. The one question I have is that currently perl is not a critical compile-time dependency; it's needed for some extra stuff, but AFAIK you can get most of git working without it. Whether to add that dependency or what is a Junio question. As for x86-64, I haven't actually *written* it yet, but it'll be a very simple adaptation. Mostly it's just a matter of using the additional registers effectively. > Yeah, I'm a whiny little b*tch, aren't I? Not at all; I expected all of that. Getting rid of OpenSSL kind of requires those things. > Hmm. Does it really help to do the bswap as a separate initial phase? > > As far as I can tell, you load the result of the bswap just a single time > for each value. So the initial "bswap all 64 bytes" seems pointless. >> + /* 00_15 0 */ >> + movl %edx,%edi >> + movl (%esp),%esi > Why not do the bswap here instead? > > Is it because you're running out of registers for scheduling, and want to > use the stack pointer rather than the original source? Exactly. I looked hard at it, but that means that I'd have to write the first 16 rounds with only one temp register, because the other is being used as an input pointer. Here's the pipelined loop for the first 16 rounds (when in[i] is the stack buffer), showing parallel operations on the same line. (Operations in parens belong to adjacent rounds.) # movl D,S (roll 5,T) (addl S,A) // # mov in[i],T xorl C,S (addl T,A) # andl B,S rorl 2,B # addl T+K,E xorl D,S movl A,T # addl S,E roll 5,T (movl C,S) // # (mov in[i],T) (xorl B,S) addl T,E which translates in perl code to: sub BODY_00_15 { local($n,$a,$b,$c,$d,$e)=@_; &comment("00_15 $n"); &mov($S,$d) if ($n == 0); &mov($T,&swtmp($n%16)); # V Load Xi. &xor($S,$c); # U Continue F() = d^(b&(c^d)) &and($S,$b); # V &rotr($b,2); # NP &lea($e,&DWP(K1,$e,$T)); # U Add Xi and K if ($n < 15) { &mov($T,$a); # V &xor($S,$d); # U &rotl($T,5); # NP &add($e,$S); # U &mov($S,$c); # V Start of NEXT round's F() &add($e,$T); # U } else { # This version provides the correct start for BODY_20_39 &xor($S,$d); # V &mov($T,&swtmp(($n+1)%16)); # U Start computing mext Xi. &add($e,$S); # V Add F() &mov($S,$a); # U Start computing a<<<5 &xor($T,&swtmp(($n+3)%16)); # V &rotl($S,5); # U &xor($T,&swtmp(($n+9)%16)); # V } } Anyway, the round is: #define K1 0x5a827999 e += bswap(in[i]) + K1 + (d^(b&(c^d))) + ROTL(a,5). b = ROTR(b,2); Notice how I use one temp (T) for in[i] and ROTL(a,5), and the other (S) for F1(b,c,d) = d^(b&(c^d)). If I only had one temporary, I'd have to seriously un-overlap it: mov S[i],T bswap T mov T,in[i] lea K1(T,e),e mov d,T xor c,T and b,T xor d,T add T,e mov a,T roll 5,T add T,e Current processors probably have enough out-of-order scheduling resources to find the parallelism there, but something like an Atom would be doomed. I just cobbled together a test implementation, and it looks pretty similar on my Phenom here (minimum of 30 runs): Separate copy loop: 1.355603 In-line: 1.350444 (+0.4% faster) A hint of being faster, but not much. It is a couple of percent faster on a P4: Separate copy loop: 3.297174 In-line: 3.237354 (+1.8% faster) And on an i7: Separate copy loop: 1.353641 In-line: 1.336766 (+1.2% faster) but I worry about in-order machines. An Athlon XP: Separate copy loop: 3.252682 In-line: 3.313870 (-1.8% slower) H'm... it's not bad. And the code is smaller. Maybe I'll work on it a bit. If you want to try it, the modified sha1-x86.s file is appended. --- /dev/null 2009-05-12 02:55:38.579106460 -0400 +++ sha1-x86.s 2009-08-04 03:42:31.073284734 -0400 @@ -0,0 +1,1359 @@ +.file "sha1-586.s" +.text +.globl sha1_block_data_order +.type sha1_block_data_order,@function +.align 16 +sha1_block_data_order: + pushl %ebp + pushl %ebx + pushl %esi + pushl %edi + movl 20(%esp),%edi + movl 24(%esp),%esi + movl 28(%esp),%eax + subl $64,%esp + shll $6,%eax + addl %esi,%eax + movl %eax,92(%esp) + movl 16(%edi),%ebp + movl 12(%edi),%edx + movl 8(%edi),%ecx + movl 4(%edi),%ebx + movl (%edi),%eax +.align 16 +.L000loop: + movl %esi,88(%esp) + /* 00_15 0 */ + movl (%esi),%edi + bswap %edi + movl %edi,(%esp) + leal 1518500249(%ebp,%edi,1),%ebp + movl %edx,%edi + xorl %ecx,%edi + andl %ebx,%edi + rorl $2,%ebx + xorl %edx,%edi + addl %edi,%ebp + movl %eax,%edi + roll $5,%edi + addl %edi,%ebp + /* 00_15 1 */ + movl 4(%esi),%edi + bswap %edi + movl %edi,4(%esp) + leal 1518500249(%edx,%edi,1),%edx + movl %ecx,%edi + xorl %ebx,%edi + andl %eax,%edi + rorl $2,%eax + xorl %ecx,%edi + addl %edi,%edx + movl %ebp,%edi + roll $5,%edi + addl %edi,%edx + /* 00_15 2 */ + movl 8(%esi),%edi + bswap %edi + movl %edi,8(%esp) + leal 1518500249(%ecx,%edi,1),%ecx + movl %ebx,%edi + xorl %eax,%edi + andl %ebp,%edi + rorl $2,%ebp + xorl %ebx,%edi + addl %edi,%ecx + movl %edx,%edi + roll $5,%edi + addl %edi,%ecx + /* 00_15 3 */ + movl 12(%esi),%edi + bswap %edi + movl %edi,12(%esp) + leal 1518500249(%ebx,%edi,1),%ebx + movl %eax,%edi + xorl %ebp,%edi + andl %edx,%edi + rorl $2,%edx + xorl %eax,%edi + addl %edi,%ebx + movl %ecx,%edi + roll $5,%edi + addl %edi,%ebx + /* 00_15 4 */ + movl 16(%esi),%edi + bswap %edi + movl %edi,16(%esp) + leal 1518500249(%eax,%edi,1),%eax + movl %ebp,%edi + xorl %edx,%edi + andl %ecx,%edi + rorl $2,%ecx + xorl %ebp,%edi + addl %edi,%eax + movl %ebx,%edi + roll $5,%edi + addl %edi,%eax + /* 00_15 5 */ + movl 20(%esi),%edi + bswap %edi + movl %edi,20(%esp) + leal 1518500249(%ebp,%edi,1),%ebp + movl %edx,%edi + xorl %ecx,%edi + andl %ebx,%edi + rorl $2,%ebx + xorl %edx,%edi + addl %edi,%ebp + movl %eax,%edi + roll $5,%edi + addl %edi,%ebp + /* 00_15 6 */ + movl 24(%esi),%edi + bswap %edi + movl %edi,24(%esp) + leal 1518500249(%edx,%edi,1),%edx + movl %ecx,%edi + xorl %ebx,%edi + andl %eax,%edi + rorl $2,%eax + xorl %ecx,%edi + addl %edi,%edx + movl %ebp,%edi + roll $5,%edi + addl %edi,%edx + /* 00_15 7 */ + movl 28(%esi),%edi + bswap %edi + movl %edi,28(%esp) + leal 1518500249(%ecx,%edi,1),%ecx + movl %ebx,%edi + xorl %eax,%edi + andl %ebp,%edi + rorl $2,%ebp + xorl %ebx,%edi + addl %edi,%ecx + movl %edx,%edi + roll $5,%edi + addl %edi,%ecx + /* 00_15 8 */ + movl 32(%esi),%edi + bswap %edi + movl %edi,32(%esp) + leal 1518500249(%ebx,%edi,1),%ebx + movl %eax,%edi + xorl %ebp,%edi + andl %edx,%edi + rorl $2,%edx + xorl %eax,%edi + addl %edi,%ebx + movl %ecx,%edi + roll $5,%edi + addl %edi,%ebx + /* 00_15 9 */ + movl 36(%esi),%edi + bswap %edi + movl %edi,36(%esp) + leal 1518500249(%eax,%edi,1),%eax + movl %ebp,%edi + xorl %edx,%edi + andl %ecx,%edi + rorl $2,%ecx + xorl %ebp,%edi + addl %edi,%eax + movl %ebx,%edi + roll $5,%edi + addl %edi,%eax + /* 00_15 10 */ + movl 40(%esi),%edi + bswap %edi + movl %edi,40(%esp) + leal 1518500249(%ebp,%edi,1),%ebp + movl %edx,%edi + xorl %ecx,%edi + andl %ebx,%edi + rorl $2,%ebx + xorl %edx,%edi + addl %edi,%ebp + movl %eax,%edi + roll $5,%edi + addl %edi,%ebp + /* 00_15 11 */ + movl 44(%esi),%edi + bswap %edi + movl %edi,44(%esp) + leal 1518500249(%edx,%edi,1),%edx + movl %ecx,%edi + xorl %ebx,%edi + andl %eax,%edi + rorl $2,%eax + xorl %ecx,%edi + addl %edi,%edx + movl %ebp,%edi + roll $5,%edi + addl %edi,%edx + /* 00_15 12 */ + movl 48(%esi),%edi + bswap %edi + movl %edi,48(%esp) + leal 1518500249(%ecx,%edi,1),%ecx + movl %ebx,%edi + xorl %eax,%edi + andl %ebp,%edi + rorl $2,%ebp + xorl %ebx,%edi + addl %edi,%ecx + movl %edx,%edi + roll $5,%edi + addl %edi,%ecx + /* 00_15 13 */ + movl 52(%esi),%edi + bswap %edi + movl %edi,52(%esp) + leal 1518500249(%ebx,%edi,1),%ebx + movl %eax,%edi + xorl %ebp,%edi + andl %edx,%edi + rorl $2,%edx + xorl %eax,%edi + addl %edi,%ebx + movl %ecx,%edi + roll $5,%edi + addl %edi,%ebx + /* 00_15 14 */ + movl 56(%esi),%edi + movl 60(%esi),%esi + bswap %edi + movl %edi,56(%esp) + leal 1518500249(%eax,%edi,1),%eax + movl %ebp,%edi + xorl %edx,%edi + andl %ecx,%edi + rorl $2,%ecx + xorl %ebp,%edi + addl %edi,%eax + movl %ebx,%edi + roll $5,%edi + addl %edi,%eax + /* 00_15 15 */ + movl %edx,%edi + bswap %esi + xorl %ecx,%edi + movl %esi,60(%esp) + andl %ebx,%edi + rorl $2,%ebx + xorl %edx,%edi + leal 1518500249(%ebp,%esi,1),%ebp + movl (%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 8(%esp),%esi + roll $5,%edi + xorl 32(%esp),%esi + /* 16_19 16 */ + xorl 52(%esp),%esi + addl %edi,%ebp + movl %ecx,%edi + roll $1,%esi + xorl %ebx,%edi + movl %esi,(%esp) + andl %eax,%edi + rorl $2,%eax + leal 1518500249(%edx,%esi,1),%edx + movl 4(%esp),%esi + xorl %ecx,%edi + xorl 12(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 36(%esp),%esi + roll $5,%edi + /* 16_19 17 */ + xorl 56(%esp),%esi + addl %edi,%edx + movl %ebx,%edi + roll $1,%esi + xorl %eax,%edi + movl %esi,4(%esp) + andl %ebp,%edi + rorl $2,%ebp + leal 1518500249(%ecx,%esi,1),%ecx + movl 8(%esp),%esi + xorl %ebx,%edi + xorl 16(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 40(%esp),%esi + roll $5,%edi + /* 16_19 18 */ + xorl 60(%esp),%esi + addl %edi,%ecx + movl %eax,%edi + roll $1,%esi + xorl %ebp,%edi + movl %esi,8(%esp) + andl %edx,%edi + rorl $2,%edx + leal 1518500249(%ebx,%esi,1),%ebx + movl 12(%esp),%esi + xorl %eax,%edi + xorl 20(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 44(%esp),%esi + roll $5,%edi + /* 16_19 19 */ + xorl (%esp),%esi + addl %edi,%ebx + movl %ebp,%edi + roll $1,%esi + xorl %edx,%edi + movl %esi,12(%esp) + andl %ecx,%edi + rorl $2,%ecx + leal 1518500249(%eax,%esi,1),%eax + movl 16(%esp),%esi + xorl %ebp,%edi + xorl 24(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 48(%esp),%esi + roll $5,%edi + /* 20_39 20 */ + xorl 4(%esp),%esi + addl %edi,%eax + roll $1,%esi + movl %edx,%edi + movl %esi,16(%esp) + xorl %ebx,%edi + rorl $2,%ebx + leal 1859775393(%ebp,%esi,1),%ebp + movl 20(%esp),%esi + xorl %ecx,%edi + xorl 28(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 52(%esp),%esi + roll $5,%edi + /* 20_39 21 */ + xorl 8(%esp),%esi + addl %edi,%ebp + roll $1,%esi + movl %ecx,%edi + movl %esi,20(%esp) + xorl %eax,%edi + rorl $2,%eax + leal 1859775393(%edx,%esi,1),%edx + movl 24(%esp),%esi + xorl %ebx,%edi + xorl 32(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 56(%esp),%esi + roll $5,%edi + /* 20_39 22 */ + xorl 12(%esp),%esi + addl %edi,%edx + roll $1,%esi + movl %ebx,%edi + movl %esi,24(%esp) + xorl %ebp,%edi + rorl $2,%ebp + leal 1859775393(%ecx,%esi,1),%ecx + movl 28(%esp),%esi + xorl %eax,%edi + xorl 36(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 60(%esp),%esi + roll $5,%edi + /* 20_39 23 */ + xorl 16(%esp),%esi + addl %edi,%ecx + roll $1,%esi + movl %eax,%edi + movl %esi,28(%esp) + xorl %edx,%edi + rorl $2,%edx + leal 1859775393(%ebx,%esi,1),%ebx + movl 32(%esp),%esi + xorl %ebp,%edi + xorl 40(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl (%esp),%esi + roll $5,%edi + /* 20_39 24 */ + xorl 20(%esp),%esi + addl %edi,%ebx + roll $1,%esi + movl %ebp,%edi + movl %esi,32(%esp) + xorl %ecx,%edi + rorl $2,%ecx + leal 1859775393(%eax,%esi,1),%eax + movl 36(%esp),%esi + xorl %edx,%edi + xorl 44(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 4(%esp),%esi + roll $5,%edi + /* 20_39 25 */ + xorl 24(%esp),%esi + addl %edi,%eax + roll $1,%esi + movl %edx,%edi + movl %esi,36(%esp) + xorl %ebx,%edi + rorl $2,%ebx + leal 1859775393(%ebp,%esi,1),%ebp + movl 40(%esp),%esi + xorl %ecx,%edi + xorl 48(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 8(%esp),%esi + roll $5,%edi + /* 20_39 26 */ + xorl 28(%esp),%esi + addl %edi,%ebp + roll $1,%esi + movl %ecx,%edi + movl %esi,40(%esp) + xorl %eax,%edi + rorl $2,%eax + leal 1859775393(%edx,%esi,1),%edx + movl 44(%esp),%esi + xorl %ebx,%edi + xorl 52(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 12(%esp),%esi + roll $5,%edi + /* 20_39 27 */ + xorl 32(%esp),%esi + addl %edi,%edx + roll $1,%esi + movl %ebx,%edi + movl %esi,44(%esp) + xorl %ebp,%edi + rorl $2,%ebp + leal 1859775393(%ecx,%esi,1),%ecx + movl 48(%esp),%esi + xorl %eax,%edi + xorl 56(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 16(%esp),%esi + roll $5,%edi + /* 20_39 28 */ + xorl 36(%esp),%esi + addl %edi,%ecx + roll $1,%esi + movl %eax,%edi + movl %esi,48(%esp) + xorl %edx,%edi + rorl $2,%edx + leal 1859775393(%ebx,%esi,1),%ebx + movl 52(%esp),%esi + xorl %ebp,%edi + xorl 60(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 20(%esp),%esi + roll $5,%edi + /* 20_39 29 */ + xorl 40(%esp),%esi + addl %edi,%ebx + roll $1,%esi + movl %ebp,%edi + movl %esi,52(%esp) + xorl %ecx,%edi + rorl $2,%ecx + leal 1859775393(%eax,%esi,1),%eax + movl 56(%esp),%esi + xorl %edx,%edi + xorl (%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 24(%esp),%esi + roll $5,%edi + /* 20_39 30 */ + xorl 44(%esp),%esi + addl %edi,%eax + roll $1,%esi + movl %edx,%edi + movl %esi,56(%esp) + xorl %ebx,%edi + rorl $2,%ebx + leal 1859775393(%ebp,%esi,1),%ebp + movl 60(%esp),%esi + xorl %ecx,%edi + xorl 4(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 28(%esp),%esi + roll $5,%edi + /* 20_39 31 */ + xorl 48(%esp),%esi + addl %edi,%ebp + roll $1,%esi + movl %ecx,%edi + movl %esi,60(%esp) + xorl %eax,%edi + rorl $2,%eax + leal 1859775393(%edx,%esi,1),%edx + movl (%esp),%esi + xorl %ebx,%edi + xorl 8(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 32(%esp),%esi + roll $5,%edi + /* 20_39 32 */ + xorl 52(%esp),%esi + addl %edi,%edx + roll $1,%esi + movl %ebx,%edi + movl %esi,(%esp) + xorl %ebp,%edi + rorl $2,%ebp + leal 1859775393(%ecx,%esi,1),%ecx + movl 4(%esp),%esi + xorl %eax,%edi + xorl 12(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 36(%esp),%esi + roll $5,%edi + /* 20_39 33 */ + xorl 56(%esp),%esi + addl %edi,%ecx + roll $1,%esi + movl %eax,%edi + movl %esi,4(%esp) + xorl %edx,%edi + rorl $2,%edx + leal 1859775393(%ebx,%esi,1),%ebx + movl 8(%esp),%esi + xorl %ebp,%edi + xorl 16(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 40(%esp),%esi + roll $5,%edi + /* 20_39 34 */ + xorl 60(%esp),%esi + addl %edi,%ebx + roll $1,%esi + movl %ebp,%edi + movl %esi,8(%esp) + xorl %ecx,%edi + rorl $2,%ecx + leal 1859775393(%eax,%esi,1),%eax + movl 12(%esp),%esi + xorl %edx,%edi + xorl 20(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 44(%esp),%esi + roll $5,%edi + /* 20_39 35 */ + xorl (%esp),%esi + addl %edi,%eax + roll $1,%esi + movl %edx,%edi + movl %esi,12(%esp) + xorl %ebx,%edi + rorl $2,%ebx + leal 1859775393(%ebp,%esi,1),%ebp + movl 16(%esp),%esi + xorl %ecx,%edi + xorl 24(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 48(%esp),%esi + roll $5,%edi + /* 20_39 36 */ + xorl 4(%esp),%esi + addl %edi,%ebp + roll $1,%esi + movl %ecx,%edi + movl %esi,16(%esp) + xorl %eax,%edi + rorl $2,%eax + leal 1859775393(%edx,%esi,1),%edx + movl 20(%esp),%esi + xorl %ebx,%edi + xorl 28(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 52(%esp),%esi + roll $5,%edi + /* 20_39 37 */ + xorl 8(%esp),%esi + addl %edi,%edx + roll $1,%esi + movl %ebx,%edi + movl %esi,20(%esp) + xorl %ebp,%edi + rorl $2,%ebp + leal 1859775393(%ecx,%esi,1),%ecx + movl 24(%esp),%esi + xorl %eax,%edi + xorl 32(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 56(%esp),%esi + roll $5,%edi + /* 20_39 38 */ + xorl 12(%esp),%esi + addl %edi,%ecx + roll $1,%esi + movl %eax,%edi + movl %esi,24(%esp) + xorl %edx,%edi + rorl $2,%edx + leal 1859775393(%ebx,%esi,1),%ebx + movl 28(%esp),%esi + xorl %ebp,%edi + xorl 36(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 60(%esp),%esi + roll $5,%edi + /* 20_39 39 */ + xorl 16(%esp),%esi + addl %edi,%ebx + roll $1,%esi + movl %ebp,%edi + movl %esi,28(%esp) + xorl %ecx,%edi + rorl $2,%ecx + leal 1859775393(%eax,%esi,1),%eax + movl 32(%esp),%esi + xorl %edx,%edi + xorl 40(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl (%esp),%esi + roll $5,%edi + /* 40_59 40 */ + addl %edi,%eax + movl %edx,%edi + xorl 20(%esp),%esi + andl %ecx,%edi + roll $1,%esi + addl %edi,%ebp + movl %edx,%edi + movl %esi,32(%esp) + xorl %ecx,%edi + leal 2400959708(%ebp,%esi,1),%ebp + andl %ebx,%edi + rorl $2,%ebx + movl 36(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 44(%esp),%esi + roll $5,%edi + xorl 4(%esp),%esi + /* 40_59 41 */ + addl %edi,%ebp + movl %ecx,%edi + xorl 24(%esp),%esi + andl %ebx,%edi + roll $1,%esi + addl %edi,%edx + movl %ecx,%edi + movl %esi,36(%esp) + xorl %ebx,%edi + leal 2400959708(%edx,%esi,1),%edx + andl %eax,%edi + rorl $2,%eax + movl 40(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 48(%esp),%esi + roll $5,%edi + xorl 8(%esp),%esi + /* 40_59 42 */ + addl %edi,%edx + movl %ebx,%edi + xorl 28(%esp),%esi + andl %eax,%edi + roll $1,%esi + addl %edi,%ecx + movl %ebx,%edi + movl %esi,40(%esp) + xorl %eax,%edi + leal 2400959708(%ecx,%esi,1),%ecx + andl %ebp,%edi + rorl $2,%ebp + movl 44(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 52(%esp),%esi + roll $5,%edi + xorl 12(%esp),%esi + /* 40_59 43 */ + addl %edi,%ecx + movl %eax,%edi + xorl 32(%esp),%esi + andl %ebp,%edi + roll $1,%esi + addl %edi,%ebx + movl %eax,%edi + movl %esi,44(%esp) + xorl %ebp,%edi + leal 2400959708(%ebx,%esi,1),%ebx + andl %edx,%edi + rorl $2,%edx + movl 48(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 56(%esp),%esi + roll $5,%edi + xorl 16(%esp),%esi + /* 40_59 44 */ + addl %edi,%ebx + movl %ebp,%edi + xorl 36(%esp),%esi + andl %edx,%edi + roll $1,%esi + addl %edi,%eax + movl %ebp,%edi + movl %esi,48(%esp) + xorl %edx,%edi + leal 2400959708(%eax,%esi,1),%eax + andl %ecx,%edi + rorl $2,%ecx + movl 52(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 60(%esp),%esi + roll $5,%edi + xorl 20(%esp),%esi + /* 40_59 45 */ + addl %edi,%eax + movl %edx,%edi + xorl 40(%esp),%esi + andl %ecx,%edi + roll $1,%esi + addl %edi,%ebp + movl %edx,%edi + movl %esi,52(%esp) + xorl %ecx,%edi + leal 2400959708(%ebp,%esi,1),%ebp + andl %ebx,%edi + rorl $2,%ebx + movl 56(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl (%esp),%esi + roll $5,%edi + xorl 24(%esp),%esi + /* 40_59 46 */ + addl %edi,%ebp + movl %ecx,%edi + xorl 44(%esp),%esi + andl %ebx,%edi + roll $1,%esi + addl %edi,%edx + movl %ecx,%edi + movl %esi,56(%esp) + xorl %ebx,%edi + leal 2400959708(%edx,%esi,1),%edx + andl %eax,%edi + rorl $2,%eax + movl 60(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 4(%esp),%esi + roll $5,%edi + xorl 28(%esp),%esi + /* 40_59 47 */ + addl %edi,%edx + movl %ebx,%edi + xorl 48(%esp),%esi + andl %eax,%edi + roll $1,%esi + addl %edi,%ecx + movl %ebx,%edi + movl %esi,60(%esp) + xorl %eax,%edi + leal 2400959708(%ecx,%esi,1),%ecx + andl %ebp,%edi + rorl $2,%ebp + movl (%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 8(%esp),%esi + roll $5,%edi + xorl 32(%esp),%esi + /* 40_59 48 */ + addl %edi,%ecx + movl %eax,%edi + xorl 52(%esp),%esi + andl %ebp,%edi + roll $1,%esi + addl %edi,%ebx + movl %eax,%edi + movl %esi,(%esp) + xorl %ebp,%edi + leal 2400959708(%ebx,%esi,1),%ebx + andl %edx,%edi + rorl $2,%edx + movl 4(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 12(%esp),%esi + roll $5,%edi + xorl 36(%esp),%esi + /* 40_59 49 */ + addl %edi,%ebx + movl %ebp,%edi + xorl 56(%esp),%esi + andl %edx,%edi + roll $1,%esi + addl %edi,%eax + movl %ebp,%edi + movl %esi,4(%esp) + xorl %edx,%edi + leal 2400959708(%eax,%esi,1),%eax + andl %ecx,%edi + rorl $2,%ecx + movl 8(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 16(%esp),%esi + roll $5,%edi + xorl 40(%esp),%esi + /* 40_59 50 */ + addl %edi,%eax + movl %edx,%edi + xorl 60(%esp),%esi + andl %ecx,%edi + roll $1,%esi + addl %edi,%ebp + movl %edx,%edi + movl %esi,8(%esp) + xorl %ecx,%edi + leal 2400959708(%ebp,%esi,1),%ebp + andl %ebx,%edi + rorl $2,%ebx + movl 12(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 20(%esp),%esi + roll $5,%edi + xorl 44(%esp),%esi + /* 40_59 51 */ + addl %edi,%ebp + movl %ecx,%edi + xorl (%esp),%esi + andl %ebx,%edi + roll $1,%esi + addl %edi,%edx + movl %ecx,%edi + movl %esi,12(%esp) + xorl %ebx,%edi + leal 2400959708(%edx,%esi,1),%edx + andl %eax,%edi + rorl $2,%eax + movl 16(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 24(%esp),%esi + roll $5,%edi + xorl 48(%esp),%esi + /* 40_59 52 */ + addl %edi,%edx + movl %ebx,%edi + xorl 4(%esp),%esi + andl %eax,%edi + roll $1,%esi + addl %edi,%ecx + movl %ebx,%edi + movl %esi,16(%esp) + xorl %eax,%edi + leal 2400959708(%ecx,%esi,1),%ecx + andl %ebp,%edi + rorl $2,%ebp + movl 20(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 28(%esp),%esi + roll $5,%edi + xorl 52(%esp),%esi + /* 40_59 53 */ + addl %edi,%ecx + movl %eax,%edi + xorl 8(%esp),%esi + andl %ebp,%edi + roll $1,%esi + addl %edi,%ebx + movl %eax,%edi + movl %esi,20(%esp) + xorl %ebp,%edi + leal 2400959708(%ebx,%esi,1),%ebx + andl %edx,%edi + rorl $2,%edx + movl 24(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 32(%esp),%esi + roll $5,%edi + xorl 56(%esp),%esi + /* 40_59 54 */ + addl %edi,%ebx + movl %ebp,%edi + xorl 12(%esp),%esi + andl %edx,%edi + roll $1,%esi + addl %edi,%eax + movl %ebp,%edi + movl %esi,24(%esp) + xorl %edx,%edi + leal 2400959708(%eax,%esi,1),%eax + andl %ecx,%edi + rorl $2,%ecx + movl 28(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 36(%esp),%esi + roll $5,%edi + xorl 60(%esp),%esi + /* 40_59 55 */ + addl %edi,%eax + movl %edx,%edi + xorl 16(%esp),%esi + andl %ecx,%edi + roll $1,%esi + addl %edi,%ebp + movl %edx,%edi + movl %esi,28(%esp) + xorl %ecx,%edi + leal 2400959708(%ebp,%esi,1),%ebp + andl %ebx,%edi + rorl $2,%ebx + movl 32(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 40(%esp),%esi + roll $5,%edi + xorl (%esp),%esi + /* 40_59 56 */ + addl %edi,%ebp + movl %ecx,%edi + xorl 20(%esp),%esi + andl %ebx,%edi + roll $1,%esi + addl %edi,%edx + movl %ecx,%edi + movl %esi,32(%esp) + xorl %ebx,%edi + leal 2400959708(%edx,%esi,1),%edx + andl %eax,%edi + rorl $2,%eax + movl 36(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 44(%esp),%esi + roll $5,%edi + xorl 4(%esp),%esi + /* 40_59 57 */ + addl %edi,%edx + movl %ebx,%edi + xorl 24(%esp),%esi + andl %eax,%edi + roll $1,%esi + addl %edi,%ecx + movl %ebx,%edi + movl %esi,36(%esp) + xorl %eax,%edi + leal 2400959708(%ecx,%esi,1),%ecx + andl %ebp,%edi + rorl $2,%ebp + movl 40(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 48(%esp),%esi + roll $5,%edi + xorl 8(%esp),%esi + /* 40_59 58 */ + addl %edi,%ecx + movl %eax,%edi + xorl 28(%esp),%esi + andl %ebp,%edi + roll $1,%esi + addl %edi,%ebx + movl %eax,%edi + movl %esi,40(%esp) + xorl %ebp,%edi + leal 2400959708(%ebx,%esi,1),%ebx + andl %edx,%edi + rorl $2,%edx + movl 44(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 52(%esp),%esi + roll $5,%edi + xorl 12(%esp),%esi + /* 40_59 59 */ + addl %edi,%ebx + movl %ebp,%edi + xorl 32(%esp),%esi + andl %edx,%edi + roll $1,%esi + addl %edi,%eax + movl %ebp,%edi + movl %esi,44(%esp) + xorl %edx,%edi + leal 2400959708(%eax,%esi,1),%eax + andl %ecx,%edi + rorl $2,%ecx + movl 48(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 56(%esp),%esi + roll $5,%edi + xorl 16(%esp),%esi + /* 20_39 60 */ + xorl 36(%esp),%esi + addl %edi,%eax + roll $1,%esi + movl %edx,%edi + movl %esi,48(%esp) + xorl %ebx,%edi + rorl $2,%ebx + leal 3395469782(%ebp,%esi,1),%ebp + movl 52(%esp),%esi + xorl %ecx,%edi + xorl 60(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 20(%esp),%esi + roll $5,%edi + /* 20_39 61 */ + xorl 40(%esp),%esi + addl %edi,%ebp + roll $1,%esi + movl %ecx,%edi + movl %esi,52(%esp) + xorl %eax,%edi + rorl $2,%eax + leal 3395469782(%edx,%esi,1),%edx + movl 56(%esp),%esi + xorl %ebx,%edi + xorl (%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 24(%esp),%esi + roll $5,%edi + /* 20_39 62 */ + xorl 44(%esp),%esi + addl %edi,%edx + roll $1,%esi + movl %ebx,%edi + movl %esi,56(%esp) + xorl %ebp,%edi + rorl $2,%ebp + leal 3395469782(%ecx,%esi,1),%ecx + movl 60(%esp),%esi + xorl %eax,%edi + xorl 4(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 28(%esp),%esi + roll $5,%edi + /* 20_39 63 */ + xorl 48(%esp),%esi + addl %edi,%ecx + roll $1,%esi + movl %eax,%edi + movl %esi,60(%esp) + xorl %edx,%edi + rorl $2,%edx + leal 3395469782(%ebx,%esi,1),%ebx + movl (%esp),%esi + xorl %ebp,%edi + xorl 8(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 32(%esp),%esi + roll $5,%edi + /* 20_39 64 */ + xorl 52(%esp),%esi + addl %edi,%ebx + roll $1,%esi + movl %ebp,%edi + movl %esi,(%esp) + xorl %ecx,%edi + rorl $2,%ecx + leal 3395469782(%eax,%esi,1),%eax + movl 4(%esp),%esi + xorl %edx,%edi + xorl 12(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 36(%esp),%esi + roll $5,%edi + /* 20_39 65 */ + xorl 56(%esp),%esi + addl %edi,%eax + roll $1,%esi + movl %edx,%edi + movl %esi,4(%esp) + xorl %ebx,%edi + rorl $2,%ebx + leal 3395469782(%ebp,%esi,1),%ebp + movl 8(%esp),%esi + xorl %ecx,%edi + xorl 16(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 40(%esp),%esi + roll $5,%edi + /* 20_39 66 */ + xorl 60(%esp),%esi + addl %edi,%ebp + roll $1,%esi + movl %ecx,%edi + movl %esi,8(%esp) + xorl %eax,%edi + rorl $2,%eax + leal 3395469782(%edx,%esi,1),%edx + movl 12(%esp),%esi + xorl %ebx,%edi + xorl 20(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 44(%esp),%esi + roll $5,%edi + /* 20_39 67 */ + xorl (%esp),%esi + addl %edi,%edx + roll $1,%esi + movl %ebx,%edi + movl %esi,12(%esp) + xorl %ebp,%edi + rorl $2,%ebp + leal 3395469782(%ecx,%esi,1),%ecx + movl 16(%esp),%esi + xorl %eax,%edi + xorl 24(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 48(%esp),%esi + roll $5,%edi + /* 20_39 68 */ + xorl 4(%esp),%esi + addl %edi,%ecx + roll $1,%esi + movl %eax,%edi + movl %esi,16(%esp) + xorl %edx,%edi + rorl $2,%edx + leal 3395469782(%ebx,%esi,1),%ebx + movl 20(%esp),%esi + xorl %ebp,%edi + xorl 28(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 52(%esp),%esi + roll $5,%edi + /* 20_39 69 */ + xorl 8(%esp),%esi + addl %edi,%ebx + roll $1,%esi + movl %ebp,%edi + movl %esi,20(%esp) + xorl %ecx,%edi + rorl $2,%ecx + leal 3395469782(%eax,%esi,1),%eax + movl 24(%esp),%esi + xorl %edx,%edi + xorl 32(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 56(%esp),%esi + roll $5,%edi + /* 20_39 70 */ + xorl 12(%esp),%esi + addl %edi,%eax + roll $1,%esi + movl %edx,%edi + movl %esi,24(%esp) + xorl %ebx,%edi + rorl $2,%ebx + leal 3395469782(%ebp,%esi,1),%ebp + movl 28(%esp),%esi + xorl %ecx,%edi + xorl 36(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 60(%esp),%esi + roll $5,%edi + /* 20_39 71 */ + xorl 16(%esp),%esi + addl %edi,%ebp + roll $1,%esi + movl %ecx,%edi + movl %esi,28(%esp) + xorl %eax,%edi + rorl $2,%eax + leal 3395469782(%edx,%esi,1),%edx + movl 32(%esp),%esi + xorl %ebx,%edi + xorl 40(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl (%esp),%esi + roll $5,%edi + /* 20_39 72 */ + xorl 20(%esp),%esi + addl %edi,%edx + roll $1,%esi + movl %ebx,%edi + movl %esi,32(%esp) + xorl %ebp,%edi + rorl $2,%ebp + leal 3395469782(%ecx,%esi,1),%ecx + movl 36(%esp),%esi + xorl %eax,%edi + xorl 44(%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 4(%esp),%esi + roll $5,%edi + /* 20_39 73 */ + xorl 24(%esp),%esi + addl %edi,%ecx + roll $1,%esi + movl %eax,%edi + movl %esi,36(%esp) + xorl %edx,%edi + rorl $2,%edx + leal 3395469782(%ebx,%esi,1),%ebx + movl 40(%esp),%esi + xorl %ebp,%edi + xorl 48(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 8(%esp),%esi + roll $5,%edi + /* 20_39 74 */ + xorl 28(%esp),%esi + addl %edi,%ebx + roll $1,%esi + movl %ebp,%edi + movl %esi,40(%esp) + xorl %ecx,%edi + rorl $2,%ecx + leal 3395469782(%eax,%esi,1),%eax + movl 44(%esp),%esi + xorl %edx,%edi + xorl 52(%esp),%esi + addl %edi,%eax + movl %ebx,%edi + xorl 12(%esp),%esi + roll $5,%edi + /* 20_39 75 */ + xorl 32(%esp),%esi + addl %edi,%eax + roll $1,%esi + movl %edx,%edi + movl %esi,44(%esp) + xorl %ebx,%edi + rorl $2,%ebx + leal 3395469782(%ebp,%esi,1),%ebp + movl 48(%esp),%esi + xorl %ecx,%edi + xorl 56(%esp),%esi + addl %edi,%ebp + movl %eax,%edi + xorl 16(%esp),%esi + roll $5,%edi + /* 20_39 76 */ + xorl 36(%esp),%esi + addl %edi,%ebp + roll $1,%esi + movl %ecx,%edi + movl %esi,48(%esp) + xorl %eax,%edi + rorl $2,%eax + leal 3395469782(%edx,%esi,1),%edx + movl 52(%esp),%esi + xorl %ebx,%edi + xorl 60(%esp),%esi + addl %edi,%edx + movl %ebp,%edi + xorl 20(%esp),%esi + roll $5,%edi + /* 20_39 77 */ + xorl 40(%esp),%esi + addl %edi,%edx + roll $1,%esi + movl %ebx,%edi + xorl %ebp,%edi + rorl $2,%ebp + leal 3395469782(%ecx,%esi,1),%ecx + movl 56(%esp),%esi + xorl %eax,%edi + xorl (%esp),%esi + addl %edi,%ecx + movl %edx,%edi + xorl 24(%esp),%esi + roll $5,%edi + /* 20_39 78 */ + xorl 44(%esp),%esi + addl %edi,%ecx + roll $1,%esi + movl %eax,%edi + xorl %edx,%edi + rorl $2,%edx + leal 3395469782(%ebx,%esi,1),%ebx + movl 60(%esp),%esi + xorl %ebp,%edi + xorl 4(%esp),%esi + addl %edi,%ebx + movl %ecx,%edi + xorl 28(%esp),%esi + roll $5,%edi + /* 20_39 79 */ + xorl 48(%esp),%esi + addl %edi,%ebx + roll $1,%esi + movl %ebp,%edi + xorl %ecx,%edi + rorl $2,%ecx + leal 3395469782(%eax,%esi,1),%eax + xorl %edx,%edi + addl %edi,%eax + movl %ebx,%edi + roll $5,%edi + addl %edi,%eax + /* Loop trailer */ + movl 84(%esp),%edi + movl 88(%esp),%esi + addl 16(%edi),%ebp + addl 12(%edi),%edx + addl 8(%edi),%ecx + addl 4(%edi),%ebx + addl (%edi),%eax + addl $64,%esi + movl %ebp,16(%edi) + movl %edx,12(%edi) + cmpl 92(%esp),%esi + movl %ecx,8(%edi) + movl %ebx,4(%edi) + movl %eax,(%edi) + jb .L000loop + addl $64,%esp + popl %edi + popl %esi + popl %ebx + popl %ebp + ret +.L_sha1_block_data_order_end: +.size sha1_block_data_order,.L_sha1_block_data_order_end-sha1_block_data_order +.byte 83,72,65,49,32,98,108,111,99,107,32,116,114,97,110,115,102,111,114,109,32,102,111,114,32,120,56,54,44,32,67,82,89,80,84,79,71,65,77,83,32,98,121,32,60,97,112,112,114,111,64,111,112,101,110,115,115,108,46,111,114,103,62,0 ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-04 8:01 ` George Spelvin @ 2009-08-04 20:41 ` Junio C Hamano 2009-08-05 18:17 ` George Spelvin 0 siblings, 1 reply; 129+ messages in thread From: Junio C Hamano @ 2009-08-04 20:41 UTC (permalink / raw) To: George Spelvin; +Cc: torvalds, git "George Spelvin" <linux@horizon.com> writes: > The one question I have is that currently perl is not a critical > compile-time dependency; it's needed for some extra stuff, but AFAIK you > can get most of git working without it. Whether to add that dependency > or what is a Junio question. I am actually feel a lot more uneasy to apply a patch signed of by somebody who calls himself George Spelvin, though. Three classes of people compile git from the source: * People who want to be on the bleeding edge and compile git for themselves, even though they are on mainstream platforms where they could choose distro-packaged one; * People who produce binary packages for distribution. * People who are on minority platforms and have no other way to get git than compiling for themselves; We do not have to worry about the first two groups of people. It won't be too involved for them to install Perl on their system; after all they are already coping with asciidoc and xmlto ;-) We can continue shipping mozilla one to help the last group. In the Makefile, we say: # Define NO_OPENSSL environment variable if you do not have OpenSSL. # This also implies MOZILLA_SHA1. and with your change, we would start implying STANDALONE_OPENSSL_SHA1 instead. But if MOZILLA_SHA1 was given explicitly, we could use that. If they really really really want the extra performance out of statically linked OpenSSL derivative, they could prepare a preprocessed assmebly on some other machine and use it as the last resort if they do not have/want Perl. The situation is exactly the same as the documentation set. They are using HTML/man prepared on another machine (namely, mine) as the last resort if they do not have/want AsciiDoc toolchain. ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-04 20:41 ` Junio C Hamano @ 2009-08-05 18:17 ` George Spelvin 2009-08-05 20:36 ` Johannes Schindelin ` (2 more replies) 0 siblings, 3 replies; 129+ messages in thread From: George Spelvin @ 2009-08-05 18:17 UTC (permalink / raw) To: gitster; +Cc: git, linux, torvalds > Three classes of people compile git from the source: > > * People who want to be on the bleeding edge and compile git for > themselves, even though they are on mainstream platforms where they > could choose distro-packaged one; > > * People who produce binary packages for distribution. > > * People who are on minority platforms and have no other way to get git > than compiling for themselves; > > We do not have to worry about the first two groups of people. It won't > be too involved for them to install Perl on their system; after all they > are already coping with asciidoc and xmlto ;-) Actually, I'd get rid of the perl entirely, but I'm not sure how necessary the other-assembler-syntax features are needed by the folks on MacOS X and Windows (msysgit). > We can continue shipping mozilla one to help the last group. Of course, we always need a C fallback. Would you like a faster one? > In the Makefile, we say: > > # Define NO_OPENSSL environment variable if you do not have OpenSSL. > # This also implies MOZILLA_SHA1. > > and with your change, we would start implying STANDALONE_OPENSSL_SHA1 > instead. But if MOZILLA_SHA1 was given explicitly, we could use that. Well, I'd really like to auto-detect the processor. Current gcc's "gcc -v" output includes a "Target: " line that will do nicely. I can, of course, fall back to C if it fails, but is there a significant user base using a non-GCC compiler? ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-05 18:17 ` George Spelvin @ 2009-08-05 20:36 ` Johannes Schindelin 2009-08-05 20:44 ` Junio C Hamano 2009-08-05 20:55 ` Linus Torvalds 2 siblings, 0 replies; 129+ messages in thread From: Johannes Schindelin @ 2009-08-05 20:36 UTC (permalink / raw) To: George Spelvin; +Cc: gitster, git, torvalds Hi, On Wed, 5 Aug 2009, George Spelvin wrote: > > Three classes of people compile git from the source: > > > > * People who want to be on the bleeding edge and compile git for > > themselves, even though they are on mainstream platforms where they > > could choose distro-packaged one; > > > > * People who produce binary packages for distribution. > > > > * People who are on minority platforms and have no other way to get git > > than compiling for themselves; > > > > We do not have to worry about the first two groups of people. It won't > > be too involved for them to install Perl on their system; after all they > > are already coping with asciidoc and xmlto ;-) > > Actually, I'd get rid of the perl entirely, but I'm not sure how > necessary the other-assembler-syntax features are needed by the > folks on MacOS X and Windows (msysgit). Don't worry for MacOSX and msysGit (or Cygwin, for that matter): all of them use GCC. > > We can continue shipping mozilla one to help the last group. > > Of course, we always need a C fallback. Would you like a faster one? Is that a trick question? :-) > > In the Makefile, we say: > > > > # Define NO_OPENSSL environment variable if you do not have OpenSSL. > > # This also implies MOZILLA_SHA1. > > > > and with your change, we would start implying STANDALONE_OPENSSL_SHA1 > > instead. But if MOZILLA_SHA1 was given explicitly, we could use that. > > Well, I'd really like to auto-detect the processor. Current gcc's > "gcc -v" output includes a "Target: " line that will do nicely. I can, > of course, fall back to C if it fails, but is there a significant user > base using a non-GCC compiler? Do you really want to determine which processor to optimize for at compile time? Build system and target system are often different... Ciao, Dscho ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-05 18:17 ` George Spelvin 2009-08-05 20:36 ` Johannes Schindelin @ 2009-08-05 20:44 ` Junio C Hamano 2009-08-05 20:55 ` Linus Torvalds 2 siblings, 0 replies; 129+ messages in thread From: Junio C Hamano @ 2009-08-05 20:44 UTC (permalink / raw) To: George Spelvin; +Cc: git, torvalds "George Spelvin" <linux@horizon.com> writes: >> We can continue shipping mozilla one to help the last group. > > Of course, we always need a C fallback. Would you like a faster one? No. I'd rather keep tested and tried while a better alternative is in work-in-progress state. ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-05 18:17 ` George Spelvin 2009-08-05 20:36 ` Johannes Schindelin 2009-08-05 20:44 ` Junio C Hamano @ 2009-08-05 20:55 ` Linus Torvalds 2009-08-05 23:13 ` Linus Torvalds 2 siblings, 1 reply; 129+ messages in thread From: Linus Torvalds @ 2009-08-05 20:55 UTC (permalink / raw) To: George Spelvin; +Cc: gitster, git On Wed, 5 Aug 2009, George Spelvin wrote: > > > We can continue shipping mozilla one to help the last group. > > Of course, we always need a C fallback. Would you like a faster one? I actually looked at code generation (on x86-64) for the C fallback, and it should be quite doable to re-write the C one to generate good code on x86-64. On 32-bit x86, I suspect the register pressures are so intense that it's unrealistic to expect gcc to do a good job, but the Mozilla SHA1 C code really seems _designed_ to be slow in stupid ways (that whole "byte at a time into a word buffer with shifts" is a really really sucky way to handle the endianness issues). So if you'd like to look at the C version, that's definitely worth it. Much bigger bang for the buck than trying to schedule asm language and having to deal with different assemblers/linkers/whatnot. Linus ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-05 20:55 ` Linus Torvalds @ 2009-08-05 23:13 ` Linus Torvalds 2009-08-06 1:18 ` Linus Torvalds 0 siblings, 1 reply; 129+ messages in thread From: Linus Torvalds @ 2009-08-05 23:13 UTC (permalink / raw) To: George Spelvin; +Cc: gitster, git On Wed, 5 Aug 2009, Linus Torvalds wrote: > > I actually looked at code generation (on x86-64) for the C fallback, and > it should be quite doable to re-write the C one to generate good code on > x86-64. Ok, here's a try. It's based on the mozilla SHA1 code, but with quite a bit of surgery. Enable with "make BLK_SHA1=1". Timings for "git fsck --full" on the git directory: - Mozilla SHA1 portable C-code (sucky sucky): MOZILLA_SHA1=1 real 0m38.194s user 0m37.838s sys 0m0.356s - This code ("half-portable C code"): BLK_SHA1=1 real 0m28.120s user 0m27.930s sys 0m0.192s - OpenSSL assembler code: real 0m26.327s user 0m26.194s sys 0m0.136s ie this is slightly slower than the openssh SHA1 routines, but that's only true on something very SHA1-intensive like "git fsck", and this is _almost_ portable code. I say "almost" because it really does require that we can do unaligned word loads, and do a good job of 'htonl()', and it assumes that 'unsigned int' is 32-bit (the latter would be easy to change by using 'uint32_t', but since it's not the relevant portability issue, I don't think it matters). In other words, unlike the Mozilla SHA1, this one doesn't suck. It's certainly not great either, but it's probably good enough in practice, without the headaches of actually making people use an assembler version. And maybe somebody can see how to improve it further? Linus --- From: Linus Torvalds <torvalds@linux-foundation.org> Subject: [PATCH] Add new optimized C 'block-sha1' routines Based on the mozilla SHA1 routine, but doing the input data accesses a word at a time and with 'htonl()' instead of loading bytes and shifting. It requires an architecture that is ok with unaligned 32-bit loads and a fast htonl(). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- Makefile | 9 +++ block-sha1/sha1.c | 145 +++++++++++++++++++++++++++++++++++++++++++++++++++++ block-sha1/sha1.h | 21 ++++++++ 3 files changed, 175 insertions(+), 0 deletions(-) diff --git a/Makefile b/Makefile index d7669b1..f12024c 100644 --- a/Makefile +++ b/Makefile @@ -84,6 +84,10 @@ all:: # specify your own (or DarwinPort's) include directories and # library directories by defining CFLAGS and LDFLAGS appropriately. # +# Define BLK_SHA1 environment variable if you want the C version +# of the SHA1 that assumes you can do unaligned 32-bit loads and +# have a fast htonl() function. +# # Define PPC_SHA1 environment variable when running make to make use of # a bundled SHA1 routine optimized for PowerPC. # @@ -1166,6 +1170,10 @@ ifdef NO_DEFLATE_BOUND BASIC_CFLAGS += -DNO_DEFLATE_BOUND endif +ifdef BLK_SHA1 + SHA1_HEADER = "block-sha1/sha1.h" + LIB_OBJS += block-sha1/sha1.o +else ifdef PPC_SHA1 SHA1_HEADER = "ppc/sha1.h" LIB_OBJS += ppc/sha1.o ppc/sha1ppc.o @@ -1183,6 +1191,7 @@ else endif endif endif +endif ifdef NO_PERL_MAKEMAKER export NO_PERL_MAKEMAKER endif diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c new file mode 100644 index 0000000..8fd90b0 --- /dev/null +++ b/block-sha1/sha1.c @@ -0,0 +1,145 @@ +/* + * Based on the Mozilla SHA1 (see mozilla-sha1/sha1.c), + * optimized to do word accesses rather than byte accesses, + * and to avoid unnecessary copies into the context array. + */ + +#include <string.h> +#include <arpa/inet.h> + +#include "sha1.h" + +/* Hash one 64-byte block of data */ +static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data); + +void blk_SHA1_Init(blk_SHA_CTX *ctx) +{ + ctx->lenW = 0; + ctx->size = 0; + + /* Initialize H with the magic constants (see FIPS180 for constants) + */ + ctx->H[0] = 0x67452301; + ctx->H[1] = 0xefcdab89; + ctx->H[2] = 0x98badcfe; + ctx->H[3] = 0x10325476; + ctx->H[4] = 0xc3d2e1f0; +} + + +void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, int len) +{ + int lenW = ctx->lenW; + + ctx->size += len << 3; + + /* Read the data into W and process blocks as they get full + */ + if (lenW) { + int left = 64 - lenW; + if (len < left) + left = len; + memcpy(lenW + (char *)ctx->W, data, left); + lenW = (lenW + left) & 63; + len -= left; + data += left; + ctx->lenW = lenW; + if (lenW) + return; + blk_SHA1Block(ctx, ctx->W); + } + while (len >= 64) { + blk_SHA1Block(ctx, data); + data += 64; + len -= 64; + } + if (len) { + memcpy(ctx->W, data, len); + ctx->lenW = len; + } +} + + +void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx) +{ + static const unsigned char pad[64] = { 0x80 }; + unsigned int padlen[2]; + int i; + + /* Pad with a binary 1 (ie 0x80), then zeroes, then length + */ + padlen[0] = htonl(ctx->size >> 32); + padlen[1] = htonl(ctx->size); + + blk_SHA1_Update(ctx, pad, 1+ (63 & (55 - ctx->lenW))); + blk_SHA1_Update(ctx, padlen, 8); + + /* Output hash + */ + for (i = 0; i < 5; i++) + ((unsigned int *)hashout)[i] = htonl(ctx->H[i]); +} + +#define SHA_ROT(X,n) (((X) << (n)) | ((X) >> (32-(n)))) + +static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) +{ + int t; + unsigned int A,B,C,D,E,TEMP; + unsigned int W[80]; + + for (t = 0; t < 16; t++) + W[t] = htonl(data[t]); + + /* Unroll it? */ + for (t = 16; t <= 79; t++) + W[t] = SHA_ROT(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1); + + A = ctx->H[0]; + B = ctx->H[1]; + C = ctx->H[2]; + D = ctx->H[3]; + E = ctx->H[4]; + +#define T_0_19(t) \ + TEMP = SHA_ROT(A,5) + (((C^D)&B)^D) + E + W[t] + 0x5a827999; \ + E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP; + + T_0_19( 0); T_0_19( 1); T_0_19( 2); T_0_19( 3); T_0_19( 4); + T_0_19( 5); T_0_19( 6); T_0_19( 7); T_0_19( 8); T_0_19( 9); + T_0_19(10); T_0_19(11); T_0_19(12); T_0_19(13); T_0_19(14); + T_0_19(15); T_0_19(16); T_0_19(17); T_0_19(18); T_0_19(19); + +#define T_20_39(t) \ + TEMP = SHA_ROT(A,5) + (B^C^D) + E + W[t] + 0x6ed9eba1; \ + E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP; + + T_20_39(20); T_20_39(21); T_20_39(22); T_20_39(23); T_20_39(24); + T_20_39(25); T_20_39(26); T_20_39(27); T_20_39(28); T_20_39(29); + T_20_39(30); T_20_39(31); T_20_39(32); T_20_39(33); T_20_39(34); + T_20_39(35); T_20_39(36); T_20_39(37); T_20_39(38); T_20_39(39); + +#define T_40_59(t) \ + TEMP = SHA_ROT(A,5) + ((B&C)|(D&(B|C))) + E + W[t] + 0x8f1bbcdc; \ + E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP; + + T_40_59(40); T_40_59(41); T_40_59(42); T_40_59(43); T_40_59(44); + T_40_59(45); T_40_59(46); T_40_59(47); T_40_59(48); T_40_59(49); + T_40_59(50); T_40_59(51); T_40_59(52); T_40_59(53); T_40_59(54); + T_40_59(55); T_40_59(56); T_40_59(57); T_40_59(58); T_40_59(59); + +#define T_60_79(t) \ + TEMP = SHA_ROT(A,5) + (B^C^D) + E + W[t] + 0xca62c1d6; \ + E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP; + + T_60_79(60); T_60_79(61); T_60_79(62); T_60_79(63); T_60_79(64); + T_60_79(65); T_60_79(66); T_60_79(67); T_60_79(68); T_60_79(69); + T_60_79(70); T_60_79(71); T_60_79(72); T_60_79(73); T_60_79(74); + T_60_79(75); T_60_79(76); T_60_79(77); T_60_79(78); T_60_79(79); + + ctx->H[0] += A; + ctx->H[1] += B; + ctx->H[2] += C; + ctx->H[3] += D; + ctx->H[4] += E; +} diff --git a/block-sha1/sha1.h b/block-sha1/sha1.h new file mode 100644 index 0000000..dbc719f --- /dev/null +++ b/block-sha1/sha1.h @@ -0,0 +1,21 @@ +/* + * Based on the Mozilla SHA1 (see mozilla-sha1/sha1.h), + * optimized to do word accesses rather than byte accesses, + * and to avoid unnecessary copies into the context array. + */ + +typedef struct { + unsigned int H[5]; + unsigned int W[16]; + int lenW; + unsigned long long size; +} blk_SHA_CTX; + +void blk_SHA1_Init(blk_SHA_CTX *ctx); +void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *dataIn, int len); +void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx); + +#define git_SHA_CTX blk_SHA_CTX +#define git_SHA1_Init blk_SHA1_Init +#define git_SHA1_Update blk_SHA1_Update +#define git_SHA1_Final blk_SHA1_Final ^ permalink raw reply related [flat|nested] 129+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-05 23:13 ` Linus Torvalds @ 2009-08-06 1:18 ` Linus Torvalds 2009-08-06 1:52 ` Nicolas Pitre 2009-08-06 18:49 ` Erik Faye-Lund 0 siblings, 2 replies; 129+ messages in thread From: Linus Torvalds @ 2009-08-06 1:18 UTC (permalink / raw) To: George Spelvin; +Cc: gitster, git On Wed, 5 Aug 2009, Linus Torvalds wrote: > > Timings for "git fsck --full" on the git directory: > > - Mozilla SHA1 portable C-code (sucky sucky): MOZILLA_SHA1=1 > > real 0m38.194s > user 0m37.838s > sys 0m0.356s > > - This code ("half-portable C code"): BLK_SHA1=1 > > real 0m28.120s > user 0m27.930s > sys 0m0.192s > > - OpenSSL assembler code: > > real 0m26.327s > user 0m26.194s > sys 0m0.136s Ok, I installed the 32-bit libraries too, to see what it looks like for that case. As expected, the compiler is not able to do a great job due to it being somewhat register starved, but on the other hand, the old Mozilla code did even worse, so.. - Mozilla SHA: real 0m47.063s user 0m46.815s sys 0m0.252s - BLK_SHA1=1 real 0m34.705s user 0m34.394s sys 0m0.312s - OPENSSL: real 0m29.754s user 0m29.446s sys 0m0.288s so the tuned asm from OpenSSL does kick ass, but the C code version isn't _that_ far away. It's quite a reasonable alternative if you don't have the OpenSSL libraries installed, for example. I note that MINGW does NO_OPENSSL by default, for example, and maybe the MINGW people want to test the patch out and enable BLK_SHA1 rather than the original Mozilla one. But while looking at 32-bit issues, I noticed that I really should also cast 'len' when shifting it. Otherwise the thing is limited to fairly small areas (28 bits - 256MB). This is not just a 32-bit problem ("int" is a signed 32-bit thing even in a 64-bit build), but I only noticed it when looking at 32-bit issues. So here's an incremental patch to fix that. Linus --- block-sha1/sha1.c | 4 ++-- block-sha1/sha1.h | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c index 8fd90b0..eef32f7 100644 --- a/block-sha1/sha1.c +++ b/block-sha1/sha1.c @@ -27,11 +27,11 @@ void blk_SHA1_Init(blk_SHA_CTX *ctx) } -void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, int len) +void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, unsigned long len) { int lenW = ctx->lenW; - ctx->size += len << 3; + ctx->size += (unsigned long long) len << 3; /* Read the data into W and process blocks as they get full */ diff --git a/block-sha1/sha1.h b/block-sha1/sha1.h index dbc719f..7be2d93 100644 --- a/block-sha1/sha1.h +++ b/block-sha1/sha1.h @@ -12,7 +12,7 @@ typedef struct { } blk_SHA_CTX; void blk_SHA1_Init(blk_SHA_CTX *ctx); -void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *dataIn, int len); +void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *dataIn, unsigned long len); void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx); #define git_SHA_CTX blk_SHA_CTX ^ permalink raw reply related [flat|nested] 129+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-06 1:18 ` Linus Torvalds @ 2009-08-06 1:52 ` Nicolas Pitre 2009-08-06 2:04 ` Junio C Hamano 2009-08-06 2:08 ` Linus Torvalds 2009-08-06 18:49 ` Erik Faye-Lund 1 sibling, 2 replies; 129+ messages in thread From: Nicolas Pitre @ 2009-08-06 1:52 UTC (permalink / raw) To: Linus Torvalds; +Cc: George Spelvin, Junio C Hamano, git On Wed, 5 Aug 2009, Linus Torvalds wrote: > But while looking at 32-bit issues, I noticed that I really should also > cast 'len' when shifting it. Otherwise the thing is limited to fairly > small areas (28 bits - 256MB). This is not just a 32-bit problem ("int" is > a signed 32-bit thing even in a 64-bit build), but I only noticed it when > looking at 32-bit issues. Even better is to not shift len at all in SHA_update() but shift ctx->size only at the end in SHA_final(). It is not like if SHA_update() could operate on partial bytes, so counting total bytes instead of total bits is all you need. This way you need no cast there and make the code slightly faster. Nicolas ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-06 1:52 ` Nicolas Pitre @ 2009-08-06 2:04 ` Junio C Hamano 2009-08-06 2:10 ` Linus Torvalds 2009-08-06 2:20 ` Nicolas Pitre 2009-08-06 2:08 ` Linus Torvalds 1 sibling, 2 replies; 129+ messages in thread From: Junio C Hamano @ 2009-08-06 2:04 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Linus Torvalds, George Spelvin, git Nicolas Pitre <nico@cam.org> writes: > On Wed, 5 Aug 2009, Linus Torvalds wrote: > >> But while looking at 32-bit issues, I noticed that I really should also >> cast 'len' when shifting it. Otherwise the thing is limited to fairly >> small areas (28 bits - 256MB). This is not just a 32-bit problem ("int" is >> a signed 32-bit thing even in a 64-bit build), but I only noticed it when >> looking at 32-bit issues. > > Even better is to not shift len at all in SHA_update() but shift > ctx->size only at the end in SHA_final(). It is not like if > SHA_update() could operate on partial bytes, so counting total bytes > instead of total bits is all you need. This way you need no cast there > and make the code slightly faster. Like this? By the way, Mozilla one calls Init at the end of Final but block-sha1 doesn't. I do not think it matters for our callers, but on the other hand FInal is not performance critical part nor Init is heavy, so it may not be a bad idea to imitate them as well. Or am I missing something? diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c index eef32f7..8293f7b 100644 --- a/block-sha1/sha1.c +++ b/block-sha1/sha1.c @@ -31,7 +31,7 @@ void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, unsigned long len) { int lenW = ctx->lenW; - ctx->size += (unsigned long long) len << 3; + ctx->size += (unsigned long long) len; /* Read the data into W and process blocks as they get full */ @@ -68,6 +68,7 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx) /* Pad with a binary 1 (ie 0x80), then zeroes, then length */ + ctx->size <<= 3; /* bytes to bits */ padlen[0] = htonl(ctx->size >> 32); padlen[1] = htonl(ctx->size); ^ permalink raw reply related [flat|nested] 129+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-06 2:04 ` Junio C Hamano @ 2009-08-06 2:10 ` Linus Torvalds 2009-08-06 2:20 ` Nicolas Pitre 1 sibling, 0 replies; 129+ messages in thread From: Linus Torvalds @ 2009-08-06 2:10 UTC (permalink / raw) To: Junio C Hamano; +Cc: Nicolas Pitre, George Spelvin, git On Wed, 5 Aug 2009, Junio C Hamano wrote: > > Like this? No, combine it with the other shifts: Yes: > - ctx->size += (unsigned long long) len << 3; > + ctx->size += (unsigned long long) len; No: > + ctx->size <<= 3; /* bytes to bits */ > padlen[0] = htonl(ctx->size >> 32); > padlen[1] = htonl(ctx->size); Do padlen[0] = htonl(ctx->size >> 29); padlen[1] = htonl(ctx->size << 3); instead. Or whatever. Linus ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-06 2:04 ` Junio C Hamano 2009-08-06 2:10 ` Linus Torvalds @ 2009-08-06 2:20 ` Nicolas Pitre 1 sibling, 0 replies; 129+ messages in thread From: Nicolas Pitre @ 2009-08-06 2:20 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, George Spelvin, git On Wed, 5 Aug 2009, Junio C Hamano wrote: > Nicolas Pitre <nico@cam.org> writes: > > > On Wed, 5 Aug 2009, Linus Torvalds wrote: > > > >> But while looking at 32-bit issues, I noticed that I really should also > >> cast 'len' when shifting it. Otherwise the thing is limited to fairly > >> small areas (28 bits - 256MB). This is not just a 32-bit problem ("int" is > >> a signed 32-bit thing even in a 64-bit build), but I only noticed it when > >> looking at 32-bit issues. > > > > Even better is to not shift len at all in SHA_update() but shift > > ctx->size only at the end in SHA_final(). It is not like if > > SHA_update() could operate on partial bytes, so counting total bytes > > instead of total bits is all you need. This way you need no cast there > > and make the code slightly faster. > > Like this? Almost (see below). > By the way, Mozilla one calls Init at the end of Final but block-sha1 > doesn't. I do not think it matters for our callers, but on the other hand > FInal is not performance critical part nor Init is heavy, so it may not be > a bad idea to imitate them as well. Or am I missing something? It is done only to make sure potentially crypto sensitive information is wiped out of the ctx structure instance. In our case we have no such concerns. > diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c > index eef32f7..8293f7b 100644 > --- a/block-sha1/sha1.c > +++ b/block-sha1/sha1.c > @@ -31,7 +31,7 @@ void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, unsigned long len) > { > int lenW = ctx->lenW; > > - ctx->size += (unsigned long long) len << 3; > + ctx->size += (unsigned long long) len; You can get rid of the cast as well now. > /* Read the data into W and process blocks as they get full > */ > @@ -68,6 +68,7 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx) > > /* Pad with a binary 1 (ie 0x80), then zeroes, then length > */ > + ctx->size <<= 3; /* bytes to bits */ > padlen[0] = htonl(ctx->size >> 32); > padlen[1] = htonl(ctx->size); Instead, I'd do: padlen[0] = htonl(ctx->size >> (32 - 3)); padlen[1] = htonl(ctx->size << 3); That would eliminate a redundant write back of ctx->size. Nicolas ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-06 1:52 ` Nicolas Pitre 2009-08-06 2:04 ` Junio C Hamano @ 2009-08-06 2:08 ` Linus Torvalds 2009-08-06 3:19 ` Artur Skawina 1 sibling, 1 reply; 129+ messages in thread From: Linus Torvalds @ 2009-08-06 2:08 UTC (permalink / raw) To: Nicolas Pitre; +Cc: George Spelvin, Junio C Hamano, git On Wed, 5 Aug 2009, Nicolas Pitre wrote: > > Even better is to not shift len at all in SHA_update() but shift > ctx->size only at the end in SHA_final(). It is not like if > SHA_update() could operate on partial bytes, so counting total bytes > instead of total bits is all you need. This way you need no cast there > and make the code slightly faster. Yeah, I tried it, but it's not noticeable. The bigger issue seems to be that it's shifter-limited, or that's what I take away from my profiles. I suspect it's even _more_ shifter-limited on some other micro-architectures, because gcc is being stupid, and generates ror $31,%eax from the "left shift + right shift" combination. It seems to -always- generate a "ror", rather than trying to generate 'rot' if the shift count would be smaller that way. And I know _some_ old micro-architectures will literally internally loop on the rol/ror counts, so "ror $31" can be _much_ more expensive than "rol $1". That isn't the case on my Nehalem, though. But I can't seem to get gcc to generate better code without actually using inline asm.. (So to clarify: this patch makes no difference that I can see to performance, but I suspect it could matter on other CPU's like an old Pentium or maybe an Atom). Linus --- block-sha1/sha1.c | 36 ++++++++++++++++++++++++------------ block-sha1/sha1.h | 2 +- 2 files changed, 25 insertions(+), 13 deletions(-) diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c index 8fd90b0..a45a3de 100644 --- a/block-sha1/sha1.c +++ b/block-sha1/sha1.c @@ -80,7 +80,19 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx) ((unsigned int *)hashout)[i] = htonl(ctx->H[i]); } -#define SHA_ROT(X,n) (((X) << (n)) | ((X) >> (32-(n)))) +#if defined(__i386__) || defined(__x86_64__) + +#define SHA_ASM(op, x, n) ({ unsigned int __res; asm(op " %1,%0":"=r" (__res):"i" (n), "0" (x)); __res; }) +#define SHA_ROL(x,n) SHA_ASM("rol", x, n) +#define SHA_ROR(x,n) SHA_ASM("ror", x, n) + +#else + +#define SHA_ROT(X,n) (((X) << (l)) | ((X) >> (r))) +#define SHA_ROL(X,n) SHA_ROT(X,n,32-(n)) +#define SHA_ROR(X,n) SHA_ROT(X,32-(n),n) + +#endif static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) { @@ -93,7 +105,7 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) /* Unroll it? */ for (t = 16; t <= 79; t++) - W[t] = SHA_ROT(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1); + W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1); A = ctx->H[0]; B = ctx->H[1]; @@ -102,8 +114,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) E = ctx->H[4]; #define T_0_19(t) \ - TEMP = SHA_ROT(A,5) + (((C^D)&B)^D) + E + W[t] + 0x5a827999; \ - E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP; + TEMP = SHA_ROL(A,5) + (((C^D)&B)^D) + E + W[t] + 0x5a827999; \ + E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; T_0_19( 0); T_0_19( 1); T_0_19( 2); T_0_19( 3); T_0_19( 4); T_0_19( 5); T_0_19( 6); T_0_19( 7); T_0_19( 8); T_0_19( 9); @@ -111,8 +123,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) T_0_19(15); T_0_19(16); T_0_19(17); T_0_19(18); T_0_19(19); #define T_20_39(t) \ - TEMP = SHA_ROT(A,5) + (B^C^D) + E + W[t] + 0x6ed9eba1; \ - E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP; + TEMP = SHA_ROL(A,5) + (B^C^D) + E + W[t] + 0x6ed9eba1; \ + E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; T_20_39(20); T_20_39(21); T_20_39(22); T_20_39(23); T_20_39(24); T_20_39(25); T_20_39(26); T_20_39(27); T_20_39(28); T_20_39(29); @@ -120,8 +132,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) T_20_39(35); T_20_39(36); T_20_39(37); T_20_39(38); T_20_39(39); #define T_40_59(t) \ - TEMP = SHA_ROT(A,5) + ((B&C)|(D&(B|C))) + E + W[t] + 0x8f1bbcdc; \ - E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP; + TEMP = SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E + W[t] + 0x8f1bbcdc; \ + E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; T_40_59(40); T_40_59(41); T_40_59(42); T_40_59(43); T_40_59(44); T_40_59(45); T_40_59(46); T_40_59(47); T_40_59(48); T_40_59(49); @@ -129,8 +141,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) T_40_59(55); T_40_59(56); T_40_59(57); T_40_59(58); T_40_59(59); #define T_60_79(t) \ - TEMP = SHA_ROT(A,5) + (B^C^D) + E + W[t] + 0xca62c1d6; \ - E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP; + TEMP = SHA_ROL(A,5) + (B^C^D) + E + W[t] + 0xca62c1d6; \ + E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; T_60_79(60); T_60_79(61); T_60_79(62); T_60_79(63); T_60_79(64); T_60_79(65); T_60_79(66); T_60_79(67); T_60_79(68); T_60_79(69); ^ permalink raw reply related [flat|nested] 129+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-06 2:08 ` Linus Torvalds @ 2009-08-06 3:19 ` Artur Skawina 2009-08-06 3:31 ` Linus Torvalds 0 siblings, 1 reply; 129+ messages in thread From: Artur Skawina @ 2009-08-06 3:19 UTC (permalink / raw) To: Linus Torvalds; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git Linus Torvalds wrote: > > The bigger issue seems to be that it's shifter-limited, or that's what I > take away from my profiles. I suspect it's even _more_ shifter-limited on > some other micro-architectures, because gcc is being stupid, and generates > > ror $31,%eax > > from the "left shift + right shift" combination. It seems to -always- > generate a "ror", rather than trying to generate 'rot' if the shift count > would be smaller that way. > > And I know _some_ old micro-architectures will literally internally loop > on the rol/ror counts, so "ror $31" can be _much_ more expensive than "rol > $1". > > That isn't the case on my Nehalem, though. But I can't seem to get gcc to > generate better code without actually using inline asm.. The compiler does the right thing w/ something like this: +#if __GNUC__>1 && defined(__i386) +#define SHA_ROT(data,bits) ({ \ + unsigned d = (data); \ + if (bits<16) \ + __asm__ ("roll %1,%0" : "=r" (d) : "I" (bits), "0" (d)); \ + else \ + __asm__ ("rorl %1,%0" : "=r" (d) : "I" (32-bits), "0" (d)); \ + d; \ + }) +#else #define SHA_ROT(X,n) (((X) << (n)) | ((X) >> (32-(n)))) +#endif which doesn't obfuscate the code as much. (I needed the asm on p4 anyway, as w/o it the mozilla version is even slower than an rfc3174 one. rol vs ror makes no measurable difference) > static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) > { > @@ -93,7 +105,7 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) > > /* Unroll it? */ > for (t = 16; t <= 79; t++) > - W[t] = SHA_ROT(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1); > + W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1); unrolling this once (but not more) is a win, at least on p4. > #define T_0_19(t) \ > - TEMP = SHA_ROT(A,5) + (((C^D)&B)^D) + E + W[t] + 0x5a827999; \ > - E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP; > + TEMP = SHA_ROL(A,5) + (((C^D)&B)^D) + E + W[t] + 0x5a827999; \ > + E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; > > T_0_19( 0); T_0_19( 1); T_0_19( 2); T_0_19( 3); T_0_19( 4); > T_0_19( 5); T_0_19( 6); T_0_19( 7); T_0_19( 8); T_0_19( 9); unrolling these otoh is a clear loss (iirc ~10%). artur ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-06 3:19 ` Artur Skawina @ 2009-08-06 3:31 ` Linus Torvalds 2009-08-06 3:48 ` Linus Torvalds 2009-08-06 4:08 ` Artur Skawina 0 siblings, 2 replies; 129+ messages in thread From: Linus Torvalds @ 2009-08-06 3:31 UTC (permalink / raw) To: Artur Skawina; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git On Thu, 6 Aug 2009, Artur Skawina wrote: > > > #define T_0_19(t) \ > > - TEMP = SHA_ROT(A,5) + (((C^D)&B)^D) + E + W[t] + 0x5a827999; \ > > - E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP; > > + TEMP = SHA_ROL(A,5) + (((C^D)&B)^D) + E + W[t] + 0x5a827999; \ > > + E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; > > > > T_0_19( 0); T_0_19( 1); T_0_19( 2); T_0_19( 3); T_0_19( 4); > > T_0_19( 5); T_0_19( 6); T_0_19( 7); T_0_19( 8); T_0_19( 9); > > unrolling these otoh is a clear loss (iirc ~10%). I can well imagine. The P4 decode bandwidth is abysmal unless you get things into the trace cache, and the trace cache is of a very limited size. However, on at least Nehalem, unrolling it all is quite a noticeable win. The way it's written, I can easily make it do one or the other by just turning the macro inside a loop (and we can have a preprocessor flag to choose one or the other), but let me work on it a bit more first. I'm trying to move the htonl() inside the loops (the same way I suggested George do with his assembly), and it seems to help a tiny bit. But I may be measuring noise. However, right now my biggest profile hit is on this irritating loop: /* Unroll it? */ for (t = 16; t <= 79; t++) W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1); and I haven't been able to move _that_ into the other iterations yet. Here's my micro-optimization update. It does the first 16 rounds (of the first 20-round thing) specially, and takes the data directly from the input array. I'm _this_ close to breaking the 28s second barrier on git-fsck, but not quite yet. Linus --- From: Linus Torvalds <torvalds@linux-foundation.org> Subject: [PATCH] block-sha1: make the 'ntohl()' part of the first SHA1 loop This helps a teeny bit. But what I -really- want to do is to avoid the whole 80-array loop, and do the xor updates as I go along.. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- block-sha1/sha1.c | 28 ++++++++++++++++------------ 1 files changed, 16 insertions(+), 12 deletions(-) diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c index a45a3de..39a5bbb 100644 --- a/block-sha1/sha1.c +++ b/block-sha1/sha1.c @@ -100,27 +100,31 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) unsigned int A,B,C,D,E,TEMP; unsigned int W[80]; - for (t = 0; t < 16; t++) - W[t] = htonl(data[t]); - - /* Unroll it? */ - for (t = 16; t <= 79; t++) - W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1); - A = ctx->H[0]; B = ctx->H[1]; C = ctx->H[2]; D = ctx->H[3]; E = ctx->H[4]; -#define T_0_19(t) \ +#define T_0_15(t) \ + TEMP = htonl(data[t]); W[t] = TEMP; \ + TEMP += SHA_ROL(A,5) + (((C^D)&B)^D) + E + 0x5a827999; \ + E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; \ + + T_0_15( 0); T_0_15( 1); T_0_15( 2); T_0_15( 3); T_0_15( 4); + T_0_15( 5); T_0_15( 6); T_0_15( 7); T_0_15( 8); T_0_15( 9); + T_0_15(10); T_0_15(11); T_0_15(12); T_0_15(13); T_0_15(14); + T_0_15(15); + + /* Unroll it? */ + for (t = 16; t <= 79; t++) + W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1); + +#define T_16_19(t) \ TEMP = SHA_ROL(A,5) + (((C^D)&B)^D) + E + W[t] + 0x5a827999; \ E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; - T_0_19( 0); T_0_19( 1); T_0_19( 2); T_0_19( 3); T_0_19( 4); - T_0_19( 5); T_0_19( 6); T_0_19( 7); T_0_19( 8); T_0_19( 9); - T_0_19(10); T_0_19(11); T_0_19(12); T_0_19(13); T_0_19(14); - T_0_19(15); T_0_19(16); T_0_19(17); T_0_19(18); T_0_19(19); + T_16_19(16); T_16_19(17); T_16_19(18); T_16_19(19); #define T_20_39(t) \ TEMP = SHA_ROL(A,5) + (B^C^D) + E + W[t] + 0x6ed9eba1; \ ^ permalink raw reply related [flat|nested] 129+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-06 3:31 ` Linus Torvalds @ 2009-08-06 3:48 ` Linus Torvalds 2009-08-06 4:01 ` Linus Torvalds 2009-08-06 4:52 ` George Spelvin 2009-08-06 4:08 ` Artur Skawina 1 sibling, 2 replies; 129+ messages in thread From: Linus Torvalds @ 2009-08-06 3:48 UTC (permalink / raw) To: Artur Skawina; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git On Wed, 5 Aug 2009, Linus Torvalds wrote: > > However, right now my biggest profile hit is on this irritating loop: > > /* Unroll it? */ > for (t = 16; t <= 79; t++) > W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1); > > and I haven't been able to move _that_ into the other iterations yet. Oh yes I have. Here's the patch that gets me sub-28s git-fsck times. In fact, it gives me sub-27s times. In fact, it's really close to the OpenSSL times. And all using plain C. Again - this is all on x86-64. I suspect 32-bit code ends up having spills due to register pressure. That said, I did get rid of that big temporary array, and it now basically only uses that 512-bit array as one circular queue. Linus PS. Ok, so my definition of "plain C" is a bit odd. There's nothing plain about it. It's disgusting C preprocessor misuse. But dang, it's kind of fun to abuse the compiler this way. --- block-sha1/sha1.c | 28 ++++++++++++++++------------ 1 files changed, 16 insertions(+), 12 deletions(-) diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c index 39a5bbb..80193d4 100644 --- a/block-sha1/sha1.c +++ b/block-sha1/sha1.c @@ -96,9 +96,8 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx) static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) { - int t; unsigned int A,B,C,D,E,TEMP; - unsigned int W[80]; + unsigned int array[16]; A = ctx->H[0]; B = ctx->H[1]; @@ -107,8 +106,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) E = ctx->H[4]; #define T_0_15(t) \ - TEMP = htonl(data[t]); W[t] = TEMP; \ - TEMP += SHA_ROL(A,5) + (((C^D)&B)^D) + E + 0x5a827999; \ + TEMP = htonl(data[t]); array[t] = TEMP; \ + TEMP += SHA_ROL(A,5) + (((C^D)&B)^D) + E + 0x5a827999; \ E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; \ T_0_15( 0); T_0_15( 1); T_0_15( 2); T_0_15( 3); T_0_15( 4); @@ -116,18 +115,21 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) T_0_15(10); T_0_15(11); T_0_15(12); T_0_15(13); T_0_15(14); T_0_15(15); - /* Unroll it? */ - for (t = 16; t <= 79; t++) - W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1); +/* This "rolls" over the 512-bit array */ +#define W(x) (array[(x)&15]) +#define SHA_XOR(t) \ + TEMP = SHA_ROL(W(t+13) ^ W(t+8) ^ W(t+2) ^ W(t), 1); W(t) = TEMP; #define T_16_19(t) \ - TEMP = SHA_ROL(A,5) + (((C^D)&B)^D) + E + W[t] + 0x5a827999; \ - E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; + SHA_XOR(t); \ + TEMP += SHA_ROL(A,5) + (((C^D)&B)^D) + E + 0x5a827999; \ + E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; \ T_16_19(16); T_16_19(17); T_16_19(18); T_16_19(19); #define T_20_39(t) \ - TEMP = SHA_ROL(A,5) + (B^C^D) + E + W[t] + 0x6ed9eba1; \ + SHA_XOR(t); \ + TEMP += SHA_ROL(A,5) + (B^C^D) + E + 0x6ed9eba1; \ E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; T_20_39(20); T_20_39(21); T_20_39(22); T_20_39(23); T_20_39(24); @@ -136,7 +138,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) T_20_39(35); T_20_39(36); T_20_39(37); T_20_39(38); T_20_39(39); #define T_40_59(t) \ - TEMP = SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E + W[t] + 0x8f1bbcdc; \ + SHA_XOR(t); \ + TEMP += SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E + 0x8f1bbcdc; \ E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; T_40_59(40); T_40_59(41); T_40_59(42); T_40_59(43); T_40_59(44); @@ -145,7 +148,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) T_40_59(55); T_40_59(56); T_40_59(57); T_40_59(58); T_40_59(59); #define T_60_79(t) \ - TEMP = SHA_ROL(A,5) + (B^C^D) + E + W[t] + 0xca62c1d6; \ + SHA_XOR(t); \ + TEMP += SHA_ROL(A,5) + (B^C^D) + E + 0xca62c1d6; \ E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; T_60_79(60); T_60_79(61); T_60_79(62); T_60_79(63); T_60_79(64); ^ permalink raw reply related [flat|nested] 129+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-06 3:48 ` Linus Torvalds @ 2009-08-06 4:01 ` Linus Torvalds 2009-08-06 4:28 ` Artur Skawina 2009-08-06 4:52 ` George Spelvin 1 sibling, 1 reply; 129+ messages in thread From: Linus Torvalds @ 2009-08-06 4:01 UTC (permalink / raw) To: Artur Skawina; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git On Wed, 5 Aug 2009, Linus Torvalds wrote: > > Here's the patch that gets me sub-28s git-fsck times. In fact, it gives me > sub-27s times. In fact, it's really close to the OpenSSL times. Just to back that up: - OpenSSL: real 0m26.363s user 0m26.174s sys 0m0.188s - This C implementation: real 0m26.594s user 0m26.310s sys 0m0.256s so I'm still slower, but now you really have to look closely to see the difference. In fact, you have to do multiple runs to make sure, because the error bars are bigger thant he difference - but openssl definitely edges my C code out by a small amount, and the above numbers are rairly normal. Linus ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-06 4:01 ` Linus Torvalds @ 2009-08-06 4:28 ` Artur Skawina 2009-08-06 4:50 ` Linus Torvalds 0 siblings, 1 reply; 129+ messages in thread From: Artur Skawina @ 2009-08-06 4:28 UTC (permalink / raw) To: Linus Torvalds; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git Linus Torvalds wrote: > > On Wed, 5 Aug 2009, Linus Torvalds wrote: >> Here's the patch that gets me sub-28s git-fsck times. In fact, it gives me >> sub-27s times. In fact, it's really close to the OpenSSL times. > > Just to back that up: > > - OpenSSL: > > real 0m26.363s > user 0m26.174s > sys 0m0.188s > > - This C implementation: > > real 0m26.594s > user 0m26.310s > sys 0m0.256s > > so I'm still slower, but now you really have to look closely to see the > difference. In fact, you have to do multiple runs to make sure, because > the error bars are bigger thant he difference - but openssl definitely > edges my C code out by a small amount, and the above numbers are rairly > normal. nice, the p4 microbenchmark #s: # TIME[s] SPEED[MB/s] rfc3174 1.357 44.99 rfc3174 1.352 45.13 mozilla 1.509 40.44 mozillaas 1.133 53.87 linus 0.5818 104.9 so it's more than twice as fast as the mozilla implementation. artur ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-06 4:28 ` Artur Skawina @ 2009-08-06 4:50 ` Linus Torvalds 2009-08-06 5:19 ` Artur Skawina 0 siblings, 1 reply; 129+ messages in thread From: Linus Torvalds @ 2009-08-06 4:50 UTC (permalink / raw) To: Artur Skawina; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git On Thu, 6 Aug 2009, Artur Skawina wrote: > > # TIME[s] SPEED[MB/s] > rfc3174 1.357 44.99 > rfc3174 1.352 45.13 > mozilla 1.509 40.44 > mozillaas 1.133 53.87 > linus 0.5818 104.9 > > so it's more than twice as fast as the mozilla implementation. So that's some general SHA1 benchmark you have? I hope it tests correctness too. Although I can't imagine it being wrong - I've made mistakes (oh, yes, many mistakes) when trying to convert the code to something efficient, and even the smallest mistake results in 'git fsck' immediately complaining about every single object. But still. I literally haven't tested it any other way (well, the git test-suite ends up doing a fair amount of testing too, and I _have_ run that). As to my atom testing: my poor little atom is a sad little thing, and it's almost painful to benchmark that thing. But it's worth it to look at how the 32-bit code compares to the openssl asm code too: - BLK_SHA1: real 2m27.160s user 2m23.651s sys 0m2.392s - OpenSSL: real 2m12.580s user 2m9.998s sys 0m1.811s - Mozilla-SHA1: real 3m21.836s user 3m18.369s sys 0m2.862s As expected, the hand-tuned assembly does better (and by a bigger margin). Probably partly because scheduling is important when in-order, and partly because gcc will have a harder time with the small register set. But it's still a big improvement over mozilla one. (This is, as always, 'git fsck --full'. It spends about 50% on that SHA1 calculation, so the SHA1 speedup is larger than you see from just th enumbers) Linus ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-06 4:50 ` Linus Torvalds @ 2009-08-06 5:19 ` Artur Skawina 2009-08-06 7:03 ` George Spelvin 0 siblings, 1 reply; 129+ messages in thread From: Artur Skawina @ 2009-08-06 5:19 UTC (permalink / raw) To: Linus Torvalds; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git Linus Torvalds wrote: > > On Thu, 6 Aug 2009, Artur Skawina wrote: >> # TIME[s] SPEED[MB/s] >> rfc3174 1.357 44.99 >> rfc3174 1.352 45.13 >> mozilla 1.509 40.44 >> mozillaas 1.133 53.87 >> linus 0.5818 104.9 >> >> so it's more than twice as fast as the mozilla implementation. > > So that's some general SHA1 benchmark you have? > > I hope it tests correctness too. yep, sort of, i just check that all versions return the same result when hashing some pseudorandom data. > As to my atom testing: my poor little atom is a sad little thing, and > it's almost painful to benchmark that thing. But it's worth it to look at > how the 32-bit code compares to the openssl asm code too: > > - BLK_SHA1: > real 2m27.160s > - OpenSSL: > real 2m12.580s > - Mozilla-SHA1: > real 3m21.836s > > As expected, the hand-tuned assembly does better (and by a bigger margin). > Probably partly because scheduling is important when in-order, and partly > because gcc will have a harder time with the small register set. > > But it's still a big improvement over mozilla one. > > (This is, as always, 'git fsck --full'. It spends about 50% on that SHA1 > calculation, so the SHA1 speedup is larger than you see from just th > enumbers) I'll start looking at other cpus once i integrate the asm versions into my benchmark. P4s really are "special". Even something as simple as this on top of your version: @@ -129,8 +133,8 @@ #define T_20_39(t) \ SHA_XOR(t); \ - TEMP += SHA_ROL(A,5) + (B^C^D) + E + 0x6ed9eba1; \ - E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; + TEMP += SHA_ROL(A,5) + (B^C^D) + E; \ + E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP + 0x6ed9eba1; T_20_39(20); T_20_39(21); T_20_39(22); T_20_39(23); T_20_39(24); T_20_39(25); T_20_39(26); T_20_39(27); T_20_39(28); T_20_39(29); @@ -139,8 +143,8 @@ #define T_40_59(t) \ SHA_XOR(t); \ - TEMP += SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E + 0x8f1bbcdc; \ - E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; + TEMP += SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E; \ + E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP + 0x8f1bbcdc; T_40_59(40); T_40_59(41); T_40_59(42); T_40_59(43); T_40_59(44); T_40_59(45); T_40_59(46); T_40_59(47); T_40_59(48); T_40_59(49); saves another 10% or so: #Initializing... Rounds: 1000000, size: 62500K, time: 1.421s, speed: 42.97MB/s # TIME[s] SPEED[MB/s] rfc3174 1.403 43.5 # New hash result: b747042d9f4f1fdabd2ac53076f8f830dea7fe0f rfc3174 1.403 43.51 linus 0.5891 103.6 linusas 0.5337 114.4 mozilla 1.535 39.76 mozillaas 1.128 54.13 artur ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-06 5:19 ` Artur Skawina @ 2009-08-06 7:03 ` George Spelvin 0 siblings, 0 replies; 129+ messages in thread From: George Spelvin @ 2009-08-06 7:03 UTC (permalink / raw) To: art.08.09, torvalds; +Cc: git, gitster, linux, nico > On Thu, 6 Aug 2009, Artur Skawina wrote: >> # TIME[s] SPEED[MB/s] >> rfc3174 1.357 44.99 >> rfc3174 1.352 45.13 >> mozilla 1.509 40.44 >> mozillaas 1.133 53.87 >> linus 0.5818 104.9 > #Initializing... Rounds: 1000000, size: 62500K, time: 1.421s, speed: 42.97MB/s > # TIME[s] SPEED[MB/s] > rfc3174 1.403 43.5 > # New hash result: b747042d9f4f1fdabd2ac53076f8f830dea7fe0f > rfc3174 1.403 43.51 > linus 0.5891 103.6 > linusas 0.5337 114.4 > mozilla 1.535 39.76 > mozillaas 1.128 54.13 I'm trying to absorb what you're learning about P4 performance, but I'm getting confused... what is what in these benchmarks? The major architectural decisions I see are: 1) Three possible ways to compute the W[] array for rounds 16..79: 1a) Compute W[16..79] in a loop beforehand (you noted that unrolling two copies helped significantly.) 1b) Compute W[16..79] as part of hash rounds 16..79. 1c) Compute W[0..15] in-place as part of hash rounds 16..79 2) The main hashing can be rolled up or unrolled: 2a) Four 20-round loops. (In case of options 1b and 1c, the first one might be split into a 16 and a 4.) 2b) Four 4-round loops, each unrolled 5x. (See the ARM assembly.) 2c) all 80 rounds unrolled. As Linus noted, 1c is not friends with options 2a and 2b, because the W() indexing math is not longer a compile-time constant. Linus has posted 1a+2c and 1c+2c. You posted some code that could be 2a or 2c depending on an UNROLL preprocessor #define. Which combinations are your "linus" and "linusas" code? You talk about "and my atom seems to like the compact loops too", but I'm not sure which loops those are. Thanks. ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-06 3:48 ` Linus Torvalds 2009-08-06 4:01 ` Linus Torvalds @ 2009-08-06 4:52 ` George Spelvin 1 sibling, 0 replies; 129+ messages in thread From: George Spelvin @ 2009-08-06 4:52 UTC (permalink / raw) To: art.08.09, torvalds; +Cc: git, gitster, linux, nico On Wed, 5 Aug 2009, Linus Torvalds wrote: > Oh yes I have. > > Here's the patch that gets me sub-28s git-fsck times. In fact, it gives me > sub-27s times. In fact, it's really close to the OpenSSL times. > > And all using plain C. > > Again - this is all on x86-64. I suspect 32-bit code ends up having > spills due to register pressure. That said, I did get rid of that big > temporary array, and it now basically only uses that 512-bit array as one > circular queue. > > Linus > > PS. Ok, so my definition of "plain C" is a bit odd. There's nothing plain > about it. It's disgusting C preprocessor misuse. But dang, it's kind of > fun to abuse the compiler this way. You're still missing three tricks, which give a slight speedup on my machine: 1) (major) Instead of reassigning all those variable all the time, make the round function E += SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + W[t] + 0x8f1bbcdc; \ B = SHA_ROR(B, 2); and rename the variables between rounds. 2 (minor) One of the round functions ((B&C)|(D&(B|C))) can be rewritten (B&C) | (C&D) | (D&B) = (B&C) | (D&(B|C)) = (B&C) | (D&(B^C)) = (B&C) ^ (D&(B^C)) = (B&C) + (D&(B^C)) to expose more associativty (and thus scheduling flexibility) to the compiler. 3) (minor) ctx->lenW is always simply a copy of the low 6 bits of ctx->size, so there's no need to bother with it. Actually, looking at the code, GCC manages to figure out the first (major) one by itself. Way to go, GCC authors! But getting avoiding the extra temporary in trick 2 also gets rid of some extra REX prefixes, saving 240 bytes in blk_SHA1Block, which is kind of nice in an inner loop. Here's my modified version of your earlier code. I haven't incoporated the W[] formation into the round functions as in your latest version. I'm sure you can bash the two together in very little time. Or I'll get to it later; I really should attend to $DAY_JOB at the moment. diff --git a/Makefile b/Makefile index daf4296..e6df8ec 100644 --- a/Makefile +++ b/Makefile @@ -84,6 +84,10 @@ all:: # specify your own (or DarwinPort's) include directories and # library directories by defining CFLAGS and LDFLAGS appropriately. # +# Define BLK_SHA1 environment variable if you want the C version +# of the SHA1 that assumes you can do unaligned 32-bit loads and +# have a fast htonl() function. +# # Define PPC_SHA1 environment variable when running make to make use of # a bundled SHA1 routine optimized for PowerPC. # @@ -1167,6 +1171,10 @@ ifdef NO_DEFLATE_BOUND BASIC_CFLAGS += -DNO_DEFLATE_BOUND endif +ifdef BLK_SHA1 + SHA1_HEADER = "block-sha1/sha1.h" + LIB_OBJS += block-sha1/sha1.o +else ifdef PPC_SHA1 SHA1_HEADER = "ppc/sha1.h" LIB_OBJS += ppc/sha1.o ppc/sha1ppc.o @@ -1184,6 +1192,7 @@ else endif endif endif +endif ifdef NO_PERL_MAKEMAKER export NO_PERL_MAKEMAKER endif diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c new file mode 100644 index 0000000..261eae7 --- /dev/null +++ b/block-sha1/sha1.c @@ -0,0 +1,141 @@ +/* + * Based on the Mozilla SHA1 (see mozilla-sha1/sha1.c), + * optimized to do word accesses rather than byte accesses, + * and to avoid unnecessary copies into the context array. + */ + +#include <string.h> +#include <arpa/inet.h> + +#include "sha1.h" + +/* Hash one 64-byte block of data */ +static void blk_SHA1Block(blk_SHA_CTX *ctx, const uint32_t *data); + +void blk_SHA1_Init(blk_SHA_CTX *ctx) +{ + /* Initialize H with the magic constants (see FIPS180 for constants) + */ + ctx->H[0] = 0x67452301; + ctx->H[1] = 0xefcdab89; + ctx->H[2] = 0x98badcfe; + ctx->H[3] = 0x10325476; + ctx->H[4] = 0xc3d2e1f0; + ctx->size = 0; +} + + +void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, int len) +{ + int lenW = (int)ctx->size & 63; + + ctx->size += len; + + /* Read the data into W and process blocks as they get full + */ + if (lenW) { + int left = 64 - lenW; + if (len < left) + left = len; + memcpy(lenW + (char *)ctx->W, data, left); + if (left + lenW != 64) + return; + len -= left; + data += left; + blk_SHA1Block(ctx, ctx->W); + } + while (len >= 64) { + blk_SHA1Block(ctx, data); + data += 64; + len -= 64; + } + memcpy(ctx->W, data, len); +} + + +void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx) +{ + int i, lenW = (int)ctx->size & 63; + + /* Pad with a binary 1 (ie 0x80), then zeroes, then length + */ + ((char *)ctx->W)[lenW++] = 0x80; + if (lenW > 56) { + memset((char *)ctx->W + lenW, 0, 64 - lenW); + blk_SHA1Block(ctx, ctx->W); + lenW = 0; + } + memset((char *)ctx->W + lenW, 0, 56 - lenW); + ctx->W[14] = htonl(ctx->size >> 29); + ctx->W[15] = htonl((uint32_t)ctx->size << 3); + blk_SHA1Block(ctx, ctx->W); + + /* Output hash + */ + for (i = 0; i < 5; i++) + ((unsigned int *)hashout)[i] = htonl(ctx->H[i]); +} + +/* SHA-1 helper macros */ +#define SHA_ROT(X,n) (((X) << (n)) | ((X) >> (32-(n)))) +#define F1(b,c,d) (((d^c)&b)^d) +#define F2(b,c,d) (b^c^d) +/* This version lets the compiler use the fact that + is associative. */ +#define F3(b,c,d) (c&d) + (b & (c^d)) + +/* The basic SHA-1 round */ +#define ROUND(a, b, c, d, e, f, k, t) \ + e += SHA_ROT(a,5) + f(b,c,d) + W[t] + k; b = SHA_ROT(b, 30) +/* Five SHA-1 rounds */ +#define FIVE(f, k, t) \ + ROUND(A, B, C, D, E, f, k, t ); \ + ROUND(E, A, B, C, D, f, k, t+1); \ + ROUND(D, E, A, B, C, f, k, t+2); \ + ROUND(C, D, E, A, B, f, k, t+3); \ + ROUND(B, C, D, E, A, f, k, t+4) + +static void blk_SHA1Block(blk_SHA_CTX *ctx, const uint32_t *data) +{ + int t; + uint32_t A,B,C,D,E; + uint32_t W[80]; + + for (t = 0; t < 16; t++) + W[t] = htonl(data[t]); + + /* Unroll it? */ + for (t = 16; t <= 79; t++) + W[t] = SHA_ROT(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1); + + A = ctx->H[0]; + B = ctx->H[1]; + C = ctx->H[2]; + D = ctx->H[3]; + E = ctx->H[4]; + + FIVE(F1, 0x5a827999, 0); + FIVE(F1, 0x5a827999, 5); + FIVE(F1, 0x5a827999, 10); + FIVE(F1, 0x5a827999, 15); + + FIVE(F2, 0x6ed9eba1, 20); + FIVE(F2, 0x6ed9eba1, 25); + FIVE(F2, 0x6ed9eba1, 30); + FIVE(F2, 0x6ed9eba1, 35); + + FIVE(F3, 0x8f1bbcdc, 40); + FIVE(F3, 0x8f1bbcdc, 45); + FIVE(F3, 0x8f1bbcdc, 50); + FIVE(F3, 0x8f1bbcdc, 55); + + FIVE(F2, 0xca62c1d6, 60); + FIVE(F2, 0xca62c1d6, 65); + FIVE(F2, 0xca62c1d6, 70); + FIVE(F2, 0xca62c1d6, 75); + + ctx->H[0] += A; + ctx->H[1] += B; + ctx->H[2] += C; + ctx->H[3] += D; + ctx->H[4] += E; +} diff --git a/block-sha1/sha1.h b/block-sha1/sha1.h new file mode 100644 index 0000000..c9dc156 --- /dev/null +++ b/block-sha1/sha1.h @@ -0,0 +1,21 @@ +/* + * Based on the Mozilla SHA1 (see mozilla-sha1/sha1.h), + * optimized to do word accesses rather than byte accesses, + * and to avoid unnecessary copies into the context array. + */ + #include <stdint.h> + +typedef struct { + uint32_t H[5]; + uint64_t size; + uint32_t W[16]; +} blk_SHA_CTX; + +void blk_SHA1_Init(blk_SHA_CTX *ctx); +void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *dataIn, int len); +void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx); + +#define git_SHA_CTX blk_SHA_CTX +#define git_SHA1_Init blk_SHA1_Init +#define git_SHA1_Update blk_SHA1_Update +#define git_SHA1_Final blk_SHA1_Final ^ permalink raw reply related [flat|nested] 129+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-06 3:31 ` Linus Torvalds 2009-08-06 3:48 ` Linus Torvalds @ 2009-08-06 4:08 ` Artur Skawina 2009-08-06 4:27 ` Linus Torvalds 1 sibling, 1 reply; 129+ messages in thread From: Artur Skawina @ 2009-08-06 4:08 UTC (permalink / raw) To: Linus Torvalds; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git Linus Torvalds wrote: > > On Thu, 6 Aug 2009, Artur Skawina wrote: >>> #define T_0_19(t) \ >>> - TEMP = SHA_ROT(A,5) + (((C^D)&B)^D) + E + W[t] + 0x5a827999; \ >>> - E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP; >>> + TEMP = SHA_ROL(A,5) + (((C^D)&B)^D) + E + W[t] + 0x5a827999; \ >>> + E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; >>> >>> T_0_19( 0); T_0_19( 1); T_0_19( 2); T_0_19( 3); T_0_19( 4); >>> T_0_19( 5); T_0_19( 6); T_0_19( 7); T_0_19( 8); T_0_19( 9); >> unrolling these otoh is a clear loss (iirc ~10%). > > I can well imagine. The P4 decode bandwidth is abysmal unless you get > things into the trace cache, and the trace cache is of a very limited > size. > > However, on at least Nehalem, unrolling it all is quite a noticeable win. > > The way it's written, I can easily make it do one or the other by just > turning the macro inside a loop (and we can have a preprocessor flag to > choose one or the other), but let me work on it a bit more first. that's of course how i measured it.. :) > I'm trying to move the htonl() inside the loops (the same way I suggested > George do with his assembly), and it seems to help a tiny bit. But I may > be measuring noise. i haven't tried your version at all yet (just applied the rol/ror and unrolling changes, but neither was a win on p4) > However, right now my biggest profile hit is on this irritating loop: > > /* Unroll it? */ > for (t = 16; t <= 79; t++) > W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1); > > and I haven't been able to move _that_ into the other iterations yet. i've done that before -- was a small loss -- maybe because of the small trace cache. deleted that attempt while cleaning up the #if mess, so don't have the patch, but it was basically #define newW(t) (W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1)) and than s/W[t]/newW(t)/ in rounds 16..79. I've only tested on p4 and there the winner so far is still: - for (t = 16; t <= 79; t++) + for (t = 16; t <= 79; t+=2) { ctx->W[t] = - SHA_ROT(ctx->W[t-3] ^ ctx->W[t-8] ^ ctx->W[t-14] ^ ctx->W[t-16], 1); + SHA_ROT(ctx->W[t-16] ^ ctx->W[t-14] ^ ctx->W[t-8] ^ ctx->W[t-3], 1); + ctx->W[t+1] = + SHA_ROT(ctx->W[t-15] ^ ctx->W[t-13] ^ ctx->W[t-7] ^ ctx->W[t-2], 1); + } > Here's my micro-optimization update. It does the first 16 rounds (of the > first 20-round thing) specially, and takes the data directly from the > input array. I'm _this_ close to breaking the 28s second barrier on > git-fsck, but not quite yet. tried this before too -- doesn't help. Not much a of a surprise -- if unrolling didn't help adding another loop (for rounds 17..20) won't. artur ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-06 4:08 ` Artur Skawina @ 2009-08-06 4:27 ` Linus Torvalds 2009-08-06 5:44 ` Artur Skawina 0 siblings, 1 reply; 129+ messages in thread From: Linus Torvalds @ 2009-08-06 4:27 UTC (permalink / raw) To: Artur Skawina; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git On Thu, 6 Aug 2009, Artur Skawina wrote: > > > > The way it's written, I can easily make it do one or the other by just > > turning the macro inside a loop (and we can have a preprocessor flag to > > choose one or the other), but let me work on it a bit more first. > > that's of course how i measured it.. :) Well, with my "rolling 512-bit array" I can't do that easily any more. Now it actually depends on the compiler being able to statically do that circular list calculation. If I were to turn it back into the chunks of loops, my new code would suck, because it would have all those nasty dynamic address calculations. > I've only tested on p4 and there the winner so far is still: Yeah, well, I refuse to touch that crappy micro-architecture any more. I complained to Intel people for years that their best CPU was only available as a laptop chip (Pentium-M), and I'm really happy to have gotten rid of all my horrid P4's. (Ok, so it was great when the P4 ran at 2x the frequency of the competition, and then it smoked them all. Except on OS loads, where the P4 exception handling took ten times longer than anything else). So I'm a big biased against P4. I'll try it on my Atom's, though. They're pretty crappy CPU's, but they have a fairly good _reason_ to be crappy. Linus ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-06 4:27 ` Linus Torvalds @ 2009-08-06 5:44 ` Artur Skawina 2009-08-06 5:56 ` Artur Skawina 0 siblings, 1 reply; 129+ messages in thread From: Artur Skawina @ 2009-08-06 5:44 UTC (permalink / raw) To: Linus Torvalds; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git Linus Torvalds wrote: > > On Thu, 6 Aug 2009, Artur Skawina wrote: >>> The way it's written, I can easily make it do one or the other by just >>> turning the macro inside a loop (and we can have a preprocessor flag to >>> choose one or the other), but let me work on it a bit more first. >> that's of course how i measured it.. :) > > Well, with my "rolling 512-bit array" I can't do that easily any more. > > Now it actually depends on the compiler being able to statically do that > circular list calculation. If I were to turn it back into the chunks of > loops, my new code would suck, because it would have all those nasty > dynamic address calculations. i did try (obvious patch below) and in fact the loops still win on p4: #Initializing... Rounds: 1000000, size: 62500K, time: 1.428s, speed: 42.76MB/s # TIME[s] SPEED[MB/s] rfc3174 1.437 42.47 rfc3174 1.438 42.45 linus 0.5791 105.4 linusas 0.5052 120.8 mozilla 1.525 40.01 mozillaas 1.192 51.19 artur --- block-sha1/sha1.c 2009-08-06 06:45:03.407322970 +0200 +++ block-sha1/sha1as.c 2009-08-06 07:36:41.332318683 +0200 @@ -107,13 +107,17 @@ #define T_0_15(t) \ TEMP = htonl(data[t]); array[t] = TEMP; \ - TEMP += SHA_ROL(A,5) + (((C^D)&B)^D) + E + 0x5a827999; \ - E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; \ + TEMP += SHA_ROL(A,5) + (((C^D)&B)^D) + E; \ + E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP + 0x5a827999; \ +#if UNROLL T_0_15( 0); T_0_15( 1); T_0_15( 2); T_0_15( 3); T_0_15( 4); T_0_15( 5); T_0_15( 6); T_0_15( 7); T_0_15( 8); T_0_15( 9); T_0_15(10); T_0_15(11); T_0_15(12); T_0_15(13); T_0_15(14); T_0_15(15); +#else + for (int t = 0; t <= 15; t++) { T_0_15(t); } +#endif /* This "rolls" over the 512-bit array */ #define W(x) (array[(x)&15]) @@ -125,37 +129,53 @@ TEMP += SHA_ROL(A,5) + (((C^D)&B)^D) + E + 0x5a827999; \ E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; \ +#if UNROLL T_16_19(16); T_16_19(17); T_16_19(18); T_16_19(19); +#else + for (int t = 16; t <= 19; t++) { T_16_19(t); } +#endif #define T_20_39(t) \ SHA_XOR(t); \ - TEMP += SHA_ROL(A,5) + (B^C^D) + E + 0x6ed9eba1; \ - E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; + TEMP += SHA_ROL(A,5) + (B^C^D) + E; \ + E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP + 0x6ed9eba1; +#if UNROLL T_20_39(20); T_20_39(21); T_20_39(22); T_20_39(23); T_20_39(24); T_20_39(25); T_20_39(26); T_20_39(27); T_20_39(28); T_20_39(29); T_20_39(30); T_20_39(31); T_20_39(32); T_20_39(33); T_20_39(34); T_20_39(35); T_20_39(36); T_20_39(37); T_20_39(38); T_20_39(39); +#else + for (int t = 20; t <= 39; t++) { T_20_39(t); } +#endif #define T_40_59(t) \ SHA_XOR(t); \ - TEMP += SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E + 0x8f1bbcdc; \ - E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; + TEMP += SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E; \ + E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP + 0x8f1bbcdc; +#if UNROLL T_40_59(40); T_40_59(41); T_40_59(42); T_40_59(43); T_40_59(44); T_40_59(45); T_40_59(46); T_40_59(47); T_40_59(48); T_40_59(49); T_40_59(50); T_40_59(51); T_40_59(52); T_40_59(53); T_40_59(54); T_40_59(55); T_40_59(56); T_40_59(57); T_40_59(58); T_40_59(59); +#else + for (int t = 40; t <= 59; t++) { T_40_59(t); } +#endif #define T_60_79(t) \ SHA_XOR(t); \ TEMP += SHA_ROL(A,5) + (B^C^D) + E + 0xca62c1d6; \ E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; +#if UNROLL T_60_79(60); T_60_79(61); T_60_79(62); T_60_79(63); T_60_79(64); T_60_79(65); T_60_79(66); T_60_79(67); T_60_79(68); T_60_79(69); T_60_79(70); T_60_79(71); T_60_79(72); T_60_79(73); T_60_79(74); T_60_79(75); T_60_79(76); T_60_79(77); T_60_79(78); T_60_79(79); +#else + for (int t = 60; t <= 79; t++) { T_60_79(t); } +#endif ctx->H[0] += A; ctx->H[1] += B; ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-06 5:44 ` Artur Skawina @ 2009-08-06 5:56 ` Artur Skawina 2009-08-06 7:45 ` Artur Skawina 0 siblings, 1 reply; 129+ messages in thread From: Artur Skawina @ 2009-08-06 5:56 UTC (permalink / raw) To: Linus Torvalds; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git Artur Skawina wrote: > i did try (obvious patch below) and in fact the loops still win on p4: > > #Initializing... Rounds: 1000000, size: 62500K, time: 1.428s, speed: 42.76MB/s > # TIME[s] SPEED[MB/s] > rfc3174 1.437 42.47 > rfc3174 1.438 42.45 > linus 0.5791 105.4 > linusas 0.5052 120.8 > mozilla 1.525 40.01 > mozillaas 1.192 51.19 and my atom seems to like the compact loops too: #Initializing... Rounds: 1000000, size: 62500K, time: 4.379s, speed: 13.94MB/s # TIME[s] SPEED[MB/s] rfc3174 4.429 13.78 rfc3174 4.414 13.83 linus 1.733 35.22 linusas 1.5 40.7 mozilla 2.818 21.66 mozillaas 2.539 24.04 artur ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-06 5:56 ` Artur Skawina @ 2009-08-06 7:45 ` Artur Skawina 0 siblings, 0 replies; 129+ messages in thread From: Artur Skawina @ 2009-08-06 7:45 UTC (permalink / raw) To: Linus Torvalds; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git Artur Skawina wrote: > > and my atom seems to like the compact loops too: no, that was wrong, i forgot to turn off the ondemand governor... the unrolled loops are in fact much faster and the numbers look more reasonable, after a few tweaks even on a P4. Now i just need to check how well it does compared to the asm implementations... artur # TIME[s] SPEED[MB/s] # ATOM rfc3174 2.199 27.75 linus 0.8642 70.62 linusas 1.606 38.01 linusas2 0.8763 69.65 mozilla 2.813 21.7 mozillaas 2.539 24.04 # P4 rfc3174 1.402 43.53 linus 0.5835 104.6 linusas 0.4625 132 linusas2 0.4456 137 mozilla 1.529 39.91 mozillaas 1.131 53.96 # P3 rfc3174 5.019 12.16 linus 1.86 32.81 linusas 3.108 19.64 linusas2 1.812 33.68 mozilla 6.431 9.49 mozillaas 5.868 10.4 ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-06 1:18 ` Linus Torvalds 2009-08-06 1:52 ` Nicolas Pitre @ 2009-08-06 18:49 ` Erik Faye-Lund 1 sibling, 0 replies; 129+ messages in thread From: Erik Faye-Lund @ 2009-08-06 18:49 UTC (permalink / raw) To: Linus Torvalds; +Cc: George Spelvin, gitster, git On Thu, Aug 6, 2009 at 3:18 AM, Linus Torvalds<torvalds@linux-foundation.org> wrote: > I note that MINGW does NO_OPENSSL by default, for example, and maybe the > MINGW people want to test the patch out and enable BLK_SHA1 rather than > the original Mozilla one. We recently got OpenSSL in msysgit. The NO_OPENSSL-switch hasn't been flipped yet, though. (We did OpenSSL to get https-support in cURL...) -- Erik "kusma" Faye-Lund kusmabite@gmail.com (+47) 986 59 656 ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-04 4:48 ` George Spelvin 2009-08-04 6:30 ` Linus Torvalds @ 2009-08-04 6:40 ` Linus Torvalds 1 sibling, 0 replies; 129+ messages in thread From: Linus Torvalds @ 2009-08-04 6:40 UTC (permalink / raw) To: George Spelvin; +Cc: Git Mailing List On Mon, 4 Aug 2009, George Spelvin wrote: > +sha1_block_data_order: > + pushl %ebp > + pushl %ebx > + pushl %esi > + pushl %edi > + movl 20(%esp),%edi > + movl 24(%esp),%esi > + movl 28(%esp),%eax > + subl $64,%esp > + shll $6,%eax > + addl %esi,%eax > + movl %eax,92(%esp) > + movl 16(%edi),%ebp > + movl 12(%edi),%edx > +.align 16 > +.L000loop: > + movl (%esi),%ecx > + movl 4(%esi),%ebx > + bswap %ecx > + movl 8(%esi),%eax > + bswap %ebx > + movl %ecx,(%esp) ... Hmm. Does it really help to do the bswap as a separate initial phase? As far as I can tell, you load the result of the bswap just a single time for each value. So the initial "bswap all 64 bytes" seems pointless. > + /* 00_15 0 */ > + movl %edx,%edi > + movl (%esp),%esi Why not do the bswap here instead? Is it because you're running out of registers for scheduling, and want to use the stack pointer rather than the original source? Or does the data dependency end up being so much better that you're better off doing a separate bswap loop? Or is it just because the code was written that way? Intriguing, either way. Linus ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: x86 SHA1: Faster than OpenSSL 2009-08-03 3:47 ` x86 SHA1: Faster than OpenSSL George Spelvin ` (2 preceding siblings ...) 2009-08-04 2:30 ` Linus Torvalds @ 2009-08-18 21:26 ` Andy Polyakov 3 siblings, 0 replies; 129+ messages in thread From: Andy Polyakov @ 2009-08-18 21:26 UTC (permalink / raw) To: George Spelvin; +Cc: git George Spelvin wrote: > (Work in progress, state dump to mailing list archives.) > > This started when discussing git startup overhead due to the dynamic > linker. One big contributor is the openssl library, which is used only > for its optimized x86 SHA-1 implementation. So I took a look at it, > with an eye to importing the code directly into the git source tree, > and decided that I felt like trying to do better. > > The original code was excellent, but it was optimized when the P4 was new. Even though last revision took place when "the P4 was new" and even triggered by its appearance, *all-round* performance was and will always be the prime goal. This means that improvements on some particular micro-architecture is always weighed against losses on others [and compromise is considered of so required]. Please note that I'm *not* trying to diminish George's effort by saying that proposed code is inappropriate, on the contrary I'm nothing but grateful! Thanks, George! I'm only saying that it will be given thorough consideration. Well, I've actually given the consideration and outcome is already committed:-) See http://cvs.openssl.org/chngview?cn=18513. I don't deliver +17%, only +12%, but at the cost of Intel Atom-specific optimizations. I used this opportunity to optimize even for Intel Atom core, something I was planning to do at some point anyway... > http://www.openssl.org/~appro/cryptogams/cryptogams-0.tar.gz > - "tar xz cryptogams-0.tar.gz" If there is interest I can pack new tar ball with updated modules. > An open question is how to add appropriate CPU detection to the git > build scripts. (Note that `uname -m`, which it currently uses to select > the ARM code, does NOT produce the right answer if you're using a 32-bit > compiler on a 64-bit platform.) It's not only that. As next subscriber noted problem on MacOS X, it [MacOS X] uses slightly different assembler convention and ELF modules can't be compiled on MacOS X. OpenSSL perlasm takes care of several assembler flavors and executable formats, including MacOS X. I'm talking about > +++ Makefile 2009-08-02 06:44:44.000000000 -0400 > +%.s : %.pl x86asm.pl x86unix.pl > + perl $< elf > $@ ^^^ this argument. Cheers. A. ^ permalink raw reply [flat|nested] 129+ messages in thread
* Performance issue of 'git branch' @ 2009-07-22 23:59 Carlos R. Mafra 2009-07-23 0:21 ` Linus Torvalds 2009-07-23 0:23 ` SZEDER Gábor 0 siblings, 2 replies; 129+ messages in thread From: Carlos R. Mafra @ 2009-07-22 23:59 UTC (permalink / raw) To: git Hi, When I run 'git branch' in the linux-2.6 repo I think it takes too long to finish (with cold cache): [mafra@Pilar:linux-2.6]$ time git branch 27-stable 28-stable 29-stable 30-stable dev-private * master option sparse stern 0.00user 0.05system 0:05.73elapsed 1%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (209major+1380minor)pagefaults 0swaps This is with git 1.6.4.rc1.10.g2a67 and the kernel is 2.6.31-rc3+. The machine is a 64bit Vaio laptop which is 1+ year old (so it is not "slow"). Repeating the command a second time takes basically zero seconds, but this is more or less what I would expect in the first time too. I use git to track linux-2.6 for 2 years now, and I remember that 'git branch' is slow for quite some time, so it is not a regression or something. It is just now that I took the courage to report this small issue. I did a 'strace' and this is where it spent most of the time: 1248301060.654911 open(".git/refs/heads/sparse", O_RDONLY) = 6 1248301060.654985 read(6, "60afdf6a4065a170ad829b4d79a86ec0"..., 255) = 41 1248301060.655056 read(6, "", 214) = 0 1248301060.655116 close(6) = 0 1248301060.680754 lstat(".git/refs/heads/stern", 0x7fff80bfa8d0) = -1 ENOENT (No such file or directory) 1248301064.018491 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0 1248301064.018641 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f409ffa7000 1248301064.018722 write(1, " 27-stable\33[m\n", 15) = 15 I don't know why .git/refs/heads/stern does not exist and why it takes so long with it. That branch is functional ('git checkout stern' succeeds), as well as all the others. But strangely .git/refs/heads/ contains only [mafra@Pilar:linux-2.6]$ ls .git/refs/heads/ dev-private master sparse which, apart from "master", are the last branches that I created. I occasionally run 'git gc --aggressive --prune" to optimize the repo, but other than that I don't do anything fancy, just 'pull' almost every day and 'bisect' (which is becoming a rare event now :-) So I would like to ask what should I do to recover the missing files in .git/refs/heads/ (which apparently is the cause for my issue) and how I can avoid losing them in the first place. Also, is there a way to "fix" the 4-secs pause in that lstat() in case the files in .git/refs/heads/ get lost again? Thanks in advance, Carlos ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-22 23:59 Performance issue of 'git branch' Carlos R. Mafra @ 2009-07-23 0:21 ` Linus Torvalds 2009-07-23 0:51 ` Linus Torvalds 2009-07-23 1:22 ` Carlos R. Mafra 2009-07-23 0:23 ` SZEDER Gábor 1 sibling, 2 replies; 129+ messages in thread From: Linus Torvalds @ 2009-07-23 0:21 UTC (permalink / raw) To: Carlos R. Mafra; +Cc: git On Thu, 23 Jul 2009, Carlos R. Mafra wrote: > > When I run 'git branch' in the linux-2.6 repo I think it takes > too long to finish (with cold cache): > > [mafra@Pilar:linux-2.6]$ time git branch > 27-stable > 28-stable > 29-stable > 30-stable > dev-private > * master > option > sparse > stern > 0.00user 0.05system 0:05.73elapsed 1%CPU (0avgtext+0avgdata 0maxresident)k > 0inputs+0outputs (209major+1380minor)pagefaults 0swaps > > This is with git 1.6.4.rc1.10.g2a67 and the kernel is 2.6.31-rc3+. The > machine is a 64bit Vaio laptop which is 1+ year old (so it is not "slow"). When have you last repacked the repository? What you're descibing is basically IO overhead, and if you don't have packed references, it's going to read a lot of small files. > I use git to track linux-2.6 for 2 years now, and I remember that > 'git branch' is slow for quite some time, so it is not a regression > or something. It is just now that I took the courage to report this > small issue. > > I did a 'strace' and this is where it spent most of the time: > > 1248301060.654911 open(".git/refs/heads/sparse", O_RDONLY) = 6 > 1248301060.654985 read(6, "60afdf6a4065a170ad829b4d79a86ec0"..., 255) = 41 > 1248301060.655056 read(6, "", 214) = 0 > 1248301060.655116 close(6) = 0 > 1248301060.680754 lstat(".git/refs/heads/stern", 0x7fff80bfa8d0) = -1 ENOENT (No such file or directory) > 1248301064.018491 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0 > 1248301064.018641 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f409ffa7000 > 1248301064.018722 write(1, " 27-stable\33[m\n", 15) = 15 > > I don't know why .git/refs/heads/stern does not exist and why it takes > so long with it. That branch is functional ('git checkout stern' succeeds), > as well as all the others. But strangely .git/refs/heads/ contains only > > [mafra@Pilar:linux-2.6]$ ls .git/refs/heads/ > dev-private master sparse > > which, apart from "master", are the last branches that I created. Ok, this actually means that you _have_ repacked the repo, and the rest of the branches are all nicely packed in .git/packed-refs. But that four _second_ lstat() is really disgusting. Let me guess: if you do a "ls -ld .git/refs/heads" you get a very big directory, despite it only having three entries in it. And your filesystem doesn't have name hashing enabled, so searching for a non-existent file involves looking through _all_ of the empty slots. Try this: git pack-refs --all rmdir .git/refs/heads rmdir .git/refs/tags mkdir .git/refs/heads mkdir .git/refs/tags and see if it magically speeds up. Linus ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-23 0:21 ` Linus Torvalds @ 2009-07-23 0:51 ` Linus Torvalds 2009-07-23 0:55 ` Linus Torvalds 2009-07-23 1:22 ` Carlos R. Mafra 1 sibling, 1 reply; 129+ messages in thread From: Linus Torvalds @ 2009-07-23 0:51 UTC (permalink / raw) To: Carlos R. Mafra; +Cc: git On Wed, 22 Jul 2009, Linus Torvalds wrote: > > Try this: > > git pack-refs --all > > rmdir .git/refs/heads > rmdir .git/refs/tags > > mkdir .git/refs/heads > mkdir .git/refs/tags > > and see if it magically speeds up. In fact, you could also just try mv .git/refs .git/temp-refs && cp -a .git/temp-refs .git/refs && rm -rf .git/temp-refs which will re-create other subdirectories too (like .git/refs/remotes etc). Of course, depending on your particular filesystem, a better fix might be to enable filename hashing, which gets rid of the whole "look through all the old empty stale directory entries to see if there's a filename there" issue. That won't fix 'readdir()' performance, but it should fix your insane 4-second lstat() thing. If you have ext3, you'd do something like tune2fs -O dir_index /dev/<node-of-your-filesystem-goes-here> but as mentioned, even with directory indexing it can actually make sense to recreate directories that at some point _used_ to be large, but got shrunk down to something much smaller. It's a generic directory problem (not just ext3, not just unix, it's a common issue across filesystems. It's not _universal_ - some smarter filesystems really do shrink their directories - but it's certainly not unusual). Linus ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-23 0:51 ` Linus Torvalds @ 2009-07-23 0:55 ` Linus Torvalds 2009-07-23 2:02 ` Carlos R. Mafra 0 siblings, 1 reply; 129+ messages in thread From: Linus Torvalds @ 2009-07-23 0:55 UTC (permalink / raw) To: Carlos R. Mafra; +Cc: git On Wed, 22 Jul 2009, Linus Torvalds wrote: > > If you have ext3, you'd do something like > > tune2fs -O dir_index /dev/<node-of-your-filesystem-goes-here> One last email note on this subject. Really. Promise. If you do that "tune2fs -O dir_index" thing, it will only take effect for _newly_ created directories. So you'll still need to do that whole "mv+cp+rm" dance, just to make sure that the refs directories are all new. I think you can also force all directories to be indexed by using fsck, but I forget the details. I'm sure man-pages will have it. Or google. Linus ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-23 0:55 ` Linus Torvalds @ 2009-07-23 2:02 ` Carlos R. Mafra 2009-07-23 2:28 ` Linus Torvalds 0 siblings, 1 reply; 129+ messages in thread From: Carlos R. Mafra @ 2009-07-23 2:02 UTC (permalink / raw) To: Linus Torvalds; +Cc: git On Wed 22.Jul'09 at 17:55:51 -0700, Linus Torvalds wrote: > On Wed, 22 Jul 2009, Linus Torvalds wrote: > > > > If you have ext3, you'd do something like > > > > tune2fs -O dir_index /dev/<node-of-your-filesystem-goes-here> > > One last email note on this subject. Really. Promise. > > If you do that "tune2fs -O dir_index" thing, it will only take effect for > _newly_ created directories. So you'll still need to do that whole > "mv+cp+rm" dance, just to make sure that the refs directories are all new. Ok, now I also did the "dir_index" thing followed by the mv+cp+rm instructions. It doesn't change the 3.5 secs delay in that single line, 1248313742.355195 lstat(".git/refs/heads/sparse", 0x7fff0c663ab0) = -1 ENOENT (No such file or directory) 1248313742.381178 lstat(".git/refs/heads/stern", 0x7fff0c663ab0) = -1 ENOENT (No such file or directory) 1248313745.804637 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0 Just to double check, [root@Pilar linux-2.6]# tune2fs -l /dev/sda5 |grep dir_index Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery sparse_super large_file (and I did the mv+cp+rm after setting "dir_index") Is there another way to check what is going on with that anomalous lstat()? [ perhaps I will try 'perf' after I read how to use it ] Thanks, Carlos ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-23 2:02 ` Carlos R. Mafra @ 2009-07-23 2:28 ` Linus Torvalds 2009-07-23 12:42 ` Jakub Narebski 0 siblings, 1 reply; 129+ messages in thread From: Linus Torvalds @ 2009-07-23 2:28 UTC (permalink / raw) To: Carlos R. Mafra; +Cc: git On Thu, 23 Jul 2009, Carlos R. Mafra wrote: > > Is there another way to check what is going on with that anomalous lstat()? I really don't think it's the lstat any more. Your directories look small and simple, and clearly the indexing made no difference. See earlier email about using "strace -T" instead of "-tt". Also, I sent you a patch to try out just a minute ago, I think that may be it. > [ perhaps I will try 'perf' after I read how to use it ] I really like 'perf' (it does what oprofile did for me, but without the headaches), but it doesn't help with IO profiling. I've actually often wanted to have a 'strace' that shows page faults as special system calls, but it's sadly nontrivial ;( Linus ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-23 2:28 ` Linus Torvalds @ 2009-07-23 12:42 ` Jakub Narebski 2009-07-23 14:45 ` Carlos R. Mafra 2009-07-23 16:25 ` Linus Torvalds 0 siblings, 2 replies; 129+ messages in thread From: Jakub Narebski @ 2009-07-23 12:42 UTC (permalink / raw) To: Linus Torvalds; +Cc: Carlos R. Mafra, git Linus Torvalds <torvalds@linux-foundation.org> writes: > On Thu, 23 Jul 2009, Carlos R. Mafra wrote: > > > > Is there another way to check what is going on with that anomalous lstat()? > > I really don't think it's the lstat any more. Your directories look small > and simple, and clearly the indexing made no difference. > > See earlier email about using "strace -T" instead of "-tt". Also, I sent > you a patch to try out just a minute ago, I think that may be it. > > > [ perhaps I will try 'perf' after I read how to use it ] > > I really like 'perf' (it does what oprofile did for me, but without the > headaches), but it doesn't help with IO profiling. > > I've actually often wanted to have a 'strace' that shows page faults as > special system calls, but it's sadly nontrivial ;( BTW. Would SystemTap help there? Among contributed scripts there is iotimes, so perhaps it would be possible to have iotrace... -- Jakub Narebski Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-23 12:42 ` Jakub Narebski @ 2009-07-23 14:45 ` Carlos R. Mafra 2009-07-23 16:25 ` Linus Torvalds 1 sibling, 0 replies; 129+ messages in thread From: Carlos R. Mafra @ 2009-07-23 14:45 UTC (permalink / raw) To: Jakub Narebski; +Cc: Linus Torvalds, git On Thu 23.Jul'09 at 5:42:03 -0700, Jakub Narebski wrote: > Linus Torvalds <torvalds@linux-foundation.org> writes: > > > On Thu, 23 Jul 2009, Carlos R. Mafra wrote: > > > > > > Is there another way to check what is going on with that anomalous lstat()? > > > > I really don't think it's the lstat any more. Your directories look small > > and simple, and clearly the indexing made no difference. > > > > See earlier email about using "strace -T" instead of "-tt". Also, I sent > > you a patch to try out just a minute ago, I think that may be it. > > > > > [ perhaps I will try 'perf' after I read how to use it ] > > > > I really like 'perf' (it does what oprofile did for me, but without the > > headaches), but it doesn't help with IO profiling. > > > > I've actually often wanted to have a 'strace' that shows page faults as > > special system calls, but it's sadly nontrivial ;( > > BTW. Would SystemTap help there? Among contributed scripts there is > iotimes, so perhaps it would be possible to have iotrace... I played a bit with 'blktrace' and 'btrace' and had two terminals open side by side, one with 'strace git branch' and the other with 'blktrace'. It was pretty obvious that exactly at the point where 'git branch' was stalling (without Linus' patch) -- which I thought had to do with lstat() -- there was a flurry of activity going on in 'btrace' output. It would be nice if 'btrace' could be somehow unified with 'strace', if that makes any sense. Here are some numbers from my tests with blktrace (blkparse and btrace): [root@Pilar mafra]# grep git blkparse-patch.txt |wc -l 811 [root@Pilar mafra]# grep git blkparse-nopatch.txt |wc -l 3479 where those lines with 'git' are something like 8,5 0 677 1.787350654 18591 I R 204488479 + 40 [git] 8,0 0 678 1.787370489 18591 A R 204488783 + 96 <- (8,5) 137529800 8,5 0 679 1.787371886 18591 Q R 204488783 + 96 [git] 8,5 0 680 1.787375378 18591 G R 204488783 + 96 [git] 8,5 0 681 1.787377613 18591 I R 204488783 + 96 [git] And the summary lines also indicate that the non-patched git makes the disc work much harder: *************** Without Linus' patch ****************************************** Total (8,5): Reads Queued: 764, 20,008KiB Writes Queued: 0, 0KiB Read Dispatches: 764, 20,008KiB Write Dispatches: 0, 0KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 764, 20,008KiB Writes Completed: 0, 0KiB Read Merges: 0, 0KiB Write Merges: 0, 0KiB IO unplugs: 299 Timer unplugs: 2 Throughput (R/W): 4,003KiB/s / 0KiB/s Events (8,5): 5,266 entries Skips: 0 forward (0 - 0.0%) ************** With Linus' patch ********************************************** Total (sda5): Reads Queued: 171, 3,128KiB Writes Queued: 6, 24KiB Read Dispatches: 171, 3,128KiB Write Dispatches: 2, 24KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 171, 3,128KiB Writes Completed: 2, 24KiB Read Merges: 0, 0KiB Write Merges: 4, 16KiB IO unplugs: 80 Timer unplugs: 0 Throughput (R/W): 1,632KiB/s / 12KiB/s Events (sda5): 1,226 entries Skips: 0 forward (0 - 0.0%) ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-23 12:42 ` Jakub Narebski 2009-07-23 14:45 ` Carlos R. Mafra @ 2009-07-23 16:25 ` Linus Torvalds 1 sibling, 0 replies; 129+ messages in thread From: Linus Torvalds @ 2009-07-23 16:25 UTC (permalink / raw) To: Jakub Narebski; +Cc: Carlos R. Mafra, git On Thu, 23 Jul 2009, Jakub Narebski wrote: > > BTW. Would SystemTap help there? Among contributed scripts there is > iotimes, so perhaps it would be possible to have iotrace... The problem I've had with all iotracers is that it's easy enough to get an IO trace, but it's basically almost impossible to integrate it with what actually _caused_ the IO. Using 'strace -T' shows very clearly what operations are taking a long time. It's very useful for seeing what you should not do for good performance - including IO - and where it comes from. It's just that page faults are invisible to it. Linus ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-23 0:21 ` Linus Torvalds 2009-07-23 0:51 ` Linus Torvalds @ 2009-07-23 1:22 ` Carlos R. Mafra 2009-07-23 2:20 ` Linus Torvalds 1 sibling, 1 reply; 129+ messages in thread From: Carlos R. Mafra @ 2009-07-23 1:22 UTC (permalink / raw) To: Linus Torvalds; +Cc: git On Wed 22.Jul'09 at 17:21:48 -0700, Linus Torvalds wrote: > > When have you last repacked the repository? Last week or so, with 'git repack -d -a' > > [mafra@Pilar:linux-2.6]$ ls .git/refs/heads/ > > dev-private master sparse > > > > which, apart from "master", are the last branches that I created. > > Ok, this actually means that you _have_ repacked the repo, and the rest of > the branches are all nicely packed in .git/packed-refs. Yes, now I saw the other branches inside packed-refs. > But that four _second_ lstat() is really disgusting. > > Let me guess: if you do a "ls -ld .git/refs/heads" you get a very big > directory, despite it only having three entries in it. [mafra@Pilar:linux-2.6]$ ls -ld .git/refs/heads drwxr-xr-x 2 mafra mafra 4096 2009-07-22 23:01 .git/refs/heads/ > And your filesystem > doesn't have name hashing enabled, so searching for a non-existent file > involves looking through _all_ of the empty slots. I use ext3 without changing any defaults that I know of (I simply compile and boot the kernel of the day), and I have no idea if name hashing is enabled here. > Try this: > > git pack-refs --all > > rmdir .git/refs/heads > rmdir .git/refs/tags > > mkdir .git/refs/heads > mkdir .git/refs/tags > > and see if it magically speeds up. It didn't change things, unfortunately. After 'echo 3 > /proc/sys/vm/drop_caches' it still takes too long, 1248310449.693085 munmap(0x7f50bcd11000, 164) = 0 1248310449.693187 lstat(".git/refs/heads/sparse", 0x7fff618c0960) = -1 ENOENT (No such file or directory) 1248310449.719112 lstat(".git/refs/heads/stern", 0x7fff618c0960) = -1 ENOENT (No such file or directory) 1248310453.014041 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 3), ...}) = 0 1248310453.014183 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f50bcd11000 Perhaps I should delete the "stern" branch, but I would like to learn why it is slowing things, because it also happened before (in fact it is always like this, afaicr) Do you have another theory? (now .git/refs/heads is empty) Thanks, Carlos ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-23 1:22 ` Carlos R. Mafra @ 2009-07-23 2:20 ` Linus Torvalds 2009-07-23 2:23 ` Linus Torvalds 0 siblings, 1 reply; 129+ messages in thread From: Linus Torvalds @ 2009-07-23 2:20 UTC (permalink / raw) To: Carlos R. Mafra; +Cc: Git Mailing List, Junio C Hamano On Thu, 23 Jul 2009, Carlos R. Mafra wrote: > > Let me guess: if you do a "ls -ld .git/refs/heads" you get a very big > > directory, despite it only having three entries in it. > > [mafra@Pilar:linux-2.6]$ ls -ld .git/refs/heads > drwxr-xr-x 2 mafra mafra 4096 2009-07-22 23:01 .git/refs/heads/ Hmm. That's just a single block. Then I really don't see why the lstat takes so long. > After 'echo 3 > /proc/sys/vm/drop_caches' it still takes too long, > > 1248310449.693085 munmap(0x7f50bcd11000, 164) = 0 > 1248310449.693187 lstat(".git/refs/heads/sparse", 0x7fff618c0960) = -1 ENOENT (No such file or directory) > 1248310449.719112 lstat(".git/refs/heads/stern", 0x7fff618c0960) = -1 ENOENT (No such file or directory) > 1248310453.014041 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 3), ...}) = 0 > 1248310453.014183 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f50bcd11000 Use 'strace -T', which shows how long the actual system calls take, rather than '-tt' which just shows when they started. Maybe the four seconds is something else than the lstat - page faults on the pack-file in between the lstat and the fstat, for example. > Perhaps I should delete the "stern" branch, but I would like to learn why > it is slowing things, because it also happened before (in fact it is always > like this, afaicr) Absolutely. Don't delete it until we figure out what takes so long there. > Do you have another theory? (now .git/refs/heads is empty) Clearly it's IO, but if that 'lstat()' was just a red herring, then I suspect it's IO on the pack-file. If so, I'd further guess that your VAIO has some pitiful 4200rpm harddisk that is slow as hell and has horrible seek latencies, and the CPU is way overpowered compared to the cruddy disk. It probably does the object lookup. You can see some debug output if you do GIT_DEBUG_LOOKUP=1 git branch and that will show you the patterns. It won't be very pretty, especially if you have several pack-files, but maybe we can figure out what's up. Hmm. I wonder.. I suspect 'git branch' looks up _all_ refs, and then afterwards it filters them. So even though it only prints out a few branches, maybe it will look at all the tags etc of the whole repository. Ooh yes. That would do it. It's going to peel and look up every single ref it finds, so it's going to look up _hundreds_ of objects (all the tags, all the commits they point to, etc etc). Even if it then only shows a couple of branches. Junio, any ideas? Linus ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-23 2:20 ` Linus Torvalds @ 2009-07-23 2:23 ` Linus Torvalds 2009-07-23 3:08 ` Linus Torvalds ` (3 more replies) 0 siblings, 4 replies; 129+ messages in thread From: Linus Torvalds @ 2009-07-23 2:23 UTC (permalink / raw) To: Carlos R. Mafra; +Cc: Git Mailing List, Junio C Hamano On Wed, 22 Jul 2009, Linus Torvalds wrote: > > Ooh yes. That would do it. It's going to peel and look up every single ref > it finds, so it's going to look up _hundreds_ of objects (all the tags, > all the commits they point to, etc etc). Even if it then only shows a > couple of branches. > > Junio, any ideas? I had one of my own. Does this fix it? It uses the "raw" version of 'for_each_ref()' (which doesn't verify that the ref is valid), and then does the "type verification" before it starts doing any gentle commit lookup. That should hopefully mean that it no longer does tons of object lookups on refs that it's not actually interested in. Linus --- builtin-branch.c | 10 +++++----- 1 files changed, 5 insertions(+), 5 deletions(-) diff --git a/builtin-branch.c b/builtin-branch.c index 5687d60..54a89ff 100644 --- a/builtin-branch.c +++ b/builtin-branch.c @@ -240,6 +240,10 @@ static int append_ref(const char *refname, const unsigned char *sha1, int flags, if (ARRAY_SIZE(ref_kind) <= i) return 0; + /* Don't add types the caller doesn't want */ + if ((kind & ref_list->kinds) == 0) + return 0; + commit = lookup_commit_reference_gently(sha1, 1); if (!commit) return error("branch '%s' does not point at a commit", refname); @@ -248,10 +252,6 @@ static int append_ref(const char *refname, const unsigned char *sha1, int flags, if (!is_descendant_of(commit, ref_list->with_commit)) return 0; - /* Don't add types the caller doesn't want */ - if ((kind & ref_list->kinds) == 0) - return 0; - if (merge_filter != NO_FILTER) add_pending_object(&ref_list->revs, (struct object *)commit, refname); @@ -426,7 +426,7 @@ static void print_ref_list(int kinds, int detached, int verbose, int abbrev, str ref_list.with_commit = with_commit; if (merge_filter != NO_FILTER) init_revisions(&ref_list.revs, NULL); - for_each_ref(append_ref, &ref_list); + for_each_rawref(append_ref, &ref_list); if (merge_filter != NO_FILTER) { struct commit *filter; filter = lookup_commit_reference_gently(merge_filter_ref, 0); ^ permalink raw reply related [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-23 2:23 ` Linus Torvalds @ 2009-07-23 3:08 ` Linus Torvalds 2009-07-23 3:21 ` Linus Torvalds 2009-07-23 3:18 ` Carlos R. Mafra ` (2 subsequent siblings) 3 siblings, 1 reply; 129+ messages in thread From: Linus Torvalds @ 2009-07-23 3:08 UTC (permalink / raw) To: Carlos R. Mafra; +Cc: Git Mailing List, Junio C Hamano On Wed, 22 Jul 2009, Linus Torvalds wrote: > > It uses the "raw" version of 'for_each_ref()' (which doesn't verify that > the ref is valid), and then does the "type verification" before it starts > doing any gentle commit lookup. > > That should hopefully mean that it no longer does tons of object lookups > on refs that it's not actually interested in. Hmm. On my kernel repo, doing GIT_DEBUG_LOOKUP=1 git branch | wc -l I get - before: 2121 - after: 39 (where two of the lines are the actual 'git branch' output). So yeah, this should make a big difference. It now looks up just two objects (one of them duplicated because it checks "HEAD" - but the duplicate lookup won't result in any extra IO, so it's only two _uncached_ accesses). The GIT_DEBUG_LOOKUP debug output probably does match the number of cold-cache IO's fairly well for something like this (at least to a first approximation), so I really hope my patch will fix your problem. Linus ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-23 3:08 ` Linus Torvalds @ 2009-07-23 3:21 ` Linus Torvalds 2009-07-23 17:47 ` Tony Finch 0 siblings, 1 reply; 129+ messages in thread From: Linus Torvalds @ 2009-07-23 3:21 UTC (permalink / raw) To: Carlos R. Mafra; +Cc: Git Mailing List, Junio C Hamano On Wed, 22 Jul 2009, Linus Torvalds wrote: > > The GIT_DEBUG_LOOKUP debug output probably does match the number of > cold-cache IO's fairly well for something like this (at least to a first > approximation), so I really hope my patch will fix your problem. Side note: the object lookup binary search we do is simple and reasonably efficient, but it is _not_ very cache-friendly (where "cache-friendly" also in this case means IO caches). There are more cache-friendly ways of searching, although the really clever ones would require us to switch the format of the pack-file index around. Which would be a fairly big pain (in addition to making the lookup a lot more complex). The _simpler_ cache-friendly alternative is likely to try the "guess location by assuming the SHA1's are evenly spread out" thing doesn't jump back-and-forth like a binary search does. We tried it a few years ago, but didn't do cold-cache numbers. And repositories were smaller too. With something like the kernel repo, with 1.2+ million objects, a binary search needs about 21 comparisons for each object we look up. The index has a first-level fan-out of 256, so that takes away 8 of them, but we're still talking about 13 comparisons. With bad locality except for the very last ones. Assuming a 4kB page-size, and about 170 index entries per page (~7 binary search levels), that's 6 pages we have to page-fault in for each search. And we probably won't start seeing lots of cache reuse until we hit hundreds or thousands of objects searched for. With soemthing like "three iterations of newton-raphson + linear search", we might end up with more index entries looked at, but we'd quite possibly get much better locality. I suspect the old newton-raphson patches we had (Discussions and patches back in April 2007 on this list) could be resurrected pretty easily. Linus ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-23 3:21 ` Linus Torvalds @ 2009-07-23 17:47 ` Tony Finch 2009-07-23 18:57 ` Linus Torvalds 0 siblings, 1 reply; 129+ messages in thread From: Tony Finch @ 2009-07-23 17:47 UTC (permalink / raw) To: Linus Torvalds; +Cc: git On Wed, 22 Jul 2009, Linus Torvalds wrote: > > I suspect the old newton-raphson patches we had (Discussions and patches > back in April 2007 on this list) could be resurrected pretty easily. That sounds interesting, but I can't find the thread you are referring to. Do you have a URL or a subject I can feed to Google? Tony. -- f.anthony.n.finch <dot@dotat.at> http://dotat.at/ GERMAN BIGHT HUMBER: SOUTHWEST 5 TO 7. MODERATE OR ROUGH. SQUALLY SHOWERS. MODERATE OR GOOD. ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-23 17:47 ` Tony Finch @ 2009-07-23 18:57 ` Linus Torvalds 0 siblings, 0 replies; 129+ messages in thread From: Linus Torvalds @ 2009-07-23 18:57 UTC (permalink / raw) To: Tony Finch; +Cc: git On Thu, 23 Jul 2009, Tony Finch wrote: > On Wed, 22 Jul 2009, Linus Torvalds wrote: > > > > I suspect the old newton-raphson patches we had (Discussions and patches > > back in April 2007 on this list) could be resurrected pretty easily. > > That sounds interesting, but I can't find the thread you are referring to. > Do you have a URL or a subject I can feed to Google? Some googling found this: http://marc.info/?l=git&m=117537594112450&w=2 but what got merged (half a year later) was a much fancier thing by Junio. See sha1-lookup.c. That original "single iteration of newton-raphson" patch was buggy, but it's perhaps interesting as a concept patch. Linus ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-23 2:23 ` Linus Torvalds 2009-07-23 3:08 ` Linus Torvalds @ 2009-07-23 3:18 ` Carlos R. Mafra 2009-07-23 3:27 ` Carlos R. Mafra ` (2 more replies) 2009-07-23 4:40 ` Junio C Hamano 2009-07-23 16:48 ` Anders Kaseorg 3 siblings, 3 replies; 129+ messages in thread From: Carlos R. Mafra @ 2009-07-23 3:18 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano First of all: * yes, my VAIO has a slow 4200 rpm disc :-( * strace -T indeed showed that lstat() was not guilty * GIT_DEBUG_LOOKUP=1 git branch produced ugly 2200+ lines Now to the patch, On Wed 22.Jul'09 at 19:23:39 -0700, Linus Torvalds wrote: > > Ooh yes. That would do it. It's going to peel and look up every single ref > > it finds, so it's going to look up _hundreds_ of objects (all the tags, > > all the commits they point to, etc etc). Even if it then only shows a > > couple of branches. > > > > Junio, any ideas? > > I had one of my own. > > Does this fix it? Yes! [mafra@Pilar:linux-2.6]$ time git branch 27-stable 28-stable 29-stable 30-stable dev-private * master option sparse stern 0.00user 0.01system 0:01.50elapsed 1%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (42major+757minor)pagefaults 0swaps 01.50 is not that good, but it doesn't "feel" terrible as 4 seconds. [ It is incredible how 4 secs feels really bad while 2 is acceptable... ] So thank you very much, Linus! A 50% improvement here! And I am happy to have finally reported it, after quietly suffering for so long thinking that "git is as fast as possible, so it is probably my fault". PS: Out of curiosity, how many femtoseconds does it take in your state-of-the-art machine? :-) ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-23 3:18 ` Carlos R. Mafra @ 2009-07-23 3:27 ` Carlos R. Mafra 2009-07-23 3:40 ` Carlos R. Mafra 2009-07-23 3:47 ` Linus Torvalds 2 siblings, 0 replies; 129+ messages in thread From: Carlos R. Mafra @ 2009-07-23 3:27 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano On Thu 23.Jul'09 at 5:18:44 +0200, Carlos R. Mafra wrote: > * GIT_DEBUG_LOOKUP=1 git branch produced ugly 2200+ lines With your patch applied it went down to 132 lines. ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-23 3:18 ` Carlos R. Mafra 2009-07-23 3:27 ` Carlos R. Mafra @ 2009-07-23 3:40 ` Carlos R. Mafra 2009-07-23 3:47 ` Linus Torvalds 2 siblings, 0 replies; 129+ messages in thread From: Carlos R. Mafra @ 2009-07-23 3:40 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano On Thu 23.Jul'09 at 5:18:44 +0200, Carlos R. Mafra wrote: > 0.00user 0.01system 0:01.50elapsed 1%CPU (0avgtext+0avgdata 0maxresident)k > 0inputs+0outputs (42major+757minor)pagefaults 0swaps > > 01.50 is not that good, but it doesn't "feel" terrible as 4 seconds. > [ It is incredible how 4 secs feels really bad while 2 is acceptable... ] I need to sleep, as the number 4 seconds got stuck in my head. In my original report it was much worse 0.00user 0.05system 0:05.73elapsed So now it was a 75% improvement! ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-23 3:18 ` Carlos R. Mafra 2009-07-23 3:27 ` Carlos R. Mafra 2009-07-23 3:40 ` Carlos R. Mafra @ 2009-07-23 3:47 ` Linus Torvalds 2009-07-23 4:10 ` Linus Torvalds 2 siblings, 1 reply; 129+ messages in thread From: Linus Torvalds @ 2009-07-23 3:47 UTC (permalink / raw) To: Carlos R. Mafra; +Cc: Git Mailing List, Junio C Hamano On Thu, 23 Jul 2009, Carlos R. Mafra wrote: > > PS: Out of curiosity, how many femtoseconds does it take in your > state-of-the-art machine? :-) Cold cache? 0.15s before the patch. 0.03s after. So we're not talking femto-seconds, but I've got Intel SSD's that do random reads in well under a millisecond. Your pitiful 4200rpm drive probably takes 20ms for each seek. You don't really need that many IO's for it to take a second or two. Or four. The kernel will do IO in bigger chunks than a single page, and there is _some_ locality to it all, so you won't see IO for each lookup. But with 2000+ lines of GIT_DEBUG_LOOKUP, you probably do end up having a noticeable fraction of them being IO-causing, and another fraction causing seeks. But I'll see if I can dig up my non-binary-search patch and see if I can make it go faster. My machine is fast, but not so fast that I can't measure it ;) Linus ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-23 3:47 ` Linus Torvalds @ 2009-07-23 4:10 ` Linus Torvalds 2009-07-23 5:13 ` Junio C Hamano 2009-07-23 5:17 ` Carlos R. Mafra 0 siblings, 2 replies; 129+ messages in thread From: Linus Torvalds @ 2009-07-23 4:10 UTC (permalink / raw) To: Carlos R. Mafra; +Cc: Git Mailing List, Junio C Hamano On Wed, 22 Jul 2009, Linus Torvalds wrote: > > But I'll see if I can dig up my non-binary-search patch and see if I can > make it go faster. My machine is fast, but not so fast that I can't > measure it ;) Oh. We actually merged a fixed version of it. I'd completely forgotten. Enabled with 'GIT_USE_LOOKUP'. But it seems to give worse performance, despite giving me fewer searches: I get 2121 probes with binary searching, but only 1325 with the newton-raphson method (for the non-fixed 'git branch' case). Using GIT_USE_LOOKUP actually results in fewer pagefaults (1391 vs 1473), but it's still slower. Interesting. Carlos, try it on your machine (just do export GIT_USE_LOOKUP=1 time git branch to try it, and 'unset GIT_USE_LOOKUP' to disable it. (And note that the "=1" part isn't important - the only thing that matters is whether the environment variable is set or not - setting it to '0' will _not_ disable it, you need to 'unset' it). With my fix to 'git branch', it doesn't matter. I get the same performance, and same number of page faults (676) regardless. So my patch makes the GIT_USE_LOOKUP=1 thing irrelevant. Linus ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-23 4:10 ` Linus Torvalds @ 2009-07-23 5:13 ` Junio C Hamano 2009-07-23 5:17 ` Carlos R. Mafra 1 sibling, 0 replies; 129+ messages in thread From: Junio C Hamano @ 2009-07-23 5:13 UTC (permalink / raw) To: Linus Torvalds; +Cc: Carlos R. Mafra, Git Mailing List Linus Torvalds <torvalds@linux-foundation.org> writes: > On Wed, 22 Jul 2009, Linus Torvalds wrote: >> >> But I'll see if I can dig up my non-binary-search patch and see if I can >> make it go faster. My machine is fast, but not so fast that I can't >> measure it ;) > > Oh. We actually merged a fixed version of it. I'd completely forgotten. As the commit message of 628522e (sha1-lookup: more memory efficient search in sorted list of SHA-1, 2007-12-29) shows, it didn't get any great performance improvements, even though it did make the probing quite a lot less memory intensive. Perhaps you can spot obvious inefficiency in the code that I failed to see, just like you recently did for "show --cc" codepath? ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-23 4:10 ` Linus Torvalds 2009-07-23 5:13 ` Junio C Hamano @ 2009-07-23 5:17 ` Carlos R. Mafra 1 sibling, 0 replies; 129+ messages in thread From: Carlos R. Mafra @ 2009-07-23 5:17 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano On Wed 22.Jul'09 at 21:10:49 -0700, Linus Torvalds wrote: > Enabled with 'GIT_USE_LOOKUP'. But it seems to give worse performance, > despite giving me fewer searches: I get 2121 probes with binary searching, > but only 1325 with the newton-raphson method (for the non-fixed 'git > branch' case). > > Using GIT_USE_LOOKUP actually results in fewer pagefaults (1391 vs 1473), > but it's still slower. Interesting. Carlos, try it on your machine (just > do > > export GIT_USE_LOOKUP=1 > time git branch > > to try it, and 'unset GIT_USE_LOOKUP' to disable it. GIT_USE_LOOKUP=1 makes is a bit slower overall. Without your patch, I get fewer pagefaults (1254 vs 1404) when it is set, but it takes ~0.5s longer (it varies a bit). > With my fix to 'git branch', it doesn't matter. I get the same > performance, and same number of page faults (676) regardless. So my patch > makes the GIT_USE_LOOKUP=1 thing irrelevant. With your patch and GIT_USE_LOOKUP=1 I get 751 pagefaults, versus 775 if GIT_USE_LOOKUP is unset, but it is faster when unset. So your patch without GIT_USE_LOOKUP=1 is the fastest option. ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-23 2:23 ` Linus Torvalds 2009-07-23 3:08 ` Linus Torvalds 2009-07-23 3:18 ` Carlos R. Mafra @ 2009-07-23 4:40 ` Junio C Hamano 2009-07-23 5:36 ` Linus Torvalds 2009-07-23 16:07 ` Carlos R. Mafra 2009-07-23 16:48 ` Anders Kaseorg 3 siblings, 2 replies; 129+ messages in thread From: Junio C Hamano @ 2009-07-23 4:40 UTC (permalink / raw) To: Linus Torvalds; +Cc: Carlos R. Mafra, Git Mailing List Linus Torvalds <torvalds@linux-foundation.org> writes: > On Wed, 22 Jul 2009, Linus Torvalds wrote: >> >> Ooh yes. That would do it. It's going to peel and look up every single ref >> it finds, so it's going to look up _hundreds_ of objects (all the tags, >> all the commits they point to, etc etc). Even if it then only shows a >> couple of branches. >> >> Junio, any ideas? > > I had one of my own. It seems that I missed all the fun while going out to dinner. > It uses the "raw" version of 'for_each_ref()' (which doesn't verify that > the ref is valid), and then does the "type verification" before it starts > doing any gentle commit lookup. Hmm, we now have to remember what this patch did, if we ever wanted to introduce negative refs later (see ef06b91 do_for_each_ref: perform the same sanity check for leftovers., 2006-11-18). Not exactly nice to spread the codepaths that need to be updated. Is the cold cache performance of "git branch" to list your local branches that important? ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-23 4:40 ` Junio C Hamano @ 2009-07-23 5:36 ` Linus Torvalds 2009-07-23 5:52 ` Junio C Hamano 2009-07-23 16:07 ` Carlos R. Mafra 1 sibling, 1 reply; 129+ messages in thread From: Linus Torvalds @ 2009-07-23 5:36 UTC (permalink / raw) To: Junio C Hamano; +Cc: Carlos R. Mafra, Git Mailing List On Wed, 22 Jul 2009, Junio C Hamano wrote: > > Hmm, we now have to remember what this patch did, if we ever wanted to > introduce negative refs later (see ef06b91 do_for_each_ref: perform the > same sanity check for leftovers., 2006-11-18). Not exactly nice to spread > the codepaths that need to be updated. Is the cold cache performance of > "git branch" to list your local branches that important? Hmm. I do think that 7.5s is _way_ too long to wait for something as simple as "what branches do I have?". And yes, it's also an operation that I'd expect to be quite possibly the first one you do when moving to a new repo, so cold-cache is realistic. And the 'rawref' thing is exactly the same as the 'ref' version, except it doesn't do the null_sha1 check and the 'has_sha1-file()' check. And since git branch will do something _better_ than the 'has_sha1_file()' check (by virtue of actually looking up the commit), I don't think that part is an issue. So the only issue is the is_null_sha1() thing. And quite frankly, while the null-sha1 check may make sense, the way the flag is named right now (DO_FOR_EACH_INCLUDE_BROKEN), I think we might be better off re-thinking things later if we ever end up caring. That 'is_null_sha1()' check should possibly be under a separate flag. That said, while I think my patch was the simplest and most the problem could certainly have been fixed differently. For example, instead of using 'for_each_ref()' and then splitting them by kind with that "detect kind" loop, it could instead have done two loops, ie if (kinds & REF_LOCAL_BRANCH) for_each_ref_in("refs/heads/", append_local, &ref_list); if (kinds & REF_REMOTE_BRANCH) for_each_ref_in("refs/remotes/", append_remote, &ref_list); and avoided the other refs we aren't interested in _that_ way instead. But it would be a bigger and involved patch. It gets really messy too (I tried), because when you use 'for_each_ref_in()' it removes the prefix as it goes along, but then the code in builtin-branch.c wants the prefix after all. Linus ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-23 5:36 ` Linus Torvalds @ 2009-07-23 5:52 ` Junio C Hamano 2009-07-23 6:04 ` Junio C Hamano 0 siblings, 1 reply; 129+ messages in thread From: Junio C Hamano @ 2009-07-23 5:52 UTC (permalink / raw) To: Linus Torvalds; +Cc: Carlos R. Mafra, Git Mailing List Linus Torvalds <torvalds@linux-foundation.org> writes: > On Wed, 22 Jul 2009, Junio C Hamano wrote: >> >> Hmm, we now have to remember what this patch did, if we ever wanted to >> introduce negative refs later (see ef06b91 do_for_each_ref: perform the >> same sanity check for leftovers., 2006-11-18). Not exactly nice to spread >> the codepaths that need to be updated. > ... > And since git branch will do something _better_ than the 'has_sha1_file()' > check (by virtue of actually looking up the commit), I don't think that > part is an issue. So the only issue is the is_null_sha1() thing. Exactly. That is_null_sha1() thing was a remnant of your idea to represent deleted ref that has a packed counterpart by storing 0{40} in a loose ref, so that we can implement deletion efficiently. Since we currently implement deletion by repacking packed refs if the ref has a packed (possibly stale) one, we do not use such a "negative ref", and skipping 0{40} done by the normal (i.e. non-raw) for_each_ref() family is not necessary. I was inclined to say that, because I never saw anybody complained that deleting refs was too slow, we declare that we would forever stick to the current implementation of ref deletion, and remove the is_null_sha1() check from the do_one_ref() function, even for include-broken case. But after thinking about it again, I'd say "if null, then skip" should be outside the DO_FOR_EACH_INCLUDE_BROKEN anyway, because the null check is not about brokenness of the ref, but is about a possible future expansion to represent deleted ref with such a "negative ref" entry. If we remove is_null_sha1() from do_one_ref(), or if we move it out of the "include broken" thing, my "Not exactly nice" comment can be rescinded, as doing the former (i.e. removal of is_null_sha1() check) is a promise that we will never have to worry about negative refs, and doing the latter will still protect callers of do_for_each_rawref() from negative refs if we ever introduce them in some future. ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-23 5:52 ` Junio C Hamano @ 2009-07-23 6:04 ` Junio C Hamano 2009-07-23 17:19 ` Linus Torvalds 0 siblings, 1 reply; 129+ messages in thread From: Junio C Hamano @ 2009-07-23 6:04 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, Carlos R. Mafra, Git Mailing List Junio C Hamano <gitster@pobox.com> writes: > Exactly. > > That is_null_sha1() thing was a remnant of your idea to represent deleted > ref that has a packed counterpart by storing 0{40} in a loose ref, so that > we can implement deletion efficiently. > > Since we currently implement deletion by repacking packed refs if the ref > has a packed (possibly stale) one, we do not use such a "negative ref", > and skipping 0{40} done by the normal (i.e. non-raw) for_each_ref() family > is not necessary. > > I was inclined to say that, because I never saw anybody complained that > deleting refs was too slow, we declare that we would forever stick to the > current implementation of ref deletion, and remove the is_null_sha1() > check from the do_one_ref() function, even for include-broken case. > > But after thinking about it again, I'd say "if null, then skip" should be > outside the DO_FOR_EACH_INCLUDE_BROKEN anyway, because the null check is > not about brokenness of the ref, but is about a possible future expansion > to represent deleted ref with such a "negative ref" entry. > > If we remove is_null_sha1() from do_one_ref(), or if we move it out of the > "include broken" thing, my "Not exactly nice" comment can be rescinded, as > doing the former (i.e. removal of is_null_sha1() check) is a promise that > we will never have to worry about negative refs, and doing the latter will > still protect callers of do_for_each_rawref() from negative refs if we > ever introduce them in some future. That is, a patch like this (this should go to 'maint'), and my worries will go away. -- >8 -- Subject: do_one_ref(): null_sha1 check is not about broken ref f8948e2 (remote prune: warn dangling symrefs, 2009-02-08) introduced a more dangerous variant of for_each_ref() family that skips the check for dangling refs, but it also made another unrelated check optional by mistake. The check to see if a ref points at 0{40} is not about brokenness, but is about a possible future plan to represent a deleted ref by writing 40 "0" in a loose ref when there is a stale version of the same ref already in .git/packed-refs, so that we can implement deletion of a ref without having to rewrite the packed refs file excluding the ref being deleted. This check has to be outside of the conditional. Signed-off-by: Junio C Hamano <gitster@pobox.com> --- refs.c | 5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/refs.c b/refs.c index bb0762e..3da3c8c 100644 --- a/refs.c +++ b/refs.c @@ -531,9 +531,10 @@ static int do_one_ref(const char *base, each_ref_fn fn, int trim, { if (strncmp(base, entry->name, trim)) return 0; + /* Is this a "negative ref" that represents a deleted ref? */ + if (is_null_sha1(entry->sha1)) + return 0; if (!(flags & DO_FOR_EACH_INCLUDE_BROKEN)) { - if (is_null_sha1(entry->sha1)) - return 0; if (!has_sha1_file(entry->sha1)) { error("%s does not point to a valid object!", entry->name); return 0; ^ permalink raw reply related [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-23 6:04 ` Junio C Hamano @ 2009-07-23 17:19 ` Linus Torvalds 0 siblings, 0 replies; 129+ messages in thread From: Linus Torvalds @ 2009-07-23 17:19 UTC (permalink / raw) To: Junio C Hamano; +Cc: Carlos R. Mafra, Git Mailing List On Wed, 22 Jul 2009, Junio C Hamano wrote: > > Subject: do_one_ref(): null_sha1 check is not about broken ref Ack. If we want to make it conditional at some point, we'd want to use a different flag. I do wonder if we should simply remove the code entirely? Linus ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-23 4:40 ` Junio C Hamano 2009-07-23 5:36 ` Linus Torvalds @ 2009-07-23 16:07 ` Carlos R. Mafra 2009-07-23 16:19 ` Linus Torvalds 1 sibling, 1 reply; 129+ messages in thread From: Carlos R. Mafra @ 2009-07-23 16:07 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, Git Mailing List On Wed 22.Jul'09 at 21:40:36 -0700, Junio C Hamano wrote: > Is the cold cache performance of "git branch" to list your > local branches that important? I simply felt like something not optimal was going on, and in some sense I still feel it even with Linus' patch applied... Don't get me wrong, I am super happy that Linus fixed it so quickly and I am grateful for that, but I am surely missing some git internal reason why 'git branch' is not instantaneous as I _naively_ expected. Having learned about .git/packed-refs last night, today I tried this (with cold cache), [mafra@Pilar:linux-2.6]$ time awk '{print $2}' .git/packed-refs |grep heads| awk -F "/" '{print $3}' 0.00user 0.00system 0:00.12elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (3major+311minor)pagefaults 0swaps 27-stable 28-stable 29-stable 30-stable dev-private master option sparse stern and notice how that makes my pitiful harddisc look like Linus' SSD! And the result is the same. [ If some branches are not inside .git/packed-refs but are listed in .git/refs/heads (like some of them were last night), it would require some modification to the script, but it would still be faster ] However, I know that I am missing something here and I would be happy to learn what. Thanks in advance, Carlos ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-23 16:07 ` Carlos R. Mafra @ 2009-07-23 16:19 ` Linus Torvalds 2009-07-23 16:53 ` Carlos R. Mafra 0 siblings, 1 reply; 129+ messages in thread From: Linus Torvalds @ 2009-07-23 16:19 UTC (permalink / raw) To: Carlos R. Mafra; +Cc: Junio C Hamano, Git Mailing List On Thu, 23 Jul 2009, Carlos R. Mafra wrote: > > Having learned about .git/packed-refs last night, today I tried > this (with cold cache), > > [mafra@Pilar:linux-2.6]$ time awk '{print $2}' .git/packed-refs |grep heads| awk -F "/" '{print $3}' > 0.00user 0.00system 0:00.12elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k > 0inputs+0outputs (3major+311minor)pagefaults 0swaps > 27-stable > 28-stable > 29-stable > 30-stable > dev-private > master > option > sparse > stern > > and notice how that makes my pitiful harddisc look like Linus' SSD! And the > result is the same. The result is the same, yes, but it doesn't do error checking. What "git branch" does over and beyond just looking at the heads is to also look at the commits those heads point to. And the reason it sucks for you is that the commits are pretty spread out (particularly in the index file, but also in the pack-file) on disk. So each "verify this head" will likely involve at least one seek, and possibly four or five. And on your disk, five seeks is a tenth of a second. You can run hdparm, and it will probably say that you get 30MB/s off that laptop drive - but when doing small random reads you'll probably get performance in the order of a few tens of kilobytes, not megabytes. (With read-ahead and read-around it's probably going to be mostly ~64kB IO's and you'll probably get hundreds of kB per second, but you're going to care about just a few kB total of those). So we _could_ make 'git branch' not actually read and verify the commits. It doesn't strictly _need_ to, unless you use 'git branch -v' or something. That would speed it up further, but the verification is nice, and as long as performance isn't _horrible_ I think we're better off doing it. After all, you'll see the problem only once. Linus ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-23 16:19 ` Linus Torvalds @ 2009-07-23 16:53 ` Carlos R. Mafra 2009-07-23 19:05 ` Linus Torvalds 0 siblings, 1 reply; 129+ messages in thread From: Carlos R. Mafra @ 2009-07-23 16:53 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List On Thu 23.Jul'09 at 9:19:21 -0700, Linus Torvalds wrote: > > > > and notice how that makes my pitiful harddisc look like Linus' SSD! And the > > result is the same. > > The result is the same, yes, but it doesn't do error checking. Oh, I see. > So we _could_ make 'git branch' not actually read and verify the commits. > It doesn't strictly _need_ to, unless you use 'git branch -v' or > something. That would speed it up further, but the verification is nice, > and as long as performance isn't _horrible_ I think we're better off doing > it. Right, but I would definitely like having some option like --dont-check to 'git branch', and I think I would use it as default (unless experience tells that errors happen often). > After all, you'll see the problem only once. True, but paradoxically that is also the reason why I notice it and makes it feel bad. Everytime I did the first 'git branch' those 5 seconds really hurt, because I wondered why it couldn't be done in 0s like subsequent commands. But sure, this was definitely not a pressing issue and your patch made it even less. I am happy that it takes 1s now, and I really appreciated your patch! Thanks, Carlos ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-23 16:53 ` Carlos R. Mafra @ 2009-07-23 19:05 ` Linus Torvalds 2009-07-23 19:13 ` Linus Torvalds 0 siblings, 1 reply; 129+ messages in thread From: Linus Torvalds @ 2009-07-23 19:05 UTC (permalink / raw) To: Carlos R. Mafra; +Cc: Junio C Hamano, Git Mailing List On Thu, 23 Jul 2009, Carlos R. Mafra wrote: > > Everytime I did the first 'git branch' those 5 seconds really hurt, because > I wondered why it couldn't be done in 0s like subsequent commands. > > But sure, this was definitely not a pressing issue and your patch made > it even less. I am happy that it takes 1s now, and I really appreciated > your patch! You could try something like this (on _top_ of the previous patch). Not very exhaustively tested, but it's pretty simple. It will still do _some_ object lookups. In particular, it will do the HEAD lookup in 'print_ref_list()', even if it's not strictly necessary. But it should cut down the noise further. Linus --- builtin-branch.c | 24 ++++++++++++++---------- 1 files changed, 14 insertions(+), 10 deletions(-) diff --git a/builtin-branch.c b/builtin-branch.c index 54a89ff..82c2cf0 100644 --- a/builtin-branch.c +++ b/builtin-branch.c @@ -191,7 +191,7 @@ struct ref_item { struct ref_list { struct rev_info revs; - int index, alloc, maxwidth; + int index, alloc, maxwidth, verbose; struct ref_item *list; struct commit_list *with_commit; int kinds; @@ -244,17 +244,20 @@ static int append_ref(const char *refname, const unsigned char *sha1, int flags, if ((kind & ref_list->kinds) == 0) return 0; - commit = lookup_commit_reference_gently(sha1, 1); - if (!commit) - return error("branch '%s' does not point at a commit", refname); + commit = NULL; + if (ref_list->verbose || ref_list->with_commit || merge_filter != NO_FILTER) { + commit = lookup_commit_reference_gently(sha1, 1); + if (!commit) + return error("branch '%s' does not point at a commit", refname); - /* Filter with with_commit if specified */ - if (!is_descendant_of(commit, ref_list->with_commit)) - return 0; + /* Filter with with_commit if specified */ + if (!is_descendant_of(commit, ref_list->with_commit)) + return 0; - if (merge_filter != NO_FILTER) - add_pending_object(&ref_list->revs, - (struct object *)commit, refname); + if (merge_filter != NO_FILTER) + add_pending_object(&ref_list->revs, + (struct object *)commit, refname); + } /* Resize buffer */ if (ref_list->index >= ref_list->alloc) { @@ -423,6 +426,7 @@ static void print_ref_list(int kinds, int detached, int verbose, int abbrev, str memset(&ref_list, 0, sizeof(ref_list)); ref_list.kinds = kinds; + ref_list.verbose = verbose; ref_list.with_commit = with_commit; if (merge_filter != NO_FILTER) init_revisions(&ref_list.revs, NULL); ^ permalink raw reply related [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-23 19:05 ` Linus Torvalds @ 2009-07-23 19:13 ` Linus Torvalds 2009-07-23 19:55 ` Carlos R. Mafra 0 siblings, 1 reply; 129+ messages in thread From: Linus Torvalds @ 2009-07-23 19:13 UTC (permalink / raw) To: Carlos R. Mafra; +Cc: Junio C Hamano, Git Mailing List On Thu, 23 Jul 2009, Linus Torvalds wrote: > > You could try something like this (on _top_ of the previous patch). > > Not very exhaustively tested, but it's pretty simple. > > It will still do _some_ object lookups. In particular, it will do the HEAD > lookup in 'print_ref_list()', even if it's not strictly necessary. But it > should cut down the noise further. And this (on top of them all) will basically avoid even that one. In fact, I think this is a cleanup. I think I'll resubmit the whole series with proper commit messages etc. Linus --- builtin-branch.c | 38 +++++++++++++++++++++++--------------- 1 files changed, 23 insertions(+), 15 deletions(-) diff --git a/builtin-branch.c b/builtin-branch.c index 82c2cf0..1a03d5f 100644 --- a/builtin-branch.c +++ b/builtin-branch.c @@ -191,7 +191,7 @@ struct ref_item { struct ref_list { struct rev_info revs; - int index, alloc, maxwidth, verbose; + int index, alloc, maxwidth, verbose, abbrev; struct ref_item *list; struct commit_list *with_commit; int kinds; @@ -418,15 +418,34 @@ static int calc_maxwidth(struct ref_list *refs) return w; } + +static void show_detached(struct ref_list *ref_list) +{ + struct commit *head_commit = lookup_commit_reference_gently(head_sha1, 1); + + if (head_commit && is_descendant_of(head_commit, ref_list->with_commit)) { + struct ref_item item; + item.name = xstrdup("(no branch)"); + item.len = strlen(item.name); + item.kind = REF_LOCAL_BRANCH; + item.dest = NULL; + item.commit = head_commit; + if (item.len > ref_list->maxwidth) + ref_list->maxwidth = item.len; + print_ref_item(&item, ref_list->maxwidth, ref_list->verbose, ref_list->abbrev, 1, ""); + free(item.name); + } +} + static void print_ref_list(int kinds, int detached, int verbose, int abbrev, struct commit_list *with_commit) { int i; struct ref_list ref_list; - struct commit *head_commit = lookup_commit_reference_gently(head_sha1, 1); memset(&ref_list, 0, sizeof(ref_list)); ref_list.kinds = kinds; ref_list.verbose = verbose; + ref_list.abbrev = abbrev; ref_list.with_commit = with_commit; if (merge_filter != NO_FILTER) init_revisions(&ref_list.revs, NULL); @@ -446,19 +465,8 @@ static void print_ref_list(int kinds, int detached, int verbose, int abbrev, str qsort(ref_list.list, ref_list.index, sizeof(struct ref_item), ref_cmp); detached = (detached && (kinds & REF_LOCAL_BRANCH)); - if (detached && head_commit && - is_descendant_of(head_commit, with_commit)) { - struct ref_item item; - item.name = xstrdup("(no branch)"); - item.len = strlen(item.name); - item.kind = REF_LOCAL_BRANCH; - item.dest = NULL; - item.commit = head_commit; - if (item.len > ref_list.maxwidth) - ref_list.maxwidth = item.len; - print_ref_item(&item, ref_list.maxwidth, verbose, abbrev, 1, ""); - free(item.name); - } + if (detached) + show_detached(&ref_list); for (i = 0; i < ref_list.index; i++) { int current = !detached && ^ permalink raw reply related [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-23 19:13 ` Linus Torvalds @ 2009-07-23 19:55 ` Carlos R. Mafra 2009-07-24 20:36 ` Linus Torvalds 0 siblings, 1 reply; 129+ messages in thread From: Carlos R. Mafra @ 2009-07-23 19:55 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List On Thu 23.Jul'09 at 12:13:41 -0700, Linus Torvalds wrote: > > It will still do _some_ object lookups. In particular, it will do the HEAD > > lookup in 'print_ref_list()', even if it's not strictly necessary. But it > > should cut down the noise further. > > And this (on top of them all) will basically avoid even that one. Ok, I applied (both) on top of the first one. After 7 tests I got these, time: 0.61 +/- 0.08 GIT_DEBUG_LOOKUP=1 git branch |wc -l 9 which are in fact only the branches list. Compared to yesterday, that is a huge improvement (0.6s vs 5.7s) and (9 vs 2200+). At least for me 0.6s is "instantaneous", so the issue is really gone. Thanks a lot to everyone! ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-23 19:55 ` Carlos R. Mafra @ 2009-07-24 20:36 ` Linus Torvalds 2009-07-24 20:47 ` Linus Torvalds 0 siblings, 1 reply; 129+ messages in thread From: Linus Torvalds @ 2009-07-24 20:36 UTC (permalink / raw) To: Carlos R. Mafra; +Cc: Junio C Hamano, Git Mailing List On Thu, 23 Jul 2009, Carlos R. Mafra wrote: > > After 7 tests I got these, > > time: > > 0.61 +/- 0.08 Btw, I think 0.61s is still too much. Can you send me the output of 'strace -Ttt' on your machine? It's entirely possible that it's all the actual binary (and shared library) loading, of course. You do have a slow harddisk. But it takes 0.035s for me, and I'm wondering if there is something else than just CPU speed and IO speed accounting for the 20x performance difference. (That said, maybe 20x is right - my SSD latency almost certainly is 20x better). Linus ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-24 20:36 ` Linus Torvalds @ 2009-07-24 20:47 ` Linus Torvalds 2009-07-24 21:21 ` Linus Torvalds 0 siblings, 1 reply; 129+ messages in thread From: Linus Torvalds @ 2009-07-24 20:47 UTC (permalink / raw) To: Carlos R. Mafra; +Cc: Junio C Hamano, Git Mailing List On Fri, 24 Jul 2009, Linus Torvalds wrote: > > Btw, I think 0.61s is still too much. Can you send me the output of > 'strace -Ttt' on your machine? Never mind. I'm seeing even worse behavior on a laptop I just dug up (another 4200 rpm harddisk). I'll dig some more. Linus ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-24 20:47 ` Linus Torvalds @ 2009-07-24 21:21 ` Linus Torvalds 2009-07-24 22:13 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 129+ messages in thread From: Linus Torvalds @ 2009-07-24 21:21 UTC (permalink / raw) To: Carlos R. Mafra; +Cc: Junio C Hamano, Git Mailing List On Fri, 24 Jul 2009, Linus Torvalds wrote: > > Never mind. I'm seeing even worse behavior on a laptop I just dug up > (another 4200 rpm harddisk). > > I'll dig some more. Yeah, it seems to be the loading overhead. I'm seeing a 'time git branch' take 1.2s in the cold-cache case, in a directory that isn't even a git directory. And 80% of it comes before we even get to 'main()'. Shared library loading, SELinux crud etc. A lot of it seems to be 'libfreebl3' and 'libselinux', which is some crazy sh*t. It seems to be all from 'curl' support. That seems _really_ sad. Lookie here: [torvalds@nehalem git]$ ldd git linux-vdso.so.1 => (0x00007fff61da7000) libcurl.so.4 => /usr/lib64/libcurl.so.4 (0x00007f2f1a498000) libz.so.1 => /lib64/libz.so.1 (0x0000003cdb800000) libcrypto.so.8 => /usr/lib64/libcrypto.so.8 (0x0000003ba7a00000) libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003cdb400000) libc.so.6 => /lib64/libc.so.6 (0x0000003cda800000) libidn.so.11 => /lib64/libidn.so.11 (0x0000003ceaa00000) libssh2.so.1 => /usr/lib64/libssh2.so.1 (0x0000003ba8e00000) libldap-2.4.so.2 => /usr/lib64/libldap-2.4.so.2 (0x00007f2f1a250000) librt.so.1 => /lib64/librt.so.1 (0x0000003cdbc00000) libgssapi_krb5.so.2 => /usr/lib64/libgssapi_krb5.so.2 (0x0000003ce6e00000) libkrb5.so.3 => /usr/lib64/libkrb5.so.3 (0x0000003ce7e00000) libk5crypto.so.3 => /usr/lib64/libk5crypto.so.3 (0x0000003ce7200000) libcom_err.so.2 => /lib64/libcom_err.so.2 (0x0000003ce6a00000) libssl3.so => /lib64/libssl3.so (0x0000003490200000) libsmime3.so => /lib64/libsmime3.so (0x000000348fe00000) libnss3.so => /lib64/libnss3.so (0x000000348f600000) libplds4.so => /lib64/libplds4.so (0x0000003cbc800000) libplc4.so => /lib64/libplc4.so (0x0000003cbdc00000) libnspr4.so => /lib64/libnspr4.so (0x0000003cbd800000) libdl.so.2 => /lib64/libdl.so.2 (0x0000003cdb000000) /lib64/ld-linux-x86-64.so.2 (0x0000003cda400000) libssl.so.8 => /usr/lib64/libssl.so.8 (0x0000003ba7e00000) liblber-2.4.so.2 => /usr/lib64/liblber-2.4.so.2 (0x0000003ceee00000) libresolv.so.2 => /lib64/libresolv.so.2 (0x0000003ce5600000) libsasl2.so.2 => /usr/lib64/libsasl2.so.2 (0x00007f2f1a030000) libkrb5support.so.0 => /usr/lib64/libkrb5support.so.0 (0x0000003ce7a00000) libkeyutils.so.1 => /lib64/libkeyutils.so.1 (0x0000003ce7600000) libnssutil3.so => /lib64/libnssutil3.so (0x000000348fa00000) libcrypt.so.1 => /lib64/libcrypt.so.1 (0x00007f2f19df8000) libselinux.so.1 => /lib64/libselinux.so.1 (0x0000003cdc400000) libfreebl3.so => /lib64/libfreebl3.so (0x00007f2f19b99000) [torvalds@nehalem git]$ make -j16 NO_CURL=1 [torvalds@nehalem git]$ ldd git linux-vdso.so.1 => (0x00007fff2f960000) libz.so.1 => /lib64/libz.so.1 (0x0000003cdb800000) libcrypto.so.8 => /usr/lib64/libcrypto.so.8 (0x0000003ba7a00000) libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003cdb400000) libc.so.6 => /lib64/libc.so.6 (0x0000003cda800000) libdl.so.2 => /lib64/libdl.so.2 (0x0000003cdb000000) /lib64/ld-linux-x86-64.so.2 (0x0000003cda400000) What a huge difference! And the NO_CURL version really does load a lot faster in cold-cache. We're not talking small differences: - compiled with NO_CURL, five runs of "echo 3 > /proc/sys/vm/drop_caches" followed by "time git branch": real 0m0.654s real 0m0.562s real 0m0.519s real 0m0.534s real 0m0.734s Total number of system calls: 194 - compiled with curl, same thing: real 0m1.503s real 0m1.455s real 0m1.267s real 0m1.819s real 0m0.985s Total number of system calls: 407! ie we're talking a _huge_ hit in startup times for that curl support. That's really really sad - especially considering how all the curl support is for very random occasional stuff. I never use it myself, for example, since I don't use http at all. And even for people who do, they only need it for non-local operations. I wonder if there is some way to only load the crazy curl stuff when we actually want open a http: connection. Linus ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-24 21:21 ` Linus Torvalds @ 2009-07-24 22:13 ` Linus Torvalds 2009-07-24 22:18 ` david 2009-08-07 4:21 ` Jeff King 2009-07-24 22:54 ` Theodore Tso 2009-07-24 23:46 ` Carlos R. Mafra 2 siblings, 2 replies; 129+ messages in thread From: Linus Torvalds @ 2009-07-24 22:13 UTC (permalink / raw) To: Junio C Hamano, Git Mailing List Cc: Carlos R. Mafra, Daniel Barkalow, Johannes Schindelin On Fri, 24 Jul 2009, Linus Torvalds wrote: > > ie we're talking a _huge_ hit in startup times for that curl support. > That's really really sad - especially considering how all the curl support > is for very random occasional stuff. I never use it myself, for example, > since I don't use http at all. And even for people who do, they only need > it for non-local operations. > > I wonder if there is some way to only load the crazy curl stuff when we > actually want open a http: connection. Here's the simple step#1: make 'git-http-fetch' be an external program rather than a built-in. Sadly, I have no idea hot to turn the transport.c code into an external walker sanely (turn the ref/object walkers into an exec of an external program). So we still end up linking with curl. But maybe somebody (Daniel? Dscho?) who knows the transport code could try to make it an external process? The performance angle of http fetching is non-existent, we really should try very hard to make the curl-dependent parts be in a binary of their own. Linus --- >From 3cfc50d497266dc73a414ed1460b36b712ad10de Mon Sep 17 00:00:00 2001 From: Linus Torvalds <torvalds@linux-foundation.org> Date: Fri, 24 Jul 2009 14:54:55 -0700 Subject: [PATCH] git-http-fetch: not a builtin We should really try to avoid having a dependency on the curl libraries for the core 'git' executable. It adds huge overheads, for no advantage. This splits up git-http-fetch so that it isn't built-in. We still do end up linking with curl for the git binary due to the transport.c http walker, but that's at least partially an independent issue. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- Makefile | 8 +++++++- git.c | 3 --- builtin-http-fetch.c => http-fetch.c | 5 ++++- 3 files changed, 11 insertions(+), 5 deletions(-) rename builtin-http-fetch.c => http-fetch.c (95%) diff --git a/Makefile b/Makefile index bde27ed..8cbd863 100644 --- a/Makefile +++ b/Makefile @@ -978,9 +978,12 @@ else else CURL_LIBCURL = -lcurl endif - BUILTIN_OBJS += builtin-http-fetch.o + PROGRAMS += git-http-fetch$X + + # FIXME! Sadly 'transport.c' still needs these for the builtin case EXTLIBS += $(CURL_LIBCURL) LIB_OBJS += http.o http-walker.o + curl_check := $(shell (echo 070908; curl-config --vernum) | sort -r | sed -ne 2p) ifeq "$(curl_check)" "070908" ifndef NO_EXPAT @@ -1485,6 +1488,9 @@ git-imap-send$X: imap-send.o $(GITLIBS) http.o http-walker.o http-push.o transport.o: http.h +git-http-fetch$X: revision.o http.o http-push.o $(GITLIBS) + $(QUIET_LINK)$(CC) $(ALL_CFLAGS) -o $@ $(ALL_LDFLAGS) $(filter %.o,$^) \ + $(LIBS) $(CURL_LIBCURL) $(EXPAT_LIBEXPAT) git-http-push$X: revision.o http.o http-push.o $(GITLIBS) $(QUIET_LINK)$(CC) $(ALL_CFLAGS) -o $@ $(ALL_LDFLAGS) $(filter %.o,$^) \ $(LIBS) $(CURL_LIBCURL) $(EXPAT_LIBEXPAT) diff --git a/git.c b/git.c index 807d875..c1e8f05 100644 --- a/git.c +++ b/git.c @@ -309,9 +309,6 @@ static void handle_internal_command(int argc, const char **argv) { "get-tar-commit-id", cmd_get_tar_commit_id }, { "grep", cmd_grep, RUN_SETUP | USE_PAGER }, { "help", cmd_help }, -#ifndef NO_CURL - { "http-fetch", cmd_http_fetch, RUN_SETUP }, -#endif { "init", cmd_init_db }, { "init-db", cmd_init_db }, { "log", cmd_log, RUN_SETUP | USE_PAGER }, diff --git a/builtin-http-fetch.c b/http-fetch.c similarity index 95% rename from builtin-http-fetch.c rename to http-fetch.c index f3e63d7..e8f44ba 100644 --- a/builtin-http-fetch.c +++ b/http-fetch.c @@ -1,8 +1,9 @@ #include "cache.h" #include "walker.h" -int cmd_http_fetch(int argc, const char **argv, const char *prefix) +int main(int argc, const char **argv) { + const char *prefix; struct walker *walker; int commits_on_stdin = 0; int commits; @@ -18,6 +19,8 @@ int cmd_http_fetch(int argc, const char **argv, const char *prefix) int get_verbosely = 0; int get_recover = 0; + prefix = setup_git_directory(); + git_config(git_default_config, NULL); while (arg < argc && argv[arg][0] == '-') { -- 1.6.4.rc1.5.gb84f ^ permalink raw reply related [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-24 22:13 ` Linus Torvalds @ 2009-07-24 22:18 ` david 2009-07-24 22:42 ` Linus Torvalds 2009-08-07 4:21 ` Jeff King 1 sibling, 1 reply; 129+ messages in thread From: david @ 2009-07-24 22:18 UTC (permalink / raw) To: Linus Torvalds Cc: Junio C Hamano, Git Mailing List, Carlos R. Mafra, Daniel Barkalow, Johannes Schindelin On Fri, 24 Jul 2009, Linus Torvalds wrote: > On Fri, 24 Jul 2009, Linus Torvalds wrote: >> >> ie we're talking a _huge_ hit in startup times for that curl support. >> That's really really sad - especially considering how all the curl support >> is for very random occasional stuff. I never use it myself, for example, >> since I don't use http at all. And even for people who do, they only need >> it for non-local operations. >> >> I wonder if there is some way to only load the crazy curl stuff when we >> actually want open a http: connection. > > Here's the simple step#1: make 'git-http-fetch' be an external program > rather than a built-in. > > Sadly, I have no idea hot to turn the transport.c code into an external > walker sanely (turn the ref/object walkers into an exec of an external > program). So we still end up linking with curl. But maybe somebody > (Daniel? Dscho?) who knows the transport code could try to make it an > external process? > > The performance angle of http fetching is non-existent, we really should > try very hard to make the curl-dependent parts be in a binary of their > own. what does the performance look like if you just do a static compile instead? David Lang > Linus > > --- >> From 3cfc50d497266dc73a414ed1460b36b712ad10de Mon Sep 17 00:00:00 2001 > From: Linus Torvalds <torvalds@linux-foundation.org> > Date: Fri, 24 Jul 2009 14:54:55 -0700 > Subject: [PATCH] git-http-fetch: not a builtin > > We should really try to avoid having a dependency on the curl libraries > for the core 'git' executable. It adds huge overheads, for no advantage. > > This splits up git-http-fetch so that it isn't built-in. We still do > end up linking with curl for the git binary due to the transport.c http > walker, but that's at least partially an independent issue. > > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> > --- > Makefile | 8 +++++++- > git.c | 3 --- > builtin-http-fetch.c => http-fetch.c | 5 ++++- > 3 files changed, 11 insertions(+), 5 deletions(-) > rename builtin-http-fetch.c => http-fetch.c (95%) > > diff --git a/Makefile b/Makefile > index bde27ed..8cbd863 100644 > --- a/Makefile > +++ b/Makefile > @@ -978,9 +978,12 @@ else > else > CURL_LIBCURL = -lcurl > endif > - BUILTIN_OBJS += builtin-http-fetch.o > + PROGRAMS += git-http-fetch$X > + > + # FIXME! Sadly 'transport.c' still needs these for the builtin case > EXTLIBS += $(CURL_LIBCURL) > LIB_OBJS += http.o http-walker.o > + > curl_check := $(shell (echo 070908; curl-config --vernum) | sort -r | sed -ne 2p) > ifeq "$(curl_check)" "070908" > ifndef NO_EXPAT > @@ -1485,6 +1488,9 @@ git-imap-send$X: imap-send.o $(GITLIBS) > > http.o http-walker.o http-push.o transport.o: http.h > > +git-http-fetch$X: revision.o http.o http-push.o $(GITLIBS) > + $(QUIET_LINK)$(CC) $(ALL_CFLAGS) -o $@ $(ALL_LDFLAGS) $(filter %.o,$^) \ > + $(LIBS) $(CURL_LIBCURL) $(EXPAT_LIBEXPAT) > git-http-push$X: revision.o http.o http-push.o $(GITLIBS) > $(QUIET_LINK)$(CC) $(ALL_CFLAGS) -o $@ $(ALL_LDFLAGS) $(filter %.o,$^) \ > $(LIBS) $(CURL_LIBCURL) $(EXPAT_LIBEXPAT) > diff --git a/git.c b/git.c > index 807d875..c1e8f05 100644 > --- a/git.c > +++ b/git.c > @@ -309,9 +309,6 @@ static void handle_internal_command(int argc, const char **argv) > { "get-tar-commit-id", cmd_get_tar_commit_id }, > { "grep", cmd_grep, RUN_SETUP | USE_PAGER }, > { "help", cmd_help }, > -#ifndef NO_CURL > - { "http-fetch", cmd_http_fetch, RUN_SETUP }, > -#endif > { "init", cmd_init_db }, > { "init-db", cmd_init_db }, > { "log", cmd_log, RUN_SETUP | USE_PAGER }, > diff --git a/builtin-http-fetch.c b/http-fetch.c > similarity index 95% > rename from builtin-http-fetch.c > rename to http-fetch.c > index f3e63d7..e8f44ba 100644 > --- a/builtin-http-fetch.c > +++ b/http-fetch.c > @@ -1,8 +1,9 @@ > #include "cache.h" > #include "walker.h" > > -int cmd_http_fetch(int argc, const char **argv, const char *prefix) > +int main(int argc, const char **argv) > { > + const char *prefix; > struct walker *walker; > int commits_on_stdin = 0; > int commits; > @@ -18,6 +19,8 @@ int cmd_http_fetch(int argc, const char **argv, const char *prefix) > int get_verbosely = 0; > int get_recover = 0; > > + prefix = setup_git_directory(); > + > git_config(git_default_config, NULL); > > while (arg < argc && argv[arg][0] == '-') { > ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-24 22:18 ` david @ 2009-07-24 22:42 ` Linus Torvalds 2009-07-24 22:46 ` david 0 siblings, 1 reply; 129+ messages in thread From: Linus Torvalds @ 2009-07-24 22:42 UTC (permalink / raw) To: david Cc: Junio C Hamano, Git Mailing List, Carlos R. Mafra, Daniel Barkalow, Johannes Schindelin On Fri, 24 Jul 2009, david@lang.hm wrote: > > what does the performance look like if you just do a static compile instead? I don't even know - I don't have a static version of curl. I could install one, of course, but since I don't think that's the solution anyway, I'm not going to bother. The real solution really is to not have curl support in the main binary. One option might be to make _all_ the transport code be outside of the core binary, or course. That's a fairly simple but somewhat sad solution (ie make all of push/pull/fetch/clone/ls-remote/etc be external binaries) Linus ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-24 22:42 ` Linus Torvalds @ 2009-07-24 22:46 ` david 2009-07-25 2:39 ` Linus Torvalds 0 siblings, 1 reply; 129+ messages in thread From: david @ 2009-07-24 22:46 UTC (permalink / raw) To: Linus Torvalds Cc: Junio C Hamano, Git Mailing List, Carlos R. Mafra, Daniel Barkalow, Johannes Schindelin On Fri, 24 Jul 2009, Linus Torvalds wrote: > On Fri, 24 Jul 2009, david@lang.hm wrote: >> >> what does the performance look like if you just do a static compile instead? > > I don't even know - I don't have a static version of curl. I could install > one, of course, but since I don't think that's the solution anyway, I'm > not going to bother. I wasn't thinking a static version of curl, I was thinking a static version of the git binaries. see how fast things could be if no startup linking was nessasary. David Lang > The real solution really is to not have curl support in the main binary. > > One option might be to make _all_ the transport code be outside of the > core binary, or course. That's a fairly simple but somewhat sad solution > (ie make all of push/pull/fetch/clone/ls-remote/etc be external binaries) > > Linus > ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-24 22:46 ` david @ 2009-07-25 2:39 ` Linus Torvalds 2009-07-25 2:53 ` Daniel Barkalow 0 siblings, 1 reply; 129+ messages in thread From: Linus Torvalds @ 2009-07-25 2:39 UTC (permalink / raw) To: david Cc: Junio C Hamano, Git Mailing List, Carlos R. Mafra, Daniel Barkalow, Johannes Schindelin On Fri, 24 Jul 2009, david@lang.hm wrote: > On Fri, 24 Jul 2009, Linus Torvalds wrote: > > > On Fri, 24 Jul 2009, david@lang.hm wrote: > > > > > > what does the performance look like if you just do a static compile > > > instead? > > > > I don't even know - I don't have a static version of curl. I could install > > one, of course, but since I don't think that's the solution anyway, I'm > > not going to bother. > > I wasn't thinking a static version of curl, I was thinking a static version of > the git binaries. see how fast things could be if no startup linking was > nessasary. Well, that's what I meant. If I add '-static' to the link flags, I get /usr/bin/ld: cannot find -lcurl collect2: ld returned 1 exit status because I simply don't have a static library version of curl (and if I do NO_CURL, I fail the link due to not having a static version of zlib). That's what I meant by "I could install a static version of curl" - I could install the debug libraries, but it just isn't a normal thing to do on any modern distribution. The right thing to do really would be to not have -lcurl for the main git binary at all. Preferably done by having http walking handled by an external process (the way we already do rsync), but it's probably easier to just make all the clone/fetch/ls-remote things be a separate binary. Of course, I'd personally solve the problem with NO_CURL=1, but that's probably not acceptable in general. Linus ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-25 2:39 ` Linus Torvalds @ 2009-07-25 2:53 ` Daniel Barkalow 0 siblings, 0 replies; 129+ messages in thread From: Daniel Barkalow @ 2009-07-25 2:53 UTC (permalink / raw) To: Linus Torvalds Cc: david, Junio C Hamano, Git Mailing List, Carlos R. Mafra, Johannes Schindelin On Fri, 24 Jul 2009, Linus Torvalds wrote: > On Fri, 24 Jul 2009, david@lang.hm wrote: > > > On Fri, 24 Jul 2009, Linus Torvalds wrote: > > > > > On Fri, 24 Jul 2009, david@lang.hm wrote: > > > > > > > > what does the performance look like if you just do a static compile > > > > instead? > > > > > > I don't even know - I don't have a static version of curl. I could install > > > one, of course, but since I don't think that's the solution anyway, I'm > > > not going to bother. > > > > I wasn't thinking a static version of curl, I was thinking a static version of > > the git binaries. see how fast things could be if no startup linking was > > nessasary. > > Well, that's what I meant. If I add '-static' to the link flags, I get > > /usr/bin/ld: cannot find -lcurl > collect2: ld returned 1 exit status > > because I simply don't have a static library version of curl (and if I do > NO_CURL, I fail the link due to not having a static version of zlib). > > That's what I meant by "I could install a static version of curl" - I > could install the debug libraries, but it just isn't a normal thing to do > on any modern distribution. The right thing to do really would be to not > have -lcurl for the main git binary at all. > > Preferably done by having http walking handled by an external process (the > way we already do rsync), but it's probably easier to just make all the > clone/fetch/ls-remote things be a separate binary. I think it's actually easy enough to have a separate binary to handle the http walking, particularly since I've got code lying around to handle importing from a foreign VCS with a separate binary that I can just remove some of the features from. -Daniel *This .sig left intentionally blank* ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-24 22:13 ` Linus Torvalds 2009-07-24 22:18 ` david @ 2009-08-07 4:21 ` Jeff King 1 sibling, 0 replies; 129+ messages in thread From: Jeff King @ 2009-08-07 4:21 UTC (permalink / raw) To: Linus Torvalds Cc: Junio C Hamano, Git Mailing List, Carlos R. Mafra, Daniel Barkalow, Johannes Schindelin On Fri, Jul 24, 2009 at 03:13:07PM -0700, Linus Torvalds wrote: > Subject: [PATCH] git-http-fetch: not a builtin > > We should really try to avoid having a dependency on the curl libraries > for the core 'git' executable. It adds huge overheads, for no advantage. > > This splits up git-http-fetch so that it isn't built-in. We still do > end up linking with curl for the git binary due to the transport.c http > walker, but that's at least partially an independent issue. > > [...] > > +git-http-fetch$X: revision.o http.o http-push.o $(GITLIBS) > + $(QUIET_LINK)$(CC) $(ALL_CFLAGS) -o $@ $(ALL_LDFLAGS) $(filter %.o,$^) \ > + $(LIBS) $(CURL_LIBCURL) $(EXPAT_LIBEXPAT) Err, this seems to horribly break git-http-fetch (see if you can spot the logic error in dependencies). Patch is below. Nobody noticed, I expect, because nothing in git _uses_ http-fetch anymore, now that git-clone is no longer a shell script. I only noticed because it tried to build http-push on one of my NO_EXPAT machines. It might be an interesting exercise to dust off the old shell scripts once in a while and see if they still pass their original tests while running on top of a more modern git. It would test that we haven't broken the plumbing interfaces. -- >8 -- Subject: [PATCH] Makefile: build http-fetch against http-fetch.o As opposed to http-push.o. We can also drop EXPAT_LIBEXPAT, since fetch does not need it. This appears to be a bad cut-and-paste in commit 1088261f. Signed-off-by: Jeff King <peff@peff.net> --- Makefile | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/Makefile b/Makefile index 97d904b..d6362d3 100644 --- a/Makefile +++ b/Makefile @@ -1502,9 +1502,9 @@ http.o http-walker.o http-push.o: http.h http.o http-walker.o: $(LIB_H) -git-http-fetch$X: revision.o http.o http-push.o $(GITLIBS) +git-http-fetch$X: revision.o http.o http-fetch.o http-walker.o $(GITLIBS) $(QUIET_LINK)$(CC) $(ALL_CFLAGS) -o $@ $(ALL_LDFLAGS) $(filter %.o,$^) \ - $(LIBS) $(CURL_LIBCURL) $(EXPAT_LIBEXPAT) + $(LIBS) $(CURL_LIBCURL) git-http-push$X: revision.o http.o http-push.o $(GITLIBS) $(QUIET_LINK)$(CC) $(ALL_CFLAGS) -o $@ $(ALL_LDFLAGS) $(filter %.o,$^) \ $(LIBS) $(CURL_LIBCURL) $(EXPAT_LIBEXPAT) -- 1.6.4.117.g6056d.dirty ^ permalink raw reply related [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-24 21:21 ` Linus Torvalds 2009-07-24 22:13 ` Linus Torvalds @ 2009-07-24 22:54 ` Theodore Tso 2009-07-24 22:59 ` Shawn O. Pearce 2009-07-24 23:46 ` Carlos R. Mafra 2 siblings, 1 reply; 129+ messages in thread From: Theodore Tso @ 2009-07-24 22:54 UTC (permalink / raw) To: Linus Torvalds; +Cc: Carlos R. Mafra, Junio C Hamano, Git Mailing List On Fri, Jul 24, 2009 at 02:21:20PM -0700, Linus Torvalds wrote: > > I wonder if there is some way to only load the crazy curl stuff when we > actually want open a http: connection. Well, we could use dlopen(), but I'm not sure that qualifies as a _sane_ solution --- especially given that there are approximately 15 interfaces used by git, that we'd have to resolve using dlsym(). - Ted ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-24 22:54 ` Theodore Tso @ 2009-07-24 22:59 ` Shawn O. Pearce 2009-07-24 23:28 ` Junio C Hamano 2009-07-26 17:07 ` Avi Kivity 0 siblings, 2 replies; 129+ messages in thread From: Shawn O. Pearce @ 2009-07-24 22:59 UTC (permalink / raw) To: Theodore Tso Cc: Linus Torvalds, Carlos R. Mafra, Junio C Hamano, Git Mailing List Theodore Tso <tytso@mit.edu> wrote: > On Fri, Jul 24, 2009 at 02:21:20PM -0700, Linus Torvalds wrote: > > > > I wonder if there is some way to only load the crazy curl stuff when we > > actually want open a http: connection. > > Well, we could use dlopen(), but I'm not sure that qualifies as a > _sane_ solution --- especially given that there are approximately 15 > interfaces used by git, that we'd have to resolve using dlsym(). Yea, that's not sane. Probably the better approach is to have git fetch and git push be a different binary from main git, so we only pay the libcurl loading overheads when we hit transport. -- Shawn. ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-24 22:59 ` Shawn O. Pearce @ 2009-07-24 23:28 ` Junio C Hamano 2009-07-26 17:07 ` Avi Kivity 1 sibling, 0 replies; 129+ messages in thread From: Junio C Hamano @ 2009-07-24 23:28 UTC (permalink / raw) To: Shawn O. Pearce Cc: Theodore Tso, Linus Torvalds, Carlos R. Mafra, Junio C Hamano, Git Mailing List "Shawn O. Pearce" <spearce@spearce.org> writes: > Theodore Tso <tytso@mit.edu> wrote: >> On Fri, Jul 24, 2009 at 02:21:20PM -0700, Linus Torvalds wrote: >> > >> > I wonder if there is some way to only load the crazy curl stuff when we >> > actually want open a http: connection. >> >> Well, we could use dlopen(), but I'm not sure that qualifies as a >> _sane_ solution --- especially given that there are approximately 15 >> interfaces used by git, that we'd have to resolve using dlsym(). > > Yea, that's not sane. > > Probably the better approach is to have git fetch and git push be a > different binary from main git, so we only pay the libcurl loading > overheads when we hit transport. Even though that still will hurt people who do not use http, I think it would be a right approach (in the sense that it should not be too painful and with a reasonable gain for local-only operations). ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-24 22:59 ` Shawn O. Pearce 2009-07-24 23:28 ` Junio C Hamano @ 2009-07-26 17:07 ` Avi Kivity 2009-07-26 17:16 ` Johannes Schindelin 1 sibling, 1 reply; 129+ messages in thread From: Avi Kivity @ 2009-07-26 17:07 UTC (permalink / raw) To: Shawn O. Pearce Cc: Theodore Tso, Linus Torvalds, Carlos R. Mafra, Junio C Hamano, Git Mailing List On 07/25/2009 01:59 AM, Shawn O. Pearce wrote: > Theodore Tso<tytso@mit.edu> wrote: > >> On Fri, Jul 24, 2009 at 02:21:20PM -0700, Linus Torvalds wrote: >> >>> I wonder if there is some way to only load the crazy curl stuff when we >>> actually want open a http: connection. >>> >> Well, we could use dlopen(), but I'm not sure that qualifies as a >> _sane_ solution --- especially given that there are approximately 15 >> interfaces used by git, that we'd have to resolve using dlsym(). >> > > Yea, that's not sane. > > Probably the better approach is to have git fetch and git push be a > different binary from main git, so we only pay the libcurl loading > overheads when we hit transport. > Or make the transports shared libraries, and use dlopen() to open the transport and dlsym() to resolve the struct transport object exported by the library. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-26 17:07 ` Avi Kivity @ 2009-07-26 17:16 ` Johannes Schindelin 0 siblings, 0 replies; 129+ messages in thread From: Johannes Schindelin @ 2009-07-26 17:16 UTC (permalink / raw) To: Avi Kivity Cc: Shawn O. Pearce, Theodore Tso, Linus Torvalds, Carlos R. Mafra, Junio C Hamano, Git Mailing List Hi, On Sun, 26 Jul 2009, Avi Kivity wrote: > On 07/25/2009 01:59 AM, Shawn O. Pearce wrote: > > Theodore Tso<tytso@mit.edu> wrote: > > > > > On Fri, Jul 24, 2009 at 02:21:20PM -0700, Linus Torvalds wrote: > > > > > > > I wonder if there is some way to only load the crazy curl stuff when we > > > > actually want open a http: connection. > > > > > > > Well, we could use dlopen(), but I'm not sure that qualifies as a > > > _sane_ solution --- especially given that there are approximately 15 > > > interfaces used by git, that we'd have to resolve using dlsym(). > > > > > > > Yea, that's not sane. > > > > Probably the better approach is to have git fetch and git push be a > > different binary from main git, so we only pay the libcurl loading > > overheads when we hit transport. > > > > Or make the transports shared libraries, and use dlopen() to open the > transport and dlsym() to resolve the struct transport object exported by the > library. ... and introduce all kinds of braindamage to the Makefile so we can properly compile .dll files on Windows? Umm, thanks, but no. Ciao, Dscho ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-24 21:21 ` Linus Torvalds 2009-07-24 22:13 ` Linus Torvalds 2009-07-24 22:54 ` Theodore Tso @ 2009-07-24 23:46 ` Carlos R. Mafra 2009-07-25 0:41 ` Carlos R. Mafra 2 siblings, 1 reply; 129+ messages in thread From: Carlos R. Mafra @ 2009-07-24 23:46 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List Sorry for the delay and missing the "strace -ttT" request, but today was a "Physics" day and took me longer to notice your email. On Fri 24.Jul'09 at 14:21:20 -0700, Linus Torvalds wrote: > > What a huge difference! > > And the NO_CURL version really does load a lot faster in cold-cache. We're > not talking small differences: With NO_CURL=1 the strace log contained 242 lines (vs 404), but the time difference was not as great as you got. But it was better: 0.55 +- 0.06 (for 8 runs) So I repeated the tests with curl enabled and this time I got: 0.77 +- 0.03 (for 6 runs) (yesterday I got 0.61 +- 0.08, so there is lot of noise) So it is better, but not by the same factor as you saw. But I may have an explanation for this. After I clear the cache I wait a few seconds to stabilize, and I do the 'time git branch' test when I see that there is no activity in the disk by looking at the 'btrace' output in another xterm. I noticed that after dropping the cache and before I do the test there is lot of activity of something called 'preload', with lines which look like these: 8,0 0 42881 495.067655112 17777 Q R 51244367 + 552 [preload] 8,0 0 42882 495.067659931 17777 G R 51244367 + 552 [preload] 8,0 0 42883 495.067664401 17777 I R 51244367 + 552 [preload] I hadn't noticed this before and now I checked that, "preload is an adaptive readahead daemon that prefetches files mapped by applications from the disk to reduce application startup time." So I guess that my tests here for your NO_CURL=1 idea is inconclusive, as I am not sure what preload is prefetching. ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-24 23:46 ` Carlos R. Mafra @ 2009-07-25 0:41 ` Carlos R. Mafra 2009-07-25 18:04 ` Linus Torvalds 0 siblings, 1 reply; 129+ messages in thread From: Carlos R. Mafra @ 2009-07-25 0:41 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List On Sat 25.Jul'09 at 1:46:48 +0200, Carlos R. Mafra wrote: > > So I guess that my tests here for your NO_CURL=1 idea is inconclusive, > as I am not sure what preload is prefetching. Ok, so I killed /usr/sbin/preload and did the tests again. The results were much more stable, with average 0.40 vs 0.79 (NO_CURL=1 being faster). The pagefaults were pretty stable too, (40major+654minor vs 12major+401minor). I will use NO_CURL=1 from now on! ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-25 0:41 ` Carlos R. Mafra @ 2009-07-25 18:04 ` Linus Torvalds 2009-07-25 18:57 ` Timo Hirvonen 0 siblings, 1 reply; 129+ messages in thread From: Linus Torvalds @ 2009-07-25 18:04 UTC (permalink / raw) To: Carlos R. Mafra; +Cc: Junio C Hamano, Git Mailing List On Sat, 25 Jul 2009, Carlos R. Mafra wrote: > > Ok, so I killed /usr/sbin/preload and did the tests again. The > results were much more stable, with average 0.40 vs 0.79 > (NO_CURL=1 being faster). The pagefaults were pretty stable too, > (40major+654minor vs 12major+401minor). > > I will use NO_CURL=1 from now on! I actually find it interesting that this whole NO_CURL issue is actually a lot more noticeable for me in the hot-cache case than all the other 'git branch' issues were. I went back to a version a few days ago (before all the optimizations), and on my machine with a hot cache I get (for my kernel repo - I don't use branches there, but I have an old 'akpm' branch for taking a emailed patch series from Andrew): [torvalds@nehalem linux]$ time ~/git/git branch akpm * master real 0m0.005s user 0m0.004s sys 0m0.000s so it's five milliseconds. Big deal, fast enough, right? Ok, so fast-forward to today, with the optimizations to builtin-branch.c: [torvalds@nehalem linux]$ time ~/git/git branch akpm * master real 0m0.004s user 0m0.000s sys 0m0.004s Woot! I shaved a millisecond off it by avoiding all those page faults and object lookups. Good, but hey, all that unnecessary lookup was just a 25% cost. So let's build it with NO_CURL: [torvalds@nehalem linux]$ time ~/git/git branch akpm * master real 0m0.002s user 0m0.000s sys 0m0.000s Heh. The whole NO_CURL=1 thing is actually a _bigger_ optimization than anything else I did to git-branch. Cost of curl: 100%. The difference in number of system calls and page faults is really quite staggering. System calls: 397->184, page faults: 619->293. Just from not doing that curl loading. No wonder performance actually doubles. Now, I admit that 5ms vs 2ms probably doesn't really matter much, but dang, performance was a primary goal in git, so I'm a bit upset at how bad curl screwed us. Plus those things do add up when scripting things, and those 300+ page faults are basically true for _all_ git programs. So it's not just 'git branch': doing 'git show' shows the exact same thing: 6ms -> 4ms, 448->235 system calls, and 1549->1176 page faults. So curl really must die. It may not matter for the expensive operations, but a lot of scripting is about running all those "cheap" things that just add up over time. Linus ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-25 18:04 ` Linus Torvalds @ 2009-07-25 18:57 ` Timo Hirvonen 2009-07-25 19:06 ` Reece Dunn ` (2 more replies) 0 siblings, 3 replies; 129+ messages in thread From: Timo Hirvonen @ 2009-07-25 18:57 UTC (permalink / raw) To: git; +Cc: Carlos R. Mafra, Junio C Hamano, Git Mailing List Linus Torvalds <torvalds@linux-foundation.org> wrote: > So curl really must die. It may not matter for the expensive operations, > but a lot of scripting is about running all those "cheap" things that just > add up over time. SELinux is the problem, not curl. On my Arch Linux machine: $ ldd bin/git linux-vdso.so.1 => (0x00007fff42306000) libcurl.so.4 => /usr/lib/libcurl.so.4 (0x00007f8714532000) libz.so.1 => /usr/lib/libz.so.1 (0x00007f871431d000) libcrypto.so.0.9.8 => /usr/lib/libcrypto.so.0.9.8 (0x00007f8713f8f000) libpthread.so.0 => /lib/libpthread.so.0 (0x00007f8713d74000) libc.so.6 => /lib/libc.so.6 (0x00007f8713a21000) librt.so.1 => /lib/librt.so.1 (0x00007f8713819000) libssl.so.0.9.8 => /usr/lib/libssl.so.0.9.8 (0x00007f87135ca000) libdl.so.2 => /lib/libdl.so.2 (0x00007f87133c6000) /lib/ld-linux-x86-64.so.2 (0x00007f8714778000) Your: [torvalds@nehalem git]$ ldd git linux-vdso.so.1 => (0x00007fff61da7000) libcurl.so.4 => /usr/lib64/libcurl.so.4 (0x00007f2f1a498000) libz.so.1 => /lib64/libz.so.1 (0x0000003cdb800000) libcrypto.so.8 => /usr/lib64/libcrypto.so.8 (0x0000003ba7a00000) libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003cdb400000) libc.so.6 => /lib64/libc.so.6 (0x0000003cda800000) libidn.so.11 => /lib64/libidn.so.11 (0x0000003ceaa00000) libssh2.so.1 => /usr/lib64/libssh2.so.1 (0x0000003ba8e00000) libldap-2.4.so.2 => /usr/lib64/libldap-2.4.so.2 (0x00007f2f1a250000) librt.so.1 => /lib64/librt.so.1 (0x0000003cdbc00000) libgssapi_krb5.so.2 => /usr/lib64/libgssapi_krb5.so.2 (0x0000003ce6e00000) libkrb5.so.3 => /usr/lib64/libkrb5.so.3 (0x0000003ce7e00000) libk5crypto.so.3 => /usr/lib64/libk5crypto.so.3 (0x0000003ce7200000) libcom_err.so.2 => /lib64/libcom_err.so.2 (0x0000003ce6a00000) libssl3.so => /lib64/libssl3.so (0x0000003490200000) libsmime3.so => /lib64/libsmime3.so (0x000000348fe00000) libnss3.so => /lib64/libnss3.so (0x000000348f600000) libplds4.so => /lib64/libplds4.so (0x0000003cbc800000) libplc4.so => /lib64/libplc4.so (0x0000003cbdc00000) libnspr4.so => /lib64/libnspr4.so (0x0000003cbd800000) libdl.so.2 => /lib64/libdl.so.2 (0x0000003cdb000000) /lib64/ld-linux-x86-64.so.2 (0x0000003cda400000) libssl.so.8 => /usr/lib64/libssl.so.8 (0x0000003ba7e00000) liblber-2.4.so.2 => /usr/lib64/liblber-2.4.so.2 (0x0000003ceee00000) libresolv.so.2 => /lib64/libresolv.so.2 (0x0000003ce5600000) libsasl2.so.2 => /usr/lib64/libsasl2.so.2 (0x00007f2f1a030000) libkrb5support.so.0 => /usr/lib64/libkrb5support.so.0 (0x0000003ce7a00000) libkeyutils.so.1 => /lib64/libkeyutils.so.1 (0x0000003ce7600000) libnssutil3.so => /lib64/libnssutil3.so (0x000000348fa00000) libcrypt.so.1 => /lib64/libcrypt.so.1 (0x00007f2f19df8000) libselinux.so.1 => /lib64/libselinux.so.1 (0x0000003cdc400000) libfreebl3.so => /lib64/libfreebl3.so (0x00007f2f19b99000) ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-25 18:57 ` Timo Hirvonen @ 2009-07-25 19:06 ` Reece Dunn 2009-07-25 20:31 ` Mike Hommey 2009-07-25 21:04 ` Carlos R. Mafra 2 siblings, 0 replies; 129+ messages in thread From: Reece Dunn @ 2009-07-25 19:06 UTC (permalink / raw) To: Timo Hirvonen; +Cc: git, Carlos R. Mafra, Junio C Hamano 2009/7/25 Timo Hirvonen <tihirvon@gmail.com>: > Linus Torvalds <torvalds@linux-foundation.org> wrote: > >> So curl really must die. It may not matter for the expensive operations, >> but a lot of scripting is about running all those "cheap" things that just >> add up over time. > > SELinux is the problem, not curl. > > On my Arch Linux machine: > > $ ldd bin/git > linux-vdso.so.1 => (0x00007fff42306000) > libcurl.so.4 => /usr/lib/libcurl.so.4 (0x00007f8714532000) > libz.so.1 => /usr/lib/libz.so.1 (0x00007f871431d000) > libcrypto.so.0.9.8 => /usr/lib/libcrypto.so.0.9.8 (0x00007f8713f8f000) > libpthread.so.0 => /lib/libpthread.so.0 (0x00007f8713d74000) > libc.so.6 => /lib/libc.so.6 (0x00007f8713a21000) > librt.so.1 => /lib/librt.so.1 (0x00007f8713819000) > libssl.so.0.9.8 => /usr/lib/libssl.so.0.9.8 (0x00007f87135ca000) > libdl.so.2 => /lib/libdl.so.2 (0x00007f87133c6000) > /lib/ld-linux-x86-64.so.2 (0x00007f8714778000) It will depend on the dependencies of curl that are applied. BLFS (http://www.linuxfromscratch.org/blfs/view/stable/basicnet/curl.html) list the following dependencies: pkg-config-0.22 OpenSSL-0.9.8g or GnuTLS-1.6.3 OpenLDAP-2.3.39 libidn-0.6.14 MIT Kerberos V5-1.6 or Heimdal-1.1 krb4 SPNEGO c-ares and the dependencies of those packages and so forth. On Ubuntu 9.04, I get: $ ldd /usr/bin/git linux-gate.so.1 => (0xb80ae000) libcurl-gnutls.so.4 => /usr/lib/libcurl-gnutls.so.4 (0xb805b000) libz.so.1 => /lib/libz.so.1 (0xb8045000) libpthread.so.0 => /lib/tls/i686/cmov/libpthread.so.0 (0xb802b000) libc.so.6 => /lib/tls/i686/cmov/libc.so.6 (0xb7ec8000) libidn.so.11 => /usr/lib/libidn.so.11 (0xb7e95000) liblber-2.4.so.2 => /usr/lib/liblber-2.4.so.2 (0xb7e87000) libldap_r-2.4.so.2 => /usr/lib/libldap_r-2.4.so.2 (0xb7e43000) librt.so.1 => /lib/tls/i686/cmov/librt.so.1 (0xb7e39000) libgssapi_krb5.so.2 => /usr/lib/libgssapi_krb5.so.2 (0xb7e0e000) libgnutls.so.26 => /usr/lib/libgnutls.so.26 (0xb7d71000) libtasn1.so.3 => /usr/lib/libtasn1.so.3 (0xb7d5f000) libgcrypt.so.11 => /lib/libgcrypt.so.11 (0xb7cf6000) /lib/ld-linux.so.2 (0xb80af000) libresolv.so.2 => /lib/tls/i686/cmov/libresolv.so.2 (0xb7ce0000) libsasl2.so.2 => /usr/lib/libsasl2.so.2 (0xb7cc7000) libdl.so.2 => /lib/tls/i686/cmov/libdl.so.2 (0xb7cc3000) libkrb5.so.3 => /usr/lib/libkrb5.so.3 (0xb7c31000) libk5crypto.so.3 => /usr/lib/libk5crypto.so.3 (0xb7c0d000) libcom_err.so.2 => /lib/libcom_err.so.2 (0xb7c09000) libkrb5support.so.0 => /usr/lib/libkrb5support.so.0 (0xb7bff000) libkeyutils.so.1 => /lib/libkeyutils.so.1 (0xb7bfb000) libgpg-error.so.0 => /lib/libgpg-error.so.0 (0xb7bf7000) - Reece ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-25 18:57 ` Timo Hirvonen 2009-07-25 19:06 ` Reece Dunn @ 2009-07-25 20:31 ` Mike Hommey 2009-07-25 21:02 ` Linus Torvalds 2009-07-25 21:04 ` Carlos R. Mafra 2 siblings, 1 reply; 129+ messages in thread From: Mike Hommey @ 2009-07-25 20:31 UTC (permalink / raw) To: Timo Hirvonen; +Cc: git, Carlos R. Mafra, Junio C Hamano On Sat, Jul 25, 2009 at 09:57:39PM +0300, Timo Hirvonen wrote: > Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > So curl really must die. It may not matter for the expensive operations, > > but a lot of scripting is about running all those "cheap" things that just > > add up over time. > > SELinux is the problem, not curl. I think it's NSS, the problem, not SELinux. Linus's libcurl is built against NSS, which is the default on Fedora. Mike ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-25 20:31 ` Mike Hommey @ 2009-07-25 21:02 ` Linus Torvalds 2009-07-25 21:13 ` Linus Torvalds 2009-07-26 7:54 ` Mike Hommey 0 siblings, 2 replies; 129+ messages in thread From: Linus Torvalds @ 2009-07-25 21:02 UTC (permalink / raw) To: Mike Hommey; +Cc: Timo Hirvonen, git, Carlos R. Mafra, Junio C Hamano On Sat, 25 Jul 2009, Mike Hommey wrote: > On Sat, Jul 25, 2009 at 09:57:39PM +0300, Timo Hirvonen wrote: > > Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > > > So curl really must die. It may not matter for the expensive operations, > > > but a lot of scripting is about running all those "cheap" things that just > > > add up over time. > > > > SELinux is the problem, not curl. > > I think it's NSS, the problem, not SELinux. Linus's libcurl is built > against NSS, which is the default on Fedora. Well, it kind of doesn't matter. The fact is, libcurl is a bloated monster, and adds zero to 99% of what git people do. The fact that apparently sometimes it's less bloated than other times doesn't really change anything fundamental, does it? Linus ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-25 21:02 ` Linus Torvalds @ 2009-07-25 21:13 ` Linus Torvalds 2009-07-25 23:23 ` Johannes Schindelin 2009-07-26 7:54 ` Mike Hommey 1 sibling, 1 reply; 129+ messages in thread From: Linus Torvalds @ 2009-07-25 21:13 UTC (permalink / raw) To: Mike Hommey; +Cc: Timo Hirvonen, git, Carlos R. Mafra, Junio C Hamano On Sat, 25 Jul 2009, Linus Torvalds wrote: > > The fact that apparently sometimes it's less bloated than other times > doesn't really change anything fundamental, does it? Btw, does anybody know how/why libdl seems to get linked in too? We're not doing -ldl, and I'm not seeing any need for it, but it's definitely there on fedora, at least. It seems to come from libcrypto. I can get rid of it with NO_OPENSSL, and that cuts down on the number of system calls in my startup by 16 (getting rid of both libcrypto and libdl). I wonder if there is some way to get the optimized openssl sha1 routines _without_ that silly ldl thing. Linus ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-25 21:13 ` Linus Torvalds @ 2009-07-25 23:23 ` Johannes Schindelin 2009-07-26 4:49 ` Linus Torvalds 0 siblings, 1 reply; 129+ messages in thread From: Johannes Schindelin @ 2009-07-25 23:23 UTC (permalink / raw) To: Linus Torvalds Cc: Mike Hommey, Timo Hirvonen, git, Carlos R. Mafra, Junio C Hamano Hi, On Sat, 25 Jul 2009, Linus Torvalds wrote: > On Sat, 25 Jul 2009, Linus Torvalds wrote: > > > > The fact that apparently sometimes it's less bloated than other times > > doesn't really change anything fundamental, does it? > > Btw, does anybody know how/why libdl seems to get linked in too? > > We're not doing -ldl, and I'm not seeing any need for it, but it's > definitely there on fedora, at least. > > It seems to come from libcrypto. I can get rid of it with NO_OPENSSL, and > that cuts down on the number of system calls in my startup by 16 (getting > rid of both libcrypto and libdl). I wonder if there is some way to get the > optimized openssl sha1 routines _without_ that silly ldl thing. OpenSSL allows for so-called engines implementing certain algorithms. These engines are dynamic libraries, loaded via dlopen(). Ciao, Dscho ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-25 23:23 ` Johannes Schindelin @ 2009-07-26 4:49 ` Linus Torvalds 2009-07-26 16:29 ` Theodore Tso 0 siblings, 1 reply; 129+ messages in thread From: Linus Torvalds @ 2009-07-26 4:49 UTC (permalink / raw) To: Johannes Schindelin Cc: Mike Hommey, Timo Hirvonen, git, Carlos R. Mafra, Junio C Hamano On Sun, 26 Jul 2009, Johannes Schindelin wrote: > > > > It seems to come from libcrypto. I can get rid of it with NO_OPENSSL, and > > that cuts down on the number of system calls in my startup by 16 (getting > > rid of both libcrypto and libdl). I wonder if there is some way to get the > > optimized openssl sha1 routines _without_ that silly ldl thing. > > OpenSSL allows for so-called engines implementing certain algorithms. > These engines are dynamic libraries, loaded via dlopen(). Ah. Ok, that explains it. It's a bit sad, since the _only_ thing we load all of libcrypto for is the (fairly trivial) SHA1 code. But at the same time, last time I benchmarked the different SHA1 libraries, the openssl one was the fastest. I think it has tuned assembly language for most architectures. Our regular mozilla-based C code is perfectly fine, but it doesn't hold a candle to assembler tuning. Oh well. Linus ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-26 4:49 ` Linus Torvalds @ 2009-07-26 16:29 ` Theodore Tso 0 siblings, 0 replies; 129+ messages in thread From: Theodore Tso @ 2009-07-26 16:29 UTC (permalink / raw) To: Linus Torvalds Cc: Johannes Schindelin, Mike Hommey, Timo Hirvonen, git, Carlos R. Mafra, Junio C Hamano On Sat, Jul 25, 2009 at 09:49:41PM -0700, Linus Torvalds wrote: > > But at the same time, last time I benchmarked the different SHA1 > libraries, the openssl one was the fastest. I think it has tuned assembly > language for most architectures. Our regular mozilla-based C code is > perfectly fine, but it doesn't hold a candle to assembler tuning. So maybe git should import the SHA1 code into its own source base? It's not like the SHA1 code changes often, or is likely to have security issues (at least, not buffer overruns; if SHA1 gets thorouhly broken we might have to change algorithms, but that's a different kettle of fish :-). - Ted ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-25 21:02 ` Linus Torvalds 2009-07-25 21:13 ` Linus Torvalds @ 2009-07-26 7:54 ` Mike Hommey 2009-07-26 10:16 ` Johannes Schindelin 1 sibling, 1 reply; 129+ messages in thread From: Mike Hommey @ 2009-07-26 7:54 UTC (permalink / raw) To: Linus Torvalds; +Cc: Timo Hirvonen, git, Carlos R. Mafra, Junio C Hamano On Sat, Jul 25, 2009 at 02:02:19PM -0700, Linus Torvalds wrote: > > > On Sat, 25 Jul 2009, Mike Hommey wrote: > > > On Sat, Jul 25, 2009 at 09:57:39PM +0300, Timo Hirvonen wrote: > > > Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > > > > > So curl really must die. It may not matter for the expensive operations, > > > > but a lot of scripting is about running all those "cheap" things that just > > > > add up over time. > > > > > > SELinux is the problem, not curl. > > > > I think it's NSS, the problem, not SELinux. Linus's libcurl is built > > against NSS, which is the default on Fedora. > > Well, it kind of doesn't matter. The fact is, libcurl is a bloated > monster, and adds zero to 99% of what git people do. Especially consideting the http transport fails to be useful in various scenarios. Mike ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-26 7:54 ` Mike Hommey @ 2009-07-26 10:16 ` Johannes Schindelin 2009-07-26 10:23 ` demerphq 0 siblings, 1 reply; 129+ messages in thread From: Johannes Schindelin @ 2009-07-26 10:16 UTC (permalink / raw) To: Mike Hommey Cc: Linus Torvalds, Timo Hirvonen, git, Carlos R. Mafra, Junio C Hamano Hi, On Sun, 26 Jul 2009, Mike Hommey wrote: > On Sat, Jul 25, 2009 at 02:02:19PM -0700, Linus Torvalds wrote: > > > > > > On Sat, 25 Jul 2009, Mike Hommey wrote: > > > > > On Sat, Jul 25, 2009 at 09:57:39PM +0300, Timo Hirvonen wrote: > > > > Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > > > > > > > So curl really must die. It may not matter for the expensive operations, > > > > > but a lot of scripting is about running all those "cheap" things that just > > > > > add up over time. > > > > > > > > SELinux is the problem, not curl. > > > > > > I think it's NSS, the problem, not SELinux. Linus's libcurl is built > > > against NSS, which is the default on Fedora. > > > > Well, it kind of doesn't matter. The fact is, libcurl is a bloated > > monster, and adds zero to 99% of what git people do. > > Especially consideting the http transport fails to be useful in various > scenarios. I beg your pardon? Maybe "s/useful/desirable/"? In many scenarios, http transport is the _last resort_ against overzealous administrators. The fact that you might be lucky enough not to need that resort is a blessing, and does not give you the right to ridicule those who are unfortunate enough not to share your good luck. Ciao, Dscho ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-26 10:16 ` Johannes Schindelin @ 2009-07-26 10:23 ` demerphq 2009-07-26 10:27 ` demerphq 0 siblings, 1 reply; 129+ messages in thread From: demerphq @ 2009-07-26 10:23 UTC (permalink / raw) To: Johannes Schindelin Cc: Mike Hommey, Linus Torvalds, Timo Hirvonen, git, Carlos R. Mafra, Junio C Hamano 2009/7/26 Johannes Schindelin <Johannes.Schindelin@gmx.de>: > Hi, > > On Sun, 26 Jul 2009, Mike Hommey wrote: > >> On Sat, Jul 25, 2009 at 02:02:19PM -0700, Linus Torvalds wrote: >> > >> > >> > On Sat, 25 Jul 2009, Mike Hommey wrote: >> > >> > > On Sat, Jul 25, 2009 at 09:57:39PM +0300, Timo Hirvonen wrote: >> > > > Linus Torvalds <torvalds@linux-foundation.org> wrote: >> > > > >> > > > > So curl really must die. It may not matter for the expensive operations, >> > > > > but a lot of scripting is about running all those "cheap" things that just >> > > > > add up over time. >> > > > >> > > > SELinux is the problem, not curl. >> > > >> > > I think it's NSS, the problem, not SELinux. Linus's libcurl is built >> > > against NSS, which is the default on Fedora. >> > >> > Well, it kind of doesn't matter. The fact is, libcurl is a bloated >> > monster, and adds zero to 99% of what git people do. >> >> Especially consideting the http transport fails to be useful in various >> scenarios. > > I beg your pardon? Maybe "s/useful/desirable/"? > > In many scenarios, http transport is the _last resort_ against overzealous > administrators. The fact that you might be lucky enough not to need that > resort is a blessing, and does not give you the right to ridicule those > who are unfortunate enough not to share your good luck. I think he meant that it is buggy and does not work correctly in various scenarios. Eg: Last I checked it couldn't handle repos where the main branch wasn''t called master, and I've seen other messages that make me think it doesn't work correctly on edge cases. cheers, Yves -- perl -Mre=debug -e "/just|another|perl|hacker/" ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-26 10:23 ` demerphq @ 2009-07-26 10:27 ` demerphq 0 siblings, 0 replies; 129+ messages in thread From: demerphq @ 2009-07-26 10:27 UTC (permalink / raw) To: Johannes Schindelin Cc: Mike Hommey, Linus Torvalds, Timo Hirvonen, git, Carlos R. Mafra, Junio C Hamano 2009/7/26 demerphq <demerphq@gmail.com>: > 2009/7/26 Johannes Schindelin <Johannes.Schindelin@gmx.de>: >> Hi, >> >> On Sun, 26 Jul 2009, Mike Hommey wrote: >> >>> On Sat, Jul 25, 2009 at 02:02:19PM -0700, Linus Torvalds wrote: >>> > >>> > >>> > On Sat, 25 Jul 2009, Mike Hommey wrote: >>> > >>> > > On Sat, Jul 25, 2009 at 09:57:39PM +0300, Timo Hirvonen wrote: >>> > > > Linus Torvalds <torvalds@linux-foundation.org> wrote: >>> > > > >>> > > > > So curl really must die. It may not matter for the expensive operations, >>> > > > > but a lot of scripting is about running all those "cheap" things that just >>> > > > > add up over time. >>> > > > >>> > > > SELinux is the problem, not curl. >>> > > >>> > > I think it's NSS, the problem, not SELinux. Linus's libcurl is built >>> > > against NSS, which is the default on Fedora. >>> > >>> > Well, it kind of doesn't matter. The fact is, libcurl is a bloated >>> > monster, and adds zero to 99% of what git people do. >>> >>> Especially consideting the http transport fails to be useful in various >>> scenarios. >> >> I beg your pardon? Maybe "s/useful/desirable/"? >> >> In many scenarios, http transport is the _last resort_ against overzealous >> administrators. The fact that you might be lucky enough not to need that >> resort is a blessing, and does not give you the right to ridicule those >> who are unfortunate enough not to share your good luck. > > I think he meant that it is buggy and does not work correctly in > various scenarios. > > Eg: Last I checked it couldn't handle repos where the main branch > wasn''t called master, and I've seen other messages that make me think > it doesn't work correctly on edge cases. Er, I meant that to go to Johannes directly, not to spam the list or the cc's with my hazy recollection, and I should have added: "but perhaps im confusing http and rsync". Sorry for the noise. Yves -- perl -Mre=debug -e "/just|another|perl|hacker/" ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-25 18:57 ` Timo Hirvonen 2009-07-25 19:06 ` Reece Dunn 2009-07-25 20:31 ` Mike Hommey @ 2009-07-25 21:04 ` Carlos R. Mafra 2 siblings, 0 replies; 129+ messages in thread From: Carlos R. Mafra @ 2009-07-25 21:04 UTC (permalink / raw) To: Timo Hirvonen; +Cc: git, Junio C Hamano On Sat 25.Jul'09 at 21:57:39 +0300, Timo Hirvonen wrote: > Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > So curl really must die. It may not matter for the expensive operations, > > but a lot of scripting is about running all those "cheap" things that just > > add up over time. > > SELinux is the problem, not curl. I don't have SELinux, and without curl it takes ~50% less time (on top of Linus' previous optimizations!). The time to open() all the libs really sums up to a considerable fraction (when the total time is low, not when compared to the huge 6 secs of before) Without curl: [mafra@Pilar:linux-2.6]$ grep open strace-nocurl.log |grep lib \ > | awk -F "<" '{print $2}' | sed s/\>// | awk '{s += $1} END {print s}' 0.070104 With curl: [mafra@Pilar:linux-2.6]$ grep open strace-curl.log |grep lib \ > | awk -F "<" '{print $2}' | sed s/\>// | awk '{s += $1} END {print s}' 0.249764 PS: It is interesting that in my laptop the time required to open libcurl alone is 20x the total time of 'git branch' for Linus' in his supercomputer: open("/usr/lib64/libcurl.so.4", O_RDONLY) = 3 <0.066239> ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-23 2:23 ` Linus Torvalds ` (2 preceding siblings ...) 2009-07-23 4:40 ` Junio C Hamano @ 2009-07-23 16:48 ` Anders Kaseorg 2009-07-23 19:03 ` Carlos R. Mafra 3 siblings, 1 reply; 129+ messages in thread From: Anders Kaseorg @ 2009-07-23 16:48 UTC (permalink / raw) To: Linus Torvalds; +Cc: Carlos R. Mafra, Git Mailing List, Junio C Hamano On Wed, 22 Jul 2009, Linus Torvalds wrote: > It uses the "raw" version of 'for_each_ref()' (which doesn't verify that > the ref is valid), and then does the "type verification" before it starts > doing any gentle commit lookup. I submitted essentially the same patch in May: http://article.gmane.org/gmane.comp.version-control.git/120097 with the additional optimization that we don’t need to lookup commits at all unless we’re using -v, --merged, --no-merged, or --contains. In my tests, it makes `git branch` 5 times faster on an uncached linux-2.6 repository. Anders ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-23 16:48 ` Anders Kaseorg @ 2009-07-23 19:03 ` Carlos R. Mafra 0 siblings, 0 replies; 129+ messages in thread From: Carlos R. Mafra @ 2009-07-23 19:03 UTC (permalink / raw) To: Anders Kaseorg; +Cc: Linus Torvalds, Git Mailing List, Junio C Hamano On Thu 23.Jul'09 at 12:48:20 -0400, Anders Kaseorg wrote: > > I submitted essentially the same patch in May: > http://article.gmane.org/gmane.comp.version-control.git/120097 > with the additional optimization that we don't need to lookup commits at > all unless we're using -v, --merged, --no-merged, or --contains. In my > tests, it makes `git branch` 5 times faster on an uncached linux-2.6 > repository. I also tested your patch even if you said that it was "essentially the same". But after repeating the tests 6 times for both your and Linus' patch (taking care to let the system rest a bit after clearing the cache), your patch is faster, 0.62 +/- 0.24 (Anders) 1.35 +/- 0.23 (Linus) And this is the raw data for your patch, 0.00user 0.01system 0:00.54elapsed 2%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (7major+727minor)pagefaults 0swaps 0.00user 0.00system 0:00.18elapsed 5%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (1major+733minor)pagefaults 0swaps 0.00user 0.00system 0:00.66elapsed 1%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (9major+723minor)pagefaults 0swaps 0.00user 0.01system 0:00.74elapsed 2%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (14major+720minor)pagefaults 0swaps 0.00user 0.00system 0:00.80elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (16major+718minor)pagefaults 0swaps 0.00user 0.00system 0:00.83elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (16major+718minor)pagefaults 0swaps and for Linus' 0.00user 0.01system 0:01.56elapsed 1%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (43major+755minor)pagefaults 0swaps 0.00user 0.01system 0:01.09elapsed 1%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (24major+775minor)pagefaults 0swaps 0.00user 0.01system 0:01.33elapsed 1%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (32major+767minor)pagefaults 0swaps 0.00user 0.00system 0:01.53elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (39major+760minor)pagefaults 0swaps 0.00user 0.01system 0:01.06elapsed 2%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (24major+775minor)pagefaults 0swaps 0.00user 0.00system 0:01.54elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (39major+760minor)pagefaults 0swaps ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-22 23:59 Performance issue of 'git branch' Carlos R. Mafra 2009-07-23 0:21 ` Linus Torvalds @ 2009-07-23 0:23 ` SZEDER Gábor 2009-07-23 2:25 ` Carlos R. Mafra 1 sibling, 1 reply; 129+ messages in thread From: SZEDER Gábor @ 2009-07-23 0:23 UTC (permalink / raw) To: Carlos R. Mafra; +Cc: git Hi, On Thu, Jul 23, 2009 at 01:59:14AM +0200, Carlos R. Mafra wrote: > I don't know why .git/refs/heads/stern does not exist and why it takes > so long with it. That branch is functional ('git checkout stern' succeeds), > as well as all the others. But strangely .git/refs/heads/ contains only > > [mafra@Pilar:linux-2.6]$ ls .git/refs/heads/ > dev-private master sparse > > which, apart from "master", are the last branches that I created. > > I occasionally run 'git gc --aggressive --prune" to optimize the repo, > but other than that I don't do anything fancy, just 'pull' almost > every day and 'bisect' (which is becoming a rare event now :-) > > So I would like to ask what should I do to recover the missing files > in .git/refs/heads/ (which apparently is the cause for my issue) and > how I can avoid losing them in the first place. have a look at .git/packed-refs and 'git pack-refs'. Best, Gábor ^ permalink raw reply [flat|nested] 129+ messages in thread
* Re: Performance issue of 'git branch' 2009-07-23 0:23 ` SZEDER Gábor @ 2009-07-23 2:25 ` Carlos R. Mafra 0 siblings, 0 replies; 129+ messages in thread From: Carlos R. Mafra @ 2009-07-23 2:25 UTC (permalink / raw) To: SZEDER Gábor; +Cc: git Hi, On Wed 22.Jul'09 at 19:23:23 -0500, SZEDER Gábor wrote: > > So I would like to ask what should I do to recover the missing files > > in .git/refs/heads/ (which apparently is the cause for my issue) and > > how I can avoid losing them in the first place. > > have a look at .git/packed-refs and 'git pack-refs'. Yes, now I learned that the files were not really missing as in "there is something wrong". I will also start to use 'git pack-refs --prune' from time to time now, in adition to 'git gc --prune' and 'git repack -d -a'. But the takes-too-long 'git branch' issue is apparently caused by something else. Thanks Gábor, Carlos ^ permalink raw reply [flat|nested] 129+ messages in thread
end of thread, other threads:[~2009-08-18 21:50 UTC | newest] Thread overview: 129+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-07-26 23:21 Performance issue of 'git branch' George Spelvin 2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin 2009-07-31 11:11 ` Erik Faye-Lund 2009-07-31 11:31 ` George Spelvin 2009-07-31 11:37 ` Michael J Gruber 2009-07-31 12:24 ` Erik Faye-Lund 2009-07-31 12:29 ` Johannes Schindelin 2009-07-31 12:32 ` George Spelvin 2009-07-31 12:45 ` Erik Faye-Lund 2009-07-31 13:02 ` George Spelvin 2009-07-31 11:21 ` Michael J Gruber 2009-07-31 11:26 ` Michael J Gruber 2009-07-31 12:31 ` Carlos R. Mafra 2009-07-31 13:27 ` Brian Ristuccia 2009-07-31 14:05 ` George Spelvin 2009-07-31 13:27 ` Jakub Narebski 2009-07-31 15:05 ` Peter Harris 2009-07-31 15:22 ` Peter Harris 2009-08-03 3:47 ` x86 SHA1: Faster than OpenSSL George Spelvin 2009-08-03 7:36 ` Jonathan del Strother 2009-08-04 1:40 ` Mark Lodato 2009-08-04 2:30 ` Linus Torvalds 2009-08-04 2:51 ` Linus Torvalds 2009-08-04 3:07 ` Jon Smirl 2009-08-04 5:01 ` George Spelvin 2009-08-04 12:56 ` Jon Smirl 2009-08-04 14:29 ` Dmitry Potapov 2009-08-18 21:50 ` Andy Polyakov 2009-08-04 4:48 ` George Spelvin 2009-08-04 6:30 ` Linus Torvalds 2009-08-04 8:01 ` George Spelvin 2009-08-04 20:41 ` Junio C Hamano 2009-08-05 18:17 ` George Spelvin 2009-08-05 20:36 ` Johannes Schindelin 2009-08-05 20:44 ` Junio C Hamano 2009-08-05 20:55 ` Linus Torvalds 2009-08-05 23:13 ` Linus Torvalds 2009-08-06 1:18 ` Linus Torvalds 2009-08-06 1:52 ` Nicolas Pitre 2009-08-06 2:04 ` Junio C Hamano 2009-08-06 2:10 ` Linus Torvalds 2009-08-06 2:20 ` Nicolas Pitre 2009-08-06 2:08 ` Linus Torvalds 2009-08-06 3:19 ` Artur Skawina 2009-08-06 3:31 ` Linus Torvalds 2009-08-06 3:48 ` Linus Torvalds 2009-08-06 4:01 ` Linus Torvalds 2009-08-06 4:28 ` Artur Skawina 2009-08-06 4:50 ` Linus Torvalds 2009-08-06 5:19 ` Artur Skawina 2009-08-06 7:03 ` George Spelvin 2009-08-06 4:52 ` George Spelvin 2009-08-06 4:08 ` Artur Skawina 2009-08-06 4:27 ` Linus Torvalds 2009-08-06 5:44 ` Artur Skawina 2009-08-06 5:56 ` Artur Skawina 2009-08-06 7:45 ` Artur Skawina 2009-08-06 18:49 ` Erik Faye-Lund 2009-08-04 6:40 ` Linus Torvalds 2009-08-18 21:26 ` Andy Polyakov -- strict thread matches above, loose matches on Subject: below -- 2009-07-22 23:59 Performance issue of 'git branch' Carlos R. Mafra 2009-07-23 0:21 ` Linus Torvalds 2009-07-23 0:51 ` Linus Torvalds 2009-07-23 0:55 ` Linus Torvalds 2009-07-23 2:02 ` Carlos R. Mafra 2009-07-23 2:28 ` Linus Torvalds 2009-07-23 12:42 ` Jakub Narebski 2009-07-23 14:45 ` Carlos R. Mafra 2009-07-23 16:25 ` Linus Torvalds 2009-07-23 1:22 ` Carlos R. Mafra 2009-07-23 2:20 ` Linus Torvalds 2009-07-23 2:23 ` Linus Torvalds 2009-07-23 3:08 ` Linus Torvalds 2009-07-23 3:21 ` Linus Torvalds 2009-07-23 17:47 ` Tony Finch 2009-07-23 18:57 ` Linus Torvalds 2009-07-23 3:18 ` Carlos R. Mafra 2009-07-23 3:27 ` Carlos R. Mafra 2009-07-23 3:40 ` Carlos R. Mafra 2009-07-23 3:47 ` Linus Torvalds 2009-07-23 4:10 ` Linus Torvalds 2009-07-23 5:13 ` Junio C Hamano 2009-07-23 5:17 ` Carlos R. Mafra 2009-07-23 4:40 ` Junio C Hamano 2009-07-23 5:36 ` Linus Torvalds 2009-07-23 5:52 ` Junio C Hamano 2009-07-23 6:04 ` Junio C Hamano 2009-07-23 17:19 ` Linus Torvalds 2009-07-23 16:07 ` Carlos R. Mafra 2009-07-23 16:19 ` Linus Torvalds 2009-07-23 16:53 ` Carlos R. Mafra 2009-07-23 19:05 ` Linus Torvalds 2009-07-23 19:13 ` Linus Torvalds 2009-07-23 19:55 ` Carlos R. Mafra 2009-07-24 20:36 ` Linus Torvalds 2009-07-24 20:47 ` Linus Torvalds 2009-07-24 21:21 ` Linus Torvalds 2009-07-24 22:13 ` Linus Torvalds 2009-07-24 22:18 ` david 2009-07-24 22:42 ` Linus Torvalds 2009-07-24 22:46 ` david 2009-07-25 2:39 ` Linus Torvalds 2009-07-25 2:53 ` Daniel Barkalow 2009-08-07 4:21 ` Jeff King 2009-07-24 22:54 ` Theodore Tso 2009-07-24 22:59 ` Shawn O. Pearce 2009-07-24 23:28 ` Junio C Hamano 2009-07-26 17:07 ` Avi Kivity 2009-07-26 17:16 ` Johannes Schindelin 2009-07-24 23:46 ` Carlos R. Mafra 2009-07-25 0:41 ` Carlos R. Mafra 2009-07-25 18:04 ` Linus Torvalds 2009-07-25 18:57 ` Timo Hirvonen 2009-07-25 19:06 ` Reece Dunn 2009-07-25 20:31 ` Mike Hommey 2009-07-25 21:02 ` Linus Torvalds 2009-07-25 21:13 ` Linus Torvalds 2009-07-25 23:23 ` Johannes Schindelin 2009-07-26 4:49 ` Linus Torvalds 2009-07-26 16:29 ` Theodore Tso 2009-07-26 7:54 ` Mike Hommey 2009-07-26 10:16 ` Johannes Schindelin 2009-07-26 10:23 ` demerphq 2009-07-26 10:27 ` demerphq 2009-07-25 21:04 ` Carlos R. Mafra 2009-07-23 16:48 ` Anders Kaseorg 2009-07-23 19:03 ` Carlos R. Mafra 2009-07-23 0:23 ` SZEDER Gábor 2009-07-23 2:25 ` Carlos R. Mafra
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).