* [PATCH 0/7] block-sha1: improved SHA1 hashing @ 2009-08-06 15:13 Linus Torvalds 2009-08-06 15:15 ` [PATCH 1/7] block-sha1: add new optimized C 'block-sha1' routines Linus Torvalds 2009-08-06 17:22 ` [PATCH 0/7] block-sha1: improved SHA1 hashing Artur Skawina 0 siblings, 2 replies; 37+ messages in thread From: Linus Torvalds @ 2009-08-06 15:13 UTC (permalink / raw) To: Git Mailing List, Junio C Hamano The bulk of this is the patches I sent out yesterday, but there's a few added tweaks from today there, and it's now one nice series instead of a patch followed by a "Haa, I improved on it" etc, so you can get the whole thing without actually having to hunt for the pieces. It's a series of 7 patches: block-sha1: add new optimized C 'block-sha1' routines block-sha1: try to use rol/ror appropriately block-sha1: make the 'ntohl()' part of the first SHA1 loop block-sha1: re-use the temporary array as we calculate the SHA1 block-sha1: macroize the rounds a bit further block-sha1: Use '(B&C)+(D&(B^C))' instead of '(B&C)|(D&(B|C))' in round 3 block-sha1: get rid of redundant 'lenW' context where the thing is loosely based on the Mozilla SHA1 routines but by the end doesn't really resemble them all that much. The Mozilla ones suck donkey d*ck in so many ways - unnecessary copies, idiotic byte-at-a-time build-up of the hash input etc. The end result is pretty much equivalent in performance to the OpenSSL SHA1 code for me on x86-64. Getting rid of OpenSSL gets rid of another couple of shared library loadings, and as a result my "make -j64 test" is now a couple of seconds faster with this than with OpenSSL in my rather unscientific tests. The code isn't very big: Makefile | 9 +++ block-sha1/sha1.c | 158 +++++++++++++++++++++++++++++++++++++++++++++++++++++ block-sha1/sha1.h | 20 +++++++ 3 files changed, 187 insertions(+), 0 deletions(-) create mode 100644 block-sha1/sha1.c create mode 100644 block-sha1/sha1.h and passes all my tests. It's really hard to not do the SHA1 right and still pass any tests at all, so it should all be good. Enable them with BLK_SHA1=1. Linus ^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 1/7] block-sha1: add new optimized C 'block-sha1' routines 2009-08-06 15:13 [PATCH 0/7] block-sha1: improved SHA1 hashing Linus Torvalds @ 2009-08-06 15:15 ` Linus Torvalds 2009-08-06 15:16 ` [PATCH 2/7] block-sha1: try to use rol/ror appropriately Linus Torvalds 2009-08-06 17:22 ` [PATCH 0/7] block-sha1: improved SHA1 hashing Artur Skawina 1 sibling, 1 reply; 37+ messages in thread From: Linus Torvalds @ 2009-08-06 15:15 UTC (permalink / raw) To: Git Mailing List, Junio C Hamano From: Linus Torvalds <torvalds@linux-foundation.org> Date: Wed, 5 Aug 2009 14:49:32 -0700 Subject: [PATCH 1/7] block-sha1: add new optimized C 'block-sha1' routines Based ont he mozilla SHA1 routine, but doing the input data accesses a word at a time and with 'htonl()' instead of loading bytes and shifting. It requires an architecture that is ok with unaligned 32-bit loads and a fast htonl(). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- Side note: we can get rid of the "arch must support unaligned loads" requirement later on by telling the compiler which ones _can_ be unaligned. So the limitations aren't all that fundamental, really. Makefile | 9 +++ block-sha1/sha1.c | 145 +++++++++++++++++++++++++++++++++++++++++++++++++++++ block-sha1/sha1.h | 21 ++++++++ 3 files changed, 175 insertions(+), 0 deletions(-) diff --git a/Makefile b/Makefile index d7669b1..f12024c 100644 --- a/Makefile +++ b/Makefile @@ -84,6 +84,10 @@ all:: # specify your own (or DarwinPort's) include directories and # library directories by defining CFLAGS and LDFLAGS appropriately. # +# Define BLK_SHA1 environment variable if you want the C version +# of the SHA1 that assumes you can do unaligned 32-bit loads and +# have a fast htonl() function. +# # Define PPC_SHA1 environment variable when running make to make use of # a bundled SHA1 routine optimized for PowerPC. # @@ -1166,6 +1170,10 @@ ifdef NO_DEFLATE_BOUND BASIC_CFLAGS += -DNO_DEFLATE_BOUND endif +ifdef BLK_SHA1 + SHA1_HEADER = "block-sha1/sha1.h" + LIB_OBJS += block-sha1/sha1.o +else ifdef PPC_SHA1 SHA1_HEADER = "ppc/sha1.h" LIB_OBJS += ppc/sha1.o ppc/sha1ppc.o @@ -1183,6 +1191,7 @@ else endif endif endif +endif ifdef NO_PERL_MAKEMAKER export NO_PERL_MAKEMAKER endif diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c new file mode 100644 index 0000000..eef32f7 --- /dev/null +++ b/block-sha1/sha1.c @@ -0,0 +1,145 @@ +/* + * Based on the Mozilla SHA1 (see mozilla-sha1/sha1.c), + * optimized to do word accesses rather than byte accesses, + * and to avoid unnecessary copies into the context array. + */ + +#include <string.h> +#include <arpa/inet.h> + +#include "sha1.h" + +/* Hash one 64-byte block of data */ +static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data); + +void blk_SHA1_Init(blk_SHA_CTX *ctx) +{ + ctx->lenW = 0; + ctx->size = 0; + + /* Initialize H with the magic constants (see FIPS180 for constants) + */ + ctx->H[0] = 0x67452301; + ctx->H[1] = 0xefcdab89; + ctx->H[2] = 0x98badcfe; + ctx->H[3] = 0x10325476; + ctx->H[4] = 0xc3d2e1f0; +} + + +void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, unsigned long len) +{ + int lenW = ctx->lenW; + + ctx->size += (unsigned long long) len << 3; + + /* Read the data into W and process blocks as they get full + */ + if (lenW) { + int left = 64 - lenW; + if (len < left) + left = len; + memcpy(lenW + (char *)ctx->W, data, left); + lenW = (lenW + left) & 63; + len -= left; + data += left; + ctx->lenW = lenW; + if (lenW) + return; + blk_SHA1Block(ctx, ctx->W); + } + while (len >= 64) { + blk_SHA1Block(ctx, data); + data += 64; + len -= 64; + } + if (len) { + memcpy(ctx->W, data, len); + ctx->lenW = len; + } +} + + +void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx) +{ + static const unsigned char pad[64] = { 0x80 }; + unsigned int padlen[2]; + int i; + + /* Pad with a binary 1 (ie 0x80), then zeroes, then length + */ + padlen[0] = htonl(ctx->size >> 32); + padlen[1] = htonl(ctx->size); + + blk_SHA1_Update(ctx, pad, 1+ (63 & (55 - ctx->lenW))); + blk_SHA1_Update(ctx, padlen, 8); + + /* Output hash + */ + for (i = 0; i < 5; i++) + ((unsigned int *)hashout)[i] = htonl(ctx->H[i]); +} + +#define SHA_ROT(X,n) (((X) << (n)) | ((X) >> (32-(n)))) + +static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) +{ + int t; + unsigned int A,B,C,D,E,TEMP; + unsigned int W[80]; + + for (t = 0; t < 16; t++) + W[t] = htonl(data[t]); + + /* Unroll it? */ + for (t = 16; t <= 79; t++) + W[t] = SHA_ROT(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1); + + A = ctx->H[0]; + B = ctx->H[1]; + C = ctx->H[2]; + D = ctx->H[3]; + E = ctx->H[4]; + +#define T_0_19(t) \ + TEMP = SHA_ROT(A,5) + (((C^D)&B)^D) + E + W[t] + 0x5a827999; \ + E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP; + + T_0_19( 0); T_0_19( 1); T_0_19( 2); T_0_19( 3); T_0_19( 4); + T_0_19( 5); T_0_19( 6); T_0_19( 7); T_0_19( 8); T_0_19( 9); + T_0_19(10); T_0_19(11); T_0_19(12); T_0_19(13); T_0_19(14); + T_0_19(15); T_0_19(16); T_0_19(17); T_0_19(18); T_0_19(19); + +#define T_20_39(t) \ + TEMP = SHA_ROT(A,5) + (B^C^D) + E + W[t] + 0x6ed9eba1; \ + E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP; + + T_20_39(20); T_20_39(21); T_20_39(22); T_20_39(23); T_20_39(24); + T_20_39(25); T_20_39(26); T_20_39(27); T_20_39(28); T_20_39(29); + T_20_39(30); T_20_39(31); T_20_39(32); T_20_39(33); T_20_39(34); + T_20_39(35); T_20_39(36); T_20_39(37); T_20_39(38); T_20_39(39); + +#define T_40_59(t) \ + TEMP = SHA_ROT(A,5) + ((B&C)|(D&(B|C))) + E + W[t] + 0x8f1bbcdc; \ + E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP; + + T_40_59(40); T_40_59(41); T_40_59(42); T_40_59(43); T_40_59(44); + T_40_59(45); T_40_59(46); T_40_59(47); T_40_59(48); T_40_59(49); + T_40_59(50); T_40_59(51); T_40_59(52); T_40_59(53); T_40_59(54); + T_40_59(55); T_40_59(56); T_40_59(57); T_40_59(58); T_40_59(59); + +#define T_60_79(t) \ + TEMP = SHA_ROT(A,5) + (B^C^D) + E + W[t] + 0xca62c1d6; \ + E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP; + + T_60_79(60); T_60_79(61); T_60_79(62); T_60_79(63); T_60_79(64); + T_60_79(65); T_60_79(66); T_60_79(67); T_60_79(68); T_60_79(69); + T_60_79(70); T_60_79(71); T_60_79(72); T_60_79(73); T_60_79(74); + T_60_79(75); T_60_79(76); T_60_79(77); T_60_79(78); T_60_79(79); + + ctx->H[0] += A; + ctx->H[1] += B; + ctx->H[2] += C; + ctx->H[3] += D; + ctx->H[4] += E; +} diff --git a/block-sha1/sha1.h b/block-sha1/sha1.h new file mode 100644 index 0000000..7be2d93 --- /dev/null +++ b/block-sha1/sha1.h @@ -0,0 +1,21 @@ +/* + * Based on the Mozilla SHA1 (see mozilla-sha1/sha1.h), + * optimized to do word accesses rather than byte accesses, + * and to avoid unnecessary copies into the context array. + */ + +typedef struct { + unsigned int H[5]; + unsigned int W[16]; + int lenW; + unsigned long long size; +} blk_SHA_CTX; + +void blk_SHA1_Init(blk_SHA_CTX *ctx); +void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *dataIn, unsigned long len); +void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx); + +#define git_SHA_CTX blk_SHA_CTX +#define git_SHA1_Init blk_SHA1_Init +#define git_SHA1_Update blk_SHA1_Update +#define git_SHA1_Final blk_SHA1_Final -- 1.6.4.31.g154b2.dirty ^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 2/7] block-sha1: try to use rol/ror appropriately 2009-08-06 15:15 ` [PATCH 1/7] block-sha1: add new optimized C 'block-sha1' routines Linus Torvalds @ 2009-08-06 15:16 ` Linus Torvalds 2009-08-06 15:18 ` [PATCH 3/7] block-sha1: make the 'ntohl()' part of the first SHA1 loop Linus Torvalds 2009-08-06 18:25 ` [PATCH 2/7] block-sha1: try to use rol/ror appropriately Bert Wesarg 0 siblings, 2 replies; 37+ messages in thread From: Linus Torvalds @ 2009-08-06 15:16 UTC (permalink / raw) To: Git Mailing List, Junio C Hamano From: Linus Torvalds <torvalds@linux-foundation.org> Date: Wed, 5 Aug 2009 19:42:15 -0700 Subject: [PATCH 2/7] block-sha1: try to use rol/ror appropriately Use the one with the smaller constant. It _can_ generate slightly smaller code (a constant of 1 is special), but perhaps more importantly it's possibly faster on any uarch that does a rotate with a loop. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- Ok, this is the hackiest of the bunch. I considered dropping it. But I do suspect we want something like this, even if we might end up massaging the asm for different compilers etc. block-sha1/sha1.c | 32 ++++++++++++++++++++++---------- 1 files changed, 22 insertions(+), 10 deletions(-) diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c index eef32f7..a45a3de 100644 --- a/block-sha1/sha1.c +++ b/block-sha1/sha1.c @@ -80,7 +80,19 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx) ((unsigned int *)hashout)[i] = htonl(ctx->H[i]); } -#define SHA_ROT(X,n) (((X) << (n)) | ((X) >> (32-(n)))) +#if defined(__i386__) || defined(__x86_64__) + +#define SHA_ASM(op, x, n) ({ unsigned int __res; asm(op " %1,%0":"=r" (__res):"i" (n), "0" (x)); __res; }) +#define SHA_ROL(x,n) SHA_ASM("rol", x, n) +#define SHA_ROR(x,n) SHA_ASM("ror", x, n) + +#else + +#define SHA_ROT(X,n) (((X) << (l)) | ((X) >> (r))) +#define SHA_ROL(X,n) SHA_ROT(X,n,32-(n)) +#define SHA_ROR(X,n) SHA_ROT(X,32-(n),n) + +#endif static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) { @@ -93,7 +105,7 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) /* Unroll it? */ for (t = 16; t <= 79; t++) - W[t] = SHA_ROT(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1); + W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1); A = ctx->H[0]; B = ctx->H[1]; @@ -102,8 +114,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) E = ctx->H[4]; #define T_0_19(t) \ - TEMP = SHA_ROT(A,5) + (((C^D)&B)^D) + E + W[t] + 0x5a827999; \ - E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP; + TEMP = SHA_ROL(A,5) + (((C^D)&B)^D) + E + W[t] + 0x5a827999; \ + E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; T_0_19( 0); T_0_19( 1); T_0_19( 2); T_0_19( 3); T_0_19( 4); T_0_19( 5); T_0_19( 6); T_0_19( 7); T_0_19( 8); T_0_19( 9); @@ -111,8 +123,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) T_0_19(15); T_0_19(16); T_0_19(17); T_0_19(18); T_0_19(19); #define T_20_39(t) \ - TEMP = SHA_ROT(A,5) + (B^C^D) + E + W[t] + 0x6ed9eba1; \ - E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP; + TEMP = SHA_ROL(A,5) + (B^C^D) + E + W[t] + 0x6ed9eba1; \ + E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; T_20_39(20); T_20_39(21); T_20_39(22); T_20_39(23); T_20_39(24); T_20_39(25); T_20_39(26); T_20_39(27); T_20_39(28); T_20_39(29); @@ -120,8 +132,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) T_20_39(35); T_20_39(36); T_20_39(37); T_20_39(38); T_20_39(39); #define T_40_59(t) \ - TEMP = SHA_ROT(A,5) + ((B&C)|(D&(B|C))) + E + W[t] + 0x8f1bbcdc; \ - E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP; + TEMP = SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E + W[t] + 0x8f1bbcdc; \ + E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; T_40_59(40); T_40_59(41); T_40_59(42); T_40_59(43); T_40_59(44); T_40_59(45); T_40_59(46); T_40_59(47); T_40_59(48); T_40_59(49); @@ -129,8 +141,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) T_40_59(55); T_40_59(56); T_40_59(57); T_40_59(58); T_40_59(59); #define T_60_79(t) \ - TEMP = SHA_ROT(A,5) + (B^C^D) + E + W[t] + 0xca62c1d6; \ - E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP; + TEMP = SHA_ROL(A,5) + (B^C^D) + E + W[t] + 0xca62c1d6; \ + E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; T_60_79(60); T_60_79(61); T_60_79(62); T_60_79(63); T_60_79(64); T_60_79(65); T_60_79(66); T_60_79(67); T_60_79(68); T_60_79(69); -- 1.6.4.31.g154b2.dirty ^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 3/7] block-sha1: make the 'ntohl()' part of the first SHA1 loop 2009-08-06 15:16 ` [PATCH 2/7] block-sha1: try to use rol/ror appropriately Linus Torvalds @ 2009-08-06 15:18 ` Linus Torvalds 2009-08-06 15:20 ` [PATCH 4/7] block-sha1: re-use the temporary array as we calculate the SHA1 Linus Torvalds 2009-08-06 18:25 ` [PATCH 2/7] block-sha1: try to use rol/ror appropriately Bert Wesarg 1 sibling, 1 reply; 37+ messages in thread From: Linus Torvalds @ 2009-08-06 15:18 UTC (permalink / raw) To: Git Mailing List, Junio C Hamano From: Linus Torvalds <torvalds@linux-foundation.org> Date: Wed, 5 Aug 2009 20:28:07 -0700 Subject: [PATCH 3/7] block-sha1: make the 'ntohl()' part of the first SHA1 loop This helps a teeny bit. But what I -really- want to do is to avoid the whole 80-array loop, and do the xor updates as I go along.. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- This is a pretty trivial one, but it was the first stage to getting rid of the annoying 80-word array that not only wastes precious L1 cache, but that loop to initialize it was a noticeable cost. block-sha1/sha1.c | 28 ++++++++++++++++------------ 1 files changed, 16 insertions(+), 12 deletions(-) diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c index a45a3de..39a5bbb 100644 --- a/block-sha1/sha1.c +++ b/block-sha1/sha1.c @@ -100,27 +100,31 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) unsigned int A,B,C,D,E,TEMP; unsigned int W[80]; - for (t = 0; t < 16; t++) - W[t] = htonl(data[t]); - - /* Unroll it? */ - for (t = 16; t <= 79; t++) - W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1); - A = ctx->H[0]; B = ctx->H[1]; C = ctx->H[2]; D = ctx->H[3]; E = ctx->H[4]; -#define T_0_19(t) \ +#define T_0_15(t) \ + TEMP = htonl(data[t]); W[t] = TEMP; \ + TEMP += SHA_ROL(A,5) + (((C^D)&B)^D) + E + 0x5a827999; \ + E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; \ + + T_0_15( 0); T_0_15( 1); T_0_15( 2); T_0_15( 3); T_0_15( 4); + T_0_15( 5); T_0_15( 6); T_0_15( 7); T_0_15( 8); T_0_15( 9); + T_0_15(10); T_0_15(11); T_0_15(12); T_0_15(13); T_0_15(14); + T_0_15(15); + + /* Unroll it? */ + for (t = 16; t <= 79; t++) + W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1); + +#define T_16_19(t) \ TEMP = SHA_ROL(A,5) + (((C^D)&B)^D) + E + W[t] + 0x5a827999; \ E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; - T_0_19( 0); T_0_19( 1); T_0_19( 2); T_0_19( 3); T_0_19( 4); - T_0_19( 5); T_0_19( 6); T_0_19( 7); T_0_19( 8); T_0_19( 9); - T_0_19(10); T_0_19(11); T_0_19(12); T_0_19(13); T_0_19(14); - T_0_19(15); T_0_19(16); T_0_19(17); T_0_19(18); T_0_19(19); + T_16_19(16); T_16_19(17); T_16_19(18); T_16_19(19); #define T_20_39(t) \ TEMP = SHA_ROL(A,5) + (B^C^D) + E + W[t] + 0x6ed9eba1; \ -- 1.6.4.31.g154b2.dirty ^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 4/7] block-sha1: re-use the temporary array as we calculate the SHA1 2009-08-06 15:18 ` [PATCH 3/7] block-sha1: make the 'ntohl()' part of the first SHA1 loop Linus Torvalds @ 2009-08-06 15:20 ` Linus Torvalds 2009-08-06 15:22 ` [PATCH 5/7] block-sha1: macroize the rounds a bit further Linus Torvalds 0 siblings, 1 reply; 37+ messages in thread From: Linus Torvalds @ 2009-08-06 15:20 UTC (permalink / raw) To: Git Mailing List, Junio C Hamano From: Linus Torvalds <torvalds@linux-foundation.org> Date: Wed, 5 Aug 2009 20:49:41 -0700 Subject: [PATCH 4/7] block-sha1: re-use the temporary array as we calculate the SHA1 The mozilla-SHA1 code did this 80-word array for the 80 iterations. But the SHA1 state is really just 512 bits, and you can actually keep it in a kind of "circular queue" of just 16 words instead. This requires us to do the xor updates as we go along (rather than as a pre-phase), but that's really what we want to do anyway. This gets me really close to the OpenSSL performance on my Nehalem. Look ma, all C code (ok, there's the rol/ror hack, but that one doesn't strictly even matter on my Nehalem, it's just a local optimization). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- This gets rid of the annoying pre-initialization and its need for that extra space. We can dynamically iterate over a 512-bit circular buffer instead. block-sha1/sha1.c | 28 ++++++++++++++++------------ 1 files changed, 16 insertions(+), 12 deletions(-) diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c index 39a5bbb..80193d4 100644 --- a/block-sha1/sha1.c +++ b/block-sha1/sha1.c @@ -96,9 +96,8 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx) static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) { - int t; unsigned int A,B,C,D,E,TEMP; - unsigned int W[80]; + unsigned int array[16]; A = ctx->H[0]; B = ctx->H[1]; @@ -107,8 +106,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) E = ctx->H[4]; #define T_0_15(t) \ - TEMP = htonl(data[t]); W[t] = TEMP; \ - TEMP += SHA_ROL(A,5) + (((C^D)&B)^D) + E + 0x5a827999; \ + TEMP = htonl(data[t]); array[t] = TEMP; \ + TEMP += SHA_ROL(A,5) + (((C^D)&B)^D) + E + 0x5a827999; \ E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; \ T_0_15( 0); T_0_15( 1); T_0_15( 2); T_0_15( 3); T_0_15( 4); @@ -116,18 +115,21 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) T_0_15(10); T_0_15(11); T_0_15(12); T_0_15(13); T_0_15(14); T_0_15(15); - /* Unroll it? */ - for (t = 16; t <= 79; t++) - W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1); +/* This "rolls" over the 512-bit array */ +#define W(x) (array[(x)&15]) +#define SHA_XOR(t) \ + TEMP = SHA_ROL(W(t+13) ^ W(t+8) ^ W(t+2) ^ W(t), 1); W(t) = TEMP; #define T_16_19(t) \ - TEMP = SHA_ROL(A,5) + (((C^D)&B)^D) + E + W[t] + 0x5a827999; \ - E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; + SHA_XOR(t); \ + TEMP += SHA_ROL(A,5) + (((C^D)&B)^D) + E + 0x5a827999; \ + E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; \ T_16_19(16); T_16_19(17); T_16_19(18); T_16_19(19); #define T_20_39(t) \ - TEMP = SHA_ROL(A,5) + (B^C^D) + E + W[t] + 0x6ed9eba1; \ + SHA_XOR(t); \ + TEMP += SHA_ROL(A,5) + (B^C^D) + E + 0x6ed9eba1; \ E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; T_20_39(20); T_20_39(21); T_20_39(22); T_20_39(23); T_20_39(24); @@ -136,7 +138,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) T_20_39(35); T_20_39(36); T_20_39(37); T_20_39(38); T_20_39(39); #define T_40_59(t) \ - TEMP = SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E + W[t] + 0x8f1bbcdc; \ + SHA_XOR(t); \ + TEMP += SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E + 0x8f1bbcdc; \ E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; T_40_59(40); T_40_59(41); T_40_59(42); T_40_59(43); T_40_59(44); @@ -145,7 +148,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) T_40_59(55); T_40_59(56); T_40_59(57); T_40_59(58); T_40_59(59); #define T_60_79(t) \ - TEMP = SHA_ROL(A,5) + (B^C^D) + E + W[t] + 0xca62c1d6; \ + SHA_XOR(t); \ + TEMP += SHA_ROL(A,5) + (B^C^D) + E + 0xca62c1d6; \ E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; T_60_79(60); T_60_79(61); T_60_79(62); T_60_79(63); T_60_79(64); -- 1.6.4.31.g154b2.dirty ^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 5/7] block-sha1: macroize the rounds a bit further 2009-08-06 15:20 ` [PATCH 4/7] block-sha1: re-use the temporary array as we calculate the SHA1 Linus Torvalds @ 2009-08-06 15:22 ` Linus Torvalds 2009-08-06 15:24 ` [PATCH 6/7] block-sha1: Use '(B&C)+(D&(B^C))' instead of '(B&C)|(D&(B|C))' in round 3 Linus Torvalds 0 siblings, 1 reply; 37+ messages in thread From: Linus Torvalds @ 2009-08-06 15:22 UTC (permalink / raw) To: Git Mailing List, Junio C Hamano From: Linus Torvalds <torvalds@linux-foundation.org> Date: Thu, 6 Aug 2009 07:20:54 -0700 Subject: [PATCH 5/7] block-sha1: macroize the rounds a bit further Avoid repeating the shared parts of the different rounds by adding a macro layer or two. It was already more cpp than C. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- This makes things denser, and puts all the core rules in one place. That first hunk really contains just about all of the important parts of SHA1. The rest is just fluff and the necessary expansions etc. block-sha1/sha1.c | 56 ++++++++++++++++++++++++---------------------------- 1 files changed, 26 insertions(+), 30 deletions(-) diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c index 80193d4..4837d58 100644 --- a/block-sha1/sha1.c +++ b/block-sha1/sha1.c @@ -94,6 +94,27 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx) #endif +/* This "rolls" over the 512-bit array */ +#define W(x) (array[(x)&15]) + +/* + * Where do we get the source from? The first 16 iterations get it from + * the input data, the next mix it from the 512-bit array. + */ +#define SHA_SRC(t) htonl(data[t]) +#define SHA_MIX(t) SHA_ROL(W(t+13) ^ W(t+8) ^ W(t+2) ^ W(t), 1) + +#define SHA_ROUND(t, input, fn, constant) \ + TEMP = input(t); W(t) = TEMP; \ + TEMP += SHA_ROL(A,5) + (fn) + E + (constant); \ + E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP + +#define T_0_15(t) SHA_ROUND(t, SHA_SRC, (((C^D)&B)^D) , 0x5a827999 ) +#define T_16_19(t) SHA_ROUND(t, SHA_MIX, (((C^D)&B)^D) , 0x5a827999 ) +#define T_20_39(t) SHA_ROUND(t, SHA_MIX, (B^C^D) , 0x6ed9eba1 ) +#define T_40_59(t) SHA_ROUND(t, SHA_MIX, ((B&C)|(D&(B|C))) , 0x8f1bbcdc ) +#define T_60_79(t) SHA_ROUND(t, SHA_MIX, (B^C^D) , 0xca62c1d6 ) + static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) { unsigned int A,B,C,D,E,TEMP; @@ -105,53 +126,28 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) D = ctx->H[3]; E = ctx->H[4]; -#define T_0_15(t) \ - TEMP = htonl(data[t]); array[t] = TEMP; \ - TEMP += SHA_ROL(A,5) + (((C^D)&B)^D) + E + 0x5a827999; \ - E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; \ - + /* Round 1 - iterations 0-16 take their input from 'data' */ T_0_15( 0); T_0_15( 1); T_0_15( 2); T_0_15( 3); T_0_15( 4); T_0_15( 5); T_0_15( 6); T_0_15( 7); T_0_15( 8); T_0_15( 9); T_0_15(10); T_0_15(11); T_0_15(12); T_0_15(13); T_0_15(14); T_0_15(15); -/* This "rolls" over the 512-bit array */ -#define W(x) (array[(x)&15]) -#define SHA_XOR(t) \ - TEMP = SHA_ROL(W(t+13) ^ W(t+8) ^ W(t+2) ^ W(t), 1); W(t) = TEMP; - -#define T_16_19(t) \ - SHA_XOR(t); \ - TEMP += SHA_ROL(A,5) + (((C^D)&B)^D) + E + 0x5a827999; \ - E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; \ - + /* Round 1 - tail. Input from 512-bit mixing array */ T_16_19(16); T_16_19(17); T_16_19(18); T_16_19(19); -#define T_20_39(t) \ - SHA_XOR(t); \ - TEMP += SHA_ROL(A,5) + (B^C^D) + E + 0x6ed9eba1; \ - E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; - + /* Round 2 */ T_20_39(20); T_20_39(21); T_20_39(22); T_20_39(23); T_20_39(24); T_20_39(25); T_20_39(26); T_20_39(27); T_20_39(28); T_20_39(29); T_20_39(30); T_20_39(31); T_20_39(32); T_20_39(33); T_20_39(34); T_20_39(35); T_20_39(36); T_20_39(37); T_20_39(38); T_20_39(39); -#define T_40_59(t) \ - SHA_XOR(t); \ - TEMP += SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E + 0x8f1bbcdc; \ - E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; - + /* Round 3 */ T_40_59(40); T_40_59(41); T_40_59(42); T_40_59(43); T_40_59(44); T_40_59(45); T_40_59(46); T_40_59(47); T_40_59(48); T_40_59(49); T_40_59(50); T_40_59(51); T_40_59(52); T_40_59(53); T_40_59(54); T_40_59(55); T_40_59(56); T_40_59(57); T_40_59(58); T_40_59(59); -#define T_60_79(t) \ - SHA_XOR(t); \ - TEMP += SHA_ROL(A,5) + (B^C^D) + E + 0xca62c1d6; \ - E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; - + /* Round 4 */ T_60_79(60); T_60_79(61); T_60_79(62); T_60_79(63); T_60_79(64); T_60_79(65); T_60_79(66); T_60_79(67); T_60_79(68); T_60_79(69); T_60_79(70); T_60_79(71); T_60_79(72); T_60_79(73); T_60_79(74); -- 1.6.4.31.g154b2.dirty ^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 6/7] block-sha1: Use '(B&C)+(D&(B^C))' instead of '(B&C)|(D&(B|C))' in round 3 2009-08-06 15:22 ` [PATCH 5/7] block-sha1: macroize the rounds a bit further Linus Torvalds @ 2009-08-06 15:24 ` Linus Torvalds 2009-08-06 15:25 ` [PATCH 7/7] block-sha1: get rid of redundant 'lenW' context Linus Torvalds 0 siblings, 1 reply; 37+ messages in thread From: Linus Torvalds @ 2009-08-06 15:24 UTC (permalink / raw) To: Git Mailing List, Junio C Hamano From: Linus Torvalds <torvalds@linux-foundation.org> Date: Thu, 6 Aug 2009 07:27:57 -0700 Subject: [PATCH 6/7] block-sha1: Use '(B&C)+(D&(B^C))' instead of '(B&C)|(D&(B|C))' in round 3 It's an equivalent expression, but the '+' gives us some freedom in instruction selection (for example, we can use 'lea' rather than 'add'), and associates with the other additions around it to give some minor scheduling freedom. Suggested-by: linux@horizon.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- A small micro-optimization. It may not matter hugely, but it's definitely the right thing to do. And with the new macroized thing, you can see clearly what part of the SHA1 rounds it affects, and how. block-sha1/sha1.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c index 4837d58..9a060a6 100644 --- a/block-sha1/sha1.c +++ b/block-sha1/sha1.c @@ -112,7 +112,7 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx) #define T_0_15(t) SHA_ROUND(t, SHA_SRC, (((C^D)&B)^D) , 0x5a827999 ) #define T_16_19(t) SHA_ROUND(t, SHA_MIX, (((C^D)&B)^D) , 0x5a827999 ) #define T_20_39(t) SHA_ROUND(t, SHA_MIX, (B^C^D) , 0x6ed9eba1 ) -#define T_40_59(t) SHA_ROUND(t, SHA_MIX, ((B&C)|(D&(B|C))) , 0x8f1bbcdc ) +#define T_40_59(t) SHA_ROUND(t, SHA_MIX, ((B&C)+(D&(B^C))) , 0x8f1bbcdc ) #define T_60_79(t) SHA_ROUND(t, SHA_MIX, (B^C^D) , 0xca62c1d6 ) static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) -- 1.6.4.31.g154b2.dirty ^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 7/7] block-sha1: get rid of redundant 'lenW' context 2009-08-06 15:24 ` [PATCH 6/7] block-sha1: Use '(B&C)+(D&(B^C))' instead of '(B&C)|(D&(B|C))' in round 3 Linus Torvalds @ 2009-08-06 15:25 ` Linus Torvalds 0 siblings, 0 replies; 37+ messages in thread From: Linus Torvalds @ 2009-08-06 15:25 UTC (permalink / raw) To: Git Mailing List, Junio C Hamano From: Linus Torvalds <torvalds@linux-foundation.org> Date: Thu, 6 Aug 2009 07:45:46 -0700 Subject: [PATCH 7/7] block-sha1: get rid of redundant 'lenW' context .. and simplify the ctx->size logic. We now count the size in bytes, which means that 'lenW' was always just the low 6 bits of the total size, so we don't carry it around separately any more. And we do the 'size in bits' shift at the end. Suggested by Nicolas Pitre and linux@horizon.com. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- Some final cleanup. Based on two separate discussions yesterday. Trivial and doesn't make any difference that I can tell, but definitely the right thing to do. block-sha1/sha1.c | 17 +++++++---------- block-sha1/sha1.h | 1 - 2 files changed, 7 insertions(+), 11 deletions(-) diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c index 9a060a6..78dcb0c 100644 --- a/block-sha1/sha1.c +++ b/block-sha1/sha1.c @@ -14,7 +14,6 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data); void blk_SHA1_Init(blk_SHA_CTX *ctx) { - ctx->lenW = 0; ctx->size = 0; /* Initialize H with the magic constants (see FIPS180 for constants) @@ -29,9 +28,9 @@ void blk_SHA1_Init(blk_SHA_CTX *ctx) void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, unsigned long len) { - int lenW = ctx->lenW; + int lenW = ctx->size & 63; - ctx->size += (unsigned long long) len << 3; + ctx->size += len; /* Read the data into W and process blocks as they get full */ @@ -43,7 +42,6 @@ void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, unsigned long len) lenW = (lenW + left) & 63; len -= left; data += left; - ctx->lenW = lenW; if (lenW) return; blk_SHA1Block(ctx, ctx->W); @@ -53,10 +51,8 @@ void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, unsigned long len) data += 64; len -= 64; } - if (len) { + if (len) memcpy(ctx->W, data, len); - ctx->lenW = len; - } } @@ -68,10 +64,11 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx) /* Pad with a binary 1 (ie 0x80), then zeroes, then length */ - padlen[0] = htonl(ctx->size >> 32); - padlen[1] = htonl(ctx->size); + padlen[0] = htonl(ctx->size >> 29); + padlen[1] = htonl(ctx->size << 3); - blk_SHA1_Update(ctx, pad, 1+ (63 & (55 - ctx->lenW))); + i = ctx->size & 63; + blk_SHA1_Update(ctx, pad, 1+ (63 & (55 - i))); blk_SHA1_Update(ctx, padlen, 8); /* Output hash diff --git a/block-sha1/sha1.h b/block-sha1/sha1.h index 7be2d93..c1ae74d 100644 --- a/block-sha1/sha1.h +++ b/block-sha1/sha1.h @@ -7,7 +7,6 @@ typedef struct { unsigned int H[5]; unsigned int W[16]; - int lenW; unsigned long long size; } blk_SHA_CTX; -- 1.6.4.31.g154b2.dirty ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH 2/7] block-sha1: try to use rol/ror appropriately 2009-08-06 15:16 ` [PATCH 2/7] block-sha1: try to use rol/ror appropriately Linus Torvalds 2009-08-06 15:18 ` [PATCH 3/7] block-sha1: make the 'ntohl()' part of the first SHA1 loop Linus Torvalds @ 2009-08-06 18:25 ` Bert Wesarg 1 sibling, 0 replies; 37+ messages in thread From: Bert Wesarg @ 2009-08-06 18:25 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano Hi, On Thu, Aug 6, 2009 at 17:16, Linus Torvalds<torvalds@linux-foundation.org> wrote: > diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c > index eef32f7..a45a3de 100644 > --- a/block-sha1/sha1.c > +++ b/block-sha1/sha1.c > @@ -80,7 +80,19 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx) > ((unsigned int *)hashout)[i] = htonl(ctx->H[i]); > } > > -#define SHA_ROT(X,n) (((X) << (n)) | ((X) >> (32-(n)))) > +#if defined(__i386__) || defined(__x86_64__) > + > +#define SHA_ASM(op, x, n) ({ unsigned int __res; asm(op " %1,%0":"=r" (__res):"i" (n), "0" (x)); __res; }) > +#define SHA_ROL(x,n) SHA_ASM("rol", x, n) > +#define SHA_ROR(x,n) SHA_ASM("ror", x, n) > + > +#else > + > +#define SHA_ROT(X,n) (((X) << (l)) | ((X) >> (r))) I suspect, this should be: #define SHA_ROT(X,l.r) (((X) << (l)) | ((X) >> (r))) > +#define SHA_ROL(X,n) SHA_ROT(X,n,32-(n)) > +#define SHA_ROR(X,n) SHA_ROT(X,32-(n),n) > + > +#endif > > static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) > { Bert ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing 2009-08-06 15:13 [PATCH 0/7] block-sha1: improved SHA1 hashing Linus Torvalds 2009-08-06 15:15 ` [PATCH 1/7] block-sha1: add new optimized C 'block-sha1' routines Linus Torvalds @ 2009-08-06 17:22 ` Artur Skawina 2009-08-06 18:09 ` Linus Torvalds 1 sibling, 1 reply; 37+ messages in thread From: Artur Skawina @ 2009-08-06 17:22 UTC (permalink / raw) To: Git Mailing List Linus Torvalds wrote: > where the thing is loosely based on the Mozilla SHA1 routines but by the > end doesn't really resemble them all that much. The Mozilla ones suck > donkey d*ck in so many ways - unnecessary copies, idiotic byte-at-a-time > build-up of the hash input etc. > > The end result is pretty much equivalent in performance to the OpenSSL > SHA1 code for me on x86-64. Getting rid of OpenSSL gets rid of another For those curious just how close the C version is to the various asm and C implementations, the q&d microbenchmark is at http://www.src.multimo.pl/YDpqIo7Li27O0L0h/sha1bench.tar.gz In short: 88% of openssl speed on P3, 42% on P4, 66% on Atom. artur ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing 2009-08-06 17:22 ` [PATCH 0/7] block-sha1: improved SHA1 hashing Artur Skawina @ 2009-08-06 18:09 ` Linus Torvalds 2009-08-06 19:10 ` Artur Skawina 0 siblings, 1 reply; 37+ messages in thread From: Linus Torvalds @ 2009-08-06 18:09 UTC (permalink / raw) To: Artur Skawina; +Cc: Git Mailing List On Thu, 6 Aug 2009, Artur Skawina wrote: > > For those curious just how close the C version is to the various > asm and C implementations, the q&d microbenchmark is at > http://www.src.multimo.pl/YDpqIo7Li27O0L0h/sha1bench.tar.gz Hmm. That thing doesn't work at all on x86-64. Even apart from the asm sources, your timing thing does soem really odd things (why do you do that odd "iret" in GETCYCLES and GETTIME?). You're better off using lfence/mfence/cpuid, and I think you could make it work on 64-bit that way too. I just hacked it away for testing. > In short: 88% of openssl speed on P3, 42% on P4, 66% on Atom. I'll use this to see if I can improve the 32-bit case. On Nehalem, with your benchmark, I get: # TIME[s] SPEED[MB/s] rfc3174 5.122 119.2 # New hash result: d829b9e028e64840094ab6702f9acdf11bec3937 rfc3174 5.153 118.5 linus 2.092 291.8 linusas 2.056 296.8 linusas2 1.909 319.8 mozilla 5.139 118.8 mozillaas 5.775 105.7 openssl 1.627 375.1 spelvin 1.678 363.7 spelvina 1.603 380.8 nettle 1.592 383.4 And with the hacked version to get some 64-bit numbers: # TIME[s] SPEED[MB/s] rfc3174 3.992 152.9 # New hash result: b78fd74c0033a4dfe0ededccb85ab00cb56880ab rfc3174 3.991 152.9 linus 1.54 396.3 linusas 1.533 398.1 linusas2 1.603 380.9 mozilla 4.352 140.3 mozillaas 4.227 144.4 so as you can see, your improvements in 32-bit mode are actually de-provements in 64-bit mode (ok, your first one seems to be a tiny improvement, but I think it's in the noise). But you're right, I need to try to improve the 32-bit case. Linus ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing 2009-08-06 18:09 ` Linus Torvalds @ 2009-08-06 19:10 ` Artur Skawina 2009-08-06 19:41 ` Linus Torvalds 0 siblings, 1 reply; 37+ messages in thread From: Artur Skawina @ 2009-08-06 19:10 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List Linus Torvalds wrote: > > On Thu, 6 Aug 2009, Artur Skawina wrote: >> For those curious just how close the C version is to the various >> asm and C implementations, the q&d microbenchmark is at >> http://www.src.multimo.pl/YDpqIo7Li27O0L0h/sha1bench.tar.gz > > Hmm. That thing doesn't work at all on x86-64. Even apart from the asm > sources, your timing thing does soem really odd things (why do you do that > odd "iret" in GETCYCLES and GETTIME?). You're better off using > lfence/mfence/cpuid, and I think you could make it work on 64-bit that > way too. yes, it's 32-bit only, i should have mentioned that. The timing code was written more than a decade ago, it really works on p2, haven't updated it, it's all just c&p'ed ever since. All of it can be safely disabled; on p2 you could account for every cycle, nowadays gettimeofday is more than enough. > I just hacked it away for testing. > >> In short: 88% of openssl speed on P3, 42% on P4, 66% on Atom. > > I'll use this to see if I can improve the 32-bit case. > > On Nehalem, with your benchmark, I get: > > # TIME[s] SPEED[MB/s] > rfc3174 5.122 119.2 > # New hash result: d829b9e028e64840094ab6702f9acdf11bec3937 > rfc3174 5.153 118.5 > linus 2.092 291.8 > linusas 2.056 296.8 > linusas2 1.909 319.8 > mozilla 5.139 118.8 > mozillaas 5.775 105.7 > openssl 1.627 375.1 > spelvin 1.678 363.7 > spelvina 1.603 380.8 > nettle 1.592 383.4 > > And with the hacked version to get some 64-bit numbers: > > # TIME[s] SPEED[MB/s] > rfc3174 3.992 152.9 > # New hash result: b78fd74c0033a4dfe0ededccb85ab00cb56880ab > rfc3174 3.991 152.9 > linus 1.54 396.3 > linusas 1.533 398.1 > linusas2 1.603 380.9 > mozilla 4.352 140.3 > mozillaas 4.227 144.4 > > so as you can see, your improvements in 32-bit mode are actually > de-provements in 64-bit mode (ok, your first one seems to be a tiny > improvement, but I think it's in the noise). Actually i didn't keep anything that wasn't a win, one reason why linusas2 stayed was that it really surprised me, i'd have expected for gcc to do a lot worse w/ the many temporaries and the compiler came up w/ a 70% gain; gcc really must have improved when i wasn't looking. > But you're right, I need to try to improve the 32-bit case. I never said anything like that. :) there probably isn't all that much that can be done. I tried a few things, but never saw any improvement above measurement noise (a few percent). Would have though that overlapping the iterations a bit would be a gain, but that didn't do much (-20%..0), maybe on 64 bit, with more registers... Oh, i noticed that '-mtune' makes quite a difference, it can change the relative performance of the functions significantly, in unobvious ways; depending on which cpu gcc tunes for (build config or -mtune); some implementations slow down, others become a bit faster. artur ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing 2009-08-06 19:10 ` Artur Skawina @ 2009-08-06 19:41 ` Linus Torvalds 2009-08-06 20:08 ` Artur Skawina 0 siblings, 1 reply; 37+ messages in thread From: Linus Torvalds @ 2009-08-06 19:41 UTC (permalink / raw) To: Artur Skawina; +Cc: Git Mailing List On Thu, 6 Aug 2009, Artur Skawina wrote: > > Oh, i noticed that '-mtune' makes quite a difference, it can change > the relative performance of the functions significantly, in unobvious > ways; depending on which cpu gcc tunes for (build config or -mtune); > some implementations slow down, others become a bit faster. That probably is mainly true for P4, although it's quite possible that it has an effect for just what the register allocator does, and then for spilling. And it looks like _all_ the tweakability is in the spilling. Nothing else matters. How does this patch work for you? It avoids doing that C-level register rotation, and instead rotates the register names with the preprocessor. I realize it's ugly as hell, but it does make it easier for gcc to see what's going on. The patch is against my git patches, but I think it should apply pretty much as-is to your sha1bench sources too. Does it make any difference for you? Linus --- block-sha1/sha1.c | 117 ++++++++++++++++++++++++++++++++++++++++------------ 1 files changed, 90 insertions(+), 27 deletions(-) diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c index 78dcb0c..ac47162 100644 --- a/block-sha1/sha1.c +++ b/block-sha1/sha1.c @@ -101,20 +101,20 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx) #define SHA_SRC(t) htonl(data[t]) #define SHA_MIX(t) SHA_ROL(W(t+13) ^ W(t+8) ^ W(t+2) ^ W(t), 1) -#define SHA_ROUND(t, input, fn, constant) \ - TEMP = input(t); W(t) = TEMP; \ - TEMP += SHA_ROL(A,5) + (fn) + E + (constant); \ - E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP +#define SHA_ROUND(t, input, fn, constant, A, B, C, D, E) do { \ + unsigned int TEMP = input(t); W(t) = TEMP; \ + TEMP += E + SHA_ROL(A,5) + (fn) + (constant); \ + B = SHA_ROR(B, 2); E = TEMP; } while (0) -#define T_0_15(t) SHA_ROUND(t, SHA_SRC, (((C^D)&B)^D) , 0x5a827999 ) -#define T_16_19(t) SHA_ROUND(t, SHA_MIX, (((C^D)&B)^D) , 0x5a827999 ) -#define T_20_39(t) SHA_ROUND(t, SHA_MIX, (B^C^D) , 0x6ed9eba1 ) -#define T_40_59(t) SHA_ROUND(t, SHA_MIX, ((B&C)+(D&(B^C))) , 0x8f1bbcdc ) -#define T_60_79(t) SHA_ROUND(t, SHA_MIX, (B^C^D) , 0xca62c1d6 ) +#define T_0_15(t, A, B, C, D, E) SHA_ROUND(t, SHA_SRC, (((C^D)&B)^D) , 0x5a827999, A, B, C, D, E ) +#define T_16_19(t, A, B, C, D, E) SHA_ROUND(t, SHA_MIX, (((C^D)&B)^D) , 0x5a827999, A, B, C, D, E ) +#define T_20_39(t, A, B, C, D, E) SHA_ROUND(t, SHA_MIX, (B^C^D) , 0x6ed9eba1, A, B, C, D, E ) +#define T_40_59(t, A, B, C, D, E) SHA_ROUND(t, SHA_MIX, ((B&C)+(D&(B^C))) , 0x8f1bbcdc, A, B, C, D, E ) +#define T_60_79(t, A, B, C, D, E) SHA_ROUND(t, SHA_MIX, (B^C^D) , 0xca62c1d6, A, B, C, D, E ) static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) { - unsigned int A,B,C,D,E,TEMP; + unsigned int A,B,C,D,E; unsigned int array[16]; A = ctx->H[0]; @@ -124,31 +124,94 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data) E = ctx->H[4]; /* Round 1 - iterations 0-16 take their input from 'data' */ - T_0_15( 0); T_0_15( 1); T_0_15( 2); T_0_15( 3); T_0_15( 4); - T_0_15( 5); T_0_15( 6); T_0_15( 7); T_0_15( 8); T_0_15( 9); - T_0_15(10); T_0_15(11); T_0_15(12); T_0_15(13); T_0_15(14); - T_0_15(15); + T_0_15( 0, A, B, C, D, E); + T_0_15( 1, E, A, B, C, D); + T_0_15( 2, D, E, A, B, C); + T_0_15( 3, C, D, E, A, B); + T_0_15( 4, B, C, D, E, A); + T_0_15( 5, A, B, C, D, E); + T_0_15( 6, E, A, B, C, D); + T_0_15( 7, D, E, A, B, C); + T_0_15( 8, C, D, E, A, B); + T_0_15( 9, B, C, D, E, A); + T_0_15(10, A, B, C, D, E); + T_0_15(11, E, A, B, C, D); + T_0_15(12, D, E, A, B, C); + T_0_15(13, C, D, E, A, B); + T_0_15(14, B, C, D, E, A); + T_0_15(15, A, B, C, D, E); /* Round 1 - tail. Input from 512-bit mixing array */ - T_16_19(16); T_16_19(17); T_16_19(18); T_16_19(19); + T_16_19(16, E, A, B, C, D); + T_16_19(17, D, E, A, B, C); + T_16_19(18, C, D, E, A, B); + T_16_19(19, B, C, D, E, A); /* Round 2 */ - T_20_39(20); T_20_39(21); T_20_39(22); T_20_39(23); T_20_39(24); - T_20_39(25); T_20_39(26); T_20_39(27); T_20_39(28); T_20_39(29); - T_20_39(30); T_20_39(31); T_20_39(32); T_20_39(33); T_20_39(34); - T_20_39(35); T_20_39(36); T_20_39(37); T_20_39(38); T_20_39(39); + T_20_39(20, A, B, C, D, E); + T_20_39(21, E, A, B, C, D); + T_20_39(22, D, E, A, B, C); + T_20_39(23, C, D, E, A, B); + T_20_39(24, B, C, D, E, A); + T_20_39(25, A, B, C, D, E); + T_20_39(26, E, A, B, C, D); + T_20_39(27, D, E, A, B, C); + T_20_39(28, C, D, E, A, B); + T_20_39(29, B, C, D, E, A); + T_20_39(30, A, B, C, D, E); + T_20_39(31, E, A, B, C, D); + T_20_39(32, D, E, A, B, C); + T_20_39(33, C, D, E, A, B); + T_20_39(34, B, C, D, E, A); + T_20_39(35, A, B, C, D, E); + T_20_39(36, E, A, B, C, D); + T_20_39(37, D, E, A, B, C); + T_20_39(38, C, D, E, A, B); + T_20_39(39, B, C, D, E, A); /* Round 3 */ - T_40_59(40); T_40_59(41); T_40_59(42); T_40_59(43); T_40_59(44); - T_40_59(45); T_40_59(46); T_40_59(47); T_40_59(48); T_40_59(49); - T_40_59(50); T_40_59(51); T_40_59(52); T_40_59(53); T_40_59(54); - T_40_59(55); T_40_59(56); T_40_59(57); T_40_59(58); T_40_59(59); + T_40_59(40, A, B, C, D, E); + T_40_59(41, E, A, B, C, D); + T_40_59(42, D, E, A, B, C); + T_40_59(43, C, D, E, A, B); + T_40_59(44, B, C, D, E, A); + T_40_59(45, A, B, C, D, E); + T_40_59(46, E, A, B, C, D); + T_40_59(47, D, E, A, B, C); + T_40_59(48, C, D, E, A, B); + T_40_59(49, B, C, D, E, A); + T_40_59(50, A, B, C, D, E); + T_40_59(51, E, A, B, C, D); + T_40_59(52, D, E, A, B, C); + T_40_59(53, C, D, E, A, B); + T_40_59(54, B, C, D, E, A); + T_40_59(55, A, B, C, D, E); + T_40_59(56, E, A, B, C, D); + T_40_59(57, D, E, A, B, C); + T_40_59(58, C, D, E, A, B); + T_40_59(59, B, C, D, E, A); /* Round 4 */ - T_60_79(60); T_60_79(61); T_60_79(62); T_60_79(63); T_60_79(64); - T_60_79(65); T_60_79(66); T_60_79(67); T_60_79(68); T_60_79(69); - T_60_79(70); T_60_79(71); T_60_79(72); T_60_79(73); T_60_79(74); - T_60_79(75); T_60_79(76); T_60_79(77); T_60_79(78); T_60_79(79); + T_60_79(60, A, B, C, D, E); + T_60_79(61, E, A, B, C, D); + T_60_79(62, D, E, A, B, C); + T_60_79(63, C, D, E, A, B); + T_60_79(64, B, C, D, E, A); + T_60_79(65, A, B, C, D, E); + T_60_79(66, E, A, B, C, D); + T_60_79(67, D, E, A, B, C); + T_60_79(68, C, D, E, A, B); + T_60_79(69, B, C, D, E, A); + T_60_79(70, A, B, C, D, E); + T_60_79(71, E, A, B, C, D); + T_60_79(72, D, E, A, B, C); + T_60_79(73, C, D, E, A, B); + T_60_79(74, B, C, D, E, A); + T_60_79(75, A, B, C, D, E); + T_60_79(76, E, A, B, C, D); + T_60_79(77, D, E, A, B, C); + T_60_79(78, C, D, E, A, B); + T_60_79(79, B, C, D, E, A); ctx->H[0] += A; ctx->H[1] += B; ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing 2009-08-06 19:41 ` Linus Torvalds @ 2009-08-06 20:08 ` Artur Skawina 2009-08-06 20:53 ` Linus Torvalds 0 siblings, 1 reply; 37+ messages in thread From: Artur Skawina @ 2009-08-06 20:08 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List Linus Torvalds wrote: > > On Thu, 6 Aug 2009, Artur Skawina wrote: >> Oh, i noticed that '-mtune' makes quite a difference, it can change >> the relative performance of the functions significantly, in unobvious >> ways; depending on which cpu gcc tunes for (build config or -mtune); >> some implementations slow down, others become a bit faster. > > That probably is mainly true for P4, although it's quite possible that it > has an effect for just what the register allocator does, and then for > spilling. > > And it looks like _all_ the tweakability is in the spilling. Nothing else > matters. > > How does this patch work for you? It avoids doing that C-level register > rotation, and instead rotates the register names with the preprocessor. > > I realize it's ugly as hell, but it does make it easier for gcc to see > what's going on. > > The patch is against my git patches, but I think it should apply pretty > much as-is to your sha1bench sources too. Does it make any difference for > you? it's a bit slower (P4): before: linus 0.6288 97.06 after: linus 0.6604 92.42 i was trying similar things, like the example below, too, but it wasn't a win on 32 bit... artur [the iteration below is functionally correct, but scheduling is most likely fubared as it wasn't a win and i was checking how much a difference it made on P4 -- ~-20..~0%, but never faster (relative to linusas2; it _is_ faster than 'linus'. Dropped this version when merging your new preprocessor macros.] @@ -125,6 +127,8 @@ #define W(x) (array[(x)&15]) #define SHA_XOR(t) \ TEMP = SHA_ROL(W(t+13) ^ W(t+8) ^ W(t+2) ^ W(t), 1); W(t) = TEMP; +#define SHA_XOR2(t) \ + SHA_ROL(W(t+13) ^ W(t+8) ^ W(t+2) ^ W(t), 1) #define T_16_19(t) \ { unsigned TEMP;\ @@ -139,10 +143,27 @@ #endif #define T_20_39(t) \ - { unsigned TEMP;\ - SHA_XOR(t); \ - TEMP += (B^C^D) + E + 0x6ed9eba1; \ - E = D; D = C; C = SHA_ROR(B, 2); B = A; TEMP += SHA_ROL(A,5); A = TEMP; } + if (t%2==0) {\ + unsigned TEMP;\ + unsigned TEMP2;\ + \ + TEMP = SHA_XOR2(t); \ + TEMP2 = SHA_XOR2(t+1); \ + W(t) = TEMP;\ + W(t+1) = TEMP2;\ + TEMP += E + 0x6ed9eba1; \ + E = C;\ + TEMP += (B^E^D); \ + TEMP2 += D + 0x6ed9eba1; \ + D = SHA_ROR(B, 2);\ + B = SHA_ROL(A, 5);\ + B += TEMP;\ + C = SHA_ROR(A, 2);\ + A ^= E; \ + A ^= D; \ + A += TEMP2;\ + A += SHA_ROL(B, 5);\ + } #if UNROLL T_20_39(20); T_20_39(21); T_20_39(22); T_20_39(23); T_20_39(24); ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing 2009-08-06 20:08 ` Artur Skawina @ 2009-08-06 20:53 ` Linus Torvalds 2009-08-06 21:24 ` Linus Torvalds 2009-08-06 21:39 ` Artur Skawina 0 siblings, 2 replies; 37+ messages in thread From: Linus Torvalds @ 2009-08-06 20:53 UTC (permalink / raw) To: Artur Skawina; +Cc: Git Mailing List On Thu, 6 Aug 2009, Artur Skawina wrote: > > it's a bit slower (P4): > > before: linus 0.6288 97.06 > after: linus 0.6604 92.42 Hmm. Ok, I just tested with your harness, and I get # TIME[s] SPEED[MB/s] rfc3174 5.1 119.7 rfc3174 5.097 119.7 linus 1.836 332.5 linusas 2.006 304.3 linusas2 1.879 324.9 mozilla 5.562 109.7 mozillaas 5.913 103.2 openssl 1.613 378.5 spelvin 1.698 359.5 spelvina 1.602 381 nettle 1.594 382.9 with it, so it is faster for me. So your slowdown seems to be yet another P4 thing. Dang crazy micro-architecture. Of course, it might be a compiler version difference too. I'm using gcc-4.4.0. With the cpp variable renaming, the compiler really has less to be smart about, but spill decisions will still matter a lot. (My old 32-bit numbers were linus 2.092 291.8 so it's a clear improvement on my machine and with my compiler). It also seems to improve the 64-bit numbers a small bit, I'm getting # TIME[s] SPEED[MB/s] rfc3174 3.98 153.3 rfc3174 3.972 153.7 linus 1.514 403.1 linusas 1.555 392.6 linusas2 1.599 381.7 mozilla 4.34 140.6 mozillaas 4.223 144.5 with my 64-bit compile, so on a Nehalem it's the best one of the C ones by a noticeable margin. (My original 64-bit numbers were linus 1.54 396.3 and while the numbers seem to fluctuate a bit, the fluctuation is roughly in the 1% range, so that improvement seems to be statistically significant. Oh, I did make a small change, but I doubt it matters. Instead of doing TEMP += E + SHA_ROL(A,5) + (fn) + (constant); \ B = SHA_ROR(B, 2); E = TEMP; } while (0) I now do E += TEMP + SHA_ROL(A,5) + (fn) + (constant); \ B = SHA_ROR(B, 2); } while (0) which is a bit more logical (the old TEMP usage was just due to a fairly mindless conversion). That _might_ have lower register pressure if the compiler is silly enough to not notice that it can do it. Maybe that matters. Linus ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing 2009-08-06 20:53 ` Linus Torvalds @ 2009-08-06 21:24 ` Linus Torvalds 2009-08-06 21:39 ` Artur Skawina 1 sibling, 0 replies; 37+ messages in thread From: Linus Torvalds @ 2009-08-06 21:24 UTC (permalink / raw) To: Artur Skawina; +Cc: Git Mailing List On Thu, 6 Aug 2009, Linus Torvalds wrote: > > Hmm. Ok, I just tested with your harness, and I get > > # TIME[s] SPEED[MB/s] > rfc3174 5.1 119.7 > rfc3174 5.097 119.7 > linus 1.836 332.5 > linusas 2.006 304.3 > linusas2 1.879 324.9 > mozilla 5.562 109.7 > mozillaas 5.913 103.2 > openssl 1.613 378.5 > spelvin 1.698 359.5 > spelvina 1.602 381 > nettle 1.594 382.9 On atom, I get things like: # TIME[s] SPEED[MB/s] rfc3174 2.186 27.92 rfc3174 2.186 27.92 linus 0.9492 64.3 linusas 0.9656 63.21 linusas2 1.012 60.29 mozilla 2.492 24.49 mozillaas 2.5 24.41 openssl 0.6411 95.2 spelvin 0.6052 100.8 spelvina 0.6655 91.71 nettle 0.7149 85.37 but quite frankly, those timings aren't stable enough to say anything. Another few runs got me: # TIME[s] SPEED[MB/s] rfc3174 2.207 27.65 rfc3174 2.21 27.62 linus 1.022 59.74 linusas 1.058 57.7 linusas2 1.008 60.58 mozilla 2.485 24.56 mozillaas 2.522 24.2 openssl 0.6421 95.06 spelvin 0.5989 101.9 spelvina 0.6638 91.94 nettle 0.7132 85.58 # TIME[s] SPEED[MB/s] rfc3174 2.224 27.44 rfc3174 2.205 27.68 linus 0.9727 62.75 linusas 0.9766 62.5 linusas2 1.026 59.5 mozilla 2.52 24.22 mozillaas 2.547 23.96 openssl 0.6459 94.5 spelvin 0.6074 100.5 spelvina 0.6751 90.41 nettle 0.7254 84.14 so whatever differences there are between linus*, they seem to be in the noise, and the hand-scheduled asm beats all the C versions senseless. I'd like to get closer to the hand-tuned ones, but I don't see anything to do any more. It's all about gcc register choice and avoiding spilling. So compiler flags changing small details can have _huge_ differences in performance. Here's the Atom numbers with gcc given the "-Os" flag (just because I wanted to try): linus 1.072 56.94 linusas 0.9573 63.76 linusas2 0.9906 61.61 Why did 'linus' numbers go down? No idea. With -O3, it's the other way around: linus 0.9537 64 linusas 0.9566 63.8 linusas2 1.013 60.26 but again, there's variation enough that I'd probabyl need to run ten runs just to see how much is noise. But the "linusas2 sucks with -O3" is clear, as is the "linus sucks with -Os" thing. Very odd, and very random. Linus ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing 2009-08-06 20:53 ` Linus Torvalds 2009-08-06 21:24 ` Linus Torvalds @ 2009-08-06 21:39 ` Artur Skawina 2009-08-06 21:52 ` Artur Skawina 1 sibling, 1 reply; 37+ messages in thread From: Artur Skawina @ 2009-08-06 21:39 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List Linus Torvalds wrote: > > with it, so it is faster for me. So your slowdown seems to be yet another > P4 thing. Dang crazy micro-architecture. > > Of course, it might be a compiler version difference too. I'm using > gcc-4.4.0. gcc version 4.4.1 20090603 (prerelease) > Oh, I did make a small change, but I doubt it matters. Instead of doing > > TEMP += E + SHA_ROL(A,5) + (fn) + (constant); \ > B = SHA_ROR(B, 2); E = TEMP; } while (0) > > I now do > > E += TEMP + SHA_ROL(A,5) + (fn) + (constant); \ > B = SHA_ROR(B, 2); } while (0) > > which is a bit more logical (the old TEMP usage was just due to a fairly > mindless conversion). That _might_ have lower register pressure if the > compiler is silly enough to not notice that it can do it. Maybe that > matters. before: linus 0.6622 92.17 after: linus 0.6631 92.05 after: linus 0.6601 92.46 after: linus 0.6624 92.14 IOW, no difference, just noise. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing 2009-08-06 21:39 ` Artur Skawina @ 2009-08-06 21:52 ` Artur Skawina 2009-08-06 22:27 ` Linus Torvalds 0 siblings, 1 reply; 37+ messages in thread From: Artur Skawina @ 2009-08-06 21:52 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List Artur Skawina wrote: > Linus Torvalds wrote: >> Oh, I did make a small change, but I doubt it matters. Instead of doing >> >> TEMP += E + SHA_ROL(A,5) + (fn) + (constant); \ >> B = SHA_ROR(B, 2); E = TEMP; } while (0) >> >> I now do >> >> E += TEMP + SHA_ROL(A,5) + (fn) + (constant); \ >> B = SHA_ROR(B, 2); } while (0) >> >> which is a bit more logical (the old TEMP usage was just due to a fairly >> mindless conversion). That _might_ have lower register pressure if the >> compiler is silly enough to not notice that it can do it. Maybe that >> matters. > > before: linus 0.6622 92.17 > after: linus 0.6631 92.05 > after: linus 0.6601 92.46 > after: linus 0.6624 92.14 > > IOW, no difference, just noise. Just to check, i did this: diff -urNp sha1bench-linus.org/block-sha1/sha1.c sha1bench-linus/block-sha1/sha1.c --- sha1bench-linus.org/block-sha1/sha1.c 2009-08-06 23:26:15.607321815 +0200 +++ sha1bench-linus/block-sha1/sha1.c 2009-08-06 23:41:36.858325807 +0200 @@ -103,8 +103,8 @@ void blk_SHA1_Final(unsigned char hashou #define SHA_ROUND(t, input, fn, constant, A, B, C, D, E) do { \ unsigned int TEMP = input(t); W(t) = TEMP; \ - E += TEMP + SHA_ROL(A,5) + (fn) + (constant); \ - B = SHA_ROR(B, 2); } while (0) + E += TEMP + (fn) + (constant); \ + B = SHA_ROR(B, 2); E += SHA_ROL(A,5); } while (0) #define T_0_15(t, A, B, C, D, E) SHA_ROUND(t, SHA_SRC, (((C^D)&B)^D) , 0x5a827999, A, B, C, D, E ) #define T_16_19(t, A, B, C, D, E) SHA_ROUND(t, SHA_MIX, (((C^D)&B)^D) , 0x5a827999, A, B, C, D, E ) and the result was: # TIME[s] SPEED[MB/s] rfc3174 1.47 41.51 rfc3174 1.474 41.42 linus 0.3564 171.2 linusas 0.5736 106.4 linusas2 0.3568 171.1 mozilla 1.17 52.19 mozillaas 1.382 44.17 openssl 0.2636 231.5 spelvin 0.2662 229.2 spelvina 0.2515 242.7 nettle 0.4386 139.2 Hmm. Does this make any difference for you? For me it's the best one so far (the linusas2 number clearly shows that for me the register renaming does nothing; other than that the functions should be very similar) artur ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing 2009-08-06 21:52 ` Artur Skawina @ 2009-08-06 22:27 ` Linus Torvalds 2009-08-06 22:33 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 37+ messages in thread From: Linus Torvalds @ 2009-08-06 22:27 UTC (permalink / raw) To: Artur Skawina; +Cc: Git Mailing List On Thu, 6 Aug 2009, Artur Skawina wrote: > > Does this make any difference for you? For me it's the best one so far > (the linusas2 number clearly shows that for me the register renaming does > nothing; other than that the functions should be very similar) Nope. If anything, it's bit slower, but it might be in the noise. I generally got 330MB/s with my "cpp renaming" on Nehalem (32-bit - the 64-bit numbers are ~400MB/s), but with this I got 325MB/s twice in a row, which matches the linusas2 numbers pretty exactly. But it seems to make a big difference for you. Btw, _what_ P4 do you have (Northwood or Prescott)? The Intel optimization manuals very much talk about avoiding rotates. And they mention "with a CPUID signature corresponding to family 15 and model encoding of 0, 1, or 2" specifically as being longer latency. That's basically pre-prescott P4, I think. Anyway, on P4 I think you have two double-speed integer issue ports (ie max four ops per cycle), but only one of them takes a rotate, and only in the first half of the cycle (ie just one shift per cycle). And afaik, that is actually the _improved_ state in Prescott. The older P4's didn't have a full shifter unit at all, iirc: shifts were "complex instructions" in Northwood and weren't even single-clock. In Core 2, I think there's still just one shifter unit, but at least it's as fast as all the other units. So P4 really does stand out as sucking as far as shifts are concerned, and if you have an older P4, it will be even worse. Linus ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing 2009-08-06 22:27 ` Linus Torvalds @ 2009-08-06 22:33 ` Linus Torvalds 2009-08-06 23:19 ` Artur Skawina 2009-08-06 22:55 ` Artur Skawina 2009-08-07 2:23 ` Linus Torvalds 2 siblings, 1 reply; 37+ messages in thread From: Linus Torvalds @ 2009-08-06 22:33 UTC (permalink / raw) To: Artur Skawina; +Cc: Git Mailing List On Thu, 6 Aug 2009, Linus Torvalds wrote: > > Anyway, on P4 I think you have two double-speed integer issue ports (ie > max four ops per cycle), but only one of them takes a rotate, and only in > the first half of the cycle (ie just one shift per cycle). > > And afaik, that is actually the _improved_ state in Prescott. The older > P4's didn't have a full shifter unit at all, iirc: shifts were "complex > instructions" in Northwood and weren't even single-clock. Yeah, verified. Google for northwood "barrel shifter" and you'll find a lot of it. Basically, older P4's will I think shift one bit at a time. So while even Prescott is relatively weak in the shifter department, pre-prescott (Willamette and Northwood) are _really_ weak. If your P4 is one of those, you really shouldn't use it to decide on optimizations. Linus ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing 2009-08-06 22:33 ` Linus Torvalds @ 2009-08-06 23:19 ` Artur Skawina 2009-08-06 23:42 ` Linus Torvalds 0 siblings, 1 reply; 37+ messages in thread From: Artur Skawina @ 2009-08-06 23:19 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List Linus Torvalds wrote: > > Yeah, verified. Google for > > northwood "barrel shifter" > > and you'll find a lot of it. > > Basically, older P4's will I think shift one bit at a time. So while even > Prescott is relatively weak in the shifter department, pre-prescott > (Willamette and Northwood) are _really_ weak. If your P4 is one of those, > you really shouldn't use it to decide on optimizations. Actually that's even more of a reason to make sure the code doesn't suck :) The difference on less perverse cpus will usually be small, but on P4 it can be huge. A few years back I found my old ip checksum microbenchmark, and when I ran it on a P4 (prescott iirc) i didn't believe my eyes. The straightforward 32-bit C implementation was running circles around the in-kernel one... And a few tweaks to the assembler version got me another ~100% speedup.[1] After that the P4 became the very first cpu to test any code on... :) artur [1] just reran the benchmark on this p4; true on northwood too: IACCK 0.9.30 Artur Skawina <...> [ exec time; lower is better ] [speed ] [ time ] [ok?] TIME-N+S TIME32 TIME33 TIME1480 MBYTES/S TIMEXXXX CSUM FUNCTION ( rdtsc_overhead=0 null=0 ) 17901 510 557 3010 393.36 59772 56dd csum_partial_cdumb16 3019 154 156 431 2747.10 43106 56dd csum_partial_c32 2413 170 177 328 3609.76 37501 56dd csum_partial_c32l 2437 170 170 328 3609.76 37488 56dd csum_partial_c32i 5078 205 254 767 1543.68 48117 56dd csum_partial_std 5612 299 291 851 1391.30 53673 56dd csum_partial_686 1584 99 127 227 5215.86 14495 56dd csum_partial_586f 1738 107 121 229 5170.31 14785 56dd csum_partial_586fs 4893 175 171 759 1559.95 52347 56dd csum_partial_copy_generic_std 4949 151 189 756 1566.14 67847 56dd csum_partial_copy_generic_686 2072 110 134 302 3920.53 39061 56dd csum_partial_copy_generic_p4as1 ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing 2009-08-06 23:19 ` Artur Skawina @ 2009-08-06 23:42 ` Linus Torvalds 0 siblings, 0 replies; 37+ messages in thread From: Linus Torvalds @ 2009-08-06 23:42 UTC (permalink / raw) To: Artur Skawina; +Cc: Git Mailing List On Fri, 7 Aug 2009, Artur Skawina wrote: > > Actually that's even more of a reason to make sure the code doesn't suck :) > The difference on less perverse cpus will usually be small, but on P4 it > can be huge. No. First off, the things you have to do on P4 are just insane. See the email I just sent out asking you to test whether two 1-bit rotates might be faster than 1 2-bit rotate. So optimizing for P4 is often the wrong thing. Secondly, P4's are going away. You may have one, but they are getting rare. So optimizing for them is a losing proposition in the long run. > A few years back I found my old ip checksum microbenchmark, and when I ran > it on a P4 (prescott iirc) i didn't believe my eyes. The straightforward > 32-bit C implementation was running circles around the in-kernel one... > And a few tweaks to the assembler version got me another ~100% speedup.[1] Yeah, not very surprising. The P4 is very good at the simplest possible kind of code that does _nothing_ fancy. But then it completely chokes on some code. I mean _totally_. It slows down by a huge amount if there is anything but the most trivial kinds of instructions. And by "trivial", I mean _really_ trivial. Shifts (as in SHA1), but iirc also things like "adc" (add with carry) etc. So it's not hard to write code that works well on other uarchs, and then totally blow up on P4. I think it doesn't rename the flags at all, so any flag dependency (carry being the most common one) will stall things horrible. There's also a very subtle store forwarding failure thing (and a lot of other events) that causes a nasty micro-architectural replay trap, and again you go from "running like a bat out of hell" to "slower than a i486 at a tenth the frequency". Really. It's disgusting. Perfectly fine code can run really slowly on the P4 just because it hits some random internal micro-architectural flaw. And there's a _lot_ of those "glass jaw" issues. The best way to avoid them is to use _only_ simple ALU instructions (add, sub, and/or/not), and to be _very_ careful with loads and stores. Linus ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing 2009-08-06 22:27 ` Linus Torvalds 2009-08-06 22:33 ` Linus Torvalds @ 2009-08-06 22:55 ` Artur Skawina 2009-08-06 23:04 ` Linus Torvalds 2009-08-07 2:23 ` Linus Torvalds 2 siblings, 1 reply; 37+ messages in thread From: Artur Skawina @ 2009-08-06 22:55 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List Linus Torvalds wrote: > > On Thu, 6 Aug 2009, Artur Skawina wrote: >> Does this make any difference for you? For me it's the best one so far >> (the linusas2 number clearly shows that for me the register renaming does >> nothing; other than that the functions should be very similar) > > Nope. If anything, it's bit slower, but it might be in the noise. I > generally got 330MB/s with my "cpp renaming" on Nehalem (32-bit - the > 64-bit numbers are ~400MB/s), but with this I got 325MB/s twice in a row, > which matches the linusas2 numbers pretty exactly. > > But it seems to make a big difference for you. It seems to do well on P2 and P4 here, if it works for core2 this could be a good generic candidate. It only does 62% on an Atom, but the best C version so far exceeds it only by ~2%. > Btw, _what_ P4 do you have (Northwood or Prescott)? northwood > The Intel optimization manuals very much talk about avoiding rotates. And > they mention "with a CPUID signature corresponding to family 15 and model > encoding of 0, 1, or 2" specifically as being longer latency. That's > basically pre-prescott P4, I think. cpu family : 15 model : 2 model name : Intel(R) Pentium(R) 4 CPU 2.80GHz stepping : 5 > Anyway, on P4 I think you have two double-speed integer issue ports (ie > max four ops per cycle), but only one of them takes a rotate, and only in > the first half of the cycle (ie just one shift per cycle). > > And afaik, that is actually the _improved_ state in Prescott. The older > P4's didn't have a full shifter unit at all, iirc: shifts were "complex > instructions" in Northwood and weren't even single-clock. > > In Core 2, I think there's still just one shifter unit, but at least it's > as fast as all the other units. So P4 really does stand out as sucking as > far as shifts are concerned, and if you have an older P4, it will be even > worse. hmm, I might be able to try it on some old willamette, but my prescott's mobo died, so i can't verify that right now. I'll upload an updated sha1bench, maybe somebody else feels like checking... artur ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing 2009-08-06 22:55 ` Artur Skawina @ 2009-08-06 23:04 ` Linus Torvalds 2009-08-06 23:25 ` Linus Torvalds 0 siblings, 1 reply; 37+ messages in thread From: Linus Torvalds @ 2009-08-06 23:04 UTC (permalink / raw) To: Artur Skawina; +Cc: Git Mailing List On Fri, 7 Aug 2009, Artur Skawina wrote: > > hmm, I might be able to try it on some old willamette, but my prescott's > mobo died, so i can't verify that right now. I think willamette and northwood are basically the same wrt shifters (and pretty much everything else too, for that matter). I think northwood is a shrink, and had an increased cache size (and higher frequencies). But I think core-wise, they're very similar. It was prescott that changed a lot (mostly for the worse - the shifter was one of the few upsides of prescott, although increased frequency often made up for the downsides). Linus ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing 2009-08-06 23:04 ` Linus Torvalds @ 2009-08-06 23:25 ` Linus Torvalds 2009-08-07 0:13 ` Linus Torvalds 2009-08-07 0:53 ` Artur Skawina 0 siblings, 2 replies; 37+ messages in thread From: Linus Torvalds @ 2009-08-06 23:25 UTC (permalink / raw) To: Artur Skawina; +Cc: Git Mailing List On Thu, 6 Aug 2009, Linus Torvalds wrote: > > It was prescott that changed a lot (mostly for the worse - the shifter was > one of the few upsides of prescott, although increased frequency often > made up for the downsides). Anyway, since you have a Northwood, I bet that the #1 issue for you is to spread out the shift instructions in a way that simply doesn't matter anywhere else. In netburst, if I remember the details correcty, a "complex instruction" will basically get the trace cache from the microcode roms. I'm not sure how it interacts with the TC entries around it, but it's entirely possible that it basically disables any instruction scheduling (the microcode traces are presumably "pre-scheduled"), so you'd basically see stalls where there's little out-of-order execution. That then explains why you see huge differences from what is basically trivial scheduling decisions, and why some random placement of a shift makes a big difference. Just out of curiosity, does anything change if you change the B = SHA_ROR(B,2) into a B = SHA_ROR(SHA_ROR(B,1),1) instead? It's very possible that it becomes _much_ worse, but I guess it's also possible in theory that a single-bit rotate ends up being a simple instruction and that doing two single-bit ROR's is actually faster than one 2-bit ROR (assuming the second one is microcoded and the first one). In particular, I'm thinking about the warnign in the intel optimization manual: The rotate by immediate and rotate by register instructions are more expensive than a shift. The rotate by 1 instruction has the same latency as a shift. so it's very possible that "rotate by 1" is much better than other rotates. Linus ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing 2009-08-06 23:25 ` Linus Torvalds @ 2009-08-07 0:13 ` Linus Torvalds 2009-08-07 1:30 ` Artur Skawina 2009-08-07 0:53 ` Artur Skawina 1 sibling, 1 reply; 37+ messages in thread From: Linus Torvalds @ 2009-08-07 0:13 UTC (permalink / raw) To: Artur Skawina; +Cc: Git Mailing List On Thu, 6 Aug 2009, Linus Torvalds wrote: > > In particular, I'm thinking about the warnign in the intel optimization > manual: > > The rotate by immediate and rotate by register instructions are > more expensive than a shift. The rotate by 1 instruction has the > same latency as a shift. > > so it's very possible that "rotate by 1" is much better than other > rotates. Hmm. Probably not. Googling more seems to indicate that rotates and shifts have a fixed 4-cycle latency on Northwood. I'm not seeing anything that indicates that a single-bit rotate/shift would be any faster. (And remember, if 4 cycles doesn't sound so bad: that's enough of a latency to do _16_ "simple" ALU's, since they can be double-pumped in the two regular ALU's). I think long-running ALU ops that feed into a store (spill) also happen to be the thing that makes the dreaded store-buffer replay trap nasties happen more (load vs store scheduled badly, and then you end up spending tens of cycles just replaying). Linus ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing 2009-08-07 0:13 ` Linus Torvalds @ 2009-08-07 1:30 ` Artur Skawina 2009-08-07 1:55 ` Linus Torvalds 0 siblings, 1 reply; 37+ messages in thread From: Artur Skawina @ 2009-08-07 1:30 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List Linus Torvalds wrote: > > On Thu, 6 Aug 2009, Linus Torvalds wrote: >> In particular, I'm thinking about the warnign in the intel optimization >> manual: >> >> The rotate by immediate and rotate by register instructions are >> more expensive than a shift. The rotate by 1 instruction has the >> same latency as a shift. >> >> so it's very possible that "rotate by 1" is much better than other >> rotates. > > Hmm. Probably not. Googling more seems to indicate that rotates and shifts > have a fixed 4-cycle latency on Northwood. I'm not seeing anything that > indicates that a single-bit rotate/shift would be any faster. > > (And remember, if 4 cycles doesn't sound so bad: that's enough of a > latency to do _16_ "simple" ALU's, since they can be double-pumped in the > two regular ALU's). looking at the generated code, there is a lot of ro[rl] movement, so it's likely that contributes to the problem. I also see 44 extra lea instructions, 44 less adds and changes like: [...] mov XX(%eRX),%eRX xor XX(%eRX),%eRX - and %eRX,%eRX + and XX(%eRX),%eRX xor XX(%eRX),%eRX - add %eRX,%eRX - ror $0x2,%eRX - mov %eRX,XX(%eRX) + lea (%eRX,%eRX,1),%eRX mov XX(%eRX),%eRX bswap %eRX mov %eRX,XX(%eRX) mov %eRX,%eRX + ror $0x2,%eRX + mov %eRX,XX(%eRX) + mov %eRX,%eRX rol $0x5,%eRX mov XX(%eRX),%eRX - mov XX(%eRX),%eRX [...] which could mean that gcc did a better job of register allocation (where "better job" might be just luck). artur ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing 2009-08-07 1:30 ` Artur Skawina @ 2009-08-07 1:55 ` Linus Torvalds 0 siblings, 0 replies; 37+ messages in thread From: Linus Torvalds @ 2009-08-07 1:55 UTC (permalink / raw) To: Artur Skawina; +Cc: Git Mailing List On Fri, 7 Aug 2009, Artur Skawina wrote: > > I also see 44 extra lea instructions, 44 less adds add and lea (as long as the lea shift is 1) should be the same on a P4 (they are not the same on some other microarchitectures and lea can have address generation stalls etc). Lea, of course, gives the potential for register movement at the same time (three-address op), and that's likely the reason for lea-vs-adds. > and changes like: > [...] > mov XX(%eRX),%eRX > xor XX(%eRX),%eRX > - and %eRX,%eRX > + and XX(%eRX),%eRX Yeah, different spill patterns. That's the biggest issue, I think. In particular, on P4, with unlucky spills, you may end up with things like ror $2,reg mov reg,x(%esp) .. a few instructions .. xor x(%esp), reg and the above is exactly when one of the worst P4 problems hit: a store, followed a few cycles later by a load from the same address (and "a few cycles later" can be quite a few instructions if they are the nice ones). What can happen is that if the store data isn't ready yet (because it comes from a long-latency op like a shift or a multiply), then you hit a store buffer replay thing. The P4 (with its long pipeline) basically starts the load speculatively, and if anything bad happens for the load (L1 cache miss, TLB miss, store buffer fault, you name it), it will cause a replay of the whole pipeline. Which can take tens of cycles. [ That said, it's been a long time since I did a lot of P4 worrying. So I may mis-remember the details. But that whole store buffer forwarding had some really nasty replay issues ] > which could mean that gcc did a better job of register allocation > (where "better job" might be just luck). I suspect that's the biggest issue. Just _happening_ to get the spills so that they don't hurt. And with unlucky scheduling, you might hit some of the P4 replay issues every single time. There are some P4 optimizations that are simple: - avoid complex instructions - don't blow the trace cache - predictable branches but the replay faults can really get you. Linus ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing 2009-08-06 23:25 ` Linus Torvalds 2009-08-07 0:13 ` Linus Torvalds @ 2009-08-07 0:53 ` Artur Skawina 1 sibling, 0 replies; 37+ messages in thread From: Artur Skawina @ 2009-08-07 0:53 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List Linus Torvalds wrote: > > Just out of curiosity, does anything change if you change the > > B = SHA_ROR(B,2) > > into a > > B = SHA_ROR(SHA_ROR(B,1),1) > > instead? It's very possible that it becomes _much_ worse, but I guess it's Did try that yesterday, didn't help. Will recheck now.. yep: before: linus 0.3554 171.7 after: linus 0.407 150 still true for the current version. > So optimizing for P4 is often the wrong thing. > > Secondly, P4's are going away. You may have one, but they are getting > rare. So optimizing for them is a losing proposition in the long run. Sure, no argument; it's just that avoiding the P4 pitfalls is usually not that hard and the impact on other, non-netburst, archs is low. There are a lot of P4s out there and they're not going away soon. (i'm still keeping most of my git trees on a P3...) For generic C code such as this the difference for your i7 was -2% and +70% for my P4; all the other (but one, i think) optimizations which worked on P4 also applied to 32-bit i7. As i happen to have a p4 i can just as well test the code on it, many improvements will likely apply to other cpus too. That's all, i doubt anybody seriously considered "optimizing for P4"; there is a reason intel discontinued them :) The atom is a more important target, but only the asm versions did well there so far. artur ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing 2009-08-06 22:27 ` Linus Torvalds 2009-08-06 22:33 ` Linus Torvalds 2009-08-06 22:55 ` Artur Skawina @ 2009-08-07 2:23 ` Linus Torvalds 2009-08-07 4:16 ` Artur Skawina [not found] ` <alpine.LFD.2.01.0908071614310.3288@localhost.localdomain> 2 siblings, 2 replies; 37+ messages in thread From: Linus Torvalds @ 2009-08-07 2:23 UTC (permalink / raw) To: Artur Skawina; +Cc: Git Mailing List On Thu, 6 Aug 2009, Linus Torvalds wrote: > > > On Thu, 6 Aug 2009, Artur Skawina wrote: > > > > Does this make any difference for you? For me it's the best one so far > > (the linusas2 number clearly shows that for me the register renaming does > > nothing; other than that the functions should be very similar) > > Nope. If anything, it's bit slower, but it might be in the noise. I > generally got 330MB/s with my "cpp renaming" on Nehalem (32-bit - the > 64-bit numbers are ~400MB/s), but with this I got 325MB/s twice in a row, > which matches the linusas2 numbers pretty exactly. I actually found a P4 I have access to, except that one is a Prescott. And I can't run it in 32-bit mode, because I only have a regular user login, and it only has the 64-bit development environment. But I can do the hacked-for-64bit sha1bench runs, and I tested your patch. It's horrible. Here's the plain "linus" baseline (ie the "Do register rotation in cpp") thing, with the fixed "E += TEMP .." thing): # TIME[s] SPEED[MB/s] rfc3174 1.648 37.03 rfc3174 1.677 36.4 linus 0.4018 151.9 linusas 0.4439 137.5 linusas2 0.4381 139.3 mozilla 0.9587 63.66 mozillaas 0.9434 64.7 and here it is with your patch: # TIME[s] SPEED[MB/s] rfc3174 1.667 36.61 rfc3174 1.644 37.12 linus 0.4653 131.2 linusas 0.4412 138.3 linusas2 0.4388 139.1 mozilla 0.9466 64.48 mozillaas 0.9449 64.59 (ok, so the numbers aren't horribly stable, but the "plain linus" thing consistently outperforms here - and underperforms with your patch). However, note that since this is the 64-bit thing, there likely aren't any spill issues, but it's simply an issue of "just how did the array[] accesses get scheduled" etc. And since this is a Prescott (or rather "Xeon") P4, the shifter isn't quite as horrible as yours is. _And_ this is a different gcc version (4.0.3). So the numbers aren't really all that comparable. It's more an example of "optimizing for P4 is futile, because you're just playing with total randomness". That's like a 20MB/s difference, just from moving a few ALU ops around a bit. And it's entirely possible that if I had gcc-4.4 on that machine, your patch would magically do the right thing ;) Sadly, that machine is just a ssh gateway, so there's no real development tools on it at all - no way to get good profiles etc. So I can't really say exactly what the problem pattern is :( Linus ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing 2009-08-07 2:23 ` Linus Torvalds @ 2009-08-07 4:16 ` Artur Skawina [not found] ` <alpine.LFD.2.01.0908071614310.3288@localhost.localdomain> 1 sibling, 0 replies; 37+ messages in thread From: Artur Skawina @ 2009-08-07 4:16 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List Linus Torvalds wrote: > > Here's the plain "linus" baseline (ie the "Do register rotation in cpp") > thing, with the fixed "E += TEMP .." thing): > linus 0.4018 151.9 > and here it is with your patch: > linus 0.4653 131.2 > (ok, so the numbers aren't horribly stable, but the "plain linus" thing > consistently outperforms here - and underperforms with your patch). Well, I'd be surprised if one C version would always be the winner on every single cpu; that 13% loss[1] I think would be an acceptable compromise, if the goal is to have one implementation that does reasonably well on all cpus. That's why i asked how the change did on nehalem; if it's a measurable loss on anything modern (core2+), then of course the P4s must suffer; and one could always blame the compiler ;) It's not like the difference in sha1 overhead will be noticeable in normal git use. artur [1] I suspect the old gcc is a factor (4.0.4 does <100M/s here). ^ permalink raw reply [flat|nested] 37+ messages in thread
[parent not found: <alpine.LFD.2.01.0908071614310.3288@localhost.localdomain>]
[parent not found: <4A7CBD28.6070306@gmail.com>]
[parent not found: <4A7CBF47.9000903@gmail.com>]
[parent not found: <alpine.LFD.2.01.0908071700290.3288@localhost.localdomain>]
[parent not found: <4A7CC380.3070008@gmail.com>]
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing [not found] ` <4A7CC380.3070008@gmail.com> @ 2009-08-08 4:16 ` Linus Torvalds 2009-08-08 5:34 ` Artur Skawina 0 siblings, 1 reply; 37+ messages in thread From: Linus Torvalds @ 2009-08-08 4:16 UTC (permalink / raw) To: Artur Skawina; +Cc: Git Mailing List On Sat, 8 Aug 2009, Artur Skawina wrote: > > i was seeing such large variations depending on the -mtune flags that > i gave up and now do just -march=i686; that's what i would expect > for generic x86 binaries. I think I have found a way to avoid the gcc crazyness. Lookie here: # TIME[s] SPEED[MB/s] rfc3174 5.094 119.8 rfc3174 5.098 119.7 linus 1.462 417.5 linusas 2.008 304 linusas2 1.878 325 mozilla 5.566 109.6 mozillaas 5.866 104.1 openssl 1.609 379.3 spelvin 1.675 364.5 spelvina 1.601 381.3 nettle 1.591 383.6 notice? I outperform all the hand-tuned asm on 32-bit too. By quite a margin, in fact. Now, I didn't try a P4, and it's possible that it won't do that there, but the 32-bit code generation sure looks impressive on my Nehalem box. The magic? I force the stores to the 512-bit hash bucket to be done in order. That seems to help a lot. The diff is trivial (on top of the "rename registers with cpp" patch), as appended. And it does seem to fix the P4 issues too, although I can obviously (once again) only test Prescott, and only in 64-bit mode: # TIME[s] SPEED[MB/s] rfc3174 1.662 36.73 rfc3174 1.64 37.22 linus 0.2523 241.9 linusas 0.4367 139.8 linusas2 0.4487 136 mozilla 0.9704 62.9 mozillaas 0.9399 64.94 that's some really impressive improvement. All from just saying "do the stores in the order I told you to, dammit!" to the compiler. Linus --- block-sha1/sha1.c | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c index 19dc41d..f70e1ba 100644 --- a/block-sha1/sha1.c +++ b/block-sha1/sha1.c @@ -93,6 +93,7 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx) /* This "rolls" over the 512-bit array */ #define W(x) (array[(x)&15]) +#define setW(x, val) (*(volatile unsigned int *)&W(x) = (val)) /* * Where do we get the source from? The first 16 iterations get it from @@ -102,7 +103,7 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx) #define SHA_MIX(t) SHA_ROL(W(t+13) ^ W(t+8) ^ W(t+2) ^ W(t), 1) #define SHA_ROUND(t, input, fn, constant, A, B, C, D, E) do { \ - unsigned int TEMP = input(t); W(t) = TEMP; \ + unsigned int TEMP = input(t); setW(t, TEMP); \ E += TEMP + SHA_ROL(A,5) + (fn) + (constant); \ B = SHA_ROR(B, 2); } while (0) ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing 2009-08-08 4:16 ` Linus Torvalds @ 2009-08-08 5:34 ` Artur Skawina 2009-08-08 17:10 ` Linus Torvalds 2009-08-08 22:58 ` Artur Skawina 0 siblings, 2 replies; 37+ messages in thread From: Artur Skawina @ 2009-08-08 5:34 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List Linus Torvalds wrote: > > I think I have found a way to avoid the gcc crazyness. > > Lookie here: > > # TIME[s] SPEED[MB/s] > rfc3174 5.094 119.8 > rfc3174 5.098 119.7 > linus 1.462 417.5 > linusas 2.008 304 > linusas2 1.878 325 > mozilla 5.566 109.6 > mozillaas 5.866 104.1 > openssl 1.609 379.3 > spelvin 1.675 364.5 > spelvina 1.601 381.3 > nettle 1.591 383.6 > > notice? I outperform all the hand-tuned asm on 32-bit too. By quite a > margin, in fact. > > Now, I didn't try a P4, and it's possible that it won't do that there, but > the 32-bit code generation sure looks impressive on my Nehalem box. The > magic? I force the stores to the 512-bit hash bucket to be done in order. > That seems to help a lot. I named it 'linusv': P4/i686: # TIME[s] SPEED[MB/s] rfc3174 1.456 41.92 rfc3174 1.445 42.22 linus 0.5865 104.1 linusph 0.5643 108.2 linusv 0.3697 165.1 linusvph 0.3618 168.7 linusp4 0.4312 141.5 linusas 0.4091 149.2 linusas2 0.4364 139.9 mozilla 1.102 55.37 mozillaas 1.297 47.07 openssl 0.261 233.9 opensslb 0.2395 254.9 spelvin 0.2653 230 nettle 0.438 139.4 and when tuning for prescott: linus 0.6544 93.27 linusph 0.6523 93.57 linusv 0.3439 177.5 linusvph 0.3547 172.1 linusp4 0.3585 170.3 so it isn't as fast as the openssl asm ones, but it does win in the C category. > I outperform all the hand-tuned asm on 32-bit too. By quite a > margin, in fact. I've inlined the byteswapping in 'opensslb', maybe that one will do a bit better. http://www.src.multimo.pl/YDpqIo7Li27O0L0h/sha1bench.tar.gz artur ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing 2009-08-08 5:34 ` Artur Skawina @ 2009-08-08 17:10 ` Linus Torvalds 2009-08-08 18:12 ` Artur Skawina 2009-08-08 22:58 ` Artur Skawina 1 sibling, 1 reply; 37+ messages in thread From: Linus Torvalds @ 2009-08-08 17:10 UTC (permalink / raw) To: Artur Skawina; +Cc: Git Mailing List On Sat, 8 Aug 2009, Artur Skawina wrote: > > I've inlined the byteswapping in 'opensslb', maybe that one will > do a bit better. > > http://www.src.multimo.pl/YDpqIo7Li27O0L0h/sha1bench.tar.gz Hmm. Testing on my atom, the inlined bswap is worse, but the asm versions are generally superior to any C one: # TIME[s] SPEED[MB/s] rfc3174 2.194 27.82 rfc3174 2.19 27.87 linus 0.947 64.45 linusph 0.9381 65.06 linusv 0.8943 68.25 linusvph 0.8803 69.34 linusasm 0.9349 65.29 linusp4 1.006 60.66 linusas 1.062 57.48 linusas2 1.009 60.5 mozilla 2.264 26.96 mozillaas 2.197 27.78 openssl 0.648 94.19 opensslb 0.7419 82.27 spelvin 0.636 95.96 spelvina 0.6671 91.49 nettle 0.717 85.12 nettle-ror 0.7137 85.52 nettle-p4sch 0.7158 85.27 Interestingly, -mtune=prescott does well for that 'linusv' version on atom too, and gets it up to linusv 0.8365 72.96 and it's the only one that improves. Odd interactions. Linus ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing 2009-08-08 17:10 ` Linus Torvalds @ 2009-08-08 18:12 ` Artur Skawina 0 siblings, 0 replies; 37+ messages in thread From: Artur Skawina @ 2009-08-08 18:12 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List Linus Torvalds wrote: > > On Sat, 8 Aug 2009, Artur Skawina wrote: >> I've inlined the byteswapping in 'opensslb', maybe that one will >> do a bit better. > Hmm. Testing on my atom, the inlined bswap is worse, but the asm versions > are generally superior to any C one: It loses on atom, but is the best one on both P3 and P4 here. Based on your other numbers I was expecting it to win on 32-bit nehalem too. gcc doing a better job of scheduling w/ 'linusv' wouldn't surprise though (since there are no spills, the data reads are about the only other thing that could make a difference. And, yes, they show up in the profiles; if x86 only had one more register...) artur ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing 2009-08-08 5:34 ` Artur Skawina 2009-08-08 17:10 ` Linus Torvalds @ 2009-08-08 22:58 ` Artur Skawina 2009-08-08 23:36 ` Artur Skawina 1 sibling, 1 reply; 37+ messages in thread From: Artur Skawina @ 2009-08-08 22:58 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List Artur Skawina wrote: > Linus Torvalds wrote: >> magic? I force the stores to the 512-bit hash bucket to be done in order. >> That seems to help a lot. > > I named it 'linusv': > linusv 0.3697 165.1 I was not going to spend even more time on the C version, but after looking at what gcc does to it, tried this: diff --git a/block-sha1/sha1vol.c b/block-sha1/sha1vol.c --- a/block-sha1/sha1vol.c +++ b/block-sha1/sha1vol.c @@ -93,7 +93,7 @@ void blk_SHA1_Finalv(unsigned char hashout[20], blk_SHA_CTX *ctx) /* This "rolls" over the 512-bit array */ #define W(x) (array[(x)&15]) -#define setW(x, val) (*(volatile unsigned int *)&W(x) = (val)) +#define setW(x, val) W(x) = (val); __asm__ volatile ("": "+m" (W(x))) /* * Where do we get the source from? The first 16 iterations get it from and got a nice improvement: rfc3174 1.436 42.49 linus 0.5843 104.5 linusph 0.5639 108.2 linusv 0.3098 197 linusvph 0.3082 198.1 linusasm 0.5849 104.3 linusp4 0.433 141 linusas 0.4077 149.7 linusas2 0.436 140 mozilla 1.099 55.54 mozillaas 1.295 47.11 openssl 0.2632 231.9 opensslb 0.2395 254.8 spelvin 0.2687 227.2 spelvina 0.2526 241.7 nettle 0.4378 139.4 nettle-ror 0.4379 139.4 nettle-p4sch 0.4231 144.2 The atom numbers didn't change much. artur ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing 2009-08-08 22:58 ` Artur Skawina @ 2009-08-08 23:36 ` Artur Skawina 0 siblings, 0 replies; 37+ messages in thread From: Artur Skawina @ 2009-08-08 23:36 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List Artur Skawina wrote: > Artur Skawina wrote: > -#define setW(x, val) (*(volatile unsigned int *)&W(x) = (val)) > +#define setW(x, val) W(x) = (val); __asm__ volatile ("": "+m" (W(x))) and w/ this on top: diff --git a/block-sha1/sha1vol.c b/block-sha1/sha1vol.c --- a/block-sha1/sha1vol.c +++ b/block-sha1/sha1vol.c @@ -103,9 +103,9 @@ void blk_SHA1_Finalv(unsigned char hashout[20], blk_SHA_CTX *ctx) #define SHA_MIX(t) SHA_ROL(W(t+13) ^ W(t+8) ^ W(t+2) ^ W(t), 1) #define SHA_ROUND(t, input, fn, constant, A, B, C, D, E) do { \ - unsigned int TEMP = input(t); setW(t, TEMP); \ - E += TEMP + SHA_ROL(A,5) + (fn) + (constant); \ - B = SHA_ROR(B, 2); } while (0) + unsigned int TEMP = SHA_ROL(A,5); E+= (fn); \ + E += (constant) + TEMP; TEMP = input(t); setW(t, TEMP); \ + B = SHA_ROR(B, 2); E += TEMP; } while (0) #define T_0_15(t, A, B, C, D, E) SHA_ROUND(t, SHA_SRC, (((C^D)&B)^D) , 0x5a827999, A, B, C, D, E ) #define T_16_19(t, A, B, C, D, E) SHA_ROUND(t, SHA_MIX, (((C^D)&B)^D) , 0x5a827999, A, B, C, D, E ) I see an improvement on atom and reach ~200M/s on P4 (i686). . When compiled w/ '-mtune=prescott': rfc3174 1.459 41.84 linus 0.6574 92.85 linusph 0.6613 92.29 linusv 0.2682 227.6 linusvph 0.2681 227.7 linusasm 0.5868 104 linusp4 0.3586 170.2 linusas 0.3795 160.8 linusas2 0.3583 170.3 mozilla 1.171 52.11 mozillaas 1.381 44.2 openssl 0.2623 232.7 opensslb 0.2404 253.9 spelvin 0.2659 229.6 spelvina 0.2492 244.9 nettle 0.4362 139.9 nettle-ror 0.436 140 nettle-p4sch 0.4204 145.2 it's now just 2% slower than the openssl assembler version. artur ^ permalink raw reply [flat|nested] 37+ messages in thread
end of thread, other threads:[~2009-08-08 23:37 UTC | newest] Thread overview: 37+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-08-06 15:13 [PATCH 0/7] block-sha1: improved SHA1 hashing Linus Torvalds 2009-08-06 15:15 ` [PATCH 1/7] block-sha1: add new optimized C 'block-sha1' routines Linus Torvalds 2009-08-06 15:16 ` [PATCH 2/7] block-sha1: try to use rol/ror appropriately Linus Torvalds 2009-08-06 15:18 ` [PATCH 3/7] block-sha1: make the 'ntohl()' part of the first SHA1 loop Linus Torvalds 2009-08-06 15:20 ` [PATCH 4/7] block-sha1: re-use the temporary array as we calculate the SHA1 Linus Torvalds 2009-08-06 15:22 ` [PATCH 5/7] block-sha1: macroize the rounds a bit further Linus Torvalds 2009-08-06 15:24 ` [PATCH 6/7] block-sha1: Use '(B&C)+(D&(B^C))' instead of '(B&C)|(D&(B|C))' in round 3 Linus Torvalds 2009-08-06 15:25 ` [PATCH 7/7] block-sha1: get rid of redundant 'lenW' context Linus Torvalds 2009-08-06 18:25 ` [PATCH 2/7] block-sha1: try to use rol/ror appropriately Bert Wesarg 2009-08-06 17:22 ` [PATCH 0/7] block-sha1: improved SHA1 hashing Artur Skawina 2009-08-06 18:09 ` Linus Torvalds 2009-08-06 19:10 ` Artur Skawina 2009-08-06 19:41 ` Linus Torvalds 2009-08-06 20:08 ` Artur Skawina 2009-08-06 20:53 ` Linus Torvalds 2009-08-06 21:24 ` Linus Torvalds 2009-08-06 21:39 ` Artur Skawina 2009-08-06 21:52 ` Artur Skawina 2009-08-06 22:27 ` Linus Torvalds 2009-08-06 22:33 ` Linus Torvalds 2009-08-06 23:19 ` Artur Skawina 2009-08-06 23:42 ` Linus Torvalds 2009-08-06 22:55 ` Artur Skawina 2009-08-06 23:04 ` Linus Torvalds 2009-08-06 23:25 ` Linus Torvalds 2009-08-07 0:13 ` Linus Torvalds 2009-08-07 1:30 ` Artur Skawina 2009-08-07 1:55 ` Linus Torvalds 2009-08-07 0:53 ` Artur Skawina 2009-08-07 2:23 ` Linus Torvalds 2009-08-07 4:16 ` Artur Skawina [not found] ` <alpine.LFD.2.01.0908071614310.3288@localhost.localdomain> [not found] ` <4A7CBD28.6070306@gmail.com> [not found] ` <4A7CBF47.9000903@gmail.com> [not found] ` <alpine.LFD.2.01.0908071700290.3288@localhost.localdomain> [not found] ` <4A7CC380.3070008@gmail.com> 2009-08-08 4:16 ` Linus Torvalds 2009-08-08 5:34 ` Artur Skawina 2009-08-08 17:10 ` Linus Torvalds 2009-08-08 18:12 ` Artur Skawina 2009-08-08 22:58 ` Artur Skawina 2009-08-08 23:36 ` Artur Skawina
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).