* Re: Broken PPC sha1.. (Re: Figured out how to get Mozilla into git) @ 2006-06-19 8:41 linux 2006-06-19 8:50 ` Johannes Schindelin 2006-06-23 0:09 ` Fixed PPC SHA1 linux 0 siblings, 2 replies; 6+ messages in thread From: linux @ 2006-06-19 8:41 UTC (permalink / raw) To: git, torvalds; +Cc: linux By the way, if anyone's still interested, I tried to produce a better-scheduled PowerPC sha1 core last year. Unfortunately, I don't have access to a PowerPC machine to test on, so debugging is a little painful. The latest version is appended, if anyone with a PowerPC machine wants to try it out. It should drop in for ppc/sha1ppc.S. Hopefully the comments explain the general idea. Notes I should add to the comments: - When reading the assembly code, note that PPC assembly permits a bare number (no %r prefix) as a register number, and further that number can be an expression! It makes the register renaming nice and simple. - Also for folks unfamiliar, it's dest,src,src operand order. - For a reminder, the PowerPC calling convention is: %r0 - Temp. Always reads as zero in some contexts. %r1 - stack pointer %r2 - Confusing. Different documents say different things. %r3..%r10 - Incoming arguments. Volatile across function calls. %r11..%r12 - Have some special uses not relevant here. Volatile. %r13..%r31 - Callee-save registers. %lr, %ctr - Volatile. %lr holds return address on input. And the way registers are used in this function are: %r0 - Temp. %r1 - Stack pointer. Used only for register saving. %r2 - Not used. %r3 - Points to hash accumulator A..E in memory. %r4 - Points to data being hashed. %r5 - Incoming loop count. Holds round constant K in body of loop. %r6..%r10 - Working copies of A..E %r11..%r26 - The W[] array of 16 input words being hashed %r27..%r31 - Start-of-round copies of A..E. %ctr - Holds loop count, copied from incoming %r5 %lr - Holds return address. Not modified. - While I try to use the load/store multiple instructions where appropriate, they have a severe penalty for unaligned operands (they're microcoded optimistically, so do a full failing aligned load before being re-issued as a slow-but-safe unaligned sequence), and thanks to git's object type prefix, the source data is generally unaligned, so they're deliberately NOT used to load the 16 words of data hashed each iteration. /* * SHA-1 implementation for PowerPC. * * Copyright (C) 2005 Paul Mackerras <paulus@samba.org> */ /* * We roll the registers for A, B, C, D, E around on each * iteration; E on iteration t is D on iteration t+1, and so on. * We use registers 6 - 10 for this. (Registers 27 - 31 hold * the previous values.) */ #define RA(t) (((t)+4)%5+6) #define RB(t) (((t)+3)%5+6) #define RC(t) (((t)+2)%5+6) #define RD(t) (((t)+1)%5+6) #define RE(t) (((t)+0)%5+6) /* We use registers 11 - 26 for the W values */ #define W(t) ((t)%16+11) /* Register 5 is used for the constant k */ /* * There are three F functions, used four groups of 20: * - 20 rounds of f0(b,c,d) = "bit wise b ? c : d" = (^b & d) + (b & c) * - 20 rounds of f1(b,c,d) = b^c^d = (b^d)^c * - 20 rounds of f2(b,c,d) = majority(b,c,d) = (b&d) + ((b^d)&c) * - 20 more rounds of f1(b,c,d) * * These are all scheduled for near-optimal performance on a G4. * The G4 is a 3-issue out-of-order machine with 3 ALUs, but it can only * *consider* starting the oldest 3 instructions per cycle. So to get * maximum performace out of it, you have to treat it as an in-order * machine. Which means interleaving the computation round t with the * computation of W[t+4]. * * The first 16 rounds use W values loaded directly from memory, while the * remianing 64 use values computed from those first 16. We preload * 4 values before starting, so there are three kinds of rounds: * - The first 12 (all f0) also load the W values from memory. * - The next 64 compute W(i+4) in parallel. 8*f0, 20*f1, 20*f2, 16*f1. * - The last 4 (all f1) do not do anything with W. * * Therefore, we have 6 different round functions: * STEPD0_LOAD(t,s) - Perform round t and load W(s). s < 16 * STEPD0_UPDATE(t,s) - Perform round t and compute W(s). s >= 16. * STEPD1_UPDATE(t,s) * STEPD2_UPDATE(t,s) * STEPD1(t) - Perform round t with no load or update. * * The G5 is more fully out-of-order, and can find the parallelism * by itself. The big limit is that it has a 2-cycle ALU latency, so * even though it's 2-way, the code has to be scheduled as if it's * 4-way, which can be a limit. To help it, we try to schedule the * read of RA(t) as late as possible so it doesn't stall waiting for * the previous round's RE(t-1), and we try to rotate RB(t) as early * as possible while reading RC(t) (= RB(t-1)) as late as possible. */ /* the initial loads. */ #define LOADW(s) \ lwz W(s),(s)*4(%r4) /* * This is actually 13 instructions, which is an awkward fit, * and uses W(s) as a temporary before loading it. */ #define STEPD0_LOAD(t,s) \ add RE(t),RE(t),W(t); andc %r0,RD(t),RB(t); /* spare slot */ \ add RE(t),RE(t),%r0; and W(s),RC(t),RB(t); rotlwi %r0,RA(t),5; \ add RE(t),RE(t),W(s); add %r0,%r0,%r5; rotlwi RB(t),RB(t),30; \ add RE(t),RE(t),%r0; lwz W(s),(s)*4(%r4); /* * This can execute starting with 2 out of 3 possible moduli, so it * does 2 rounds in 9 cycles, 4.5 cycles/round. */ #define STEPD0_UPDATE(t,s) \ add RE(t),RE(t),W(t); andc %r0,RD(t),RB(t); xor W(s),W((s)-16),W((s)-3); \ add RE(t),RE(t),%r0; and %r0,RC(t),RB(t); xor W(s),W(s),W((s)-8); \ add RE(t),RE(t),%r0; rotlwi %r0,RA(t),5; xor W(s),W(s),W((s)-14); \ add RE(t),RE(t),%r5; rotlwi RB(t),RB(t),30; rotlwi W(s),W(s),1; \ add RE(t),RE(t),%r0; /* Nicely optimal. Conveniently, also the most common. */ #define STEPD1_UPDATE(t,s) \ add RE(t),RE(t),W(t); xor %r0,RD(t),RB(t); xor W(s),W((s)-16),W((s)-3); \ add RE(t),RE(t),%r5; xor %r0,%r0,RC(t); xor W(s),W(s),W((s)-8); \ add RE(t),RE(t),%r0; rotlwi %r0,RA(t),5; xor W(s),W(s),W((s)-14); \ add RE(t),RE(t),%r0; rotlwi RB(t),RB(t),30; rotlwi W(s),W(s),1; /* * The naked version, no UPDATE, for the last 4 rounds. 3 cycles per. * We could use W(s) as a temp register, but we don't need it. */ #define STEPD1(t) \ /* spare slot */ add RE(t),RE(t),W(t); xor %r0,RD(t),RB(t); \ rotlwi RB(t),RB(t),30; add RE(t),RE(t),%r5; xor %r0,%r0,RC(t); \ add RE(t),RE(t),%r0; rotlwi %r0,RA(t),5; /* idle */ \ add RE(t),RE(t),%r0; /* 5 cycles per */ #define STEPD2_UPDATE(t,s) \ add RE(t),RE(t),W(t); and %r0,RD(t),RB(t); xor W(s),W((s)-16),W((s)-3); \ add RE(t),RE(t),%r0; xor %r0,RD(t),RB(t); xor W(s),W(s),W((s)-8); \ add RE(t),RE(t),%r5; and %r0,%r0,RC(t); xor W(s),W(s),W((s)-14); \ add RE(t),RE(t),%r0; rotlwi %r0,RA(t),5; rotlwi W(s),W(s),1; \ add RE(t),RE(t),%r0; rotlwi RB(t),RB(t),30; #define STEP0_LOAD4(t,s) \ STEPD0_LOAD(t,s); \ STEPD0_LOAD((t+1),(s)+1); \ STEPD0_LOAD((t)+2,(s)+2); \ STEPD0_LOAD((t)+3,(s)+3); #define STEPUP4(fn, t, s) \ STEP##fn##_UPDATE(t,s); \ STEP##fn##_UPDATE((t)+1,(s)+1); \ STEP##fn##_UPDATE((t)+2,(s)+2); \ STEP##fn##_UPDATE((t)+3,(s)+3); \ #define STEPUP20(fn, t, s) \ STEPUP4(fn, t, s); \ STEPUP4(fn, (t)+4, (s)+4); \ STEPUP4(fn, (t)+8, (s)+8); \ STEPUP4(fn, (t)+12, (s)+12); \ STEPUP4(fn, (t)+16, (s)+16) .globl sha1_core sha1_core: stwu %r1,-80(%r1) stmw %r13,4(%r1) /* Load up A - E */ lmw %r27,0(%r3) mtctr %r5 1: lis %r5,0x5a82 /* K0-19 */ mr RA(0),%r27 LOADW(0) mr RB(0),%r28 LOADW(1) mr RC(0),%r29 LOADW(2) ori %r5,%r5,0x7999 mr RD(0),%r30 LOADW(3) mr RE(0),%r31 STEP0_LOAD4(0, 4) STEP0_LOAD4(4, 8) STEP0_LOAD4(8, 12) STEPUP4(D0, 12, 16) STEPUP4(D0, 16, 20) lis %r5,0x6ed9 /* K20-39 */ ori %r5,%r5,0xeba1 STEPUP20(D1, 20, 24) lis %r5,0x8f1b /* K40-59 */ ori %r5,%r5,0xbcdc STEPUP20(D2, 40, 44) lis %r5,0xca62 /* K60-79 */ ori %r5,%r5,0xc1d6 STEPUP4(D1, 60, 64) STEPUP4(D1, 64, 68) STEPUP4(D1, 68, 72) STEPUP4(D1, 72, 76) STEPD1(76) STEPD1(77) STEPD1(78) STEPD1(79) /* Add results to original values */ add %r31,%r31,RE(0) add %r30,%r30,RD(0) add %r29,%r29,RC(0) add %r28,%r28,RB(0) add %r27,%r27,RA(0) addi %r4,%r4,64 bdnz 1b /* Save final hash, restore registers, and return */ stmw %r27,0(%r3) lmw %r13,4(%r1) addi %r1,%r1,80 blr ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Broken PPC sha1.. (Re: Figured out how to get Mozilla into git) 2006-06-19 8:41 Broken PPC sha1.. (Re: Figured out how to get Mozilla into git) linux @ 2006-06-19 8:50 ` Johannes Schindelin 2006-06-23 0:09 ` Fixed PPC SHA1 linux 1 sibling, 0 replies; 6+ messages in thread From: Johannes Schindelin @ 2006-06-19 8:50 UTC (permalink / raw) To: linux; +Cc: git, torvalds Hi, On Mon, 19 Jun 2006, linux@horizon.com wrote: > By the way, if anyone's still interested, I tried to produce a > better-scheduled PowerPC sha1 core last year. Unfortunately, I don't have > access to a PowerPC machine to test on, so debugging is a little painful. If you have access to SourceForge's compile farm, they have a PPC-G5 there. Ciao, Dscho ^ permalink raw reply [flat|nested] 6+ messages in thread
* Fixed PPC SHA1 2006-06-19 8:41 Broken PPC sha1.. (Re: Figured out how to get Mozilla into git) linux 2006-06-19 8:50 ` Johannes Schindelin @ 2006-06-23 0:09 ` linux 2006-06-23 0:54 ` linux 1 sibling, 1 reply; 6+ messages in thread From: linux @ 2006-06-23 0:09 UTC (permalink / raw) To: git, linuxppc-dev; +Cc: linux Okay, here's a tested and working version (the earlier version worked, too) of a better-scheduled PPC SHA1. This is about 15% faster that the current sha1ppc.S on a G4, and 5% faster on a G5 when hashing 10 million bytes, unaligned. (The G5 ratio seems to get better as the sizes fall.) It's also somewhat smaller, due to using load-multiple instructions. I have a variant that uses load string (lswi) to load the values to be hashed that is a few percent faster on a G5, but a few percent slower on a G4. It's also 52 bytes smaller (out of 4000). Does anyone have any feeling for the ratio of G4 and G5 machines out there? I presume for that small a percentage, run-time processor detection isn't worth it. (Cc: to linuxppc-dev for the experts on this question.) I've tried using lmw to load the data, and it's faster if the data is aligned, but it absolutely dies if the data is not. And due to git's variable-sized hash prefix, most of git's hashes are of unaligned data. (No copyright is claimed on the changes to Paul Mackerras' work below. Enjoy.) /* * SHA-1 implementation for PowerPC. * * Copyright (C) 2005 Paul Mackerras <paulus@samba.org> */ /* * PowerPC calling convention: * %r0 - volatile temp * %r1 - stack pointer. * %r2 - reserved * %r3-%r12 - Incoming arguments & return values; volatile. * %r13-%r31 - Callee-save registers * %lr - Return address, volatile * %ctr - volatile * * Register usage in this routine: * %r0 - temp * %r3 - argument (pointer to 5 words of SHA state) * %r4 - argument (pointer to data to hash) * %r5 - Contant K in SHA round (initially number of blocks to hash) * %r6-%r10 - Working copies of SHA variables A..E (actually E..A order) * %r11-%r26 - Data being hashed W[]. * %r27-%r31 - Previous copies of A..E, for final add back. * %ctr - loop count */ /* * We roll the registers for A, B, C, D, E around on each * iteration; E on iteration t is D on iteration t+1, and so on. * We use registers 6 - 10 for this. (Registers 27 - 31 hold * the previous values.) */ #define RA(t) (((t)+4)%5+6) #define RB(t) (((t)+3)%5+6) #define RC(t) (((t)+2)%5+6) #define RD(t) (((t)+1)%5+6) #define RE(t) (((t)+0)%5+6) /* We use registers 11 - 26 for the W values */ #define W(t) ((t)%16+11) /* Register 5 is used for the constant k */ /* * The basic SHA-1 round function is: * E += ROTL(A,5) + F(B,C,D) + W[i] + K; B = ROTL(B,30) * Then the variables are renamed: (A,B,C,D,E) = (E,A,B,C,D). * * Every 20 rounds, the function F() and the contant K changes: * - 20 rounds of f0(b,c,d) = "bit wise b ? c : d" = (^b & d) + (b & c) * - 20 rounds of f1(b,c,d) = b^c^d = (b^d)^c * - 20 rounds of f2(b,c,d) = majority(b,c,d) = (b&d) + ((b^d)&c) * - 20 more rounds of f1(b,c,d) * * These are all scheduled for near-optimal performance on a G4. * The G4 is a 3-issue out-of-order machine with 3 ALUs, but it can only * *consider* starting the oldest 3 instructions per cycle. So to get * maximum performace out of it, you have to treat it as an in-order * machine. Which means interleaving the computation round t with the * computation of W[t+4]. * * The first 16 rounds use W values loaded directly from memory, while the * remianing 64 use values computed from those first 16. We preload * 4 values before starting, so there are three kinds of rounds: * - The first 12 (all f0) also load the W values from memory. * - The next 64 compute W(i+4) in parallel. 8*f0, 20*f1, 20*f2, 16*f1. * - The last 4 (all f1) do not do anything with W. * * Therefore, we have 6 different round functions: * STEPD0_LOAD(t,s) - Perform round t and load W(s). s < 16 * STEPD0_UPDATE(t,s) - Perform round t and compute W(s). s >= 16. * STEPD1_UPDATE(t,s) * STEPD2_UPDATE(t,s) * STEPD1(t) - Perform round t with no load or update. * * The G5 is more fully out-of-order, and can find the parallelism * by itself. The big limit is that it has a 2-cycle ALU latency, so * even though it's 2-way, the code has to be scheduled as if it's * 4-way, which can be a limit. To help it, we try to schedule the * read of RA(t) as late as possible so it doesn't stall waiting for * the previous round's RE(t-1), and we try to rotate RB(t) as early * as possible while reading RC(t) (= RB(t-1)) as late as possible. */ /* the initial loads. */ #define LOADW(s) \ lwz W(s),(s)*4(%r4) /* * Perform a step with F0, and load W(s). Uses W(s) as a temporary * before loading it. * This is actually 10 instructions, which is an awkward fit. * It can execute grouped as listed, or delayed one instruction. * (If delayed two instructions, there is a stall before the start of the * second line.) Thus, two iterations take 7 cycles, 3.5 cycles per round. */ #define STEPD0_LOAD(t,s) \ add RE(t),RE(t),W(t); andc %r0,RD(t),RB(t); and W(s),RC(t),RB(t); \ add RE(t),RE(t),%r0; rotlwi %r0,RA(t),5; rotlwi RB(t),RB(t),30; \ add RE(t),RE(t),W(s); add %r0,%r0,%r5; lwz W(s),(s)*4(%r4); \ add RE(t),RE(t),%r0 /* * This is likewise awkward, 13 instructions. However, it can also * execute starting with 2 out of 3 possible moduli, so it does 2 rounds * in 9 cycles, 4.5 cycles/round. */ #define STEPD0_UPDATE(t,s,loadk...) \ add RE(t),RE(t),W(t); andc %r0,RD(t),RB(t); xor W(s),W((s)-16),W((s)-3); \ add RE(t),RE(t),%r0; and %r0,RC(t),RB(t); xor W(s),W(s),W((s)-8); \ add RE(t),RE(t),%r0; rotlwi %r0,RA(t),5; xor W(s),W(s),W((s)-14); \ add RE(t),RE(t),%r5; loadk; rotlwi RB(t),RB(t),30; rotlwi W(s),W(s),1; \ add RE(t),RE(t),%r0 /* Nicely optimal. Conveniently, also the most common. */ #define STEPD1_UPDATE(t,s,loadk...) \ add RE(t),RE(t),W(t); xor %r0,RD(t),RB(t); xor W(s),W((s)-16),W((s)-3); \ add RE(t),RE(t),%r5; loadk; xor %r0,%r0,RC(t); xor W(s),W(s),W((s)-8); \ add RE(t),RE(t),%r0; rotlwi %r0,RA(t),5; xor W(s),W(s),W((s)-14); \ add RE(t),RE(t),%r0; rotlwi RB(t),RB(t),30; rotlwi W(s),W(s),1 /* * The naked version, no UPDATE, for the last 4 rounds. 3 cycles per. * We could use W(s) as a temp register, but we don't need it. */ #define STEPD1(t) \ add RE(t),RE(t),W(t); xor %r0,RD(t),RB(t); \ rotlwi RB(t),RB(t),30; add RE(t),RE(t),%r5; xor %r0,%r0,RC(t); \ add RE(t),RE(t),%r0; rotlwi %r0,RA(t),5; /* spare slot */ \ add RE(t),RE(t),%r0 /* * 14 instructions, 5 cycles per. The majority function is a bit * awkward to compute. This can execute with a 1-instruction delay, * but it causes a 2-instruction delay, which triggers a stall. */ #define STEPD2_UPDATE(t,s,loadk...) \ add RE(t),RE(t),W(t); and %r0,RD(t),RB(t); xor W(s),W((s)-16),W((s)-3); \ add RE(t),RE(t),%r0; xor %r0,RD(t),RB(t); xor W(s),W(s),W((s)-8); \ add RE(t),RE(t),%r5; loadk; and %r0,%r0,RC(t); xor W(s),W(s),W((s)-14); \ add RE(t),RE(t),%r0; rotlwi %r0,RA(t),5; rotlwi W(s),W(s),1; \ add RE(t),RE(t),%r0; rotlwi RB(t),RB(t),30 #define STEP0_LOAD4(t,s) \ STEPD0_LOAD(t,s); \ STEPD0_LOAD((t+1),(s)+1); \ STEPD0_LOAD((t)+2,(s)+2); \ STEPD0_LOAD((t)+3,(s)+3) #define STEPUP4(fn, t, s, loadk...) \ STEP##fn##_UPDATE(t,s,); \ STEP##fn##_UPDATE((t)+1,(s)+1,); \ STEP##fn##_UPDATE((t)+2,(s)+2,); \ STEP##fn##_UPDATE((t)+3,(s)+3,loadk) #define STEPUP20(fn, t, s, loadk...) \ STEPUP4(fn, t, s,); \ STEPUP4(fn, (t)+4, (s)+4,); \ STEPUP4(fn, (t)+8, (s)+8,); \ STEPUP4(fn, (t)+12, (s)+12,); \ STEPUP4(fn, (t)+16, (s)+16, loadk) .globl sha1_core sha1_core: stwu %r1,-80(%r1) stmw %r13,4(%r1) /* Load up A - E */ lmw %r27,0(%r3) mtctr %r5 1: LOADW(0) lis %r5,0x5a82 mr RE(0),%r31 LOADW(1) mr RD(0),%r30 mr RC(0),%r29 LOADW(2) ori %r5,%r5,0x7999 /* K0-19 */ mr RB(0),%r28 LOADW(3) mr RA(0),%r27 STEP0_LOAD4(0, 4) STEP0_LOAD4(4, 8) STEP0_LOAD4(8, 12) STEPUP4(D0, 12, 16,) STEPUP4(D0, 16, 20, lis %r5,0x6ed9) ori %r5,%r5,0xeba1 /* K20-39 */ STEPUP20(D1, 20, 24, lis %r5,0x8f1b) ori %r5,%r5,0xbcdc /* K40-59 */ STEPUP20(D2, 40, 44, lis %r5,0xca62) ori %r5,%r5,0xc1d6 /* K60-79 */ STEPUP4(D1, 60, 64,) STEPUP4(D1, 64, 68,) STEPUP4(D1, 68, 72,) STEPUP4(D1, 72, 76,) addi %r4,%r4,64 STEPD1(76) STEPD1(77) STEPD1(78) STEPD1(79) /* Add results to original values */ add %r31,%r31,RE(0) add %r30,%r30,RD(0) add %r29,%r29,RC(0) add %r28,%r28,RB(0) add %r27,%r27,RA(0) bdnz 1b /* Save final hash, restore registers, and return */ stmw %r27,0(%r3) lmw %r13,4(%r1) addi %r1,%r1,80 blr ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Fixed PPC SHA1 2006-06-23 0:09 ` Fixed PPC SHA1 linux @ 2006-06-23 0:54 ` linux 2006-06-27 22:50 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 6+ messages in thread From: linux @ 2006-06-23 0:54 UTC (permalink / raw) To: git, linuxppc-dev; +Cc: linux Here's the lwsi-based version that's slightly faster on a G5, but slightly slower on a G4. /* * SHA-1 implementation for PowerPC. * * Copyright (C) 2005 Paul Mackerras <paulus@samba.org> */ /* * PowerPC calling convention: * %r0 - volatile temp * %r1 - stack pointer. * %r2 - reserved * %r3-%r12 - Incoming arguments & return values; volatile. * %r13-%r31 - Callee-save registers * %lr - Return address, volatile * %ctr - volatile * * Register usage in this routine: * %r0 - temp * %r3 - argument (pointer to 5 words of SHA state) * %r4 - argument (pointer to data to hash) * %r5-%r20 - Data being hashed W[]. (%r5 is initially count of blocks) * %r21 - Contant K in SHA round * %r22-%r26 - Working copies of SHA variables A..E (actually E..A order) * %r27-%r31 - Previous copies of A..E, for final add back. * %ctr - loop count (copied from %r5 argument) * * It's also worth mentioning that PPC assembly accept a bare * number as a register specifier; the "%r" prefix is actually optional. * And that number cna be an expression! That simplifies the * loop unrolling significantly. */ /* * We roll the registers for A, B, C, D, E around on each * iteration; E on iteration t is D on iteration t+1, and so on. * We use registers 22 - 26 for this. (Registers 27 - 31 hold * the previous values.) */ #define RA(t) (((t)+4)%5+22) #define RB(t) (((t)+3)%5+22) #define RC(t) (((t)+2)%5+22) #define RD(t) (((t)+1)%5+22) #define RE(t) (((t)+0)%5+22) /* Register 21 is used for the constant k */ /* We use registers 5 - 20 for the W values */ #define W(t) ((t)%16+5) /* * The basic SHA-1 round function is: * E += ROTL(A,5) + F(B,C,D) + W[i] + K; B = ROTL(B,30) * Then the variables are renamed: (A,B,C,D,E) = (E,A,B,C,D). * * Every 20 rounds, the function F() and the contant K changes: * - 20 rounds of f0(b,c,d) = "bit wise b ? c : d" = (^b & d) + (b & c) * - 20 rounds of f1(b,c,d) = b^c^d = (b^d)^c * - 20 rounds of f2(b,c,d) = majority(b,c,d) = (b&d) + ((b^d)&c) * - 20 more rounds of f1(b,c,d) * * These are all scheduled for near-optimal performance on a G4. * The G4 is a 3-issue out-of-order machine with 3 ALUs, but it can only * *consider* starting the oldest 3 instructions per cycle. So to get * maximum performace out of it, you have to treat it as an in-order * machine. Which means interleaving the computation round t with the * computation of W[t+4]. * * The first 16 rounds use W values loaded directly from memory, while the * remianing 64 use values computed from those first 16. We preload * 4 values before starting, so there are three kinds of rounds: * - The first 12 (all f0) also load the W values from memory. * - The next 64 compute W(i+4) in parallel. 8*f0, 20*f1, 20*f2, 16*f1. * - The last 4 (all f1) do not do anything with W. * * Therefore, we have 5 different round functions: * STEPD0(t,s) - Perform round t * STEPD0_UPDATE(t,s) - Perform round t and compute W(s). s >= 16. * STEPD1_UPDATE(t,s) * STEPD2_UPDATE(t,s) * STEPD1(t) - Perform round t with no load or update. * * There's also provision for inserting an instruction to start loading * the new K value after it's last used in the given step. * * The G5 is more fully out-of-order, and can find the parallelism * by itself. The big limit is that it has a 2-cycle ALU latency, so * even though it's 2-way, the code has to be scheduled as if it's * 4-way, which can be a limit. To help it, we try to schedule the * read of RA(t) as late as possible so it doesn't stall waiting for * the previous round's RE(t-1), and we try to rotate RB(t) as early * as possible while reading RC(t) (= RB(t-1)) as late as possible. */ /* * Okay, we need a naked version of STEPD0. It's 9 instructions. * Can that be done in 3 cycles, WITHOUT using W(s) as a temp? * NO. So we need W(s) as a temp. That can be arranged with some * clever scheduling. */ #define STEPD0(t,s) \ /* spare slot */ add RE(t),RE(t),W(t); andc %r0,RD(t),RB(t); \ add RE(t),RE(t),%r0; and W(s),RC(t),RB(t); rotlwi %r0,RA(t),5; \ add RE(t),RE(t),W(s); add %r0,%r0,%r21; rotlwi RB(t),RB(t),30; \ add RE(t),RE(t),%r0; /* * This can execute starting with 2 out of 3 possible moduli, so it * does 2 rounds in 9 cycles, 4.5 cycles/round. */ #define STEPD0_UPDATE(t,s,loadk...) \ add RE(t),RE(t),W(t); andc %r0,RD(t),RB(t); xor W(s),W((s)-16),W((s)-3); \ add RE(t),RE(t),%r0; and %r0,RC(t),RB(t); xor W(s),W(s),W((s)-8); \ add RE(t),RE(t),%r0; rotlwi %r0,RA(t),5; xor W(s),W(s),W((s)-14); \ add RE(t),RE(t),%r21; loadk; rotlwi RB(t),RB(t),30; rotlwi W(s),W(s),1; \ add RE(t),RE(t),%r0 /* Nicely optimal. Conveniently, also the most common. */ #define STEPD1_UPDATE(t,s,loadk...) \ add RE(t),RE(t),W(t); xor %r0,RD(t),RB(t); xor W(s),W((s)-16),W((s)-3); \ add RE(t),RE(t),%r21; loadk; xor %r0,%r0,RC(t); xor W(s),W(s),W((s)-8); \ add RE(t),RE(t),%r0; rotlwi %r0,RA(t),5; xor W(s),W(s),W((s)-14); \ add RE(t),RE(t),%r0; rotlwi RB(t),RB(t),30; rotlwi W(s),W(s),1 /* * The naked version, no UPDATE, for the last 4 rounds. 3 cycles per. * We could use W(s) as a temp register, but we don't need it. */ #define STEPD1(t) \ add RE(t),RE(t),W(t); xor %r0,RD(t),RB(t); \ add RE(t),RE(t),%r21; xor %r0,%r0,RC(t); rotlwi RB(t),RB(t),30; \ add RE(t),RE(t),%r0; rotlwi %r0,RA(t),5; /* spare slot */ \ add RE(t),RE(t),%r0 /* 5 cycles per */ #define STEPD2_UPDATE(t,s,loadk...) \ add RE(t),RE(t),W(t); and %r0,RD(t),RB(t); xor W(s),W((s)-16),W((s)-3); \ add RE(t),RE(t),%r0; xor %r0,RD(t),RB(t); xor W(s),W(s),W((s)-8); \ add RE(t),RE(t),%r21; loadk; and %r0,%r0,RC(t); xor W(s),W(s),W((s)-14); \ add RE(t),RE(t),%r0; rotlwi %r0,RA(t),5; rotlwi W(s),W(s),1; \ add RE(t),RE(t),%r0; rotlwi RB(t),RB(t),30 #define STEP0_LOAD4(t,s) \ STEPD0_LOAD(t,s); \ STEPD0_LOAD((t+1),(s)+1); \ STEPD0_LOAD((t)+2,(s)+2); \ STEPD0_LOAD((t)+3,(s)+3); #define STEP0_4(t,s) \ STEPD0(t,s); \ STEPD0((t+1),(s)+1); \ STEPD0((t)+2,(s)+2); \ STEPD0((t)+3,(s)+3); #define STEP1_4(t) \ STEPD1(t); \ STEPD1((t+1)); \ STEPD1((t)+2); \ STEPD1((t)+3); #define STEPUP4(fn, t, s, loadk...) \ STEP##fn##_UPDATE(t,s,); \ STEP##fn##_UPDATE((t)+1,(s)+1,); \ STEP##fn##_UPDATE((t)+2,(s)+2,); \ STEP##fn##_UPDATE((t)+3,(s)+3,loadk) #define STEPUP20(fn, t, s, loadk...) \ STEPUP4(fn, t, s,); \ STEPUP4(fn, (t)+4, (s)+4,); \ STEPUP4(fn, (t)+8, (s)+8,); \ STEPUP4(fn, (t)+12, (s)+12,); \ STEPUP4(fn, (t)+16, (s)+16, loadk) .globl sha1_core sha1_core: stwu %r1,-80(%r1) stmw %r13,4(%r1) /* Load up A - E */ lmw %r27,0(%r3) mtctr %r5 1: lswi W(0),%r4,32 lis %r21,0x5a82 addi %r4,%r4,32 mr RE(0),%r31 mr RD(0),%r30 mr RC(0),%r29 ori %r21,%r21,0x7999 /* K0-19 */ mr RB(0),%r28 mr RA(0),%r27 STEP0_4(0, 8) STEP0_4(4, 12) lswi W(8),%r4,32 STEPUP4(D0, 8, 16,) STEPUP4(D0, 12, 20,) STEPUP4(D0, 16, 24, lis %r21,0x6ed9) ori %r21,%r21,0xeba1 /* K20-39 */ STEPUP20(D1, 20, 28, lis %r21,0x8f1b) ori %r21,%r21,0xbcdc /* K40-59 */ STEPUP20(D2, 40, 48, lis %r21,0xca62) ori %r21,%r21,0xc1d6 /* K60-79 */ STEPUP4(D1, 60, 68,) STEPUP4(D1, 64, 72,) STEPUP4(D1, 68, 76,) addi %r4,%r4,32 STEP1_4(72); STEP1_4(76); /* Add results to original values */ add %r31,%r31,RE(0) add %r30,%r30,RD(0) add %r29,%r29,RC(0) add %r28,%r28,RB(0) add %r27,%r27,RA(0) bdnz 1b /* Save final hash, restore registers, and return */ stmw %r27,0(%r3) lmw %r13,4(%r1) addi %r1,%r1,80 blr ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Fixed PPC SHA1 2006-06-23 0:54 ` linux @ 2006-06-27 22:50 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 6+ messages in thread From: Benjamin Herrenschmidt @ 2006-06-27 22:50 UTC (permalink / raw) To: linux; +Cc: git, linuxppc-dev On Thu, 2006-06-22 at 20:54 -0400, linux@horizon.com wrote: > Here's the lwsi-based version that's slightly faster on a G5, but slightly > slower on a G4. I wouldn't bother with 2 versions... use the non-string version (string operations will cause performance problems on other processors) Ben. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Figured out how to get Mozilla into git @ 2006-06-09 2:17 Jon Smirl 2006-06-09 3:06 ` Martin Langhoff 0 siblings, 1 reply; 6+ messages in thread From: Jon Smirl @ 2006-06-09 2:17 UTC (permalink / raw) To: git I was able to import Mozilla into SVN without problem, it just occured to me to then import the SVN repository in git. The import has been running a few hours now and it is up to the year 2000 (starts in 1998). Since I haven't hit any errors yet it will probably finish ok. I should have the results in the morning. I wonder how long it will take to start gitk on a 10GB repository. Once I get this monster into git, are there tools that will let me keep it in sync with Mozilla CVS? SVN renamed numeric branches to this form, unlabeled-3.7.24, so that may be a problem. Any advice on how to pack this to make it run faster? -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-09 2:17 Figured out how to get Mozilla into git Jon Smirl @ 2006-06-09 3:06 ` Martin Langhoff 2006-06-10 1:14 ` Martin Langhoff 0 siblings, 1 reply; 6+ messages in thread From: Martin Langhoff @ 2006-06-09 3:06 UTC (permalink / raw) To: Jon Smirl; +Cc: git Jon, oh, I went back to a cvsimport that I started a couple days ago. Completed with no problems... Last commit: commit 5ecb56b9c4566618fad602a8da656477e4c6447a Author: wtchang%redhat.com <wtchang%redhat.com> Date: Fri Jun 2 17:20:37 2006 +0000 Import NSPR 4.6.2 and NSS 3.11.1 mozilla.git$ du -sh .git/ 2.0G .git/ It took 43492.19user 53504.77system 40:23:49elapsed 66%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (77334major+3122469478minor)pagefaults 0swaps > I should have the results in the morning. I wonder how long it will > take to start gitk on a 10GB repository. Hopefully not that big :) -- anyway, just do gitk --max-count=1000 > Once I get this monster into git, are there tools that will let me > keep it in sync with Mozilla CVS? If you use git-cvsimport, you can safely re-run it on a cronjob to keep it in sync. Not too sure about the cvs2svn => git-svnimport, though git-svnimport does support incremental imports. > SVN renamed numeric branches to this form, unlabeled-3.7.24, so that > may be a problem. Ouch, > Any advice on how to pack this to make it run faster? git-repack -a -d but it OOMs on my 2GB+2GBswap machine :( martin ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-09 3:06 ` Martin Langhoff @ 2006-06-10 1:14 ` Martin Langhoff 2006-06-10 1:33 ` Linus Torvalds 0 siblings, 1 reply; 6+ messages in thread From: Martin Langhoff @ 2006-06-10 1:14 UTC (permalink / raw) To: Jon Smirl; +Cc: git On 6/9/06, Martin Langhoff <martin.langhoff@gmail.com> wrote: > mozilla.git$ du -sh .git/ > 2.0G .git/ Ok -- pushed the repository out to our mirror box. Try: git-clone http://mirrors.catalyst.net.nz/pub/mozilla.git/ Now, good news. No, _very_ good news. As I was rsync'ing this out, and looking at the repo, suddently something was odd. Apparently after a git-repack -a -d OOMd on me, and I had posted this message, I re-ran it. [As it happens I have been running several imports of gentoo and moz lately on thebox. It is entirely possible that cvsps or a stray git-cvsimport was sitting on a whole lot of ram at the time] Now I don't know how much memory or time this took, but it clearly completed ok. And, it's now a single pack, weighting a grand total of 617MB So my comments about OOM'ing were wrong apparently. Hey, if the whole history is actually only 617MB, then initial checkouts are back to something reasonable, I'd say. cheers, martin ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-10 1:14 ` Martin Langhoff @ 2006-06-10 1:33 ` Linus Torvalds 2006-06-11 22:00 ` Nicolas Pitre 0 siblings, 1 reply; 6+ messages in thread From: Linus Torvalds @ 2006-06-10 1:33 UTC (permalink / raw) To: Martin Langhoff; +Cc: Jon Smirl, git On Sat, 10 Jun 2006, Martin Langhoff wrote: > > Now I don't know how much memory or time this took, but it clearly > completed ok. And, it's now a single pack, weighting a grand total of > 617MB Ok, that's more than reasonable. That should be fairly easily mapped on a 32-bit architecture without any huge problems, even with some VM fragmentation going on. It might be borderline (and you definitely want a 3:1 VM user:kernel split), but considering that the original CVS archive was apparently 3GB, having a single 617M pack-file is still pretty damn good. That's like 20% of the original, with all the obvious distribution advantages. Clearly this whole thing _does_ show that we could improve the process of importing things from CVS a whole lot, and I assume your 617MB pack doesn't have the nice name/email translations so it needs to be fixed up, but it sounds like on the whole the core git design came through with shining colors, even if we may want to polish things up a bit ;) I'm downloading the thing right now. Linus ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-10 1:33 ` Linus Torvalds @ 2006-06-11 22:00 ` Nicolas Pitre 2006-06-18 19:26 ` Linus Torvalds 0 siblings, 1 reply; 6+ messages in thread From: Nicolas Pitre @ 2006-06-11 22:00 UTC (permalink / raw) To: Linus Torvalds; +Cc: Martin Langhoff, Jon Smirl, git On Fri, 9 Jun 2006, Linus Torvalds wrote: > > > On Sat, 10 Jun 2006, Martin Langhoff wrote: > > > > Now I don't know how much memory or time this took, but it clearly > > completed ok. And, it's now a single pack, weighting a grand total of > > 617MB > > Ok, that's more than reasonable. That should be fairly easily mapped on a > 32-bit architecture without any huge problems, even with some VM > fragmentation going on. It might be borderline (and you definitely want a > 3:1 VM user:kernel split), but considering that the original CVS archive > was apparently 3GB, having a single 617M pack-file is still pretty damn > good. That's like 20% of the original, with all the obvious distribution > advantages. I played a bit with git-repack on that repo. the git-pack-objects memory usage grew to around 760MB (git-rev-list was less than that). So LRU of partial pack mappings might bring that down significantly. Then I used git-repack -a -f --window=20 --depth=20 which produced a nice 468MB pack file along with the invariant 45MB index file for a grand total of 535MB for the whole repo (the .git/refs/ directory alone still occupies 17MB on disk). So it is probably worth having deeper delta chains for large historic repositories as the deep revisions are unlikely to be referenced that often while the saving is quite significant. Nicolas ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-11 22:00 ` Nicolas Pitre @ 2006-06-18 19:26 ` Linus Torvalds 2006-06-18 21:40 ` Martin Langhoff 0 siblings, 1 reply; 6+ messages in thread From: Linus Torvalds @ 2006-06-18 19:26 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Martin Langhoff, Jon Smirl, git On Sun, 11 Jun 2006, Nicolas Pitre wrote: > > Then I used git-repack -a -f --window=20 --depth=20 which produced a > nice 468MB pack file along with the invariant 45MB index file for a > grand total of 535MB for the whole repo (the .git/refs/ directory alone > still occupies 17MB on disk). Btw, can others with that mozilla repo confirm that a mozilla repository that has been repacked seems to be entirely fine, but git-fsck-objects (with "--full", of course) will report error: Packfile .git/objects/pack/pack-06389c21fc3c4312cbc9a4ddde087c907c1a840b.pack SHA1 mismatch with itself for me (the fsck then completes with no other errors what-so-ever, so the contents are actually fine). Or is it just me? Linus ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-18 19:26 ` Linus Torvalds @ 2006-06-18 21:40 ` Martin Langhoff 2006-06-18 22:36 ` Linus Torvalds 0 siblings, 1 reply; 6+ messages in thread From: Martin Langhoff @ 2006-06-18 21:40 UTC (permalink / raw) To: Linus Torvalds; +Cc: Nicolas Pitre, Jon Smirl, git On 6/19/06, Linus Torvalds <torvalds@osdl.org> wrote: > Or is it just me? No problems here with my latest import run. fsck-objects --full comes clean, takes 14m: /usr/bin/time git-fsck-objects --full 737.22user 38.79system 14:09.40elapsed 91%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (20807major+19483471minor)pagefaults 0swaps BTW, that import (with the latest code Junio has) took 37hs even with the aggressive repack -a -d. I want to bench it dropping the -a from the recurrring repack, and doing a final repack -a -d. cheers, martin ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Figured out how to get Mozilla into git 2006-06-18 21:40 ` Martin Langhoff @ 2006-06-18 22:36 ` Linus Torvalds 2006-06-18 22:51 ` Broken PPC sha1.. (Re: Figured out how to get Mozilla into git) Linus Torvalds 0 siblings, 1 reply; 6+ messages in thread From: Linus Torvalds @ 2006-06-18 22:36 UTC (permalink / raw) To: Martin Langhoff; +Cc: Nicolas Pitre, Jon Smirl, git On Mon, 19 Jun 2006, Martin Langhoff wrote: > > No problems here with my latest import run. fsck-objects --full comes > clean, takes 14m: > > /usr/bin/time git-fsck-objects --full > 737.22user 38.79system 14:09.40elapsed 91%CPU (0avgtext+0avgdata 0maxresident)k > 0inputs+0outputs (20807major+19483471minor)pagefaults 0swaps It takes much less than that for me: 408.40user 32.56system 7:22.07elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (145major+13455672minor)pagefaults 0swaps and in particular note the much lower minor pagefaults number (which is a very good approximation of total RSS). Mine is with all the memory optimizations in place, but I didn't see _that_ big of a difference, so there's something else in addition. However, the fact that I get "SHA1 mismatch with itself" is strange. The re-pack will always re-generate the SHA1, so I worry that this is perhaps some PPC-specific bug in SHA1 handling (and it's entirely possible that it's triggered by doing a SHA1 over a 500+MB area). The fact that you don't see it is indicative that it's somehow specific to my setup. > BTW, that import (with the latest code Junio has) took 37hs even with > the aggressive repack -a -d. I want to bench it dropping the -a from > the recurrring repack, and doing a final repack -a -d. Yeah, that's probably the right thing to do. The "-a" is ok with tons of memory, and I'm trying to make it ok with _less_ memory, but it's probably just not worth it. Linus ^ permalink raw reply [flat|nested] 6+ messages in thread
* Broken PPC sha1.. (Re: Figured out how to get Mozilla into git) 2006-06-18 22:36 ` Linus Torvalds @ 2006-06-18 22:51 ` Linus Torvalds 0 siblings, 0 replies; 6+ messages in thread From: Linus Torvalds @ 2006-06-18 22:51 UTC (permalink / raw) To: Martin Langhoff, Paul Mackerras; +Cc: Nicolas Pitre, Jon Smirl, git On Sun, 18 Jun 2006, Linus Torvalds wrote: > > On Mon, 19 Jun 2006, Martin Langhoff wrote: > > > > No problems here with my latest import run. fsck-objects --full comes > > clean, takes 14m: > > > > /usr/bin/time git-fsck-objects --full > > 737.22user 38.79system 14:09.40elapsed 91%CPU (0avgtext+0avgdata 0maxresident)k > > 0inputs+0outputs (20807major+19483471minor)pagefaults 0swaps > > It takes much less than that for me: > > 408.40user 32.56system 7:22.07elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k > 0inputs+0outputs (145major+13455672minor)pagefaults 0swaps Ok, re-building the thing with MOZILLA_SHA1=1 rather than my default PPC_SHA1=1 fixes the problem. I no longer get that "SHA1 mismatch with itself" on the pack-file. Sadly, it also takes a _lot_ longer to fsck. Paul - I think the ppc SHA1_Update() overflows in 32 bits, when the length of the memory area to be checksummed is huge. In particular, the pack-file is 535MB in size, and the way we check the SHA1 checksum is by just mapping it all, doing a single SHA1_Update() over the whole pack-file, and comparing the end result with the internal SHA1 at the end of the pack-file. The PPC SHA1_Update() function starts off with: int SHA1_Update(SHA_CTX *c, const void *ptr, unsigned long n) { ... c->len += n << 3; which will obviously overflow if "n" is bigger than 29 bits, ie 512MB. So doing the length in bits (or whatever that "<<3" is there for) doesn't seem to be such a great idea. I guess we could make the caller just always chunk it up, but wouldn't it be nice to fix the PPC SHA1 implementation instead? That said, the _only_ thing this will ever trigger on in practice is exactly this one case: a large packfile whose checksum was _correctly_ generated - because pack-file generation does it in IO chunks using the csum-file interfaces - but that will be incorrectly checked because we check it all at once. So as bugs go, it's a fairly benign one. Linus ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2006-06-27 22:50 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-06-19 8:41 Broken PPC sha1.. (Re: Figured out how to get Mozilla into git) linux 2006-06-19 8:50 ` Johannes Schindelin 2006-06-23 0:09 ` Fixed PPC SHA1 linux 2006-06-23 0:54 ` linux 2006-06-27 22:50 ` Benjamin Herrenschmidt -- strict thread matches above, loose matches on Subject: below -- 2006-06-09 2:17 Figured out how to get Mozilla into git Jon Smirl 2006-06-09 3:06 ` Martin Langhoff 2006-06-10 1:14 ` Martin Langhoff 2006-06-10 1:33 ` Linus Torvalds 2006-06-11 22:00 ` Nicolas Pitre 2006-06-18 19:26 ` Linus Torvalds 2006-06-18 21:40 ` Martin Langhoff 2006-06-18 22:36 ` Linus Torvalds 2006-06-18 22:51 ` Broken PPC sha1.. (Re: Figured out how to get Mozilla into git) Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).