* crc32() optimization [not found] ` <24987.1036797874@passion.cambridge.redhat.com> @ 2002-11-10 15:28 ` Joakim Tjernlund 2002-11-10 18:43 ` Marc Singer 0 siblings, 1 reply; 17+ messages in thread From: Joakim Tjernlund @ 2002-11-10 15:28 UTC (permalink / raw) To: David Woodhouse; +Cc: jffs-dev, linux-mtd Hi David This patch improves my scan time with 22%( from 2.39 to 1.86 seconds). Maybe you want to include it in the 2.4 branch. I will put this in my next backport of the crc32 stuff from 2.5. Jocke Index: fs/jffs2/crc32.h =================================================================== RCS file: /home/cvs/mtd/fs/jffs2/crc32.h,v retrieving revision 1.3 diff -u -b -r1.3 crc32.h --- fs/jffs2/crc32.h 26 Feb 2001 14:44:37 -0000 1.3 +++ fs/jffs2/crc32.h 10 Nov 2002 15:25:11 -0000 @@ -13,7 +13,16 @@ crc32(__u32 val, const void *ss, int len) { const unsigned char *s = ss; - while (--len >= 0) + while (len >= 6){ + val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8); + val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8); + val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8); + val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8); + val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8); + val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8); + len -= 6; + } + while (len--) val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8); return val; } ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: crc32() optimization 2002-11-10 15:28 ` crc32() optimization Joakim Tjernlund @ 2002-11-10 18:43 ` Marc Singer 2002-11-10 19:25 ` Wolfgang Denk 2002-11-10 20:04 ` Joakim Tjernlund 0 siblings, 2 replies; 17+ messages in thread From: Marc Singer @ 2002-11-10 18:43 UTC (permalink / raw) To: Joakim Tjernlund; +Cc: linux-mtd As it should. I wonder if you'd do better changing the loop slightly. Check for len == 0 and do a short-circuit return. Then do this for (++len; len & 0x7; len >>= 3) { ONCE(); // repeat eight times ... len >>= 3; } while (--len > 0) ONCE(); This is the implementation I've written for another project which we've found to be relatively optimal. Note that len *must* be an int even though contemporary convention is to use the size_t type. On Sun, Nov 10, 2002 at 04:28:00PM +0100, Joakim Tjernlund wrote: > Hi David > > This patch improves my scan time with 22%( from 2.39 to 1.86 seconds). > Maybe you want to include it in the 2.4 branch. > > I will put this in my next backport of the crc32 stuff from 2.5. > > Jocke > > Index: fs/jffs2/crc32.h > =================================================================== > RCS file: /home/cvs/mtd/fs/jffs2/crc32.h,v > retrieving revision 1.3 > diff -u -b -r1.3 crc32.h > --- fs/jffs2/crc32.h 26 Feb 2001 14:44:37 -0000 1.3 > +++ fs/jffs2/crc32.h 10 Nov 2002 15:25:11 -0000 > @@ -13,7 +13,16 @@ > crc32(__u32 val, const void *ss, int len) > { > const unsigned char *s = ss; > - while (--len >= 0) > + while (len >= 6){ > + val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8); > + val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8); > + val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8); > + val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8); > + val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8); > + val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8); > + len -= 6; > + } > + while (len--) > val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8); > return val; > } > > > ______________________________________________________ > Linux MTD discussion mailing list > http://lists.infradead.org/mailman/listinfo/linux-mtd/ ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: crc32() optimization 2002-11-10 18:43 ` Marc Singer @ 2002-11-10 19:25 ` Wolfgang Denk 2002-11-10 20:05 ` Joakim Tjernlund 2002-11-11 0:50 ` Marc Singer 2002-11-10 20:04 ` Joakim Tjernlund 1 sibling, 2 replies; 17+ messages in thread From: Wolfgang Denk @ 2002-11-10 19:25 UTC (permalink / raw) To: Marc Singer; +Cc: Joakim Tjernlund, linux-mtd In message <20021110184321.GB16087@buici.com> you wrote: > As it should. I wonder if you'd do better changing the loop slightly. > > Check for len == 0 and do a short-circuit return. Then do this > > for (++len; len & 0x7; len >>= 3) { > ONCE(); // repeat eight times > ... > len >>= 3; > } > while (--len > 0) > ONCE(); > > This is the implementation I've written for another project which Seems broken to me, since you "len >>= 3" twice. Also, Duff's Device comes to mind. Best regards, Wolfgang Denk -- Software Engineering: Embedded and Realtime Systems, Embedded Linux Phone: (+49)-8142-4596-87 Fax: (+49)-8142-4596-88 Email: wd@denx.de See us @ electronica 2002 in Munich, Nov 12-15, Hall A3, Booth A3.325 ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: crc32() optimization 2002-11-10 19:25 ` Wolfgang Denk @ 2002-11-10 20:05 ` Joakim Tjernlund 2002-11-10 21:00 ` Wolfgang Denk 2002-11-11 0:50 ` Marc Singer 1 sibling, 1 reply; 17+ messages in thread From: Joakim Tjernlund @ 2002-11-10 20:05 UTC (permalink / raw) To: Marc Singer, Wolfgang Denk; +Cc: linux-mtd Hi Wolfgang What's "Duff's Device"? Jocke > > Seems broken to me, since you "len >>= 3" twice. > > Also, Duff's Device comes to mind. > > Best regards, > > Wolfgang Denk > > -- > Software Engineering: Embedded and Realtime Systems, Embedded Linux > Phone: (+49)-8142-4596-87 Fax: (+49)-8142-4596-88 Email: wd@denx.de > See us @ electronica 2002 in Munich, Nov 12-15, Hall A3, Booth A3.325 ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: crc32() optimization 2002-11-10 20:05 ` Joakim Tjernlund @ 2002-11-10 21:00 ` Wolfgang Denk 2002-11-10 21:22 ` Joakim Tjernlund 2002-11-11 1:31 ` Marc Singer 0 siblings, 2 replies; 17+ messages in thread From: Wolfgang Denk @ 2002-11-10 21:00 UTC (permalink / raw) To: Joakim Tjernlund; +Cc: Marc Singer, linux-mtd In message <006901c288f4$79e2ce20$0200a8c0@telia.com> you wrote: > > What's "Duff's Device"? It's a tricky way to implement general loop unrolling directly in C. Applied to your problem, code that looks like this (instead of 8 any other loop count may be used, but you need to adjust the "case" statements then): register int n = (len + (8-1)) / 8; switch (len % 8) { case 0: do { val = crc32_table ... ; case 7: val = crc32_table ... ; case 6: val = crc32_table ... ; case 5: val = crc32_table ... ; case 4: val = crc32_table ... ; case 3: val = crc32_table ... ; case 2: val = crc32_table ... ; case 1: val = crc32_table ... ; } while (--n > 0); } BTW: this is strictly legal ANSI C! For an explanation see http://www.lysator.liu.se/c/duffs-device.html Best regards, Wolfgang Denk -- Software Engineering: Embedded and Realtime Systems, Embedded Linux Phone: (+49)-8142-4596-87 Fax: (+49)-8142-4596-88 Email: wd@denx.de See us @ electronica 2002 in Munich, Nov 12-15, Hall A3, Booth A3.325 ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: crc32() optimization 2002-11-10 21:00 ` Wolfgang Denk @ 2002-11-10 21:22 ` Joakim Tjernlund 2002-11-10 22:35 ` Joakim Tjernlund 2002-11-11 1:31 ` Marc Singer 1 sibling, 1 reply; 17+ messages in thread From: Joakim Tjernlund @ 2002-11-10 21:22 UTC (permalink / raw) To: Wolfgang Denk; +Cc: Marc Singer, linux-mtd Cool! I will give it a try(tomorrow afternoon) and see what happens. Jocke > > > > What's "Duff's Device"? > > It's a tricky way to implement general loop unrolling directly in C. > Applied to your problem, code that looks like this (instead of 8 any > other loop count may be used, but you need to adjust the "case" > statements then): > > register int n = (len + (8-1)) / 8; > > switch (len % 8) { > case 0: do { val = crc32_table ... ; > case 7: val = crc32_table ... ; > case 6: val = crc32_table ... ; > case 5: val = crc32_table ... ; > case 4: val = crc32_table ... ; > case 3: val = crc32_table ... ; > case 2: val = crc32_table ... ; > case 1: val = crc32_table ... ; > } while (--n > 0); > } > > BTW: this is strictly legal ANSI C! > > For an explanation see http://www.lysator.liu.se/c/duffs-device.html > > Best regards, > > Wolfgang Denk > > -- > Software Engineering: Embedded and Realtime Systems, Embedded Linux > Phone: (+49)-8142-4596-87 Fax: (+49)-8142-4596-88 Email: wd@denx.de > See us @ electronica 2002 in Munich, Nov 12-15, Hall A3, Booth A3.325 ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: crc32() optimization 2002-11-10 21:22 ` Joakim Tjernlund @ 2002-11-10 22:35 ` Joakim Tjernlund 2002-11-10 22:41 ` Wolfgang Denk 0 siblings, 1 reply; 17+ messages in thread From: Joakim Tjernlund @ 2002-11-10 22:35 UTC (permalink / raw) To: Wolfgang Denk; +Cc: Marc Singer, linux-mtd I could not wait until tomorrow, so I did it now instead. The result was worse. The best I got was 7% improvement. I tried 16, 8, 6 and 4 as unrolling steps. Jocke > Cool! I will give it a try(tomorrow afternoon) and see what happens. > > Jocke > > > > > > What's "Duff's Device"? > > > > It's a tricky way to implement general loop unrolling directly in C. > > Applied to your problem, code that looks like this (instead of 8 any > > other loop count may be used, but you need to adjust the "case" > > statements then): > > > > register int n = (len + (8-1)) / 8; > > > > switch (len % 8) { > > case 0: do { val = crc32_table ... ; > > case 7: val = crc32_table ... ; > > case 6: val = crc32_table ... ; > > case 5: val = crc32_table ... ; > > case 4: val = crc32_table ... ; > > case 3: val = crc32_table ... ; > > case 2: val = crc32_table ... ; > > case 1: val = crc32_table ... ; > > } while (--n > 0); > > } > > > > BTW: this is strictly legal ANSI C! > > > > For an explanation see http://www.lysator.liu.se/c/duffs-device.html > > > > Best regards, > > > > Wolfgang Denk > > > > -- > > Software Engineering: Embedded and Realtime Systems, Embedded Linux > > Phone: (+49)-8142-4596-87 Fax: (+49)-8142-4596-88 Email: wd@denx.de > > See us @ electronica 2002 in Munich, Nov 12-15, Hall A3, Booth A3.325 > > ______________________________________________________ > Linux MTD discussion mailing list > http://lists.infradead.org/mailman/listinfo/linux-mtd/ ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: crc32() optimization 2002-11-10 22:35 ` Joakim Tjernlund @ 2002-11-10 22:41 ` Wolfgang Denk 2002-11-10 23:00 ` Joakim Tjernlund 0 siblings, 1 reply; 17+ messages in thread From: Wolfgang Denk @ 2002-11-10 22:41 UTC (permalink / raw) To: Joakim Tjernlund; +Cc: Marc Singer, linux-mtd In message <001301c28909$743f1f40$0200a8c0@telia.com> you wrote: > I could not wait until tomorrow, so I did it now instead. > The result was worse. The best I got was 7% improvement. > I tried 16, 8, 6 and 4 as unrolling steps. Makes no sense to me. Should be at least as efficient as your original code (marginally better). Best regards, Wolfgang Denk -- Software Engineering: Embedded and Realtime Systems, Embedded Linux Phone: (+49)-8142-4596-87 Fax: (+49)-8142-4596-88 Email: wd@denx.de See us @ electronica 2002 in Munich, Nov 12-15, Hall A3, Booth A3.325 ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: crc32() optimization 2002-11-10 22:41 ` Wolfgang Denk @ 2002-11-10 23:00 ` Joakim Tjernlund 2002-11-10 23:56 ` Eric W. Biederman 0 siblings, 1 reply; 17+ messages in thread From: Joakim Tjernlund @ 2002-11-10 23:00 UTC (permalink / raw) To: Wolfgang Denk; +Cc: Marc Singer, linux-mtd > In message <001301c28909$743f1f40$0200a8c0@telia.com> you wrote: > > I could not wait until tomorrow, so I did it now instead. > > The result was worse. The best I got was 7% improvement. > > I tried 16, 8, 6 and 4 as unrolling steps. > > Makes no sense to me. Should be at least as efficient as your > original code (marginally better). I don't understand this either. Anyone? Jocke ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: crc32() optimization 2002-11-10 23:00 ` Joakim Tjernlund @ 2002-11-10 23:56 ` Eric W. Biederman 0 siblings, 0 replies; 17+ messages in thread From: Eric W. Biederman @ 2002-11-10 23:56 UTC (permalink / raw) To: Joakim Tjernlund; +Cc: Wolfgang Denk, Marc Singer, linux-mtd "Joakim Tjernlund" <Joakim.Tjernlund@lumentis.se> writes: > > In message <001301c28909$743f1f40$0200a8c0@telia.com> you wrote: > > > I could not wait until tomorrow, so I did it now instead. > > > The result was worse. The best I got was 7% improvement. > > > I tried 16, 8, 6 and 4 as unrolling steps. > > > > Makes no sense to me. Should be at least as efficient as your > > original code (marginally better). > > I don't understand this either. > Anyone? You might try it with 6. But a lot depends on what gcc can do with it and gcc may not be like all of those potential entry points.. Running gcc -S and checking to see the difference in the generated assembly might be instructive. Eric ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: crc32() optimization 2002-11-10 21:00 ` Wolfgang Denk 2002-11-10 21:22 ` Joakim Tjernlund @ 2002-11-11 1:31 ` Marc Singer 2002-11-11 1:37 ` Wolfgang Denk 1 sibling, 1 reply; 17+ messages in thread From: Marc Singer @ 2002-11-11 1:31 UTC (permalink / raw) To: Wolfgang Denk; +Cc: Joakim Tjernlund, linux-mtd On Sun, Nov 10, 2002 at 10:00:03PM +0100, Wolfgang Denk wrote: > In message <006901c288f4$79e2ce20$0200a8c0@telia.com> you wrote: > > > > What's "Duff's Device"? > > It's a tricky way to implement general loop unrolling directly in C. > Applied to your problem, code that looks like this (instead of 8 any > other loop count may be used, but you need to adjust the "case" > statements then): > > register int n = (len + (8-1)) / 8; > > switch (len % 8) { > case 0: do { val = crc32_table ... ; > case 7: val = crc32_table ... ; > case 6: val = crc32_table ... ; > case 5: val = crc32_table ... ; > case 4: val = crc32_table ... ; > case 3: val = crc32_table ... ; > case 2: val = crc32_table ... ; > case 1: val = crc32_table ... ; > } while (--n > 0); > } This doesn't look right to me. You are decrementing n but using the modulus of len in the switch. The len modulus is correct when n == 1, but not when n > 1. The idea makes sense, but the implementation appears to be missing a detail. As for performance problems, I believe that the trouble is evident from the assembler output. The reason that the unrolled loop is more efficient than the simple loop is mainly because you don't jump as often. We all know that jumps tend to perturb the instruction fetch queue and cache. Here is an example. I know it is wrong, but it shows how the compiler implements the switch. I've marked the problem jump. It is executed for every iteration of the loop so it tends to negate other changes made to the algorithm. So, the version I posted may not be significantly better that the original one that unrolled 6 times. It is just that unrolling to a power of two sometimes makes the math simpler and sometimes improves the performance. The present crop of CPUs performs all simple ALU functions in a single cycle, so there is little reason to worry about how many loops to unroll. BTW, I checked my code and found I made several errors. The increment step in for() is a decrement by 8, not a shift. Cheers. -------------------------------------------------- #define ONCE do { ++i; } while (0); int foo () { int i = 0; int c; for (c = 20 ; c > 0; c -= 8) { switch (c % 8) { case 0: ONCE; case 7: ONCE; case 6: ONCE; case 5: ONCE; case 4: ONCE; case 3: ONCE; case 2: ONCE; case 1: ONCE; } } return i; } int main () { foo (); } -------------------------------------------------- .file "unroll.c" .text .align 2 .p2align 2,,3 .globl foo .type foo,@function foo: pushl %ebp movl %esp, %ebp xorl %eax, %eax pushl %ebx movl $20, %ecx .p2align 2,,3 .L26: testl %ecx, %ecx movl %ecx, %edx js .L29 .L25: andl $-8, %edx movl %ecx, %ebx subl %edx, %ebx cmpl $7, %ebx ja .L4 ;;; **** This is the problem jump. jmp *.L23(,%ebx,4) ;;; **** This is the problem jump. .section .rodata .align 4 .align 4 .L23: .long .L7 .long .L21 .long .L19 .long .L17 .long .L15 .long .L13 .long .L11 .long .L9 .text .L7: incl %eax .L9: incl %eax .L11: incl %eax .L13: incl %eax .L15: incl %eax .L17: incl %eax .L19: incl %eax .L21: incl %eax .L4: subl $8, %ecx testl %ecx, %ecx jg .L26 popl %ebx leave ret .p2align 2,,3 .L29: leal 7(%ecx), %edx jmp .L25 .Lfe1: .size foo,.Lfe1-foo .align 2 .p2align 2,,3 .globl main .type main,@function main: pushl %ebp movl %esp, %ebp subl $8, %esp andl $-16, %esp call foo leave ret .Lfe2: .size main,.Lfe2-main .ident "GCC: (GNU) 3.2.1 20020830 (Debian prerelease)" -------------------------------------------------- ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: crc32() optimization 2002-11-11 1:31 ` Marc Singer @ 2002-11-11 1:37 ` Wolfgang Denk 2002-11-11 4:42 ` Marc Singer 0 siblings, 1 reply; 17+ messages in thread From: Wolfgang Denk @ 2002-11-11 1:37 UTC (permalink / raw) To: Marc Singer; +Cc: Joakim Tjernlund, linux-mtd In message <20021111013114.GB27214@buici.com> you wrote: > > > > What's "Duff's Device"? > > > > It's a tricky way to implement general loop unrolling directly in C. > > Applied to your problem, code that looks like this (instead of 8 any > > other loop count may be used, but you need to adjust the "case" > > statements then): > > > > register int n = (len + (8-1)) / 8; > > > > switch (len % 8) { > > case 0: do { val = crc32_table ... ; > > case 7: val = crc32_table ... ; > > case 6: val = crc32_table ... ; > > case 5: val = crc32_table ... ; > > case 4: val = crc32_table ... ; > > case 3: val = crc32_table ... ; > > case 2: val = crc32_table ... ; > > case 1: val = crc32_table ... ; > > } while (--n > 0); > > } > > This doesn't look right to me. You are decrementing n but using the > modulus of len in the switch. The len modulus is correct when n == 1, > but not when n > 1. The idea makes sense, but the implementation > appears to be missing a detail. You don't understand. The switch is only needed for the first, partial loop where we want less than N statements; then we're nunning the remaining fully unrolled loos in the do{}while loop. > As for performance problems, I believe that the trouble is evident > from the assembler output. The reason that the unrolled loop is more > efficient than the simple loop is mainly because you don't jump as > often. We all know that jumps tend to perturb the instruction fetch > queue and cache. Did you enable optimization? > Here is an example. I know it is wrong, but it shows how the compiler > implements the switch. I've marked the problem jump. It is executed Irrelevant here. > So, the version I posted may not be significantly better that the > original one that unrolled 6 times. It is just that unrolling to a You save the extra while{} loop. Best regards, Wolfgang Denk -- Software Engineering: Embedded and Realtime Systems, Embedded Linux Phone: (+49)-8142-4596-87 Fax: (+49)-8142-4596-88 Email: wd@denx.de See us @ electronica 2002 in Munich, Nov 12-15, Hall A3, Booth A3.325 ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: crc32() optimization 2002-11-11 1:37 ` Wolfgang Denk @ 2002-11-11 4:42 ` Marc Singer 2002-11-25 15:55 ` Herman Oosthuysen 0 siblings, 1 reply; 17+ messages in thread From: Marc Singer @ 2002-11-11 4:42 UTC (permalink / raw) To: Wolfgang Denk; +Cc: Joakim Tjernlund, linux-mtd On Mon, Nov 11, 2002 at 02:37:33AM +0100, Wolfgang Denk wrote: > In message <20021111013114.GB27214@buici.com> you wrote: > > > > > > What's "Duff's Device"? > > > > > > It's a tricky way to implement general loop unrolling directly in C. > > > Applied to your problem, code that looks like this (instead of 8 any > > > other loop count may be used, but you need to adjust the "case" > > > statements then): > > > > > > register int n = (len + (8-1)) / 8; > > > > > > switch (len % 8) { > > > case 0: do { val = crc32_table ... ; > > > case 7: val = crc32_table ... ; > > > case 6: val = crc32_table ... ; > > > case 5: val = crc32_table ... ; > > > case 4: val = crc32_table ... ; > > > case 3: val = crc32_table ... ; > > > case 2: val = crc32_table ... ; > > > case 1: val = crc32_table ... ; > > > } while (--n > 0); > > > } > > > > This doesn't look right to me. You are decrementing n but using the > > modulus of len in the switch. The len modulus is correct when n == 1, > > but not when n > 1. The idea makes sense, but the implementation > > appears to be missing a detail. > > You don't understand. The switch is only needed for the first, > partial loop where we want less than N statements; then we're nunning > the remaining fully unrolled loos in the do{}while loop. I see. I misread the code. I cannot see why this would not be better than the original poster's version. I'll test it on my code to see if there is an improvement. > > As for performance problems, I believe that the trouble is evident > > from the assembler output. The reason that the unrolled loop is more > > efficient than the simple loop is mainly because you don't jump as > > often. We all know that jumps tend to perturb the instruction fetch > > queue and cache. > > Did you enable optimization? Indeed. But it doesn't matter since it executes the switch jump only one time. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: crc32() optimization 2002-11-11 4:42 ` Marc Singer @ 2002-11-25 15:55 ` Herman Oosthuysen 2002-11-25 16:12 ` Joakim Tjernlund 0 siblings, 1 reply; 17+ messages in thread From: Herman Oosthuysen @ 2002-11-25 15:55 UTC (permalink / raw) To: linux-mtd Is there not a look-up table based CRC32 elsewhere in the kernel already? Multiple CRC32 algorithms seem to me to be a terrible waste. -- ------------------------------------------------------------------------ Herman Oosthuysen B.Eng.(E), Member of IEEE Wireless Networks Inc. http://www.WirelessNetworksInc.com E-mail: Herman@WirelessNetworksInc.com Phone: 1.403.569-5687, Fax: 1.403.235-3965 ------------------------------------------------------------------------ Marc Singer wrote: > On Mon, Nov 11, 2002 at 02:37:33AM +0100, Wolfgang Denk wrote: > >>In message <20021111013114.GB27214@buici.com> you wrote: >> >>>>>What's "Duff's Device"? >>>> >>>>It's a tricky way to implement general loop unrolling directly in C. >>>>Applied to your problem, code that looks like this (instead of 8 any >>>>other loop count may be used, but you need to adjust the "case" >>>>statements then): >>>> >>>> register int n = (len + (8-1)) / 8; >>>> >>>> switch (len % 8) { >>>> case 0: do { val = crc32_table ... ; >>>> case 7: val = crc32_table ... ; >>>> case 6: val = crc32_table ... ; >>>> case 5: val = crc32_table ... ; >>>> case 4: val = crc32_table ... ; >>>> case 3: val = crc32_table ... ; >>>> case 2: val = crc32_table ... ; >>>> case 1: val = crc32_table ... ; >>>> } while (--n > 0); >>>> } >>> >>>This doesn't look right to me. You are decrementing n but using the >>>modulus of len in the switch. The len modulus is correct when n == 1, >>>but not when n > 1. The idea makes sense, but the implementation >>>appears to be missing a detail. >> >>You don't understand. The switch is only needed for the first, >>partial loop where we want less than N statements; then we're nunning >>the remaining fully unrolled loos in the do{}while loop. > > > I see. I misread the code. I cannot see why this would not be better > than the original poster's version. I'll test it on my code to see if > there is an improvement. > > > >>>As for performance problems, I believe that the trouble is evident >>>from the assembler output. The reason that the unrolled loop is more >>>efficient than the simple loop is mainly because you don't jump as >>>often. We all know that jumps tend to perturb the instruction fetch >>>queue and cache. >> >>Did you enable optimization? > > > Indeed. But it doesn't matter since it executes the switch jump only > one time. > > > ______________________________________________________ > Linux MTD discussion mailing list > http://lists.infradead.org/mailman/listinfo/linux-mtd/ > -- ------------------------------------------------------------------------ Herman Oosthuysen B.Eng.(E), Member of IEEE Wireless Networks Inc. http://www.WirelessNetworksInc.com E-mail: Herman@WirelessNetworksInc.com Phone: 1.403.569-5687, Fax: 1.403.235-3965 ------------------------------------------------------------------------ ^ permalink raw reply [flat|nested] 17+ messages in thread
* RE: crc32() optimization 2002-11-25 15:55 ` Herman Oosthuysen @ 2002-11-25 16:12 ` Joakim Tjernlund 0 siblings, 0 replies; 17+ messages in thread From: Joakim Tjernlund @ 2002-11-25 16:12 UTC (permalink / raw) To: Herman Oosthuysen, linux-mtd Yes, there are. In 2.5 these has been replace with a common lib(see lib/crc32.c). I have a back port to 2.4, but first we try get the optimized version into 2.5 and then do a new back port to 2.4 and then try to get that into 2.4. The final 2.5 patch should appear sometime today on lkm. Jocke > > Is there not a look-up table based CRC32 elsewhere in the kernel already? > > Multiple CRC32 algorithms seem to me to be a terrible waste. > -- > > ------------------------------------------------------------------------ > Herman Oosthuysen > B.Eng.(E), Member of IEEE > Wireless Networks Inc. > http://www.WirelessNetworksInc.com > E-mail: Herman@WirelessNetworksInc.com > Phone: 1.403.569-5687, Fax: 1.403.235-3965 > ------------------------------------------------------------------------ > > > Marc Singer wrote: > > On Mon, Nov 11, 2002 at 02:37:33AM +0100, Wolfgang Denk wrote: > > > >>In message <20021111013114.GB27214@buici.com> you wrote: > >> > >>>>>What's "Duff's Device"? > >>>> > >>>>It's a tricky way to implement general loop unrolling directly in C. > >>>>Applied to your problem, code that looks like this (instead of 8 any > >>>>other loop count may be used, but you need to adjust the "case" > >>>>statements then): > >>>> > >>>> register int n = (len + (8-1)) / 8; > >>>> > >>>> switch (len % 8) { > >>>> case 0: do { val = crc32_table ... ; > >>>> case 7: val = crc32_table ... ; > >>>> case 6: val = crc32_table ... ; > >>>> case 5: val = crc32_table ... ; > >>>> case 4: val = crc32_table ... ; > >>>> case 3: val = crc32_table ... ; > >>>> case 2: val = crc32_table ... ; > >>>> case 1: val = crc32_table ... ; > >>>> } while (--n > 0); > >>>> } > >>> > >>>This doesn't look right to me. You are decrementing n but using the > >>>modulus of len in the switch. The len modulus is correct when n == 1, > >>>but not when n > 1. The idea makes sense, but the implementation > >>>appears to be missing a detail. > >> > >>You don't understand. The switch is only needed for the first, > >>partial loop where we want less than N statements; then we're nunning > >>the remaining fully unrolled loos in the do{}while loop. > > > > > > I see. I misread the code. I cannot see why this would not be better > > than the original poster's version. I'll test it on my code to see if > > there is an improvement. > > > > > > > >>>As for performance problems, I believe that the trouble is evident > >>>from the assembler output. The reason that the unrolled loop is more > >>>efficient than the simple loop is mainly because you don't jump as > >>>often. We all know that jumps tend to perturb the instruction fetch > >>>queue and cache. > >> > >>Did you enable optimization? > > > > > > Indeed. But it doesn't matter since it executes the switch jump only > > one time. > > > > > > ______________________________________________________ > > Linux MTD discussion mailing list > > http://lists.infradead.org/mailman/listinfo/linux-mtd/ > > > > -- > > ------------------------------------------------------------------------ > Herman Oosthuysen > B.Eng.(E), Member of IEEE > Wireless Networks Inc. > http://www.WirelessNetworksInc.com > E-mail: Herman@WirelessNetworksInc.com > Phone: 1.403.569-5687, Fax: 1.403.235-3965 > ------------------------------------------------------------------------ > > > > ______________________________________________________ > Linux MTD discussion mailing list > http://lists.infradead.org/mailman/listinfo/linux-mtd/ > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: crc32() optimization 2002-11-10 19:25 ` Wolfgang Denk 2002-11-10 20:05 ` Joakim Tjernlund @ 2002-11-11 0:50 ` Marc Singer 1 sibling, 0 replies; 17+ messages in thread From: Marc Singer @ 2002-11-11 0:50 UTC (permalink / raw) To: Wolfgang Denk; +Cc: Joakim Tjernlund, linux-mtd On Sun, Nov 10, 2002 at 08:25:38PM +0100, Wolfgang Denk wrote: > In message <20021110184321.GB16087@buici.com> you wrote: > > As it should. I wonder if you'd do better changing the loop slightly. > > > > Check for len == 0 and do a short-circuit return. Then do this > > > > for (++len; len & 0x7; len >>= 3) { > > ONCE(); // repeat eight times > > ... > > len >>= 3; > > } > > while (--len > 0) > > ONCE(); > > > > This is the implementation I've written for another project which > > Seems broken to me, since you "len >>= 3" twice. That's what I get for writing it from memory. > Also, Duff's Device comes to mind. What would that be? > > Best regards, > > Wolfgang Denk > > -- > Software Engineering: Embedded and Realtime Systems, Embedded Linux > Phone: (+49)-8142-4596-87 Fax: (+49)-8142-4596-88 Email: wd@denx.de > See us @ electronica 2002 in Munich, Nov 12-15, Hall A3, Booth A3.325 ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: crc32() optimization 2002-11-10 18:43 ` Marc Singer 2002-11-10 19:25 ` Wolfgang Denk @ 2002-11-10 20:04 ` Joakim Tjernlund 1 sibling, 0 replies; 17+ messages in thread From: Joakim Tjernlund @ 2002-11-10 20:04 UTC (permalink / raw) To: Marc Singer; +Cc: linux-mtd hmm , maybe. I tried 16, 8 & 4 also, but 6 was a little faster for me. What would be great if someone that understands CRC better than me could take a look at Algorithm 4 at http://www.cl.cam.ac.uk/Research/SRG/bluebook/21/crc/node6.html#SECTION00060000000000000000 and apply that on linux CRC32 code. I tried but failed to get it correct. Jocke > As it should. I wonder if you'd do better changing the loop slightly. > > Check for len == 0 and do a short-circuit return. Then do this > > for (++len; len & 0x7; len >>= 3) { > ONCE(); // repeat eight times > ... > len >>= 3; > } > while (--len > 0) > ONCE(); > > This is the implementation I've written for another project which > we've found to be relatively optimal. Note that len *must* be an int > even though contemporary convention is to use the size_t type. > > > On Sun, Nov 10, 2002 at 04:28:00PM +0100, Joakim Tjernlund wrote: > > Hi David > > > > This patch improves my scan time with 22%( from 2.39 to 1.86 seconds). > > Maybe you want to include it in the 2.4 branch. > > > > I will put this in my next backport of the crc32 stuff from 2.5. > > > > Jocke > > > > Index: fs/jffs2/crc32.h > > =================================================================== > > RCS file: /home/cvs/mtd/fs/jffs2/crc32.h,v > > retrieving revision 1.3 > > diff -u -b -r1.3 crc32.h > > --- fs/jffs2/crc32.h 26 Feb 2001 14:44:37 -0000 1.3 > > +++ fs/jffs2/crc32.h 10 Nov 2002 15:25:11 -0000 > > @@ -13,7 +13,16 @@ > > crc32(__u32 val, const void *ss, int len) > > { > > const unsigned char *s = ss; > > - while (--len >= 0) > > + while (len >= 6){ > > + val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8); > > + val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8); > > + val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8); > > + val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8); > > + val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8); > > + val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8); > > + len -= 6; > > + } > > + while (len--) > > val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8); > > return val; > > } > > > > > > ______________________________________________________ > > Linux MTD discussion mailing list > > http://lists.infradead.org/mailman/listinfo/linux-mtd/ > > ______________________________________________________ > Linux MTD discussion mailing list > http://lists.infradead.org/mailman/listinfo/linux-mtd/ ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2002-11-25 15:41 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <F122OgRGkQ6mBySsxVY00000854@hotmail.com>
[not found] ` <24987.1036797874@passion.cambridge.redhat.com>
2002-11-10 15:28 ` crc32() optimization Joakim Tjernlund
2002-11-10 18:43 ` Marc Singer
2002-11-10 19:25 ` Wolfgang Denk
2002-11-10 20:05 ` Joakim Tjernlund
2002-11-10 21:00 ` Wolfgang Denk
2002-11-10 21:22 ` Joakim Tjernlund
2002-11-10 22:35 ` Joakim Tjernlund
2002-11-10 22:41 ` Wolfgang Denk
2002-11-10 23:00 ` Joakim Tjernlund
2002-11-10 23:56 ` Eric W. Biederman
2002-11-11 1:31 ` Marc Singer
2002-11-11 1:37 ` Wolfgang Denk
2002-11-11 4:42 ` Marc Singer
2002-11-25 15:55 ` Herman Oosthuysen
2002-11-25 16:12 ` Joakim Tjernlund
2002-11-11 0:50 ` Marc Singer
2002-11-10 20:04 ` Joakim Tjernlund
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox