public inbox for linux-mtd@lists.infradead.org
 help / color / mirror / Atom feed
* crc32() optimization
       [not found] ` <24987.1036797874@passion.cambridge.redhat.com>
@ 2002-11-10 15:28   ` Joakim Tjernlund
  2002-11-10 18:43     ` Marc Singer
  0 siblings, 1 reply; 17+ messages in thread
From: Joakim Tjernlund @ 2002-11-10 15:28 UTC (permalink / raw)
  To: David Woodhouse; +Cc: jffs-dev, linux-mtd

Hi David

This patch improves my scan time with 22%( from 2.39 to 1.86 seconds).
Maybe you want to include it in the 2.4 branch.

I will put this in my next backport of the crc32 stuff from 2.5.

          Jocke

Index: fs/jffs2/crc32.h
===================================================================
RCS file: /home/cvs/mtd/fs/jffs2/crc32.h,v
retrieving revision 1.3
diff -u -b -r1.3 crc32.h
--- fs/jffs2/crc32.h    26 Feb 2001 14:44:37 -0000      1.3
+++ fs/jffs2/crc32.h    10 Nov 2002 15:25:11 -0000
@@ -13,7 +13,16 @@
 crc32(__u32 val, const void *ss, int len)
 {
        const unsigned char *s = ss;
-        while (--len >= 0)
+        while (len >= 6){
+         val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8);
+         val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8);
+         val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8);
+         val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8);
+         val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8);
+         val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8);
+         len -= 6;
+        }
+        while (len--)
                 val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8);
         return val;
 }

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: crc32() optimization
  2002-11-10 15:28   ` crc32() optimization Joakim Tjernlund
@ 2002-11-10 18:43     ` Marc Singer
  2002-11-10 19:25       ` Wolfgang Denk
  2002-11-10 20:04       ` Joakim Tjernlund
  0 siblings, 2 replies; 17+ messages in thread
From: Marc Singer @ 2002-11-10 18:43 UTC (permalink / raw)
  To: Joakim Tjernlund; +Cc: linux-mtd

As it should.  I wonder if you'd do better changing the loop slightly.

Check for len == 0 and do a short-circuit return.  Then do this

  for (++len; len & 0x7; len >>= 3) {
     ONCE(); // repeat eight times
     ...
     len >>= 3;
  }
  while (--len > 0)
     ONCE();

This is the implementation I've written for another project which
we've found to be relatively optimal.  Note that len *must* be an int
even though contemporary convention is to use the size_t type.


On Sun, Nov 10, 2002 at 04:28:00PM +0100, Joakim Tjernlund wrote:
> Hi David
> 
> This patch improves my scan time with 22%( from 2.39 to 1.86 seconds).
> Maybe you want to include it in the 2.4 branch.
> 
> I will put this in my next backport of the crc32 stuff from 2.5.
> 
>           Jocke
> 
> Index: fs/jffs2/crc32.h
> ===================================================================
> RCS file: /home/cvs/mtd/fs/jffs2/crc32.h,v
> retrieving revision 1.3
> diff -u -b -r1.3 crc32.h
> --- fs/jffs2/crc32.h    26 Feb 2001 14:44:37 -0000      1.3
> +++ fs/jffs2/crc32.h    10 Nov 2002 15:25:11 -0000
> @@ -13,7 +13,16 @@
>  crc32(__u32 val, const void *ss, int len)
>  {
>         const unsigned char *s = ss;
> -        while (--len >= 0)
> +        while (len >= 6){
> +         val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8);
> +         val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8);
> +         val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8);
> +         val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8);
> +         val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8);
> +         val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8);
> +         len -= 6;
> +        }
> +        while (len--)
>                  val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8);
>          return val;
>  }
> 
> 
> ______________________________________________________
> Linux MTD discussion mailing list
> http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: crc32() optimization
  2002-11-10 18:43     ` Marc Singer
@ 2002-11-10 19:25       ` Wolfgang Denk
  2002-11-10 20:05         ` Joakim Tjernlund
  2002-11-11  0:50         ` Marc Singer
  2002-11-10 20:04       ` Joakim Tjernlund
  1 sibling, 2 replies; 17+ messages in thread
From: Wolfgang Denk @ 2002-11-10 19:25 UTC (permalink / raw)
  To: Marc Singer; +Cc: Joakim Tjernlund, linux-mtd

In message <20021110184321.GB16087@buici.com> you wrote:
> As it should.  I wonder if you'd do better changing the loop slightly.
> 
> Check for len == 0 and do a short-circuit return.  Then do this
> 
>   for (++len; len & 0x7; len >>= 3) {
>      ONCE(); // repeat eight times
>      ...
>      len >>= 3;
>   }
>   while (--len > 0)
>      ONCE();
> 
> This is the implementation I've written for another project which

Seems broken to me, since you "len >>= 3" twice.

Also, Duff's Device comes to mind.

Best regards,

Wolfgang Denk

-- 
Software Engineering:  Embedded and Realtime Systems,  Embedded Linux
Phone: (+49)-8142-4596-87  Fax: (+49)-8142-4596-88  Email: wd@denx.de
See us @ electronica 2002 in Munich, Nov 12-15, Hall A3, Booth A3.325 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: crc32() optimization
  2002-11-10 18:43     ` Marc Singer
  2002-11-10 19:25       ` Wolfgang Denk
@ 2002-11-10 20:04       ` Joakim Tjernlund
  1 sibling, 0 replies; 17+ messages in thread
From: Joakim Tjernlund @ 2002-11-10 20:04 UTC (permalink / raw)
  To: Marc Singer; +Cc: linux-mtd

hmm , maybe. I tried 16, 8 & 4 also, but 6 was a little faster for me.
What would be great if someone that understands CRC better than me could
take a look at Algorithm 4 at http://www.cl.cam.ac.uk/Research/SRG/bluebook/21/crc/node6.html#SECTION00060000000000000000
and apply that on linux CRC32 code.  I tried but failed to get it correct.

        Jocke

> As it should.  I wonder if you'd do better changing the loop slightly.
> 
> Check for len == 0 and do a short-circuit return.  Then do this
> 
>   for (++len; len & 0x7; len >>= 3) {
>      ONCE(); // repeat eight times
>      ...
>      len >>= 3;
>   }
>   while (--len > 0)
>      ONCE();
> 
> This is the implementation I've written for another project which
> we've found to be relatively optimal.  Note that len *must* be an int
> even though contemporary convention is to use the size_t type.
> 
> 
> On Sun, Nov 10, 2002 at 04:28:00PM +0100, Joakim Tjernlund wrote:
> > Hi David
> > 
> > This patch improves my scan time with 22%( from 2.39 to 1.86 seconds).
> > Maybe you want to include it in the 2.4 branch.
> > 
> > I will put this in my next backport of the crc32 stuff from 2.5.
> > 
> >           Jocke
> > 
> > Index: fs/jffs2/crc32.h
> > ===================================================================
> > RCS file: /home/cvs/mtd/fs/jffs2/crc32.h,v
> > retrieving revision 1.3
> > diff -u -b -r1.3 crc32.h
> > --- fs/jffs2/crc32.h    26 Feb 2001 14:44:37 -0000      1.3
> > +++ fs/jffs2/crc32.h    10 Nov 2002 15:25:11 -0000
> > @@ -13,7 +13,16 @@
> >  crc32(__u32 val, const void *ss, int len)
> >  {
> >         const unsigned char *s = ss;
> > -        while (--len >= 0)
> > +        while (len >= 6){
> > +         val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8);
> > +         val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8);
> > +         val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8);
> > +         val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8);
> > +         val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8);
> > +         val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8);
> > +         len -= 6;
> > +        }
> > +        while (len--)
> >                  val = crc32_table[(val ^ *s++) & 0xff] ^ (val >> 8);
> >          return val;
> >  }
> > 
> > 
> > ______________________________________________________
> > Linux MTD discussion mailing list
> > http://lists.infradead.org/mailman/listinfo/linux-mtd/
> 
> ______________________________________________________
> Linux MTD discussion mailing list
> http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: crc32() optimization
  2002-11-10 19:25       ` Wolfgang Denk
@ 2002-11-10 20:05         ` Joakim Tjernlund
  2002-11-10 21:00           ` Wolfgang Denk
  2002-11-11  0:50         ` Marc Singer
  1 sibling, 1 reply; 17+ messages in thread
From: Joakim Tjernlund @ 2002-11-10 20:05 UTC (permalink / raw)
  To: Marc Singer, Wolfgang Denk; +Cc: linux-mtd

Hi Wolfgang 

What's "Duff's Device"?

        Jocke
> 
> Seems broken to me, since you "len >>= 3" twice.
> 
> Also, Duff's Device comes to mind.
> 
> Best regards,
> 
> Wolfgang Denk
> 
> -- 
> Software Engineering:  Embedded and Realtime Systems,  Embedded Linux
> Phone: (+49)-8142-4596-87  Fax: (+49)-8142-4596-88  Email: wd@denx.de
> See us @ electronica 2002 in Munich, Nov 12-15, Hall A3, Booth A3.325

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: crc32() optimization
  2002-11-10 20:05         ` Joakim Tjernlund
@ 2002-11-10 21:00           ` Wolfgang Denk
  2002-11-10 21:22             ` Joakim Tjernlund
  2002-11-11  1:31             ` Marc Singer
  0 siblings, 2 replies; 17+ messages in thread
From: Wolfgang Denk @ 2002-11-10 21:00 UTC (permalink / raw)
  To: Joakim Tjernlund; +Cc: Marc Singer, linux-mtd

In message <006901c288f4$79e2ce20$0200a8c0@telia.com> you wrote:
> 
> What's "Duff's Device"?

It's a tricky way to implement general loop unrolling directly in  C.
Applied  to your problem, code that looks like this (instead of 8 any
other loop count may be used, but  you  need  to  adjust  the  "case"
statements then):

	register int n = (len + (8-1)) / 8;

	switch (len % 8) {
	case 0: do {	val = crc32_table ... ;
	case 7:		val = crc32_table ... ;
	case 6:		val = crc32_table ... ;
	case 5:		val = crc32_table ... ;
	case 4:		val = crc32_table ... ;
	case 3:		val = crc32_table ... ;
	case 2:		val = crc32_table ... ;
	case 1:		val = crc32_table ... ;
		} while (--n > 0);
	}

BTW: this is strictly legal ANSI C!

For an explanation see http://www.lysator.liu.se/c/duffs-device.html

Best regards,

Wolfgang Denk

-- 
Software Engineering:  Embedded and Realtime Systems,  Embedded Linux
Phone: (+49)-8142-4596-87  Fax: (+49)-8142-4596-88  Email: wd@denx.de
See us @ electronica 2002 in Munich, Nov 12-15, Hall A3, Booth A3.325 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: crc32() optimization
  2002-11-10 21:00           ` Wolfgang Denk
@ 2002-11-10 21:22             ` Joakim Tjernlund
  2002-11-10 22:35               ` Joakim Tjernlund
  2002-11-11  1:31             ` Marc Singer
  1 sibling, 1 reply; 17+ messages in thread
From: Joakim Tjernlund @ 2002-11-10 21:22 UTC (permalink / raw)
  To: Wolfgang Denk; +Cc: Marc Singer, linux-mtd

Cool! I will give it a try(tomorrow afternoon) and see what happens.

         Jocke
> > 
> > What's "Duff's Device"?
> 
> It's a tricky way to implement general loop unrolling directly in  C.
> Applied  to your problem, code that looks like this (instead of 8 any
> other loop count may be used, but  you  need  to  adjust  the  "case"
> statements then):
> 
> register int n = (len + (8-1)) / 8;
> 
> switch (len % 8) {
> case 0: do { val = crc32_table ... ;
> case 7: val = crc32_table ... ;
> case 6: val = crc32_table ... ;
> case 5: val = crc32_table ... ;
> case 4: val = crc32_table ... ;
> case 3: val = crc32_table ... ;
> case 2: val = crc32_table ... ;
> case 1: val = crc32_table ... ;
> } while (--n > 0);
> }
> 
> BTW: this is strictly legal ANSI C!
> 
> For an explanation see http://www.lysator.liu.se/c/duffs-device.html
> 
> Best regards,
> 
> Wolfgang Denk
> 
> -- 
> Software Engineering:  Embedded and Realtime Systems,  Embedded Linux
> Phone: (+49)-8142-4596-87  Fax: (+49)-8142-4596-88  Email: wd@denx.de
> See us @ electronica 2002 in Munich, Nov 12-15, Hall A3, Booth A3.325

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: crc32() optimization
  2002-11-10 21:22             ` Joakim Tjernlund
@ 2002-11-10 22:35               ` Joakim Tjernlund
  2002-11-10 22:41                 ` Wolfgang Denk
  0 siblings, 1 reply; 17+ messages in thread
From: Joakim Tjernlund @ 2002-11-10 22:35 UTC (permalink / raw)
  To: Wolfgang Denk; +Cc: Marc Singer, linux-mtd

I could not wait until tomorrow, so I did it now instead.
The result was worse. The best I got was 7% improvement.
I tried 16, 8, 6 and 4 as unrolling steps.

           Jocke

> Cool! I will give it a try(tomorrow afternoon) and see what happens.
> 
>          Jocke
> > > 
> > > What's "Duff's Device"?
> > 
> > It's a tricky way to implement general loop unrolling directly in  C.
> > Applied  to your problem, code that looks like this (instead of 8 any
> > other loop count may be used, but  you  need  to  adjust  the  "case"
> > statements then):
> > 
> > register int n = (len + (8-1)) / 8;
> > 
> > switch (len % 8) {
> > case 0: do { val = crc32_table ... ;
> > case 7: val = crc32_table ... ;
> > case 6: val = crc32_table ... ;
> > case 5: val = crc32_table ... ;
> > case 4: val = crc32_table ... ;
> > case 3: val = crc32_table ... ;
> > case 2: val = crc32_table ... ;
> > case 1: val = crc32_table ... ;
> > } while (--n > 0);
> > }
> > 
> > BTW: this is strictly legal ANSI C!
> > 
> > For an explanation see http://www.lysator.liu.se/c/duffs-device.html
> > 
> > Best regards,
> > 
> > Wolfgang Denk
> > 
> > -- 
> > Software Engineering:  Embedded and Realtime Systems,  Embedded Linux
> > Phone: (+49)-8142-4596-87  Fax: (+49)-8142-4596-88  Email: wd@denx.de
> > See us @ electronica 2002 in Munich, Nov 12-15, Hall A3, Booth A3.325
> 
> ______________________________________________________
> Linux MTD discussion mailing list
> http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: crc32() optimization
  2002-11-10 22:35               ` Joakim Tjernlund
@ 2002-11-10 22:41                 ` Wolfgang Denk
  2002-11-10 23:00                   ` Joakim Tjernlund
  0 siblings, 1 reply; 17+ messages in thread
From: Wolfgang Denk @ 2002-11-10 22:41 UTC (permalink / raw)
  To: Joakim Tjernlund; +Cc: Marc Singer, linux-mtd

In message <001301c28909$743f1f40$0200a8c0@telia.com> you wrote:
> I could not wait until tomorrow, so I did it now instead.
> The result was worse. The best I got was 7% improvement.
> I tried 16, 8, 6 and 4 as unrolling steps.

Makes no sense to me.  Should  be  at  least  as  efficient  as  your
original code (marginally better).

Best regards,

Wolfgang Denk

-- 
Software Engineering:  Embedded and Realtime Systems,  Embedded Linux
Phone: (+49)-8142-4596-87  Fax: (+49)-8142-4596-88  Email: wd@denx.de
See us @ electronica 2002 in Munich, Nov 12-15, Hall A3, Booth A3.325 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: crc32() optimization
  2002-11-10 22:41                 ` Wolfgang Denk
@ 2002-11-10 23:00                   ` Joakim Tjernlund
  2002-11-10 23:56                     ` Eric W. Biederman
  0 siblings, 1 reply; 17+ messages in thread
From: Joakim Tjernlund @ 2002-11-10 23:00 UTC (permalink / raw)
  To: Wolfgang Denk; +Cc: Marc Singer, linux-mtd

> In message <001301c28909$743f1f40$0200a8c0@telia.com> you wrote:
> > I could not wait until tomorrow, so I did it now instead.
> > The result was worse. The best I got was 7% improvement.
> > I tried 16, 8, 6 and 4 as unrolling steps.
> 
> Makes no sense to me.  Should  be  at  least  as  efficient  as  your
> original code (marginally better).

I don't understand this either.
Anyone?

         Jocke

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: crc32() optimization
  2002-11-10 23:00                   ` Joakim Tjernlund
@ 2002-11-10 23:56                     ` Eric W. Biederman
  0 siblings, 0 replies; 17+ messages in thread
From: Eric W. Biederman @ 2002-11-10 23:56 UTC (permalink / raw)
  To: Joakim Tjernlund; +Cc: Wolfgang Denk, Marc Singer, linux-mtd

"Joakim Tjernlund" <Joakim.Tjernlund@lumentis.se> writes:

> > In message <001301c28909$743f1f40$0200a8c0@telia.com> you wrote:
> > > I could not wait until tomorrow, so I did it now instead.
> > > The result was worse. The best I got was 7% improvement.
> > > I tried 16, 8, 6 and 4 as unrolling steps.
> > 
> > Makes no sense to me.  Should  be  at  least  as  efficient  as  your
> > original code (marginally better).
> 
> I don't understand this either.
> Anyone?

You might try it with 6.  But a lot depends on what gcc can do with
it and gcc may not be like all of those potential entry points.. 

Running gcc -S and checking to see the difference in the generated
assembly might be instructive.

Eric

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: crc32() optimization
  2002-11-10 19:25       ` Wolfgang Denk
  2002-11-10 20:05         ` Joakim Tjernlund
@ 2002-11-11  0:50         ` Marc Singer
  1 sibling, 0 replies; 17+ messages in thread
From: Marc Singer @ 2002-11-11  0:50 UTC (permalink / raw)
  To: Wolfgang Denk; +Cc: Joakim Tjernlund, linux-mtd

On Sun, Nov 10, 2002 at 08:25:38PM +0100, Wolfgang Denk wrote:
> In message <20021110184321.GB16087@buici.com> you wrote:
> > As it should.  I wonder if you'd do better changing the loop slightly.
> > 
> > Check for len == 0 and do a short-circuit return.  Then do this
> > 
> >   for (++len; len & 0x7; len >>= 3) {
> >      ONCE(); // repeat eight times
> >      ...
> >      len >>= 3;
> >   }
> >   while (--len > 0)
> >      ONCE();
> > 
> > This is the implementation I've written for another project which
> 
> Seems broken to me, since you "len >>= 3" twice.

That's what I get for writing it from memory.

> Also, Duff's Device comes to mind.

What would that be?


> 
> Best regards,
> 
> Wolfgang Denk
> 
> -- 
> Software Engineering:  Embedded and Realtime Systems,  Embedded Linux
> Phone: (+49)-8142-4596-87  Fax: (+49)-8142-4596-88  Email: wd@denx.de
> See us @ electronica 2002 in Munich, Nov 12-15, Hall A3, Booth A3.325 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: crc32() optimization
  2002-11-10 21:00           ` Wolfgang Denk
  2002-11-10 21:22             ` Joakim Tjernlund
@ 2002-11-11  1:31             ` Marc Singer
  2002-11-11  1:37               ` Wolfgang Denk
  1 sibling, 1 reply; 17+ messages in thread
From: Marc Singer @ 2002-11-11  1:31 UTC (permalink / raw)
  To: Wolfgang Denk; +Cc: Joakim Tjernlund, linux-mtd

On Sun, Nov 10, 2002 at 10:00:03PM +0100, Wolfgang Denk wrote:
> In message <006901c288f4$79e2ce20$0200a8c0@telia.com> you wrote:
> > 
> > What's "Duff's Device"?
> 
> It's a tricky way to implement general loop unrolling directly in  C.
> Applied  to your problem, code that looks like this (instead of 8 any
> other loop count may be used, but  you  need  to  adjust  the  "case"
> statements then):
> 
> 	register int n = (len + (8-1)) / 8;
> 
> 	switch (len % 8) {
> 	case 0: do {	val = crc32_table ... ;
> 	case 7:		val = crc32_table ... ;
> 	case 6:		val = crc32_table ... ;
> 	case 5:		val = crc32_table ... ;
> 	case 4:		val = crc32_table ... ;
> 	case 3:		val = crc32_table ... ;
> 	case 2:		val = crc32_table ... ;
> 	case 1:		val = crc32_table ... ;
> 		} while (--n > 0);
> 	}

This doesn't look right to me.  You are decrementing n but using the
modulus of len in the switch.  The len modulus is correct when n == 1,
but not when n > 1.  The idea makes sense, but the implementation
appears to be missing a detail.

As for performance problems, I believe that the trouble is evident
from the assembler output.  The reason that the unrolled loop is more
efficient than the simple loop is mainly because you don't jump as
often.  We all know that jumps tend to perturb the instruction fetch
queue and cache.

Here is an example.  I know it is wrong, but it shows how the compiler
implements the switch.  I've marked the problem jump.  It is executed
for every iteration of the loop so it tends to negate other changes
made to the algorithm.

So, the version I posted may not be significantly better that the
original one that unrolled 6 times.  It is just that unrolling to a
power of two sometimes makes the math simpler and sometimes improves
the performance.  The present crop of CPUs performs all simple ALU
functions in a single cycle, so there is little reason to worry about
how many loops to unroll.

BTW, I checked my code and found I made several errors.  The increment
step in for() is a decrement by 8, not a shift.

Cheers.

 --------------------------------------------------
#define ONCE do { ++i; } while (0);

int foo ()
{
  int i = 0;
  int c;

  for (c = 20 ; c > 0; c -= 8) {
    switch (c % 8) {
    case 0: ONCE;
    case 7: ONCE;
    case 6: ONCE;
    case 5: ONCE;
    case 4: ONCE;
    case 3: ONCE;
    case 2: ONCE;
    case 1: ONCE;
    }      
  }
  return i;
}

int main ()
{
  foo ();
}

--------------------------------------------------
	.file	"unroll.c"
	.text
	.align 2
	.p2align 2,,3
.globl foo
	.type	foo,@function
foo:
	pushl	%ebp
	movl	%esp, %ebp
	xorl	%eax, %eax
	pushl	%ebx
	movl	$20, %ecx
	.p2align 2,,3
.L26:
	testl	%ecx, %ecx
	movl	%ecx, %edx
	js	.L29
.L25:
	andl	$-8, %edx
	movl	%ecx, %ebx
	subl	%edx, %ebx
	cmpl	$7, %ebx
	ja	.L4
;;; **** This is the problem jump.
	jmp	*.L23(,%ebx,4)
;;; **** This is the problem jump.
	.section	.rodata
	.align 4
	.align 4
.L23:
	.long	.L7
	.long	.L21
	.long	.L19
	.long	.L17
	.long	.L15
	.long	.L13
	.long	.L11
	.long	.L9
	.text
.L7:
	incl	%eax
.L9:
	incl	%eax
.L11:
	incl	%eax
.L13:
	incl	%eax
.L15:
	incl	%eax
.L17:
	incl	%eax
.L19:
	incl	%eax
.L21:
	incl	%eax
.L4:
	subl	$8, %ecx
	testl	%ecx, %ecx
	jg	.L26
	popl	%ebx
	leave
	ret
	.p2align 2,,3
.L29:
	leal	7(%ecx), %edx
	jmp	.L25
.Lfe1:
	.size	foo,.Lfe1-foo
	.align 2
	.p2align 2,,3
.globl main
	.type	main,@function
main:
	pushl	%ebp
	movl	%esp, %ebp
	subl	$8, %esp
	andl	$-16, %esp
	call	foo
	leave
	ret
.Lfe2:
	.size	main,.Lfe2-main
	.ident	"GCC: (GNU) 3.2.1 20020830 (Debian prerelease)"

--------------------------------------------------

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: crc32() optimization
  2002-11-11  1:31             ` Marc Singer
@ 2002-11-11  1:37               ` Wolfgang Denk
  2002-11-11  4:42                 ` Marc Singer
  0 siblings, 1 reply; 17+ messages in thread
From: Wolfgang Denk @ 2002-11-11  1:37 UTC (permalink / raw)
  To: Marc Singer; +Cc: Joakim Tjernlund, linux-mtd

In message <20021111013114.GB27214@buici.com> you wrote:
>
> > > What's "Duff's Device"?
> > 
> > It's a tricky way to implement general loop unrolling directly in  C.
> > Applied  to your problem, code that looks like this (instead of 8 any
> > other loop count may be used, but  you  need  to  adjust  the  "case"
> > statements then):
> > 
> > 	register int n = (len + (8-1)) / 8;
> > 
> > 	switch (len % 8) {
> > 	case 0: do {	val = crc32_table ... ;
> > 	case 7:		val = crc32_table ... ;
> > 	case 6:		val = crc32_table ... ;
> > 	case 5:		val = crc32_table ... ;
> > 	case 4:		val = crc32_table ... ;
> > 	case 3:		val = crc32_table ... ;
> > 	case 2:		val = crc32_table ... ;
> > 	case 1:		val = crc32_table ... ;
> > 		} while (--n > 0);
> > 	}
> 
> This doesn't look right to me.  You are decrementing n but using the
> modulus of len in the switch.  The len modulus is correct when n == 1,
> but not when n > 1.  The idea makes sense, but the implementation
> appears to be missing a detail.

You don't understand. The  switch  is  only  needed  for  the  first,
partial loop where we want less than N statements; then we're nunning
the remaining fully unrolled loos in the do{}while loop.

> As for performance problems, I believe that the trouble is evident
> from the assembler output.  The reason that the unrolled loop is more
> efficient than the simple loop is mainly because you don't jump as
> often.  We all know that jumps tend to perturb the instruction fetch
> queue and cache.

Did you enable optimization?

> Here is an example.  I know it is wrong, but it shows how the compiler
> implements the switch.  I've marked the problem jump.  It is executed

Irrelevant here.

> So, the version I posted may not be significantly better that the
> original one that unrolled 6 times.  It is just that unrolling to a

You save the extra while{} loop.


Best regards,

Wolfgang Denk

-- 
Software Engineering:  Embedded and Realtime Systems,  Embedded Linux
Phone: (+49)-8142-4596-87  Fax: (+49)-8142-4596-88  Email: wd@denx.de
See us @ electronica 2002 in Munich, Nov 12-15, Hall A3, Booth A3.325 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: crc32() optimization
  2002-11-11  1:37               ` Wolfgang Denk
@ 2002-11-11  4:42                 ` Marc Singer
  2002-11-25 15:55                   ` Herman Oosthuysen
  0 siblings, 1 reply; 17+ messages in thread
From: Marc Singer @ 2002-11-11  4:42 UTC (permalink / raw)
  To: Wolfgang Denk; +Cc: Joakim Tjernlund, linux-mtd

On Mon, Nov 11, 2002 at 02:37:33AM +0100, Wolfgang Denk wrote:
> In message <20021111013114.GB27214@buici.com> you wrote:
> >
> > > > What's "Duff's Device"?
> > > 
> > > It's a tricky way to implement general loop unrolling directly in  C.
> > > Applied  to your problem, code that looks like this (instead of 8 any
> > > other loop count may be used, but  you  need  to  adjust  the  "case"
> > > statements then):
> > > 
> > > 	register int n = (len + (8-1)) / 8;
> > > 
> > > 	switch (len % 8) {
> > > 	case 0: do {	val = crc32_table ... ;
> > > 	case 7:		val = crc32_table ... ;
> > > 	case 6:		val = crc32_table ... ;
> > > 	case 5:		val = crc32_table ... ;
> > > 	case 4:		val = crc32_table ... ;
> > > 	case 3:		val = crc32_table ... ;
> > > 	case 2:		val = crc32_table ... ;
> > > 	case 1:		val = crc32_table ... ;
> > > 		} while (--n > 0);
> > > 	}
> > 
> > This doesn't look right to me.  You are decrementing n but using the
> > modulus of len in the switch.  The len modulus is correct when n == 1,
> > but not when n > 1.  The idea makes sense, but the implementation
> > appears to be missing a detail.
> 
> You don't understand. The  switch  is  only  needed  for  the  first,
> partial loop where we want less than N statements; then we're nunning
> the remaining fully unrolled loos in the do{}while loop.

I see.  I misread the code.  I cannot see why this would not be better
than the original poster's version.  I'll test it on my code to see if
there is an improvement. 


> > As for performance problems, I believe that the trouble is evident
> > from the assembler output.  The reason that the unrolled loop is more
> > efficient than the simple loop is mainly because you don't jump as
> > often.  We all know that jumps tend to perturb the instruction fetch
> > queue and cache.
> 
> Did you enable optimization?

Indeed.  But it doesn't matter since it executes the switch jump only
one time.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: crc32() optimization
  2002-11-11  4:42                 ` Marc Singer
@ 2002-11-25 15:55                   ` Herman Oosthuysen
  2002-11-25 16:12                     ` Joakim Tjernlund
  0 siblings, 1 reply; 17+ messages in thread
From: Herman Oosthuysen @ 2002-11-25 15:55 UTC (permalink / raw)
  To: linux-mtd

Is there not a look-up table based CRC32 elsewhere in the kernel already?

Multiple CRC32 algorithms seem to me to be a terrible waste.
-- 

------------------------------------------------------------------------
Herman Oosthuysen
B.Eng.(E), Member of IEEE
Wireless Networks Inc.
http://www.WirelessNetworksInc.com
E-mail: Herman@WirelessNetworksInc.com
Phone: 1.403.569-5687, Fax: 1.403.235-3965
------------------------------------------------------------------------


Marc Singer wrote:
> On Mon, Nov 11, 2002 at 02:37:33AM +0100, Wolfgang Denk wrote:
> 
>>In message <20021111013114.GB27214@buici.com> you wrote:
>>
>>>>>What's "Duff's Device"?
>>>>
>>>>It's a tricky way to implement general loop unrolling directly in  C.
>>>>Applied  to your problem, code that looks like this (instead of 8 any
>>>>other loop count may be used, but  you  need  to  adjust  the  "case"
>>>>statements then):
>>>>
>>>>	register int n = (len + (8-1)) / 8;
>>>>
>>>>	switch (len % 8) {
>>>>	case 0: do {	val = crc32_table ... ;
>>>>	case 7:		val = crc32_table ... ;
>>>>	case 6:		val = crc32_table ... ;
>>>>	case 5:		val = crc32_table ... ;
>>>>	case 4:		val = crc32_table ... ;
>>>>	case 3:		val = crc32_table ... ;
>>>>	case 2:		val = crc32_table ... ;
>>>>	case 1:		val = crc32_table ... ;
>>>>		} while (--n > 0);
>>>>	}
>>>
>>>This doesn't look right to me.  You are decrementing n but using the
>>>modulus of len in the switch.  The len modulus is correct when n == 1,
>>>but not when n > 1.  The idea makes sense, but the implementation
>>>appears to be missing a detail.
>>
>>You don't understand. The  switch  is  only  needed  for  the  first,
>>partial loop where we want less than N statements; then we're nunning
>>the remaining fully unrolled loos in the do{}while loop.
> 
> 
> I see.  I misread the code.  I cannot see why this would not be better
> than the original poster's version.  I'll test it on my code to see if
> there is an improvement. 
> 
> 
> 
>>>As for performance problems, I believe that the trouble is evident
>>>from the assembler output.  The reason that the unrolled loop is more
>>>efficient than the simple loop is mainly because you don't jump as
>>>often.  We all know that jumps tend to perturb the instruction fetch
>>>queue and cache.
>>
>>Did you enable optimization?
> 
> 
> Indeed.  But it doesn't matter since it executes the switch jump only
> one time.
> 
> 
> ______________________________________________________
> Linux MTD discussion mailing list
> http://lists.infradead.org/mailman/listinfo/linux-mtd/
> 

-- 

------------------------------------------------------------------------
Herman Oosthuysen
B.Eng.(E), Member of IEEE
Wireless Networks Inc.
http://www.WirelessNetworksInc.com
E-mail: Herman@WirelessNetworksInc.com
Phone: 1.403.569-5687, Fax: 1.403.235-3965
------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: crc32() optimization
  2002-11-25 15:55                   ` Herman Oosthuysen
@ 2002-11-25 16:12                     ` Joakim Tjernlund
  0 siblings, 0 replies; 17+ messages in thread
From: Joakim Tjernlund @ 2002-11-25 16:12 UTC (permalink / raw)
  To: Herman Oosthuysen, linux-mtd

Yes, there are. In 2.5 these has been replace with a common lib(see lib/crc32.c).
I have a back port to 2.4, but first we try get the optimized version into 2.5 and
then do a new back port to 2.4 and then try to get that into 2.4.

The final 2.5 patch should appear sometime today on lkm.

  Jocke

> 
> Is there not a look-up table based CRC32 elsewhere in the kernel already?
> 
> Multiple CRC32 algorithms seem to me to be a terrible waste.
> -- 
> 
> ------------------------------------------------------------------------
> Herman Oosthuysen
> B.Eng.(E), Member of IEEE
> Wireless Networks Inc.
> http://www.WirelessNetworksInc.com
> E-mail: Herman@WirelessNetworksInc.com
> Phone: 1.403.569-5687, Fax: 1.403.235-3965
> ------------------------------------------------------------------------
> 
> 
> Marc Singer wrote:
> > On Mon, Nov 11, 2002 at 02:37:33AM +0100, Wolfgang Denk wrote:
> > 
> >>In message <20021111013114.GB27214@buici.com> you wrote:
> >>
> >>>>>What's "Duff's Device"?
> >>>>
> >>>>It's a tricky way to implement general loop unrolling directly in  C.
> >>>>Applied  to your problem, code that looks like this (instead of 8 any
> >>>>other loop count may be used, but  you  need  to  adjust  the  "case"
> >>>>statements then):
> >>>>
> >>>>	register int n = (len + (8-1)) / 8;
> >>>>
> >>>>	switch (len % 8) {
> >>>>	case 0: do {	val = crc32_table ... ;
> >>>>	case 7:		val = crc32_table ... ;
> >>>>	case 6:		val = crc32_table ... ;
> >>>>	case 5:		val = crc32_table ... ;
> >>>>	case 4:		val = crc32_table ... ;
> >>>>	case 3:		val = crc32_table ... ;
> >>>>	case 2:		val = crc32_table ... ;
> >>>>	case 1:		val = crc32_table ... ;
> >>>>		} while (--n > 0);
> >>>>	}
> >>>
> >>>This doesn't look right to me.  You are decrementing n but using the
> >>>modulus of len in the switch.  The len modulus is correct when n == 1,
> >>>but not when n > 1.  The idea makes sense, but the implementation
> >>>appears to be missing a detail.
> >>
> >>You don't understand. The  switch  is  only  needed  for  the  first,
> >>partial loop where we want less than N statements; then we're nunning
> >>the remaining fully unrolled loos in the do{}while loop.
> > 
> > 
> > I see.  I misread the code.  I cannot see why this would not be better
> > than the original poster's version.  I'll test it on my code to see if
> > there is an improvement. 
> > 
> > 
> > 
> >>>As for performance problems, I believe that the trouble is evident
> >>>from the assembler output.  The reason that the unrolled loop is more
> >>>efficient than the simple loop is mainly because you don't jump as
> >>>often.  We all know that jumps tend to perturb the instruction fetch
> >>>queue and cache.
> >>
> >>Did you enable optimization?
> > 
> > 
> > Indeed.  But it doesn't matter since it executes the switch jump only
> > one time.
> > 
> > 
> > ______________________________________________________
> > Linux MTD discussion mailing list
> > http://lists.infradead.org/mailman/listinfo/linux-mtd/
> > 
> 
> -- 
> 
> ------------------------------------------------------------------------
> Herman Oosthuysen
> B.Eng.(E), Member of IEEE
> Wireless Networks Inc.
> http://www.WirelessNetworksInc.com
> E-mail: Herman@WirelessNetworksInc.com
> Phone: 1.403.569-5687, Fax: 1.403.235-3965
> ------------------------------------------------------------------------
> 
> 
> 
> ______________________________________________________
> Linux MTD discussion mailing list
> http://lists.infradead.org/mailman/listinfo/linux-mtd/
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2002-11-25 15:41 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <F122OgRGkQ6mBySsxVY00000854@hotmail.com>
     [not found] ` <24987.1036797874@passion.cambridge.redhat.com>
2002-11-10 15:28   ` crc32() optimization Joakim Tjernlund
2002-11-10 18:43     ` Marc Singer
2002-11-10 19:25       ` Wolfgang Denk
2002-11-10 20:05         ` Joakim Tjernlund
2002-11-10 21:00           ` Wolfgang Denk
2002-11-10 21:22             ` Joakim Tjernlund
2002-11-10 22:35               ` Joakim Tjernlund
2002-11-10 22:41                 ` Wolfgang Denk
2002-11-10 23:00                   ` Joakim Tjernlund
2002-11-10 23:56                     ` Eric W. Biederman
2002-11-11  1:31             ` Marc Singer
2002-11-11  1:37               ` Wolfgang Denk
2002-11-11  4:42                 ` Marc Singer
2002-11-25 15:55                   ` Herman Oosthuysen
2002-11-25 16:12                     ` Joakim Tjernlund
2002-11-11  0:50         ` Marc Singer
2002-11-10 20:04       ` Joakim Tjernlund

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox