From: Marc Singer <elf@buici.com>
To: Wolfgang Denk <wd@denx.de>
Cc: Joakim Tjernlund <Joakim.Tjernlund@lumentis.se>,
linux-mtd@lists.infradead.org
Subject: Re: crc32() optimization
Date: Sun, 10 Nov 2002 17:31:14 -0800 [thread overview]
Message-ID: <20021111013114.GB27214@buici.com> (raw)
In-Reply-To: <20021110210008.7CFF210162@denx.denx.de>
On Sun, Nov 10, 2002 at 10:00:03PM +0100, Wolfgang Denk wrote:
> In message <006901c288f4$79e2ce20$0200a8c0@telia.com> you wrote:
> >
> > What's "Duff's Device"?
>
> It's a tricky way to implement general loop unrolling directly in C.
> Applied to your problem, code that looks like this (instead of 8 any
> other loop count may be used, but you need to adjust the "case"
> statements then):
>
> register int n = (len + (8-1)) / 8;
>
> switch (len % 8) {
> case 0: do { val = crc32_table ... ;
> case 7: val = crc32_table ... ;
> case 6: val = crc32_table ... ;
> case 5: val = crc32_table ... ;
> case 4: val = crc32_table ... ;
> case 3: val = crc32_table ... ;
> case 2: val = crc32_table ... ;
> case 1: val = crc32_table ... ;
> } while (--n > 0);
> }
This doesn't look right to me. You are decrementing n but using the
modulus of len in the switch. The len modulus is correct when n == 1,
but not when n > 1. The idea makes sense, but the implementation
appears to be missing a detail.
As for performance problems, I believe that the trouble is evident
from the assembler output. The reason that the unrolled loop is more
efficient than the simple loop is mainly because you don't jump as
often. We all know that jumps tend to perturb the instruction fetch
queue and cache.
Here is an example. I know it is wrong, but it shows how the compiler
implements the switch. I've marked the problem jump. It is executed
for every iteration of the loop so it tends to negate other changes
made to the algorithm.
So, the version I posted may not be significantly better that the
original one that unrolled 6 times. It is just that unrolling to a
power of two sometimes makes the math simpler and sometimes improves
the performance. The present crop of CPUs performs all simple ALU
functions in a single cycle, so there is little reason to worry about
how many loops to unroll.
BTW, I checked my code and found I made several errors. The increment
step in for() is a decrement by 8, not a shift.
Cheers.
--------------------------------------------------
#define ONCE do { ++i; } while (0);
int foo ()
{
int i = 0;
int c;
for (c = 20 ; c > 0; c -= 8) {
switch (c % 8) {
case 0: ONCE;
case 7: ONCE;
case 6: ONCE;
case 5: ONCE;
case 4: ONCE;
case 3: ONCE;
case 2: ONCE;
case 1: ONCE;
}
}
return i;
}
int main ()
{
foo ();
}
--------------------------------------------------
.file "unroll.c"
.text
.align 2
.p2align 2,,3
.globl foo
.type foo,@function
foo:
pushl %ebp
movl %esp, %ebp
xorl %eax, %eax
pushl %ebx
movl $20, %ecx
.p2align 2,,3
.L26:
testl %ecx, %ecx
movl %ecx, %edx
js .L29
.L25:
andl $-8, %edx
movl %ecx, %ebx
subl %edx, %ebx
cmpl $7, %ebx
ja .L4
;;; **** This is the problem jump.
jmp *.L23(,%ebx,4)
;;; **** This is the problem jump.
.section .rodata
.align 4
.align 4
.L23:
.long .L7
.long .L21
.long .L19
.long .L17
.long .L15
.long .L13
.long .L11
.long .L9
.text
.L7:
incl %eax
.L9:
incl %eax
.L11:
incl %eax
.L13:
incl %eax
.L15:
incl %eax
.L17:
incl %eax
.L19:
incl %eax
.L21:
incl %eax
.L4:
subl $8, %ecx
testl %ecx, %ecx
jg .L26
popl %ebx
leave
ret
.p2align 2,,3
.L29:
leal 7(%ecx), %edx
jmp .L25
.Lfe1:
.size foo,.Lfe1-foo
.align 2
.p2align 2,,3
.globl main
.type main,@function
main:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
andl $-16, %esp
call foo
leave
ret
.Lfe2:
.size main,.Lfe2-main
.ident "GCC: (GNU) 3.2.1 20020830 (Debian prerelease)"
--------------------------------------------------
next prev parent reply other threads:[~2002-11-11 1:00 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <F122OgRGkQ6mBySsxVY00000854@hotmail.com>
[not found] ` <24987.1036797874@passion.cambridge.redhat.com>
2002-11-10 15:28 ` crc32() optimization Joakim Tjernlund
2002-11-10 18:43 ` Marc Singer
2002-11-10 19:25 ` Wolfgang Denk
2002-11-10 20:05 ` Joakim Tjernlund
2002-11-10 21:00 ` Wolfgang Denk
2002-11-10 21:22 ` Joakim Tjernlund
2002-11-10 22:35 ` Joakim Tjernlund
2002-11-10 22:41 ` Wolfgang Denk
2002-11-10 23:00 ` Joakim Tjernlund
2002-11-10 23:56 ` Eric W. Biederman
2002-11-11 1:31 ` Marc Singer [this message]
2002-11-11 1:37 ` Wolfgang Denk
2002-11-11 4:42 ` Marc Singer
2002-11-25 15:55 ` Herman Oosthuysen
2002-11-25 16:12 ` Joakim Tjernlund
2002-11-11 0:50 ` Marc Singer
2002-11-10 20:04 ` Joakim Tjernlund
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20021111013114.GB27214@buici.com \
--to=elf@buici.com \
--cc=Joakim.Tjernlund@lumentis.se \
--cc=linux-mtd@lists.infradead.org \
--cc=wd@denx.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox