From: "Joakim Tjernlund" <Joakim.Tjernlund@lumentis.se>
To: "Tim Seufert" <tas@mindspring.com>
Cc: <linuxppc-dev@lists.linuxppc.org>
Subject: Re: csum_partial() and csum_partial_copy_generic() in badly optimized?
Date: Sun, 17 Nov 2002 16:17:41 +0100 [thread overview]
Message-ID: <002801c28e4c$79219d60$0200a8c0@telia.com> (raw)
In-Reply-To: A09E986A-F9F1-11D6-AEE9-003065F22EAA@mindspring.com
> On Saturday, November 16, 2002, at 02:16 AM, Joakim Tjernlund wrote:
>
> >> The comment is probably correct. The reason the instruction has
> >> (effectively) zero overhead is that most PowerPCs have a feature which
> >> "folds" predicted-taken branches out of the instruction stream before
> >> they are dispatched. This effectively makes the branch cost 0 cycles,
> >> as it does not occupy integer execution resources as it would on other
> >> possible microarchitectures.
> >>
> > hmm, I am on a mpc860 and I get big performace improvements if I apply
> > unrolling. Consider the standard CRC32 funtion:
> > while(len--) {
> > result = (result << 8 | *data++) ^ crctab[result >> 24];
> > }
> > If I apply manual unrolling or compile with -funroll-loops I get
> >> 20% performance increase. Is this a special case or is
> > the mpc860 doing a bad job?
>
> Don't forget about gcc.
>
> In the code you originally were talking about, the PPC CTR register and
> bdnz instruction were used to implement the loop counter. bdnz puts
> all the loop overhead (counter decrement, test, and branch) into one
> instruction. Since that instruction is a branch, it can be folded, and
> thus have 0 overhead. CTR and the instructions which operate on it
> (such as bdnz) were put into the PPC architecture mainly as an
> optimization opportunity for loops where the loop variable is not used
> inside the loop body.
loop variable not USED or loop variable not MODIFIED?
>
> There is no guarantee that gcc will always use CTR, even for such
> obvious candidates as the crc32 loop. gcc is simply not that great at
> PPC optimization, especially at low optimization levels. I've just
> been playing with gcc 2.95.4 on YDL 2.3 and Apple's gcc 3.1 on OSX
> 10.2.2 (these versions are merely what I happen to have installed on
> machines that are handy). Here's a summary of when gcc will compile
> that crc32 loop with use of CTR and bdnz (note that -O3 or above
> automatically turn on -funroll-loops, so I saw no point in testing
> those levels):
>
> -O1 -O2 -O1 -funroll-loops -O2 -funroll-loops
> 2.95.4 no no no no
> 3.1 no yes yes yes
hmm, looks like I should upgrade gcc to 3.1 or possibly 3.2. However
I think that gcc >=3.0 has changed the ABI for C++, which is bad for me.
Is 2.95.x still maintained? Maybe this optimization could be added
to that branch.
>
> If gcc isn't generating a CTR loop to start out with, the crc32 code
> will benefit more from unrolling than it should.
>
> I did a bit of crude performance testing, and with gcc 3.1 there is no
> difference in the cache-hot performance of the crc32 loop when
> switching between -O2 and -O2 -funroll-loops.
>
> Now, that _was_ on a 7455. You would see some difference on a 860,
> because gcc 3.1 did find something else to optimize. Here's the loop
> body for -O2:
>
> L48:
> ; basic block 2
> lbz r9,0(r3) ; * data
> rlwinm r0,r30,10,22,29 ; result
> lwzx r11,r10,r0 ; crctab
> addi r3,r3,1 ; data, data
> slwi r0,r30,8 ; result
> or r0,r0,r9
> xor r30,r0,r11 ; result
> bdnz L48
>
> With -O2 -funroll-loops, gcc copies this loop body four times, and
> transforms the addi (increments 'data' ptr by 1) and lbz (loads *data)
> instructions. addi is hoisted from the loop body and instead the
> pointer is incremented by 4 at the end of the unrolled loop. Each lbz
> copy has the appropriate offset (0, 1, 2, 3) to simulate the original
> pointer incrementing. The net effect is that for every four iterations
> of the original loop we execute three fewer addi instructions.
I see.
> The 7455 has a lot of integer execution units and can probably do the
> extra adds for free, but the 860 is basically a 601, and I think the
> 601 had just one IU, so the adds will not be free.
probably.
>
> But if you go back to that original loop in csum_partial() etc., I
> don't see any opportunity to perform similar optimizations. So I doubt
> very much that unrolling that loop would have any benefit even on the
> 860.
You are probably right. Thanks for your effort to clear this up for me.
Jocke
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
next prev parent reply other threads:[~2002-11-17 15:17 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2002-11-15 23:01 csum_partial() and csum_partial_copy_generic() in badly optimized? Joakim Tjernlund
2002-11-16 2:39 ` Tim Seufert
2002-11-16 10:16 ` Joakim Tjernlund
2002-11-17 5:58 ` Tim Seufert
2002-11-17 15:17 ` Joakim Tjernlund [this message]
2002-11-17 22:00 ` Tim Seufert
2002-11-17 23:32 ` Joakim Tjernlund
2002-11-18 1:27 ` Tim Seufert
2002-11-18 4:12 ` Gabriel Paubert
2002-11-18 13:49 ` Joakim Tjernlund
2002-11-18 18:05 ` Gabriel Paubert
2002-11-18 18:43 ` Joakim Tjernlund
2002-11-19 1:24 ` Gabriel Paubert
2002-11-19 3:31 ` Paul Mackerras
2002-11-19 5:35 ` Gabriel Paubert
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='002801c28e4c$79219d60$0200a8c0@telia.com' \
--to=joakim.tjernlund@lumentis.se \
--cc=linuxppc-dev@lists.linuxppc.org \
--cc=tas@mindspring.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).