csum_partial() and csum_partial_copy_generic() in badly optimized?

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* csum_partial() and csum_partial_copy_generic() in badly optimized?
@ 2002-11-15 23:01 Joakim Tjernlund
  2002-11-16  2:39 ` Tim Seufert
  0 siblings, 1 reply; 15+ messages in thread
From: Joakim Tjernlund @ 2002-11-15 23:01 UTC (permalink / raw)
  To: linuxppc-dev

Hi

Looking over the different checksums in I came across csum_partial() and csum_partial_copy_generic(), which lives in
arch/ppc/lib/checksum.S.

This comment in csum_partial:
/* the bdnz has zero overhead, so it should */
/* be unnecessary to unroll this loop */

got me wondering(code included last). A instruction can not have zero cost/overhead.
This instruction must be eating cycles. I think this function needs unrolling, but  I am pretty
useless on assembler so I need help.

Can any PPC/assembler guy comment on this and, if needed, do the
unrolling? I think  6 or 8 as unroll step will be enough.

The same goes for  csum_partial_copy_generic()

These functions are used to checksum every IP/TCP/UDP packet, so it
would be a good thing if they were properly optimized.

It would be really nice if there were more comments(and use names on jump labels, numbers
are very uninformative), it's hard enough to understand as is.

          Jocke

/ *
 * computes the checksum of a memory block at buff, length len,
 * and adds in "sum" (32-bit)
 *
 * csum_partial(buff, len, sum)
 */
_GLOBAL(csum_partial)
        addic   r0,r5,0
        subi    r3,r3,4
        srwi.   r6,r4,2
        beq     3f              /* if we're doing < 4 bytes */
        andi.   r5,r3,2         /* Align buffer to longword boundary */
        beq+    1f
        lhz     r5,4(r3)        /* do 2 bytes to get aligned */
        addi    r3,r3,2
        subi    r4,r4,2
        addc    r0,r0,r5
        srwi.   r6,r4,2         /* # words to do */
        beq     3f
1:      mtctr   r6
2:      lwzu    r5,4(r3)        /* the bdnz has zero overhead, so it should */
        adde    r0,r0,r5        /* be unnecessary to unroll this loop */
        bdnz    2b
        andi.   r4,r4,3
3:      cmpi    0,r4,2
        blt+    4f
        lhz     r5,4(r3)
        addi    r3,r3,2
        subi    r4,r4,2
        adde    r0,r0,r5
4:      cmpi    0,r4,1
        bne+    5f
        lbz     r5,4(r3)
        slwi    r5,r5,8         /* Upper byte of word */
        adde    r0,r0,r5
5:      addze   r3,r0           /* add in final carry */
        blr

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: csum_partial() and csum_partial_copy_generic() in badly optimized?
  2002-11-15 23:01 csum_partial() and csum_partial_copy_generic() in badly optimized? Joakim Tjernlund
@ 2002-11-16  2:39 ` Tim Seufert
  2002-11-16 10:16   ` Joakim Tjernlund
  0 siblings, 1 reply; 15+ messages in thread
From: Tim Seufert @ 2002-11-16  2:39 UTC (permalink / raw)
  To: Joakim Tjernlund; +Cc: linuxppc-dev

On Friday, November 15, 2002, at 03:01  PM, Joakim Tjernlund wrote:

> This comment in csum_partial:
> /* the bdnz has zero overhead, so it should */
> /* be unnecessary to unroll this loop */
>
> got me wondering(code included last). A instruction can not have zero
> cost/overhead.
> This instruction must be eating cycles. I think this function needs
> unrolling, but  I am pretty
> useless on assembler so I need help.
>
> Can any PPC/assembler guy comment on this and, if needed, do the
> unrolling? I think  6 or 8 as unroll step will be enough.

The comment is probably correct.  The reason the instruction has
(effectively) zero overhead is that most PowerPCs have a feature which
"folds" predicted-taken branches out of the instruction stream before
they are dispatched.  This effectively makes the branch cost 0 cycles,
as it does not occupy integer execution resources as it would on other
possible microarchitectures.

With current hardware trends loop unrolling can often be an
anti-optimization.  Even without loop overhead reduction features like
branch folding, it may be a net penalty just because you are chewing up
more I-cache and causing more memory traffic to fill it.  Consider the
costs:

Reading a cache line (8 instructions, 4-beat burst assuming 4-1-1-1
cycle timing, which is optimistic) from 133 MHz SDRAM:  52.5 ns

1 processor core cycle at 1 GHz: 1 ns

So every time you do something that causes a cache line miss, you could
have executed 50+ instructions instead.  This only gets worse when you
consider more realistic memory timing (I don't know offhand whether you
can really get 4-1-1-1 burst timing with PC133 under any circumstances,
and besides it's going to be much worse than 4 cycles for the initial
beat if you don't get a page hit).

That's not to say that unrolling is useless these days, just that the
disparity between memory and processor core speed means that you have
to be careful in deciding when to apply it and to what extent.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: csum_partial() and csum_partial_copy_generic() in badly optimized?
  2002-11-16  2:39 ` Tim Seufert
@ 2002-11-16 10:16   ` Joakim Tjernlund
  2002-11-17  5:58     ` Tim Seufert
  0 siblings, 1 reply; 15+ messages in thread
From: Joakim Tjernlund @ 2002-11-16 10:16 UTC (permalink / raw)
  To: Tim Seufert; +Cc: linuxppc-dev


Hi Tim

Thanks for your answer. See inline below

> On Friday, November 15, 2002, at 03:01  PM, Joakim Tjernlund wrote:
>
> > This comment in csum_partial:
> > /* the bdnz has zero overhead, so it should */
> > /* be unnecessary to unroll this loop */
> >
> > got me wondering(code included last). A instruction can not have zero
> > cost/overhead.
> > This instruction must be eating cycles. I think this function needs
> > unrolling, but  I am pretty
> > useless on assembler so I need help.
> >
> > Can any PPC/assembler guy comment on this and, if needed, do the
> > unrolling? I think  6 or 8 as unroll step will be enough.
>
> The comment is probably correct.  The reason the instruction has
> (effectively) zero overhead is that most PowerPCs have a feature which
> "folds" predicted-taken branches out of the instruction stream before
> they are dispatched.  This effectively makes the branch cost 0 cycles,
> as it does not occupy integer execution resources as it would on other
> possible microarchitectures.
>
hmm, I am on a mpc860 and I get big performace improvements if I apply
unrolling. Consider the standard CRC32 funtion:
while(len--) {
        result = (result << 8 | *data++) ^ crctab[result >> 24];
}
If I apply manual unrolling or compile with -funroll-loops I get
> 20% performance increase. Is this a special case or is
the mpc860 doing a bad job?

> With current hardware trends loop unrolling can often be an
> anti-optimization.  Even without loop overhead reduction features like
> branch folding, it may be a net penalty just because you are chewing up
> more I-cache and causing more memory traffic to fill it.  Consider the
> costs:
>
> Reading a cache line (8 instructions, 4-beat burst assuming 4-1-1-1
> cycle timing, which is optimistic) from 133 MHz SDRAM:  52.5 ns
>
> 1 processor core cycle at 1 GHz: 1 ns
>
> So every time you do something that causes a cache line miss, you could
> have executed 50+ instructions instead.  This only gets worse when you
> consider more realistic memory timing (I don't know offhand whether you
> can really get 4-1-1-1 burst timing with PC133 under any circumstances,
> and besides it's going to be much worse than 4 cycles for the initial
> beat if you don't get a page hit).

For a big loop(many iterations) this can not be a problem, right?
csum_partial() often has more than 1000 bytes to checksum.

>
> That's not to say that unrolling is useless these days, just that the
> disparity between memory and processor core speed means that you have
> to be careful in deciding when to apply it and to what extent.

It would seem that loop unrolling is working fine for 8xx, would
you mind doing an unrolling of that function for me to test?

It is only 8xx that needs this, just add a #ifdef CONFIG_8xx

           Jocke

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: csum_partial() and csum_partial_copy_generic() in badly optimized?
  2002-11-16 10:16   ` Joakim Tjernlund
@ 2002-11-17  5:58     ` Tim Seufert
  2002-11-17 15:17       ` Joakim Tjernlund
  0 siblings, 1 reply; 15+ messages in thread
From: Tim Seufert @ 2002-11-17  5:58 UTC (permalink / raw)
  To: Joakim Tjernlund; +Cc: linuxppc-dev

On Saturday, November 16, 2002, at 02:16  AM, Joakim Tjernlund wrote:

>> The comment is probably correct.  The reason the instruction has
>> (effectively) zero overhead is that most PowerPCs have a feature which
>> "folds" predicted-taken branches out of the instruction stream before
>> they are dispatched.  This effectively makes the branch cost 0 cycles,
>> as it does not occupy integer execution resources as it would on other
>> possible microarchitectures.
>>
> hmm, I am on a mpc860 and I get big performace improvements if I apply
> unrolling. Consider the standard CRC32 funtion:
> while(len--) {
>         result = (result << 8 | *data++) ^ crctab[result >> 24];
> }
> If I apply manual unrolling or compile with -funroll-loops I get
>> 20% performance increase. Is this a special case or is
> the mpc860 doing a bad job?

Don't forget about gcc.

In the code you originally were talking about, the PPC CTR register and
bdnz instruction were used to implement the loop counter.  bdnz puts
all the loop overhead (counter decrement, test, and branch) into one
instruction.  Since that instruction is a branch, it can be folded, and
thus have 0 overhead.  CTR and the instructions which operate on it
(such as bdnz) were put into the PPC architecture mainly as an
optimization opportunity for loops where the loop variable is not used
inside the loop body.

There is no guarantee that gcc will always use CTR, even for such
obvious candidates as the crc32 loop.  gcc is simply not that great at
PPC optimization, especially at low optimization levels.  I've just
been playing with gcc 2.95.4 on YDL 2.3 and Apple's gcc 3.1 on OSX
10.2.2 (these versions are merely what I happen to have installed on
machines that are handy).  Here's a summary of when gcc will compile
that crc32 loop with use of CTR and bdnz (note that -O3 or above
automatically turn on -funroll-loops, so I saw no point in testing
those levels):

           -O1    -O2    -O1 -funroll-loops    -O2 -funroll-loops
2.95.4    no     no     no                    no
3.1       no     yes    yes                   yes

If gcc isn't generating a CTR loop to start out with, the crc32 code
will benefit more from unrolling than it should.

I did a bit of crude performance testing, and with gcc 3.1 there is no
difference in the cache-hot performance of the crc32 loop when
switching between -O2 and -O2 -funroll-loops.

Now, that _was_ on a 7455.  You would see some difference on a 860,
because gcc 3.1 did find something else to optimize.  Here's the loop
body for -O2:

L48:
         ; basic block 2
         lbz r9,0(r3)    ; * data
         rlwinm r0,r30,10,22,29  ;  result
         lwzx r11,r10,r0 ;  crctab
         addi r3,r3,1    ;  data,  data
         slwi r0,r30,8   ;  result
         or r0,r0,r9
         xor r30,r0,r11  ;  result
         bdnz L48

With -O2 -funroll-loops, gcc copies this loop body four times, and
transforms the addi (increments 'data' ptr by 1) and lbz (loads *data)
instructions.  addi is hoisted from the loop body and instead the
pointer is incremented by 4 at the end of the unrolled loop.  Each lbz
copy has the appropriate offset (0, 1, 2, 3) to simulate the original
pointer incrementing.  The net effect is that for every four iterations
of the original loop we execute three fewer addi instructions.

The 7455 has a lot of integer execution units and can probably do the
extra adds for free, but the 860 is basically a 601, and I think the
601 had just one IU, so the adds will not be free.

But if you go back to that original loop in csum_partial() etc., I
don't see any opportunity to perform similar optimizations.  So I doubt
very much that unrolling that loop would have any benefit even on the
860.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: csum_partial() and csum_partial_copy_generic() in badly optimized?
  2002-11-17  5:58     ` Tim Seufert
@ 2002-11-17 15:17       ` Joakim Tjernlund
  2002-11-17 22:00         ` Tim Seufert
  0 siblings, 1 reply; 15+ messages in thread
From: Joakim Tjernlund @ 2002-11-17 15:17 UTC (permalink / raw)
  To: Tim Seufert; +Cc: linuxppc-dev


> On Saturday, November 16, 2002, at 02:16  AM, Joakim Tjernlund wrote:
>
> >> The comment is probably correct.  The reason the instruction has
> >> (effectively) zero overhead is that most PowerPCs have a feature which
> >> "folds" predicted-taken branches out of the instruction stream before
> >> they are dispatched.  This effectively makes the branch cost 0 cycles,
> >> as it does not occupy integer execution resources as it would on other
> >> possible microarchitectures.
> >>
> > hmm, I am on a mpc860 and I get big performace improvements if I apply
> > unrolling. Consider the standard CRC32 funtion:
> > while(len--) {
> >         result = (result << 8 | *data++) ^ crctab[result >> 24];
> > }
> > If I apply manual unrolling or compile with -funroll-loops I get
> >> 20% performance increase. Is this a special case or is
> > the mpc860 doing a bad job?
>
> Don't forget about gcc.
>
> In the code you originally were talking about, the PPC CTR register and
> bdnz instruction were used to implement the loop counter.  bdnz puts
> all the loop overhead (counter decrement, test, and branch) into one
> instruction.  Since that instruction is a branch, it can be folded, and
> thus have 0 overhead.  CTR and the instructions which operate on it
> (such as bdnz) were put into the PPC architecture mainly as an
> optimization opportunity for loops where the loop variable is not used
> inside the loop body.

loop variable not USED or loop variable not MODIFIED?

>
> There is no guarantee that gcc will always use CTR, even for such
> obvious candidates as the crc32 loop.  gcc is simply not that great at
> PPC optimization, especially at low optimization levels.  I've just
> been playing with gcc 2.95.4 on YDL 2.3 and Apple's gcc 3.1 on OSX
> 10.2.2 (these versions are merely what I happen to have installed on
> machines that are handy).  Here's a summary of when gcc will compile
> that crc32 loop with use of CTR and bdnz (note that -O3 or above
> automatically turn on -funroll-loops, so I saw no point in testing
> those levels):
>
>            -O1    -O2    -O1 -funroll-loops    -O2 -funroll-loops
> 2.95.4    no     no     no                    no
> 3.1       no     yes    yes                   yes

hmm, looks like I should upgrade gcc to 3.1 or possibly 3.2. However
I think that gcc >=3.0 has changed the ABI for C++, which is bad for me.

Is 2.95.x still maintained? Maybe this optimization could be added
to that branch.

>
> If gcc isn't generating a CTR loop to start out with, the crc32 code
> will benefit more from unrolling than it should.
>
> I did a bit of crude performance testing, and with gcc 3.1 there is no
> difference in the cache-hot performance of the crc32 loop when
> switching between -O2 and -O2 -funroll-loops.
>
> Now, that _was_ on a 7455.  You would see some difference on a 860,
> because gcc 3.1 did find something else to optimize.  Here's the loop
> body for -O2:
>
> L48:
>          ; basic block 2
>          lbz r9,0(r3)    ; * data
>          rlwinm r0,r30,10,22,29  ;  result
>          lwzx r11,r10,r0 ;  crctab
>          addi r3,r3,1    ;  data,  data
>          slwi r0,r30,8   ;  result
>          or r0,r0,r9
>          xor r30,r0,r11  ;  result
>          bdnz L48
>
> With -O2 -funroll-loops, gcc copies this loop body four times, and
> transforms the addi (increments 'data' ptr by 1) and lbz (loads *data)
> instructions.  addi is hoisted from the loop body and instead the
> pointer is incremented by 4 at the end of the unrolled loop.  Each lbz
> copy has the appropriate offset (0, 1, 2, 3) to simulate the original
> pointer incrementing.  The net effect is that for every four iterations
> of the original loop we execute three fewer addi instructions.

I see.

> The 7455 has a lot of integer execution units and can probably do the
> extra adds for free, but the 860 is basically a 601, and I think the
> 601 had just one IU, so the adds will not be free.
probably.

>
> But if you go back to that original loop in csum_partial() etc., I
> don't see any opportunity to perform similar optimizations.  So I doubt
> very much that unrolling that loop would have any benefit even on the
> 860.

You are probably right. Thanks for your effort to clear this up for me.

          Jocke

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: csum_partial() and csum_partial_copy_generic() in badly optimized?
  2002-11-17 15:17       ` Joakim Tjernlund
@ 2002-11-17 22:00         ` Tim Seufert
  2002-11-17 23:32           ` Joakim Tjernlund
  0 siblings, 1 reply; 15+ messages in thread
From: Tim Seufert @ 2002-11-17 22:00 UTC (permalink / raw)
  To: Joakim Tjernlund; +Cc: linuxppc-dev


On Sunday, November 17, 2002, at 07:17  AM, Joakim Tjernlund wrote:

>> CTR and the instructions which operate on it
>> (such as bdnz) were put into the PPC architecture mainly as an
>> optimization opportunity for loops where the loop variable is not used
>> inside the loop body.
>
> loop variable not USED or loop variable not MODIFIED?

Not used.  CTR cannot be specified as the source or destination of most
instructions.  In order to access its contents you have to use special
instructions that move between it and a normal general purpose register.

>> Here's a summary of when gcc will compile
>> that crc32 loop with use of CTR and bdnz (note that -O3 or above
>> automatically turn on -funroll-loops, so I saw no point in testing
>> those levels):
>>
>>            -O1    -O2    -O1 -funroll-loops    -O2 -funroll-loops
>> 2.95.4    no     no     no                    no
>> 3.1       no     yes    yes                   yes
>
> hmm, looks like I should upgrade gcc to 3.1 or possibly 3.2. However
> I think that gcc >=3.0 has changed the ABI for C++, which is bad for
> me.

Sooner or later you're going to want to, though.  :)

> Is 2.95.x still maintained? Maybe this optimization could be added
> to that branch.

I don't know.  It probably is to some extent, just because there are
plenty of people in the same boat with you on the C++ ABI changes.


** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: csum_partial() and csum_partial_copy_generic() in badly optimized?
  2002-11-17 22:00         ` Tim Seufert
@ 2002-11-17 23:32           ` Joakim Tjernlund
  2002-11-18  1:27             ` Tim Seufert
  2002-11-18  4:12             ` Gabriel Paubert
  0 siblings, 2 replies; 15+ messages in thread
From: Joakim Tjernlund @ 2002-11-17 23:32 UTC (permalink / raw)
  To: Tim Seufert; +Cc: linuxppc-dev


> On Sunday, November 17, 2002, at 07:17  AM, Joakim Tjernlund wrote:
>
> >> CTR and the instructions which operate on it
> >> (such as bdnz) were put into the PPC architecture mainly as an
> >> optimization opportunity for loops where the loop variable is not used
> >> inside the loop body.
> >
> > loop variable not USED or loop variable not MODIFIED?
>
> Not used.  CTR cannot be specified as the source or destination of most
> instructions.  In order to access its contents you have to use special
> instructions that move between it and a normal general purpose register.

OK, so how about if I modify the crc32 loop:

unsigned char * end = data +len;
while(data < end) {
        result = (result << 8 | *data++) ^ crctab[result >> 24];
}

will that be possible to optimze in with something similar as bdnz also?

>
> >> Here's a summary of when gcc will compile
> >> that crc32 loop with use of CTR and bdnz (note that -O3 or above
> >> automatically turn on -funroll-loops, so I saw no point in testing
> >> those levels):
> >>
> >>            -O1    -O2    -O1 -funroll-loops    -O2 -funroll-loops
> >> 2.95.4    no     no     no                    no
> >> 3.1       no     yes    yes                   yes
> >
> > hmm, looks like I should upgrade gcc to 3.1 or possibly 3.2. However
> > I think that gcc >=3.0 has changed the ABI for C++, which is bad for
> > me.
>
> Sooner or later you're going to want to, though.  :)

Yes, but upgrading our customers will be a pain :-(

       Jocke

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: csum_partial() and csum_partial_copy_generic() in badly optimized?
  2002-11-17 23:32           ` Joakim Tjernlund
@ 2002-11-18  1:27             ` Tim Seufert
  2002-11-18  4:12             ` Gabriel Paubert
  1 sibling, 0 replies; 15+ messages in thread
From: Tim Seufert @ 2002-11-18  1:27 UTC (permalink / raw)
  To: Joakim Tjernlund; +Cc: linuxppc-dev

On Sunday, November 17, 2002, at 03:32  PM, Joakim Tjernlund wrote:

>> Not used.  CTR cannot be specified as the source or destination of
>> most
>> instructions.  In order to access its contents you have to use special
>> instructions that move between it and a normal general purpose
>> register.
>
> OK, so how about if I modify the crc32 loop:
>
> unsigned char * end = data +len;
> while(data < end) {
>         result = (result << 8 | *data++) ^ crctab[result >> 24];
> }
>
> will that be possible to optimze in with something similar as bdnz
> also?

Nope.  If the pointer was put into the CTR register, it would have to
be copied back out with an extra instruction in order to be
dereferenced.  Also, I'm fairly sure there is no way to compare CTR to
anything but zero, so loops really need to be of the form "while
(counter--)" to use it.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: csum_partial() and csum_partial_copy_generic() in badly optimized?
  2002-11-17 23:32           ` Joakim Tjernlund
  2002-11-18  1:27             ` Tim Seufert
@ 2002-11-18  4:12             ` Gabriel Paubert
  2002-11-18 13:49               ` Joakim Tjernlund
  1 sibling, 1 reply; 15+ messages in thread
From: Gabriel Paubert @ 2002-11-18  4:12 UTC (permalink / raw)
  To: Joakim Tjernlund; +Cc: Tim Seufert, linuxppc-dev

Joakim Tjernlund wrote:
>>On Sunday, November 17, 2002, at 07:17  AM, Joakim Tjernlund wrote:
>>
>>
>>>>CTR and the instructions which operate on it
>>>>(such as bdnz) were put into the PPC architecture mainly as an
>>>>optimization opportunity for loops where the loop variable is not used
>>>>inside the loop body.
>>>
>>>loop variable not USED or loop variable not MODIFIED?
>>
>>Not used.  CTR cannot be specified as the source or destination of most
>>instructions.  In order to access its contents you have to use special
>>instructions that move between it and a normal general purpose register.
>
>
> OK, so how about if I modify the crc32 loop:
>
> unsigned char * end = data +len;
> while(data < end) {
>         result = (result << 8 | *data++) ^ crctab[result >> 24];
> }
>
> will that be possible to optimze in with something similar as bdnz also?

I don't know if even bleeding edge gcc can do it, basically you can always
use bdnz as soon as you can compute the iteration count before entering
the loop. The problem is that equivalent source code constructs do not
always result in exactly equivalent internal representation in GCC. The
transforms which are attempted depend on the exact version of GCC and
the optimization level.

In the example code you give, the variable 'end' is absolutely useless and
forces the compiler to do more simplifications (essentially eliminating
end and using end-data as a loop index if it wants to use bdnz). Making
life more complex for the compiler is never a good idea...

I'd rather write it as:

int i;
for(i=0; i< len; i++) {
	result=...data++...;
}

when i is not modified in the loop. I'm almost sure that recent gcc will
end up using a bdnz instruction in this simple case.

This said, it is probably very hard to optimize this loop since
the load from crctab and the dependencies between iterations
introduce quite a few delays.

0:
lbzu
scratch,*data,1
	rlwinm	tmp,result,10,0x3fc
	slwi	result,result,8
	lwzx	tmp,crctab,tmp
	or	result,result,scratch
	xor	result,result,tmp
	bdnz	0b

is probably the best you can get. The worst path which limits iteration
rate is rlwinm+lwz+xor, which will be 4 or 5 clock cycles typically.

I'm not sure that gcc will use an lbzu for reading the byte array. It may
help to explicitly decrement data before the loop and then use *++data
which better matches the operation of lbzu (I know that post-increment
being worse than pre-increment was true for some versions of gcc, but I
don't know exactly which).

Finally a truly clever compiler which knows its PPC assembly should be
able  to notice that one instruction can be saved since (result<<8|*data)
can be replaced by a bit field insert:

0:
lbzu
scratch,*data,1
	rlwinm	tmp,result,10,0x3fc
	lwzx	tmp,crctab,tmp
	rlwimi	scratch,result,8,0xffffff00
	xor	result,scratch,tmp
	bdnz	0b

but this would not help the critical path of the loop.

	Gabriel.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: csum_partial() and csum_partial_copy_generic() in badly optimized?
  2002-11-18  4:12             ` Gabriel Paubert
@ 2002-11-18 13:49               ` Joakim Tjernlund
  2002-11-18 18:05                 ` Gabriel Paubert
  0 siblings, 1 reply; 15+ messages in thread
From: Joakim Tjernlund @ 2002-11-18 13:49 UTC (permalink / raw)
  To: Gabriel Paubert; +Cc: Tim Seufert, linuxppc-dev


> > OK, so how about if I modify the crc32 loop:
> >
> > unsigned char * end = data +len;
> > while(data < end) {
> >         result = (result << 8 | *data++) ^ crctab[result >> 24];
> > }
> >
> > will that be possible to optimze in with something similar as bdnz also?
[SNIP]
> 	Gabriel.

Ok, thanks for the lesson. I decided to have a closer look at arch/ppc/kernel/misc.S to
see how it uses the bdnz instruction. I think i may have found a bug:

/*
 * Like above, but invalidate the D-cache.  This is used by the 8xx
 * to invalidate the cache so the PPC core doesn't get stale data
 * from the CPM (no cache snooping here :-).
 *
 * invalidate_dcache_range(unsigned long start, unsigned long stop)
 */
_GLOBAL(invalidate_dcache_range)
	li	r5,L1_CACHE_LINE_SIZE-1
	andc	r3,r3,r5
	subf	r4,r3,r4
	add	r4,r4,r5
	srwi.	r4,r4,LG_L1_CACHE_LINE_SIZE
	beqlr
	mtctr	r4

1:	dcbi	0,r3
	addi	r3,r3,L1_CACHE_LINE_SIZE
	bdnz	1b
	sync				/* wait for dcbi's to get to ram */
	blr

Supposed you you do a invalidate_dcache_range(0,16) then 2 cachelines should be
invalidated on a mpc8xx, since range 0 to 16 is 17 bytes and a cache line is 16 bytes.

If I understand this assembly, mtctr r4 will load the CTR with 1 and that
will only execute the the dcbi 0,r3 once. Am I making sense here?

I think the function should look something like this:
_GLOBAL(invalidate_dcache_range)
	subf	r4,r3,r4
	beqlr
	srwi.	r4,r4,LG_L1_CACHE_LINE_SIZE
	addi	r4,r4,1
	mtctr	r4

1:	dcbi	0,r3
	addi	r3,r3,L1_CACHE_LINE_SIZE
	bdnz	1b
	sync				/* wait for dcbi's to get to ram */
	blr


** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: csum_partial() and csum_partial_copy_generic() in badly optimized?
  2002-11-18 13:49               ` Joakim Tjernlund
@ 2002-11-18 18:05                 ` Gabriel Paubert
  2002-11-18 18:43                   ` Joakim Tjernlund
  2002-11-19  3:31                   ` Paul Mackerras
  0 siblings, 2 replies; 15+ messages in thread
From: Gabriel Paubert @ 2002-11-18 18:05 UTC (permalink / raw)
  To: joakim.tjernlund; +Cc: Tim Seufert, linuxppc-dev


Joakim Tjernlund wrote:

> Ok, thanks for the lesson. I decided to have a closer look at arch/ppc/kernel/misc.S to
> see how it uses the bdnz instruction. I think i may have found a bug:
>
> /*
>  * Like above, but invalidate the D-cache.  This is used by the 8xx
>  * to invalidate the cache so the PPC core doesn't get stale data
>  * from the CPM (no cache snooping here :-).
>  *
>  * invalidate_dcache_range(unsigned long start, unsigned long stop)
>  */
> _GLOBAL(invalidate_dcache_range)
> 	li	r5,L1_CACHE_LINE_SIZE-1
> 	andc	r3,r3,r5
> 	subf	r4,r3,r4
> 	add	r4,r4,r5
> 	srwi.	r4,r4,LG_L1_CACHE_LINE_SIZE
> 	beqlr
> 	mtctr	r4
>
> 1:	dcbi	0,r3
> 	addi	r3,r3,L1_CACHE_LINE_SIZE
> 	bdnz	1b
> 	sync				/* wait for dcbi's to get to ram */
> 	blr
>
> Supposed you you do a invalidate_dcache_range(0,16) then 2 cachelines should be
> invalidated on a mpc8xx, since range 0 to 16 is 17 bytes and a cache line is 16 bytes.

I don't know this code, whether it is correct or not depends on what you
pass in r4. If it is invalidate_dcache_range(start, start+len), the code
is correct since start+len is one byte beyond the buffer. If it is
invalidate_dcache_range(first, last), then it is buggy. The former
definition of parameters is more frequent in practice.

This said, the first instruction can be removed:
_GLOBAL(invalidate_dcache_range)
  	rlwinm	r3,r3,0,~(L1_CACHE_LINE_SIZE-1)
  	subf	r4,r3,r4
  	add	r4,r4,L1_CACHE_LINE_SIZE-1

should work.

> If I understand this assembly, mtctr r4 will load the CTR with 1 and that
> will only execute the the dcbi 0,r3 once. Am I making sense here?

Yes, but I believe that the parameters are defined that way. There is
a reason for which C wants pointers to element following the end
of an array to be valid.

[SNIP]

	Regards,
	Gabriel.


** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: csum_partial() and csum_partial_copy_generic() in badly optimized?
  2002-11-18 18:05                 ` Gabriel Paubert
@ 2002-11-18 18:43                   ` Joakim Tjernlund
  2002-11-19  1:24                     ` Gabriel Paubert
  2002-11-19  3:31                   ` Paul Mackerras
  1 sibling, 1 reply; 15+ messages in thread
From: Joakim Tjernlund @ 2002-11-18 18:43 UTC (permalink / raw)
  To: Gabriel Paubert; +Cc: Tim Seufert, linuxppc-dev


>
> Joakim Tjernlund wrote:
>
> > Ok, thanks for the lesson. I decided to have a closer look at arch/ppc/kernel/misc.S to
> > see how it uses the bdnz instruction. I think i may have found a bug:
> >
> > /*
> >  * Like above, but invalidate the D-cache.  This is used by the 8xx
> >  * to invalidate the cache so the PPC core doesn't get stale data
> >  * from the CPM (no cache snooping here :-).
> >  *
> >  * invalidate_dcache_range(unsigned long start, unsigned long stop)
> >  */
> > _GLOBAL(invalidate_dcache_range)
> > 	li	r5,L1_CACHE_LINE_SIZE-1
> > 	andc	r3,r3,r5
> > 	subf	r4,r3,r4
> > 	add	r4,r4,r5
> > 	srwi.	r4,r4,LG_L1_CACHE_LINE_SIZE
> > 	beqlr
> > 	mtctr	r4
> >
> > 1:	dcbi	0,r3
> > 	addi	r3,r3,L1_CACHE_LINE_SIZE
> > 	bdnz	1b
> > 	sync				/* wait for dcbi's to get to ram */
> > 	blr
> >
> > Supposed you you do a invalidate_dcache_range(0,16) then 2 cachelines should be
> > invalidated on a mpc8xx, since range 0 to 16 is 17 bytes and a cache line is 16 bytes.
>
> I don't know this code, whether it is correct or not depends on what you
> pass in r4. If it is invalidate_dcache_range(start, start+len), the code
> is correct since start+len is one byte beyond the buffer. If it is
> invalidate_dcache_range(first, last), then it is buggy. The former
> definition of parameters is more frequent in practice.

hmm, I think you are right. The practice seems to be invalidate_dcache_range(start, start+len).
But then the interface should be invalidate_dcache_range(start, len) instead, IMHO.
I guess it's too late to change it now.

>
> This said, the first instruction can be removed:
> _GLOBAL(invalidate_dcache_range)
>   	rlwinm	r3,r3,0,~(L1_CACHE_LINE_SIZE-1)
>   	subf	r4,r3,r4
>   	add	r4,r4,L1_CACHE_LINE_SIZE-1
>
> should work.

I guess so.

   Regards
          Jocke


** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: csum_partial() and csum_partial_copy_generic() in badly optimized?
  2002-11-18 18:43                   ` Joakim Tjernlund
@ 2002-11-19  1:24                     ` Gabriel Paubert
  0 siblings, 0 replies; 15+ messages in thread
From: Gabriel Paubert @ 2002-11-19  1:24 UTC (permalink / raw)
  To: joakim.tjernlund; +Cc: Tim Seufert, linuxppc-dev


Joakim Tjernlund wrote:

> hmm, I think you are right. The practice seems to be invalidate_dcache_range(start, start+len).
> But then the interface should be invalidate_dcache_range(start, len) instead, IMHO.
> I guess it's too late to change it now.

Probably. I don't know why it is this way since it would in most cases be
better to do the start+len in the subroutine, saving almost always one
instruction in the caller.

>
>
>>This said, the first instruction can be removed:
>>_GLOBAL(invalidate_dcache_range)
>>  	rlwinm	r3,r3,0,~(L1_CACHE_LINE_SIZE-1)
>>  	subf	r4,r3,r4
>>  	add	r4,r4,L1_CACHE_LINE_SIZE-1

of course this instruction should be an immediate form: addi and not add!

	Regards,
	Gabriel.


** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: csum_partial() and csum_partial_copy_generic() in badly optimized?
  2002-11-18 18:05                 ` Gabriel Paubert
  2002-11-18 18:43                   ` Joakim Tjernlund
@ 2002-11-19  3:31                   ` Paul Mackerras
  2002-11-19  5:35                     ` Gabriel Paubert
  1 sibling, 1 reply; 15+ messages in thread
From: Paul Mackerras @ 2002-11-19  3:31 UTC (permalink / raw)
  To: Gabriel Paubert; +Cc: joakim.tjernlund, Tim Seufert, linuxppc-dev


Gabriel Paubert writes:

> I don't know this code, whether it is correct or not depends on what you
> pass in r4. If it is invalidate_dcache_range(start, start+len), the code
> is correct since start+len is one byte beyond the buffer. If it is
> invalidate_dcache_range(first, last), then it is buggy. The former
> definition of parameters is more frequent in practice.

That is correct, the `stop' parameter is the address of the first byte
after the end of the range that you want invalidated.

> This said, the first instruction can be removed:
> _GLOBAL(invalidate_dcache_range)
>   	rlwinm	r3,r3,0,~(L1_CACHE_LINE_SIZE-1)

Huh?  I'm guessing you really mean to say:

	rlwinm	r3,r3,0,0,31-LG_L1_CACHE_LINE_SIZE

Regards,
Paul.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: csum_partial() and csum_partial_copy_generic() in badly optimized?
  2002-11-19  3:31                   ` Paul Mackerras
@ 2002-11-19  5:35                     ` Gabriel Paubert
  0 siblings, 0 replies; 15+ messages in thread
From: Gabriel Paubert @ 2002-11-19  5:35 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: joakim.tjernlund, Tim Seufert, linuxppc-dev


Paul Mackerras wrote:
> Gabriel Paubert writes:
>
>
>>I don't know this code, whether it is correct or not depends on what you
>>pass in r4. If it is invalidate_dcache_range(start, start+len), the code
>>is correct since start+len is one byte beyond the buffer. If it is
>>invalidate_dcache_range(first, last), then it is buggy. The former
>>definition of parameters is more frequent in practice.
>
>
> That is correct, the `stop' parameter is the address of the first byte
> after the end of the range that you want invalidated.
>
>
>>This said, the first instruction can be removed:
>>_GLOBAL(invalidate_dcache_range)
>>  	rlwinm	r3,r3,0,~(L1_CACHE_LINE_SIZE-1)
>
>
> Huh?  I'm guessing you really mean to say:
>
> 	rlwinm	r3,r3,0,0,31-LG_L1_CACHE_LINE_SIZE


No, since a long time gas admits a special syntax of rlwinm and friends
with only four parameters: instead of 2 bit numbers giving the first and
last bit of the mask, you simply write the mask value. Very handy when
writing assembly code.

	Regards,
	Gabriel.


** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2002-11-19  5:35 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-11-15 23:01 csum_partial() and csum_partial_copy_generic() in badly optimized? Joakim Tjernlund
2002-11-16  2:39 ` Tim Seufert
2002-11-16 10:16   ` Joakim Tjernlund
2002-11-17  5:58     ` Tim Seufert
2002-11-17 15:17       ` Joakim Tjernlund
2002-11-17 22:00         ` Tim Seufert
2002-11-17 23:32           ` Joakim Tjernlund
2002-11-18  1:27             ` Tim Seufert
2002-11-18  4:12             ` Gabriel Paubert
2002-11-18 13:49               ` Joakim Tjernlund
2002-11-18 18:05                 ` Gabriel Paubert
2002-11-18 18:43                   ` Joakim Tjernlund
2002-11-19  1:24                     ` Gabriel Paubert
2002-11-19  3:31                   ` Paul Mackerras
2002-11-19  5:35                     ` Gabriel Paubert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).