[v5] MIPS: lib: csum_partial: more instruction paral

All of lore.kernel.org
 help / color / mirror / Atom feed

* [v5] MIPS: lib: csum_partial: more instruction paral
@ 2015-03-26 17:07 cee1
  2015-03-30 20:10 ` Ralf Baechle
  2015-03-31  8:34 ` cee1
  0 siblings, 2 replies; 8+ messages in thread
From: cee1 @ 2015-03-26 17:07 UTC (permalink / raw)
  To: linux-mips; +Cc: ralf, Chen Jie

From: Chen Jie <chenj@lemote.com>

Computing sum introduces true data dependency. This patch removes some
true data depdendencies, hence increases instruction level parallelism.

This patch brings at most 50% csum performance gain on Loongson 3a
processor in our test.

One example about how this patch works is in CSUM_BIGCHUNK1:
// ** original **    vs    ** patch applied **
    ADDC(sum, t0)           ADDC(t0, t1)
    ADDC(sum, t1)           ADDC(t2, t3)
    ADDC(sum, t2)           ADDC(sum, t0)
    ADDC(sum, t3)           ADDC(sum, t2)

In the original implementation, each ADDC(sum, ...) depends on the sum
value updated by previous ADDC(as source operand).

With this patch applied, the first two ADDC operations are independent,
hence can be executed simultaneously if possible.

Another example is in the "copy and sum calculating chunk":
// ** original **    vs    ** patch applied **
    STORE(t0, UNIT(0) ...   STORE(t0, UNIT(0) ...
    ADDC(sum, t0)           ADDC(t0, t1)
    STORE(t1, UNIT(1) ...   STORE(t1, UNIT(1) ...
    ADDC(sum, t1)           ADDC(sum, t0)
    STORE(t2, UNIT(2) ...   STORE(t2, UNIT(2) ...
    ADDC(sum, t2)           ADDC(t2, t3)
    STORE(t3, UNIT(3) ...   STORE(t3, UNIT(3) ...
    ADDC(sum, t3)           ADDC(sum, t2)

With this patch applied, ADDC and the **next next** ADDC are independent.

Signed-off-by: chenj <chenj@lemote.com>
---
1. The result can be found at
http://dev.lemote.com/files/upload/software/csum-opti/csum-opti-benchmark.html
And is generated by a userspace test program:
http://dev.lemote.com/files/upload/software/csum-opti/csum-test.tar.gz

[v2: amend commit message]
[v3: further amend commit message]
[v4: amend commit message & sign-off my patch]
[v5: amend commit message]

 arch/mips/lib/csum_partial.S | 38 +++++++++++++++++++-------------------
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/arch/mips/lib/csum_partial.S b/arch/mips/lib/csum_partial.S
index 4c721e2..ed88647 100644
--- a/arch/mips/lib/csum_partial.S
+++ b/arch/mips/lib/csum_partial.S
@@ -76,10 +76,10 @@
 	LOAD	_t1, (offset + UNIT(1))(src);			\
 	LOAD	_t2, (offset + UNIT(2))(src);			\
 	LOAD	_t3, (offset + UNIT(3))(src);			\
+	ADDC(_t0, _t1);						\
+	ADDC(_t2, _t3);						\
 	ADDC(sum, _t0);						\
-	ADDC(sum, _t1);						\
-	ADDC(sum, _t2);						\
-	ADDC(sum, _t3)
+	ADDC(sum, _t2)
 
 #ifdef USE_DOUBLE
 #define CSUM_BIGCHUNK(src, offset, sum, _t0, _t1, _t2, _t3)	\
@@ -504,21 +504,21 @@ LEAF(csum_partial)
 	SUB	len, len, 8*NBYTES
 	ADD	src, src, 8*NBYTES
 	STORE(t0, UNIT(0)(dst),	.Ls_exc\@)
-	ADDC(sum, t0)
+	ADDC(t0, t1)
 	STORE(t1, UNIT(1)(dst),	.Ls_exc\@)
-	ADDC(sum, t1)
+	ADDC(sum, t0)
 	STORE(t2, UNIT(2)(dst),	.Ls_exc\@)
-	ADDC(sum, t2)
+	ADDC(t2, t3)
 	STORE(t3, UNIT(3)(dst),	.Ls_exc\@)
-	ADDC(sum, t3)
+	ADDC(sum, t2)
 	STORE(t4, UNIT(4)(dst),	.Ls_exc\@)
-	ADDC(sum, t4)
+	ADDC(t4, t5)
 	STORE(t5, UNIT(5)(dst),	.Ls_exc\@)
-	ADDC(sum, t5)
+	ADDC(sum, t4)
 	STORE(t6, UNIT(6)(dst),	.Ls_exc\@)
-	ADDC(sum, t6)
+	ADDC(t6, t7)
 	STORE(t7, UNIT(7)(dst),	.Ls_exc\@)
-	ADDC(sum, t7)
+	ADDC(sum, t6)
 	.set	reorder				/* DADDI_WAR */
 	ADD	dst, dst, 8*NBYTES
 	bgez	len, 1b
@@ -544,13 +544,13 @@ LEAF(csum_partial)
 	SUB	len, len, 4*NBYTES
 	ADD	src, src, 4*NBYTES
 	STORE(t0, UNIT(0)(dst),	.Ls_exc\@)
-	ADDC(sum, t0)
+	ADDC(t0, t1)
 	STORE(t1, UNIT(1)(dst),	.Ls_exc\@)
-	ADDC(sum, t1)
+	ADDC(sum, t0)
 	STORE(t2, UNIT(2)(dst),	.Ls_exc\@)
-	ADDC(sum, t2)
+	ADDC(t2, t3)
 	STORE(t3, UNIT(3)(dst),	.Ls_exc\@)
-	ADDC(sum, t3)
+	ADDC(sum, t2)
 	.set	reorder				/* DADDI_WAR */
 	ADD	dst, dst, 4*NBYTES
 	beqz	len, .Ldone\@
@@ -649,13 +649,13 @@ LEAF(csum_partial)
 	nop				# improves slotting
 #endif
 	STORE(t0, UNIT(0)(dst),	.Ls_exc\@)
-	ADDC(sum, t0)
+	ADDC(t0, t1)
 	STORE(t1, UNIT(1)(dst),	.Ls_exc\@)
-	ADDC(sum, t1)
+	ADDC(sum, t0)
 	STORE(t2, UNIT(2)(dst),	.Ls_exc\@)
-	ADDC(sum, t2)
+	ADDC(t2, t3)
 	STORE(t3, UNIT(3)(dst),	.Ls_exc\@)
-	ADDC(sum, t3)
+	ADDC(sum, t2)
 	.set	reorder				/* DADDI_WAR */
 	ADD	dst, dst, 4*NBYTES
 	bne	len, rem, 1b
-- 
1.9.5 (Apple Git-50.3)

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [v5] MIPS: lib: csum_partial: more instruction paral
  2015-03-26 17:07 [v5] MIPS: lib: csum_partial: more instruction paral cee1
@ 2015-03-30 20:10 ` Ralf Baechle
  2015-03-31  8:34   ` cee1
  2015-03-31  8:34 ` cee1
  1 sibling, 1 reply; 8+ messages in thread
From: Ralf Baechle @ 2015-03-30 20:10 UTC (permalink / raw)
  To: cee1; +Cc: linux-mips, Chen Jie

On Fri, Mar 27, 2015 at 01:07:24AM +0800, cee1 wrote:

> From: Chen Jie <chenj@lemote.com>
> 
> Computing sum introduces true data dependency. This patch removes some
> true data depdendencies, hence increases instruction level parallelism.
> 
> This patch brings at most 50% csum performance gain on Loongson 3a
> processor in our test.
> 
> One example about how this patch works is in CSUM_BIGCHUNK1:
> // ** original **    vs    ** patch applied **
>     ADDC(sum, t0)           ADDC(t0, t1)
>     ADDC(sum, t1)           ADDC(t2, t3)
>     ADDC(sum, t2)           ADDC(sum, t0)
>     ADDC(sum, t3)           ADDC(sum, t2)
> 
> In the original implementation, each ADDC(sum, ...) depends on the sum
> value updated by previous ADDC(as source operand).
> 
> With this patch applied, the first two ADDC operations are independent,
> hence can be executed simultaneously if possible.
> 
> Another example is in the "copy and sum calculating chunk":
> // ** original **    vs    ** patch applied **
>     STORE(t0, UNIT(0) ...   STORE(t0, UNIT(0) ...
>     ADDC(sum, t0)           ADDC(t0, t1)
>     STORE(t1, UNIT(1) ...   STORE(t1, UNIT(1) ...
>     ADDC(sum, t1)           ADDC(sum, t0)
>     STORE(t2, UNIT(2) ...   STORE(t2, UNIT(2) ...
>     ADDC(sum, t2)           ADDC(t2, t3)
>     STORE(t3, UNIT(3) ...   STORE(t3, UNIT(3) ...
>     ADDC(sum, t3)           ADDC(sum, t2)
> 
> With this patch applied, ADDC and the **next next** ADDC are independent.

This is interesting because even CPUs as old as the R2000 have a pipeline
bypass which allows an instruction to use a result written to a register
by an immediately preceeeding instruction.

Can you explain why this patch is so beneficial for Loongson 3A?

Thanks,

  Ralf

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [v5] MIPS: lib: csum_partial: more instruction paral
  2015-03-30 20:10 ` Ralf Baechle
@ 2015-03-31  8:34   ` cee1
  2015-04-02 12:59     ` Maciej W. Rozycki
  0 siblings, 1 reply; 8+ messages in thread
From: cee1 @ 2015-03-31  8:34 UTC (permalink / raw)
  To: Ralf Baechle; +Cc: Linux MIPS Mailing List, Chen Jie

2015-03-31 4:10 GMT+08:00 Ralf Baechle <ralf@linux-mips.org>:
>> One example about how this patch works is in CSUM_BIGCHUNK1:
>> // ** original **    vs    ** patch applied **
>>     ADDC(sum, t0)           ADDC(t0, t1)
>>     ADDC(sum, t1)           ADDC(t2, t3)
>>     ADDC(sum, t2)           ADDC(sum, t0)
>>     ADDC(sum, t3)           ADDC(sum, t2)
>>
>> With this patch applied, ADDC and the **next next** ADDC are independent.
>
> This is interesting because even CPUs as old as the R2000 have a pipeline
> bypass which allows an instruction to use a result written to a register
> by an immediately preceeeding instruction.

But if removes some dependency(as the patch did), instruction A and
instruction B can be issued at the same cycle[1], instead of B waiting
for the result from A   (a pipeline bypass reduces the wait time, but
not eliminates it, right?)

>
> Can you explain why this patch is so beneficial for Loongson 3A?

I have written a simply test[2] to measure the performance gain on
Loongson 3A, the result[3] shows at most 50% performance gain.

IMHO, the patch not only benefits Loongson 3A, but would also benefit
other MIPS CPU(s).


--
1. If the hardware supports this, e.g. at least two ALU units for ALU
operations, and is an out of order execution pipeline, etc
2. http://dev.lemote.com/files/upload/software/csum-opti/csum-test.tar.gz
3. http://dev.lemote.com/files/upload/software/csum-opti/csum-opti-benchmark.html



Regards,

- cee1

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [v5] MIPS: lib: csum_partial: more instruction paral
  2015-03-26 17:07 [v5] MIPS: lib: csum_partial: more instruction paral cee1
  2015-03-30 20:10 ` Ralf Baechle
@ 2015-03-31  8:34 ` cee1
  1 sibling, 0 replies; 8+ messages in thread
From: cee1 @ 2015-03-31  8:34 UTC (permalink / raw)
  To: Linux MIPS Mailing List
  Cc: Ralf Baechle, 陈华才, 吴章金

The output result isn't influenced by this patch, as shown by
following mathematical proof.

For convenience, let's mark '2^64' as '@', mark 'ADDC(A, B)' as 'A ⊕ B', then:
A ⊕ B = (A + B >= @) ? A + B - @ + 1 : A + B

What this patch did is transforming "sum ⊕ A ⊕ B ⊕ C ⊕ D" to "sum ⊕ (A
⊕ B) ⊕ (C ⊕ D)",
so what we are trying to prove is:

sum ⊕ A ⊕ B ⊕ C ⊕ D = sum ⊕ (A ⊕ B)

Let's first prove "sum ⊕ A ⊕ B" = "sum ⊕ (A ⊕ B)". There are 2 add ops
within 2 '⊕' here. According whether these add ops overflow, there are
4 cases:
Case 1: Only the first add op overflows.

Case 2: Only the second add op overflows.

Case 3: Both add ops overflow.

Case 4: Neither add ops overflow.

## In Case 1, we have
(1) sum + A >= @
    => sum ⊕ A = sum + A - @ + 1

(2) sum + A - @ + 1 + B < @
    => sum ⊕ A ⊕ B = sum + A + B - @ + 1

If A + B >= @:
    => A ⊕ B = A + B - @ + 1
    => sum + (A ⊕ B) = sum + A + B - @ + 1 < @   (see (2))
    => sum ⊕ (A ⊕ B) = sum + A + B - @ + 1 = sum ⊕ A ⊕ B

If A + B < @:
    => A ⊕ B = A + B
(3) => sum + (A ⊕ B) = sum + A + B

    sum + A + B >= @ + B >= @   (adds B to both sides of (1))
    => sum + (A ⊕ B) >= @       (see (3))
    => sum ⊕ (A ⊕ B) = sum + A + B - @ + 1 = sum ⊕ A ⊕ B

## In Case 2, we have
(1) sum + A < @
    => sum ⊕ A = sum + A

(2) sum + A + B >= @
    => sum ⊕ A ⊕ B = sum + A + B - @ + 1

If A + B >= @:
    => A ⊕ B = A + B - @ + 1
(3) => sum + (A ⊕ B) = sum + A + B - @ + 1

    sum + A + B - @ + 1 < @ + B - @ + 1   (adds 'B - @ + 1' to both
sides of (1))
    => sum + A + B - @ + 1 < B + 1 <= @
    => sum + A + B - @ + 1 < @
    => sum + (A ⊕ B) < @                  (see (3))
    => sum ⊕ (A ⊕ B) = sum + A + B - @ + 1 = sum ⊕ A ⊕ B

If A + B < @:
    => A ⊕ B = A + B
    => sum + (A ⊕ B) = sum + A + B >= @   (see (2))
    => sum ⊕ (A ⊕ B) = sum + A + B - @ + 1 = sum ⊕ A ⊕ B

## In Case 3, we have
(1) sum + A >= @
    => sum ⊕ A = sum + A - @ + 1

(2) sum + A - @ + 1 + B >= @
    => sum ⊕ A ⊕ B = sum + A + B - 2@ + 2

(3) A + B >= 2@ - 1 - sum   (transformed from (2))

    1 + sum <= @            ( sum is in range [0, @) )
    => @ + 1 + sum <= 2@    ( adds @ to both sides)
    => @ <= 2@ - 1 - sum <= A + B   (combining with (3))
    => A + B >= @                   (cleaning up)
    => A ⊕ B = A + B - @ + 1
    => sum + (A ⊕ B) = sum + A + B - @ + 1 >= @   (see (2))
    => sum ⊕ (A ⊕ B) = sum + A + B - 2@ + 2 = sum ⊕ A ⊕ B

## In Case 4, we have
(1) sum + A < @
    => sum ⊕ A = sum + A

(2) sum + A + B < @
    => sum ⊕ A ⊕ B = sum + A + B

    A + B < @ - sum    (subtracts 'sum' from both sides of (2))
    => A + B < @
    => A ⊕ B = A + B
    => sum + (A ⊕ B) = sum + A + B < @   (see (2))
    => sum ⊕ (A ⊕ B) = sum + A + B = sum ⊕ A ⊕ B

Conclusion: sum ⊕ A ⊕ B = sum ⊕ (A ⊕ B)


Let U = sum ⊕ A ⊕ B (or sum ⊕ (A ⊕ B)), then

sum ⊕ A ⊕ B ⊕ C ⊕ D = U ⊕ C ⊕ D = U ⊕ (C ⊕ D)   (use the conclusion)

so we have:

sum ⊕ A ⊕ B ⊕ C ⊕ D = sum ⊕ (A ⊕ B) ⊕ (C ⊕ D)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [v5] MIPS: lib: csum_partial: more instruction paral
  2015-03-31  8:34   ` cee1
@ 2015-04-02 12:59     ` Maciej W. Rozycki
  2015-04-06 13:03       ` cee1
  0 siblings, 1 reply; 8+ messages in thread
From: Maciej W. Rozycki @ 2015-04-02 12:59 UTC (permalink / raw)
  To: cee1; +Cc: Ralf Baechle, Linux MIPS Mailing List, Chen Jie

On Tue, 31 Mar 2015, cee1 wrote:

> >> One example about how this patch works is in CSUM_BIGCHUNK1:
> >> // ** original **    vs    ** patch applied **
> >>     ADDC(sum, t0)           ADDC(t0, t1)
> >>     ADDC(sum, t1)           ADDC(t2, t3)
> >>     ADDC(sum, t2)           ADDC(sum, t0)
> >>     ADDC(sum, t3)           ADDC(sum, t2)
> >>
> >> With this patch applied, ADDC and the **next next** ADDC are independent.
> >
> > This is interesting because even CPUs as old as the R2000 have a pipeline
> > bypass which allows an instruction to use a result written to a register
> > by an immediately preceeeding instruction.
> 
> But if removes some dependency(as the patch did), instruction A and
> instruction B can be issued at the same cycle[1], instead of B waiting
> for the result from A   (a pipeline bypass reduces the wait time, but
> not eliminates it, right?)

 Hmm, that sounds to me remarkably like the scenario with Intel's original 
Pentium processor that had a dual issue pipeline with U and V execution 
pipes, both of which accepted ALU operations, and then each had some 
further constraints as to other instructions, some of which had to go to a 
specific pipe of the two (and were still parallelised if the other 
instruction was acceptable for the other pipe).

 To get good performance out of that design you had to interleave ALU 
operations so that there was no data dependency between two consecutive 
instructions, in which case two instructions could have been issued and 
retired at a time, in parallel.  The further constraints the U and V pipes 
had with other instructions made instruction scheduling quite an 
interesting challenge for the compiler or handcoded assembly.

 With the more complex pipeline design the Pentium's successor Pentium Pro 
had there was no longer such an issue, I reckon there were several 
mechanisms involved including register renaming and speculative execution 
of more than just two instructions ahead that eliminated the need of such 
constrained instruction scheduling although I don't remember offhand how 
all this worked.

> > Can you explain why this patch is so beneficial for Loongson 3A?
> 
> I have written a simply test[2] to measure the performance gain on
> Loongson 3A, the result[3] shows at most 50% performance gain.
> 
> IMHO, the patch not only benefits Loongson 3A, but would also benefit
> other MIPS CPU(s).

 I'm not sure if any such other superscalar MIPS pipeline implementation 
exists, but if written correctly then at worst it won't hurt anyone else, 
so just make sure your change does not regress scalar MIPS pipelines.  I 
hope you have a way to verify it.

  Maciej

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [v5] MIPS: lib: csum_partial: more instruction paral
  2015-04-02 12:59     ` Maciej W. Rozycki
@ 2015-04-06 13:03       ` cee1
  2015-04-06 13:52         ` Maciej W. Rozycki
  0 siblings, 1 reply; 8+ messages in thread
From: cee1 @ 2015-04-06 13:03 UTC (permalink / raw)
  To: Maciej W. Rozycki; +Cc: Ralf Baechle, Linux MIPS Mailing List, Chen Jie

2015-04-02 20:59 GMT+08:00 Maciej W. Rozycki <macro@linux-mips.org>:
>  I'm not sure if any such other superscalar MIPS pipeline implementation
> exists, but if written correctly then at worst it won't hurt anyone else,
> so just make sure your change does not regress scalar MIPS pipelines.  I
> hope you have a way to verify it.
>
>   Maciej

It seems the P-Class of Warrior generation of MIPS CPU has a
superscalar MIPS pipeline(http://imgtec.com/mips/warrior/pclass.asp).

Anyway, I've written a small benchmark at
http://dev.lemote.com/files/upload/software/csum-opti/csum-test.tar.gz

Someone can download and modify it:
1. config.h
2. change NR_ITERATION to a proper value in bench.c

And run 'make run_bench', which will: 1) run a benchmark; 2) generate
the result of `before-opti vs after-opti'; 3) invoke a python
script(parse.py) to preprocess the result.

Then paste the content of benchresult.txt to Calc / Excel / Numeric,
etc and analyze it.

-- 
Regards,

- cee1

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [v5] MIPS: lib: csum_partial: more instruction paral
  2015-04-06 13:03       ` cee1
@ 2015-04-06 13:52         ` Maciej W. Rozycki
  2015-04-06 15:30           ` cee1
  0 siblings, 1 reply; 8+ messages in thread
From: Maciej W. Rozycki @ 2015-04-06 13:52 UTC (permalink / raw)
  To: cee1; +Cc: Ralf Baechle, Linux MIPS Mailing List, Chen Jie

On Mon, 6 Apr 2015, cee1 wrote:

> >  I'm not sure if any such other superscalar MIPS pipeline implementation
> > exists, but if written correctly then at worst it won't hurt anyone else,
> > so just make sure your change does not regress scalar MIPS pipelines.  I
> > hope you have a way to verify it.
> 
> It seems the P-Class of Warrior generation of MIPS CPU has a
> superscalar MIPS pipeline(http://imgtec.com/mips/warrior/pclass.asp).

 There have been many superscalar MIPS implementations, however I don't
know offhand if any other have the restrictions like yours.

  Maciej

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [v5] MIPS: lib: csum_partial: more instruction paral
  2015-04-06 13:52         ` Maciej W. Rozycki
@ 2015-04-06 15:30           ` cee1
  0 siblings, 0 replies; 8+ messages in thread
From: cee1 @ 2015-04-06 15:30 UTC (permalink / raw)
  To: Maciej W. Rozycki; +Cc: Ralf Baechle, Linux MIPS Mailing List, Chen Jie

2015-04-06 21:52 GMT+08:00 Maciej W. Rozycki <macro@linux-mips.org>:
> On Mon, 6 Apr 2015, cee1 wrote:
>
>> >  I'm not sure if any such other superscalar MIPS pipeline implementation
>> > exists, but if written correctly then at worst it won't hurt anyone else,
>> > so just make sure your change does not regress scalar MIPS pipelines.  I
>> > hope you have a way to verify it.
>>
>> It seems the P-Class of Warrior generation of MIPS CPU has a
>> superscalar MIPS pipeline(http://imgtec.com/mips/warrior/pclass.asp).
>
>  There have been many superscalar MIPS implementations, however I don't
> know offhand if any other have the restrictions like yours.

Hi, I guess I may not make myself clear :)

The example is only showing how this patch removes true data
dependency, not implies any restrictions.

E.g.
ADDC(sum, t0)
ADDC(sum, t1)
ADDC(sum, t2)
ADDC(sum, t3)

which are actually following instructions:
(1) daddu     sum, t0;
(2) sltu         v1, sum, t0;
(3) daddu     sum, v1;

(4) daddu     sum, t1;
(5) sltu         v1, sum, t1;
(6) daddu     sum, v1;

(7) daddu     sum, t2;
(8) sltu         v1, sum, t2;
(9) daddu     sum, v1;

(10) daddu     sum, t3;
(11) sltu         v1, sum, t3;
(12) daddu     sum, v1;

Here, each instruction depends on the result of its previous
instruction, this is tough for any superscalar pipelines.


With the patch applied, it becomes:
ADDC(t0, t1)
ADDC(t2, t3)
ADDC(sum, t0)
ADDC(sum, t2)

which are actually following instructions:
(1) daddu     t0, t1;
(2) sltu         v1, t0, t1;
(3) daddu     t0, v1;

(4) daddu     t2, t3;
(5) sltu         v1, t2, t3;
(6) daddu     t2, v1;

(7) daddu     sum, t0;
(8) sltu         v1, sum, t0;
(9) daddu     sum, v1;

(10) daddu     sum, t2;
(11) sltu         v1, sum, t2;
(12) daddu     sum, v1;

Here, e.g. at least (1) and (4) can be issued at the same cycle, as
long as CPU has enough execution units and a large enough
RS(Reservation Station), fetching instructions quick enough, etc.

What I want to say is, this patch removes some ** true data dependency
**, hence should improve the performance on (most?) superscalar
pipeline implementations.



-- 
Regards,

- cee1

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-04-06 15:30 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-03-26 17:07 [v5] MIPS: lib: csum_partial: more instruction paral cee1
2015-03-30 20:10 ` Ralf Baechle
2015-03-31  8:34   ` cee1
2015-04-02 12:59     ` Maciej W. Rozycki
2015-04-06 13:03       ` cee1
2015-04-06 13:52         ` Maciej W. Rozycki
2015-04-06 15:30           ` cee1
2015-03-31  8:34 ` cee1

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.