* [v5] MIPS: lib: csum_partial: more instruction paral
@ 2015-03-26 17:07 cee1
2015-03-30 20:10 ` Ralf Baechle
2015-03-31 8:34 ` cee1
0 siblings, 2 replies; 8+ messages in thread
From: cee1 @ 2015-03-26 17:07 UTC (permalink / raw)
To: linux-mips; +Cc: ralf, Chen Jie
From: Chen Jie <chenj@lemote.com>
Computing sum introduces true data dependency. This patch removes some
true data depdendencies, hence increases instruction level parallelism.
This patch brings at most 50% csum performance gain on Loongson 3a
processor in our test.
One example about how this patch works is in CSUM_BIGCHUNK1:
// ** original ** vs ** patch applied **
ADDC(sum, t0) ADDC(t0, t1)
ADDC(sum, t1) ADDC(t2, t3)
ADDC(sum, t2) ADDC(sum, t0)
ADDC(sum, t3) ADDC(sum, t2)
In the original implementation, each ADDC(sum, ...) depends on the sum
value updated by previous ADDC(as source operand).
With this patch applied, the first two ADDC operations are independent,
hence can be executed simultaneously if possible.
Another example is in the "copy and sum calculating chunk":
// ** original ** vs ** patch applied **
STORE(t0, UNIT(0) ... STORE(t0, UNIT(0) ...
ADDC(sum, t0) ADDC(t0, t1)
STORE(t1, UNIT(1) ... STORE(t1, UNIT(1) ...
ADDC(sum, t1) ADDC(sum, t0)
STORE(t2, UNIT(2) ... STORE(t2, UNIT(2) ...
ADDC(sum, t2) ADDC(t2, t3)
STORE(t3, UNIT(3) ... STORE(t3, UNIT(3) ...
ADDC(sum, t3) ADDC(sum, t2)
With this patch applied, ADDC and the **next next** ADDC are independent.
Signed-off-by: chenj <chenj@lemote.com>
---
1. The result can be found at
http://dev.lemote.com/files/upload/software/csum-opti/csum-opti-benchmark.html
And is generated by a userspace test program:
http://dev.lemote.com/files/upload/software/csum-opti/csum-test.tar.gz
[v2: amend commit message]
[v3: further amend commit message]
[v4: amend commit message & sign-off my patch]
[v5: amend commit message]
arch/mips/lib/csum_partial.S | 38 +++++++++++++++++++-------------------
1 file changed, 19 insertions(+), 19 deletions(-)
diff --git a/arch/mips/lib/csum_partial.S b/arch/mips/lib/csum_partial.S
index 4c721e2..ed88647 100644
--- a/arch/mips/lib/csum_partial.S
+++ b/arch/mips/lib/csum_partial.S
@@ -76,10 +76,10 @@
LOAD _t1, (offset + UNIT(1))(src); \
LOAD _t2, (offset + UNIT(2))(src); \
LOAD _t3, (offset + UNIT(3))(src); \
+ ADDC(_t0, _t1); \
+ ADDC(_t2, _t3); \
ADDC(sum, _t0); \
- ADDC(sum, _t1); \
- ADDC(sum, _t2); \
- ADDC(sum, _t3)
+ ADDC(sum, _t2)
#ifdef USE_DOUBLE
#define CSUM_BIGCHUNK(src, offset, sum, _t0, _t1, _t2, _t3) \
@@ -504,21 +504,21 @@ LEAF(csum_partial)
SUB len, len, 8*NBYTES
ADD src, src, 8*NBYTES
STORE(t0, UNIT(0)(dst), .Ls_exc\@)
- ADDC(sum, t0)
+ ADDC(t0, t1)
STORE(t1, UNIT(1)(dst), .Ls_exc\@)
- ADDC(sum, t1)
+ ADDC(sum, t0)
STORE(t2, UNIT(2)(dst), .Ls_exc\@)
- ADDC(sum, t2)
+ ADDC(t2, t3)
STORE(t3, UNIT(3)(dst), .Ls_exc\@)
- ADDC(sum, t3)
+ ADDC(sum, t2)
STORE(t4, UNIT(4)(dst), .Ls_exc\@)
- ADDC(sum, t4)
+ ADDC(t4, t5)
STORE(t5, UNIT(5)(dst), .Ls_exc\@)
- ADDC(sum, t5)
+ ADDC(sum, t4)
STORE(t6, UNIT(6)(dst), .Ls_exc\@)
- ADDC(sum, t6)
+ ADDC(t6, t7)
STORE(t7, UNIT(7)(dst), .Ls_exc\@)
- ADDC(sum, t7)
+ ADDC(sum, t6)
.set reorder /* DADDI_WAR */
ADD dst, dst, 8*NBYTES
bgez len, 1b
@@ -544,13 +544,13 @@ LEAF(csum_partial)
SUB len, len, 4*NBYTES
ADD src, src, 4*NBYTES
STORE(t0, UNIT(0)(dst), .Ls_exc\@)
- ADDC(sum, t0)
+ ADDC(t0, t1)
STORE(t1, UNIT(1)(dst), .Ls_exc\@)
- ADDC(sum, t1)
+ ADDC(sum, t0)
STORE(t2, UNIT(2)(dst), .Ls_exc\@)
- ADDC(sum, t2)
+ ADDC(t2, t3)
STORE(t3, UNIT(3)(dst), .Ls_exc\@)
- ADDC(sum, t3)
+ ADDC(sum, t2)
.set reorder /* DADDI_WAR */
ADD dst, dst, 4*NBYTES
beqz len, .Ldone\@
@@ -649,13 +649,13 @@ LEAF(csum_partial)
nop # improves slotting
#endif
STORE(t0, UNIT(0)(dst), .Ls_exc\@)
- ADDC(sum, t0)
+ ADDC(t0, t1)
STORE(t1, UNIT(1)(dst), .Ls_exc\@)
- ADDC(sum, t1)
+ ADDC(sum, t0)
STORE(t2, UNIT(2)(dst), .Ls_exc\@)
- ADDC(sum, t2)
+ ADDC(t2, t3)
STORE(t3, UNIT(3)(dst), .Ls_exc\@)
- ADDC(sum, t3)
+ ADDC(sum, t2)
.set reorder /* DADDI_WAR */
ADD dst, dst, 4*NBYTES
bne len, rem, 1b
--
1.9.5 (Apple Git-50.3)
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [v5] MIPS: lib: csum_partial: more instruction paral
2015-03-26 17:07 [v5] MIPS: lib: csum_partial: more instruction paral cee1
@ 2015-03-30 20:10 ` Ralf Baechle
2015-03-31 8:34 ` cee1
2015-03-31 8:34 ` cee1
1 sibling, 1 reply; 8+ messages in thread
From: Ralf Baechle @ 2015-03-30 20:10 UTC (permalink / raw)
To: cee1; +Cc: linux-mips, Chen Jie
On Fri, Mar 27, 2015 at 01:07:24AM +0800, cee1 wrote:
> From: Chen Jie <chenj@lemote.com>
>
> Computing sum introduces true data dependency. This patch removes some
> true data depdendencies, hence increases instruction level parallelism.
>
> This patch brings at most 50% csum performance gain on Loongson 3a
> processor in our test.
>
> One example about how this patch works is in CSUM_BIGCHUNK1:
> // ** original ** vs ** patch applied **
> ADDC(sum, t0) ADDC(t0, t1)
> ADDC(sum, t1) ADDC(t2, t3)
> ADDC(sum, t2) ADDC(sum, t0)
> ADDC(sum, t3) ADDC(sum, t2)
>
> In the original implementation, each ADDC(sum, ...) depends on the sum
> value updated by previous ADDC(as source operand).
>
> With this patch applied, the first two ADDC operations are independent,
> hence can be executed simultaneously if possible.
>
> Another example is in the "copy and sum calculating chunk":
> // ** original ** vs ** patch applied **
> STORE(t0, UNIT(0) ... STORE(t0, UNIT(0) ...
> ADDC(sum, t0) ADDC(t0, t1)
> STORE(t1, UNIT(1) ... STORE(t1, UNIT(1) ...
> ADDC(sum, t1) ADDC(sum, t0)
> STORE(t2, UNIT(2) ... STORE(t2, UNIT(2) ...
> ADDC(sum, t2) ADDC(t2, t3)
> STORE(t3, UNIT(3) ... STORE(t3, UNIT(3) ...
> ADDC(sum, t3) ADDC(sum, t2)
>
> With this patch applied, ADDC and the **next next** ADDC are independent.
This is interesting because even CPUs as old as the R2000 have a pipeline
bypass which allows an instruction to use a result written to a register
by an immediately preceeeding instruction.
Can you explain why this patch is so beneficial for Loongson 3A?
Thanks,
Ralf
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [v5] MIPS: lib: csum_partial: more instruction paral
2015-03-30 20:10 ` Ralf Baechle
@ 2015-03-31 8:34 ` cee1
2015-04-02 12:59 ` Maciej W. Rozycki
0 siblings, 1 reply; 8+ messages in thread
From: cee1 @ 2015-03-31 8:34 UTC (permalink / raw)
To: Ralf Baechle; +Cc: Linux MIPS Mailing List, Chen Jie
2015-03-31 4:10 GMT+08:00 Ralf Baechle <ralf@linux-mips.org>:
>> One example about how this patch works is in CSUM_BIGCHUNK1:
>> // ** original ** vs ** patch applied **
>> ADDC(sum, t0) ADDC(t0, t1)
>> ADDC(sum, t1) ADDC(t2, t3)
>> ADDC(sum, t2) ADDC(sum, t0)
>> ADDC(sum, t3) ADDC(sum, t2)
>>
>> With this patch applied, ADDC and the **next next** ADDC are independent.
>
> This is interesting because even CPUs as old as the R2000 have a pipeline
> bypass which allows an instruction to use a result written to a register
> by an immediately preceeeding instruction.
But if removes some dependency(as the patch did), instruction A and
instruction B can be issued at the same cycle[1], instead of B waiting
for the result from A (a pipeline bypass reduces the wait time, but
not eliminates it, right?)
>
> Can you explain why this patch is so beneficial for Loongson 3A?
I have written a simply test[2] to measure the performance gain on
Loongson 3A, the result[3] shows at most 50% performance gain.
IMHO, the patch not only benefits Loongson 3A, but would also benefit
other MIPS CPU(s).
--
1. If the hardware supports this, e.g. at least two ALU units for ALU
operations, and is an out of order execution pipeline, etc
2. http://dev.lemote.com/files/upload/software/csum-opti/csum-test.tar.gz
3. http://dev.lemote.com/files/upload/software/csum-opti/csum-opti-benchmark.html
Regards,
- cee1
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [v5] MIPS: lib: csum_partial: more instruction paral
2015-03-26 17:07 [v5] MIPS: lib: csum_partial: more instruction paral cee1
2015-03-30 20:10 ` Ralf Baechle
@ 2015-03-31 8:34 ` cee1
1 sibling, 0 replies; 8+ messages in thread
From: cee1 @ 2015-03-31 8:34 UTC (permalink / raw)
To: Linux MIPS Mailing List
Cc: Ralf Baechle, 陈华才, 吴章金
The output result isn't influenced by this patch, as shown by
following mathematical proof.
For convenience, let's mark '2^64' as '@', mark 'ADDC(A, B)' as 'A ⊕ B', then:
A ⊕ B = (A + B >= @) ? A + B - @ + 1 : A + B
What this patch did is transforming "sum ⊕ A ⊕ B ⊕ C ⊕ D" to "sum ⊕ (A
⊕ B) ⊕ (C ⊕ D)",
so what we are trying to prove is:
sum ⊕ A ⊕ B ⊕ C ⊕ D = sum ⊕ (A ⊕ B)
Let's first prove "sum ⊕ A ⊕ B" = "sum ⊕ (A ⊕ B)". There are 2 add ops
within 2 '⊕' here. According whether these add ops overflow, there are
4 cases:
Case 1: Only the first add op overflows.
Case 2: Only the second add op overflows.
Case 3: Both add ops overflow.
Case 4: Neither add ops overflow.
## In Case 1, we have
(1) sum + A >= @
=> sum ⊕ A = sum + A - @ + 1
(2) sum + A - @ + 1 + B < @
=> sum ⊕ A ⊕ B = sum + A + B - @ + 1
If A + B >= @:
=> A ⊕ B = A + B - @ + 1
=> sum + (A ⊕ B) = sum + A + B - @ + 1 < @ (see (2))
=> sum ⊕ (A ⊕ B) = sum + A + B - @ + 1 = sum ⊕ A ⊕ B
If A + B < @:
=> A ⊕ B = A + B
(3) => sum + (A ⊕ B) = sum + A + B
sum + A + B >= @ + B >= @ (adds B to both sides of (1))
=> sum + (A ⊕ B) >= @ (see (3))
=> sum ⊕ (A ⊕ B) = sum + A + B - @ + 1 = sum ⊕ A ⊕ B
## In Case 2, we have
(1) sum + A < @
=> sum ⊕ A = sum + A
(2) sum + A + B >= @
=> sum ⊕ A ⊕ B = sum + A + B - @ + 1
If A + B >= @:
=> A ⊕ B = A + B - @ + 1
(3) => sum + (A ⊕ B) = sum + A + B - @ + 1
sum + A + B - @ + 1 < @ + B - @ + 1 (adds 'B - @ + 1' to both
sides of (1))
=> sum + A + B - @ + 1 < B + 1 <= @
=> sum + A + B - @ + 1 < @
=> sum + (A ⊕ B) < @ (see (3))
=> sum ⊕ (A ⊕ B) = sum + A + B - @ + 1 = sum ⊕ A ⊕ B
If A + B < @:
=> A ⊕ B = A + B
=> sum + (A ⊕ B) = sum + A + B >= @ (see (2))
=> sum ⊕ (A ⊕ B) = sum + A + B - @ + 1 = sum ⊕ A ⊕ B
## In Case 3, we have
(1) sum + A >= @
=> sum ⊕ A = sum + A - @ + 1
(2) sum + A - @ + 1 + B >= @
=> sum ⊕ A ⊕ B = sum + A + B - 2@ + 2
(3) A + B >= 2@ - 1 - sum (transformed from (2))
1 + sum <= @ ( sum is in range [0, @) )
=> @ + 1 + sum <= 2@ ( adds @ to both sides)
=> @ <= 2@ - 1 - sum <= A + B (combining with (3))
=> A + B >= @ (cleaning up)
=> A ⊕ B = A + B - @ + 1
=> sum + (A ⊕ B) = sum + A + B - @ + 1 >= @ (see (2))
=> sum ⊕ (A ⊕ B) = sum + A + B - 2@ + 2 = sum ⊕ A ⊕ B
## In Case 4, we have
(1) sum + A < @
=> sum ⊕ A = sum + A
(2) sum + A + B < @
=> sum ⊕ A ⊕ B = sum + A + B
A + B < @ - sum (subtracts 'sum' from both sides of (2))
=> A + B < @
=> A ⊕ B = A + B
=> sum + (A ⊕ B) = sum + A + B < @ (see (2))
=> sum ⊕ (A ⊕ B) = sum + A + B = sum ⊕ A ⊕ B
Conclusion: sum ⊕ A ⊕ B = sum ⊕ (A ⊕ B)
Let U = sum ⊕ A ⊕ B (or sum ⊕ (A ⊕ B)), then
sum ⊕ A ⊕ B ⊕ C ⊕ D = U ⊕ C ⊕ D = U ⊕ (C ⊕ D) (use the conclusion)
so we have:
sum ⊕ A ⊕ B ⊕ C ⊕ D = sum ⊕ (A ⊕ B) ⊕ (C ⊕ D)
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [v5] MIPS: lib: csum_partial: more instruction paral
2015-03-31 8:34 ` cee1
@ 2015-04-02 12:59 ` Maciej W. Rozycki
2015-04-06 13:03 ` cee1
0 siblings, 1 reply; 8+ messages in thread
From: Maciej W. Rozycki @ 2015-04-02 12:59 UTC (permalink / raw)
To: cee1; +Cc: Ralf Baechle, Linux MIPS Mailing List, Chen Jie
On Tue, 31 Mar 2015, cee1 wrote:
> >> One example about how this patch works is in CSUM_BIGCHUNK1:
> >> // ** original ** vs ** patch applied **
> >> ADDC(sum, t0) ADDC(t0, t1)
> >> ADDC(sum, t1) ADDC(t2, t3)
> >> ADDC(sum, t2) ADDC(sum, t0)
> >> ADDC(sum, t3) ADDC(sum, t2)
> >>
> >> With this patch applied, ADDC and the **next next** ADDC are independent.
> >
> > This is interesting because even CPUs as old as the R2000 have a pipeline
> > bypass which allows an instruction to use a result written to a register
> > by an immediately preceeeding instruction.
>
> But if removes some dependency(as the patch did), instruction A and
> instruction B can be issued at the same cycle[1], instead of B waiting
> for the result from A (a pipeline bypass reduces the wait time, but
> not eliminates it, right?)
Hmm, that sounds to me remarkably like the scenario with Intel's original
Pentium processor that had a dual issue pipeline with U and V execution
pipes, both of which accepted ALU operations, and then each had some
further constraints as to other instructions, some of which had to go to a
specific pipe of the two (and were still parallelised if the other
instruction was acceptable for the other pipe).
To get good performance out of that design you had to interleave ALU
operations so that there was no data dependency between two consecutive
instructions, in which case two instructions could have been issued and
retired at a time, in parallel. The further constraints the U and V pipes
had with other instructions made instruction scheduling quite an
interesting challenge for the compiler or handcoded assembly.
With the more complex pipeline design the Pentium's successor Pentium Pro
had there was no longer such an issue, I reckon there were several
mechanisms involved including register renaming and speculative execution
of more than just two instructions ahead that eliminated the need of such
constrained instruction scheduling although I don't remember offhand how
all this worked.
> > Can you explain why this patch is so beneficial for Loongson 3A?
>
> I have written a simply test[2] to measure the performance gain on
> Loongson 3A, the result[3] shows at most 50% performance gain.
>
> IMHO, the patch not only benefits Loongson 3A, but would also benefit
> other MIPS CPU(s).
I'm not sure if any such other superscalar MIPS pipeline implementation
exists, but if written correctly then at worst it won't hurt anyone else,
so just make sure your change does not regress scalar MIPS pipelines. I
hope you have a way to verify it.
Maciej
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [v5] MIPS: lib: csum_partial: more instruction paral
2015-04-02 12:59 ` Maciej W. Rozycki
@ 2015-04-06 13:03 ` cee1
2015-04-06 13:52 ` Maciej W. Rozycki
0 siblings, 1 reply; 8+ messages in thread
From: cee1 @ 2015-04-06 13:03 UTC (permalink / raw)
To: Maciej W. Rozycki; +Cc: Ralf Baechle, Linux MIPS Mailing List, Chen Jie
2015-04-02 20:59 GMT+08:00 Maciej W. Rozycki <macro@linux-mips.org>:
> I'm not sure if any such other superscalar MIPS pipeline implementation
> exists, but if written correctly then at worst it won't hurt anyone else,
> so just make sure your change does not regress scalar MIPS pipelines. I
> hope you have a way to verify it.
>
> Maciej
It seems the P-Class of Warrior generation of MIPS CPU has a
superscalar MIPS pipeline(http://imgtec.com/mips/warrior/pclass.asp).
Anyway, I've written a small benchmark at
http://dev.lemote.com/files/upload/software/csum-opti/csum-test.tar.gz
Someone can download and modify it:
1. config.h
2. change NR_ITERATION to a proper value in bench.c
And run 'make run_bench', which will: 1) run a benchmark; 2) generate
the result of `before-opti vs after-opti'; 3) invoke a python
script(parse.py) to preprocess the result.
Then paste the content of benchresult.txt to Calc / Excel / Numeric,
etc and analyze it.
--
Regards,
- cee1
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [v5] MIPS: lib: csum_partial: more instruction paral
2015-04-06 13:03 ` cee1
@ 2015-04-06 13:52 ` Maciej W. Rozycki
2015-04-06 15:30 ` cee1
0 siblings, 1 reply; 8+ messages in thread
From: Maciej W. Rozycki @ 2015-04-06 13:52 UTC (permalink / raw)
To: cee1; +Cc: Ralf Baechle, Linux MIPS Mailing List, Chen Jie
On Mon, 6 Apr 2015, cee1 wrote:
> > I'm not sure if any such other superscalar MIPS pipeline implementation
> > exists, but if written correctly then at worst it won't hurt anyone else,
> > so just make sure your change does not regress scalar MIPS pipelines. I
> > hope you have a way to verify it.
>
> It seems the P-Class of Warrior generation of MIPS CPU has a
> superscalar MIPS pipeline(http://imgtec.com/mips/warrior/pclass.asp).
There have been many superscalar MIPS implementations, however I don't
know offhand if any other have the restrictions like yours.
Maciej
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [v5] MIPS: lib: csum_partial: more instruction paral
2015-04-06 13:52 ` Maciej W. Rozycki
@ 2015-04-06 15:30 ` cee1
0 siblings, 0 replies; 8+ messages in thread
From: cee1 @ 2015-04-06 15:30 UTC (permalink / raw)
To: Maciej W. Rozycki; +Cc: Ralf Baechle, Linux MIPS Mailing List, Chen Jie
2015-04-06 21:52 GMT+08:00 Maciej W. Rozycki <macro@linux-mips.org>:
> On Mon, 6 Apr 2015, cee1 wrote:
>
>> > I'm not sure if any such other superscalar MIPS pipeline implementation
>> > exists, but if written correctly then at worst it won't hurt anyone else,
>> > so just make sure your change does not regress scalar MIPS pipelines. I
>> > hope you have a way to verify it.
>>
>> It seems the P-Class of Warrior generation of MIPS CPU has a
>> superscalar MIPS pipeline(http://imgtec.com/mips/warrior/pclass.asp).
>
> There have been many superscalar MIPS implementations, however I don't
> know offhand if any other have the restrictions like yours.
Hi, I guess I may not make myself clear :)
The example is only showing how this patch removes true data
dependency, not implies any restrictions.
E.g.
ADDC(sum, t0)
ADDC(sum, t1)
ADDC(sum, t2)
ADDC(sum, t3)
which are actually following instructions:
(1) daddu sum, t0;
(2) sltu v1, sum, t0;
(3) daddu sum, v1;
(4) daddu sum, t1;
(5) sltu v1, sum, t1;
(6) daddu sum, v1;
(7) daddu sum, t2;
(8) sltu v1, sum, t2;
(9) daddu sum, v1;
(10) daddu sum, t3;
(11) sltu v1, sum, t3;
(12) daddu sum, v1;
Here, each instruction depends on the result of its previous
instruction, this is tough for any superscalar pipelines.
With the patch applied, it becomes:
ADDC(t0, t1)
ADDC(t2, t3)
ADDC(sum, t0)
ADDC(sum, t2)
which are actually following instructions:
(1) daddu t0, t1;
(2) sltu v1, t0, t1;
(3) daddu t0, v1;
(4) daddu t2, t3;
(5) sltu v1, t2, t3;
(6) daddu t2, v1;
(7) daddu sum, t0;
(8) sltu v1, sum, t0;
(9) daddu sum, v1;
(10) daddu sum, t2;
(11) sltu v1, sum, t2;
(12) daddu sum, v1;
Here, e.g. at least (1) and (4) can be issued at the same cycle, as
long as CPU has enough execution units and a large enough
RS(Reservation Station), fetching instructions quick enough, etc.
What I want to say is, this patch removes some ** true data dependency
**, hence should improve the performance on (most?) superscalar
pipeline implementations.
--
Regards,
- cee1
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2015-04-06 15:30 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-03-26 17:07 [v5] MIPS: lib: csum_partial: more instruction paral cee1
2015-03-30 20:10 ` Ralf Baechle
2015-03-31 8:34 ` cee1
2015-04-02 12:59 ` Maciej W. Rozycki
2015-04-06 13:03 ` cee1
2015-04-06 13:52 ` Maciej W. Rozycki
2015-04-06 15:30 ` cee1
2015-03-31 8:34 ` cee1
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.