* [PATCH, v3] MIPS: lib: csum_partial: more instruction paral [not found] <1400587638-17791-1-git-send-email-chenj@lemote.com> @ 2014-05-20 12:09 ` chenj 2014-05-20 12:05 ` Markos Chandras 0 siblings, 1 reply; 4+ messages in thread From: chenj @ 2014-05-20 12:09 UTC (permalink / raw) To: james.hogan; +Cc: linux-mips, chenhc, markos.chandras, chenj Computing sum introduces true data dependency. This patch removes some true data depdendencies, hence instruction level parallelism is improved. This patch brings at most 50% csum performance gain on Loongson 3a processor in our test. One example about how this patch works is in CSUM_BIGCHUNK1: // ** original ** vs ** patch applied ** ADDC(sum, t0) ADDC(t0, t1) ADDC(sum, t1) ADDC(t2, t3) ADDC(sum, t2) ADDC(sum, t0) ADDC(sum, t3) ADDC(sum, t2) In the original implementation, each ADDC(sum, ...) references the sum value updated by previous ADDC. With patch applied, the first two ADDC operations are independent, hence can be executed simultaneously if possible. Another example is in the "copy and sum calculating" chunk: // ** original ** vs ** patch applied ** STORE(t0, UNIT(0)... STORE(t0, UNIT(0)... ADDC(sum, t0) ADDC(t0, t1) STORE(t1, UNIT(1)... STORE(t1, UNIT(1)... ADDC(sum, t1) ADDC(sum, t0) STORE(t2, UNIT(2)... STORE(t2, UNIT(2)... ADDC(sum, t2) ADDC(t2, t3) STORE(t3, UNIT(3)... STORE(t3, UNIT(3)... ADDC(sum, t3) ADDC(sum, t2) With patch applied, the second and third ADDC are independent. --- 1. The result can be found at http://dev.lemote.com/files/upload/software/csum-opti/csum-opti-benchmark.html And is generated by a userspace test program: http://dev.lemote.com/files/upload/software/csum-opti/csum-test.tar.gz [v2: amend commit message] [v3: further amend commit message] arch/mips/lib/csum_partial.S | 38 +++++++++++++++++++------------------- 1 file changed, 19 insertions(+), 19 deletions(-) diff --git a/arch/mips/lib/csum_partial.S b/arch/mips/lib/csum_partial.S index 9901237..6cea101 100644 --- a/arch/mips/lib/csum_partial.S +++ b/arch/mips/lib/csum_partial.S @@ -76,10 +76,10 @@ LOAD _t1, (offset + UNIT(1))(src); \ LOAD _t2, (offset + UNIT(2))(src); \ LOAD _t3, (offset + UNIT(3))(src); \ + ADDC(_t0, _t1); \ + ADDC(_t2, _t3); \ ADDC(sum, _t0); \ - ADDC(sum, _t1); \ - ADDC(sum, _t2); \ - ADDC(sum, _t3) + ADDC(sum, _t2) #ifdef USE_DOUBLE #define CSUM_BIGCHUNK(src, offset, sum, _t0, _t1, _t2, _t3) \ @@ -501,21 +501,21 @@ LEAF(csum_partial) SUB len, len, 8*NBYTES ADD src, src, 8*NBYTES STORE(t0, UNIT(0)(dst), .Ls_exc\@) - ADDC(sum, t0) + ADDC(t0, t1) STORE(t1, UNIT(1)(dst), .Ls_exc\@) - ADDC(sum, t1) + ADDC(sum, t0) STORE(t2, UNIT(2)(dst), .Ls_exc\@) - ADDC(sum, t2) + ADDC(t2, t3) STORE(t3, UNIT(3)(dst), .Ls_exc\@) - ADDC(sum, t3) + ADDC(sum, t2) STORE(t4, UNIT(4)(dst), .Ls_exc\@) - ADDC(sum, t4) + ADDC(t4, t5) STORE(t5, UNIT(5)(dst), .Ls_exc\@) - ADDC(sum, t5) + ADDC(sum, t4) STORE(t6, UNIT(6)(dst), .Ls_exc\@) - ADDC(sum, t6) + ADDC(t6, t7) STORE(t7, UNIT(7)(dst), .Ls_exc\@) - ADDC(sum, t7) + ADDC(sum, t6) .set reorder /* DADDI_WAR */ ADD dst, dst, 8*NBYTES bgez len, 1b @@ -541,13 +541,13 @@ LEAF(csum_partial) SUB len, len, 4*NBYTES ADD src, src, 4*NBYTES STORE(t0, UNIT(0)(dst), .Ls_exc\@) - ADDC(sum, t0) + ADDC(t0, t1) STORE(t1, UNIT(1)(dst), .Ls_exc\@) - ADDC(sum, t1) + ADDC(sum, t0) STORE(t2, UNIT(2)(dst), .Ls_exc\@) - ADDC(sum, t2) + ADDC(t2, t3) STORE(t3, UNIT(3)(dst), .Ls_exc\@) - ADDC(sum, t3) + ADDC(sum, t2) .set reorder /* DADDI_WAR */ ADD dst, dst, 4*NBYTES beqz len, .Ldone\@ @@ -646,13 +646,13 @@ LEAF(csum_partial) nop # improves slotting #endif STORE(t0, UNIT(0)(dst), .Ls_exc\@) - ADDC(sum, t0) + ADDC(t0, t1) STORE(t1, UNIT(1)(dst), .Ls_exc\@) - ADDC(sum, t1) + ADDC(sum, t0) STORE(t2, UNIT(2)(dst), .Ls_exc\@) - ADDC(sum, t2) + ADDC(t2, t3) STORE(t3, UNIT(3)(dst), .Ls_exc\@) - ADDC(sum, t3) + ADDC(sum, t2) .set reorder /* DADDI_WAR */ ADD dst, dst, 4*NBYTES bne len, rem, 1b -- 1.9.0 ^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH, v3] MIPS: lib: csum_partial: more instruction paral @ 2014-05-20 12:05 ` Markos Chandras 0 siblings, 0 replies; 4+ messages in thread From: Markos Chandras @ 2014-05-20 12:05 UTC (permalink / raw) To: chenj, james.hogan; +Cc: linux-mips, chenhc On 05/20/2014 01:09 PM, chenj wrote: > Computing sum introduces true data dependency. This patch removes some > true data depdendencies, hence instruction level parallelism is > improved. > > This patch brings at most 50% csum performance gain on Loongson 3a > processor in our test. > > One example about how this patch works is in CSUM_BIGCHUNK1: > // ** original ** vs ** patch applied ** > ADDC(sum, t0) ADDC(t0, t1) > ADDC(sum, t1) ADDC(t2, t3) > ADDC(sum, t2) ADDC(sum, t0) > ADDC(sum, t3) ADDC(sum, t2) > > In the original implementation, each ADDC(sum, ...) references the sum > value updated by previous ADDC. > > With patch applied, the first two ADDC operations are independent, > hence can be executed simultaneously if possible. > > Another example is in the "copy and sum calculating" chunk: > // ** original ** vs ** patch applied ** > STORE(t0, UNIT(0)... STORE(t0, UNIT(0)... > ADDC(sum, t0) ADDC(t0, t1) > STORE(t1, UNIT(1)... STORE(t1, UNIT(1)... > ADDC(sum, t1) ADDC(sum, t0) > STORE(t2, UNIT(2)... STORE(t2, UNIT(2)... > ADDC(sum, t2) ADDC(t2, t3) > STORE(t3, UNIT(3)... STORE(t3, UNIT(3)... > ADDC(sum, t3) ADDC(sum, t2) > > With patch applied, the second and third ADDC are independent. Hi chenj, You forgot to sign-off your patch -- markos ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH, v3] MIPS: lib: csum_partial: more instruction paral @ 2014-05-20 12:05 ` Markos Chandras 0 siblings, 0 replies; 4+ messages in thread From: Markos Chandras @ 2014-05-20 12:05 UTC (permalink / raw) To: chenj, james.hogan; +Cc: linux-mips, chenhc On 05/20/2014 01:09 PM, chenj wrote: > Computing sum introduces true data dependency. This patch removes some > true data depdendencies, hence instruction level parallelism is > improved. > > This patch brings at most 50% csum performance gain on Loongson 3a > processor in our test. > > One example about how this patch works is in CSUM_BIGCHUNK1: > // ** original ** vs ** patch applied ** > ADDC(sum, t0) ADDC(t0, t1) > ADDC(sum, t1) ADDC(t2, t3) > ADDC(sum, t2) ADDC(sum, t0) > ADDC(sum, t3) ADDC(sum, t2) > > In the original implementation, each ADDC(sum, ...) references the sum > value updated by previous ADDC. > > With patch applied, the first two ADDC operations are independent, > hence can be executed simultaneously if possible. > > Another example is in the "copy and sum calculating" chunk: > // ** original ** vs ** patch applied ** > STORE(t0, UNIT(0)... STORE(t0, UNIT(0)... > ADDC(sum, t0) ADDC(t0, t1) > STORE(t1, UNIT(1)... STORE(t1, UNIT(1)... > ADDC(sum, t1) ADDC(sum, t0) > STORE(t2, UNIT(2)... STORE(t2, UNIT(2)... > ADDC(sum, t2) ADDC(t2, t3) > STORE(t3, UNIT(3)... STORE(t3, UNIT(3)... > ADDC(sum, t3) ADDC(sum, t2) > > With patch applied, the second and third ADDC are independent. Hi chenj, You forgot to sign-off your patch -- markos ^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH] MIPS: lib: csum_partial: more instruction paral 2014-05-20 12:05 ` Markos Chandras (?) @ 2014-05-20 12:33 ` chenj -1 siblings, 0 replies; 4+ messages in thread From: chenj @ 2014-05-20 12:33 UTC (permalink / raw) To: markos.chandras; +Cc: linux-mips, chenhc, james.hogan, chenj Computing sum introduces true data dependency. This patch removes some true data depdendencies, hence instruction level parallelism is improved. This patch brings at most 50% csum performance gain on Loongson 3a processor in our test. One example about how this patch works is in CSUM_BIGCHUNK1: // ** original ** vs ** patch applied ** ADDC(sum, t0) ADDC(t0, t1) ADDC(sum, t1) ADDC(t2, t3) ADDC(sum, t2) ADDC(sum, t0) ADDC(sum, t3) ADDC(sum, t2) In the original implementation, each ADDC(sum, ...) references the sum value updated by previous ADDC. With patch applied, the first two ADDC operations are independent, hence can be executed simultaneously if possible. Another example is in the "copy and sum calculating chunk": // ** original ** vs ** patch applied ** STORE(t0, UNIT(0) ... STORE(t0, UNIT(0) ... ADDC(sum, t0) ADDC(t0, t1) STORE(t1, UNIT(1) ... STORE(t1, UNIT(1) ... ADDC(sum, t1) ADDC(sum, t0) STORE(t2, UNIT(2) ... STORE(t2, UNIT(2) ... ADDC(sum, t2) ADDC(t2, t3) STORE(t3, UNIT(3) ... STORE(t3, UNIT(3) ... ADDC(sum, t3) ADDC(sum, t2) With patch applied, the first and third ADDC are independent. Signed-off-by: chenj <chenj@lemote.com> --- 1. The result can be found at http://dev.lemote.com/files/upload/software/csum-opti/csum-opti-benchmark.html And is generated by a userspace test program: http://dev.lemote.com/files/upload/software/csum-opti/csum-test.tar.gz [v2: amend commit message] [v3: further amend commit message] [v4: amend commit message & sign-off my patch] arch/mips/lib/csum_partial.S | 38 +++++++++++++++++++------------------- 1 file changed, 19 insertions(+), 19 deletions(-) diff --git a/arch/mips/lib/csum_partial.S b/arch/mips/lib/csum_partial.S index 9901237..6cea101 100644 --- a/arch/mips/lib/csum_partial.S +++ b/arch/mips/lib/csum_partial.S @@ -76,10 +76,10 @@ LOAD _t1, (offset + UNIT(1))(src); \ LOAD _t2, (offset + UNIT(2))(src); \ LOAD _t3, (offset + UNIT(3))(src); \ + ADDC(_t0, _t1); \ + ADDC(_t2, _t3); \ ADDC(sum, _t0); \ - ADDC(sum, _t1); \ - ADDC(sum, _t2); \ - ADDC(sum, _t3) + ADDC(sum, _t2) #ifdef USE_DOUBLE #define CSUM_BIGCHUNK(src, offset, sum, _t0, _t1, _t2, _t3) \ @@ -501,21 +501,21 @@ LEAF(csum_partial) SUB len, len, 8*NBYTES ADD src, src, 8*NBYTES STORE(t0, UNIT(0)(dst), .Ls_exc\@) - ADDC(sum, t0) + ADDC(t0, t1) STORE(t1, UNIT(1)(dst), .Ls_exc\@) - ADDC(sum, t1) + ADDC(sum, t0) STORE(t2, UNIT(2)(dst), .Ls_exc\@) - ADDC(sum, t2) + ADDC(t2, t3) STORE(t3, UNIT(3)(dst), .Ls_exc\@) - ADDC(sum, t3) + ADDC(sum, t2) STORE(t4, UNIT(4)(dst), .Ls_exc\@) - ADDC(sum, t4) + ADDC(t4, t5) STORE(t5, UNIT(5)(dst), .Ls_exc\@) - ADDC(sum, t5) + ADDC(sum, t4) STORE(t6, UNIT(6)(dst), .Ls_exc\@) - ADDC(sum, t6) + ADDC(t6, t7) STORE(t7, UNIT(7)(dst), .Ls_exc\@) - ADDC(sum, t7) + ADDC(sum, t6) .set reorder /* DADDI_WAR */ ADD dst, dst, 8*NBYTES bgez len, 1b @@ -541,13 +541,13 @@ LEAF(csum_partial) SUB len, len, 4*NBYTES ADD src, src, 4*NBYTES STORE(t0, UNIT(0)(dst), .Ls_exc\@) - ADDC(sum, t0) + ADDC(t0, t1) STORE(t1, UNIT(1)(dst), .Ls_exc\@) - ADDC(sum, t1) + ADDC(sum, t0) STORE(t2, UNIT(2)(dst), .Ls_exc\@) - ADDC(sum, t2) + ADDC(t2, t3) STORE(t3, UNIT(3)(dst), .Ls_exc\@) - ADDC(sum, t3) + ADDC(sum, t2) .set reorder /* DADDI_WAR */ ADD dst, dst, 4*NBYTES beqz len, .Ldone\@ @@ -646,13 +646,13 @@ LEAF(csum_partial) nop # improves slotting #endif STORE(t0, UNIT(0)(dst), .Ls_exc\@) - ADDC(sum, t0) + ADDC(t0, t1) STORE(t1, UNIT(1)(dst), .Ls_exc\@) - ADDC(sum, t1) + ADDC(sum, t0) STORE(t2, UNIT(2)(dst), .Ls_exc\@) - ADDC(sum, t2) + ADDC(t2, t3) STORE(t3, UNIT(3)(dst), .Ls_exc\@) - ADDC(sum, t3) + ADDC(sum, t2) .set reorder /* DADDI_WAR */ ADD dst, dst, 4*NBYTES bne len, rem, 1b -- 1.9.0 ^ permalink raw reply related [flat|nested] 4+ messages in thread
end of thread, other threads:[~2014-05-20 12:27 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1400587638-17791-1-git-send-email-chenj@lemote.com>
2014-05-20 12:09 ` [PATCH, v3] MIPS: lib: csum_partial: more instruction paral chenj
2014-05-20 12:05 ` Markos Chandras
2014-05-20 12:05 ` Markos Chandras
2014-05-20 12:33 ` [PATCH] " chenj
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.