* [parisc-linux] DIFF use 6-regs in copy_user_page_asm
@ 2005-01-03 6:19 Grant Grundler
2005-01-04 6:13 ` Randolph Chung
0 siblings, 1 reply; 12+ messages in thread
From: Grant Grundler @ 2005-01-03 6:19 UTC (permalink / raw)
To: parisc-linux
This patch adds one more cycle between the load and store of a
given register by using three pairs of registers instead of two.
I had previously quoted one of the PA-8xxx papers that indicated
L1 cache was 2 cycles latency.
With this diff, the unrolled part of the loop now meets that.
The prolog and epilogue obviously cannot.
If anyone can show me a workload that improves with this diff,
I'll apply it. Otherwise it's just an academic excercise.
BTW, I don't really trust build-tools/cpup.c unless someone
can convince me it's really running in wide mode and not getting
lots of page faults/page zeroing to interfere with the test.
Maybe need to iterate over a smaller buffer (e.g. 64MB) several times
and ignore the first iteration. Maybe also record cr16 values between
calls to find a minima and median *after* all the copying
is done.
thanks,
grant
ps. The "alignment doesn't matter" comment is too short. It really
means the alignment doesn't matter for the rest of the loop.
ie I don't need to add nops to seperate the pairs of "std" insns.
Index: arch/parisc/kernel/pacache.S
===================================================================
RCS file: /var/cvs/linux-2.6/arch/parisc/kernel/pacache.S,v
retrieving revision 1.14
diff -u -p -r1.14 pacache.S
--- arch/parisc/kernel/pacache.S 30 Dec 2004 08:07:48 -0000 1.14
+++ arch/parisc/kernel/pacache.S 3 Jan 2005 05:59:19 -0000
@@ -306,51 +306,52 @@ copy_user_page_asm:
ldd 0(%r25), %r19 /* bundle 1 */
ldi 32, %r1 /* PAGE_SIZE/128 == 32 */
-
1: ldd 8(%r25), %r20
ldw 256(%r25), %r0 /* prefetch 4 cacheline ahead */
ldd 16(%r25), %r21 /* bundle 2 */
ldd 24(%r25), %r22
+ nop /* preserve alignment of quads */
+ nop /* preserve alignment of quads */
+
+ ldd 32(%r25), %r23 /* bundle 3 */
+ ldd 40(%r25), %r24
std %r19, 0(%r26)
std %r20, 8(%r26)
- ldd 32(%r25), %r19 /* bundle 3 */
- ldd 40(%r25), %r20
+ ldd 48(%r25), %r19 /* bundle 4 */
+ ldd 56(%r25), %r20
std %r21, 16(%r26)
std %r22, 24(%r26)
- ldd 48(%r25), %r21 /* bundle 4 */
- ldd 56(%r25), %r22
- std %r19, 32(%r26)
- std %r20, 40(%r26)
-
- ldd 64(%r25), %r19 /* bundle 5 */
- ldd 72(%r25), %r20
- std %r21, 48(%r26)
- std %r22, 56(%r26)
-
- ldd 80(%r25), %r21 /* bundle 6 */
- ldd 88(%r25), %r22
- std %r19, 64(%r26)
- std %r20, 72(%r26)
+ ldd 64(%r25), %r21 /* bundle 5 */
+ ldd 72(%r25), %r22
+ std %r23, 32(%r26)
+ std %r24, 40(%r26)
+
+ ldd 80(%r25), %r23 /* bundle 6 */
+ ldd 88(%r25), %r24
+ std %r19, 48(%r26)
+ std %r20, 56(%r26)
ldd 96(%r25), %r19 /* bundle 7 */
ldd 104(%r25), %r20
- std %r21, 80(%r26)
- std %r22, 88(%r26)
+ std %r21, 64(%r26)
+ std %r22, 72(%r26)
ldd 112(%r25), %r21 /* bundle 8 */
ldd 120(%r25), %r22
+ std %r23, 80(%r26)
+ std %r24, 88(%r26)
+
+ ldo 128(%r25), %r25 /* alignment doesn't matter */
std %r19, 96(%r26)
std %r20, 104(%r26)
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [parisc-linux] DIFF use 6-regs in copy_user_page_asm 2005-01-03 6:19 [parisc-linux] DIFF use 6-regs in copy_user_page_asm Grant Grundler @ 2005-01-04 6:13 ` Randolph Chung 2005-01-04 8:23 ` Ryan Bradetich ` (2 more replies) 0 siblings, 3 replies; 12+ messages in thread From: Randolph Chung @ 2005-01-04 6:13 UTC (permalink / raw) To: Grant Grundler; +Cc: parisc-linux > This patch adds one more cycle between the load and store of a > given register by using three pairs of registers instead of two. > I had previously quoted one of the PA-8xxx papers that indicated > L1 cache was 2 cycles latency. > With this diff, the unrolled part of the loop now meets that. > The prolog and epilogue obviously cannot. > > If anyone can show me a workload that improves with this diff, > I'll apply it. Otherwise it's just an academic excercise. i'd like to see numbers too, but i doubt you will see any. it appears that at least newer PA cpus do a sufficient amount of internal instruction reordering that you don't see a difference as long as there are enough pending instructions to keep the pipeline busy. randolph -- Randolph Chung Debian GNU/Linux Developer, hppa/ia64 ports http://www.tausq.org/ _______________________________________________ parisc-linux mailing list parisc-linux@lists.parisc-linux.org http://lists.parisc-linux.org/mailman/listinfo/parisc-linux ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [parisc-linux] DIFF use 6-regs in copy_user_page_asm 2005-01-04 6:13 ` Randolph Chung @ 2005-01-04 8:23 ` Ryan Bradetich 2005-01-04 8:29 ` Randolph Chung 2005-01-04 14:51 ` Michael S. Zick 2005-01-04 23:39 ` Grant Grundler 2 siblings, 1 reply; 12+ messages in thread From: Ryan Bradetich @ 2005-01-04 8:23 UTC (permalink / raw) To: Randolph Chung; +Cc: parisc-linux Randolph, Is this something that would benefit older processors? Say a PCX-T processor? Is this something worth testing? I'm pretty sure I have a PCX-T processor around I could boot up and test. - Ryan On Mon, 2005-01-03 at 22:13 -0800, Randolph Chung wrote: > > This patch adds one more cycle between the load and store of a > > given register by using three pairs of registers instead of two. > > I had previously quoted one of the PA-8xxx papers that indicated > > L1 cache was 2 cycles latency. > > With this diff, the unrolled part of the loop now meets that. > > The prolog and epilogue obviously cannot. > > > > If anyone can show me a workload that improves with this diff, > > I'll apply it. Otherwise it's just an academic excercise. > > i'd like to see numbers too, but i doubt you will see any. it appears > that at least newer PA cpus do a sufficient amount of internal > instruction reordering that you don't see a difference as long as there > are enough pending instructions to keep the pipeline busy. > > randolph -- Ryan Bradetich <rbradetich@uswest.net> _______________________________________________ parisc-linux mailing list parisc-linux@lists.parisc-linux.org http://lists.parisc-linux.org/mailman/listinfo/parisc-linux ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [parisc-linux] DIFF use 6-regs in copy_user_page_asm 2005-01-04 8:23 ` Ryan Bradetich @ 2005-01-04 8:29 ` Randolph Chung 2005-01-04 13:12 ` Joel Soete 0 siblings, 1 reply; 12+ messages in thread From: Randolph Chung @ 2005-01-04 8:29 UTC (permalink / raw) To: Ryan Bradetich; +Cc: parisc-linux > Is this something that would benefit older processors? Say a PCX-T > processor? Is this something worth testing? I'm pretty sure I > have a PCX-T processor around I could boot up and test. i have no idea, but we can test it and see. i've read some pdfs which suggest that all pa20 processors can do this reordering, but i don't know about pa11 processors. in any case, empirical results are always better than speculation :) randolph _______________________________________________ parisc-linux mailing list parisc-linux@lists.parisc-linux.org http://lists.parisc-linux.org/mailman/listinfo/parisc-linux ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [parisc-linux] DIFF use 6-regs in copy_user_page_asm 2005-01-04 8:29 ` Randolph Chung @ 2005-01-04 13:12 ` Joel Soete 0 siblings, 0 replies; 12+ messages in thread From: Joel Soete @ 2005-01-04 13:12 UTC (permalink / raw) To: Randolph Chung, Ryan Bradetich, Grant Grundler; +Cc: parisc-linux [-- Attachment #1: Type: text/plain, Size: 2842 bytes --] > -- Original Message -- > Date: Tue, 4 Jan 2005 00:29:40 -0800 > From: Randolph Chung <randolph@tausq.org> > To: Ryan Bradetich <rbradetich@uswest.net> > Cc: parisc-linux@lists.parisc-linux.org > Reply-To: Randolph Chung <randolph@tausq.org> > Subject: Re: [parisc-linux] DIFF use 6-regs in copy_user_page_asm > > > > Is this something that would benefit older processors? Say a PCX-T > > processor? Is this something worth testing? I'm pretty sure I > > have a PCX-T processor around I could boot up and test. > > i have no idea, but we can test it and see. i've read some pdfs which > suggest that all pa20 processors can do this reordering, but i don't > know about pa11 processors. in any case, empirical results are always > better than speculation :) > This is only foreseen for pa2.0 ;-) That said, I also tried re-ordering (with only 4 regs) on my c110 but test case (previous cpup0.c) didn't show any improvement nor degradation ! Grant I also rewrite test case (here attached cpup1.c) to reproduce your proposal and compare with present stuff in kernel and here are some results: patst005:/Develop/jso/Var/Comp# time ./cpup1 ; time ./cpup2 real 0m3.964s user 0m0.209s sys 0m3.753s real 0m3.976s user 0m0.190s sys 0m3.781s patst005:/Develop/jso/Var/Comp# time ./cpup2 ; time ./cpup1 real 0m3.946s user 0m0.218s sys 0m3.725s real 0m3.961s user 0m0.196s sys 0m3.762s patst005:/Develop/jso/Var/Comp# time ./cpup0 ; time ./cpup2 ; time ./cpup1 real 0m4.046s user 0m0.354s sys 0m3.691s real 0m3.940s user 0m0.225s sys 0m3.712s real 0m3.946s user 0m0.208s sys 0m3.734s patst005:/Develop/jso/Var/Comp# time ./cpup0 ; time ./cpup1 ; time ./cpup2 real 0m4.068s user 0m0.342s sys 0m3.724s real 0m3.948s user 0m0.194s sys 0m3.752s real 0m3.936s user 0m0.193s sys 0m3.740s patst005:/Develop/jso/Var/Comp# time ./cpup1 ; time ./cpup0 ; time ./cpup2 real 0m3.928s user 0m0.202s sys 0m3.725s real 0m4.067s user 0m0.329s sys 0m3.731s real 0m3.946s user 0m0.224s sys 0m3.718s patst005:/Develop/jso/Var/Comp# time ./cpup2 ; time ./cpup0 ; time ./cpup1 real 0m3.942s user 0m0.213s sys 0m3.727s real 0m4.086s user 0m0.333s sys 0m3.749s real 0m3.956s user 0m0.208s sys 0m3.745s Unfortunately (as in the previous test), I didn't reach to point out an actual benefit :-( (please note that I have to reduce the BUFFSIZE because my b2k has only 256Mb of ram ;-) hth, Joel --------------------------------------------------------------------------- Tiscali solde! 1 mois et activation Gratuits, modem à 9,99 http://reg.tiscali.be/adsl/default.asp?lg=FR [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: cpup1.c --] [-- Type: text/x-csrc, Size: 4086 bytes --] /* ** CoPy User Page asm tester ** ** gcc -O2 -o cpup0 cpup.c vanilla 32-bit loop ** -march=2.0 -DLP64 -o cpup1 64-bit, 4ld + 4st sequences ** -march=2.0 -DLP64 -DV1 -o cpup2 64-bit, 4regs, 2ld/2st bundles ** -march=2.0 -DLP64 -DUSE6REGS -o cpup3 64-bit, 6 regs, 2ld/2st bundles */ #include <stdlib.h> #include <stdio.h> #include <string.h> #include <errno.h> #include <asm/page.h> void __copy_user_page_asm(void *to, void *from) { register unsigned long __to __asm__ ("r26") = (unsigned long)to; register unsigned long __from __asm__ ("r25") = (unsigned long)from; #ifdef LP64 asm volatile ("ldd 0(%0), %%r19\n" " ldi 32, %%r1\n" "1: ldd 8(%0), %%r20\n" " ldw 256(%0), %%r0\n" " ldd 16(%0), %%r21\n" " ldd 24(%0), %%r22\n" #ifdef USE6REGS " nop\n" " nop\n" " ldd 32(%0), %%r23\n" " ldd 40(%0), %%r24\n" " std %%r19, 0(%1)\n" " std %%r20, 8(%1)\n" " ldd 48(%0), %%r19\n" " ldd 56(%0), %%r20\n" " std %%r21, 16(%1)\n" " std %%r22, 24(%1)\n" " ldd 64(%0), %%r21\n" " ldd 72(%0), %%r22\n" " std %%r23, 32(%1)\n" " std %%r24, 40(%1)\n" " ldd 80(%0), %%r23\n" " ldd 88(%0), %%r24\n" " std %%r19, 48(%1)\n" " std %%r20, 56(%1)\n" " ldd 96(%0), %%r19\n" " ldd 104(%0), %%r20\n" " std %%r21, 64(%1)\n" " std %%r22, 72(%1)\n" " ldd 112(%0), %%r21\n" " ldd 120(%0), %%r22\n" " std %%r23, 80(%%r26)\n" " std %%r24, 88(%%r26)\n" " ldo 128(%0), %0\n" " std %%r19, 96(%1)\n" " std %%r20, 104(%1)\n" " std %%r21, 112(%1)\n" " std %%r22, 120(%1)\n" " ldo 128(%1), %1\n" " addib,> -1, %%r1, 1b\n" " ldd 0(%0), %%r19" #else /* !USE6REGS */ " std %%r19, 0(%1)\n" " std %%r20, 8(%1)\n" " ldd 32(%0), %%r19\n" " ldd 40(%0), %%r20\n" " std %%r21, 16(%1)\n" " std %%r22, 24(%1)\n" " ldd 48(%0), %%r21\n" " ldd 56(%0), %%r22\n" " std %%r19, 32(%1)\n" " std %%r20, 40(%1)\n" " ldd 64(%0), %%r19\n" " ldd 72(%0), %%r20\n" " std %%r21, 48(%1)\n" " std %%r22, 56(%1)\n" " ldd 80(%0), %%r21\n" " ldd 88(%0), %%r22\n" " std %%r19, 64(%1)\n" " std %%r20, 72(%1)\n" " ldd 96(%0), %%r19\n" " ldd 104(%0), %%r20\n" " std %%r21, 80(%1)\n" " std %%r22, 88(%1)\n" " ldd 112(%0), %%r21\n" " ldd 120(%0), %%r22\n" " std %%r19, 96(%1)\n" " std %%r20, 104(%1)\n" " ldo 128(%0), %0\n" " std %%r21, 112(%1)\n" " std %%r22, 120(%1)\n" " ldo 128(%1), %1\n" " addib,> -1, %%r1, 1b\n" " ldd 0(%0), %%r19" #endif /* USE6REGS */ #else /* !LP64 */ asm volatile ("ldi 64, %%r1\n" "1: ldw 0(%0), %%r19\n" " ldw 4(%0), %%r20\n" " ldw 8(%0), %%r21\n" " ldw 12(%0), %%r22\n" " stw %%r19, 0(%1)\n" " stw %%r20, 4(%1)\n" " stw %%r21, 8(%1)\n" " stw %%r22, 12(%1)\n" " ldw 16(%0), %%r19\n" " ldw 20(%0), %%r20\n" " ldw 24(%0), %%r21\n" " ldw 28(%0), %%r22\n" " stw %%r19, 16(%1)\n" " stw %%r20, 20(%1)\n" " stw %%r21, 24(%1)\n" " stw %%r22, 28(%1)\n" " ldw 32(%0), %%r19\n" " ldw 36(%0), %%r20\n" " ldw 40(%0), %%r21\n" " ldw 44(%0), %%r22\n" " stw %%r19, 32(%1)\n" " stw %%r20, 36(%1)\n" " stw %%r21, 40(%1)\n" " stw %%r22, 44(%1)\n" " ldw 48(%0), %%r19\n" " ldw 52(%0), %%r20\n" " ldw 56(%0), %%r21\n" " ldw 60(%0), %%r22\n" " stw %%r19, 48(%1)\n" " stw %%r20, 52(%1)\n" " stw %%r21, 56(%1)\n" " stw %%r22, 60(%1)\n" " ldo 64(%1), %1\n" " addib,> -1, %%r1, 1b\n" " ldo 64(%0), %0" #endif /* LP64 */ : : "r"(__from), "r"(__to) ); } #define BUFFSIZE (1024*1024*64) #define PPB (BUFFSIZE/PAGE_SIZE) /* Pages Per Buff */ int main(int argc, char * * argv, char * * env) { char *MemSrc, *MemDst; unsigned long j; MemSrc = malloc(BUFFSIZE); MemDst = malloc(BUFFSIZE); if (MemSrc == NULL || MemDst == NULL) return 1; /* initialize first page of MemSrc */ for (j = 0; j < (PAGE_SIZE/sizeof(unsigned long)) ; j++) { ((unsigned long *) MemSrc)[j]=j; } /* clone first page to remaining pages - page at a time */ for (j = 1; j < PPB ; j++) { __copy_user_page_asm( MemSrc + (j*PAGE_SIZE), MemSrc); } /* Clone Src to Dest - page at a time */ for (j = 0; j < PPB ; j++) { __copy_user_page_asm( MemDst + (j*PAGE_SIZE), MemSrc + (j*PAGE_SIZE)); } free(MemSrc); free(MemDst); return 0; } [-- Attachment #3: Type: text/plain, Size: 169 bytes --] _______________________________________________ parisc-linux mailing list parisc-linux@lists.parisc-linux.org http://lists.parisc-linux.org/mailman/listinfo/parisc-linux ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [parisc-linux] DIFF use 6-regs in copy_user_page_asm 2005-01-04 6:13 ` Randolph Chung 2005-01-04 8:23 ` Ryan Bradetich @ 2005-01-04 14:51 ` Michael S. Zick 2005-01-04 16:02 ` Grant Grundler 2005-01-04 23:39 ` Grant Grundler 2 siblings, 1 reply; 12+ messages in thread From: Michael S. Zick @ 2005-01-04 14:51 UTC (permalink / raw) To: parisc-linux On Tue January 4 2005 00:13, Randolph Chung wrote: > > This patch adds one more cycle between the load and store of a > > given register by using three pairs of registers instead of two. > > I had previously quoted one of the PA-8xxx papers that indicated > > L1 cache was 2 cycles latency. > > With this diff, the unrolled part of the loop now meets that. > > The prolog and epilogue obviously cannot. > > > > If anyone can show me a workload that improves with this diff, > > I'll apply it. Otherwise it's just an academic excercise. > > i'd like to see numbers too, but i doubt you will see any. it appears > that at least newer PA cpus do a sufficient amount of internal > instruction reordering that you don't see a difference as long as there > are enough pending instructions to keep the pipeline busy. > One other possibility to keep in mind when testing: this is an io sequence, no heavy register-register computations. If the 4-regs + internal reordering has already saturated the cpu-external busses... Then even if the cpu-core can execute the 6-regs + internal reordering more effectively, you will never see it outside of the cpu. You might have to borrow a buss analyzer from the hardware lab to see if this is effecting your tests. Mike _______________________________________________ parisc-linux mailing list parisc-linux@lists.parisc-linux.org http://lists.parisc-linux.org/mailman/listinfo/parisc-linux ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [parisc-linux] DIFF use 6-regs in copy_user_page_asm 2005-01-04 14:51 ` Michael S. Zick @ 2005-01-04 16:02 ` Grant Grundler [not found] ` <200501041142.44400.mszick@wolfbutter.com> 0 siblings, 1 reply; 12+ messages in thread From: Grant Grundler @ 2005-01-04 16:02 UTC (permalink / raw) To: Michael S. Zick; +Cc: parisc-linux On Tue, Jan 04, 2005 at 08:51:19AM -0600, Michael S. Zick wrote: > One other possibility to keep in mind when testing: this is an > io sequence, no heavy register-register computations. Not entirely. Sure, to load data into cache it's "io". But the bulk of the loop is to move data from one cacheline to another. > You might have to borrow a buss analyzer from the hardware > lab to see if this is effecting your tests. I don't. If 6-regs works better then I use it. If CPU performance counter support worked, we could figure out where the bottlenecks were for both cases. grant _______________________________________________ parisc-linux mailing list parisc-linux@lists.parisc-linux.org http://lists.parisc-linux.org/mailman/listinfo/parisc-linux ^ permalink raw reply [flat|nested] 12+ messages in thread
[parent not found: <200501041142.44400.mszick@wolfbutter.com>]
* Re: [parisc-linux] DIFF use 6-regs in copy_user_page_asm [not found] ` <200501041142.44400.mszick@wolfbutter.com> @ 2005-01-04 20:09 ` Grant Grundler 0 siblings, 0 replies; 12+ messages in thread From: Grant Grundler @ 2005-01-04 20:09 UTC (permalink / raw) To: Michael S. Zick; +Cc: parisc-linux On Tue, Jan 04, 2005 at 11:42:44AM -0600, Michael S. Zick wrote: > > I don't. If 6-regs works better then I use it. > Agreed, > If you can find a difference now. I can using CR16. That's what I was proposing before. > I was speaking of the other case: > If they appear to work the same now. Yes, but I don't need an analyzer to guess at what might be causing the bottleneck. The "Linux Way" is to keep trying different variants until we find a better one (or get fed up). I know using an analyzer is more precise _once_ it's setup. Joel, I've hacked your cpup1.c and committed it build-tools. Please send me diffs in the future. You would have noticed that you reference %r26 directly in two of the asm statements. The new version implements most of what I was proposing: o use CR16 to measure copy_user_page_asm() o run multiple iterations to avoid page faults/TLB activity o drops -DV1 code (4ld/4st in 64-bit case) o implements -DUSE6REGS o uses 64MB src/dest buffer grundler <536>gcc -O2 -o cpup0 cpup.c grundler <537>gcc -march=2.0 -DLP64 -o cpup2 cpup.c grundler <538>gcc -march=2.0 -DLP64 -DDUSE6REGS -o cpup3 cpup.c grundler <539>./cpup0 First Loop : min 14393 avg 17156 median 16219 Later Loops : min 9696 avg 10819 median 10432 grundler <540>./cpup2 First Loop : min 11381 avg 14120 median 13168 Later Loops : min 5844 avg 7695 median 7595 grundler <541>./cpup3 First Loop : min 11441 avg 14102 median 13167 Later Loops : min 5898 avg 7702 median 7594 This might be useful for measuring cost of TLB insertion too. Please verify the code is generating the stats properly before taking the above numbers as The Truth. (650 Mhz A500 running SMP 2.6.10-rc3-pa6) I also noticed that even this gets different results on the first vs successive invocations: grundler <545>./cpup3 First Loop : min 11277 avg 17749 median 13143 Later Loops : min 5806 avg 8156 median 7589 grundler <546>./cpup3 First Loop : min 11217 avg 14250 median 13154 Later Loops : min 5904 avg 7726 median 7604 grundler <547>./cpup3 First Loop : min 11528 avg 14147 median 13162 Later Loops : min 5877 avg 7722 median 7600 grundler <548>./cpup3 First Loop : min 11548 avg 14202 median 13177 Later Loops : min 5866 avg 7727 median 7600 grundler <549>./cpup3 First Loop : min 11577 avg 14150 median 13173 Later Loops : min 5877 avg 7729 median 7607 Ignoring the first invocation, the results are quite precise: +- 4/7725 Adding another "ldw 192(%0), %%r0" to the bottom of the loop reduced that even a bit more. We only prefectch one of the two cachelines processed in the loop before. The 5th run output was: grundler <561>./cpup3 First Loop : min 9831 avg 12950 median 12000 Later Loops : min 5790 avg 7529 median 7375 hth, grant _______________________________________________ parisc-linux mailing list parisc-linux@lists.parisc-linux.org http://lists.parisc-linux.org/mailman/listinfo/parisc-linux ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [parisc-linux] DIFF use 6-regs in copy_user_page_asm 2005-01-04 6:13 ` Randolph Chung 2005-01-04 8:23 ` Ryan Bradetich 2005-01-04 14:51 ` Michael S. Zick @ 2005-01-04 23:39 ` Grant Grundler 2005-01-05 0:00 ` John David Anglin 2 siblings, 1 reply; 12+ messages in thread From: Grant Grundler @ 2005-01-04 23:39 UTC (permalink / raw) To: Randolph Chung; +Cc: parisc-linux On Mon, Jan 03, 2005 at 10:13:42PM -0800, Randolph Chung wrote: > i'd like to see numbers too, but i doubt you will see any. I hope the new cpup.c will help us provide precise values. [ The following is more intended for folks like Joel than Randolph. I'm pretty sure Randolph understands how CPUs work. ] > it appears > that at least newer PA cpus do a sufficient amount of internal > instruction reordering that you don't see a difference as long as there > are enough pending instructions to keep the pipeline busy. The pipeline on PA-8x00 can load 4 instructions at a time. How those 4 instructions get executed depend on interlocks (e.g. register is in still use by previous insn) and if the CPU/mem units are available. AFAIK, PA8x00 processors support 2 loads/cycle, 2 stores/cycle, 2 shift+merge ops/cycle, 2 FP Div/cycle, etc. "Keeping the pipeline busy" is just as much a function of instruction scheduling by programmer/compiler as re-ordering by the CPU. One really needs to keep track of which resources are available. Anytime a resource is not available, the pipeline "stalls". ia64 calls this "bubbles" and the best description I've found of "bubbles" is in the MySQL perf paper by Philippe Bonnet: http://www.gelato.org/resources/bookspapers.php (or http://www.gelato.org/pdf/mysql_itanium2_perf.pdf) While parisc might be more primitive in how it deals with "stalls" than ia64, I expect most of the same principles apply to both when writing code. grant _______________________________________________ parisc-linux mailing list parisc-linux@lists.parisc-linux.org http://lists.parisc-linux.org/mailman/listinfo/parisc-linux ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [parisc-linux] DIFF use 6-regs in copy_user_page_asm 2005-01-04 23:39 ` Grant Grundler @ 2005-01-05 0:00 ` John David Anglin 2005-01-05 22:01 ` Michael S. Zick 2005-01-06 22:55 ` Grant Grundler 0 siblings, 2 replies; 12+ messages in thread From: John David Anglin @ 2005-01-05 0:00 UTC (permalink / raw) To: Grant Grundler; +Cc: parisc-linux > The pipeline on PA-8x00 can load 4 instructions at a time. > How those 4 instructions get executed depend on interlocks (e.g. > register is in still use by previous insn) and if the CPU/mem units are > available. AFAIK, PA8x00 processors support 2 loads/cycle, > 2 stores/cycle, 2 shift+merge ops/cycle, 2 FP Div/cycle, etc. > "Keeping the pipeline busy" is just as much a function of instruction > scheduling by programmer/compiler as re-ordering by the CPU. > One really needs to keep track of which resources are available. This is what the GCC machine definitions says: ;; The PA8000 has a large (56) entry reorder buffer that is split between ;; memory and non-memory operations. ;; ;; The PA8000 can issue two memory and two non-memory operations per cycle to ;; the function units, with the exception of branches and multi-output ;; instructions. The PA8000 can retire two non-memory operations per cycle ;; and two memory operations per cycle, only one of which may be a store. ;; ;; Given the large reorder buffer, the processor can hide most latencies. ;; According to HP, they've got the best results by scheduling for retirement ;; bandwidth with limited latency scheduling for floating point operations. ;; Latency for integer operations and memory references is ignored. ;; ;; ;; We claim floating point operations have a 2 cycle latency and are ;; fully pipelined, except for div and sqrt which are not pipelined and ;; take from 17 to 31 cycles to complete. ;; ;; It's worth noting that there is no way to saturate all the functional ;; units on the PA8000 as there is not enough issue bandwidth. Comments? Dave -- J. David Anglin dave.anglin@nrc-cnrc.gc.ca National Research Council of Canada (613) 990-0752 (FAX: 952-6602) _______________________________________________ parisc-linux mailing list parisc-linux@lists.parisc-linux.org http://lists.parisc-linux.org/mailman/listinfo/parisc-linux ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [parisc-linux] DIFF use 6-regs in copy_user_page_asm 2005-01-05 0:00 ` John David Anglin @ 2005-01-05 22:01 ` Michael S. Zick 2005-01-06 22:55 ` Grant Grundler 1 sibling, 0 replies; 12+ messages in thread From: Michael S. Zick @ 2005-01-05 22:01 UTC (permalink / raw) To: parisc-linux On Tue January 4 2005 18:00, John David Anglin wrote: > > This is what the GCC machine definitions says: > > ;; > ;; It's worth noting that there is no way to saturate all the functional > ;; units on the PA8000 as there is not enough issue bandwidth. > > Comments? > Thanks, That is what I tried to say but couldn't find the reference. Mike _______________________________________________ parisc-linux mailing list parisc-linux@lists.parisc-linux.org http://lists.parisc-linux.org/mailman/listinfo/parisc-linux ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [parisc-linux] DIFF use 6-regs in copy_user_page_asm 2005-01-05 0:00 ` John David Anglin 2005-01-05 22:01 ` Michael S. Zick @ 2005-01-06 22:55 ` Grant Grundler 1 sibling, 0 replies; 12+ messages in thread From: Grant Grundler @ 2005-01-06 22:55 UTC (permalink / raw) To: John David Anglin; +Cc: parisc-linux On Tue, Jan 04, 2005 at 07:00:43PM -0500, John David Anglin wrote: > Grant Grundler wrote: > > The pipeline on PA-8x00 can load 4 instructions at a time. My statement happens to be correct but is misleading. And I misunderstood it too. The system does "fetch" 4 insn at a time but can't execute/retire 4 mem ops at a time. PCX-U can only handle two memory ops per cycle as (correctly) described by the GCC machine definition below. > This is what the GCC machine definitions says: Thanks for digging this up. I definitely was confused about some of the details. > ;; The PA8000 has a large (56) entry reorder buffer that is split between > ;; memory and non-memory operations. > ;; > ;; The PA8000 can issue two memory and two non-memory operations per cycle to > ;; the function units, with the exception of branches and multi-output > ;; instructions. The PA8000 can retire two non-memory operations per cycle > ;; and two memory operations per cycle, only one of which may be a store. Yes, this is correct. I'm told this is true for all PA-8x00 CPUs. I was confused. The "load-store unit" describes one unit that can only do one load or one store. Each instruction queue is also divided into even and odd slots. The functional units are assigned to either an even or odd slot. Ie two loads in adjacent slots will use both load-store units. two loads in odd slots will serialize. That shouldn't be an issue for copy_user_page_asm but it would be good if gcc is aware of it. Further, the PCX-U cache accesses are serialized when to the same cacheline. The copy_user_page_asm loop should be restructured to interleave accesses to the two 64 byte cachelines handled in each iteration of the loop. This won't matter for stores at the "tail end" of the loop but will help with the loads at the front of the loop. > ;; Given the large reorder buffer, the processor can hide most latencies. Yes. I'm told the re-order buffers (aka "memory ops queue") should be sufficient to hide scheduling issues. So we don't have to sweat too many details if we get it "close enough". > ;; According to HP, they've got the best results by scheduling for retirement > ;; bandwidth with limited latency scheduling for floating point operations. > ;; Latency for integer operations and memory references is ignored. Along this line, according to the specs, PCX-U has a "best case load latency of three cycles". This is because of "one cycle for address calculation and 2 cycles for off chip cache access". Later CPUs are 2 cycles. Restructuring the USE6REGs code might help PCX-U. > ;; It's worth noting that there is no way to saturate all the functional > ;; units on the PA8000 as there is not enough issue bandwidth. Agreed. But we do saturate the load-store units in the copy_user_page_asm code. At least for short bursts. hth, grant _______________________________________________ parisc-linux mailing list parisc-linux@lists.parisc-linux.org http://lists.parisc-linux.org/mailman/listinfo/parisc-linux ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2005-01-06 22:55 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-01-03 6:19 [parisc-linux] DIFF use 6-regs in copy_user_page_asm Grant Grundler
2005-01-04 6:13 ` Randolph Chung
2005-01-04 8:23 ` Ryan Bradetich
2005-01-04 8:29 ` Randolph Chung
2005-01-04 13:12 ` Joel Soete
2005-01-04 14:51 ` Michael S. Zick
2005-01-04 16:02 ` Grant Grundler
[not found] ` <200501041142.44400.mszick@wolfbutter.com>
2005-01-04 20:09 ` Grant Grundler
2005-01-04 23:39 ` Grant Grundler
2005-01-05 0:00 ` John David Anglin
2005-01-05 22:01 ` Michael S. Zick
2005-01-06 22:55 ` Grant Grundler
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.