* [parisc-linux] DIFF use 6-regs in copy_user_page_asm
@ 2005-01-03 6:19 Grant Grundler
2005-01-04 6:13 ` Randolph Chung
0 siblings, 1 reply; 12+ messages in thread
From: Grant Grundler @ 2005-01-03 6:19 UTC (permalink / raw)
To: parisc-linux
This patch adds one more cycle between the load and store of a
given register by using three pairs of registers instead of two.
I had previously quoted one of the PA-8xxx papers that indicated
L1 cache was 2 cycles latency.
With this diff, the unrolled part of the loop now meets that.
The prolog and epilogue obviously cannot.
If anyone can show me a workload that improves with this diff,
I'll apply it. Otherwise it's just an academic excercise.
BTW, I don't really trust build-tools/cpup.c unless someone
can convince me it's really running in wide mode and not getting
lots of page faults/page zeroing to interfere with the test.
Maybe need to iterate over a smaller buffer (e.g. 64MB) several times
and ignore the first iteration. Maybe also record cr16 values between
calls to find a minima and median *after* all the copying
is done.
thanks,
grant
ps. The "alignment doesn't matter" comment is too short. It really
means the alignment doesn't matter for the rest of the loop.
ie I don't need to add nops to seperate the pairs of "std" insns.
Index: arch/parisc/kernel/pacache.S
===================================================================
RCS file: /var/cvs/linux-2.6/arch/parisc/kernel/pacache.S,v
retrieving revision 1.14
diff -u -p -r1.14 pacache.S
--- arch/parisc/kernel/pacache.S 30 Dec 2004 08:07:48 -0000 1.14
+++ arch/parisc/kernel/pacache.S 3 Jan 2005 05:59:19 -0000
@@ -306,51 +306,52 @@ copy_user_page_asm:
ldd 0(%r25), %r19 /* bundle 1 */
ldi 32, %r1 /* PAGE_SIZE/128 == 32 */
-
1: ldd 8(%r25), %r20
ldw 256(%r25), %r0 /* prefetch 4 cacheline ahead */
ldd 16(%r25), %r21 /* bundle 2 */
ldd 24(%r25), %r22
+ nop /* preserve alignment of quads */
+ nop /* preserve alignment of quads */
+
+ ldd 32(%r25), %r23 /* bundle 3 */
+ ldd 40(%r25), %r24
std %r19, 0(%r26)
std %r20, 8(%r26)
- ldd 32(%r25), %r19 /* bundle 3 */
- ldd 40(%r25), %r20
+ ldd 48(%r25), %r19 /* bundle 4 */
+ ldd 56(%r25), %r20
std %r21, 16(%r26)
std %r22, 24(%r26)
- ldd 48(%r25), %r21 /* bundle 4 */
- ldd 56(%r25), %r22
- std %r19, 32(%r26)
- std %r20, 40(%r26)
-
- ldd 64(%r25), %r19 /* bundle 5 */
- ldd 72(%r25), %r20
- std %r21, 48(%r26)
- std %r22, 56(%r26)
-
- ldd 80(%r25), %r21 /* bundle 6 */
- ldd 88(%r25), %r22
- std %r19, 64(%r26)
- std %r20, 72(%r26)
+ ldd 64(%r25), %r21 /* bundle 5 */
+ ldd 72(%r25), %r22
+ std %r23, 32(%r26)
+ std %r24, 40(%r26)
+
+ ldd 80(%r25), %r23 /* bundle 6 */
+ ldd 88(%r25), %r24
+ std %r19, 48(%r26)
+ std %r20, 56(%r26)
ldd 96(%r25), %r19 /* bundle 7 */
ldd 104(%r25), %r20
- std %r21, 80(%r26)
- std %r22, 88(%r26)
+ std %r21, 64(%r26)
+ std %r22, 72(%r26)
ldd 112(%r25), %r21 /* bundle 8 */
ldd 120(%r25), %r22
+ std %r23, 80(%r26)
+ std %r24, 88(%r26)
+
+ ldo 128(%r25), %r25 /* alignment doesn't matter */
std %r19, 96(%r26)
std %r20, 104(%r26)
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [parisc-linux] DIFF use 6-regs in copy_user_page_asm
2005-01-03 6:19 [parisc-linux] DIFF use 6-regs in copy_user_page_asm Grant Grundler
@ 2005-01-04 6:13 ` Randolph Chung
2005-01-04 8:23 ` Ryan Bradetich
` (2 more replies)
0 siblings, 3 replies; 12+ messages in thread
From: Randolph Chung @ 2005-01-04 6:13 UTC (permalink / raw)
To: Grant Grundler; +Cc: parisc-linux
> This patch adds one more cycle between the load and store of a
> given register by using three pairs of registers instead of two.
> I had previously quoted one of the PA-8xxx papers that indicated
> L1 cache was 2 cycles latency.
> With this diff, the unrolled part of the loop now meets that.
> The prolog and epilogue obviously cannot.
>
> If anyone can show me a workload that improves with this diff,
> I'll apply it. Otherwise it's just an academic excercise.
i'd like to see numbers too, but i doubt you will see any. it appears
that at least newer PA cpus do a sufficient amount of internal
instruction reordering that you don't see a difference as long as there
are enough pending instructions to keep the pipeline busy.
randolph
--
Randolph Chung
Debian GNU/Linux Developer, hppa/ia64 ports
http://www.tausq.org/
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [parisc-linux] DIFF use 6-regs in copy_user_page_asm
2005-01-04 6:13 ` Randolph Chung
@ 2005-01-04 8:23 ` Ryan Bradetich
2005-01-04 8:29 ` Randolph Chung
2005-01-04 14:51 ` Michael S. Zick
2005-01-04 23:39 ` Grant Grundler
2 siblings, 1 reply; 12+ messages in thread
From: Ryan Bradetich @ 2005-01-04 8:23 UTC (permalink / raw)
To: Randolph Chung; +Cc: parisc-linux
Randolph,
Is this something that would benefit older processors? Say a PCX-T
processor? Is this something worth testing? I'm pretty sure I
have a PCX-T processor around I could boot up and test.
- Ryan
On Mon, 2005-01-03 at 22:13 -0800, Randolph Chung wrote:
> > This patch adds one more cycle between the load and store of a
> > given register by using three pairs of registers instead of two.
> > I had previously quoted one of the PA-8xxx papers that indicated
> > L1 cache was 2 cycles latency.
> > With this diff, the unrolled part of the loop now meets that.
> > The prolog and epilogue obviously cannot.
> >
> > If anyone can show me a workload that improves with this diff,
> > I'll apply it. Otherwise it's just an academic excercise.
>
> i'd like to see numbers too, but i doubt you will see any. it appears
> that at least newer PA cpus do a sufficient amount of internal
> instruction reordering that you don't see a difference as long as there
> are enough pending instructions to keep the pipeline busy.
>
> randolph
--
Ryan Bradetich <rbradetich@uswest.net>
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [parisc-linux] DIFF use 6-regs in copy_user_page_asm
2005-01-04 8:23 ` Ryan Bradetich
@ 2005-01-04 8:29 ` Randolph Chung
2005-01-04 13:12 ` Joel Soete
0 siblings, 1 reply; 12+ messages in thread
From: Randolph Chung @ 2005-01-04 8:29 UTC (permalink / raw)
To: Ryan Bradetich; +Cc: parisc-linux
> Is this something that would benefit older processors? Say a PCX-T
> processor? Is this something worth testing? I'm pretty sure I
> have a PCX-T processor around I could boot up and test.
i have no idea, but we can test it and see. i've read some pdfs which
suggest that all pa20 processors can do this reordering, but i don't
know about pa11 processors. in any case, empirical results are always
better than speculation :)
randolph
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [parisc-linux] DIFF use 6-regs in copy_user_page_asm
2005-01-04 8:29 ` Randolph Chung
@ 2005-01-04 13:12 ` Joel Soete
0 siblings, 0 replies; 12+ messages in thread
From: Joel Soete @ 2005-01-04 13:12 UTC (permalink / raw)
To: Randolph Chung, Ryan Bradetich, Grant Grundler; +Cc: parisc-linux
[-- Attachment #1: Type: text/plain, Size: 2842 bytes --]
> -- Original Message --
> Date: Tue, 4 Jan 2005 00:29:40 -0800
> From: Randolph Chung <randolph@tausq.org>
> To: Ryan Bradetich <rbradetich@uswest.net>
> Cc: parisc-linux@lists.parisc-linux.org
> Reply-To: Randolph Chung <randolph@tausq.org>
> Subject: Re: [parisc-linux] DIFF use 6-regs in copy_user_page_asm
>
>
> > Is this something that would benefit older processors? Say a PCX-T
> > processor? Is this something worth testing? I'm pretty sure I
> > have a PCX-T processor around I could boot up and test.
>
> i have no idea, but we can test it and see. i've read some pdfs which
> suggest that all pa20 processors can do this reordering, but i don't
> know about pa11 processors. in any case, empirical results are always
> better than speculation :)
>
This is only foreseen for pa2.0 ;-)
That said, I also tried re-ordering (with only 4 regs) on my c110 but test
case (previous cpup0.c) didn't show any improvement nor degradation !
Grant I also rewrite test case (here attached cpup1.c) to reproduce your
proposal and compare with present stuff in kernel and here are some results:
patst005:/Develop/jso/Var/Comp# time ./cpup1 ; time ./cpup2
real 0m3.964s
user 0m0.209s
sys 0m3.753s
real 0m3.976s
user 0m0.190s
sys 0m3.781s
patst005:/Develop/jso/Var/Comp# time ./cpup2 ; time ./cpup1
real 0m3.946s
user 0m0.218s
sys 0m3.725s
real 0m3.961s
user 0m0.196s
sys 0m3.762s
patst005:/Develop/jso/Var/Comp# time ./cpup0 ; time ./cpup2 ; time ./cpup1
real 0m4.046s
user 0m0.354s
sys 0m3.691s
real 0m3.940s
user 0m0.225s
sys 0m3.712s
real 0m3.946s
user 0m0.208s
sys 0m3.734s
patst005:/Develop/jso/Var/Comp# time ./cpup0 ; time ./cpup1 ; time ./cpup2
real 0m4.068s
user 0m0.342s
sys 0m3.724s
real 0m3.948s
user 0m0.194s
sys 0m3.752s
real 0m3.936s
user 0m0.193s
sys 0m3.740s
patst005:/Develop/jso/Var/Comp# time ./cpup1 ; time ./cpup0 ; time ./cpup2
real 0m3.928s
user 0m0.202s
sys 0m3.725s
real 0m4.067s
user 0m0.329s
sys 0m3.731s
real 0m3.946s
user 0m0.224s
sys 0m3.718s
patst005:/Develop/jso/Var/Comp# time ./cpup2 ; time ./cpup0 ; time ./cpup1
real 0m3.942s
user 0m0.213s
sys 0m3.727s
real 0m4.086s
user 0m0.333s
sys 0m3.749s
real 0m3.956s
user 0m0.208s
sys 0m3.745s
Unfortunately (as in the previous test), I didn't reach to point out an
actual benefit :-(
(please note that I have to reduce the BUFFSIZE because my b2k has only
256Mb of ram ;-)
hth,
Joel
---------------------------------------------------------------------------
Tiscali solde! 1 mois et activation Gratuits, modem à 9,99
http://reg.tiscali.be/adsl/default.asp?lg=FR
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: cpup1.c --]
[-- Type: text/x-csrc, Size: 4086 bytes --]
/*
** CoPy User Page asm tester
**
** gcc -O2 -o cpup0 cpup.c vanilla 32-bit loop
** -march=2.0 -DLP64 -o cpup1 64-bit, 4ld + 4st sequences
** -march=2.0 -DLP64 -DV1 -o cpup2 64-bit, 4regs, 2ld/2st bundles
** -march=2.0 -DLP64 -DUSE6REGS -o cpup3 64-bit, 6 regs, 2ld/2st bundles
*/
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <errno.h>
#include <asm/page.h>
void __copy_user_page_asm(void *to, void *from)
{
register unsigned long __to __asm__ ("r26") = (unsigned long)to;
register unsigned long __from __asm__ ("r25") = (unsigned long)from;
#ifdef LP64
asm volatile ("ldd 0(%0), %%r19\n"
" ldi 32, %%r1\n"
"1: ldd 8(%0), %%r20\n"
" ldw 256(%0), %%r0\n"
" ldd 16(%0), %%r21\n"
" ldd 24(%0), %%r22\n"
#ifdef USE6REGS
" nop\n"
" nop\n"
" ldd 32(%0), %%r23\n"
" ldd 40(%0), %%r24\n"
" std %%r19, 0(%1)\n"
" std %%r20, 8(%1)\n"
" ldd 48(%0), %%r19\n"
" ldd 56(%0), %%r20\n"
" std %%r21, 16(%1)\n"
" std %%r22, 24(%1)\n"
" ldd 64(%0), %%r21\n"
" ldd 72(%0), %%r22\n"
" std %%r23, 32(%1)\n"
" std %%r24, 40(%1)\n"
" ldd 80(%0), %%r23\n"
" ldd 88(%0), %%r24\n"
" std %%r19, 48(%1)\n"
" std %%r20, 56(%1)\n"
" ldd 96(%0), %%r19\n"
" ldd 104(%0), %%r20\n"
" std %%r21, 64(%1)\n"
" std %%r22, 72(%1)\n"
" ldd 112(%0), %%r21\n"
" ldd 120(%0), %%r22\n"
" std %%r23, 80(%%r26)\n"
" std %%r24, 88(%%r26)\n"
" ldo 128(%0), %0\n"
" std %%r19, 96(%1)\n"
" std %%r20, 104(%1)\n"
" std %%r21, 112(%1)\n"
" std %%r22, 120(%1)\n"
" ldo 128(%1), %1\n"
" addib,> -1, %%r1, 1b\n"
" ldd 0(%0), %%r19"
#else /* !USE6REGS */
" std %%r19, 0(%1)\n"
" std %%r20, 8(%1)\n"
" ldd 32(%0), %%r19\n"
" ldd 40(%0), %%r20\n"
" std %%r21, 16(%1)\n"
" std %%r22, 24(%1)\n"
" ldd 48(%0), %%r21\n"
" ldd 56(%0), %%r22\n"
" std %%r19, 32(%1)\n"
" std %%r20, 40(%1)\n"
" ldd 64(%0), %%r19\n"
" ldd 72(%0), %%r20\n"
" std %%r21, 48(%1)\n"
" std %%r22, 56(%1)\n"
" ldd 80(%0), %%r21\n"
" ldd 88(%0), %%r22\n"
" std %%r19, 64(%1)\n"
" std %%r20, 72(%1)\n"
" ldd 96(%0), %%r19\n"
" ldd 104(%0), %%r20\n"
" std %%r21, 80(%1)\n"
" std %%r22, 88(%1)\n"
" ldd 112(%0), %%r21\n"
" ldd 120(%0), %%r22\n"
" std %%r19, 96(%1)\n"
" std %%r20, 104(%1)\n"
" ldo 128(%0), %0\n"
" std %%r21, 112(%1)\n"
" std %%r22, 120(%1)\n"
" ldo 128(%1), %1\n"
" addib,> -1, %%r1, 1b\n"
" ldd 0(%0), %%r19"
#endif /* USE6REGS */
#else /* !LP64 */
asm volatile ("ldi 64, %%r1\n"
"1: ldw 0(%0), %%r19\n"
" ldw 4(%0), %%r20\n"
" ldw 8(%0), %%r21\n"
" ldw 12(%0), %%r22\n"
" stw %%r19, 0(%1)\n"
" stw %%r20, 4(%1)\n"
" stw %%r21, 8(%1)\n"
" stw %%r22, 12(%1)\n"
" ldw 16(%0), %%r19\n"
" ldw 20(%0), %%r20\n"
" ldw 24(%0), %%r21\n"
" ldw 28(%0), %%r22\n"
" stw %%r19, 16(%1)\n"
" stw %%r20, 20(%1)\n"
" stw %%r21, 24(%1)\n"
" stw %%r22, 28(%1)\n"
" ldw 32(%0), %%r19\n"
" ldw 36(%0), %%r20\n"
" ldw 40(%0), %%r21\n"
" ldw 44(%0), %%r22\n"
" stw %%r19, 32(%1)\n"
" stw %%r20, 36(%1)\n"
" stw %%r21, 40(%1)\n"
" stw %%r22, 44(%1)\n"
" ldw 48(%0), %%r19\n"
" ldw 52(%0), %%r20\n"
" ldw 56(%0), %%r21\n"
" ldw 60(%0), %%r22\n"
" stw %%r19, 48(%1)\n"
" stw %%r20, 52(%1)\n"
" stw %%r21, 56(%1)\n"
" stw %%r22, 60(%1)\n"
" ldo 64(%1), %1\n"
" addib,> -1, %%r1, 1b\n"
" ldo 64(%0), %0"
#endif /* LP64 */
:
: "r"(__from), "r"(__to) );
}
#define BUFFSIZE (1024*1024*64)
#define PPB (BUFFSIZE/PAGE_SIZE) /* Pages Per Buff */
int main(int argc, char * * argv, char * * env)
{
char *MemSrc, *MemDst;
unsigned long j;
MemSrc = malloc(BUFFSIZE);
MemDst = malloc(BUFFSIZE);
if (MemSrc == NULL || MemDst == NULL)
return 1;
/* initialize first page of MemSrc */
for (j = 0; j < (PAGE_SIZE/sizeof(unsigned long)) ; j++) {
((unsigned long *) MemSrc)[j]=j;
}
/* clone first page to remaining pages - page at a time */
for (j = 1; j < PPB ; j++) {
__copy_user_page_asm( MemSrc + (j*PAGE_SIZE), MemSrc);
}
/* Clone Src to Dest - page at a time */
for (j = 0; j < PPB ; j++) {
__copy_user_page_asm( MemDst + (j*PAGE_SIZE),
MemSrc + (j*PAGE_SIZE));
}
free(MemSrc);
free(MemDst);
return 0;
}
[-- Attachment #3: Type: text/plain, Size: 169 bytes --]
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [parisc-linux] DIFF use 6-regs in copy_user_page_asm
2005-01-04 6:13 ` Randolph Chung
2005-01-04 8:23 ` Ryan Bradetich
@ 2005-01-04 14:51 ` Michael S. Zick
2005-01-04 16:02 ` Grant Grundler
2005-01-04 23:39 ` Grant Grundler
2 siblings, 1 reply; 12+ messages in thread
From: Michael S. Zick @ 2005-01-04 14:51 UTC (permalink / raw)
To: parisc-linux
On Tue January 4 2005 00:13, Randolph Chung wrote:
> > This patch adds one more cycle between the load and store of a
> > given register by using three pairs of registers instead of two.
> > I had previously quoted one of the PA-8xxx papers that indicated
> > L1 cache was 2 cycles latency.
> > With this diff, the unrolled part of the loop now meets that.
> > The prolog and epilogue obviously cannot.
> >
> > If anyone can show me a workload that improves with this diff,
> > I'll apply it. Otherwise it's just an academic excercise.
>
> i'd like to see numbers too, but i doubt you will see any. it appears
> that at least newer PA cpus do a sufficient amount of internal
> instruction reordering that you don't see a difference as long as there
> are enough pending instructions to keep the pipeline busy.
>
One other possibility to keep in mind when testing: this is an
io sequence, no heavy register-register computations.
If the 4-regs + internal reordering has already saturated the
cpu-external busses...
Then even if the cpu-core can execute the 6-regs + internal
reordering more effectively, you will never see it outside
of the cpu.
You might have to borrow a buss analyzer from the hardware
lab to see if this is effecting your tests.
Mike
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [parisc-linux] DIFF use 6-regs in copy_user_page_asm
2005-01-04 14:51 ` Michael S. Zick
@ 2005-01-04 16:02 ` Grant Grundler
[not found] ` <200501041142.44400.mszick@wolfbutter.com>
0 siblings, 1 reply; 12+ messages in thread
From: Grant Grundler @ 2005-01-04 16:02 UTC (permalink / raw)
To: Michael S. Zick; +Cc: parisc-linux
On Tue, Jan 04, 2005 at 08:51:19AM -0600, Michael S. Zick wrote:
> One other possibility to keep in mind when testing: this is an
> io sequence, no heavy register-register computations.
Not entirely. Sure, to load data into cache it's "io".
But the bulk of the loop is to move data from one cacheline
to another.
> You might have to borrow a buss analyzer from the hardware
> lab to see if this is effecting your tests.
I don't. If 6-regs works better then I use it.
If CPU performance counter support worked, we could figure
out where the bottlenecks were for both cases.
grant
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [parisc-linux] DIFF use 6-regs in copy_user_page_asm
[not found] ` <200501041142.44400.mszick@wolfbutter.com>
@ 2005-01-04 20:09 ` Grant Grundler
0 siblings, 0 replies; 12+ messages in thread
From: Grant Grundler @ 2005-01-04 20:09 UTC (permalink / raw)
To: Michael S. Zick; +Cc: parisc-linux
On Tue, Jan 04, 2005 at 11:42:44AM -0600, Michael S. Zick wrote:
> > I don't. If 6-regs works better then I use it.
> Agreed,
> If you can find a difference now.
I can using CR16. That's what I was proposing before.
> I was speaking of the other case:
> If they appear to work the same now.
Yes, but I don't need an analyzer to guess at what might be causing
the bottleneck. The "Linux Way" is to keep trying different variants
until we find a better one (or get fed up). I know using an analyzer
is more precise _once_ it's setup.
Joel,
I've hacked your cpup1.c and committed it build-tools.
Please send me diffs in the future.
You would have noticed that you reference %r26 directly in two
of the asm statements.
The new version implements most of what I was proposing:
o use CR16 to measure copy_user_page_asm()
o run multiple iterations to avoid page faults/TLB activity
o drops -DV1 code (4ld/4st in 64-bit case)
o implements -DUSE6REGS
o uses 64MB src/dest buffer
grundler <536>gcc -O2 -o cpup0 cpup.c
grundler <537>gcc -march=2.0 -DLP64 -o cpup2 cpup.c
grundler <538>gcc -march=2.0 -DLP64 -DDUSE6REGS -o cpup3 cpup.c
grundler <539>./cpup0
First Loop : min 14393 avg 17156 median 16219
Later Loops : min 9696 avg 10819 median 10432
grundler <540>./cpup2
First Loop : min 11381 avg 14120 median 13168
Later Loops : min 5844 avg 7695 median 7595
grundler <541>./cpup3
First Loop : min 11441 avg 14102 median 13167
Later Loops : min 5898 avg 7702 median 7594
This might be useful for measuring cost of TLB insertion too.
Please verify the code is generating the stats properly before
taking the above numbers as The Truth.
(650 Mhz A500 running SMP 2.6.10-rc3-pa6)
I also noticed that even this gets different results on the first
vs successive invocations:
grundler <545>./cpup3
First Loop : min 11277 avg 17749 median 13143
Later Loops : min 5806 avg 8156 median 7589
grundler <546>./cpup3
First Loop : min 11217 avg 14250 median 13154
Later Loops : min 5904 avg 7726 median 7604
grundler <547>./cpup3
First Loop : min 11528 avg 14147 median 13162
Later Loops : min 5877 avg 7722 median 7600
grundler <548>./cpup3
First Loop : min 11548 avg 14202 median 13177
Later Loops : min 5866 avg 7727 median 7600
grundler <549>./cpup3
First Loop : min 11577 avg 14150 median 13173
Later Loops : min 5877 avg 7729 median 7607
Ignoring the first invocation, the results are quite precise: +- 4/7725
Adding another "ldw 192(%0), %%r0" to the bottom of the loop
reduced that even a bit more. We only prefectch one of the
two cachelines processed in the loop before.
The 5th run output was:
grundler <561>./cpup3
First Loop : min 9831 avg 12950 median 12000
Later Loops : min 5790 avg 7529 median 7375
hth,
grant
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [parisc-linux] DIFF use 6-regs in copy_user_page_asm
2005-01-04 6:13 ` Randolph Chung
2005-01-04 8:23 ` Ryan Bradetich
2005-01-04 14:51 ` Michael S. Zick
@ 2005-01-04 23:39 ` Grant Grundler
2005-01-05 0:00 ` John David Anglin
2 siblings, 1 reply; 12+ messages in thread
From: Grant Grundler @ 2005-01-04 23:39 UTC (permalink / raw)
To: Randolph Chung; +Cc: parisc-linux
On Mon, Jan 03, 2005 at 10:13:42PM -0800, Randolph Chung wrote:
> i'd like to see numbers too, but i doubt you will see any.
I hope the new cpup.c will help us provide precise values.
[ The following is more intended for folks like Joel than Randolph.
I'm pretty sure Randolph understands how CPUs work. ]
> it appears
> that at least newer PA cpus do a sufficient amount of internal
> instruction reordering that you don't see a difference as long as there
> are enough pending instructions to keep the pipeline busy.
The pipeline on PA-8x00 can load 4 instructions at a time.
How those 4 instructions get executed depend on interlocks (e.g.
register is in still use by previous insn) and if the CPU/mem units are
available. AFAIK, PA8x00 processors support 2 loads/cycle,
2 stores/cycle, 2 shift+merge ops/cycle, 2 FP Div/cycle, etc.
"Keeping the pipeline busy" is just as much a function of instruction
scheduling by programmer/compiler as re-ordering by the CPU.
One really needs to keep track of which resources are available.
Anytime a resource is not available, the pipeline "stalls".
ia64 calls this "bubbles" and the best description I've found
of "bubbles" is in the MySQL perf paper by Philippe Bonnet:
http://www.gelato.org/resources/bookspapers.php
(or http://www.gelato.org/pdf/mysql_itanium2_perf.pdf)
While parisc might be more primitive in how it deals with "stalls"
than ia64, I expect most of the same principles apply to both when
writing code.
grant
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [parisc-linux] DIFF use 6-regs in copy_user_page_asm
2005-01-04 23:39 ` Grant Grundler
@ 2005-01-05 0:00 ` John David Anglin
2005-01-05 22:01 ` Michael S. Zick
2005-01-06 22:55 ` Grant Grundler
0 siblings, 2 replies; 12+ messages in thread
From: John David Anglin @ 2005-01-05 0:00 UTC (permalink / raw)
To: Grant Grundler; +Cc: parisc-linux
> The pipeline on PA-8x00 can load 4 instructions at a time.
> How those 4 instructions get executed depend on interlocks (e.g.
> register is in still use by previous insn) and if the CPU/mem units are
> available. AFAIK, PA8x00 processors support 2 loads/cycle,
> 2 stores/cycle, 2 shift+merge ops/cycle, 2 FP Div/cycle, etc.
> "Keeping the pipeline busy" is just as much a function of instruction
> scheduling by programmer/compiler as re-ordering by the CPU.
> One really needs to keep track of which resources are available.
This is what the GCC machine definitions says:
;; The PA8000 has a large (56) entry reorder buffer that is split between
;; memory and non-memory operations.
;;
;; The PA8000 can issue two memory and two non-memory operations per cycle to
;; the function units, with the exception of branches and multi-output
;; instructions. The PA8000 can retire two non-memory operations per cycle
;; and two memory operations per cycle, only one of which may be a store.
;;
;; Given the large reorder buffer, the processor can hide most latencies.
;; According to HP, they've got the best results by scheduling for retirement
;; bandwidth with limited latency scheduling for floating point operations.
;; Latency for integer operations and memory references is ignored.
;;
;;
;; We claim floating point operations have a 2 cycle latency and are
;; fully pipelined, except for div and sqrt which are not pipelined and
;; take from 17 to 31 cycles to complete.
;;
;; It's worth noting that there is no way to saturate all the functional
;; units on the PA8000 as there is not enough issue bandwidth.
Comments?
Dave
--
J. David Anglin dave.anglin@nrc-cnrc.gc.ca
National Research Council of Canada (613) 990-0752 (FAX: 952-6602)
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [parisc-linux] DIFF use 6-regs in copy_user_page_asm
2005-01-05 0:00 ` John David Anglin
@ 2005-01-05 22:01 ` Michael S. Zick
2005-01-06 22:55 ` Grant Grundler
1 sibling, 0 replies; 12+ messages in thread
From: Michael S. Zick @ 2005-01-05 22:01 UTC (permalink / raw)
To: parisc-linux
On Tue January 4 2005 18:00, John David Anglin wrote:
>
> This is what the GCC machine definitions says:
>
> ;;
> ;; It's worth noting that there is no way to saturate all the functional
> ;; units on the PA8000 as there is not enough issue bandwidth.
>
> Comments?
>
Thanks,
That is what I tried to say but couldn't find the reference.
Mike
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [parisc-linux] DIFF use 6-regs in copy_user_page_asm
2005-01-05 0:00 ` John David Anglin
2005-01-05 22:01 ` Michael S. Zick
@ 2005-01-06 22:55 ` Grant Grundler
1 sibling, 0 replies; 12+ messages in thread
From: Grant Grundler @ 2005-01-06 22:55 UTC (permalink / raw)
To: John David Anglin; +Cc: parisc-linux
On Tue, Jan 04, 2005 at 07:00:43PM -0500, John David Anglin wrote:
> Grant Grundler wrote:
> > The pipeline on PA-8x00 can load 4 instructions at a time.
My statement happens to be correct but is misleading.
And I misunderstood it too.
The system does "fetch" 4 insn at a time but can't execute/retire
4 mem ops at a time. PCX-U can only handle two memory ops per cycle
as (correctly) described by the GCC machine definition below.
> This is what the GCC machine definitions says:
Thanks for digging this up. I definitely was confused about
some of the details.
> ;; The PA8000 has a large (56) entry reorder buffer that is split between
> ;; memory and non-memory operations.
> ;;
> ;; The PA8000 can issue two memory and two non-memory operations per cycle to
> ;; the function units, with the exception of branches and multi-output
> ;; instructions. The PA8000 can retire two non-memory operations per cycle
> ;; and two memory operations per cycle, only one of which may be a store.
Yes, this is correct. I'm told this is true for all PA-8x00 CPUs.
I was confused. The "load-store unit" describes one unit that can
only do one load or one store.
Each instruction queue is also divided into even and odd slots.
The functional units are assigned to either an even or odd slot.
Ie two loads in adjacent slots will use both load-store units.
two loads in odd slots will serialize. That shouldn't be an issue
for copy_user_page_asm but it would be good if gcc is aware of it.
Further, the PCX-U cache accesses are serialized when to the same cacheline.
The copy_user_page_asm loop should be restructured to interleave accesses
to the two 64 byte cachelines handled in each iteration of the loop.
This won't matter for stores at the "tail end" of the loop but will
help with the loads at the front of the loop.
> ;; Given the large reorder buffer, the processor can hide most latencies.
Yes. I'm told the re-order buffers (aka "memory ops queue") should be
sufficient to hide scheduling issues. So we don't have to sweat
too many details if we get it "close enough".
> ;; According to HP, they've got the best results by scheduling for retirement
> ;; bandwidth with limited latency scheduling for floating point operations.
> ;; Latency for integer operations and memory references is ignored.
Along this line, according to the specs, PCX-U has a "best case load
latency of three cycles".
This is because of "one cycle for address calculation and 2 cycles for
off chip cache access". Later CPUs are 2 cycles.
Restructuring the USE6REGs code might help PCX-U.
> ;; It's worth noting that there is no way to saturate all the functional
> ;; units on the PA8000 as there is not enough issue bandwidth.
Agreed.
But we do saturate the load-store units in the copy_user_page_asm code.
At least for short bursts.
hth,
grant
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2005-01-06 22:55 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-01-03 6:19 [parisc-linux] DIFF use 6-regs in copy_user_page_asm Grant Grundler
2005-01-04 6:13 ` Randolph Chung
2005-01-04 8:23 ` Ryan Bradetich
2005-01-04 8:29 ` Randolph Chung
2005-01-04 13:12 ` Joel Soete
2005-01-04 14:51 ` Michael S. Zick
2005-01-04 16:02 ` Grant Grundler
[not found] ` <200501041142.44400.mszick@wolfbutter.com>
2005-01-04 20:09 ` Grant Grundler
2005-01-04 23:39 ` Grant Grundler
2005-01-05 0:00 ` John David Anglin
2005-01-05 22:01 ` Michael S. Zick
2005-01-06 22:55 ` Grant Grundler
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.