All of lore.kernel.org
 help / color / mirror / Atom feed
* [parisc-linux] DIFF use 6-regs in copy_user_page_asm
@ 2005-01-03  6:19 Grant Grundler
  2005-01-04  6:13 ` Randolph Chung
  0 siblings, 1 reply; 12+ messages in thread
From: Grant Grundler @ 2005-01-03  6:19 UTC (permalink / raw)
  To: parisc-linux


This patch adds one more cycle between the load and store of a
given register by using three pairs of registers instead of two.
I had previously quoted one of the PA-8xxx papers that indicated
L1 cache was 2 cycles latency.
With this diff, the unrolled part of the loop now meets that.
The prolog and epilogue obviously cannot.

If anyone can show me a workload that improves with this diff,
I'll apply it. Otherwise it's just an academic excercise.

BTW, I don't really trust build-tools/cpup.c unless someone
can convince me it's really running in wide mode and not getting
lots of page faults/page zeroing to interfere with the test.
Maybe need to iterate over a smaller buffer (e.g. 64MB) several times
and ignore the first iteration. Maybe also record cr16 values between
calls to find a minima and median *after* all the copying
is done.

thanks,
grant

ps. The "alignment doesn't matter" comment is too short. It really
    means the alignment doesn't matter for the rest of the loop.
    ie I don't need to add nops to seperate the pairs of "std" insns.

Index: arch/parisc/kernel/pacache.S
===================================================================
RCS file: /var/cvs/linux-2.6/arch/parisc/kernel/pacache.S,v
retrieving revision 1.14
diff -u -p -r1.14 pacache.S
--- arch/parisc/kernel/pacache.S	30 Dec 2004 08:07:48 -0000	1.14
+++ arch/parisc/kernel/pacache.S	3 Jan 2005 05:59:19 -0000
@@ -306,51 +306,52 @@ copy_user_page_asm:
 
 	ldd		0(%r25), %r19		/* bundle 1 */
 	ldi		32, %r1                 /* PAGE_SIZE/128 == 32 */
-
 1:	ldd		8(%r25), %r20
 	ldw		256(%r25), %r0		/* prefetch 4 cacheline ahead */
 
 	ldd		16(%r25), %r21		/* bundle 2 */
 	ldd		24(%r25), %r22
+	nop		/* preserve alignment of quads */
+	nop		/* preserve alignment of quads */
+
+	ldd		32(%r25), %r23		/* bundle 3 */
+	ldd		40(%r25), %r24
 	std		%r19, 0(%r26)
 	std		%r20, 8(%r26)
 
-	ldd		32(%r25), %r19		/* bundle 3 */
-	ldd		40(%r25), %r20
+	ldd		48(%r25), %r19		/* bundle 4 */
+	ldd		56(%r25), %r20
 	std		%r21, 16(%r26)
 	std		%r22, 24(%r26)
 
-	ldd		48(%r25), %r21		/* bundle 4 */
-	ldd		56(%r25), %r22
-	std		%r19, 32(%r26)
-	std		%r20, 40(%r26)
-
-	ldd		64(%r25), %r19		/* bundle 5 */
-	ldd		72(%r25), %r20
-	std		%r21, 48(%r26)
-	std		%r22, 56(%r26)
-
-	ldd		80(%r25), %r21		/* bundle 6 */
-	ldd		88(%r25), %r22
-	std		%r19, 64(%r26)
-	std		%r20, 72(%r26)
+	ldd		64(%r25), %r21		/* bundle 5 */
+	ldd		72(%r25), %r22
+	std		%r23, 32(%r26)
+	std		%r24, 40(%r26)
+
+	ldd		80(%r25), %r23		/* bundle 6 */
+	ldd		88(%r25), %r24
+	std		%r19, 48(%r26)
+	std		%r20, 56(%r26)
 
 	ldd		 96(%r25), %r19		/* bundle 7 */
 	ldd		104(%r25), %r20
-	std		%r21, 80(%r26)
-	std		%r22, 88(%r26)
+	std		%r21, 64(%r26)
+	std		%r22, 72(%r26)
 
 	ldd		112(%r25), %r21		/* bundle 8 */
 	ldd		120(%r25), %r22
+	std		%r23, 80(%r26)
+	std		%r24, 88(%r26)
+
+	ldo		128(%r25), %r25		/* alignment doesn't matter */
 	std		%r19, 96(%r26)
 	std		%r20, 104(%r26)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [parisc-linux] DIFF use 6-regs in copy_user_page_asm
  2005-01-03  6:19 [parisc-linux] DIFF use 6-regs in copy_user_page_asm Grant Grundler
@ 2005-01-04  6:13 ` Randolph Chung
  2005-01-04  8:23   ` Ryan Bradetich
                     ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Randolph Chung @ 2005-01-04  6:13 UTC (permalink / raw)
  To: Grant Grundler; +Cc: parisc-linux

> This patch adds one more cycle between the load and store of a
> given register by using three pairs of registers instead of two.
> I had previously quoted one of the PA-8xxx papers that indicated
> L1 cache was 2 cycles latency.
> With this diff, the unrolled part of the loop now meets that.
> The prolog and epilogue obviously cannot.
> 
> If anyone can show me a workload that improves with this diff,
> I'll apply it. Otherwise it's just an academic excercise.

i'd like to see numbers too, but i doubt you will see any. it appears
that at least newer PA cpus do a sufficient amount of internal
instruction reordering that you don't see a difference as long as there
are enough pending instructions to keep the pipeline busy.

randolph
-- 
Randolph Chung
Debian GNU/Linux Developer, hppa/ia64 ports
http://www.tausq.org/
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [parisc-linux] DIFF use 6-regs in copy_user_page_asm
  2005-01-04  6:13 ` Randolph Chung
@ 2005-01-04  8:23   ` Ryan Bradetich
  2005-01-04  8:29     ` Randolph Chung
  2005-01-04 14:51   ` Michael S. Zick
  2005-01-04 23:39   ` Grant Grundler
  2 siblings, 1 reply; 12+ messages in thread
From: Ryan Bradetich @ 2005-01-04  8:23 UTC (permalink / raw)
  To: Randolph Chung; +Cc: parisc-linux

Randolph,

Is this something that would benefit older processors? Say a PCX-T
processor?  Is this something worth testing?  I'm pretty sure I 
have a PCX-T processor around I could boot up and test.

- Ryan

On Mon, 2005-01-03 at 22:13 -0800, Randolph Chung wrote:
> > This patch adds one more cycle between the load and store of a
> > given register by using three pairs of registers instead of two.
> > I had previously quoted one of the PA-8xxx papers that indicated
> > L1 cache was 2 cycles latency.
> > With this diff, the unrolled part of the loop now meets that.
> > The prolog and epilogue obviously cannot.
> > 
> > If anyone can show me a workload that improves with this diff,
> > I'll apply it. Otherwise it's just an academic excercise.
> 
> i'd like to see numbers too, but i doubt you will see any. it appears
> that at least newer PA cpus do a sufficient amount of internal
> instruction reordering that you don't see a difference as long as there
> are enough pending instructions to keep the pipeline busy.
> 
> randolph
-- 
Ryan Bradetich <rbradetich@uswest.net>

_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [parisc-linux] DIFF use 6-regs in copy_user_page_asm
  2005-01-04  8:23   ` Ryan Bradetich
@ 2005-01-04  8:29     ` Randolph Chung
  2005-01-04 13:12       ` Joel Soete
  0 siblings, 1 reply; 12+ messages in thread
From: Randolph Chung @ 2005-01-04  8:29 UTC (permalink / raw)
  To: Ryan Bradetich; +Cc: parisc-linux

> Is this something that would benefit older processors? Say a PCX-T
> processor?  Is this something worth testing?  I'm pretty sure I 
> have a PCX-T processor around I could boot up and test.

i have no idea, but we can test it and see. i've read some pdfs which
suggest that all pa20 processors can do this reordering, but i don't
know about pa11 processors. in any case, empirical results are always
better than speculation :)

randolph
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [parisc-linux] DIFF use 6-regs in copy_user_page_asm
  2005-01-04  8:29     ` Randolph Chung
@ 2005-01-04 13:12       ` Joel Soete
  0 siblings, 0 replies; 12+ messages in thread
From: Joel Soete @ 2005-01-04 13:12 UTC (permalink / raw)
  To: Randolph Chung, Ryan Bradetich, Grant Grundler; +Cc: parisc-linux

[-- Attachment #1: Type: text/plain, Size: 2842 bytes --]


> -- Original Message --
> Date: Tue, 4 Jan 2005 00:29:40 -0800
> From: Randolph Chung <randolph@tausq.org>
> To: Ryan Bradetich <rbradetich@uswest.net>
> Cc: parisc-linux@lists.parisc-linux.org
> Reply-To: Randolph Chung <randolph@tausq.org>
> Subject: Re: [parisc-linux] DIFF use 6-regs in copy_user_page_asm
>
>
> > Is this something that would benefit older processors? Say a PCX-T
> > processor?  Is this something worth testing?  I'm pretty sure I
> > have a PCX-T processor around I could boot up and test.
>
> i have no idea, but we can test it and see. i've read some pdfs which
> suggest that all pa20 processors can do this reordering, but i don't
> know about pa11 processors. in any case, empirical results are always
> better than speculation :)
>
This is only foreseen for pa2.0 ;-)

That said, I also tried re-ordering (with only 4 regs) on my c110 but test
case (previous cpup0.c) didn't show any improvement nor degradation !

Grant I also rewrite test case (here attached cpup1.c) to reproduce your
proposal and compare with present stuff in kernel and here are some results:
patst005:/Develop/jso/Var/Comp# time ./cpup1 ; time ./cpup2

real    0m3.964s
user    0m0.209s
sys     0m3.753s

real    0m3.976s
user    0m0.190s
sys     0m3.781s
patst005:/Develop/jso/Var/Comp# time ./cpup2 ; time ./cpup1

real    0m3.946s
user    0m0.218s
sys     0m3.725s

real    0m3.961s
user    0m0.196s
sys     0m3.762s
patst005:/Develop/jso/Var/Comp# time ./cpup0 ; time ./cpup2 ; time ./cpup1

real    0m4.046s
user    0m0.354s
sys     0m3.691s

real    0m3.940s
user    0m0.225s
sys     0m3.712s

real    0m3.946s
user    0m0.208s
sys     0m3.734s
patst005:/Develop/jso/Var/Comp# time ./cpup0 ; time ./cpup1 ; time ./cpup2

real    0m4.068s
user    0m0.342s
sys     0m3.724s

real    0m3.948s
user    0m0.194s
sys     0m3.752s

real    0m3.936s
user    0m0.193s
sys     0m3.740s
patst005:/Develop/jso/Var/Comp# time ./cpup1 ; time ./cpup0 ; time ./cpup2

real    0m3.928s
user    0m0.202s
sys     0m3.725s

real    0m4.067s
user    0m0.329s
sys     0m3.731s

real    0m3.946s
user    0m0.224s
sys     0m3.718s
patst005:/Develop/jso/Var/Comp# time ./cpup2 ; time ./cpup0 ; time ./cpup1

real    0m3.942s
user    0m0.213s
sys     0m3.727s

real    0m4.086s
user    0m0.333s
sys     0m3.749s

real    0m3.956s
user    0m0.208s
sys     0m3.745s

Unfortunately (as in the previous test), I didn't reach to point out an
actual benefit :-(

(please note that I have to reduce the BUFFSIZE because my b2k has only
256Mb of ram ;-)

hth,
    Joel

---------------------------------------------------------------------------
Tiscali solde! 1 mois et activation Gratuits, modem à 9,99€
http://reg.tiscali.be/adsl/default.asp?lg=FR




[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: cpup1.c --]
[-- Type: text/x-csrc, Size: 4086 bytes --]


/*
** CoPy User Page asm tester
**
** gcc -O2 -o cpup0 cpup.c		vanilla 32-bit loop
** 	-march=2.0 -DLP64 -o cpup1	64-bit, 4ld + 4st sequences
**	-march=2.0 -DLP64 -DV1 -o cpup2	64-bit, 4regs, 2ld/2st bundles
** 	-march=2.0 -DLP64 -DUSE6REGS -o cpup3 64-bit, 6 regs, 2ld/2st bundles
*/
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <errno.h>
#include <asm/page.h>

void __copy_user_page_asm(void *to, void *from)
{
	register unsigned long __to __asm__ ("r26") =  (unsigned long)to;
	register unsigned long __from __asm__ ("r25") =  (unsigned long)from;

#ifdef LP64

asm volatile ("ldd		0(%0), %%r19\n"
"	ldi		32, %%r1\n"
"1:	ldd		8(%0), %%r20\n"
"	ldw		256(%0), %%r0\n"
"	ldd		16(%0), %%r21\n"
"	ldd		24(%0), %%r22\n"
#ifdef USE6REGS
"	nop\n"
"	nop\n"
"	ldd		32(%0), %%r23\n"
"	ldd		40(%0), %%r24\n"
"	std		%%r19, 0(%1)\n"
"	std		%%r20, 8(%1)\n"
"	ldd		48(%0), %%r19\n"
"	ldd		56(%0), %%r20\n"
"	std		%%r21, 16(%1)\n"
"	std		%%r22, 24(%1)\n"
"	ldd		64(%0), %%r21\n"
"	ldd		72(%0), %%r22\n"
"	std		%%r23, 32(%1)\n"
"	std		%%r24, 40(%1)\n"
"	ldd		80(%0), %%r23\n"
"	ldd		88(%0), %%r24\n"
"	std		%%r19, 48(%1)\n"
"	std		%%r20, 56(%1)\n"
"	ldd		 96(%0), %%r19\n"
"	ldd		104(%0), %%r20\n"
"	std		%%r21, 64(%1)\n"
"	std		%%r22, 72(%1)\n"
"	ldd		112(%0), %%r21\n"
"	ldd		120(%0), %%r22\n"
"	std		%%r23, 80(%%r26)\n"
"	std		%%r24, 88(%%r26)\n"
"	ldo		128(%0), %0\n"
"	std		%%r19,  96(%1)\n"
"	std		%%r20, 104(%1)\n"
"	std		%%r21, 112(%1)\n"
"	std		%%r22, 120(%1)\n"
"	ldo		128(%1), %1\n"
"	addib,>		-1, %%r1, 1b\n"
"	ldd		0(%0), %%r19"
#else	/* !USE6REGS */ 
"	std		%%r19, 0(%1)\n"
"	std		%%r20, 8(%1)\n"
"	ldd		32(%0), %%r19\n"
"	ldd		40(%0), %%r20\n"
"	std		%%r21, 16(%1)\n"
"	std		%%r22, 24(%1)\n"
"	ldd		48(%0), %%r21\n"
"	ldd		56(%0), %%r22\n"
"	std		%%r19, 32(%1)\n"
"	std		%%r20, 40(%1)\n"
"	ldd		64(%0), %%r19\n"
"	ldd		72(%0), %%r20\n"
"	std		%%r21, 48(%1)\n"
"	std		%%r22, 56(%1)\n"
"	ldd		80(%0), %%r21\n"
"	ldd		88(%0), %%r22\n"
"	std		%%r19, 64(%1)\n"
"	std		%%r20, 72(%1)\n"
"	ldd		 96(%0), %%r19\n"
"	ldd		104(%0), %%r20\n"
"	std		%%r21, 80(%1)\n"
"	std		%%r22, 88(%1)\n"
"	ldd		112(%0), %%r21\n"
"	ldd		120(%0), %%r22\n"
"	std		%%r19, 96(%1)\n"
"	std		%%r20, 104(%1)\n"
"	ldo		128(%0), %0\n"
"	std		%%r21, 112(%1)\n"
"	std		%%r22, 120(%1)\n"
"	ldo		128(%1), %1\n"
"	addib,>		-1, %%r1, 1b\n"
"	ldd		0(%0), %%r19"
#endif	/* USE6REGS */

#else	/* !LP64 */

asm volatile ("ldi		64, %%r1\n"
"1:	ldw		0(%0), %%r19\n"
"	ldw		4(%0), %%r20\n"
"	ldw		8(%0), %%r21\n"
"	ldw		12(%0), %%r22\n"
"	stw		%%r19, 0(%1)\n"
"	stw		%%r20, 4(%1)\n"
"	stw		%%r21, 8(%1)\n"
"	stw		%%r22, 12(%1)\n"
"	ldw		16(%0), %%r19\n"
"	ldw		20(%0), %%r20\n"
"	ldw		24(%0), %%r21\n"
"	ldw		28(%0), %%r22\n"
"	stw		%%r19, 16(%1)\n"
"	stw		%%r20, 20(%1)\n"
"	stw		%%r21, 24(%1)\n"
"	stw		%%r22, 28(%1)\n"
"	ldw		32(%0), %%r19\n"
"	ldw		36(%0), %%r20\n"
"	ldw		40(%0), %%r21\n"
"	ldw		44(%0), %%r22\n"
"	stw		%%r19, 32(%1)\n"
"	stw		%%r20, 36(%1)\n"
"	stw		%%r21, 40(%1)\n"
"	stw		%%r22, 44(%1)\n"
"	ldw		48(%0), %%r19\n"
"	ldw		52(%0), %%r20\n"
"	ldw		56(%0), %%r21\n"
"	ldw		60(%0), %%r22\n"
"	stw		%%r19, 48(%1)\n"
"	stw		%%r20, 52(%1)\n"
"	stw		%%r21, 56(%1)\n"
"	stw		%%r22, 60(%1)\n"
"	ldo		64(%1), %1\n"
"	addib,>		-1, %%r1, 1b\n"
"	ldo		64(%0), %0"
#endif	/* LP64 */
	:		
	: "r"(__from), "r"(__to) );
}

#define BUFFSIZE	(1024*1024*64)
#define PPB		(BUFFSIZE/PAGE_SIZE)	/* Pages Per Buff */

int main(int argc, char * * argv, char * * env)
{

	char *MemSrc, *MemDst;
	unsigned long j;

	MemSrc = malloc(BUFFSIZE);
	MemDst = malloc(BUFFSIZE);

	if (MemSrc == NULL || MemDst == NULL)
		return 1;

	/* initialize first page of MemSrc */
	for (j = 0; j < (PAGE_SIZE/sizeof(unsigned long)) ; j++) {
		((unsigned long *) MemSrc)[j]=j;
	}

	/* clone first page to remaining pages - page at a time */
	for (j = 1; j < PPB ; j++) {
		__copy_user_page_asm( MemSrc + (j*PAGE_SIZE), MemSrc);
	}

	/* Clone Src to Dest - page at a time */
	for (j = 0; j < PPB ; j++) {
		__copy_user_page_asm( MemDst + (j*PAGE_SIZE),
				      MemSrc + (j*PAGE_SIZE));
	}

	free(MemSrc);
	free(MemDst);
	return 0;
}


[-- Attachment #3: Type: text/plain, Size: 169 bytes --]

_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [parisc-linux] DIFF use 6-regs in copy_user_page_asm
  2005-01-04  6:13 ` Randolph Chung
  2005-01-04  8:23   ` Ryan Bradetich
@ 2005-01-04 14:51   ` Michael S. Zick
  2005-01-04 16:02     ` Grant Grundler
  2005-01-04 23:39   ` Grant Grundler
  2 siblings, 1 reply; 12+ messages in thread
From: Michael S. Zick @ 2005-01-04 14:51 UTC (permalink / raw)
  To: parisc-linux

On Tue January 4 2005 00:13, Randolph Chung wrote:
> > This patch adds one more cycle between the load and store of a
> > given register by using three pairs of registers instead of two.
> > I had previously quoted one of the PA-8xxx papers that indicated
> > L1 cache was 2 cycles latency.
> > With this diff, the unrolled part of the loop now meets that.
> > The prolog and epilogue obviously cannot.
> > 
> > If anyone can show me a workload that improves with this diff,
> > I'll apply it. Otherwise it's just an academic excercise.
> 
> i'd like to see numbers too, but i doubt you will see any. it appears
> that at least newer PA cpus do a sufficient amount of internal
> instruction reordering that you don't see a difference as long as there
> are enough pending instructions to keep the pipeline busy.
> 
One other possibility to keep in mind when testing: this is an
io sequence, no heavy register-register computations.

If the 4-regs + internal reordering has already saturated the
cpu-external busses...

Then even if the cpu-core can execute the 6-regs + internal
reordering more effectively, you will never see it outside
of the cpu.

You might have to borrow a buss analyzer from the hardware
lab to see if this is effecting your tests.

Mike
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [parisc-linux] DIFF use 6-regs in copy_user_page_asm
  2005-01-04 14:51   ` Michael S. Zick
@ 2005-01-04 16:02     ` Grant Grundler
       [not found]       ` <200501041142.44400.mszick@wolfbutter.com>
  0 siblings, 1 reply; 12+ messages in thread
From: Grant Grundler @ 2005-01-04 16:02 UTC (permalink / raw)
  To: Michael S. Zick; +Cc: parisc-linux

On Tue, Jan 04, 2005 at 08:51:19AM -0600, Michael S. Zick wrote:
> One other possibility to keep in mind when testing: this is an
> io sequence, no heavy register-register computations.

Not entirely. Sure, to load data into cache it's "io".
But the bulk of the loop is to move data from one cacheline
to another.

> You might have to borrow a buss analyzer from the hardware
> lab to see if this is effecting your tests.

I don't. If 6-regs works better then I use it.
If CPU performance counter support worked, we could figure
out where the bottlenecks were for both cases.

grant
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [parisc-linux] DIFF use 6-regs in copy_user_page_asm
       [not found]       ` <200501041142.44400.mszick@wolfbutter.com>
@ 2005-01-04 20:09         ` Grant Grundler
  0 siblings, 0 replies; 12+ messages in thread
From: Grant Grundler @ 2005-01-04 20:09 UTC (permalink / raw)
  To: Michael S. Zick; +Cc: parisc-linux

On Tue, Jan 04, 2005 at 11:42:44AM -0600, Michael S. Zick wrote:
> > I don't. If 6-regs works better then I use it.
> Agreed,
> If you can find a difference now.

I can using CR16. That's what I was proposing before.

> I was speaking of the other case:
> If they appear to work the same now.

Yes, but I don't need an analyzer to guess at what might be causing
the bottleneck. The "Linux Way" is to keep trying different variants
until we find a better one (or get fed up). I know using an analyzer
is more precise _once_ it's setup.

Joel,
I've hacked your cpup1.c and committed it build-tools.
Please send me diffs in the future.
You would have noticed that you reference %r26 directly in two
of the asm statements.

The new version implements most of what I was proposing:
o use CR16 to measure copy_user_page_asm()
o run multiple iterations to avoid page faults/TLB activity

o drops -DV1 code (4ld/4st in 64-bit case)
o implements -DUSE6REGS
o uses 64MB src/dest buffer

grundler <536>gcc -O2 -o cpup0 cpup.c
grundler <537>gcc -march=2.0 -DLP64 -o cpup2 cpup.c
grundler <538>gcc -march=2.0 -DLP64 -DDUSE6REGS -o cpup3 cpup.c
grundler <539>./cpup0
          First Loop : min  14393  avg  17156  median  16219
         Later Loops : min   9696  avg  10819  median  10432
grundler <540>./cpup2
          First Loop : min  11381  avg  14120  median  13168
         Later Loops : min   5844  avg   7695  median   7595
grundler <541>./cpup3
          First Loop : min  11441  avg  14102  median  13167
         Later Loops : min   5898  avg   7702  median   7594

This might be useful for measuring cost of TLB insertion too.

Please verify the code is generating the stats properly before
taking the above numbers as The Truth.
(650 Mhz A500 running SMP 2.6.10-rc3-pa6)

I also noticed that even this gets different results on the first
vs successive invocations:
grundler <545>./cpup3
          First Loop : min  11277  avg  17749  median  13143
         Later Loops : min   5806  avg   8156  median   7589
grundler <546>./cpup3
          First Loop : min  11217  avg  14250  median  13154
         Later Loops : min   5904  avg   7726  median   7604
grundler <547>./cpup3
          First Loop : min  11528  avg  14147  median  13162
         Later Loops : min   5877  avg   7722  median   7600
grundler <548>./cpup3
          First Loop : min  11548  avg  14202  median  13177
         Later Loops : min   5866  avg   7727  median   7600
grundler <549>./cpup3
          First Loop : min  11577  avg  14150  median  13173
         Later Loops : min   5877  avg   7729  median   7607

Ignoring the first invocation, the results are quite precise: +- 4/7725

Adding another "ldw 192(%0), %%r0" to the bottom of the loop
reduced that even a bit more. We only prefectch one of the
two cachelines processed in the loop before.
The 5th run output was:
grundler <561>./cpup3 
          First Loop : min   9831  avg  12950  median  12000
         Later Loops : min   5790  avg   7529  median   7375

hth,
grant
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [parisc-linux] DIFF use 6-regs in copy_user_page_asm
  2005-01-04  6:13 ` Randolph Chung
  2005-01-04  8:23   ` Ryan Bradetich
  2005-01-04 14:51   ` Michael S. Zick
@ 2005-01-04 23:39   ` Grant Grundler
  2005-01-05  0:00     ` John David Anglin
  2 siblings, 1 reply; 12+ messages in thread
From: Grant Grundler @ 2005-01-04 23:39 UTC (permalink / raw)
  To: Randolph Chung; +Cc: parisc-linux

On Mon, Jan 03, 2005 at 10:13:42PM -0800, Randolph Chung wrote:
> i'd like to see numbers too, but i doubt you will see any.

I hope the new cpup.c will help us provide precise values.


[ The following is more intended for folks like Joel than Randolph.
 I'm pretty sure Randolph understands how CPUs work. ]

>  it appears
> that at least newer PA cpus do a sufficient amount of internal
> instruction reordering that you don't see a difference as long as there
> are enough pending instructions to keep the pipeline busy.

The pipeline on PA-8x00 can load 4 instructions at a time.
How those 4 instructions get executed depend on interlocks (e.g.
register is in still use by previous insn) and if the CPU/mem units are
available.  AFAIK, PA8x00 processors support 2 loads/cycle,
2 stores/cycle, 2 shift+merge ops/cycle, 2 FP Div/cycle, etc. 
"Keeping the pipeline busy" is just as much a function of instruction
scheduling by programmer/compiler as re-ordering by the CPU.
One really needs to keep track of which resources are available.

Anytime a resource is not available, the pipeline "stalls". 
ia64 calls this "bubbles" and the best description I've found
of "bubbles" is in the MySQL perf paper by Philippe Bonnet:
	http://www.gelato.org/resources/bookspapers.php

(or http://www.gelato.org/pdf/mysql_itanium2_perf.pdf)

While parisc might be more primitive in how it deals with "stalls"
than ia64, I expect most of the same principles apply to both when
writing code.

grant
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [parisc-linux] DIFF use 6-regs in copy_user_page_asm
  2005-01-04 23:39   ` Grant Grundler
@ 2005-01-05  0:00     ` John David Anglin
  2005-01-05 22:01       ` Michael S. Zick
  2005-01-06 22:55       ` Grant Grundler
  0 siblings, 2 replies; 12+ messages in thread
From: John David Anglin @ 2005-01-05  0:00 UTC (permalink / raw)
  To: Grant Grundler; +Cc: parisc-linux

> The pipeline on PA-8x00 can load 4 instructions at a time.
> How those 4 instructions get executed depend on interlocks (e.g.
> register is in still use by previous insn) and if the CPU/mem units are
> available.  AFAIK, PA8x00 processors support 2 loads/cycle,
> 2 stores/cycle, 2 shift+merge ops/cycle, 2 FP Div/cycle, etc. 
> "Keeping the pipeline busy" is just as much a function of instruction
> scheduling by programmer/compiler as re-ordering by the CPU.
> One really needs to keep track of which resources are available.

This is what the GCC machine definitions says:

;; The PA8000 has a large (56) entry reorder buffer that is split between
;; memory and non-memory operations.
;;
;; The PA8000 can issue two memory and two non-memory operations per cycle to
;; the function units, with the exception of branches and multi-output
;; instructions.  The PA8000 can retire two non-memory operations per cycle
;; and two memory operations per cycle, only one of which may be a store.
;;
;; Given the large reorder buffer, the processor can hide most latencies.
;; According to HP, they've got the best results by scheduling for retirement
;; bandwidth with limited latency scheduling for floating point operations.
;; Latency for integer operations and memory references is ignored.
;;
;;
;; We claim floating point operations have a 2 cycle latency and are
;; fully pipelined, except for div and sqrt which are not pipelined and
;; take from 17 to 31 cycles to complete.
;;
;; It's worth noting that there is no way to saturate all the functional
;; units on the PA8000 as there is not enough issue bandwidth.

Comments?

Dave
-- 
J. David Anglin                                  dave.anglin@nrc-cnrc.gc.ca
National Research Council of Canada              (613) 990-0752 (FAX: 952-6602)
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [parisc-linux] DIFF use 6-regs in copy_user_page_asm
  2005-01-05  0:00     ` John David Anglin
@ 2005-01-05 22:01       ` Michael S. Zick
  2005-01-06 22:55       ` Grant Grundler
  1 sibling, 0 replies; 12+ messages in thread
From: Michael S. Zick @ 2005-01-05 22:01 UTC (permalink / raw)
  To: parisc-linux

On Tue January 4 2005 18:00, John David Anglin wrote:

> 
> This is what the GCC machine definitions says:
> 

> ;;
> ;; It's worth noting that there is no way to saturate all the functional
> ;; units on the PA8000 as there is not enough issue bandwidth.
> 
> Comments?
>
Thanks,
That is what I tried to say but couldn't find the reference.

Mike
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [parisc-linux] DIFF use 6-regs in copy_user_page_asm
  2005-01-05  0:00     ` John David Anglin
  2005-01-05 22:01       ` Michael S. Zick
@ 2005-01-06 22:55       ` Grant Grundler
  1 sibling, 0 replies; 12+ messages in thread
From: Grant Grundler @ 2005-01-06 22:55 UTC (permalink / raw)
  To: John David Anglin; +Cc: parisc-linux

On Tue, Jan 04, 2005 at 07:00:43PM -0500, John David Anglin wrote:
> Grant Grundler wrote:
> > The pipeline on PA-8x00 can load 4 instructions at a time.

My statement happens to be correct but is misleading.
And I misunderstood it too.

The system does "fetch" 4 insn at a time but can't execute/retire
4 mem ops at a time.  PCX-U can only handle two memory ops per cycle
as (correctly) described by the GCC machine definition below.


> This is what the GCC machine definitions says:

Thanks for digging this up. I definitely was confused about
some of the details.

> ;; The PA8000 has a large (56) entry reorder buffer that is split between
> ;; memory and non-memory operations.
> ;;
> ;; The PA8000 can issue two memory and two non-memory operations per cycle to
> ;; the function units, with the exception of branches and multi-output
> ;; instructions.  The PA8000 can retire two non-memory operations per cycle
> ;; and two memory operations per cycle, only one of which may be a store.

Yes, this is correct.  I'm told this is true for all PA-8x00 CPUs.

I was confused. The "load-store unit" describes one unit that can
only do one load or one store.

Each instruction queue is also divided into even and odd slots.
The functional units are assigned to either an even or odd slot.
Ie two loads in adjacent slots will use both load-store units.
two loads in odd slots will serialize. That shouldn't be an issue
for copy_user_page_asm but it would be good if gcc is aware of it.

Further, the PCX-U cache accesses are serialized when to the same cacheline.
The copy_user_page_asm loop should be restructured to interleave accesses
to the two 64 byte cachelines handled in each iteration of the loop.
This won't matter for stores at the "tail end" of the loop but will
help with the loads at the front of the loop.

> ;; Given the large reorder buffer, the processor can hide most latencies.

Yes. I'm told the re-order buffers (aka "memory ops queue") should be
sufficient to hide scheduling issues. So we don't have to sweat
too many details if we get it "close enough".

> ;; According to HP, they've got the best results by scheduling for retirement
> ;; bandwidth with limited latency scheduling for floating point operations.
> ;; Latency for integer operations and memory references is ignored.

Along this line, according to the specs, PCX-U has a "best case load
latency of three cycles".
This is because of "one cycle for address calculation and 2 cycles for
off chip cache access". Later CPUs are 2 cycles.
Restructuring the USE6REGs code might help PCX-U.

> ;; It's worth noting that there is no way to saturate all the functional
> ;; units on the PA8000 as there is not enough issue bandwidth.

Agreed.
But we do saturate the load-store units in the copy_user_page_asm code.
At least for short bursts.

hth,
grant
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2005-01-06 22:55 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-01-03  6:19 [parisc-linux] DIFF use 6-regs in copy_user_page_asm Grant Grundler
2005-01-04  6:13 ` Randolph Chung
2005-01-04  8:23   ` Ryan Bradetich
2005-01-04  8:29     ` Randolph Chung
2005-01-04 13:12       ` Joel Soete
2005-01-04 14:51   ` Michael S. Zick
2005-01-04 16:02     ` Grant Grundler
     [not found]       ` <200501041142.44400.mszick@wolfbutter.com>
2005-01-04 20:09         ` Grant Grundler
2005-01-04 23:39   ` Grant Grundler
2005-01-05  0:00     ` John David Anglin
2005-01-05 22:01       ` Michael S. Zick
2005-01-06 22:55       ` Grant Grundler

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.