All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: copy_user_page_asm suggested 64bit improvment [Was: [parisc-linux] clear user page test]
       [not found] <418A80E8000124B5@mail-6-bnl.tiscali.it>
@ 2004-12-27  7:36 ` Grant Grundler
  2004-12-27 10:40   ` Joel Soete
                     ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Grant Grundler @ 2004-12-27  7:36 UTC (permalink / raw)
  To: Joel Soete; +Cc: parisc-linux

On Tue, Dec 21, 2004 at 02:37:47PM +0100, Joel Soete wrote:
> Hello all,

Joel,
I trim your postings to only include the parts I need to respond to.
Could you please do the same?

I hate having to scroll down pages of stuff to get to your comment.
That's probably why no one else responded.


> As promised, here is a cleaner (?)  patch:
> --- arch/parisc/kernel/pacache.S.Orig	2004-12-20 08:28:23.000000000 +0100
> +++ arch/parisc/kernel/pacache.S	2004-12-20 14:49:35.000000000 +0100
> @@ -295,7 +295,52 @@
>  	.callinfo NO_CALLS
>  	.entry
> 
> -	ldi		64, %r1
> +	pdtlb		0(%r25)
> +	pdtlb		0(%r26)

Sorry - I missed why the pdtlb needs to be added.
Could you explain?

Won't the pdtlb guarantee at least one trap per page copied?
I would hope we guarantee the D-TLB is "clean" when calling this function.

> +#ifdef __LP64__
> +
> +	ldi		32, %r1			/* PAGE_SIZE/128 == 32 */
> +
> +1:	ldd		0(%r25), %r19
> +	ldd		8(%r25), %r20
> +	ldd		16(%r25), %r21
> +	ldd		24(%r25), %r22
> +	std		%r19, 0(%r26)
> +	std		%r20, 8(%r26)
> +	std		%r21, 16(%r26)
> +	std		%r22, 24(%r26)

This looks good.

PA2.0 can retire 2 loads and 2 stores per cycle IFF there are no dependencies.
can be executed in one cycle.

That means we want something like this:

+1:	ldd		0(%r25), %r19
+	ldd		8(%r25), %r20
+	ldd		16(%r25), %r21
+	ldd		24(%r25), %r22
+	std		%r19, 0(%r26)
+	std		%r20, 8(%r26)
+	ldd		32(%r25), %r19
+	ldd		40(%r25), %r20
+	std		%r21, 16(%r26)
+	std		%r22, 24(%r26)
+	ldd		48(%r25), %r21
+	ldd		56(%r25), %r22
+	std		%r19, 32(%r26)
+	std		%r20, 40(%r26)
...
+	ldd		112(%r25), %r21
+	ldd		120(%r25), %r22
+	std		%r19, 96(%r26)
+	std		%r20, 104(%r26)
+	ldo		128(%r25), %r25
+	std		%r21, 112(%r26)
+	std		%r22, 120(%r26)
+	ADDIB>		-1, %r1, 1b
+	ldo		128(%r26), %r26
...

[ Note that I've moved the "ldo" around as well!]

More distance between the "ldd %rX" and the corresponding
"std %rX" is generally a good thing.
This routine could use more registers in the loop to get more "distance".

It costs us 1 cycle to save two registers on the stack.
Once the data is in L1-Cache, IFF the CPU needs more than one cycle
to retire successive loads, we gain several cycles assuming additional
register pairs are used multiply times per loop.
Anyone know how many cycles ldd from L1 takes?

I expect gcc encodes those times so it can schedule stuff optimally.
But I've forgotten where to find the PA2.0 scheduling magic.
It might be worth just letting gcc unroll the loop for us since
SR0 (kernel) is implied in all the ldd/std instructions.


> -	extrd,u		%r26,56,32, %r26		/* convert phys addr to tlb insert format */
> -	extrd,u		%r23,56,32, %r23		/* convert phys addr to tlb insert format */
> -	depd		%r24,63,22, %r28		/* Form aliased virtual address 'to' */
> +	extrd,u		%r26,56,32, %r26	/* convert phys addr to tlb insert format */
> +	extrd,u		%r23,56,32, %r23	/* convert phys addr to tlb insert format */
> +	depd		%r24,63,22, %r28	/* Form aliased virtual address 'to' */

Please post white space changes as seperate patches.


> the loop used:
> export i=0 ; while [ $i -le 10 ] ; do make clean ; make oldconfig ; readprofile

3 to 5 iterations are sufficient for me (since they take so long).

> -r ; time make vmlinux ; readprofile >> /var/logs/prof.doc; i=$((i+1)) ;
> done 2>&1 | tee /var/logs/k-loop1
> 
> * with original 2.6.10-rc3-pa8 running kernel
> # grep "^user" k-loop1

Please use "^sys" or "^real".
"user" time is only number that should NOT change with this patch.

> # grep copy_user_page_asm prof.doc
>   3254 copy_user_page_asm                        20.3375
>   3273 copy_user_page_asm                        20.4563
...

> * with 2.6.10-rc3-pa8 + patch and without "pdtlb		0(%r2[56])"
...
> # grep copy_user_page_asm prof.doc
>   1818 copy_user_page_asm                        11.3625
>   1763 copy_user_page_asm                        11.0188
>   1785 copy_user_page_asm                        11.1562
...

This is clearly goodness.

> * with 2.6.10-rc3-pa8 + full patch
...
> # grep copy_user_page_asm prof.doc
>   1894 copy_user_page_asm                        11.8375
>   1972 copy_user_page_asm                        12.3250
>   1975 copy_user_page_asm                        12.3438
>   1880 copy_user_page_asm                        11.7500
>   1923 copy_user_page_asm                        12.0188

I expect extra traps and/or time spent ordering the TLB operations.
pdtlb is costing about 8% performance in this routine.
I definitely want a clear explanation before adding this.

> So the main interest is to reduce the number of clock ticks :-)

Yes. :^)


thanks,
grant
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: copy_user_page_asm suggested 64bit improvment [Was: [parisc-linux] clear user page test]
  2004-12-27  7:36 ` copy_user_page_asm suggested 64bit improvment [Was: [parisc-linux] clear user page test] Grant Grundler
@ 2004-12-27 10:40   ` Joel Soete
  2004-12-27 15:08     ` James Bottomley
                       ` (2 more replies)
  2004-12-28 16:25   ` [parisc-linux] Re: copy_user_page_asm suggested 64bit improvment (Test case) Joel Soete
  2004-12-30  8:10   ` copy_user_page_asm suggested 64bit improvment [Was: [parisc-linux] clear user page test] Grant Grundler
  2 siblings, 3 replies; 17+ messages in thread
From: Joel Soete @ 2004-12-27 10:40 UTC (permalink / raw)
  To: Grant Grundler; +Cc: parisc-linux



Grant Grundler wrote:
> On Tue, Dec 21, 2004 at 02:37:47PM +0100, Joel Soete wrote:
> 
>>Hello all,
> 
> 
> Joel,
> I trim your postings to only include the parts I need to respond to.
> Could you please do the same?
> 
Apologies, I would just like to be as detailed as possible for the others who didn't follow our previous mail exchange before :-(

> I hate having to scroll down pages of stuff to get to your comment.
> That's probably why no one else responded.
> 
I understand that make stuff too noisy

> 
> 
>>As promised, here is a cleaner (?)  patch:
>>--- arch/parisc/kernel/pacache.S.Orig	2004-12-20 08:28:23.000000000 +0100
>>+++ arch/parisc/kernel/pacache.S	2004-12-20 14:49:35.000000000 +0100
>>@@ -295,7 +295,52 @@
>> 	.callinfo NO_CALLS
>> 	.entry
>>
>>-	ldi		64, %r1
>>+	pdtlb		0(%r25)
>>+	pdtlb		0(%r26)
> 
> 
> Sorry - I missed why the pdtlb needs to be added.
> Could you explain?

Sorry no, that was a question of mine:
the previous inplementation of copy_user_page_asm() (between #if 0 ... #endif below in the code) started with:
[...]
         /* Purge any old translations */

         pdtlb           0(%r28)
         pdtlb           0(%r29)

         ldi             64, %r1
[...]

and we do the same in __clear_user_page_asm()
[...]
         /* Purge any old translation */

         pdtlb           0(%r28)

[...]
> 
> Won't the pdtlb guarantee at least one trap per page copied?
> I would hope we guarantee the D-TLB is "clean" when calling this function.
> 
Should be why it was removed but as far as I didn't find any explanation (that's obvious: that's nearly impossible to explain all 
details of implementation ;-)

> 
>>+#ifdef __LP64__
>>+
>>+	ldi		32, %r1			/* PAGE_SIZE/128 == 32 */
>>+
>>+1:	ldd		0(%r25), %r19
>>+	ldd		8(%r25), %r20
>>+	ldd		16(%r25), %r21
>>+	ldd		24(%r25), %r22
>>+	std		%r19, 0(%r26)
>>+	std		%r20, 8(%r26)
[...]
> 
> This looks good.
> 
> PA2.0 can retire 2 loads and 2 stores per cycle IFF there are no dependencies.
> can be executed in one cycle.
> 
> That means we want something like this:
> 
> +1:	ldd		0(%r25), %r19
> +	ldd		8(%r25), %r20
> +	ldd		16(%r25), %r21
> +	ldd		24(%r25), %r22
> +	std		%r19, 0(%r26)
> +	std		%r20, 8(%r26)
> +	ldd		32(%r25), %r19
> +	ldd		40(%r25), %r20
[...]
> +	ldo		128(%r25), %r25
> +	std		%r21, 112(%r26)
> +	std		%r22, 120(%r26)
> +	ADDIB>		-1, %r1, 1b
> +	ldo		128(%r26), %r26
> ...
> 
> [ Note that I've moved the "ldo" around as well!]
> 
> More distance between the "ldd %rX" and the corresponding
> "std %rX" is generally a good thing.
> This routine could use more registers in the loop to get more "distance".
Ok that was another possibility: I trust that we can use r23, r24 as far as:
     r23-r26: these are arg3-arg0, i.e. you can use them if you
         don't care about the values that were passed in anymore.

but not more of r3-r18 because:
r3-r18,r27,r30 need to be saved and restored. r3-r18 are just
     general purpose registers. [...]

> 
> It costs us 1 cycle to save two registers on the stack.
> Once the data is in L1-Cache, IFF the CPU needs more than one cycle
> to retire successive loads, we gain several cycles assuming additional
> register pairs are used multiply times per loop.
Well that (cache management) is still far beyond my skill :-(

[...]
>>-	extrd,u		%r26,56,32, %r26		/* convert phys addr to tlb insert format */
...
>>+	extrd,u		%r26,56,32, %r26	/* convert phys addr to tlb insert format */
> 
> Please post white space changes as seperate patches.
> 
oops my bad (apologies)
> 
[...]
>>* with original 2.6.10-rc3-pa8 running kernel
>># grep "^user" k-loop1
> 
> Please use "^sys" or "^real".
> "user" time is only number that should NOT change with this patch.
> 
I will try to recover those info
> 
[...]
> 
>>So the main interest is to reduce the number of clock ticks :-)
> 
> 
> Yes. :^)
> 
Thanks for your patience and relevant remarks, I will come back we more material soon ;-)

Joel
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: copy_user_page_asm suggested 64bit improvment [Was: [parisc-linux] clear user page test]
  2004-12-27 10:40   ` Joel Soete
@ 2004-12-27 15:08     ` James Bottomley
  2004-12-31 20:26       ` Michael S. Zick
  2004-12-27 17:34     ` Joel Soete
  2004-12-27 18:32     ` Joel Soete
  2 siblings, 1 reply; 17+ messages in thread
From: James Bottomley @ 2004-12-27 15:08 UTC (permalink / raw)
  To: Joel Soete; +Cc: PARISC list

On Mon, 2004-12-27 at 10:40 +0000, Joel Soete wrote:
> Should be why it was removed but as far as I didn't find any explanation (that's obvious: that's nearly impossible to explain all 
> details of implementation ;-)

I haven't time to look through the patch, but I can explain what the
pdtlb's are about in pacache.S.

Both copy_user_page_asm and __clear_user_page_asm use something called
the tmpalias mapping.  This is a 8MB reserved area that's used to prime
the user space cache.  What you do is to set up a temporary mapping for
the target of the copy which is congruent to the user space address
somewhere in the tmpalias region.  Then when you do the copy, the user
alias is automatically up to date as well (because the cache sees the
collision by virtue of its congruence properties).

It's a nice idea, but we've never been able to make it work in practise,
because the user page we're copying can be an executable page, and this
scheme only makes the d-cache correct.  If we had a way of telling
whether it's a data page or and instruction page, we could make it work.
That's why the mechanism is #if 0'd out.

On the other hand, we can use it for clear_user_page, because no-one
ever wants to clear an executable page before returning it to the user.

James


_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: copy_user_page_asm suggested 64bit improvment [Was: [parisc-linux] clear user page test]
  2004-12-27 10:40   ` Joel Soete
  2004-12-27 15:08     ` James Bottomley
@ 2004-12-27 17:34     ` Joel Soete
  2004-12-27 18:32     ` Joel Soete
  2 siblings, 0 replies; 17+ messages in thread
From: Joel Soete @ 2004-12-27 17:34 UTC (permalink / raw)
  To: Joel Soete; +Cc: parisc-linux



Joel Soete wrote:
> 
> 
> Grant Grundler wrote:
> 
>> On Tue, Dec 21, 2004 at 02:37:47PM +0100, Joel Soete wrote:
>>
[...]
>>> * with original 2.6.10-rc3-pa8 running kernel
>>> # grep "^user" k-loop1
>>
>>
>> Please use "^sys" or "^real".
>> "user" time is only number that should NOT change with this patch.
>>
> I will try to recover those info
> 
Those results was:
k-loop1 (i.e. cvs 2.6.10-rc3-pa8)
real	23m7.594s
user	18m47.768s
sys	4m2.585s

real	22m53.506s
user	18m47.400s
sys	4m0.321s

real	22m54.599s
user	18m47.492s
sys	4m0.226s

real	22m53.410s
user	18m48.205s
sys	3m59.351s

k-loop2 (i.e. cvs 2.6.10-rc3-pa8 + patch without pdtlb)
real	23m4.170s
user	18m47.511s
sys	4m0.654s

real	22m59.651s
user	18m51.133s
sys	3m58.969s

real	23m0.391s
user	18m50.908s
sys	3m59.588s

real	22m59.401s
user	18m51.090s
sys	3m59.673s

k-loop3 (i.e. cvs 2.6.10-rc3-pa8 + full patch)
real	23m28.521s
user	18m53.815s
sys	3m57.967s

real	23m32.696s
user	18m54.045s
sys	3m58.598s

real	23m28.981s
user	18m54.774s
sys	3m58.128s

real	23m30.631s
user	18m54.405s
sys	3m58.974s


hth,
	Joel
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: copy_user_page_asm suggested 64bit improvment [Was: [parisc-linux] clear user page test]
  2004-12-27 10:40   ` Joel Soete
  2004-12-27 15:08     ` James Bottomley
  2004-12-27 17:34     ` Joel Soete
@ 2004-12-27 18:32     ` Joel Soete
  2 siblings, 0 replies; 17+ messages in thread
From: Joel Soete @ 2004-12-27 18:32 UTC (permalink / raw)
  To: Joel Soete; +Cc: parisc-linux



Joel Soete wrote:
> 
> 
> Grant Grundler wrote:
> 
>> On Tue, Dec 21, 2004 at 02:37:47PM +0100, Joel Soete wrote:
>>
>>> Hello all,
>>
[...]
>> This routine could use more registers in the loop to get more "distance".
> 
> Ok that was another possibility: I trust that we can use r23, r24 as far 
> as:
>     r23-r26: these are arg3-arg0, i.e. you can use them if you
>         don't care about the values that were passed in anymore.
> 
Here is a first writing just to be sure I well understand:
#ifdef __LP64__

         ldi             32, %r1                 /* PAGE_SIZE/128 == 32 */

1:      ldd             0(%r25), %r19
         ldd             8(%r25), %r20
         ldd             16(%r25), %r21
         ldd             24(%r25), %r22
         ldd             32(%r25), %r23
         ldd             40(%r25), %r24
         std             %r19, 0(%r26)
         std             %r20, 8(%r26)
         std             %r21, 16(%r26)
         std             %r22, 24(%r26)
         std             %r23, 32(%r26)
         std             %r24, 40(%r26)
         ldd             48(%r25), %r19
         ldd             56(%r25), %r20
         ldd             64(%r25), %r21
         ldd             72(%r25), %r22
         ldd             80(%r25), %r23
         ldd             88(%r25), %r24
         std             %r19, 48(%r26)
         std             %r20, 56(%r26)
         std             %r21, 64(%r26)
         std             %r22, 72(%r26)
         std             %r23, 80(%r26)
         std             %r24, 88(%r26)
         ldd             96(%r25), %r19
         ldd             104(%r25), %r20
         ldd             112(%r25), %r21
         ldd             120(%r25), %r22
         std             %r19, 96(%r26)
         std             %r20, 104(%r26)
         std             %r21, 112(%r26)
         std             %r22, 120(%r26)
         ldo             128(%r26), %r26
         ADDIB>          -1, %r1, 1b
         ldo             128(%r25), %r25

#else   /* !__LP64__ */

just have to re-arrange with distance between couple std/ldd?

What do you think?

Joel
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [parisc-linux] Re: copy_user_page_asm suggested 64bit improvment (Test case)
  2004-12-27  7:36 ` copy_user_page_asm suggested 64bit improvment [Was: [parisc-linux] clear user page test] Grant Grundler
  2004-12-27 10:40   ` Joel Soete
@ 2004-12-28 16:25   ` Joel Soete
  2004-12-29  5:46     ` Grant Grundler
  2004-12-30  8:10   ` copy_user_page_asm suggested 64bit improvment [Was: [parisc-linux] clear user page test] Grant Grundler
  2 siblings, 1 reply; 17+ messages in thread
From: Joel Soete @ 2004-12-28 16:25 UTC (permalink / raw)
  To: Grant Grundler; +Cc: parisc-linux

[-- Attachment #1: Type: text/plain, Size: 2267 bytes --]

A test case may can help better to show improvement:

gcc -O2 -o cpup0 cpup0.c
gcc -march=2.0 -O2 -DLP64 -o cpup1 cpup0.c
gcc -march=2.0 -O2 -DLP64 -DV1 -o cpup2 cpup0.c
gcc -march=2.0 -O2 -DLP64 -DV2 -o cpup3 cpup0.c

Linux patst006 2.6.10-rc3-pa4-n4kmp #3 SMP Fri Dec 10 13:45:46 CET 2004 parisc64 GNU/Linux

# time ./cpup0 ; time ./cpup1; time ./cpup2 ; time ./cpup3

real	0m2.294s
user	0m0.226s
sys	0m2.068s

real	0m2.213s
user	0m0.140s
sys	0m2.074s

real	0m2.217s
user	0m0.108s
sys	0m2.110s

real	0m2.208s
user	0m0.108s
sys	0m2.100s
# time ./cpup0 ; time ./cpup1; time ./cpup2 ; time ./cpup3

real	0m2.316s
user	0m0.197s
sys	0m2.119s

real	0m2.217s
user	0m0.117s
sys	0m2.101s

real	0m2.203s
user	0m0.119s
sys	0m2.084s

real	0m2.205s
user	0m0.126s
sys	0m2.079s
# time ./cpup0 ; time ./cpup1; time ./cpup2 ; time ./cpup3

real	0m2.316s
user	0m0.194s
sys	0m2.122s

real	0m2.211s
user	0m0.126s
sys	0m2.086s

real	0m2.208s
user	0m0.106s
sys	0m2.102s

real	0m2.217s
user	0m0.113s
sys	0m2.105s
# time ./cpup0 ; time ./cpup1; time ./cpup2 ; time ./cpup3

real	0m2.311s
user	0m0.219s
sys	0m2.093s

real	0m2.222s
user	0m0.141s
sys	0m2.082s

real	0m2.207s
user	0m0.115s
sys	0m2.093s

real	0m2.208s
user	0m0.117s
sys	0m2.091s
# time ./cpup0 ; time ./cpup1; time ./cpup2 ; time ./cpup3

real	0m2.310s
user	0m0.205s
sys	0m2.105s

real	0m2.213s
user	0m0.104s
sys	0m2.109s

real	0m2.207s
user	0m0.115s
sys	0m2.092s

real	0m2.205s
user	0m0.108s
sys	0m2.096s

I would like here to know if the order could have importance?

# time ./cpup0 ; time ./cpup1; time ./cpup3 ; time ./cpup2

real	0m2.294s
user	0m0.196s
sys	0m2.100s

real	0m2.221s
user	0m0.111s
sys	0m2.111s

real	0m2.226s
user	0m0.097s
sys	0m2.130s

real	0m2.208s
user	0m0.107s
sys	0m2.101s
# time ./cpup0 ; time ./cpup3; time ./cpup2 ; time ./cpup1

real	0m2.302s
user	0m0.200s
sys	0m2.102s

real	0m2.206s
user	0m0.110s
sys	0m2.097s

real	0m2.213s
user	0m0.108s
sys	0m2.106s

real	0m2.214s
user	0m0.123s
sys	0m2.092s
# time ./cpup3 ; time ./cpup2; time ./cpup1 ; time ./cpup0

real	0m2.209s
user	0m0.104s
sys	0m2.105s

real	0m2.221s
user	0m0.115s
sys	0m2.106s

real	0m2.227s
user	0m0.111s
sys	0m2.116s

real	0m2.296s
user	0m0.212s
sys	0m2.085s

May be more improvement in 'more register used' (i.e. V2 and cpup3)?

Joel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: cpup0.c --]
[-- Type: text/x-csrc; name="cpup0.c", Size: 8594 bytes --]


#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <errno.h>
#include <asm/page.h>

void __copy_user_page_asm(void *to, void *from)
{
	register unsigned long __to __asm__ ("r26") =  (unsigned long)to;
	register unsigned long __from __asm__ ("r25") =  (unsigned long)from;

#ifdef LP64

asm volatile ("ldi		32, %%r1\n"	/* PAGE_SIZE/128 == 32 */
#if V2
"1:	ldd		0(%0), %%r19\n"
"	ldd		8(%0), %%r20\n"
"	ldd		16(%0), %%r21\n"
"	ldd		24(%0), %%r22\n"
"	std		%%r19, 0(%1)\n"
"	std		%%r20, 8(%1)\n"
"	ldd		32(%0), %%r23\n"
"	ldd		40(%0), %%r24\n"
"	std		%%r21, 16(%1)\n"
"	std		%%r22, 24(%1)\n"
"	ldd		48(%0), %%r19\n"
"	ldd		56(%0), %%r20\n"
"	std		%%r23, 32(%1)\n"
"	std		%%r24, 40(%1)\n"
"	ldd		64(%0), %%r21\n"
"	ldd		72(%0), %%r22\n"
"	std		%%r19, 48(%1)\n"
"	std		%%r20, 56(%1)\n"
"	ldd		80(%0), %%r23\n"
"	ldd		88(%0), %%r24\n"
"	std		%%r21, 64(%1)\n"
"	std		%%r22, 72(%1)\n"
"	ldd		96(%0), %%r19\n"
"	ldd		104(%0), %%r20\n"
"	std		%%r23, 80(%1)\n"
"	std		%%r24, 88(%1)\n"
"	ldd		112(%0), %%r21\n"
"	ldd		120(%0), %%r22\n"
"	std		%%r19, 96(%1)\n"
"	std		%%r20, 104(%1)\n"
"	ldo		128(%0), %0\n"
"	std		%%r21, 112(%1)\n"
"	std		%%r22, 120(%1)\n"
"	addib,>		-1, %%r1, 1b\n"
"	ldo		128(%1), %1"
#else	/* !V2 */ 
"1:	ldd		0(%0), %%r19\n"
"	ldd		8(%0), %%r20\n"
"	ldd		16(%0), %%r21\n"
"	ldd		24(%0), %%r22\n"
"	std		%%r19, 0(%1)\n"
"	std		%%r20, 8(%1)\n"
#ifndef V1
"	std		%%r21, 16(%1)\n"
"	std		%%r22, 24(%1)\n"
"	ldd		32(%0), %%r19\n"
"	ldd		40(%0), %%r20\n"
"	ldd		48(%0), %%r21\n"
"	ldd		56(%0), %%r22\n"
"	std		%%r19, 32(%1)\n"
"	std		%%r20, 40(%1)\n"
"	std		%%r21, 48(%1)\n"
"	std		%%r22, 56(%1)\n"
"	ldd		64(%0), %%r19\n"
"	ldd		72(%0), %%r20\n"
"	ldd		80(%0), %%r21\n"
"	ldd		88(%0), %%r22\n"
"	std		%%r19, 64(%1)\n"
"	std		%%r20, 72(%1)\n"
"	std		%%r21, 80(%1)\n"
"	std		%%r22, 88(%1)\n"
"	ldd		96(%0), %%r19\n"
"	ldd		104(%0), %%r20\n"
"	ldd		112(%0), %%r21\n"
"	ldd		120(%0), %%r22\n"
"	std		%%r19, 96(%1)\n"
"	std		%%r20, 104(%1)\n"
"	std		%%r21, 112(%1)\n"
"	std		%%r22, 120(%1)\n"
"	ldo		128(%1), %1\n"
"	addib,>		-1, %%r1, 1b\n"
"	ldo		128(%0), %0"
#else	/* V1 */
"	ldd		32(%0), %%r19\n"
"	ldd		40(%0), %%r20\n"
"	std		%%r21, 16(%1)\n"
"	std		%%r22, 24(%1)\n"
"	ldd		48(%0), %%r21\n"
"	ldd		56(%0), %%r22\n"
"	std		%%r19, 32(%1)\n"
"	std		%%r20, 40(%1)\n"
"	ldd		64(%0), %%r19\n"
"	ldd		72(%0), %%r20\n"
"	std		%%r21, 48(%1)\n"
"	std		%%r22, 56(%1)\n"
"	ldd		80(%0), %%r21\n"
"	ldd		88(%0), %%r22\n"
"	std		%%r19, 64(%1)\n"
"	std		%%r20, 72(%1)\n"
"	ldd		96(%0), %%r19\n"
"	ldd		104(%0), %%r20\n"
"	std		%%r21, 80(%1)\n"
"	std		%%r22, 88(%1)\n"
"	ldd		112(%0), %%r21\n"
"	ldd		120(%0), %%r22\n"
"	std		%%r19, 96(%1)\n"
"	std		%%r20, 104(%1)\n"
"	ldo		128(%0), %0\n"
"	std		%%r21, 112(%1)\n"
"	std		%%r22, 120(%1)\n"
"	addib,>		-1, %%r1, 1b\n"
"	ldo		128(%1), %1"
#endif	/* V1 */

#endif	/* 0 */

#else	/* !__LP64__ */

asm volatile ("ldi		64, %%r1\n"
"1:	ldw		0(%0), %%r19\n"
"	ldw		4(%0), %%r20\n"
"	ldw		8(%0), %%r21\n"
"	ldw		12(%0), %%r22\n"
"	stw		%%r19, 0(%1)\n"
"	stw		%%r20, 4(%1)\n"
"	stw		%%r21, 8(%1)\n"
"	stw		%%r22, 12(%1)\n"
"	ldw		16(%0), %%r19\n"
"	ldw		20(%0), %%r20\n"
"	ldw		24(%0), %%r21\n"
"	ldw		28(%0), %%r22\n"
"	stw		%%r19, 16(%1)\n"
"	stw		%%r20, 20(%1)\n"
"	stw		%%r21, 24(%1)\n"
"	stw		%%r22, 28(%1)\n"
"	ldw		32(%0), %%r19\n"
"	ldw		36(%0), %%r20\n"
"	ldw		40(%0), %%r21\n"
"	ldw		44(%0), %%r22\n"
"	stw		%%r19, 32(%1)\n"
"	stw		%%r20, 36(%1)\n"
"	stw		%%r21, 40(%1)\n"
"	stw		%%r22, 44(%1)\n"
"	ldw		48(%0), %%r19\n"
"	ldw		52(%0), %%r20\n"
"	ldw		56(%0), %%r21\n"
"	ldw		60(%0), %%r22\n"
"	stw		%%r19, 48(%1)\n"
"	stw		%%r20, 52(%1)\n"
"	stw		%%r21, 56(%1)\n"
"	stw		%%r22, 60(%1)\n"
"	ldo		64(%1), %1\n"
"	addib,>		-1, %%r1, 1b\n"
"	ldo		64(%0), %0"
#endif	/* __LP64__ */
	:		
	: "r"(__from), "r"(__to) );
}

/* 
#define	INIT	1
#define	DEBUG	1
 */

#define BUFFSIZE	(1024*1024*256)
#define PPB		(BUFFSIZE/PAGE_SIZE)	/* Pages Per Buff */


int main(int argc, char * * argv, char * * env)
{
	char MemSrc[] = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmn" ;

	char *MemDst;
	int i, j, k;


	MemDst = malloc(BUFFSIZE);
	for (j = 0; j < PPB ; j++) {
		__copy_user_page_asm(MemDst+(j*PAGE_SIZE), MemSrc);
	}

	MemDst[BUFFSIZE] = '\0';

#if DEBUG
/*
	printf("MemDst = %s\n", MemDst);
 */
	for (i=0; i<BUFFSIZE; i++) {
		printf("MemDst[%d] = %c\n", i, MemDst[i]);
	}
#endif
	return 0;
}


[-- Attachment #3: Type: text/plain, Size: 169 bytes --]

_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [parisc-linux] Re: copy_user_page_asm suggested 64bit improvment (Test case)
  2004-12-28 16:25   ` [parisc-linux] Re: copy_user_page_asm suggested 64bit improvment (Test case) Joel Soete
@ 2004-12-29  5:46     ` Grant Grundler
  2004-12-29 11:36       ` Joel Soete
  0 siblings, 1 reply; 17+ messages in thread
From: Grant Grundler @ 2004-12-29  5:46 UTC (permalink / raw)
  To: Joel Soete; +Cc: parisc-linux

On Tue, Dec 28, 2004 at 04:25:45PM +0000, Joel Soete wrote:
> A test case may can help better to show improvement:
> 
> gcc -O2 -o cpup0 cpup0.c
> gcc -march=2.0 -O2 -DLP64 -o cpup1 cpup0.c
> gcc -march=2.0 -O2 -DLP64 -DV1 -o cpup2 cpup0.c
> gcc -march=2.0 -O2 -DLP64 -DV2 -o cpup3 cpup0.c

As usual, I've hacked the cpup.c test.
Don't compare my results below with the previous ones Joel posted.
I've committed my version of cpup.c to "build-tools" repository.

grundler <549>for j in 1 2 3 4 5 ; do echo -n $j " " ; for i in 0 1 2 3; do time ./cpup$i ; done 2>&1 | fgrep user | cut -f2 ; done

and globbed the output a bit so it looks like a table:
#   cpup0    cpup1    cpup2    cpup3
1  0m1.033s 0m0.616s 0m0.607s 0m0.616s
2  0m1.039s 0m0.651s 0m0.587s 0m0.589s
3  0m1.004s 0m0.605s 0m0.631s 0m0.613s
4  0m1.015s 0m0.615s 0m0.572s 0m0.592s
5  0m1.014s 0m0.619s 0m0.564s 0m0.607s

Results are not statistically significant between 64-bit variants.
Results are from 2.6.10-rc3-pa6 SMP 64-bit kernel on a500-65 w/8G RAM
that was running a compile in the background.

cpupX columns above are defined by the following:
/*
** gcc -O2 -o cpup0 cpup.c             vanilla 32-bit loop
**      -march=2.0 -DLP64 -o cpup1      64-bit, 4ld + 4st sequences
**      -march=2.0 -DLP64 -DV1 -o cpup2 64-bit, 4regs, 2ld/2st bundles
**      -march=2.0 -DLP64 -DUSE6REGS -o cpup3 64-bit, 6 regs, 2ld/2st bundles
*/

And I'm wondering how/if 64-bit user space test ever worked since we don't
officially support 64-bit user space.
Likely I'm copying trash around even though the pointers are probably intact.

cpup2 is what I'd like to commit for the kernel version.
I've appended the patch.

I was expecting cpup3 would be slightly faster but don't have data
to prove it. And I'm still worried that GR23/GR24 won't be saved
by the caller since the __copy_user_page_asm function prototype
only specifies two arguments.

thanks,
grant


Index: arch/parisc/kernel/pacache.S
===================================================================
RCS file: /var/cvs/linux-2.6/arch/parisc/kernel/pacache.S,v
retrieving revision 1.13
diff -u -p -r1.13 pacache.S
--- arch/parisc/kernel/pacache.S	19 Dec 2004 04:50:35 -0000	1.13
+++ arch/parisc/kernel/pacache.S	29 Dec 2004 05:37:46 -0000
@@ -295,17 +295,72 @@ copy_user_page_asm:
 	.callinfo NO_CALLS
 	.entry
 
-	ldi		64, %r1
+#ifdef __LP64__
+	/* PA8x00 CPUs can consume 2 loads and 2 stores per cycle.
+	 * Unroll the loop by hand and arrange insn appropriately.
+	 * GCC probably can do this just as well.
+	 *
+	 * Prefetching and using more regs to increase the "distance"
+	 * between ldd and corresponding std are possible optimizations.
+	 */
+
+	ldi		32, %r1                 /* PAGE_SIZE/128 == 32 */
+
+1:	ldd		0(%r25), %r19		/* prolog == 1 bundle */
+	ldd		8(%r25), %r20
+
+	ldd		16(%r25), %r21		/* bundle 2 */
+	ldd		24(%r25), %r22
+	std		%r19, 0(%r26)
+	std		%r20, 8(%r26)
+
+	ldd		32(%r25), %r19		/* bundle 3 */
+	ldd		40(%r25), %r20
+	std		%r21, 16(%r26)
+	std		%r22, 24(%r26)
+
+	ldd		48(%r25), %r21		/* bundle 4 */
+	ldd		56(%r25), %r22
+	std		%r19, 32(%r26)
+	std		%r20, 40(%r26)
+
+	ldd		64(%r25), %r19		/* bundle 5 */
+	ldd		72(%r25), %r20
+	std		%r21, 48(%r26)
+	std		%r22, 56(%r26)
+
+	ldd		80(%r25), %r21		/* bundle 6 */
+	ldd		88(%r25), %r22
+	std		%r19, 64(%r26)
+	std		%r20, 72(%r26)
+
+	ldd		 96(%r25), %r19		/* bundle 7 */
+	ldd		104(%r25), %r20
+	std		%r21, 80(%r26)
+	std		%r22, 88(%r26)
+
+	ldd		112(%r25), %r21		/* bundle 8 */
+	ldd		120(%r25), %r22
+	std		%r19, 96(%r26)
+	std		%r20, 104(%r26)
+
+	ldo		128(%r25), %r25		/* epilog == 2 bundles */
+	std		%r21, 112(%r26)
+	std		%r22, 120(%r26)
+
+	ADDIB>		-1, %r1, 1b
+	ldo		128(%r26), %r26
+
+#else
 
 	/*
 	 * This loop is optimized for PCXL/PCXL2 ldw/ldw and stw/stw
-	 * bundles (very restricted rules for bundling). It probably
-	 * does OK on PCXU and better, but we could do better with
-	 * ldd/std instructions. Note that until (if) we start saving
+	 * bundles (very restricted rules for bundling).
+	 * Note that until (if) we start saving
 	 * the full 64 bit register values on interrupt, we can't
 	 * use ldd/std on a 32 bit kernel.
 	 */
-
+	ldi		64, %r1		/* PAGE_SIZE/64 == 64 */
 
 1:
 	ldw		0(%r25), %r19
@@ -343,7 +398,7 @@ copy_user_page_asm:
 	ldo		64(%r26), %r26
 	ADDIB>		-1, %r1, 1b
 	ldo		64(%r25), %r25

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [parisc-linux] Re: copy_user_page_asm suggested 64bit improvment (Test case)
  2004-12-29  5:46     ` Grant Grundler
@ 2004-12-29 11:36       ` Joel Soete
  0 siblings, 0 replies; 17+ messages in thread
From: Joel Soete @ 2004-12-29 11:36 UTC (permalink / raw)
  To: Grant Grundler; +Cc: parisc-linux



Grant Grundler wrote:
> On Tue, Dec 28, 2004 at 04:25:45PM +0000, Joel Soete wrote:
[...]
> 
> As usual, I've hacked the cpup.c test.
Cool ;-)

> Don't compare my results below with the previous ones Joel posted.
> I've committed my version of cpup.c to "build-tools" repository.
>
Thanks

[...]
> 
> cpup2 is what I'd like to commit for the kernel version.
> I've appended the patch.
> 
Nice

> I was expecting cpup3 would be slightly faster but don't have data
> to prove it.
Don't know where can we find L1-Cache state diagram to help more?

> And I'm still worried that GR23/GR24 won't be saved
> by the caller since the __copy_user_page_asm function prototype
> only specifies two arguments.
> 
> thanks,
> grant
> 
Thanks for your attention,
	Joel
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: copy_user_page_asm suggested 64bit improvment [Was: [parisc-linux] clear user page test]
  2004-12-27  7:36 ` copy_user_page_asm suggested 64bit improvment [Was: [parisc-linux] clear user page test] Grant Grundler
  2004-12-27 10:40   ` Joel Soete
  2004-12-28 16:25   ` [parisc-linux] Re: copy_user_page_asm suggested 64bit improvment (Test case) Joel Soete
@ 2004-12-30  8:10   ` Grant Grundler
  2004-12-30 17:04     ` [parisc-linux] Re: copy_user_page_asm suggested 64bit improvment [Was: [parisc-l John David Anglin
  2 siblings, 1 reply; 17+ messages in thread
From: Grant Grundler @ 2004-12-30  8:10 UTC (permalink / raw)
  To: Joel Soete; +Cc: parisc-linux

On Mon, Dec 27, 2004 at 12:36:54AM -0700, Grant Grundler wrote:
> Anyone know how many cycles ldd from L1 takes?

I found the answer for PCX-W CPU:
	The PCXW Data cache is a 4-way set associative 1 MB cache, split
	into two banks and interleaved on double word boundaries to allow
	two simultaneous uses of the cache. Each bank is further divided
	into independent tag and data ports, primarily to allow effective
	single cycle stores. The two tags hold identical information.
	Each port returns data in two cycles, but can start a new access
	every cycle.

I'll assume PA8[567]00 CPUs have similar if not identical behavior.
PA8800 may not and I'd be curious if anyone knows.

I've just committed a "simple" version that uses r19/20/21/22.
I've got another version that also uses r23/r24 but it didn't boot
and I didn't chase down why. It's possibly a HW bug with this particular
A500. I'll try it again.

Lamont tells me r23/24/28/29 are *caller* saves registers. 
Ie I could r28/29 as well (or instead of r23/24).

thanks,
grant
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [parisc-linux] Re: copy_user_page_asm suggested 64bit improvment [Was: [parisc-l
  2004-12-30  8:10   ` copy_user_page_asm suggested 64bit improvment [Was: [parisc-linux] clear user page test] Grant Grundler
@ 2004-12-30 17:04     ` John David Anglin
  0 siblings, 0 replies; 17+ messages in thread
From: John David Anglin @ 2004-12-30 17:04 UTC (permalink / raw)
  To: Grant Grundler; +Cc: parisc-linux

> Lamont tells me r23/24/28/29 are *caller* saves registers. 
> Ie I could r28/29 as well (or instead of r23/24).

Here is a summary of the uses for general call used registers.  These are
r1, r2 and r19 to r31.

Register			32-bit		64-bit
Arguments		       r23-r26 	       r19-r26
Argument Pointer		    NA		   r29
Static Chain			   r29		   r31
PIC Offset Table Pointer	   r19		   r27
Stack Pointer			   r30		   r30
Return Pointer			    r2		    r2
Millicode Return Pointer	   r31		    r2 (r31 for local millicode)
Pointer for $$dyncall		   r21		    NA

You can never use r30 and you can't use r2 without saving it.  Watch out
for the PIC register conventions.  Depending on circumstances, the rest
should be usable.

Dave
-- 
J. David Anglin                                  dave.anglin@nrc-cnrc.gc.ca
National Research Council of Canada              (613) 990-0752 (FAX: 952-6602)
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: copy_user_page_asm suggested 64bit improvment [Was: [parisc-linux] clear user page test]
  2004-12-27 15:08     ` James Bottomley
@ 2004-12-31 20:26       ` Michael S. Zick
  2004-12-31 20:56         ` Grant Grundler
  2004-12-31 21:21         ` James Bottomley
  0 siblings, 2 replies; 17+ messages in thread
From: Michael S. Zick @ 2004-12-31 20:26 UTC (permalink / raw)
  To: parisc-linux

On Mon December 27 2004 09:08, James Bottomley wrote:
> On Mon, 2004-12-27 at 10:40 +0000, Joel Soete wrote:
> > Should be why it was removed but as far as I didn't find any explanation (that's obvious: that's nearly impossible to explain all 
> > details of implementation ;-)
> 
> I haven't time to look through the patch, but I can explain what the
> pdtlb's are about in pacache.S.
> 
> Both copy_user_page_asm and __clear_user_page_asm use something called
> the tmpalias mapping.  This is a 8MB reserved area that's used to prime
> the user space cache.  What you do is to set up a temporary mapping for
> the target of the copy which is congruent to the user space address
> somewhere in the tmpalias region.  Then when you do the copy, the user
> alias is automatically up to date as well (because the cache sees the
> collision by virtue of its congruence properties).
> 
> It's a nice idea, but we've never been able to make it work in practise,
> because the user page we're copying can be an executable page, and this
> scheme only makes the d-cache correct.  If we had a way of telling
> whether it's a data page or and instruction page, we could make it work.
> That's why the mechanism is #if 0'd out.
> 
Group,
I have been following this thread with interest.  Let me share my observations.

Changes in the instruction sequence of this kernel code path makes a user
observable difference in execution timings.

<bold-statement attribute="General-OS-Design">
    This path should not be within the set of user observable execution times.
</bold-statement>

Conditions, general:
The copy of a "user page" :: presumed to mean "copy of a page assigned
to user space".  Possible refinement: "copy of a page assigned to a specific
user's space".

Page must contain zeros on return.

Contents of system caches must correspond to contents of page (zeros).

On entry, it is unknown if page is currently Data, Executable (Instruction),
Both, or Neither.
Having a means to determine the exact, prior, usages of a page on entry
to this path would be nice; but logic and design can overcome this lack.

HP, PA-RISC has only i-cache and d-cache hardware.  It does not have
s-cache hardware.

A page assigned to user space may be assigned to more than one,
specific, user's space.

A page assigned to user space may also be assigned to kernel space.

For a 'dual assigned' page (assigned to both user space and kernel space) 
the following must hold:

A)  (Kernel Instruction) and (User Instruction)::
        MUST NOT also be assigned: (User Data)
        MAY OPTIONALLY be assigned: (Kernel Data)

B)  (Kernel Data) and (User Data)::
        MUST NOT also be assigned: (Kernel Instruction)
        MAY OPTIONALLY be assigned: (User Instruction)

The above requirements are independent of the implementation of
such assignments.  
Memory management hardware that allows 'dual assignment' is rare.  
Memory management software that allows 'dual assignment' by 
constructing a 'page alias' is common.

Condition (A :: 'MUST NOT') protects kernel provided, common code, 
from user modification.

Condition (A :: 'MAY OPTIONALLY') allows the kernel to:
    1) Dynamically alter the code provided to user space in general.
    2) Dynamically alter the code provided to a specific user's space.  
        NOTE: Such operation would trigger a 'copy on write' code path.  
        NOTE: The (shared) source page of 'copy on write' is not modified.
        NOTE: The destination page of 'copy on write' comes from the free pool.

Condition (B :: 'MUST NOT') protects the kernel from user insertion or 
modification of kernel code.

Condition (B :: 'MAY OPTIONALLY') supports the provision of 'executable
stack' in user space in the absence of s-cache hardware.

For a system that supports the provision of 'user, executable stack' the
following must hold:

C) (User Instruction) and (User Data) and (User Stack)::
        MUST meet condition (B)
        MUST NOT be shared among users: thou shall not share your stack.

D) (User Data) and (User Heap)::
        MUST NOT also be assigned: (Kernel Instruction)
        MUST NOT also be assigned: (User Instruction)
        MAY OPTIONALLY share disjoint address sub-ranges of the overall
        address range '((User Instruction) and (User Data) and (User Stack))'
        ON EITHER CONDITION OF:
        1) Attributes of the disjoint address sub-ranges are also disjoint.
        2) Software design can guarantee behavior the same as sub-condition(1).

Condition (C :: 'MUST NOT') 'copy on write' code path is never used.

Condition (D :: 'MUST NOT') Differs from (Condition C) by non-compliance with
(Condition B).

Condition (D :: 'MAY OPTIONALLY') Guarantees the distinction between (Condition
C) and (Condition D) when (Condition D) address area is shared among users in
the absence of separate (Condition C) and (Condition D) address spaces.
NOTE: A (Condition D) area my trigger a 'copy on write' code path; A (Condition
C) area MUST NOT trigger a 'copy on write" code path.

<All-Other-Combinations>

1) A page received from (any) free pool is guaranteed to contain only zeros.
2) A page received from (any) free pool is guaranteed to not have any 'user
space' cache representations.

</All-Other-Combinations>

NOTE: Zeroing a page received from (any) free pool is not 'user observable'
for the simple reason that it never happens.

<Page-Return-To-Free-Pool>

Pages which are intended to be added to the free pool, are not directly returned
to the free pool. 
Instead they are returned to a kernel space, free pool management, daemon.  It
is this daemon that makes the <All-Other-Combinations> guarantee.

NOTE: Zeroing a page on return to (any) free pool is not 'user observable' only
the 'add to free pool incoming queue' is in the 'user observable' code path.

NOTE: Pages handled by this daemon may have both d-cache and i-cache
representations.  But the code which deals with this situation is not 'user 
observable' because the entire 'return to free pool' operation is not 'user
observable'.

</Page-Return-To-Free-Pool>

<Non-Free-Pool-Pages>

<Non-rhetorical Question="What user pages can be both Instruction and Data?" />

(Condition B - 'MAY OPTIONALLY') pages: 

Dual Assigned : (I.E: Transition from 'shared' to 'private')
In-Use portion is copied ('user observable') - Not-Used portion is not copied.
It can be guaranteed to already be zero since it hasn't been used.
The 'write' side of the copy instructions does any 'cache priming'.

(Condition C) pages:

NOTE: Never shared, therefore never copied.

NOTE: Extending the pages present for an executable stack does not
have 'user observable' zeroing since the new page source is the free pool.

NOTE: Trimming 'zombie' stack extensions under general memory pressure
(I.E: Free pool exhausted @ new page request pending) would generate 'user
observable' execution time while a page on the 'add to free pool incoming
queue' was cleared.
This corner case can be postponed by using 'preemptive trimming' implemented
in the free pool management daemon.

</Non-Free-Pool-Pages>

Q.E.D: Zeroing a page with the destination of user space assignment need not
be a 'user observable' execution time.

There should be additional gains made in 'copy-[to|from]-user' when these four
conditions are enforced.

Mike
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: copy_user_page_asm suggested 64bit improvment [Was: [parisc-linux] clear user page test]
  2004-12-31 20:26       ` Michael S. Zick
@ 2004-12-31 20:56         ` Grant Grundler
  2004-12-31 21:35           ` Michael S. Zick
  2004-12-31 21:21         ` James Bottomley
  1 sibling, 1 reply; 17+ messages in thread
From: Grant Grundler @ 2004-12-31 20:56 UTC (permalink / raw)
  To: Michael S. Zick; +Cc: parisc-linux

On Fri, Dec 31, 2004 at 02:26:13PM -0600, Michael S. Zick wrote:
>     This path should not be within the set of user observable execution times.
...
> NOTE: Zeroing a page on return to (any) free pool is not 'user observable' only
> the 'add to free pool incoming queue' is in the 'user observable' code path.
> 
> NOTE: Pages handled by this daemon may have both d-cache and i-cache
> representations.  But the code which deals with this situation is not 'user 
> observable' because the entire 'return to free pool' operation is not 'user
> observable'.
...
> Q.E.D: Zeroing a page with the destination of user space assignment need not
> be a 'user observable' execution time.

Mike,
The copy_user_page and zero_page functions *are* observable since they
affect metrics reported by "time" and readprofile. I don't care if they
are in invoked in the application context or some other context.

Certainly, it would reduce startup latency to pre-zero the pages in
the kernel (daemon) and have them ready when apps want them.
But on a loaded system, I expect this will be slightly less efficient
and more complex since one doesn't know how many need to be pre-zero'd
or when to steal pre-zero'd pages for other uses (e.g. load in an
executable).

> There should be additional gains made in 'copy-[to|from]-user' when these four
> conditions are enforced.

I read the conditions and thought "neat".
I don't pretend to understand all of them or what they mean.
But instead of trying to explain them, could you send me a patch that works?
Maybe something that has a chance of going back upstream to linus?

thanks,
grant

> 
> Mike
> _______________________________________________
> parisc-linux mailing list
> parisc-linux@lists.parisc-linux.org
> http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: copy_user_page_asm suggested 64bit improvment [Was: [parisc-linux] clear user page test]
  2004-12-31 20:26       ` Michael S. Zick
  2004-12-31 20:56         ` Grant Grundler
@ 2004-12-31 21:21         ` James Bottomley
  1 sibling, 0 replies; 17+ messages in thread
From: James Bottomley @ 2004-12-31 21:21 UTC (permalink / raw)
  To: Michael S. Zick; +Cc: PARISC list

On Fri, 2004-12-31 at 14:26 -0600, Michael S. Zick wrote:
> Page must contain zeros on return.
> 
> Contents of system caches must correspond to contents of page (zeros).

Actually, no, this is precisely what we don't do for performance
reasons.  If we just wanted to the caches and main memory in sync, we
wouldn't need to muck with the tmpalias space.

What clear_user_page_asm does is to prime the cache covering the page
with zeros, but return the page to user space with a dirty cache (i.e.
with the real memory not necessarily zero'd but with the cache in a
state to zero it on a flush).  The reason for using the tmpalias space
is so that the user's VIPT cache lines covering the page are congruent
and thus the same ones the kernel wrote the zeros to.

This means that if the user is simply going to fill the page again, we
stand a good chance of *not* having to write the zeros to main memory in
the first place (this saves us quite a bit of execution time because
writing to main memory is an expensive operation).

James


_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: copy_user_page_asm suggested 64bit improvment [Was: [parisc-linux] clear user page test]
  2004-12-31 20:56         ` Grant Grundler
@ 2004-12-31 21:35           ` Michael S. Zick
       [not found]             ` <20041231225447.GC23592@colo.lackof.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Michael S. Zick @ 2004-12-31 21:35 UTC (permalink / raw)
  To: parisc-linux

On Fri December 31 2004 14:56, Grant Grundler wrote:
> 
> I read the conditions and thought "neat".
> I don't pretend to understand all of them or what they mean.
> But instead of trying to explain them, could you send me a patch that works?
> Maybe something that has a chance of going back upstream to linus?
> 
I tried the 'patch that works' route with a similar suggestion for sched.c
Based on that experience...

I suspect that perhaps pictures (diagrams? flow charts? dependency
graphs?) might stand a better chance of conveying what I can't explain
in English.  I'll put that (drawing pictures) on my todo list.

Let me attempt an abstract in words:

The *nix philosophy is two part drivers.

The 'top part' can be viewed as a 'client' that makes requests on 
behalf of the hardware.
The 'bottom part' can be viewed as a 'host' that services 'client'
requests.

Nothing new there.

What I proposed was:
The memory page free pool be defined as a 'virtual device' with
a two part driver.

The 'top part' is executed by the 'client' (kernel).
The 'bottom part' is executed by the 'host' (kernel daemon).

The only thing different than usual here is that a real hardware
device is (in most cases) the 'client' and the kernel is the 'host'.

In this virtual free pool device, the kernel is the 'client' and the
daemon is the 'host' (which only happens to be part of the kernel).
Only the 'client' code is in the user's execution path.

Should be interesting to consider.

I wouldn't expect the idea to be adopted any quicker than my
description (and patch that works) that the scheduler should be
a virtual device with a two part driver.

Mike
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: copy_user_page_asm suggested 64bit improvment [Was: [parisc-linux] clear user page test]
       [not found]             ` <20041231225447.GC23592@colo.lackof.org>
@ 2004-12-31 23:56               ` Michael S. Zick
  2005-01-12 13:52               ` Michael S. Zick
  1 sibling, 0 replies; 17+ messages in thread
From: Michael S. Zick @ 2004-12-31 23:56 UTC (permalink / raw)
  To: parisc-linux

On Fri December 31 2004 16:54, Grant Grundler wrote:
> On Fri, Dec 31, 2004 at 03:35:28PM -0600, Michael S. Zick wrote:
> > I tried the 'patch that works' route with a similar suggestion for sched.c
> > Based on that experience...
> 
> ah good. You learned something. :^)
>
Sometimes.

> 
> > What I proposed was:
> > The memory page free pool be defined as a 'virtual device' with
> > a two part driver.
> ...
> > In this virtual free pool device, the kernel is the 'client' and the
> > daemon is the 'host' (which only happens to be part of the kernel).
> > Only the 'client' code is in the user's execution path.
> 
> This sounds neat and "clean". But things could get very ugly
> when one needs to "steal" zero'd pages for other uses.
> 
> > Should be interesting to consider.
> 
> Yes, Agreed.
>
Better the discussion first - code optionally later.

> 
> > I wouldn't expect the idea to be adopted any quicker than my
> > description (and patch that works) that the scheduler should be
> > a virtual device with a two part driver.
> 
> I don't know what happened to your scheduler idea specifically (or
> how it was presented), but making something a driver means
> giving up something else. Been there done that.
> 
Overly radical at the time of presentation compared with 
other 'work in progress'.

Managing the free page pool (only) as a virtual device would lead
to much-oh (scientific term ;) glue code.  Not much of an improvement
over current practice.
Managing over-all memory resources as a virtual device is the answer;
but that is hardly a 'patch'.

Even so, glue code would be required unless the resource of 
cpu-cycles was also managed as a virtual device.

Now the topic is definitely out of the 'patch' scope, providing both
virtual devices would need to be a kernel branch devoted to the
project.

That in turn would require that a whole lot of people 'get on board'
with the ideas behind the design change.

The only practical means to accomplish that brings us full circle
back to the observation above: "Discussion First".

Mike
(PS: None of this is academic, just a clean re-write of code
written in the past for proprietary operating systems.)
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: copy_user_page_asm suggested 64bit improvment [Was: [parisc-linux] clear user page test]
       [not found]             ` <20041231225447.GC23592@colo.lackof.org>
  2004-12-31 23:56               ` Michael S. Zick
@ 2005-01-12 13:52               ` Michael S. Zick
  2005-01-12 15:32                 ` Joel Soete
  1 sibling, 1 reply; 17+ messages in thread
From: Michael S. Zick @ 2005-01-12 13:52 UTC (permalink / raw)
  To: parisc-linux

On Fri December 31 2004 16:54, Grant Grundler wrote:
> On Fri, Dec 31, 2004 at 03:35:28PM -0600, Michael S. Zick wrote:
> > I tried the 'patch that works' route with a similar suggestion for sched.c
> > Based on that experience...
> 
> ah good. You learned something. :^)
> 
> > What I proposed was:
> > The memory page free pool be defined as a 'virtual device' with
> > a two part driver.
> ...
> > In this virtual free pool device, the kernel is the 'client' and the
> > daemon is the 'host' (which only happens to be part of the kernel).
> > Only the 'client' code is in the user's execution path.
> 
> This sounds neat and "clean". But things could get very ugly
> when one needs to "steal" zero'd pages for other uses.
> 
> > Should be interesting to consider.
> 
> Yes, Agreed.
> 
Design seems to be drifting in that general direction.

See: change log on 2.6.11-rc1

More details at:
<http://seclists.org/lists/linux-kernel/2005/Jan/0888.html>

Mike (with Joel's help).
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: copy_user_page_asm suggested 64bit improvment [Was: [parisc-linux] clear user page test]
  2005-01-12 13:52               ` Michael S. Zick
@ 2005-01-12 15:32                 ` Joel Soete
  0 siblings, 0 replies; 17+ messages in thread
From: Joel Soete @ 2005-01-12 15:32 UTC (permalink / raw)
  To: Michael S. Zick, parisc-linux

[...]
> > 
> Design seems to be drifting in that general direction.
> 
> See: change log on 2.6.11-rc1
> 
> More details at:
> <http://seclists.org/lists/linux-kernel/2005/Jan/0888.html>
> 
The last v4 release thread start here:
<http://seclists.org/lists/linux-kernel/2005/Jan/2931.html>
and also 
<http://www.gelato.unsw.edu.au/linux-ia64/0501/12468.html>

I tried to applying those patch but I do have miss against which kernel t=
his
patch was build: a big hunk of patch [2/4] failled :-(
Having a quick look is supposed to rename severall function in mm/page_al=
loc.c
as page_order() into page_zorder() but I didn't find it and not more in t=
he
vanilla 2.6.10?

Joel

-------------------------------------------------------------------------=
--
Tiscali solde! 1 mois et activation Gratuits, modem =E0 9,99=80
http://reg.tiscali.be/adsl/default.asp?lg=3DFR



_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2005-01-12 15:32 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <418A80E8000124B5@mail-6-bnl.tiscali.it>
2004-12-27  7:36 ` copy_user_page_asm suggested 64bit improvment [Was: [parisc-linux] clear user page test] Grant Grundler
2004-12-27 10:40   ` Joel Soete
2004-12-27 15:08     ` James Bottomley
2004-12-31 20:26       ` Michael S. Zick
2004-12-31 20:56         ` Grant Grundler
2004-12-31 21:35           ` Michael S. Zick
     [not found]             ` <20041231225447.GC23592@colo.lackof.org>
2004-12-31 23:56               ` Michael S. Zick
2005-01-12 13:52               ` Michael S. Zick
2005-01-12 15:32                 ` Joel Soete
2004-12-31 21:21         ` James Bottomley
2004-12-27 17:34     ` Joel Soete
2004-12-27 18:32     ` Joel Soete
2004-12-28 16:25   ` [parisc-linux] Re: copy_user_page_asm suggested 64bit improvment (Test case) Joel Soete
2004-12-29  5:46     ` Grant Grundler
2004-12-29 11:36       ` Joel Soete
2004-12-30  8:10   ` copy_user_page_asm suggested 64bit improvment [Was: [parisc-linux] clear user page test] Grant Grundler
2004-12-30 17:04     ` [parisc-linux] Re: copy_user_page_asm suggested 64bit improvment [Was: [parisc-l John David Anglin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.