All of lore.kernel.org
 help / color / mirror / Atom feed
From: Grant Grundler <grundler@parisc-linux.org>
To: Joel Soete <soete.joel@tiscali.be>
Cc: parisc-linux <parisc-linux@lists.parisc-linux.org>
Subject: Re: copy_user_page_asm suggested 64bit improvment [Was: [parisc-linux]	clear user page test]
Date: Mon, 27 Dec 2004 00:36:54 -0700	[thread overview]
Message-ID: <20041227073654.GI29492@colo.lackof.org> (raw)
In-Reply-To: <418A80E8000124B5@mail-6-bnl.tiscali.it>

On Tue, Dec 21, 2004 at 02:37:47PM +0100, Joel Soete wrote:
> Hello all,

Joel,
I trim your postings to only include the parts I need to respond to.
Could you please do the same?

I hate having to scroll down pages of stuff to get to your comment.
That's probably why no one else responded.


> As promised, here is a cleaner (?)  patch:
> --- arch/parisc/kernel/pacache.S.Orig	2004-12-20 08:28:23.000000000 +0100
> +++ arch/parisc/kernel/pacache.S	2004-12-20 14:49:35.000000000 +0100
> @@ -295,7 +295,52 @@
>  	.callinfo NO_CALLS
>  	.entry
> 
> -	ldi		64, %r1
> +	pdtlb		0(%r25)
> +	pdtlb		0(%r26)

Sorry - I missed why the pdtlb needs to be added.
Could you explain?

Won't the pdtlb guarantee at least one trap per page copied?
I would hope we guarantee the D-TLB is "clean" when calling this function.

> +#ifdef __LP64__
> +
> +	ldi		32, %r1			/* PAGE_SIZE/128 == 32 */
> +
> +1:	ldd		0(%r25), %r19
> +	ldd		8(%r25), %r20
> +	ldd		16(%r25), %r21
> +	ldd		24(%r25), %r22
> +	std		%r19, 0(%r26)
> +	std		%r20, 8(%r26)
> +	std		%r21, 16(%r26)
> +	std		%r22, 24(%r26)

This looks good.

PA2.0 can retire 2 loads and 2 stores per cycle IFF there are no dependencies.
can be executed in one cycle.

That means we want something like this:

+1:	ldd		0(%r25), %r19
+	ldd		8(%r25), %r20
+	ldd		16(%r25), %r21
+	ldd		24(%r25), %r22
+	std		%r19, 0(%r26)
+	std		%r20, 8(%r26)
+	ldd		32(%r25), %r19
+	ldd		40(%r25), %r20
+	std		%r21, 16(%r26)
+	std		%r22, 24(%r26)
+	ldd		48(%r25), %r21
+	ldd		56(%r25), %r22
+	std		%r19, 32(%r26)
+	std		%r20, 40(%r26)
...
+	ldd		112(%r25), %r21
+	ldd		120(%r25), %r22
+	std		%r19, 96(%r26)
+	std		%r20, 104(%r26)
+	ldo		128(%r25), %r25
+	std		%r21, 112(%r26)
+	std		%r22, 120(%r26)
+	ADDIB>		-1, %r1, 1b
+	ldo		128(%r26), %r26
...

[ Note that I've moved the "ldo" around as well!]

More distance between the "ldd %rX" and the corresponding
"std %rX" is generally a good thing.
This routine could use more registers in the loop to get more "distance".

It costs us 1 cycle to save two registers on the stack.
Once the data is in L1-Cache, IFF the CPU needs more than one cycle
to retire successive loads, we gain several cycles assuming additional
register pairs are used multiply times per loop.
Anyone know how many cycles ldd from L1 takes?

I expect gcc encodes those times so it can schedule stuff optimally.
But I've forgotten where to find the PA2.0 scheduling magic.
It might be worth just letting gcc unroll the loop for us since
SR0 (kernel) is implied in all the ldd/std instructions.


> -	extrd,u		%r26,56,32, %r26		/* convert phys addr to tlb insert format */
> -	extrd,u		%r23,56,32, %r23		/* convert phys addr to tlb insert format */
> -	depd		%r24,63,22, %r28		/* Form aliased virtual address 'to' */
> +	extrd,u		%r26,56,32, %r26	/* convert phys addr to tlb insert format */
> +	extrd,u		%r23,56,32, %r23	/* convert phys addr to tlb insert format */
> +	depd		%r24,63,22, %r28	/* Form aliased virtual address 'to' */

Please post white space changes as seperate patches.


> the loop used:
> export i=0 ; while [ $i -le 10 ] ; do make clean ; make oldconfig ; readprofile

3 to 5 iterations are sufficient for me (since they take so long).

> -r ; time make vmlinux ; readprofile >> /var/logs/prof.doc; i=$((i+1)) ;
> done 2>&1 | tee /var/logs/k-loop1
> 
> * with original 2.6.10-rc3-pa8 running kernel
> # grep "^user" k-loop1

Please use "^sys" or "^real".
"user" time is only number that should NOT change with this patch.

> # grep copy_user_page_asm prof.doc
>   3254 copy_user_page_asm                        20.3375
>   3273 copy_user_page_asm                        20.4563
...

> * with 2.6.10-rc3-pa8 + patch and without "pdtlb		0(%r2[56])"
...
> # grep copy_user_page_asm prof.doc
>   1818 copy_user_page_asm                        11.3625
>   1763 copy_user_page_asm                        11.0188
>   1785 copy_user_page_asm                        11.1562
...

This is clearly goodness.

> * with 2.6.10-rc3-pa8 + full patch
...
> # grep copy_user_page_asm prof.doc
>   1894 copy_user_page_asm                        11.8375
>   1972 copy_user_page_asm                        12.3250
>   1975 copy_user_page_asm                        12.3438
>   1880 copy_user_page_asm                        11.7500
>   1923 copy_user_page_asm                        12.0188

I expect extra traps and/or time spent ordering the TLB operations.
pdtlb is costing about 8% performance in this routine.
I definitely want a clear explanation before adding this.

> So the main interest is to reduce the number of clock ticks :-)

Yes. :^)


thanks,
grant
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

       reply	other threads:[~2004-12-27  7:36 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <418A80E8000124B5@mail-6-bnl.tiscali.it>
2004-12-27  7:36 ` Grant Grundler [this message]
2004-12-27 10:40   ` copy_user_page_asm suggested 64bit improvment [Was: [parisc-linux] clear user page test] Joel Soete
2004-12-27 15:08     ` James Bottomley
2004-12-31 20:26       ` Michael S. Zick
2004-12-31 20:56         ` Grant Grundler
2004-12-31 21:35           ` Michael S. Zick
     [not found]             ` <20041231225447.GC23592@colo.lackof.org>
2004-12-31 23:56               ` Michael S. Zick
2005-01-12 13:52               ` Michael S. Zick
2005-01-12 15:32                 ` Joel Soete
2004-12-31 21:21         ` James Bottomley
2004-12-27 17:34     ` Joel Soete
2004-12-27 18:32     ` Joel Soete
2004-12-28 16:25   ` [parisc-linux] Re: copy_user_page_asm suggested 64bit improvment (Test case) Joel Soete
2004-12-29  5:46     ` Grant Grundler
2004-12-29 11:36       ` Joel Soete
2004-12-30  8:10   ` copy_user_page_asm suggested 64bit improvment [Was: [parisc-linux] clear user page test] Grant Grundler
2004-12-30 17:04     ` [parisc-linux] Re: copy_user_page_asm suggested 64bit improvment [Was: [parisc-l John David Anglin
     [not found] <20041210190333.GC6653@colo.lackof.org>
     [not found] ` <418A811700010466@mail-8-bnl.mail.tiscali.sys>
     [not found]   ` <20041213180758.GA8705@colo.lackof.org>
     [not found]     ` <41C34C56.4080508@tiscali.be>
     [not found]       ` <20041218073036.GA29003@colo.lackof.org>
     [not found]         ` <41C440A3.6060708@tiscali.be>
     [not found]           ` <41C4872D.6010705@tiscali.be>
     [not found]             ` <41C4A35A.7010003@tiscali.be>
     [not found]               ` <20041219042528.GB15282@colo.lackof.org>
     [not found]                 ` <41C5D761.4030004@tiscali.be>
2004-12-19 20:27                   ` copy_user_page_asm suggested 64bit improvment [Was: [parisc-linux] clear user page test] Joel Soete

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20041227073654.GI29492@colo.lackof.org \
    --to=grundler@parisc-linux.org \
    --cc=parisc-linux@lists.parisc-linux.org \
    --cc=soete.joel@tiscali.be \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.