linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/2] powerpc: copy_4K_page tweaked for Cell
@ 2008-08-14  6:17 Mark Nelson
  2008-08-22  4:32 ` [PATCH 0/2] powerpc: new copy_4K_page() Mark Nelson
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Mark Nelson @ 2008-08-14  6:17 UTC (permalink / raw)
  To: linuxppc-dev, cbe-oss-dev

Hi All,

What follows is an updated version of copy_4K_page that has been tuned
for the Cell processor. With this new routine it was found that the
system time measured when compiling a 2.6.26 pseries_defconfig was
reduced by ~10s:

mainline (2.6.27-rc1-00632-g2e1e921):

real    17m8.727s
user    59m48.693s
sys     3m56.089s

real    17m9.350s
user    59m44.822s
sys     3m56.666s

new routine:

real    17m7.311s
user    59m51.339s
sys     3m47.043s

real    17m7.863s
user    59m49.028s
sys     3m46.608s

This same routine was also found to improve performance on 970 CPUs
too (but by a much smaller amount):

mainline (2.6.27-rc1-00632-g2e1e921):

real    16m8.545s
user    14m38.134s
sys     1m55.156s

real    16m7.089s
user    14m37.974s
sys     1m55.010s

new routine:

real    16m11.641s
user    14m37.251s
sys     1m52.618s

real    16m6.139s
user    14m38.282s
sys     1m53.184s


I also did testing on Power{3..6} and I found that Power3, Power5 and
Power6 did better with this new routine when the dcbt and dcbz
weren't used (in which case they achieved performance comparable to
the existing kernel copy_4K_page routine). Power4 on other hand
performed slightly better with the dcbt and dcbz included (still
comparable to the current kernel copy_4K_page).

So in order to get the best performance across the board I created a
new CPU feature that will govern whether the dcbt and dcbz are used
(and un-creatively named it CPU_FTR_CP_USE_DCBTZ). I added it to the
CPU features of Cell, Power4 and 970.
Unfortunately I don't have access to a PA6T but judging by the
marketing material I could find, it looks like it has a strong enough
hardware prefetcher that it probably wouldn't benefit from the dcbt
and dcbz...

Okay, that's probably enough prattling along - you can all go and look
at the code now.

All comments appreciated

[I decided to post the whole copy routine rather than a diff between
it and the current one because I found the diff quite unreadable. I'll post
a real patchset after I've addressed any comments.]

Many thanks!

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 0/2] powerpc: new copy_4K_page()
  2008-08-14  6:17 [RFC 0/2] powerpc: copy_4K_page tweaked for Cell Mark Nelson
@ 2008-08-22  4:32 ` Mark Nelson
  2008-08-22  4:36 ` [PATCH 1/2] powerpc: add new CPU feature: CPU_FTR_CP_USE_DCBTZ Mark Nelson
  2008-08-22  4:39 ` [PATCH 2/2] powerpc: new copy_4K_page() Mark Nelson
  2 siblings, 0 replies; 4+ messages in thread
From: Mark Nelson @ 2008-08-22  4:32 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: cbe-oss-dev

On Thu, 14 Aug 2008 04:17:32 pm Mark Nelson wrote:
> Hi All,
> 
> What follows is an updated version of copy_4K_page that has been tuned
> for the Cell processor. With this new routine it was found that the
> system time measured when compiling a 2.6.26 pseries_defconfig was
> reduced by ~10s:
> 
> mainline (2.6.27-rc1-00632-g2e1e921):
> 
> real    17m8.727s
> user    59m48.693s
> sys     3m56.089s
> 
> real    17m9.350s
> user    59m44.822s
> sys     3m56.666s
> 
> new routine:
> 
> real    17m7.311s
> user    59m51.339s
> sys     3m47.043s
> 
> real    17m7.863s
> user    59m49.028s
> sys     3m46.608s
> 
> This same routine was also found to improve performance on 970 CPUs
> too (but by a much smaller amount):
> 
> mainline (2.6.27-rc1-00632-g2e1e921):
> 
> real    16m8.545s
> user    14m38.134s
> sys     1m55.156s
> 
> real    16m7.089s
> user    14m37.974s
> sys     1m55.010s
> 
> new routine:
> 
> real    16m11.641s
> user    14m37.251s
> sys     1m52.618s
> 
> real    16m6.139s
> user    14m38.282s
> sys     1m53.184s
> 
> 
> I also did testing on Power{3..6} and I found that Power3, Power5 and
> Power6 did better with this new routine when the dcbt and dcbz
> weren't used (in which case they achieved performance comparable to
> the existing kernel copy_4K_page routine). Power4 on other hand
> performed slightly better with the dcbt and dcbz included (still
> comparable to the current kernel copy_4K_page).
> 
> So in order to get the best performance across the board I created a
> new CPU feature that will govern whether the dcbt and dcbz are used
> (and un-creatively named it CPU_FTR_CP_USE_DCBTZ). I added it to the
> CPU features of Cell, Power4 and 970.
> Unfortunately I don't have access to a PA6T but judging by the
> marketing material I could find, it looks like it has a strong enough
> hardware prefetcher that it probably wouldn't benefit from the dcbt
> and dcbz...
> 
> Okay, that's probably enough prattling along - you can all go and look
> at the code now.
> 
> All comments appreciated
> 
> [I decided to post the whole copy routine rather than a diff between
> it and the current one because I found the diff quite unreadable. I'll post
> a real patchset after I've addressed any comments.]
> 
> Many thanks!
> 

The actual patches for the new copy_4K_page() follow this.

Note: I changed the order of the patches so that the new CPU feature
bit is introduced in the first patch and then the new copy_4K_page
is introduced in the second patch.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 1/2] powerpc: add new CPU feature: CPU_FTR_CP_USE_DCBTZ
  2008-08-14  6:17 [RFC 0/2] powerpc: copy_4K_page tweaked for Cell Mark Nelson
  2008-08-22  4:32 ` [PATCH 0/2] powerpc: new copy_4K_page() Mark Nelson
@ 2008-08-22  4:36 ` Mark Nelson
  2008-08-22  4:39 ` [PATCH 2/2] powerpc: new copy_4K_page() Mark Nelson
  2 siblings, 0 replies; 4+ messages in thread
From: Mark Nelson @ 2008-08-22  4:36 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: cbe-oss-dev

Add a new CPU feature bit, CPU_FTR_CP_USE_DCBTZ, to be added to the
64bit powerpc chips that benefit from having dcbt and dcbz
instructions used in their memory copy routines.

This will be used in a subsequent patch that updates copy_4K_page().
The new bit is added to Cell, PPC970 and Power4 because they show
better performance with the new copy_4K_page() when dcbt and dcbz
instructions are used.

Signed-off-by: Mark Nelson <markn@au1.ibm.com>
---
 arch/powerpc/include/asm/cputable.h |    9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

Index: upstream/arch/powerpc/include/asm/cputable.h
===================================================================
--- upstream.orig/arch/powerpc/include/asm/cputable.h
+++ upstream/arch/powerpc/include/asm/cputable.h
@@ -192,6 +192,7 @@ extern const char *powerpc_base_platform
 #define CPU_FTR_NO_SLBIE_B		LONG_ASM_CONST(0x0008000000000000)
 #define CPU_FTR_VSX			LONG_ASM_CONST(0x0010000000000000)
 #define CPU_FTR_SAO			LONG_ASM_CONST(0x0020000000000000)
+#define CPU_FTR_CP_USE_DCBTZ		LONG_ASM_CONST(0x0040000000000000)
 
 #ifndef __ASSEMBLY__
 
@@ -387,10 +388,11 @@ extern const char *powerpc_base_platform
 	    CPU_FTR_MMCRA | CPU_FTR_CTRL)
 #define CPU_FTRS_POWER4	(CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \
 	    CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \
-	    CPU_FTR_MMCRA)
+	    CPU_FTR_MMCRA | CPU_FTR_CP_USE_DCBTZ)
 #define CPU_FTRS_PPC970	(CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \
 	    CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \
-	    CPU_FTR_ALTIVEC_COMP | CPU_FTR_CAN_NAP | CPU_FTR_MMCRA)
+	    CPU_FTR_ALTIVEC_COMP | CPU_FTR_CAN_NAP | CPU_FTR_MMCRA | \
+	    CPU_FTR_CP_USE_DCBTZ)
 #define CPU_FTRS_POWER5	(CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \
 	    CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \
 	    CPU_FTR_MMCRA | CPU_FTR_SMT | \
@@ -411,7 +413,8 @@ extern const char *powerpc_base_platform
 #define CPU_FTRS_CELL	(CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \
 	    CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \
 	    CPU_FTR_ALTIVEC_COMP | CPU_FTR_MMCRA | CPU_FTR_SMT | \
-	    CPU_FTR_PAUSE_ZERO | CPU_FTR_CI_LARGE_PAGE | CPU_FTR_CELL_TB_BUG)
+	    CPU_FTR_PAUSE_ZERO | CPU_FTR_CI_LARGE_PAGE | \
+	    CPU_FTR_CELL_TB_BUG | CPU_FTR_CP_USE_DCBTZ)
 #define CPU_FTRS_PA6T (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \
 	    CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | \
 	    CPU_FTR_ALTIVEC_COMP | CPU_FTR_CI_LARGE_PAGE | \

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 2/2] powerpc: new copy_4K_page()
  2008-08-14  6:17 [RFC 0/2] powerpc: copy_4K_page tweaked for Cell Mark Nelson
  2008-08-22  4:32 ` [PATCH 0/2] powerpc: new copy_4K_page() Mark Nelson
  2008-08-22  4:36 ` [PATCH 1/2] powerpc: add new CPU feature: CPU_FTR_CP_USE_DCBTZ Mark Nelson
@ 2008-08-22  4:39 ` Mark Nelson
  2 siblings, 0 replies; 4+ messages in thread
From: Mark Nelson @ 2008-08-22  4:39 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: cbe-oss-dev

This new copy_4K_page() function was originally tuned for the best
performance on the Cell processor, but after testing on more 64bit
powerpc chips it was found that with a small modification it either
matched the performance offered by the current mainline version or
bettered it by a small amount.

It was found that on a Cell-based QS22 blade the amount of system
time measured when compiling a 2.6.26 pseries_defconfig decreased
by 4%. Using the same test, a 4-way 970MP machine saw a decrease of
2% in system time. No noticeable change was seen on Power4, Power5
or Power6.

The 4096 byte page is copied in thirty-two 128 byte strides. An
initial setup loop executes dcbt instructions for the whole source
page and dcbz instructions for the whole destination page. To do
this, the cache line size is retrieved from ppc64_caches.

A new CPU feature bit, CPU_FTR_CP_USE_DCBTZ, (introduced in the
previous patch) is used to make the modification to this new copy
routine - on Power4, 970 and Cell the feature bit is set so the
setup loop is executed, but on all other 64bit chips the setup
loop is nop'ed out.

Signed-off-by: Mark Nelson <markn@au1.ibm.com>
---
 arch/powerpc/lib/copypage_64.S |  198 +++++++++++++++++++----------------------
 1 file changed, 93 insertions(+), 105 deletions(-)

Index: upstream/arch/powerpc/lib/copypage_64.S
===================================================================
--- upstream.orig/arch/powerpc/lib/copypage_64.S
+++ upstream/arch/powerpc/lib/copypage_64.S
@@ -1,5 +1,5 @@
 /*
- * Copyright (C) 2002 Paul Mackerras, IBM Corp.
+ * Copyright (C) 2008 Mark Nelson, IBM Corp.
  *
  * This program is free software; you can redistribute it and/or
  * modify it under the terms of the GNU General Public License
@@ -8,112 +8,100 @@
  */
 #include <asm/processor.h>
 #include <asm/ppc_asm.h>
+#include <asm/asm-offsets.h>
+
+        .section        ".toc","aw"
+PPC64_CACHES:
+        .tc             ppc64_caches[TC],ppc64_caches
+        .section        ".text"
+
 
 _GLOBAL(copy_4K_page)
-	std	r31,-8(1)
-	std	r30,-16(1)
-	std	r29,-24(1)
-	std	r28,-32(1)
-	std	r27,-40(1)
-	std	r26,-48(1)
-	std	r25,-56(1)
-	std	r24,-64(1)
-	std	r23,-72(1)
-	std	r22,-80(1)
-	std	r21,-88(1)
-	std	r20,-96(1)
-	li	r5,4096/32 - 1
+	li	r5,4096		/* 4K page size */
+BEGIN_FTR_SECTION
+	ld      r10,PPC64_CACHES@toc(r2)
+	lwz	r11,DCACHEL1LOGLINESIZE(r10)	/* log2 of cache line size */
+	lwz     r12,DCACHEL1LINESIZE(r10)	/* get cache line size */
+	li	r9,0
+	srd	r8,r5,r11
+
+	mtctr	r8
+setup:
+	dcbt	r9,r4
+	dcbz	r9,r3
+	add	r9,r9,r12
+	bdnz	setup
+END_FTR_SECTION_IFSET(CPU_FTR_CP_USE_DCBTZ)
 	addi	r3,r3,-8
-	li	r12,5
-0:	addi	r5,r5,-24
-	mtctr	r12
-	ld	r22,640(4)
-	ld	r21,512(4)
-	ld	r20,384(4)
-	ld	r11,256(4)
-	ld	r9,128(4)
-	ld	r7,0(4)
-	ld	r25,648(4)
-	ld	r24,520(4)
-	ld	r23,392(4)
-	ld	r10,264(4)
-	ld	r8,136(4)
-	ldu	r6,8(4)
-	cmpwi	r5,24
-1:	std	r22,648(3)
-	std	r21,520(3)
-	std	r20,392(3)
-	std	r11,264(3)
-	std	r9,136(3)
-	std	r7,8(3)
-	ld	r28,648(4)
-	ld	r27,520(4)
-	ld	r26,392(4)
-	ld	r31,264(4)
-	ld	r30,136(4)
-	ld	r29,8(4)
-	std	r25,656(3)
-	std	r24,528(3)
-	std	r23,400(3)
-	std	r10,272(3)
-	std	r8,144(3)
-	std	r6,16(3)
-	ld	r22,656(4)
-	ld	r21,528(4)
-	ld	r20,400(4)
-	ld	r11,272(4)
-	ld	r9,144(4)
-	ld	r7,16(4)
-	std	r28,664(3)
-	std	r27,536(3)
-	std	r26,408(3)
-	std	r31,280(3)
-	std	r30,152(3)
-	stdu	r29,24(3)
-	ld	r25,664(4)
-	ld	r24,536(4)
-	ld	r23,408(4)
-	ld	r10,280(4)
-	ld	r8,152(4)
-	ldu	r6,24(4)
+	srdi    r8,r5,7		/* page is copied in 128 byte strides */
+	addi	r8,r8,-1	/* one stride copied outside loop */
+
+	mtctr	r8
+
+	ld	r5,0(r4)
+	ld	r6,8(r4)
+	ld	r7,16(r4)
+	ldu	r8,24(r4)
+1:	std	r5,8(r3)
+	ld	r9,8(r4)
+	std	r6,16(r3)
+	ld	r10,16(r4)
+	std	r7,24(r3)
+	ld	r11,24(r4)
+	std	r8,32(r3)
+	ld	r12,32(r4)
+	std	r9,40(r3)
+	ld	r5,40(r4)
+	std	r10,48(r3)
+	ld	r6,48(r4)
+	std	r11,56(r3)
+	ld	r7,56(r4)
+	std	r12,64(r3)
+	ld	r8,64(r4)
+	std	r5,72(r3)
+	ld	r9,72(r4)
+	std	r6,80(r3)
+	ld	r10,80(r4)
+	std	r7,88(r3)
+	ld	r11,88(r4)
+	std	r8,96(r3)
+	ld	r12,96(r4)
+	std	r9,104(r3)
+	ld	r5,104(r4)
+	std	r10,112(r3)
+	ld	r6,112(r4)
+	std	r11,120(r3)
+	ld	r7,120(r4)
+	stdu	r12,128(r3)
+	ldu	r8,128(r4)
 	bdnz	1b
-	std	r22,648(3)
-	std	r21,520(3)
-	std	r20,392(3)
-	std	r11,264(3)
-	std	r9,136(3)
-	std	r7,8(3)
-	addi	r4,r4,640
-	addi	r3,r3,648
-	bge	0b
-	mtctr	r5
-	ld	r7,0(4)
-	ld	r8,8(4)
-	ldu	r9,16(4)
-3:	ld	r10,8(4)
-	std	r7,8(3)
-	ld	r7,16(4)
-	std	r8,16(3)
-	ld	r8,24(4)
-	std	r9,24(3)
-	ldu	r9,32(4)
-	stdu	r10,32(3)
-	bdnz	3b
-4:	ld	r10,8(4)
-	std	r7,8(3)
-	std	r8,16(3)
-	std	r9,24(3)
-	std	r10,32(3)
-9:	ld	r20,-96(1)
-	ld	r21,-88(1)
-	ld	r22,-80(1)
-	ld	r23,-72(1)
-	ld	r24,-64(1)
-	ld	r25,-56(1)
-	ld	r26,-48(1)
-	ld	r27,-40(1)
-	ld	r28,-32(1)
-	ld	r29,-24(1)
-	ld	r30,-16(1)
-	ld	r31,-8(1)
+
+	std	r5,8(r3)
+	ld	r9,8(r4)
+	std	r6,16(r3)
+	ld	r10,16(r4)
+	std	r7,24(r3)
+	ld	r11,24(r4)
+	std	r8,32(r3)
+	ld	r12,32(r4)
+	std	r9,40(r3)
+	ld	r5,40(r4)
+	std	r10,48(r3)
+	ld	r6,48(r4)
+	std	r11,56(r3)
+	ld	r7,56(r4)
+	std	r12,64(r3)
+	ld	r8,64(r4)
+	std	r5,72(r3)
+	ld	r9,72(r4)
+	std	r6,80(r3)
+	ld	r10,80(r4)
+	std	r7,88(r3)
+	ld	r11,88(r4)
+	std	r8,96(r3)
+	ld	r12,96(r4)
+	std	r9,104(r3)
+	std	r10,112(r3)
+	std	r11,120(r3)
+	std	r12,128(r3)
 	blr

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2008-08-22  4:39 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-08-14  6:17 [RFC 0/2] powerpc: copy_4K_page tweaked for Cell Mark Nelson
2008-08-22  4:32 ` [PATCH 0/2] powerpc: new copy_4K_page() Mark Nelson
2008-08-22  4:36 ` [PATCH 1/2] powerpc: add new CPU feature: CPU_FTR_CP_USE_DCBTZ Mark Nelson
2008-08-22  4:39 ` [PATCH 2/2] powerpc: new copy_4K_page() Mark Nelson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).