* [RFC 0/2] powerpc: copy_4K_page tweaked for Cell @ 2008-08-14 6:17 Mark Nelson 2008-08-22 4:32 ` [PATCH 0/2] powerpc: new copy_4K_page() Mark Nelson ` (2 more replies) 0 siblings, 3 replies; 4+ messages in thread From: Mark Nelson @ 2008-08-14 6:17 UTC (permalink / raw) To: linuxppc-dev, cbe-oss-dev Hi All, What follows is an updated version of copy_4K_page that has been tuned for the Cell processor. With this new routine it was found that the system time measured when compiling a 2.6.26 pseries_defconfig was reduced by ~10s: mainline (2.6.27-rc1-00632-g2e1e921): real 17m8.727s user 59m48.693s sys 3m56.089s real 17m9.350s user 59m44.822s sys 3m56.666s new routine: real 17m7.311s user 59m51.339s sys 3m47.043s real 17m7.863s user 59m49.028s sys 3m46.608s This same routine was also found to improve performance on 970 CPUs too (but by a much smaller amount): mainline (2.6.27-rc1-00632-g2e1e921): real 16m8.545s user 14m38.134s sys 1m55.156s real 16m7.089s user 14m37.974s sys 1m55.010s new routine: real 16m11.641s user 14m37.251s sys 1m52.618s real 16m6.139s user 14m38.282s sys 1m53.184s I also did testing on Power{3..6} and I found that Power3, Power5 and Power6 did better with this new routine when the dcbt and dcbz weren't used (in which case they achieved performance comparable to the existing kernel copy_4K_page routine). Power4 on other hand performed slightly better with the dcbt and dcbz included (still comparable to the current kernel copy_4K_page). So in order to get the best performance across the board I created a new CPU feature that will govern whether the dcbt and dcbz are used (and un-creatively named it CPU_FTR_CP_USE_DCBTZ). I added it to the CPU features of Cell, Power4 and 970. Unfortunately I don't have access to a PA6T but judging by the marketing material I could find, it looks like it has a strong enough hardware prefetcher that it probably wouldn't benefit from the dcbt and dcbz... Okay, that's probably enough prattling along - you can all go and look at the code now. All comments appreciated [I decided to post the whole copy routine rather than a diff between it and the current one because I found the diff quite unreadable. I'll post a real patchset after I've addressed any comments.] Many thanks! ^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH 0/2] powerpc: new copy_4K_page() 2008-08-14 6:17 [RFC 0/2] powerpc: copy_4K_page tweaked for Cell Mark Nelson @ 2008-08-22 4:32 ` Mark Nelson 2008-08-22 4:36 ` [PATCH 1/2] powerpc: add new CPU feature: CPU_FTR_CP_USE_DCBTZ Mark Nelson 2008-08-22 4:39 ` [PATCH 2/2] powerpc: new copy_4K_page() Mark Nelson 2 siblings, 0 replies; 4+ messages in thread From: Mark Nelson @ 2008-08-22 4:32 UTC (permalink / raw) To: linuxppc-dev; +Cc: cbe-oss-dev On Thu, 14 Aug 2008 04:17:32 pm Mark Nelson wrote: > Hi All, > > What follows is an updated version of copy_4K_page that has been tuned > for the Cell processor. With this new routine it was found that the > system time measured when compiling a 2.6.26 pseries_defconfig was > reduced by ~10s: > > mainline (2.6.27-rc1-00632-g2e1e921): > > real 17m8.727s > user 59m48.693s > sys 3m56.089s > > real 17m9.350s > user 59m44.822s > sys 3m56.666s > > new routine: > > real 17m7.311s > user 59m51.339s > sys 3m47.043s > > real 17m7.863s > user 59m49.028s > sys 3m46.608s > > This same routine was also found to improve performance on 970 CPUs > too (but by a much smaller amount): > > mainline (2.6.27-rc1-00632-g2e1e921): > > real 16m8.545s > user 14m38.134s > sys 1m55.156s > > real 16m7.089s > user 14m37.974s > sys 1m55.010s > > new routine: > > real 16m11.641s > user 14m37.251s > sys 1m52.618s > > real 16m6.139s > user 14m38.282s > sys 1m53.184s > > > I also did testing on Power{3..6} and I found that Power3, Power5 and > Power6 did better with this new routine when the dcbt and dcbz > weren't used (in which case they achieved performance comparable to > the existing kernel copy_4K_page routine). Power4 on other hand > performed slightly better with the dcbt and dcbz included (still > comparable to the current kernel copy_4K_page). > > So in order to get the best performance across the board I created a > new CPU feature that will govern whether the dcbt and dcbz are used > (and un-creatively named it CPU_FTR_CP_USE_DCBTZ). I added it to the > CPU features of Cell, Power4 and 970. > Unfortunately I don't have access to a PA6T but judging by the > marketing material I could find, it looks like it has a strong enough > hardware prefetcher that it probably wouldn't benefit from the dcbt > and dcbz... > > Okay, that's probably enough prattling along - you can all go and look > at the code now. > > All comments appreciated > > [I decided to post the whole copy routine rather than a diff between > it and the current one because I found the diff quite unreadable. I'll post > a real patchset after I've addressed any comments.] > > Many thanks! > The actual patches for the new copy_4K_page() follow this. Note: I changed the order of the patches so that the new CPU feature bit is introduced in the first patch and then the new copy_4K_page is introduced in the second patch. ^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH 1/2] powerpc: add new CPU feature: CPU_FTR_CP_USE_DCBTZ 2008-08-14 6:17 [RFC 0/2] powerpc: copy_4K_page tweaked for Cell Mark Nelson 2008-08-22 4:32 ` [PATCH 0/2] powerpc: new copy_4K_page() Mark Nelson @ 2008-08-22 4:36 ` Mark Nelson 2008-08-22 4:39 ` [PATCH 2/2] powerpc: new copy_4K_page() Mark Nelson 2 siblings, 0 replies; 4+ messages in thread From: Mark Nelson @ 2008-08-22 4:36 UTC (permalink / raw) To: linuxppc-dev; +Cc: cbe-oss-dev Add a new CPU feature bit, CPU_FTR_CP_USE_DCBTZ, to be added to the 64bit powerpc chips that benefit from having dcbt and dcbz instructions used in their memory copy routines. This will be used in a subsequent patch that updates copy_4K_page(). The new bit is added to Cell, PPC970 and Power4 because they show better performance with the new copy_4K_page() when dcbt and dcbz instructions are used. Signed-off-by: Mark Nelson <markn@au1.ibm.com> --- arch/powerpc/include/asm/cputable.h | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) Index: upstream/arch/powerpc/include/asm/cputable.h =================================================================== --- upstream.orig/arch/powerpc/include/asm/cputable.h +++ upstream/arch/powerpc/include/asm/cputable.h @@ -192,6 +192,7 @@ extern const char *powerpc_base_platform #define CPU_FTR_NO_SLBIE_B LONG_ASM_CONST(0x0008000000000000) #define CPU_FTR_VSX LONG_ASM_CONST(0x0010000000000000) #define CPU_FTR_SAO LONG_ASM_CONST(0x0020000000000000) +#define CPU_FTR_CP_USE_DCBTZ LONG_ASM_CONST(0x0040000000000000) #ifndef __ASSEMBLY__ @@ -387,10 +388,11 @@ extern const char *powerpc_base_platform CPU_FTR_MMCRA | CPU_FTR_CTRL) #define CPU_FTRS_POWER4 (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \ CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \ - CPU_FTR_MMCRA) + CPU_FTR_MMCRA | CPU_FTR_CP_USE_DCBTZ) #define CPU_FTRS_PPC970 (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \ CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \ - CPU_FTR_ALTIVEC_COMP | CPU_FTR_CAN_NAP | CPU_FTR_MMCRA) + CPU_FTR_ALTIVEC_COMP | CPU_FTR_CAN_NAP | CPU_FTR_MMCRA | \ + CPU_FTR_CP_USE_DCBTZ) #define CPU_FTRS_POWER5 (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \ CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \ CPU_FTR_MMCRA | CPU_FTR_SMT | \ @@ -411,7 +413,8 @@ extern const char *powerpc_base_platform #define CPU_FTRS_CELL (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \ CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \ CPU_FTR_ALTIVEC_COMP | CPU_FTR_MMCRA | CPU_FTR_SMT | \ - CPU_FTR_PAUSE_ZERO | CPU_FTR_CI_LARGE_PAGE | CPU_FTR_CELL_TB_BUG) + CPU_FTR_PAUSE_ZERO | CPU_FTR_CI_LARGE_PAGE | \ + CPU_FTR_CELL_TB_BUG | CPU_FTR_CP_USE_DCBTZ) #define CPU_FTRS_PA6T (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \ CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | \ CPU_FTR_ALTIVEC_COMP | CPU_FTR_CI_LARGE_PAGE | \ ^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH 2/2] powerpc: new copy_4K_page() 2008-08-14 6:17 [RFC 0/2] powerpc: copy_4K_page tweaked for Cell Mark Nelson 2008-08-22 4:32 ` [PATCH 0/2] powerpc: new copy_4K_page() Mark Nelson 2008-08-22 4:36 ` [PATCH 1/2] powerpc: add new CPU feature: CPU_FTR_CP_USE_DCBTZ Mark Nelson @ 2008-08-22 4:39 ` Mark Nelson 2 siblings, 0 replies; 4+ messages in thread From: Mark Nelson @ 2008-08-22 4:39 UTC (permalink / raw) To: linuxppc-dev; +Cc: cbe-oss-dev This new copy_4K_page() function was originally tuned for the best performance on the Cell processor, but after testing on more 64bit powerpc chips it was found that with a small modification it either matched the performance offered by the current mainline version or bettered it by a small amount. It was found that on a Cell-based QS22 blade the amount of system time measured when compiling a 2.6.26 pseries_defconfig decreased by 4%. Using the same test, a 4-way 970MP machine saw a decrease of 2% in system time. No noticeable change was seen on Power4, Power5 or Power6. The 4096 byte page is copied in thirty-two 128 byte strides. An initial setup loop executes dcbt instructions for the whole source page and dcbz instructions for the whole destination page. To do this, the cache line size is retrieved from ppc64_caches. A new CPU feature bit, CPU_FTR_CP_USE_DCBTZ, (introduced in the previous patch) is used to make the modification to this new copy routine - on Power4, 970 and Cell the feature bit is set so the setup loop is executed, but on all other 64bit chips the setup loop is nop'ed out. Signed-off-by: Mark Nelson <markn@au1.ibm.com> --- arch/powerpc/lib/copypage_64.S | 198 +++++++++++++++++++---------------------- 1 file changed, 93 insertions(+), 105 deletions(-) Index: upstream/arch/powerpc/lib/copypage_64.S =================================================================== --- upstream.orig/arch/powerpc/lib/copypage_64.S +++ upstream/arch/powerpc/lib/copypage_64.S @@ -1,5 +1,5 @@ /* - * Copyright (C) 2002 Paul Mackerras, IBM Corp. + * Copyright (C) 2008 Mark Nelson, IBM Corp. * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License @@ -8,112 +8,100 @@ */ #include <asm/processor.h> #include <asm/ppc_asm.h> +#include <asm/asm-offsets.h> + + .section ".toc","aw" +PPC64_CACHES: + .tc ppc64_caches[TC],ppc64_caches + .section ".text" + _GLOBAL(copy_4K_page) - std r31,-8(1) - std r30,-16(1) - std r29,-24(1) - std r28,-32(1) - std r27,-40(1) - std r26,-48(1) - std r25,-56(1) - std r24,-64(1) - std r23,-72(1) - std r22,-80(1) - std r21,-88(1) - std r20,-96(1) - li r5,4096/32 - 1 + li r5,4096 /* 4K page size */ +BEGIN_FTR_SECTION + ld r10,PPC64_CACHES@toc(r2) + lwz r11,DCACHEL1LOGLINESIZE(r10) /* log2 of cache line size */ + lwz r12,DCACHEL1LINESIZE(r10) /* get cache line size */ + li r9,0 + srd r8,r5,r11 + + mtctr r8 +setup: + dcbt r9,r4 + dcbz r9,r3 + add r9,r9,r12 + bdnz setup +END_FTR_SECTION_IFSET(CPU_FTR_CP_USE_DCBTZ) addi r3,r3,-8 - li r12,5 -0: addi r5,r5,-24 - mtctr r12 - ld r22,640(4) - ld r21,512(4) - ld r20,384(4) - ld r11,256(4) - ld r9,128(4) - ld r7,0(4) - ld r25,648(4) - ld r24,520(4) - ld r23,392(4) - ld r10,264(4) - ld r8,136(4) - ldu r6,8(4) - cmpwi r5,24 -1: std r22,648(3) - std r21,520(3) - std r20,392(3) - std r11,264(3) - std r9,136(3) - std r7,8(3) - ld r28,648(4) - ld r27,520(4) - ld r26,392(4) - ld r31,264(4) - ld r30,136(4) - ld r29,8(4) - std r25,656(3) - std r24,528(3) - std r23,400(3) - std r10,272(3) - std r8,144(3) - std r6,16(3) - ld r22,656(4) - ld r21,528(4) - ld r20,400(4) - ld r11,272(4) - ld r9,144(4) - ld r7,16(4) - std r28,664(3) - std r27,536(3) - std r26,408(3) - std r31,280(3) - std r30,152(3) - stdu r29,24(3) - ld r25,664(4) - ld r24,536(4) - ld r23,408(4) - ld r10,280(4) - ld r8,152(4) - ldu r6,24(4) + srdi r8,r5,7 /* page is copied in 128 byte strides */ + addi r8,r8,-1 /* one stride copied outside loop */ + + mtctr r8 + + ld r5,0(r4) + ld r6,8(r4) + ld r7,16(r4) + ldu r8,24(r4) +1: std r5,8(r3) + ld r9,8(r4) + std r6,16(r3) + ld r10,16(r4) + std r7,24(r3) + ld r11,24(r4) + std r8,32(r3) + ld r12,32(r4) + std r9,40(r3) + ld r5,40(r4) + std r10,48(r3) + ld r6,48(r4) + std r11,56(r3) + ld r7,56(r4) + std r12,64(r3) + ld r8,64(r4) + std r5,72(r3) + ld r9,72(r4) + std r6,80(r3) + ld r10,80(r4) + std r7,88(r3) + ld r11,88(r4) + std r8,96(r3) + ld r12,96(r4) + std r9,104(r3) + ld r5,104(r4) + std r10,112(r3) + ld r6,112(r4) + std r11,120(r3) + ld r7,120(r4) + stdu r12,128(r3) + ldu r8,128(r4) bdnz 1b - std r22,648(3) - std r21,520(3) - std r20,392(3) - std r11,264(3) - std r9,136(3) - std r7,8(3) - addi r4,r4,640 - addi r3,r3,648 - bge 0b - mtctr r5 - ld r7,0(4) - ld r8,8(4) - ldu r9,16(4) -3: ld r10,8(4) - std r7,8(3) - ld r7,16(4) - std r8,16(3) - ld r8,24(4) - std r9,24(3) - ldu r9,32(4) - stdu r10,32(3) - bdnz 3b -4: ld r10,8(4) - std r7,8(3) - std r8,16(3) - std r9,24(3) - std r10,32(3) -9: ld r20,-96(1) - ld r21,-88(1) - ld r22,-80(1) - ld r23,-72(1) - ld r24,-64(1) - ld r25,-56(1) - ld r26,-48(1) - ld r27,-40(1) - ld r28,-32(1) - ld r29,-24(1) - ld r30,-16(1) - ld r31,-8(1) + + std r5,8(r3) + ld r9,8(r4) + std r6,16(r3) + ld r10,16(r4) + std r7,24(r3) + ld r11,24(r4) + std r8,32(r3) + ld r12,32(r4) + std r9,40(r3) + ld r5,40(r4) + std r10,48(r3) + ld r6,48(r4) + std r11,56(r3) + ld r7,56(r4) + std r12,64(r3) + ld r8,64(r4) + std r5,72(r3) + ld r9,72(r4) + std r6,80(r3) + ld r10,80(r4) + std r7,88(r3) + ld r11,88(r4) + std r8,96(r3) + ld r12,96(r4) + std r9,104(r3) + std r10,112(r3) + std r11,120(r3) + std r12,128(r3) blr ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2008-08-22 4:39 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-08-14 6:17 [RFC 0/2] powerpc: copy_4K_page tweaked for Cell Mark Nelson 2008-08-22 4:32 ` [PATCH 0/2] powerpc: new copy_4K_page() Mark Nelson 2008-08-22 4:36 ` [PATCH 1/2] powerpc: add new CPU feature: CPU_FTR_CP_USE_DCBTZ Mark Nelson 2008-08-22 4:39 ` [PATCH 2/2] powerpc: new copy_4K_page() Mark Nelson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).