* [RFC 2/2] powerpc: copy_4K_page tweaked for Cell - add CPU feature @ 2008-08-14 6:18 Mark Nelson 2008-08-14 10:51 ` Michael Ellerman 0 siblings, 1 reply; 5+ messages in thread From: Mark Nelson @ 2008-08-14 6:18 UTC (permalink / raw) To: linuxppc-dev, cbe-oss-dev Add a new CPU feature, CPU_FTR_CP_USE_DCBTZ, to be added to the CPUs that benefit from having dcbt and dcbz instructions used in copy_4K_page(). So far Cell, PPC970 and Power4 benefit. This way all the other 64bit powerpc chips will have the whole prefetching loop nop'ed out. Signed-off-by: Mark Nelson <markn@au1.ibm.com> --- arch/powerpc/include/asm/cputable.h | 9 ++++++--- arch/powerpc/lib/copypage_64.S | 3 ++- 2 files changed, 8 insertions(+), 4 deletions(-) Index: upstream/arch/powerpc/include/asm/cputable.h =================================================================== --- upstream.orig/arch/powerpc/include/asm/cputable.h +++ upstream/arch/powerpc/include/asm/cputable.h @@ -192,6 +192,7 @@ extern const char *powerpc_base_platform #define CPU_FTR_NO_SLBIE_B LONG_ASM_CONST(0x0008000000000000) #define CPU_FTR_VSX LONG_ASM_CONST(0x0010000000000000) #define CPU_FTR_SAO LONG_ASM_CONST(0x0020000000000000) +#define CPU_FTR_CP_USE_DCBTZ LONG_ASM_CONST(0x0040000000000000) #ifndef __ASSEMBLY__ @@ -387,10 +388,11 @@ extern const char *powerpc_base_platform CPU_FTR_MMCRA | CPU_FTR_CTRL) #define CPU_FTRS_POWER4 (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \ CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \ - CPU_FTR_MMCRA) + CPU_FTR_MMCRA | CPU_FTR_CP_USE_DCBTZ) #define CPU_FTRS_PPC970 (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \ CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \ - CPU_FTR_ALTIVEC_COMP | CPU_FTR_CAN_NAP | CPU_FTR_MMCRA) + CPU_FTR_ALTIVEC_COMP | CPU_FTR_CAN_NAP | CPU_FTR_MMCRA | \ + CPU_FTR_CP_USE_DCBTZ) #define CPU_FTRS_POWER5 (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \ CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \ CPU_FTR_MMCRA | CPU_FTR_SMT | \ @@ -411,7 +413,8 @@ extern const char *powerpc_base_platform #define CPU_FTRS_CELL (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \ CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \ CPU_FTR_ALTIVEC_COMP | CPU_FTR_MMCRA | CPU_FTR_SMT | \ - CPU_FTR_PAUSE_ZERO | CPU_FTR_CI_LARGE_PAGE | CPU_FTR_CELL_TB_BUG) + CPU_FTR_PAUSE_ZERO | CPU_FTR_CI_LARGE_PAGE | \ + CPU_FTR_CELL_TB_BUG | CPU_FTR_CP_USE_DCBTZ) #define CPU_FTRS_PA6T (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \ CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | \ CPU_FTR_ALTIVEC_COMP | CPU_FTR_CI_LARGE_PAGE | \ Index: upstream/arch/powerpc/lib/copypage_64.S =================================================================== --- upstream.orig/arch/powerpc/lib/copypage_64.S +++ upstream/arch/powerpc/lib/copypage_64.S @@ -18,6 +18,7 @@ PPC64_CACHES: _GLOBAL(copy_4K_page) li r5,4096 /* 4K page size */ +BEGIN_FTR_SECTION ld r10,PPC64_CACHES@toc(r2) lwz r11,DCACHEL1LOGLINESIZE(r10) /* log2 of cache line size */ lwz r12,DCACHEL1LINESIZE(r10) /* Get cache line size */ @@ -30,7 +31,7 @@ setup: dcbz r9,r3 add r9,r9,r12 bdnz setup - +END_FTR_SECTION_IFSET(CPU_FTR_CP_USE_DCBTZ) addi r3,r3,-8 srdi r8,r5,7 /* page is copied in 128 byte strides */ addi r8,r8,-1 /* one stride copied outside loop */ ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC 2/2] powerpc: copy_4K_page tweaked for Cell - add CPU feature 2008-08-14 6:18 [RFC 2/2] powerpc: copy_4K_page tweaked for Cell - add CPU feature Mark Nelson @ 2008-08-14 10:51 ` Michael Ellerman 2008-08-14 11:48 ` Mark Nelson 0 siblings, 1 reply; 5+ messages in thread From: Michael Ellerman @ 2008-08-14 10:51 UTC (permalink / raw) To: Mark Nelson; +Cc: linuxppc-dev, cbe-oss-dev [-- Attachment #1: Type: text/plain, Size: 2279 bytes --] On Thu, 2008-08-14 at 16:18 +1000, Mark Nelson wrote: > Add a new CPU feature, CPU_FTR_CP_USE_DCBTZ, to be added to the CPUs that benefit > from having dcbt and dcbz instructions used in copy_4K_page(). So far Cell, PPC970 > and Power4 benefit. > > This way all the other 64bit powerpc chips will have the whole prefetching loop > nop'ed out. > Index: upstream/arch/powerpc/lib/copypage_64.S > =================================================================== > --- upstream.orig/arch/powerpc/lib/copypage_64.S > +++ upstream/arch/powerpc/lib/copypage_64.S > @@ -18,6 +18,7 @@ PPC64_CACHES: > > _GLOBAL(copy_4K_page) > li r5,4096 /* 4K page size */ > +BEGIN_FTR_SECTION > ld r10,PPC64_CACHES@toc(r2) > lwz r11,DCACHEL1LOGLINESIZE(r10) /* log2 of cache line size */ > lwz r12,DCACHEL1LINESIZE(r10) /* Get cache line size */ > @@ -30,7 +31,7 @@ setup: > dcbz r9,r3 > add r9,r9,r12 > bdnz setup > - > +END_FTR_SECTION_IFSET(CPU_FTR_CP_USE_DCBTZ) > addi r3,r3,-8 > srdi r8,r5,7 /* page is copied in 128 byte strides */ > addi r8,r8,-1 /* one stride copied outside loop */ Instead of nop'ing it out, we could use an alternative feature section to either run it or jump over it. It would look something like: _GLOBAL(copy_4K_page) BEGIN_FTR_SECTION li r5,4096 /* 4K page size */ ld r10,PPC64_CACHES@toc(r2) lwz r11,DCACHEL1LOGLINESIZE(r10) /* log2 of cache line size */ lwz r12,DCACHEL1LINESIZE(r10) /* Get cache line size */ li r9,0 srd r8,r5,r11 mtctr r8 setup: dcbt r9,r4 dcbz r9,r3 add r9,r9,r12 bdnz setup FTR_SECTION_ELSE b 1f ALT_FTR_SECTION_END_IFSET(CPU_FTR_CP_USE_DCBTZ) 1: addi r3,r3,-8 So in the no-dcbtz case you'd get a branch instead of 11 nops. Of course you'd need to benchmark it to see if skipping the nops is better than executing them ;P cheers -- Michael Ellerman OzLabs, IBM Australia Development Lab wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC 2/2] powerpc: copy_4K_page tweaked for Cell - add CPU feature 2008-08-14 10:51 ` Michael Ellerman @ 2008-08-14 11:48 ` Mark Nelson 2008-08-14 12:10 ` Michael Ellerman 0 siblings, 1 reply; 5+ messages in thread From: Mark Nelson @ 2008-08-14 11:48 UTC (permalink / raw) To: michael; +Cc: linuxppc-dev, cbe-oss-dev Hi Michael, On Thu, 14 Aug 2008 08:51:35 pm Michael Ellerman wrote: > On Thu, 2008-08-14 at 16:18 +1000, Mark Nelson wrote: > > Add a new CPU feature, CPU_FTR_CP_USE_DCBTZ, to be added to the CPUs that benefit > > from having dcbt and dcbz instructions used in copy_4K_page(). So far Cell, PPC970 > > and Power4 benefit. > > > > This way all the other 64bit powerpc chips will have the whole prefetching loop > > nop'ed out. > > > Index: upstream/arch/powerpc/lib/copypage_64.S > > =================================================================== > > --- upstream.orig/arch/powerpc/lib/copypage_64.S > > +++ upstream/arch/powerpc/lib/copypage_64.S > > @@ -18,6 +18,7 @@ PPC64_CACHES: > > > > _GLOBAL(copy_4K_page) > > li r5,4096 /* 4K page size */ > > +BEGIN_FTR_SECTION > > ld r10,PPC64_CACHES@toc(r2) > > lwz r11,DCACHEL1LOGLINESIZE(r10) /* log2 of cache line size */ > > lwz r12,DCACHEL1LINESIZE(r10) /* Get cache line size */ > > @@ -30,7 +31,7 @@ setup: > > dcbz r9,r3 > > add r9,r9,r12 > > bdnz setup > > - > > +END_FTR_SECTION_IFSET(CPU_FTR_CP_USE_DCBTZ) > > addi r3,r3,-8 > > srdi r8,r5,7 /* page is copied in 128 byte strides */ > > addi r8,r8,-1 /* one stride copied outside loop */ > > Instead of nop'ing it out, we could use an alternative feature section > to either run it or jump over it. It would look something like: > > > _GLOBAL(copy_4K_page) > BEGIN_FTR_SECTION > li r5,4096 /* 4K page size */ > ld r10,PPC64_CACHES@toc(r2) > lwz r11,DCACHEL1LOGLINESIZE(r10) /* log2 of cache line size */ > lwz r12,DCACHEL1LINESIZE(r10) /* Get cache line size */ > li r9,0 > srd r8,r5,r11 > > mtctr r8 > setup: > dcbt r9,r4 > dcbz r9,r3 > add r9,r9,r12 > bdnz setup > FTR_SECTION_ELSE > b 1f > ALT_FTR_SECTION_END_IFSET(CPU_FTR_CP_USE_DCBTZ) > 1: > addi r3,r3,-8 > > So in the no-dcbtz case you'd get a branch instead of 11 nops. > > Of course you'd need to benchmark it to see if skipping the nops is > better than executing them ;P Thanks for looking through this. That does look a lot better. In the first version there wasn't quite as much to nop out (the cache line size was hardcoded to 128 bytes) so I wasn't so worried but I'll definitely try this with an alternative section like you describe. The jump probably will turn out to be better because I'd imagine that the same chips that don't need the dcbt and dcbz because they've got beefy enough hardware prefetchers also won't be disturbed by the jump (but benchmarks tomorrow will confirm; or prove me wrong :) ) Thanks! Mark ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC 2/2] powerpc: copy_4K_page tweaked for Cell - add CPU feature 2008-08-14 11:48 ` Mark Nelson @ 2008-08-14 12:10 ` Michael Ellerman 2008-08-15 6:33 ` Mark Nelson 0 siblings, 1 reply; 5+ messages in thread From: Michael Ellerman @ 2008-08-14 12:10 UTC (permalink / raw) To: Mark Nelson; +Cc: linuxppc-dev, cbe-oss-dev [-- Attachment #1: Type: text/plain, Size: 3284 bytes --] On Thu, 2008-08-14 at 21:48 +1000, Mark Nelson wrote: > Hi Michael, > > On Thu, 14 Aug 2008 08:51:35 pm Michael Ellerman wrote: > > On Thu, 2008-08-14 at 16:18 +1000, Mark Nelson wrote: > > > Add a new CPU feature, CPU_FTR_CP_USE_DCBTZ, to be added to the CPUs that benefit > > > from having dcbt and dcbz instructions used in copy_4K_page(). So far Cell, PPC970 > > > and Power4 benefit. > > > > > > This way all the other 64bit powerpc chips will have the whole prefetching loop > > > nop'ed out. > > > > > Index: upstream/arch/powerpc/lib/copypage_64.S > > > =================================================================== > > > --- upstream.orig/arch/powerpc/lib/copypage_64.S > > > +++ upstream/arch/powerpc/lib/copypage_64.S > > > @@ -18,6 +18,7 @@ PPC64_CACHES: > > > > > > _GLOBAL(copy_4K_page) > > > li r5,4096 /* 4K page size */ > > > +BEGIN_FTR_SECTION > > > ld r10,PPC64_CACHES@toc(r2) > > > lwz r11,DCACHEL1LOGLINESIZE(r10) /* log2 of cache line size */ > > > lwz r12,DCACHEL1LINESIZE(r10) /* Get cache line size */ > > > @@ -30,7 +31,7 @@ setup: > > > dcbz r9,r3 > > > add r9,r9,r12 > > > bdnz setup > > > - > > > +END_FTR_SECTION_IFSET(CPU_FTR_CP_USE_DCBTZ) > > > addi r3,r3,-8 > > > srdi r8,r5,7 /* page is copied in 128 byte strides */ > > > addi r8,r8,-1 /* one stride copied outside loop */ > > > > Instead of nop'ing it out, we could use an alternative feature section > > to either run it or jump over it. It would look something like: > > > > > > _GLOBAL(copy_4K_page) > > BEGIN_FTR_SECTION > > li r5,4096 /* 4K page size */ > > ld r10,PPC64_CACHES@toc(r2) > > lwz r11,DCACHEL1LOGLINESIZE(r10) /* log2 of cache line size */ > > lwz r12,DCACHEL1LINESIZE(r10) /* Get cache line size */ > > li r9,0 > > srd r8,r5,r11 > > > > mtctr r8 > > setup: > > dcbt r9,r4 > > dcbz r9,r3 > > add r9,r9,r12 > > bdnz setup > > FTR_SECTION_ELSE > > b 1f > > ALT_FTR_SECTION_END_IFSET(CPU_FTR_CP_USE_DCBTZ) > > 1: > > addi r3,r3,-8 > > > > So in the no-dcbtz case you'd get a branch instead of 11 nops. > > > > Of course you'd need to benchmark it to see if skipping the nops is > > better than executing them ;P > > Thanks for looking through this. > > That does look a lot better. In the first version there wasn't quite > as much to nop out (the cache line size was hardcoded to 128 > bytes) so I wasn't so worried but I'll definitely try this with an > alternative section like you describe. > > The jump probably will turn out to be better because I'd imagine > that the same chips that don't need the dcbt and dcbz because > they've got beefy enough hardware prefetchers also won't be > disturbed by the jump (but benchmarks tomorrow will confirm; > or prove me wrong :) ) Yeah, that would make sense. But you never know :) cheers -- Michael Ellerman OzLabs, IBM Australia Development Lab wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC 2/2] powerpc: copy_4K_page tweaked for Cell - add CPU feature 2008-08-14 12:10 ` Michael Ellerman @ 2008-08-15 6:33 ` Mark Nelson 0 siblings, 0 replies; 5+ messages in thread From: Mark Nelson @ 2008-08-15 6:33 UTC (permalink / raw) To: michael; +Cc: linuxppc-dev, cbe-oss-dev On Thu, 14 Aug 2008 10:10:48 pm Michael Ellerman wrote: > On Thu, 2008-08-14 at 21:48 +1000, Mark Nelson wrote: > > Hi Michael, > > > > On Thu, 14 Aug 2008 08:51:35 pm Michael Ellerman wrote: > > > On Thu, 2008-08-14 at 16:18 +1000, Mark Nelson wrote: > > > > Add a new CPU feature, CPU_FTR_CP_USE_DCBTZ, to be added to the CPUs that benefit > > > > from having dcbt and dcbz instructions used in copy_4K_page(). So far Cell, PPC970 > > > > and Power4 benefit. > > > > > > > > This way all the other 64bit powerpc chips will have the whole prefetching loop > > > > nop'ed out. > > > > > > > Index: upstream/arch/powerpc/lib/copypage_64.S > > > > =================================================================== > > > > --- upstream.orig/arch/powerpc/lib/copypage_64.S > > > > +++ upstream/arch/powerpc/lib/copypage_64.S > > > > @@ -18,6 +18,7 @@ PPC64_CACHES: > > > > > > > > _GLOBAL(copy_4K_page) > > > > li r5,4096 /* 4K page size */ > > > > +BEGIN_FTR_SECTION > > > > ld r10,PPC64_CACHES@toc(r2) > > > > lwz r11,DCACHEL1LOGLINESIZE(r10) /* log2 of cache line size */ > > > > lwz r12,DCACHEL1LINESIZE(r10) /* Get cache line size */ > > > > @@ -30,7 +31,7 @@ setup: > > > > dcbz r9,r3 > > > > add r9,r9,r12 > > > > bdnz setup > > > > - > > > > +END_FTR_SECTION_IFSET(CPU_FTR_CP_USE_DCBTZ) > > > > addi r3,r3,-8 > > > > srdi r8,r5,7 /* page is copied in 128 byte strides */ > > > > addi r8,r8,-1 /* one stride copied outside loop */ > > > > > > Instead of nop'ing it out, we could use an alternative feature section > > > to either run it or jump over it. It would look something like: > > > > > > > > > _GLOBAL(copy_4K_page) > > > BEGIN_FTR_SECTION > > > li r5,4096 /* 4K page size */ > > > ld r10,PPC64_CACHES@toc(r2) > > > lwz r11,DCACHEL1LOGLINESIZE(r10) /* log2 of cache line size */ > > > lwz r12,DCACHEL1LINESIZE(r10) /* Get cache line size */ > > > li r9,0 > > > srd r8,r5,r11 > > > > > > mtctr r8 > > > setup: > > > dcbt r9,r4 > > > dcbz r9,r3 > > > add r9,r9,r12 > > > bdnz setup > > > FTR_SECTION_ELSE > > > b 1f > > > ALT_FTR_SECTION_END_IFSET(CPU_FTR_CP_USE_DCBTZ) > > > 1: > > > addi r3,r3,-8 > > > > > > So in the no-dcbtz case you'd get a branch instead of 11 nops. > > > > > > Of course you'd need to benchmark it to see if skipping the nops is > > > better than executing them ;P > > > > Thanks for looking through this. > > > > That does look a lot better. In the first version there wasn't quite > > as much to nop out (the cache line size was hardcoded to 128 > > bytes) so I wasn't so worried but I'll definitely try this with an > > alternative section like you describe. > > > > The jump probably will turn out to be better because I'd imagine > > that the same chips that don't need the dcbt and dcbz because > > they've got beefy enough hardware prefetchers also won't be > > disturbed by the jump (but benchmarks tomorrow will confirm; > > or prove me wrong :) ) > > Yeah, that would make sense. But you never know :) It turns out that on Power6 using the alternative section doesn't have any noticeable effect on performance. On Power5 though it actually made the compile test a tiny bit slower: with alternative feature section: real 5m7.549s user 17m24.256s sys 1m0.621s real 5m7.829s user 17m24.879s sys 1m0.465s original implementation: real 5m6.468s user 17m22.891s sys 0m59.765s real 5m6.965s user 17m23.160s sys 0m59.844s So I guess I'll leave it the way it is... Thanks! Mark ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2008-08-15 6:33 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-08-14 6:18 [RFC 2/2] powerpc: copy_4K_page tweaked for Cell - add CPU feature Mark Nelson 2008-08-14 10:51 ` Michael Ellerman 2008-08-14 11:48 ` Mark Nelson 2008-08-14 12:10 ` Michael Ellerman 2008-08-15 6:33 ` Mark Nelson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).