* [RFC 2/2] powerpc: copy_4K_page tweaked for Cell - add CPU feature
@ 2008-08-14 6:18 Mark Nelson
2008-08-14 10:51 ` Michael Ellerman
0 siblings, 1 reply; 5+ messages in thread
From: Mark Nelson @ 2008-08-14 6:18 UTC (permalink / raw)
To: linuxppc-dev, cbe-oss-dev
Add a new CPU feature, CPU_FTR_CP_USE_DCBTZ, to be added to the CPUs that benefit
from having dcbt and dcbz instructions used in copy_4K_page(). So far Cell, PPC970
and Power4 benefit.
This way all the other 64bit powerpc chips will have the whole prefetching loop
nop'ed out.
Signed-off-by: Mark Nelson <markn@au1.ibm.com>
---
arch/powerpc/include/asm/cputable.h | 9 ++++++---
arch/powerpc/lib/copypage_64.S | 3 ++-
2 files changed, 8 insertions(+), 4 deletions(-)
Index: upstream/arch/powerpc/include/asm/cputable.h
===================================================================
--- upstream.orig/arch/powerpc/include/asm/cputable.h
+++ upstream/arch/powerpc/include/asm/cputable.h
@@ -192,6 +192,7 @@ extern const char *powerpc_base_platform
#define CPU_FTR_NO_SLBIE_B LONG_ASM_CONST(0x0008000000000000)
#define CPU_FTR_VSX LONG_ASM_CONST(0x0010000000000000)
#define CPU_FTR_SAO LONG_ASM_CONST(0x0020000000000000)
+#define CPU_FTR_CP_USE_DCBTZ LONG_ASM_CONST(0x0040000000000000)
#ifndef __ASSEMBLY__
@@ -387,10 +388,11 @@ extern const char *powerpc_base_platform
CPU_FTR_MMCRA | CPU_FTR_CTRL)
#define CPU_FTRS_POWER4 (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \
CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \
- CPU_FTR_MMCRA)
+ CPU_FTR_MMCRA | CPU_FTR_CP_USE_DCBTZ)
#define CPU_FTRS_PPC970 (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \
CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \
- CPU_FTR_ALTIVEC_COMP | CPU_FTR_CAN_NAP | CPU_FTR_MMCRA)
+ CPU_FTR_ALTIVEC_COMP | CPU_FTR_CAN_NAP | CPU_FTR_MMCRA | \
+ CPU_FTR_CP_USE_DCBTZ)
#define CPU_FTRS_POWER5 (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \
CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \
CPU_FTR_MMCRA | CPU_FTR_SMT | \
@@ -411,7 +413,8 @@ extern const char *powerpc_base_platform
#define CPU_FTRS_CELL (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \
CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \
CPU_FTR_ALTIVEC_COMP | CPU_FTR_MMCRA | CPU_FTR_SMT | \
- CPU_FTR_PAUSE_ZERO | CPU_FTR_CI_LARGE_PAGE | CPU_FTR_CELL_TB_BUG)
+ CPU_FTR_PAUSE_ZERO | CPU_FTR_CI_LARGE_PAGE | \
+ CPU_FTR_CELL_TB_BUG | CPU_FTR_CP_USE_DCBTZ)
#define CPU_FTRS_PA6T (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \
CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | \
CPU_FTR_ALTIVEC_COMP | CPU_FTR_CI_LARGE_PAGE | \
Index: upstream/arch/powerpc/lib/copypage_64.S
===================================================================
--- upstream.orig/arch/powerpc/lib/copypage_64.S
+++ upstream/arch/powerpc/lib/copypage_64.S
@@ -18,6 +18,7 @@ PPC64_CACHES:
_GLOBAL(copy_4K_page)
li r5,4096 /* 4K page size */
+BEGIN_FTR_SECTION
ld r10,PPC64_CACHES@toc(r2)
lwz r11,DCACHEL1LOGLINESIZE(r10) /* log2 of cache line size */
lwz r12,DCACHEL1LINESIZE(r10) /* Get cache line size */
@@ -30,7 +31,7 @@ setup:
dcbz r9,r3
add r9,r9,r12
bdnz setup
-
+END_FTR_SECTION_IFSET(CPU_FTR_CP_USE_DCBTZ)
addi r3,r3,-8
srdi r8,r5,7 /* page is copied in 128 byte strides */
addi r8,r8,-1 /* one stride copied outside loop */
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC 2/2] powerpc: copy_4K_page tweaked for Cell - add CPU feature
2008-08-14 6:18 [RFC 2/2] powerpc: copy_4K_page tweaked for Cell - add CPU feature Mark Nelson
@ 2008-08-14 10:51 ` Michael Ellerman
2008-08-14 11:48 ` Mark Nelson
0 siblings, 1 reply; 5+ messages in thread
From: Michael Ellerman @ 2008-08-14 10:51 UTC (permalink / raw)
To: Mark Nelson; +Cc: linuxppc-dev, cbe-oss-dev
[-- Attachment #1: Type: text/plain, Size: 2279 bytes --]
On Thu, 2008-08-14 at 16:18 +1000, Mark Nelson wrote:
> Add a new CPU feature, CPU_FTR_CP_USE_DCBTZ, to be added to the CPUs that benefit
> from having dcbt and dcbz instructions used in copy_4K_page(). So far Cell, PPC970
> and Power4 benefit.
>
> This way all the other 64bit powerpc chips will have the whole prefetching loop
> nop'ed out.
> Index: upstream/arch/powerpc/lib/copypage_64.S
> ===================================================================
> --- upstream.orig/arch/powerpc/lib/copypage_64.S
> +++ upstream/arch/powerpc/lib/copypage_64.S
> @@ -18,6 +18,7 @@ PPC64_CACHES:
>
> _GLOBAL(copy_4K_page)
> li r5,4096 /* 4K page size */
> +BEGIN_FTR_SECTION
> ld r10,PPC64_CACHES@toc(r2)
> lwz r11,DCACHEL1LOGLINESIZE(r10) /* log2 of cache line size */
> lwz r12,DCACHEL1LINESIZE(r10) /* Get cache line size */
> @@ -30,7 +31,7 @@ setup:
> dcbz r9,r3
> add r9,r9,r12
> bdnz setup
> -
> +END_FTR_SECTION_IFSET(CPU_FTR_CP_USE_DCBTZ)
> addi r3,r3,-8
> srdi r8,r5,7 /* page is copied in 128 byte strides */
> addi r8,r8,-1 /* one stride copied outside loop */
Instead of nop'ing it out, we could use an alternative feature section
to either run it or jump over it. It would look something like:
_GLOBAL(copy_4K_page)
BEGIN_FTR_SECTION
li r5,4096 /* 4K page size */
ld r10,PPC64_CACHES@toc(r2)
lwz r11,DCACHEL1LOGLINESIZE(r10) /* log2 of cache line size */
lwz r12,DCACHEL1LINESIZE(r10) /* Get cache line size */
li r9,0
srd r8,r5,r11
mtctr r8
setup:
dcbt r9,r4
dcbz r9,r3
add r9,r9,r12
bdnz setup
FTR_SECTION_ELSE
b 1f
ALT_FTR_SECTION_END_IFSET(CPU_FTR_CP_USE_DCBTZ)
1:
addi r3,r3,-8
So in the no-dcbtz case you'd get a branch instead of 11 nops.
Of course you'd need to benchmark it to see if skipping the nops is
better than executing them ;P
cheers
--
Michael Ellerman
OzLabs, IBM Australia Development Lab
wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)
We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC 2/2] powerpc: copy_4K_page tweaked for Cell - add CPU feature
2008-08-14 10:51 ` Michael Ellerman
@ 2008-08-14 11:48 ` Mark Nelson
2008-08-14 12:10 ` Michael Ellerman
0 siblings, 1 reply; 5+ messages in thread
From: Mark Nelson @ 2008-08-14 11:48 UTC (permalink / raw)
To: michael; +Cc: linuxppc-dev, cbe-oss-dev
Hi Michael,
On Thu, 14 Aug 2008 08:51:35 pm Michael Ellerman wrote:
> On Thu, 2008-08-14 at 16:18 +1000, Mark Nelson wrote:
> > Add a new CPU feature, CPU_FTR_CP_USE_DCBTZ, to be added to the CPUs that benefit
> > from having dcbt and dcbz instructions used in copy_4K_page(). So far Cell, PPC970
> > and Power4 benefit.
> >
> > This way all the other 64bit powerpc chips will have the whole prefetching loop
> > nop'ed out.
>
> > Index: upstream/arch/powerpc/lib/copypage_64.S
> > ===================================================================
> > --- upstream.orig/arch/powerpc/lib/copypage_64.S
> > +++ upstream/arch/powerpc/lib/copypage_64.S
> > @@ -18,6 +18,7 @@ PPC64_CACHES:
> >
> > _GLOBAL(copy_4K_page)
> > li r5,4096 /* 4K page size */
> > +BEGIN_FTR_SECTION
> > ld r10,PPC64_CACHES@toc(r2)
> > lwz r11,DCACHEL1LOGLINESIZE(r10) /* log2 of cache line size */
> > lwz r12,DCACHEL1LINESIZE(r10) /* Get cache line size */
> > @@ -30,7 +31,7 @@ setup:
> > dcbz r9,r3
> > add r9,r9,r12
> > bdnz setup
> > -
> > +END_FTR_SECTION_IFSET(CPU_FTR_CP_USE_DCBTZ)
> > addi r3,r3,-8
> > srdi r8,r5,7 /* page is copied in 128 byte strides */
> > addi r8,r8,-1 /* one stride copied outside loop */
>
> Instead of nop'ing it out, we could use an alternative feature section
> to either run it or jump over it. It would look something like:
>
>
> _GLOBAL(copy_4K_page)
> BEGIN_FTR_SECTION
> li r5,4096 /* 4K page size */
> ld r10,PPC64_CACHES@toc(r2)
> lwz r11,DCACHEL1LOGLINESIZE(r10) /* log2 of cache line size */
> lwz r12,DCACHEL1LINESIZE(r10) /* Get cache line size */
> li r9,0
> srd r8,r5,r11
>
> mtctr r8
> setup:
> dcbt r9,r4
> dcbz r9,r3
> add r9,r9,r12
> bdnz setup
> FTR_SECTION_ELSE
> b 1f
> ALT_FTR_SECTION_END_IFSET(CPU_FTR_CP_USE_DCBTZ)
> 1:
> addi r3,r3,-8
>
> So in the no-dcbtz case you'd get a branch instead of 11 nops.
>
> Of course you'd need to benchmark it to see if skipping the nops is
> better than executing them ;P
Thanks for looking through this.
That does look a lot better. In the first version there wasn't quite
as much to nop out (the cache line size was hardcoded to 128
bytes) so I wasn't so worried but I'll definitely try this with an
alternative section like you describe.
The jump probably will turn out to be better because I'd imagine
that the same chips that don't need the dcbt and dcbz because
they've got beefy enough hardware prefetchers also won't be
disturbed by the jump (but benchmarks tomorrow will confirm;
or prove me wrong :) )
Thanks!
Mark
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC 2/2] powerpc: copy_4K_page tweaked for Cell - add CPU feature
2008-08-14 11:48 ` Mark Nelson
@ 2008-08-14 12:10 ` Michael Ellerman
2008-08-15 6:33 ` Mark Nelson
0 siblings, 1 reply; 5+ messages in thread
From: Michael Ellerman @ 2008-08-14 12:10 UTC (permalink / raw)
To: Mark Nelson; +Cc: linuxppc-dev, cbe-oss-dev
[-- Attachment #1: Type: text/plain, Size: 3284 bytes --]
On Thu, 2008-08-14 at 21:48 +1000, Mark Nelson wrote:
> Hi Michael,
>
> On Thu, 14 Aug 2008 08:51:35 pm Michael Ellerman wrote:
> > On Thu, 2008-08-14 at 16:18 +1000, Mark Nelson wrote:
> > > Add a new CPU feature, CPU_FTR_CP_USE_DCBTZ, to be added to the CPUs that benefit
> > > from having dcbt and dcbz instructions used in copy_4K_page(). So far Cell, PPC970
> > > and Power4 benefit.
> > >
> > > This way all the other 64bit powerpc chips will have the whole prefetching loop
> > > nop'ed out.
> >
> > > Index: upstream/arch/powerpc/lib/copypage_64.S
> > > ===================================================================
> > > --- upstream.orig/arch/powerpc/lib/copypage_64.S
> > > +++ upstream/arch/powerpc/lib/copypage_64.S
> > > @@ -18,6 +18,7 @@ PPC64_CACHES:
> > >
> > > _GLOBAL(copy_4K_page)
> > > li r5,4096 /* 4K page size */
> > > +BEGIN_FTR_SECTION
> > > ld r10,PPC64_CACHES@toc(r2)
> > > lwz r11,DCACHEL1LOGLINESIZE(r10) /* log2 of cache line size */
> > > lwz r12,DCACHEL1LINESIZE(r10) /* Get cache line size */
> > > @@ -30,7 +31,7 @@ setup:
> > > dcbz r9,r3
> > > add r9,r9,r12
> > > bdnz setup
> > > -
> > > +END_FTR_SECTION_IFSET(CPU_FTR_CP_USE_DCBTZ)
> > > addi r3,r3,-8
> > > srdi r8,r5,7 /* page is copied in 128 byte strides */
> > > addi r8,r8,-1 /* one stride copied outside loop */
> >
> > Instead of nop'ing it out, we could use an alternative feature section
> > to either run it or jump over it. It would look something like:
> >
> >
> > _GLOBAL(copy_4K_page)
> > BEGIN_FTR_SECTION
> > li r5,4096 /* 4K page size */
> > ld r10,PPC64_CACHES@toc(r2)
> > lwz r11,DCACHEL1LOGLINESIZE(r10) /* log2 of cache line size */
> > lwz r12,DCACHEL1LINESIZE(r10) /* Get cache line size */
> > li r9,0
> > srd r8,r5,r11
> >
> > mtctr r8
> > setup:
> > dcbt r9,r4
> > dcbz r9,r3
> > add r9,r9,r12
> > bdnz setup
> > FTR_SECTION_ELSE
> > b 1f
> > ALT_FTR_SECTION_END_IFSET(CPU_FTR_CP_USE_DCBTZ)
> > 1:
> > addi r3,r3,-8
> >
> > So in the no-dcbtz case you'd get a branch instead of 11 nops.
> >
> > Of course you'd need to benchmark it to see if skipping the nops is
> > better than executing them ;P
>
> Thanks for looking through this.
>
> That does look a lot better. In the first version there wasn't quite
> as much to nop out (the cache line size was hardcoded to 128
> bytes) so I wasn't so worried but I'll definitely try this with an
> alternative section like you describe.
>
> The jump probably will turn out to be better because I'd imagine
> that the same chips that don't need the dcbt and dcbz because
> they've got beefy enough hardware prefetchers also won't be
> disturbed by the jump (but benchmarks tomorrow will confirm;
> or prove me wrong :) )
Yeah, that would make sense. But you never know :)
cheers
--
Michael Ellerman
OzLabs, IBM Australia Development Lab
wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)
We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC 2/2] powerpc: copy_4K_page tweaked for Cell - add CPU feature
2008-08-14 12:10 ` Michael Ellerman
@ 2008-08-15 6:33 ` Mark Nelson
0 siblings, 0 replies; 5+ messages in thread
From: Mark Nelson @ 2008-08-15 6:33 UTC (permalink / raw)
To: michael; +Cc: linuxppc-dev, cbe-oss-dev
On Thu, 14 Aug 2008 10:10:48 pm Michael Ellerman wrote:
> On Thu, 2008-08-14 at 21:48 +1000, Mark Nelson wrote:
> > Hi Michael,
> >
> > On Thu, 14 Aug 2008 08:51:35 pm Michael Ellerman wrote:
> > > On Thu, 2008-08-14 at 16:18 +1000, Mark Nelson wrote:
> > > > Add a new CPU feature, CPU_FTR_CP_USE_DCBTZ, to be added to the CPUs that benefit
> > > > from having dcbt and dcbz instructions used in copy_4K_page(). So far Cell, PPC970
> > > > and Power4 benefit.
> > > >
> > > > This way all the other 64bit powerpc chips will have the whole prefetching loop
> > > > nop'ed out.
> > >
> > > > Index: upstream/arch/powerpc/lib/copypage_64.S
> > > > ===================================================================
> > > > --- upstream.orig/arch/powerpc/lib/copypage_64.S
> > > > +++ upstream/arch/powerpc/lib/copypage_64.S
> > > > @@ -18,6 +18,7 @@ PPC64_CACHES:
> > > >
> > > > _GLOBAL(copy_4K_page)
> > > > li r5,4096 /* 4K page size */
> > > > +BEGIN_FTR_SECTION
> > > > ld r10,PPC64_CACHES@toc(r2)
> > > > lwz r11,DCACHEL1LOGLINESIZE(r10) /* log2 of cache line size */
> > > > lwz r12,DCACHEL1LINESIZE(r10) /* Get cache line size */
> > > > @@ -30,7 +31,7 @@ setup:
> > > > dcbz r9,r3
> > > > add r9,r9,r12
> > > > bdnz setup
> > > > -
> > > > +END_FTR_SECTION_IFSET(CPU_FTR_CP_USE_DCBTZ)
> > > > addi r3,r3,-8
> > > > srdi r8,r5,7 /* page is copied in 128 byte strides */
> > > > addi r8,r8,-1 /* one stride copied outside loop */
> > >
> > > Instead of nop'ing it out, we could use an alternative feature section
> > > to either run it or jump over it. It would look something like:
> > >
> > >
> > > _GLOBAL(copy_4K_page)
> > > BEGIN_FTR_SECTION
> > > li r5,4096 /* 4K page size */
> > > ld r10,PPC64_CACHES@toc(r2)
> > > lwz r11,DCACHEL1LOGLINESIZE(r10) /* log2 of cache line size */
> > > lwz r12,DCACHEL1LINESIZE(r10) /* Get cache line size */
> > > li r9,0
> > > srd r8,r5,r11
> > >
> > > mtctr r8
> > > setup:
> > > dcbt r9,r4
> > > dcbz r9,r3
> > > add r9,r9,r12
> > > bdnz setup
> > > FTR_SECTION_ELSE
> > > b 1f
> > > ALT_FTR_SECTION_END_IFSET(CPU_FTR_CP_USE_DCBTZ)
> > > 1:
> > > addi r3,r3,-8
> > >
> > > So in the no-dcbtz case you'd get a branch instead of 11 nops.
> > >
> > > Of course you'd need to benchmark it to see if skipping the nops is
> > > better than executing them ;P
> >
> > Thanks for looking through this.
> >
> > That does look a lot better. In the first version there wasn't quite
> > as much to nop out (the cache line size was hardcoded to 128
> > bytes) so I wasn't so worried but I'll definitely try this with an
> > alternative section like you describe.
> >
> > The jump probably will turn out to be better because I'd imagine
> > that the same chips that don't need the dcbt and dcbz because
> > they've got beefy enough hardware prefetchers also won't be
> > disturbed by the jump (but benchmarks tomorrow will confirm;
> > or prove me wrong :) )
>
> Yeah, that would make sense. But you never know :)
It turns out that on Power6 using the alternative section doesn't
have any noticeable effect on performance. On Power5 though it
actually made the compile test a tiny bit slower:
with alternative feature section:
real 5m7.549s
user 17m24.256s
sys 1m0.621s
real 5m7.829s
user 17m24.879s
sys 1m0.465s
original implementation:
real 5m6.468s
user 17m22.891s
sys 0m59.765s
real 5m6.965s
user 17m23.160s
sys 0m59.844s
So I guess I'll leave it the way it is...
Thanks!
Mark
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2008-08-15 6:33 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-08-14 6:18 [RFC 2/2] powerpc: copy_4K_page tweaked for Cell - add CPU feature Mark Nelson
2008-08-14 10:51 ` Michael Ellerman
2008-08-14 11:48 ` Mark Nelson
2008-08-14 12:10 ` Michael Ellerman
2008-08-15 6:33 ` Mark Nelson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).