linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [RFC 2/2] powerpc: copy_4K_page tweaked for Cell - add CPU feature
@ 2008-08-14  6:18 Mark Nelson
  2008-08-14 10:51 ` Michael Ellerman
  0 siblings, 1 reply; 5+ messages in thread
From: Mark Nelson @ 2008-08-14  6:18 UTC (permalink / raw)
  To: linuxppc-dev, cbe-oss-dev

Add a new CPU feature, CPU_FTR_CP_USE_DCBTZ, to be added to the CPUs that benefit
from having dcbt and dcbz instructions used in copy_4K_page(). So far Cell, PPC970
and Power4 benefit.

This way all the other 64bit powerpc chips will have the whole prefetching loop
nop'ed out.

Signed-off-by: Mark Nelson <markn@au1.ibm.com>
---
 arch/powerpc/include/asm/cputable.h |    9 ++++++---
 arch/powerpc/lib/copypage_64.S      |    3 ++-
 2 files changed, 8 insertions(+), 4 deletions(-)

Index: upstream/arch/powerpc/include/asm/cputable.h
===================================================================
--- upstream.orig/arch/powerpc/include/asm/cputable.h
+++ upstream/arch/powerpc/include/asm/cputable.h
@@ -192,6 +192,7 @@ extern const char *powerpc_base_platform
 #define CPU_FTR_NO_SLBIE_B		LONG_ASM_CONST(0x0008000000000000)
 #define CPU_FTR_VSX			LONG_ASM_CONST(0x0010000000000000)
 #define CPU_FTR_SAO			LONG_ASM_CONST(0x0020000000000000)
+#define CPU_FTR_CP_USE_DCBTZ		LONG_ASM_CONST(0x0040000000000000)
 
 #ifndef __ASSEMBLY__
 
@@ -387,10 +388,11 @@ extern const char *powerpc_base_platform
 	    CPU_FTR_MMCRA | CPU_FTR_CTRL)
 #define CPU_FTRS_POWER4	(CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \
 	    CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \
-	    CPU_FTR_MMCRA)
+	    CPU_FTR_MMCRA | CPU_FTR_CP_USE_DCBTZ)
 #define CPU_FTRS_PPC970	(CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \
 	    CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \
-	    CPU_FTR_ALTIVEC_COMP | CPU_FTR_CAN_NAP | CPU_FTR_MMCRA)
+	    CPU_FTR_ALTIVEC_COMP | CPU_FTR_CAN_NAP | CPU_FTR_MMCRA | \
+	    CPU_FTR_CP_USE_DCBTZ)
 #define CPU_FTRS_POWER5	(CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \
 	    CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \
 	    CPU_FTR_MMCRA | CPU_FTR_SMT | \
@@ -411,7 +413,8 @@ extern const char *powerpc_base_platform
 #define CPU_FTRS_CELL	(CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \
 	    CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \
 	    CPU_FTR_ALTIVEC_COMP | CPU_FTR_MMCRA | CPU_FTR_SMT | \
-	    CPU_FTR_PAUSE_ZERO | CPU_FTR_CI_LARGE_PAGE | CPU_FTR_CELL_TB_BUG)
+	    CPU_FTR_PAUSE_ZERO | CPU_FTR_CI_LARGE_PAGE | \
+	    CPU_FTR_CELL_TB_BUG | CPU_FTR_CP_USE_DCBTZ)
 #define CPU_FTRS_PA6T (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \
 	    CPU_FTR_HPTE_TABLE | CPU_FTR_PPCAS_ARCH_V2 | \
 	    CPU_FTR_ALTIVEC_COMP | CPU_FTR_CI_LARGE_PAGE | \
Index: upstream/arch/powerpc/lib/copypage_64.S
===================================================================
--- upstream.orig/arch/powerpc/lib/copypage_64.S
+++ upstream/arch/powerpc/lib/copypage_64.S
@@ -18,6 +18,7 @@ PPC64_CACHES:
 
 _GLOBAL(copy_4K_page)
 	li	r5,4096		/* 4K page size */
+BEGIN_FTR_SECTION
 	ld      r10,PPC64_CACHES@toc(r2)
 	lwz	r11,DCACHEL1LOGLINESIZE(r10)	/* log2 of cache line size */
 	lwz     r12,DCACHEL1LINESIZE(r10)	/* Get cache line size */
@@ -30,7 +31,7 @@ setup:
 	dcbz	r9,r3
 	add	r9,r9,r12
 	bdnz	setup
-
+END_FTR_SECTION_IFSET(CPU_FTR_CP_USE_DCBTZ)
 	addi	r3,r3,-8
 	srdi    r8,r5,7		/* page is copied in 128 byte strides */
 	addi	r8,r8,-1	/* one stride copied outside loop */

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC 2/2] powerpc: copy_4K_page tweaked for Cell - add CPU feature
  2008-08-14  6:18 [RFC 2/2] powerpc: copy_4K_page tweaked for Cell - add CPU feature Mark Nelson
@ 2008-08-14 10:51 ` Michael Ellerman
  2008-08-14 11:48   ` Mark Nelson
  0 siblings, 1 reply; 5+ messages in thread
From: Michael Ellerman @ 2008-08-14 10:51 UTC (permalink / raw)
  To: Mark Nelson; +Cc: linuxppc-dev, cbe-oss-dev

[-- Attachment #1: Type: text/plain, Size: 2279 bytes --]

On Thu, 2008-08-14 at 16:18 +1000, Mark Nelson wrote:
> Add a new CPU feature, CPU_FTR_CP_USE_DCBTZ, to be added to the CPUs that benefit
> from having dcbt and dcbz instructions used in copy_4K_page(). So far Cell, PPC970
> and Power4 benefit.
> 
> This way all the other 64bit powerpc chips will have the whole prefetching loop
> nop'ed out.

> Index: upstream/arch/powerpc/lib/copypage_64.S
> ===================================================================
> --- upstream.orig/arch/powerpc/lib/copypage_64.S
> +++ upstream/arch/powerpc/lib/copypage_64.S
> @@ -18,6 +18,7 @@ PPC64_CACHES:
>  
>  _GLOBAL(copy_4K_page)
>  	li	r5,4096		/* 4K page size */
> +BEGIN_FTR_SECTION
>  	ld      r10,PPC64_CACHES@toc(r2)
>  	lwz	r11,DCACHEL1LOGLINESIZE(r10)	/* log2 of cache line size */
>  	lwz     r12,DCACHEL1LINESIZE(r10)	/* Get cache line size */
> @@ -30,7 +31,7 @@ setup:
>  	dcbz	r9,r3
>  	add	r9,r9,r12
>  	bdnz	setup
> -
> +END_FTR_SECTION_IFSET(CPU_FTR_CP_USE_DCBTZ)
>  	addi	r3,r3,-8
>  	srdi    r8,r5,7		/* page is copied in 128 byte strides */
>  	addi	r8,r8,-1	/* one stride copied outside loop */

Instead of nop'ing it out, we could use an alternative feature section
to either run it or jump over it. It would look something like:


_GLOBAL(copy_4K_page)
BEGIN_FTR_SECTION
        li      r5,4096         /* 4K page size */
        ld      r10,PPC64_CACHES@toc(r2)
        lwz     r11,DCACHEL1LOGLINESIZE(r10)    /* log2 of cache line size */
        lwz     r12,DCACHEL1LINESIZE(r10)       /* Get cache line size */
        li      r9,0
        srd     r8,r5,r11

        mtctr   r8
setup:
        dcbt    r9,r4
        dcbz    r9,r3
        add     r9,r9,r12
        bdnz    setup
FTR_SECTION_ELSE
	b	1f
ALT_FTR_SECTION_END_IFSET(CPU_FTR_CP_USE_DCBTZ)
1:
        addi    r3,r3,-8

So in the no-dcbtz case you'd get a branch instead of 11 nops.

Of course you'd need to benchmark it to see if skipping the nops is
better than executing them ;P

cheers

-- 
Michael Ellerman
OzLabs, IBM Australia Development Lab

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC 2/2] powerpc: copy_4K_page tweaked for Cell - add CPU feature
  2008-08-14 10:51 ` Michael Ellerman
@ 2008-08-14 11:48   ` Mark Nelson
  2008-08-14 12:10     ` Michael Ellerman
  0 siblings, 1 reply; 5+ messages in thread
From: Mark Nelson @ 2008-08-14 11:48 UTC (permalink / raw)
  To: michael; +Cc: linuxppc-dev, cbe-oss-dev

Hi Michael,

On Thu, 14 Aug 2008 08:51:35 pm Michael Ellerman wrote:
> On Thu, 2008-08-14 at 16:18 +1000, Mark Nelson wrote:
> > Add a new CPU feature, CPU_FTR_CP_USE_DCBTZ, to be added to the CPUs that benefit
> > from having dcbt and dcbz instructions used in copy_4K_page(). So far Cell, PPC970
> > and Power4 benefit.
> > 
> > This way all the other 64bit powerpc chips will have the whole prefetching loop
> > nop'ed out.
> 
> > Index: upstream/arch/powerpc/lib/copypage_64.S
> > ===================================================================
> > --- upstream.orig/arch/powerpc/lib/copypage_64.S
> > +++ upstream/arch/powerpc/lib/copypage_64.S
> > @@ -18,6 +18,7 @@ PPC64_CACHES:
> >  
> >  _GLOBAL(copy_4K_page)
> >  	li	r5,4096		/* 4K page size */
> > +BEGIN_FTR_SECTION
> >  	ld      r10,PPC64_CACHES@toc(r2)
> >  	lwz	r11,DCACHEL1LOGLINESIZE(r10)	/* log2 of cache line size */
> >  	lwz     r12,DCACHEL1LINESIZE(r10)	/* Get cache line size */
> > @@ -30,7 +31,7 @@ setup:
> >  	dcbz	r9,r3
> >  	add	r9,r9,r12
> >  	bdnz	setup
> > -
> > +END_FTR_SECTION_IFSET(CPU_FTR_CP_USE_DCBTZ)
> >  	addi	r3,r3,-8
> >  	srdi    r8,r5,7		/* page is copied in 128 byte strides */
> >  	addi	r8,r8,-1	/* one stride copied outside loop */
> 
> Instead of nop'ing it out, we could use an alternative feature section
> to either run it or jump over it. It would look something like:
> 
> 
> _GLOBAL(copy_4K_page)
> BEGIN_FTR_SECTION
>         li      r5,4096         /* 4K page size */
>         ld      r10,PPC64_CACHES@toc(r2)
>         lwz     r11,DCACHEL1LOGLINESIZE(r10)    /* log2 of cache line size */
>         lwz     r12,DCACHEL1LINESIZE(r10)       /* Get cache line size */
>         li      r9,0
>         srd     r8,r5,r11
> 
>         mtctr   r8
> setup:
>         dcbt    r9,r4
>         dcbz    r9,r3
>         add     r9,r9,r12
>         bdnz    setup
> FTR_SECTION_ELSE
> 	b	1f
> ALT_FTR_SECTION_END_IFSET(CPU_FTR_CP_USE_DCBTZ)
> 1:
>         addi    r3,r3,-8
> 
> So in the no-dcbtz case you'd get a branch instead of 11 nops.
> 
> Of course you'd need to benchmark it to see if skipping the nops is
> better than executing them ;P

Thanks for looking through this.

That does look a lot better. In the first version there wasn't quite
as much to nop out (the cache line size was hardcoded to 128
bytes) so I wasn't so worried but I'll definitely try this with an
alternative section like you describe.

The jump probably will turn out to be better because I'd imagine
that the same chips that don't need the dcbt and dcbz because
they've got beefy enough hardware prefetchers also won't be
disturbed by the jump (but benchmarks tomorrow will confirm;
or prove me wrong :) )

Thanks!

Mark

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC 2/2] powerpc: copy_4K_page tweaked for Cell - add CPU feature
  2008-08-14 11:48   ` Mark Nelson
@ 2008-08-14 12:10     ` Michael Ellerman
  2008-08-15  6:33       ` Mark Nelson
  0 siblings, 1 reply; 5+ messages in thread
From: Michael Ellerman @ 2008-08-14 12:10 UTC (permalink / raw)
  To: Mark Nelson; +Cc: linuxppc-dev, cbe-oss-dev

[-- Attachment #1: Type: text/plain, Size: 3284 bytes --]

On Thu, 2008-08-14 at 21:48 +1000, Mark Nelson wrote:
> Hi Michael,
> 
> On Thu, 14 Aug 2008 08:51:35 pm Michael Ellerman wrote:
> > On Thu, 2008-08-14 at 16:18 +1000, Mark Nelson wrote:
> > > Add a new CPU feature, CPU_FTR_CP_USE_DCBTZ, to be added to the CPUs that benefit
> > > from having dcbt and dcbz instructions used in copy_4K_page(). So far Cell, PPC970
> > > and Power4 benefit.
> > > 
> > > This way all the other 64bit powerpc chips will have the whole prefetching loop
> > > nop'ed out.
> > 
> > > Index: upstream/arch/powerpc/lib/copypage_64.S
> > > ===================================================================
> > > --- upstream.orig/arch/powerpc/lib/copypage_64.S
> > > +++ upstream/arch/powerpc/lib/copypage_64.S
> > > @@ -18,6 +18,7 @@ PPC64_CACHES:
> > >  
> > >  _GLOBAL(copy_4K_page)
> > >  	li	r5,4096		/* 4K page size */
> > > +BEGIN_FTR_SECTION
> > >  	ld      r10,PPC64_CACHES@toc(r2)
> > >  	lwz	r11,DCACHEL1LOGLINESIZE(r10)	/* log2 of cache line size */
> > >  	lwz     r12,DCACHEL1LINESIZE(r10)	/* Get cache line size */
> > > @@ -30,7 +31,7 @@ setup:
> > >  	dcbz	r9,r3
> > >  	add	r9,r9,r12
> > >  	bdnz	setup
> > > -
> > > +END_FTR_SECTION_IFSET(CPU_FTR_CP_USE_DCBTZ)
> > >  	addi	r3,r3,-8
> > >  	srdi    r8,r5,7		/* page is copied in 128 byte strides */
> > >  	addi	r8,r8,-1	/* one stride copied outside loop */
> > 
> > Instead of nop'ing it out, we could use an alternative feature section
> > to either run it or jump over it. It would look something like:
> > 
> > 
> > _GLOBAL(copy_4K_page)
> > BEGIN_FTR_SECTION
> >         li      r5,4096         /* 4K page size */
> >         ld      r10,PPC64_CACHES@toc(r2)
> >         lwz     r11,DCACHEL1LOGLINESIZE(r10)    /* log2 of cache line size */
> >         lwz     r12,DCACHEL1LINESIZE(r10)       /* Get cache line size */
> >         li      r9,0
> >         srd     r8,r5,r11
> > 
> >         mtctr   r8
> > setup:
> >         dcbt    r9,r4
> >         dcbz    r9,r3
> >         add     r9,r9,r12
> >         bdnz    setup
> > FTR_SECTION_ELSE
> > 	b	1f
> > ALT_FTR_SECTION_END_IFSET(CPU_FTR_CP_USE_DCBTZ)
> > 1:
> >         addi    r3,r3,-8
> > 
> > So in the no-dcbtz case you'd get a branch instead of 11 nops.
> > 
> > Of course you'd need to benchmark it to see if skipping the nops is
> > better than executing them ;P
> 
> Thanks for looking through this.
> 
> That does look a lot better. In the first version there wasn't quite
> as much to nop out (the cache line size was hardcoded to 128
> bytes) so I wasn't so worried but I'll definitely try this with an
> alternative section like you describe.
> 
> The jump probably will turn out to be better because I'd imagine
> that the same chips that don't need the dcbt and dcbz because
> they've got beefy enough hardware prefetchers also won't be
> disturbed by the jump (but benchmarks tomorrow will confirm;
> or prove me wrong :) )

Yeah, that would make sense. But you never know :)

cheers

-- 
Michael Ellerman
OzLabs, IBM Australia Development Lab

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC 2/2] powerpc: copy_4K_page tweaked for Cell - add CPU feature
  2008-08-14 12:10     ` Michael Ellerman
@ 2008-08-15  6:33       ` Mark Nelson
  0 siblings, 0 replies; 5+ messages in thread
From: Mark Nelson @ 2008-08-15  6:33 UTC (permalink / raw)
  To: michael; +Cc: linuxppc-dev, cbe-oss-dev

On Thu, 14 Aug 2008 10:10:48 pm Michael Ellerman wrote:
> On Thu, 2008-08-14 at 21:48 +1000, Mark Nelson wrote:
> > Hi Michael,
> > 
> > On Thu, 14 Aug 2008 08:51:35 pm Michael Ellerman wrote:
> > > On Thu, 2008-08-14 at 16:18 +1000, Mark Nelson wrote:
> > > > Add a new CPU feature, CPU_FTR_CP_USE_DCBTZ, to be added to the CPUs that benefit
> > > > from having dcbt and dcbz instructions used in copy_4K_page(). So far Cell, PPC970
> > > > and Power4 benefit.
> > > > 
> > > > This way all the other 64bit powerpc chips will have the whole prefetching loop
> > > > nop'ed out.
> > > 
> > > > Index: upstream/arch/powerpc/lib/copypage_64.S
> > > > ===================================================================
> > > > --- upstream.orig/arch/powerpc/lib/copypage_64.S
> > > > +++ upstream/arch/powerpc/lib/copypage_64.S
> > > > @@ -18,6 +18,7 @@ PPC64_CACHES:
> > > >  
> > > >  _GLOBAL(copy_4K_page)
> > > >  	li	r5,4096		/* 4K page size */
> > > > +BEGIN_FTR_SECTION
> > > >  	ld      r10,PPC64_CACHES@toc(r2)
> > > >  	lwz	r11,DCACHEL1LOGLINESIZE(r10)	/* log2 of cache line size */
> > > >  	lwz     r12,DCACHEL1LINESIZE(r10)	/* Get cache line size */
> > > > @@ -30,7 +31,7 @@ setup:
> > > >  	dcbz	r9,r3
> > > >  	add	r9,r9,r12
> > > >  	bdnz	setup
> > > > -
> > > > +END_FTR_SECTION_IFSET(CPU_FTR_CP_USE_DCBTZ)
> > > >  	addi	r3,r3,-8
> > > >  	srdi    r8,r5,7		/* page is copied in 128 byte strides */
> > > >  	addi	r8,r8,-1	/* one stride copied outside loop */
> > > 
> > > Instead of nop'ing it out, we could use an alternative feature section
> > > to either run it or jump over it. It would look something like:
> > > 
> > > 
> > > _GLOBAL(copy_4K_page)
> > > BEGIN_FTR_SECTION
> > >         li      r5,4096         /* 4K page size */
> > >         ld      r10,PPC64_CACHES@toc(r2)
> > >         lwz     r11,DCACHEL1LOGLINESIZE(r10)    /* log2 of cache line size */
> > >         lwz     r12,DCACHEL1LINESIZE(r10)       /* Get cache line size */
> > >         li      r9,0
> > >         srd     r8,r5,r11
> > > 
> > >         mtctr   r8
> > > setup:
> > >         dcbt    r9,r4
> > >         dcbz    r9,r3
> > >         add     r9,r9,r12
> > >         bdnz    setup
> > > FTR_SECTION_ELSE
> > > 	b	1f
> > > ALT_FTR_SECTION_END_IFSET(CPU_FTR_CP_USE_DCBTZ)
> > > 1:
> > >         addi    r3,r3,-8
> > > 
> > > So in the no-dcbtz case you'd get a branch instead of 11 nops.
> > > 
> > > Of course you'd need to benchmark it to see if skipping the nops is
> > > better than executing them ;P
> > 
> > Thanks for looking through this.
> > 
> > That does look a lot better. In the first version there wasn't quite
> > as much to nop out (the cache line size was hardcoded to 128
> > bytes) so I wasn't so worried but I'll definitely try this with an
> > alternative section like you describe.
> > 
> > The jump probably will turn out to be better because I'd imagine
> > that the same chips that don't need the dcbt and dcbz because
> > they've got beefy enough hardware prefetchers also won't be
> > disturbed by the jump (but benchmarks tomorrow will confirm;
> > or prove me wrong :) )
> 
> Yeah, that would make sense. But you never know :)

It turns out that on Power6 using the alternative section doesn't
have any noticeable effect on performance. On Power5 though it
actually made the compile test a tiny bit slower:

with alternative feature section:

real    5m7.549s
user    17m24.256s
sys     1m0.621s

real    5m7.829s
user    17m24.879s
sys     1m0.465s

original implementation:

real    5m6.468s
user    17m22.891s
sys     0m59.765s

real    5m6.965s
user    17m23.160s
sys     0m59.844s

So I guess I'll leave it the way it is...

Thanks!

Mark

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2008-08-15  6:33 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-08-14  6:18 [RFC 2/2] powerpc: copy_4K_page tweaked for Cell - add CPU feature Mark Nelson
2008-08-14 10:51 ` Michael Ellerman
2008-08-14 11:48   ` Mark Nelson
2008-08-14 12:10     ` Michael Ellerman
2008-08-15  6:33       ` Mark Nelson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).