[RFC 3/3] powerpc: copy_4K_page tweaked for Cell

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* [RFC 3/3] powerpc: copy_4K_page tweaked for Cell
@ 2008-06-19  7:54 Mark Nelson
  2008-06-19 21:28 ` [Cbe-oss-dev] " Arnd Bergmann
  0 siblings, 1 reply; 3+ messages in thread
From: Mark Nelson @ 2008-06-19  7:54 UTC (permalink / raw)
  To: linuxppc-dev, cbe-oss-dev; +Cc: Gunnar von Boehn, Michael Ellerman

/*
 * Copyright (C) 2008 Gunnar von Boehn, IBM Corp.
 *
 * This program is free software; you can redistribute it and/or
 * modify it under the terms of the GNU General Public License
 * as published by the Free Software Foundation; either version
 * 2 of the License, or (at your option) any later version.
 *
 *
 * copy_4K_page routine optimized for CELL-BE-PPC
 *
 * The CELL PPC core has 1 integerunit and 1 load/store unit
 * CELL: 1st level data cache = 32K - 2nd level data cache = 512K
 * - 3rd level data cache = 0K
 * To improve copy performance we need to prefetch source data
 * far ahead to hide this latency
 * For best performance instruction forms ending in "." like "andi."
 * should be avoided as they are implemented in microcode on CELL.
 *
 * The below code is loop unrolled for the CELL cache line of 128 bytes.
 */

#include <asm/processor.h>
#include <asm/ppc_asm.h>

#define PREFETCH_AHEAD 6
#define ZERO_AHEAD 4

	.align  7
_GLOBAL(copy_4K_page)
	dcbt	0,r4		/* Prefetch ONE SRC cacheline */

	addi	r6,r3,-8	/* prepare for stdu */
	addi	r4,r4,-8	/* prepare for ldu */

	li	r10,32		/* copy 32 cache lines for a 4K page */
	li	r12,128+8		/* prefetch distance*/

	subi	r11,r10,PREFETCH_AHEAD
	li	r10,PREFETCH_AHEAD

	mtctr	r10
.LprefetchSRC:
	dcbt    r12,r4
	addi    r12,r12,128
	bdnz    .LprefetchSRC

.Louterloop:				/* copy while cache lines */
	mtctr	r11

	li	r11,128*ZERO_AHEAD +8		/* DCBZ dist */

.align	4
	/* Copy whole cachelines, optimized by prefetching SRC cacheline */
.Lloop: 				/* Copy aligned body */
	dcbt    r12,r4			/* PREFETCH SOURCE some cache lines ahead*/
	ld      r9, 0x08(r4)
	dcbz	r11,r6
	ld      r7, 0x10(r4)    	/* 4 register stride copy */
	ld      r8, 0x18(r4)		/* 4 are optimal to hide 1st level cache lantency*/
	ld      r0, 0x20(r4)
	std     r9, 0x08(r6)
	std     r7, 0x10(r6)
	std     r8, 0x18(r6)
	std     r0, 0x20(r6)
	ld      r9, 0x28(r4)
	ld      r7, 0x30(r4)
	ld      r8, 0x38(r4)
	ld      r0, 0x40(r4)
	std     r9, 0x28(r6)
	std     r7, 0x30(r6)
	std     r8, 0x38(r6)
	std     r0, 0x40(r6)
	ld      r9, 0x48(r4)
	ld      r7, 0x50(r4)
	ld      r8, 0x58(r4)
	ld      r0, 0x60(r4)
	std     r9, 0x48(r6)
	std     r7, 0x50(r6)
	std     r8, 0x58(r6)
	std     r0, 0x60(r6)
	ld      r9, 0x68(r4)
	ld      r7, 0x70(r4)
	ld      r8, 0x78(r4)
	ldu     r0, 0x80(r4)
	std     r9, 0x68(r6)
	std     r7, 0x70(r6)
	std     r8, 0x78(r6)
	stdu    r0, 0x80(r6)

	bdnz    .Lloop

	sldi    r10,r10,2         	/* adjust from 128 to 32 byte stride */
	mtctr 	r10
.Lloop2: 				/* Copy aligned body */
	ld      r9, 0x08(r4)
	ld      r7, 0x10(r4)
	ld      r8, 0x18(r4)
	ldu     r0, 0x20(r4)
	std     r9, 0x08(r6)
	std     r7, 0x10(r6)
	std     r8, 0x18(r6)
	stdu    r0, 0x20(r6)

	bdnz    .Lloop2

.Lendloop2:
	blr

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Cbe-oss-dev] [RFC 3/3] powerpc: copy_4K_page tweaked for Cell
  2008-06-19  7:54 [RFC 3/3] powerpc: copy_4K_page tweaked for Cell Mark Nelson
@ 2008-06-19 21:28 ` Arnd Bergmann
  2008-06-20  2:25   ` Mark Nelson
  0 siblings, 1 reply; 3+ messages in thread
From: Arnd Bergmann @ 2008-06-19 21:28 UTC (permalink / raw)
  To: cbe-oss-dev; +Cc: Mark Nelson, linuxppc-dev, Gunnar von Boehn, Michael Ellerman

T24gVGh1cnNkYXkgMTkgSnVuZSAyMDA4LCBNYXJrIE5lbHNvbiB3cm90ZToKPiCgoKCgoKCgoC5h
bGlnbiCgNwo+IF9HTE9CQUwoY29weV80S19wYWdlKQo+IKCgoKCgoKCgZGNidKCgoKAwLHI0oKCg
oKCgoKCgoKCgLyogUHJlZmV0Y2ggT05FIFNSQyBjYWNoZWxpbmUgKi8KPiAKPiCgoKCgoKCgoGFk
ZGmgoKCgcjYscjMsLTigoKCgoKCgoC8qIHByZXBhcmUgZm9yIHN0ZHUgKi8KPiCgoKCgoKCgoGFk
ZGmgoKCgcjQscjQsLTigoKCgoKCgoC8qIHByZXBhcmUgZm9yIGxkdSAqLwo+IAo+IKCgoKCgoKCg
bGmgoKCgoKByMTAsMzKgoKCgoKCgoKCgLyogY29weSAzMiBjYWNoZSBsaW5lcyBmb3IgYSA0SyBw
YWdlICovCj4goKCgoKCgoKBsaaCgoKCgoHIxMiwxMjgrOKCgoKCgoKCgoKCgoKCgoC8qIHByZWZl
dGNoIGRpc3RhbmNlKi8KClNpbmNlIHlvdSBoYXZlIGEgbG9vcCBoZXJlIGFueXdheSBpbnN0ZWFk
IG9mIHRoZSBmdWxseSB1bnJvbGxlZApjb2RlLCB3aHkgbm90IHByb3ZpZGUgYSBjb3B5XzY0S19w
YWdlIGZ1bmN0aW9uIGFzIHdlbGwsIGp1bXBpbmcgaW4KaGVyZT8KClRoZSBpbmxpbmUgNjRrIGNv
cHlfcGFnZSBmdW5jdGlvbiBvdGhlcndpc2UganVzdCBhZGRzIGNvZGUgc2l6ZSwKYXMgd2VsbCBh
cyBiZWluZyBhIHRpbnkgYml0IHNsb3dlci4gSXQgbWF5IGV2ZW4gYmUgZ29vZCB0bwpoYXZlIGFu
IG91dC1vZi1saW5lIGNvcHlfNjRLX3BhZ2UgZm9yIHRoZSByZWd1bGFyIGNvZGUsIGp1c3QKY2Fs
bGluZyBjb3B5XzRLX3BhZ2UgcmVwZWF0ZWRseS4KCglBcm5kIDw+PAo=

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Cbe-oss-dev] [RFC 3/3] powerpc: copy_4K_page tweaked for Cell
  2008-06-19 21:28 ` [Cbe-oss-dev] " Arnd Bergmann
@ 2008-06-20  2:25   ` Mark Nelson
  0 siblings, 0 replies; 3+ messages in thread
From: Mark Nelson @ 2008-06-20  2:25 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linuxppc-dev, Gunnar von Boehn, cbe-oss-dev, Michael Ellerman

On Fri, 20 Jun 2008 07:28:50 am Arnd Bergmann wrote:
> On Thursday 19 June 2008, Mark Nelson wrote:
> > =A0=A0=A0=A0=A0=A0=A0=A0.align =A07
> > _GLOBAL(copy_4K_page)
> > =A0=A0=A0=A0=A0=A0=A0=A0dcbt=A0=A0=A0=A00,r4=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0/* Prefetch ONE SRC cacheline */
> >=20
> > =A0=A0=A0=A0=A0=A0=A0=A0addi=A0=A0=A0=A0r6,r3,-8=A0=A0=A0=A0=A0=A0=A0=
=A0/* prepare for stdu */
> > =A0=A0=A0=A0=A0=A0=A0=A0addi=A0=A0=A0=A0r4,r4,-8=A0=A0=A0=A0=A0=A0=A0=
=A0/* prepare for ldu */
> >=20
> > =A0=A0=A0=A0=A0=A0=A0=A0li=A0=A0=A0=A0=A0=A0r10,32=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0/* copy 32 cache lines for a 4K page */
> > =A0=A0=A0=A0=A0=A0=A0=A0li=A0=A0=A0=A0=A0=A0r12,128+8=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0/* prefetch distance*/
>=20
> Since you have a loop here anyway instead of the fully unrolled
> code, why not provide a copy_64K_page function as well, jumping in
> here?

That is a good idea. What effect will that have on how the code
patching will work?

>=20
> The inline 64k copy_page function otherwise just adds code size,
> as well as being a tiny bit slower. It may even be good to
> have an out-of-line copy_64K_page for the regular code, just
> calling copy_4K_page repeatedly.

Doing that sounds like it'll make the code patching easier.

Thanks!

Mark

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2008-06-20  2:25 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-06-19  7:54 [RFC 3/3] powerpc: copy_4K_page tweaked for Cell Mark Nelson
2008-06-19 21:28 ` [Cbe-oss-dev] " Arnd Bergmann
2008-06-20  2:25   ` Mark Nelson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).