* Efficient memcpy()/memmove() for G2/G3 cores...
@ 2008-08-25 9:31 David Jander
2008-08-25 11:00 ` Matt Sealey
0 siblings, 1 reply; 27+ messages in thread
From: David Jander @ 2008-08-25 9:31 UTC (permalink / raw)
To: linuxppc-dev
Hello,
I was wondering if there is a good replacement for GLibc memcpy() functions,
that doesn't have horrendous performance on embedded PowerPC processors (such
as Glibc has).
I did some simple benchmarks with this implementation on our custom MPC5121
based board (Freescale e300 core, something like a PPC603e, G2, without VMX):
...
unsigned long int a,b,c,d;
unsigned long int a1,b1,c1,d1;
...
while (len >= 32)
{
a = plSrc[0];
b = plSrc[1];
c = plSrc[2];
d = plSrc[3];
a1 = plSrc[4];
b1 = plSrc[5];
c1 = plSrc[6];
d1 = plSrc[7];
plSrc += 8;
plDst[0] = a;
plDst[1] = b;
plDst[2] = c;
plDst[3] = d;
plDst[4] = a1;
plDst[5] = b1;
plDst[6] = c1;
plDst[7] = d1;
plDst += 8;
len -= 32;
}
...
And the results are more than telling.... by linking this with LD_PRELOAD,
some programs get an enourmous performance boost.
For example a small test program that copies frames into video memory (just
RAM) improved throughput from 13.2 MiB/s to 69.5 MiB/s.
I have googled for this issue, but most optimized versions of memcpy() and
friends seem to focus on AltiVec/VMX, which this processor does not have.
Now I am certain that most of the G2/G3 users on this list _must_ have a
better solution for this. Any suggestions?
Btw, the tests are done on Ubuntu/PowerPC 7.10, don't know if that matters
though...
Best regards,
--
David Jander
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient memcpy()/memmove() for G2/G3 cores...
2008-08-25 9:31 Efficient memcpy()/memmove() for G2/G3 cores David Jander
@ 2008-08-25 11:00 ` Matt Sealey
2008-08-25 13:06 ` David Jander
0 siblings, 1 reply; 27+ messages in thread
From: Matt Sealey @ 2008-08-25 11:00 UTC (permalink / raw)
To: David Jander; +Cc: linuxppc-dev
Hi David,
The focus has definitely been on VMX but that's not to say lower power
processors were forgotten :)
Gunnar von Boehn did some benchmarking with an assembly optimized routine,
for Cell, 603e and so on (basically the whole gamut from embedded up to
sever class IBM chips) and got some pretty good results;
http://www.powerdeveloper.org/forums/viewtopic.php?t=1426
It is definitely something that needs fixing. The generic routine in glibc
just copies words with no benefit of knowing the cache line size or any
cache block buffers in the chip, and certainly no use of cache control or
data streaming on higher end chips.
With knowledge of the right way to unroll the loops, how many copies to
do at once to try and get a burst, reducing cache usage etc. you can get
very impressive performance (as you can see, 50MB up to 78MB at the
smallest size, the basic improvement is 2x performance).
I hope that helps you a little bit. Gunnar posted code to this list not
long after. I have a copy of the "e300 optimized" routine but I thought
best he should post it here, than myself.
There is a lot of scope I think for optimizing several points (glibc,
kernel, some applications) for embedded processors which nobody is
really taking on. But, not many people want to do this kind of work..
--
Matt Sealey <matt@genesi-usa.com>
Genesi, Manager, Developer Relations
David Jander wrote:
> Hello,
>
> I was wondering if there is a good replacement for GLibc memcpy() functions,
> that doesn't have horrendous performance on embedded PowerPC processors (such
> as Glibc has).
>
> I did some simple benchmarks with this implementation on our custom MPC5121
> based board (Freescale e300 core, something like a PPC603e, G2, without VMX):
>
> ...
> unsigned long int a,b,c,d;
> unsigned long int a1,b1,c1,d1;
> ...
> while (len >= 32)
> {
> a = plSrc[0];
> b = plSrc[1];
> c = plSrc[2];
> d = plSrc[3];
> a1 = plSrc[4];
> b1 = plSrc[5];
> c1 = plSrc[6];
> d1 = plSrc[7];
> plSrc += 8;
> plDst[0] = a;
> plDst[1] = b;
> plDst[2] = c;
> plDst[3] = d;
> plDst[4] = a1;
> plDst[5] = b1;
> plDst[6] = c1;
> plDst[7] = d1;
> plDst += 8;
> len -= 32;
> }
> ...
>
> And the results are more than telling.... by linking this with LD_PRELOAD,
> some programs get an enourmous performance boost.
> For example a small test program that copies frames into video memory (just
> RAM) improved throughput from 13.2 MiB/s to 69.5 MiB/s.
> I have googled for this issue, but most optimized versions of memcpy() and
> friends seem to focus on AltiVec/VMX, which this processor does not have.
> Now I am certain that most of the G2/G3 users on this list _must_ have a
> better solution for this. Any suggestions?
>
> Btw, the tests are done on Ubuntu/PowerPC 7.10, don't know if that matters
> though...
>
> Best regards,
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient memcpy()/memmove() for G2/G3 cores...
2008-08-25 11:00 ` Matt Sealey
@ 2008-08-25 13:06 ` David Jander
2008-08-25 22:28 ` Benjamin Herrenschmidt
0 siblings, 1 reply; 27+ messages in thread
From: David Jander @ 2008-08-25 13:06 UTC (permalink / raw)
To: Matt Sealey; +Cc: linuxppc-dev
[-- Attachment #1: Type: text/plain, Size: 2595 bytes --]
Hi Matt,
On Monday 25 August 2008 13:00:10 Matt Sealey wrote:
> The focus has definitely been on VMX but that's not to say lower power
> processors were forgotten :)
lower-power (pun intended) is coming strong these days, as energy-efficiency
is getteing more important every day. And the MPC5121 is a brand-new embedded
processor, that will pop-up in quite a lot devices around you most
probably ;-)
> Gunnar von Boehn did some benchmarking with an assembly optimized routine,
> for Cell, 603e and so on (basically the whole gamut from embedded up to
> sever class IBM chips) and got some pretty good results;
>
> http://www.powerdeveloper.org/forums/viewtopic.php?t=1426
>
> It is definitely something that needs fixing. The generic routine in glibc
> just copies words with no benefit of knowing the cache line size or any
> cache block buffers in the chip, and certainly no use of cache control or
> data streaming on higher end chips.
>
> With knowledge of the right way to unroll the loops, how many copies to
> do at once to try and get a burst, reducing cache usage etc. you can get
> very impressive performance (as you can see, 50MB up to 78MB at the
> smallest size, the basic improvement is 2x performance).
>
> I hope that helps you a little bit. Gunnar posted code to this list not
> long after. I have a copy of the "e300 optimized" routine but I thought
> best he should post it here, than myself.
Ok, I think I found it on the thread. The only problem is, that AFAICS it can
be much better... at least on my platform (e300 core), and I don't know why!
Can you explain this?
I did this:
I took Gunnars code (copy-paste from the forum), renamed the function from
memcpy_e300 to memcpy and put it in a file called "memcpy_e300.S". Then I
did:
$ gcc -O2 -Wall -shared -o libmemcpye300.so memcpy_e300.S
I tried the performance with the small program in the attachment:
$ gcc -O2 -Wall -o pruvmem pruvmem.c
$ LD_PRELOAD=..../libmemcpye300.so ./pruvmem
Data rate: 45.9 MiB/s
Now I did the same thing with my own memcpy written in C (see attached file
mymemcpy.c):
$ LD_PRELOAD=..../libmymemcpy.so ./pruvmem
Data rate: 72.9 MiB/s
Now, can someone please explain this?
As a reference, here's glibc's performance:
$ ./pruvmem
Data rate: 14.8 MiB/s
> There is a lot of scope I think for optimizing several points (glibc,
> kernel, some applications) for embedded processors which nobody is
> really taking on. But, not many people want to do this kind of work..
They should! It makes a HUGE difference. I surely will of course.
Greetings,
--
David Jander
[-- Attachment #2: pruvmem.c --]
[-- Type: text/x-csrc, Size: 1629 bytes --]
#include <stdio.h>
#include <sys/mman.h>
#include <string.h>
#include <sys/time.h>
#include <time.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
int main(void)
{
int f;
unsigned long int *mem,*src,*dst;
int t;
long int usecs;
unsigned long int secs, count;
double rate;
struct timeval tv, tv0, tv1;
printf("Opening fb0\n");
f = open("/dev/fb0", O_RDWR);
if(f<0) {
perror("opening fb0");
return 1;
}
printf("mmapping fb0\n");
mem = mmap(NULL, 0x00300000, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED,f,0);
printf("mmap returned: %08x\n",(unsigned int)mem);
perror("mmap");
if(mem==-1)
return 1;
gettimeofday(&tv, NULL);
for(t=0; t<0x000c0000; t++)
mem[t] = (tv.tv_usec ^ tv.tv_sec) ^ t;
count = 0;
gettimeofday(&tv0, NULL);
for(t=0; t<10; t++) {
src = mem;
dst = mem+0x00040000;
memcpy(dst, src, 0x00100000);
count += 0x00100000;
}
gettimeofday(&tv1, NULL);
secs = tv1.tv_sec-tv0.tv_sec;
usecs = tv1.tv_usec-tv0.tv_usec;
if(usecs<0) {
usecs += 1000000;
secs -= 1;
}
printf("Time elapsed: %ld secs, %ld usecs data transferred: %ld bytes\n",secs, usecs, count);
rate = (double)count/((double)secs + (double)usecs/1000000.0);
printf("Data rate: %5.3g MiB/s\n", rate/(1024.0*1024.0));
return 0;
}
[-- Attachment #3: mymemcpy.c --]
[-- Type: text/x-csrc, Size: 2289 bytes --]
#include <stdlib.h>
void * memcpy(void * dst, void const * src, size_t len)
{
unsigned long int a,b,c,d;
unsigned long int a1,b1,c1,d1;
unsigned long int a2,b2,c2,d2;
unsigned long int a3,b3,c3,d3;
long * plDst = (long *) dst;
long const * plSrc = (long const *) src;
//if (!((unsigned long)src & 0xFFFFFFFC) && !((unsigned long)dst & 0xFFFFFFFC))
//{
while (len >= 64)
{
a = plSrc[0];
b = plSrc[1];
c = plSrc[2];
d = plSrc[3];
a1 = plSrc[4];
b1 = plSrc[5];
c1 = plSrc[6];
d1 = plSrc[7];
a2 = plSrc[8];
b2 = plSrc[9];
c2 = plSrc[10];
d2 = plSrc[11];
a3 = plSrc[12];
b3 = plSrc[13];
c3 = plSrc[14];
d3 = plSrc[15];
plSrc += 16;
plDst[0] = a;
plDst[1] = b;
plDst[2] = c;
plDst[3] = d;
plDst[4] = a1;
plDst[5] = b1;
plDst[6] = c1;
plDst[7] = d1;
plDst[8] = a2;
plDst[9] = b2;
plDst[10] = c2;
plDst[11] = d2;
plDst[12] = a3;
plDst[13] = b3;
plDst[14] = c3;
plDst[15] = d3;
plDst += 16;
len -= 64;
}
while(len >= 16) {
a = plSrc[0];
b = plSrc[1];
c = plSrc[2];
d = plSrc[3];
plSrc += 4;
plDst[0] = a;
plDst[1] = b;
plDst[2] = c;
plDst[3] = d;
plDst += 4;
len -= 16;
}
//}
char * pcDst = (char *) plDst;
char const * pcSrc = (char const *) plSrc;
while (len--)
{
*pcDst++ = *pcSrc++;
}
return (dst);
}
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient memcpy()/memmove() for G2/G3 cores...
2008-08-25 13:06 ` David Jander
@ 2008-08-25 22:28 ` Benjamin Herrenschmidt
2008-08-27 21:04 ` Steven Munroe
0 siblings, 1 reply; 27+ messages in thread
From: Benjamin Herrenschmidt @ 2008-08-25 22:28 UTC (permalink / raw)
To: David Jander; +Cc: linuxppc-dev
On Mon, 2008-08-25 at 15:06 +0200, David Jander wrote:
> Hi Matt,
>
> On Monday 25 August 2008 13:00:10 Matt Sealey wrote:
> > The focus has definitely been on VMX but that's not to say lower power
> > processors were forgotten :)
>
> lower-power (pun intended) is coming strong these days, as energy-efficiency
> is getteing more important every day. And the MPC5121 is a brand-new embedded
> processor, that will pop-up in quite a lot devices around you most
> probably ;-)
It would be useful of somebody interested in getting things things
into glibc did the necessary FSF copyright assignment stuff and worked
toward integrating them.
Ben.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient memcpy()/memmove() for G2/G3 cores...
2008-08-25 22:28 ` Benjamin Herrenschmidt
@ 2008-08-27 21:04 ` Steven Munroe
2008-08-29 11:48 ` David Jander
0 siblings, 1 reply; 27+ messages in thread
From: Steven Munroe @ 2008-08-27 21:04 UTC (permalink / raw)
To: benh; +Cc: linuxppc-dev, David Jander
On Tue, 2008-08-26 at 08:28 +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2008-08-25 at 15:06 +0200, David Jander wrote:
> > Hi Matt,
> >
> > On Monday 25 August 2008 13:00:10 Matt Sealey wrote:
> > > The focus has definitely been on VMX but that's not to say lower power
> > > processors were forgotten :)
> >
> > lower-power (pun intended) is coming strong these days, as energy-efficiency
> > is getteing more important every day. And the MPC5121 is a brand-new embedded
> > processor, that will pop-up in quite a lot devices around you most
> > probably ;-)
>
> It would be useful of somebody interested in getting things things
> into glibc did the necessary FSF copyright assignment stuff and worked
> toward integrating them.
>
Ben makes a very good point!
There is a process for contributing code to GLIBC, which starts with an
FSF copyright assignment.
There is also a framework for adding and maintaining optimizations of
this type:
http://penguinppc.org/dev/glibc/glibc-powerpc-cpu-addon.html
Since this original effort the powerpc changes have been merged into
mainline glibc (GLIBC-2.7) and no longer require a separate
(powerpc-cpu) addon. But the --with-cpu= configure option still works.
This mechanism also works with the glibc ports addon and eglibc.
So it does no good to complain here. If you have core you want to
contribute, Get your FSF CR assignment and join #glibc on freenode IRC.
And we will help you.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient memcpy()/memmove() for G2/G3 cores...
2008-08-27 21:04 ` Steven Munroe
@ 2008-08-29 11:48 ` David Jander
2008-08-29 12:21 ` Joakim Tjernlund
` (2 more replies)
0 siblings, 3 replies; 27+ messages in thread
From: David Jander @ 2008-08-29 11:48 UTC (permalink / raw)
To: munroesj; +Cc: linuxppc-dev
On Wednesday 27 August 2008 23:04:39 Steven Munroe wrote:
> On Tue, 2008-08-26 at 08:28 +1000, Benjamin Herrenschmidt wrote:
> > On Mon, 2008-08-25 at 15:06 +0200, David Jander wrote:
> > > Hi Matt,
> > >
> > > On Monday 25 August 2008 13:00:10 Matt Sealey wrote:
> > > > The focus has definitely been on VMX but that's not to say lower
> > > > power processors were forgotten :)
> > >
> > > lower-power (pun intended) is coming strong these days, as
> > > energy-efficiency is getteing more important every day. And the MPC5121
> > > is a brand-new embedded processor, that will pop-up in quite a lot
> > > devices around you most probably ;-)
> >
> > It would be useful of somebody interested in getting things things
> > into glibc did the necessary FSF copyright assignment stuff and worked
> > toward integrating them.
>
> Ben makes a very good point!
Sounds reasonable... but I am still wondering about what you mean
with "things"?
AFAICS there is almost nothing there (besides the memcpy() routine from Gunnar
von Boehn, which is apparently still far from optimal). And I was asking for
someone to correct me here ;-)
> There is also a framework for adding and maintaining optimizations of
> this type:
>
> http://penguinppc.org/dev/glibc/glibc-powerpc-cpu-addon.html
I had already stumbled across this one, but it seems to focus on G3 or newer
processors (power4). There is no optimal memcpy() for G2/PPC603/e300.
>[...]
> So it does no good to complain here. If you have core you want to
> contribute, Get your FSF CR assignment and join #glibc on freenode IRC.
I am not complaining. I was only wondering if it is just me or there really is
very little that has been done (for either uClibc, glibc, or whatever for
powerpc) to improve performance of (linux-) applications on "lower"-power
platforms (G2 core), AFAICS there is a LOT that can be gained by simple
tweaks.
> And we will help you.
Thanks, now that I know which is the "correct" way to contribute, I only need
to come up with a good set of optimization, worthy of inclusion in glibc.
OTOH, maybe it is easier and simpler to start with a collection of functions
in a shared-library, that may be suited for preloading via LD_PRELOAD
or /etc/ld_preload...
Maybe once this collection is more stable (in terms of that heavy tweaking has
stopped) one could try the pilgrimage towards glibc inclusion....
The problem is: I have very little experience with powerpc assembly and only
very limited time to dedicate to this and I am looking for others who have
Greetings,
--
David Jander
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient memcpy()/memmove() for G2/G3 cores...
2008-08-29 11:48 ` David Jander
@ 2008-08-29 12:21 ` Joakim Tjernlund
2008-09-01 7:23 ` David Jander
2008-08-29 20:34 ` Steven Munroe
2008-08-31 8:28 ` Benjamin Herrenschmidt
2 siblings, 1 reply; 27+ messages in thread
From: Joakim Tjernlund @ 2008-08-29 12:21 UTC (permalink / raw)
To: David Jander; +Cc: munroesj, linuxppc-dev
On Fri, 2008-08-29 at 13:48 +0200, David Jander wrote:
> On Wednesday 27 August 2008 23:04:39 Steven Munroe wrote:
> > On Tue, 2008-08-26 at 08:28 +1000, Benjamin Herrenschmidt wrote:
> > > On Mon, 2008-08-25 at 15:06 +0200, David Jander wrote:
> > > > Hi Matt,
[SNIP]
> I am not complaining. I was only wondering if it is just me or there =
really is=20
> very little that has been done (for either uClibc, glibc, or whatever =
for=20
> powerpc) to improve performance of (linux-) applications on =
"lower"-power=20
> platforms (G2 core), AFAICS there is a LOT that can be gained by =
simple=20
> tweaks.
[SNIP]
>=20
> The problem is: I have very little experience with powerpc assembly =
and only=20
> very limited time to dedicate to this and I am looking for others who =
have=20
I improved the PowerPC memcpy and friends in uClibc a while ago. It does
basically the same a the kernel memcpy but without any cache
instructions. It is written in C, but in such a way that
optimal assembly is generated.
Jocke
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient memcpy()/memmove() for G2/G3 cores...
2008-08-29 11:48 ` David Jander
2008-08-29 12:21 ` Joakim Tjernlund
@ 2008-08-29 20:34 ` Steven Munroe
2008-09-01 8:29 ` David Jander
2008-08-31 8:28 ` Benjamin Herrenschmidt
2 siblings, 1 reply; 27+ messages in thread
From: Steven Munroe @ 2008-08-29 20:34 UTC (permalink / raw)
To: David Jander; +Cc: linuxppc-dev
On Fri, 2008-08-29 at 13:48 +0200, David Jander wrote:
> On Wednesday 27 August 2008 23:04:39 Steven Munroe wrote:
> > On Tue, 2008-08-26 at 08:28 +1000, Benjamin Herrenschmidt wrote:
> > > On Mon, 2008-08-25 at 15:06 +0200, David Jander wrote:
> > > > Hi Matt,
> > > >
> > > > On Monday 25 August 2008 13:00:10 Matt Sealey wrote:
> > > > > The focus has definitely been on VMX but that's not to say lower
> > > > > power processors were forgotten :)
> > > >
[SNIP]
> > >
> > > It would be useful of somebody interested in getting things things
> > > into glibc did the necessary FSF copyright assignment stuff and worked
> > > toward integrating them.
> >
> > Ben makes a very good point!
>
> Sounds reasonable... but I am still wondering about what you mean
> with "things"?
> AFAICS there is almost nothing there (besides the memcpy() routine from Gunnar
> von Boehn, which is apparently still far from optimal). And I was asking for
> someone to correct me here ;-)
>
> > There is also a framework for adding and maintaining optimizations of
> > this type:
> >
> > http://penguinppc.org/dev/glibc/glibc-powerpc-cpu-addon.html
>
> I had already stumbled across this one, but it seems to focus on G3 or newer
> processors (power4). There is no optimal memcpy() for G2/PPC603/e300.
>
Well folks volunteer to work on code for the hardware they have, use,
and care about. I don't have any of that hardware...
this framework can be used to add optimizations for any valid gcc
-mcpu=<cpu-type> target.
> >[...]
> > So it does no good to complain here. If you have core you want to
> > contribute, Get your FSF CR assignment and join #glibc on freenode IRC.
>
> I am not complaining. I was only wondering if it is just me or there really is
> very little that has been done (for either uClibc, glibc, or whatever for
> powerpc) to improve performance of (linux-) applications on "lower"-power
> platforms (G2 core), AFAICS there is a LOT that can be gained by simple
> tweaks.
>
This is a self help group (free as in freedom) We help each other. And
you can help yourself. There is no free lunch.
> > And we will help you.
[SNIP]
>
> The problem is: I have very little experience with powerpc assembly and only
> very limited time to dedicate to this and I am looking for others who have
>
Well this will be a good learning experience for you. We will try to
answer questions.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient memcpy()/memmove() for G2/G3 cores...
2008-08-29 11:48 ` David Jander
2008-08-29 12:21 ` Joakim Tjernlund
2008-08-29 20:34 ` Steven Munroe
@ 2008-08-31 8:28 ` Benjamin Herrenschmidt
2008-09-01 6:42 ` David Jander
2 siblings, 1 reply; 27+ messages in thread
From: Benjamin Herrenschmidt @ 2008-08-31 8:28 UTC (permalink / raw)
To: David Jander; +Cc: munroesj, linuxppc-dev
O> > It would be useful of somebody interested in getting things things
> > > into glibc did the necessary FSF copyright assignment stuff and worked
> > > toward integrating them.
> >
> > Ben makes a very good point!
>
> Sounds reasonable... but I am still wondering about what you mean
> with "things"?
Typo. I meant "these things", that is, variants of various libc
functions optimized for a given processor type.
> AFAICS there is almost nothing there (besides the memcpy() routine from Gunnar
> von Boehn, which is apparently still far from optimal). And I was asking for
> someone to correct me here ;-)
No idea, as we said, it's mostly up to users of the processors (or to a
certain extent, manufacturers, hint hint hint) to do that work.
> > There is also a framework for adding and maintaining optimizations of
> > this type:
> >
> > http://penguinppc.org/dev/glibc/glibc-powerpc-cpu-addon.html
>
> I had already stumbled across this one, but it seems to focus on G3 or newer
> processors (power4). There is no optimal memcpy() for G2/PPC603/e300.
It focuses on what the people doing it have access to, are paid to work
on, or other material constraints. It's up to others from the community
to fill the gaps.
> >[...]
> > So it does no good to complain here. If you have core you want to
> > contribute, Get your FSF CR assignment and join #glibc on freenode IRC.
>
> I am not complaining. I was only wondering if it is just me or there really is
> very little that has been done (for either uClibc, glibc, or whatever for
> powerpc) to improve performance of (linux-) applications on "lower"-power
> platforms (G2 core), AFAICS there is a LOT that can be gained by simple
> tweaks.
Well, possibly, then you are welcome to work on those tweaks and if they
indeed improve things, submit patches to glibc :-) I'm sure Steve and
Ryan will be happy to help with the submission process.
> > And we will help you.
>
> Thanks, now that I know which is the "correct" way to contribute, I only need
> to come up with a good set of optimization, worthy of inclusion in glibc.
You don't have to do it all at once. A simple tweak of one function
such as memcpy, if it's measurably improving performances without
notable regressions could be a first step, and then tweak after tweak...
It's a common mistake to try to do too much "out of tree" and then
struggle and give up when it's time to merge that stuff because there
are too many areas that won't necessarily be acceptable "as is".
One little bit at a time is generally a better approach.
> OTOH, maybe it is easier and simpler to start with a collection of functions
> in a shared-library, that may be suited for preloading via LD_PRELOAD
> or /etc/ld_preload...
>
> Maybe once this collection is more stable (in terms of that heavy tweaking has
> stopped) one could try the pilgrimage towards glibc inclusion....
I believe that's the wrong approach as it leads to never-merged out-of
tree code.
> The problem is: I have very little experience with powerpc assembly and only
> very limited time to dedicate to this and I am looking for others who have
Cheers,
Ben.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient memcpy()/memmove() for G2/G3 cores...
2008-08-31 8:28 ` Benjamin Herrenschmidt
@ 2008-09-01 6:42 ` David Jander
0 siblings, 0 replies; 27+ messages in thread
From: David Jander @ 2008-09-01 6:42 UTC (permalink / raw)
To: benh; +Cc: munroesj, linuxppc-dev
On Sunday 31 August 2008 10:28:43 Benjamin Herrenschmidt wrote:
> O> > It would be useful of somebody interested in getting things things
>
> > > > into glibc did the necessary FSF copyright assignment stuff and
> > > > worked toward integrating them.
> > >
> > > Ben makes a very good point!
> >
> > Sounds reasonable... but I am still wondering about what you mean
> > with "things"?
>
> Typo. I meant "these things", that is, variants of various libc
> functions optimized for a given processor type.
Ok, we'd have to _make_ those "things" first then ;-)
> > AFAICS there is almost nothing there (besides the memcpy() routine from
> > Gunnar von Boehn, which is apparently still far from optimal). And I was
> > asking for someone to correct me here ;-)
>
> No idea, as we said, it's mostly up to users of the processors (or to a
> certain extent, manufacturers, hint hint hint) to do that work.
Ok, I get the point.
> > > There is also a framework for adding and maintaining optimizations of
> > > this type:
> > >
> > > http://penguinppc.org/dev/glibc/glibc-powerpc-cpu-addon.html
> >
> > I had already stumbled across this one, but it seems to focus on G3 or
> > newer processors (power4). There is no optimal memcpy() for
> > G2/PPC603/e300.
>
> It focuses on what the people doing it have access to, are paid to work
> on, or other material constraints. It's up to others from the community
> to fill the gaps.
That's all I need to know ;-)
> > >[...]
> > > So it does no good to complain here. If you have core you want to
> > > contribute, Get your FSF CR assignment and join #glibc on freenode IRC.
> >
> > I am not complaining. I was only wondering if it is just me or there
> > really is very little that has been done (for either uClibc, glibc, or
> > whatever for powerpc) to improve performance of (linux-) applications on
> > "lower"-power platforms (G2 core), AFAICS there is a LOT that can be
> > gained by simple tweaks.
>
> Well, possibly, then you are welcome to work on those tweaks and if they
> indeed improve things, submit patches to glibc :-) I'm sure Steve and
> Ryan will be happy to help with the submission process.
Sounds encouraging. I'll try my best (in the limited amount of time I have).
>[...]
> You don't have to do it all at once. A simple tweak of one function
> such as memcpy, if it's measurably improving performances without
> notable regressions could be a first step, and then tweak after tweak...
>
> It's a common mistake to try to do too much "out of tree" and then
> struggle and give up when it's time to merge that stuff because there
> are too many areas that won't necessarily be acceptable "as is".
>
> One little bit at a time is generally a better approach.
Ok, I take your advice.
> > OTOH, maybe it is easier and simpler to start with a collection of
> > functions in a shared-library, that may be suited for preloading via
> > LD_PRELOAD or /etc/ld_preload...
> >
> > Maybe once this collection is more stable (in terms of that heavy
> > tweaking has stopped) one could try the pilgrimage towards glibc
> > inclusion....
>
> I believe that's the wrong approach as it leads to never-merged out-of
> tree code.
Hmm... you mean, it'll be easier to keep patching (improving) things once they
are already in glibc? Interesting.
Thanks a lot for your comments.
Best regards,
--
David Jander
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient memcpy()/memmove() for G2/G3 cores...
2008-08-29 12:21 ` Joakim Tjernlund
@ 2008-09-01 7:23 ` David Jander
2008-09-01 9:36 ` Joakim Tjernlund
0 siblings, 1 reply; 27+ messages in thread
From: David Jander @ 2008-09-01 7:23 UTC (permalink / raw)
To: joakim.tjernlund; +Cc: munroesj, linuxppc-dev
On Friday 29 August 2008 14:20:33 Joakim Tjernlund wrote:
>[...]
> > The problem is: I have very little experience with powerpc assembly and
> > only very limited time to dedicate to this and I am looking for others
> > who have
>
> I improved the PowerPC memcpy and friends in uClibc a while ago. It does
> basically the same a the kernel memcpy but without any cache
> instructions. It is written in C, but in such a way that
> optimal assembly is generated.
Hmm, isn't that going to break on a different version of gcc?
I just copied the latest version of trunk/uClibc/libc/string/powerpc/memcpy.c
from subversion as uclibc-memcpy.c, removed the last line and did this:
$ gcc -shared -O2 -Wall -o libucmemcpy.so uclibc-memcpy.c
(should I use other compiler options?)
Then I started my test program with LD_PRELOAD=...
My test program only copies big chunks of aligned memory, so it will only test
for maximum throughput (such as copying video frames). I will make a better
one, to measure throughput on different sized blocks of aligned and unaligned
memory, but first I want to find out why I can't seem to get even close to
the expected RAM bandwidth (bursts occur at 1.6 Gbyte/s, sustained transfers
might be able to reach 400 Mbyte/s in theory, taking into account the video
controller eating almost half of it, I'd like to get somewhere close to 200).
The result is quite a bit better than that of glibc-2.7 (13.2 Mbyte/s --> 22
Mbyte/s), but still far from the 71.5 Mbyte/s achieved when using bigger
strides of 16 registers load/store at a time.
Note, that this is copy performance, one-way througput should be double these
figures.
I'll try to learn how cache manipulating instructions work, to see if I can
gain some more bandwith using them.
Regards,
--
David Jander
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient memcpy()/memmove() for G2/G3 cores...
2008-08-29 20:34 ` Steven Munroe
@ 2008-09-01 8:29 ` David Jander
0 siblings, 0 replies; 27+ messages in thread
From: David Jander @ 2008-09-01 8:29 UTC (permalink / raw)
To: munroesj; +Cc: linuxppc-dev
On Friday 29 August 2008 22:34:21 Steven Munroe wrote:
> > I am not complaining. I was only wondering if it is just me or there
> > really is very little that has been done (for either uClibc, glibc, or
> > whatever for powerpc) to improve performance of (linux-) applications on
> > "lower"-power platforms (G2 core), AFAICS there is a LOT that can be
> > gained by simple tweaks.
>
> This is a self help group (free as in freedom) We help each other. And
> you can help yourself. There is no free lunch.
I never expected to be served a free dish of any kind on a mailing-list ;-)
I was just asking around, to avoid reinventing wheels, since I intend to dig
into this problem, that's all. My intention never was to pick up work from
others and then run.
> > The problem is: I have very little experience with powerpc assembly and
> > only very limited time to dedicate to this and I am looking for others
> > who have
>
> Well this will be a good learning experience for you. We will try to
> answer questions.
Excellent. I love learning new stuff ;-)
Thanks a lot for the guidance so far...
Regards,
--
David Jander
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient memcpy()/memmove() for G2/G3 cores...
2008-09-01 7:23 ` David Jander
@ 2008-09-01 9:36 ` Joakim Tjernlund
2008-09-02 13:12 ` David Jander
0 siblings, 1 reply; 27+ messages in thread
From: Joakim Tjernlund @ 2008-09-01 9:36 UTC (permalink / raw)
To: David Jander; +Cc: munroesj, linuxppc-dev
On Mon, 2008-09-01 at 09:23 +0200, David Jander wrote:
> On Friday 29 August 2008 14:20:33 Joakim Tjernlund wrote:
> >[...]
> > > The problem is: I have very little experience with powerpc assembly and
> > > only very limited time to dedicate to this and I am looking for others
> > > who have
> >
> > I improved the PowerPC memcpy and friends in uClibc a while ago. It does
> > basically the same a the kernel memcpy but without any cache
> > instructions. It is written in C, but in such a way that
> > optimal assembly is generated.
>
> Hmm, isn't that going to break on a different version of gcc?
Not break, but gcc might generate non optimal code. However, the code
is laid out to make it easy for gcc to do the right thing.
> I just copied the latest version of trunk/uClibc/libc/string/powerpc/memcpy.c
> from subversion as uclibc-memcpy.c, removed the last line and did this:
>
> $ gcc -shared -O2 -Wall -o libucmemcpy.so uclibc-memcpy.c
>
> (should I use other compiler options?)
These are fine.
>
> Then I started my test program with LD_PRELOAD=...
>
> My test program only copies big chunks of aligned memory, so it will only test
> for maximum throughput (such as copying video frames). I will make a better
> one, to measure throughput on different sized blocks of aligned and unaligned
> memory, but first I want to find out why I can't seem to get even close to
> the expected RAM bandwidth (bursts occur at 1.6 Gbyte/s, sustained transfers
> might be able to reach 400 Mbyte/s in theory, taking into account the video
> controller eating almost half of it, I'd like to get somewhere close to 200).
>
> The result is quite a bit better than that of glibc-2.7 (13.2 Mbyte/s --> 22
> Mbyte/s), but still far from the 71.5 Mbyte/s achieved when using bigger
> strides of 16 registers load/store at a time.
> Note, that this is copy performance, one-way througput should be double these
> figures.
Yeah, the code is trying to do a reasonable job without knowing what
micro arch it is running on. These could probably go to glibc
as new general purpose memxxx() routines. You will probably see
a big increase once dcbz is added to the copy/memset functions.
Fire away :)
Jocke
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient memcpy()/memmove() for G2/G3 cores...
2008-09-01 9:36 ` Joakim Tjernlund
@ 2008-09-02 13:12 ` David Jander
2008-09-03 6:43 ` Joakim Tjernlund
2008-09-03 20:33 ` prodyut hazarika
0 siblings, 2 replies; 27+ messages in thread
From: David Jander @ 2008-09-02 13:12 UTC (permalink / raw)
To: linuxppc-dev; +Cc: munroesj, John Rigby
On Monday 01 September 2008 11:36:15 Joakim Tjernlund wrote:
>[...]
> > Then I started my test program with LD_PRELOAD=...
> >
> > My test program only copies big chunks of aligned memory, so it will only
> > test for maximum throughput (such as copying video frames). I will make a
> > better one, to measure throughput on different sized blocks of aligned
> > and unaligned memory, but first I want to find out why I can't seem to
> > get even close to the expected RAM bandwidth (bursts occur at 1.6
> > Gbyte/s, sustained transfers might be able to reach 400 Mbyte/s in
> > theory, taking into account the video controller eating almost half of
> > it, I'd like to get somewhere close to 200).
> >
> > The result is quite a bit better than that of glibc-2.7 (13.2 Mbyte/s -->
> > 22 Mbyte/s), but still far from the 71.5 Mbyte/s achieved when using
> > bigger strides of 16 registers load/store at a time.
> > Note, that this is copy performance, one-way througput should be double
> > these figures.
>
> Yeah, the code is trying to do a reasonable job without knowing what
> micro arch it is running on. These could probably go to glibc
> as new general purpose memxxx() routines. You will probably see
> a big increase once dcbz is added to the copy/memset functions.
>
> Fire away :)
Ok here I go:
I have made some astonishing discoveries, and I'd like to post the used
source-code somewhere in the meantime, any suggestions? To this list?
There seem to be some substantial differences between the e300 core used in
the MPC5200B and in the MPC5121e (besides the MPC5121 having double the
amount of cache). Memcpy()-performance wise, these differences amount to the
following. The tests done are with vanilla glibc (version 2.6.1 and 2.7
without any powerpc specific memcpy() optimizations), Gunnar von Boehns
memcpy_e300 and my tweaked version, memcpy_e300_dj which basically uses
16-register strides instead of 4-register strides in Gunnar's example.
memcpy() peak-performance (RAM memory throughput) on:
MPC5200B, glibc-2.6, no optimizations: 136 Mbyte/s
MPC5121e, glibc-2.7, no optimizations: 30 Mbyte/s
MPC5200B, memcpy_e300: 225 Mbyte/s
MPC5121e, memcpy_e300: 130 Mbyte/s
MPC5200B, memcpy_e300_dj: 200 Mbyte/s
MPC5121e, memcpy_e300_dj: 202 Mbyte/s
For the MPC5121e, 16-register strides seem to be most optimal, whereas for the
MPC5200B, 4-register stides give best performance. Also, plain C memcpy()
performance on MPC5121e is terribly poor! Does enyone know why? I don't quite
seem to understand those results.
Some information on the test hardware:
MPC5200B-based board has 64 Mbyte DDR-SDRAM, 32-bit wide (two x16 chips),
running ubuntu 7.10 with kernel 2.6.19.2.
MPC5121e-based board has 256 Mbyte DDR2-SDRAM, 32-bit wide (two x16 chips),
running ubuntu 8.04.1 with kernel 2.6.24.5 from Freescale LTIB with the DIU
turned OFF. When the DIU is turned on, maximum throughput drops from 202 to
196 Mbyte/s.
memcpy_e300 variants basically use 4 or 16-register load/store strides, cache
alignment and dcbz/bcbt cache-manipulation instructions to tweak performance.
I have not tried interleaving integer and fpu instructions.
Does anybody have any suggestion about where to start searching for an
explaination of these results? I have the impression that there is something
wrong with my setup, or with the e300c4-core, or both, but what????
Greetings,
--
David Jander
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient memcpy()/memmove() for G2/G3 cores...
2008-09-02 13:12 ` David Jander
@ 2008-09-03 6:43 ` Joakim Tjernlund
2008-09-03 20:33 ` prodyut hazarika
1 sibling, 0 replies; 27+ messages in thread
From: Joakim Tjernlund @ 2008-09-03 6:43 UTC (permalink / raw)
To: David Jander; +Cc: linuxppc-dev, John Rigby, munroesj
On Tue, 2008-09-02 at 15:12 +0200, David Jander wrote:
=EF=BB=BF> I have made some astonishing discoveries, and I'd like to post t=
he
> used source-code somewhere in the meantime, any suggestions? To this list=
?
Yes, mail it.
I got a mpc8323/8321 board I want to try.
> For the MPC5121e, 16-register strides seem to be most optimal, whereas fo=
r the=20
> MPC5200B, 4-register stides give best performance. Also, plain C memcpy()=
=20
Plain C, is that the uClibc one?
Did you try to tweak it?
Jocke
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient memcpy()/memmove() for G2/G3 cores...
2008-09-02 13:12 ` David Jander
2008-09-03 6:43 ` Joakim Tjernlund
@ 2008-09-03 20:33 ` prodyut hazarika
2008-09-04 2:04 ` Paul Mackerras
1 sibling, 1 reply; 27+ messages in thread
From: prodyut hazarika @ 2008-09-03 20:33 UTC (permalink / raw)
To: David Jander; +Cc: linuxppc-dev, John Rigby, munroesj
Hi all,
> These could probably go to glibc
> as new general purpose memxxx() routines. You will probably see
> a big increase once dcbz is added to the copy/memset functions.
glibc memxxx for powerpc are horribly inefficient. For optimal performance,
we should should dcbt instruction to establish the source address in cache, and
dcbz to establish the destination address in cache. We should do
dcbt and dcbz such that the touches happen a line ahead of the actual copy.
The problem which is see is that dcbt and dcbz instructions don't work on
non-cacheable memory (obviously!). But memxxx function are used for both
cached and non-cached memory. Thus this optimized memcpy should be smart enough
to figure out that both source and destination address fall in
cacheable space, and only then
used the optimized dcbt/dcbz instructions.
You can expect to see a significant jump in memxxx function after
using dcbt/dcbz.
Thanks,
Prodyut Hazarika
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient memcpy()/memmove() for G2/G3 cores...
2008-09-03 20:33 ` prodyut hazarika
@ 2008-09-04 2:04 ` Paul Mackerras
2008-09-04 12:05 ` David Jander
2008-09-04 18:14 ` prodyut hazarika
0 siblings, 2 replies; 27+ messages in thread
From: Paul Mackerras @ 2008-09-04 2:04 UTC (permalink / raw)
To: prodyut hazarika; +Cc: linuxppc-dev, David Jander, John Rigby, munroesj
prodyut hazarika writes:
> glibc memxxx for powerpc are horribly inefficient. For optimal performance,
> we should should dcbt instruction to establish the source address in cache, and
> dcbz to establish the destination address in cache. We should do
> dcbt and dcbz such that the touches happen a line ahead of the actual copy.
>
> The problem which is see is that dcbt and dcbz instructions don't work on
> non-cacheable memory (obviously!). But memxxx function are used for both
> cached and non-cached memory. Thus this optimized memcpy should be smart enough
> to figure out that both source and destination address fall in
> cacheable space, and only then
> used the optimized dcbt/dcbz instructions.
I would be careful about adding overhead to memcpy. I found that in
the kernel, almost all calls to memcpy are for less than 128 bytes (1
cache line on most 64-bit machines). So, adding a lot of code to
detect cacheability and do prefetching is just going to slow down the
common case, which is short copies. I don't have statistics for glibc
but I wouldn't be surprised if most copies were short there also.
The other thing that I have found is that code that is optimal for
cache-cold copies is usually significantly slower than optimal for
cache-hot copies, because the cache management instructions consume
cycles and don't help in the cache-hot case.
In other words, I don't think we should be tuning the glibc memcpy
based on tests of how fast it copies multiple megabytes.
Still, for 6xx/e300 cores, we probably do want to use dcbt/dcbz for
larger copies. We don't want to use dcbt/dcbz on the larger 64-bit
processors (POWER4/5/6) because the hardware prefetching and
write-combining mean that dcbt/dcbz don't help and just slow things
down.
Paul.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient memcpy()/memmove() for G2/G3 cores...
2008-09-04 2:04 ` Paul Mackerras
@ 2008-09-04 12:05 ` David Jander
2008-09-04 12:19 ` Josh Boyer
2008-09-04 18:14 ` prodyut hazarika
1 sibling, 1 reply; 27+ messages in thread
From: David Jander @ 2008-09-04 12:05 UTC (permalink / raw)
To: Paul Mackerras; +Cc: linuxppc-dev, prodyut hazarika, John Rigby, munroesj
[-- Attachment #1: Type: text/plain, Size: 5031 bytes --]
On Thursday 04 September 2008 04:04:58 Paul Mackerras wrote:
> prodyut hazarika writes:
> > glibc memxxx for powerpc are horribly inefficient. For optimal
> > performance, we should should dcbt instruction to establish the source
> > address in cache, and dcbz to establish the destination address in cache.
> > We should do dcbt and dcbz such that the touches happen a line ahead of
> > the actual copy.
> >
> > The problem which is see is that dcbt and dcbz instructions don't work on
> > non-cacheable memory (obviously!). But memxxx function are used for both
> > cached and non-cached memory. Thus this optimized memcpy should be smart
> > enough to figure out that both source and destination address fall in
> > cacheable space, and only then
> > used the optimized dcbt/dcbz instructions.
>
> I would be careful about adding overhead to memcpy. I found that in
> the kernel, almost all calls to memcpy are for less than 128 bytes (1
> cache line on most 64-bit machines). So, adding a lot of code to
> detect cacheability and do prefetching is just going to slow down the
> common case, which is short copies. I don't have statistics for glibc
> but I wouldn't be surprised if most copies were short there also.
Then please explain the following. This is a memcpy() speed test for different
sized blocks on a MPC5121e (DIU is turned on). The first case is glibc code
without optimizations, and the second case is 16-register strides with
dcbt/dcbz instructions, written in assembly language (see attachment)
$ ./memcpyspeed
Fully aligned:
100000 chunks of 5 bytes : 3.48 Mbyte/s ( throughput: 6.96 Mbytes/s)
50000 chunks of 16 bytes : 14.3 Mbyte/s ( throughput: 28.6 Mbytes/s)
10000 chunks of 100 bytes : 14.4 Mbyte/s ( throughput: 28.8 Mbytes/s)
5000 chunks of 256 bytes : 14.4 Mbyte/s ( throughput: 28.7 Mbytes/s)
1000 chunks of 1000 bytes : 14.4 Mbyte/s ( throughput: 28.7 Mbytes/s)
50 chunks of 16384 bytes : 14.2 Mbyte/s ( throughput: 28.4 Mbytes/s)
1 chunks of 1048576 bytes : 14.4 Mbyte/s ( throughput: 28.8 Mbytes/s)
$ LD_PRELOAD=./libmemcpye300dj.so ./memcpyspeed
Fully aligned:
100000 chunks of 5 bytes : 7.44 Mbyte/s ( throughput: 14.9 Mbytes/s)
50000 chunks of 16 bytes : 13.1 Mbyte/s ( throughput: 26.2 Mbytes/s)
10000 chunks of 100 bytes : 29.4 Mbyte/s ( throughput: 58.8 Mbytes/s)
5000 chunks of 256 bytes : 90.2 Mbyte/s ( throughput: 180 Mbytes/s)
1000 chunks of 1000 bytes : 77 Mbyte/s ( throughput: 154 Mbytes/s)
50 chunks of 16384 bytes : 96.8 Mbyte/s ( throughput: 194 Mbytes/s)
1 chunks of 1048576 bytes : 97.6 Mbyte/s ( throughput: 195 Mbytes/s)
(I have edited the output of this tool to fit into an e-mail without wrapping
lines for readability).
Please tell me how on earth there can be such a big difference???
Note that on a MPC5200B this is TOTALLY different, and both processors have an
e300 core (different versions of it though).
> The other thing that I have found is that code that is optimal for
> cache-cold copies is usually significantly slower than optimal for
> cache-hot copies, because the cache management instructions consume
> cycles and don't help in the cache-hot case.
>
> In other words, I don't think we should be tuning the glibc memcpy
> based on tests of how fast it copies multiple megabytes.
I don't just copy multiple megabytes! See above example. Also I do constant
performance testing of different applications using LD_PRELOAD, to se the
impact. Recentrly I even tried prboom (a free doom port), to remember the
good old days of PC benchmarking ;-)
I have yet to come across a test that has lower performance with this
optimization (on an MPC5121e that is).
> Still, for 6xx/e300 cores, we probably do want to use dcbt/dcbz for
> larger copies. We don't want to use dcbt/dcbz on the larger 64-bit
At least for MPC5121e you really, really need it!!
> processors (POWER4/5/6) because the hardware prefetching and
> write-combining mean that dcbt/dcbz don't help and just slow things
> down.
That's explainable.
What's not explainable, are the results I am getting on the MPC5121e.
Please, could someone tell me what I am doing wrong? (I must be doing
something wrong, I'm almost sure).
One thing that I realize is not quite "right" with memcpyspeed.c is the fact
that it copies consecutive blocks of memory, that should have an impact on
5-byte and 16-bytes copy results I guess (a cacheline for the following block
may already be fetched), but not anymore for 100-byte blocks and bigger (with
32-byte cache lines). In fact, 16-bytes seems to be the only size where the
additional overhead has some impact (which is negligible).
Another thing is that performance probably matters most to the end-user when
applications need to copy big amounts of data (e.g. video frames or bitmap
data), which is most probably done using big blocks of memcpy(), so
eventually hurting performance for small copies probably has less weight on
overall experience.
Best regards,
--
David Jander
[-- Attachment #2: memcpy_e300_dj.S --]
[-- Type: text/x-objcsrc, Size: 5413 bytes --]
/* Optimized memcpy() implementation for PowerPC e300c4 core (Freescale MPC5121)
*
* Written by Gunnar von Boehn
* Tweaked by David Jander to improve performance on MPC5121e processor.
*/
#include "ppc_asm.h"
#define L1_CACHE_SHIFT 5
#define MAX_COPY_PREFETCH 4
#define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT)
CACHELINE_BYTES = L1_CACHE_BYTES
LG_DOUBLE_CACHELINE = (L1_CACHE_SHIFT+1)
CACHELINE_MASK = (L1_CACHE_BYTES-1)
/*
* Memcpy optimized for PPC e300
*
* This relative simple memcpy does the following to optimize performance
*
* For sizes > 32 byte:
* DST is aligned to 32bit boundary - using 8bit copies
* DST is aligned to cache line boundary (32byte) - using aligned 32bit copies
* The main copy loop prossess one cache line (32byte) per iteration
* The DST cacheline is clear using DCBZ
* The clearing of the aligned DST cache line is very important for performance
* it prevents the CPU from fetching the DST line from memory - this saves 33% of memory accesses.
* To optimize SRC read performance the SRC is prefetched using DCBT
*
* The trick for getting good performance is to use a good match of prefetch distance
* for SRC reading and for DST clearing.
* Typically you DCBZ the DST 0 or 1 cache line ahead
* Typically you DCBT the SRC 2 - 4 cache lines ahaed
* on the e300 prefetching the SRC too far ahead will be slower than not prefetching at all.
*
* We use DCBZ DST[0] and DBCT SRC[0-1] depending on the SRC alignment
*
*/
.align 2
/* parameters r3=DST, r4=SRC, r5=size */
/* returns r3=DST */
.global memcpy
memcpy:
mr r7,r3 /* Save DST in r7 for return */
dcbt 0,r4 /* Prefetch SRC cache line 32byte */
neg r0,r3 /* DST alignment */
addi r4,r4,-4
andi. r0,r0,CACHELINE_MASK /* # of bytes away from cache line boundary */
addi r6,r3,-4
cmplw cr1,r5,r0 /* is this more than total to do? */
beq .Lcachelinealigned
blt cr1,.Lcopyrest /* if not much to do */
andi. r8,r0,3 /* get it word-aligned first */
mtctr r8
beq+ .Ldstwordaligned
.Laligntoword:
lbz r9,4(r4) /* we copy bytes (8bit) 0-3 */
stb r9,4(r6) /* to get the DST 32bit aligned */
addi r4,r4,1
addi r6,r6,1
bdnz .Laligntoword
.Ldstwordaligned:
subf r5,r0,r5
srwi. r0,r0,2
mtctr r0
beq .Lcachelinealigned
.Laligntocacheline:
lwzu r9,4(r4) /* do copy 32bit words (0-7) */
stwu r9,4(r6) /* to get DST cache line aligned (32byte) */
bdnz .Laligntocacheline
.Lcachelinealigned:
srwi. r0,r5,LG_DOUBLE_CACHELINE /* # complete cachelines */
clrlwi r5,r5,32-LG_DOUBLE_CACHELINE
li r11,32
beq .Lcopyrest
addi r3,r4,4 /* Find out which SRC cacheline to prefetch */
neg r3,r3
andi. r3,r3,31
addi r3,r3,32
mtctr r0
stwu r1,-76(r1) /* Save some tmp registers */
stw r23,28(r1)
stw r30,56(r1)
stw r31,60(r1)
stw r24,32(r1)
stw r25,36(r1)
stw r26,40(r1)
stw r27,44(r1)
stw r28,48(r1)
stw r29,52(r1)
stw r13,64(r1)
stw r14,68(r1)
stw r15,72(r1)
.align 7
.Lloop: /* the main body of the cacheline loop */
dcbt r3,r4 /* SRC cache line prefetch */
dcbz r11,r6 /* clear DST cache line */
lwz r31, 0x04(r4) /* copy using a 8 register stride for best performance on e300 */
lwz r8, 0x08(r4)
lwz r9, 0x0c(r4)
lwz r10, 0x10(r4)
lwz r12, 0x14(r4)
lwz r13, 0x18(r4)
lwz r14, 0x1c(r4)
lwzu r23, 0x20(r4)
dcbt r3,r4 /* SRC cache line prefetch */
lwz r24, 0x04(r4)
lwz r25, 0x08(r4)
lwz r26, 0x0c(r4)
lwz r27, 0x10(r4)
lwz r28, 0x14(r4)
lwz r29, 0x18(r4)
lwz r30, 0x1c(r4)
lwzu r15, 0x20(r4)
stw r31, 0x04(r6)
stw r8, 0x08(r6)
stw r9, 0x0c(r6)
stw r10, 0x10(r6)
stw r12, 0x14(r6)
stw r13, 0x18(r6)
stw r14, 0x1c(r6)
stwu r23, 0x20(r6)
dcbz r11,r6 /* clear DST cache line */
stw r24, 0x04(r6)
stw r25, 0x08(r6)
stw r26, 0x0c(r6)
stw r27, 0x10(r6)
stw r28, 0x14(r6)
stw r29, 0x18(r6)
stw r30, 0x1c(r6)
stwu r15, 0x20(r6)
bdnz .Lloop
lwz r24,32(r1) /* restore tmp registers */
lwz r23,28(r1)
lwz r25,36(r1)
lwz r26,40(r1)
lwz r27,44(r1)
lwz r28,48(r1)
lwz r29,52(r1)
lwz r30,56(r1)
lwz r31,60(r1)
lwz r13,64(r1)
lwz r14,68(r1)
lwz r15,72(r1)
addi r1,r1,76
.Lcopyrest:
srwi. r0,r5,2
mtctr r0
beq .Llastbytes
.Lcopywords:
lwzu r0,4(r4) /* we copy remaining words (0-7) */
stwu r0,4(r6)
bdnz .Lcopywords
.Llastbytes:
andi. r0,r5,3
mtctr r0
beq+ .Lend
.Lcopybytes:
lbz r0,4(r4) /* we copy remaining bytes (0-3) */
stb r0,4(r6)
addi r4,r4,1
addi r6,r6,1
bdnz .Lcopybytes
.Lend: /* done : return 0 for Linux / DST for glibc*/
mr r3, r7
blr
[-- Attachment #3: memcpyspeed.c --]
[-- Type: text/x-csrc, Size: 4073 bytes --]
#include <stdio.h>
#include <sys/mman.h>
#include <string.h>
#include <sys/time.h>
#include <time.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
// #define VIDEO_MMAP
// #define TEST_UNALIGNED
static void *srcpool;
static void *dstpool;
unsigned int sizes[] = {5, 16, 100, 256, 1000, 16384, 1048576};
unsigned int nums[] = {100000, 50000, 10000, 5000, 1000, 50, 1};
#define TESTRUNS 10
unsigned int memtest(int size, int num, int srcaligned, int dstaligned)
{
struct timeval tv0, tv1;
unsigned char *src=(unsigned char *)srcpool, *dst=(unsigned char *)dstpool;
unsigned char *sp, *dp;
unsigned int t,i;
long int usecs;
unsigned long int secs;
/* Get src and dst 32-byte aligned */
src = (unsigned char *)((unsigned int)(src+31) & 0xffffffe0);
dst = (unsigned char *)((unsigned int)(dst+31) & 0xffffffe0);
/* Now unalign them if desired (some random offset) */
if(!srcaligned)
src += 11;
if(!dstaligned)
dst += 13;
/* "Train" the system (caches, paging, etc...) */
sp = src;
dp = dst;
for(i=0; i<num; i++) {
memcpy(dp, sp, size);
sp += size;
dp += size;
}
/* Start measurement */
gettimeofday(&tv0, NULL);
for(t=0; t<TESTRUNS; t++) {
sp = src;
dp = dst;
for(i=0; i<num; i++) {
memcpy(dp, sp, size);
sp += size;
dp += size;
}
}
gettimeofday(&tv1, NULL);
secs = tv1.tv_sec-tv0.tv_sec;
usecs = tv1.tv_usec-tv0.tv_usec;
if(usecs<0) {
usecs += 1000000;
secs -= 1;
}
return usecs+1000000L*secs;
}
unsigned int memverify(int size, int num, int srcaligned, int dstaligned)
{
unsigned char *src=(unsigned char *)srcpool, *dst=(unsigned char *)dstpool;
/* Get src and dst 32-byte aligned */
src = (unsigned char *)((unsigned int)(src+31) & 0xffffffe0);
dst = (unsigned char *)((unsigned int)(dst+31) & 0xffffffe0);
/* Now unalign them if desired (some random offset) */
if(!srcaligned)
src += 11;
if(!dstaligned)
dst += 13;
return memcmp(dst, src, size*num);
}
void evaluate(char *name, unsigned int totalsize, unsigned int usecs)
{
double rate;
rate = (double)totalsize*(double)TESTRUNS/((double)usecs/1000000.0);
rate /= (1024.0*1024.0);
printf("Memcpy %-30s: %5.3g Mbyte/s (memory throughput: %5.3g Mbytes/s)\n",name, rate, rate*2.0);
}
int main(void)
{
int t,i;
unsigned int usecs;
char buf[50];
struct timeval tv;
#ifdef VIDEO_MMAP
unsigned long int *mem;
int f;
printf("Opening fb0\n");
f = open("/dev/fb0", O_RDWR);
if(f<0) {
perror("opening fb0");
return 1;
}
printf("mmapping fb0\n");
mem = mmap(NULL, 0x00300000, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED,f,0);
printf("mmap returned: %08x\n",(unsigned int)mem);
perror("mmap");
if(mem==-1)
return 1;
#else
unsigned long int mem[786432];
#endif
srcpool = (unsigned char *)mem;
dstpool = (unsigned char *)mem;
dstpool += 1572864; /* 1.5 Mbyte offset into 3 Mbyte framebuffer */
gettimeofday(&tv, NULL);
for(t=0; t<0x000c0000; t++)
mem[t] = (tv.tv_usec ^ tv.tv_sec) ^ t;
printf("Fully aligned:\n");
for(t=0; t<(sizeof(nums)/sizeof(nums[0])); t++) {
snprintf(buf, 50, "%d chunks of %d bytes", nums[t], sizes[t]);
usecs = memtest(sizes[t], nums[t], 1, 1);
evaluate(buf, nums[t]*sizes[t], usecs);
if(memverify(sizes[t], nums[t], 1, 1)) {
printf("Verify faild!\n");
}
}
#ifdef TEST_UNALIGNED
printf("source unaligned:\n");
for(t=0; t<(sizeof(nums)/sizeof(nums[0])); t++) {
snprintf(buf, 50, "%d chunks of %d bytes", nums[t], sizes[t]);
usecs = memtest(sizes[t], nums[t], 0, 1);
evaluate(buf, nums[t]*sizes[t], usecs);
}
printf("destination unaligned:\n");
for(t=0; t<(sizeof(nums)/sizeof(nums[0])); t++) {
snprintf(buf, 50, "%d chunks of %d bytes", nums[t], sizes[t]);
usecs = memtest(sizes[t], nums[t], 1, 0);
evaluate(buf, nums[t]*sizes[t], usecs);
}
printf("both unaligned:\n");
for(t=0; t<(sizeof(nums)/sizeof(nums[0])); t++) {
snprintf(buf, 50, "%d chunks of %d bytes", nums[t], sizes[t]);
usecs = memtest(sizes[t], nums[t], 0, 0);
evaluate(buf, nums[t]*sizes[t], usecs);
}
#endif
return 0;
}
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient memcpy()/memmove() for G2/G3 cores...
2008-09-04 12:05 ` David Jander
@ 2008-09-04 12:19 ` Josh Boyer
2008-09-04 12:59 ` David Jander
0 siblings, 1 reply; 27+ messages in thread
From: Josh Boyer @ 2008-09-04 12:19 UTC (permalink / raw)
To: David Jander
Cc: linuxppc-dev, Paul Mackerras, munroesj, John Rigby,
prodyut hazarika
On Thu, Sep 04, 2008 at 02:05:16PM +0200, David Jander wrote:
>> I would be careful about adding overhead to memcpy. I found that in
>> the kernel, almost all calls to memcpy are for less than 128 bytes (1
>> cache line on most 64-bit machines). So, adding a lot of code to
>> detect cacheability and do prefetching is just going to slow down the
>> common case, which is short copies. I don't have statistics for glibc
>> but I wouldn't be surprised if most copies were short there also.
>
>Then please explain the following. This is a memcpy() speed test for different
>sized blocks on a MPC5121e (DIU is turned on). The first case is glibc code
>without optimizations, and the second case is 16-register strides with
>dcbt/dcbz instructions, written in assembly language (see attachment)
>
>$ ./memcpyspeed
>Fully aligned:
>100000 chunks of 5 bytes : 3.48 Mbyte/s ( throughput: 6.96 Mbytes/s)
>50000 chunks of 16 bytes : 14.3 Mbyte/s ( throughput: 28.6 Mbytes/s)
>10000 chunks of 100 bytes : 14.4 Mbyte/s ( throughput: 28.8 Mbytes/s)
>5000 chunks of 256 bytes : 14.4 Mbyte/s ( throughput: 28.7 Mbytes/s)
>1000 chunks of 1000 bytes : 14.4 Mbyte/s ( throughput: 28.7 Mbytes/s)
>50 chunks of 16384 bytes : 14.2 Mbyte/s ( throughput: 28.4 Mbytes/s)
>1 chunks of 1048576 bytes : 14.4 Mbyte/s ( throughput: 28.8 Mbytes/s)
>
>$ LD_PRELOAD=./libmemcpye300dj.so ./memcpyspeed
>Fully aligned:
>100000 chunks of 5 bytes : 7.44 Mbyte/s ( throughput: 14.9 Mbytes/s)
>50000 chunks of 16 bytes : 13.1 Mbyte/s ( throughput: 26.2 Mbytes/s)
>10000 chunks of 100 bytes : 29.4 Mbyte/s ( throughput: 58.8 Mbytes/s)
>5000 chunks of 256 bytes : 90.2 Mbyte/s ( throughput: 180 Mbytes/s)
>1000 chunks of 1000 bytes : 77 Mbyte/s ( throughput: 154 Mbytes/s)
>50 chunks of 16384 bytes : 96.8 Mbyte/s ( throughput: 194 Mbytes/s)
>1 chunks of 1048576 bytes : 97.6 Mbyte/s ( throughput: 195 Mbytes/s)
>
>(I have edited the output of this tool to fit into an e-mail without wrapping
>lines for readability).
>Please tell me how on earth there can be such a big difference???
>Note that on a MPC5200B this is TOTALLY different, and both processors have an
>e300 core (different versions of it though).
How can there be such a big difference in throughput? Well, your algorithm
seems better optimized than the glibc one for your testcase :).
>> The other thing that I have found is that code that is optimal for
>> cache-cold copies is usually significantly slower than optimal for
>> cache-hot copies, because the cache management instructions consume
>> cycles and don't help in the cache-hot case.
>>
>> In other words, I don't think we should be tuning the glibc memcpy
>> based on tests of how fast it copies multiple megabytes.
>
>I don't just copy multiple megabytes! See above example. Also I do constant
>performance testing of different applications using LD_PRELOAD, to se the
>impact. Recentrly I even tried prboom (a free doom port), to remember the
>good old days of PC benchmarking ;-)
>I have yet to come across a test that has lower performance with this
>optimization (on an MPC5121e that is).
>
>> Still, for 6xx/e300 cores, we probably do want to use dcbt/dcbz for
>> larger copies. We don't want to use dcbt/dcbz on the larger 64-bit
>
>At least for MPC5121e you really, really need it!!
>
>> processors (POWER4/5/6) because the hardware prefetching and
>> write-combining mean that dcbt/dcbz don't help and just slow things
>> down.
>
>That's explainable.
>What's not explainable, are the results I am getting on the MPC5121e.
>Please, could someone tell me what I am doing wrong? (I must be doing
>something wrong, I'm almost sure).
I don't think you're doing anything wrong exactly. But it seems that
your testcase sits there and just copies data with memcpy in varying
sizes and amounts. That's not exactly a real-world usecase is it?
I think what Paul was saying is that during the course of runtime for a
normal program (the kernel or userspace), most memcpy operations will be of
a small order of magnitude. They will also be scattered among code that does
_other_ stuff than just memcpy. So he's concerned about the overhead of an
implementation that sets up the cache to do a single 32 byte memcpy.
Of course, I could be totally wrong. I haven't had my coffee yet this
morning after all.
josh
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient memcpy()/memmove() for G2/G3 cores...
2008-09-04 12:19 ` Josh Boyer
@ 2008-09-04 12:59 ` David Jander
2008-09-04 14:31 ` Steven Munroe
2008-09-04 15:01 ` Gunnar Von Boehn
0 siblings, 2 replies; 27+ messages in thread
From: David Jander @ 2008-09-04 12:59 UTC (permalink / raw)
To: Josh Boyer
Cc: munroesj, John Rigby, Gunnar Von Boehn, linuxppc-dev,
Paul Mackerras, prodyut hazarika
On Thursday 04 September 2008 14:19:26 Josh Boyer wrote:
>[...]
> >$ ./memcpyspeed
> >Fully aligned:
> >100000 chunks of 5 bytes : 3.48 Mbyte/s ( throughput: 6.96 Mbytes/s)
> >50000 chunks of 16 bytes : 14.3 Mbyte/s ( throughput: 28.6 Mbytes/s)
> >10000 chunks of 100 bytes : 14.4 Mbyte/s ( throughput: 28.8 Mbytes/s)
> >5000 chunks of 256 bytes : 14.4 Mbyte/s ( throughput: 28.7 Mbytes/s)
> >1000 chunks of 1000 bytes : 14.4 Mbyte/s ( throughput: 28.7 Mbytes/s)
> >50 chunks of 16384 bytes : 14.2 Mbyte/s ( throughput: 28.4 Mbytes/s)
> >1 chunks of 1048576 bytes : 14.4 Mbyte/s ( throughput: 28.8 Mbytes/s)
> >
> >$ LD_PRELOAD=./libmemcpye300dj.so ./memcpyspeed
> >Fully aligned:
> >100000 chunks of 5 bytes : 7.44 Mbyte/s ( throughput: 14.9 Mbytes/s)
> >50000 chunks of 16 bytes : 13.1 Mbyte/s ( throughput: 26.2 Mbytes/s)
> >10000 chunks of 100 bytes : 29.4 Mbyte/s ( throughput: 58.8 Mbytes/s)
> >5000 chunks of 256 bytes : 90.2 Mbyte/s ( throughput: 180 Mbytes/s)
> >1000 chunks of 1000 bytes : 77 Mbyte/s ( throughput: 154 Mbytes/s)
> >50 chunks of 16384 bytes : 96.8 Mbyte/s ( throughput: 194 Mbytes/s)
> >1 chunks of 1048576 bytes : 97.6 Mbyte/s ( throughput: 195 Mbytes/s)
> >
> >(I have edited the output of this tool to fit into an e-mail without
> > wrapping lines for readability).
> >Please tell me how on earth there can be such a big difference???
> >Note that on a MPC5200B this is TOTALLY different, and both processors
> > have an e300 core (different versions of it though).
>
> How can there be such a big difference in throughput? Well, your algorithm
> seems better optimized than the glibc one for your testcase :).
Yes, I admit my testcase is focussing on optimizing memcpy() of uncached data,
and that interest stems from the fact that I was testing X11 performance
(using xorg kdrive and xorg-server), and wondering why this processor wasn't
able to get more FPS when moving frames on screen or scrolling, when in
theory the on-board RAM should have bandwidth enough to get a smooth image.
What I mean is that I have a hard time believing that this processor core is
so dependent of tweaks in order to get some decent memory throughput. The
MPC5200B does get higher througput with much less effort, and the two cores
should be fairly identical (besides the MPC5200B having less cache memory and
some other details).
>[...]
> I don't think you're doing anything wrong exactly. But it seems that
> your testcase sits there and just copies data with memcpy in varying
> sizes and amounts. That's not exactly a real-world usecase is it?
No, of course it's not. I made this program to test the performance difference
of different tweaks quickly. Once I found something that worked, I started
LD_PRELOADing it to different other programs (among others the kdrive
Xserver, mplayer, and x11perf) to see its impact on performance of some
real-life apps. There the difference in performance is not so impressive of
course, but it is still there (almost always either noticeably in favor of
the tweaked version of memcpy(), or with a negligible or no difference).
I have not studied the different application's uses of memcpy(), and only done
empirical tests so far.
> I think what Paul was saying is that during the course of runtime for a
> normal program (the kernel or userspace), most memcpy operations will be of
> a small order of magnitude. They will also be scattered among code that
> does _other_ stuff than just memcpy. So he's concerned about the overhead
> of an implementation that sets up the cache to do a single 32 byte memcpy.
I understand. I also have this concern, especially for other processors, as
the MPC5200B, where there doesn't seem to be so much to gain anyway.
> Of course, I could be totally wrong. I haven't had my coffee yet this
> morning after all.
You're doing quite good regardless of your lack of caffeine ;-)
Greetings,
--
David Jander
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient memcpy()/memmove() for G2/G3 cores...
2008-09-04 12:59 ` David Jander
@ 2008-09-04 14:31 ` Steven Munroe
2008-09-04 14:45 ` Gunnar Von Boehn
` (2 more replies)
2008-09-04 15:01 ` Gunnar Von Boehn
1 sibling, 3 replies; 27+ messages in thread
From: Steven Munroe @ 2008-09-04 14:31 UTC (permalink / raw)
To: David Jander
Cc: munroesj, Gunnar Von Boehn, John Rigby, linuxppc-dev,
Paul Mackerras, prodyut hazarika
On Thu, 2008-09-04 at 14:59 +0200, David Jander wrote:
> On Thursday 04 September 2008 14:19:26 Josh Boyer wrote:
> >[...]
> > >(I have edited the output of this tool to fit into an e-mail without
> > > wrapping lines for readability).
> > >Please tell me how on earth there can be such a big difference???
> > >Note that on a MPC5200B this is TOTALLY different, and both processors
> > > have an e300 core (different versions of it though).
> >
> > How can there be such a big difference in throughput? Well, your algorithm
> > seems better optimized than the glibc one for your testcase :).
>
> Yes, I admit my testcase is focussing on optimizing memcpy() of uncached data,
> and that interest stems from the fact that I was testing X11 performance
> (using xorg kdrive and xorg-server), and wondering why this processor wasn't
> able to get more FPS when moving frames on screen or scrolling, when in
> theory the on-board RAM should have bandwidth enough to get a smooth image.
> What I mean is that I have a hard time believing that this processor core is
> so dependent of tweaks in order to get some decent memory throughput. The
> MPC5200B does get higher througput with much less effort, and the two cores
> should be fairly identical (besides the MPC5200B having less cache memory and
> some other details).
>
I have personally optimized memcpy for power4/5/6 and they are all
different. There are dozens of different PPC implementations from
different manufacturers and design, every one is different! With painful
negotiation I was able to get the --with-cpu= framework added to glibc
but not all distro use it. You can thank me later ...
MPC5200B? never heard of it, don't care. I am busy with power7.
So don't assume we are stupid because we have not dropped everything to
optimize memcpy for YOUR processor and YOUR specific case.
You care, your are a programmer? write code! If you care about the
community then fit your optimization into the framework provided for CPU
specific optimization and submit it so others can benefit.
> >[...]
> > I don't think you're doing anything wrong exactly. But it seems that
> > your testcase sits there and just copies data with memcpy in varying
> > sizes and amounts. That's not exactly a real-world usecase is it?
>
> No, of course it's not. I made this program to test the performance difference
> of different tweaks quickly. Once I found something that worked, I started
> LD_PRELOADing it to different other programs (among others the kdrive
> Xserver, mplayer, and x11perf) to see its impact on performance of some
> real-life apps. There the difference in performance is not so impressive of
> course, but it is still there (almost always either noticeably in favor of
> the tweaked version of memcpy(), or with a negligible or no difference).
>
The trick is that the code built into glibc has to be optimal for the
average case (4-256, average 12 bytes). Actually most memcpy
implementations are a series of special cases for length and alignment.
You can always do better if you know exactly what processor you are on
and what specific sizes and alignment your application uses.
> I have not studied the different application's uses of memcpy(), and only done
> empirical tests so far.
>
> > I think what Paul was saying is that during the course of runtime for a
> > normal program (the kernel or userspace), most memcpy operations will be of
> > a small order of magnitude. They will also be scattered among code that
> > does _other_ stuff than just memcpy. So he's concerned about the overhead
> > of an implementation that sets up the cache to do a single 32 byte memcpy.
>
> I understand. I also have this concern, especially for other processors, as
> the MPC5200B, where there doesn't seem to be so much to gain anyway.
>
> > Of course, I could be totally wrong. I haven't had my coffee yet this
> > morning after all.
>
> You're doing quite good regardless of your lack of caffeine ;-)
>
> Greetings,
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient memcpy()/memmove() for G2/G3 cores...
2008-09-04 14:31 ` Steven Munroe
@ 2008-09-04 14:45 ` Gunnar Von Boehn
2008-09-04 15:14 ` Gunnar Von Boehn
2008-09-04 16:25 ` David Jander
2 siblings, 0 replies; 27+ messages in thread
From: Gunnar Von Boehn @ 2008-09-04 14:45 UTC (permalink / raw)
To: munroesj
Cc: David Jander, John Rigby, linuxppc-dev, Paul Mackerras,
prodyut hazarika
Steve,
I think we should be grateful for people being interested in improving
performance for PPC,
and we should not bash them.
The proposal to optimize the memcopy for the 5200 is good.
Steve, you said that you've heard about the 5200..
Maybe I can refresh your memory:
I did send you an optimized 32bit memcopy version for the 5200 about
halve a year ago,
I did send you the routine with the kind request for inclusion.
As you might recall the optimized 5200 memcopy version that I had send
you, was improving the performance by 50%.
Kind regards
Gunnar
On Thu, Sep 4, 2008 at 4:31 PM, Steven Munroe
<munroesj@linux.vnet.ibm.com> wrote:
> On Thu, 2008-09-04 at 14:59 +0200, David Jander wrote:
>> On Thursday 04 September 2008 14:19:26 Josh Boyer wrote:
>> >[...]
>> > >(I have edited the output of this tool to fit into an e-mail without
>> > > wrapping lines for readability).
>> > >Please tell me how on earth there can be such a big difference???
>> > >Note that on a MPC5200B this is TOTALLY different, and both processors
>> > > have an e300 core (different versions of it though).
>> >
>> > How can there be such a big difference in throughput? Well, your algorithm
>> > seems better optimized than the glibc one for your testcase :).
>>
>> Yes, I admit my testcase is focussing on optimizing memcpy() of uncached data,
>> and that interest stems from the fact that I was testing X11 performance
>> (using xorg kdrive and xorg-server), and wondering why this processor wasn't
>> able to get more FPS when moving frames on screen or scrolling, when in
>> theory the on-board RAM should have bandwidth enough to get a smooth image.
>> What I mean is that I have a hard time believing that this processor core is
>> so dependent of tweaks in order to get some decent memory throughput. The
>> MPC5200B does get higher througput with much less effort, and the two cores
>> should be fairly identical (besides the MPC5200B having less cache memory and
>> some other details).
>>
>
> I have personally optimized memcpy for power4/5/6 and they are all
> different. There are dozens of different PPC implementations from
> different manufacturers and design, every one is different! With painful
> negotiation I was able to get the --with-cpu= framework added to glibc
> but not all distro use it. You can thank me later ...
>
> MPC5200B? never heard of it, don't care. I am busy with power7.
>
> So don't assume we are stupid because we have not dropped everything to
> optimize memcpy for YOUR processor and YOUR specific case.
>
> You care, your are a programmer? write code! If you care about the
> community then fit your optimization into the framework provided for CPU
> specific optimization and submit it so others can benefit.
>
>> >[...]
>> > I don't think you're doing anything wrong exactly. But it seems that
>> > your testcase sits there and just copies data with memcpy in varying
>> > sizes and amounts. That's not exactly a real-world usecase is it?
>>
>> No, of course it's not. I made this program to test the performance difference
>> of different tweaks quickly. Once I found something that worked, I started
>> LD_PRELOADing it to different other programs (among others the kdrive
>> Xserver, mplayer, and x11perf) to see its impact on performance of some
>> real-life apps. There the difference in performance is not so impressive of
>> course, but it is still there (almost always either noticeably in favor of
>> the tweaked version of memcpy(), or with a negligible or no difference).
>>
> The trick is that the code built into glibc has to be optimal for the
> average case (4-256, average 12 bytes). Actually most memcpy
> implementations are a series of special cases for length and alignment.
>
> You can always do better if you know exactly what processor you are on
> and what specific sizes and alignment your application uses.
>
>> I have not studied the different application's uses of memcpy(), and only done
>> empirical tests so far.
>>
>> > I think what Paul was saying is that during the course of runtime for a
>> > normal program (the kernel or userspace), most memcpy operations will be of
>> > a small order of magnitude. They will also be scattered among code that
>> > does _other_ stuff than just memcpy. So he's concerned about the overhead
>> > of an implementation that sets up the cache to do a single 32 byte memcpy.
>>
>> I understand. I also have this concern, especially for other processors, as
>> the MPC5200B, where there doesn't seem to be so much to gain anyway.
>>
>> > Of course, I could be totally wrong. I haven't had my coffee yet this
>> > morning after all.
>>
>> You're doing quite good regardless of your lack of caffeine ;-)
>>
>> Greetings,
>>
>
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient memcpy()/memmove() for G2/G3 cores...
2008-09-04 12:59 ` David Jander
2008-09-04 14:31 ` Steven Munroe
@ 2008-09-04 15:01 ` Gunnar Von Boehn
2008-09-04 16:32 ` David Jander
1 sibling, 1 reply; 27+ messages in thread
From: Gunnar Von Boehn @ 2008-09-04 15:01 UTC (permalink / raw)
To: David Jander
Cc: munroesj, John Rigby, linuxppc-dev, Paul Mackerras,
prodyut hazarika
Hi David,
Regarding your testcase.
I think we all agree with you that improving the performance for PPC
is a noble quest
and we should all try do improve the performance where possible.
Regarding the 5200B and 5221 CPUs.
As we all know the 5200B is a G2 PowerPC from Freescale.
The factor for the memory performance of the PPC are two items:
A) This CPU has ZERO 2nd level cache
B) This CPU can remember exactly one prefetched memory line.
This means the normal memcopy routines that prefetch several cache
lines ahead DO NOT WORK!
To get good/best performance you need to prefetch EXACTLY ONE cache line ahead.
Altering the Linux Kernel or glibc memcopy routines for the G2/PPC
core to work like this is actually very simple.
Altering the Linux Kernel or glibc memcopy routines to work like
described will increase performance by 100%
Regarding the 5121.
David, you did create a very special memcopy for the 5121e CPU.
Your test showed us that the normal glibc memcopy is about 10 times
slower than expected on the 5121.
I really wonder why this is the case.
I would have expected the 5121 to perform just like the 5200B.
What we saw is that switching from READ to WRITE and back is very
costly on 5121.
There seems to be a huge difference between the 5200 and its successor the 5121.
Is this performance difference caused by the CPU or by the board /memory?
Cheers
Gunnar
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient memcpy()/memmove() for G2/G3 cores...
2008-09-04 14:31 ` Steven Munroe
2008-09-04 14:45 ` Gunnar Von Boehn
@ 2008-09-04 15:14 ` Gunnar Von Boehn
2008-09-04 16:25 ` David Jander
2 siblings, 0 replies; 27+ messages in thread
From: Gunnar Von Boehn @ 2008-09-04 15:14 UTC (permalink / raw)
To: munroesj
Cc: David Jander, John Rigby, linuxppc-dev, Paul Mackerras,
prodyut hazarika
Hi Steve,
> I have personally optimized memcpy for power4/5/6 and they are all
> different. There are dozens of different PPC implementations from
> different manufacturers and design, every one is different! With painful
> negotiation I was able to get the --with-cpu= framework added to glibc
> but not all distro use it. You can thank me later
Steve, you make it sound like very many different PowerPC chips:
You said you did the Power 4, Power 5 , Power 6 and now Power 7 routines.
And there are the 970 and the Cell.
While this sounds like 7 different PPC chips.
But aren't this actually only 2 main families?
Wouldn't it be possible to create two main routine to cover all?
One type that performs good on the family of Power4/5 and 7.
And one that performs good on the family of P6 and Cell?
How are the Linux hackers handling this?
Maybe there is room for consolidating?
Cheers
Gunnar
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient memcpy()/memmove() for G2/G3 cores...
2008-09-04 14:31 ` Steven Munroe
2008-09-04 14:45 ` Gunnar Von Boehn
2008-09-04 15:14 ` Gunnar Von Boehn
@ 2008-09-04 16:25 ` David Jander
2 siblings, 0 replies; 27+ messages in thread
From: David Jander @ 2008-09-04 16:25 UTC (permalink / raw)
To: munroesj; +Cc: linuxppc-dev, Gunnar Von Boehn
Hi Steven,
On Thursday 04 September 2008 16:31:13 Steven Munroe wrote:
>[...]
> > Yes, I admit my testcase is focussing on optimizing memcpy() of uncached
> > data, and that interest stems from the fact that I was testing X11
> > performance (using xorg kdrive and xorg-server), and wondering why this
> > processor wasn't able to get more FPS when moving frames on screen or
> > scrolling, when in theory the on-board RAM should have bandwidth enough
> > to get a smooth image. What I mean is that I have a hard time believing
> > that this processor core is so dependent of tweaks in order to get some
> > decent memory throughput. The MPC5200B does get higher througput with
> > much less effort, and the two cores should be fairly identical (besides
> > the MPC5200B having less cache memory and some other details).
>
> I have personally optimized memcpy for power4/5/6 and they are all
> different. There are dozens of different PPC implementations from
> different manufacturers and design, every one is different! With painful
> negotiation I was able to get the --with-cpu= framework added to glibc
> but not all distro use it. You can thank me later ...
Well, thank you ;-)
> MPC5200B? never heard of it, don't care. I am busy with power7.
Ok, keep up your work with power7, it's great you care about that one ;-)
> So don't assume we are stupid because we have not dropped everything to
> optimize memcpy for YOUR processor and YOUR specific case.
Ho! I never, ever assumed that anyone (on this list) is stupid. I think you
got me totally wrong (and _that_ may be my fault). I was asking for other
users experience. You make it apear as if I was complaining about your
optimizations for Power4/5/6/970/Cell, but in fact, if you read correctly I
havn't even touched them... they are useless to me, since this is an e300
core. My comparisons are all against vanilla glibc _without_ any optimized
code... that is (most probably) simple loops with char copy, or at most
32-bit word copies. What I want to know is why this processor (MPC5121e, not
the MPC5200B) is so terribly inefficient at this without optimizations and if
someone has done something about it before me (I am doing it right now). I
have never stated that specifically _you_ did a bad job or something, so why
are you reacting like that??
In fact, your framework for specific optimizations in glibc will most probably
come in VERY handy, once I have sorted out the root of the problem with my
specific case.... so thanks a lot for your valuable work... yes, I mean it.
> You care, your are a programmer? write code! If you care about the
> community then fit your optimization into the framework provided for CPU
> specific optimization and submit it so others can benefit.
I _am_ writing code, and Gunnar is helping me find an explaination to the
bizarre behaviour of this particular chip. If the result is useable to
others, i _will_ fit it on your framework for optimizations.
> > >[...]
> > > I don't think you're doing anything wrong exactly. But it seems that
> > > your testcase sits there and just copies data with memcpy in varying
> > > sizes and amounts. That's not exactly a real-world usecase is it?
> >
> > No, of course it's not. I made this program to test the performance
> > difference of different tweaks quickly. Once I found something that
> > worked, I started LD_PRELOADing it to different other programs (among
> > others the kdrive Xserver, mplayer, and x11perf) to see its impact on
> > performance of some real-life apps. There the difference in performance
> > is not so impressive of course, but it is still there (almost always
> > either noticeably in favor of the tweaked version of memcpy(), or with a
> > negligible or no difference).
>
> The trick is that the code built into glibc has to be optimal for the
> average case (4-256, average 12 bytes). Actually most memcpy
> implementations are a series of special cases for length and alignment.
>
> You can always do better if you know exactly what processor you are on
> and what specific sizes and alignment your application uses.
Yes, I know that's a problem. Thanks for the information for "average size", I
don't know where it comes from, but I'll take your word.
I am trying to be as polite and friendly as I can, so if you think I am not,
please tell me where and when... I'll try to improve my social skills for the
next time ;-)
Greetings,
--
David Jander
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient memcpy()/memmove() for G2/G3 cores...
2008-09-04 15:01 ` Gunnar Von Boehn
@ 2008-09-04 16:32 ` David Jander
0 siblings, 0 replies; 27+ messages in thread
From: David Jander @ 2008-09-04 16:32 UTC (permalink / raw)
To: Gunnar Von Boehn
Cc: munroesj, John Rigby, linuxppc-dev, Paul Mackerras,
prodyut hazarika
On Thursday 04 September 2008 17:01:21 Gunnar Von Boehn wrote:
>[...]
> Regarding the 5121.
> David, you did create a very special memcopy for the 5121e CPU.
> Your test showed us that the normal glibc memcopy is about 10 times
> slower than expected on the 5121.
>
> I really wonder why this is the case.
> I would have expected the 5121 to perform just like the 5200B.
> What we saw is that switching from READ to WRITE and back is very
> costly on 5121.
>
> There seems to be a huge difference between the 5200 and its successor the
> 5121. Is this performance difference caused by the CPU or by the board
> /memory?
I have some new insight now, and I will look more closely at the working of
the DRAM controller... there has to be something wrong somewhere, an I am
going to find it... whether it is some strange bug in my u-boot code
(initializing the DRAM controller and prio-manager for example) or a
silicon-errata (John?)
Thanks a lot for your help so far.
--
David Jander
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Efficient memcpy()/memmove() for G2/G3 cores...
2008-09-04 2:04 ` Paul Mackerras
2008-09-04 12:05 ` David Jander
@ 2008-09-04 18:14 ` prodyut hazarika
1 sibling, 0 replies; 27+ messages in thread
From: prodyut hazarika @ 2008-09-04 18:14 UTC (permalink / raw)
To: Paul Mackerras; +Cc: linuxppc-dev, David Jander, John Rigby, munroesj
> I would be careful about adding overhead to memcpy. I found that in
> the kernel, almost all calls to memcpy are for less than 128 bytes (1
> cache line on most 64-bit machines). So, adding a lot of code to
> detect cacheability and do prefetching is just going to slow down the
> common case, which is short copies. I don't have statistics for glibc
> but I wouldn't be surprised if most copies were short there also.
>
You are right. For small copy, it is not advisable.
The way I did was put a small check in the beginning of memcpy. If the copy
is less than 5 cache lines, I don't do dcbt/dcbz. Thus we see a big jump
for copy more than 5 cache lines. The overhead is only 2 assembly instructions
(compare number of bytes followed by jump).
One question - How can we can quickly determine whether both source and memory
address range fall in cacheable range? The user can mmap a region of memory as
non-cacheable, but then call memcpy with that address.
The optimized version must quickly determine that dcbt/dcbz must not
be used in this case.
I don't know what would be a good way to achieve the same?
Regards,
Prodyut Hazarika
^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2008-09-04 18:14 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-08-25 9:31 Efficient memcpy()/memmove() for G2/G3 cores David Jander
2008-08-25 11:00 ` Matt Sealey
2008-08-25 13:06 ` David Jander
2008-08-25 22:28 ` Benjamin Herrenschmidt
2008-08-27 21:04 ` Steven Munroe
2008-08-29 11:48 ` David Jander
2008-08-29 12:21 ` Joakim Tjernlund
2008-09-01 7:23 ` David Jander
2008-09-01 9:36 ` Joakim Tjernlund
2008-09-02 13:12 ` David Jander
2008-09-03 6:43 ` Joakim Tjernlund
2008-09-03 20:33 ` prodyut hazarika
2008-09-04 2:04 ` Paul Mackerras
2008-09-04 12:05 ` David Jander
2008-09-04 12:19 ` Josh Boyer
2008-09-04 12:59 ` David Jander
2008-09-04 14:31 ` Steven Munroe
2008-09-04 14:45 ` Gunnar Von Boehn
2008-09-04 15:14 ` Gunnar Von Boehn
2008-09-04 16:25 ` David Jander
2008-09-04 15:01 ` Gunnar Von Boehn
2008-09-04 16:32 ` David Jander
2008-09-04 18:14 ` prodyut hazarika
2008-08-29 20:34 ` Steven Munroe
2008-09-01 8:29 ` David Jander
2008-08-31 8:28 ` Benjamin Herrenschmidt
2008-09-01 6:42 ` David Jander
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).