Efficient memcpy()/memmove() for G2/G3 cores...

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* Efficient memcpy()/memmove() for G2/G3 cores...
@ 2008-08-25  9:31 David Jander
  2008-08-25 11:00 ` Matt Sealey
  0 siblings, 1 reply; 27+ messages in thread
From: David Jander @ 2008-08-25  9:31 UTC (permalink / raw)
  To: linuxppc-dev

Hello,

I was wondering if there is a good replacement for GLibc memcpy() functions, 
that doesn't have horrendous performance on embedded PowerPC processors (such 
as Glibc has).

I did some simple benchmarks with this implementation on our custom MPC5121 
based board (Freescale e300 core, something like a PPC603e, G2, without VMX):

...
unsigned long int a,b,c,d;
unsigned long int a1,b1,c1,d1;
...
while (len >= 32)
{
    a =  plSrc[0];
    b =  plSrc[1];
    c =  plSrc[2];
    d =  plSrc[3];
    a1 = plSrc[4];
    b1 = plSrc[5];
    c1 = plSrc[6];
    d1 = plSrc[7];
    plSrc += 8;
    plDst[0] = a;
    plDst[1] = b;
    plDst[2] = c;
    plDst[3] = d;
    plDst[4] = a1;
    plDst[5] = b1;
    plDst[6] = c1;
    plDst[7] = d1;
    plDst += 8;
    len -= 32;
}
...

And the results are more than telling.... by linking this with LD_PRELOAD, 
some programs get an enourmous performance boost.
For example a small test program that copies frames into video memory (just 
RAM) improved throughput from 13.2 MiB/s to 69.5 MiB/s.
I have googled for this issue, but most optimized versions of memcpy() and 
friends seem to focus on AltiVec/VMX, which this processor does not have.
Now I am certain that most of the G2/G3 users on this list _must_ have a 
better solution for this. Any suggestions?

Btw, the tests are done on Ubuntu/PowerPC 7.10, don't know if that matters 
though...

Best regards,

-- 
David Jander

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Efficient memcpy()/memmove() for G2/G3 cores...
  2008-08-25  9:31 Efficient memcpy()/memmove() for G2/G3 cores David Jander
@ 2008-08-25 11:00 ` Matt Sealey
  2008-08-25 13:06   ` David Jander
  0 siblings, 1 reply; 27+ messages in thread
From: Matt Sealey @ 2008-08-25 11:00 UTC (permalink / raw)
  To: David Jander; +Cc: linuxppc-dev

Hi David,

The focus has definitely been on VMX but that's not to say lower power
processors were forgotten :)

Gunnar von Boehn did some benchmarking with an assembly optimized routine,
for Cell, 603e and so on (basically the whole gamut from embedded up to
sever class IBM chips) and got some pretty good results;

http://www.powerdeveloper.org/forums/viewtopic.php?t=1426

It is definitely something that needs fixing. The generic routine in glibc
just copies words with no benefit of knowing the cache line size or any
cache block buffers in the chip, and certainly no use of cache control or
data streaming on higher end chips.

With knowledge of the right way to unroll the loops, how many copies to
do at once to try and get a burst, reducing cache usage etc. you can get
very impressive performance (as you can see, 50MB up to 78MB at the
smallest size, the basic improvement is 2x performance).

I hope that helps you a little bit. Gunnar posted code to this list not
long after. I have a copy of the "e300 optimized" routine but I thought
best he should post it here, than myself.

There is a lot of scope I think for optimizing several points (glibc,
kernel, some applications) for embedded processors which nobody is
really taking on. But, not many people want to do this kind of work..

-- 
Matt Sealey <matt@genesi-usa.com>
Genesi, Manager, Developer Relations

David Jander wrote:
> Hello,
> 
> I was wondering if there is a good replacement for GLibc memcpy() functions, 
> that doesn't have horrendous performance on embedded PowerPC processors (such 
> as Glibc has).
> 
> I did some simple benchmarks with this implementation on our custom MPC5121 
> based board (Freescale e300 core, something like a PPC603e, G2, without VMX):
> 
> ...
> unsigned long int a,b,c,d;
> unsigned long int a1,b1,c1,d1;
> ...
> while (len >= 32)
> {
>     a =  plSrc[0];
>     b =  plSrc[1];
>     c =  plSrc[2];
>     d =  plSrc[3];
>     a1 = plSrc[4];
>     b1 = plSrc[5];
>     c1 = plSrc[6];
>     d1 = plSrc[7];
>     plSrc += 8;
>     plDst[0] = a;
>     plDst[1] = b;
>     plDst[2] = c;
>     plDst[3] = d;
>     plDst[4] = a1;
>     plDst[5] = b1;
>     plDst[6] = c1;
>     plDst[7] = d1;
>     plDst += 8;
>     len -= 32;
> }
> ...
> 
> And the results are more than telling.... by linking this with LD_PRELOAD, 
> some programs get an enourmous performance boost.
> For example a small test program that copies frames into video memory (just 
> RAM) improved throughput from 13.2 MiB/s to 69.5 MiB/s.
> I have googled for this issue, but most optimized versions of memcpy() and 
> friends seem to focus on AltiVec/VMX, which this processor does not have.
> Now I am certain that most of the G2/G3 users on this list _must_ have a 
> better solution for this. Any suggestions?
> 
> Btw, the tests are done on Ubuntu/PowerPC 7.10, don't know if that matters 
> though...
> 
> Best regards,
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Efficient memcpy()/memmove() for G2/G3 cores...
  2008-08-25 11:00 ` Matt Sealey
@ 2008-08-25 13:06   ` David Jander
  2008-08-25 22:28     ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 27+ messages in thread
From: David Jander @ 2008-08-25 13:06 UTC (permalink / raw)
  To: Matt Sealey; +Cc: linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 2595 bytes --]


Hi Matt,

On Monday 25 August 2008 13:00:10 Matt Sealey wrote:
> The focus has definitely been on VMX but that's not to say lower power
> processors were forgotten :)

lower-power (pun intended) is coming strong these days, as energy-efficiency 
is getteing more important every day. And the MPC5121 is a brand-new embedded 
processor, that will pop-up in quite a lot devices around you most 
probably ;-)

> Gunnar von Boehn did some benchmarking with an assembly optimized routine,
> for Cell, 603e and so on (basically the whole gamut from embedded up to
> sever class IBM chips) and got some pretty good results;
>
> http://www.powerdeveloper.org/forums/viewtopic.php?t=1426
>
> It is definitely something that needs fixing. The generic routine in glibc
> just copies words with no benefit of knowing the cache line size or any
> cache block buffers in the chip, and certainly no use of cache control or
> data streaming on higher end chips.
>
> With knowledge of the right way to unroll the loops, how many copies to
> do at once to try and get a burst, reducing cache usage etc. you can get
> very impressive performance (as you can see, 50MB up to 78MB at the
> smallest size, the basic improvement is 2x performance).
>
> I hope that helps you a little bit. Gunnar posted code to this list not
> long after. I have a copy of the "e300 optimized" routine but I thought
> best he should post it here, than myself.

Ok, I think I found it on the thread. The only problem is, that AFAICS it can 
be much better... at least on my platform (e300 core), and I don't know why! 
Can you explain this?

I did this:

I took Gunnars code (copy-paste from the forum), renamed the function from 
memcpy_e300 to memcpy and put it in a file called "memcpy_e300.S". Then I 
did:

$ gcc -O2 -Wall -shared -o libmemcpye300.so memcpy_e300.S

I tried the performance with the small program in the attachment:

$ gcc -O2 -Wall -o pruvmem pruvmem.c
$ LD_PRELOAD=..../libmemcpye300.so ./pruvmem

Data rate:  45.9 MiB/s

Now I did the same thing with my own memcpy written in C (see attached file 
mymemcpy.c):

$ LD_PRELOAD=..../libmymemcpy.so ./pruvmem

Data rate:  72.9 MiB/s

Now, can someone please explain this?

As a reference, here's glibc's performance:

$ ./pruvmem

Data rate:  14.8 MiB/s

> There is a lot of scope I think for optimizing several points (glibc,
> kernel, some applications) for embedded processors which nobody is
> really taking on. But, not many people want to do this kind of work..

They should! It makes a HUGE difference. I surely will of course.

Greetings,

-- 
David Jander

[-- Attachment #2: pruvmem.c --]
[-- Type: text/x-csrc, Size: 1629 bytes --]

#include <stdio.h>
#include <sys/mman.h>
#include <string.h>
#include <sys/time.h>
#include <time.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

int main(void)
{
        int f;
        unsigned long int *mem,*src,*dst;
        int t;
        long int usecs;
        unsigned long int secs, count;
        double rate;
        struct timeval tv, tv0, tv1;

        printf("Opening fb0\n");
        f = open("/dev/fb0", O_RDWR);
        if(f<0) {
                perror("opening fb0");
                return 1;
        }
        printf("mmapping fb0\n");

        mem = mmap(NULL, 0x00300000, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED,f,0);

        printf("mmap returned: %08x\n",(unsigned int)mem);
        perror("mmap");
        if(mem==-1)
                return 1;

        gettimeofday(&tv, NULL);
        for(t=0; t<0x000c0000; t++)
                mem[t] = (tv.tv_usec ^ tv.tv_sec) ^ t;
        count = 0;
        gettimeofday(&tv0, NULL);
        for(t=0; t<10; t++) {
                src = mem;
                dst = mem+0x00040000;
                memcpy(dst, src, 0x00100000);
                count += 0x00100000;
        }
        gettimeofday(&tv1, NULL);
        secs = tv1.tv_sec-tv0.tv_sec;
        usecs = tv1.tv_usec-tv0.tv_usec;
        if(usecs<0) {
                usecs += 1000000;
                secs -= 1;
        }
        printf("Time elapsed: %ld secs, %ld usecs data transferred: %ld bytes\n",secs, usecs, count);
        rate = (double)count/((double)secs + (double)usecs/1000000.0);
        printf("Data rate: %5.3g MiB/s\n", rate/(1024.0*1024.0));

        return 0;
}

[-- Attachment #3: mymemcpy.c --]
[-- Type: text/x-csrc, Size: 2289 bytes --]

#include <stdlib.h>
void * memcpy(void * dst, void const * src, size_t len)
{
        unsigned long int a,b,c,d;
        unsigned long int a1,b1,c1,d1;
        unsigned long int a2,b2,c2,d2;
        unsigned long int a3,b3,c3,d3;
    long * plDst = (long *) dst;
    long const * plSrc = (long const *) src;
    //if (!((unsigned long)src & 0xFFFFFFFC) && !((unsigned long)dst & 0xFFFFFFFC))
    //{
        while (len >= 64)
        {
                        a =  plSrc[0];
                        b =  plSrc[1];
                        c =  plSrc[2];
                        d =  plSrc[3];
                        a1 = plSrc[4];
                        b1 = plSrc[5];
                        c1 = plSrc[6];
                        d1 = plSrc[7];
                        a2 = plSrc[8];
                        b2 = plSrc[9];
                        c2 = plSrc[10];
                        d2 = plSrc[11];
                        a3 = plSrc[12];
                        b3 = plSrc[13];
                        c3 = plSrc[14];
                        d3 = plSrc[15];
                        plSrc += 16;
                        plDst[0] = a;
                        plDst[1] = b;
                        plDst[2] = c;
                        plDst[3] = d;
                        plDst[4] = a1;
                        plDst[5] = b1;
                        plDst[6] = c1;
                        plDst[7] = d1;
                        plDst[8] = a2;
                        plDst[9] = b2;
                        plDst[10] = c2;
                        plDst[11] = d2;
                        plDst[12] = a3;
                        plDst[13] = b3;
                        plDst[14] = c3;
                        plDst[15] = d3;
                        plDst += 16;
            len -= 64;
        }
        while(len >= 16) {
            a =  plSrc[0];
            b =  plSrc[1];
            c =  plSrc[2];
            d =  plSrc[3];
            plSrc += 4;
            plDst[0] = a;
            plDst[1] = b;
            plDst[2] = c;
            plDst[3] = d;
            plDst += 4;
            len -= 16;
        }
    //}
    char * pcDst = (char *) plDst;
    char const * pcSrc = (char const *) plSrc;

    while (len--)
    {
        *pcDst++ = *pcSrc++;
    }
    return (dst);
}

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Efficient memcpy()/memmove() for G2/G3 cores...
  2008-08-25 13:06   ` David Jander
@ 2008-08-25 22:28     ` Benjamin Herrenschmidt
  2008-08-27 21:04       ` Steven Munroe
  0 siblings, 1 reply; 27+ messages in thread
From: Benjamin Herrenschmidt @ 2008-08-25 22:28 UTC (permalink / raw)
  To: David Jander; +Cc: linuxppc-dev

On Mon, 2008-08-25 at 15:06 +0200, David Jander wrote:
> Hi Matt,
> 
> On Monday 25 August 2008 13:00:10 Matt Sealey wrote:
> > The focus has definitely been on VMX but that's not to say lower power
> > processors were forgotten :)
> 
> lower-power (pun intended) is coming strong these days, as energy-efficiency 
> is getteing more important every day. And the MPC5121 is a brand-new embedded 
> processor, that will pop-up in quite a lot devices around you most 
> probably ;-)

It would be useful of somebody interested in getting things things
into glibc did the necessary FSF copyright assignment stuff and worked
toward integrating them.

Ben.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Efficient memcpy()/memmove() for G2/G3 cores...
  2008-08-25 22:28     ` Benjamin Herrenschmidt
@ 2008-08-27 21:04       ` Steven Munroe
  2008-08-29 11:48         ` David Jander
  0 siblings, 1 reply; 27+ messages in thread
From: Steven Munroe @ 2008-08-27 21:04 UTC (permalink / raw)
  To: benh; +Cc: linuxppc-dev, David Jander

On Tue, 2008-08-26 at 08:28 +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2008-08-25 at 15:06 +0200, David Jander wrote:
> > Hi Matt,
> > 
> > On Monday 25 August 2008 13:00:10 Matt Sealey wrote:
> > > The focus has definitely been on VMX but that's not to say lower power
> > > processors were forgotten :)
> > 
> > lower-power (pun intended) is coming strong these days, as energy-efficiency 
> > is getteing more important every day. And the MPC5121 is a brand-new embedded 
> > processor, that will pop-up in quite a lot devices around you most 
> > probably ;-)
> 
> It would be useful of somebody interested in getting things things
> into glibc did the necessary FSF copyright assignment stuff and worked
> toward integrating them.
> 

Ben makes a very good point!

There is a process for contributing code to GLIBC, which starts with an
FSF copyright assignment.

There is also a framework for adding and maintaining optimizations of
this type:

http://penguinppc.org/dev/glibc/glibc-powerpc-cpu-addon.html

Since this original effort the powerpc changes have been merged into
mainline glibc (GLIBC-2.7) and no longer require a separate
(powerpc-cpu) addon. But the --with-cpu= configure option still works.
This mechanism also works with the glibc ports addon and eglibc.

So it does no good to complain here. If you have core you want to
contribute, Get your FSF CR assignment and join #glibc on freenode IRC.

And we will help you.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Efficient memcpy()/memmove() for G2/G3 cores...
  2008-08-27 21:04       ` Steven Munroe
@ 2008-08-29 11:48         ` David Jander
  2008-08-29 12:21           ` Joakim Tjernlund
                             ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: David Jander @ 2008-08-29 11:48 UTC (permalink / raw)
  To: munroesj; +Cc: linuxppc-dev

On Wednesday 27 August 2008 23:04:39 Steven Munroe wrote:
> On Tue, 2008-08-26 at 08:28 +1000, Benjamin Herrenschmidt wrote:
> > On Mon, 2008-08-25 at 15:06 +0200, David Jander wrote:
> > > Hi Matt,
> > >
> > > On Monday 25 August 2008 13:00:10 Matt Sealey wrote:
> > > > The focus has definitely been on VMX but that's not to say lower
> > > > power processors were forgotten :)
> > >
> > > lower-power (pun intended) is coming strong these days, as
> > > energy-efficiency is getteing more important every day. And the MPC5121
> > > is a brand-new embedded processor, that will pop-up in quite a lot
> > > devices around you most probably ;-)
> >
> > It would be useful of somebody interested in getting things things
> > into glibc did the necessary FSF copyright assignment stuff and worked
> > toward integrating them.
>
> Ben makes a very good point!

Sounds reasonable... but I am still wondering about what you mean 
with "things"?
AFAICS there is almost nothing there (besides the memcpy() routine from Gunnar 
von Boehn, which is apparently still far from optimal). And I was asking for 
someone to correct me here ;-)

> There is also a framework for adding and maintaining optimizations of
> this type:
> 
> http://penguinppc.org/dev/glibc/glibc-powerpc-cpu-addon.html

I had already stumbled across this one, but it seems to focus on G3 or newer 
processors (power4). There is no optimal memcpy() for G2/PPC603/e300.

>[...]
> So it does no good to complain here. If you have core you want to
> contribute, Get your FSF CR assignment and join #glibc on freenode IRC.

I am not complaining. I was only wondering if it is just me or there really is 
very little that has been done (for either uClibc, glibc, or whatever for 
powerpc) to improve performance of (linux-) applications on "lower"-power 
platforms (G2 core), AFAICS there is a LOT that can be gained by simple 
tweaks.

> And we will help you.

Thanks, now that I know which is the "correct" way to contribute, I only need 
to come up with a good set of optimization, worthy of inclusion in glibc.
OTOH, maybe it is easier and simpler to start with a collection of functions 
in a shared-library, that may be suited for preloading via LD_PRELOAD 
or /etc/ld_preload...

Maybe once this collection is more stable (in terms of that heavy tweaking has 
stopped) one could try the pilgrimage towards glibc inclusion....

The problem is: I have very little experience with powerpc assembly and only 
very limited time to dedicate to this and I am looking for others who have 

Greetings,

-- 
David Jander

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Efficient memcpy()/memmove() for G2/G3 cores...
  2008-08-29 11:48         ` David Jander
@ 2008-08-29 12:21           ` Joakim Tjernlund
  2008-09-01  7:23             ` David Jander
  2008-08-29 20:34           ` Steven Munroe
  2008-08-31  8:28           ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 27+ messages in thread
From: Joakim Tjernlund @ 2008-08-29 12:21 UTC (permalink / raw)
  To: David Jander; +Cc: munroesj, linuxppc-dev

On Fri, 2008-08-29 at 13:48 +0200, David Jander wrote:
> On Wednesday 27 August 2008 23:04:39 Steven Munroe wrote:
> > On Tue, 2008-08-26 at 08:28 +1000, Benjamin Herrenschmidt wrote:
> > > On Mon, 2008-08-25 at 15:06 +0200, David Jander wrote:
> > > > Hi Matt,

[SNIP]

> I am not complaining. I was only wondering if it is just me or there =
really is=20
> very little that has been done (for either uClibc, glibc, or whatever =
for=20
> powerpc) to improve performance of (linux-) applications on =
"lower"-power=20
> platforms (G2 core), AFAICS there is a LOT that can be gained by =
simple=20
> tweaks.

[SNIP]

>=20
> The problem is: I have very little experience with powerpc assembly =
and only=20
> very limited time to dedicate to this and I am looking for others who =
have=20

I improved the PowerPC memcpy and friends in uClibc a while ago. It does
basically the same a the kernel memcpy but without any cache
instructions. It is written in C, but in such a way that
optimal assembly is generated.

 Jocke

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Efficient memcpy()/memmove() for G2/G3 cores...
  2008-08-29 11:48         ` David Jander
  2008-08-29 12:21           ` Joakim Tjernlund
@ 2008-08-29 20:34           ` Steven Munroe
  2008-09-01  8:29             ` David Jander
  2008-08-31  8:28           ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 27+ messages in thread
From: Steven Munroe @ 2008-08-29 20:34 UTC (permalink / raw)
  To: David Jander; +Cc: linuxppc-dev

On Fri, 2008-08-29 at 13:48 +0200, David Jander wrote:
> On Wednesday 27 August 2008 23:04:39 Steven Munroe wrote:
> > On Tue, 2008-08-26 at 08:28 +1000, Benjamin Herrenschmidt wrote:
> > > On Mon, 2008-08-25 at 15:06 +0200, David Jander wrote:
> > > > Hi Matt,
> > > >
> > > > On Monday 25 August 2008 13:00:10 Matt Sealey wrote:
> > > > > The focus has definitely been on VMX but that's not to say lower
> > > > > power processors were forgotten :)
> > > >
[SNIP]
> > >
> > > It would be useful of somebody interested in getting things things
> > > into glibc did the necessary FSF copyright assignment stuff and worked
> > > toward integrating them.
> >
> > Ben makes a very good point!
> 
> Sounds reasonable... but I am still wondering about what you mean 
> with "things"?
> AFAICS there is almost nothing there (besides the memcpy() routine from Gunnar 
> von Boehn, which is apparently still far from optimal). And I was asking for 
> someone to correct me here ;-)
> 
> > There is also a framework for adding and maintaining optimizations of
> > this type:
> > 
> > http://penguinppc.org/dev/glibc/glibc-powerpc-cpu-addon.html
> 
> I had already stumbled across this one, but it seems to focus on G3 or newer 
> processors (power4). There is no optimal memcpy() for G2/PPC603/e300.
> 
Well folks volunteer to work on code for the hardware they have, use,
and care about. I don't have any of that hardware...

this framework can be used to add optimizations for any valid gcc
-mcpu=<cpu-type> target.

> >[...]
> > So it does no good to complain here. If you have core you want to
> > contribute, Get your FSF CR assignment and join #glibc on freenode IRC.
> 
> I am not complaining. I was only wondering if it is just me or there really is 
> very little that has been done (for either uClibc, glibc, or whatever for 
> powerpc) to improve performance of (linux-) applications on "lower"-power 
> platforms (G2 core), AFAICS there is a LOT that can be gained by simple 
> tweaks.
> 
This is a self help group (free as in freedom) We help each other. And
you can help yourself. There is no free lunch.

> > And we will help you.
[SNIP]
> 
> The problem is: I have very little experience with powerpc assembly and only 
> very limited time to dedicate to this and I am looking for others who have 
> 
Well this will be a good learning experience for you. We will try to
answer questions.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Efficient memcpy()/memmove() for G2/G3 cores...
  2008-08-29 11:48         ` David Jander
  2008-08-29 12:21           ` Joakim Tjernlund
  2008-08-29 20:34           ` Steven Munroe
@ 2008-08-31  8:28           ` Benjamin Herrenschmidt
  2008-09-01  6:42             ` David Jander
  2 siblings, 1 reply; 27+ messages in thread
From: Benjamin Herrenschmidt @ 2008-08-31  8:28 UTC (permalink / raw)
  To: David Jander; +Cc: munroesj, linuxppc-dev

O> > It would be useful of somebody interested in getting things things
> > > into glibc did the necessary FSF copyright assignment stuff and worked
> > > toward integrating them.
> >
> > Ben makes a very good point!
> 
> Sounds reasonable... but I am still wondering about what you mean 
> with "things"?

Typo. I meant "these things", that is, variants of various libc
functions optimized for a given processor type.

> AFAICS there is almost nothing there (besides the memcpy() routine from Gunnar 
> von Boehn, which is apparently still far from optimal). And I was asking for 
> someone to correct me here ;-)

No idea, as we said, it's mostly up to users of the processors (or to a
certain extent, manufacturers, hint hint hint) to do that work.

> > There is also a framework for adding and maintaining optimizations of
> > this type:
> > 
> > http://penguinppc.org/dev/glibc/glibc-powerpc-cpu-addon.html
> 
> I had already stumbled across this one, but it seems to focus on G3 or newer 
> processors (power4). There is no optimal memcpy() for G2/PPC603/e300.

It focuses on what the people doing it have access to, are paid to work
on, or other material constraints. It's up to others from the community
to fill the gaps.

> >[...]
> > So it does no good to complain here. If you have core you want to
> > contribute, Get your FSF CR assignment and join #glibc on freenode IRC.
> 
> I am not complaining. I was only wondering if it is just me or there really is 
> very little that has been done (for either uClibc, glibc, or whatever for 
> powerpc) to improve performance of (linux-) applications on "lower"-power 
> platforms (G2 core), AFAICS there is a LOT that can be gained by simple 
> tweaks.

Well, possibly, then you are welcome to work on those tweaks and if they
indeed improve things, submit patches to glibc :-) I'm sure Steve and
Ryan will be happy to help with the submission process.

> > And we will help you.
> 
> Thanks, now that I know which is the "correct" way to contribute, I only need 
> to come up with a good set of optimization, worthy of inclusion in glibc.

You don't have to do it all at once. A  simple tweak of one function
such as memcpy, if it's measurably improving performances without
notable regressions could be a first step, and then tweak after tweak...

It's a common mistake to try to do too much "out of tree" and then
struggle and give up when it's time to merge that stuff because there
are too many areas that won't necessarily be acceptable "as is".

One little bit at a time is generally a better approach.

> OTOH, maybe it is easier and simpler to start with a collection of functions 
> in a shared-library, that may be suited for preloading via LD_PRELOAD 
> or /etc/ld_preload...
> 
> Maybe once this collection is more stable (in terms of that heavy tweaking has 
> stopped) one could try the pilgrimage towards glibc inclusion....

I believe that's the wrong approach as it leads to never-merged out-of
tree code.

> The problem is: I have very little experience with powerpc assembly and only 
> very limited time to dedicate to this and I am looking for others who have 

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Efficient memcpy()/memmove() for G2/G3 cores...
  2008-08-31  8:28           ` Benjamin Herrenschmidt
@ 2008-09-01  6:42             ` David Jander
  0 siblings, 0 replies; 27+ messages in thread
From: David Jander @ 2008-09-01  6:42 UTC (permalink / raw)
  To: benh; +Cc: munroesj, linuxppc-dev

On Sunday 31 August 2008 10:28:43 Benjamin Herrenschmidt wrote:
> O> > It would be useful of somebody interested in getting things things
>
> > > > into glibc did the necessary FSF copyright assignment stuff and
> > > > worked toward integrating them.
> > >
> > > Ben makes a very good point!
> >
> > Sounds reasonable... but I am still wondering about what you mean
> > with "things"?
>
> Typo. I meant "these things", that is, variants of various libc
> functions optimized for a given processor type.

Ok, we'd have to _make_ those "things" first then ;-)

> > AFAICS there is almost nothing there (besides the memcpy() routine from
> > Gunnar von Boehn, which is apparently still far from optimal). And I was
> > asking for someone to correct me here ;-)
>
> No idea, as we said, it's mostly up to users of the processors (or to a
> certain extent, manufacturers, hint hint hint) to do that work.

Ok, I get the point.

> > > There is also a framework for adding and maintaining optimizations of
> > > this type:
> > >
> > > http://penguinppc.org/dev/glibc/glibc-powerpc-cpu-addon.html
> >
> > I had already stumbled across this one, but it seems to focus on G3 or
> > newer processors (power4). There is no optimal memcpy() for
> > G2/PPC603/e300.
>
> It focuses on what the people doing it have access to, are paid to work
> on, or other material constraints. It's up to others from the community
> to fill the gaps.

That's all I need to know ;-)

> > >[...]
> > > So it does no good to complain here. If you have core you want to
> > > contribute, Get your FSF CR assignment and join #glibc on freenode IRC.
> >
> > I am not complaining. I was only wondering if it is just me or there
> > really is very little that has been done (for either uClibc, glibc, or
> > whatever for powerpc) to improve performance of (linux-) applications on
> > "lower"-power platforms (G2 core), AFAICS there is a LOT that can be
> > gained by simple tweaks.
>
> Well, possibly, then you are welcome to work on those tweaks and if they
> indeed improve things, submit patches to glibc :-) I'm sure Steve and
> Ryan will be happy to help with the submission process.

Sounds encouraging. I'll try my best (in the limited amount of time I have).

>[...]
> You don't have to do it all at once. A  simple tweak of one function
> such as memcpy, if it's measurably improving performances without
> notable regressions could be a first step, and then tweak after tweak...
>
> It's a common mistake to try to do too much "out of tree" and then
> struggle and give up when it's time to merge that stuff because there
> are too many areas that won't necessarily be acceptable "as is".
>
> One little bit at a time is generally a better approach.

Ok, I take your advice.

> > OTOH, maybe it is easier and simpler to start with a collection of
> > functions in a shared-library, that may be suited for preloading via
> > LD_PRELOAD or /etc/ld_preload...
> >
> > Maybe once this collection is more stable (in terms of that heavy
> > tweaking has stopped) one could try the pilgrimage towards glibc
> > inclusion....
>
> I believe that's the wrong approach as it leads to never-merged out-of
> tree code.

Hmm... you mean, it'll be easier to keep patching (improving) things once they 
are already in glibc? Interesting.

Thanks a lot for your comments.

Best regards,

-- 
David Jander

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Efficient memcpy()/memmove() for G2/G3 cores...
  2008-08-29 12:21           ` Joakim Tjernlund
@ 2008-09-01  7:23             ` David Jander
  2008-09-01  9:36               ` Joakim Tjernlund
  0 siblings, 1 reply; 27+ messages in thread
From: David Jander @ 2008-09-01  7:23 UTC (permalink / raw)
  To: joakim.tjernlund; +Cc: munroesj, linuxppc-dev

On Friday 29 August 2008 14:20:33 Joakim Tjernlund wrote:
>[...]
> > The problem is: I have very little experience with powerpc assembly and
> > only very limited time to dedicate to this and I am looking for others
> > who have
>
> I improved the PowerPC memcpy and friends in uClibc a while ago. It does
> basically the same a the kernel memcpy but without any cache
> instructions. It is written in C, but in such a way that
> optimal assembly is generated.

Hmm, isn't that going to break on a different version of gcc?
I just copied the latest version of trunk/uClibc/libc/string/powerpc/memcpy.c 
from subversion as uclibc-memcpy.c, removed the last line and did this:

$ gcc -shared -O2 -Wall -o libucmemcpy.so uclibc-memcpy.c

(should I use other compiler options?)

Then I started my test program with LD_PRELOAD=...

My test program only copies big chunks of aligned memory, so it will only test 
for maximum throughput (such as copying video frames). I will make a better 
one, to measure throughput on different sized blocks of aligned and unaligned 
memory, but first I want to find out why I can't seem to get even close to 
the expected RAM bandwidth (bursts occur at 1.6 Gbyte/s, sustained transfers 
might be able to reach 400 Mbyte/s in theory, taking into account the video 
controller eating almost half of it, I'd like to get somewhere close to 200).

The result is quite a bit better than that of glibc-2.7 (13.2 Mbyte/s --> 22 
Mbyte/s), but still far from the 71.5 Mbyte/s achieved when using bigger 
strides of 16 registers load/store at a time.
Note, that this is copy performance, one-way througput should be double these 
figures.

I'll try to learn how cache manipulating instructions work, to see if I can 
gain some more bandwith using them.

Regards,

-- 
David Jander

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Efficient memcpy()/memmove() for G2/G3 cores...
  2008-08-29 20:34           ` Steven Munroe
@ 2008-09-01  8:29             ` David Jander
  0 siblings, 0 replies; 27+ messages in thread
From: David Jander @ 2008-09-01  8:29 UTC (permalink / raw)
  To: munroesj; +Cc: linuxppc-dev

On Friday 29 August 2008 22:34:21 Steven Munroe wrote:
> > I am not complaining. I was only wondering if it is just me or there
> > really is very little that has been done (for either uClibc, glibc, or
> > whatever for powerpc) to improve performance of (linux-) applications on
> > "lower"-power platforms (G2 core), AFAICS there is a LOT that can be
> > gained by simple tweaks.
>
> This is a self help group (free as in freedom) We help each other. And
> you can help yourself. There is no free lunch.

I never expected to be served a free dish of any kind on a mailing-list ;-)
I was just asking around, to avoid reinventing wheels, since I intend to dig 
into this problem, that's all. My intention never was to pick up work from 
others and then run.

> > The problem is: I have very little experience with powerpc assembly and
> > only very limited time to dedicate to this and I am looking for others
> > who have
>
> Well this will be a good learning experience for you. We will try to
> answer questions.

Excellent. I love learning new stuff ;-)

Thanks a lot for the guidance so far...

Regards,

-- 
David Jander

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Efficient memcpy()/memmove() for G2/G3 cores...
  2008-09-01  7:23             ` David Jander
@ 2008-09-01  9:36               ` Joakim Tjernlund
  2008-09-02 13:12                 ` David Jander
  0 siblings, 1 reply; 27+ messages in thread
From: Joakim Tjernlund @ 2008-09-01  9:36 UTC (permalink / raw)
  To: David Jander; +Cc: munroesj, linuxppc-dev

On Mon, 2008-09-01 at 09:23 +0200, David Jander wrote:
> On Friday 29 August 2008 14:20:33 Joakim Tjernlund wrote:
> >[...]
> > > The problem is: I have very little experience with powerpc assembly and
> > > only very limited time to dedicate to this and I am looking for others
> > > who have
> >
> > I improved the PowerPC memcpy and friends in uClibc a while ago. It does
> > basically the same a the kernel memcpy but without any cache
> > instructions. It is written in C, but in such a way that
> > optimal assembly is generated.
> 
> Hmm, isn't that going to break on a different version of gcc?

Not break, but gcc might generate non optimal code. However, the code
is laid out to make it easy for gcc to do the right thing.

> I just copied the latest version of trunk/uClibc/libc/string/powerpc/memcpy.c 
> from subversion as uclibc-memcpy.c, removed the last line and did this:
> 
> $ gcc -shared -O2 -Wall -o libucmemcpy.so uclibc-memcpy.c
> 
> (should I use other compiler options?)

These are fine.

> 
> Then I started my test program with LD_PRELOAD=...
> 
> My test program only copies big chunks of aligned memory, so it will only test 
> for maximum throughput (such as copying video frames). I will make a better 
> one, to measure throughput on different sized blocks of aligned and unaligned 
> memory, but first I want to find out why I can't seem to get even close to 
> the expected RAM bandwidth (bursts occur at 1.6 Gbyte/s, sustained transfers 
> might be able to reach 400 Mbyte/s in theory, taking into account the video 
> controller eating almost half of it, I'd like to get somewhere close to 200).
> 
> The result is quite a bit better than that of glibc-2.7 (13.2 Mbyte/s --> 22 
> Mbyte/s), but still far from the 71.5 Mbyte/s achieved when using bigger 
> strides of 16 registers load/store at a time.
> Note, that this is copy performance, one-way througput should be double these 
> figures.

Yeah, the code is trying to do a reasonable job without knowing what
micro arch it is running on. These could probably go to glibc
as new general purpose memxxx() routines. You will probably see
a big increase once dcbz is added to the copy/memset functions.

Fire away :)

 Jocke

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Efficient memcpy()/memmove() for G2/G3 cores...
  2008-09-01  9:36               ` Joakim Tjernlund
@ 2008-09-02 13:12                 ` David Jander
  2008-09-03  6:43                   ` Joakim Tjernlund
  2008-09-03 20:33                   ` prodyut hazarika
  0 siblings, 2 replies; 27+ messages in thread
From: David Jander @ 2008-09-02 13:12 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: munroesj, John Rigby

On Monday 01 September 2008 11:36:15 Joakim Tjernlund wrote:
>[...]
> > Then I started my test program with LD_PRELOAD=...
> >
> > My test program only copies big chunks of aligned memory, so it will only
> > test for maximum throughput (such as copying video frames). I will make a
> > better one, to measure throughput on different sized blocks of aligned
> > and unaligned memory, but first I want to find out why I can't seem to
> > get even close to the expected RAM bandwidth (bursts occur at 1.6
> > Gbyte/s, sustained transfers might be able to reach 400 Mbyte/s in
> > theory, taking into account the video controller eating almost half of
> > it, I'd like to get somewhere close to 200).
> >
> > The result is quite a bit better than that of glibc-2.7 (13.2 Mbyte/s -->
> > 22 Mbyte/s), but still far from the 71.5 Mbyte/s achieved when using
> > bigger strides of 16 registers load/store at a time.
> > Note, that this is copy performance, one-way througput should be double
> > these figures.
>
> Yeah, the code is trying to do a reasonable job without knowing what
> micro arch it is running on. These could probably go to glibc
> as new general purpose memxxx() routines. You will probably see
> a big increase once dcbz is added to the copy/memset functions.
>
> Fire away :)

Ok here I go:

I have made some astonishing discoveries, and I'd like to post the used 
source-code somewhere in the meantime, any suggestions? To this list?

There seem to be some substantial differences between the e300 core used in 
the MPC5200B and in the MPC5121e (besides the MPC5121 having double the 
amount of cache). Memcpy()-performance wise, these differences amount to the 
following. The tests done are with vanilla glibc (version 2.6.1 and 2.7 
without any powerpc specific memcpy() optimizations), Gunnar von Boehns 
memcpy_e300 and my tweaked version, memcpy_e300_dj which basically uses 
16-register strides instead of 4-register strides in Gunnar's example.

memcpy() peak-performance (RAM memory throughput) on:

MPC5200B, glibc-2.6, no optimizations: 136 Mbyte/s
MPC5121e, glibc-2.7, no optimizations:  30 Mbyte/s

MPC5200B, memcpy_e300: 225 Mbyte/s
MPC5121e, memcpy_e300: 130 Mbyte/s

MPC5200B, memcpy_e300_dj: 200 Mbyte/s
MPC5121e, memcpy_e300_dj: 202 Mbyte/s

For the MPC5121e, 16-register strides seem to be most optimal, whereas for the 
MPC5200B, 4-register stides give best performance. Also, plain C memcpy() 
performance on MPC5121e is terribly poor! Does enyone know why? I don't quite 
seem to understand those results.

Some information on the test hardware:

MPC5200B-based board has 64 Mbyte DDR-SDRAM, 32-bit wide (two x16 chips), 
running ubuntu 7.10 with kernel 2.6.19.2.

MPC5121e-based board has 256 Mbyte DDR2-SDRAM, 32-bit wide (two x16 chips), 
running ubuntu 8.04.1 with kernel 2.6.24.5 from Freescale LTIB with the DIU 
turned OFF. When the DIU is turned on, maximum throughput drops from 202 to 
196 Mbyte/s.

memcpy_e300 variants basically use 4 or 16-register load/store strides, cache 
alignment and dcbz/bcbt cache-manipulation instructions to tweak performance.

I have not tried interleaving integer and fpu instructions.

Does anybody have any suggestion about where to start searching for an 
explaination of these results? I have the impression that there is something 
wrong with my setup, or with the e300c4-core, or both, but what????

Greetings,

-- 
David Jander

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Efficient memcpy()/memmove() for G2/G3 cores...
  2008-09-02 13:12                 ` David Jander
@ 2008-09-03  6:43                   ` Joakim Tjernlund
  2008-09-03 20:33                   ` prodyut hazarika
  1 sibling, 0 replies; 27+ messages in thread
From: Joakim Tjernlund @ 2008-09-03  6:43 UTC (permalink / raw)
  To: David Jander; +Cc: linuxppc-dev, John Rigby, munroesj

On Tue, 2008-09-02 at 15:12 +0200, David Jander wrote:
=EF=BB=BF> I have made some astonishing discoveries, and I'd like to post t=
he
> used source-code somewhere in the meantime, any suggestions? To this list=
?

Yes, mail it.
I got a mpc8323/8321 board I want to try.

> For the MPC5121e, 16-register strides seem to be most optimal, whereas fo=
r the=20
> MPC5200B, 4-register stides give best performance. Also, plain C memcpy()=
=20

Plain C, is that the uClibc one?
Did you try to tweak it?

 Jocke

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Efficient memcpy()/memmove() for G2/G3 cores...
  2008-09-02 13:12                 ` David Jander
  2008-09-03  6:43                   ` Joakim Tjernlund
@ 2008-09-03 20:33                   ` prodyut hazarika
  2008-09-04  2:04                     ` Paul Mackerras
  1 sibling, 1 reply; 27+ messages in thread
From: prodyut hazarika @ 2008-09-03 20:33 UTC (permalink / raw)
  To: David Jander; +Cc: linuxppc-dev, John Rigby, munroesj

Hi all,

>  These could probably go to glibc
> as new general purpose memxxx() routines. You will probably see
> a big increase once dcbz is added to the copy/memset functions.

glibc memxxx for powerpc are horribly inefficient. For optimal performance,
we should should dcbt instruction to establish the source address in cache, and
dcbz to establish the destination address in cache. We should do
dcbt and dcbz such that the touches happen a line ahead of the actual copy.

The problem which is see is that dcbt and dcbz instructions don't work on
non-cacheable memory (obviously!). But memxxx function are used for both
cached and non-cached memory. Thus this optimized memcpy should be smart enough
to figure out that both source and destination address fall in
cacheable space, and only then
used the optimized dcbt/dcbz instructions.

You can expect to see a significant jump in memxxx function after
using dcbt/dcbz.

Thanks,
Prodyut Hazarika

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Efficient memcpy()/memmove() for G2/G3 cores...
  2008-09-03 20:33                   ` prodyut hazarika
@ 2008-09-04  2:04                     ` Paul Mackerras
  2008-09-04 12:05                       ` David Jander
  2008-09-04 18:14                       ` prodyut hazarika
  0 siblings, 2 replies; 27+ messages in thread
From: Paul Mackerras @ 2008-09-04  2:04 UTC (permalink / raw)
  To: prodyut hazarika; +Cc: linuxppc-dev, David Jander, John Rigby, munroesj

prodyut hazarika writes:

> glibc memxxx for powerpc are horribly inefficient. For optimal performance,
> we should should dcbt instruction to establish the source address in cache, and
> dcbz to establish the destination address in cache. We should do
> dcbt and dcbz such that the touches happen a line ahead of the actual copy.
> 
> The problem which is see is that dcbt and dcbz instructions don't work on
> non-cacheable memory (obviously!). But memxxx function are used for both
> cached and non-cached memory. Thus this optimized memcpy should be smart enough
> to figure out that both source and destination address fall in
> cacheable space, and only then
> used the optimized dcbt/dcbz instructions.

I would be careful about adding overhead to memcpy.  I found that in
the kernel, almost all calls to memcpy are for less than 128 bytes (1
cache line on most 64-bit machines).  So, adding a lot of code to
detect cacheability and do prefetching is just going to slow down the
common case, which is short copies.  I don't have statistics for glibc
but I wouldn't be surprised if most copies were short there also.

The other thing that I have found is that code that is optimal for
cache-cold copies is usually significantly slower than optimal for
cache-hot copies, because the cache management instructions consume
cycles and don't help in the cache-hot case.

In other words, I don't think we should be tuning the glibc memcpy
based on tests of how fast it copies multiple megabytes.

Still, for 6xx/e300 cores, we probably do want to use dcbt/dcbz for
larger copies.  We don't want to use dcbt/dcbz on the larger 64-bit
processors (POWER4/5/6) because the hardware prefetching and
write-combining mean that dcbt/dcbz don't help and just slow things
down.

Paul.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Efficient memcpy()/memmove() for G2/G3 cores...
  2008-09-04  2:04                     ` Paul Mackerras
@ 2008-09-04 12:05                       ` David Jander
  2008-09-04 12:19                         ` Josh Boyer
  2008-09-04 18:14                       ` prodyut hazarika
  1 sibling, 1 reply; 27+ messages in thread
From: David Jander @ 2008-09-04 12:05 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev, prodyut hazarika, John Rigby, munroesj

[-- Attachment #1: Type: text/plain, Size: 5031 bytes --]

On Thursday 04 September 2008 04:04:58 Paul Mackerras wrote:
> prodyut hazarika writes:
> > glibc memxxx for powerpc are horribly inefficient. For optimal
> > performance, we should should dcbt instruction to establish the source
> > address in cache, and dcbz to establish the destination address in cache.
> > We should do dcbt and dcbz such that the touches happen a line ahead of
> > the actual copy.
> >
> > The problem which is see is that dcbt and dcbz instructions don't work on
> > non-cacheable memory (obviously!). But memxxx function are used for both
> > cached and non-cached memory. Thus this optimized memcpy should be smart
> > enough to figure out that both source and destination address fall in
> > cacheable space, and only then
> > used the optimized dcbt/dcbz instructions.
>
> I would be careful about adding overhead to memcpy.  I found that in
> the kernel, almost all calls to memcpy are for less than 128 bytes (1
> cache line on most 64-bit machines).  So, adding a lot of code to
> detect cacheability and do prefetching is just going to slow down the
> common case, which is short copies.  I don't have statistics for glibc
> but I wouldn't be surprised if most copies were short there also.

Then please explain the following. This is a memcpy() speed test for different 
sized blocks on a MPC5121e (DIU is turned on). The first case is glibc code 
without optimizations, and the second case is 16-register strides with 
dcbt/dcbz instructions, written in assembly language (see attachment)

$ ./memcpyspeed
Fully aligned:
100000 chunks of 5 bytes   :  3.48 Mbyte/s ( throughput:  6.96 Mbytes/s)
50000 chunks of 16 bytes   :  14.3 Mbyte/s ( throughput:  28.6 Mbytes/s)
10000 chunks of 100 bytes  :  14.4 Mbyte/s ( throughput:  28.8 Mbytes/s)
5000 chunks of 256 bytes   :  14.4 Mbyte/s ( throughput:  28.7 Mbytes/s)
1000 chunks of 1000 bytes  :  14.4 Mbyte/s ( throughput:  28.7 Mbytes/s)
50 chunks of 16384 bytes   :  14.2 Mbyte/s ( throughput:  28.4 Mbytes/s)
1 chunks of 1048576 bytes  :  14.4 Mbyte/s ( throughput:  28.8 Mbytes/s)

$ LD_PRELOAD=./libmemcpye300dj.so ./memcpyspeed
Fully aligned:
100000 chunks of 5 bytes   :  7.44 Mbyte/s ( throughput:  14.9 Mbytes/s)
50000 chunks of 16 bytes   :  13.1 Mbyte/s ( throughput:  26.2 Mbytes/s)
10000 chunks of 100 bytes  :  29.4 Mbyte/s ( throughput:  58.8 Mbytes/s)
5000 chunks of 256 bytes   :  90.2 Mbyte/s ( throughput:   180 Mbytes/s)
1000 chunks of 1000 bytes  :    77 Mbyte/s ( throughput:   154 Mbytes/s)
50 chunks of 16384 bytes   :  96.8 Mbyte/s ( throughput:   194 Mbytes/s)
1 chunks of 1048576 bytes  :  97.6 Mbyte/s ( throughput:   195 Mbytes/s)

(I have edited the output of this tool to fit into an e-mail without wrapping 
lines for readability).
Please tell me how on earth there can be such a big difference???
Note that on a MPC5200B this is TOTALLY different, and both processors have an 
e300 core (different versions of it though).

> The other thing that I have found is that code that is optimal for
> cache-cold copies is usually significantly slower than optimal for
> cache-hot copies, because the cache management instructions consume
> cycles and don't help in the cache-hot case.
>
> In other words, I don't think we should be tuning the glibc memcpy
> based on tests of how fast it copies multiple megabytes.

I don't just copy multiple megabytes! See above example. Also I do constant 
performance testing of different applications using LD_PRELOAD, to se the 
impact. Recentrly I even tried prboom (a free doom port), to remember the 
good old days of PC benchmarking ;-)
I have yet to come across a test that has lower performance with this 
optimization (on an MPC5121e that is).

> Still, for 6xx/e300 cores, we probably do want to use dcbt/dcbz for
> larger copies.  We don't want to use dcbt/dcbz on the larger 64-bit

At least for MPC5121e you really, really need it!!

> processors (POWER4/5/6) because the hardware prefetching and
> write-combining mean that dcbt/dcbz don't help and just slow things
> down.

That's explainable.
What's not explainable, are the results I am getting on the MPC5121e.
Please, could someone tell me what I am doing wrong? (I must be doing 
something wrong, I'm almost sure).
One thing that I realize is not quite "right" with memcpyspeed.c is the fact 
that it copies consecutive blocks of memory, that should have an impact on 
5-byte and 16-bytes copy results I guess (a cacheline for the following block 
may already be fetched), but not anymore for 100-byte blocks and bigger (with 
32-byte cache lines). In fact, 16-bytes seems to be the only size where the 
additional overhead has some impact (which is negligible).

Another thing is that performance probably matters most to the end-user when 
applications need to copy big amounts of data (e.g. video frames or bitmap 
data), which is most probably done using big blocks of memcpy(), so 
eventually hurting performance for small copies probably has less weight on 
overall experience.

Best regards,

-- 
David Jander

[-- Attachment #2: memcpy_e300_dj.S --]
[-- Type: text/x-objcsrc, Size: 5413 bytes --]

/* Optimized memcpy() implementation for PowerPC e300c4 core (Freescale MPC5121)
 *
 * Written by Gunnar von Boehn
 * Tweaked by David Jander to improve performance on MPC5121e processor.
 */

#include "ppc_asm.h"

#define L1_CACHE_SHIFT          5 
#define MAX_COPY_PREFETCH       4 
#define L1_CACHE_BYTES          (1 << L1_CACHE_SHIFT) 
 
CACHELINE_BYTES = L1_CACHE_BYTES 
LG_DOUBLE_CACHELINE = (L1_CACHE_SHIFT+1)
CACHELINE_MASK = (L1_CACHE_BYTES-1) 

/* 
 * Memcpy optimized for PPC e300 
 * 
 * This relative simple memcpy does the following to optimize performance 
 * 
 * For sizes > 32 byte: 
 * DST is aligned to 32bit boundary - using 8bit copies 
 * DST is aligned to cache line boundary (32byte) - using aligned 32bit copies 
 * The main copy loop prossess one cache line (32byte) per iteration 
 * The DST cacheline is clear using DCBZ 
 * The clearing of the aligned DST cache line is very important for performance 
 * it prevents the CPU from fetching the DST line from memory - this saves 33% of memory accesses. 
 * To optimize SRC read performance the SRC is prefetched using DCBT 
 * 
 * The trick for getting good performance is to use a good match of prefetch distance 
 * for SRC reading and for DST clearing. 
 * Typically you DCBZ the DST 0 or 1 cache line ahead 
 * Typically you DCBT the SRC 2 - 4 cache lines ahaed 
 * on the e300 prefetching the SRC too far ahead will be slower than not prefetching at all. 
 * 
 * We use  DCBZ DST[0]  and DBCT SRC[0-1] depending on the SRC alignment 
 * 
 */ 
.align 2 

/* parameters r3=DST, r4=SRC, r5=size */ 
/* returns r3=DST */ 


.global memcpy
memcpy:
	mr      r7,r3                    /* Save DST in r7 for return */
	dcbt    0,r4                     /* Prefetch SRC cache line 32byte */ 
	neg     r0,r3                    /* DST alignment */ 
	addi    r4,r4,-4
	andi.   r0,r0,CACHELINE_MASK     /* # of bytes away from cache line boundary */ 
	addi    r6,r3,-4
	cmplw   cr1,r5,r0                /* is this more than total to do? */ 
	beq     .Lcachelinealigned 
	
	blt     cr1,.Lcopyrest                  /* if not much to do */ 
 
	andi.   r8,r0,3                         /* get it word-aligned first */ 
	mtctr   r8 
	beq+    .Ldstwordaligned 
.Laligntoword:  
	lbz     r9,4(r4)                        /* we copy bytes (8bit) 0-3  */ 
	stb     r9,4(r6)                        /* to get the DST 32bit aligned */ 
	addi    r4,r4,1 
	addi    r6,r6,1 
	bdnz    .Laligntoword 

.Ldstwordaligned: 
	subf    r5,r0,r5 
	srwi.   r0,r0,2 
	mtctr   r0 
	beq     .Lcachelinealigned 

.Laligntocacheline: 
	lwzu    r9,4(r4)                        /* do copy 32bit words (0-7) */ 
	stwu    r9,4(r6)                        /* to get DST cache line aligned (32byte) */ 
	bdnz    .Laligntocacheline 

.Lcachelinealigned:
	srwi.   r0,r5,LG_DOUBLE_CACHELINE        /* # complete cachelines */ 
	clrlwi  r5,r5,32-LG_DOUBLE_CACHELINE 
	li      r11,32
	beq     .Lcopyrest 

	addi    r3,r4,4                         /* Find out which SRC cacheline to prefetch */ 
	neg     r3,r3    
	andi.   r3,r3,31 
	addi    r3,r3,32 
	
	mtctr   r0 

	stwu    r1,-76(r1) /* Save some tmp registers */
	stw     r23,28(r1)
	stw     r30,56(r1)
	stw     r31,60(r1)
	stw     r24,32(r1)
	stw     r25,36(r1)
	stw     r26,40(r1)
	stw     r27,44(r1)
	stw     r28,48(r1)
	stw     r29,52(r1)
	stw     r13,64(r1)
	stw     r14,68(r1)
	stw     r15,72(r1)
	
.align 7 
.Lloop:                                         /* the main body of the cacheline loop */ 
	dcbt    r3,r4                           /* SRC cache line prefetch */ 
	dcbz    r11,r6                          /* clear DST cache line */ 
	lwz     r31, 0x04(r4)                   /* copy using a 8 register stride for best performance on e300 */ 
	lwz     r8,  0x08(r4)
	lwz     r9,  0x0c(r4)
	lwz     r10, 0x10(r4)
	lwz     r12, 0x14(r4)
	lwz     r13, 0x18(r4)
	lwz     r14, 0x1c(r4)
	lwzu    r23, 0x20(r4)
	dcbt    r3,r4                           /* SRC cache line prefetch */ 
	lwz     r24, 0x04(r4)
	lwz     r25, 0x08(r4)
	lwz     r26, 0x0c(r4)
	lwz     r27, 0x10(r4)
	lwz     r28, 0x14(r4)
	lwz     r29, 0x18(r4)
	lwz     r30, 0x1c(r4)
	lwzu    r15, 0x20(r4)
	stw     r31, 0x04(r6)
	stw     r8,  0x08(r6)
	stw     r9,  0x0c(r6)
	stw     r10, 0x10(r6)
	stw     r12, 0x14(r6)
	stw     r13, 0x18(r6)
	stw     r14, 0x1c(r6)
	stwu    r23, 0x20(r6)
	dcbz    r11,r6                          /* clear DST cache line */ 
	stw     r24, 0x04(r6)
	stw     r25, 0x08(r6)
	stw     r26, 0x0c(r6)
	stw     r27, 0x10(r6)
	stw     r28, 0x14(r6)
	stw     r29, 0x18(r6)
	stw     r30, 0x1c(r6)
	stwu    r15, 0x20(r6)
	bdnz    .Lloop 
		
	lwz     r24,32(r1) /* restore tmp registers */
	lwz     r23,28(r1)
	lwz     r25,36(r1)
	lwz     r26,40(r1)
	lwz     r27,44(r1)
	lwz     r28,48(r1)
	lwz     r29,52(r1)
	lwz     r30,56(r1)
	lwz     r31,60(r1)
	lwz     r13,64(r1)
	lwz     r14,68(r1)
	lwz     r15,72(r1)
	addi    r1,r1,76
 
.Lcopyrest:
	srwi.   r0,r5,2 
	mtctr   r0 
	beq     .Llastbytes

.Lcopywords:    
	lwzu    r0,4(r4)                        /* we copy remaining words (0-7) */ 
	stwu    r0,4(r6)    
	bdnz    .Lcopywords 

.Llastbytes: 
	andi.   r0,r5,3 
	mtctr   r0 
	beq+    .Lend

.Lcopybytes:    
	lbz     r0,4(r4)                        /* we copy remaining bytes (0-3)  */ 
	stb     r0,4(r6) 
	addi    r4,r4,1 
	addi    r6,r6,1 
	bdnz    .Lcopybytes 

.Lend:  /* done : return 0 for Linux / DST for glibc*/ 
	mr      r3, r7
	blr 

[-- Attachment #3: memcpyspeed.c --]
[-- Type: text/x-csrc, Size: 4073 bytes --]

#include <stdio.h>
#include <sys/mman.h>
#include <string.h>
#include <sys/time.h>
#include <time.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

// #define VIDEO_MMAP
// #define TEST_UNALIGNED

static void *srcpool;
static void *dstpool;

unsigned int sizes[] = {5,      16,    100,   256,  1000, 16384, 1048576};
unsigned int nums[] =  {100000, 50000, 10000, 5000, 1000, 50,    1};

#define TESTRUNS 10

unsigned int memtest(int size, int num, int srcaligned, int dstaligned)
{
	struct timeval tv0, tv1;
	unsigned char *src=(unsigned char *)srcpool, *dst=(unsigned char *)dstpool;
	unsigned char *sp, *dp;
	unsigned int t,i;
	long int usecs;
	unsigned long int secs;
	
	/* Get src and dst 32-byte aligned */
	src = (unsigned char *)((unsigned int)(src+31) & 0xffffffe0);
	dst = (unsigned char *)((unsigned int)(dst+31) & 0xffffffe0);
	
	/* Now unalign them if desired (some random offset) */
	if(!srcaligned)
		src += 11;
	if(!dstaligned)
		dst += 13;
	
	/* "Train" the system (caches, paging, etc...) */
	sp = src;
	dp = dst;
	for(i=0; i<num; i++) {
		memcpy(dp, sp, size);
		sp += size;
		dp += size;
	}
	
	/* Start measurement */
	gettimeofday(&tv0, NULL);
	for(t=0; t<TESTRUNS; t++) {
		sp = src;
		dp = dst;
		for(i=0; i<num; i++) {
			memcpy(dp, sp, size);
			sp += size;
			dp += size;
		}
	}
	gettimeofday(&tv1, NULL);
	secs = tv1.tv_sec-tv0.tv_sec;
	usecs = tv1.tv_usec-tv0.tv_usec;
	if(usecs<0) {
		usecs += 1000000;
		secs -= 1;
	}
	return usecs+1000000L*secs;
}

unsigned int memverify(int size, int num, int srcaligned, int dstaligned)
{
	unsigned char *src=(unsigned char *)srcpool, *dst=(unsigned char *)dstpool;
	
	/* Get src and dst 32-byte aligned */
	src = (unsigned char *)((unsigned int)(src+31) & 0xffffffe0);
	dst = (unsigned char *)((unsigned int)(dst+31) & 0xffffffe0);
	
	/* Now unalign them if desired (some random offset) */
	if(!srcaligned)
		src += 11;
	if(!dstaligned)
		dst += 13;
	
	return memcmp(dst, src, size*num);
}


void evaluate(char *name, unsigned int totalsize, unsigned int usecs)
{
	double rate;
	
	rate = (double)totalsize*(double)TESTRUNS/((double)usecs/1000000.0);
	rate /= (1024.0*1024.0);
	printf("Memcpy %-30s: %5.3g Mbyte/s (memory throughput: %5.3g Mbytes/s)\n",name, rate, rate*2.0);
}

int main(void)
{
	int t,i;
	unsigned int usecs;
	char buf[50];
	struct timeval tv;
#ifdef VIDEO_MMAP
	unsigned long int *mem;
	int f;
	
	printf("Opening fb0\n");
	f = open("/dev/fb0", O_RDWR);
	if(f<0) {
		perror("opening fb0");
		return 1;
	}
	printf("mmapping fb0\n");
	
	mem = mmap(NULL, 0x00300000, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED,f,0);
	
	printf("mmap returned: %08x\n",(unsigned int)mem);
	perror("mmap");
	if(mem==-1)
		return 1;
#else
	unsigned long int mem[786432];
#endif
	
	srcpool = (unsigned char *)mem;
	dstpool = (unsigned char *)mem;
	dstpool += 1572864; /* 1.5 Mbyte offset into 3 Mbyte framebuffer */
	
	gettimeofday(&tv, NULL);
	for(t=0; t<0x000c0000; t++)
		mem[t] = (tv.tv_usec ^ tv.tv_sec) ^ t;

	printf("Fully aligned:\n");	
	for(t=0; t<(sizeof(nums)/sizeof(nums[0])); t++) {
		snprintf(buf, 50, "%d chunks of %d bytes", nums[t], sizes[t]);
		usecs = memtest(sizes[t], nums[t], 1, 1);
		evaluate(buf, nums[t]*sizes[t], usecs);
		if(memverify(sizes[t], nums[t], 1, 1)) {
			printf("Verify faild!\n");
		}
	}
#ifdef TEST_UNALIGNED
	printf("source unaligned:\n");	
	for(t=0; t<(sizeof(nums)/sizeof(nums[0])); t++) {
		snprintf(buf, 50, "%d chunks of %d bytes", nums[t], sizes[t]);
		usecs = memtest(sizes[t], nums[t], 0, 1);
		evaluate(buf, nums[t]*sizes[t], usecs);
	}
	
	printf("destination unaligned:\n");	
	for(t=0; t<(sizeof(nums)/sizeof(nums[0])); t++) {
		snprintf(buf, 50, "%d chunks of %d bytes", nums[t], sizes[t]);
		usecs = memtest(sizes[t], nums[t], 1, 0);
		evaluate(buf, nums[t]*sizes[t], usecs);
	}
	
	printf("both unaligned:\n");	
	for(t=0; t<(sizeof(nums)/sizeof(nums[0])); t++) {
		snprintf(buf, 50, "%d chunks of %d bytes", nums[t], sizes[t]);
		usecs = memtest(sizes[t], nums[t], 0, 0);
		evaluate(buf, nums[t]*sizes[t], usecs);
	}
#endif	
	return 0;
}

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Efficient memcpy()/memmove() for G2/G3 cores...
  2008-09-04 12:05                       ` David Jander
@ 2008-09-04 12:19                         ` Josh Boyer
  2008-09-04 12:59                           ` David Jander
  0 siblings, 1 reply; 27+ messages in thread
From: Josh Boyer @ 2008-09-04 12:19 UTC (permalink / raw)
  To: David Jander
  Cc: linuxppc-dev, Paul Mackerras, munroesj, John Rigby,
	prodyut hazarika

On Thu, Sep 04, 2008 at 02:05:16PM +0200, David Jander wrote:
>> I would be careful about adding overhead to memcpy.  I found that in
>> the kernel, almost all calls to memcpy are for less than 128 bytes (1
>> cache line on most 64-bit machines).  So, adding a lot of code to
>> detect cacheability and do prefetching is just going to slow down the
>> common case, which is short copies.  I don't have statistics for glibc
>> but I wouldn't be surprised if most copies were short there also.
>
>Then please explain the following. This is a memcpy() speed test for different 
>sized blocks on a MPC5121e (DIU is turned on). The first case is glibc code 
>without optimizations, and the second case is 16-register strides with 
>dcbt/dcbz instructions, written in assembly language (see attachment)
>
>$ ./memcpyspeed
>Fully aligned:
>100000 chunks of 5 bytes   :  3.48 Mbyte/s ( throughput:  6.96 Mbytes/s)
>50000 chunks of 16 bytes   :  14.3 Mbyte/s ( throughput:  28.6 Mbytes/s)
>10000 chunks of 100 bytes  :  14.4 Mbyte/s ( throughput:  28.8 Mbytes/s)
>5000 chunks of 256 bytes   :  14.4 Mbyte/s ( throughput:  28.7 Mbytes/s)
>1000 chunks of 1000 bytes  :  14.4 Mbyte/s ( throughput:  28.7 Mbytes/s)
>50 chunks of 16384 bytes   :  14.2 Mbyte/s ( throughput:  28.4 Mbytes/s)
>1 chunks of 1048576 bytes  :  14.4 Mbyte/s ( throughput:  28.8 Mbytes/s)
>
>$ LD_PRELOAD=./libmemcpye300dj.so ./memcpyspeed
>Fully aligned:
>100000 chunks of 5 bytes   :  7.44 Mbyte/s ( throughput:  14.9 Mbytes/s)
>50000 chunks of 16 bytes   :  13.1 Mbyte/s ( throughput:  26.2 Mbytes/s)
>10000 chunks of 100 bytes  :  29.4 Mbyte/s ( throughput:  58.8 Mbytes/s)
>5000 chunks of 256 bytes   :  90.2 Mbyte/s ( throughput:   180 Mbytes/s)
>1000 chunks of 1000 bytes  :    77 Mbyte/s ( throughput:   154 Mbytes/s)
>50 chunks of 16384 bytes   :  96.8 Mbyte/s ( throughput:   194 Mbytes/s)
>1 chunks of 1048576 bytes  :  97.6 Mbyte/s ( throughput:   195 Mbytes/s)
>
>(I have edited the output of this tool to fit into an e-mail without wrapping 
>lines for readability).
>Please tell me how on earth there can be such a big difference???
>Note that on a MPC5200B this is TOTALLY different, and both processors have an 
>e300 core (different versions of it though).

How can there be such a big difference in throughput?  Well, your algorithm
seems better optimized than the glibc one for your testcase :).

>> The other thing that I have found is that code that is optimal for
>> cache-cold copies is usually significantly slower than optimal for
>> cache-hot copies, because the cache management instructions consume
>> cycles and don't help in the cache-hot case.
>>
>> In other words, I don't think we should be tuning the glibc memcpy
>> based on tests of how fast it copies multiple megabytes.
>
>I don't just copy multiple megabytes! See above example. Also I do constant 
>performance testing of different applications using LD_PRELOAD, to se the 
>impact. Recentrly I even tried prboom (a free doom port), to remember the 
>good old days of PC benchmarking ;-)
>I have yet to come across a test that has lower performance with this 
>optimization (on an MPC5121e that is).
>
>> Still, for 6xx/e300 cores, we probably do want to use dcbt/dcbz for
>> larger copies.  We don't want to use dcbt/dcbz on the larger 64-bit
>
>At least for MPC5121e you really, really need it!!
>
>> processors (POWER4/5/6) because the hardware prefetching and
>> write-combining mean that dcbt/dcbz don't help and just slow things
>> down.
>
>That's explainable.
>What's not explainable, are the results I am getting on the MPC5121e.
>Please, could someone tell me what I am doing wrong? (I must be doing 
>something wrong, I'm almost sure).

I don't think you're doing anything wrong exactly.  But it seems that
your testcase sits there and just copies data with memcpy in varying
sizes and amounts.  That's not exactly a real-world usecase is it?

I think what Paul was saying is that during the course of runtime for a
normal program (the kernel or userspace), most memcpy operations will be of
a small order of magnitude.  They will also be scattered among code that does
_other_ stuff than just memcpy.  So he's concerned about the overhead of an
implementation that sets up the cache to do a single 32 byte memcpy.

Of course, I could be totally wrong.  I haven't had my coffee yet this
morning after all.

josh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Efficient memcpy()/memmove() for G2/G3 cores...
  2008-09-04 12:19                         ` Josh Boyer
@ 2008-09-04 12:59                           ` David Jander
  2008-09-04 14:31                             ` Steven Munroe
  2008-09-04 15:01                             ` Gunnar Von Boehn
  0 siblings, 2 replies; 27+ messages in thread
From: David Jander @ 2008-09-04 12:59 UTC (permalink / raw)
  To: Josh Boyer
  Cc: munroesj, John Rigby, Gunnar Von Boehn, linuxppc-dev,
	Paul Mackerras, prodyut hazarika

On Thursday 04 September 2008 14:19:26 Josh Boyer wrote:
>[...]
> >$ ./memcpyspeed
> >Fully aligned:
> >100000 chunks of 5 bytes   :  3.48 Mbyte/s ( throughput:  6.96 Mbytes/s)
> >50000 chunks of 16 bytes   :  14.3 Mbyte/s ( throughput:  28.6 Mbytes/s)
> >10000 chunks of 100 bytes  :  14.4 Mbyte/s ( throughput:  28.8 Mbytes/s)
> >5000 chunks of 256 bytes   :  14.4 Mbyte/s ( throughput:  28.7 Mbytes/s)
> >1000 chunks of 1000 bytes  :  14.4 Mbyte/s ( throughput:  28.7 Mbytes/s)
> >50 chunks of 16384 bytes   :  14.2 Mbyte/s ( throughput:  28.4 Mbytes/s)
> >1 chunks of 1048576 bytes  :  14.4 Mbyte/s ( throughput:  28.8 Mbytes/s)
> >
> >$ LD_PRELOAD=./libmemcpye300dj.so ./memcpyspeed
> >Fully aligned:
> >100000 chunks of 5 bytes   :  7.44 Mbyte/s ( throughput:  14.9 Mbytes/s)
> >50000 chunks of 16 bytes   :  13.1 Mbyte/s ( throughput:  26.2 Mbytes/s)
> >10000 chunks of 100 bytes  :  29.4 Mbyte/s ( throughput:  58.8 Mbytes/s)
> >5000 chunks of 256 bytes   :  90.2 Mbyte/s ( throughput:   180 Mbytes/s)
> >1000 chunks of 1000 bytes  :    77 Mbyte/s ( throughput:   154 Mbytes/s)
> >50 chunks of 16384 bytes   :  96.8 Mbyte/s ( throughput:   194 Mbytes/s)
> >1 chunks of 1048576 bytes  :  97.6 Mbyte/s ( throughput:   195 Mbytes/s)
> >
> >(I have edited the output of this tool to fit into an e-mail without
> > wrapping lines for readability).
> >Please tell me how on earth there can be such a big difference???
> >Note that on a MPC5200B this is TOTALLY different, and both processors
> > have an e300 core (different versions of it though).
>
> How can there be such a big difference in throughput?  Well, your algorithm
> seems better optimized than the glibc one for your testcase :).

Yes, I admit my testcase is focussing on optimizing memcpy() of uncached data, 
and that interest stems from the fact that I was testing X11 performance 
(using xorg kdrive and xorg-server), and wondering why this processor wasn't 
able to get more FPS when moving frames on screen or scrolling, when in 
theory the on-board RAM should have bandwidth enough to get a smooth image.
What I mean is that I have a hard time believing that this processor core is 
so dependent of tweaks in order to get some decent memory throughput. The 
MPC5200B does get higher througput with much less effort, and the two cores 
should be fairly identical (besides the MPC5200B having less cache memory and 
some other details).

>[...]
> I don't think you're doing anything wrong exactly.  But it seems that
> your testcase sits there and just copies data with memcpy in varying
> sizes and amounts.  That's not exactly a real-world usecase is it?

No, of course it's not. I made this program to test the performance difference 
of different tweaks quickly. Once I found something that worked, I started 
LD_PRELOADing it to different other programs (among others the kdrive 
Xserver, mplayer, and x11perf) to see its impact on performance of some 
real-life apps. There the difference in performance is not so impressive of 
course, but it is still there (almost always either noticeably in favor of 
the tweaked version of memcpy(), or with a negligible or no difference).

I have not studied the different application's uses of memcpy(), and only done 
empirical tests so far.

> I think what Paul was saying is that during the course of runtime for a
> normal program (the kernel or userspace), most memcpy operations will be of
> a small order of magnitude.  They will also be scattered among code that
> does _other_ stuff than just memcpy.  So he's concerned about the overhead
> of an implementation that sets up the cache to do a single 32 byte memcpy.

I understand. I also have this concern, especially for other processors, as 
the MPC5200B, where there doesn't seem to be so much to gain anyway.

> Of course, I could be totally wrong.  I haven't had my coffee yet this
> morning after all.

You're doing quite good regardless of your lack of caffeine ;-)

Greetings,

-- 
David Jander

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Efficient memcpy()/memmove() for G2/G3 cores...
  2008-09-04 12:59                           ` David Jander
@ 2008-09-04 14:31                             ` Steven Munroe
  2008-09-04 14:45                               ` Gunnar Von Boehn
                                                 ` (2 more replies)
  2008-09-04 15:01                             ` Gunnar Von Boehn
  1 sibling, 3 replies; 27+ messages in thread
From: Steven Munroe @ 2008-09-04 14:31 UTC (permalink / raw)
  To: David Jander
  Cc: munroesj, Gunnar Von Boehn, John Rigby, linuxppc-dev,
	Paul Mackerras, prodyut hazarika

On Thu, 2008-09-04 at 14:59 +0200, David Jander wrote:
> On Thursday 04 September 2008 14:19:26 Josh Boyer wrote:
> >[...]
> > >(I have edited the output of this tool to fit into an e-mail without
> > > wrapping lines for readability).
> > >Please tell me how on earth there can be such a big difference???
> > >Note that on a MPC5200B this is TOTALLY different, and both processors
> > > have an e300 core (different versions of it though).
> >
> > How can there be such a big difference in throughput?  Well, your algorithm
> > seems better optimized than the glibc one for your testcase :).
> 
> Yes, I admit my testcase is focussing on optimizing memcpy() of uncached data, 
> and that interest stems from the fact that I was testing X11 performance 
> (using xorg kdrive and xorg-server), and wondering why this processor wasn't 
> able to get more FPS when moving frames on screen or scrolling, when in 
> theory the on-board RAM should have bandwidth enough to get a smooth image.
> What I mean is that I have a hard time believing that this processor core is 
> so dependent of tweaks in order to get some decent memory throughput. The 
> MPC5200B does get higher througput with much less effort, and the two cores 
> should be fairly identical (besides the MPC5200B having less cache memory and 
> some other details).
> 

I have personally optimized memcpy for power4/5/6 and they are all
different. There are dozens of different PPC implementations from
different manufacturers and design, every one is different! With painful
negotiation I was able to get the --with-cpu= framework added to glibc
but not all distro use it. You can thank me later ...

MPC5200B? never heard of it, don't care. I am busy with power7.

So don't assume we are stupid because we have not dropped everything to
optimize memcpy for YOUR processor and YOUR specific case.

You care, your are a programmer? write code! If you care about the
community then fit your optimization into the framework provided for CPU
specific optimization and submit it so others can benefit. 

> >[...]
> > I don't think you're doing anything wrong exactly.  But it seems that
> > your testcase sits there and just copies data with memcpy in varying
> > sizes and amounts.  That's not exactly a real-world usecase is it?
> 
> No, of course it's not. I made this program to test the performance difference 
> of different tweaks quickly. Once I found something that worked, I started 
> LD_PRELOADing it to different other programs (among others the kdrive 
> Xserver, mplayer, and x11perf) to see its impact on performance of some 
> real-life apps. There the difference in performance is not so impressive of 
> course, but it is still there (almost always either noticeably in favor of 
> the tweaked version of memcpy(), or with a negligible or no difference).
> 
The trick is that the code built into glibc has to be optimal for the
average case (4-256, average 12 bytes). Actually most memcpy
implementations are a series of special cases for length and alignment. 

You can always do better if you know exactly what processor you are on
and what specific sizes and alignment your application uses.

> I have not studied the different application's uses of memcpy(), and only done 
> empirical tests so far.
> 
> > I think what Paul was saying is that during the course of runtime for a
> > normal program (the kernel or userspace), most memcpy operations will be of
> > a small order of magnitude.  They will also be scattered among code that
> > does _other_ stuff than just memcpy.  So he's concerned about the overhead
> > of an implementation that sets up the cache to do a single 32 byte memcpy.
> 
> I understand. I also have this concern, especially for other processors, as 
> the MPC5200B, where there doesn't seem to be so much to gain anyway.
> 
> > Of course, I could be totally wrong.  I haven't had my coffee yet this
> > morning after all.
> 
> You're doing quite good regardless of your lack of caffeine ;-)
> 
> Greetings,
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Efficient memcpy()/memmove() for G2/G3 cores...
  2008-09-04 14:31                             ` Steven Munroe
@ 2008-09-04 14:45                               ` Gunnar Von Boehn
  2008-09-04 15:14                               ` Gunnar Von Boehn
  2008-09-04 16:25                               ` David Jander
  2 siblings, 0 replies; 27+ messages in thread
From: Gunnar Von Boehn @ 2008-09-04 14:45 UTC (permalink / raw)
  To: munroesj
  Cc: David Jander, John Rigby, linuxppc-dev, Paul Mackerras,
	prodyut hazarika

Steve,

I think we should be grateful for people being interested in improving
performance for PPC,
and we should not bash them.

The proposal to optimize the memcopy for the 5200 is good.


Steve, you said that you've heard about the 5200..
Maybe I can refresh your memory:
I did send you an optimized 32bit memcopy version for the 5200 about
halve a year ago,
I did send you the routine with the kind request for inclusion.
As you might recall the optimized 5200 memcopy version that I had send
you, was improving the performance by 50%.


Kind regards
Gunnar


On Thu, Sep 4, 2008 at 4:31 PM, Steven Munroe
<munroesj@linux.vnet.ibm.com> wrote:
> On Thu, 2008-09-04 at 14:59 +0200, David Jander wrote:
>> On Thursday 04 September 2008 14:19:26 Josh Boyer wrote:
>> >[...]
>> > >(I have edited the output of this tool to fit into an e-mail without
>> > > wrapping lines for readability).
>> > >Please tell me how on earth there can be such a big difference???
>> > >Note that on a MPC5200B this is TOTALLY different, and both processors
>> > > have an e300 core (different versions of it though).
>> >
>> > How can there be such a big difference in throughput?  Well, your algorithm
>> > seems better optimized than the glibc one for your testcase :).
>>
>> Yes, I admit my testcase is focussing on optimizing memcpy() of uncached data,
>> and that interest stems from the fact that I was testing X11 performance
>> (using xorg kdrive and xorg-server), and wondering why this processor wasn't
>> able to get more FPS when moving frames on screen or scrolling, when in
>> theory the on-board RAM should have bandwidth enough to get a smooth image.
>> What I mean is that I have a hard time believing that this processor core is
>> so dependent of tweaks in order to get some decent memory throughput. The
>> MPC5200B does get higher througput with much less effort, and the two cores
>> should be fairly identical (besides the MPC5200B having less cache memory and
>> some other details).
>>
>
> I have personally optimized memcpy for power4/5/6 and they are all
> different. There are dozens of different PPC implementations from
> different manufacturers and design, every one is different! With painful
> negotiation I was able to get the --with-cpu= framework added to glibc
> but not all distro use it. You can thank me later ...
>
> MPC5200B? never heard of it, don't care. I am busy with power7.
>
> So don't assume we are stupid because we have not dropped everything to
> optimize memcpy for YOUR processor and YOUR specific case.
>
> You care, your are a programmer? write code! If you care about the
> community then fit your optimization into the framework provided for CPU
> specific optimization and submit it so others can benefit.
>
>> >[...]
>> > I don't think you're doing anything wrong exactly.  But it seems that
>> > your testcase sits there and just copies data with memcpy in varying
>> > sizes and amounts.  That's not exactly a real-world usecase is it?
>>
>> No, of course it's not. I made this program to test the performance difference
>> of different tweaks quickly. Once I found something that worked, I started
>> LD_PRELOADing it to different other programs (among others the kdrive
>> Xserver, mplayer, and x11perf) to see its impact on performance of some
>> real-life apps. There the difference in performance is not so impressive of
>> course, but it is still there (almost always either noticeably in favor of
>> the tweaked version of memcpy(), or with a negligible or no difference).
>>
> The trick is that the code built into glibc has to be optimal for the
> average case (4-256, average 12 bytes). Actually most memcpy
> implementations are a series of special cases for length and alignment.
>
> You can always do better if you know exactly what processor you are on
> and what specific sizes and alignment your application uses.
>
>> I have not studied the different application's uses of memcpy(), and only done
>> empirical tests so far.
>>
>> > I think what Paul was saying is that during the course of runtime for a
>> > normal program (the kernel or userspace), most memcpy operations will be of
>> > a small order of magnitude.  They will also be scattered among code that
>> > does _other_ stuff than just memcpy.  So he's concerned about the overhead
>> > of an implementation that sets up the cache to do a single 32 byte memcpy.
>>
>> I understand. I also have this concern, especially for other processors, as
>> the MPC5200B, where there doesn't seem to be so much to gain anyway.
>>
>> > Of course, I could be totally wrong.  I haven't had my coffee yet this
>> > morning after all.
>>
>> You're doing quite good regardless of your lack of caffeine ;-)
>>
>> Greetings,
>>
>
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Efficient memcpy()/memmove() for G2/G3 cores...
  2008-09-04 12:59                           ` David Jander
  2008-09-04 14:31                             ` Steven Munroe
@ 2008-09-04 15:01                             ` Gunnar Von Boehn
  2008-09-04 16:32                               ` David Jander
  1 sibling, 1 reply; 27+ messages in thread
From: Gunnar Von Boehn @ 2008-09-04 15:01 UTC (permalink / raw)
  To: David Jander
  Cc: munroesj, John Rigby, linuxppc-dev, Paul Mackerras,
	prodyut hazarika

Hi David,

Regarding your testcase.

I think we all agree with you that improving the performance for PPC
is a noble quest
and we should all try do improve the performance where possible.

Regarding the 5200B and 5221 CPUs.

As we all know the 5200B is a G2 PowerPC from Freescale.

The factor for the memory performance of the PPC are two items:
A) This CPU has ZERO 2nd level cache
B) This CPU can remember exactly one prefetched memory line.

This means the normal memcopy routines that prefetch several cache
lines ahead DO NOT WORK!
To get good/best performance you need to prefetch EXACTLY ONE cache line ahead.

Altering the Linux Kernel or glibc memcopy routines for the G2/PPC
core to work like this is actually very simple.
Altering the Linux Kernel or glibc memcopy routines to work like
described will increase performance by 100%

Regarding the 5121.
David, you did create a very special memcopy for the 5121e CPU.
Your test showed us that the normal glibc memcopy is about 10 times
slower than expected on the 5121.

I really wonder why this is the case.
I would have expected the 5121 to perform just like the 5200B.
What we saw is that switching from READ to WRITE and back is very
costly on 5121.

There seems to be a huge difference between the 5200 and its successor the 5121.
Is this performance difference caused by the CPU or by the board /memory?

Cheers
Gunnar

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Efficient memcpy()/memmove() for G2/G3 cores...
  2008-09-04 14:31                             ` Steven Munroe
  2008-09-04 14:45                               ` Gunnar Von Boehn
@ 2008-09-04 15:14                               ` Gunnar Von Boehn
  2008-09-04 16:25                               ` David Jander
  2 siblings, 0 replies; 27+ messages in thread
From: Gunnar Von Boehn @ 2008-09-04 15:14 UTC (permalink / raw)
  To: munroesj
  Cc: David Jander, John Rigby, linuxppc-dev, Paul Mackerras,
	prodyut hazarika

Hi Steve,

> I have personally optimized memcpy for power4/5/6 and they are all
> different. There are dozens of different PPC implementations from
> different manufacturers and design, every one is different! With painful
> negotiation I was able to get the --with-cpu= framework added to glibc
> but not all distro use it. You can thank me later

Steve, you make it sound like very many different PowerPC chips:

You said you did the Power 4, Power 5 , Power 6 and now Power 7 routines.
And there are the 970 and the Cell.

While this sounds like 7 different PPC chips.
But aren't this actually only 2 main families?

Wouldn't it be possible to create two main routine to cover all?
One type that performs good on the family of Power4/5 and 7.
And one that performs good on the family of P6 and Cell?

How are the Linux hackers handling this?
Maybe there is room for consolidating?

Cheers
Gunnar

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Efficient memcpy()/memmove() for G2/G3 cores...
  2008-09-04 14:31                             ` Steven Munroe
  2008-09-04 14:45                               ` Gunnar Von Boehn
  2008-09-04 15:14                               ` Gunnar Von Boehn
@ 2008-09-04 16:25                               ` David Jander
  2 siblings, 0 replies; 27+ messages in thread
From: David Jander @ 2008-09-04 16:25 UTC (permalink / raw)
  To: munroesj; +Cc: linuxppc-dev, Gunnar Von Boehn

Hi Steven,

On Thursday 04 September 2008 16:31:13 Steven Munroe wrote:
>[...]
> > Yes, I admit my testcase is focussing on optimizing memcpy() of uncached
> > data, and that interest stems from the fact that I was testing X11
> > performance (using xorg kdrive and xorg-server), and wondering why this
> > processor wasn't able to get more FPS when moving frames on screen or
> > scrolling, when in theory the on-board RAM should have bandwidth enough
> > to get a smooth image. What I mean is that I have a hard time believing
> > that this processor core is so dependent of tweaks in order to get some
> > decent memory throughput. The MPC5200B does get higher througput with
> > much less effort, and the two cores should be fairly identical (besides
> > the MPC5200B having less cache memory and some other details).
>
> I have personally optimized memcpy for power4/5/6 and they are all
> different. There are dozens of different PPC implementations from
> different manufacturers and design, every one is different! With painful
> negotiation I was able to get the --with-cpu= framework added to glibc
> but not all distro use it. You can thank me later ...

Well, thank you ;-)

> MPC5200B? never heard of it, don't care. I am busy with power7.

Ok, keep up your work with power7, it's great you care about that one ;-)

> So don't assume we are stupid because we have not dropped everything to
> optimize memcpy for YOUR processor and YOUR specific case.

Ho! I never, ever assumed that anyone (on this list) is stupid. I think you 
got me totally wrong (and _that_ may be my fault). I was asking for other 
users experience. You make it apear as if I was complaining about your 
optimizations for Power4/5/6/970/Cell, but in fact, if you read correctly I 
havn't even touched them... they are useless to me, since this is an e300 
core. My comparisons are all against vanilla glibc _without_ any optimized 
code... that is (most probably) simple loops with char copy, or at most 
32-bit word copies. What I want to know is why this processor (MPC5121e, not 
the MPC5200B) is so terribly inefficient at this without optimizations and if 
someone has done something about it before me (I am doing it right now). I 
have never stated that specifically _you_ did a bad job or something, so why 
are you reacting like that??
In fact, your framework for specific optimizations in glibc will most probably 
come in VERY handy, once I have sorted out the root of the problem with my 
specific case.... so thanks a lot for your valuable work... yes, I mean it.

> You care, your are a programmer? write code! If you care about the
> community then fit your optimization into the framework provided for CPU
> specific optimization and submit it so others can benefit.

I _am_ writing code, and Gunnar is helping me find an explaination to the 
bizarre behaviour of this particular chip. If the result is useable to 
others, i _will_ fit it on your framework for optimizations.

> > >[...]
> > > I don't think you're doing anything wrong exactly.  But it seems that
> > > your testcase sits there and just copies data with memcpy in varying
> > > sizes and amounts.  That's not exactly a real-world usecase is it?
> >
> > No, of course it's not. I made this program to test the performance
> > difference of different tweaks quickly. Once I found something that
> > worked, I started LD_PRELOADing it to different other programs (among
> > others the kdrive Xserver, mplayer, and x11perf) to see its impact on
> > performance of some real-life apps. There the difference in performance
> > is not so impressive of course, but it is still there (almost always
> > either noticeably in favor of the tweaked version of memcpy(), or with a
> > negligible or no difference).
>
> The trick is that the code built into glibc has to be optimal for the
> average case (4-256, average 12 bytes). Actually most memcpy
> implementations are a series of special cases for length and alignment.
> 
> You can always do better if you know exactly what processor you are on
> and what specific sizes and alignment your application uses.

Yes, I know that's a problem. Thanks for the information for "average size", I 
don't know where it comes from, but I'll take your word.

I am trying to be as polite and friendly as I can, so if you think I am not, 
please tell me where and when... I'll try to improve my social skills for the 
next time ;-)

Greetings,

-- 
David Jander

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Efficient memcpy()/memmove() for G2/G3 cores...
  2008-09-04 15:01                             ` Gunnar Von Boehn
@ 2008-09-04 16:32                               ` David Jander
  0 siblings, 0 replies; 27+ messages in thread
From: David Jander @ 2008-09-04 16:32 UTC (permalink / raw)
  To: Gunnar Von Boehn
  Cc: munroesj, John Rigby, linuxppc-dev, Paul Mackerras,
	prodyut hazarika

On Thursday 04 September 2008 17:01:21 Gunnar Von Boehn wrote:
>[...]
> Regarding the 5121.
> David, you did create a very special memcopy for the 5121e CPU.
> Your test showed us that the normal glibc memcopy is about 10 times
> slower than expected on the 5121.
>
> I really wonder why this is the case.
> I would have expected the 5121 to perform just like the 5200B.
> What we saw is that switching from READ to WRITE and back is very
> costly on 5121.
>
> There seems to be a huge difference between the 5200 and its successor the
> 5121. Is this performance difference caused by the CPU or by the board
> /memory?

I have some new insight now, and I will look more closely at the working of  
the DRAM controller... there has to be something wrong somewhere, an I am 
going to find it... whether it is some strange bug in my u-boot code 
(initializing the DRAM controller and prio-manager for example) or a 
silicon-errata (John?)

Thanks a lot for your help so far.

-- 
David Jander

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Efficient memcpy()/memmove() for G2/G3 cores...
  2008-09-04  2:04                     ` Paul Mackerras
  2008-09-04 12:05                       ` David Jander
@ 2008-09-04 18:14                       ` prodyut hazarika
  1 sibling, 0 replies; 27+ messages in thread
From: prodyut hazarika @ 2008-09-04 18:14 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev, David Jander, John Rigby, munroesj

> I would be careful about adding overhead to memcpy.  I found that in
> the kernel, almost all calls to memcpy are for less than 128 bytes (1
> cache line on most 64-bit machines).  So, adding a lot of code to
> detect cacheability and do prefetching is just going to slow down the
> common case, which is short copies.  I don't have statistics for glibc
> but I wouldn't be surprised if most copies were short there also.
>

You are right. For small copy, it is not advisable.
The way I did was put a small check in the beginning of memcpy. If the copy
is less than 5 cache lines, I don't do dcbt/dcbz. Thus we see a big jump
for copy more than 5 cache lines. The overhead is only 2 assembly instructions
(compare number of bytes followed by jump).

One question - How can we can quickly determine whether both source and memory
address range fall in cacheable range? The user can mmap a region of memory as
non-cacheable, but then call memcpy with that address.

The optimized version must quickly determine that dcbt/dcbz must not
be used in this case.
I don't know what would be a good way to achieve the same?

Regards,
Prodyut Hazarika

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2008-09-04 18:14 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-08-25  9:31 Efficient memcpy()/memmove() for G2/G3 cores David Jander
2008-08-25 11:00 ` Matt Sealey
2008-08-25 13:06   ` David Jander
2008-08-25 22:28     ` Benjamin Herrenschmidt
2008-08-27 21:04       ` Steven Munroe
2008-08-29 11:48         ` David Jander
2008-08-29 12:21           ` Joakim Tjernlund
2008-09-01  7:23             ` David Jander
2008-09-01  9:36               ` Joakim Tjernlund
2008-09-02 13:12                 ` David Jander
2008-09-03  6:43                   ` Joakim Tjernlund
2008-09-03 20:33                   ` prodyut hazarika
2008-09-04  2:04                     ` Paul Mackerras
2008-09-04 12:05                       ` David Jander
2008-09-04 12:19                         ` Josh Boyer
2008-09-04 12:59                           ` David Jander
2008-09-04 14:31                             ` Steven Munroe
2008-09-04 14:45                               ` Gunnar Von Boehn
2008-09-04 15:14                               ` Gunnar Von Boehn
2008-09-04 16:25                               ` David Jander
2008-09-04 15:01                             ` Gunnar Von Boehn
2008-09-04 16:32                               ` David Jander
2008-09-04 18:14                       ` prodyut hazarika
2008-08-29 20:34           ` Steven Munroe
2008-09-01  8:29             ` David Jander
2008-08-31  8:28           ` Benjamin Herrenschmidt
2008-09-01  6:42             ` David Jander

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).