Re: Efficient memcpy()/memmove() for G2/G3 cores...

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

From: David Jander <david.jander@protonic.nl>
To: Matt Sealey <matt@genesi-usa.com>
Cc: linuxppc-dev@ozlabs.org
Subject: Re: Efficient memcpy()/memmove() for G2/G3 cores...
Date: Mon, 25 Aug 2008 15:06:33 +0200	[thread overview]
Message-ID: <200808251506.34450.david.jander@protonic.nl> (raw)
In-Reply-To: <48B290BA.7060202@genesi-usa.com>

[-- Attachment #1: Type: text/plain, Size: 2595 bytes --]


Hi Matt,

On Monday 25 August 2008 13:00:10 Matt Sealey wrote:
> The focus has definitely been on VMX but that's not to say lower power
> processors were forgotten :)

lower-power (pun intended) is coming strong these days, as energy-efficiency 
is getteing more important every day. And the MPC5121 is a brand-new embedded 
processor, that will pop-up in quite a lot devices around you most 
probably ;-)

> Gunnar von Boehn did some benchmarking with an assembly optimized routine,
> for Cell, 603e and so on (basically the whole gamut from embedded up to
> sever class IBM chips) and got some pretty good results;
>
> http://www.powerdeveloper.org/forums/viewtopic.php?t=1426
>
> It is definitely something that needs fixing. The generic routine in glibc
> just copies words with no benefit of knowing the cache line size or any
> cache block buffers in the chip, and certainly no use of cache control or
> data streaming on higher end chips.
>
> With knowledge of the right way to unroll the loops, how many copies to
> do at once to try and get a burst, reducing cache usage etc. you can get
> very impressive performance (as you can see, 50MB up to 78MB at the
> smallest size, the basic improvement is 2x performance).
>
> I hope that helps you a little bit. Gunnar posted code to this list not
> long after. I have a copy of the "e300 optimized" routine but I thought
> best he should post it here, than myself.

Ok, I think I found it on the thread. The only problem is, that AFAICS it can 
be much better... at least on my platform (e300 core), and I don't know why! 
Can you explain this?

I did this:

I took Gunnars code (copy-paste from the forum), renamed the function from 
memcpy_e300 to memcpy and put it in a file called "memcpy_e300.S". Then I 
did:

$ gcc -O2 -Wall -shared -o libmemcpye300.so memcpy_e300.S

I tried the performance with the small program in the attachment:

$ gcc -O2 -Wall -o pruvmem pruvmem.c
$ LD_PRELOAD=..../libmemcpye300.so ./pruvmem

Data rate:  45.9 MiB/s

Now I did the same thing with my own memcpy written in C (see attached file 
mymemcpy.c):

$ LD_PRELOAD=..../libmymemcpy.so ./pruvmem

Data rate:  72.9 MiB/s

Now, can someone please explain this?

As a reference, here's glibc's performance:

$ ./pruvmem

Data rate:  14.8 MiB/s

> There is a lot of scope I think for optimizing several points (glibc,
> kernel, some applications) for embedded processors which nobody is
> really taking on. But, not many people want to do this kind of work..

They should! It makes a HUGE difference. I surely will of course.

Greetings,

-- 
David Jander

[-- Attachment #2: pruvmem.c --]
[-- Type: text/x-csrc, Size: 1629 bytes --]

#include <stdio.h>
#include <sys/mman.h>
#include <string.h>
#include <sys/time.h>
#include <time.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

int main(void)
{
        int f;
        unsigned long int *mem,*src,*dst;
        int t;
        long int usecs;
        unsigned long int secs, count;
        double rate;
        struct timeval tv, tv0, tv1;

        printf("Opening fb0\n");
        f = open("/dev/fb0", O_RDWR);
        if(f<0) {
                perror("opening fb0");
                return 1;
        }
        printf("mmapping fb0\n");

        mem = mmap(NULL, 0x00300000, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED,f,0);

        printf("mmap returned: %08x\n",(unsigned int)mem);
        perror("mmap");
        if(mem==-1)
                return 1;

        gettimeofday(&tv, NULL);
        for(t=0; t<0x000c0000; t++)
                mem[t] = (tv.tv_usec ^ tv.tv_sec) ^ t;
        count = 0;
        gettimeofday(&tv0, NULL);
        for(t=0; t<10; t++) {
                src = mem;
                dst = mem+0x00040000;
                memcpy(dst, src, 0x00100000);
                count += 0x00100000;
        }
        gettimeofday(&tv1, NULL);
        secs = tv1.tv_sec-tv0.tv_sec;
        usecs = tv1.tv_usec-tv0.tv_usec;
        if(usecs<0) {
                usecs += 1000000;
                secs -= 1;
        }
        printf("Time elapsed: %ld secs, %ld usecs data transferred: %ld bytes\n",secs, usecs, count);
        rate = (double)count/((double)secs + (double)usecs/1000000.0);
        printf("Data rate: %5.3g MiB/s\n", rate/(1024.0*1024.0));

        return 0;
}

[-- Attachment #3: mymemcpy.c --]
[-- Type: text/x-csrc, Size: 2289 bytes --]

#include <stdlib.h>
void * memcpy(void * dst, void const * src, size_t len)
{
        unsigned long int a,b,c,d;
        unsigned long int a1,b1,c1,d1;
        unsigned long int a2,b2,c2,d2;
        unsigned long int a3,b3,c3,d3;
    long * plDst = (long *) dst;
    long const * plSrc = (long const *) src;
    //if (!((unsigned long)src & 0xFFFFFFFC) && !((unsigned long)dst & 0xFFFFFFFC))
    //{
        while (len >= 64)
        {
                        a =  plSrc[0];
                        b =  plSrc[1];
                        c =  plSrc[2];
                        d =  plSrc[3];
                        a1 = plSrc[4];
                        b1 = plSrc[5];
                        c1 = plSrc[6];
                        d1 = plSrc[7];
                        a2 = plSrc[8];
                        b2 = plSrc[9];
                        c2 = plSrc[10];
                        d2 = plSrc[11];
                        a3 = plSrc[12];
                        b3 = plSrc[13];
                        c3 = plSrc[14];
                        d3 = plSrc[15];
                        plSrc += 16;
                        plDst[0] = a;
                        plDst[1] = b;
                        plDst[2] = c;
                        plDst[3] = d;
                        plDst[4] = a1;
                        plDst[5] = b1;
                        plDst[6] = c1;
                        plDst[7] = d1;
                        plDst[8] = a2;
                        plDst[9] = b2;
                        plDst[10] = c2;
                        plDst[11] = d2;
                        plDst[12] = a3;
                        plDst[13] = b3;
                        plDst[14] = c3;
                        plDst[15] = d3;
                        plDst += 16;
            len -= 64;
        }
        while(len >= 16) {
            a =  plSrc[0];
            b =  plSrc[1];
            c =  plSrc[2];
            d =  plSrc[3];
            plSrc += 4;
            plDst[0] = a;
            plDst[1] = b;
            plDst[2] = c;
            plDst[3] = d;
            plDst += 4;
            len -= 16;
        }
    //}
    char * pcDst = (char *) plDst;
    char const * pcSrc = (char const *) plSrc;

    while (len--)
    {
        *pcDst++ = *pcSrc++;
    }
    return (dst);
}

next prev parent reply	other threads:[~2008-08-25 13:07 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-08-25  9:31 Efficient memcpy()/memmove() for G2/G3 cores David Jander
2008-08-25 11:00 ` Matt Sealey
2008-08-25 13:06   ` David Jander [this message]
2008-08-25 22:28     ` Benjamin Herrenschmidt
2008-08-27 21:04       ` Steven Munroe
2008-08-29 11:48         ` David Jander
2008-08-29 12:21           ` Joakim Tjernlund
2008-09-01  7:23             ` David Jander
2008-09-01  9:36               ` Joakim Tjernlund
2008-09-02 13:12                 ` David Jander
2008-09-03  6:43                   ` Joakim Tjernlund
2008-09-03 20:33                   ` prodyut hazarika
2008-09-04  2:04                     ` Paul Mackerras
2008-09-04 12:05                       ` David Jander
2008-09-04 12:19                         ` Josh Boyer
2008-09-04 12:59                           ` David Jander
2008-09-04 14:31                             ` Steven Munroe
2008-09-04 14:45                               ` Gunnar Von Boehn
2008-09-04 15:14                               ` Gunnar Von Boehn
2008-09-04 16:25                               ` David Jander
2008-09-04 15:01                             ` Gunnar Von Boehn
2008-09-04 16:32                               ` David Jander
2008-09-04 18:14                       ` prodyut hazarika
2008-08-29 20:34           ` Steven Munroe
2008-09-01  8:29             ` David Jander
2008-08-31  8:28           ` Benjamin Herrenschmidt
2008-09-01  6:42             ` David Jander

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200808251506.34450.david.jander@protonic.nl \
    --to=david.jander@protonic.nl \
    --cc=linuxppc-dev@ozlabs.org \
    --cc=matt@genesi-usa.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).