linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* MPC5200B memory performance
@ 2007-05-15 11:22 Daniel Schnell
  2007-05-16 18:03 ` Matthias Fechner
  0 siblings, 1 reply; 4+ messages in thread
From: Daniel Schnell @ 2007-05-15 11:22 UTC (permalink / raw)
  To: linuxppc-embedded

[-- Attachment #1: Type: text/plain, Size: 2432 bytes --]

Hi,


I am doing some memory performance measurements on our custom MPC5200B
board which runs on 396 MHz internally and is connected to DDR RAM. The
RAM is driven with 132 MHz.

With the attached program (compile with -lrt) I am testing the memcpy()
throughput. In theory the memory throughput should be the double of the
memcpy() throughput if source and destination buffers are same size and
inside the DDR-RAM.

So one could make the simple calculation:

132 MHz * 32 Bit (address width) * 2 (DDR) ~ 1GBytes/sec brutto memory
throughput.

For a memcpy this should be then ~500MB/second.

Of course in real world scenarios we cannot reach the theoretical limit,
but be about 30 % near I guess.


I get the following values on my board:

bash-2.05b# ./memcpy_perf
Test (10000) memcpy of sizes (1024) ....
10000 memcpy. Time per memcpy: 1567 [nsec] (653 MB/sec)
 finished.
Test (10000) memcpy of sizes (2048) ....
10000 memcpy. Time per memcpy: 2939 [nsec] (696 MB/sec)
 finished.
Test (10000) memcpy of sizes (4096) ....
10000 memcpy. Time per memcpy: 5706 [nsec] (717 MB/sec)
 finished.
Test (10000) memcpy of sizes (8192) ....
10000 memcpy. Time per memcpy: 17077 [nsec] (479 MB/sec)
 finished.
Test (10000) memcpy of sizes (16384) ....
10000 memcpy. Time per memcpy: 133314 [nsec] (122 MB/sec)
 finished.
Test (1000) memcpy of sizes (32768) ....
1000 memcpy. Time per memcpy: 243417 [nsec] (134 MB/sec)
 finished.
Test (1000) memcpy of sizes (51200) ....
1000 memcpy. Time per memcpy: 403455 [nsec] (126 MB/sec)
 finished.
Test (1000) memcpy of sizes (102400) ....
1000 memcpy. Time per memcpy: 713316 [nsec] (143 MB/sec)
 finished.
Test (100) memcpy of sizes (1048576) ....
100 memcpy. Time per memcpy: 7210570 [nsec] (145 MB/sec)
 finished.
Test (10) memcpy of sizes (10485760) ....
10 memcpy. Time per memcpy: 78162400 [nsec] (134 MB/sec)
 finished.
Test (5) memcpy of sizes (52428800) ....
5 memcpy. Time per memcpy: 425281800 [nsec] (123 MB/sec)
 finished.



The first 4 values are because of the data cache. So here we are testing
cache performance. All other values will test the memory controller
interface.

All in all, I am not sure, why the memory access is so much slower than
I expected.
Which factors did I miss in my calculation ? Can anybody run this
program on its 5200B based board as a comparision ?


Best regards,

Daniel Schnell.

[-- Attachment #2: memcpy_perf.c --]
[-- Type: application/octet-stream, Size: 1979 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <pthread.h>

int osa_now_timespec(struct timespec* t)
{
	return clock_gettime (CLOCK_REALTIME, t);
}


int osa_timediff(const struct timespec* t1, const struct timespec* t2, struct timespec* diff)
{
	if (t1!=NULL && t2!=NULL && diff!=NULL)
	{
		unsigned long long a_nsec, b_nsec, diff_nsec; 

		// calculate difference time
		a_nsec = t1->tv_sec*1000000000ULL + t1->tv_nsec;
		b_nsec = t2->tv_sec*1000000000ULL + t2->tv_nsec;
		diff_nsec = b_nsec - a_nsec;
		diff->tv_sec =  diff_nsec/1000000000ULL;
		diff->tv_nsec = diff_nsec%1000000000ULL;
		return 0;
	}
	return -1;
}


unsigned long long osa_to_ns(const struct timespec* t)
{
	 return (t->tv_sec*1000000000ULL + t->tv_nsec);
}


int testMemcpy(unsigned long num, size_t msgsize)
{
	printf("Test (%ld) memcpy of sizes (%ld) ....\n",
            num, msgsize);
    unsigned long i;
    unsigned long long nstime;
    struct timespec t1, t2, t3;

    char *buf1=malloc (msgsize);
    char *buf2=malloc (msgsize);

    // measure
	osa_now_timespec(&t1);
    for (i=0; i<num; i++)
        {
        memcpy (buf1, buf2, msgsize);
        }
    // measure
	osa_now_timespec(&t2);
	osa_timediff(&t1, &t2, &t3);
    
    free (buf2);
    free (buf1);
    
    nstime = osa_to_ns(&t3)/(unsigned long long) i;
	printf("%ld memcpy. Time per memcpy: %llu [nsec] (%llu MB/sec)\n",
            i, nstime,
            (1000ULL * (unsigned long long) msgsize)/(nstime)) ;
    fflush (stdout);


	printf(" finished.\n");
    fflush (stdout);
}


int main()
{
    testMemcpy(10000,      1*1024);
    testMemcpy(10000,      2*1024);
    testMemcpy(10000,      4*1024);
    testMemcpy(10000,      8*1024);
    testMemcpy(10000,     16*1024);
    testMemcpy(1000 ,     32*1024);
    testMemcpy(1000 ,     50*1024);
    testMemcpy(1000 ,    100*1024);
    testMemcpy( 100 ,   1024*1024);
    testMemcpy(  10 ,10*1024*1024);
    testMemcpy(   5 ,50*1024*1024);

    return 0;
}

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: MPC5200B memory performance
@ 2007-05-15 15:14 Fillod Stephane
  0 siblings, 0 replies; 4+ messages in thread
From: Fillod Stephane @ 2007-05-15 15:14 UTC (permalink / raw)
  To: Daniel Schnell, linuxppc-embedded

Daniel Schnell wrote:
>With the attached program (compile with -lrt) I am testing the memcpy()
>throughput. In theory the memory throughput should be the double of the
>memcpy() throughput if source and destination buffers are same size and
>inside the DDR-RAM.

Theory tells that write speed is a little bit different than read speed,
but that's when you want to be picky. RTFD(*).
(*) Datasheets

>So one could make the simple calculation:
>
>132 MHz * 32 Bit (address width) * 2 (DDR) ~ 1GBytes/sec brutto memory
throughput.
>
>For a memcpy this should be then ~500MB/second.

All you can say is, assuming 100% efficient
CPU/cache/bus/DDR-controller,=20
you can say that memcpy (hitting the DRAM) cannot be higher than that=20
value :-)

>Of course in real world scenarios we cannot reach the theoretical
limit,
>but be about 30 % near I guess.

IMO, real world scenarios *should* achieve at least 70%, with
appropriate
memcpy implementation. I've been disappointed lately by PQ3 which cannot
do better than ~50% efficiency. I'd love anyone, esp. from Freescale,
to prove me wrong or show my mistake. The FAE didn't give an answer,=20
but I saw that newer parts will have a "Queue manager" helping=20
the DDR controller. Any idea?

[...]
>The first 4 values are because of the data cache. So here we are
testing

What's your data cache size BTW? Do you have a L2 cache?

>cache performance. All other values will test the memory controller
>interface.

Well, you're testing also part of the cache and memory subsystem.
On the read side, you're paying an extra for cache misses. On the write
side,
there's read-on-write. I don't know the mpc5200 details, but most cache=20
subsystems think it's smart to fill-up (read) end of line you began
to write. But in the big memcpy case, the read is useless because=20
the cache didn't know you were about to overwrite the full line.=20
That's why the dcbz ppc instruction comes in handy to prevent the R-O-W.

In that regard, glibc is very suboptimal. For better performance,
I recommend you to read and understand the cacheable_memcpy assembly
function in Linux kernel (arch/ppc/lib/string.S). It's missing
some read prefetch (dcbt) though.

>All in all, I am not sure, why the memory access is so much slower than
I expected.
>Which factors did I miss in my calculation ? Can anybody run this
>program on its 5200B based board as a comparision ?

The values on PQ3 won't be of any help for you, esp. with disappointing
result (50% efficiency max). If you doubt about your memcpy
implementation,
you may implement the same bench with DMA (to get 50MiB of contiguous
RAM,
do it in kernel or under U-Boot).=20

Best Regards,
--=20
Stephane

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: MPC5200B memory performance
  2007-05-15 11:22 MPC5200B memory performance Daniel Schnell
@ 2007-05-16 18:03 ` Matthias Fechner
  2007-05-18 10:27   ` Daniel Schnell
  0 siblings, 1 reply; 4+ messages in thread
From: Matthias Fechner @ 2007-05-16 18:03 UTC (permalink / raw)
  To: linuxppc-embedded

Hello Daniel,

* Daniel Schnell <daniel.schnell@marel.com> [15-05-07 11:22]:
> I get the following values on my board:

my result is:
Test (10000) memcpy of sizes (1024) ....
10000 memcpy. Time per memcpy: 1814 [nsec] (564 MB/sec)
 finished.
Test (10000) memcpy of sizes (2048) ....
10000 memcpy. Time per memcpy: 3433 [nsec] (596 MB/sec)
 finished.
Test (10000) memcpy of sizes (4096) ....
10000 memcpy. Time per memcpy: 6687 [nsec] (612 MB/sec)
 finished.
Test (10000) memcpy of sizes (8192) ....
10000 memcpy. Time per memcpy: 21454 [nsec] (381 MB/sec)
 finished.
Test (10000) memcpy of sizes (16384) ....
10000 memcpy. Time per memcpy: 205551 [nsec] (79 MB/sec)
 finished.
Test (1000) memcpy of sizes (32768) ....
1000 memcpy. Time per memcpy: 379875 [nsec] (86 MB/sec)
 finished.
Test (1000) memcpy of sizes (51200) ....
1000 memcpy. Time per memcpy: 588792 [nsec] (86 MB/sec)
 finished.
Test (1000) memcpy of sizes (102400) ....
1000 memcpy. Time per memcpy: 1126511 [nsec] (90 MB/sec)
 finished.
Test (100) memcpy of sizes (1048576) ....
100 memcpy. Time per memcpy: 11307890 [nsec] (92 MB/sec)
 finished.
Test (10) memcpy of sizes (10485760) ....
10 memcpy. Time per memcpy: 120783600 [nsec] (86 MB/sec)
 finished.
Test (5) memcpy of sizes (52428800) ....
5 memcpy. Time per memcpy: 673867800 [nsec] (77 MB/sec)
 finished.
	   


Best regards,
Matthias

-- 

"Programming today is a race between software engineers striving to
build bigger and better idiot-proof programs, and the universe trying to
produce bigger and better idiots. So far, the universe is winning." --
Rich Cook

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: MPC5200B memory performance
  2007-05-16 18:03 ` Matthias Fechner
@ 2007-05-18 10:27   ` Daniel Schnell
  0 siblings, 0 replies; 4+ messages in thread
From: Daniel Schnell @ 2007-05-18 10:27 UTC (permalink / raw)
  To: linuxppc-embedded

Hi,

I found the following link to get more insights about the memcpy =
performance of my board:

http://www.greyhound-data.com/gunnar/glibc/index.htm
=20
One can download a testsuite for memory performance and let it run. Not =
only memcpy, but also memcmp and memset is tested.

This gave quite interesting results. Most of the time I get better =
values from optimized functions than from the glibc functions. As a rule =
of thumb: depending on the buffer size used, 15-20 % speedup, sometimes =
> 100% !

I will definetly integrate some of these functions in our software and =
then use=20
-Wl,--wrap,memcpy on the linker line to replace all occurences of =
memcpy, memcmp, etc. also for external libraries.


Best regards,

Daniel Schnell

--=20
Daniel Schnell                   | daniel.schnell@marel.com
Hugb=FAna=F0arger=F0                   | www.marel.com

-----Original Message-----
From: linuxppc-embedded-bounces+daniel.schnell=3Dmarel.com@ozlabs.org =
[mailto:linuxppc-embedded-bounces+daniel.schnell=3Dmarel.com@ozlabs.org] =
On Behalf Of Matthias Fechner
Sent: 16. ma=ED 2007 18:03
To: linuxppc-embedded@ozlabs.org
Subject: Re: MPC5200B memory performance

Hello Daniel,

* Daniel Schnell <daniel.schnell@marel.com> [15-05-07 11:22]:
> I get the following values on my board:

my result is:
Test (10000) memcpy of sizes (1024) ....
10000 memcpy. Time per memcpy: 1814 [nsec] (564 MB/sec)  finished.
Test (10000) memcpy of sizes (2048) ....
10000 memcpy. Time per memcpy: 3433 [nsec] (596 MB/sec)  finished.
Test (10000) memcpy of sizes (4096) ....
10000 memcpy. Time per memcpy: 6687 [nsec] (612 MB/sec)  finished.
Test (10000) memcpy of sizes (8192) ....
10000 memcpy. Time per memcpy: 21454 [nsec] (381 MB/sec)  finished.
Test (10000) memcpy of sizes (16384) ....
10000 memcpy. Time per memcpy: 205551 [nsec] (79 MB/sec)  finished.
Test (1000) memcpy of sizes (32768) ....
1000 memcpy. Time per memcpy: 379875 [nsec] (86 MB/sec)  finished.
Test (1000) memcpy of sizes (51200) ....
1000 memcpy. Time per memcpy: 588792 [nsec] (86 MB/sec)  finished.
Test (1000) memcpy of sizes (102400) ....
1000 memcpy. Time per memcpy: 1126511 [nsec] (90 MB/sec)  finished.
Test (100) memcpy of sizes (1048576) ....
100 memcpy. Time per memcpy: 11307890 [nsec] (92 MB/sec)  finished.
Test (10) memcpy of sizes (10485760) ....
10 memcpy. Time per memcpy: 120783600 [nsec] (86 MB/sec)  finished.
Test (5) memcpy of sizes (52428800) ....
5 memcpy. Time per memcpy: 673867800 [nsec] (77 MB/sec)  finished.
	  =20


Best regards,
Matthias

--=20

"Programming today is a race between software engineers striving to =
build bigger and better idiot-proof programs, and the universe trying to =
produce bigger and better idiots. So far, the universe is winning." -- =
Rich Cook _______________________________________________
Linuxppc-embedded mailing list
Linuxppc-embedded@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-embedded

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2007-05-18 10:27 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-15 11:22 MPC5200B memory performance Daniel Schnell
2007-05-16 18:03 ` Matthias Fechner
2007-05-18 10:27   ` Daniel Schnell
  -- strict thread matches above, loose matches on Subject: below --
2007-05-15 15:14 Fillod Stephane

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).