All of lore.kernel.org
 help / color / mirror / Atom feed
* [Xenomai-help] memcpy performance on Xenomai
@ 2007-05-15 11:38 Daniel Schnell
  2007-05-15 12:16 ` Gilles Chanteperdrix
  0 siblings, 1 reply; 15+ messages in thread
From: Daniel Schnell @ 2007-05-15 11:38 UTC (permalink / raw)
  To: xenomai

[-- Attachment #1: Type: text/plain, Size: 3387 bytes --]

Hi,
 
 
I am testing the memcpy() performance of Xenomai on my board in
comparision to the memcpy() performance of native linux and I get
significant differences.

Attached find a program which compiles on native linux simply with
(-lrt).
It gives me the following output:

=======
bash-2.05b# ./memcpy_perf
Test (10000) memcpy of sizes (1024) ....
10000 memcpy. Time per memcpy: 1567 [nsec] (653 MB/sec)
 finished.
Test (10000) memcpy of sizes (2048) ....
10000 memcpy. Time per memcpy: 2939 [nsec] (696 MB/sec)
 finished.
Test (10000) memcpy of sizes (4096) ....
10000 memcpy. Time per memcpy: 5706 [nsec] (717 MB/sec)
 finished.
Test (10000) memcpy of sizes (8192) ....
10000 memcpy. Time per memcpy: 17077 [nsec] (479 MB/sec)
 finished.
Test (10000) memcpy of sizes (16384) ....
10000 memcpy. Time per memcpy: 133314 [nsec] (122 MB/sec)
 finished.
Test (1000) memcpy of sizes (32768) ....
1000 memcpy. Time per memcpy: 243417 [nsec] (134 MB/sec)
 finished.
Test (1000) memcpy of sizes (51200) ....
1000 memcpy. Time per memcpy: 403455 [nsec] (126 MB/sec)
 finished.
Test (1000) memcpy of sizes (102400) ....
1000 memcpy. Time per memcpy: 713316 [nsec] (143 MB/sec)
 finished.
Test (100) memcpy of sizes (1048576) ....
100 memcpy. Time per memcpy: 7210570 [nsec] (145 MB/sec)
 finished.
Test (10) memcpy of sizes (10485760) ....
10 memcpy. Time per memcpy: 78162400 [nsec] (134 MB/sec)
 finished.
Test (5) memcpy of sizes (52428800) ....
5 memcpy. Time per memcpy: 425281800 [nsec] (123 MB/sec)
 finished.

======

Spawning the function testMemcpy() as a POSIX thread inside another
program
yields the following results:

bash-2.05b# bin/testspecs
Test (10000) memcpy of sizes (1024) ....
10000 memcpy. Time per memcpy: 1566 [nsec] (653 MB/sec)
 finished.
Test (10000) memcpy of sizes (2048) ....
10000 memcpy. Time per memcpy: 2943 [nsec] (695 MB/sec)
 finished.
Test (10000) memcpy of sizes (4096) ....
10000 memcpy. Time per memcpy: 5696 [nsec] (719 MB/sec)
 finished.
Test (10000) memcpy of sizes (8192) ....
10000 memcpy. Time per memcpy: 17325 [nsec] (472 MB/sec)
 finished.
Test (10000) memcpy of sizes (16384) ....
10000 memcpy. Time per memcpy: 200892 [nsec] (81 MB/sec)
 finished.
Test (1000) memcpy of sizes (32768) ....
1000 memcpy. Time per memcpy: 400213 [nsec] (81 MB/sec)
 finished.
Test (1000) memcpy of sizes (51200) ....
1000 memcpy. Time per memcpy: 555240 [nsec] (92 MB/sec)
 finished.
Test (1000) memcpy of sizes (102400) ....
1000 memcpy. Time per memcpy: 1253123 [nsec] (81 MB/sec)
 finished.
Test (100) memcpy of sizes (1048576) ....
100 memcpy. Time per memcpy: 12413170 [nsec] (84 MB/sec)
 finished.
Test (10) memcpy of sizes (10485760) ....
10 memcpy. Time per memcpy: 124039572 [nsec] (84 MB/sec)
 finished.
Test (5) memcpy of sizes (52428800) ....
5 memcpy. Time per memcpy: 596899212 [nsec] (87 MB/sec)
 finished.

As long as the memcpy works on the cache line only, the results are
identical. As soon as the real DDR memory is used, performance drops by
66% !

I am assuming because of different linked-in time functions
(clock_gettime())) I am measuring somehow differently. But I am clueless
at the moment where and if the performance is eaten up.

Please can anybody try to reproduce this behaviour on its board ?


Best regards,

Daniel Schnell.

[-- Attachment #2: memcpy_perf.c --]
[-- Type: application/octet-stream, Size: 1979 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <pthread.h>

int osa_now_timespec(struct timespec* t)
{
	return clock_gettime (CLOCK_REALTIME, t);
}


int osa_timediff(const struct timespec* t1, const struct timespec* t2, struct timespec* diff)
{
	if (t1!=NULL && t2!=NULL && diff!=NULL)
	{
		unsigned long long a_nsec, b_nsec, diff_nsec; 

		// calculate difference time
		a_nsec = t1->tv_sec*1000000000ULL + t1->tv_nsec;
		b_nsec = t2->tv_sec*1000000000ULL + t2->tv_nsec;
		diff_nsec = b_nsec - a_nsec;
		diff->tv_sec =  diff_nsec/1000000000ULL;
		diff->tv_nsec = diff_nsec%1000000000ULL;
		return 0;
	}
	return -1;
}


unsigned long long osa_to_ns(const struct timespec* t)
{
	 return (t->tv_sec*1000000000ULL + t->tv_nsec);
}


int testMemcpy(unsigned long num, size_t msgsize)
{
	printf("Test (%ld) memcpy of sizes (%ld) ....\n",
            num, msgsize);
    unsigned long i;
    unsigned long long nstime;
    struct timespec t1, t2, t3;

    char *buf1=malloc (msgsize);
    char *buf2=malloc (msgsize);

    // measure
	osa_now_timespec(&t1);
    for (i=0; i<num; i++)
        {
        memcpy (buf1, buf2, msgsize);
        }
    // measure
	osa_now_timespec(&t2);
	osa_timediff(&t1, &t2, &t3);
    
    free (buf2);
    free (buf1);
    
    nstime = osa_to_ns(&t3)/(unsigned long long) i;
	printf("%ld memcpy. Time per memcpy: %llu [nsec] (%llu MB/sec)\n",
            i, nstime,
            (1000ULL * (unsigned long long) msgsize)/(nstime)) ;
    fflush (stdout);


	printf(" finished.\n");
    fflush (stdout);
}


int main()
{
    testMemcpy(10000,      1*1024);
    testMemcpy(10000,      2*1024);
    testMemcpy(10000,      4*1024);
    testMemcpy(10000,      8*1024);
    testMemcpy(10000,     16*1024);
    testMemcpy(1000 ,     32*1024);
    testMemcpy(1000 ,     50*1024);
    testMemcpy(1000 ,    100*1024);
    testMemcpy( 100 ,   1024*1024);
    testMemcpy(  10 ,10*1024*1024);
    testMemcpy(   5 ,50*1024*1024);

    return 0;
}

^ permalink raw reply	[flat|nested] 15+ messages in thread
* Re: [Xenomai-help] memcpy performance on Xenomai
@ 2007-05-15 15:59 Fillod Stephane
  2007-05-15 16:59 ` Daniel Schnell
  0 siblings, 1 reply; 15+ messages in thread
From: Fillod Stephane @ 2007-05-15 15:59 UTC (permalink / raw)
  To: Daniel Schnell; +Cc: xenomai

Daniel Schnell wrote:
>1.) The culprit is mlockall(MCL_FUTURE|MCL_CURRENT);
>
>As soon I leave this away, I get much better results:

That's puzzling. Can it be explained?
Have you tried a memset of buf1&buf2 before the first osa_now_timespec?

>2.) rt_timer_tsc
>If I use clock_gettime() this needs 3100 ns,
>If I use rt_timer_tsc() this needs 74 (!) ns.

Have you tried clock_gettime with *_HR clock like CLOCK_MONOTONIC_HR ?
You have to have a recent glibc to get benefit from that.
One of those clock rely on timebase tsc instead of issuing 
an expensive syscall.

Regards,
-- 
Stephane


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2007-05-16 20:34 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-15 11:38 [Xenomai-help] memcpy performance on Xenomai Daniel Schnell
2007-05-15 12:16 ` Gilles Chanteperdrix
2007-05-15 14:40   ` Daniel Schnell
2007-05-15 14:50     ` Gilles Chanteperdrix
2007-05-15 15:28       ` Daniel Schnell
2007-05-15 15:41         ` Gilles Chanteperdrix
2007-05-15 17:54         ` Eric Noulard
2007-05-16  6:36           ` M. Koehrer
2007-05-15 15:18     ` Philippe Gerum
  -- strict thread matches above, loose matches on Subject: below --
2007-05-15 15:59 Fillod Stephane
2007-05-15 16:59 ` Daniel Schnell
2007-05-15 18:03   ` Gilles Chanteperdrix
2007-05-15 20:26     ` Eric Noulard
2007-05-16 20:17       ` Gilles Chanteperdrix
2007-05-16 20:34         ` Eric Noulard

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.