[Xenomai-help] memcpy performance on Xenomai

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Xenomai-help] memcpy performance on Xenomai
@ 2007-05-15 11:38 Daniel Schnell
  2007-05-15 12:16 ` Gilles Chanteperdrix
  0 siblings, 1 reply; 15+ messages in thread
From: Daniel Schnell @ 2007-05-15 11:38 UTC (permalink / raw)
  To: xenomai

[-- Attachment #1: Type: text/plain, Size: 3387 bytes --]

Hi,
 
 
I am testing the memcpy() performance of Xenomai on my board in
comparision to the memcpy() performance of native linux and I get
significant differences.

Attached find a program which compiles on native linux simply with
(-lrt).
It gives me the following output:

=======
bash-2.05b# ./memcpy_perf
Test (10000) memcpy of sizes (1024) ....
10000 memcpy. Time per memcpy: 1567 [nsec] (653 MB/sec)
 finished.
Test (10000) memcpy of sizes (2048) ....
10000 memcpy. Time per memcpy: 2939 [nsec] (696 MB/sec)
 finished.
Test (10000) memcpy of sizes (4096) ....
10000 memcpy. Time per memcpy: 5706 [nsec] (717 MB/sec)
 finished.
Test (10000) memcpy of sizes (8192) ....
10000 memcpy. Time per memcpy: 17077 [nsec] (479 MB/sec)
 finished.
Test (10000) memcpy of sizes (16384) ....
10000 memcpy. Time per memcpy: 133314 [nsec] (122 MB/sec)
 finished.
Test (1000) memcpy of sizes (32768) ....
1000 memcpy. Time per memcpy: 243417 [nsec] (134 MB/sec)
 finished.
Test (1000) memcpy of sizes (51200) ....
1000 memcpy. Time per memcpy: 403455 [nsec] (126 MB/sec)
 finished.
Test (1000) memcpy of sizes (102400) ....
1000 memcpy. Time per memcpy: 713316 [nsec] (143 MB/sec)
 finished.
Test (100) memcpy of sizes (1048576) ....
100 memcpy. Time per memcpy: 7210570 [nsec] (145 MB/sec)
 finished.
Test (10) memcpy of sizes (10485760) ....
10 memcpy. Time per memcpy: 78162400 [nsec] (134 MB/sec)
 finished.
Test (5) memcpy of sizes (52428800) ....
5 memcpy. Time per memcpy: 425281800 [nsec] (123 MB/sec)
 finished.

======

Spawning the function testMemcpy() as a POSIX thread inside another
program
yields the following results:

bash-2.05b# bin/testspecs
Test (10000) memcpy of sizes (1024) ....
10000 memcpy. Time per memcpy: 1566 [nsec] (653 MB/sec)
 finished.
Test (10000) memcpy of sizes (2048) ....
10000 memcpy. Time per memcpy: 2943 [nsec] (695 MB/sec)
 finished.
Test (10000) memcpy of sizes (4096) ....
10000 memcpy. Time per memcpy: 5696 [nsec] (719 MB/sec)
 finished.
Test (10000) memcpy of sizes (8192) ....
10000 memcpy. Time per memcpy: 17325 [nsec] (472 MB/sec)
 finished.
Test (10000) memcpy of sizes (16384) ....
10000 memcpy. Time per memcpy: 200892 [nsec] (81 MB/sec)
 finished.
Test (1000) memcpy of sizes (32768) ....
1000 memcpy. Time per memcpy: 400213 [nsec] (81 MB/sec)
 finished.
Test (1000) memcpy of sizes (51200) ....
1000 memcpy. Time per memcpy: 555240 [nsec] (92 MB/sec)
 finished.
Test (1000) memcpy of sizes (102400) ....
1000 memcpy. Time per memcpy: 1253123 [nsec] (81 MB/sec)
 finished.
Test (100) memcpy of sizes (1048576) ....
100 memcpy. Time per memcpy: 12413170 [nsec] (84 MB/sec)
 finished.
Test (10) memcpy of sizes (10485760) ....
10 memcpy. Time per memcpy: 124039572 [nsec] (84 MB/sec)
 finished.
Test (5) memcpy of sizes (52428800) ....
5 memcpy. Time per memcpy: 596899212 [nsec] (87 MB/sec)
 finished.

As long as the memcpy works on the cache line only, the results are
identical. As soon as the real DDR memory is used, performance drops by
66% !

I am assuming because of different linked-in time functions
(clock_gettime())) I am measuring somehow differently. But I am clueless
at the moment where and if the performance is eaten up.

Please can anybody try to reproduce this behaviour on its board ?


Best regards,

Daniel Schnell.

[-- Attachment #2: memcpy_perf.c --]
[-- Type: application/octet-stream, Size: 1979 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <pthread.h>

int osa_now_timespec(struct timespec* t)
{
	return clock_gettime (CLOCK_REALTIME, t);
}


int osa_timediff(const struct timespec* t1, const struct timespec* t2, struct timespec* diff)
{
	if (t1!=NULL && t2!=NULL && diff!=NULL)
	{
		unsigned long long a_nsec, b_nsec, diff_nsec; 

		// calculate difference time
		a_nsec = t1->tv_sec*1000000000ULL + t1->tv_nsec;
		b_nsec = t2->tv_sec*1000000000ULL + t2->tv_nsec;
		diff_nsec = b_nsec - a_nsec;
		diff->tv_sec =  diff_nsec/1000000000ULL;
		diff->tv_nsec = diff_nsec%1000000000ULL;
		return 0;
	}
	return -1;
}


unsigned long long osa_to_ns(const struct timespec* t)
{
	 return (t->tv_sec*1000000000ULL + t->tv_nsec);
}


int testMemcpy(unsigned long num, size_t msgsize)
{
	printf("Test (%ld) memcpy of sizes (%ld) ....\n",
            num, msgsize);
    unsigned long i;
    unsigned long long nstime;
    struct timespec t1, t2, t3;

    char *buf1=malloc (msgsize);
    char *buf2=malloc (msgsize);

    // measure
	osa_now_timespec(&t1);
    for (i=0; i<num; i++)
        {
        memcpy (buf1, buf2, msgsize);
        }
    // measure
	osa_now_timespec(&t2);
	osa_timediff(&t1, &t2, &t3);
    
    free (buf2);
    free (buf1);
    
    nstime = osa_to_ns(&t3)/(unsigned long long) i;
	printf("%ld memcpy. Time per memcpy: %llu [nsec] (%llu MB/sec)\n",
            i, nstime,
            (1000ULL * (unsigned long long) msgsize)/(nstime)) ;
    fflush (stdout);


	printf(" finished.\n");
    fflush (stdout);
}


int main()
{
    testMemcpy(10000,      1*1024);
    testMemcpy(10000,      2*1024);
    testMemcpy(10000,      4*1024);
    testMemcpy(10000,      8*1024);
    testMemcpy(10000,     16*1024);
    testMemcpy(1000 ,     32*1024);
    testMemcpy(1000 ,     50*1024);
    testMemcpy(1000 ,    100*1024);
    testMemcpy( 100 ,   1024*1024);
    testMemcpy(  10 ,10*1024*1024);
    testMemcpy(   5 ,50*1024*1024);

    return 0;
}

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xenomai-help] memcpy performance on Xenomai
  2007-05-15 11:38 [Xenomai-help] memcpy performance on Xenomai Daniel Schnell
@ 2007-05-15 12:16 ` Gilles Chanteperdrix
  2007-05-15 14:40   ` Daniel Schnell
  0 siblings, 1 reply; 15+ messages in thread
From: Gilles Chanteperdrix @ 2007-05-15 12:16 UTC (permalink / raw)
  To: Daniel Schnell; +Cc: xenomai

Daniel Schnell wrote:
> Hi,
>  
>  
> I am testing the memcpy() performance of Xenomai on my board in
> comparision to the memcpy() performance of native linux and I get
> significant differences.
> 
> Attached find a program which compiles on native linux simply with
> (-lrt).
> It gives me the following output:
> 
> =======
> bash-2.05b# ./memcpy_perf
> Test (10000) memcpy of sizes (1024) ....
> 10000 memcpy. Time per memcpy: 1567 [nsec] (653 MB/sec)
>  finished.
> Test (10000) memcpy of sizes (2048) ....
> 10000 memcpy. Time per memcpy: 2939 [nsec] (696 MB/sec)
>  finished.
> Test (10000) memcpy of sizes (4096) ....
> 10000 memcpy. Time per memcpy: 5706 [nsec] (717 MB/sec)
>  finished.
> Test (10000) memcpy of sizes (8192) ....
> 10000 memcpy. Time per memcpy: 17077 [nsec] (479 MB/sec)
>  finished.
> Test (10000) memcpy of sizes (16384) ....
> 10000 memcpy. Time per memcpy: 133314 [nsec] (122 MB/sec)
>  finished.
> Test (1000) memcpy of sizes (32768) ....
> 1000 memcpy. Time per memcpy: 243417 [nsec] (134 MB/sec)
>  finished.
> Test (1000) memcpy of sizes (51200) ....
> 1000 memcpy. Time per memcpy: 403455 [nsec] (126 MB/sec)
>  finished.
> Test (1000) memcpy of sizes (102400) ....
> 1000 memcpy. Time per memcpy: 713316 [nsec] (143 MB/sec)
>  finished.
> Test (100) memcpy of sizes (1048576) ....
> 100 memcpy. Time per memcpy: 7210570 [nsec] (145 MB/sec)
>  finished.
> Test (10) memcpy of sizes (10485760) ....
> 10 memcpy. Time per memcpy: 78162400 [nsec] (134 MB/sec)
>  finished.
> Test (5) memcpy of sizes (52428800) ....
> 5 memcpy. Time per memcpy: 425281800 [nsec] (123 MB/sec)
>  finished.
> 
> ======
> 
> Spawning the function testMemcpy() as a POSIX thread inside another
> program
> yields the following results:
> 
> bash-2.05b# bin/testspecs
> Test (10000) memcpy of sizes (1024) ....
> 10000 memcpy. Time per memcpy: 1566 [nsec] (653 MB/sec)
>  finished.
> Test (10000) memcpy of sizes (2048) ....
> 10000 memcpy. Time per memcpy: 2943 [nsec] (695 MB/sec)
>  finished.
> Test (10000) memcpy of sizes (4096) ....
> 10000 memcpy. Time per memcpy: 5696 [nsec] (719 MB/sec)
>  finished.
> Test (10000) memcpy of sizes (8192) ....
> 10000 memcpy. Time per memcpy: 17325 [nsec] (472 MB/sec)
>  finished.
> Test (10000) memcpy of sizes (16384) ....
> 10000 memcpy. Time per memcpy: 200892 [nsec] (81 MB/sec)
>  finished.
> Test (1000) memcpy of sizes (32768) ....
> 1000 memcpy. Time per memcpy: 400213 [nsec] (81 MB/sec)
>  finished.
> Test (1000) memcpy of sizes (51200) ....
> 1000 memcpy. Time per memcpy: 555240 [nsec] (92 MB/sec)
>  finished.
> Test (1000) memcpy of sizes (102400) ....
> 1000 memcpy. Time per memcpy: 1253123 [nsec] (81 MB/sec)
>  finished.
> Test (100) memcpy of sizes (1048576) ....
> 100 memcpy. Time per memcpy: 12413170 [nsec] (84 MB/sec)
>  finished.
> Test (10) memcpy of sizes (10485760) ....
> 10 memcpy. Time per memcpy: 124039572 [nsec] (84 MB/sec)
>  finished.
> Test (5) memcpy of sizes (52428800) ....
> 5 memcpy. Time per memcpy: 596899212 [nsec] (87 MB/sec)
>  finished.
> 
> As long as the memcpy works on the cache line only, the results are
> identical. As soon as the real DDR memory is used, performance drops by
> 66% !
> 
> I am assuming because of different linked-in time functions
> (clock_gettime())) I am measuring somehow differently. But I am clueless
> at the moment where and if the performance is eaten up.

Improving clock_gettime overhead by reading directly the tsc is my very
next task. If you want to check if the effect you measure is the result
of clock_gettime overhead, you can measure the duration of memcpy with
the native api service rt_timer_tsc, and convert the tsc difference with
rt_timer_tsc2ns.

-- 
                                                 Gilles Chanteperdrix


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xenomai-help] memcpy performance on Xenomai
  2007-05-15 12:16 ` Gilles Chanteperdrix
@ 2007-05-15 14:40   ` Daniel Schnell
  2007-05-15 14:50     ` Gilles Chanteperdrix
  2007-05-15 15:18     ` Philippe Gerum
  0 siblings, 2 replies; 15+ messages in thread
From: Daniel Schnell @ 2007-05-15 14:40 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai

This was not the culprit. Same results.

Does Xenomai replace the memcpy() call with an own implementation ? (I don't think so.)

What about trashing of cash lines through context switches ? But then if we run it on Linux alone we should also have trashed cache lines. There should not be any difference.
Is maybe the presence of a Xenomai POSIX thread cause a lot of ctx switches, even if only a memcpy is executed inside the thread ? Shouldn't Xenomai threads run totally uninterrupted if they have the highest prio ?

Please could somebody actually run this test on his hardware and see if these differences between Xenomai POSIX skin and Linux native are happening there as well ?

Best regards,

Daniel Schnell

-----Original Message-----
From: Gilles Chanteperdrix [mailto:gilles.chanteperdrix@xenomai.org
Sent: 15. maí 2007 12:16
To: Daniel Schnell
Cc: xenomai@xenomai.org
Subject: Re: [Xenomai-help] memcpy performance on Xenomai

Improving clock_gettime overhead by reading directly the tsc is my very next task. If you want to check if the effect you measure is the result of clock_gettime overhead, you can measure the duration of memcpy with the native api service rt_timer_tsc, and convert the tsc difference with rt_timer_tsc2ns.

-- 
                                                 Gilles Chanteperdrix

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xenomai-help] memcpy performance on Xenomai
  2007-05-15 14:40   ` Daniel Schnell
@ 2007-05-15 14:50     ` Gilles Chanteperdrix
  2007-05-15 15:28       ` Daniel Schnell
  2007-05-15 15:18     ` Philippe Gerum
  1 sibling, 1 reply; 15+ messages in thread
From: Gilles Chanteperdrix @ 2007-05-15 14:50 UTC (permalink / raw)
  To: Daniel Schnell; +Cc: xenomai

Daniel Schnell wrote:
> > -----Original Message-----
> > From: Gilles Chanteperdrix [mailto:gilles.chanteperdrix@xenomai.org
> > Sent: 15. maí 2007 12:16
> > To: Daniel Schnell
> > Cc: xenomai@xenomai.org
> > Subject: Re: [Xenomai-help] memcpy performance on Xenomai
> >
> >
> > Improving clock_gettime overhead by reading directly the tsc is my very next task. If you want to check if the effect you measure is the result of clock_gettime overhead, you can measure the duration of memcpy with the native api service rt_timer_tsc, and convert the tsc difference with rt_timer_tsc2ns.
> 
> This was not the culprit. Same results.

Does your processor have a tsc ? If yes, do you compile Xenomai with
--enable-x86-tsc ? What happens if you disable the interruptions ?

> 
> Does Xenomai replace the memcpy() call with an own implementation ? (I don't think so.)
> 
> What about trashing of cash lines through context switches ? But then if we run it on Linux alone we should also have trashed cache lines. There should not be any difference.
> Is maybe the presence of a Xenomai POSIX thread cause a lot of ctx switches, even if only a memcpy is executed inside the thread ? Shouldn't Xenomai threads run totally uninterrupted if they have the highest prio ?
> 
> Please could somebody actually run this test on his hardware and see if these differences between Xenomai POSIX skin and Linux native are happening there as well ?

If you want us to test the code, please send it, I mean the one adapted
to the native skin.

-- 
                                                 Gilles Chanteperdrix


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xenomai-help] memcpy performance on Xenomai
  2007-05-15 14:40   ` Daniel Schnell
  2007-05-15 14:50     ` Gilles Chanteperdrix
@ 2007-05-15 15:18     ` Philippe Gerum
  1 sibling, 0 replies; 15+ messages in thread
From: Philippe Gerum @ 2007-05-15 15:18 UTC (permalink / raw)
  To: Daniel Schnell; +Cc: xenomai

On Tue, 2007-05-15 at 14:40 +0000, Daniel Schnell wrote:
> This was not the culprit. Same results.
> 
> Does Xenomai replace the memcpy() call with an own implementation ? (I don't think so.)

No.

> 
> What about trashing of cash lines through context switches ? 

Interrupts also participate in cache trashing.

> But then if we run it on Linux alone we should also have trashed cache lines. There should not be any difference.

It depends. You are running 2.4.25/ppc kernel IIRC, which means that
your system endures much fewer preemptions on a vanilla kernel (100 hz
timer, no kernel preemption). Depending on the Xenomai timer freq, and
the number of RT thread switches in your app, your cache may be under
permanent pressure.

> Is maybe the presence of a Xenomai POSIX thread cause a lot of ctx switches,
> even if only a memcpy is executed inside the thread ? Shouldn't Xenomai threads
>  run totally uninterrupted if they have the highest prio ?

I don't get what you mean actually. If your thread needs no switching,
then Xenomai does no switches, period. However, if your RT thread is
continuously moving from primary to secondary mode and back for
instance, then switches would occur at a high rate;
see /proc/xenomai/stats to check this.

2.4/ppc kernels could possibly cause secondary mode switches to Xenomai
threads, due to on-demand mapping and COW management issues when copying
data, especially to/from large buffers. So, memcpy in primary mode ->
page_fault -> mode_transition -> internal context_switch -> back to
memcpy in secondary mode for the same thread.

High prio threads can also be preempted by interrupts.

> 
> Please could somebody actually run this test on his hardware and see if these differences between Xenomai POSIX skin and Linux native are happening there as well ?
> 

FWIW, you have all the needed tools to check this yourself.

First, sampling /proc/xenomai/stats would tell you the average number of
ctx switches, and the number of mode transitions, on a per-thread
basis. 
Then, you could move to a 2.6.x kernel for the purpose of testing and
without having to change anything else runtime-wise, this would enable
the latency tracer facility (Kernel hacking -> I-pipe debugging). A
simple log showing how/by whom a given user-space memcpy has been
preempted would definitely shed some light on this issue.

> 
> Best regards,
> 
> Daniel Schnell
> 
> 
> -----Original Message-----
> From: Gilles Chanteperdrix [mailto:gilles.chanteperdrix@xenomai.org] 
> Sent: 15. maí 2007 12:16
> To: Daniel Schnell
> Cc: xenomai@xenomai.org
> Subject: Re: [Xenomai-help] memcpy performance on Xenomai
> 
> 
> Improving clock_gettime overhead by reading directly the tsc is my very next task. If you want to check if the effect you measure is the result of clock_gettime overhead, you can measure the duration of memcpy with the native api service rt_timer_tsc, and convert the tsc difference with rt_timer_tsc2ns.
> 
-- 
Philippe.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xenomai-help] memcpy performance on Xenomai
  2007-05-15 14:50     ` Gilles Chanteperdrix
@ 2007-05-15 15:28       ` Daniel Schnell
  2007-05-15 15:41         ` Gilles Chanteperdrix
  2007-05-15 17:54         ` Eric Noulard
  0 siblings, 2 replies; 15+ messages in thread
From: Daniel Schnell @ 2007-05-15 15:28 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai

Hi,

Some interesting insights about my last tests.

1.) The culprit is mlockall(MCL_FUTURE|MCL_CURRENT);

As soon I leave this away, I get much better results:

Without mlockall():
Test (10) memcpy of sizes (10485760)
10 memcpy. Time per memcpy: 78147209 [nsec] (134 MB/sec)
 finished.

With mlockall():
Test (10) memcpy of sizes (10485760) ....
10 memcpy. Time per memcpy: 124194618 [nsec] (84 MB/sec)
 finished.

Then again I cannot use Xenomai without mlockall() 
:(

2.) rt_timer_tsc
If I use clock_gettime() this needs 3100 ns,
If I use rt_timer_tsc() this needs 74 (!) ns.


Oooh, we are using clock_gettime() a lot, so this makes a big difference for us.

Best regards,

-- 
Daniel Schnell                   | daniel.schnell@domain.hid
Hugbúnaðargerð                   | www.marel.com

-----Original Message-----
From: Gilles Chanteperdrix [mailto:gilles.chanteperdrix@xenomai.org
Sent: 15. maí 2007 14:51
To: Daniel Schnell
Cc: xenomai@xenomai.org
Subject: Re: [Xenomai-help] memcpy performance on Xenomai

Daniel Schnell wrote:
> > -----Original Message-----
> > From: Gilles Chanteperdrix [mailto:gilles.chanteperdrix@xenomai.org]
> > Sent: 15. maí 2007 12:16
> > To: Daniel Schnell
> > Cc: xenomai@xenomai.org
> > Subject: Re: [Xenomai-help] memcpy performance on Xenomai
> >
> >
> > Improving clock_gettime overhead by reading directly the tsc is my very next task. If you want to check if the effect you measure is the result of clock_gettime overhead, you can measure the duration of memcpy with the native api service rt_timer_tsc, and convert the tsc difference with rt_timer_tsc2ns.
> 
> This was not the culprit. Same results.

Does your processor have a tsc ? If yes, do you compile Xenomai with --enable-x86-tsc ? What happens if you disable the interruptions ?

> 
> Does Xenomai replace the memcpy() call with an own implementation ? (I 
> don't think so.)
> 
> What about trashing of cash lines through context switches ? But then if we run it on Linux alone we should also have trashed cache lines. There should not be any difference.
> Is maybe the presence of a Xenomai POSIX thread cause a lot of ctx switches, even if only a memcpy is executed inside the thread ? Shouldn't Xenomai threads run totally uninterrupted if they have the highest prio ?
> 
> Please could somebody actually run this test on his hardware and see if these differences between Xenomai POSIX skin and Linux native are happening there as well ?

If you want us to test the code, please send it, I mean the one adapted to the native skin.

-- 
                                                 Gilles Chanteperdrix


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xenomai-help] memcpy performance on Xenomai
  2007-05-15 15:28       ` Daniel Schnell
@ 2007-05-15 15:41         ` Gilles Chanteperdrix
  2007-05-15 17:54         ` Eric Noulard
  1 sibling, 0 replies; 15+ messages in thread
From: Gilles Chanteperdrix @ 2007-05-15 15:41 UTC (permalink / raw)
  To: Daniel Schnell; +Cc: xenomai

Daniel Schnell wrote:
> 2.) rt_timer_tsc
> If I use clock_gettime() this needs 3100 ns,
> If I use rt_timer_tsc() this needs 74 (!) ns.
> 
> 
> Oooh, we are using clock_gettime() a lot, so this makes a big difference for us.

Again, this issue should be adressed real soon.


-- 
                                                 Gilles Chanteperdrix


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xenomai-help] memcpy performance on Xenomai
@ 2007-05-15 15:59 Fillod Stephane
  2007-05-15 16:59 ` Daniel Schnell
  0 siblings, 1 reply; 15+ messages in thread
From: Fillod Stephane @ 2007-05-15 15:59 UTC (permalink / raw)
  To: Daniel Schnell; +Cc: xenomai

Daniel Schnell wrote:
>1.) The culprit is mlockall(MCL_FUTURE|MCL_CURRENT);
>
>As soon I leave this away, I get much better results:

That's puzzling. Can it be explained?
Have you tried a memset of buf1&buf2 before the first osa_now_timespec?

>2.) rt_timer_tsc
>If I use clock_gettime() this needs 3100 ns,
>If I use rt_timer_tsc() this needs 74 (!) ns.

Have you tried clock_gettime with *_HR clock like CLOCK_MONOTONIC_HR ?
You have to have a recent glibc to get benefit from that.
One of those clock rely on timebase tsc instead of issuing 
an expensive syscall.

Regards,
-- 
Stephane

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xenomai-help] memcpy performance on Xenomai
  2007-05-15 15:59 Fillod Stephane
@ 2007-05-15 16:59 ` Daniel Schnell
  2007-05-15 18:03   ` Gilles Chanteperdrix
  0 siblings, 1 reply; 15+ messages in thread
From: Daniel Schnell @ 2007-05-15 16:59 UTC (permalink / raw)
  To: Fillod Stephane; +Cc: xenomai

Hi,

To 1.)
doing memset() on the buffers makes a change ... Now the performance is as bad as with mlockall() :(
Which means the Linux VM without mlockall makes somehow an optimization for uninitialized heap buffers.
Actually nothing to blame on Xenomai here.

The reason I was looking at the memcpy native performance was because I found Xenomai Posix message queues somewhat slow:
In two tasks where one is receiving a 32 kBytes msg and the other sending to this task 32 kbytes of data, I have an average cycle time of more than 1 ms.
Looking at the memcpy performance of my system tells me something:
Memcpy(32 KB) ->  400 us
Two msg_send(32KB) cannot be less than 2x400 us + 2 ctx switch away, probably even more because we have to write first to Kernel memory, then the ctx switch, then copying from Kernel memory to user buffer, than ctx switch. This simply means I cannot use msg queues for anything bigger than 1K, which means I have to use shared memory for that.
As you proposed in another mail maybe it makes a difference to use an own memcpy implementation for ppc which does no read on write.

To 2.)
I don't have a recent glibc, so that is no option. I can wait however for the new clock_gettime() of Xenomai and use in between the rt_timer_tsc function.

Best regards,

Daniel Schnell.

-----Original Message-----
From: Fillod Stephane [mailto:stephane.fillod@domain.hid
Sent: 15. maí 2007 15:59
To: Daniel Schnell
Cc: xenomai@xenomai.org
Subject: RE: [Xenomai-help] memcpy performance on Xenomai

That's puzzling. Can it be explained?
Have you tried a memset of buf1&buf2 before the first osa_now_timespec?

>2.) rt_timer_tsc
>If I use clock_gettime() this needs 3100 ns, If I use rt_timer_tsc() 
>this needs 74 (!) ns.

Have you tried clock_gettime with *_HR clock like CLOCK_MONOTONIC_HR ?
You have to have a recent glibc to get benefit from that.
One of those clock rely on timebase tsc instead of issuing an expensive syscall.

Regards,
--
Stephane

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xenomai-help] memcpy performance on Xenomai
  2007-05-15 15:28       ` Daniel Schnell
  2007-05-15 15:41         ` Gilles Chanteperdrix
@ 2007-05-15 17:54         ` Eric Noulard
  2007-05-16  6:36           ` M. Koehrer
  1 sibling, 1 reply; 15+ messages in thread
From: Eric Noulard @ 2007-05-15 17:54 UTC (permalink / raw)
  To: Daniel Schnell; +Cc: xenomai

[-- Attachment #1: Type: text/plain, Size: 2202 bytes --]

2007/5/15, Daniel Schnell <daniel.schnell@domain.hid>:
> Hi,
>
> Some interesting insights about my last tests.
>
> 1.) The culprit is mlockall(MCL_FUTURE|MCL_CURRENT);
>
> As soon I leave this away, I get much better results:
>
> Without mlockall():
> Test (10) memcpy of sizes (10485760)
> 10 memcpy. Time per memcpy: 78147209 [nsec] (134 MB/sec)
>  finished.
>
> With mlockall():
> Test (10) memcpy of sizes (10485760) ....
> 10 memcpy. Time per memcpy: 124194618 [nsec] (84 MB/sec)
>  finished.


I think you are not measuring the same thing in both case.
I did some test on 2.6.20 (precompiled debian etch kernel)
on a 1.6 GHz Pentium M.

I think the fact that you malloced your buffer and then
immediatly memcpy the buffers does a non repeatable measure
(at least on my side)
depending on something I do not understand .

Could you try my modified version of your code which
adds:

memset(buf1,'\0',msgsize);
memset(buf2,'\0',msgsize);

just after malloc (you may try calloc too).

With this modification
I get similar figure for the mlockall version on my (quasi)-vanilla kernel.

that is:

./memcpy_perf_mlockall
Test (10) memcpy of sizes (10485760) ....
10 memcpy. Time per memcpy: 35716568 [nsec] (293 MB/sec)
 finished.

./memcpy_perf_memset
Test (10) memcpy of sizes (10485760) ....
10 memcpy. Time per memcpy: 36004454 [nsec] (291 MB/sec)
 finished.

./memcpy_perf
Test (10) memcpy of sizes (10485760) ....
10 memcpy. Time per memcpy: 23881352 [nsec] (439 MB/sec)
 finished.


I think that without mlockall or no memset the memory pages you
requested with malloc and did not --really-- get are brought to
physical memory only when memcpy comes.

What puzzles me is WHY it is faster WITHOUT touching the page
BEFORE memcpy???

Any memory handling expert is welcomed to answer.

> Then again I cannot use Xenomai without mlockall()
> :(

And you cannot design a realtime application without
ensuring you really have the memory you requested,
this is not a xenomai issue (my opinion though).

PS: on line compilation used:

gcc memcpy_perf-erk.c -o memcpy_perf -lrt
gcc -DMLOCK memcpy_perf-erk.c -o memcpy_perf_mlockall -lrt
gcc -DMEMSET memcpy_perf-erk.c -o memcpy_perf_memset -lrt

-- 
Erk

[-- Attachment #2: memcpy_perf-erk.c --]
[-- Type: text/x-csrc, Size: 2161 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <pthread.h>
#include <string.h>
#include <sys/mman.h>

int osa_now_timespec(struct timespec* t)
{
	return clock_gettime (CLOCK_REALTIME, t);
}


int osa_timediff(const struct timespec* t1, const struct timespec* t2, struct timespec* diff)
{
	if (t1!=NULL && t2!=NULL && diff!=NULL)
	{
		unsigned long long a_nsec, b_nsec, diff_nsec; 

		// calculate difference time
		a_nsec = t1->tv_sec*1000000000ULL + t1->tv_nsec;
		b_nsec = t2->tv_sec*1000000000ULL + t2->tv_nsec;
		diff_nsec = b_nsec - a_nsec;
		diff->tv_sec =  diff_nsec/1000000000ULL;
		diff->tv_nsec = diff_nsec%1000000000ULL;
		return 0;
	}
	return -1;
}


unsigned long long osa_to_ns(const struct timespec* t)
{
	 return (t->tv_sec*1000000000ULL + t->tv_nsec);
}


int testMemcpy(unsigned long num, size_t msgsize)
{
	printf("Test (%ld) memcpy of sizes (%ld) ....\n",
            num, msgsize);
    unsigned long i;
    unsigned long long nstime;
    struct timespec t1, t2, t3;

    char *buf1=malloc (msgsize);
    char *buf2=malloc (msgsize);
#ifdef MEMSET
    memset(buf1,'\0',msgsize);
    memset(buf2,'\0',msgsize);
#endif
    // measure
	osa_now_timespec(&t1);
    for (i=0; i<num; i++)
        {
        memcpy (buf1, buf2, msgsize);
        }
    // measure
	osa_now_timespec(&t2);
	osa_timediff(&t1, &t2, &t3);
    
    free (buf2);
    free (buf1);
    
    nstime = osa_to_ns(&t3)/(unsigned long long) i;
	printf("%ld memcpy. Time per memcpy: %llu [nsec] (%llu MB/sec)\n",
            i, nstime,
            (1000ULL * (unsigned long long) msgsize)/(nstime)) ;
    fflush (stdout);


	printf(" finished.\n");
    fflush (stdout);
}


int main()
{
#ifdef MLOCK
    mlockall(MCL_FUTURE|MCL_CURRENT);
#endif
    testMemcpy(10000,      1*1024);
    testMemcpy(10000,      2*1024);
    testMemcpy(10000,      4*1024);
    testMemcpy(10000,      8*1024);
    testMemcpy(10000,     16*1024);
    testMemcpy(1000 ,     32*1024);
    testMemcpy(1000 ,     50*1024);
    testMemcpy(1000 ,    100*1024);
    testMemcpy( 100 ,   1024*1024);
    testMemcpy(  10 ,10*1024*1024);
    testMemcpy(   5 ,50*1024*1024);

    return 0;
}

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xenomai-help] memcpy performance on Xenomai
  2007-05-15 16:59 ` Daniel Schnell
@ 2007-05-15 18:03   ` Gilles Chanteperdrix
  2007-05-15 20:26     ` Eric Noulard
  0 siblings, 1 reply; 15+ messages in thread
From: Gilles Chanteperdrix @ 2007-05-15 18:03 UTC (permalink / raw)
  To: Daniel Schnell; +Cc: xenomai

Daniel Schnell wrote:
 > Hi,
 > 
 > 
 > To 1.)
 > doing memset() on the buffers makes a change ... Now the performance is as bad as with mlockall() :(
 > Which means the Linux VM without mlockall makes somehow an optimization for uninitialized heap buffers.
 > Actually nothing to blame on Xenomai here.
 > 
 > The reason I was looking at the memcpy native performance was because I found Xenomai Posix message queues somewhat slow:
 > In two tasks where one is receiving a 32 kBytes msg and the other sending to this task 32 kbytes of data, I have an average cycle time of more than 1 ms.
 > Looking at the memcpy performance of my system tells me something:
 > Memcpy(32 KB) ->  400 us
 > Two msg_send(32KB) cannot be less than 2x400 us + 2 ctx switch away, probably even more because we have to write first to Kernel memory, then the ctx switch, then copying from Kernel memory to user buffer, than ctx switch. This simply means I cannot use msg queues for anything bigger than 1K, which means I have to use shared memory for that.

Actually, when using message queues in user-space, there are two more
copies of the messages, plus the allocation of a temporary buffer for
messages larger than 64 bytes. Using shared memory in the implementation
of messages queues would allow to have two copies as well in user-space,
this is another optimization that has been in my todo list for some
time. But the cost of posix message queues is already high by design
(at least two copies when user-space, maybe one copy in kernel-space),
so it is probably better to use message queues to pass pointers to a
shared memory region, allowing zero copies.

 > As you proposed in another mail maybe it makes a difference to use an own memcpy implementation for ppc which does no read on write.
 > 
 > 
 > To 2.)
 > I don't have a recent glibc, so that is no option. I can wait however for the new clock_gettime() of Xenomai and use in between the rt_timer_tsc function.

Changing glibc for an embedded Linux project is not that traumatic, and
is probably a good idea when there were so much improvements.

-- 


					    Gilles Chanteperdrix.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xenomai-help] memcpy performance on Xenomai
  2007-05-15 18:03   ` Gilles Chanteperdrix
@ 2007-05-15 20:26     ` Eric Noulard
  2007-05-16 20:17       ` Gilles Chanteperdrix
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Noulard @ 2007-05-15 20:26 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai

2007/5/15, Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>:
> Daniel Schnell wrote:
>  > The reason I was looking at the memcpy native performance was because I found Xenomai Posix message queues somewhat slow:
>  > In two tasks where one is receiving a 32 kBytes msg and the other sending to this task 32 kbytes of data, I have an average cycle time of more than 1 ms.
>  > Looking at the memcpy performance of my system tells me something:
>  > Memcpy(32 KB) ->  400 us
>  > Two msg_send(32KB) cannot be less than 2x400 us + 2 ctx switch away, probably even more because we have to write first to Kernel memory, then the ctx switch, then copying from Kernel memory to user buffer, than ctx switch. This simply means I cannot use msg queues for anything bigger than 1K, which means I have to use shared memory for that.

Are your 2 tasks in user-space or 1 in user 1 in kernel?

> so it is probably better to use message queues to pass pointers to a
> shared memory region, allowing zero copies.

I was about to say just the same but before developing my idea
I have xenomai question:
Are the share memory region obtained by shm_open
shareable between kernel task and user task just like
rt_heap_create/rt_heap_alloc does (read from the doc).

And back to shared memory design ideas.
May be you can create a shared memory region on which
you map a RINGBUFFER structure
(ringbuffer is simple and efficient if you only have 1 writer and 1 reader)
Since ringbuffer is non-blocking you may add a semaphore in order
get the implicit synchro the message queue gives you when RINGBUF
is empty (I think the full case may be forbidden by design).

It may (? to be confirmed ?) cost you less than
SHM  + MESSAGE QUEUE if sem_wait/sem_post is faster
moreover with a ringbuf you only have to sem_wait if it is empty.

The ringbuf element may contain offset in the SHM where you can find the data
or if your data has fixed size you may set up ringbuf which contains
the effective data as element.

We use in-memory ringbuf for a sampling protocol designed and used
in test environment: https://savannah.nongnu.org/projects/tsp

The ringbuf code is implemented with C MACRO (for efficiency):
http://cvs.savannah.nongnu.org/viewvc/tsp/src/core/misc_utils/tsp_ringbuf.h?root=tsp&view=markup

-- 
Erk

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xenomai-help] memcpy performance on Xenomai
  2007-05-15 17:54         ` Eric Noulard
@ 2007-05-16  6:36           ` M. Koehrer
  0 siblings, 0 replies; 15+ messages in thread
From: M. Koehrer @ 2007-05-16  6:36 UTC (permalink / raw)
  To: eric.noulard, daniel.schnell; +Cc: xenomai

Hi!

I am not an memory expert.
However, I think that a zero-only page is handled specially by the MMU
(it actually does not use physical memory).
This is the reason why a malloc for a huge amount of memory is typically successful even if there
is not that much physical memory available.
With malloc and a memset to zero only this will typically not lead to a physical RAM usage (I thinks this
is the "copy-on-write" (COW) stuff)
Thus, I recommend to do a memset with a non-zero value after allocating the memory.

memset(buf1,123,msgsize);  
memset(buf2,123,msgsize);

This should lead to a fair comparison. 

Regards

Mathias
> > Some interesting insights about my last tests.
> >
> > 1.) The culprit is mlockall(MCL_FUTURE|MCL_CURRENT);
> >
> > As soon I leave this away, I get much better results:
> >
> > Without mlockall():
> > Test (10) memcpy of sizes (10485760)
> > 10 memcpy. Time per memcpy: 78147209 [nsec] (134 MB/sec)
> >  finished.
> >
> > With mlockall():
> > Test (10) memcpy of sizes (10485760) ....
> > 10 memcpy. Time per memcpy: 124194618 [nsec] (84 MB/sec)
> >  finished.
> 
> 
> I think you are not measuring the same thing in both case.
> I did some test on 2.6.20 (precompiled debian etch kernel)
> on a 1.6 GHz Pentium M.
> 
> I think the fact that you malloced your buffer and then
> immediatly memcpy the buffers does a non repeatable measure
> (at least on my side)
> depending on something I do not understand .
> 
> Could you try my modified version of your code which
> adds:
> 
> memset(buf1,'\0',msgsize);
> memset(buf2,'\0',msgsize);
> 
> just after malloc (you may try calloc too).
> 
> With this modification
> I get similar figure for the mlockall version on my (quasi)-vanilla kernel.
> 
> that is:
> 
> ./memcpy_perf_mlockall
> Test (10) memcpy of sizes (10485760) ....
> 10 memcpy. Time per memcpy: 35716568 [nsec] (293 MB/sec)
>  finished.
> 
> ./memcpy_perf_memset
> Test (10) memcpy of sizes (10485760) ....
> 10 memcpy. Time per memcpy: 36004454 [nsec] (291 MB/sec)
>  finished.
> 
> ./memcpy_perf
> Test (10) memcpy of sizes (10485760) ....
> 10 memcpy. Time per memcpy: 23881352 [nsec] (439 MB/sec)
>  finished.
> 
> 
> I think that without mlockall or no memset the memory pages you
> requested with malloc and did not --really-- get are brought to
> physical memory only when memcpy comes.
> 
> What puzzles me is WHY it is faster WITHOUT touching the page
> BEFORE memcpy???
> 
> Any memory handling expert is welcomed to answer.
> 
> > Then again I cannot use Xenomai without mlockall()
> > :(
> 
> And you cannot design a realtime application without
> ensuring you really have the memory you requested,
> this is not a xenomai issue (my opinion though).
> 
> PS: on line compilation used:
> 
> gcc memcpy_perf-erk.c -o memcpy_perf -lrt
> gcc -DMLOCK memcpy_perf-erk.c -o memcpy_perf_mlockall -lrt
> gcc -DMEMSET memcpy_perf-erk.c -o memcpy_perf_memset -lrt
> 


-- 
Mathias Koehrer
mathias_koehrer@domain.hid


Viel oder wenig? Schnell oder langsam? Unbegrenzt surfen + telefonieren
ohne Zeit- und Volumenbegrenzung? DAS TOP ANGEBOT JETZT bei Arcor: günstig
und schnell mit DSL - das All-Inclusive-Paket für clevere Doppel-Sparer,
nur  39,85 €  inkl. DSL- und ISDN-Grundgebühr!
http://www.arcor.de/rd/emf-dsl-2


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xenomai-help] memcpy performance on Xenomai
  2007-05-15 20:26     ` Eric Noulard
@ 2007-05-16 20:17       ` Gilles Chanteperdrix
  2007-05-16 20:34         ` Eric Noulard
  0 siblings, 1 reply; 15+ messages in thread
From: Gilles Chanteperdrix @ 2007-05-16 20:17 UTC (permalink / raw)
  To: Eric Noulard; +Cc: xenomai

Eric Noulard wrote:
 > 2007/5/15, Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>:
 > > Daniel Schnell wrote:
 > >  > The reason I was looking at the memcpy native performance was because I found Xenomai Posix message queues somewhat slow:
 > >  > In two tasks where one is receiving a 32 kBytes msg and the other sending to this task 32 kbytes of data, I have an average cycle time of more than 1 ms.
 > >  > Looking at the memcpy performance of my system tells me something:
 > >  > Memcpy(32 KB) ->  400 us
 > >  > Two msg_send(32KB) cannot be less than 2x400 us + 2 ctx switch away, probably even more because we have to write first to Kernel memory, then the ctx switch, then copying from Kernel memory to user buffer, than ctx switch. This simply means I cannot use msg queues for anything bigger than 1K, which means I have to use shared memory for that.
 > 
 > Are your 2 tasks in user-space or 1 in user 1 in kernel?
 > 
 > > so it is probably better to use message queues to pass pointers to a
 > > shared memory region, allowing zero copies.
 > 
 > I was about to say just the same but before developing my idea
 > I have xenomai question:
 > Are the share memory region obtained by shm_open
 > shareable between kernel task and user task just like
 > rt_heap_create/rt_heap_alloc does (read from the doc).

Well, if you found rt_heaps doc, it would have been easy to have a look
at posix shared memory doc as well... Anyway, yes, on platforms where
sharing a memory area between kernel space and user-space is not a
problem, posix shared memory are shareable. On ARM, it is a no go
because of the cache architecture.

-- 


					    Gilles Chanteperdrix.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xenomai-help] memcpy performance on Xenomai
  2007-05-16 20:17       ` Gilles Chanteperdrix
@ 2007-05-16 20:34         ` Eric Noulard
  0 siblings, 0 replies; 15+ messages in thread
From: Eric Noulard @ 2007-05-16 20:34 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai

2007/5/16, Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>:
> Eric Noulard wrote:
>  >
>  > I was about to say just the same but before developing my idea
>  > I have xenomai question:
>  > Are the share memory region obtained by shm_open
>  > shareable between kernel task and user task just like
>  > rt_heap_create/rt_heap_alloc does (read from the doc).
>
> Well, if you found rt_heaps doc, it would have been easy to have a look
> at posix shared memory doc as well...

I should have said it but in fact I did read the doc
yesterday and didn't find the answer.

I reread it more carefully tonight and find it.
Sorry about that.

> Anyway, yes, on platforms where
> sharing a memory area between kernel space and user-space is not a
> problem, posix shared memory are shareable. On ARM, it is a no go
> because of the cache architecture.

Thank you very much for your answer :))

And please excuse me for my too fast read.
I will read doc more carefully next time.


-- 
Erk


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2007-05-16 20:34 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-15 11:38 [Xenomai-help] memcpy performance on Xenomai Daniel Schnell
2007-05-15 12:16 ` Gilles Chanteperdrix
2007-05-15 14:40   ` Daniel Schnell
2007-05-15 14:50     ` Gilles Chanteperdrix
2007-05-15 15:28       ` Daniel Schnell
2007-05-15 15:41         ` Gilles Chanteperdrix
2007-05-15 17:54         ` Eric Noulard
2007-05-16  6:36           ` M. Koehrer
2007-05-15 15:18     ` Philippe Gerum
  -- strict thread matches above, loose matches on Subject: below --
2007-05-15 15:59 Fillod Stephane
2007-05-15 16:59 ` Daniel Schnell
2007-05-15 18:03   ` Gilles Chanteperdrix
2007-05-15 20:26     ` Eric Noulard
2007-05-16 20:17       ` Gilles Chanteperdrix
2007-05-16 20:34         ` Eric Noulard

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.