* Re: [Xenomai-help] Xenomai and futexes - Native API optimized
@ 2007-05-11 7:08 M. Koehrer
2007-05-11 7:37 ` M. Koehrer
2007-05-11 7:44 ` Jan Kiszka
0 siblings, 2 replies; 13+ messages in thread
From: M. Koehrer @ 2007-05-11 7:08 UTC (permalink / raw)
To: stephane.fillod, jan.kiszka, mathias_koehrer; +Cc: xenomai
Hi,
here are my latest results:
I included xeno_nucleus and xeno_native into the kernel. (see first oprofile result).
(I also added debug information to the kernel which was of no additional help).
I detected that most of the time was spent in __ipipe_hard_cpuid.
I looked at that routine and broke that using 2 additional helper functions that are use
to "monitor" the apic_read() resp. the GET_APIC_ID calls (see patch below).
The results of this experiment can be found at "second oprofile result" below.
When I interpret the results correctly, then it looks as if the apic_read() is actually
eating up the performance. As this call "leaves" the CPU internal "area" and accesses the external
APIC this sounds sensible.
I think this is done here to detect the current CPU.
Is it possible to detect differently or to store that information somehow with a thread (TLS)
to avoid requesting it frequently?
I hope that helps a little bit to identify this issue and (perhaps) to find a faster solution.
Thanks for all feedback on this!
Regards
Mathias
------------ Begin of patch -----------------
--- ipipe.c.orig 2007-05-09 16:16:32.000000000 +0200
+++ ipipe.c 2007-05-11 08:47:41.000000000 +0200
@@ -72,13 +72,30 @@
int (*__ipipe_logical_cpuid)(void) = &__ipipe_boot_cpuid;
+
+unsigned long __ipipe_hard_cpuid_apic_read(void)
+{
+ return apic_read(APIC_ID);
+}
+
+unsigned __ipipe_hard_cpuid_get_apic_id(unsigned long apic)
+{
+ return GET_APIC_ID(apic);
+}
+
+
static notrace int __ipipe_hard_cpuid(void)
{
unsigned long flags;
int cpu;
+ unsigned long apic;
+ unsigned apic_id;
local_irq_save_hw_notrace(flags);
- cpu = __ipipe_apicid_2_cpu[GET_APIC_ID(apic_read(APIC_ID))];
+ // cpu = __ipipe_apicid_2_cpu[GET_APIC_ID(apic_read(APIC_ID))];
+ apic = __ipipe_hard_cpuid_apic_read();
+ apic_id = __ipipe_hard_cpuid_get_apic_id(apic);
+ cpu = __ipipe_apicid_2_cpu[apic_id];
local_irq_restore_hw_notrace(flags);
return cpu;
}
--------- End of patch
------------- First oprofile result: -----------------------------
Using default event: GLOBAL_POWER_EVENTS:100000:1:1:1
Daemon started.
Profiler running.
delta is 50404054895 per step: 5040
Stopping profiling.
CPU: P4 / Xeon, speed 3192.16 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (ma
ndatory) count 100000
samples % image name app name symbol name
6273 56.8928 vmlinux vmlinux __ipipe_hard_cpuid
900 8.1625 vmlinux vmlinux rt_sem_v
694 6.2942 vmlinux vmlinux xnregistry_fetch
321 2.9113 vmlinux vmlinux __ipipe_dispatch_event
293 2.6574 bash bash (no symbols)
250 2.2674 vmlinux vmlinux hrtimer_run_queues
227 2.0588 libc-2.3.6.so libc-2.3.6.so (no symbols)
140 1.2697 vmlinux vmlinux delay_tsc
123 1.1155 libnative.so.0.0.0 libnative.so.0.0.0 rt_sem_p
103 0.9342 vmlinux vmlinux __ipipe_stall_root
78 0.7074 vmlinux vmlinux __ipipe_test_and_stall_root
67 0.6077 vmlinux vmlinux apic_timer_interrupt
66 0.5986 vmlinux vmlinux sysenter_past_esp
60 0.5442 vmlinux vmlinux __ipipe_restore_pipeline_head
56 0.5079 vmlinux vmlinux do_wp_page
56 0.5079 vmlinux vmlinux search_by_key
54 0.4898 oprofiled oprofiled (no symbols)
41 0.3718 vmlinux vmlinux __ipipe_sync_stage
37 0.3356 ld-2.3.6.so ld-2.3.6.so do_lookup_x
37 0.3356 vmlinux vmlinux __handle_mm_fault
25 0.2267 vmlinux vmlinux __ipipe_handle_exception
25 0.2267 vmlinux vmlinux find_get_page
25 0.2267 vmlinux vmlinux get_page_from_freelist
25 0.2267 vmlinux vmlinux run_timer_softirq
25 0.2267 vmlinux vmlinux scheduler_tick
24 0.2177 vmlinux vmlinux sysenter_exit
23 0.2086 vmlinux vmlinux __ipipe_syscall_root
23 0.2086 vmlinux vmlinux __ipipe_unstall_root
22 0.1995 libnative.so.0.0.0 libnative.so.0.0.0 rt_sem_v
21 0.1905 vmlinux vmlinux __ipipe_test_root
21 0.1905 vmlinux vmlinux ata_bmdma_start
19 0.1723 ld-2.3.6.so ld-2.3.6.so strcmp
19 0.1723 vmlinux vmlinux ata_altstatus
19 0.1723 vmlinux vmlinux ata_bmdma_irq_clear
18 0.1633 oprofile oprofile (no symbols)
18 0.1633 vmlinux vmlinux find_vma
18 0.1633 vmlinux vmlinux flush_tlb_page
18 0.1633 vmlinux vmlinux release_pages
18 0.1633 vmlinux vmlinux unmap_vmas
17 0.1542 ld-2.3.6.so ld-2.3.6.so _dl_relocate_object
17 0.1542 vmlinux vmlinux __ipipe_unstall_iret_root
---------------------------------------------
------------- Second oprofile result: -----------------------------
Using default event: GLOBAL_POWER_EVENTS:100000:1:1:1
Daemon started.
Profiler running.
delta is 51846556098 per step: 5184
Stopping profiling.
CPU: P4 / Xeon, speed 3192.33 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (ma
ndatory) count 100000
samples % image name app name symbol name
4350 39.4307 vmlinux vmlinux __ipipe_hard_cpuid_apic_read
2336 21.1748 vmlinux vmlinux __ipipe_hard_cpuid
306 2.7737 bash bash (no symbols)
287 2.6015 vmlinux vmlinux __ipipe_dispatch_event
276 2.5018 vmlinux vmlinux sysenter_past_esp
269 2.4384 vmlinux vmlinux hrtimer_run_queues
264 2.3930 vmlinux vmlinux __ipipe_syscall_root
245 2.2208 vmlinux vmlinux xnregistry_fetch
209 1.8945 libc-2.3.6.so libc-2.3.6.so (no symbols)
173 1.5682 vmlinux vmlinux rt_sem_v
154 1.3959 vmlinux vmlinux __ipipe_restore_pipeline_head
128 1.1603 vmlinux vmlinux __copy_from_user_ll_nozero
102 0.9246 vmlinux vmlinux delay_tsc
100 0.9065 vmlinux vmlinux __ipipe_stall_root
100 0.9065 vmlinux vmlinux hisyscall_event
90 0.8158 vmlinux vmlinux apic_timer_interrupt
89 0.8067 vmlinux vmlinux __ipipe_test_and_stall_root
80 0.7252 vmlinux vmlinux rt_sem_p
74 0.6708 vmlinux vmlinux sysenter_exit
58 0.5257 vmlinux vmlinux do_wp_page
53 0.4804 oprofiled oprofiled (no symbols)
53 0.4804 vmlinux vmlinux search_by_key
50 0.4532 vmlinux vmlinux __ipipe_sync_stage
40 0.3626 vmlinux vmlinux __rt_sem_v
39 0.3535 vmlinux vmlinux __rt_sem_p
31 0.2810 ld-2.3.6.so ld-2.3.6.so do_lookup_x
30 0.2719 vmlinux vmlinux find_get_page
27 0.2447 vmlinux vmlinux ata_bmdma_start
24 0.2175 vmlinux vmlinux __ipipe_unstall_root
24 0.2175 vmlinux vmlinux run_timer_softirq
23 0.2085 vmlinux vmlinux unmap_vmas
22 0.1994 vmlinux vmlinux flush_tlb_page
21 0.1904 ld-2.3.6.so ld-2.3.6.so strcmp
21 0.1904 vmlinux vmlinux __handle_mm_fault
19 0.1722 vmlinux vmlinux __ipipe_test_root
18 0.1632 vmlinux vmlinux do_page_fault
17 0.1541 oprofile oprofile (no symbols)
17 0.1541 vmlinux vmlinux __ipipe_handle_exception
16 0.1450 vmlinux vmlinux __d_lookup
15 0.1360 vmlinux vmlinux filemap_nopage
15 0.1360 vmlinux vmlinux get_page_from_freelist
15 0.1360 vmlinux vmlinux page_remove_rmap
15 0.1360 vmlinux vmlinux restore_nocheck_notrace
14 0.1269 vmlinux vmlinux _atomic_dec_and_lock
14 0.1269 vmlinux vmlinux copy_page_range
14 0.1269 vmlinux vmlinux scheduler_tick
12 0.1088 ld-2.3.6.so ld-2.3.6.so _dl_relocate_object
12 0.1088 vmlinux vmlinux __find_get_block
12 0.1088 vmlinux vmlinux __ipipe_unstall_iret_root
--------------------------------------------------------------
> define CONFIG_XENO_OPT_DEBUG and CONFIG_DEBUG_KERNEL/CONFIG_DEBUG_INFO
> to have symbols and all. OProfile is only able to look up virtual
> address
> when debug symbols are present in file. You may have to pass the
> --vmlinux option to opcontrol. Compiling Xenomai like suggested by Jan
> will make your life easier, and will bring you an extra mini speedup
> if you're in that kind of business.
>
> --
> Stephane
>
--
Mathias Koehrer
mathias_koehrer@domain.hid
Viel oder wenig? Schnell oder langsam? Unbegrenzt surfen + telefonieren
ohne Zeit- und Volumenbegrenzung? DAS TOP ANGEBOT JETZT bei Arcor: günstig
und schnell mit DSL - das All-Inclusive-Paket für clevere Doppel-Sparer,
nur 39,85 inkl. DSL- und ISDN-Grundgebühr!
http://www.arcor.de/rd/emf-dsl-2
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [Xenomai-help] Xenomai and futexes - Native API optimized
2007-05-11 7:08 [Xenomai-help] Xenomai and futexes - Native API optimized M. Koehrer
@ 2007-05-11 7:37 ` M. Koehrer
2007-05-11 7:44 ` Jan Kiszka
1 sibling, 0 replies; 13+ messages in thread
From: M. Koehrer @ 2007-05-11 7:37 UTC (permalink / raw)
To: stephane.fillod, jan.kiszka, mathias_koehrer; +Cc: xenomai
Hi!
As written in my mail before, I measured that the time is lost in __ipipe_hard_cpuid.
This function will only be called with a SMP setup.
I have now disabled SMP in my kernel (but enabled APIC and IO APIC) to do the same experiment.
And the result with a UP system is dramatically better!!!!
It is now less than 1microsecond per loop (compared against about 5 microseconds in SMP).
See below the oprofile output for UP.
Regards
Mathias
--------------------------------------------- oprofile output for UP ----------------
Using default event: GLOBAL_POWER_EVENTS:100000:1:1:1
Daemon started.
Profiler running.
delta is 9190565309 per step: 919
Stopping profiling.
CPU: P4 / Xeon, speed 3192.2 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (ma
ndatory) count 100000
samples % image name app name symbol name
1669 18.5115 vmlinux vmlinux __ipipe_dispatch_event
985 10.9250 vmlinux vmlinux sysenter_past_esp
711 7.8860 vmlinux vmlinux xnregistry_fetch
550 6.1003 libc-2.3.6.so libc-2.3.6.so (no symbols)
479 5.3128 vmlinux vmlinux __copy_from_user_ll_nozero
474 5.2573 vmlinux vmlinux __ipipe_restore_pipeline_head
471 5.2240 vmlinux vmlinux rt_sem_p
451 5.0022 vmlinux vmlinux search_by_key
294 3.2609 bash bash (no symbols)
240 2.6619 vmlinux vmlinux hisyscall_event
237 2.6287 vmlinux vmlinux rt_sem_v
236 2.6176 libnative.so.0.0.0 libnative.so.0.0.0 rt_sem_p
156 1.7303 libnative.so.0.0.0 libnative.so.0.0.0 rt_sem_v
145 1.6083 performance performance mytaska
61 0.6766 vmlinux vmlinux do_wp_page
60 0.6655 vmlinux vmlinux acpi_pm_read
56 0.6211 vmlinux vmlinux __find_get_block
53 0.5878 vmlinux vmlinux memset
52 0.5768 oprofiled oprofiled (no symbols)
43 0.4769 ld-2.3.6.so ld-2.3.6.so do_lookup_x
34 0.3771 syslogd syslogd (no symbols)
34 0.3771 vmlinux vmlinux release_console_sem
33 0.3660 vmlinux vmlinux __ipipe_unstall_root
33 0.3660 vmlinux vmlinux memcpy
31 0.3438 vmlinux vmlinux do_journal_end
30 0.3327 klogd klogd (no symbols)
30 0.3327 vmlinux vmlinux unmap_vmas
29 0.3217 oprofile oprofile (no symbols)
27 0.2995 vmlinux vmlinux __copy_from_user_ll
27 0.2995 vmlinux vmlinux do_con_write
27 0.2995 vmlinux vmlinux ll_rw_block
25 0.2773 vmlinux vmlinux do_syslog
21 0.2329 ld-2.3.6.so ld-2.3.6.so strcmp
21 0.2329 vmlinux vmlinux get_page_from_freelist
21 0.2329 vmlinux vmlinux sysenter_exit
21 0.2329 vmlinux vmlinux write_chan
19 0.2107 vmlinux vmlinux __handle_mm_fault
19 0.2107 vmlinux vmlinux __ipipe_unstall_iret_root
19 0.2107 vmlinux vmlinux generic_file_buffered_write
18 0.1996 vmlinux vmlinux bit_waitqueue
18 0.1996 vmlinux vmlinux reiserfs_update_sd_size
17 0.1886 vmlinux vmlinux journal_mark_dirty
17 0.1886 vmlinux vmlinux reiserfs_prepare_for_journal
17 0.1886 vmlinux vmlinux vsnprintf
16 0.1775 vmlinux vmlinux unlock_buffer
15 0.1664 vmlinux vmlinux memmove
15 0.1664 vmlinux vmlinux number
14 0.1553 vmlinux vmlinux __link_path_walk
14 0.1553 vmlinux vmlinux conv_uni_to_pc
14 0.1553 vmlinux vmlinux kmem_cache_alloc
14 0.1553 vmlinux vmlinux radix_tree_lookup
13 0.1442 ld-2.3.6.so ld-2.3.6.so _dl_relocate_object
13 0.1442 vmlinux vmlinux get_num_ver
12 0.1331 vmlinux vmlinux __d_lookup
12 0.1331 vmlinux vmlinux __ipipe_handle_exception
11 0.1220 vmlinux vmlinux __block_prepare_write
l--------------------------------------
--
Mathias Koehrer
mathias_koehrer@domain.hid
Viel oder wenig? Schnell oder langsam? Unbegrenzt surfen + telefonieren
ohne Zeit- und Volumenbegrenzung? DAS TOP ANGEBOT JETZT bei Arcor: günstig
und schnell mit DSL - das All-Inclusive-Paket für clevere Doppel-Sparer,
nur 39,85 inkl. DSL- und ISDN-Grundgebühr!
http://www.arcor.de/rd/emf-dsl-2
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Xenomai-help] Xenomai and futexes - Native API optimized
2007-05-11 7:08 [Xenomai-help] Xenomai and futexes - Native API optimized M. Koehrer
2007-05-11 7:37 ` M. Koehrer
@ 2007-05-11 7:44 ` Jan Kiszka
2007-05-11 8:10 ` M. Koehrer
1 sibling, 1 reply; 13+ messages in thread
From: Jan Kiszka @ 2007-05-11 7:44 UTC (permalink / raw)
To: M. Koehrer; +Cc: xenomai
[-- Attachment #1: Type: text/plain, Size: 1901 bytes --]
M. Koehrer wrote:
> Hi,
>
> here are my latest results:
> I included xeno_nucleus and xeno_native into the kernel. (see first oprofile result).
> (I also added debug information to the kernel which was of no additional help).
> I detected that most of the time was spent in __ipipe_hard_cpuid.
> I looked at that routine and broke that using 2 additional helper functions that are use
> to "monitor" the apic_read() resp. the GET_APIC_ID calls (see patch below).
> The results of this experiment can be found at "second oprofile result" below.
> When I interpret the results correctly, then it looks as if the apic_read() is actually
> eating up the performance. As this call "leaves" the CPU internal "area" and accesses the external
> APIC this sounds sensible.
> I think this is done here to detect the current CPU.
> Is it possible to detect differently or to store that information somehow with a thread (TLS)
> to avoid requesting it frequently?
Actually, this is how Linux works. On x86, it used to derive the CPU ID
from current_thread_info, which required a valid "current" what Xenomai
kernel threads do not guarantee (their stack size is variable, but Linux
needs a fixed, predefined one to resolve current). Nowadays (>= 2.6.20)
we have PDA (Per-CPU Data Array) on x86, a register based approach, and
I think we could safely use it for I-pipe/Xenomai as well. That would
make A LOT of thinks easier, because replacing smp_processor_id all over
the place to make some code I-pipe aware is a boring job (I recently did
so for LTTng...) - and it results in slower code on SMP.
As an experiment, you could try implementing the cpuid lookup via
raw_smp_processor_id(), given you use 2.6.20. Hmm, maybe/likely some
Xenomai tuning is required as well to make sure that the PDA is set
correctly when a Xenomai task is created/migrated. Results welcome!
Jan
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 249 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Xenomai-help] Xenomai and futexes - Native API optimized
2007-05-11 7:44 ` Jan Kiszka
@ 2007-05-11 8:10 ` M. Koehrer
2007-05-11 8:21 ` Philippe Gerum
0 siblings, 1 reply; 13+ messages in thread
From: M. Koehrer @ 2007-05-11 8:10 UTC (permalink / raw)
To: jan.kiszka, mathias_koehrer; +Cc: xenomai
Hi Jan,
> As an experiment, you could try implementing the cpuid lookup via
> raw_smp_processor_id(), given you use 2.6.20. Hmm, maybe/likely some
> Xenomai tuning is required as well to make sure that the PDA is set
> correctly when a Xenomai task is created/migrated. Results welcome!
I have applied a patch (below) to my SMP kernel 2.6.20.4.
And actually the results are now much better.
I have been back on my "original" kernel version without the oprofile support.
The "original" ipipe.c delivered times of 4.5 us per step with my test project.
The patched ipipe.c (using raw_smp_processor_id()) delivers 1.25 us per step!
This is really an excellent improvement!
I do not know if there are any side effects - if not: why not using this method instead
of the previous one...
Regards
Mathias
------------------------- patch start ------------------
--- ipipe.c.orig 2007-05-09 16:16:32.000000000 +0200
+++ ipipe.c 2007-05-11 10:00:44.000000000 +0200
@@ -74,6 +74,7 @@
static notrace int __ipipe_hard_cpuid(void)
{
+#if 0
unsigned long flags;
int cpu;
@@ -81,6 +82,9 @@
cpu = __ipipe_apicid_2_cpu[GET_APIC_ID(apic_read(APIC_ID))];
local_irq_restore_hw_notrace(flags);
return cpu;
+#endif
+ return raw_smp_processor_id();
+
}
#endif /* CONFIG_SMP */
------------------------- patch end ------------------
--
Mathias Koehrer
mathias_koehrer@domain.hid
Viel oder wenig? Schnell oder langsam? Unbegrenzt surfen + telefonieren
ohne Zeit- und Volumenbegrenzung? DAS TOP ANGEBOT JETZT bei Arcor: günstig
und schnell mit DSL - das All-Inclusive-Paket für clevere Doppel-Sparer,
nur 39,85 inkl. DSL- und ISDN-Grundgebühr!
http://www.arcor.de/rd/emf-dsl-2
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [Xenomai-help] Xenomai and futexes - Native API optimized
2007-05-11 8:10 ` M. Koehrer
@ 2007-05-11 8:21 ` Philippe Gerum
0 siblings, 0 replies; 13+ messages in thread
From: Philippe Gerum @ 2007-05-11 8:21 UTC (permalink / raw)
To: M. Koehrer; +Cc: xenomai, jan.kiszka
On Fri, 2007-05-11 at 10:10 +0200, M. Koehrer wrote:
> Hi Jan,
>
> > As an experiment, you could try implementing the cpuid lookup via
> > raw_smp_processor_id(), given you use 2.6.20. Hmm, maybe/likely some
> > Xenomai tuning is required as well to make sure that the PDA is set
> > correctly when a Xenomai task is created/migrated. Results welcome!
>
> I have applied a patch (below) to my SMP kernel 2.6.20.4.
> And actually the results are now much better.
> I have been back on my "original" kernel version without the oprofile support.
> The "original" ipipe.c delivered times of 4.5 us per step with my test project.
> The patched ipipe.c (using raw_smp_processor_id()) delivers 1.25 us per step!
> This is really an excellent improvement!
> I do not know if there are any side effects - if not: why not using this method instead
> of the previous one...
We are going to use it. PDAs have been recently introduced and the
I-pipe patch for x86_64 already uses them, but I simply overlooked the
fact that they have been made available for x86 too. Will fix, thanks.
>
> Regards
>
> Mathias
>
> ------------------------- patch start ------------------
> --- ipipe.c.orig 2007-05-09 16:16:32.000000000 +0200
> +++ ipipe.c 2007-05-11 10:00:44.000000000 +0200
> @@ -74,6 +74,7 @@
>
> static notrace int __ipipe_hard_cpuid(void)
> {
> +#if 0
> unsigned long flags;
> int cpu;
>
> @@ -81,6 +82,9 @@
> cpu = __ipipe_apicid_2_cpu[GET_APIC_ID(apic_read(APIC_ID))];
> local_irq_restore_hw_notrace(flags);
> return cpu;
> +#endif
> + return raw_smp_processor_id();
> +
> }
>
> #endif /* CONFIG_SMP */
> ------------------------- patch end ------------------
>
>
--
Philippe.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Xenomai-help] Xenomai and futexes - Native API optimized
@ 2007-05-10 17:23 Fillod Stephane
0 siblings, 0 replies; 13+ messages in thread
From: Fillod Stephane @ 2007-05-10 17:23 UTC (permalink / raw)
To: Jan Kiszka, M. Koehrer; +Cc: xenomai
Jan Kiszka wrote:
[...]
>> 2309 5.4062 xeno_nucleus xeno_nucleus
(no symbols)
>
>We are lacking module symbols here. Dunno if one can teach them to
>oprofile (likely somehow), but the easiest approach would be to compile
>xeno_native and nucleus into the kernel.
You can pass --enable-debug to the configure script in order to enable
debug symbols (and don't strip programs/libs). For the kernel and
modules,
define CONFIG_XENO_OPT_DEBUG and CONFIG_DEBUG_KERNEL/CONFIG_DEBUG_INFO
to have symbols and all. OProfile is only able to look up virtual
address
when debug symbols are present in file. You may have to pass the
--vmlinux option to opcontrol. Compiling Xenomai like suggested by Jan
will make your life easier, and will bring you an extra mini speedup
if you're in that kind of business.
--
Stephane
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Xenomai-help] Xenomai and futexes - Native API optimized for user space only applications
@ 2007-05-09 10:52 M. Koehrer
2007-05-09 16:14 ` Philippe Gerum
0 siblings, 1 reply; 13+ messages in thread
From: M. Koehrer @ 2007-05-09 10:52 UTC (permalink / raw)
To: xenomai
Hi everybody,
I am using Xenomai for a high performance real time simulation system.
All of the simulation code is executed within user space. One application is running that consists
of several Xenomai real time threads.
For performance reasons I am always using the latest PC technology available (which is currently
Pentium D or Core2Duo PCs, we are thinking even of using quad-core CPUs).
There has to be a kind of thread synchronisation e.g. when accessing shared data.
Hardware I/O is done via rtnet or via PCI I/O boards that work in user space aswell (PCI memory
mapped into the user space).
This is done e.g. by using semaphores, mutexes et.c
I like the Xenomai-native skin as it provides a very clear API that is easy to use.
However, for a user space only application it is a performance killer, that all API calls
lead to a mode switch from user to kernel space. Each API call takes about 1-2 microseconds (us)
on my PC which is really expensive.
Especially when inter process communication is used to protect the access to shared data
it is mostly the case that the calling thread does not have to wait. In this situation there is no
need for a context switch. The API call did not lead to a rescheduling of the available tasks.
And for this the required 1-2 us do really hurt.
Thus my question/proposal is if there is a plan to use a "variant" of the native API that is optimized for
user space only applications. In this case e.g. futexes can be used. If there is a need to
reschedule to another task it is fine to "invest" the 2us but it can be avoided mostly which should
increase the overall performance dramatically.
This would lead to a library where a big part of the functionality is handled directly in the library
(in user space). Currently the skin passes the (user space) API call via a Xenomai System call
to the kernel space to execute there the actual functionality.
Thanks for any feedback on this proposal.
Regards
Mathias
--
Mathias Koehrer
mathias_koehrer@domain.hid
Viel oder wenig? Schnell oder langsam? Unbegrenzt surfen + telefonieren
ohne Zeit- und Volumenbegrenzung? DAS TOP ANGEBOT JETZT bei Arcor: günstig
und schnell mit DSL - das All-Inclusive-Paket für clevere Doppel-Sparer,
nur 39,85 inkl. DSL- und ISDN-Grundgebühr!
http://www.arcor.de/rd/emf-dsl-2
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Xenomai-help] Xenomai and futexes - Native API optimized for user space only applications
2007-05-09 10:52 [Xenomai-help] Xenomai and futexes - Native API optimized for user space only applications M. Koehrer
@ 2007-05-09 16:14 ` Philippe Gerum
2007-05-10 8:31 ` [Xenomai-help] Xenomai and futexes - Native API optimized M. Koehrer
0 siblings, 1 reply; 13+ messages in thread
From: Philippe Gerum @ 2007-05-09 16:14 UTC (permalink / raw)
To: M. Koehrer; +Cc: xenomai
On Wed, 2007-05-09 at 12:52 +0200, M. Koehrer wrote:
> Hi everybody,
>
> I am using Xenomai for a high performance real time simulation system.
> All of the simulation code is executed within user space. One application is running that consists
> of several Xenomai real time threads.
> For performance reasons I am always using the latest PC technology available (which is currently
> Pentium D or Core2Duo PCs, we are thinking even of using quad-core CPUs).
> There has to be a kind of thread synchronisation e.g. when accessing shared data.
> Hardware I/O is done via rtnet or via PCI I/O boards that work in user space aswell (PCI memory
> mapped into the user space).
> This is done e.g. by using semaphores, mutexes et.c
> I like the Xenomai-native skin as it provides a very clear API that is easy to use.
> However, for a user space only application it is a performance killer, that all API calls
> lead to a mode switch from user to kernel space. Each API call takes about 1-2 microseconds (us)
> on my PC which is really expensive.
> Especially when inter process communication is used to protect the access to shared data
> it is mostly the case that the calling thread does not have to wait. In this situation there is no
> need for a context switch. The API call did not lead to a rescheduling of the available tasks.
> And for this the required 1-2 us do really hurt.
>
> Thus my question/proposal is if there is a plan to use a "variant" of the native API that is optimized for
> user space only applications. In this case e.g. futexes can be used. If there is a need to
> reschedule to another task it is fine to "invest" the 2us but it can be avoided mostly which should
> increase the overall performance dramatically.
> This would lead to a library where a big part of the functionality is handled directly in the library
> (in user space). Currently the skin passes the (user space) API call via a Xenomai System call
> to the kernel space to execute there the actual functionality.
>
> Thanks for any feedback on this proposal.
Yes, this is the way to go. It's the kind of rework introduced by the
NPTL for the glibc, and there is no reason to pay the kernel/user space
transition when no contention exists on the synch object.
I'm not sure the cost is as much as 2 us, unless you don't use the
sysenter/sysexit protocol for syscalls Xenomai manages properly
(--enable-x86-sep). It seems you are going through the ancient 0x80
exception vector, but still, I agree that the point remains: no
contention should mean no transition to kernel.
We could not use futexes, because we don't want to depend on the vanilla
infrastructure, which would lead to unbounded latencies. The point is
about properly sharing a piece of data between both spaces for each
synch object, and have some user-space accessible atomic ops to operate
on them. That's the simplest part, the other, more complex one is to
invest the time needed to achieve this (for all archs that may run this
way, i.e. not all archs Xenomai supports do have atomic ops available
from user-space, but that's not the majority though).
>
> Regards
>
> Mathias
>
>
>
>
>
--
Philippe.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Xenomai-help] Xenomai and futexes - Native API optimized
2007-05-09 16:14 ` Philippe Gerum
@ 2007-05-10 8:31 ` M. Koehrer
2007-05-10 9:26 ` Jan Kiszka
2007-05-10 9:46 ` Daniel Schnell
0 siblings, 2 replies; 13+ messages in thread
From: M. Koehrer @ 2007-05-10 8:31 UTC (permalink / raw)
To: rpm, mathias_koehrer; +Cc: xenomai
Hi Philippe,
>
> I'm not sure the cost is as much as 2 us, unless you don't use the
> sysenter/sysexit protocol for syscalls Xenomai manages properly
> (--enable-x86-sep). It seems you are going through the ancient 0x80
> exception vector, but still, I agree that the point remains: no
> contention should mean no transition to kernel.
>
I did some measurements to get precise values here.
For this I used the following Xenomai program that accesses a semaphore twice
within a loop (p followed by v).
---------------------------------
#include <native/task.h>
#include <native/sem.h>
RT_TASK taska_desc;
RT_SEM sem;
void mytaska(void *cookie)
{
int i;
RTIME tim1, tim2;
#define LOOPS 100000
tim1 = rt_timer_read();
for (i=0; i < LOOPS; i++)
{
rt_sem_p(&sem, TM_INFINITE);
rt_sem_v(&sem);
}
tim2 = rt_timer_read();
printf("delta is %llu per step: %llu\n",
tim2-tim1,
(tim2-tim1)/ LOOPS);
}
int main(void)
{
mlockall(MCL_CURRENT|MCL_FUTURE);
rt_sem_create(&sem, "mysem", 1 ,0);
rt_task_create(&taska_desc, "mytaska", 0, 81, T_JOINABLE);
rt_task_start(&taska_desc, &mytaska, NULL);
rt_task_join(&taska_desc);
rt_sem_delete(&sem);
return 0;
}
-----------------------------
I measured the following runtime:
For configuring Xenomai without any option: 4.8us per step
Configuring Xenomai with --enable-x86-sep: 4.5us per step.
I ran this experiment on a 3.2 GHz Pentium D on a server main board with
E7230 chipset.
Xenomai 2.3.1, kernel 2.6.20.4, SMP
Thus the performance here is not really excellent (as there is no need to do
a task switch.
I agree with all the comments that recommended to improve the design first before
trying to improve the OS performance.
However, if there is (RT)OS, I expect it to be as fast as possible.
Because, even if I have a "perfect" design, I need the OS - otherwise I would just
live without the OS.
A better design could help to reduce OS calls but it can not avoid it completely.
Also, a lock less designs might lead to additional data copies which is also a known
performance killer. So I think this is a decision that has to be made on base of the concrete
project. General rules do not help here.
Thanks for all comments!
Regards
Mathias
--
Mathias Koehrer
mathias_koehrer@domain.hid
Viel oder wenig? Schnell oder langsam? Unbegrenzt surfen + telefonieren
ohne Zeit- und Volumenbegrenzung? DAS TOP ANGEBOT JETZT bei Arcor: günstig
und schnell mit DSL - das All-Inclusive-Paket für clevere Doppel-Sparer,
nur 39,85 inkl. DSL- und ISDN-Grundgebühr!
http://www.arcor.de/rd/emf-dsl-2
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [Xenomai-help] Xenomai and futexes - Native API optimized
2007-05-10 8:31 ` [Xenomai-help] Xenomai and futexes - Native API optimized M. Koehrer
@ 2007-05-10 9:26 ` Jan Kiszka
2007-05-10 13:02 ` M. Koehrer
2007-05-10 9:46 ` Daniel Schnell
1 sibling, 1 reply; 13+ messages in thread
From: Jan Kiszka @ 2007-05-10 9:26 UTC (permalink / raw)
To: M. Koehrer; +Cc: xenomai
[-- Attachment #1: Type: text/plain, Size: 1206 bytes --]
M. Koehrer wrote:
> ...
> I measured the following runtime:
> For configuring Xenomai without any option: 4.8us per step
> Configuring Xenomai with --enable-x86-sep: 4.5us per step.
>
> I ran this experiment on a 3.2 GHz Pentium D on a server main board with
> E7230 chipset.
> Xenomai 2.3.1, kernel 2.6.20.4, SMP
> Thus the performance here is not really excellent (as there is no need to do
> a task switch.
Mind to run oprofile on this setup to see where costs primarily come from?
>
> I agree with all the comments that recommended to improve the design first before
> trying to improve the OS performance.
> However, if there is (RT)OS, I expect it to be as fast as possible.
And there it is again, the common misunderstanding: RTOSes are not
GPOSes, only faster. They provide services optimised for predictable and
low (in that order) worst-case performance.
So adding optimisations for whatever average case must not hurt those
goals significantly. And that often means your overall performance is
even worse than with an GPOS. No excuses for keeping Xenomai unoptimised
if it can be tuned smartly, just an explanation why we may not help in
every case.
Jan
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Xenomai-help] Xenomai and futexes - Native API optimized
2007-05-10 9:26 ` Jan Kiszka
@ 2007-05-10 13:02 ` M. Koehrer
2007-05-10 13:41 ` M. Koehrer
0 siblings, 1 reply; 13+ messages in thread
From: M. Koehrer @ 2007-05-10 13:02 UTC (permalink / raw)
To: jan.kiszka, mathias_koehrer; +Cc: xenomai
Hi Jan,
it is the first time I use oprofile, I hope I have here results that are significant...
I wrote a bash script to start everything:
--------------------
#!/bin/bash
opcontrol --start
./performance
opcontrol --stop
opreport
--------------------
My real time test application is called "performance".
Here is the output:
------------------------------------------------------------
Profiler running.
delta is 2023521331 per step: 5058
Stopping profiling.
CPU: P4 / Xeon, speed 3192.18 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000
GLOBAL_POWER_E...|
samples| %|
------------------
38103 76.4368 vmlinux
2659 5.3341 xeno_nucleus
2327 4.6681 libnative.so.0.0.0
2078 4.1686 bash
GLOBAL_POWER_E...|
samples| %|
------------------
2069 99.5669 bash
3 0.1444 anon (tgid:4769 range:0xb7fea000-0xb7feb000)
2 0.0962 anon (tgid:22703 range:0xb7fea000-0xb7feb000)
2 0.0962 anon (tgid:4883 range:0xb7fea000-0xb7feb000)
1 0.0481 anon (tgid:1883 range:0xb7fea000-0xb7feb000)
1 0.0481 anon (tgid:4942 range:0xb7fea000-0xb7feb000)
1587 3.1836 xeno_native
1514 3.0372 libc-2.3.6.so
907 1.8195 ld-2.3.6.so
317 0.6359 oprofiled
GLOBAL_POWER_E...|
samples| %|
------------------
315 99.3691 oprofiled
2 0.6309 anon (tgid:1855 range:0xb7fea000-0xb7feb000)
142 0.2849 performance
GLOBAL_POWER_E...|
samples| %|
------------------
130 91.5493 anon (tgid:22701 range:0xb7fea000-0xb7feb000)
12 8.4507 performance
91 0.1826 oprofile
33 0.0662 gawk
25 0.0502 e1000
24 0.0481 grep
7 0.0140 ls
4 0.0080 cat
4 0.0080 libncurses.so.5.5
3 0.0060 libpopt.so.0.0.0
3 0.0060 libselinux.so.1
3 0.0060 libsepol.so.1
3 0.0060 libdl-2.3.6.so
3 0.0060 libpthread-2.3.6.so
2 0.0040 mkdir
2 0.0040 ophelp
1 0.0020 lsmod
1 0.0020 mktemp
1 0.0020 libnss_compat-2.3.6.so
1 0.0020 libnss_files-2.3.6.so
1 0.0020 dirname
1 0.0020 id
1 0.0020 tr
1 0.0020 in.telnetd
------------------------------------------
>
> Mind to run oprofile on this setup to see where costs primarily come from?
>
Does this help??
Regards
Mathias
--
Mathias Koehrer
mathias_koehrer@domain.hid
Viel oder wenig? Schnell oder langsam? Unbegrenzt surfen + telefonieren
ohne Zeit- und Volumenbegrenzung? DAS TOP ANGEBOT JETZT bei Arcor: günstig
und schnell mit DSL - das All-Inclusive-Paket für clevere Doppel-Sparer,
nur 39,85 inkl. DSL- und ISDN-Grundgebühr!
http://www.arcor.de/rd/emf-dsl-2
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [Xenomai-help] Xenomai and futexes - Native API optimized
2007-05-10 13:02 ` M. Koehrer
@ 2007-05-10 13:41 ` M. Koehrer
2007-05-10 16:21 ` Jan Kiszka
0 siblings, 1 reply; 13+ messages in thread
From: M. Koehrer @ 2007-05-10 13:41 UTC (permalink / raw)
To: jan.kiszka, mathias_koehrer; +Cc: xenomai
Hi Jan,
here is the (truncated) output of opreport -l (which seems to be more useful...).
Regards
Mathias
------------------
Profiler running.
delta is 2030247695 per step: 5075
Stopping profiling.
CPU: P4 / Xeon, speed 3192.18 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000
samples % image name app name symbol name
27423 64.2074 vmlinux vmlinux __ipipe_hard_cpuid
2309 5.4062 xeno_nucleus xeno_nucleus (no symbols)
1489 3.4863 bash bash (no symbols)
1226 2.8705 vmlinux vmlinux sysenter_past_esp
1209 2.8307 xeno_native xeno_native (no symbols)
1091 2.5544 libc-2.3.6.so libc-2.3.6.so (no symbols)
792 1.8544 vmlinux vmlinux __ipipe_syscall_root
700 1.6390 vmlinux vmlinux __ipipe_dispatch_event
581 1.3603 vmlinux vmlinux page_fault
316 0.7399 vmlinux vmlinux __ipipe_stall_root
264 0.6181 vmlinux vmlinux do_wp_page
232 0.5432 oprofiled oprofiled (no symbols)
227 0.5315 vmlinux vmlinux __ipipe_test_and_stall_root
177 0.4144 ld-2.3.6.so ld-2.3.6.so do_lookup_x
157 0.3676 vmlinux vmlinux find_get_page
156 0.3653 vmlinux vmlinux __copy_from_user_ll_nozero
156 0.3653 vmlinux vmlinux __handle_mm_fault
128 0.2997 performance performance mytaska
122 0.2856 ld-2.3.6.so ld-2.3.6.so strcmp
106 0.2482 vmlinux vmlinux unmap_vmas
105 0.2458 vmlinux vmlinux __ipipe_handle_exception
97 0.2271 vmlinux vmlinux get_page_from_freelist
92 0.2154 vmlinux vmlinux search_by_key
86 0.2014 vmlinux vmlinux page_remove_rmap
85 0.1990 vmlinux vmlinux flush_tlb_page
82 0.1920 vmlinux vmlinux timer_interrupt
78 0.1826 vmlinux vmlinux filemap_nopage
77 0.1803 vmlinux vmlinux __ipipe_unstall_root
75 0.1756 vmlinux vmlinux release_pages
73 0.1709 oprofile oprofile (no symbols)
66 0.1545 vmlinux vmlinux __ipipe_restore_pipeline_head
64 0.1498 vmlinux vmlinux mwait_idle_with_hints
62 0.1452 ld-2.3.6.so ld-2.3.6.so _dl_relocate_object
60 0.1405 vmlinux vmlinux copy_page_range
60 0.1405 vmlinux vmlinux find_vma
56 0.1311 vmlinux vmlinux hrtimer_run_queues
52 0.1218 vmlinux vmlinux __ipipe_test_root
52 0.1218 vmlinux vmlinux up_read
50 0.1171 vmlinux vmlinux __ipipe_unstall_iret_root
49 0.1147 vmlinux vmlinux do_page_fault
47 0.1100 e1000 e1000 (no symbols)
47 0.1100 vmlinux vmlinux down_read_trylock
46 0.1077 vmlinux vmlinux error_code
40 0.0937 vmlinux vmlinux __ipipe_sync_stage
---------------------------
>
> >
> > Mind to run oprofile on this setup to see where costs primarily come
> from?
> >
--
Mathias Koehrer
mathias_koehrer@domain.hid
Viel oder wenig? Schnell oder langsam? Unbegrenzt surfen + telefonieren
ohne Zeit- und Volumenbegrenzung? DAS TOP ANGEBOT JETZT bei Arcor: günstig
und schnell mit DSL - das All-Inclusive-Paket für clevere Doppel-Sparer,
nur 39,85 inkl. DSL- und ISDN-Grundgebühr!
http://www.arcor.de/rd/emf-dsl-2
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Xenomai-help] Xenomai and futexes - Native API optimized
2007-05-10 13:41 ` M. Koehrer
@ 2007-05-10 16:21 ` Jan Kiszka
0 siblings, 0 replies; 13+ messages in thread
From: Jan Kiszka @ 2007-05-10 16:21 UTC (permalink / raw)
To: M. Koehrer; +Cc: xenomai
[-- Attachment #1: Type: text/plain, Size: 1217 bytes --]
M. Koehrer wrote:
> Hi Jan,
>
> here is the (truncated) output of opreport -l (which seems to be more useful...).
Are you sure you predominately measured the semaphore scenario, not also
some things that happens before and after (or during) that test? Make
sure the test loop runs longer than setup/cleanup.
>
> Regards
>
> Mathias
> ------------------
> Profiler running.
> delta is 2030247695 per step: 5075
> Stopping profiling.
> CPU: P4 / Xeon, speed 3192.18 MHz (estimated)
> Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000
> samples % image name app name symbol name
> 27423 64.2074 vmlinux vmlinux __ipipe_hard_cpuid
Mmpf, that's heavy. Can someone comment on this? Is the related hardware
access that costly or are we calling it too often?
> 2309 5.4062 xeno_nucleus xeno_nucleus (no symbols)
We are lacking module symbols here. Dunno if one can teach them to
oprofile (likely somehow), but the easiest approach would be to compile
xeno_native and nucleus into the kernel.
Jan
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Xenomai-help] Xenomai and futexes - Native API optimized
2007-05-10 8:31 ` [Xenomai-help] Xenomai and futexes - Native API optimized M. Koehrer
2007-05-10 9:26 ` Jan Kiszka
@ 2007-05-10 9:46 ` Daniel Schnell
2007-05-10 15:16 ` Philippe Gerum
1 sibling, 1 reply; 13+ messages in thread
From: Daniel Schnell @ 2007-05-10 9:46 UTC (permalink / raw)
To: M. Koehrer, rpm; +Cc: xenomai
Hi,
I find the results interesting.
running your program under a MPC5200B, 396 MHz yields slightly better results: 3847 ns ~ 3,8 us.
For the rt_sem_v() operation alone that is 1778 ns ~1,7 us ==> rt_sem_p(): 2069 ns ~2,1 us.
Indeed for a ctx switch less operation this is a _very_ long time. It would be interesting to know how the figures are for kernel based operations.
I have not measured mutexes so far, but it is probable that the figures are similar. If so, this is really something one should try to improve, because one has to "pay" a lot for the relatively unlikely event of a reschedule. And mutexes are and will/should be used a lot.
I have to support Mathias here: it is not so easy as blaming bad design of the application or the used tooling. One often has dependencies on legacy code / tools / external libraries. Especially so in industrial environment, where standard protocols and interfaces are used, just think about a full featured CanOpen protocol stack for typically a couple of 1000 €. One simply cannot create a software "master plan" and fix all possible design flaws in those external dependent components. Especially not so in external binary only libraries.
We are at the moment underway to find out the specifications of our Xenomai platform, i.e. max. ctx switch times, performance of the POSIX skin, etc.
I will post the results and the source code if we are finished.
Best regards,
--
Daniel Schnell | daniel.schnell@marel.com
Hugbúnaðargerð | www.marel.com
-----Original Message-----
From: xenomai-help-bounces@gna.org [mailto:xenomai-help-bounces@gna.org] On Behalf Of M. Koehrer
Sent: 10. maí 2007 08:31
To: rpm@xenomai.org; mathias_koehrer@arcor.de
Cc: xenomai-help@gna.org
Subject: Re: [Xenomai-help] Xenomai and futexes - Native API optimized
Hi Philippe,
>
> I'm not sure the cost is as much as 2 us, unless you don't use the
> sysenter/sysexit protocol for syscalls Xenomai manages properly
> (--enable-x86-sep). It seems you are going through the ancient 0x80
> exception vector, but still, I agree that the point remains: no
> contention should mean no transition to kernel.
>
I did some measurements to get precise values here.
For this I used the following Xenomai program that accesses a semaphore twice within a loop (p followed by v).
---------------------------------
#include <native/task.h>
#include <native/sem.h>
RT_TASK taska_desc;
RT_SEM sem;
void mytaska(void *cookie)
{
int i;
RTIME tim1, tim2;
#define LOOPS 100000
tim1 = rt_timer_read();
for (i=0; i < LOOPS; i++)
{
rt_sem_p(&sem, TM_INFINITE);
rt_sem_v(&sem);
}
tim2 = rt_timer_read();
printf("delta is %llu per step: %llu\n",
tim2-tim1,
(tim2-tim1)/ LOOPS);
}
int main(void)
{
mlockall(MCL_CURRENT|MCL_FUTURE);
rt_sem_create(&sem, "mysem", 1 ,0);
rt_task_create(&taska_desc, "mytaska", 0, 81, T_JOINABLE);
rt_task_start(&taska_desc, &mytaska, NULL);
rt_task_join(&taska_desc);
rt_sem_delete(&sem);
return 0;
}
-----------------------------
I measured the following runtime:
For configuring Xenomai without any option: 4.8us per step Configuring Xenomai with --enable-x86-sep: 4.5us per step.
I ran this experiment on a 3.2 GHz Pentium D on a server main board with E7230 chipset.
Xenomai 2.3.1, kernel 2.6.20.4, SMP
Thus the performance here is not really excellent (as there is no need to do a task switch.
I agree with all the comments that recommended to improve the design first before trying to improve the OS performance.
However, if there is (RT)OS, I expect it to be as fast as possible.
Because, even if I have a "perfect" design, I need the OS - otherwise I would just live without the OS.
A better design could help to reduce OS calls but it can not avoid it completely.
Also, a lock less designs might lead to additional data copies which is also a known performance killer. So I think this is a decision that has to be made on base of the concrete project. General rules do not help here.
Thanks for all comments!
Regards
Mathias
--
Mathias Koehrer
mathias_koehrer@arcor.de
Viel oder wenig? Schnell oder langsam? Unbegrenzt surfen + telefonieren
ohne Zeit- und Volumenbegrenzung? DAS TOP ANGEBOT JETZT bei Arcor: günstig
und schnell mit DSL - das All-Inclusive-Paket für clevere Doppel-Sparer,
nur 39,85 € inkl. DSL- und ISDN-Grundgebühr!
http://www.arcor.de/rd/emf-dsl-2
_______________________________________________
Xenomai-help mailing list
Xenomai-help@gna.org
https://mail.gna.org/listinfo/xenomai-help
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [Xenomai-help] Xenomai and futexes - Native API optimized
2007-05-10 9:46 ` Daniel Schnell
@ 2007-05-10 15:16 ` Philippe Gerum
0 siblings, 0 replies; 13+ messages in thread
From: Philippe Gerum @ 2007-05-10 15:16 UTC (permalink / raw)
To: Daniel Schnell; +Cc: xenomai, M. Koehrer
On Thu, 2007-05-10 at 09:46 +0000, Daniel Schnell wrote:
> If so, this is really something one should try to improve, because one has to "pay" a lot
> for the relatively unlikely event of a reschedule. And mutexes are and will/should be used a lot.
>
> I have to support Mathias here: it is not so easy as blaming bad design of the application or the used tooling.
> One often has dependencies on legacy code / tools / external libraries.
Ack. Xenomai is very much about helping people to migrate from legacy RT
environments to Linux-based systems; this is why Xenomai has a skin
layer instead of a single wired-in interface in the first place. In that
sense, we _must_ accept the fact that such code needs to be ported with
the fewest changes that make the transition possible, basically because
even as mis-designed or sloppy as it could be, such code often has one
single very desirable property: it _works in the field_ (damnit! (c)),
and an awful lot of effort has likely been put to fix the real-world
issues which have popped during the application lifetime no one wants to
start again, and any knowledge about those bugs and how they have been
fixed has often vanished many moons ago.
To sum up, I have no problem trying hard to make some ugly legacy code
shine by mean of a smart implementation on our side, that's usual
business after all.
This is not to say that we should plumb any number of evil core
interfaces into Xenomai, or tweak the existing ones, just for making new
things as silly or ugly as they might have been in the legacy context.
But, improving the performance of mutual exclusion would _not_ impact
any Xenomai interface, but only the innards of some skins, for the best
in terms of performance. So, FWIW, I agree, we must improve this when
time allows, and yes, it should probably be high enough priority-wise in
the TODO list, especially if measurements confirm the problem.
I see two distinct improvements we could achieve:
- optimize our syscall path, between user-space and the nucleus. This
would have a positive impact on every syscall from every skin. My
feeling is that I-cache penalty is way too high presently when
dispatching syscall events within the Adeos layer. We must do so while
keeping the pipeline scheme so we could still downgrade syscalls from
the Xenomai domain to the Linux one when needed.
- implement the fast acquisition path for core Xenomai mutexes (to be
defined) in the non-contended case, using a piece of pinned memory
shared between kernel and user spaces.
--
Philippe.
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2007-05-11 8:21 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-11 7:08 [Xenomai-help] Xenomai and futexes - Native API optimized M. Koehrer
2007-05-11 7:37 ` M. Koehrer
2007-05-11 7:44 ` Jan Kiszka
2007-05-11 8:10 ` M. Koehrer
2007-05-11 8:21 ` Philippe Gerum
-- strict thread matches above, loose matches on Subject: below --
2007-05-10 17:23 Fillod Stephane
2007-05-09 10:52 [Xenomai-help] Xenomai and futexes - Native API optimized for user space only applications M. Koehrer
2007-05-09 16:14 ` Philippe Gerum
2007-05-10 8:31 ` [Xenomai-help] Xenomai and futexes - Native API optimized M. Koehrer
2007-05-10 9:26 ` Jan Kiszka
2007-05-10 13:02 ` M. Koehrer
2007-05-10 13:41 ` M. Koehrer
2007-05-10 16:21 ` Jan Kiszka
2007-05-10 9:46 ` Daniel Schnell
2007-05-10 15:16 ` Philippe Gerum
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.