* [Xenomai-core] [RFC] Micro-optimisations for the libs @ 2006-05-06 8:09 Jan Kiszka 2006-05-06 10:37 ` Philippe Gerum 0 siblings, 1 reply; 4+ messages in thread From: Jan Kiszka @ 2006-05-06 8:09 UTC (permalink / raw) To: xenomai-core; +Cc: Daniel.Rossier [-- Attachment #1: Type: text/plain, Size: 2789 bytes --] Hi, [Daniel, I put you in the CC as you showed some interest in this topic.] as I indicated a some weeks ago, I had a closer look at the code the user space libs currently produce (on x86). The following considerations are certainly not worth noticeable microseconds on GHz boxes, but they may buy us (yet another) few micros on low-end. First of all, there is some redundant code in the syscall path of each skin service. This is due to the fact that the function code is calculated based on the the skin mux id each time a service is invoked. The mux id has to be shifted and masked in order to combine it with the constant function code part - this could also easily happen ahead-of-time, saving code and cycles for each service entry point. Here is a commented disassembly of some simple native skin service which only takes one argument. Function prologue: 460: 55 push %ebp 461: 89 e5 mov %esp,%ebp 463: 57 push %edi 464: 83 ec 10 sub $0x10,%esp Loading the skin mux-id: 467: a1 00 00 00 00 mov 0x0,%eax Loading the argument (here: some pointer) 46c: 8b 7d 08 mov 0x8(%ebp),%edi Calculating the function code: 46f: c1 e0 10 shl $0x10,%eax 472: 25 00 00 ff 00 and $0xff0000,%eax 477: 0d 2b 02 00 08 or $0x800022b,%eax Saving the code: 47c: 89 45 f8 mov %eax,0xfffffff8(%ebp) 47f: 53 push %ebx Loading the arguments (here only one): 480: 89 fb mov %edi,%ebx Restoring the code again, issuing the syscall: 482: 8b 45 f8 mov 0xfffffff8(%ebp),%eax 485: cd 80 int $0x80 487: 5b pop %ebx Function epilogue: 488: 83 c4 10 add $0x10,%esp 48b: 5f pop %edi 48c: 5d pop %ebp 48d: c3 ret Looking at this code, I also started thinking about inlining short and probably heavily-used functions into the user code. This would save the function prologue/epilogue both in the lib and the user code itself. For sure, it only makes sense for time-critical functions (think of mutex_lock/unlock or rt_timer_read). But inlining could be made optional for the user by providing both the library variant and the inlined version. The users could then select the preferred one by #defining some control switch before including the skin headers. Any thoughts on this? And, almost more important, anyone around willing to work on these optimisations and evaluate the results? I can't ATM. Jan [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 250 bytes --] ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Xenomai-core] [RFC] Micro-optimisations for the libs 2006-05-06 8:09 [Xenomai-core] [RFC] Micro-optimisations for the libs Jan Kiszka @ 2006-05-06 10:37 ` Philippe Gerum 2006-05-08 10:51 ` Wolfgang Grandegger 0 siblings, 1 reply; 4+ messages in thread From: Philippe Gerum @ 2006-05-06 10:37 UTC (permalink / raw) To: Jan Kiszka; +Cc: Daniel.Rossier, xenomai-core Jan Kiszka wrote: > Hi, > > [Daniel, I put you in the CC as you showed some interest in this topic.] > > as I indicated a some weeks ago, I had a closer look at the code the > user space libs currently produce (on x86). The following considerations > are certainly not worth noticeable microseconds on GHz boxes, but they > may buy us (yet another) few micros on low-end. > > First of all, there is some redundant code in the syscall path of each > skin service. This is due to the fact that the function code is > calculated based on the the skin mux id each time a service is invoked. > The mux id has to be shifted and masked in order to combine it with the > constant function code part - this could also easily happen > ahead-of-time, saving code and cycles for each service entry point. > > Here is a commented disassembly of some simple native skin service which > only takes one argument. > > > Function prologue: > 460: 55 push %ebp > 461: 89 e5 mov %esp,%ebp > 463: 57 push %edi > 464: 83 ec 10 sub $0x10,%esp > > Loading the skin mux-id: > 467: a1 00 00 00 00 mov 0x0,%eax > > Loading the argument (here: some pointer) > 46c: 8b 7d 08 mov 0x8(%ebp),%edi > > Calculating the function code: > 46f: c1 e0 10 shl $0x10,%eax > 472: 25 00 00 ff 00 and $0xff0000,%eax > 477: 0d 2b 02 00 08 or $0x800022b,%eax > > Saving the code: > 47c: 89 45 f8 mov %eax,0xfffffff8(%ebp) > > 47f: 53 push %ebx > > Loading the arguments (here only one): > 480: 89 fb mov %edi,%ebx > > Restoring the code again, issuing the syscall: > 482: 8b 45 f8 mov 0xfffffff8(%ebp),%eax > 485: cd 80 int $0x80 > > 487: 5b pop %ebx > > Function epilogue: > 488: 83 c4 10 add $0x10,%esp > 48b: 5f pop %edi > 48c: 5d pop %ebp > 48d: c3 ret > > > Looking at this code, I also started thinking about inlining short and > probably heavily-used functions into the user code. This would save the > function prologue/epilogue both in the lib and the user code itself. For > sure, it only makes sense for time-critical functions (think of > mutex_lock/unlock or rt_timer_read). But inlining could be made optional The best optimization for rt_timer_read() would be to do the cycles-to-ns conversion in user-space from a direct TSC reading if the arch supports it (most do). Of course, this would only be possible for strictly aperiodic timing setups (i.e. CONFIG_XENO_OPT_PERIODIC_TIMING off). For the rt_mutex_lock()/unlock(), we still need to refrain from calling the kernel for uncontended access by using some Xeno equivalent of the futex approach, which would suppress most of the incentive to micro-optimize the call itself. > for the user by providing both the library variant and the inlined > version. The users could then select the preferred one by #defining some > control switch before including the skin headers. > > Any thoughts on this? And, almost more important, anyone around willing > to work on these optimisations and evaluate the results? I can't ATM. > Quite frankly, I remember that I once had to clean up the LXRT inlining support in RTAI 3.0/3.1, and this was far from being fun stuff to do. Basically, AFAICT, having both inline and out-of-line support for library calls almost invariably ends up to a maintenance nightmare of some sort, e.g. depending whether to compile with gcc's optimization on or not, which might be dictated by the fact that one also wants (exploitable) debug information or not, and so on. Not to speak of the fact that you end up having two implementations to maintain separately. This said, only the figures would tell us if such inlining brings something significant or not to the picture performance-wise on low-end hw, so I'd be interested to see those first. -- Philippe. ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Xenomai-core] [RFC] Micro-optimisations for the libs 2006-05-06 10:37 ` Philippe Gerum @ 2006-05-08 10:51 ` Wolfgang Grandegger 2006-05-08 12:01 ` Jan Kiszka 0 siblings, 1 reply; 4+ messages in thread From: Wolfgang Grandegger @ 2006-05-08 10:51 UTC (permalink / raw) To: Philippe Gerum; +Cc: Daniel.Rossier, Jan Kiszka, xenomai-core Philippe Gerum wrote: > Jan Kiszka wrote: >> Hi, >> >> [Daniel, I put you in the CC as you showed some interest in this topic.] >> >> as I indicated a some weeks ago, I had a closer look at the code the >> user space libs currently produce (on x86). The following considerations >> are certainly not worth noticeable microseconds on GHz boxes, but they >> may buy us (yet another) few micros on low-end. >> >> First of all, there is some redundant code in the syscall path of each >> skin service. This is due to the fact that the function code is >> calculated based on the the skin mux id each time a service is invoked. >> The mux id has to be shifted and masked in order to combine it with the >> constant function code part - this could also easily happen >> ahead-of-time, saving code and cycles for each service entry point. >> >> Here is a commented disassembly of some simple native skin service which >> only takes one argument. >> >> >> Function prologue: >> 460: 55 push %ebp >> 461: 89 e5 mov %esp,%ebp >> 463: 57 push %edi >> 464: 83 ec 10 sub $0x10,%esp >> >> Loading the skin mux-id: >> 467: a1 00 00 00 00 mov 0x0,%eax >> >> Loading the argument (here: some pointer) >> 46c: 8b 7d 08 mov 0x8(%ebp),%edi >> >> Calculating the function code: >> 46f: c1 e0 10 shl $0x10,%eax >> 472: 25 00 00 ff 00 and $0xff0000,%eax >> 477: 0d 2b 02 00 08 or $0x800022b,%eax >> >> Saving the code: >> 47c: 89 45 f8 mov %eax,0xfffffff8(%ebp) >> >> 47f: 53 push %ebx >> >> Loading the arguments (here only one): >> 480: 89 fb mov %edi,%ebx >> >> Restoring the code again, issuing the syscall: >> 482: 8b 45 f8 mov 0xfffffff8(%ebp),%eax >> 485: cd 80 int $0x80 >> >> 487: 5b pop %ebx >> >> Function epilogue: >> 488: 83 c4 10 add $0x10,%esp >> 48b: 5f pop %edi >> 48c: 5d pop %ebp >> 48d: c3 ret >> >> >> Looking at this code, I also started thinking about inlining short and >> probably heavily-used functions into the user code. This would save the >> function prologue/epilogue both in the lib and the user code itself. For >> sure, it only makes sense for time-critical functions (think of >> mutex_lock/unlock or rt_timer_read). But inlining could be made optional > > The best optimization for rt_timer_read() would be to do the > cycles-to-ns conversion in user-space from a direct TSC reading if the > arch supports it (most do). Of course, this would only be possible for > strictly aperiodic timing setups (i.e. CONFIG_XENO_OPT_PERIODIC_TIMING > off). > > For the rt_mutex_lock()/unlock(), we still need to refrain from calling > the kernel for uncontended access by using some Xeno equivalent of the > futex approach, which would suppress most of the incentive to > micro-optimize the call itself. > >> for the user by providing both the library variant and the inlined >> version. The users could then select the preferred one by #defining some >> control switch before including the skin headers. >> >> Any thoughts on this? And, almost more important, anyone around willing >> to work on these optimisations and evaluate the results? I can't ATM. >> > > Quite frankly, I remember that I once had to clean up the LXRT inlining > support in RTAI 3.0/3.1, and this was far from being fun stuff to do. > Basically, AFAICT, having both inline and out-of-line support for > library calls almost invariably ends up to a maintenance nightmare of > some sort, e.g. depending whether to compile with gcc's optimization on > or not, which might be dictated by the fact that one also wants > (exploitable) debug information or not, and so on. Not to speak of the > fact that you end up having two implementations to maintain separately. > > This said, only the figures would tell us if such inlining brings > something significant or not to the picture performance-wise on low-end > hw, so I'd be interested to see those first. I agree! I have also doubts that using inline function will improve latencies. Larger code also results in more TBL misses and cache refills. Wolfgang. ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Xenomai-core] [RFC] Micro-optimisations for the libs 2006-05-08 10:51 ` Wolfgang Grandegger @ 2006-05-08 12:01 ` Jan Kiszka 0 siblings, 0 replies; 4+ messages in thread From: Jan Kiszka @ 2006-05-08 12:01 UTC (permalink / raw) To: Wolfgang Grandegger, Philippe Gerum; +Cc: Daniel.Rossier, xenomai-core [-- Attachment #1: Type: text/plain, Size: 5417 bytes --] Wolfgang Grandegger wrote: > Philippe Gerum wrote: >> Jan Kiszka wrote: >>> Hi, >>> >>> [Daniel, I put you in the CC as you showed some interest in this topic.] >>> >>> as I indicated a some weeks ago, I had a closer look at the code the >>> user space libs currently produce (on x86). The following considerations >>> are certainly not worth noticeable microseconds on GHz boxes, but they >>> may buy us (yet another) few micros on low-end. >>> >>> First of all, there is some redundant code in the syscall path of each >>> skin service. This is due to the fact that the function code is >>> calculated based on the the skin mux id each time a service is invoked. >>> The mux id has to be shifted and masked in order to combine it with the >>> constant function code part - this could also easily happen >>> ahead-of-time, saving code and cycles for each service entry point. >>> >>> Here is a commented disassembly of some simple native skin service which >>> only takes one argument. >>> >>> >>> Function prologue: >>> 460: 55 push %ebp >>> 461: 89 e5 mov %esp,%ebp >>> 463: 57 push %edi >>> 464: 83 ec 10 sub $0x10,%esp >>> >>> Loading the skin mux-id: >>> 467: a1 00 00 00 00 mov 0x0,%eax >>> >>> Loading the argument (here: some pointer) >>> 46c: 8b 7d 08 mov 0x8(%ebp),%edi >>> >>> Calculating the function code: >>> 46f: c1 e0 10 shl $0x10,%eax >>> 472: 25 00 00 ff 00 and $0xff0000,%eax >>> 477: 0d 2b 02 00 08 or $0x800022b,%eax >>> >>> Saving the code: >>> 47c: 89 45 f8 mov %eax,0xfffffff8(%ebp) >>> >>> 47f: 53 push %ebx >>> >>> Loading the arguments (here only one): >>> 480: 89 fb mov %edi,%ebx >>> >>> Restoring the code again, issuing the syscall: >>> 482: 8b 45 f8 mov 0xfffffff8(%ebp),%eax >>> 485: cd 80 int $0x80 >>> >>> 487: 5b pop %ebx >>> >>> Function epilogue: >>> 488: 83 c4 10 add $0x10,%esp >>> 48b: 5f pop %edi >>> 48c: 5d pop %ebp >>> 48d: c3 ret >>> >>> >>> Looking at this code, I also started thinking about inlining short and >>> probably heavily-used functions into the user code. This would save the >>> function prologue/epilogue both in the lib and the user code itself. For >>> sure, it only makes sense for time-critical functions (think of >>> mutex_lock/unlock or rt_timer_read). But inlining could be made optional >> >> The best optimization for rt_timer_read() would be to do the >> cycles-to-ns conversion in user-space from a direct TSC reading if the >> arch supports it (most do). Of course, this would only be possible for >> strictly aperiodic timing setups (i.e. CONFIG_XENO_OPT_PERIODIC_TIMING >> off). >> >> For the rt_mutex_lock()/unlock(), we still need to refrain from >> calling the kernel for uncontended access by using some Xeno >> equivalent of the futex approach, which would suppress most of the >> incentive to micro-optimize the call itself. Ack. That's a bit more complex to realise but should be put on our to-do list as well. >> >>> for the user by providing both the library variant and the inlined >>> version. The users could then select the preferred one by #defining some >>> control switch before including the skin headers. >>> >>> Any thoughts on this? And, almost more important, anyone around willing >>> to work on these optimisations and evaluate the results? I can't ATM. >>> >> >> Quite frankly, I remember that I once had to clean up the LXRT >> inlining support in RTAI 3.0/3.1, and this was far from being fun >> stuff to do. Basically, AFAICT, having both inline and out-of-line >> support for library calls almost invariably ends up to a maintenance >> nightmare of some sort, e.g. depending whether to compile with gcc's >> optimization on or not, which might be dictated by the fact that one >> also wants (exploitable) debug information or not, and so on. Not to >> speak of the fact that you end up having two implementations to >> maintain separately. >> >> This said, only the figures would tell us if such inlining brings >> something significant or not to the picture performance-wise on >> low-end hw, so I'd be interested to see those first. > > I agree! I have also doubts that using inline function will improve > latencies. Larger code also results in more TBL misses and cache refills. > I'm definitely not arguing for the aggressive inlining RTAI provides (all or nothing). I'm rather suggesting selective optional inlining of trivial and heavily-used stubs. Think of something like rt_sem_v() for the library call vs. rt_sem_v_inlined() for an unrolled variant. rt_sem_v() could simply be implemented by calling rt_sem_v_inlined(), thus there should also be no reason why the maintenance hell might get reopened. I think the order I presented should be kept when looking at these optimisations: 1. reduce the complexity of existing syscall code, 2. consider if we can provide any benefit to the user by offering /some/ functions as optional inlines. Jan [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 250 bytes --] ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2006-05-08 12:01 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-05-06 8:09 [Xenomai-core] [RFC] Micro-optimisations for the libs Jan Kiszka 2006-05-06 10:37 ` Philippe Gerum 2006-05-08 10:51 ` Wolfgang Grandegger 2006-05-08 12:01 ` Jan Kiszka
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.