Wolfgang Grandegger wrote:
> Philippe Gerum wrote:
>> Jan Kiszka wrote:
>>> Hi,
>>>
>>> [Daniel, I put you in the CC as you showed some interest in this topic.]
>>>
>>> as I indicated a some weeks ago, I had a closer look at the code the
>>> user space libs currently produce (on x86). The following considerations
>>> are certainly not worth noticeable microseconds on GHz boxes, but they
>>> may buy us (yet another) few micros on low-end.
>>>
>>> First of all, there is some redundant code in the syscall path of each
>>> skin service. This is due to the fact that the function code is
>>> calculated based on the the skin mux id each time a service is invoked.
>>> The mux id has to be shifted and masked in order to combine it with the
>>> constant function code part - this could also easily happen
>>> ahead-of-time, saving code and cycles for each service entry point.
>>>
>>> Here is a commented disassembly of some simple native skin service which
>>> only takes one argument.
>>>
>>>
>>> Function prologue:
>>>  460:   55                      push   %ebp
>>>  461:   89 e5                   mov    %esp,%ebp
>>>  463:   57                      push   %edi
>>>  464:   83 ec 10                sub    $0x10,%esp
>>>
>>> Loading the skin mux-id:
>>>  467:   a1 00 00 00 00          mov    0x0,%eax
>>>
>>> Loading the argument (here: some pointer)
>>>  46c:   8b 7d 08                mov    0x8(%ebp),%edi
>>>
>>> Calculating the function code:
>>>  46f:   c1 e0 10                shl    $0x10,%eax
>>>  472:   25 00 00 ff 00          and    $0xff0000,%eax
>>>  477:   0d 2b 02 00 08          or     $0x800022b,%eax
>>>
>>> Saving the code:
>>>  47c:   89 45 f8                mov    %eax,0xfffffff8(%ebp)
>>>
>>>  47f:   53                      push   %ebx
>>>
>>> Loading the arguments (here only one):
>>>  480:   89 fb                   mov    %edi,%ebx
>>>
>>> Restoring the code again, issuing the syscall:
>>>  482:   8b 45 f8                mov    0xfffffff8(%ebp),%eax
>>>  485:   cd 80                   int    $0x80
>>>
>>>  487:   5b                      pop    %ebx
>>>
>>> Function epilogue:
>>>  488:   83 c4 10                add    $0x10,%esp
>>>  48b:   5f                      pop    %edi
>>>  48c:   5d                      pop    %ebp
>>>  48d:   c3                      ret
>>>
>>>
>>> Looking at this code, I also started thinking about inlining short and
>>> probably heavily-used functions into the user code. This would save the
>>> function prologue/epilogue both in the lib and the user code itself. For
>>> sure, it only makes sense for time-critical functions (think of
>>> mutex_lock/unlock or rt_timer_read). But inlining could be made optional
>>
>> The best optimization for rt_timer_read() would be to do the
>> cycles-to-ns conversion in user-space from a direct TSC reading if the
>> arch supports it (most do). Of course, this would only be possible for
>> strictly aperiodic timing setups (i.e. CONFIG_XENO_OPT_PERIODIC_TIMING
>> off).
>>
>> For the rt_mutex_lock()/unlock(), we still need to refrain from
>> calling the kernel for uncontended access by using some Xeno
>> equivalent of the futex approach, which would suppress most of the
>> incentive to micro-optimize the call itself.

Ack. That's a bit more complex to realise but should be put on our to-do
list as well.

>>
>>> for the user by providing both the library variant and the inlined
>>> version. The users could then select the preferred one by #defining some
>>> control switch before including the skin headers.
>>>
>>> Any thoughts on this? And, almost more important, anyone around willing
>>> to work on these optimisations and evaluate the results? I can't ATM.
>>>
>>
>> Quite frankly, I remember that I once had to clean up the LXRT
>> inlining support in RTAI 3.0/3.1, and this was far from being fun
>> stuff to do. Basically, AFAICT, having both inline and out-of-line
>> support for library calls almost invariably ends up to a maintenance
>> nightmare of some sort, e.g. depending whether to compile with gcc's
>> optimization on or not, which might be dictated by the fact that one
>> also wants (exploitable) debug information or not, and so on. Not to
>> speak of the fact that you end up having two implementations to
>> maintain separately.
>>
>> This said, only the figures would tell us if such inlining brings
>> something significant or not to the picture performance-wise on
>> low-end hw, so I'd be interested to see those first.
> 
> I agree! I have also doubts that using inline function will improve
> latencies. Larger code also results in more TBL misses and cache refills.
> 

I'm definitely not arguing for the aggressive inlining RTAI provides
(all or nothing). I'm rather suggesting selective optional inlining of
trivial and heavily-used stubs. Think of something like rt_sem_v() for
the library call vs. rt_sem_v_inlined() for an unrolled variant.
rt_sem_v() could simply be implemented by calling rt_sem_v_inlined(),
thus there should also be no reason why the maintenance hell might get
reopened.

I think the order I presented should be kept when looking at these
optimisations:
1. reduce the complexity of existing syscall code,
2. consider if we can provide any benefit to the user by offering /some/
functions as optional inlines.

Jan