[Xenomai-core] ns vs. tsc as internal timer base

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Xenomai-core] ns vs. tsc as internal timer base
@ 2006-06-13 10:51 Jan Kiszka
  2006-06-13 11:16 ` Philippe Gerum
                   ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Jan Kiszka @ 2006-06-13 10:51 UTC (permalink / raw)
  To: xenomai-core

[-- Attachment #1: Type: text/plain, Size: 1384 bytes --]

Hi,

between some football half-times of the last days ;), I played a bit
with a hand-optimised xnarch_tsc_to_ns() for x86. Using scaled math, I
achieved between 3 (P-I 133 MHz) to 4 times (P-M 1.3 GHz) faster
conversions than with the current variant. While this optimisation only
saves a few ten nanoseconds on high-end, slow processors can gain
several hundreds of nanos per conversion (my P-133: -600 ns).

This does not come for free: accuracy of very large values is slightly
worse, but that's likely negligible compared to the clock accuracy of
TSCs (does anyone have any real numbers on the latter, BTW?).

As we loose some bits the one way, converting back still requires "real"
division (i.e. the use of the existing slower xnarch_ns_to_tsc).
Otherwise, we would get significant errors already for small intervals.

To avoid loosing the optimisation again in ns_to_tsc, I thought about
basing the whole internal timer arithmetics on nanoseconds instead of
TSCs as it is now. Although I dug quite a lot in the current timer
subsystem the last weeks, I may still oversee aspects and I'm
x86-biased. Therefore my question before thinking or even patching
further this way: What was the motivation to choose TSCs as internal
time base? Any pitfalls down the road (except introducing regressions)?

Jan

PS: All this would be 2.3-stuff, for sure.

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai-core] ns vs. tsc as internal timer base
  2006-06-13 10:51 [Xenomai-core] ns vs. tsc as internal timer base Jan Kiszka
@ 2006-06-13 11:16 ` Philippe Gerum
  2006-06-13 11:56   ` Jan Kiszka
  2006-06-13 11:59 ` [Xenomai-core] ns vs. tsc as internal timer base Gilles Chanteperdrix
  2006-06-13 12:00 ` Anders Blomdell
  2 siblings, 1 reply; 27+ messages in thread
From: Philippe Gerum @ 2006-06-13 11:16 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

Jan Kiszka wrote:
> Hi,
> 
> between some football half-times of the last days ;), I played a bit
> with a hand-optimised xnarch_tsc_to_ns() for x86. Using scaled math, I
> achieved between 3 (P-I 133 MHz) to 4 times (P-M 1.3 GHz) faster
> conversions than with the current variant. While this optimisation only
> saves a few ten nanoseconds on high-end, slow processors can gain
> several hundreds of nanos per conversion (my P-133: -600 ns).
> 

I did exactely the same a few weeks ago, based on Anzinger's scaled math 
from i386/kernel/timers/timer_tsc.c. And indeed, I had x 20 performance 
improvements in some cases.

> This does not come for free: accuracy of very large values is slightly
> worse, but that's likely negligible compared to the clock accuracy of
> TSCs (does anyone have any real numbers on the latter, BTW?).
> 

We do start losing significant precision for 2 ms delays and above, 
IIRC. This could be an issue for some events in aperiodic mode, albeit 
we could use a plain divide for those. The cost of conditionally doing 
this remains to be evaluated though.

> As we loose some bits the one way, converting back still requires "real"
> division (i.e. the use of the existing slower xnarch_ns_to_tsc).
> Otherwise, we would get significant errors already for small intervals.
> 
> To avoid loosing the optimisation again in ns_to_tsc, I thought about
> basing the whole internal timer arithmetics on nanoseconds instead of
> TSCs as it is now. Although I dug quite a lot in the current timer
> subsystem the last weeks, I may still oversee aspects and I'm
> x86-biased. Therefore my question before thinking or even patching
> further this way: What was the motivation to choose TSCs as internal
> time base?

TSC are not the whole nucleus time base, but only the timer management 
one. The motivation to use TSCs in nucleus/timer.c was to pick a unit 
which would not require any conversion beyond the initial one in 
xntimer_start.

> Any pitfalls down the road (except introducing regressions)?

Well, pitfalls expected from changing the core idea of time of the timer 
management code... :o>

> 
> Jan
> 
> 
> PS: All this would be 2.3-stuff, for sure.
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Xenomai-core mailing list
> Xenomai-core@domain.hid
> https://mail.gna.org/listinfo/xenomai-core


-- 

Philippe.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai-core] ns vs. tsc as internal timer base
  2006-06-13 11:16 ` Philippe Gerum
@ 2006-06-13 11:56   ` Jan Kiszka
  2006-06-13 12:31     ` Philippe Gerum
  0 siblings, 1 reply; 27+ messages in thread
From: Jan Kiszka @ 2006-06-13 11:56 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: xenomai-core

[-- Attachment #1: Type: text/plain, Size: 4043 bytes --]

Philippe Gerum wrote:
> Jan Kiszka wrote:
>> Hi,
>>
>> between some football half-times of the last days ;), I played a bit
>> with a hand-optimised xnarch_tsc_to_ns() for x86. Using scaled math, I
>> achieved between 3 (P-I 133 MHz) to 4 times (P-M 1.3 GHz) faster
>> conversions than with the current variant. While this optimisation only
>> saves a few ten nanoseconds on high-end, slow processors can gain
>> several hundreds of nanos per conversion (my P-133: -600 ns).
>>
> 
> I did exactely the same a few weeks ago, based on Anzinger's scaled math

:) We should coordinate better.

> from i386/kernel/timers/timer_tsc.c. And indeed, I had x 20 performance
> improvements in some cases.

Oops, that sounds like a bit too extreme optimisations. Is the original
version varying that much? I didn't observe this.

Here is my current version, BTW:

long tsc_scale;
unsigned int tsc_shift = 31;

static inline long long fast_tsc_to_ns(long long ts)
{
    long long ret;

    __asm__ (
        /* HI = HIWORD(ts) * tsc_scale */
        "mov  %%eax,%%ebx\n\t"
        "mov  %%edx,%%eax\n\t"
        "imull %2\n\t"
        "mov  %%eax,%%esi\n\t"
        "mov  %%edx,%%edi\n\t"

        /* LO = LOWORD(ts) * tsc_scale */
        "mov  %%ebx,%%eax\n\t"
        "mull %2\n\t"

        /* ret = (HI << 32) + LO */
        "add  %%esi,%%edx\n\t"
        "adc  $0,%%edi\n\t"

        /* ret = ret >> tsc_shift */
        "shrd %%cl,%%edx,%%eax\n\t"
        "shrd %%cl,%%edi,%%edx\n\t"
        : "=A"(ret)
        : "A" (ts), "m" (tsc_scale), "c" (tsc_shift)
        : "ebx", "esi", "edi");

    return ret;
}

void init_tsc(unsigned long cpu_freq)
{
    unsigned long long scale;

    while (1) {
        scale = do_div(1000000000LL << tsc_shift, cpu_freq);
        if (scale <= 0x7FFFFFFF)
            break;
        tsc_shift--;
    }
    tsc_scale = scale;
}

This version will use 31 (GHz cpu_freq) to 26 (~32 MHz) shifts, i.e. a
bit more than the Linux kernel's 22 bits.

> 
>> This does not come for free: accuracy of very large values is slightly
>> worse, but that's likely negligible compared to the clock accuracy of
>> TSCs (does anyone have any real numbers on the latter, BTW?).
>>
> 
> We do start losing significant precision for 2 ms delays and above,
> IIRC. This could be an issue for some events in aperiodic mode, albeit
> we could use a plain divide for those. The cost of conditionally doing
> this remains to be evaluated though.

Maybe I tested (not calculated - math is too hard for me :o)) the wrong
values, but I didn't see such high regressions.

> 
>> As we loose some bits the one way, converting back still requires "real"
>> division (i.e. the use of the existing slower xnarch_ns_to_tsc).
>> Otherwise, we would get significant errors already for small intervals.
>>
>> To avoid loosing the optimisation again in ns_to_tsc, I thought about
>> basing the whole internal timer arithmetics on nanoseconds instead of
>> TSCs as it is now. Although I dug quite a lot in the current timer
>> subsystem the last weeks, I may still oversee aspects and I'm
>> x86-biased. Therefore my question before thinking or even patching
>> further this way: What was the motivation to choose TSCs as internal
>> time base?
> 
> TSC are not the whole nucleus time base, but only the timer management
> one. The motivation to use TSCs in nucleus/timer.c was to pick a unit
> which would not require any conversion beyond the initial one in
> xntimer_start.

That helps strictly periodic application timers, not aperiodic ones like
timeouts.

> 
>> Any pitfalls down the road (except introducing regressions)?
> 
> Well, pitfalls expected from changing the core idea of time of the timer
> management code... :o>
> 

You mean turning

rthal_timer_program_shot(rthal_imuldiv(delay,RTHAL_TIMER_FREQ,RTHAL_CPU_FREQ));

into

rthal_timer_program_shot(rthal_imuldiv(delay,RTHAL_TIMER_FREQ,1000000000));

e.g. ?

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai-core] ns vs. tsc as internal timer base
  2006-06-13 10:51 [Xenomai-core] ns vs. tsc as internal timer base Jan Kiszka
  2006-06-13 11:16 ` Philippe Gerum
@ 2006-06-13 11:59 ` Gilles Chanteperdrix
  2006-06-13 12:00 ` Anders Blomdell
  2 siblings, 0 replies; 27+ messages in thread
From: Gilles Chanteperdrix @ 2006-06-13 11:59 UTC (permalink / raw)
  To: Jan Kiszka

Jan Kiszka wrote:
 > Hi,
 > 
 > between some football half-times of the last days ;), I played a bit
 > with a hand-optimised xnarch_tsc_to_ns() for x86. Using scaled math, I
 > achieved between 3 (P-I 133 MHz) to 4 times (P-M 1.3 GHz) faster
 > conversions than with the current variant. While this optimisation only
 > saves a few ten nanoseconds on high-end, slow processors can gain
 > several hundreds of nanos per conversion (my P-133: -600 ns).

Some time ago, I did also some experiment on avoiding divisions. I came
to a solution that precompute fractions using a real division, and that
only use additions, multiplication and shifts for imuldiv and ullimd. I
thought there would be no loss in accuracy, but well, sometimes the last
bit is wrong.

Anyway, here is the code if you want to benchmark it, div96by32 and
u64(to|from)u32 are defined in asm-i386/hal.h or asm-generic/hal.h:

typedef struct {
    unsigned long long frac;    /* Fractionary part. */
    unsigned long integ;        /* Integer part. */
} u32frac_t;

/* m/d == integ + frac / 2^64 */
void precalc(u32frac_t *const f,
             const unsigned long m,
             const unsigned long d)
{
    f->integ = m > d ? m / d :0;
    f->frac = div96by32(u64fromu32(m % d, 0), 0, d, NULL);
}

inline unsigned long nodiv_imuldiv(unsigned long op, u32frac_t f)
{
    const unsigned long tmp = (ullmul(op, f.frac >> 32)) >> 32;

    if(f.integ)
        return tmp + op * f.integ;

    return tmp;
}

#define add64and32(h, l, s) do {                \
    __asm__ ("addl %2, %1\n\t"                  \
             "adcl $0, %0"                      \
             : "+r"(h), "+r"(l)                 \
             : "r"(s));                         \
    } while(0)

#define add96and64(l0, l1, l2, s0, s1) do {     \
    __asm__ ("addl %4, %2\n\t"                  \
             "adcl %3, %1\n\t"                  \
             "adcl $0, %0\n\t"                  \
             : "+r"(l0), "+r"(l1), "+r"(l2)     \
             : "r"(s0), "r"(s1));               \
    } while(0)

inline unsigned long long mul64by64_high(const unsigned long long op,
                                      const unsigned long long m)
{
    /* Compute high 64 bits of multiplication 64 bits x 64 bits. */
    unsigned long long t1, t2, t3;
    u_long oph, opl, mh, ml, t0, t1h, t1l, t2h, t2l, t3h, t3l;

    u64tou32(op, oph, opl);
    u64tou32(m, mh, ml);
    t0 = ullmul(opl, ml) >> 32;
    t1 = ullmul(oph, ml); u64tou32(t1, t1h, t1l);
    add64and32(t1h, t1l, t0);
    t2 = ullmul(opl, mh); u64tou32(t2, t2h, t2l);
    t3 = ullmul(oph, mh); u64tou32(t3, t3h, t3l);
    add64and32(t3h, t3l, t2h);
    add96and64(t3h, t3l, t2l, t1h, t1l);

    return u64fromu32(t3h, t3l);
}

inline unsigned long long nodiv_ullimd(const unsigned long long op,
                                   const u32frac_t f)
{
    const unsigned long long tmp = mul64by64_high(op, f.frac);

    if(f.integ)
        return tmp + op * f.integ;

    return tmp;
}

-- 


					    Gilles Chanteperdrix.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai-core] ns vs. tsc as internal timer base
  2006-06-13 10:51 [Xenomai-core] ns vs. tsc as internal timer base Jan Kiszka
  2006-06-13 11:16 ` Philippe Gerum
  2006-06-13 11:59 ` [Xenomai-core] ns vs. tsc as internal timer base Gilles Chanteperdrix
@ 2006-06-13 12:00 ` Anders Blomdell
  2 siblings, 0 replies; 27+ messages in thread
From: Anders Blomdell @ 2006-06-13 12:00 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

Jan Kiszka wrote:
> Hi,
> 
> To avoid loosing the optimisation again in ns_to_tsc, I thought about
> basing the whole internal timer arithmetics on nanoseconds instead of
> TSCs as it is now. 
Good idea, makes it simpler to adopt to laptop frequency scaling and deep ACPI 
sleep, i.e. sync Xenomai time to the ACPI timer.

/Anders

-- 
Anders Blomdell                  Email: anders.blomdell@domain.hid
Department of Automatic Control
Lund University                  Phone:    +46 46 222 4625
P.O. Box 118                     Fax:      +46 46 138118
SE-221 00 Lund, Sweden


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai-core] ns vs. tsc as internal timer base
  2006-06-13 11:56   ` Jan Kiszka
@ 2006-06-13 12:31     ` Philippe Gerum
  2006-06-13 13:07       ` Gilles Chanteperdrix
                         ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Philippe Gerum @ 2006-06-13 12:31 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

Jan Kiszka wrote:
> Philippe Gerum wrote:
> 
>>Jan Kiszka wrote:
>>
>>>Hi,
>>>
>>>between some football half-times of the last days ;), I played a bit
>>>with a hand-optimised xnarch_tsc_to_ns() for x86. Using scaled math, I
>>>achieved between 3 (P-I 133 MHz) to 4 times (P-M 1.3 GHz) faster
>>>conversions than with the current variant. While this optimisation only
>>>saves a few ten nanoseconds on high-end, slow processors can gain
>>>several hundreds of nanos per conversion (my P-133: -600 ns).
>>>
>>
>>I did exactely the same a few weeks ago, based on Anzinger's scaled math
> 
> 
> :) We should coordinate better.
> 

The answer is published roadmap + todo list, but this requires some 
organisation we have not been able to setup yet.

> 
>>from i386/kernel/timers/timer_tsc.c. And indeed, I had x 20 performance
>>improvements in some cases.
> 
> 
> Oops, that sounds like a bit too extreme optimisations. Is the original
> version varying that much? I didn't observe this.
> 
> Here is my current version, BTW:
> 
> long tsc_scale;
> unsigned int tsc_shift = 31;
> 
> static inline long long fast_tsc_to_ns(long long ts)
> {
>     long long ret;
> 
>     __asm__ (
>         /* HI = HIWORD(ts) * tsc_scale */
>         "mov  %%eax,%%ebx\n\t"
>         "mov  %%edx,%%eax\n\t"
>         "imull %2\n\t"
>         "mov  %%eax,%%esi\n\t"
>         "mov  %%edx,%%edi\n\t"
> 
>         /* LO = LOWORD(ts) * tsc_scale */
>         "mov  %%ebx,%%eax\n\t"
>         "mull %2\n\t"
> 
>         /* ret = (HI << 32) + LO */
>         "add  %%esi,%%edx\n\t"
>         "adc  $0,%%edi\n\t"
> 
>         /* ret = ret >> tsc_shift */
>         "shrd %%cl,%%edx,%%eax\n\t"
>         "shrd %%cl,%%edi,%%edx\n\t"
>         : "=A"(ret)
>         : "A" (ts), "m" (tsc_scale), "c" (tsc_shift)
>         : "ebx", "esi", "edi");
> 
>     return ret;
> }
> 
> void init_tsc(unsigned long cpu_freq)
> {
>     unsigned long long scale;
> 
>     while (1) {
>         scale = do_div(1000000000LL << tsc_shift, cpu_freq);
>         if (scale <= 0x7FFFFFFF)
>             break;
>         tsc_shift--;
>     }
>     tsc_scale = scale;
> }
> 
> This version will use 31 (GHz cpu_freq) to 26 (~32 MHz) shifts, i.e. a
> bit more than the Linux kernel's 22 bits.
>

Here is likely why we have different levels of accuracy and performance, 
  firstly my version is bluntly based on the khz freq, secondly it 
calculates the other way around, i.e. ns2tsc, so that tsc are keep in 
the inner code, but more efficiently converted from ns counts passed to 
the outer interface:

static unsigned long ns2cyc_scale;
#define NS2CYC_SCALE_FACTOR 10 /* 2^10, carefully chosen */

static inline void set_ns2cyc_scale(unsigned long cpu_khz)
{
     ns2cyc_scale = (cpu_khz << NS2CYC_SCALE_FACTOR) / 1000000;
}

static inline unsigned long long ns_2_cycles(unsigned long long ns)
{
     return ns * ns2cyc_scale >> NS2CYC_SCALE_FACTOR;
}

>>
>>TSC are not the whole nucleus time base, but only the timer management
>>one. The motivation to use TSCs in nucleus/timer.c was to pick a unit
>>which would not require any conversion beyond the initial one in
>>xntimer_start.
> 
> 
> That helps strictly periodic application timers, not aperiodic ones like
> timeouts.
>

It depends, periodic timers usually exhibit larger delays, so the gain 
is more significant with oneshot timings incurring smaller delays, hence 
a higher number of calculations.

> 
>>>Any pitfalls down the road (except introducing regressions)?
>>
>>Well, pitfalls expected from changing the core idea of time of the timer
>>management code... :o>
>>
> 
> You mean turning
> 
> rthal_timer_program_shot(rthal_imuldiv(delay,RTHAL_TIMER_FREQ,RTHAL_CPU_FREQ));
> 
> into
> 
> rthal_timer_program_shot(rthal_imuldiv(delay,RTHAL_TIMER_FREQ,1000000000));
> 

Not really, it was a general remark about changing a code that might 
have some assumtions on using TSCs. Additionally, only x86 needs to 
rescale TSC values to the timer frequency, other archs use the same unit 
on both sides, and such unit might even have nothing to do with any CPU 
accounting (e.g. blackfin uses a free running timer, ppc uses the 
internal timebase, etc).

This said, it should not have that many assumptions, and in any case, 
they should be confined to nucleus/timers.c. I think we should give this 
kind of optimization a try.

-- 

Philippe.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai-core] ns vs. tsc as internal timer base
  2006-06-13 12:31     ` Philippe Gerum
@ 2006-06-13 13:07       ` Gilles Chanteperdrix
  2006-06-13 13:28         ` Philippe Gerum
  2006-06-13 13:33       ` Jan Kiszka
  2006-06-13 16:19       ` Jan Kiszka
  2 siblings, 1 reply; 27+ messages in thread
From: Gilles Chanteperdrix @ 2006-06-13 13:07 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: xenomai

Philippe Gerum wrote:
 > static inline unsigned long long ns_2_cycles(unsigned long long ns)
 > {
 >      return ns * ns2cyc_scale >> NS2CYC_SCALE_FACTOR;

This multiplication is 64 bits * 32 bits, the intermediate result may
need more than 64 bits, so you should compute it the same way as the
beginning of ullimd. Something like:

static inline unsigned long long ns_2_cycles(unsigned long long ns)
{
    unsigned nsh, nsl, tlh, tll;
    unsigned long long th, tl;

    __rthal_u64tou32(ns, nsh, nsl);
    tl = rthal_ullmul(nsl, ns2cyc_scale);
    __rthal_u64tou32(tl, tlh, tll);
    th = rthal_ullmul(nsh, ns2cyc_scale);
    th += tlh;

    tll = (unsigned) th << (32 - NS2CYC_SCALE_FACTOR) | tll >> NS2CYC_SCALE_FACTOR;
    th >>= NS2CYC_SCALE_FACTOR;
    return __rthal_u64fromu32(th, tll);
}


-- 


					    Gilles Chanteperdrix.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai-core] ns vs. tsc as internal timer base
  2006-06-13 13:07       ` Gilles Chanteperdrix
@ 2006-06-13 13:28         ` Philippe Gerum
  2006-06-13 13:34           ` Gilles Chanteperdrix
  0 siblings, 1 reply; 27+ messages in thread
From: Philippe Gerum @ 2006-06-13 13:28 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai

Gilles Chanteperdrix wrote:
> Philippe Gerum wrote:
>  > static inline unsigned long long ns_2_cycles(unsigned long long ns)
>  > {
>  >      return ns * ns2cyc_scale >> NS2CYC_SCALE_FACTOR;
> 
> This multiplication is 64 bits * 32 bits, the intermediate result may
> need more than 64 bits, so you should compute it the same way as the
> beginning of ullimd. Something like:

Sure, but the point is that if we were to use such code, we should bound 
the 64bit operand and would not use it beyond the tolerable loss of 
accuracy on output (e.g. 2ms).  This would require to break longer shots 
in several smaller ones, relying on the internal timer management logic 
to redo the shot until it has actually elapsed (which should be a rare 
case for oneshot timing), a bit like we are currently doing in bounding 
the values to 2^32-1 right now. Going for ullimd alike implementation 
somehow impedes the overall effort in reducing the CPU footprint, I 
guess. This said, I have still no clue if the gain in computation cycles 
is worth the additional overhead of dealing with possibly early shots - 
I tend to think it would be better on average though.

> 
> static inline unsigned long long ns_2_cycles(unsigned long long ns)
> {
>     unsigned nsh, nsl, tlh, tll;
>     unsigned long long th, tl;
> 
>     __rthal_u64tou32(ns, nsh, nsl);
>     tl = rthal_ullmul(nsl, ns2cyc_scale);
>     __rthal_u64tou32(tl, tlh, tll);
>     th = rthal_ullmul(nsh, ns2cyc_scale);
>     th += tlh;
> 
>     tll = (unsigned) th << (32 - NS2CYC_SCALE_FACTOR) | tll >> NS2CYC_SCALE_FACTOR;
>     th >>= NS2CYC_SCALE_FACTOR;
>     return __rthal_u64fromu32(th, tll);
> }
> 
> 

-- 

Philippe.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai-core] ns vs. tsc as internal timer base
  2006-06-13 12:31     ` Philippe Gerum
  2006-06-13 13:07       ` Gilles Chanteperdrix
@ 2006-06-13 13:33       ` Jan Kiszka
  2006-06-13 13:51         ` Philippe Gerum
  2006-06-13 16:19       ` Jan Kiszka
  2 siblings, 1 reply; 27+ messages in thread
From: Jan Kiszka @ 2006-06-13 13:33 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: xenomai-core

[-- Attachment #1: Type: text/plain, Size: 4549 bytes --]

Philippe Gerum wrote:
> Jan Kiszka wrote:
>> Philippe Gerum wrote:
>>> from i386/kernel/timers/timer_tsc.c. And indeed, I had x 20 performance
>>> improvements in some cases.
>>
>> Oops, that sounds like a bit too extreme optimisations. Is the original
>> version varying that much? I didn't observe this.
>>
>> Here is my current version, BTW:
>>
>> long tsc_scale;
>> unsigned int tsc_shift = 31;
>>
>> static inline long long fast_tsc_to_ns(long long ts)
>> {
>>     long long ret;
>>
>>     __asm__ (
>>         /* HI = HIWORD(ts) * tsc_scale */
>>         "mov  %%eax,%%ebx\n\t"
>>         "mov  %%edx,%%eax\n\t"
>>         "imull %2\n\t"
>>         "mov  %%eax,%%esi\n\t"
>>         "mov  %%edx,%%edi\n\t"
>>
>>         /* LO = LOWORD(ts) * tsc_scale */
>>         "mov  %%ebx,%%eax\n\t"
>>         "mull %2\n\t"
>>
>>         /* ret = (HI << 32) + LO */
>>         "add  %%esi,%%edx\n\t"
>>         "adc  $0,%%edi\n\t"
>>
>>         /* ret = ret >> tsc_shift */
>>         "shrd %%cl,%%edx,%%eax\n\t"
>>         "shrd %%cl,%%edi,%%edx\n\t"
>>         : "=A"(ret)
>>         : "A" (ts), "m" (tsc_scale), "c" (tsc_shift)
>>         : "ebx", "esi", "edi");
>>
>>     return ret;
>> }
>>
>> void init_tsc(unsigned long cpu_freq)
>> {
>>     unsigned long long scale;
>>
>>     while (1) {
>>         scale = do_div(1000000000LL << tsc_shift, cpu_freq);
>>         if (scale <= 0x7FFFFFFF)
>>             break;
>>         tsc_shift--;
>>     }
>>     tsc_scale = scale;
>> }
>>
>> This version will use 31 (GHz cpu_freq) to 26 (~32 MHz) shifts, i.e. a
>> bit more than the Linux kernel's 22 bits.
>>
> 
> Here is likely why we have different levels of accuracy and performance,
>  firstly my version is bluntly based on the khz freq, secondly it
> calculates the other way around, i.e. ns2tsc, so that tsc are keep in
> the inner code, but more efficiently converted from ns counts passed to
> the outer interface:
> 
> static unsigned long ns2cyc_scale;
> #define NS2CYC_SCALE_FACTOR 10 /* 2^10, carefully chosen */

Linux only uses 10 bits for scheduling time calculation, which is
tick-based (low-res) anyway. The tsc clock_source uses 22 bits. The
latter overflows after an hour or so, because they drop all bits > 64
after the multiplication - insignificantly faster when using optimised
code anyway.

> 
> static inline void set_ns2cyc_scale(unsigned long cpu_khz)
> {
>     ns2cyc_scale = (cpu_khz << NS2CYC_SCALE_FACTOR) / 1000000;
> }
> 
> static inline unsigned long long ns_2_cycles(unsigned long long ns)
> {
>     return ns * ns2cyc_scale >> NS2CYC_SCALE_FACTOR;
> }
> 
>>>
>>> TSC are not the whole nucleus time base, but only the timer management
>>> one. The motivation to use TSCs in nucleus/timer.c was to pick a unit
>>> which would not require any conversion beyond the initial one in
>>> xntimer_start.
>>
>>
>> That helps strictly periodic application timers, not aperiodic ones like
>> timeouts.
>>
> 
> It depends, periodic timers usually exhibit larger delays, so the gain
> is more significant with oneshot timings incurring smaller delays, hence
> a higher number of calculations.
> 
>>
>>>> Any pitfalls down the road (except introducing regressions)?
>>>
>>> Well, pitfalls expected from changing the core idea of time of the timer
>>> management code... :o>
>>>
>>
>> You mean turning
>>
>> rthal_timer_program_shot(rthal_imuldiv(delay,RTHAL_TIMER_FREQ,RTHAL_CPU_FREQ));
>>
>>
>> into
>>
>> rthal_timer_program_shot(rthal_imuldiv(delay,RTHAL_TIMER_FREQ,1000000000));
>>
>>
> 
> Not really, it was a general remark about changing a code that might
> have some assumtions on using TSCs. Additionally, only x86 needs to
> rescale TSC values to the timer frequency, other archs use the same unit
> on both sides, and such unit might even have nothing to do with any CPU
> accounting (e.g. blackfin uses a free running timer, ppc uses the
> internal timebase, etc).

Ok, an interesting aspect I already assumed but didn't check in details
yet. That makes dealing with TSCs interesting again on != x86. In
contrast, on x86, there is the aspect of frequency scaling that Anders
brought up and which would speak pro nanos.

> 
> This said, it should not have that many assumptions, and in any case,
> they should be confined to nucleus/timers.c. I think we should give this
> kind of optimization a try.
> 

Yep, it just needs some more brain cycles how to do this precisely.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai-core] ns vs. tsc as internal timer base
  2006-06-13 13:28         ` Philippe Gerum
@ 2006-06-13 13:34           ` Gilles Chanteperdrix
  2006-06-13 13:45             ` Philippe Gerum
  0 siblings, 1 reply; 27+ messages in thread
From: Gilles Chanteperdrix @ 2006-06-13 13:34 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: xenomai

Philippe Gerum wrote:
 > Gilles Chanteperdrix wrote:
 > > Philippe Gerum wrote:
 > >  > static inline unsigned long long ns_2_cycles(unsigned long long ns)
 > >  > {
 > >  >      return ns * ns2cyc_scale >> NS2CYC_SCALE_FACTOR;
 > > 
 > > This multiplication is 64 bits * 32 bits, the intermediate result may
 > > need more than 64 bits, so you should compute it the same way as the
 > > beginning of ullimd. Something like:
 > 
 > Sure, but the point is that if we were to use such code, we should bound 
 > the 64bit operand and would not use it beyond the tolerable loss of 
 > accuracy on output (e.g. 2ms).  This would require to break longer shots 
 > in several smaller ones, relying on the internal timer management logic 
 > to redo the shot until it has actually elapsed (which should be a rare 
 > case for oneshot timing), a bit like we are currently doing in bounding 
 > the values to 2^32-1 right now. Going for ullimd alike implementation 
 > somehow impedes the overall effort in reducing the CPU footprint, I 
 > guess. This said, I have still no clue if the gain in computation cycles 
 > is worth the additional overhead of dealing with possibly early shots - 
 > I tend to think it would be better on average though.

Ok, we could then write:

static inline unsigned long long ns_2_cycles(unsigned ns)
{
    return (unsigned long long) ns * ns2cyc_scale >> NS2CYC_SCALE_FACTOR;
}

-- 


					    Gilles Chanteperdrix.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai-core] ns vs. tsc as internal timer base
  2006-06-13 13:34           ` Gilles Chanteperdrix
@ 2006-06-13 13:45             ` Philippe Gerum
  0 siblings, 0 replies; 27+ messages in thread
From: Philippe Gerum @ 2006-06-13 13:45 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai

Gilles Chanteperdrix wrote:
> Philippe Gerum wrote:
>  > Gilles Chanteperdrix wrote:
>  > > Philippe Gerum wrote:
>  > >  > static inline unsigned long long ns_2_cycles(unsigned long long ns)
>  > >  > {
>  > >  >      return ns * ns2cyc_scale >> NS2CYC_SCALE_FACTOR;
>  > > 
>  > > This multiplication is 64 bits * 32 bits, the intermediate result may
>  > > need more than 64 bits, so you should compute it the same way as the
>  > > beginning of ullimd. Something like:
>  > 
>  > Sure, but the point is that if we were to use such code, we should bound 
>  > the 64bit operand and would not use it beyond the tolerable loss of 
>  > accuracy on output (e.g. 2ms).  This would require to break longer shots 
>  > in several smaller ones, relying on the internal timer management logic 
>  > to redo the shot until it has actually elapsed (which should be a rare 
>  > case for oneshot timing), a bit like we are currently doing in bounding 
>  > the values to 2^32-1 right now. Going for ullimd alike implementation 
>  > somehow impedes the overall effort in reducing the CPU footprint, I 
>  > guess. This said, I have still no clue if the gain in computation cycles 
>  > is worth the additional overhead of dealing with possibly early shots - 
>  > I tend to think it would be better on average though.
> 
> Ok, we could then write:
> 
> static inline unsigned long long ns_2_cycles(unsigned ns)
> {
>     return (unsigned long long) ns * ns2cyc_scale >> NS2CYC_SCALE_FACTOR;
> }
> 

Yep.

-- 

Philippe.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai-core] ns vs. tsc as internal timer base
  2006-06-13 13:33       ` Jan Kiszka
@ 2006-06-13 13:51         ` Philippe Gerum
  0 siblings, 0 replies; 27+ messages in thread
From: Philippe Gerum @ 2006-06-13 13:51 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

Jan Kiszka wrote:
> Philippe Gerum wrote:
> 
>>Jan Kiszka wrote:
>>
>>>Philippe Gerum wrote:
>>>
>>>>from i386/kernel/timers/timer_tsc.c. And indeed, I had x 20 performance
>>>>improvements in some cases.
>>>
>>>Oops, that sounds like a bit too extreme optimisations. Is the original
>>>version varying that much? I didn't observe this.
>>>
>>>Here is my current version, BTW:
>>>
>>>long tsc_scale;
>>>unsigned int tsc_shift = 31;
>>>
>>>static inline long long fast_tsc_to_ns(long long ts)
>>>{
>>>    long long ret;
>>>
>>>    __asm__ (
>>>        /* HI = HIWORD(ts) * tsc_scale */
>>>        "mov  %%eax,%%ebx\n\t"
>>>        "mov  %%edx,%%eax\n\t"
>>>        "imull %2\n\t"
>>>        "mov  %%eax,%%esi\n\t"
>>>        "mov  %%edx,%%edi\n\t"
>>>
>>>        /* LO = LOWORD(ts) * tsc_scale */
>>>        "mov  %%ebx,%%eax\n\t"
>>>        "mull %2\n\t"
>>>
>>>        /* ret = (HI << 32) + LO */
>>>        "add  %%esi,%%edx\n\t"
>>>        "adc  $0,%%edi\n\t"
>>>
>>>        /* ret = ret >> tsc_shift */
>>>        "shrd %%cl,%%edx,%%eax\n\t"
>>>        "shrd %%cl,%%edi,%%edx\n\t"
>>>        : "=A"(ret)
>>>        : "A" (ts), "m" (tsc_scale), "c" (tsc_shift)
>>>        : "ebx", "esi", "edi");
>>>
>>>    return ret;
>>>}
>>>
>>>void init_tsc(unsigned long cpu_freq)
>>>{
>>>    unsigned long long scale;
>>>
>>>    while (1) {
>>>        scale = do_div(1000000000LL << tsc_shift, cpu_freq);
>>>        if (scale <= 0x7FFFFFFF)
>>>            break;
>>>        tsc_shift--;
>>>    }
>>>    tsc_scale = scale;
>>>}
>>>
>>>This version will use 31 (GHz cpu_freq) to 26 (~32 MHz) shifts, i.e. a
>>>bit more than the Linux kernel's 22 bits.
>>>
>>
>>Here is likely why we have different levels of accuracy and performance,
>> firstly my version is bluntly based on the khz freq, secondly it
>>calculates the other way around, i.e. ns2tsc, so that tsc are keep in
>>the inner code, but more efficiently converted from ns counts passed to
>>the outer interface:
>>
>>static unsigned long ns2cyc_scale;
>>#define NS2CYC_SCALE_FACTOR 10 /* 2^10, carefully chosen */
> 
> 
> Linux only uses 10 bits for scheduling time calculation, which is
> tick-based (low-res) anyway.

This code is rather used to compute TSC offsets within a tick, so the 
max operand is short, bounded and known by design. Hence the scale 
factor, AFAICS.

  The tsc clock_source uses 22 bits. The
> latter overflows after an hour or so, because they drop all bits > 64
> after the multiplication - insignificantly faster when using optimised
> code anyway.
>

This path to optimizing is about computing reasonably short delays this 
way, so roll-over and precision would not be a key factor.

> 
>>static inline void set_ns2cyc_scale(unsigned long cpu_khz)
>>{
>>    ns2cyc_scale = (cpu_khz << NS2CYC_SCALE_FACTOR) / 1000000;
>>}
>>
>>static inline unsigned long long ns_2_cycles(unsigned long long ns)
>>{
>>    return ns * ns2cyc_scale >> NS2CYC_SCALE_FACTOR;
>>}
>>
>>
>>>>TSC are not the whole nucleus time base, but only the timer management
>>>>one. The motivation to use TSCs in nucleus/timer.c was to pick a unit
>>>>which would not require any conversion beyond the initial one in
>>>>xntimer_start.
>>>
>>>
>>>That helps strictly periodic application timers, not aperiodic ones like
>>>timeouts.
>>>
>>
>>It depends, periodic timers usually exhibit larger delays, so the gain
>>is more significant with oneshot timings incurring smaller delays, hence
>>a higher number of calculations.
>>
>>
>>>>>Any pitfalls down the road (except introducing regressions)?
>>>>
>>>>Well, pitfalls expected from changing the core idea of time of the timer
>>>>management code... :o>
>>>>
>>>You mean turning
>>>
>>>rthal_timer_program_shot(rthal_imuldiv(delay,RTHAL_TIMER_FREQ,RTHAL_CPU_FREQ));
>>>
>>>
>>>into
>>>
>>>rthal_timer_program_shot(rthal_imuldiv(delay,RTHAL_TIMER_FREQ,1000000000));
>>>
>>>
>>
>>Not really, it was a general remark about changing a code that might
>>have some assumtions on using TSCs. Additionally, only x86 needs to
>>rescale TSC values to the timer frequency, other archs use the same unit
>>on both sides, and such unit might even have nothing to do with any CPU
>>accounting (e.g. blackfin uses a free running timer, ppc uses the
>>internal timebase, etc).
> 
> 
> Ok, an interesting aspect I already assumed but didn't check in details
> yet. That makes dealing with TSCs interesting again on != x86. In
> contrast, on x86, there is the aspect of frequency scaling that Anders
> brought up and which would speak pro nanos.
> 
> 
>>This said, it should not have that many assumptions, and in any case,
>>they should be confined to nucleus/timers.c. I think we should give this
>>kind of optimization a try.
>>
> 
> 
> Yep, it just needs some more brain cycles how to do this precisely.
> 
> Jan
> 


-- 

Philippe.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai-core] ns vs. tsc as internal timer base
  2006-06-13 12:31     ` Philippe Gerum
  2006-06-13 13:07       ` Gilles Chanteperdrix
  2006-06-13 13:33       ` Jan Kiszka
@ 2006-06-13 16:19       ` Jan Kiszka
  2006-06-13 16:29         ` Gilles Chanteperdrix
  2006-06-13 17:04         ` Philippe Gerum
  2 siblings, 2 replies; 27+ messages in thread
From: Jan Kiszka @ 2006-06-13 16:19 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: xenomai-core

[-- Attachment #1: Type: text/plain, Size: 1015 bytes --]

Philippe Gerum wrote:
> Here is likely why we have different levels of accuracy and performance,
>  firstly my version is bluntly based on the khz freq, secondly it
> calculates the other way around, i.e. ns2tsc, so that tsc are keep in
> the inner code, but more efficiently converted from ns counts passed to
> the outer interface:
> 
> static unsigned long ns2cyc_scale;
> #define NS2CYC_SCALE_FACTOR 10 /* 2^10, carefully chosen */
> 
> static inline void set_ns2cyc_scale(unsigned long cpu_khz)
> {
>     ns2cyc_scale = (cpu_khz << NS2CYC_SCALE_FACTOR) / 1000000;
> }
> 
> static inline unsigned long long ns_2_cycles(unsigned long long ns)
> {
>     return ns * ns2cyc_scale >> NS2CYC_SCALE_FACTOR;
> }

Your version performs ~50% better than mine (outperforming the original
version by factor 7 on a 1 GHz box, vs. 4.8). I think you compared
non-optimised code, didn't you? Without -O2, I see 15 times better
performance.

[Gilles variant yet refuses the get benchmarked.]

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai-core] ns vs. tsc as internal timer base
  2006-06-13 16:19       ` Jan Kiszka
@ 2006-06-13 16:29         ` Gilles Chanteperdrix
  2006-06-13 17:04         ` Philippe Gerum
  1 sibling, 0 replies; 27+ messages in thread
From: Gilles Chanteperdrix @ 2006-06-13 16:29 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai

Jan Kiszka wrote:
 > Philippe Gerum wrote:
 > > Here is likely why we have different levels of accuracy and performance,
 > >  firstly my version is bluntly based on the khz freq, secondly it
 > > calculates the other way around, i.e. ns2tsc, so that tsc are keep in
 > > the inner code, but more efficiently converted from ns counts passed to
 > > the outer interface:
 > > 
 > > static unsigned long ns2cyc_scale;
 > > #define NS2CYC_SCALE_FACTOR 10 /* 2^10, carefully chosen */
 > > 
 > > static inline void set_ns2cyc_scale(unsigned long cpu_khz)
 > > {
 > >     ns2cyc_scale = (cpu_khz << NS2CYC_SCALE_FACTOR) / 1000000;
 > > }
 > > 
 > > static inline unsigned long long ns_2_cycles(unsigned long long ns)
 > > {
 > >     return ns * ns2cyc_scale >> NS2CYC_SCALE_FACTOR;
 > > }
 > 
 > Your version performs ~50% better than mine (outperforming the original
 > version by factor 7 on a 1 GHz box, vs. 4.8). I think you compared
 > non-optimised code, didn't you? Without -O2, I see 15 times better
 > performance.
 > 
 > [Gilles variant yet refuses the get benchmarked.]

Since we accept a smaller range, I think you should benchmark
nodiv_imuldiv instead of nodiv_ullimd. And it should perform better
since it uses 32 bits shifts which are not real shifts.

-- 


					    Gilles Chanteperdrix.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai-core] ns vs. tsc as internal timer base
  2006-06-13 16:19       ` Jan Kiszka
  2006-06-13 16:29         ` Gilles Chanteperdrix
@ 2006-06-13 17:04         ` Philippe Gerum
  2006-06-13 17:13           ` Gilles Chanteperdrix
  1 sibling, 1 reply; 27+ messages in thread
From: Philippe Gerum @ 2006-06-13 17:04 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

Jan Kiszka wrote:
> Philippe Gerum wrote:
> 
>>Here is likely why we have different levels of accuracy and performance,
>> firstly my version is bluntly based on the khz freq, secondly it
>>calculates the other way around, i.e. ns2tsc, so that tsc are keep in
>>the inner code, but more efficiently converted from ns counts passed to
>>the outer interface:
>>
>>static unsigned long ns2cyc_scale;
>>#define NS2CYC_SCALE_FACTOR 10 /* 2^10, carefully chosen */
>>
>>static inline void set_ns2cyc_scale(unsigned long cpu_khz)
>>{
>>    ns2cyc_scale = (cpu_khz << NS2CYC_SCALE_FACTOR) / 1000000;
>>}
>>
>>static inline unsigned long long ns_2_cycles(unsigned long long ns)
>>{
>>    return ns * ns2cyc_scale >> NS2CYC_SCALE_FACTOR;
>>}
> 
> 
> Your version performs ~50% better than mine (outperforming the original
> version by factor 7 on a 1 GHz box, vs. 4.8). I think you compared
> non-optimised code, didn't you?

Nah, I'm not that drunk!

  Without -O2, I see 15 times better
> performance.

Redone the check here on a Centrino 1.6Mhz, and still have roughly x20 
improvement (a bit better actually). I'm using Debian/sarge gcc 3.3.5.

> 
> [Gilles variant yet refuses the get benchmarked.]
> 
> Jan
> 


-- 

Philippe.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai-core] ns vs. tsc as internal timer base
  2006-06-13 17:04         ` Philippe Gerum
@ 2006-06-13 17:13           ` Gilles Chanteperdrix
  2006-06-13 17:58             ` Philippe Gerum
  2006-07-25 18:26             ` [Xenomai-core] Timer optimisations, continued Jan Kiszka
  0 siblings, 2 replies; 27+ messages in thread
From: Gilles Chanteperdrix @ 2006-06-13 17:13 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: Jan Kiszka, xenomai-core

Philippe Gerum wrote:
 > Redone the check here on a Centrino 1.6Mhz, and still have roughly x20 
 > improvement (a bit better actually). I'm using Debian/sarge gcc 3.3.5.

I think I remember that Pentium M has a much shorter mull instruction
than other processors of the family.

-- 


					    Gilles Chanteperdrix.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai-core] ns vs. tsc as internal timer base
  2006-06-13 17:13           ` Gilles Chanteperdrix
@ 2006-06-13 17:58             ` Philippe Gerum
  2006-06-14  9:25               ` Jim Cromie
  2006-07-25 18:26             ` [Xenomai-core] Timer optimisations, continued Jan Kiszka
  1 sibling, 1 reply; 27+ messages in thread
From: Philippe Gerum @ 2006-06-13 17:58 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: Jan Kiszka, xenomai-core

Gilles Chanteperdrix wrote:
> Philippe Gerum wrote:
>  > Redone the check here on a Centrino 1.6Mhz, and still have roughly x20 
>  > improvement (a bit better actually). I'm using Debian/sarge gcc 3.3.5.
> 
> I think I remember that Pentium M has a much shorter mull instruction
> than other processors of the family.
> 

That would explain. Anyway, as John Stulz put it:
"math is hard, lets go shopping!"

-- 

Philippe.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai-core] ns vs. tsc as internal timer base
  2006-06-13 17:58             ` Philippe Gerum
@ 2006-06-14  9:25               ` Jim Cromie
  2006-06-14 12:29                 ` Philippe Gerum
  0 siblings, 1 reply; 27+ messages in thread
From: Jim Cromie @ 2006-06-14  9:25 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: Jan Kiszka, xenomai-core

Philippe Gerum wrote:
> Gilles Chanteperdrix wrote:
>> Philippe Gerum wrote:
>>  > Redone the check here on a Centrino 1.6Mhz, and still have roughly 
>> x20  > improvement (a bit better actually). I'm using Debian/sarge 
>> gcc 3.3.5.
>>
>> I think I remember that Pentium M has a much shorter mull instruction
>> than other processors of the family.
>>
>
> That would explain. Anyway, as John Stulz put it:
> "math is hard, lets go shopping!"
>

Heh.  Appropriate that his name (Stultz) comes up here, as his 
generic-time (GTOD)
patchset looks headed for 2.6.18, bringing with it a full re-working
of linux timers / timeofday.  IN this new world, time is kept on 
free-running counters.

Ive been running this patchset on my soekris for some time, since
GTOD detects that the TSC counts slowly, calls it insane, and does timing
with the PIT.

With GTOD, writing a new clocksource driver is easy, enough so I could 
do it.
My clocksource patch uses the 27 mhz timer on the Geode CPU.
Once the TSC is de-rated, mine becomes the best clocksource, and GTOD 
switches to it.

All of which is to say ..
new mainline code is coming, should this current rework notion wait,
given that its will all need revisited again later

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai-core] ns vs. tsc as internal timer base
  2006-06-14  9:25               ` Jim Cromie
@ 2006-06-14 12:29                 ` Philippe Gerum
  2006-06-14 13:07                   ` Jan Kiszka
  0 siblings, 1 reply; 27+ messages in thread
From: Philippe Gerum @ 2006-06-14 12:29 UTC (permalink / raw)
  To: Jim Cromie; +Cc: Jan Kiszka, xenomai-core

Jim Cromie wrote:
> Philippe Gerum wrote:
> 
>> Gilles Chanteperdrix wrote:
>>
>>> Philippe Gerum wrote:
>>>  > Redone the check here on a Centrino 1.6Mhz, and still have roughly 
>>> x20  > improvement (a bit better actually). I'm using Debian/sarge 
>>> gcc 3.3.5.
>>>
>>> I think I remember that Pentium M has a much shorter mull instruction
>>> than other processors of the family.
>>>
>>
>> That would explain. Anyway, as John Stulz put it:
>> "math is hard, lets go shopping!"
>>
> 
> Heh.  Appropriate that his name (Stultz) comes up here, as his 
> generic-time (GTOD)
> patchset looks headed for 2.6.18, bringing with it a full re-working
> of linux timers / timeofday.  IN this new world, time is kept on 
> free-running counters.
> 
> Ive been running this patchset on my soekris for some time, since
> GTOD detects that the TSC counts slowly, calls it insane, and does timing
> with the PIT.
> 
> With GTOD, writing a new clocksource driver is easy, enough so I could 
> do it.
> My clocksource patch uses the 27 mhz timer on the Geode CPU.
> Once the TSC is de-rated, mine becomes the best clocksource, and GTOD 
> switches to it.
> 
> All of which is to say ..
> new mainline code is coming, should this current rework notion wait,
> given that its will all need revisited again later
> 

Clearly yes, since this is going to impact Adeos too. GTOD is going to 
fiddle with the PIT channels in a way Adeos needs to be aware of, in 
order for the client RTOS to reuse such timer. Added to the flow of 
other core changes planned for 2.6.18, this is likely going to be funky.

"Find wall. Beat head against same."

-- 

Philippe.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai-core] ns vs. tsc as internal timer base
  2006-06-14 12:29                 ` Philippe Gerum
@ 2006-06-14 13:07                   ` Jan Kiszka
  2006-06-14 16:04                     ` Jan Kiszka
  0 siblings, 1 reply; 27+ messages in thread
From: Jan Kiszka @ 2006-06-14 13:07 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: xenomai-core

[-- Attachment #1: Type: text/plain, Size: 2035 bytes --]

Philippe Gerum wrote:
> Jim Cromie wrote:
>> Philippe Gerum wrote:
>>
>>> Gilles Chanteperdrix wrote:
>>>
>>>> Philippe Gerum wrote:
>>>>  > Redone the check here on a Centrino 1.6Mhz, and still have
>>>> roughly x20  > improvement (a bit better actually). I'm using
>>>> Debian/sarge gcc 3.3.5.
>>>>
>>>> I think I remember that Pentium M has a much shorter mull instruction
>>>> than other processors of the family.
>>>>
>>>
>>> That would explain. Anyway, as John Stulz put it:
>>> "math is hard, lets go shopping!"
>>>
>>
>> Heh.  Appropriate that his name (Stultz) comes up here, as his
>> generic-time (GTOD)
>> patchset looks headed for 2.6.18, bringing with it a full re-working
>> of linux timers / timeofday.  IN this new world, time is kept on
>> free-running counters.
>>
>> Ive been running this patchset on my soekris for some time, since
>> GTOD detects that the TSC counts slowly, calls it insane, and does timing
>> with the PIT.
>>
>> With GTOD, writing a new clocksource driver is easy, enough so I could
>> do it.
>> My clocksource patch uses the 27 mhz timer on the Geode CPU.
>> Once the TSC is de-rated, mine becomes the best clocksource, and GTOD
>> switches to it.
>>
>> All of which is to say ..
>> new mainline code is coming, should this current rework notion wait,
>> given that its will all need revisited again later
>>
> 
> Clearly yes, since this is going to impact Adeos too. GTOD is going to
> fiddle with the PIT channels in a way Adeos needs to be aware of, in
> order for the client RTOS to reuse such timer. Added to the flow of
> other core changes planned for 2.6.18, this is likely going to be funky.
> 
> "Find wall. Beat head against same."
> 

May not be required: the GTOD and clocksource abstractions could provide
a clean way to register some virtual, Adeos- or RTOS-provided clock with
Linux. And that clock may even lose ticks without Linux losing its
system time! So far for the theory, practice may still require walls...

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai-core] ns vs. tsc as internal timer base
  2006-06-14 13:07                   ` Jan Kiszka
@ 2006-06-14 16:04                     ` Jan Kiszka
  0 siblings, 0 replies; 27+ messages in thread
From: Jan Kiszka @ 2006-06-14 16:04 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: xenomai-core

[-- Attachment #1: Type: text/plain, Size: 2432 bytes --]

Jan Kiszka wrote:
> Philippe Gerum wrote:
>> Jim Cromie wrote:
>>> Philippe Gerum wrote:
>>>
>>>> Gilles Chanteperdrix wrote:
>>>>
>>>>> Philippe Gerum wrote:
>>>>>  > Redone the check here on a Centrino 1.6Mhz, and still have
>>>>> roughly x20  > improvement (a bit better actually). I'm using
>>>>> Debian/sarge gcc 3.3.5.
>>>>>
>>>>> I think I remember that Pentium M has a much shorter mull instruction
>>>>> than other processors of the family.
>>>>>
>>>> That would explain. Anyway, as John Stulz put it:
>>>> "math is hard, lets go shopping!"
>>>>
>>> Heh.  Appropriate that his name (Stultz) comes up here, as his
>>> generic-time (GTOD)
>>> patchset looks headed for 2.6.18, bringing with it a full re-working
>>> of linux timers / timeofday.  IN this new world, time is kept on
>>> free-running counters.
>>>
>>> Ive been running this patchset on my soekris for some time, since
>>> GTOD detects that the TSC counts slowly, calls it insane, and does timing
>>> with the PIT.
>>>
>>> With GTOD, writing a new clocksource driver is easy, enough so I could
>>> do it.
>>> My clocksource patch uses the 27 mhz timer on the Geode CPU.
>>> Once the TSC is de-rated, mine becomes the best clocksource, and GTOD
>>> switches to it.
>>>
>>> All of which is to say ..
>>> new mainline code is coming, should this current rework notion wait,
>>> given that its will all need revisited again later
>>>
>> Clearly yes, since this is going to impact Adeos too. GTOD is going to
>> fiddle with the PIT channels in a way Adeos needs to be aware of, in
>> order for the client RTOS to reuse such timer. Added to the flow of
>> other core changes planned for 2.6.18, this is likely going to be funky.
>>
>> "Find wall. Beat head against same."
>>
> 
> May not be required: the GTOD and clocksource abstractions could provide
> a clean way to register some virtual, Adeos- or RTOS-provided clock with
> Linux. And that clock may even lose ticks without Linux losing its
> system time! So far for the theory, practice may still require walls...
> 

Some refinement: clocksource may either remain TSC or become a
Xenomai-provided clock if its handling (PIT...) requires
synchronisation. The clockevent, the one thing that triggers timer IRQs,
could become a virtual device driven by Xenomai. And GTOD should happily
make use of them instead of messing up with shared hardware.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Xenomai-core] Timer optimisations, continued
  2006-06-13 17:13           ` Gilles Chanteperdrix
  2006-06-13 17:58             ` Philippe Gerum
@ 2006-07-25 18:26             ` Jan Kiszka
  2006-07-27  8:53               ` Philippe Gerum
  1 sibling, 1 reply; 27+ messages in thread
From: Jan Kiszka @ 2006-07-25 18:26 UTC (permalink / raw)
  To: xenomai-core

[-- Attachment #1: Type: text/plain, Size: 18489 bytes --]

Hi all,

to continue the discussion about improving the timer subsystem,
specifically with respect to unit conversion overhead, I'm posting here
a (fairly long) report of my findings and consideration.

First of all I did some benchmarking of the various optimised conversion
routines that popped up. I stressed them on the different x86-platforms.
The numbers are for 1000 iterations (loop overhead compensated), used
compiler was gcc-4.1. Just to recall the actors:

xnarch_tsc_to_ns - original accurate 64-bit division for converting TSC
                   ticks in nanoseconds (and vice versa)
fast_tsc_to_ns   - my scaled-math-based assembler variant, suffering
                   from some inaccuracy for large intervals, still
                   requires normal 64-bit muldiv for the ns-to-TSC
                   return path
ns_2_cycles      - Philippe's similar version, a bit more inaccurate
nodiv_ullimd     - Gilles' 64-bit conversion routine, only sometimes
                   varying in the last bit from the original result
nodiv_imuldiv    - Gilles' 32-bit div-less conversion for small
                   intervals (haven't checked, but I assume it's as
                   accurate as the 64-bit variant in the limited domain)

And here are the results (ugly test code available on request):

VIA C2, 600 MHz:
xnarch_tsc_to_ns:  160680 cycles /  267800 ns
fast_tsc_to_ns:    119842 cycles /  199736 ns
ns_2_cycles:        69376 cycles /  115626 ns
nodiv_ullimd:      179042 cycles /  298403 ns
nodiv_imuldiv:      41336 cycles /   68893 ns

P-III, 1GHz:
xnarch_tsc_to_ns:  108475 cycles /  107935 ns
fast_tsc_to_ns:     24127 cycles /   24006 ns
ns_2_cycles:        21338 cycles /   21231 ns
nodiv_ullimd:       67974 cycles /   67635 ns
nodiv_imuldiv:      13269 cycles /   13202 ns

P-MMX, 266 MHz:
xnarch_tsc_to_ns:  131886 cycles /  495812 ns
fast_tsc_to_ns:     47697 cycles /  179312 ns
ns_2_cycles:        43627 cycles /  164011 ns
nodiv_ullimd:      141915 cycles /  533515 ns
nodiv_imuldiv:      44761 cycles /  168274 ns

P-M, 1,3GHz:
xnarch_tsc_to_ns:  113219 cycles /   87091 ns
fast_tsc_to_ns:     26718 cycles /   20552 ns
ns_2_cycles:        15024 cycles /   11556 ns
nodiv_ullimd:       49620 cycles /   38169 ns
nodiv_imuldiv:      17036 cycles /   13104 ns

Opteron 275 (32-bit mode), 1,8 GHz:
xnarch_tsc_to_ns:  112507 cycles /   62503 ns
fast_tsc_to_ns:     21857 cycles /   12142 ns
ns_2_cycles:        12545 cycles /    6969 ns
nodiv_ullimd:       41175 cycles /   22875 ns
nodiv_imuldiv:       7261 cycles /    4033 ns

For sure, working with only 32-bit is the fastest variant on all
platforms. Other variants do not always perform well or have limited
accuracy. Unfortunately, 32-bit conversions cannot be applied on all
scenarios, we will see this below.


After hacking my fast_tsc_to_ns, my original plan was to switch the
internal timer base completely to nanoseconds in the hope to reduce the
number of conversions in the timer hot-paths. Luckily I decided to
analyse the typical scenarios first before starting the develop any
patch. I consider the following 5 scenarios for heavy timer usage. Both
TSC and nanoseconds as time base are analysed, also a potential
timer_start() variant that accepts absolute timeout values. The pseudo
code /should/ be self-explaining. If not do not hesitate to ask.


1. Periodic Timers
==================

Start once, run continuously
=> hot-path is the timer IRQ


1.1 TSC-based
-------------

task_set_periodic(start, interval)              [rarely]
        delay = start - get_time()
                get_time(): tsc -> ns           [64-bit]
        timer_start(delay, interval)
                delay: ns -> tsc                [32-bit candidate]
                date = get_tsc() + delay
                interval: ns -> tsc             [32-bit candidate]
                program_timer(date)
                        delay = date - get_tsc()
                        set_hw_timer(delay)
-or-
task_set_periodic(start, interval)              [rarely]
        timer_start_abs(start, interval)
                date: ns -> tsc                 [64-bit]
                interval: ns -> tsc             [32-bit candidate]
                program_timer(date)
                        delay = date - get_tsc()
                        set_hw_timer(delay)

timer_irq()                                     [hot-path]
        date <= get_tsc()?
        date = get_tsc() + interval
        program_timer(date)
                delay = date - get_tsc()
                set_hw_timer(delay)


1.2 ns-based
------------

task_set_periodic(start, interval)              [rarely]
        delay = start - get_time()
                get_time(): tsc -> ns           [64-bit]
        timer_start(delay, interval)
                date = get_time() + delay
                        get_time(): tsc -> ns   [64-bit]
                program_timer(date)
                        date: ns -> tsc         [64-bit]
                        delay = date - get_tsc()
                        set_hw_timer(delay)
-or-
task_set_periodic(start, interval)              [rarely]
        timer_start_abs(start, interval)
                date = start
                program_timer(date)
                        date: ns -> tsc         [64-bit]
                        delay = date - get_tsc()
                        set_hw_timer(delay)

timer_irq()                                     [hot-path]
        now = get_time()
                get_time: tsc -> ns             [64-bit]
        date <= now()?
        date = now + interval
        program_timer(date)
                date: ns -> tsc                 [64-bit]
                delay = date - get_tsc()
                set_hw_timer(delay)


1.3 Summary of (only!) the hot-path
-----------------------------------

              | total       | tsc->ns | ns->tsc | possible
              | conversions |         |         | 32-bit ns->tsc
--------------+-------------+---------+---------+----------------
TSC-based     | 0           | 0       | 0       | 0
TSC-based+ABS | 0           | 0       | 0       | 0
ns-based      | 2           | 1       | 1       | 0
ns-based+ABS  | 2           | 1       | 1       | 0



2. Relative Timers
==================
(explicit relative delays)

Started often, typically time out
=> hot-path is timer_start() and the timer IRQ


2.1 TSC-based
-------------

task_sleep(delay)                               [hot-path]
        timer_start(delay, 0)
                delay: ns -> tsc                [32-bit candidate]
                date = get_tsc() + delay
                program_timer(date)
                        delay = date - get_tsc()
                        set_hw_timer(delay)

timer_irq()                                     [hot-path]
        date <= get_tsc()?
        (programming of succeeding timer intentionally not included)


2.2 ns-based
-------------

task_sleep(delay)                               [hot-path]
        timer_start(delay, 0)
                date = get_time() + delay
                        get_time: tsc -> ns     [64-bit]
                program_timer(date)
                        date: ns -> tsc         [64-bit]
                        delay = date - get_tsc()
                        set_hw_timer(delay)

timer_irq()                                     [hot-path]
        now = get_time()
                get_time: tsc -> ns             [64-bit]
        date <= now?
        (programming of succeeding timer intentionally not included)


2.3 Summary of the hot-path
---------------------------

              | total       | tsc->ns | ns->tsc | possible
              | conversions |         |         | 32-bit ns->tsc
--------------+-------------+---------+---------+----------------
TSC-based     | 1           | 0       | 1       | 1
ns-based      | 3           | 2       | 1       | 0



3. Relative Timeouts
====================
(IPC mechanisms, device operations, etc.)

Started often, typically do not fire, often comparably large timeout
values that do not make it down to program_timer() before cancellation
=> hot-path is timer_start() and timer_stop()


3.1 TSC-based
-------------

mutex_lock(..., delay)                          [hot-path]
        timer_start(delay, 0)
                delay: ns -> tsc                [32-bit candidate]
                date = get_tsc() + delay
                program_timer(date)             [rarely]
                        delay = date - get_tsc()
                        set_hw_timer(delay)
        block_on_mutex()
        timer_stop()
                program_timer(date)             [rarely]
                        delay = date - get_tsc()
                        set_hw_timer(delay)

timer_irq()                                     [rarely]
        date <= get_tsc()?


3.2 ns-based
------------

mutex_lock(..., delay)                          [hot-path]
        timer_start(delay, 0)
                date = get_time() + delay
                        get_time: tsc -> ns     [64-bit]
                program_timer(date)             [rarely]
                        date: ns -> tsc         [64-bit]
                        delay = date - get_tsc()
                        set_hw_timer(delay)
        block_on_mutex()
        timer_stop()
                program_timer(date)             [rarely]
                        date: ns -> tsc         [64-bit]
                        delay = date - get_tsc()
                        set_hw_timer(delay)

timer_irq()                                     [rarely]
        now = get_time()
                get_time: tsc -> ns             [64-bit]
        date <= now?


3.3 Summary of the hot-path
---------------------------

              | total       | tsc->ns | ns->tsc | possible
              | conversions |         |         | 32-bit ns->tsc
--------------+-------------+---------+---------+----------------
TSC-based     | 1           | 0       | 1       | 1
ns-based      | 1           | 1       | 0       | 0



4. Absolute Timers
==================
(e.g. TDMA slot timing in RTnet)

Started often, include time-stamp acquisition and conversion, typically
fire => hot-path is get_time(), timer_start(), and the timer IRQ


4.1 TSC-based
-------------

date = get_time() + delay                       [hot-path]
        get_time: tsc -> ns                     [64-bit]

task_sleep_until(date)                          [hot-path]
        delay = date - get_time()
                get_time: tsc -> ns             [64-bit]
        timer_start(delay, 0)
                delay: ns -> tsc                [32-bit candidate]
                date = get_tsc() + delay
                program_timer(date)
                        delay = date - get_tsc()
                        set_hw_timer(delay)
-or-
task_sleep_until(date)                          [hot-path]
        timer_start_abs(date, 0)
                date: ns -> tsc                 [64-bit]
                program_timer(date)
                        delay = date - get_tsc()
                        set_hw_timer(delay)

timer_irq()                                     [hot-path]
        test date <= get_tsc()?


4.2 ns-based
------------

date = get_time() + delay                       [hot-path]
        get_time: tsc -> ns                     [64-bit]

task_sleep_until(date)                          [hot-path]
        delay = date - get_time()
                get_time: tsc -> ns             [64-bit]
        timer_start(delay, 0)
                date = get_time() + delay
                        get_time(): tsc -> ns   [64-bit]
                program_timer(date)
                        date: ns -> tsc         [64-bit]
                        delay = date - get_tsc()
                        set_hw_timer(delay)
-or-
task_sleep_until(date)                          [hot-path]
        timer_start_abs(date, 0)
                program_timer(date)
                        date: ns -> tsc         [64-bit]
                        delay = date - get_tsc()
                        set_hw_timer(delay)

timer_irq()                                     [hot-path]
        now = get_time()
                get_time: tsc -> ns             [64-bit]
        date <= now()?


4.3 Summary of the hot-path
---------------------------

              | total       | tsc->ns | ns->tsc | possible
              | conversions |         |         | 32-bit ns->tsc
--------------+-------------+---------+---------+----------------
TSC-based     | 3           | 2       | 1       | 1
TSC-based+ABS | 2           | 1       | 1       | 0
ns-based      | 5           | 4       | 1       | 0
ns-based+ABS  | 3           | 2       | 1       | 0



5. Absolute Timeouts
====================
(e.g. POSIX IPC mechanisms)

Started often, typically do not fire, include time-stamp acquisition,
often comparably large timeout values
=> hot-path is get_time(), timer_start(), and timer_stop()


5.1 TSC-based
-------------

date = get_time() + delay                       [hot-path]
        get_time: tsc -> ns                     [64-bit]

sem_timeddown(..., date)                        [hot-path]
        delay = date - get_time()
                get_time: tsc -> ns             [64-bit]
        timer_start(delay, 0)
                delay: ns -> tsc                [32-bit candidate]
                date = get_tsc() + delay
                program_timer(date)             [rarely]
                        delay = date - get_tsc()
                        set_hw_timer(delay)
        block_on_sem()
        timer_stop()
                program_timer(date)             [rarely]
                        delay = date - get_tsc()
                        set_hw_timer(delay)
-or-
sem_timeddown(..., date)                        [hot-path]
        timer_start_abs(date, 0)
                date: ns -> tsc                 [64-bit]
                program_timer(date)             [rarely]
                        delay = date - get_tsc()
                        set_hw_timer(delay)
        block_on_sem()
        timer_stop()
                program_timer(date)             [rarely]
                        delay = date - get_tsc()
                        set_hw_timer(delay)

timer_irq()                                     [rarely]
        date <= get_tsc()?


5.2 ns-based
------------

date = get_time() + delay                       [hot-path]
        get_time: tsc -> ns                     [64-bit]

sem_timeddown(..., date)                        [hot-path]
        delay = date - get_time()
                get_time: tsc -> ns             [64-bit]
        timer_start(delay, 0)
                date = get_time() + delay
                        get_time(): tsc -> ns   [64-bit]
                program_timer(date)             [rarely]
                        date: ns -> tsc         [64-bit]
                        delay = date - get_tsc()
                        set_hw_timer(delay)
        block_on_sem()
        timer_stop()
                program_timer(date)             [rarely]
                        date: ns -> tsc         [64-bit]
                        delay = date - get_tsc()
                        set_hw_timer(delay)
-or-
sem_timeddown(..., date)                        [hot-path]
        timer_start_abs(date, 0)
                program_timer(date)             [rarely]
                        date: ns -> tsc         [64-bit]
                        delay = date - get_tsc()
                        set_hw_timer(delay)
        block_on_sem()
        timer_stop()
                program_timer(date)             [rarely]
                        date: ns -> tsc         [64-bit]
                        delay = date - get_tsc()
                        set_hw_timer(delay)

timer_irq()                                     [rarely]
        now = get_time()
                get_time: tsc -> ns             [64-bit]
        date <= now()?


5.3 Summary of the hot-path
---------------------------

              | total       | tsc->ns | ns->tsc | possible
              | conversions |         |         | 32-bit ns->tsc
--------------+-------------+---------+---------+----------------
TSC-based     | 3           | 2       | 1       | 1
TSC-based+ABS | 2           | 1       | 1       | 0
ns-based      | 3           | 3       | 0       | 0
ns-based+ABS  | 1           | 1       | 0       | 0

[Please don't take every detail above for granted. Some bugs may sleep
even there. Too many conversions...]


To summarise these lengthy results:

 o ns-based xntimers are nice on first sight, but not on second. Most
   use-cases (except 5) require less conversions when we keep the
   abstraction as it is.

 o Performance should be improvable by combining fast_tsc_to_ns for full
   64-bit conversions with nodiv_imuldiv for short relative ns-to-tsc.
   It should be ok to loose some accuracy wrt to long periods given that
   TSC are AFAIK not very accurate themselves. Nevertheless, to keep
   precision on 64-bit ns-to-tsc reverse conversions, those should
   remain implemented as they are:
   "if (ns <= ULONG_MAX) nodiv_imuldiv else xnarch_ns_to_tsc"

 o A further improvement should be achievable for scenarios 4 and 5 by
   introducing absolute xntimers (more precisely: a flag to
   differentiate between the mode on xntimer_start). I have an outdated
   patch for this in my repos, needs re-basing.

To verify that we actually improve something with each of the changes
above, some kind of fine-grained test suite will be required. The
timerbench could be extended to support all 5 scenarios. But does
someone have any quick idea how to evaluate the overall performances
best? The new per-task statistics code is not accurate enough as it
accounts IRQs mostly to the preempted task, not the preempting one. Mm,
execution time of some long-running number-crunching Linux task in the
background?

Looking forward to feedback!

Jan


PS: Finally, after stabilising the xntimers again, we will see a nice
rtdm_timer API as well. But those patches need even more re-basing then...


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai-core] Timer optimisations, continued
  2006-07-25 18:26             ` [Xenomai-core] Timer optimisations, continued Jan Kiszka
@ 2006-07-27  8:53               ` Philippe Gerum
  2006-07-27 12:42                 ` Gilles Chanteperdrix
  0 siblings, 1 reply; 27+ messages in thread
From: Philippe Gerum @ 2006-07-27  8:53 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

On Tue, 2006-07-25 at 20:26 +0200, Jan Kiszka wrote:

<massive snippage>

> 
> To summarise these lengthy results:
> 
>  o ns-based xntimers are nice on first sight, but not on second. Most
>    use-cases (except 5) require less conversions when we keep the
>    abstraction as it is.
> 

The current approach was a deliberate choice to favour accuracy of
timers, at the - reasonably small - expense of not optimizing the
"timeout" use case. The net result is that the core timing code is
TSC-based, so that no time unit conversion occurs after a timer has been
started, except in the case where the hw timer has a different time unit
than the TSC used (this said, this last conversion before programmin
gthe hw timer would be needed regardless of the time unit maintained by
the timing core).

>  o Performance should be improvable by combining fast_tsc_to_ns for full
>    64-bit conversions with nodiv_imuldiv for short relative ns-to-tsc.
>    It should be ok to loose some accuracy wrt to long periods given that
>    TSC are AFAIK not very accurate themselves. Nevertheless, to keep
>    precision on 64-bit ns-to-tsc reverse conversions, those should
>    remain implemented as they are:
>    "if (ns <= ULONG_MAX) nodiv_imuldiv else xnarch_ns_to_tsc"
> 

I basically agree with that, including the 64/32 optimization on delay
ranges. IOW, we could optimize time conversions in the timing core
_locally_ (i.e. nucleus/timer.c exclusively) even at the expense of a
small loss of accuracy in the dedicated converters. In any case, we are
implicitely talking of the oneshot mode here, and as such, it would be
acceptable to trigger an early shot once in a while - i.e. due to the
loss of accuracy - that would cause the existing code to restart the
timer until it eventually elapses past the expected time, given that
this would only occur with large delays. But: we must leave the existing
converters as they are in the xnarch layer, keeping the most accurate
operations provided there, since a lot of code depends on their
accuracy.

>  o A further improvement should be achievable for scenarios 4 and 5 by
>    introducing absolute xntimers (more precisely: a flag to
>    differentiate between the mode on xntimer_start). I have an outdated
>    patch for this in my repos, needs re-basing.
> 

Grmblm... Well, I would have preferred that we don't add that kind of
complexity to the nucleus interface, but I must admit that some
important use cases are definitely better served by absolute timespecs,
so I would surrender to this requirement, provided the implementation is
confined to xnpod_suspend_thread() + xntimer_start().

> To verify that we actually improve something with each of the changes
> above, some kind of fine-grained test suite will be required. The
> timerbench could be extended to support all 5 scenarios. But does
> someone have any quick idea how to evaluate the overall performances
> best? The new per-task statistics code is not accurate enough as it
> accounts IRQs mostly to the preempted task, not the preempting one. Mm,
> execution time of some long-running number-crunching Linux task in the
> background?

Better use a kernel-based low priority RT task running in the
background, limiting the sampling period to a duration that Linux could
bear with (maybe running multiple subsequent periods with warmup phases,
just to let the penguin breath). The effect of TLB misses would be much
lower, and no need to block the Linux IRQs using Xenomai's I-shield.

> Looking forward to feedback!
> 
> Jan
> 
> 
> PS: Finally, after stabilising the xntimers again, we will see a nice
> rtdm_timer API as well. But those patches need even more re-basing then...
> 
> _______________________________________________
> Xenomai-core mailing list
> Xenomai-core@domain.hid
> https://mail.gna.org/listinfo/xenomai-core
-- 
Philippe.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai-core] Timer optimisations, continued
  2006-07-27  8:53               ` Philippe Gerum
@ 2006-07-27 12:42                 ` Gilles Chanteperdrix
  2006-07-27 13:19                   ` Philippe Gerum
  0 siblings, 1 reply; 27+ messages in thread
From: Gilles Chanteperdrix @ 2006-07-27 12:42 UTC (permalink / raw)
  To: rpm; +Cc: Jan Kiszka, xenomai-core

Philippe Gerum wrote:
 > >  o A further improvement should be achievable for scenarios 4 and 5 by
 > >    introducing absolute xntimers (more precisely: a flag to
 > >    differentiate between the mode on xntimer_start). I have an outdated
 > >    patch for this in my repos, needs re-basing.
 > > 
 > 
 > Grmblm... Well, I would have preferred that we don't add that kind of
 > complexity to the nucleus interface, but I must admit that some
 > important use cases are definitely better served by absolute timespecs,
 > so I would surrender to this requirement, provided the implementation is
 > confined to xnpod_suspend_thread() + xntimer_start().

It would be nice if absolute timeouts were also available when using
xnsynch_sleep_on. There are a few use cases in the POSIX skin.

-- 


					    Gilles Chanteperdrix.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai-core] Timer optimisations, continued
  2006-07-27 12:42                 ` Gilles Chanteperdrix
@ 2006-07-27 13:19                   ` Philippe Gerum
  2006-07-27 13:54                     ` Jan Kiszka
  0 siblings, 1 reply; 27+ messages in thread
From: Philippe Gerum @ 2006-07-27 13:19 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: Jan Kiszka, xenomai-core

On Thu, 2006-07-27 at 14:42 +0200, Gilles Chanteperdrix wrote:
> Philippe Gerum wrote:
>  > >  o A further improvement should be achievable for scenarios 4 and 5 by
>  > >    introducing absolute xntimers (more precisely: a flag to
>  > >    differentiate between the mode on xntimer_start). I have an outdated
>  > >    patch for this in my repos, needs re-basing.
>  > > 
>  > 
>  > Grmblm... Well, I would have preferred that we don't add that kind of
>  > complexity to the nucleus interface, but I must admit that some
>  > important use cases are definitely better served by absolute timespecs,
>  > so I would surrender to this requirement, provided the implementation is
>  > confined to xnpod_suspend_thread() + xntimer_start().
> 
> It would be nice if absolute timeouts were also available when using
> xnsynch_sleep_on. There are a few use cases in the POSIX skin.

Makes sense, since xnpod_suspend_thread() and xnsynch_sleep_on() are
tightly integrated interfaces.

> 
-- 
Philippe.




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai-core] Timer optimisations, continued
  2006-07-27 13:19                   ` Philippe Gerum
@ 2006-07-27 13:54                     ` Jan Kiszka
  2006-07-27 14:10                       ` Philippe Gerum
  0 siblings, 1 reply; 27+ messages in thread
From: Jan Kiszka @ 2006-07-27 13:54 UTC (permalink / raw)
  To: rpm; +Cc: xenomai-core

[-- Attachment #1: Type: text/plain, Size: 1398 bytes --]

Philippe Gerum wrote:
> On Thu, 2006-07-27 at 14:42 +0200, Gilles Chanteperdrix wrote:
>> Philippe Gerum wrote:
>>  > >  o A further improvement should be achievable for scenarios 4 and 5 by
>>  > >    introducing absolute xntimers (more precisely: a flag to
>>  > >    differentiate between the mode on xntimer_start). I have an outdated
>>  > >    patch for this in my repos, needs re-basing.
>>  > > 
>>  > 
>>  > Grmblm... Well, I would have preferred that we don't add that kind of
>>  > complexity to the nucleus interface, but I must admit that some
>>  > important use cases are definitely better served by absolute timespecs,
>>  > so I would surrender to this requirement, provided the implementation is
>>  > confined to xnpod_suspend_thread() + xntimer_start().
>>
>> It would be nice if absolute timeouts were also available when using
>> xnsynch_sleep_on. There are a few use cases in the POSIX skin.
> 
> Makes sense, since xnpod_suspend_thread() and xnsynch_sleep_on() are
> tightly integrated interfaces.
> 

Anyone any idea how to extend both function interfaces best to
differentiate absolute/relative timeouts? I guess we need an additional
argument to the functions, don't we?

I had the weird idea of using the sign bit of the timeout value for
this. But the potential side effects of halving the absolute time domain
this way scares me.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai-core] Timer optimisations, continued
  2006-07-27 13:54                     ` Jan Kiszka
@ 2006-07-27 14:10                       ` Philippe Gerum
  0 siblings, 0 replies; 27+ messages in thread
From: Philippe Gerum @ 2006-07-27 14:10 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

On Thu, 2006-07-27 at 15:54 +0200, Jan Kiszka wrote:
> Philippe Gerum wrote:
> > On Thu, 2006-07-27 at 14:42 +0200, Gilles Chanteperdrix wrote:
> >> Philippe Gerum wrote:
> >>  > >  o A further improvement should be achievable for scenarios 4 and 5 by
> >>  > >    introducing absolute xntimers (more precisely: a flag to
> >>  > >    differentiate between the mode on xntimer_start). I have an outdated
> >>  > >    patch for this in my repos, needs re-basing.
> >>  > > 
> >>  > 
> >>  > Grmblm... Well, I would have preferred that we don't add that kind of
> >>  > complexity to the nucleus interface, but I must admit that some
> >>  > important use cases are definitely better served by absolute timespecs,
> >>  > so I would surrender to this requirement, provided the implementation is
> >>  > confined to xnpod_suspend_thread() + xntimer_start().
> >>
> >> It would be nice if absolute timeouts were also available when using
> >> xnsynch_sleep_on. There are a few use cases in the POSIX skin.
> > 
> > Makes sense, since xnpod_suspend_thread() and xnsynch_sleep_on() are
> > tightly integrated interfaces.
> > 
> 
> Anyone any idea how to extend both function interfaces best to
> differentiate absolute/relative timeouts? I guess we need an additional
> argument to the functions, don't we?

Yes, I'm afraid we do. The other approach that would basically make the
timeout a non-scalar value in order to store the rel/abs qualifier would
be just overkill.

> 
> I had the weird idea of using the sign bit of the timeout value for
> this. But the potential side effects of halving the absolute time domain
> this way scares me.
> 

Same here, this looks like a very fragile solution to a general issue.

-- 
Philippe.




^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2006-07-27 14:10 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-06-13 10:51 [Xenomai-core] ns vs. tsc as internal timer base Jan Kiszka
2006-06-13 11:16 ` Philippe Gerum
2006-06-13 11:56   ` Jan Kiszka
2006-06-13 12:31     ` Philippe Gerum
2006-06-13 13:07       ` Gilles Chanteperdrix
2006-06-13 13:28         ` Philippe Gerum
2006-06-13 13:34           ` Gilles Chanteperdrix
2006-06-13 13:45             ` Philippe Gerum
2006-06-13 13:33       ` Jan Kiszka
2006-06-13 13:51         ` Philippe Gerum
2006-06-13 16:19       ` Jan Kiszka
2006-06-13 16:29         ` Gilles Chanteperdrix
2006-06-13 17:04         ` Philippe Gerum
2006-06-13 17:13           ` Gilles Chanteperdrix
2006-06-13 17:58             ` Philippe Gerum
2006-06-14  9:25               ` Jim Cromie
2006-06-14 12:29                 ` Philippe Gerum
2006-06-14 13:07                   ` Jan Kiszka
2006-06-14 16:04                     ` Jan Kiszka
2006-07-25 18:26             ` [Xenomai-core] Timer optimisations, continued Jan Kiszka
2006-07-27  8:53               ` Philippe Gerum
2006-07-27 12:42                 ` Gilles Chanteperdrix
2006-07-27 13:19                   ` Philippe Gerum
2006-07-27 13:54                     ` Jan Kiszka
2006-07-27 14:10                       ` Philippe Gerum
2006-06-13 11:59 ` [Xenomai-core] ns vs. tsc as internal timer base Gilles Chanteperdrix
2006-06-13 12:00 ` Anders Blomdell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.