All of lore.kernel.org
 help / color / mirror / Atom feed
* [Xenomai-core] [PATCH-STACK] Updates, timerstats, rtdm-timers
@ 2007-06-04 23:25 Jan Kiszka
  2007-06-05  8:31 ` Jan Kiszka
  0 siblings, 1 reply; 20+ messages in thread
From: Jan Kiszka @ 2007-06-04 23:25 UTC (permalink / raw)
  To: xenomai-core

[-- Attachment #1: Type: text/plain, Size: 4737 bytes --]

An undated and extended version of my patch stack for I-pipe and Xenomai 
is now available at

	http://www.rts.uni-hannover.de/rtaddon/patches

There is once again some new stuff from my would-be-nice-in-2.4 list
contained, namely the timerstat /proc output and a preview on my current
rtdm_timer draft. Here is the overview of the content:

/ipipe-kernel
-------------

ipipe-janitorial.patch

    [unchanged]
    Removes useless hunks from the I-pipe patch, specifically over i386.

disable-context-check-v3.patch
disable-context-check-v2-i386.patch

    [Fixed broken !CONFIG_IPIPE_DEBUG_CONTEXT build]
    Infrastructure for temporarily or permanently disabling the context
    checker. Applies this on ipipe_trace_panic_freeze() and NMI.

refactor-ipipe_walk_pipeline.patch

    [broken-out ipipe_processor_id removal]
    Remove cpuid from __ipipe_walk_pipeline parameters, remove fastcall.

cleanup-processor_id-i386.patch

    [broken-out ipipe_processor_id removal]
    Drop legacy code related to i386 ipipe_processor_id.

instrument-smp_processor_id.patch
instrument-smp_processor_id-i386.patch

    [broken-out ipipe_processor_id removal]
    Catch smp_processor_id invocations over non-root domains on archs
    that retrieve the CPU number from the kernel stack. Archs without
    this problem need to define IPIPE_STACK_INVARIANT_CPUID (only
    support for i386 so far).

hard-irq-disable-on-suspend-resume.patch

    [unchanged]
    Old patch of mine to enable software-suspend over I-pipe.

add-ipipe_preempt_disable.patch

    [unchanged]
    Introduces ipipe_preempt_disable as an I-pipe-safe alternative to
    preempt_disable. Required for kernel markers that come with LTTng.

prepare-lttng.patch
ltt-ipipe-v2.patch

    [updated to use of ipipe_processor_id() where now required]
    LTTng preparation and I-pipe adoption patches. See README.lttng for
    more details.


/xenomai
--------

refactor-queue-init.patch

    Refactor DECLARE_XNQUEUE to DEFINE_XNQUEUE. Break out
    XNQUEUE_INITIALIZER.

cleanup-proc-stuff.patch

    Remove xnskentry::proc, track proc registration via xnskentry::name.
    Clean up redundant typecasts. Remove unneeded code in *_seq_next().

destroy-thread-timers.patch

    Unconditionally destroy xnthread timers on thread deletion. Besides
    the consistency aspect, timerstats.patch will require clean
    destruction.

fast-tsc-to-ns-v2.patch

    [Rebased, improved rounding of least significant digit]
    Integration of my scaled-math-based xnarch_tsc_to_ns service for
    i386 at least.

[RFC] timerstat.patch

    Dump currently or previously active timers per timebase under
    /proc/xenomai/timerstat. Output looks like this:

# cat /proc/xenomai/timerstat/master
CPU  SCHEDULED   FIRED       TIMEOUT    INTERVAL   HANDLER      NAME
0    5959        5958        1          4000000    NULL         [host tick]
0    25          24          659464312  1000000000  xnpod_watch  [watchdog]
0    368         367         5042333    10000000   xnthread_pe  sampling-831

    The idea is to have an overview of timer activity *on the target*,
    just like we already have for threads. This can help to quickly get
    an overview about
     - how many timers there are on a system
     - how they are programmed
     - how often they are the scheduled -- and actually fired
     - who may have installed them

[RFC] xntimer-monotonic.patch

    Use a new flag, XNTIMER_MONOTONIC, to control if absolute timeouts
    shall skip wallclock_offset correction on start, making them
    independent of clock adjustment. So far only used by RTDM timers,
    the POSIX /might/ be able to exploit it as well.

[PREVIEW] rtdm-timers.patch

    Add rtdm_timer_* services, turn timerbench into the first user. This
    patch also introduces monotonic timers to rtdm_task_* and adds the
    new clock service rtdm_clock_read_monotonic. The whole thing about
    monotonic clocks for drivers is due to my concerns that once we
    start tuning the master timebase according to external sources, we
    /might/ be happy to provide non-adjustable clock and timers for
    device drivers that need strictly continuous timing.

librtutils.patch

    [refreshed]
    Contains rt_print services so far. Still open naming question.

[RFC] rtsystrace-v2.patch

    [refreshed]
    Proposal to add rt_print-based Xenomai syscall tracing. Looking for
    a less code-invasive approach.

[PREVIEW] lttng.patch

    [rebased]
    Very rough patch to make LTTng work with Xenomai.


As usual: Testers and reviews are welcome, feedback is appreciated.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 249 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Xenomai-core] [PATCH-STACK] Updates, timerstats, rtdm-timers
  2007-06-04 23:25 [Xenomai-core] [PATCH-STACK] Updates, timerstats, rtdm-timers Jan Kiszka
@ 2007-06-05  8:31 ` Jan Kiszka
  2007-06-05 22:28   ` Gilles Chanteperdrix
  2007-06-06 12:49   ` Jan Kiszka
  0 siblings, 2 replies; 20+ messages in thread
From: Jan Kiszka @ 2007-06-05  8:31 UTC (permalink / raw)
  To: xenomai-core

[-- Attachment #1: Type: text/plain, Size: 384 bytes --]

Jan Kiszka wrote:
...
> fast-tsc-to-ns-v2.patch
> 
>     [Rebased, improved rounding of least significant digit]

Rounding in the fast path for the sake of the last digit was silly.
Instead, I'm now addressing the ugly interval printing via
xnarch_precise_tsc_to_ns when converting the timer interval back into
nanos. -v3 incorporating this has just been uploaded.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Xenomai-core] [PATCH-STACK] Updates, timerstats, rtdm-timers
  2007-06-05  8:31 ` Jan Kiszka
@ 2007-06-05 22:28   ` Gilles Chanteperdrix
  2007-06-06 10:30     ` Jan Kiszka
  2007-06-06 12:49   ` Jan Kiszka
  1 sibling, 1 reply; 20+ messages in thread
From: Gilles Chanteperdrix @ 2007-06-05 22:28 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

Jan Kiszka wrote:
 > Jan Kiszka wrote:
 > ...
 > > fast-tsc-to-ns-v2.patch
 > > 
 > >     [Rebased, improved rounding of least significant digit]
 > 
 > Rounding in the fast path for the sake of the last digit was silly.
 > Instead, I'm now addressing the ugly interval printing via
 > xnarch_precise_tsc_to_ns when converting the timer interval back into
 > nanos. -v3 incorporating this has just been uploaded.

Hi,

I had a look at the fast-tsc-to-ns implementation, here is how I would
rewrite it:

static inline void xnarch_init_llmulshft(const unsigned m_in,
					 const unsigned d_in,
					 unsigned *m_out,
					 unsigned *s_out)
{
	unsigned long long mult;

	*s_out = 31;
	while (1) {
		mult = ((unsigned long long)m_in) << *s_out;
		do_div(mult, d_in);
		if (mult <= INT_MAX)
			break;
		(*s_out)--;
	}
	*m_out = (unsigned)mult;
}

/* Non x86. */
#define __rthal_u96shift(h, m, l, s) ({		\
	unsigned _l = (l);			\
	unsigned _m = (m);			\
	unsigned _s = (s);			\
	_l >>= _s;				\
	_m >>= s;				\
	_l |= (_m << (32 - s));			\
	_m |= ((h) << (32 - s));		\
        __rthal_u64fromu32(_m, _l);		\
})

/* x86 */
#define __rthal_u96shift(h, m, l, s) ({		\
	unsigned _l = (l);			\
	unsigned _m = (m);			\
	unsigned _s = (s);			\
	asm ("shrdl\t%%cl,%1,%0"		\
	     : "+r,?m"(_l)			\
	     : "r,r"(_m), "c,c"(_s));		\
	asm ("shrdl\t%%cl,%1,%0"		\
	     : "+r,?m"(_m)			\
	     : "r,r"(h), "c,c"(_s));		\
	__rthal_u64fromu32(_m, _l);		\
})

static inline long long rthal_llmi(int i, int j)
{
        /* Signed fast 32x32->64 multiplication */
	return (long long) i * j;
}

static inline long long gilles_llmulshft(const long long op,
					 const unsigned m,
					 const unsigned s)
{
	unsigned oph, opl, tlh, tll, thh, thl;
	unsigned long long th, tl;

	__rthal_u64tou32(op, oph, opl);
	tl = rthal_ullmul(opl, m);
	__rthal_u64tou32(tl, tlh, tll);
	th = rthal_llmi(oph, m);
	th += tlh;
	__rthal_u64tou32(th, thh, thl);
	
	return __rthal_u96shift(thh, thl, tll, s);
}



-- 


					    Gilles Chanteperdrix.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Xenomai-core] [PATCH-STACK] Updates, timerstats, rtdm-timers
  2007-06-05 22:28   ` Gilles Chanteperdrix
@ 2007-06-06 10:30     ` Jan Kiszka
  2007-06-06 12:47       ` Gilles Chanteperdrix
  0 siblings, 1 reply; 20+ messages in thread
From: Jan Kiszka @ 2007-06-06 10:30 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai-core

[-- Attachment #1: Type: text/plain, Size: 4001 bytes --]

Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>  > Jan Kiszka wrote:
>  > ...
>  > > fast-tsc-to-ns-v2.patch
>  > > 
>  > >     [Rebased, improved rounding of least significant digit]
>  > 
>  > Rounding in the fast path for the sake of the last digit was silly.
>  > Instead, I'm now addressing the ugly interval printing via
>  > xnarch_precise_tsc_to_ns when converting the timer interval back into
>  > nanos. -v3 incorporating this has just been uploaded.
> 
> Hi,
> 
> I had a look at the fast-tsc-to-ns implementation, here is how I would
> rewrite it:
> 
> static inline void xnarch_init_llmulshft(const unsigned m_in,
> 					 const unsigned d_in,
> 					 unsigned *m_out,
> 					 unsigned *s_out)
> {
> 	unsigned long long mult;
> 
> 	*s_out = 31;
> 	while (1) {
> 		mult = ((unsigned long long)m_in) << *s_out;
> 		do_div(mult, d_in);
> 		if (mult <= INT_MAX)
> 			break;
> 		(*s_out)--;
> 	}
> 	*m_out = (unsigned)mult;
> }
> 
> /* Non x86. */
> #define __rthal_u96shift(h, m, l, s) ({		\
> 	unsigned _l = (l);			\
> 	unsigned _m = (m);			\
> 	unsigned _s = (s);			\
> 	_l >>= _s;				\
> 	_m >>= s;				\
> 	_l |= (_m << (32 - s));			\
> 	_m |= ((h) << (32 - s));		\
>         __rthal_u64fromu32(_m, _l);		\
> })
> 
> /* x86 */
> #define __rthal_u96shift(h, m, l, s) ({		\
> 	unsigned _l = (l);			\
> 	unsigned _m = (m);			\
> 	unsigned _s = (s);			\
> 	asm ("shrdl\t%%cl,%1,%0"		\
> 	     : "+r,?m"(_l)			\
> 	     : "r,r"(_m), "c,c"(_s));		\
> 	asm ("shrdl\t%%cl,%1,%0"		\
> 	     : "+r,?m"(_m)			\
> 	     : "r,r"(h), "c,c"(_s));		\
> 	__rthal_u64fromu32(_m, _l);		\
> })
> 
> static inline long long rthal_llmi(int i, int j)
> {
>         /* Signed fast 32x32->64 multiplication */
> 	return (long long) i * j;
> }
> 
> static inline long long gilles_llmulshft(const long long op,
> 					 const unsigned m,
> 					 const unsigned s)
> {
> 	unsigned oph, opl, tlh, tll, thh, thl;
> 	unsigned long long th, tl;
> 
> 	__rthal_u64tou32(op, oph, opl);
> 	tl = rthal_ullmul(opl, m);
> 	__rthal_u64tou32(tl, tlh, tll);
> 	th = rthal_llmi(oph, m);
> 	th += tlh;
> 	__rthal_u64tou32(th, thh, thl);
> 	
> 	return __rthal_u96shift(thh, thl, tll, s);
> }
> 
> 

Thanks for your suggestion.

While your generic version produces comparable code, the x86 variant is
about twice as large as the full-assembly version. And code size
translates into I-cache occupation, which may have latency costs.

[gcc 4.1, i386]
-O2 -mregparm=3 -fomit-frame-pointer:
    63: 08048490   119 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft
    68: 08048510   121 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft_x86
    77: 08048450    57 FUNC    GLOBAL DEFAULT   13 rthal_llmulshft
    78: 080483c0   135 FUNC    GLOBAL DEFAULT   13 __rthal_generic_llmulshft

-Os -mregparm=3 -fomit-frame-pointer:
    63: 0804843b    93 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft
    68: 08048498    97 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft_x86
    77: 08048410    43 FUNC    GLOBAL DEFAULT   13 rthal_llmulshft
    78: 080483b4    92 FUNC    GLOBAL DEFAULT   13 __rthal_generic_llmulshft

-O2:
    63: 08048480   120 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft
    68: 08048500   105 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft_x86
    77: 08048440    60 FUNC    GLOBAL DEFAULT   13 rthal_llmulshft
    78: 080483c0   117 FUNC    GLOBAL DEFAULT   13 __rthal_generic_llmulshft

-Os:
    63: 08048438   104 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft
    68: 080484a0    83 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft_x86
    77: 0804840b    45 FUNC    GLOBAL DEFAULT   13 rthal_llmulshft
    78: 080483b4    87 FUNC    GLOBAL DEFAULT   13 __rthal_generic_llmulshft

I'm not arguing we should turn each and every Xenomai arch code into
pure assembly. But in this case it already happened, it's less scattered
source code-wise, and it is compacter object-wise. So I would prefer to
keep it as is.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Xenomai-core] [PATCH-STACK] Updates, timerstats, rtdm-timers
  2007-06-06 10:30     ` Jan Kiszka
@ 2007-06-06 12:47       ` Gilles Chanteperdrix
  2007-06-06 12:59         ` Jan Kiszka
  0 siblings, 1 reply; 20+ messages in thread
From: Gilles Chanteperdrix @ 2007-06-06 12:47 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

Jan Kiszka wrote:
> Gilles Chanteperdrix wrote:
> 
>>Jan Kiszka wrote:
>> > Jan Kiszka wrote:
>> > ...
>> > > fast-tsc-to-ns-v2.patch
>> > > 
>> > >     [Rebased, improved rounding of least significant digit]
>> > 
>> > Rounding in the fast path for the sake of the last digit was silly.
>> > Instead, I'm now addressing the ugly interval printing via
>> > xnarch_precise_tsc_to_ns when converting the timer interval back into
>> > nanos. -v3 incorporating this has just been uploaded.
>>
>>Hi,
>>
>>I had a look at the fast-tsc-to-ns implementation, here is how I would
>>rewrite it:
>>
>>static inline void xnarch_init_llmulshft(const unsigned m_in,
>>					 const unsigned d_in,
>>					 unsigned *m_out,
>>					 unsigned *s_out)
>>{
>>	unsigned long long mult;
>>
>>	*s_out = 31;
>>	while (1) {
>>		mult = ((unsigned long long)m_in) << *s_out;
>>		do_div(mult, d_in);
>>		if (mult <= INT_MAX)
>>			break;
>>		(*s_out)--;
>>	}
>>	*m_out = (unsigned)mult;
>>}
>>
>>/* Non x86. */
>>#define __rthal_u96shift(h, m, l, s) ({		\
>>	unsigned _l = (l);			\
>>	unsigned _m = (m);			\
>>	unsigned _s = (s);			\
>>	_l >>= _s;				\
>>	_m >>= s;				\
>>	_l |= (_m << (32 - s));			\
>>	_m |= ((h) << (32 - s));		\
>>        __rthal_u64fromu32(_m, _l);		\
>>})
>>
>>/* x86 */
>>#define __rthal_u96shift(h, m, l, s) ({		\
>>	unsigned _l = (l);			\
>>	unsigned _m = (m);			\
>>	unsigned _s = (s);			\
>>	asm ("shrdl\t%%cl,%1,%0"		\
>>	     : "+r,?m"(_l)			\
>>	     : "r,r"(_m), "c,c"(_s));		\
>>	asm ("shrdl\t%%cl,%1,%0"		\
>>	     : "+r,?m"(_m)			\
>>	     : "r,r"(h), "c,c"(_s));		\
>>	__rthal_u64fromu32(_m, _l);		\
>>})
>>
>>static inline long long rthal_llmi(int i, int j)
>>{
>>        /* Signed fast 32x32->64 multiplication */
>>	return (long long) i * j;
>>}
>>
>>static inline long long gilles_llmulshft(const long long op,
>>					 const unsigned m,
>>					 const unsigned s)
>>{
>>	unsigned oph, opl, tlh, tll, thh, thl;
>>	unsigned long long th, tl;
>>
>>	__rthal_u64tou32(op, oph, opl);
>>	tl = rthal_ullmul(opl, m);
>>	__rthal_u64tou32(tl, tlh, tll);
>>	th = rthal_llmi(oph, m);
>>	th += tlh;
>>	__rthal_u64tou32(th, thh, thl);
>>	
>>	return __rthal_u96shift(thh, thl, tll, s);
>>}
>>
>>
> 
> 
> Thanks for your suggestion.
> 
> While your generic version produces comparable code, the x86 variant is
> about twice as large as the full-assembly version. And code size
> translates into I-cache occupation, which may have latency costs.
> 
> [gcc 4.1, i386]
> -O2 -mregparm=3 -fomit-frame-pointer:
>     63: 08048490   119 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft
>     68: 08048510   121 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft_x86
>     77: 08048450    57 FUNC    GLOBAL DEFAULT   13 rthal_llmulshft
>     78: 080483c0   135 FUNC    GLOBAL DEFAULT   13 __rthal_generic_llmulshft
> 
> -Os -mregparm=3 -fomit-frame-pointer:
>     63: 0804843b    93 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft
>     68: 08048498    97 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft_x86
>     77: 08048410    43 FUNC    GLOBAL DEFAULT   13 rthal_llmulshft
>     78: 080483b4    92 FUNC    GLOBAL DEFAULT   13 __rthal_generic_llmulshft
> 
> -O2:
>     63: 08048480   120 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft
>     68: 08048500   105 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft_x86
>     77: 08048440    60 FUNC    GLOBAL DEFAULT   13 rthal_llmulshft
>     78: 080483c0   117 FUNC    GLOBAL DEFAULT   13 __rthal_generic_llmulshft
> 
> -Os:
>     63: 08048438   104 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft
>     68: 080484a0    83 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft_x86
>     77: 0804840b    45 FUNC    GLOBAL DEFAULT   13 rthal_llmulshft
>     78: 080483b4    87 FUNC    GLOBAL DEFAULT   13 __rthal_generic_llmulshft
> 
> I'm not arguing we should turn each and every Xenomai arch code into
> pure assembly. But in this case it already happened, it's less scattered
> source code-wise, and it is compacter object-wise. So I would prefer to
> keep it as is.

I would say the advantage of having a C version outperform the
advantages of the full assembly version. C is really easier to
understand and debug.

The differences between the two versions are some register moves, which
cost almost nothing, especially since each operation in the assembly
version depends on the result of the previous operation, which means
lots of pipeline stall, the register moves will just feed the pipeline.
I do not think they really matter. Look at the assembly produced for
gilles_llmulshft on ARM, a low end architecture where each instruction
really costs:
gilles_llmulshft:
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        stmfd   sp!, {r4, r5, r6, r7}
        umull   r6, r7, r0, r2
        mov     r4, r7
        mov     r5, #0
        smlal   r4, r5, r2, r1
        rsb     ip, r3, #32
        mov     r2, r4, lsr r3
        orr     r1, r2, r5, asl ip
        mov     r2, r2, asl ip
        orr     r0, r2, r6, lsr r3
        @ lr needed for prologue
        ldmfd   sp!, {r4, r5, r6, r7}
        mov     pc, lr

pretty minimal, no ?

The full assembly version has another big drawback, it is a big block
that the optimizer can not split, whereas in a C version, the optimizer
can decide to interleave the surrounding code. So a C version will
inline better.

There is one thing I do not like with llmulshft (any implementation), it
is the rounding policy towards minus infinity. llmulshft(-1, 2/3)
returns -1 whereas llimd would return 0.

-- 
                                                 Gilles Chanteperdrix


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Xenomai-core] [PATCH-STACK] Updates, timerstats, rtdm-timers
  2007-06-05  8:31 ` Jan Kiszka
  2007-06-05 22:28   ` Gilles Chanteperdrix
@ 2007-06-06 12:49   ` Jan Kiszka
  2007-06-06 13:29     ` Gilles Chanteperdrix
  1 sibling, 1 reply; 20+ messages in thread
From: Jan Kiszka @ 2007-06-06 12:49 UTC (permalink / raw)
  To: xenomai-core

[-- Attachment #1: Type: text/plain, Size: 776 bytes --]

Jan Kiszka wrote:
> Jan Kiszka wrote:
> ...
>> fast-tsc-to-ns-v2.patch
>>
>>     [Rebased, improved rounding of least significant digit]
> 
> Rounding in the fast path for the sake of the last digit was silly.
> Instead, I'm now addressing the ugly interval printing via
> xnarch_precise_tsc_to_ns when converting the timer interval back into
> nanos. -v3 incorporating this has just been uploaded.
> 

After noticing yesterday that even unpatched Xenomai sometimes converts
inaccurately when showing small timer intervals under /proc, I just got
an idea how to address this beautification issue even better: -v4 now
rounds up in the slow, precise tsc-to-ns path, see

http://www.rts.uni-hannover.de/rtaddon/patches/xenomai/fast-tsc-to-ns-v4.patch

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Xenomai-core] [PATCH-STACK] Updates, timerstats, rtdm-timers
  2007-06-06 12:47       ` Gilles Chanteperdrix
@ 2007-06-06 12:59         ` Jan Kiszka
  2007-06-06 13:21           ` Gilles Chanteperdrix
  0 siblings, 1 reply; 20+ messages in thread
From: Jan Kiszka @ 2007-06-06 12:59 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai-core

[-- Attachment #1: Type: text/plain, Size: 6741 bytes --]

Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Gilles Chanteperdrix wrote:
>>
>>> Jan Kiszka wrote:
>>>> Jan Kiszka wrote:
>>>> ...
>>>>> fast-tsc-to-ns-v2.patch
>>>>>
>>>>>     [Rebased, improved rounding of least significant digit]
>>>> Rounding in the fast path for the sake of the last digit was silly.
>>>> Instead, I'm now addressing the ugly interval printing via
>>>> xnarch_precise_tsc_to_ns when converting the timer interval back into
>>>> nanos. -v3 incorporating this has just been uploaded.
>>> Hi,
>>>
>>> I had a look at the fast-tsc-to-ns implementation, here is how I would
>>> rewrite it:
>>>
>>> static inline void xnarch_init_llmulshft(const unsigned m_in,
>>> 					 const unsigned d_in,
>>> 					 unsigned *m_out,
>>> 					 unsigned *s_out)
>>> {
>>> 	unsigned long long mult;
>>>
>>> 	*s_out = 31;
>>> 	while (1) {
>>> 		mult = ((unsigned long long)m_in) << *s_out;
>>> 		do_div(mult, d_in);
>>> 		if (mult <= INT_MAX)
>>> 			break;
>>> 		(*s_out)--;
>>> 	}
>>> 	*m_out = (unsigned)mult;
>>> }
>>>
>>> /* Non x86. */
>>> #define __rthal_u96shift(h, m, l, s) ({		\
>>> 	unsigned _l = (l);			\
>>> 	unsigned _m = (m);			\
>>> 	unsigned _s = (s);			\
>>> 	_l >>= _s;				\
>>> 	_m >>= s;				\
>>> 	_l |= (_m << (32 - s));			\
>>> 	_m |= ((h) << (32 - s));		\
>>>        __rthal_u64fromu32(_m, _l);		\
>>> })
>>>
>>> /* x86 */
>>> #define __rthal_u96shift(h, m, l, s) ({		\
>>> 	unsigned _l = (l);			\
>>> 	unsigned _m = (m);			\
>>> 	unsigned _s = (s);			\
>>> 	asm ("shrdl\t%%cl,%1,%0"		\
>>> 	     : "+r,?m"(_l)			\
>>> 	     : "r,r"(_m), "c,c"(_s));		\
>>> 	asm ("shrdl\t%%cl,%1,%0"		\
>>> 	     : "+r,?m"(_m)			\
>>> 	     : "r,r"(h), "c,c"(_s));		\
>>> 	__rthal_u64fromu32(_m, _l);		\
>>> })
>>>
>>> static inline long long rthal_llmi(int i, int j)
>>> {
>>>        /* Signed fast 32x32->64 multiplication */
>>> 	return (long long) i * j;
>>> }
>>>
>>> static inline long long gilles_llmulshft(const long long op,
>>> 					 const unsigned m,
>>> 					 const unsigned s)
>>> {
>>> 	unsigned oph, opl, tlh, tll, thh, thl;
>>> 	unsigned long long th, tl;
>>>
>>> 	__rthal_u64tou32(op, oph, opl);
>>> 	tl = rthal_ullmul(opl, m);
>>> 	__rthal_u64tou32(tl, tlh, tll);
>>> 	th = rthal_llmi(oph, m);
>>> 	th += tlh;
>>> 	__rthal_u64tou32(th, thh, thl);
>>> 	
>>> 	return __rthal_u96shift(thh, thl, tll, s);
>>> }
>>>
>>>
>>
>> Thanks for your suggestion.
>>
>> While your generic version produces comparable code, the x86 variant is
>> about twice as large as the full-assembly version. And code size
>> translates into I-cache occupation, which may have latency costs.
>>
>> [gcc 4.1, i386]
>> -O2 -mregparm=3 -fomit-frame-pointer:
>>     63: 08048490   119 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft
>>     68: 08048510   121 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft_x86
>>     77: 08048450    57 FUNC    GLOBAL DEFAULT   13 rthal_llmulshft
>>     78: 080483c0   135 FUNC    GLOBAL DEFAULT   13 __rthal_generic_llmulshft
>>
>> -Os -mregparm=3 -fomit-frame-pointer:
>>     63: 0804843b    93 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft
>>     68: 08048498    97 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft_x86
>>     77: 08048410    43 FUNC    GLOBAL DEFAULT   13 rthal_llmulshft
>>     78: 080483b4    92 FUNC    GLOBAL DEFAULT   13 __rthal_generic_llmulshft
>>
>> -O2:
>>     63: 08048480   120 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft
>>     68: 08048500   105 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft_x86
>>     77: 08048440    60 FUNC    GLOBAL DEFAULT   13 rthal_llmulshft
>>     78: 080483c0   117 FUNC    GLOBAL DEFAULT   13 __rthal_generic_llmulshft
>>
>> -Os:
>>     63: 08048438   104 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft
>>     68: 080484a0    83 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft_x86
>>     77: 0804840b    45 FUNC    GLOBAL DEFAULT   13 rthal_llmulshft
>>     78: 080483b4    87 FUNC    GLOBAL DEFAULT   13 __rthal_generic_llmulshft
>>
>> I'm not arguing we should turn each and every Xenomai arch code into
>> pure assembly. But in this case it already happened, it's less scattered
>> source code-wise, and it is compacter object-wise. So I would prefer to
>> keep it as is.
> 
> I would say the advantage of having a C version outperform the
> advantages of the full assembly version. C is really easier to
> understand and debug.

Personally, I prefer the clear (and commented) assembly over the nested
macros and inlines.

> 
> The differences between the two versions are some register moves, which
> cost almost nothing, especially since each operation in the assembly

Cycle-wise, you are right. But what bites us more in the worst case are
memory accesses, specifically when they are not cached. Code size
matters more according to my experience.

> version depends on the result of the previous operation, which means
> lots of pipeline stall, the register moves will just feed the pipeline.
> I do not think they really matter. Look at the assembly produced for
> gilles_llmulshft on ARM, a low end architecture where each instruction
> really costs:
> gilles_llmulshft:
>         @ args = 0, pretend = 0, frame = 0
>         @ frame_needed = 0, uses_anonymous_args = 0
>         @ link register save eliminated.
>         stmfd   sp!, {r4, r5, r6, r7}
>         umull   r6, r7, r0, r2
>         mov     r4, r7
>         mov     r5, #0
>         smlal   r4, r5, r2, r1
>         rsb     ip, r3, #32
>         mov     r2, r4, lsr r3
>         orr     r1, r2, r5, asl ip
>         mov     r2, r2, asl ip
>         orr     r0, r2, r6, lsr r3
>         @ lr needed for prologue
>         ldmfd   sp!, {r4, r5, r6, r7}
>         mov     pc, lr
> 
> pretty minimal, no ?

OK, your version can perfectly go into the ARM arch. But i386 is
different: less registers, thus easily a lot of variable shuffling...

> 
> The full assembly version has another big drawback, it is a big block
> that the optimizer can not split, whereas in a C version, the optimizer
> can decide to interleave the surrounding code. So a C version will
> inline better.

We are not inlining that service anymore, at least not for its primary
usage tsc-to-ns. Inlining costs object size, thus increases the latency
(although it saves us a few cycles).

> 
> There is one thing I do not like with llmulshft (any implementation), it
> is the rounding policy towards minus infinity. llmulshft(-1, 2/3)
> returns -1 whereas llimd would return 0.

See other postings: rounding of the last digit doesn't matter with
scaled math, it's already inaccurate by nature. That's also why we have
it only one-way.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Xenomai-core] [PATCH-STACK] Updates, timerstats, rtdm-timers
  2007-06-06 12:59         ` Jan Kiszka
@ 2007-06-06 13:21           ` Gilles Chanteperdrix
  2007-06-06 13:31             ` Jan Kiszka
  0 siblings, 1 reply; 20+ messages in thread
From: Gilles Chanteperdrix @ 2007-06-06 13:21 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

Jan Kiszka wrote:
> Gilles Chanteperdrix wrote:
> 
>>Jan Kiszka wrote:
>>
>>>Gilles Chanteperdrix wrote:
>>>
>>>
>>>>Jan Kiszka wrote:
>>>>
>>>>>Jan Kiszka wrote:
>>>>>...
>>>>>
>>>>>>fast-tsc-to-ns-v2.patch
>>>>>>
>>>>>>    [Rebased, improved rounding of least significant digit]
>>>>>
>>>>>Rounding in the fast path for the sake of the last digit was silly.
>>>>>Instead, I'm now addressing the ugly interval printing via
>>>>>xnarch_precise_tsc_to_ns when converting the timer interval back into
>>>>>nanos. -v3 incorporating this has just been uploaded.
>>>>
>>>>Hi,
>>>>
>>>>I had a look at the fast-tsc-to-ns implementation, here is how I would
>>>>rewrite it:
>>>>
>>>>static inline void xnarch_init_llmulshft(const unsigned m_in,
>>>>					 const unsigned d_in,
>>>>					 unsigned *m_out,
>>>>					 unsigned *s_out)
>>>>{
>>>>	unsigned long long mult;
>>>>
>>>>	*s_out = 31;
>>>>	while (1) {
>>>>		mult = ((unsigned long long)m_in) << *s_out;
>>>>		do_div(mult, d_in);
>>>>		if (mult <= INT_MAX)
>>>>			break;
>>>>		(*s_out)--;
>>>>	}
>>>>	*m_out = (unsigned)mult;
>>>>}
>>>>
>>>>/* Non x86. */
>>>>#define __rthal_u96shift(h, m, l, s) ({		\
>>>>	unsigned _l = (l);			\
>>>>	unsigned _m = (m);			\
>>>>	unsigned _s = (s);			\
>>>>	_l >>= _s;				\
>>>>	_m >>= s;				\
>>>>	_l |= (_m << (32 - s));			\
>>>>	_m |= ((h) << (32 - s));		\
>>>>       __rthal_u64fromu32(_m, _l);		\
>>>>})
>>>>
>>>>/* x86 */
>>>>#define __rthal_u96shift(h, m, l, s) ({		\
>>>>	unsigned _l = (l);			\
>>>>	unsigned _m = (m);			\
>>>>	unsigned _s = (s);			\
>>>>	asm ("shrdl\t%%cl,%1,%0"		\
>>>>	     : "+r,?m"(_l)			\
>>>>	     : "r,r"(_m), "c,c"(_s));		\
>>>>	asm ("shrdl\t%%cl,%1,%0"		\
>>>>	     : "+r,?m"(_m)			\
>>>>	     : "r,r"(h), "c,c"(_s));		\
>>>>	__rthal_u64fromu32(_m, _l);		\
>>>>})
>>>>
>>>>static inline long long rthal_llmi(int i, int j)
>>>>{
>>>>       /* Signed fast 32x32->64 multiplication */
>>>>	return (long long) i * j;
>>>>}
>>>>
>>>>static inline long long gilles_llmulshft(const long long op,
>>>>					 const unsigned m,
>>>>					 const unsigned s)
>>>>{
>>>>	unsigned oph, opl, tlh, tll, thh, thl;
>>>>	unsigned long long th, tl;
>>>>
>>>>	__rthal_u64tou32(op, oph, opl);
>>>>	tl = rthal_ullmul(opl, m);
>>>>	__rthal_u64tou32(tl, tlh, tll);
>>>>	th = rthal_llmi(oph, m);
>>>>	th += tlh;
>>>>	__rthal_u64tou32(th, thh, thl);
>>>>	
>>>>	return __rthal_u96shift(thh, thl, tll, s);
>>>>}
>>>>
>>>>
>>>
>>>Thanks for your suggestion.
>>>
>>>While your generic version produces comparable code, the x86 variant is
>>>about twice as large as the full-assembly version. And code size
>>>translates into I-cache occupation, which may have latency costs.
>>>
>>>[gcc 4.1, i386]
>>>-O2 -mregparm=3 -fomit-frame-pointer:
>>>    63: 08048490   119 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft
>>>    68: 08048510   121 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft_x86
>>>    77: 08048450    57 FUNC    GLOBAL DEFAULT   13 rthal_llmulshft
>>>    78: 080483c0   135 FUNC    GLOBAL DEFAULT   13 __rthal_generic_llmulshft
>>>
>>>-Os -mregparm=3 -fomit-frame-pointer:
>>>    63: 0804843b    93 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft
>>>    68: 08048498    97 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft_x86
>>>    77: 08048410    43 FUNC    GLOBAL DEFAULT   13 rthal_llmulshft
>>>    78: 080483b4    92 FUNC    GLOBAL DEFAULT   13 __rthal_generic_llmulshft
>>>
>>>-O2:
>>>    63: 08048480   120 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft
>>>    68: 08048500   105 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft_x86
>>>    77: 08048440    60 FUNC    GLOBAL DEFAULT   13 rthal_llmulshft
>>>    78: 080483c0   117 FUNC    GLOBAL DEFAULT   13 __rthal_generic_llmulshft
>>>
>>>-Os:
>>>    63: 08048438   104 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft
>>>    68: 080484a0    83 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft_x86
>>>    77: 0804840b    45 FUNC    GLOBAL DEFAULT   13 rthal_llmulshft
>>>    78: 080483b4    87 FUNC    GLOBAL DEFAULT   13 __rthal_generic_llmulshft
>>>
>>>I'm not arguing we should turn each and every Xenomai arch code into
>>>pure assembly. But in this case it already happened, it's less scattered
>>>source code-wise, and it is compacter object-wise. So I would prefer to
>>>keep it as is.
>>
>>I would say the advantage of having a C version outperform the
>>advantages of the full assembly version. C is really easier to
>>understand and debug.
> 
> 
> Personally, I prefer the clear (and commented) assembly over the nested
> macros and inlines.

Not when the macro and inline bear names that are easy to understand. If
you do not find the names easy to understand, then change them (I do not
like rthal_llmul either, but I could not find a name). To make the
assembly fully understandable, you would need to comment every
statement. And now, run the assembly code in gdb, and try and print the
value of a 64 bits intermediate result: you can't.

> 
> 
>>The differences between the two versions are some register moves, which
>>cost almost nothing, especially since each operation in the assembly
> 
> 
> Cycle-wise, you are right. But what bites us more in the worst case are
> memory accesses, specifically when they are not cached. Code size
> matters more according to my experience.
> 
> 
>>version depends on the result of the previous operation, which means
>>lots of pipeline stall, the register moves will just feed the pipeline.
>>I do not think they really matter. Look at the assembly produced for
>>gilles_llmulshft on ARM, a low end architecture where each instruction
>>really costs:
>>gilles_llmulshft:
>>        @ args = 0, pretend = 0, frame = 0
>>        @ frame_needed = 0, uses_anonymous_args = 0
>>        @ link register save eliminated.
>>        stmfd   sp!, {r4, r5, r6, r7}
>>        umull   r6, r7, r0, r2
>>        mov     r4, r7
>>        mov     r5, #0
>>        smlal   r4, r5, r2, r1
>>        rsb     ip, r3, #32
>>        mov     r2, r4, lsr r3
>>        orr     r1, r2, r5, asl ip
>>        mov     r2, r2, asl ip
>>        orr     r0, r2, r6, lsr r3
>>        @ lr needed for prologue
>>        ldmfd   sp!, {r4, r5, r6, r7}
>>        mov     pc, lr
>>
>>pretty minimal, no ?
> 
> 
> OK, your version can perfectly go into the ARM arch. But i386 is
> different: less registers, thus easily a lot of variable shuffling...

variable shuffling which does not really matter, that is my point,
otherwise the x86 family would not be as fast as it is.

> 
> 
>>The full assembly version has another big drawback, it is a big block
>>that the optimizer can not split, whereas in a C version, the optimizer
>>can decide to interleave the surrounding code. So a C version will
>>inline better.
> 
> 
> We are not inlining that service anymore, at least not for its primary
> usage tsc-to-ns. Inlining costs object size, thus increases the latency
> (although it saves us a few cycles).

it *is* inlined, in tsc_to/from_ns. Another question that I forgot in my
previous mails: why not using llmulshft for the two services ?

> 
> 
>>There is one thing I do not like with llmulshft (any implementation), it
>>is the rounding policy towards minus infinity. llmulshft(-1, 2/3)
>>returns -1 whereas llimd would return 0.
> 
> 
> See other postings: rounding of the last digit doesn't matter with
> scaled math, it's already inaccurate by nature. That's also why we have
> it only one-way.

When returning -1 instead of 0, it is not the last digit that is wrong,
but the first (and only) one.

-- 
                                                 Gilles Chanteperdrix


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Xenomai-core] [PATCH-STACK] Updates, timerstats, rtdm-timers
  2007-06-06 12:49   ` Jan Kiszka
@ 2007-06-06 13:29     ` Gilles Chanteperdrix
  2007-06-06 13:36       ` Jan Kiszka
  0 siblings, 1 reply; 20+ messages in thread
From: Gilles Chanteperdrix @ 2007-06-06 13:29 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

Jan Kiszka wrote:
> Jan Kiszka wrote:
> 
>>Jan Kiszka wrote:
>>...
>>
>>>fast-tsc-to-ns-v2.patch
>>>
>>>    [Rebased, improved rounding of least significant digit]
>>
>>Rounding in the fast path for the sake of the last digit was silly.
>>Instead, I'm now addressing the ugly interval printing via
>>xnarch_precise_tsc_to_ns when converting the timer interval back into
>>nanos. -v3 incorporating this has just been uploaded.
>>
> 
> 
> After noticing yesterday that even unpatched Xenomai sometimes converts
> inaccurately when showing small timer intervals under /proc, I just got
> an idea how to address this beautification issue even better: -v4 now
> rounds up in the slow, precise tsc-to-ns path, see
> 
> http://www.rts.uni-hannover.de/rtaddon/patches/xenomai/fast-tsc-to-ns-v4.patch

I am the one who decided of the rounding behaviour of llimd, RTAI
version had the same rounding policy as the one you propose, and I made
it for the following reasons:
- rouding towards 0 is the policy used by the C language, so doing this
for llimd made it consistent with what one expects from C code;
- values computed by llimd are used to program timers, and we prefer the
timer to be programmed for a too short value than for a too long value.

-- 
                                                 Gilles Chanteperdrix


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Xenomai-core] [PATCH-STACK] Updates, timerstats, rtdm-timers
  2007-06-06 13:21           ` Gilles Chanteperdrix
@ 2007-06-06 13:31             ` Jan Kiszka
  2007-06-06 18:23               ` Gilles Chanteperdrix
  0 siblings, 1 reply; 20+ messages in thread
From: Jan Kiszka @ 2007-06-06 13:31 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai-core

[-- Attachment #1: Type: text/plain, Size: 8471 bytes --]

Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Gilles Chanteperdrix wrote:
>>
>>> Jan Kiszka wrote:
>>>
>>>> Gilles Chanteperdrix wrote:
>>>>
>>>>
>>>>> Jan Kiszka wrote:
>>>>>
>>>>>> Jan Kiszka wrote:
>>>>>> ...
>>>>>>
>>>>>>> fast-tsc-to-ns-v2.patch
>>>>>>>
>>>>>>>    [Rebased, improved rounding of least significant digit]
>>>>>> Rounding in the fast path for the sake of the last digit was silly.
>>>>>> Instead, I'm now addressing the ugly interval printing via
>>>>>> xnarch_precise_tsc_to_ns when converting the timer interval back into
>>>>>> nanos. -v3 incorporating this has just been uploaded.
>>>>> Hi,
>>>>>
>>>>> I had a look at the fast-tsc-to-ns implementation, here is how I would
>>>>> rewrite it:
>>>>>
>>>>> static inline void xnarch_init_llmulshft(const unsigned m_in,
>>>>> 					 const unsigned d_in,
>>>>> 					 unsigned *m_out,
>>>>> 					 unsigned *s_out)
>>>>> {
>>>>> 	unsigned long long mult;
>>>>>
>>>>> 	*s_out = 31;
>>>>> 	while (1) {
>>>>> 		mult = ((unsigned long long)m_in) << *s_out;
>>>>> 		do_div(mult, d_in);
>>>>> 		if (mult <= INT_MAX)
>>>>> 			break;
>>>>> 		(*s_out)--;
>>>>> 	}
>>>>> 	*m_out = (unsigned)mult;
>>>>> }
>>>>>
>>>>> /* Non x86. */
>>>>> #define __rthal_u96shift(h, m, l, s) ({		\
>>>>> 	unsigned _l = (l);			\
>>>>> 	unsigned _m = (m);			\
>>>>> 	unsigned _s = (s);			\
>>>>> 	_l >>= _s;				\
>>>>> 	_m >>= s;				\
>>>>> 	_l |= (_m << (32 - s));			\
>>>>> 	_m |= ((h) << (32 - s));		\
>>>>>       __rthal_u64fromu32(_m, _l);		\
>>>>> })
>>>>>
>>>>> /* x86 */
>>>>> #define __rthal_u96shift(h, m, l, s) ({		\
>>>>> 	unsigned _l = (l);			\
>>>>> 	unsigned _m = (m);			\
>>>>> 	unsigned _s = (s);			\
>>>>> 	asm ("shrdl\t%%cl,%1,%0"		\
>>>>> 	     : "+r,?m"(_l)			\
>>>>> 	     : "r,r"(_m), "c,c"(_s));		\
>>>>> 	asm ("shrdl\t%%cl,%1,%0"		\
>>>>> 	     : "+r,?m"(_m)			\
>>>>> 	     : "r,r"(h), "c,c"(_s));		\
>>>>> 	__rthal_u64fromu32(_m, _l);		\
>>>>> })
>>>>>
>>>>> static inline long long rthal_llmi(int i, int j)
>>>>> {
>>>>>       /* Signed fast 32x32->64 multiplication */
>>>>> 	return (long long) i * j;
>>>>> }
>>>>>
>>>>> static inline long long gilles_llmulshft(const long long op,
>>>>> 					 const unsigned m,
>>>>> 					 const unsigned s)
>>>>> {
>>>>> 	unsigned oph, opl, tlh, tll, thh, thl;
>>>>> 	unsigned long long th, tl;
>>>>>
>>>>> 	__rthal_u64tou32(op, oph, opl);
>>>>> 	tl = rthal_ullmul(opl, m);
>>>>> 	__rthal_u64tou32(tl, tlh, tll);
>>>>> 	th = rthal_llmi(oph, m);
>>>>> 	th += tlh;
>>>>> 	__rthal_u64tou32(th, thh, thl);
>>>>> 	
>>>>> 	return __rthal_u96shift(thh, thl, tll, s);
>>>>> }
>>>>>
>>>>>
>>>> Thanks for your suggestion.
>>>>
>>>> While your generic version produces comparable code, the x86 variant is
>>>> about twice as large as the full-assembly version. And code size
>>>> translates into I-cache occupation, which may have latency costs.
>>>>
>>>> [gcc 4.1, i386]
>>>> -O2 -mregparm=3 -fomit-frame-pointer:
>>>>    63: 08048490   119 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft
>>>>    68: 08048510   121 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft_x86
>>>>    77: 08048450    57 FUNC    GLOBAL DEFAULT   13 rthal_llmulshft
>>>>    78: 080483c0   135 FUNC    GLOBAL DEFAULT   13 __rthal_generic_llmulshft
>>>>
>>>> -Os -mregparm=3 -fomit-frame-pointer:
>>>>    63: 0804843b    93 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft
>>>>    68: 08048498    97 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft_x86
>>>>    77: 08048410    43 FUNC    GLOBAL DEFAULT   13 rthal_llmulshft
>>>>    78: 080483b4    92 FUNC    GLOBAL DEFAULT   13 __rthal_generic_llmulshft
>>>>
>>>> -O2:
>>>>    63: 08048480   120 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft
>>>>    68: 08048500   105 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft_x86
>>>>    77: 08048440    60 FUNC    GLOBAL DEFAULT   13 rthal_llmulshft
>>>>    78: 080483c0   117 FUNC    GLOBAL DEFAULT   13 __rthal_generic_llmulshft
>>>>
>>>> -Os:
>>>>    63: 08048438   104 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft
>>>>    68: 080484a0    83 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft_x86
>>>>    77: 0804840b    45 FUNC    GLOBAL DEFAULT   13 rthal_llmulshft
>>>>    78: 080483b4    87 FUNC    GLOBAL DEFAULT   13 __rthal_generic_llmulshft
>>>>
>>>> I'm not arguing we should turn each and every Xenomai arch code into
>>>> pure assembly. But in this case it already happened, it's less scattered
>>>> source code-wise, and it is compacter object-wise. So I would prefer to
>>>> keep it as is.
>>> I would say the advantage of having a C version outperform the
>>> advantages of the full assembly version. C is really easier to
>>> understand and debug.
>>
>> Personally, I prefer the clear (and commented) assembly over the nested
>> macros and inlines.
> 
> Not when the macro and inline bear names that are easy to understand. If
> you do not find the names easy to understand, then change them (I do not
> like rthal_llmul either, but I could not find a name). To make the
> assembly fully understandable, you would need to comment every
> statement. And now, run the assembly code in gdb, and try and print the
> value of a 64 bits intermediate result: you can't.

No question, this is a matter of taste.

> 
>>
>>> The differences between the two versions are some register moves, which
>>> cost almost nothing, especially since each operation in the assembly
>>
>> Cycle-wise, you are right. But what bites us more in the worst case are
>> memory accesses, specifically when they are not cached. Code size
>> matters more according to my experience.
>>
>>
>>> version depends on the result of the previous operation, which means
>>> lots of pipeline stall, the register moves will just feed the pipeline.
>>> I do not think they really matter. Look at the assembly produced for
>>> gilles_llmulshft on ARM, a low end architecture where each instruction
>>> really costs:
>>> gilles_llmulshft:
>>>        @ args = 0, pretend = 0, frame = 0
>>>        @ frame_needed = 0, uses_anonymous_args = 0
>>>        @ link register save eliminated.
>>>        stmfd   sp!, {r4, r5, r6, r7}
>>>        umull   r6, r7, r0, r2
>>>        mov     r4, r7
>>>        mov     r5, #0
>>>        smlal   r4, r5, r2, r1
>>>        rsb     ip, r3, #32
>>>        mov     r2, r4, lsr r3
>>>        orr     r1, r2, r5, asl ip
>>>        mov     r2, r2, asl ip
>>>        orr     r0, r2, r6, lsr r3
>>>        @ lr needed for prologue
>>>        ldmfd   sp!, {r4, r5, r6, r7}
>>>        mov     pc, lr
>>>
>>> pretty minimal, no ?
>>
>> OK, your version can perfectly go into the ARM arch. But i386 is
>> different: less registers, thus easily a lot of variable shuffling...
> 
> variable shuffling which does not really matter, that is my point,
> otherwise the x86 family would not be as fast as it is.

Think of the *code size*...

> 
>>
>>> The full assembly version has another big drawback, it is a big block
>>> that the optimizer can not split, whereas in a C version, the optimizer
>>> can decide to interleave the surrounding code. So a C version will
>>> inline better.
>>
>> We are not inlining that service anymore, at least not for its primary
>> usage tsc-to-ns. Inlining costs object size, thus increases the latency
>> (although it saves us a few cycles).
> 
> it *is* inlined, in tsc_to/from_ns. Another question that I forgot in my

xnarch_tsc_to_ns uninlines this service, and I don't see other, larger
users so far.

> previous mails: why not using llmulshft for the two services ?

See below, see my original post on all the conversion approaches: scaled
math is inaccurate, doing it both ways may cause noticeable errors when
dealing with calculated vs. measured time stamps over, granted, fairly
long periods.

> 
>>
>>> There is one thing I do not like with llmulshft (any implementation), it
>>> is the rounding policy towards minus infinity. llmulshft(-1, 2/3)
>>> returns -1 whereas llimd would return 0.
>>
>> See other postings: rounding of the last digit doesn't matter with
>> scaled math, it's already inaccurate by nature. That's also why we have
>> it only one-way.
> 
> When returning -1 instead of 0, it is not the last digit that is wrong,
> but the first (and only) one.

So this is about -1 nanoseconds vs. 0 nanoseconds. Well, does this error
matter in real life? :->



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Xenomai-core] [PATCH-STACK] Updates, timerstats, rtdm-timers
  2007-06-06 13:29     ` Gilles Chanteperdrix
@ 2007-06-06 13:36       ` Jan Kiszka
  2007-06-06 15:08         ` Jan Kiszka
  0 siblings, 1 reply; 20+ messages in thread
From: Jan Kiszka @ 2007-06-06 13:36 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai-core

[-- Attachment #1: Type: text/plain, Size: 1655 bytes --]

Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Jan Kiszka wrote:
>>
>>> Jan Kiszka wrote:
>>> ...
>>>
>>>> fast-tsc-to-ns-v2.patch
>>>>
>>>>    [Rebased, improved rounding of least significant digit]
>>> Rounding in the fast path for the sake of the last digit was silly.
>>> Instead, I'm now addressing the ugly interval printing via
>>> xnarch_precise_tsc_to_ns when converting the timer interval back into
>>> nanos. -v3 incorporating this has just been uploaded.
>>>
>>
>> After noticing yesterday that even unpatched Xenomai sometimes converts
>> inaccurately when showing small timer intervals under /proc, I just got
>> an idea how to address this beautification issue even better: -v4 now
>> rounds up in the slow, precise tsc-to-ns path, see
>>
>> http://www.rts.uni-hannover.de/rtaddon/patches/xenomai/fast-tsc-to-ns-v4.patch
> 
> I am the one who decided of the rounding behaviour of llimd, RTAI
> version had the same rounding policy as the one you propose, and I made
> it for the following reasons:
> - rouding towards 0 is the policy used by the C language, so doing this
> for llimd made it consistent with what one expects from C code;
> - values computed by llimd are used to program timers, and we prefer the
> timer to be programmed for a too short value than for a too long value.

That's OK, I agree. In my patch for i386, this rounding is only relevant
for display purposes. It's just to help me finding the expected period T
of my task in /proc instead of T-1 sometimes. Beautification. All we
need for other archs is xnarch_tsc_to_ns according to the old scheme.
Will rework this.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Xenomai-core] [PATCH-STACK] Updates, timerstats, rtdm-timers
  2007-06-06 13:36       ` Jan Kiszka
@ 2007-06-06 15:08         ` Jan Kiszka
  0 siblings, 0 replies; 20+ messages in thread
From: Jan Kiszka @ 2007-06-06 15:08 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai-core

[-- Attachment #1: Type: text/plain, Size: 2110 bytes --]

Jan Kiszka wrote:
> Gilles Chanteperdrix wrote:
>> Jan Kiszka wrote:
>>> Jan Kiszka wrote:
>>>
>>>> Jan Kiszka wrote:
>>>> ...
>>>>
>>>>> fast-tsc-to-ns-v2.patch
>>>>>
>>>>>    [Rebased, improved rounding of least significant digit]
>>>> Rounding in the fast path for the sake of the last digit was silly.
>>>> Instead, I'm now addressing the ugly interval printing via
>>>> xnarch_precise_tsc_to_ns when converting the timer interval back into
>>>> nanos. -v3 incorporating this has just been uploaded.
>>>>
>>> After noticing yesterday that even unpatched Xenomai sometimes converts
>>> inaccurately when showing small timer intervals under /proc, I just got
>>> an idea how to address this beautification issue even better: -v4 now
>>> rounds up in the slow, precise tsc-to-ns path, see
>>>
>>> http://www.rts.uni-hannover.de/rtaddon/patches/xenomai/fast-tsc-to-ns-v4.patch
>> I am the one who decided of the rounding behaviour of llimd, RTAI
>> version had the same rounding policy as the one you propose, and I made
>> it for the following reasons:
>> - rouding towards 0 is the policy used by the C language, so doing this
>> for llimd made it consistent with what one expects from C code;
>> - values computed by llimd are used to program timers, and we prefer the
>> timer to be programmed for a too short value than for a too long value.
> 
> That's OK, I agree. In my patch for i386, this rounding is only relevant
> for display purposes. It's just to help me finding the expected period T
> of my task in /proc instead of T-1 sometimes. Beautification. All we
> need for other archs is xnarch_tsc_to_ns according to the old scheme.
> Will rework this.

Done, -v5 is online.

This is not yet incorporating any change to the generic llmulshft. I
haven't thought about nor tried our code on a 64 bit arch yet. Did you
check this already? I wonder, eg., if it makes sense to exploit 128 bit
with 64 bit shifts there or stick with 94/32 bit accuracy and related
conversion errors. Depending on this, the generic version might have to
be reconsidered.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Xenomai-core] [PATCH-STACK] Updates, timerstats, rtdm-timers
  2007-06-06 13:31             ` Jan Kiszka
@ 2007-06-06 18:23               ` Gilles Chanteperdrix
  2007-06-06 18:46                 ` Jan Kiszka
  0 siblings, 1 reply; 20+ messages in thread
From: Gilles Chanteperdrix @ 2007-06-06 18:23 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

Jan Kiszka wrote:
 > Gilles Chanteperdrix wrote:
 > > Not when the macro and inline bear names that are easy to understand. If
 > > you do not find the names easy to understand, then change them (I do not
 > > like rthal_llmul either, but I could not find a name). To make the
 > > assembly fully understandable, you would need to comment every
 > > statement. And now, run the assembly code in gdb, and try and print the
 > > value of a 64 bits intermediate result: you can't.
 > 
 > No question, this is a matter of taste.

No, really, being able to debug the code inside gdb appears to me as
something more than a "matter of taste", I thought that as the person
who made Xenomai run with kgdb you would have agreed with me.

Now, about the way I wrote arithmetic code, their are reasons behind my
choices. There are some repetitive patterns in this arithmetic code and
I wanted to facter them out. The first pattern is the conversion between
32 bits and 64 bits, we have to do this in a way that is understood by
the compiler on a particular platform, hence the definition of
rthal_u64from/tou32 which is different for each platform. x86
understands shifts and mask (or cast), but gcc for power pc or arm
prefers the union trick.
There is also only one way to cause gcc to use the 32x32->64 fast
multiplication it is exactly what does rthal_ullmul. If you want to do
the same thing, but write it differently, you invariably cause gcc to
use a full 64 bits multiplication.

So, when in rthal_generic_llmulshft, I read:

    long long hi = (ll >> BITS_PER_LONG) * m;
    unsigned long long lo = ((long)ll) * m;

I think this is all wrong:
- on a 64 bits machine, BITS_PER_LONG is 64 and ll is 64 bits, so ll >>
  BITS_PER_LONG is 0

- for the first multiplication, the compiler will not detect the
  "fastmult" condition, and will use a full 64 bits multiplication. In
  order to get it to generate the minimal multiplication, you should
  have used:

  long long hi = (long long)(int)(ll >> 32) * (int) m

 I find:
static inline long long rthal_llmi(const int i, const int j)
{
        /* Fast 32x32->64 multiplication */
	return (long long) i * j;
}

/* (...) */
	__rthal_u64tou32(op, oph, opl);
	hi = rthal_llmi(oph, m);

easier to write and maintain, understand once you know what
rthal_llmi does, and generates better code with regard to the
32bits/64bits conversion.

- for the second multiplication, since the two arguments are 32 bits, the
  compiler will use a 32 bits multiplication, and since you (wrongly)
  cast the first argument to long, it will use a signed multiplication,
  whereas we would want it to use an unsigned multiplication, as the
  assembly routine correctly does.

Here again, using:

      lo = rthal_ullmul(opl, m);
would have been less error-prone.
   
So, Ok, I will try to do something for x86 (either reduce the numbers of
registers used by the C code, or reduce the assembly to the bare
minimum). But, please, pick my generic implementation of llmulshft, it
was carefully written.

-- 


					    Gilles Chanteperdrix.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Xenomai-core] [PATCH-STACK] Updates, timerstats, rtdm-timers
  2007-06-06 18:23               ` Gilles Chanteperdrix
@ 2007-06-06 18:46                 ` Jan Kiszka
  2007-06-07 12:52                   ` Jan Kiszka
  0 siblings, 1 reply; 20+ messages in thread
From: Jan Kiszka @ 2007-06-06 18:46 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai-core

[-- Attachment #1: Type: text/plain, Size: 5201 bytes --]

Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>  > Gilles Chanteperdrix wrote:
>  > > Not when the macro and inline bear names that are easy to understand. If
>  > > you do not find the names easy to understand, then change them (I do not
>  > > like rthal_llmul either, but I could not find a name). To make the
>  > > assembly fully understandable, you would need to comment every
>  > > statement. And now, run the assembly code in gdb, and try and print the
>  > > value of a 64 bits intermediate result: you can't.
>  > 
>  > No question, this is a matter of taste.
> 
> No, really, being able to debug the code inside gdb appears to me as
> something more than a "matter of taste", I thought that as the person
> who made Xenomai run with kgdb you would have agreed with me.

Do we optimise hot path for debuggability? I really don't expect such a
well-defined small function being the target of a debugging session.
Moreover, you typically debug such micro services with stepi anyway,
watching registers, not variables (which are often undefined due to gcc
optimisations).

> Now, about the way I wrote arithmetic code, their are reasons behind my
> choices. There are some repetitive patterns in this arithmetic code and
> I wanted to facter them out. The first pattern is the conversion between
> 32 bits and 64 bits, we have to do this in a way that is understood by
> the compiler on a particular platform, hence the definition of
> rthal_u64from/tou32 which is different for each platform. x86
> understands shifts and mask (or cast), but gcc for power pc or arm
> prefers the union trick.
> There is also only one way to cause gcc to use the 32x32->64 fast
> multiplication it is exactly what does rthal_ullmul. If you want to do
> the same thing, but write it differently, you invariably cause gcc to
> use a full 64 bits multiplication.
> 
> So, when in rthal_generic_llmulshft, I read:
> 
>     long long hi = (ll >> BITS_PER_LONG) * m;
>     unsigned long long lo = ((long)ll) * m;
> 
> I think this is all wrong:
> - on a 64 bits machine, BITS_PER_LONG is 64 and ll is 64 bits, so ll >>
>   BITS_PER_LONG is 0

Yes, utterly wrong, notices this as well. We must set 32 bits in stone.
And doing things with true 64 bit requires 128-bit math for the setup, I
guess that's not worth the trouble.

> 
> - for the first multiplication, the compiler will not detect the
>   "fastmult" condition, and will use a full 64 bits multiplication. In
>   order to get it to generate the minimal multiplication, you should
>   have used:
> 
>   long long hi = (long long)(int)(ll >> 32) * (int) m
> 
>  I find:
> static inline long long rthal_llmi(const int i, const int j)
> {
>         /* Fast 32x32->64 multiplication */
> 	return (long long) i * j;
> }
> 
> /* (...) */
> 	__rthal_u64tou32(op, oph, opl);
> 	hi = rthal_llmi(oph, m);
> 
> easier to write and maintain, understand once you know what
> rthal_llmi does, and generates better code with regard to the
> 32bits/64bits conversion.
> 
> - for the second multiplication, since the two arguments are 32 bits, the
>   compiler will use a 32 bits multiplication, and since you (wrongly)
>   cast the first argument to long, it will use a signed multiplication,
>   whereas we would want it to use an unsigned multiplication, as the
>   assembly routine correctly does.
> 
> Here again, using:
> 
>       lo = rthal_ullmul(opl, m);
> would have been less error-prone.
>    
> So, Ok, I will try to do something for x86 (either reduce the numbers of
> registers used by the C code, or reduce the assembly to the bare
> minimum). But, please, pick my generic implementation of llmulshft, it
> was carefully written.

Yes, it is the better choice for 32 bit archs (my previous tests didn't
reflect the usage in Xenomai truely, redoing them made my generic
version fall behind yours). Will include it.

But your generic code produces worse binaries on 64 bit. Anyway, given
the potential of 64-bit instructions, we would better do this
differently there, e.g. like this for x64:

#define rthal_llmulshft(ll, m, s)                                      \
({                                                                     \
       long long __ret;                                                \
                                                                       \
       __asm__ (                                                       \
               /* HI:LO = ll * m */                                    \
               "imull %[__m]\n\t"                                      \
                                                                       \
               /* ret = HI:LO >> s */                                  \
               "shrd %%cl,%%rdx,%%rax\n\t"                             \
               : "=a" (__ret)                                          \
               : "a" (ll), [__m] "m" (m), "c" (s));                    \
       __ret;                                                          \
})

This version actually makes inlining xnarch_tsc_to_ns on that arch
interesting again...

Jan



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Xenomai-core] [PATCH-STACK] Updates, timerstats, rtdm-timers
  2007-06-06 18:46                 ` Jan Kiszka
@ 2007-06-07 12:52                   ` Jan Kiszka
  2007-06-07 13:02                     ` Gilles Chanteperdrix
  0 siblings, 1 reply; 20+ messages in thread
From: Jan Kiszka @ 2007-06-07 12:52 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai-core

[-- Attachment #1: Type: text/plain, Size: 761 bytes --]

Jan Kiszka wrote:
> Gilles Chanteperdrix wrote:
>> So, Ok, I will try to do something for x86 (either reduce the numbers of
>> registers used by the C code, or reduce the assembly to the bare
>> minimum). But, please, pick my generic implementation of llmulshft, it
>> was carefully written.
> 
> Yes, it is the better choice for 32 bit archs (my previous tests didn't
> reflect the usage in Xenomai truely, redoing them made my generic
> version fall behind yours). Will include it.

Done, see -v6. Then I added that two-liner for x86_64 rthal_llmulshft,
fixed the BITS_PER_LONG bug, and enabled generic-based support for ARM
(testing welcome!).

At this chance: My series now also includes rthal_llimd for x86_64,
another two-liner.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Xenomai-core] [PATCH-STACK] Updates, timerstats, rtdm-timers
  2007-06-07 12:52                   ` Jan Kiszka
@ 2007-06-07 13:02                     ` Gilles Chanteperdrix
  2007-06-07 14:06                       ` Jan Kiszka
  0 siblings, 1 reply; 20+ messages in thread
From: Gilles Chanteperdrix @ 2007-06-07 13:02 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

Jan Kiszka wrote:
> Jan Kiszka wrote:
> 
>>Gilles Chanteperdrix wrote:
>>
>>>So, Ok, I will try to do something for x86 (either reduce the numbers of
>>>registers used by the C code, or reduce the assembly to the bare
>>>minimum). But, please, pick my generic implementation of llmulshft, it
>>>was carefully written.
>>
>>Yes, it is the better choice for 32 bit archs (my previous tests didn't
>>reflect the usage in Xenomai truely, redoing them made my generic
>>version fall behind yours). Will include it.
> 
> 
> Done, see -v6. Then I added that two-liner for x86_64 rthal_llmulshft,
> fixed the BITS_PER_LONG bug, and enabled generic-based support for ARM
> (testing welcome!).
> 
> At this chance: My series now also includes rthal_llimd for x86_64,
> another two-liner.

v6 is not in the download area.

-- 
                                                 Gilles Chanteperdrix


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Xenomai-core] [PATCH-STACK] Updates, timerstats, rtdm-timers
  2007-06-07 13:02                     ` Gilles Chanteperdrix
@ 2007-06-07 14:06                       ` Jan Kiszka
  2007-06-07 14:24                         ` Gilles Chanteperdrix
  0 siblings, 1 reply; 20+ messages in thread
From: Jan Kiszka @ 2007-06-07 14:06 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai-core

[-- Attachment #1: Type: text/plain, Size: 930 bytes --]

Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Jan Kiszka wrote:
>>
>>> Gilles Chanteperdrix wrote:
>>>
>>>> So, Ok, I will try to do something for x86 (either reduce the numbers of
>>>> registers used by the C code, or reduce the assembly to the bare
>>>> minimum). But, please, pick my generic implementation of llmulshft, it
>>>> was carefully written.
>>> Yes, it is the better choice for 32 bit archs (my previous tests didn't
>>> reflect the usage in Xenomai truely, redoing them made my generic
>>> version fall behind yours). Will include it.
>>
>> Done, see -v6. Then I added that two-liner for x86_64 rthal_llmulshft,
>> fixed the BITS_PER_LONG bug, and enabled generic-based support for ARM
>> (testing welcome!).
>>
>> At this chance: My series now also includes rthal_llimd for x86_64,
>> another two-liner.
> 
> v6 is not in the download area.
> 

Mpf, forgot to press "update". Done.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Xenomai-core] [PATCH-STACK] Updates, timerstats, rtdm-timers
  2007-06-07 14:06                       ` Jan Kiszka
@ 2007-06-07 14:24                         ` Gilles Chanteperdrix
  2007-06-07 14:40                           ` Jan Kiszka
  0 siblings, 1 reply; 20+ messages in thread
From: Gilles Chanteperdrix @ 2007-06-07 14:24 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

Jan Kiszka wrote:
> Gilles Chanteperdrix wrote:
> 
>>Jan Kiszka wrote:
>>
>>>Jan Kiszka wrote:
>>>
>>>
>>>>Gilles Chanteperdrix wrote:
>>>>
>>>>
>>>>>So, Ok, I will try to do something for x86 (either reduce the numbers of
>>>>>registers used by the C code, or reduce the assembly to the bare
>>>>>minimum). But, please, pick my generic implementation of llmulshft, it
>>>>>was carefully written.
>>>>
>>>>Yes, it is the better choice for 32 bit archs (my previous tests didn't
>>>>reflect the usage in Xenomai truely, redoing them made my generic
>>>>version fall behind yours). Will include it.
>>>
>>>Done, see -v6. Then I added that two-liner for x86_64 rthal_llmulshft,
>>>fixed the BITS_PER_LONG bug, and enabled generic-based support for ARM
>>>(testing welcome!).
>>>
>>>At this chance: My series now also includes rthal_llimd for x86_64,
>>>another two-liner.
>>
>>v6 is not in the download area.
>>
> 
> 
> Mpf, forgot to press "update". Done.

Ok, I agree with the fast-tsc-to-ns patch: I could not get gcc to
generate code with less moves on x86 (which is, for me, if it was still
needed, yet another proof that these register moves are harmless).

However, I do not agree with the x86_64 llimd: it will not work if m is
greater than 2G, that is why we implement llimd in terms of ullimd on
other architectures.

-- 
                                                 Gilles Chanteperdrix


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Xenomai-core] [PATCH-STACK] Updates, timerstats, rtdm-timers
  2007-06-07 14:24                         ` Gilles Chanteperdrix
@ 2007-06-07 14:40                           ` Jan Kiszka
  2007-06-07 14:54                             ` Gilles Chanteperdrix
  0 siblings, 1 reply; 20+ messages in thread
From: Jan Kiszka @ 2007-06-07 14:40 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai-core

[-- Attachment #1: Type: text/plain, Size: 1840 bytes --]

Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Gilles Chanteperdrix wrote:
>>
>>> Jan Kiszka wrote:
>>>
>>>> Jan Kiszka wrote:
>>>>
>>>>
>>>>> Gilles Chanteperdrix wrote:
>>>>>
>>>>>
>>>>>> So, Ok, I will try to do something for x86 (either reduce the numbers of
>>>>>> registers used by the C code, or reduce the assembly to the bare
>>>>>> minimum). But, please, pick my generic implementation of llmulshft, it
>>>>>> was carefully written.
>>>>> Yes, it is the better choice for 32 bit archs (my previous tests didn't
>>>>> reflect the usage in Xenomai truely, redoing them made my generic
>>>>> version fall behind yours). Will include it.
>>>> Done, see -v6. Then I added that two-liner for x86_64 rthal_llmulshft,
>>>> fixed the BITS_PER_LONG bug, and enabled generic-based support for ARM
>>>> (testing welcome!).
>>>>
>>>> At this chance: My series now also includes rthal_llimd for x86_64,
>>>> another two-liner.
>>> v6 is not in the download area.
>>>
>>
>> Mpf, forgot to press "update". Done.
> 
> Ok, I agree with the fast-tsc-to-ns patch: I could not get gcc to
> generate code with less moves on x86 (which is, for me, if it was still
> needed, yet another proof that these register moves are harmless).

No question -- from the average performance POV.

> 
> However, I do not agree with the x86_64 llimd: it will not work if m is
> greater than 2G, that is why we implement llimd in terms of ullimd on
> other architectures.
> 

Please help me, I don't see it yet:

m is 32 bit and gets extended to 64 bit without considering any sign (as
it should be). Then we multiply 64x64 bit signed, but we know for sure
that the second multiplier is always positive. Same for division. Basic
tests ((-1*1000000000)/2 vs. (-1*3000000000)/2) confirmed this on the
target.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Xenomai-core] [PATCH-STACK] Updates, timerstats, rtdm-timers
  2007-06-07 14:40                           ` Jan Kiszka
@ 2007-06-07 14:54                             ` Gilles Chanteperdrix
  0 siblings, 0 replies; 20+ messages in thread
From: Gilles Chanteperdrix @ 2007-06-07 14:54 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

Jan Kiszka wrote:
> Gilles Chanteperdrix wrote:
> 
>>Jan Kiszka wrote:
>>
>>>Gilles Chanteperdrix wrote:
>>>
>>>
>>>>Jan Kiszka wrote:
>>>>
>>>>
>>>>>Jan Kiszka wrote:
>>>>>
>>>>>
>>>>>
>>>>>>Gilles Chanteperdrix wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>>So, Ok, I will try to do something for x86 (either reduce the numbers of
>>>>>>>registers used by the C code, or reduce the assembly to the bare
>>>>>>>minimum). But, please, pick my generic implementation of llmulshft, it
>>>>>>>was carefully written.
>>>>>>
>>>>>>Yes, it is the better choice for 32 bit archs (my previous tests didn't
>>>>>>reflect the usage in Xenomai truely, redoing them made my generic
>>>>>>version fall behind yours). Will include it.
>>>>>
>>>>>Done, see -v6. Then I added that two-liner for x86_64 rthal_llmulshft,
>>>>>fixed the BITS_PER_LONG bug, and enabled generic-based support for ARM
>>>>>(testing welcome!).
>>>>>
>>>>>At this chance: My series now also includes rthal_llimd for x86_64,
>>>>>another two-liner.
>>>>
>>>>v6 is not in the download area.
>>>>
>>>
>>>Mpf, forgot to press "update". Done.
>>
>>Ok, I agree with the fast-tsc-to-ns patch: I could not get gcc to
>>generate code with less moves on x86 (which is, for me, if it was still
>>needed, yet another proof that these register moves are harmless).
> 
> 
> No question -- from the average performance POV.
> 
> 
>>However, I do not agree with the x86_64 llimd: it will not work if m is
>>greater than 2G, that is why we implement llimd in terms of ullimd on
>>other architectures.
>>
> 
> 
> Please help me, I don't see it yet:
> 
> m is 32 bit and gets extended to 64 bit without considering any sign (as
> it should be). Then we multiply 64x64 bit signed, but we know for sure
> that the second multiplier is always positive. Same for division. Basic
> tests ((-1*1000000000)/2 vs. (-1*3000000000)/2) confirmed this on the
> target.

No, you are right. It works.

-- 
                                                 Gilles Chanteperdrix


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2007-06-07 14:54 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-06-04 23:25 [Xenomai-core] [PATCH-STACK] Updates, timerstats, rtdm-timers Jan Kiszka
2007-06-05  8:31 ` Jan Kiszka
2007-06-05 22:28   ` Gilles Chanteperdrix
2007-06-06 10:30     ` Jan Kiszka
2007-06-06 12:47       ` Gilles Chanteperdrix
2007-06-06 12:59         ` Jan Kiszka
2007-06-06 13:21           ` Gilles Chanteperdrix
2007-06-06 13:31             ` Jan Kiszka
2007-06-06 18:23               ` Gilles Chanteperdrix
2007-06-06 18:46                 ` Jan Kiszka
2007-06-07 12:52                   ` Jan Kiszka
2007-06-07 13:02                     ` Gilles Chanteperdrix
2007-06-07 14:06                       ` Jan Kiszka
2007-06-07 14:24                         ` Gilles Chanteperdrix
2007-06-07 14:40                           ` Jan Kiszka
2007-06-07 14:54                             ` Gilles Chanteperdrix
2007-06-06 12:49   ` Jan Kiszka
2007-06-06 13:29     ` Gilles Chanteperdrix
2007-06-06 13:36       ` Jan Kiszka
2007-06-06 15:08         ` Jan Kiszka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.