Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>  > Jan Kiszka wrote:
>  > ...
>  > > fast-tsc-to-ns-v2.patch
>  > > 
>  > >     [Rebased, improved rounding of least significant digit]
>  > 
>  > Rounding in the fast path for the sake of the last digit was silly.
>  > Instead, I'm now addressing the ugly interval printing via
>  > xnarch_precise_tsc_to_ns when converting the timer interval back into
>  > nanos. -v3 incorporating this has just been uploaded.
> 
> Hi,
> 
> I had a look at the fast-tsc-to-ns implementation, here is how I would
> rewrite it:
> 
> static inline void xnarch_init_llmulshft(const unsigned m_in,
> 					 const unsigned d_in,
> 					 unsigned *m_out,
> 					 unsigned *s_out)
> {
> 	unsigned long long mult;
> 
> 	*s_out = 31;
> 	while (1) {
> 		mult = ((unsigned long long)m_in) << *s_out;
> 		do_div(mult, d_in);
> 		if (mult <= INT_MAX)
> 			break;
> 		(*s_out)--;
> 	}
> 	*m_out = (unsigned)mult;
> }
> 
> /* Non x86. */
> #define __rthal_u96shift(h, m, l, s) ({		\
> 	unsigned _l = (l);			\
> 	unsigned _m = (m);			\
> 	unsigned _s = (s);			\
> 	_l >>= _s;				\
> 	_m >>= s;				\
> 	_l |= (_m << (32 - s));			\
> 	_m |= ((h) << (32 - s));		\
>         __rthal_u64fromu32(_m, _l);		\
> })
> 
> /* x86 */
> #define __rthal_u96shift(h, m, l, s) ({		\
> 	unsigned _l = (l);			\
> 	unsigned _m = (m);			\
> 	unsigned _s = (s);			\
> 	asm ("shrdl\t%%cl,%1,%0"		\
> 	     : "+r,?m"(_l)			\
> 	     : "r,r"(_m), "c,c"(_s));		\
> 	asm ("shrdl\t%%cl,%1,%0"		\
> 	     : "+r,?m"(_m)			\
> 	     : "r,r"(h), "c,c"(_s));		\
> 	__rthal_u64fromu32(_m, _l);		\
> })
> 
> static inline long long rthal_llmi(int i, int j)
> {
>         /* Signed fast 32x32->64 multiplication */
> 	return (long long) i * j;
> }
> 
> static inline long long gilles_llmulshft(const long long op,
> 					 const unsigned m,
> 					 const unsigned s)
> {
> 	unsigned oph, opl, tlh, tll, thh, thl;
> 	unsigned long long th, tl;
> 
> 	__rthal_u64tou32(op, oph, opl);
> 	tl = rthal_ullmul(opl, m);
> 	__rthal_u64tou32(tl, tlh, tll);
> 	th = rthal_llmi(oph, m);
> 	th += tlh;
> 	__rthal_u64tou32(th, thh, thl);
> 	
> 	return __rthal_u96shift(thh, thl, tll, s);
> }
> 
> 

Thanks for your suggestion.

While your generic version produces comparable code, the x86 variant is
about twice as large as the full-assembly version. And code size
translates into I-cache occupation, which may have latency costs.

[gcc 4.1, i386]
-O2 -mregparm=3 -fomit-frame-pointer:
    63: 08048490   119 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft
    68: 08048510   121 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft_x86
    77: 08048450    57 FUNC    GLOBAL DEFAULT   13 rthal_llmulshft
    78: 080483c0   135 FUNC    GLOBAL DEFAULT   13 __rthal_generic_llmulshft

-Os -mregparm=3 -fomit-frame-pointer:
    63: 0804843b    93 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft
    68: 08048498    97 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft_x86
    77: 08048410    43 FUNC    GLOBAL DEFAULT   13 rthal_llmulshft
    78: 080483b4    92 FUNC    GLOBAL DEFAULT   13 __rthal_generic_llmulshft

-O2:
    63: 08048480   120 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft
    68: 08048500   105 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft_x86
    77: 08048440    60 FUNC    GLOBAL DEFAULT   13 rthal_llmulshft
    78: 080483c0   117 FUNC    GLOBAL DEFAULT   13 __rthal_generic_llmulshft

-Os:
    63: 08048438   104 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft
    68: 080484a0    83 FUNC    GLOBAL DEFAULT   13 gilles_llmulshft_x86
    77: 0804840b    45 FUNC    GLOBAL DEFAULT   13 rthal_llmulshft
    78: 080483b4    87 FUNC    GLOBAL DEFAULT   13 __rthal_generic_llmulshft

I'm not arguing we should turn each and every Xenomai arch code into
pure assembly. But in this case it already happened, it's less scattered
source code-wise, and it is compacter object-wise. So I would prefer to
keep it as is.

Jan