Gilles Chanteperdrix wrote: > Jan Kiszka wrote: > > Jan Kiszka wrote: > > ... > > > fast-tsc-to-ns-v2.patch > > > > > > [Rebased, improved rounding of least significant digit] > > > > Rounding in the fast path for the sake of the last digit was silly. > > Instead, I'm now addressing the ugly interval printing via > > xnarch_precise_tsc_to_ns when converting the timer interval back into > > nanos. -v3 incorporating this has just been uploaded. > > Hi, > > I had a look at the fast-tsc-to-ns implementation, here is how I would > rewrite it: > > static inline void xnarch_init_llmulshft(const unsigned m_in, > const unsigned d_in, > unsigned *m_out, > unsigned *s_out) > { > unsigned long long mult; > > *s_out = 31; > while (1) { > mult = ((unsigned long long)m_in) << *s_out; > do_div(mult, d_in); > if (mult <= INT_MAX) > break; > (*s_out)--; > } > *m_out = (unsigned)mult; > } > > /* Non x86. */ > #define __rthal_u96shift(h, m, l, s) ({ \ > unsigned _l = (l); \ > unsigned _m = (m); \ > unsigned _s = (s); \ > _l >>= _s; \ > _m >>= s; \ > _l |= (_m << (32 - s)); \ > _m |= ((h) << (32 - s)); \ > __rthal_u64fromu32(_m, _l); \ > }) > > /* x86 */ > #define __rthal_u96shift(h, m, l, s) ({ \ > unsigned _l = (l); \ > unsigned _m = (m); \ > unsigned _s = (s); \ > asm ("shrdl\t%%cl,%1,%0" \ > : "+r,?m"(_l) \ > : "r,r"(_m), "c,c"(_s)); \ > asm ("shrdl\t%%cl,%1,%0" \ > : "+r,?m"(_m) \ > : "r,r"(h), "c,c"(_s)); \ > __rthal_u64fromu32(_m, _l); \ > }) > > static inline long long rthal_llmi(int i, int j) > { > /* Signed fast 32x32->64 multiplication */ > return (long long) i * j; > } > > static inline long long gilles_llmulshft(const long long op, > const unsigned m, > const unsigned s) > { > unsigned oph, opl, tlh, tll, thh, thl; > unsigned long long th, tl; > > __rthal_u64tou32(op, oph, opl); > tl = rthal_ullmul(opl, m); > __rthal_u64tou32(tl, tlh, tll); > th = rthal_llmi(oph, m); > th += tlh; > __rthal_u64tou32(th, thh, thl); > > return __rthal_u96shift(thh, thl, tll, s); > } > > Thanks for your suggestion. While your generic version produces comparable code, the x86 variant is about twice as large as the full-assembly version. And code size translates into I-cache occupation, which may have latency costs. [gcc 4.1, i386] -O2 -mregparm=3 -fomit-frame-pointer: 63: 08048490 119 FUNC GLOBAL DEFAULT 13 gilles_llmulshft 68: 08048510 121 FUNC GLOBAL DEFAULT 13 gilles_llmulshft_x86 77: 08048450 57 FUNC GLOBAL DEFAULT 13 rthal_llmulshft 78: 080483c0 135 FUNC GLOBAL DEFAULT 13 __rthal_generic_llmulshft -Os -mregparm=3 -fomit-frame-pointer: 63: 0804843b 93 FUNC GLOBAL DEFAULT 13 gilles_llmulshft 68: 08048498 97 FUNC GLOBAL DEFAULT 13 gilles_llmulshft_x86 77: 08048410 43 FUNC GLOBAL DEFAULT 13 rthal_llmulshft 78: 080483b4 92 FUNC GLOBAL DEFAULT 13 __rthal_generic_llmulshft -O2: 63: 08048480 120 FUNC GLOBAL DEFAULT 13 gilles_llmulshft 68: 08048500 105 FUNC GLOBAL DEFAULT 13 gilles_llmulshft_x86 77: 08048440 60 FUNC GLOBAL DEFAULT 13 rthal_llmulshft 78: 080483c0 117 FUNC GLOBAL DEFAULT 13 __rthal_generic_llmulshft -Os: 63: 08048438 104 FUNC GLOBAL DEFAULT 13 gilles_llmulshft 68: 080484a0 83 FUNC GLOBAL DEFAULT 13 gilles_llmulshft_x86 77: 0804840b 45 FUNC GLOBAL DEFAULT 13 rthal_llmulshft 78: 080483b4 87 FUNC GLOBAL DEFAULT 13 __rthal_generic_llmulshft I'm not arguing we should turn each and every Xenomai arch code into pure assembly. But in this case it already happened, it's less scattered source code-wise, and it is compacter object-wise. So I would prefer to keep it as is. Jan