From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <4666B716.6010909@domain.hid> Date: Wed, 06 Jun 2007 15:31:02 +0200 From: Jan Kiszka MIME-Version: 1.0 References: <46649F7E.3060104@domain.hid> <46651F7D.9090702@domain.hid> <18021.58231.177931.286548@domain.hid> <46668CC3.8050002@domain.hid> <4666ACE5.7030200@domain.hid> <4666AFB3.6040602@domain.hid> <4666B4C7.6020308@domain.hid> In-Reply-To: <4666B4C7.6020308@domain.hid> Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="------------enig7C97BF8F381F315AB457CDFD" Sender: jan.kiszka@domain.hid Subject: Re: [Xenomai-core] [PATCH-STACK] Updates, timerstats, rtdm-timers List-Id: "Xenomai life and development \(bug reports, patches, discussions\)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Gilles Chanteperdrix Cc: xenomai-core This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enig7C97BF8F381F315AB457CDFD Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: quoted-printable Gilles Chanteperdrix wrote: > Jan Kiszka wrote: >> Gilles Chanteperdrix wrote: >> >>> Jan Kiszka wrote: >>> >>>> Gilles Chanteperdrix wrote: >>>> >>>> >>>>> Jan Kiszka wrote: >>>>> >>>>>> Jan Kiszka wrote: >>>>>> ... >>>>>> >>>>>>> fast-tsc-to-ns-v2.patch >>>>>>> >>>>>>> [Rebased, improved rounding of least significant digit] >>>>>> Rounding in the fast path for the sake of the last digit was silly= =2E >>>>>> Instead, I'm now addressing the ugly interval printing via >>>>>> xnarch_precise_tsc_to_ns when converting the timer interval back i= nto >>>>>> nanos. -v3 incorporating this has just been uploaded. >>>>> Hi, >>>>> >>>>> I had a look at the fast-tsc-to-ns implementation, here is how I wo= uld >>>>> rewrite it: >>>>> >>>>> static inline void xnarch_init_llmulshft(const unsigned m_in, >>>>> const unsigned d_in, >>>>> unsigned *m_out, >>>>> unsigned *s_out) >>>>> { >>>>> unsigned long long mult; >>>>> >>>>> *s_out =3D 31; >>>>> while (1) { >>>>> mult =3D ((unsigned long long)m_in) << *s_out; >>>>> do_div(mult, d_in); >>>>> if (mult <=3D INT_MAX) >>>>> break; >>>>> (*s_out)--; >>>>> } >>>>> *m_out =3D (unsigned)mult; >>>>> } >>>>> >>>>> /* Non x86. */ >>>>> #define __rthal_u96shift(h, m, l, s) ({ \ >>>>> unsigned _l =3D (l); \ >>>>> unsigned _m =3D (m); \ >>>>> unsigned _s =3D (s); \ >>>>> _l >>=3D _s; \ >>>>> _m >>=3D s; \ >>>>> _l |=3D (_m << (32 - s)); \ >>>>> _m |=3D ((h) << (32 - s)); \ >>>>> __rthal_u64fromu32(_m, _l); \ >>>>> }) >>>>> >>>>> /* x86 */ >>>>> #define __rthal_u96shift(h, m, l, s) ({ \ >>>>> unsigned _l =3D (l); \ >>>>> unsigned _m =3D (m); \ >>>>> unsigned _s =3D (s); \ >>>>> asm ("shrdl\t%%cl,%1,%0" \ >>>>> : "+r,?m"(_l) \ >>>>> : "r,r"(_m), "c,c"(_s)); \ >>>>> asm ("shrdl\t%%cl,%1,%0" \ >>>>> : "+r,?m"(_m) \ >>>>> : "r,r"(h), "c,c"(_s)); \ >>>>> __rthal_u64fromu32(_m, _l); \ >>>>> }) >>>>> >>>>> static inline long long rthal_llmi(int i, int j) >>>>> { >>>>> /* Signed fast 32x32->64 multiplication */ >>>>> return (long long) i * j; >>>>> } >>>>> >>>>> static inline long long gilles_llmulshft(const long long op, >>>>> const unsigned m, >>>>> const unsigned s) >>>>> { >>>>> unsigned oph, opl, tlh, tll, thh, thl; >>>>> unsigned long long th, tl; >>>>> >>>>> __rthal_u64tou32(op, oph, opl); >>>>> tl =3D rthal_ullmul(opl, m); >>>>> __rthal_u64tou32(tl, tlh, tll); >>>>> th =3D rthal_llmi(oph, m); >>>>> th +=3D tlh; >>>>> __rthal_u64tou32(th, thh, thl); >>>>> =09 >>>>> return __rthal_u96shift(thh, thl, tll, s); >>>>> } >>>>> >>>>> >>>> Thanks for your suggestion. >>>> >>>> While your generic version produces comparable code, the x86 variant= is >>>> about twice as large as the full-assembly version. And code size >>>> translates into I-cache occupation, which may have latency costs. >>>> >>>> [gcc 4.1, i386] >>>> -O2 -mregparm=3D3 -fomit-frame-pointer: >>>> 63: 08048490 119 FUNC GLOBAL DEFAULT 13 gilles_llmulshft >>>> 68: 08048510 121 FUNC GLOBAL DEFAULT 13 gilles_llmulshft_x= 86 >>>> 77: 08048450 57 FUNC GLOBAL DEFAULT 13 rthal_llmulshft >>>> 78: 080483c0 135 FUNC GLOBAL DEFAULT 13 __rthal_generic_ll= mulshft >>>> >>>> -Os -mregparm=3D3 -fomit-frame-pointer: >>>> 63: 0804843b 93 FUNC GLOBAL DEFAULT 13 gilles_llmulshft >>>> 68: 08048498 97 FUNC GLOBAL DEFAULT 13 gilles_llmulshft_x= 86 >>>> 77: 08048410 43 FUNC GLOBAL DEFAULT 13 rthal_llmulshft >>>> 78: 080483b4 92 FUNC GLOBAL DEFAULT 13 __rthal_generic_ll= mulshft >>>> >>>> -O2: >>>> 63: 08048480 120 FUNC GLOBAL DEFAULT 13 gilles_llmulshft >>>> 68: 08048500 105 FUNC GLOBAL DEFAULT 13 gilles_llmulshft_x= 86 >>>> 77: 08048440 60 FUNC GLOBAL DEFAULT 13 rthal_llmulshft >>>> 78: 080483c0 117 FUNC GLOBAL DEFAULT 13 __rthal_generic_ll= mulshft >>>> >>>> -Os: >>>> 63: 08048438 104 FUNC GLOBAL DEFAULT 13 gilles_llmulshft >>>> 68: 080484a0 83 FUNC GLOBAL DEFAULT 13 gilles_llmulshft_x= 86 >>>> 77: 0804840b 45 FUNC GLOBAL DEFAULT 13 rthal_llmulshft >>>> 78: 080483b4 87 FUNC GLOBAL DEFAULT 13 __rthal_generic_ll= mulshft >>>> >>>> I'm not arguing we should turn each and every Xenomai arch code into= >>>> pure assembly. But in this case it already happened, it's less scatt= ered >>>> source code-wise, and it is compacter object-wise. So I would prefer= to >>>> keep it as is. >>> I would say the advantage of having a C version outperform the >>> advantages of the full assembly version. C is really easier to >>> understand and debug. >> >> Personally, I prefer the clear (and commented) assembly over the neste= d >> macros and inlines. >=20 > Not when the macro and inline bear names that are easy to understand. I= f > you do not find the names easy to understand, then change them (I do no= t > like rthal_llmul either, but I could not find a name). To make the > assembly fully understandable, you would need to comment every > statement. And now, run the assembly code in gdb, and try and print the= > value of a 64 bits intermediate result: you can't. No question, this is a matter of taste. >=20 >> >>> The differences between the two versions are some register moves, whi= ch >>> cost almost nothing, especially since each operation in the assembly >> >> Cycle-wise, you are right. But what bites us more in the worst case ar= e >> memory accesses, specifically when they are not cached. Code size >> matters more according to my experience. >> >> >>> version depends on the result of the previous operation, which means >>> lots of pipeline stall, the register moves will just feed the pipelin= e. >>> I do not think they really matter. Look at the assembly produced for >>> gilles_llmulshft on ARM, a low end architecture where each instructio= n >>> really costs: >>> gilles_llmulshft: >>> @ args =3D 0, pretend =3D 0, frame =3D 0 >>> @ frame_needed =3D 0, uses_anonymous_args =3D 0 >>> @ link register save eliminated. >>> stmfd sp!, {r4, r5, r6, r7} >>> umull r6, r7, r0, r2 >>> mov r4, r7 >>> mov r5, #0 >>> smlal r4, r5, r2, r1 >>> rsb ip, r3, #32 >>> mov r2, r4, lsr r3 >>> orr r1, r2, r5, asl ip >>> mov r2, r2, asl ip >>> orr r0, r2, r6, lsr r3 >>> @ lr needed for prologue >>> ldmfd sp!, {r4, r5, r6, r7} >>> mov pc, lr >>> >>> pretty minimal, no ? >> >> OK, your version can perfectly go into the ARM arch. But i386 is >> different: less registers, thus easily a lot of variable shuffling... >=20 > variable shuffling which does not really matter, that is my point, > otherwise the x86 family would not be as fast as it is. Think of the *code size*... >=20 >> >>> The full assembly version has another big drawback, it is a big block= >>> that the optimizer can not split, whereas in a C version, the optimiz= er >>> can decide to interleave the surrounding code. So a C version will >>> inline better. >> >> We are not inlining that service anymore, at least not for its primary= >> usage tsc-to-ns. Inlining costs object size, thus increases the latenc= y >> (although it saves us a few cycles). >=20 > it *is* inlined, in tsc_to/from_ns. Another question that I forgot in m= y xnarch_tsc_to_ns uninlines this service, and I don't see other, larger users so far. > previous mails: why not using llmulshft for the two services ? See below, see my original post on all the conversion approaches: scaled math is inaccurate, doing it both ways may cause noticeable errors when dealing with calculated vs. measured time stamps over, granted, fairly long periods. >=20 >> >>> There is one thing I do not like with llmulshft (any implementation),= it >>> is the rounding policy towards minus infinity. llmulshft(-1, 2/3) >>> returns -1 whereas llimd would return 0. >> >> See other postings: rounding of the last digit doesn't matter with >> scaled math, it's already inaccurate by nature. That's also why we hav= e >> it only one-way. >=20 > When returning -1 instead of 0, it is not the last digit that is wrong,= > but the first (and only) one. So this is about -1 nanoseconds vs. 0 nanoseconds. Well, does this error matter in real life? :-> --------------enig7C97BF8F381F315AB457CDFD Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGZrcWniDOoMHTA+kRAmULAJ9PGiemQWJe1mQ1XUCdfDnh3qQ5fQCffH4w fQ8B3PanFHYyCOIGJOMjJaQ= =d/0F -----END PGP SIGNATURE----- --------------enig7C97BF8F381F315AB457CDFD--