From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <4666AFB3.6040602@domain.hid> Date: Wed, 06 Jun 2007 14:59:31 +0200 From: Jan Kiszka MIME-Version: 1.0 References: <46649F7E.3060104@domain.hid> <46651F7D.9090702@domain.hid> <18021.58231.177931.286548@domain.hid> <46668CC3.8050002@domain.hid> <4666ACE5.7030200@domain.hid> In-Reply-To: <4666ACE5.7030200@domain.hid> Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="------------enig97361200A2AA3A43EFC5250F" Sender: jan.kiszka@domain.hid Subject: Re: [Xenomai-core] [PATCH-STACK] Updates, timerstats, rtdm-timers List-Id: "Xenomai life and development \(bug reports, patches, discussions\)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Gilles Chanteperdrix Cc: xenomai-core This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enig97361200A2AA3A43EFC5250F Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: quoted-printable Gilles Chanteperdrix wrote: > Jan Kiszka wrote: >> Gilles Chanteperdrix wrote: >> >>> Jan Kiszka wrote: >>>> Jan Kiszka wrote: >>>> ... >>>>> fast-tsc-to-ns-v2.patch >>>>> >>>>> [Rebased, improved rounding of least significant digit] >>>> Rounding in the fast path for the sake of the last digit was silly. >>>> Instead, I'm now addressing the ugly interval printing via >>>> xnarch_precise_tsc_to_ns when converting the timer interval back int= o >>>> nanos. -v3 incorporating this has just been uploaded. >>> Hi, >>> >>> I had a look at the fast-tsc-to-ns implementation, here is how I woul= d >>> rewrite it: >>> >>> static inline void xnarch_init_llmulshft(const unsigned m_in, >>> const unsigned d_in, >>> unsigned *m_out, >>> unsigned *s_out) >>> { >>> unsigned long long mult; >>> >>> *s_out =3D 31; >>> while (1) { >>> mult =3D ((unsigned long long)m_in) << *s_out; >>> do_div(mult, d_in); >>> if (mult <=3D INT_MAX) >>> break; >>> (*s_out)--; >>> } >>> *m_out =3D (unsigned)mult; >>> } >>> >>> /* Non x86. */ >>> #define __rthal_u96shift(h, m, l, s) ({ \ >>> unsigned _l =3D (l); \ >>> unsigned _m =3D (m); \ >>> unsigned _s =3D (s); \ >>> _l >>=3D _s; \ >>> _m >>=3D s; \ >>> _l |=3D (_m << (32 - s)); \ >>> _m |=3D ((h) << (32 - s)); \ >>> __rthal_u64fromu32(_m, _l); \ >>> }) >>> >>> /* x86 */ >>> #define __rthal_u96shift(h, m, l, s) ({ \ >>> unsigned _l =3D (l); \ >>> unsigned _m =3D (m); \ >>> unsigned _s =3D (s); \ >>> asm ("shrdl\t%%cl,%1,%0" \ >>> : "+r,?m"(_l) \ >>> : "r,r"(_m), "c,c"(_s)); \ >>> asm ("shrdl\t%%cl,%1,%0" \ >>> : "+r,?m"(_m) \ >>> : "r,r"(h), "c,c"(_s)); \ >>> __rthal_u64fromu32(_m, _l); \ >>> }) >>> >>> static inline long long rthal_llmi(int i, int j) >>> { >>> /* Signed fast 32x32->64 multiplication */ >>> return (long long) i * j; >>> } >>> >>> static inline long long gilles_llmulshft(const long long op, >>> const unsigned m, >>> const unsigned s) >>> { >>> unsigned oph, opl, tlh, tll, thh, thl; >>> unsigned long long th, tl; >>> >>> __rthal_u64tou32(op, oph, opl); >>> tl =3D rthal_ullmul(opl, m); >>> __rthal_u64tou32(tl, tlh, tll); >>> th =3D rthal_llmi(oph, m); >>> th +=3D tlh; >>> __rthal_u64tou32(th, thh, thl); >>> =09 >>> return __rthal_u96shift(thh, thl, tll, s); >>> } >>> >>> >> >> Thanks for your suggestion. >> >> While your generic version produces comparable code, the x86 variant i= s >> about twice as large as the full-assembly version. And code size >> translates into I-cache occupation, which may have latency costs. >> >> [gcc 4.1, i386] >> -O2 -mregparm=3D3 -fomit-frame-pointer: >> 63: 08048490 119 FUNC GLOBAL DEFAULT 13 gilles_llmulshft >> 68: 08048510 121 FUNC GLOBAL DEFAULT 13 gilles_llmulshft_x8= 6 >> 77: 08048450 57 FUNC GLOBAL DEFAULT 13 rthal_llmulshft >> 78: 080483c0 135 FUNC GLOBAL DEFAULT 13 __rthal_generic_llm= ulshft >> >> -Os -mregparm=3D3 -fomit-frame-pointer: >> 63: 0804843b 93 FUNC GLOBAL DEFAULT 13 gilles_llmulshft >> 68: 08048498 97 FUNC GLOBAL DEFAULT 13 gilles_llmulshft_x8= 6 >> 77: 08048410 43 FUNC GLOBAL DEFAULT 13 rthal_llmulshft >> 78: 080483b4 92 FUNC GLOBAL DEFAULT 13 __rthal_generic_llm= ulshft >> >> -O2: >> 63: 08048480 120 FUNC GLOBAL DEFAULT 13 gilles_llmulshft >> 68: 08048500 105 FUNC GLOBAL DEFAULT 13 gilles_llmulshft_x8= 6 >> 77: 08048440 60 FUNC GLOBAL DEFAULT 13 rthal_llmulshft >> 78: 080483c0 117 FUNC GLOBAL DEFAULT 13 __rthal_generic_llm= ulshft >> >> -Os: >> 63: 08048438 104 FUNC GLOBAL DEFAULT 13 gilles_llmulshft >> 68: 080484a0 83 FUNC GLOBAL DEFAULT 13 gilles_llmulshft_x8= 6 >> 77: 0804840b 45 FUNC GLOBAL DEFAULT 13 rthal_llmulshft >> 78: 080483b4 87 FUNC GLOBAL DEFAULT 13 __rthal_generic_llm= ulshft >> >> I'm not arguing we should turn each and every Xenomai arch code into >> pure assembly. But in this case it already happened, it's less scatter= ed >> source code-wise, and it is compacter object-wise. So I would prefer t= o >> keep it as is. >=20 > I would say the advantage of having a C version outperform the > advantages of the full assembly version. C is really easier to > understand and debug. Personally, I prefer the clear (and commented) assembly over the nested macros and inlines. >=20 > The differences between the two versions are some register moves, which= > cost almost nothing, especially since each operation in the assembly Cycle-wise, you are right. But what bites us more in the worst case are memory accesses, specifically when they are not cached. Code size matters more according to my experience. > version depends on the result of the previous operation, which means > lots of pipeline stall, the register moves will just feed the pipeline.= > I do not think they really matter. Look at the assembly produced for > gilles_llmulshft on ARM, a low end architecture where each instruction > really costs: > gilles_llmulshft: > @ args =3D 0, pretend =3D 0, frame =3D 0 > @ frame_needed =3D 0, uses_anonymous_args =3D 0 > @ link register save eliminated. > stmfd sp!, {r4, r5, r6, r7} > umull r6, r7, r0, r2 > mov r4, r7 > mov r5, #0 > smlal r4, r5, r2, r1 > rsb ip, r3, #32 > mov r2, r4, lsr r3 > orr r1, r2, r5, asl ip > mov r2, r2, asl ip > orr r0, r2, r6, lsr r3 > @ lr needed for prologue > ldmfd sp!, {r4, r5, r6, r7} > mov pc, lr >=20 > pretty minimal, no ? OK, your version can perfectly go into the ARM arch. But i386 is different: less registers, thus easily a lot of variable shuffling... >=20 > The full assembly version has another big drawback, it is a big block > that the optimizer can not split, whereas in a C version, the optimizer= > can decide to interleave the surrounding code. So a C version will > inline better. We are not inlining that service anymore, at least not for its primary usage tsc-to-ns. Inlining costs object size, thus increases the latency (although it saves us a few cycles). >=20 > There is one thing I do not like with llmulshft (any implementation), i= t > is the rounding policy towards minus infinity. llmulshft(-1, 2/3) > returns -1 whereas llimd would return 0. See other postings: rounding of the last digit doesn't matter with scaled math, it's already inaccurate by nature. That's also why we have it only one-way. Jan --------------enig97361200A2AA3A43EFC5250F Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGZq+zniDOoMHTA+kRAmafAJ95LXksoyWyshFBxF8+9hSkLBicKACfb8/J zXpT32QiSaaq1jde/af96bo= =ZUdJ -----END PGP SIGNATURE----- --------------enig97361200A2AA3A43EFC5250F--