From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from lvps87-230-20-158.dedicated.hosteurope.de (lvps87-230-20-158.dedicated.hosteurope.de [87.230.20.158]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by ozlabs.org (Postfix) with ESMTPS id 3E746DDDDC for ; Sun, 24 Aug 2008 18:05:10 +1000 (EST) From: Konstantinos Margaritis To: rsa@us.ibm.com Subject: Re: libfreevec benchmarks Date: Sun, 24 Aug 2008 11:03:56 +0300 References: <200808211909.16852.markos@codex.gr> <1219427047.7917.38.camel@localhost> In-Reply-To: <1219427047.7917.38.camel@localhost> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Message-Id: <200808241103.58485.markos@codex.gr> Cc: linuxppc-dev@ozlabs.org List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , =CE=A3=CF=84=CE=B9=CF=82 Friday 22 August 2008 20:44:07 =CE=BF/=CE=B7 Ryan = S. Arnold =CE=AD=CE=B3=CF=81=CE=B1=CF=88=CE=B5: > Do you have FSF (Free Software Foundation) copyright assignment yet? Copyright assignment is not the issue, if there was interest in the first=20 place, that would never had deterred me. > How've you implemented the optimizations? Scalar for small sizes, AltiVec for larger (>16 bytes, depending on the=20 routine). > Optimizations for individual architectures should follow the powerpc-cpu > precedent for providing these routines, e.g. > > sysdeps/powerpc/powerpc32/power6/memcpy.S > sysdeps/powerpc/powerpc64/power6/memcpy.S That's the idea I got, but so far I understood that only 64-bit PowerPC/POW= ER=20 cpus are supported, what about 32-bit cpus? libfreevec isn't ported to 64-b= it=20 yet (though I will finish that soon). Would it be enough to have one dir li= ke=20 eg: sysdeps/powerpc/powerpc32/altivec/ or would I have to refer to specific CPU models? eg 74xx? And use Implies f= or=20 the rest? > Today, if glibc is configure with --with-cpu=3D970 it will actually > default to the power optimizations for the string routines, as indicated > by the sysdeps/powerpc/powerpc[32|64]/970/Implies files. It'd be worth > verifying that your baseline glibc runs are against existing optimized > versions of glibc. If they're not then this is a fault of the distro > you're testing on. Well, I used Debian Lenny and OpenSuse 11.0 (using glibc 2.7 and glibc2.8=20 resp. If it doesn't work as supposed, these are two popular distros with a= =20 broken glibc, which I would think it's not very likely. > I'm not aware of the status of some of the embedded PowerPC processors > with-regard to powerpc-cpu optimizations. Would the G4 and 8610 fall under the "embedded" PowerPC category? > Our research found that for some tasks on some PowerPC processors the > expense of reserving the floating point pipeline for vector operations > exceeds the benefit of using vector insns for the task. Well, I would advise *strongly* against that, except for specific cases, no= t=20 for OS-wide functions. For example, in a popular 3D application such as=20 Blender (or the Mesa 3D library), a lot of memory copying is done along wit= h=20 lots of FPU math. If you use the FPU unit for plain memcpy/etc stuff, you=20 essentially forbid the app to use it for the important stuff, ie math, and = in=20 the end you lose performance. On the other hand, the AltiVec unit remains=20 unused all the time, and it's certainly more capable and more generic than = the=20 =46PU for most of the stuff -not to mention that inside the same app, the i= ssue=20 of context switching becomes unimportant.=20 > Generally our optimizations tend to favor data an average of 12 bytes > with 1000 byte max. We also favor aligned data and use the existing > implementation as a model as a baseline for where we try to keep > unaligned data performance from dropping below. Please, check the graphs of most libfreevec functions for the sizes=20 12-1000bytes. Apart from strlen(), which is the only function that performs= =20 better overall than libfreevec, most other functions offer the same=20 performance for sizes up to 48/96 bytes, but then performance increases=20 dramatically due to the use of the vector unit. > This research would be a good candidate for selectively replacing some > of the existing libm functionality. Do these results hold for all > permutations of long double support? Do they hold for x86/x86_64 as > well as PowerPC? I would suggest against a massive patch to libc-alpha > and would instead recommend selective, individual replacement of > fundamental routines to start with accompanied by exhaustive profile > data. You have to show that you're dedicated to maintenance of these > routines and you can't overwhelm the reviewers with massive patches. =46or the moment, my focus is on 32-bit floats only, but the algorithm is t= he=20 same for 64-bit/128-bit floating point numbers even. It will just use more= =20 terms. And yes, as I said, it doesn't use AltiVec and is totally cross- platform -just plain C- and very short code even. I tested the code on an=20 Athlon X2 again and I get even better performance than on the PowerPC CPUs.= =20 =46or some reason, glibc -and freebsd libc for that matter as I did a look= =20 around- use very complex source trees with no good reason. The implementati= on=20 of a sinf() for example is no more than 20 C lines.=20 As for commitment, well I've been working on that stuff since 2004 (with a = ~2y=20 break because of other obligations, army, family, baby, etc :), but unless= =20 IBM/Freescale choose to dump AltiVec altogether, I don't see myself stoppin= g=20 working on it. To tell you the truth, the promotion of the vector unit by b= oth=20 companies has been a disappointment in my eyes at least, so I might just as= =20 well switch platform... But that won't happen yet anyway. > Any submission to GLIBC is going to require that you and your code > follow the GLIBC process or it'll probably be ignored. You can engage > me directly via CC and I can help you understand how to integrate the > code but I can't give you a free pass or do the work for you. I never asked that. However, first it's more imporant to me to show that th= e=20 code is worth including and then *if* it's proven worthy, then we can worry= =20 about stuff like copyright assignment, etc.=20 > The new libc-help mailing list was also created as a place for people to > learn the process and get the patches in a state where they're ready to > be submitted to libc-alpha. I will take a look, thanks for that info. Konstantinos Margaritis Codex http://www.codex.gr