* libfreevec benchmarks @ 2008-08-21 16:09 Konstantinos Margaritis 2008-08-22 17:44 ` Ryan S. Arnold 0 siblings, 1 reply; 5+ messages in thread From: Konstantinos Margaritis @ 2008-08-21 16:09 UTC (permalink / raw) To: linuxppc-dev Benh suggested that I made this more known, and in particular to this list, so I send this mail in hope that some people might be interested. In particular, I ran the following benchmarks against libfreevec/glibc: http://www.freevec.org/content/libfreevec_104_benchmarks_updated libfreevec has reached a very stable point, where me and a couple of others (the OpenSuse PowerPC port developer being one) have been using it for weeks (personally I've been using it for months), using the LD_PRELOAD mechanism (as explained here: http://www.freevec.org/content/howto_using_libfreevec_using_ld_preload). The OpenSuse guys even consider using it by default on the ppc port even, but that's not final of course. glibc integration _might_ happen if glibc developers change their attitude (my mails have been mostly ignored). Last, I've also been working on a libm rewrite, though this will take some time still. I've reimplemented most math functions at the algorithm level, eg. so far, most functions achieve 50%-200% speed increase at full IEEE754 accuracy (mathematically proven, soon to be published online) without using Altivec yet, just by choosing a different approximation method (Taylor approximation is pretty dumb if you ask me anyway). Regards Konstantinos Margaritis Codex http://www.codex.gr ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: libfreevec benchmarks 2008-08-21 16:09 libfreevec benchmarks Konstantinos Margaritis @ 2008-08-22 17:44 ` Ryan S. Arnold 2008-08-22 17:50 ` Ryan S. Arnold 2008-08-24 8:03 ` Konstantinos Margaritis 0 siblings, 2 replies; 5+ messages in thread From: Ryan S. Arnold @ 2008-08-22 17:44 UTC (permalink / raw) To: Konstantinos Margaritis; +Cc: linuxppc-dev On Thu, 2008-08-21 at 19:09 +0300, Konstantinos Margaritis wrote: > Benh suggested that I made this more known, and in particular to this list, so > I send this mail in hope that some people might be interested. In particular, > I ran the following benchmarks against libfreevec/glibc: > > http://www.freevec.org/content/libfreevec_104_benchmarks_updated Nice results. > libfreevec has reached a very stable point, where me and a couple of others > (the OpenSuse PowerPC port developer being one) have been using it for weeks > (personally I've been using it for months), using the LD_PRELOAD mechanism (as > explained here: > http://www.freevec.org/content/howto_using_libfreevec_using_ld_preload). > The OpenSuse guys even consider using it by default on the ppc port even, but > that's not final of course. > > glibc integration _might_ happen if glibc developers change their attitude (my > mails have been mostly ignored). Konstantinos, Do you have FSF (Free Software Foundation) copyright assignment yet? How've you implemented the optimizations? Vector insns are allowed in the PowerPC code in GLIBC if guarded by PPC_FEATURE_HAS_ALTIVEC (look at setjmp/_longjmp). The use of unguarded vector code is allowed in --with-cpu powerpc-cpu override directories for cpus that support altivec/VMX. Optimizations for individual architectures should follow the powerpc-cpu precedent for providing these routines, e.g. sysdeps/powerpc/powerpc32/power6/memcpy.S sysdeps/powerpc/powerpc64/power6/memcpy.S I believe that optimizations for the G5 processor would go into the existing 970 directories: sysdeps/powerpc/powerpc32/970 sysdeps/powerpc/powerpc64/970 Today, if glibc is configure with --with-cpu=970 it will actually default to the power optimizations for the string routines, as indicated by the sysdeps/powerpc/powerpc[32|64]/970/Implies files. It'd be worth verifying that your baseline glibc runs are against existing optimized versions of glibc. If they're not then this is a fault of the distro you're testing on. I'm not aware of the status of some of the embedded PowerPC processors with-regard to powerpc-cpu optimizations. Our research found that for some tasks on some PowerPC processors the expense of reserving the floating point pipeline for vector operations exceeds the benefit of using vector insns for the task. In these cases we tend to optimize based on pipeline characteristics rather than using the vector facility. Generally our optimizations tend to favor data an average of 12 bytes with 1000 byte max. We also favor aligned data and use the existing implementation as a model as a baseline for where we try to keep unaligned data performance from dropping below. > Last, I've also been working on a libm rewrite, though this will take some > time still. I've reimplemented most math functions at the algorithm level, eg. > so far, most functions achieve 50%-200% speed increase at full IEEE754 > accuracy (mathematically proven, soon to be published online) without using > Altivec yet, just by choosing a different approximation method (Taylor > approximation is pretty dumb if you ask me anyway). This research would be a good candidate for selectively replacing some of the existing libm functionality. Do these results hold for all permutations of long double support? Do they hold for x86/x86_64 as well as PowerPC? I would suggest against a massive patch to libc-alpha and would instead recommend selective, individual replacement of fundamental routines to start with accompanied by exhaustive profile data. You have to show that you're dedicated to maintenance of these routines and you can't overwhelm the reviewers with massive patches. Any submission to GLIBC is going to require that you and your code follow the GLIBC process or it'll probably be ignored. You can engage me directly via CC and I can help you understand how to integrate the code but I can't give you a free pass or do the work for you. The new libc-help mailing list was also created as a place for people to learn the process and get the patches in a state where they're ready to be submitted to libc-alpha. Regards, Ryan S. Arnold IBM Linux Technology Center Linux Toolchain Development ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: libfreevec benchmarks 2008-08-22 17:44 ` Ryan S. Arnold @ 2008-08-22 17:50 ` Ryan S. Arnold 2008-08-24 8:03 ` Konstantinos Margaritis 1 sibling, 0 replies; 5+ messages in thread From: Ryan S. Arnold @ 2008-08-22 17:50 UTC (permalink / raw) To: Konstantinos Margaritis; +Cc: linuxppc-dev On Fri, 2008-08-22 at 12:44 -0500, Ryan S. Arnold wrote: > Today, if glibc is configure with --with-cpu=970 it will actually > default to the power optimizations for the string routines, as > indicated > by the sysdeps/powerpc/powerpc[32|64]/970/Implies files. It'd be > worth > verifying that your baseline glibc runs are against existing optimized > versions of glibc. If they're not then this is a fault of the distro > you're testing on. I intended to say that "it will actually default to the POWER4 optimizations" Regards, Ryan S. Arnold IBM Linux Technology Center Linux Toolchain Development ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: libfreevec benchmarks 2008-08-22 17:44 ` Ryan S. Arnold 2008-08-22 17:50 ` Ryan S. Arnold @ 2008-08-24 8:03 ` Konstantinos Margaritis 2008-09-02 22:24 ` Ryan S. Arnold 1 sibling, 1 reply; 5+ messages in thread From: Konstantinos Margaritis @ 2008-08-24 8:03 UTC (permalink / raw) To: rsa; +Cc: linuxppc-dev =CE=A3=CF=84=CE=B9=CF=82 Friday 22 August 2008 20:44:07 =CE=BF/=CE=B7 Ryan = S. Arnold =CE=AD=CE=B3=CF=81=CE=B1=CF=88=CE=B5: > Do you have FSF (Free Software Foundation) copyright assignment yet? Copyright assignment is not the issue, if there was interest in the first=20 place, that would never had deterred me. > How've you implemented the optimizations? Scalar for small sizes, AltiVec for larger (>16 bytes, depending on the=20 routine). > Optimizations for individual architectures should follow the powerpc-cpu > precedent for providing these routines, e.g. > > sysdeps/powerpc/powerpc32/power6/memcpy.S > sysdeps/powerpc/powerpc64/power6/memcpy.S That's the idea I got, but so far I understood that only 64-bit PowerPC/POW= ER=20 cpus are supported, what about 32-bit cpus? libfreevec isn't ported to 64-b= it=20 yet (though I will finish that soon). Would it be enough to have one dir li= ke=20 eg: sysdeps/powerpc/powerpc32/altivec/ or would I have to refer to specific CPU models? eg 74xx? And use Implies f= or=20 the rest? > Today, if glibc is configure with --with-cpu=3D970 it will actually > default to the power optimizations for the string routines, as indicated > by the sysdeps/powerpc/powerpc[32|64]/970/Implies files. It'd be worth > verifying that your baseline glibc runs are against existing optimized > versions of glibc. If they're not then this is a fault of the distro > you're testing on. Well, I used Debian Lenny and OpenSuse 11.0 (using glibc 2.7 and glibc2.8=20 resp. If it doesn't work as supposed, these are two popular distros with a= =20 broken glibc, which I would think it's not very likely. > I'm not aware of the status of some of the embedded PowerPC processors > with-regard to powerpc-cpu optimizations. Would the G4 and 8610 fall under the "embedded" PowerPC category? > Our research found that for some tasks on some PowerPC processors the > expense of reserving the floating point pipeline for vector operations > exceeds the benefit of using vector insns for the task. Well, I would advise *strongly* against that, except for specific cases, no= t=20 for OS-wide functions. For example, in a popular 3D application such as=20 Blender (or the Mesa 3D library), a lot of memory copying is done along wit= h=20 lots of FPU math. If you use the FPU unit for plain memcpy/etc stuff, you=20 essentially forbid the app to use it for the important stuff, ie math, and = in=20 the end you lose performance. On the other hand, the AltiVec unit remains=20 unused all the time, and it's certainly more capable and more generic than = the=20 =46PU for most of the stuff -not to mention that inside the same app, the i= ssue=20 of context switching becomes unimportant.=20 > Generally our optimizations tend to favor data an average of 12 bytes > with 1000 byte max. We also favor aligned data and use the existing > implementation as a model as a baseline for where we try to keep > unaligned data performance from dropping below. Please, check the graphs of most libfreevec functions for the sizes=20 12-1000bytes. Apart from strlen(), which is the only function that performs= =20 better overall than libfreevec, most other functions offer the same=20 performance for sizes up to 48/96 bytes, but then performance increases=20 dramatically due to the use of the vector unit. > This research would be a good candidate for selectively replacing some > of the existing libm functionality. Do these results hold for all > permutations of long double support? Do they hold for x86/x86_64 as > well as PowerPC? I would suggest against a massive patch to libc-alpha > and would instead recommend selective, individual replacement of > fundamental routines to start with accompanied by exhaustive profile > data. You have to show that you're dedicated to maintenance of these > routines and you can't overwhelm the reviewers with massive patches. =46or the moment, my focus is on 32-bit floats only, but the algorithm is t= he=20 same for 64-bit/128-bit floating point numbers even. It will just use more= =20 terms. And yes, as I said, it doesn't use AltiVec and is totally cross- platform -just plain C- and very short code even. I tested the code on an=20 Athlon X2 again and I get even better performance than on the PowerPC CPUs.= =20 =46or some reason, glibc -and freebsd libc for that matter as I did a look= =20 around- use very complex source trees with no good reason. The implementati= on=20 of a sinf() for example is no more than 20 C lines.=20 As for commitment, well I've been working on that stuff since 2004 (with a = ~2y=20 break because of other obligations, army, family, baby, etc :), but unless= =20 IBM/Freescale choose to dump AltiVec altogether, I don't see myself stoppin= g=20 working on it. To tell you the truth, the promotion of the vector unit by b= oth=20 companies has been a disappointment in my eyes at least, so I might just as= =20 well switch platform... But that won't happen yet anyway. > Any submission to GLIBC is going to require that you and your code > follow the GLIBC process or it'll probably be ignored. You can engage > me directly via CC and I can help you understand how to integrate the > code but I can't give you a free pass or do the work for you. I never asked that. However, first it's more imporant to me to show that th= e=20 code is worth including and then *if* it's proven worthy, then we can worry= =20 about stuff like copyright assignment, etc.=20 > The new libc-help mailing list was also created as a place for people to > learn the process and get the patches in a state where they're ready to > be submitted to libc-alpha. I will take a look, thanks for that info. Konstantinos Margaritis Codex http://www.codex.gr ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: libfreevec benchmarks 2008-08-24 8:03 ` Konstantinos Margaritis @ 2008-09-02 22:24 ` Ryan S. Arnold 0 siblings, 0 replies; 5+ messages in thread From: Ryan S. Arnold @ 2008-09-02 22:24 UTC (permalink / raw) To: Konstantinos Margaritis; +Cc: linuxppc-dev, Anton Blanchard, Steve Munroe Hi Konstantinos, I've been on vacation. Here are my responses. On Sun, 2008-08-24 at 11:03 +0300, Konstantinos Margaritis wrote: > Copyright assignment is not the issue, if there was interest in the first > place, that would never had deterred me. Okay. This does deter some people when they understand the restrictions. > > How've you implemented the optimizations? > > Scalar for small sizes, AltiVec for larger (>16 bytes, depending on the > routine). Okay, this is a reasonable approach. > > Optimizations for individual architectures should follow the powerpc-cpu > > precedent for providing these routines, e.g. > > > > sysdeps/powerpc/powerpc32/power6/memcpy.S > > sysdeps/powerpc/powerpc64/power6/memcpy.S > > That's the idea I got, but so far I understood that only 64-bit PowerPC/POWER > cpus are supported, what about 32-bit cpus? libfreevec isn't ported to 64-bit > yet (though I will finish that soon). Would it be enough to have one dir like > eg: > > sysdeps/powerpc/powerpc32/altivec/ My team doesn't deal 32-bit only processors at this time and we haven't been asked to do so, so our focus tends to gravitate toward 64-bit but we still enable biarch. > or would I have to refer to specific CPU models? eg 74xx? And use Implies for > the rest? You'd have something like: sysdeps/powerpc/powerpc32/74xx Altivec would be a category decision you'd make in the code or you may be able to do what we do for fpu only code (though I'm not saying this is _the_ solution), e.g. sysdeps/powerpc/powerpc32/74xx/altivec > > Today, if glibc is configure with --with-cpu=970 it will actually > > default to the power optimizations for the string routines, as indicated > > by the sysdeps/powerpc/powerpc[32|64]/970/Implies files. It'd be worth > > verifying that your baseline glibc runs are against existing optimized > > versions of glibc. If they're not then this is a fault of the distro > > you're testing on. > > Well, I used Debian Lenny and OpenSuse 11.0 (using glibc 2.7 and glibc2.8 > resp. If it doesn't work as supposed, these are two popular distros with a > broken glibc, which I would think it's not very likely. The term 'broken' isn't relevant here. They may have made a choice to select a base build that conforms to an ABI that precedes ppc970. Or they may have chosen to not ship an optimized /lib/970/libc.so.6 and instead defer to the 'default' /lib/libc.so.6. > > I'm not aware of the status of some of the embedded PowerPC processors > > with-regard to powerpc-cpu optimizations. > > Would the G4 and 8610 fall under the "embedded" PowerPC category? I think these processors precede the ISA categories. As long as these processors exist in a desktop or server machine they could probably make it into GLIBC main, otherwise I'm sure 'ports' would accept the overrides. > > Our research found that for some tasks on some PowerPC processors the > > expense of reserving the floating point pipeline for vector operations > > exceeds the benefit of using vector insns for the task. > > Well, I would advise *strongly* against that, except for specific cases, not > for OS-wide functions. For example, in a popular 3D application such as > Blender (or the Mesa 3D library), a lot of memory copying is done along with > lots of FPU math. If you use the FPU unit for plain memcpy/etc stuff, you > essentially forbid the app to use it for the important stuff, ie math, and in > the end you lose performance. On the other hand, the AltiVec unit remains > unused all the time, and it's certainly more capable and more generic than the > FPU for most of the stuff -not to mention that inside the same app, the issue > of context switching becomes unimportant. I didn't describe the situation adequately. The micro architecture requires that the floating point pipeline be reserved if one wants to perform vector operations on such systems. Therefore in these cases we choose to not use vector operations for memcpy/etc. This tends to be a system by system thing and isn't an OS decision. > > Generally our optimizations tend to favor data an average of 12 bytes > > with 1000 byte max. We also favor aligned data and use the existing > > implementation as a model as a baseline for where we try to keep > > unaligned data performance from dropping below. > > Please, check the graphs of most libfreevec functions for the sizes > 12-1000bytes. Apart from strlen(), which is the only function that performs > better overall than libfreevec, most other functions offer the same > performance for sizes up to 48/96 bytes, but then performance increases > dramatically due to the use of the vector unit. Even for our own optimizations which choose to not use vector we may want to consider doing so for sizes in excess of 1000 bytes if you research holds true on our hardware. This is interesting. > For the moment, my focus is on 32-bit floats only, but the algorithm is the > same for 64-bit/128-bit floating point numbers even. It will just use more > terms. And yes, as I said, it doesn't use AltiVec and is totally cross- > platform -just plain C- and very short code even. I tested the code on an > Athlon X2 again and I get even better performance than on the PowerPC CPUs. > For some reason, glibc -and freebsd libc for that matter as I did a look > around- use very complex source trees with no good reason. The implementation > of a sinf() for example is no more than 20 C lines. Currently outside of operations performed in vector, 32-bit floats are not in the Power ISA. Were you running on a 32-bit machine with 64-bit bit float (like the ISA describes) and only using 'float' and not 'double'? The convoluted function layout is due to various spec conformance layers, i.e. exceptions, and errno. On some functions these wrappers contribute up to 40% of the execution time of the functions. Some functions also include wrappers for re-computation to increase precision. > As for commitment, well I've been working on that stuff since 2004 (with a ~2y > break because of other obligations, army, family, baby, etc :), but unless > IBM/Freescale choose to dump AltiVec altogether, I don't see myself stopping > working on it. To tell you the truth, the promotion of the vector unit by both > companies has been a disappointment in my eyes at least, so I might just as > well switch platform... But that won't happen yet anyway. IBM doesn't plan on dumping Altivec/VMX, in-fact we're coming out with VSX. > > Any submission to GLIBC is going to require that you and your code > > follow the GLIBC process or it'll probably be ignored. You can engage > > me directly via CC and I can help you understand how to integrate the > > code but I can't give you a free pass or do the work for you. > > I never asked that. However, first it's more imporant to me to show that the > code is worth including and then *if* it's proven worthy, then we can worry > about stuff like copyright assignment, etc. I'm just giving you the party line and trying to be helpful. If you go to libc-alpha without your papers you'll be ignored. Ryan S. Arnold IBM Linux Technology Center Linux Toolchain Development ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2008-09-02 22:24 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-08-21 16:09 libfreevec benchmarks Konstantinos Margaritis 2008-08-22 17:44 ` Ryan S. Arnold 2008-08-22 17:50 ` Ryan S. Arnold 2008-08-24 8:03 ` Konstantinos Margaritis 2008-09-02 22:24 ` Ryan S. Arnold
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).