All of lore.kernel.org
 help / color / mirror / Atom feed
* ARM NEON optimisations for gf-complete/jerasure/ceph-erasure
@ 2014-09-04 14:42 Janne Grunau
  2014-09-04 15:21 ` Loic Dachary
  2014-09-04 15:57 ` Loic Dachary
  0 siblings, 2 replies; 7+ messages in thread
From: Janne Grunau @ 2014-09-04 14:42 UTC (permalink / raw)
  To: ceph-devel

Hi,

I've started writing ARM/AArch64 NEON optimizations for gf-complete.  
http://git.jannau.net/gf-complete.git/log/?h=neon has proof of concept 
AArch64 NEON optimisations for w8.

Implemented methods are so far the carry-less/polynomial multiplication 
and the split table. The polynomial multiplication is reasonable fast 
for region multiplications (~2000MB/s on an Apple A7 at 1.3GHz) since 
NEON has a 8-bit to 16-bit SIMD polynomial multiplication.

The split table method is still faster though, 5700MB/s on the same CPU.  
I'm actually surprised by that since it is faster (per cycle) than the 
Core i7-3770 from gf-complete's manual (page 14). That suggests that 
SSE3 code might not be optimal.

I'm currently working on integrating NEON into the build system and then 
will extend the existing code to work on ARMv7-a too. Those two are 
straight forward. There are a couple of other issues I would like to 
discuss before I start to work on them.

The #if/#ifdefs in the source are starting to make the source hard to 
read then more than one optimization is added. Separating arch specific
implementations from each other and from the generic implementation 
works reasonable well for the multimedia related projects I have 
experience with (libav/FFmpeg, x264). There would be arch specific init 
functions which set the appropriate function pointers. The neon 
optimisations would then live in w8_arm.c which would be only compiled 
for arm. If someone has another idea how to avoid the #ifdefs I'm open 
for that too.

I'm currently using the SSE/NOSSE region option which is bogus. I'm 
wondering whether I should just rename that SIMD/NOSIMD (not really true 
since the carry less operations for w64 and w128 only use the SIMD 
instruction set but are single data). That would need to have backward 
compatibility for SSE/NOSSE. The other option would be to add 
NEON/NONEON flags.

I'm sure I find other issues to discuss when I start integrating the 
NEON optimisations into jerasure and ceph.

thanks

Janne

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-10-10 14:01 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-04 14:42 ARM NEON optimisations for gf-complete/jerasure/ceph-erasure Janne Grunau
2014-09-04 15:21 ` Loic Dachary
     [not found]   ` <CA+AFVBg1oKix1U=qYdLBQ+j4MPYek-npz5UkkFh+dtR_UADUxw@mail.gmail.com>
2014-09-18 10:11     ` Janne Grunau
2014-10-10 14:01     ` Janne Grunau
2014-09-04 15:57 ` Loic Dachary
2014-09-05  0:27   ` Ethan L. Miller
2014-09-05  7:51     ` Janne Grunau

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.