Hello all, The attached patch contains optimization for scale factors calculation which provides additional SBC encoder speedup. For non-gcc compilers, CLZ function is implemented with a very simple and slow straightforward code (but it is still faster than current git code even if used instead of __builtin_clz). Something better could be done like: http://groups.google.com/group/comp.sys.arm/msg/5ae56e3a95a2345e?hl=en But I'm not sure about license/copyright of the code at this link and decided not to touch it. Anyway, I don't think that gcc implementation of __builtin_clz for the CPU cores which do not support CLZ instruction is any worse. Joint stereo processing also involves recalculation of scale factors, which can use a similar optimization or even exactly the same function. I intentionally did not benchmark encoding with joint stereo yet as it would spoil the nice numbers :) That's something to improve next. Benchmark results (sbcenc with default settings): ==== ARM Cortex-A8: before: real 1m 4.84s user 1m 1.05s sys 0m 3.78s after: real 0m 58.93s user 0m 55.15s sys 0m 3.78s Intel Core2: before: real    0m7.729s user    0m7.268s sys     0m0.376s after: real 0m6.473s user 0m6.116s sys 0m0.292s ==== Overall, CPU usage in SBC encoder looks more or less like this (oprofile log from ARM Cortex-A8): samples % image name symbol name 2173 30.6791 sbcenc.neon_new sbc_encode 1774 25.0459 sbcenc.neon_new sbc_analyze_4b_8s_neon 1525 21.5304 sbcenc.neon_new sbc_calculate_bits 916 12.9324 sbcenc.neon_new sbc_calc_scalefactors 600 8.4710 sbcenc.neon_new sbc_enc_process_input_8s_be 75 1.0589 libc-2.5.so memcpy 13 0.1835 sbcenc.neon_new main 4 0.0565 libc-2.5.so write 2 0.0282 sbcenc.neon_new .plt 1 0.0141 ld-2.5.so _dl_relocate_object Best regards, Siarhei Siamashka