[PATCH/RFC] SIMD optimizations for SBC encoder analysis filter

Linux bluetooth development
 help / color / mirror / Atom feed

* [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter
@ 2008-12-31 16:03 Siarhei Siamashka
  2008-12-31 20:55 ` Luiz Augusto von Dentz
  2009-01-01  8:58 ` Marcel Holtmann
  0 siblings, 2 replies; 20+ messages in thread
From: Siarhei Siamashka @ 2008-12-31 16:03 UTC (permalink / raw)
  To: linux-bluetooth

[-- Attachment #1: Type: text/plain, Size: 2410 bytes --]

Hello all,

This is a preliminary preview of SIMD optimizations for SBC encoder analysis filter.

It already contains MMX optimization for 4 subbands case (yes, all this insane
amount of extra lines of code finally starts to pay off) ;)

Important notice: in order to test MMX optimizations, you need to have
extra '-mmmx' command line option passed to gcc. Runtime MMX autodetection
can be easily added later. Also don't forget to pass -s4 option to sbcenc
because 8 subbands case is still not accelerated. By the way, SSE2 is twice
wider than MMX and should be a lot faster. Though MMX is supported on
virtually every x86 cpu that is in use nowadays and can be considered "lowest
common denominator".

My quick benchmark showed that the performance gets improved about ~10%
overall (and about twice better for the analysis filter function alone) when
compared with bluez-4.23 release which had the old buggy code. Improvement is
much more noticeable over the release 4.25 which contains a new fixed and
mostly nonoptimized filter.

So now the performance is better than ever. And I guess, all the platforms
should use SIMD optimizations nowadays, so they should gain performance
improvements too. Those 'anamatrix' style optimizations in older code feel
so much like the previous century ;)

I'm going to primarily focus on NEON and maybe ARMv6 SIMD optimizations,
these will be submitted a bit later. Also, as I have already written before,
the other parts of code are quite inefficient too and can be optimized. There
are still lots of things to improve.


But right now I would like to hear some opinions about the following things
regarding the attached patch:

The first question is about the use of extra source file for SIMD
optimizations and introduction of 'sbc_encoder_init_simd_optimized_analyze'
function to the global namespace. The rationale for that is the intention to
stop adding changes to 'sbc.c' (otherwise it will become bloated pretty soon
with the addition of multiple optimizations for various platforms). If anyone
has a better idea, I'm very much interested to hear it.

And if the addition of a new source file gets approved, I wonder about what
text should go to the copyright header?

Now we have two "reference" C implementations of analysis filter. Is it OK to
keep both? Or only SIMD-friendly one should remain in the end?

PS. Happy New Year

Best regards,
Siarhei Siamashka

[-- Attachment #2: preview-0002-SIMD-optimizations-for-SBC-encoder-analysis-filter.patch --]
[-- Type: text/x-diff, Size: 25534 bytes --]

>From e8f98db87085f8394c68363a4a971aea5b025a9b Mon Sep 17 00:00:00 2001
From: Siarhei Siamashka <siarhei.siamashka@nokia.com>
Date: Wed, 31 Dec 2008 16:52:08 +0200
Subject: [PATCH] SIMD optimizations for SBC encoder analysis filter

Added SIMD-friendly "reference" C implementation of SBC
analysis filter (code layout had to be changed a bit and
constants in the tables reshuffled). This code can be used
as a starting point for MMX/SSE2/NEON/ARMv6 and probably
some others (MIPS?, SPARC?, PPC?) platform specific
optimizations. Initial test version of MMX optimization
for 4 subbands case is also included.
---
 sbc/Makefile.am  |    2 +-
 sbc/sbc.c        |   16 +++-
 sbc/sbc.h        |    6 +
 sbc/sbc_simd.c   |  335 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 sbc/sbc_tables.h |  256 ++++++++++++++++++++++++++++++++++++++++-
 5 files changed, 609 insertions(+), 6 deletions(-)
 create mode 100644 sbc/sbc_simd.c

diff --git a/sbc/Makefile.am b/sbc/Makefile.am
index c42f162..45c2e09 100644
--- a/sbc/Makefile.am
+++ b/sbc/Makefile.am
@@ -8,7 +8,7 @@ endif
 if SBC
 noinst_LTLIBRARIES = libsbc.la
 
-libsbc_la_SOURCES = sbc.h sbc.c sbc_math.h sbc_tables.h
+libsbc_la_SOURCES = sbc.h sbc.c sbc_simd.c sbc_math.h sbc_tables.h
 
 libsbc_la_CFLAGS = -finline-functions -funswitch-loops -fgcse-after-reload
 
diff --git a/sbc/sbc.c b/sbc/sbc.c
index 01b4011..e313d4a 100644
--- a/sbc/sbc.c
+++ b/sbc/sbc.c
@@ -94,7 +94,8 @@ struct sbc_decoder_state {
 struct sbc_encoder_state {
 	int subbands;
 	int position[2];
-	int16_t X[2][256];
+	int16_t buffer[2][256 + 15];
+	int16_t *X[2];
 	void (*sbc_analyze_4b_4s)(int16_t *pcm, int16_t *x,
 				  int32_t *out, int out_stride);
 	void (*sbc_analyze_4b_8s)(int16_t *pcm, int16_t *x,
@@ -1053,9 +1054,22 @@ static void sbc_encoder_init(struct sbc_encoder_state *state,
 	state->subbands = frame->subbands;
 	state->position[0] = state->position[1] = 12 * frame->subbands;
 
+	/* Initialize X pointers (ensure 16 byte alignment) */
+	state->X[0] = state->buffer[0];
+	state->X[1] = state->buffer[1];
+	while ((int) state->X[0] & 0xF)
+		state->X[0]++;
+	while ((int) state->X[1] & 0xF)
+		state->X[1]++;
+
 	/* Default implementation for analyze function */
 	state->sbc_analyze_4b_4s = sbc_analyze_4b_4s;
 	state->sbc_analyze_4b_8s = sbc_analyze_4b_8s;
+
+	/* Try to override it with something faster */
+	sbc_encoder_init_simd_optimized_analyze(
+		&state->sbc_analyze_4b_4s,
+		&state->sbc_analyze_4b_8s);
 }
 
 struct sbc_priv {
diff --git a/sbc/sbc.h b/sbc/sbc.h
index ab47e32..fd6f18e 100644
--- a/sbc/sbc.h
+++ b/sbc/sbc.h
@@ -90,6 +90,12 @@ int sbc_get_frame_duration(sbc_t *sbc);
 int sbc_get_codesize(sbc_t *sbc);
 void sbc_finish(sbc_t *sbc);
 
+void sbc_encoder_init_simd_optimized_analyze(
+	void (**sbc_analyze_4b_4s)(int16_t *pcm, int16_t *x,
+				   int32_t *out, int out_stride),
+	void (**sbc_analyze_4b_8s)(int16_t *pcm, int16_t *x,
+				   int32_t *out, int out_stride));
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/sbc/sbc_simd.c b/sbc/sbc_simd.c
new file mode 100644
index 0000000..865f88e
--- /dev/null
+++ b/sbc/sbc_simd.c
@@ -0,0 +1,335 @@
+#include <stdint.h>
+#include <stdio.h>
+#include <limits.h>
+#include "sbc.h"
+#include "sbc_math.h"
+#include "sbc_tables.h"
+
+/*
+ * A reference C code with SIMD-friendly tables reordering and code layout.
+ * This code can be used to develop platform specific SIMD optimizations.
+ * Also it may be theoretically used as some kind of test for compiler
+ * autovectorization capabilities :)
+ */
+
+static inline void _sbc_analyze_four_simd(const int16_t *in, int32_t *out,
+					  const FIXED_T *const_table)
+{
+	FIXED_A t1[4];
+	FIXED_T t2[4];
+	int hop = 0;
+
+	/* rounding coefficient */
+	t1[0] = t1[1] = t1[2] = t1[3] =
+		(FIXED_A) 1 << (SBC_PROTO_FIXED4_SCALE - 1);
+
+	/* low pass polyphase filter */
+	for (hop = 0; hop < 40; hop += 8) {
+		t1[0] += (FIXED_A) in[hop] * const_table[hop];
+		t1[0] += (FIXED_A) in[hop + 1] * const_table[hop + 1];
+		t1[1] += (FIXED_A) in[hop + 2] * const_table[hop + 2];
+		t1[1] += (FIXED_A) in[hop + 3] * const_table[hop + 3];
+		t1[2] += (FIXED_A) in[hop + 4] * const_table[hop + 4];
+		t1[2] += (FIXED_A) in[hop + 5] * const_table[hop + 5];
+		t1[3] += (FIXED_A) in[hop + 6] * const_table[hop + 6];
+		t1[3] += (FIXED_A) in[hop + 7] * const_table[hop + 7];
+	}
+
+	/* scaling */
+	t2[0] = t1[0] >> SBC_PROTO_FIXED4_SCALE;
+	t2[1] = t1[1] >> SBC_PROTO_FIXED4_SCALE;
+	t2[2] = t1[2] >> SBC_PROTO_FIXED4_SCALE;
+	t2[3] = t1[3] >> SBC_PROTO_FIXED4_SCALE;
+
+	/* do the cos transform */
+	t1[0]  = (FIXED_A) t2[0] * const_table[40 + 0];
+	t1[0] += (FIXED_A) t2[1] * const_table[40 + 1];
+	t1[1]  = (FIXED_A) t2[0] * const_table[40 + 2];
+	t1[1] += (FIXED_A) t2[1] * const_table[40 + 3];
+
+	t1[2]  = (FIXED_A) t2[0] * const_table[40 + 4];
+	t1[2] += (FIXED_A) t2[1] * const_table[40 + 5];
+	t1[3]  = (FIXED_A) t2[0] * const_table[40 + 6];
+	t1[3] += (FIXED_A) t2[1] * const_table[40 + 7];
+
+	t1[0] += (FIXED_A) t2[2] * const_table[40 + 8];
+	t1[0] += (FIXED_A) t2[3] * const_table[40 + 9];
+	t1[1] += (FIXED_A) t2[2] * const_table[40 + 10];
+	t1[1] += (FIXED_A) t2[3] * const_table[40 + 11];
+	t1[2] += (FIXED_A) t2[2] * const_table[40 + 12];
+	t1[2] += (FIXED_A) t2[3] * const_table[40 + 13];
+	t1[3] += (FIXED_A) t2[2] * const_table[40 + 14];
+	t1[3] += (FIXED_A) t2[3] * const_table[40 + 15];
+
+	out[0] = t1[0] >>
+		(SBC_COS_TABLE_FIXED4_SCALE - SCALE_OUT_BITS);
+	out[1] = t1[1] >>
+		(SBC_COS_TABLE_FIXED4_SCALE - SCALE_OUT_BITS);
+	out[2] = t1[2] >>
+		(SBC_COS_TABLE_FIXED4_SCALE - SCALE_OUT_BITS);
+	out[3] = t1[3] >>
+		(SBC_COS_TABLE_FIXED4_SCALE - SCALE_OUT_BITS);
+}
+
+static inline void _sbc_analyze_eight_simd(const int16_t *in, int32_t *out,
+					   const FIXED_T *consts)
+{
+	FIXED_A t1[8];
+	FIXED_T t2[8];
+	int i, hop;
+
+	/* rounding coefficient */
+	t1[0] = t1[1] = t1[2] = t1[3] = t1[4] = t1[5] = t1[6] = t1[7] =
+		(FIXED_A) 1 << (SBC_PROTO_FIXED8_SCALE-1);
+
+	/* low pass polyphase filter */
+	for (hop = 0; hop < 80; hop += 16) {
+		t1[0] += (FIXED_A) in[hop] * consts[hop];
+		t1[0] += (FIXED_A) in[hop + 1] * consts[hop + 1];
+		t1[1] += (FIXED_A) in[hop + 2] * consts[hop + 2];
+		t1[1] += (FIXED_A) in[hop + 3] * consts[hop + 3];
+		t1[2] += (FIXED_A) in[hop + 4] * consts[hop + 4];
+		t1[2] += (FIXED_A) in[hop + 5] * consts[hop + 5];
+		t1[3] += (FIXED_A) in[hop + 6] * consts[hop + 6];
+		t1[3] += (FIXED_A) in[hop + 7] * consts[hop + 7];
+		t1[4] += (FIXED_A) in[hop + 8] * consts[hop + 8];
+		t1[4] += (FIXED_A) in[hop + 9] * consts[hop + 9];
+		t1[5] += (FIXED_A) in[hop + 10] * consts[hop + 10];
+		t1[5] += (FIXED_A) in[hop + 11] * consts[hop + 11];
+		t1[6] += (FIXED_A) in[hop + 12] * consts[hop + 12];
+		t1[6] += (FIXED_A) in[hop + 13] * consts[hop + 13];
+		t1[7] += (FIXED_A) in[hop + 14] * consts[hop + 14];
+		t1[7] += (FIXED_A) in[hop + 15] * consts[hop + 15];
+	}
+
+	/* scaling */
+	t2[0] = t1[0] >> SBC_PROTO_FIXED8_SCALE;
+	t2[1] = t1[1] >> SBC_PROTO_FIXED8_SCALE;
+	t2[2] = t1[2] >> SBC_PROTO_FIXED8_SCALE;
+	t2[3] = t1[3] >> SBC_PROTO_FIXED8_SCALE;
+	t2[4] = t1[4] >> SBC_PROTO_FIXED8_SCALE;
+	t2[5] = t1[5] >> SBC_PROTO_FIXED8_SCALE;
+	t2[6] = t1[6] >> SBC_PROTO_FIXED8_SCALE;
+	t2[7] = t1[7] >> SBC_PROTO_FIXED8_SCALE;
+
+
+	/* do the cos transform */
+	t1[0] = t1[1] = t1[2] = t1[3] = t1[4] = t1[5] = t1[6] = t1[7] = 0;
+
+	for (i = 0; i < 4; i++) {
+		t1[0] += (FIXED_A) t2[i * 2 + 0] * consts[80 + i * 16 + 0];
+		t1[0] += (FIXED_A) t2[i * 2 + 1] * consts[80 + i * 16 + 1];
+		t1[1] += (FIXED_A) t2[i * 2 + 0] * consts[80 + i * 16 + 2];
+		t1[1] += (FIXED_A) t2[i * 2 + 1] * consts[80 + i * 16 + 3];
+		t1[2] += (FIXED_A) t2[i * 2 + 0] * consts[80 + i * 16 + 4];
+		t1[2] += (FIXED_A) t2[i * 2 + 1] * consts[80 + i * 16 + 5];
+		t1[3] += (FIXED_A) t2[i * 2 + 0] * consts[80 + i * 16 + 6];
+		t1[3] += (FIXED_A) t2[i * 2 + 1] * consts[80 + i * 16 + 7];
+		t1[4] += (FIXED_A) t2[i * 2 + 0] * consts[80 + i * 16 + 8];
+		t1[4] += (FIXED_A) t2[i * 2 + 1] * consts[80 + i * 16 + 9];
+		t1[5] += (FIXED_A) t2[i * 2 + 0] * consts[80 + i * 16 + 10];
+		t1[5] += (FIXED_A) t2[i * 2 + 1] * consts[80 + i * 16 + 11];
+		t1[6] += (FIXED_A) t2[i * 2 + 0] * consts[80 + i * 16 + 12];
+		t1[6] += (FIXED_A) t2[i * 2 + 1] * consts[80 + i * 16 + 13];
+		t1[7] += (FIXED_A) t2[i * 2 + 0] * consts[80 + i * 16 + 14];
+		t1[7] += (FIXED_A) t2[i * 2 + 1] * consts[80 + i * 16 + 15];
+	}
+
+	for (i = 0; i < 8; i++)
+		out[i] = t1[i] >>
+			(SBC_COS_TABLE_FIXED8_SCALE - SCALE_OUT_BITS);
+}
+
+static inline void sbc_analyze_4b_4s_simd(int16_t *pcm, int16_t *x,
+					  int32_t *out, int out_stride)
+{
+	int i;
+	/* Input audio samples and do reordering for SIMD */
+	for (i = 0; i < 16; i += 8) {
+		int16_t *pcm1 = pcm + 8 - i;
+		int16_t *pcm2 = pcm + 8 - i + 4;
+		x[i + 64] = x[i + 0] = pcm2[3];
+		x[i + 65] = x[i + 1] = pcm1[3];
+		x[i + 66] = x[i + 2] = pcm2[2];
+		x[i + 67] = x[i + 3] = pcm2[0];
+		x[i + 68] = x[i + 4] = pcm1[0];
+		x[i + 69] = x[i + 5] = pcm1[2];
+		x[i + 70] = x[i + 6] = pcm1[1];
+		x[i + 71] = x[i + 7] = pcm2[1];
+	}
+
+	/* Analyze blocks */
+	_sbc_analyze_four_simd(x + 12, out, analysis_consts_fixed4_simd_odd);
+	out += out_stride;
+	_sbc_analyze_four_simd(x + 8, out, analysis_consts_fixed4_simd_even);
+	out += out_stride;
+	_sbc_analyze_four_simd(x + 4, out, analysis_consts_fixed4_simd_odd);
+	out += out_stride;
+	_sbc_analyze_four_simd(x + 0, out, analysis_consts_fixed4_simd_even);
+}
+
+static inline void sbc_analyze_4b_8s_simd(int16_t *pcm, int16_t *x,
+					  int32_t *out, int out_stride)
+{
+	int i;
+	/* Input audio samples and do reordering for SIMD */
+	for (i = 0; i < 32; i += 16) {
+		int16_t *pcm1 = pcm + 16 - i;
+		int16_t *pcm2 = pcm + 16 - i + 8;
+		x[i + 128] = x[i + 0] = pcm2[7];
+		x[i + 129] = x[i + 1] = pcm1[7];
+		x[i + 130] = x[i + 2] = pcm2[6];
+		x[i + 131] = x[i + 3] = pcm2[0];
+		x[i + 132] = x[i + 4] = pcm2[5];
+		x[i + 133] = x[i + 5] = pcm2[1];
+		x[i + 134] = x[i + 6] = pcm2[4];
+		x[i + 135] = x[i + 7] = pcm2[2];
+		x[i + 136] = x[i + 8] = pcm2[3];
+		x[i + 137] = x[i + 9] = pcm1[3];
+		x[i + 138] = x[i + 10] = pcm1[6];
+		x[i + 139] = x[i + 11] = pcm1[0];
+		x[i + 140] = x[i + 12] = pcm1[5];
+		x[i + 141] = x[i + 13] = pcm1[1];
+		x[i + 142] = x[i + 14] = pcm1[4];
+		x[i + 143] = x[i + 15] = pcm1[2];
+	}
+
+	/* Analyze blocks */
+	_sbc_analyze_eight_simd(x + 24, out, analysis_consts_fixed8_simd_odd);
+	out += out_stride;
+	_sbc_analyze_eight_simd(x + 16, out, analysis_consts_fixed8_simd_even);
+	out += out_stride;
+	_sbc_analyze_eight_simd(x + 8, out, analysis_consts_fixed8_simd_odd);
+	out += out_stride;
+	_sbc_analyze_eight_simd(x + 0, out, analysis_consts_fixed8_simd_even);
+}
+
+/*
+ * MMX optimized implementation
+ */
+
+#if defined(__GNUC__) && defined(__MMX__) && !defined(SBC_HIGH_PRECISION)
+#define USE_MMX
+#endif
+
+#ifdef USE_MMX
+
+static inline void _sbc_analyze_four_mmx(const int16_t *in, int32_t *out,
+					 const FIXED_T *const_table)
+{
+	static int32_t round_c[2] = {
+		1 << (SBC_PROTO_FIXED4_SCALE - 1),
+		1 << (SBC_PROTO_FIXED4_SCALE - 1),
+	};
+	asm volatile (
+		"movq       (%0), %%mm0\n"
+		"movq      8(%0), %%mm1\n"
+		"pmaddwd    (%1), %%mm0\n"
+		"pmaddwd   8(%1), %%mm1\n"
+		"paddd      (%2), %%mm0\n"
+		"paddd      (%2), %%mm1\n"
+		"\n"
+		"movq     16(%0), %%mm2\n"
+		"movq     24(%0), %%mm3\n"
+		"pmaddwd  16(%1), %%mm2\n"
+		"pmaddwd  24(%1), %%mm3\n"
+		"paddd     %%mm2, %%mm0\n"
+		"paddd     %%mm3, %%mm1\n"
+		"\n"
+		"movq     32(%0), %%mm2\n"
+		"movq     40(%0), %%mm3\n"
+		"pmaddwd  32(%1), %%mm2\n"
+		"pmaddwd  40(%1), %%mm3\n"
+		"paddd     %%mm2, %%mm0\n"
+		"paddd     %%mm3, %%mm1\n"
+		"\n"
+		"movq     48(%0), %%mm2\n"
+		"movq     56(%0), %%mm3\n"
+		"pmaddwd  48(%1), %%mm2\n"
+		"pmaddwd  56(%1), %%mm3\n"
+		"paddd     %%mm2, %%mm0\n"
+		"paddd     %%mm3, %%mm1\n"
+		"\n"
+		"movq     64(%0), %%mm2\n"
+		"movq     72(%0), %%mm3\n"
+		"pmaddwd  64(%1), %%mm2\n"
+		"pmaddwd  72(%1), %%mm3\n"
+		"paddd     %%mm2, %%mm0\n"
+		"paddd     %%mm3, %%mm1\n"
+		"\n"
+		"psrad        %4, %%mm0\n"
+		"psrad        %4, %%mm1\n"
+		"pshufw    $0x88, %%mm0, %%mm0\n"
+		"pshufw    $0x88, %%mm1, %%mm1\n"
+		"\n"
+		"movq      %%mm0, %%mm2\n"
+		"pmaddwd  80(%1), %%mm0\n"
+		"pmaddwd  88(%1), %%mm2\n"
+		"\n"
+		"movq      %%mm1, %%mm3\n"
+		"pmaddwd  96(%1), %%mm1\n"
+		"pmaddwd 104(%1), %%mm3\n"
+		"paddd     %%mm1, %%mm0\n"
+		"paddd     %%mm3, %%mm2\n"
+		"\n"
+		"movq      %%mm0, (%3)\n"
+		"movq      %%mm2, 8(%3)\n"
+		:
+		: "r" (in), "r" (const_table), "r" (&round_c), "r" (out),
+		  "i" (SBC_PROTO_FIXED4_SCALE)
+		: "memory");
+}
+
+static inline void sbc_analyze_4b_4s_mmx(int16_t *pcm, int16_t *x,
+					 int32_t *out, int out_stride)
+{
+	/* Input audio samples and do reordering for SIMD */
+	asm volatile (
+		"pshufw $0x23,  24(%0), %%mm0\n"
+		"pshufw $0x18,  16(%0), %%mm1\n"
+		"pinsrw    $1,  22(%0), %%mm0\n"
+		"pinsrw    $3,  26(%0), %%mm1\n"
+		"movq   %%mm0,   (%1)\n"
+		"movq   %%mm1,  8(%1)\n"
+		"movq   %%mm0, 128(%1)\n"
+		"movq   %%mm1, 136(%1)\n"
+		"\n"
+		"pshufw $0x23,   8(%0), %%mm0\n"
+		"pshufw $0x18,    (%0), %%mm1\n"
+		"pinsrw    $1,   6(%0), %%mm0\n"
+		"pinsrw    $3,  10(%0), %%mm1\n"
+		"movq   %%mm0,  16(%1)\n"
+		"movq   %%mm1,  24(%1)\n"
+		"movq   %%mm0, 144(%1)\n"
+		"movq   %%mm1, 152(%1)\n"
+		:
+		: "r" (pcm), "r" (x)
+		: "memory");
+
+	/* Analyze blocks */
+	_sbc_analyze_four_mmx(x + 12, out, analysis_consts_fixed4_simd_odd);
+	out += out_stride;
+	_sbc_analyze_four_mmx(x + 8, out, analysis_consts_fixed4_simd_even);
+	out += out_stride;
+	_sbc_analyze_four_mmx(x + 4, out, analysis_consts_fixed4_simd_odd);
+	out += out_stride;
+	_sbc_analyze_four_mmx(x + 0, out, analysis_consts_fixed4_simd_even);
+
+	asm volatile ("emms");
+}
+
+#endif
+
+/*
+ * TODO: runtime MMX detection (right now -mmmx gcc option is required)
+ */
+void sbc_encoder_init_simd_optimized_analyze(
+	void (**sbc_analyze_4b_4s)(int16_t *pcm, int16_t *x,
+				   int32_t *out, int out_stride),
+	void (**sbc_analyze_4b_8s)(int16_t *pcm, int16_t *x,
+				   int32_t *out, int out_stride))
+{
+#ifdef USE_MMX
+	*sbc_analyze_4b_4s = sbc_analyze_4b_4s_mmx;
+#endif
+}
diff --git a/sbc/sbc_tables.h b/sbc/sbc_tables.h
index 8df8c1f..4955f93 100644
--- a/sbc/sbc_tables.h
+++ b/sbc/sbc_tables.h
@@ -157,8 +157,9 @@ static const int32_t synmatrix8[16][8] = {
  */
 #define SBC_PROTO_FIXED4_SCALE \
 	((sizeof(FIXED_T) * CHAR_BIT - 1) - SBC_FIXED_EXTRA_BITS + 1)
-#define F(x) (FIXED_A) ((x * 2) * \
+#define F_PROTO4(x) (FIXED_A) ((x * 2) * \
 	((FIXED_A) 1 << (sizeof(FIXED_T) * CHAR_BIT - 1)) + 0.5)
+#define F(x) F_PROTO4(x)
 static const FIXED_T _sbc_proto_fixed4[40] = {
 	 F(0.00000000E+00),  F(5.36548976E-04),
 	-F(1.49188357E-03),  F(2.73370904E-03),
@@ -206,8 +207,9 @@ static const FIXED_T _sbc_proto_fixed4[40] = {
  */
 #define SBC_COS_TABLE_FIXED4_SCALE \
 	((sizeof(FIXED_T) * CHAR_BIT - 1) + SBC_FIXED_EXTRA_BITS)
-#define F(x) (FIXED_A) ((x) * \
+#define F_COS4(x) (FIXED_A) ((x) * \
 	((FIXED_A) 1 << (sizeof(FIXED_T) * CHAR_BIT - 1)) + 0.5)
+#define F(x) F_COS4(x)
 static const FIXED_T cos_table_fixed_4[32] = {
 	 F(0.7071067812),  F(0.9238795325), -F(1.0000000000),  F(0.9238795325),
 	 F(0.7071067812),  F(0.3826834324),  F(0.0000000000),  F(0.3826834324),
@@ -233,8 +235,9 @@ static const FIXED_T cos_table_fixed_4[32] = {
  */
 #define SBC_PROTO_FIXED8_SCALE \
 	((sizeof(FIXED_T) * CHAR_BIT - 1) - SBC_FIXED_EXTRA_BITS + 2)
-#define F(x) (FIXED_A) ((x * 4) * \
+#define F_PROTO8(x) (FIXED_A) ((x * 4) * \
 	((FIXED_A) 1 << (sizeof(FIXED_T) * CHAR_BIT - 1)) + 0.5)
+#define F(x) F_PROTO8(x)
 static const FIXED_T _sbc_proto_fixed8[80] = {
 	 F(0.00000000E+00),  F(1.56575398E-04),
 	 F(3.43256425E-04),  F(5.54620202E-04),
@@ -301,8 +304,9 @@ static const FIXED_T _sbc_proto_fixed8[80] = {
  */
 #define SBC_COS_TABLE_FIXED8_SCALE \
 	((sizeof(FIXED_T) * CHAR_BIT - 1) + SBC_FIXED_EXTRA_BITS)
-#define F(x) (FIXED_A) ((x) * \
+#define F_COS8(x) (FIXED_A) ((x) * \
 	((FIXED_A) 1 << (sizeof(FIXED_T) * CHAR_BIT - 1)) + 0.5)
+#define F(x) F_COS8(x)
 static const FIXED_T cos_table_fixed_8[128] = {
 	 F(0.7071067812),  F(0.8314696123),  F(0.9238795325),  F(0.9807852804),
 	-F(1.0000000000),  F(0.9807852804),  F(0.9238795325),  F(0.8314696123),
@@ -345,3 +349,247 @@ static const FIXED_T cos_table_fixed_8[128] = {
 	-F(0.0000000000), -F(0.1950903220),  F(0.3826834324), -F(0.5555702330),
 };
 #undef F
+
+/*
+ * Constant tables for the use in SIMD optimized analysis filters
+ * Each table consists of two parts:
+ * 1. reordered "proto" table
+ * 2. reordered "cos" table
+ *
+ * Due to non-symmetrical reordering, separate tables for "even"
+ * and "odd" cases are needed
+ */
+
+#ifdef __GNUC__
+#define SIMD_ALIGNED __attribute__((aligned(16)))
+#else
+#define SIMD_ALIGNED
+#endif
+
+static const FIXED_T SIMD_ALIGNED analysis_consts_fixed4_simd_even[40 + 16] = {
+#define F(x) F_PROTO4(x)
+	 F(0.00000000E+00),  F(3.83720193E-03),
+	 F(5.36548976E-04),  F(2.73370904E-03),
+	 F(3.06012286E-03),  F(3.89205149E-03),
+	 F(0.00000000E+00), -F(1.49188357E-03),
+	 F(1.09137620E-02),  F(2.58767811E-02),
+	 F(2.04385087E-02),  F(3.21939290E-02),
+	 F(7.76463494E-02),  F(6.13245186E-03),
+	 F(0.00000000E+00), -F(2.88757392E-02),
+	 F(1.35593274E-01),  F(2.94315332E-01),
+	 F(1.94987841E-01),  F(2.81828203E-01),
+	-F(1.94987841E-01),  F(2.81828203E-01),
+	 F(0.00000000E+00), -F(2.46636662E-01),
+	-F(1.35593274E-01),  F(2.58767811E-02),
+	-F(7.76463494E-02),  F(6.13245186E-03),
+	-F(2.04385087E-02),  F(3.21939290E-02),
+	 F(0.00000000E+00),  F(2.88217274E-02),
+	-F(1.09137620E-02),  F(3.83720193E-03),
+	-F(3.06012286E-03),  F(3.89205149E-03),
+	-F(5.36548976E-04),  F(2.73370904E-03),
+	 F(0.00000000E+00), -F(1.86581691E-03),
+#undef F
+#define F(x) F_COS4(x)
+	 F(0.7071067812),  F(0.9238795325),
+	-F(0.7071067812),  F(0.3826834324),
+	-F(0.7071067812), -F(0.3826834324),
+	 F(0.7071067812), -F(0.9238795325),
+	 F(0.3826834324), -F(1.0000000000),
+	-F(0.9238795325), -F(1.0000000000),
+	 F(0.9238795325), -F(1.0000000000),
+	-F(0.3826834324), -F(1.0000000000),
+#undef F
+};
+
+static const FIXED_T SIMD_ALIGNED analysis_consts_fixed4_simd_odd[40 + 16] = {
+#define F(x) F_PROTO4(x)
+	 F(2.73370904E-03),  F(5.36548976E-04),
+	-F(1.49188357E-03),  F(0.00000000E+00),
+	 F(3.83720193E-03),  F(1.09137620E-02),
+	 F(3.89205149E-03),  F(3.06012286E-03),
+	 F(3.21939290E-02),  F(2.04385087E-02),
+	-F(2.88757392E-02),  F(0.00000000E+00),
+	 F(2.58767811E-02),  F(1.35593274E-01),
+	 F(6.13245186E-03),  F(7.76463494E-02),
+	 F(2.81828203E-01),  F(1.94987841E-01),
+	-F(2.46636662E-01),  F(0.00000000E+00),
+	 F(2.94315332E-01), -F(1.35593274E-01),
+	 F(2.81828203E-01), -F(1.94987841E-01),
+	 F(6.13245186E-03), -F(7.76463494E-02),
+	 F(2.88217274E-02),  F(0.00000000E+00),
+	 F(2.58767811E-02), -F(1.09137620E-02),
+	 F(3.21939290E-02), -F(2.04385087E-02),
+	 F(3.89205149E-03), -F(3.06012286E-03),
+	-F(1.86581691E-03),  F(0.00000000E+00),
+	 F(3.83720193E-03),  F(0.00000000E+00),
+	 F(2.73370904E-03), -F(5.36548976E-04),
+#undef F
+#define F(x) F_COS4(x)
+	 F(0.9238795325), -F(1.0000000000),
+	 F(0.3826834324), -F(1.0000000000),
+	-F(0.3826834324), -F(1.0000000000),
+	-F(0.9238795325), -F(1.0000000000),
+	 F(0.7071067812),  F(0.3826834324),
+	-F(0.7071067812), -F(0.9238795325),
+	-F(0.7071067812),  F(0.9238795325),
+	 F(0.7071067812), -F(0.3826834324),
+#undef F
+};
+
+static const FIXED_T SIMD_ALIGNED analysis_consts_fixed8_simd_even[80 + 64] = {
+#define F(x) F_PROTO8(x)
+	 F(0.00000000E+00),  F(2.01182542E-03),
+	 F(1.56575398E-04),  F(1.78371725E-03),
+	 F(3.43256425E-04),  F(1.47640169E-03),
+	 F(5.54620202E-04),  F(1.13992507E-03),
+	-F(8.23919506E-04),  F(0.00000000E+00),
+	 F(2.10371989E-03),  F(3.49717454E-03),
+	 F(1.99454554E-03),  F(1.64973098E-03),
+	 F(1.61656283E-03),  F(1.78805361E-04),
+	 F(5.65949473E-03),  F(1.29371806E-02),
+	 F(8.02941163E-03),  F(1.53184106E-02),
+	 F(1.04584443E-02),  F(1.62208471E-02),
+	 F(1.27472335E-02),  F(1.59045603E-02),
+	-F(1.46525263E-02),  F(0.00000000E+00),
+	 F(8.85757540E-03),  F(5.31873032E-02),
+	 F(2.92408442E-03),  F(3.90751381E-02),
+	-F(4.91578024E-03),  F(2.61098752E-02),
+	 F(6.79989431E-02),  F(1.46955068E-01),
+	 F(8.29847578E-02),  F(1.45389847E-01),
+	 F(9.75753918E-02),  F(1.40753505E-01),
+	 F(1.11196689E-01),  F(1.33264415E-01),
+	-F(1.23264548E-01),  F(0.00000000E+00),
+	 F(1.45389847E-01), -F(8.29847578E-02),
+	 F(1.40753505E-01), -F(9.75753918E-02),
+	 F(1.33264415E-01), -F(1.11196689E-01),
+	-F(6.79989431E-02),  F(1.29371806E-02),
+	-F(5.31873032E-02),  F(8.85757540E-03),
+	-F(3.90751381E-02),  F(2.92408442E-03),
+	-F(2.61098752E-02), -F(4.91578024E-03),
+	 F(1.46404076E-02),  F(0.00000000E+00),
+	 F(1.53184106E-02), -F(8.02941163E-03),
+	 F(1.62208471E-02), -F(1.04584443E-02),
+	 F(1.59045603E-02), -F(1.27472335E-02),
+	-F(5.65949473E-03),  F(2.01182542E-03),
+	-F(3.49717454E-03),  F(2.10371989E-03),
+	-F(1.64973098E-03),  F(1.99454554E-03),
+	-F(1.78805361E-04),  F(1.61656283E-03),
+	-F(9.02154502E-04),  F(0.00000000E+00),
+	 F(1.78371725E-03), -F(1.56575398E-04),
+	 F(1.47640169E-03), -F(3.43256425E-04),
+	 F(1.13992507E-03), -F(5.54620202E-04),
+#undef F
+#define F(x) F_COS8(x)
+	 F(0.7071067812),  F(0.8314696123),
+	-F(0.7071067812), -F(0.1950903220),
+	-F(0.7071067812), -F(0.9807852804),
+	 F(0.7071067812), -F(0.5555702330),
+	 F(0.7071067812),  F(0.5555702330),
+	-F(0.7071067812),  F(0.9807852804),
+	-F(0.7071067812),  F(0.1950903220),
+	 F(0.7071067812), -F(0.8314696123),
+	 F(0.9238795325),  F(0.9807852804),
+	 F(0.3826834324),  F(0.8314696123),
+	-F(0.3826834324),  F(0.5555702330),
+	-F(0.9238795325),  F(0.1950903220),
+	-F(0.9238795325), -F(0.1950903220),
+	-F(0.3826834324), -F(0.5555702330),
+	 F(0.3826834324), -F(0.8314696123),
+	 F(0.9238795325), -F(0.9807852804),
+	-F(1.0000000000),  F(0.5555702330),
+	-F(1.0000000000), -F(0.9807852804),
+	-F(1.0000000000),  F(0.1950903220),
+	-F(1.0000000000),  F(0.8314696123),
+	-F(1.0000000000), -F(0.8314696123),
+	-F(1.0000000000), -F(0.1950903220),
+	-F(1.0000000000),  F(0.9807852804),
+	-F(1.0000000000), -F(0.5555702330),
+	 F(0.3826834324),  F(0.1950903220),
+	-F(0.9238795325), -F(0.5555702330),
+	 F(0.9238795325),  F(0.8314696123),
+	-F(0.3826834324), -F(0.9807852804),
+	-F(0.3826834324),  F(0.9807852804),
+	 F(0.9238795325), -F(0.8314696123),
+	-F(0.9238795325),  F(0.5555702330),
+	 F(0.3826834324), -F(0.1950903220),
+#undef F
+};
+
+static const FIXED_T SIMD_ALIGNED analysis_consts_fixed8_simd_odd[80 + 64] = {
+#define F(x) F_PROTO8(x)
+	 F(0.00000000E+00), -F(8.23919506E-04),
+	 F(1.56575398E-04),  F(1.78371725E-03),
+	 F(3.43256425E-04),  F(1.47640169E-03),
+	 F(5.54620202E-04),  F(1.13992507E-03),
+	 F(2.01182542E-03),  F(5.65949473E-03),
+	 F(2.10371989E-03),  F(3.49717454E-03),
+	 F(1.99454554E-03),  F(1.64973098E-03),
+	 F(1.61656283E-03),  F(1.78805361E-04),
+	 F(0.00000000E+00), -F(1.46525263E-02),
+	 F(8.02941163E-03),  F(1.53184106E-02),
+	 F(1.04584443E-02),  F(1.62208471E-02),
+	 F(1.27472335E-02),  F(1.59045603E-02),
+	 F(1.29371806E-02),  F(6.79989431E-02),
+	 F(8.85757540E-03),  F(5.31873032E-02),
+	 F(2.92408442E-03),  F(3.90751381E-02),
+	-F(4.91578024E-03),  F(2.61098752E-02),
+	 F(0.00000000E+00), -F(1.23264548E-01),
+	 F(8.29847578E-02),  F(1.45389847E-01),
+	 F(9.75753918E-02),  F(1.40753505E-01),
+	 F(1.11196689E-01),  F(1.33264415E-01),
+	 F(1.46955068E-01), -F(6.79989431E-02),
+	 F(1.45389847E-01), -F(8.29847578E-02),
+	 F(1.40753505E-01), -F(9.75753918E-02),
+	 F(1.33264415E-01), -F(1.11196689E-01),
+	 F(0.00000000E+00),  F(1.46404076E-02),
+	-F(5.31873032E-02),  F(8.85757540E-03),
+	-F(3.90751381E-02),  F(2.92408442E-03),
+	-F(2.61098752E-02), -F(4.91578024E-03),
+	 F(1.29371806E-02), -F(5.65949473E-03),
+	 F(1.53184106E-02), -F(8.02941163E-03),
+	 F(1.62208471E-02), -F(1.04584443E-02),
+	 F(1.59045603E-02), -F(1.27472335E-02),
+	 F(0.00000000E+00), -F(9.02154502E-04),
+	-F(3.49717454E-03),  F(2.10371989E-03),
+	-F(1.64973098E-03),  F(1.99454554E-03),
+	-F(1.78805361E-04),  F(1.61656283E-03),
+	 F(2.01182542E-03),  F(0.00000000E+00),
+	 F(1.78371725E-03), -F(1.56575398E-04),
+	 F(1.47640169E-03), -F(3.43256425E-04),
+	 F(1.13992507E-03), -F(5.54620202E-04),
+#undef F
+#define F(x) F_COS8(x)
+	-F(1.0000000000),  F(0.8314696123),
+	-F(1.0000000000), -F(0.1950903220),
+	-F(1.0000000000), -F(0.9807852804),
+	-F(1.0000000000), -F(0.5555702330),
+	-F(1.0000000000),  F(0.5555702330),
+	-F(1.0000000000),  F(0.9807852804),
+	-F(1.0000000000),  F(0.1950903220),
+	-F(1.0000000000), -F(0.8314696123),
+	 F(0.9238795325),  F(0.9807852804),
+	 F(0.3826834324),  F(0.8314696123),
+	-F(0.3826834324),  F(0.5555702330),
+	-F(0.9238795325),  F(0.1950903220),
+	-F(0.9238795325), -F(0.1950903220),
+	-F(0.3826834324), -F(0.5555702330),
+	 F(0.3826834324), -F(0.8314696123),
+	 F(0.9238795325), -F(0.9807852804),
+	 F(0.7071067812),  F(0.5555702330),
+	-F(0.7071067812), -F(0.9807852804),
+	-F(0.7071067812),  F(0.1950903220),
+	 F(0.7071067812),  F(0.8314696123),
+	 F(0.7071067812), -F(0.8314696123),
+	-F(0.7071067812), -F(0.1950903220),
+	-F(0.7071067812),  F(0.9807852804),
+	 F(0.7071067812), -F(0.5555702330),
+	 F(0.3826834324),  F(0.1950903220),
+	-F(0.9238795325), -F(0.5555702330),
+	 F(0.9238795325),  F(0.8314696123),
+	-F(0.3826834324), -F(0.9807852804),
+	-F(0.3826834324),  F(0.9807852804),
+	 F(0.9238795325), -F(0.8314696123),
+	-F(0.9238795325),  F(0.5555702330),
+	 F(0.3826834324), -F(0.1950903220),
+#undef F
+};
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter
  2008-12-31 16:03 [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter Siarhei Siamashka
@ 2008-12-31 20:55 ` Luiz Augusto von Dentz
  2009-01-02 16:33   ` Siarhei Siamashka
  2009-01-06  2:50   ` Marcel Holtmann
  2009-01-01  8:58 ` Marcel Holtmann
  1 sibling, 2 replies; 20+ messages in thread
From: Luiz Augusto von Dentz @ 2008-12-31 20:55 UTC (permalink / raw)
  To: Siarhei Siamashka; +Cc: linux-bluetooth

I wonder why don't we use liboil
(http://liboil.freedesktop.org/wiki/). Since we can't keep
implementing, or don't want to, optimization code for each instruction
extension around. Liboil detects which implementation is faster at
runtime and there are many other codec implementations that depend on
it, it actually makes a lot of sense to gstream and PulseAudio which
already uses liboil. I know that means adding another dependency to
BlueZ, or perhaps it is time to make libsbc a real library?

-- 
Luiz Augusto von Dentz
Engenheiro de Computação

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter
  2008-12-31 16:03 [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter Siarhei Siamashka
  2008-12-31 20:55 ` Luiz Augusto von Dentz
@ 2009-01-01  8:58 ` Marcel Holtmann
  2009-01-02 16:07   ` Siarhei Siamashka
  1 sibling, 1 reply; 20+ messages in thread
From: Marcel Holtmann @ 2009-01-01  8:58 UTC (permalink / raw)
  To: Siarhei Siamashka; +Cc: linux-bluetooth

Hi Siarhei,

> This is a preliminary preview of SIMD optimizations for SBC encoder analysis filter.
> 
> It already contains MMX optimization for 4 subbands case (yes, all this insane
> amount of extra lines of code finally starts to pay off) ;)
> 
> Important notice: in order to test MMX optimizations, you need to have
> extra '-mmmx' command line option passed to gcc. Runtime MMX autodetection
> can be easily added later. Also don't forget to pass -s4 option to sbcenc
> because 8 subbands case is still not accelerated. By the way, SSE2 is twice
> wider than MMX and should be a lot faster. Though MMX is supported on
> virtually every x86 cpu that is in use nowadays and can be considered "lowest
> common denominator".
> 
> My quick benchmark showed that the performance gets improved about ~10%
> overall (and about twice better for the analysis filter function alone) when
> compared with bluez-4.23 release which had the old buggy code. Improvement is
> much more noticeable over the release 4.25 which contains a new fixed and
> mostly nonoptimized filter.
> 
> So now the performance is better than ever. And I guess, all the platforms
> should use SIMD optimizations nowadays, so they should gain performance
> improvements too. Those 'anamatrix' style optimizations in older code feel
> so much like the previous century ;)
> 
> I'm going to primarily focus on NEON and maybe ARMv6 SIMD optimizations,
> these will be submitted a bit later. Also, as I have already written before,
> the other parts of code are quite inefficient too and can be optimized. There
> are still lots of things to improve.
> 
> 
> But right now I would like to hear some opinions about the following things
> regarding the attached patch:
> 
> The first question is about the use of extra source file for SIMD
> optimizations and introduction of 'sbc_encoder_init_simd_optimized_analyze'
> function to the global namespace. The rationale for that is the intention to
> stop adding changes to 'sbc.c' (otherwise it will become bloated pretty soon
> with the addition of multiple optimizations for various platforms). If anyone
> has a better idea, I'm very much interested to hear it.
> 
> And if the addition of a new source file gets approved, I wonder about what
> text should go to the copyright header?
> 
> Now we have two "reference" C implementations of analysis filter. Is it OK to
> keep both? Or only SIMD-friendly one should remain in the end?

I am fine with keeping both, but if one is just not useful, we are going
to remove it. Also two separate files are fine for me. Personally I
prefer a runtime selection since compile time options are always painful
to test before making the release.

For the copyright header it is pretty simple. We copy the current header
and then later on I will add the appropriate Nokia copyright to it. So
don't worry about that part, I take care of that.

Regards

Marcel



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter
  2009-01-01  8:58 ` Marcel Holtmann
@ 2009-01-02 16:07   ` Siarhei Siamashka
  2009-01-02 16:27     ` Brad Midgley
                       ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Siarhei Siamashka @ 2009-01-02 16:07 UTC (permalink / raw)
  To: ext Marcel Holtmann; +Cc: linux-bluetooth

[-- Attachment #1: Type: text/plain, Size: 4172 bytes --]

On Thursday 01 January 2009 10:58:03 ext Marcel Holtmann wrote:
> Hi Siarhei,
>
> > This is a preliminary preview of SIMD optimizations for SBC encoder
> > analysis filter.
> >
> > It already contains MMX optimization for 4 subbands case (yes, all this
> > insane amount of extra lines of code finally starts to pay off) ;)
> >
> > Important notice: in order to test MMX optimizations, you need to have
> > extra '-mmmx' command line option passed to gcc. Runtime MMX
> > autodetection can be easily added later. Also don't forget to pass -s4
> > option to sbcenc because 8 subbands case is still not accelerated. By the
> > way, SSE2 is twice wider than MMX and should be a lot faster. Though MMX
> > is supported on virtually every x86 cpu that is in use nowadays and can
> > be considered "lowest common denominator".
> >
> > My quick benchmark showed that the performance gets improved about ~10%
> > overall (and about twice better for the analysis filter function alone)
> > when compared with bluez-4.23 release which had the old buggy code.
> > Improvement is much more noticeable over the release 4.25 which contains
> > a new fixed and mostly nonoptimized filter.
> >
> > So now the performance is better than ever. And I guess, all the
> > platforms should use SIMD optimizations nowadays, so they should gain
> > performance improvements too. Those 'anamatrix' style optimizations in
> > older code feel so much like the previous century ;)
> >
> > I'm going to primarily focus on NEON and maybe ARMv6 SIMD optimizations,
> > these will be submitted a bit later. Also, as I have already written
> > before, the other parts of code are quite inefficient too and can be
> > optimized. There are still lots of things to improve.
> >
> >
> > But right now I would like to hear some opinions about the following
> > things regarding the attached patch:
> >
> > The first question is about the use of extra source file for SIMD
> > optimizations and introduction of
> > 'sbc_encoder_init_simd_optimized_analyze' function to the global
> > namespace. The rationale for that is the intention to stop adding changes
> > to 'sbc.c' (otherwise it will become bloated pretty soon with the
> > addition of multiple optimizations for various platforms). If anyone has
> > a better idea, I'm very much interested to hear it.
> >
> > And if the addition of a new source file gets approved, I wonder about
> > what text should go to the copyright header?
> >
> > Now we have two "reference" C implementations of analysis filter. Is it
> > OK to keep both? Or only SIMD-friendly one should remain in the end?
>
> I am fine with keeping both, but if one is just not useful, we are going
> to remove it.

The only problem with SIMD-friendly code is that it uses two tables instead of
one (that's a sacrifice for the nice and symmetric code layout which fits SIMD
instructions of modern processors quite well). It may be somewhat less
optimal for the legacy processors without SIMD capabilities.

I wonder what CPU architectures are the most important for bluez?

> Also two separate files are fine for me. Personally I prefer a runtime
> selection since compile time options are always painful 
> to test before making the release.

The attached patch contains what I would consider to be a final variant. MMX
support is now complete. It works for both x86 and amd64, has runtime
autodetection of MMX availability, supports 4 and 8 subbands cases. I also
ensured that only original MMX instructions are used (and no SSE or other
later additions), so the code should work fine even on the old Pentium1 MMX.
New MMX optimized functions produce bit identical results when compared
with bluez-4.25 release.

With this patch applied, new filtering functions are noticeably faster than
than the old ones on x86 (so now they are both faster and have better
quality). Assembly optimizations for the other platforms can be easily
added too. 

> For the copyright header it is pretty simple. We copy the current header
> and then later on I will add the appropriate Nokia copyright to it. So
> don't worry about that part, I take care of that.

OK, thanks


Best regards,
Siarhei Siamashka

[-- Attachment #2: 0001-SIMD-optimizations-for-SBC-encoder-analysis-filter.patch --]
[-- Type: text/x-diff, Size: 33162 bytes --]

>From 42543fb826b4f86d878a997c0adb0b428b459ffd Mon Sep 17 00:00:00 2001
From: Siarhei Siamashka <siarhei.siamashka@nokia.com>
Date: Wed, 31 Dec 2008 16:52:08 +0200
Subject: [PATCH] SIMD optimizations for SBC encoder analysis filter

Added SIMD-friendly C implementation of SBC analysis filter (the
structure of code had to be changed a bit and constants in the
tables reordered). This code can be used as a reference for
developing platform specific SIMD optimizations. MMX optimizations
for x86/amd64 processors are included.
---
 sbc/Makefile.am   |    2 +-
 sbc/sbc.c         |   17 ++-
 sbc/sbc.h         |    6 +
 sbc/sbc_analyze.c |  617 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 sbc/sbc_tables.h  |  256 ++++++++++++++++++++++-
 5 files changed, 892 insertions(+), 6 deletions(-)
 create mode 100644 sbc/sbc_analyze.c

diff --git a/sbc/Makefile.am b/sbc/Makefile.am
index c42f162..d0d48ad 100644
--- a/sbc/Makefile.am
+++ b/sbc/Makefile.am
@@ -8,7 +8,7 @@ endif
 if SBC
 noinst_LTLIBRARIES = libsbc.la
 
-libsbc_la_SOURCES = sbc.h sbc.c sbc_math.h sbc_tables.h
+libsbc_la_SOURCES = sbc.h sbc.c sbc_analyze.c sbc_math.h sbc_tables.h
 
 libsbc_la_CFLAGS = -finline-functions -funswitch-loops -fgcse-after-reload
 
diff --git a/sbc/sbc.c b/sbc/sbc.c
index b349090..0b64b4c 100644
--- a/sbc/sbc.c
+++ b/sbc/sbc.c
@@ -94,7 +94,8 @@ struct sbc_decoder_state {
 struct sbc_encoder_state {
 	int subbands;
 	int position[2];
-	int16_t X[2][256];
+	int16_t buffer[2][256 + 8];
+	int16_t *X[2];
 	void (*sbc_analyze_4b_4s)(int16_t *pcm, int16_t *x,
 				  int32_t *out, int out_stride);
 	void (*sbc_analyze_4b_8s)(int16_t *pcm, int16_t *x,
@@ -1053,9 +1054,23 @@ static void sbc_encoder_init(struct sbc_encoder_state *state,
 	state->subbands = frame->subbands;
 	state->position[0] = state->position[1] = 12 * frame->subbands;
 
+	/* Initialize X pointers (ensure 16 byte alignment) */
+	state->X[0] = state->buffer[0];
+	state->X[1] = state->buffer[1];
+	while ((int) state->X[0] & 0xF)
+		state->X[0]++;
+	while ((int) state->X[1] & 0xF)
+		state->X[1]++;
+
 	/* Default implementation for analyze function */
 	state->sbc_analyze_4b_4s = sbc_analyze_4b_4s;
 	state->sbc_analyze_4b_8s = sbc_analyze_4b_8s;
+
+	/* Try to override the default implementation with faster SIMD
+	   optimized functions if possible */
+	sbc_encoder_init_simd_optimized_analyze(
+		&state->sbc_analyze_4b_4s,
+		&state->sbc_analyze_4b_8s);
 }
 
 struct sbc_priv {
diff --git a/sbc/sbc.h b/sbc/sbc.h
index 2838b1f..5beff88 100644
--- a/sbc/sbc.h
+++ b/sbc/sbc.h
@@ -90,6 +90,12 @@ int sbc_get_frame_duration(sbc_t *sbc);
 int sbc_get_codesize(sbc_t *sbc);
 void sbc_finish(sbc_t *sbc);
 
+void sbc_encoder_init_simd_optimized_analyze(
+	void (**sbc_analyze_4b_4s)(int16_t *pcm, int16_t *x,
+				   int32_t *out, int out_stride),
+	void (**sbc_analyze_4b_8s)(int16_t *pcm, int16_t *x,
+				   int32_t *out, int out_stride));
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/sbc/sbc_analyze.c b/sbc/sbc_analyze.c
new file mode 100644
index 0000000..dbd9d65
--- /dev/null
+++ b/sbc/sbc_analyze.c
@@ -0,0 +1,617 @@
+#include <stdint.h>
+#include <limits.h>
+#include "sbc.h"
+#include "sbc_math.h"
+#include "sbc_tables.h"
+
+/*
+ * A reference C code of analysis filter with SIMD-friendly tables
+ * reordering and code layout. This code can be used to develop platform
+ * specific SIMD optimizations. Also it may be used as some kind of test
+ * for compiler autovectorization capabilities (who knows, if the compiler
+ * is very good at this stuff, hand optimized assembly may be not strictly
+ * needed for some platform).
+ */
+
+static inline void _sbc_analyze_four_simd(const int16_t *in, int32_t *out,
+					  const FIXED_T *const_table)
+{
+	FIXED_A t1[4];
+	FIXED_T t2[4];
+	int hop = 0;
+
+	/* rounding coefficient */
+	t1[0] = t1[1] = t1[2] = t1[3] =
+		(FIXED_A) 1 << (SBC_PROTO_FIXED4_SCALE - 1);
+
+	/* low pass polyphase filter */
+	for (hop = 0; hop < 40; hop += 8) {
+		t1[0] += (FIXED_A) in[hop] * const_table[hop];
+		t1[0] += (FIXED_A) in[hop + 1] * const_table[hop + 1];
+		t1[1] += (FIXED_A) in[hop + 2] * const_table[hop + 2];
+		t1[1] += (FIXED_A) in[hop + 3] * const_table[hop + 3];
+		t1[2] += (FIXED_A) in[hop + 4] * const_table[hop + 4];
+		t1[2] += (FIXED_A) in[hop + 5] * const_table[hop + 5];
+		t1[3] += (FIXED_A) in[hop + 6] * const_table[hop + 6];
+		t1[3] += (FIXED_A) in[hop + 7] * const_table[hop + 7];
+	}
+
+	/* scaling */
+	t2[0] = t1[0] >> SBC_PROTO_FIXED4_SCALE;
+	t2[1] = t1[1] >> SBC_PROTO_FIXED4_SCALE;
+	t2[2] = t1[2] >> SBC_PROTO_FIXED4_SCALE;
+	t2[3] = t1[3] >> SBC_PROTO_FIXED4_SCALE;
+
+	/* do the cos transform */
+	t1[0]  = (FIXED_A) t2[0] * const_table[40 + 0];
+	t1[0] += (FIXED_A) t2[1] * const_table[40 + 1];
+	t1[1]  = (FIXED_A) t2[0] * const_table[40 + 2];
+	t1[1] += (FIXED_A) t2[1] * const_table[40 + 3];
+	t1[2]  = (FIXED_A) t2[0] * const_table[40 + 4];
+	t1[2] += (FIXED_A) t2[1] * const_table[40 + 5];
+	t1[3]  = (FIXED_A) t2[0] * const_table[40 + 6];
+	t1[3] += (FIXED_A) t2[1] * const_table[40 + 7];
+
+	t1[0] += (FIXED_A) t2[2] * const_table[40 + 8];
+	t1[0] += (FIXED_A) t2[3] * const_table[40 + 9];
+	t1[1] += (FIXED_A) t2[2] * const_table[40 + 10];
+	t1[1] += (FIXED_A) t2[3] * const_table[40 + 11];
+	t1[2] += (FIXED_A) t2[2] * const_table[40 + 12];
+	t1[2] += (FIXED_A) t2[3] * const_table[40 + 13];
+	t1[3] += (FIXED_A) t2[2] * const_table[40 + 14];
+	t1[3] += (FIXED_A) t2[3] * const_table[40 + 15];
+
+	out[0] = t1[0] >>
+		(SBC_COS_TABLE_FIXED4_SCALE - SCALE_OUT_BITS);
+	out[1] = t1[1] >>
+		(SBC_COS_TABLE_FIXED4_SCALE - SCALE_OUT_BITS);
+	out[2] = t1[2] >>
+		(SBC_COS_TABLE_FIXED4_SCALE - SCALE_OUT_BITS);
+	out[3] = t1[3] >>
+		(SBC_COS_TABLE_FIXED4_SCALE - SCALE_OUT_BITS);
+}
+
+static inline void _sbc_analyze_eight_simd(const int16_t *in, int32_t *out,
+					   const FIXED_T *consts)
+{
+	FIXED_A t1[8];
+	FIXED_T t2[8];
+	int i, hop;
+
+	/* rounding coefficient */
+	t1[0] = t1[1] = t1[2] = t1[3] = t1[4] = t1[5] = t1[6] = t1[7] =
+		(FIXED_A) 1 << (SBC_PROTO_FIXED8_SCALE-1);
+
+	/* low pass polyphase filter */
+	for (hop = 0; hop < 80; hop += 16) {
+		t1[0] += (FIXED_A) in[hop] * consts[hop];
+		t1[0] += (FIXED_A) in[hop + 1] * consts[hop + 1];
+		t1[1] += (FIXED_A) in[hop + 2] * consts[hop + 2];
+		t1[1] += (FIXED_A) in[hop + 3] * consts[hop + 3];
+		t1[2] += (FIXED_A) in[hop + 4] * consts[hop + 4];
+		t1[2] += (FIXED_A) in[hop + 5] * consts[hop + 5];
+		t1[3] += (FIXED_A) in[hop + 6] * consts[hop + 6];
+		t1[3] += (FIXED_A) in[hop + 7] * consts[hop + 7];
+		t1[4] += (FIXED_A) in[hop + 8] * consts[hop + 8];
+		t1[4] += (FIXED_A) in[hop + 9] * consts[hop + 9];
+		t1[5] += (FIXED_A) in[hop + 10] * consts[hop + 10];
+		t1[5] += (FIXED_A) in[hop + 11] * consts[hop + 11];
+		t1[6] += (FIXED_A) in[hop + 12] * consts[hop + 12];
+		t1[6] += (FIXED_A) in[hop + 13] * consts[hop + 13];
+		t1[7] += (FIXED_A) in[hop + 14] * consts[hop + 14];
+		t1[7] += (FIXED_A) in[hop + 15] * consts[hop + 15];
+	}
+
+	/* scaling */
+	t2[0] = t1[0] >> SBC_PROTO_FIXED8_SCALE;
+	t2[1] = t1[1] >> SBC_PROTO_FIXED8_SCALE;
+	t2[2] = t1[2] >> SBC_PROTO_FIXED8_SCALE;
+	t2[3] = t1[3] >> SBC_PROTO_FIXED8_SCALE;
+	t2[4] = t1[4] >> SBC_PROTO_FIXED8_SCALE;
+	t2[5] = t1[5] >> SBC_PROTO_FIXED8_SCALE;
+	t2[6] = t1[6] >> SBC_PROTO_FIXED8_SCALE;
+	t2[7] = t1[7] >> SBC_PROTO_FIXED8_SCALE;
+
+
+	/* do the cos transform */
+	t1[0] = t1[1] = t1[2] = t1[3] = t1[4] = t1[5] = t1[6] = t1[7] = 0;
+
+	for (i = 0; i < 4; i++) {
+		t1[0] += (FIXED_A) t2[i * 2 + 0] * consts[80 + i * 16 + 0];
+		t1[0] += (FIXED_A) t2[i * 2 + 1] * consts[80 + i * 16 + 1];
+		t1[1] += (FIXED_A) t2[i * 2 + 0] * consts[80 + i * 16 + 2];
+		t1[1] += (FIXED_A) t2[i * 2 + 1] * consts[80 + i * 16 + 3];
+		t1[2] += (FIXED_A) t2[i * 2 + 0] * consts[80 + i * 16 + 4];
+		t1[2] += (FIXED_A) t2[i * 2 + 1] * consts[80 + i * 16 + 5];
+		t1[3] += (FIXED_A) t2[i * 2 + 0] * consts[80 + i * 16 + 6];
+		t1[3] += (FIXED_A) t2[i * 2 + 1] * consts[80 + i * 16 + 7];
+		t1[4] += (FIXED_A) t2[i * 2 + 0] * consts[80 + i * 16 + 8];
+		t1[4] += (FIXED_A) t2[i * 2 + 1] * consts[80 + i * 16 + 9];
+		t1[5] += (FIXED_A) t2[i * 2 + 0] * consts[80 + i * 16 + 10];
+		t1[5] += (FIXED_A) t2[i * 2 + 1] * consts[80 + i * 16 + 11];
+		t1[6] += (FIXED_A) t2[i * 2 + 0] * consts[80 + i * 16 + 12];
+		t1[6] += (FIXED_A) t2[i * 2 + 1] * consts[80 + i * 16 + 13];
+		t1[7] += (FIXED_A) t2[i * 2 + 0] * consts[80 + i * 16 + 14];
+		t1[7] += (FIXED_A) t2[i * 2 + 1] * consts[80 + i * 16 + 15];
+	}
+
+	for (i = 0; i < 8; i++)
+		out[i] = t1[i] >>
+			(SBC_COS_TABLE_FIXED8_SCALE - SCALE_OUT_BITS);
+}
+
+static inline void sbc_analyze_4b_4s_simd(int16_t *pcm, int16_t *x,
+					  int32_t *out, int out_stride)
+{
+	/* Fetch audio samples and do input data reordering for SIMD */
+	x[64] = x[0]  = pcm[8 + 7];
+	x[65] = x[1]  = pcm[8 + 3];
+	x[66] = x[2]  = pcm[8 + 6];
+	x[67] = x[3]  = pcm[8 + 4];
+	x[68] = x[4]  = pcm[8 + 0];
+	x[69] = x[5]  = pcm[8 + 2];
+	x[70] = x[6]  = pcm[8 + 1];
+	x[71] = x[7]  = pcm[8 + 5];
+
+	x[72] = x[8]  = pcm[0 + 7];
+	x[73] = x[9]  = pcm[0 + 3];
+	x[74] = x[10] = pcm[0 + 6];
+	x[75] = x[11] = pcm[0 + 4];
+	x[76] = x[12] = pcm[0 + 0];
+	x[77] = x[13] = pcm[0 + 2];
+	x[78] = x[14] = pcm[0 + 1];
+	x[79] = x[15] = pcm[0 + 5];
+
+	/* Analyze blocks */
+	_sbc_analyze_four_simd(x + 12, out, analysis_consts_fixed4_simd_odd);
+	out += out_stride;
+	_sbc_analyze_four_simd(x + 8, out, analysis_consts_fixed4_simd_even);
+	out += out_stride;
+	_sbc_analyze_four_simd(x + 4, out, analysis_consts_fixed4_simd_odd);
+	out += out_stride;
+	_sbc_analyze_four_simd(x + 0, out, analysis_consts_fixed4_simd_even);
+}
+
+static inline void sbc_analyze_4b_8s_simd(int16_t *pcm, int16_t *x,
+					  int32_t *out, int out_stride)
+{
+	/* Fetch audio samples and do input data reordering for SIMD */
+	x[128] = x[0]  = pcm[16 + 15];
+	x[129] = x[1]  = pcm[16 + 7];
+	x[130] = x[2]  = pcm[16 + 14];
+	x[131] = x[3]  = pcm[16 + 8];
+	x[132] = x[4]  = pcm[16 + 13];
+	x[133] = x[5]  = pcm[16 + 9];
+	x[134] = x[6]  = pcm[16 + 12];
+	x[135] = x[7]  = pcm[16 + 10];
+	x[136] = x[8]  = pcm[16 + 11];
+	x[137] = x[9]  = pcm[16 + 3];
+	x[138] = x[10] = pcm[16 + 6];
+	x[139] = x[11] = pcm[16 + 0];
+	x[140] = x[12] = pcm[16 + 5];
+	x[141] = x[13] = pcm[16 + 1];
+	x[142] = x[14] = pcm[16 + 4];
+	x[143] = x[15] = pcm[16 + 2];
+
+	x[144] = x[16] = pcm[0 + 15];
+	x[145] = x[17] = pcm[0 + 7];
+	x[146] = x[18] = pcm[0 + 14];
+	x[147] = x[19] = pcm[0 + 8];
+	x[148] = x[20] = pcm[0 + 13];
+	x[149] = x[21] = pcm[0 + 9];
+	x[150] = x[22] = pcm[0 + 12];
+	x[151] = x[23] = pcm[0 + 10];
+	x[152] = x[24] = pcm[0 + 11];
+	x[153] = x[25] = pcm[0 + 3];
+	x[154] = x[26] = pcm[0 + 6];
+	x[155] = x[27] = pcm[0 + 0];
+	x[156] = x[28] = pcm[0 + 5];
+	x[157] = x[29] = pcm[0 + 1];
+	x[158] = x[30] = pcm[0 + 4];
+	x[159] = x[31] = pcm[0 + 2];
+
+	/* Analyze blocks */
+	_sbc_analyze_eight_simd(x + 24, out, analysis_consts_fixed8_simd_odd);
+	out += out_stride;
+	_sbc_analyze_eight_simd(x + 16, out, analysis_consts_fixed8_simd_even);
+	out += out_stride;
+	_sbc_analyze_eight_simd(x + 8, out, analysis_consts_fixed8_simd_odd);
+	out += out_stride;
+	_sbc_analyze_eight_simd(x + 0, out, analysis_consts_fixed8_simd_even);
+}
+
+/*
+ * MMX optimizations
+ */
+
+#if defined(__GNUC__) && (defined(__i386__) || defined(__amd64__))
+#ifndef SBC_HIGH_PRECISION
+#define SBC_BUILD_WITH_MMX_SUPPORT
+#endif
+#endif
+
+#ifdef SBC_BUILD_WITH_MMX_SUPPORT
+
+static inline void _sbc_analyze_four_mmx(const int16_t *in, int32_t *out,
+					 const FIXED_T *consts)
+{
+	static const SIMD_ALIGNED int32_t round_c[2] = {
+		1 << (SBC_PROTO_FIXED4_SCALE - 1),
+		1 << (SBC_PROTO_FIXED4_SCALE - 1),
+	};
+	asm volatile (
+		"movq        (%0), %%mm0\n"
+		"movq       8(%0), %%mm1\n"
+		"pmaddwd     (%1), %%mm0\n"
+		"pmaddwd    8(%1), %%mm1\n"
+		"paddd       (%2), %%mm0\n"
+		"paddd       (%2), %%mm1\n"
+		"\n"
+		"movq      16(%0), %%mm2\n"
+		"movq      24(%0), %%mm3\n"
+		"pmaddwd   16(%1), %%mm2\n"
+		"pmaddwd   24(%1), %%mm3\n"
+		"paddd      %%mm2, %%mm0\n"
+		"paddd      %%mm3, %%mm1\n"
+		"\n"
+		"movq      32(%0), %%mm2\n"
+		"movq      40(%0), %%mm3\n"
+		"pmaddwd   32(%1), %%mm2\n"
+		"pmaddwd   40(%1), %%mm3\n"
+		"paddd      %%mm2, %%mm0\n"
+		"paddd      %%mm3, %%mm1\n"
+		"\n"
+		"movq      48(%0), %%mm2\n"
+		"movq      56(%0), %%mm3\n"
+		"pmaddwd   48(%1), %%mm2\n"
+		"pmaddwd   56(%1), %%mm3\n"
+		"paddd      %%mm2, %%mm0\n"
+		"paddd      %%mm3, %%mm1\n"
+		"\n"
+		"movq      64(%0), %%mm2\n"
+		"movq      72(%0), %%mm3\n"
+		"pmaddwd   64(%1), %%mm2\n"
+		"pmaddwd   72(%1), %%mm3\n"
+		"paddd      %%mm2, %%mm0\n"
+		"paddd      %%mm3, %%mm1\n"
+		"\n"
+		"psrad         %4, %%mm0\n"
+		"psrad         %4, %%mm1\n"
+		"packssdw   %%mm0, %%mm0\n"
+		"packssdw   %%mm1, %%mm1\n"
+		"\n"
+		"movq       %%mm0, %%mm2\n"
+		"pmaddwd   80(%1), %%mm0\n"
+		"pmaddwd   88(%1), %%mm2\n"
+		"\n"
+		"movq       %%mm1, %%mm3\n"
+		"pmaddwd   96(%1), %%mm1\n"
+		"pmaddwd  104(%1), %%mm3\n"
+		"paddd      %%mm1, %%mm0\n"
+		"paddd      %%mm3, %%mm2\n"
+		"\n"
+		"movq       %%mm0, (%3)\n"
+		"movq       %%mm2, 8(%3)\n"
+		:
+		: "r" (in), "r" (consts), "r" (&round_c), "r" (out),
+		  "i" (SBC_PROTO_FIXED4_SCALE)
+		: "memory");
+}
+
+static inline void _sbc_analyze_eight_mmx(const int16_t *in, int32_t *out,
+					  const FIXED_T *consts)
+{
+	static const SIMD_ALIGNED int32_t round_c[2] = {
+		1 << (SBC_PROTO_FIXED8_SCALE - 1),
+		1 << (SBC_PROTO_FIXED8_SCALE - 1),
+	};
+	asm volatile (
+		"movq        (%0), %%mm0\n"
+		"movq       8(%0), %%mm1\n"
+		"movq      16(%0), %%mm2\n"
+		"movq      24(%0), %%mm3\n"
+		"pmaddwd     (%1), %%mm0\n"
+		"pmaddwd    8(%1), %%mm1\n"
+		"pmaddwd   16(%1), %%mm2\n"
+		"pmaddwd   24(%1), %%mm3\n"
+		"paddd       (%2), %%mm0\n"
+		"paddd       (%2), %%mm1\n"
+		"paddd       (%2), %%mm2\n"
+		"paddd       (%2), %%mm3\n"
+		"\n"
+		"movq      32(%0), %%mm4\n"
+		"movq      40(%0), %%mm5\n"
+		"movq      48(%0), %%mm6\n"
+		"movq      56(%0), %%mm7\n"
+		"pmaddwd   32(%1), %%mm4\n"
+		"pmaddwd   40(%1), %%mm5\n"
+		"pmaddwd   48(%1), %%mm6\n"
+		"pmaddwd   56(%1), %%mm7\n"
+		"paddd      %%mm4, %%mm0\n"
+		"paddd      %%mm5, %%mm1\n"
+		"paddd      %%mm6, %%mm2\n"
+		"paddd      %%mm7, %%mm3\n"
+		"\n"
+		"movq      64(%0), %%mm4\n"
+		"movq      72(%0), %%mm5\n"
+		"movq      80(%0), %%mm6\n"
+		"movq      88(%0), %%mm7\n"
+		"pmaddwd   64(%1), %%mm4\n"
+		"pmaddwd   72(%1), %%mm5\n"
+		"pmaddwd   80(%1), %%mm6\n"
+		"pmaddwd   88(%1), %%mm7\n"
+		"paddd      %%mm4, %%mm0\n"
+		"paddd      %%mm5, %%mm1\n"
+		"paddd      %%mm6, %%mm2\n"
+		"paddd      %%mm7, %%mm3\n"
+		"\n"
+		"movq      96(%0), %%mm4\n"
+		"movq     104(%0), %%mm5\n"
+		"movq     112(%0), %%mm6\n"
+		"movq     120(%0), %%mm7\n"
+		"pmaddwd   96(%1), %%mm4\n"
+		"pmaddwd  104(%1), %%mm5\n"
+		"pmaddwd  112(%1), %%mm6\n"
+		"pmaddwd  120(%1), %%mm7\n"
+		"paddd      %%mm4, %%mm0\n"
+		"paddd      %%mm5, %%mm1\n"
+		"paddd      %%mm6, %%mm2\n"
+		"paddd      %%mm7, %%mm3\n"
+		"\n"
+		"movq     128(%0), %%mm4\n"
+		"movq     136(%0), %%mm5\n"
+		"movq     144(%0), %%mm6\n"
+		"movq     152(%0), %%mm7\n"
+		"pmaddwd  128(%1), %%mm4\n"
+		"pmaddwd  136(%1), %%mm5\n"
+		"pmaddwd  144(%1), %%mm6\n"
+		"pmaddwd  152(%1), %%mm7\n"
+		"paddd      %%mm4, %%mm0\n"
+		"paddd      %%mm5, %%mm1\n"
+		"paddd      %%mm6, %%mm2\n"
+		"paddd      %%mm7, %%mm3\n"
+		"\n"
+		"psrad         %4, %%mm0\n"
+		"psrad         %4, %%mm1\n"
+		"psrad         %4, %%mm2\n"
+		"psrad         %4, %%mm3\n"
+		"\n"
+		"packssdw   %%mm0, %%mm0\n"
+		"packssdw   %%mm1, %%mm1\n"
+		"packssdw   %%mm2, %%mm2\n"
+		"packssdw   %%mm3, %%mm3\n"
+		"\n"
+		"movq       %%mm0, %%mm4\n"
+		"movq       %%mm0, %%mm5\n"
+		"pmaddwd  160(%1), %%mm4\n"
+		"pmaddwd  168(%1), %%mm5\n"
+		"\n"
+		"movq       %%mm1, %%mm6\n"
+		"movq       %%mm1, %%mm7\n"
+		"pmaddwd  192(%1), %%mm6\n"
+		"pmaddwd  200(%1), %%mm7\n"
+		"paddd      %%mm6, %%mm4\n"
+		"paddd      %%mm7, %%mm5\n"
+		"\n"
+		"movq       %%mm2, %%mm6\n"
+		"movq       %%mm2, %%mm7\n"
+		"pmaddwd  224(%1), %%mm6\n"
+		"pmaddwd  232(%1), %%mm7\n"
+		"paddd      %%mm6, %%mm4\n"
+		"paddd      %%mm7, %%mm5\n"
+		"\n"
+		"movq       %%mm3, %%mm6\n"
+		"movq       %%mm3, %%mm7\n"
+		"pmaddwd  256(%1), %%mm6\n"
+		"pmaddwd  264(%1), %%mm7\n"
+		"paddd      %%mm6, %%mm4\n"
+		"paddd      %%mm7, %%mm5\n"
+		"\n"
+		"movq       %%mm4, (%3)\n"
+		"movq       %%mm5, 8(%3)\n"
+		"\n"
+		"movq       %%mm0, %%mm5\n"
+		"pmaddwd  176(%1), %%mm0\n"
+		"pmaddwd  184(%1), %%mm5\n"
+		"\n"
+		"movq       %%mm1, %%mm7\n"
+		"pmaddwd  208(%1), %%mm1\n"
+		"pmaddwd  216(%1), %%mm7\n"
+		"paddd      %%mm1, %%mm0\n"
+		"paddd      %%mm7, %%mm5\n"
+		"\n"
+		"movq       %%mm2, %%mm7\n"
+		"pmaddwd  240(%1), %%mm2\n"
+		"pmaddwd  248(%1), %%mm7\n"
+		"paddd      %%mm2, %%mm0\n"
+		"paddd      %%mm7, %%mm5\n"
+		"\n"
+		"movq       %%mm3, %%mm7\n"
+		"pmaddwd  272(%1), %%mm3\n"
+		"pmaddwd  280(%1), %%mm7\n"
+		"paddd      %%mm3, %%mm0\n"
+		"paddd      %%mm7, %%mm5\n"
+		"\n"
+		"movq       %%mm0, 16(%3)\n"
+		"movq       %%mm5, 24(%3)\n"
+		:
+		: "r" (in), "r" (consts), "r" (&round_c), "r" (out),
+		  "i" (SBC_PROTO_FIXED8_SCALE)
+		: "memory");
+}
+
+static inline void sbc_analyze_4b_4s_mmx(int16_t *pcm, int16_t *x,
+					 int32_t *out, int out_stride)
+{
+	/* Fetch audio samples and do input data reordering for SIMD */
+	x[0]  = pcm[8 + 7];
+	x[1]  = pcm[8 + 3];
+	x[2]  = pcm[8 + 6];
+	x[3]  = pcm[8 + 4];
+	x[4]  = pcm[8 + 0];
+	x[5]  = pcm[8 + 2];
+	x[6]  = pcm[8 + 1];
+	x[7]  = pcm[8 + 5];
+
+	x[8]  = pcm[0 + 7];
+	x[9]  = pcm[0 + 3];
+	x[10] = pcm[0 + 6];
+	x[11] = pcm[0 + 4];
+	x[12] = pcm[0 + 0];
+	x[13] = pcm[0 + 2];
+	x[14] = pcm[0 + 1];
+	x[15] = pcm[0 + 5];
+
+	/* Analyze blocks */
+	_sbc_analyze_four_mmx(x + 12, out, analysis_consts_fixed4_simd_odd);
+	out += out_stride;
+	_sbc_analyze_four_mmx(x + 8, out, analysis_consts_fixed4_simd_even);
+	out += out_stride;
+	_sbc_analyze_four_mmx(x + 4, out, analysis_consts_fixed4_simd_odd);
+	out += out_stride;
+	_sbc_analyze_four_mmx(x + 0, out, analysis_consts_fixed4_simd_even);
+
+	/* Copy x[0 .. 15] to x[64 .. 79] using MMX */
+	asm volatile (
+		"movq        (%0), %%mm0\n"
+		"movq       8(%0), %%mm1\n"
+		"movq      16(%0), %%mm2\n"
+		"movq      24(%0), %%mm3\n"
+		"\n"
+		"movq       %%mm0, 128(%0)\n"
+		"movq       %%mm1, 136(%0)\n"
+		"movq       %%mm2, 144(%0)\n"
+		"movq       %%mm3, 152(%0)\n"
+		"\n"
+		"emms\n"
+		:
+		: "r" (x)
+		: "memory");
+}
+
+static inline void sbc_analyze_4b_8s_mmx(int16_t *pcm, int16_t *x,
+					 int32_t *out, int out_stride)
+{
+	/* Fetch audio samples and do input data reordering for SIMD */
+	x[0]  = pcm[16 + 15];
+	x[1]  = pcm[16 + 7];
+	x[2]  = pcm[16 + 14];
+	x[3]  = pcm[16 + 8];
+	x[4]  = pcm[16 + 13];
+	x[5]  = pcm[16 + 9];
+	x[6]  = pcm[16 + 12];
+	x[7]  = pcm[16 + 10];
+	x[8]  = pcm[16 + 11];
+	x[9]  = pcm[16 + 3];
+	x[10] = pcm[16 + 6];
+	x[11] = pcm[16 + 0];
+	x[12] = pcm[16 + 5];
+	x[13] = pcm[16 + 1];
+	x[14] = pcm[16 + 4];
+	x[15] = pcm[16 + 2];
+
+	x[16] = pcm[0 + 15];
+	x[17] = pcm[0 + 7];
+	x[18] = pcm[0 + 14];
+	x[19] = pcm[0 + 8];
+	x[20] = pcm[0 + 13];
+	x[21] = pcm[0 + 9];
+	x[22] = pcm[0 + 12];
+	x[23] = pcm[0 + 10];
+	x[24] = pcm[0 + 11];
+	x[25] = pcm[0 + 3];
+	x[26] = pcm[0 + 6];
+	x[27] = pcm[0 + 0];
+	x[28] = pcm[0 + 5];
+	x[29] = pcm[0 + 1];
+	x[30] = pcm[0 + 4];
+	x[31] = pcm[0 + 2];
+
+	/* Analyze blocks */
+	_sbc_analyze_eight_mmx(x + 24, out, analysis_consts_fixed8_simd_odd);
+	out += out_stride;
+	_sbc_analyze_eight_mmx(x + 16, out, analysis_consts_fixed8_simd_even);
+	out += out_stride;
+	_sbc_analyze_eight_mmx(x + 8, out, analysis_consts_fixed8_simd_odd);
+	out += out_stride;
+	_sbc_analyze_eight_mmx(x + 0, out, analysis_consts_fixed8_simd_even);
+
+	/* Copy x[0 .. 31] to x[128 .. 159] using MMX */
+	asm volatile (
+		"movq        (%0), %%mm0\n"
+		"movq       8(%0), %%mm1\n"
+		"movq      16(%0), %%mm2\n"
+		"movq      24(%0), %%mm3\n"
+		"movq      32(%0), %%mm4\n"
+		"movq      40(%0), %%mm5\n"
+		"movq      48(%0), %%mm6\n"
+		"movq      56(%0), %%mm7\n"
+		"\n"
+		"movq       %%mm0, 256(%0)\n"
+		"movq       %%mm1, 264(%0)\n"
+		"movq       %%mm2, 272(%0)\n"
+		"movq       %%mm3, 280(%0)\n"
+		"movq       %%mm4, 288(%0)\n"
+		"movq       %%mm5, 296(%0)\n"
+		"movq       %%mm6, 304(%0)\n"
+		"movq       %%mm7, 312(%0)\n"
+		"\n"
+		"emms\n"
+		:
+		: "r" (x)
+		: "memory");
+}
+
+static int check_mmx_support()
+{
+#ifdef __amd64__
+	return 1; /* We assume that all 64-bit processors have MMX support */
+#else
+	int cpuid_feature_information;
+	asm volatile (
+		/* According to Intel manual, CPUID instruction is supported
+		   if the value of ID bit (bit 21) in EFLAGS can be modified */
+		"pushf\n"
+		"movl     (%%esp),   %0\n"
+		"xorl     $0x200000, (%%esp)\n" /* try to modify ID bit */
+		"popf\n"
+		"pushf\n"
+		"xorl     (%%esp),   %0\n"      /* check if ID bit changed */
+		"jz       1f\n"
+		"push     %%eax\n"
+		"push     %%ebx\n"
+		"push     %%ecx\n"
+		"mov      $1,        %%eax\n"
+		"cpuid\n"
+		"pop      %%ecx\n"
+		"pop      %%ebx\n"
+		"pop      %%eax\n"
+		"1:\n"
+		"popf\n"
+		: "=d" (cpuid_feature_information)
+		:
+		: "cc");
+    return cpuid_feature_information & (1 << 23);
+#endif
+}
+
+#endif
+
+/*
+ * Detect CPU features and setup the best implementation of
+ * the SBC analysis filter
+ */
+
+void sbc_encoder_init_simd_optimized_analyze(
+	void (**sbc_analyze_4b_4s)(int16_t *pcm, int16_t *x,
+				   int32_t *out, int out_stride),
+	void (**sbc_analyze_4b_8s)(int16_t *pcm, int16_t *x,
+				   int32_t *out, int out_stride))
+{
+#ifdef SBC_BUILD_WITH_MMX_SUPPORT
+	if (check_mmx_support()) {
+		*sbc_analyze_4b_4s = sbc_analyze_4b_4s_mmx;
+		*sbc_analyze_4b_8s = sbc_analyze_4b_8s_mmx;
+	}
+#endif
+}
diff --git a/sbc/sbc_tables.h b/sbc/sbc_tables.h
index f1dfe6c..cd3ecfb 100644
--- a/sbc/sbc_tables.h
+++ b/sbc/sbc_tables.h
@@ -157,8 +157,9 @@ static const int32_t synmatrix8[16][8] = {
  */
 #define SBC_PROTO_FIXED4_SCALE \
 	((sizeof(FIXED_T) * CHAR_BIT - 1) - SBC_FIXED_EXTRA_BITS + 1)
-#define F(x) (FIXED_A) ((x * 2) * \
+#define F_PROTO4(x) (FIXED_A) ((x * 2) * \
 	((FIXED_A) 1 << (sizeof(FIXED_T) * CHAR_BIT - 1)) + 0.5)
+#define F(x) F_PROTO4(x)
 static const FIXED_T _sbc_proto_fixed4[40] = {
 	 F(0.00000000E+00),  F(5.36548976E-04),
 	-F(1.49188357E-03),  F(2.73370904E-03),
@@ -206,8 +207,9 @@ static const FIXED_T _sbc_proto_fixed4[40] = {
  */
 #define SBC_COS_TABLE_FIXED4_SCALE \
 	((sizeof(FIXED_T) * CHAR_BIT - 1) + SBC_FIXED_EXTRA_BITS)
-#define F(x) (FIXED_A) ((x) * \
+#define F_COS4(x) (FIXED_A) ((x) * \
 	((FIXED_A) 1 << (sizeof(FIXED_T) * CHAR_BIT - 1)) + 0.5)
+#define F(x) F_COS4(x)
 static const FIXED_T cos_table_fixed_4[32] = {
 	 F(0.7071067812),  F(0.9238795325), -F(1.0000000000),  F(0.9238795325),
 	 F(0.7071067812),  F(0.3826834324),  F(0.0000000000),  F(0.3826834324),
@@ -233,8 +235,9 @@ static const FIXED_T cos_table_fixed_4[32] = {
  */
 #define SBC_PROTO_FIXED8_SCALE \
 	((sizeof(FIXED_T) * CHAR_BIT - 1) - SBC_FIXED_EXTRA_BITS + 2)
-#define F(x) (FIXED_A) ((x * 4) * \
+#define F_PROTO8(x) (FIXED_A) ((x * 4) * \
 	((FIXED_A) 1 << (sizeof(FIXED_T) * CHAR_BIT - 1)) + 0.5)
+#define F(x) F_PROTO8(x)
 static const FIXED_T _sbc_proto_fixed8[80] = {
 	 F(0.00000000E+00),  F(1.56575398E-04),
 	 F(3.43256425E-04),  F(5.54620202E-04),
@@ -301,8 +304,9 @@ static const FIXED_T _sbc_proto_fixed8[80] = {
  */
 #define SBC_COS_TABLE_FIXED8_SCALE \
 	((sizeof(FIXED_T) * CHAR_BIT - 1) + SBC_FIXED_EXTRA_BITS)
-#define F(x) (FIXED_A) ((x) * \
+#define F_COS8(x) (FIXED_A) ((x) * \
 	((FIXED_A) 1 << (sizeof(FIXED_T) * CHAR_BIT - 1)) + 0.5)
+#define F(x) F_COS8(x)
 static const FIXED_T cos_table_fixed_8[128] = {
 	 F(0.7071067812),  F(0.8314696123),  F(0.9238795325),  F(0.9807852804),
 	-F(1.0000000000),  F(0.9807852804),  F(0.9238795325),  F(0.8314696123),
@@ -345,3 +349,247 @@ static const FIXED_T cos_table_fixed_8[128] = {
 	-F(0.0000000000), -F(0.1950903220),  F(0.3826834324), -F(0.5555702330),
 };
 #undef F
+
+/*
+ * Constant tables for the use in SIMD optimized analysis filters
+ * Each table consists of two parts:
+ * 1. reordered "proto" table
+ * 2. reordered "cos" table
+ *
+ * Due to non-symmetrical reordering, separate tables for "even"
+ * and "odd" cases are needed
+ */
+
+#ifdef __GNUC__
+#define SIMD_ALIGNED __attribute__((aligned(16)))
+#else
+#define SIMD_ALIGNED
+#endif
+
+static const FIXED_T SIMD_ALIGNED analysis_consts_fixed4_simd_even[40 + 16] = {
+#define F(x) F_PROTO4(x)
+	 F(0.00000000E+00),  F(3.83720193E-03),
+	 F(5.36548976E-04),  F(2.73370904E-03),
+	 F(3.06012286E-03),  F(3.89205149E-03),
+	 F(0.00000000E+00), -F(1.49188357E-03),
+	 F(1.09137620E-02),  F(2.58767811E-02),
+	 F(2.04385087E-02),  F(3.21939290E-02),
+	 F(7.76463494E-02),  F(6.13245186E-03),
+	 F(0.00000000E+00), -F(2.88757392E-02),
+	 F(1.35593274E-01),  F(2.94315332E-01),
+	 F(1.94987841E-01),  F(2.81828203E-01),
+	-F(1.94987841E-01),  F(2.81828203E-01),
+	 F(0.00000000E+00), -F(2.46636662E-01),
+	-F(1.35593274E-01),  F(2.58767811E-02),
+	-F(7.76463494E-02),  F(6.13245186E-03),
+	-F(2.04385087E-02),  F(3.21939290E-02),
+	 F(0.00000000E+00),  F(2.88217274E-02),
+	-F(1.09137620E-02),  F(3.83720193E-03),
+	-F(3.06012286E-03),  F(3.89205149E-03),
+	-F(5.36548976E-04),  F(2.73370904E-03),
+	 F(0.00000000E+00), -F(1.86581691E-03),
+#undef F
+#define F(x) F_COS4(x)
+	 F(0.7071067812),  F(0.9238795325),
+	-F(0.7071067812),  F(0.3826834324),
+	-F(0.7071067812), -F(0.3826834324),
+	 F(0.7071067812), -F(0.9238795325),
+	 F(0.3826834324), -F(1.0000000000),
+	-F(0.9238795325), -F(1.0000000000),
+	 F(0.9238795325), -F(1.0000000000),
+	-F(0.3826834324), -F(1.0000000000),
+#undef F
+};
+
+static const FIXED_T SIMD_ALIGNED analysis_consts_fixed4_simd_odd[40 + 16] = {
+#define F(x) F_PROTO4(x)
+	 F(2.73370904E-03),  F(5.36548976E-04),
+	-F(1.49188357E-03),  F(0.00000000E+00),
+	 F(3.83720193E-03),  F(1.09137620E-02),
+	 F(3.89205149E-03),  F(3.06012286E-03),
+	 F(3.21939290E-02),  F(2.04385087E-02),
+	-F(2.88757392E-02),  F(0.00000000E+00),
+	 F(2.58767811E-02),  F(1.35593274E-01),
+	 F(6.13245186E-03),  F(7.76463494E-02),
+	 F(2.81828203E-01),  F(1.94987841E-01),
+	-F(2.46636662E-01),  F(0.00000000E+00),
+	 F(2.94315332E-01), -F(1.35593274E-01),
+	 F(2.81828203E-01), -F(1.94987841E-01),
+	 F(6.13245186E-03), -F(7.76463494E-02),
+	 F(2.88217274E-02),  F(0.00000000E+00),
+	 F(2.58767811E-02), -F(1.09137620E-02),
+	 F(3.21939290E-02), -F(2.04385087E-02),
+	 F(3.89205149E-03), -F(3.06012286E-03),
+	-F(1.86581691E-03),  F(0.00000000E+00),
+	 F(3.83720193E-03),  F(0.00000000E+00),
+	 F(2.73370904E-03), -F(5.36548976E-04),
+#undef F
+#define F(x) F_COS4(x)
+	 F(0.9238795325), -F(1.0000000000),
+	 F(0.3826834324), -F(1.0000000000),
+	-F(0.3826834324), -F(1.0000000000),
+	-F(0.9238795325), -F(1.0000000000),
+	 F(0.7071067812),  F(0.3826834324),
+	-F(0.7071067812), -F(0.9238795325),
+	-F(0.7071067812),  F(0.9238795325),
+	 F(0.7071067812), -F(0.3826834324),
+#undef F
+};
+
+static const FIXED_T SIMD_ALIGNED analysis_consts_fixed8_simd_even[80 + 64] = {
+#define F(x) F_PROTO8(x)
+	 F(0.00000000E+00),  F(2.01182542E-03),
+	 F(1.56575398E-04),  F(1.78371725E-03),
+	 F(3.43256425E-04),  F(1.47640169E-03),
+	 F(5.54620202E-04),  F(1.13992507E-03),
+	-F(8.23919506E-04),  F(0.00000000E+00),
+	 F(2.10371989E-03),  F(3.49717454E-03),
+	 F(1.99454554E-03),  F(1.64973098E-03),
+	 F(1.61656283E-03),  F(1.78805361E-04),
+	 F(5.65949473E-03),  F(1.29371806E-02),
+	 F(8.02941163E-03),  F(1.53184106E-02),
+	 F(1.04584443E-02),  F(1.62208471E-02),
+	 F(1.27472335E-02),  F(1.59045603E-02),
+	-F(1.46525263E-02),  F(0.00000000E+00),
+	 F(8.85757540E-03),  F(5.31873032E-02),
+	 F(2.92408442E-03),  F(3.90751381E-02),
+	-F(4.91578024E-03),  F(2.61098752E-02),
+	 F(6.79989431E-02),  F(1.46955068E-01),
+	 F(8.29847578E-02),  F(1.45389847E-01),
+	 F(9.75753918E-02),  F(1.40753505E-01),
+	 F(1.11196689E-01),  F(1.33264415E-01),
+	-F(1.23264548E-01),  F(0.00000000E+00),
+	 F(1.45389847E-01), -F(8.29847578E-02),
+	 F(1.40753505E-01), -F(9.75753918E-02),
+	 F(1.33264415E-01), -F(1.11196689E-01),
+	-F(6.79989431E-02),  F(1.29371806E-02),
+	-F(5.31873032E-02),  F(8.85757540E-03),
+	-F(3.90751381E-02),  F(2.92408442E-03),
+	-F(2.61098752E-02), -F(4.91578024E-03),
+	 F(1.46404076E-02),  F(0.00000000E+00),
+	 F(1.53184106E-02), -F(8.02941163E-03),
+	 F(1.62208471E-02), -F(1.04584443E-02),
+	 F(1.59045603E-02), -F(1.27472335E-02),
+	-F(5.65949473E-03),  F(2.01182542E-03),
+	-F(3.49717454E-03),  F(2.10371989E-03),
+	-F(1.64973098E-03),  F(1.99454554E-03),
+	-F(1.78805361E-04),  F(1.61656283E-03),
+	-F(9.02154502E-04),  F(0.00000000E+00),
+	 F(1.78371725E-03), -F(1.56575398E-04),
+	 F(1.47640169E-03), -F(3.43256425E-04),
+	 F(1.13992507E-03), -F(5.54620202E-04),
+#undef F
+#define F(x) F_COS8(x)
+	 F(0.7071067812),  F(0.8314696123),
+	-F(0.7071067812), -F(0.1950903220),
+	-F(0.7071067812), -F(0.9807852804),
+	 F(0.7071067812), -F(0.5555702330),
+	 F(0.7071067812),  F(0.5555702330),
+	-F(0.7071067812),  F(0.9807852804),
+	-F(0.7071067812),  F(0.1950903220),
+	 F(0.7071067812), -F(0.8314696123),
+	 F(0.9238795325),  F(0.9807852804),
+	 F(0.3826834324),  F(0.8314696123),
+	-F(0.3826834324),  F(0.5555702330),
+	-F(0.9238795325),  F(0.1950903220),
+	-F(0.9238795325), -F(0.1950903220),
+	-F(0.3826834324), -F(0.5555702330),
+	 F(0.3826834324), -F(0.8314696123),
+	 F(0.9238795325), -F(0.9807852804),
+	-F(1.0000000000),  F(0.5555702330),
+	-F(1.0000000000), -F(0.9807852804),
+	-F(1.0000000000),  F(0.1950903220),
+	-F(1.0000000000),  F(0.8314696123),
+	-F(1.0000000000), -F(0.8314696123),
+	-F(1.0000000000), -F(0.1950903220),
+	-F(1.0000000000),  F(0.9807852804),
+	-F(1.0000000000), -F(0.5555702330),
+	 F(0.3826834324),  F(0.1950903220),
+	-F(0.9238795325), -F(0.5555702330),
+	 F(0.9238795325),  F(0.8314696123),
+	-F(0.3826834324), -F(0.9807852804),
+	-F(0.3826834324),  F(0.9807852804),
+	 F(0.9238795325), -F(0.8314696123),
+	-F(0.9238795325),  F(0.5555702330),
+	 F(0.3826834324), -F(0.1950903220),
+#undef F
+};
+
+static const FIXED_T SIMD_ALIGNED analysis_consts_fixed8_simd_odd[80 + 64] = {
+#define F(x) F_PROTO8(x)
+	 F(0.00000000E+00), -F(8.23919506E-04),
+	 F(1.56575398E-04),  F(1.78371725E-03),
+	 F(3.43256425E-04),  F(1.47640169E-03),
+	 F(5.54620202E-04),  F(1.13992507E-03),
+	 F(2.01182542E-03),  F(5.65949473E-03),
+	 F(2.10371989E-03),  F(3.49717454E-03),
+	 F(1.99454554E-03),  F(1.64973098E-03),
+	 F(1.61656283E-03),  F(1.78805361E-04),
+	 F(0.00000000E+00), -F(1.46525263E-02),
+	 F(8.02941163E-03),  F(1.53184106E-02),
+	 F(1.04584443E-02),  F(1.62208471E-02),
+	 F(1.27472335E-02),  F(1.59045603E-02),
+	 F(1.29371806E-02),  F(6.79989431E-02),
+	 F(8.85757540E-03),  F(5.31873032E-02),
+	 F(2.92408442E-03),  F(3.90751381E-02),
+	-F(4.91578024E-03),  F(2.61098752E-02),
+	 F(0.00000000E+00), -F(1.23264548E-01),
+	 F(8.29847578E-02),  F(1.45389847E-01),
+	 F(9.75753918E-02),  F(1.40753505E-01),
+	 F(1.11196689E-01),  F(1.33264415E-01),
+	 F(1.46955068E-01), -F(6.79989431E-02),
+	 F(1.45389847E-01), -F(8.29847578E-02),
+	 F(1.40753505E-01), -F(9.75753918E-02),
+	 F(1.33264415E-01), -F(1.11196689E-01),
+	 F(0.00000000E+00),  F(1.46404076E-02),
+	-F(5.31873032E-02),  F(8.85757540E-03),
+	-F(3.90751381E-02),  F(2.92408442E-03),
+	-F(2.61098752E-02), -F(4.91578024E-03),
+	 F(1.29371806E-02), -F(5.65949473E-03),
+	 F(1.53184106E-02), -F(8.02941163E-03),
+	 F(1.62208471E-02), -F(1.04584443E-02),
+	 F(1.59045603E-02), -F(1.27472335E-02),
+	 F(0.00000000E+00), -F(9.02154502E-04),
+	-F(3.49717454E-03),  F(2.10371989E-03),
+	-F(1.64973098E-03),  F(1.99454554E-03),
+	-F(1.78805361E-04),  F(1.61656283E-03),
+	 F(2.01182542E-03),  F(0.00000000E+00),
+	 F(1.78371725E-03), -F(1.56575398E-04),
+	 F(1.47640169E-03), -F(3.43256425E-04),
+	 F(1.13992507E-03), -F(5.54620202E-04),
+#undef F
+#define F(x) F_COS8(x)
+	-F(1.0000000000),  F(0.8314696123),
+	-F(1.0000000000), -F(0.1950903220),
+	-F(1.0000000000), -F(0.9807852804),
+	-F(1.0000000000), -F(0.5555702330),
+	-F(1.0000000000),  F(0.5555702330),
+	-F(1.0000000000),  F(0.9807852804),
+	-F(1.0000000000),  F(0.1950903220),
+	-F(1.0000000000), -F(0.8314696123),
+	 F(0.9238795325),  F(0.9807852804),
+	 F(0.3826834324),  F(0.8314696123),
+	-F(0.3826834324),  F(0.5555702330),
+	-F(0.9238795325),  F(0.1950903220),
+	-F(0.9238795325), -F(0.1950903220),
+	-F(0.3826834324), -F(0.5555702330),
+	 F(0.3826834324), -F(0.8314696123),
+	 F(0.9238795325), -F(0.9807852804),
+	 F(0.7071067812),  F(0.5555702330),
+	-F(0.7071067812), -F(0.9807852804),
+	-F(0.7071067812),  F(0.1950903220),
+	 F(0.7071067812),  F(0.8314696123),
+	 F(0.7071067812), -F(0.8314696123),
+	-F(0.7071067812), -F(0.1950903220),
+	-F(0.7071067812),  F(0.9807852804),
+	 F(0.7071067812), -F(0.5555702330),
+	 F(0.3826834324),  F(0.1950903220),
+	-F(0.9238795325), -F(0.5555702330),
+	 F(0.9238795325),  F(0.8314696123),
+	-F(0.3826834324), -F(0.9807852804),
+	-F(0.3826834324),  F(0.9807852804),
+	 F(0.9238795325), -F(0.8314696123),
+	-F(0.9238795325),  F(0.5555702330),
+	 F(0.3826834324), -F(0.1950903220),
+#undef F
+};
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter
  2009-01-02 16:07   ` Siarhei Siamashka
@ 2009-01-02 16:27     ` Brad Midgley
  2009-01-02 17:11       ` Siarhei Siamashka
  2009-01-05  8:57     ` Siarhei Siamashka
  2009-01-06  2:49     ` Marcel Holtmann
  2 siblings, 1 reply; 20+ messages in thread
From: Brad Midgley @ 2009-01-02 16:27 UTC (permalink / raw)
  To: Siarhei Siamashka; +Cc: linux-bluetooth

Siarhei

> I wonder what CPU architectures are the most important for bluez?

This is not an easy question, but one perspective is to consider the
impact on battery life. Running sbc encoding on a phone will have a
greater impact on battery life than it does to run it on a laptop.

The ideal is to have portable devices mitigate this with dsp hardware,
but we can't count on the hardware or the driver to be there in all
cases. (see https://garage.maemo.org/projects/dsp-sbc/ for some work
using the TI dsp)

-- 
Brad Midgley

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter
  2008-12-31 20:55 ` Luiz Augusto von Dentz
@ 2009-01-02 16:33   ` Siarhei Siamashka
  2009-01-02 19:40     ` Luiz Augusto von Dentz
  2009-01-06  2:50   ` Marcel Holtmann
  1 sibling, 1 reply; 20+ messages in thread
From: Siarhei Siamashka @ 2009-01-02 16:33 UTC (permalink / raw)
  To: ext Luiz Augusto von Dentz; +Cc: linux-bluetooth

On Wednesday 31 December 2008 22:55:24 ext Luiz Augusto von Dentz wrote:
> I wonder why don't we use liboil
> (http://liboil.freedesktop.org/wiki/).

Can you clarify your proposal a bit? Which functions/implementations from
liboil do you suggest for use in bluez sbc?

> Since we can't keep implementing, or don't want to, optimization code for
> each instruction extension around.

Or do you suggest to submit the sbc analysis filter function to liboil, add it
as sbc dependency and hope that somebody would translate the code to the
instruction sets of other architectures? Will it turn out to be beneficial?
IMHO It may easily become just an unnecessary burden and wasted effort too.

> Liboil detects which implementation is 
> faster at runtime and there are many other codec implementations that depend
> on it, it actually makes a lot of sense to gstream and PulseAudio which
> already uses liboil. I know that means adding another dependency to
> BlueZ, or perhaps it is time to make libsbc a real library?

I had a quick look at liboil and it did not impress me that much yet.

Best regards,
Siarhei Siamashka

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter
  2009-01-02 16:27     ` Brad Midgley
@ 2009-01-02 17:11       ` Siarhei Siamashka
  2009-01-02 18:03         ` Brad Midgley
  2009-01-05 11:08         ` Simon Pickering
  0 siblings, 2 replies; 20+ messages in thread
From: Siarhei Siamashka @ 2009-01-02 17:11 UTC (permalink / raw)
  To: ext Brad Midgley; +Cc: linux-bluetooth

On Friday 02 January 2009 18:27:33 ext Brad Midgley wrote:
> Siarhei
>
> > I wonder what CPU architectures are the most important for bluez?
>
> This is not an easy question, but one perspective is to consider the
> impact on battery life. Running sbc encoding on a phone will have a
> greater impact on battery life than it does to run it on a laptop.

I see. I'm mostly interested in ARM, so this one should be quite fine. On the
other hand, if we sacrifice performance let's say for MIPS when adding
some of the changes beneficial for other platforms, will it be considered an
important regression?

I also submitted MMX implementation first as this is the code which can be
hopefully tested by more people. Anyway, the most hard part was to transform
the code to be efficiently vectorizable (done by writing several additional
scripts which were used to find an optimal input data permutation and generate
the tables). After that, just converting C code to the appropriate MMX
instructions in gcc inline assembly took probably only ~1 day of working time,
including testing. Support for the other architectures should be quite easy
from this point (ARM will follow next).

> The ideal is to have portable devices mitigate this with dsp hardware,
> but we can't count on the hardware or the driver to be there in all
> cases. (see https://garage.maemo.org/projects/dsp-sbc/ for some work
> using the TI dsp)

Yes, I know about this project. And its maintainer should be subscribed here
too :)

Making code a bit more C55x friendly is not difficult at all and can be surely
done.

-- 
Best regards,
Siarhei Siamashka

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter
  2009-01-02 17:11       ` Siarhei Siamashka
@ 2009-01-02 18:03         ` Brad Midgley
  2009-01-05 11:08         ` Simon Pickering
  1 sibling, 0 replies; 20+ messages in thread
From: Brad Midgley @ 2009-01-02 18:03 UTC (permalink / raw)
  To: linux-bluetooth@vger.kernel.org

Siarhei

> other hand, if we sacrifice performance let's say for MIPS when adding
> some of the changes beneficial for other platforms, will it be considered an
> important regression?

I think it would be best to consider these case-by-case.

btw, I have a mips-based access point I will try things out on to
assess it. (It's a first-rev asus wl-500gp)

Brad Midgley

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter
  2009-01-02 16:33   ` Siarhei Siamashka
@ 2009-01-02 19:40     ` Luiz Augusto von Dentz
  2009-01-04 17:56       ` Siarhei Siamashka
  0 siblings, 1 reply; 20+ messages in thread
From: Luiz Augusto von Dentz @ 2009-01-02 19:40 UTC (permalink / raw)
  To: Siarhei Siamashka; +Cc: linux-bluetooth

Hi Siarhei,

On Fri, Jan 2, 2009 at 1:33 PM, Siarhei Siamashka
<siarhei.siamashka@nokia.com> wrote:
> On Wednesday 31 December 2008 22:55:24 ext Luiz Augusto von Dentz wrote:
>> I wonder why don't we use liboil
>> (http://liboil.freedesktop.org/wiki/).
>
> Can you clarify your proposal a bit? Which functions/implementations from
> liboil do you suggest for use in bluez sbc?

Liboil stands to optimized inner loops, that exactly what we need,
transforming the whole code will, already is, depend on each simd
extention to be implemented. What we basically do is multiply and
accumulate arrays, what could be done with:
http://liboil.freedesktop.org/documentation/liboil-liboilfuncs-math.html#oil-multsum-f32

> Or do you suggest to submit the sbc analysis filter function to liboil, add it
> as sbc dependency and hope that somebody would translate the code to the
> instruction sets of other architectures? Will it turn out to be beneficial?
> IMHO It may easily become just an unnecessary burden and wasted effort too.

What about if there is any other codec that might benefit from the
code we are producing, Im not talking about the whole sbc analysis
filter but the inner loops.

Also read careful what liboil does, there is a whole instruction set
detection/benchmark system very similar to what you have proposed for
choosing implementation in runtime.

-- 
Luiz Augusto von Dentz
Engenheiro de Computação

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter
  2009-01-02 19:40     ` Luiz Augusto von Dentz
@ 2009-01-04 17:56       ` Siarhei Siamashka
  0 siblings, 0 replies; 20+ messages in thread
From: Siarhei Siamashka @ 2009-01-04 17:56 UTC (permalink / raw)
  To: ext Luiz Augusto von Dentz; +Cc: linux-bluetooth

On Friday 02 January 2009 21:40:48 ext Luiz Augusto von Dentz wrote:
> Hi Siarhei,
>
> On Fri, Jan 2, 2009 at 1:33 PM, Siarhei Siamashka
>
> <siarhei.siamashka@nokia.com> wrote:
> > On Wednesday 31 December 2008 22:55:24 ext Luiz Augusto von Dentz wrote:
> >> I wonder why don't we use liboil
> >> (http://liboil.freedesktop.org/wiki/).
> >
> > Can you clarify your proposal a bit? Which functions/implementations from
> > liboil do you suggest for use in bluez sbc?
>
> Liboil stands to optimized inner loops, that exactly what we need,
> transforming the whole code will, already is, depend on each simd
> extention to be implemented. 
>
> What we basically do is multiply and 
> accumulate arrays, what could be done with:
> http://liboil.freedesktop.org/documentation/liboil-liboilfuncs-math.html#oi
>l-multsum-f32

Right now from what I see, we need SIMD optimized versions of:
- analysis filter
- channels deinterleaving with optional endian conversion
- scalefactors processing
- joining channels
- maybe quantization

Liboil does not seem to directly provide any of these (I really looked through
all of it, but could of course miss something). Your example is not very good
and does not clarify anything, because it is even a floating point function.

Let's take the SBC analysis filter as an example. It's a function, which reads
data from the samples buffer, constants buffer and writes some results in the
output buffer. We want all the operations inside of it to be done with
registers only, avoiding any intermediate stores to memory. The arrays t1[8]
and t2[8] are supposed to be mapped directly on the registers. If you try to
implement analysis function using liboil 'inner loop' functions, the resulting
performance would be simply horrible. If you don't trust me, just have a look
at some more stuff from liboil such as DCT functions. The analysis filter from
SBC falls exactly into the same category.

The other functions that need to be done and that I have mentioned above are
also the same.

Moreover, the arrays which SBC operates on are rather small. That's why
special care needs to be taken about proper loops unrolling, alignment and
the other stuff in order not to have any unneeded overhead.

> > Or do you suggest to submit the sbc analysis filter function to liboil,
> > add it as sbc dependency and hope that somebody would translate the code
> > to the instruction sets of other architectures? Will it turn out to be
> > beneficial? IMHO It may easily become just an unnecessary burden and
> > wasted effort too.
>
> What about if there is any other codec that might benefit from the
> code we are producing, Im not talking about the whole sbc analysis
> filter but the inner loops.

Than it is good for these other codecs :) They will be able to take some
code from SBC (either directly, or via liboil library if it gets to suck in
the stuff from bluez like it did with some other samples of optimized code).

> Also read careful what liboil does,  there is a whole instruction set 
> detection/benchmark system very similar to what you have proposed for
> choosing implementation in runtime.

The detection of MMX needs only a dozen of lines of trivial code (checking
EFLAGS and CPUID). Adding a  big library as a dependency just for a few lines
of code is kind of overkill.

In addition, by spending 15 minutes on writing and testing this trivial code
using just an Architecture Software Developer's Manual from Intel, I avoid
all the hassle of making sure that I don't violate the licenses or copyrights
of somebody else :)

By the way, I had a look and didn't quite like the way liboil does this CPU
capability check. Instead of checking EFLAGS first, it tries to execute CPUID
directly and has the code to catch SIGILL. I'm not sure if it is a good idea
to mess with the signals from a *library*.

-- 
Best regards,
Siarhei Siamashka

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter
  2009-01-02 16:07   ` Siarhei Siamashka
  2009-01-02 16:27     ` Brad Midgley
@ 2009-01-05  8:57     ` Siarhei Siamashka
  2009-01-06  2:49     ` Marcel Holtmann
  2 siblings, 0 replies; 20+ messages in thread
From: Siarhei Siamashka @ 2009-01-05  8:57 UTC (permalink / raw)
  To: linux-bluetooth

On Friday 02 January 2009 18:07:17 ext Siarhei Siamashka wrote:
> On Thursday 01 January 2009 10:58:03 ext Marcel Holtmann wrote:
[...]
> > > But right now I would like to hear some opinions about the following
> > > things regarding the attached patch:
> > >
> > > The first question is about the use of extra source file for SIMD
> > > optimizations and introduction of
> > > 'sbc_encoder_init_simd_optimized_analyze' function to the global
> > > namespace. The rationale for that is the intention to stop adding
> > > changes to 'sbc.c' (otherwise it will become bloated pretty soon with
> > > the addition of multiple optimizations for various platforms). If
> > > anyone has a better idea, I'm very much interested to hear it.
> > >
> > > And if the addition of a new source file gets approved, I wonder about
> > > what text should go to the copyright header?
> > >
> > > Now we have two "reference" C implementations of analysis filter. Is it
> > > OK to keep both? Or only SIMD-friendly one should remain in the end?
> >
> > I am fine with keeping both, but if one is just not useful, we are going
> > to remove it.
>
> The only problem with SIMD-friendly code is that it uses two tables instead
> of one (that's a sacrifice for the nice and symmetric code layout which
> fits SIMD instructions of modern processors quite well). It may be somewhat
> less optimal for the legacy processors without SIMD capabilities.
>
> I wonder what CPU architectures are the most important for bluez?
>
> > Also two separate files are fine for me. Personally I prefer a runtime
> > selection since compile time options are always painful
> > to test before making the release.
>
> The attached patch contains what I would consider to be a final variant.
> MMX support is now complete. It works for both x86 and amd64, has runtime
> autodetection of MMX availability, supports 4 and 8 subbands cases. I also
> ensured that only original MMX instructions are used (and no SSE or other
> later additions), so the code should work fine even on the old Pentium1
> MMX. New MMX optimized functions produce bit identical results when
> compared with bluez-4.25 release.
>
> With this patch applied, new filtering functions are noticeably faster than
> than the old ones on x86 (so now they are both faster and have better
> quality). Assembly optimizations for the other platforms can be easily
> added too.
>
> > For the copyright header it is pretty simple. We copy the current header
> > and then later on I will add the appropriate Nokia copyright to it. So
> > don't worry about that part, I take care of that.
>
> OK, thanks

I understand that it is too early to ping you regarding the status of the
patch :)

But it would be nice if all the SBC encoder optimizations that are relatively
easy to implement got done and integrated fast (keeping the encoder output
bit identical to that of version 4.25 for now)

After the second thought, I propose the following source files layout:
sbc_dsplib.c, sbc_dsplib.h - contains reference C code for the supplementary
helper functions which can be used in SBC encoder/decoder and can be
efficiently SIMD/assembly optimized
sbc_dsplib_mmx.c - x86 MMX optimizations
sbc_dsplib_sse2.c - x86 SSE2 optimizations
sbc_dsplib_neon.c - ARM NEON optimizations
sbc_dsplib_armv6.c - ARMv6 optimizations
...

sbc_dsplib.c would also contain an initialization function, which sets up the
function pointers in 'sbc_encoder state' structure to the best available
implementations for the current platform.

The content of sbc_dsplib* files can be then considered for the future
submission into liboil if this is desired.

Would you prefer an updated patchset, which implements all of this stuff one
step at a time?

-- 
Best regards,
Siarhei Siamashka

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter
  2009-01-02 17:11       ` Siarhei Siamashka
  2009-01-02 18:03         ` Brad Midgley
@ 2009-01-05 11:08         ` Simon Pickering
  1 sibling, 0 replies; 20+ messages in thread
From: Simon Pickering @ 2009-01-05 11:08 UTC (permalink / raw)
  To: 'Siarhei Siamashka', 'ext Brad Midgley'; +Cc: linux-bluetooth


> > The ideal is to have portable devices mitigate this with dsp
hardware,
> > but we can't count on the hardware or the driver to be there in all
> > cases. (see https://garage.maemo.org/projects/dsp-sbc/ for some work
> > using the TI dsp)
> 
> Yes, I know about this project. And its maintainer should be 
> subscribed here
> too :)

He is now :)

> Making code a bit more C55x friendly is not difficult at all 
> and can be surely
> done.

I'm catching up on the flood of patches and will have a look at this
soon,

Cheers,


Simon


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter
  2009-01-02 16:07   ` Siarhei Siamashka
  2009-01-02 16:27     ` Brad Midgley
  2009-01-05  8:57     ` Siarhei Siamashka
@ 2009-01-06  2:49     ` Marcel Holtmann
  2009-01-06  5:27       ` Christian Hoene
  2009-01-09 16:50       ` Siarhei Siamashka
  2 siblings, 2 replies; 20+ messages in thread
From: Marcel Holtmann @ 2009-01-06  2:49 UTC (permalink / raw)
  To: Siarhei Siamashka; +Cc: linux-bluetooth

Hi Siarhei,

> > > This is a preliminary preview of SIMD optimizations for SBC encoder
> > > analysis filter.
> > >
> > > It already contains MMX optimization for 4 subbands case (yes, all this
> > > insane amount of extra lines of code finally starts to pay off) ;)
> > >
> > > Important notice: in order to test MMX optimizations, you need to have
> > > extra '-mmmx' command line option passed to gcc. Runtime MMX
> > > autodetection can be easily added later. Also don't forget to pass -s4
> > > option to sbcenc because 8 subbands case is still not accelerated. By the
> > > way, SSE2 is twice wider than MMX and should be a lot faster. Though MMX
> > > is supported on virtually every x86 cpu that is in use nowadays and can
> > > be considered "lowest common denominator".
> > >
> > > My quick benchmark showed that the performance gets improved about ~10%
> > > overall (and about twice better for the analysis filter function alone)
> > > when compared with bluez-4.23 release which had the old buggy code.
> > > Improvement is much more noticeable over the release 4.25 which contains
> > > a new fixed and mostly nonoptimized filter.
> > >
> > > So now the performance is better than ever. And I guess, all the
> > > platforms should use SIMD optimizations nowadays, so they should gain
> > > performance improvements too. Those 'anamatrix' style optimizations in
> > > older code feel so much like the previous century ;)
> > >
> > > I'm going to primarily focus on NEON and maybe ARMv6 SIMD optimizations,
> > > these will be submitted a bit later. Also, as I have already written
> > > before, the other parts of code are quite inefficient too and can be
> > > optimized. There are still lots of things to improve.
> > >
> > >
> > > But right now I would like to hear some opinions about the following
> > > things regarding the attached patch:
> > >
> > > The first question is about the use of extra source file for SIMD
> > > optimizations and introduction of
> > > 'sbc_encoder_init_simd_optimized_analyze' function to the global
> > > namespace. The rationale for that is the intention to stop adding changes
> > > to 'sbc.c' (otherwise it will become bloated pretty soon with the
> > > addition of multiple optimizations for various platforms). If anyone has
> > > a better idea, I'm very much interested to hear it.
> > >
> > > And if the addition of a new source file gets approved, I wonder about
> > > what text should go to the copyright header?
> > >
> > > Now we have two "reference" C implementations of analysis filter. Is it
> > > OK to keep both? Or only SIMD-friendly one should remain in the end?
> >
> > I am fine with keeping both, but if one is just not useful, we are going
> > to remove it.
> 
> The only problem with SIMD-friendly code is that it uses two tables instead of
> one (that's a sacrifice for the nice and symmetric code layout which fits SIMD
> instructions of modern processors quite well). It may be somewhat less
> optimal for the legacy processors without SIMD capabilities.
> 
> I wonder what CPU architectures are the most important for bluez?
> 
> > Also two separate files are fine for me. Personally I prefer a runtime
> > selection since compile time options are always painful 
> > to test before making the release.
> 
> The attached patch contains what I would consider to be a final variant. MMX
> support is now complete. It works for both x86 and amd64, has runtime
> autodetection of MMX availability, supports 4 and 8 subbands cases. I also
> ensured that only original MMX instructions are used (and no SSE or other
> later additions), so the code should work fine even on the old Pentium1 MMX.
> New MMX optimized functions produce bit identical results when compared
> with bluez-4.25 release.
> 
> With this patch applied, new filtering functions are noticeably faster than
> than the old ones on x86 (so now they are both faster and have better
> quality). Assembly optimizations for the other platforms can be easily
> added too. 

can you re-base your patch against the latest tree and re-send the
patch.

Do we still need the high precession stuff. I wanna cut down the number
of ifdefs in the code as much as possible.

Regards

Marcel



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter
  2008-12-31 20:55 ` Luiz Augusto von Dentz
  2009-01-02 16:33   ` Siarhei Siamashka
@ 2009-01-06  2:50   ` Marcel Holtmann
  1 sibling, 0 replies; 20+ messages in thread
From: Marcel Holtmann @ 2009-01-06  2:50 UTC (permalink / raw)
  To: Luiz Augusto von Dentz; +Cc: Siarhei Siamashka, linux-bluetooth

Hi Luiz,

> I wonder why don't we use liboil
> (http://liboil.freedesktop.org/wiki/). Since we can't keep
> implementing, or don't want to, optimization code for each instruction
> extension around. Liboil detects which implementation is faster at
> runtime and there are many other codec implementations that depend on
> it, it actually makes a lot of sense to gstream and PulseAudio which
> already uses liboil. I know that means adding another dependency to
> BlueZ, or perhaps it is time to make libsbc a real library?

let me stop the discussion about liboil right now. I don't see any big
advantage for BlueZ right now. This might change at some point in the
future, but right now, we will not base around liboil.

Regards

Marcel



^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter
  2009-01-06  2:49     ` Marcel Holtmann
@ 2009-01-06  5:27       ` Christian Hoene
  2009-01-06  5:45         ` Marcel Holtmann
  2009-01-09 16:50       ` Siarhei Siamashka
  1 sibling, 1 reply; 20+ messages in thread
From: Christian Hoene @ 2009-01-06  5:27 UTC (permalink / raw)
  To: 'Marcel Holtmann', 'Siarhei Siamashka'; +Cc: linux-bluetooth

> Do we still need the high precession stuff. I wanna cut down the number
> of ifdefs in the code as much as possible.

Yes, because provides better audio quality.

Greetings

 Christian

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter
  2009-01-06  5:27       ` Christian Hoene
@ 2009-01-06  5:45         ` Marcel Holtmann
  2009-01-07  9:31           ` Siarhei Siamashka
  0 siblings, 1 reply; 20+ messages in thread
From: Marcel Holtmann @ 2009-01-06  5:45 UTC (permalink / raw)
  To: hoene; +Cc: 'Siarhei Siamashka', linux-bluetooth

Hi Christian,

> > Do we still need the high precession stuff. I wanna cut down the number
> > of ifdefs in the code as much as possible.
> 
> Yes, because provides better audio quality.

okay, but we have to make a choice in what we want. We can't just have a
lots of ifdefs around. They will be killing us eventually. It is a
nightmare from a release engineering perspective.

What is the downside for doing high precession only?

Regards

Marcel



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter
  2009-01-06  5:45         ` Marcel Holtmann
@ 2009-01-07  9:31           ` Siarhei Siamashka
  0 siblings, 0 replies; 20+ messages in thread
From: Siarhei Siamashka @ 2009-01-07  9:31 UTC (permalink / raw)
  To: ext Marcel Holtmann; +Cc: hoene, linux-bluetooth

On Tuesday 06 January 2009 07:45:01 ext Marcel Holtmann wrote:
> Hi Christian,
>
> > > Do we still need the high precession stuff. I wanna cut down the number
> > > of ifdefs in the code as much as possible.
> >
> > Yes, because provides better audio quality.
>
> okay, but we have to make a choice in what we want. We can't just have a
> lots of ifdefs around. They will be killing us eventually. It is a
> nightmare from a release engineering perspective.

That's a single ifdef, which was added for testing purposes. The analysis
filter code itself is flexible enough to work in both configurations as the
shift constants depend on the use of 'sizeof' operator. The original floating
point constants are also wrapped into macros, which expand to the needed
fixed point data type automagically. 

And as it was discussed before, It is possible to have both fast and high
precision implementations compiled in at the same time. Something like having:

sbc_analysis_filter_template.h - with the tables and implementation of
analysis function as a static inline function, with a custom preprocessor
managed suffix for its name

And 'sbc_analysis_filter.c' having code like this:

#define SBC_HIGH_PRECISION
#define SB_ANALYSIS_FUNCTION_SUFFIX _hq
#include "sbc_analysis_filter_template.h"

#undef SBC_HIGH_PRECISION
#undef SB_ANALYSIS_FUNCTION_SUFFIX
#define SB_ANALYSIS_FUNCTION_SUFFIX _fast
#include "sbc_analysis_filter_template.h"

This double include will instantiate both implementations from the same
template. Or something like this. It does not increase source code size
much.

> What is the downside for doing high precession only?

Performance is a lot better for 16-bit fixed point version because it can
benefit from DSP/multimedia instruction set extensions of modern processors.
A performance difference can be seen when benchmarking MMX enabled vs.
high precision build. The relative difference will get even bigger after
optimizing other parts of SBC encoder.

-- 
Best regards,
Siarhei Siamashka

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter
  2009-01-06  2:49     ` Marcel Holtmann
  2009-01-06  5:27       ` Christian Hoene
@ 2009-01-09 16:50       ` Siarhei Siamashka
  2009-01-15 19:34         ` Siarhei Siamashka
  1 sibling, 1 reply; 20+ messages in thread
From: Siarhei Siamashka @ 2009-01-09 16:50 UTC (permalink / raw)
  To: ext Marcel Holtmann; +Cc: linux-bluetooth

On Tuesday 06 January 2009 04:49:06 ext Marcel Holtmann wrote:
> > The attached patch contains what I would consider to be a final variant.
> > MMX support is now complete. It works for both x86 and amd64, has runtime
> > autodetection of MMX availability, supports 4 and 8 subbands cases. I
> > also ensured that only original MMX instructions are used (and no SSE or
> > other later additions), so the code should work fine even on the old
> > Pentium1 MMX. New MMX optimized functions produce bit identical results
> > when compared with bluez-4.25 release.
> >
> > With this patch applied, new filtering functions are noticeably faster
> > than than the old ones on x86 (so now they are both faster and have
> > better quality). Assembly optimizations for the other platforms can be
> > easily added too.
>
> can you re-base your patch against the latest tree and re-send the
> patch.

Yes, I will submit an updated SIMD optimizations patchset in a few days.


Best regards,
Siarhei Siamashka

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter
  2009-01-09 16:50       ` Siarhei Siamashka
@ 2009-01-15 19:34         ` Siarhei Siamashka
  2009-01-15 23:29           ` Marcel Holtmann
  0 siblings, 1 reply; 20+ messages in thread
From: Siarhei Siamashka @ 2009-01-15 19:34 UTC (permalink / raw)
  To: ext Marcel Holtmann; +Cc: linux-bluetooth

[-- Attachment #1: Type: text/plain, Size: 1851 bytes --]

On Friday 09 January 2009 18:50:54 ext Siarhei Siamashka wrote:
> On Tuesday 06 January 2009 04:49:06 ext Marcel Holtmann wrote:
> > > The attached patch contains what I would consider to be a final
> > > variant. MMX support is now complete. It works for both x86 and amd64,
> > > has runtime autodetection of MMX availability, supports 4 and 8
> > > subbands cases. I also ensured that only original MMX instructions are
> > > used (and no SSE or other later additions), so the code should work
> > > fine even on the old Pentium1 MMX. New MMX optimized functions produce
> > > bit identical results when compared with bluez-4.25 release.
> > >
> > > With this patch applied, new filtering functions are noticeably faster
> > > than than the old ones on x86 (so now they are both faster and have
> > > better quality). Assembly optimizations for the other platforms can be
> > > easily added too.
> >
> > can you re-base your patch against the latest tree and re-send the
> > patch.
>
> Yes, I will submit an updated SIMD optimizations patchset in a few days.

Updated patches are attached.

Performance improvement when testing with big buck bunny soundtrack varies
somewhere between 1.4x (4 subbands, MMX analysis filter, Intel Core2 CPU) and
2x factor (8 subbands, NEON analysis filter, ARM Cortex-A8 CPU). But these
numbers are for default bitpool settings (32) and no joint stereo, this
configuration is quite sensitive to analysis filter performance.

SIMD optimized code provides exactly the same output as C version.

But even with this optimization done, there are still a lot more things
to improve. I'm going to improve input data permutation/endian
conversion/channels deinterleaving next. Also scalefactors processing
can be vectorized. Audio quality can be still improved by tweaking
constant tables.


Best regards,
Siarhei Siamashka

[-- Attachment #2: 0001-SIMD-friendly-variant-of-SBC-encoder-analysis-filter.patch --]
[-- Type: text/x-diff, Size: 35467 bytes --]

>From 45aab0c1d41ec949a7db83d17ba1e1bb5093dfaf Mon Sep 17 00:00:00 2001
From: Siarhei Siamashka <siarhei.siamashka@nokia.com>
Date: Thu, 15 Jan 2009 19:11:23 +0200
Subject: [PATCH] SIMD-friendly variant of SBC encoder analysis filter

Added SIMD-friendly C implementation of SBC analysis filter (the
structure of code had to be changed a bit and constants in the
tables reordered). This code can be used as a reference for
developing platform specific SIMD optimizations. These functions
are put into a new file 'sbc_primitives.c', which is going to
contain all the basic stuff for SBC codec.
---
 sbc/Makefile.am      |    3 +-
 sbc/sbc.c            |  155 +-------------------
 sbc/sbc_math.h       |    2 -
 sbc/sbc_primitives.c |  401 ++++++++++++++++++++++++++++++++++++++++++++++++++
 sbc/sbc_primitives.h |   52 +++++++
 sbc/sbc_tables.h     |  250 +++++++++++++++++++++++++++++++-
 6 files changed, 703 insertions(+), 160 deletions(-)
 create mode 100644 sbc/sbc_primitives.c
 create mode 100644 sbc/sbc_primitives.h

diff --git a/sbc/Makefile.am b/sbc/Makefile.am
index c42f162..cd068e7 100644
--- a/sbc/Makefile.am
+++ b/sbc/Makefile.am
@@ -8,7 +8,8 @@ endif
 if SBC
 noinst_LTLIBRARIES = libsbc.la
 
-libsbc_la_SOURCES = sbc.h sbc.c sbc_math.h sbc_tables.h
+libsbc_la_SOURCES = sbc.h sbc.c sbc_math.h sbc_tables.h \
+	sbc_primitives.c
 
 libsbc_la_CFLAGS = -finline-functions -funswitch-loops -fgcse-after-reload
 
diff --git a/sbc/sbc.c b/sbc/sbc.c
index 651981f..534c935 100644
--- a/sbc/sbc.c
+++ b/sbc/sbc.c
@@ -46,6 +46,7 @@
 #include "sbc_tables.h"
 
 #include "sbc.h"
+#include "sbc_primitives.h"
 
 #define SBC_SYNCWORD	0x9C
 
@@ -91,16 +92,6 @@ struct sbc_decoder_state {
 	int offset[2][16];
 };
 
-struct sbc_encoder_state {
-	int subbands;
-	int position[2];
-	int16_t X[2][256];
-	void (*sbc_analyze_4b_4s)(int16_t *pcm, int16_t *x,
-				  int32_t *out, int out_stride);
-	void (*sbc_analyze_4b_8s)(int16_t *pcm, int16_t *x,
-				  int32_t *out, int out_stride);
-};
-
 /*
  * Calculates the CRC-8 of the first len bits in data
  */
@@ -653,146 +644,6 @@ static int sbc_synthesize_audio(struct sbc_decoder_state *state,
 	}
 }
 
-static inline void _sbc_analyze_four(const int16_t *in, int32_t *out)
-{
-	FIXED_A t1[4];
-	FIXED_T t2[4];
-	int i = 0, hop = 0;
-
-	/* rounding coefficient */
-	t1[0] = t1[1] = t1[2] = t1[3] =
-		(FIXED_A) 1 << (SBC_PROTO_FIXED4_SCALE - 1);
-
-	/* low pass polyphase filter */
-	for (hop = 0; hop < 40; hop += 8) {
-		t1[0] += (FIXED_A) in[hop] * _sbc_proto_fixed4[hop];
-		t1[1] += (FIXED_A) in[hop + 1] * _sbc_proto_fixed4[hop + 1];
-		t1[2] += (FIXED_A) in[hop + 2] * _sbc_proto_fixed4[hop + 2];
-		t1[1] += (FIXED_A) in[hop + 3] * _sbc_proto_fixed4[hop + 3];
-		t1[0] += (FIXED_A) in[hop + 4] * _sbc_proto_fixed4[hop + 4];
-		t1[3] += (FIXED_A) in[hop + 5] * _sbc_proto_fixed4[hop + 5];
-		t1[3] += (FIXED_A) in[hop + 7] * _sbc_proto_fixed4[hop + 7];
-	}
-
-	/* scaling */
-	t2[0] = t1[0] >> SBC_PROTO_FIXED4_SCALE;
-	t2[1] = t1[1] >> SBC_PROTO_FIXED4_SCALE;
-	t2[2] = t1[2] >> SBC_PROTO_FIXED4_SCALE;
-	t2[3] = t1[3] >> SBC_PROTO_FIXED4_SCALE;
-
-	/* do the cos transform */
-	for (i = 0, hop = 0; i < 4; hop += 8, i++) {
-		out[i] = ((FIXED_A) t2[0] * cos_table_fixed_4[0 + hop] +
-			  (FIXED_A) t2[1] * cos_table_fixed_4[1 + hop] +
-			  (FIXED_A) t2[2] * cos_table_fixed_4[2 + hop] +
-			  (FIXED_A) t2[3] * cos_table_fixed_4[5 + hop]) >>
-			(SBC_COS_TABLE_FIXED4_SCALE - SCALE_OUT_BITS);
-	}
-}
-
-static void sbc_analyze_4b_4s(int16_t *pcm, int16_t *x,
-			      int32_t *out, int out_stride)
-{
-	int i;
-
-	/* Input 4 x 4 Audio Samples */
-	for (i = 0; i < 16; i += 4) {
-		x[64 + i] = x[0 + i] = pcm[15 - i];
-		x[65 + i] = x[1 + i] = pcm[14 - i];
-		x[66 + i] = x[2 + i] = pcm[13 - i];
-		x[67 + i] = x[3 + i] = pcm[12 - i];
-	}
-
-	/* Analyze four blocks */
-	_sbc_analyze_four(x + 12, out);
-	out += out_stride;
-	_sbc_analyze_four(x + 8, out);
-	out += out_stride;
-	_sbc_analyze_four(x + 4, out);
-	out += out_stride;
-	_sbc_analyze_four(x, out);
-}
-
-static inline void _sbc_analyze_eight(const int16_t *in, int32_t *out)
-{
-	FIXED_A t1[8];
-	FIXED_T t2[8];
-	int i, hop;
-
-	/* rounding coefficient */
-	t1[0] = t1[1] = t1[2] = t1[3] = t1[4] = t1[5] = t1[6] = t1[7] =
-		(FIXED_A) 1 << (SBC_PROTO_FIXED8_SCALE-1);
-
-	/* low pass polyphase filter */
-	for (hop = 0; hop < 80; hop += 16) {
-		t1[0] += (FIXED_A) in[hop] * _sbc_proto_fixed8[hop];
-		t1[1] += (FIXED_A) in[hop + 1] * _sbc_proto_fixed8[hop + 1];
-		t1[2] += (FIXED_A) in[hop + 2] * _sbc_proto_fixed8[hop + 2];
-		t1[3] += (FIXED_A) in[hop + 3] * _sbc_proto_fixed8[hop + 3];
-		t1[4] += (FIXED_A) in[hop + 4] * _sbc_proto_fixed8[hop + 4];
-		t1[3] += (FIXED_A) in[hop + 5] * _sbc_proto_fixed8[hop + 5];
-		t1[2] += (FIXED_A) in[hop + 6] * _sbc_proto_fixed8[hop + 6];
-		t1[1] += (FIXED_A) in[hop + 7] * _sbc_proto_fixed8[hop + 7];
-		t1[0] += (FIXED_A) in[hop + 8] * _sbc_proto_fixed8[hop + 8];
-		t1[5] += (FIXED_A) in[hop + 9] * _sbc_proto_fixed8[hop + 9];
-		t1[6] += (FIXED_A) in[hop + 10] * _sbc_proto_fixed8[hop + 10];
-		t1[7] += (FIXED_A) in[hop + 11] * _sbc_proto_fixed8[hop + 11];
-		t1[7] += (FIXED_A) in[hop + 13] * _sbc_proto_fixed8[hop + 13];
-		t1[6] += (FIXED_A) in[hop + 14] * _sbc_proto_fixed8[hop + 14];
-		t1[5] += (FIXED_A) in[hop + 15] * _sbc_proto_fixed8[hop + 15];
-	}
-
-	/* scaling */
-	t2[0] = t1[0] >> SBC_PROTO_FIXED8_SCALE;
-	t2[1] = t1[1] >> SBC_PROTO_FIXED8_SCALE;
-	t2[2] = t1[2] >> SBC_PROTO_FIXED8_SCALE;
-	t2[3] = t1[3] >> SBC_PROTO_FIXED8_SCALE;
-	t2[4] = t1[4] >> SBC_PROTO_FIXED8_SCALE;
-	t2[5] = t1[5] >> SBC_PROTO_FIXED8_SCALE;
-	t2[6] = t1[6] >> SBC_PROTO_FIXED8_SCALE;
-	t2[7] = t1[7] >> SBC_PROTO_FIXED8_SCALE;
-
-	/* do the cos transform */
-	for (i = 0, hop = 0; i < 8; hop += 16, i++) {
-		out[i] = ((FIXED_A) t2[0] * cos_table_fixed_8[0 + hop] +
-			  (FIXED_A) t2[1] * cos_table_fixed_8[1 + hop] +
-			  (FIXED_A) t2[2] * cos_table_fixed_8[2 + hop] +
-			  (FIXED_A) t2[3] * cos_table_fixed_8[3 + hop] +
-			  (FIXED_A) t2[4] * cos_table_fixed_8[4 + hop] +
-			  (FIXED_A) t2[5] * cos_table_fixed_8[9 + hop] +
-			  (FIXED_A) t2[6] * cos_table_fixed_8[10 + hop] +
-			  (FIXED_A) t2[7] * cos_table_fixed_8[11 + hop]) >>
-			(SBC_COS_TABLE_FIXED8_SCALE - SCALE_OUT_BITS);
-	}
-}
-
-static void sbc_analyze_4b_8s(int16_t *pcm, int16_t *x,
-			      int32_t *out, int out_stride)
-{
-	int i;
-
-	/* Input 4 x 8 Audio Samples */
-	for (i = 0; i < 32; i += 8) {
-		x[128 + i] = x[0 + i] = pcm[31 - i];
-		x[129 + i] = x[1 + i] = pcm[30 - i];
-		x[130 + i] = x[2 + i] = pcm[29 - i];
-		x[131 + i] = x[3 + i] = pcm[28 - i];
-		x[132 + i] = x[4 + i] = pcm[27 - i];
-		x[133 + i] = x[5 + i] = pcm[26 - i];
-		x[134 + i] = x[6 + i] = pcm[25 - i];
-		x[135 + i] = x[7 + i] = pcm[24 - i];
-	}
-
-	/* Analyze four blocks */
-	_sbc_analyze_eight(x + 24, out);
-	out += out_stride;
-	_sbc_analyze_eight(x + 16, out);
-	out += out_stride;
-	_sbc_analyze_eight(x + 8, out);
-	out += out_stride;
-	_sbc_analyze_eight(x, out);
-}
-
 static int sbc_analyze_audio(struct sbc_encoder_state *state,
 				struct sbc_frame *frame)
 {
@@ -1056,9 +907,7 @@ static void sbc_encoder_init(struct sbc_encoder_state *state,
 	state->subbands = frame->subbands;
 	state->position[0] = state->position[1] = 12 * frame->subbands;
 
-	/* Default implementation for analyze function */
-	state->sbc_analyze_4b_4s = sbc_analyze_4b_4s;
-	state->sbc_analyze_4b_8s = sbc_analyze_4b_8s;
+	sbc_init_primitives(state);
 }
 
 struct sbc_priv {
diff --git a/sbc/sbc_math.h b/sbc/sbc_math.h
index 6ca4f52..b87bc81 100644
--- a/sbc/sbc_math.h
+++ b/sbc/sbc_math.h
@@ -29,8 +29,6 @@
 #define ASR(val, bits) ((-2 >> 1 == -1) ? \
 		 ((int32_t)(val)) >> (bits) : ((int32_t) (val)) / (1 << (bits)))
 
-#define SCALE_OUT_BITS 15
-
 #define SCALE_SPROTO4_TBL	12
 #define SCALE_SPROTO8_TBL	14
 #define SCALE_NPROTO4_TBL	11
diff --git a/sbc/sbc_primitives.c b/sbc/sbc_primitives.c
new file mode 100644
index 0000000..f2e75b4
--- /dev/null
+++ b/sbc/sbc_primitives.c
@@ -0,0 +1,401 @@
+/*
+ *
+ *  Bluetooth low-complexity, subband codec (SBC) library
+ *
+ *  Copyright (C) 2004-2009  Marcel Holtmann <marcel@holtmann.org>
+ *  Copyright (C) 2004-2005  Henryk Ploetz <henryk@ploetzli.ch>
+ *  Copyright (C) 2005-2006  Brad Midgley <bmidgley@xmission.com>
+ *
+ *
+ *  This library is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU Lesser General Public
+ *  License as published by the Free Software Foundation; either
+ *  version 2.1 of the License, or (at your option) any later version.
+ *
+ *  This library is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ *  Lesser General Public License for more details.
+ *
+ *  You should have received a copy of the GNU Lesser General Public
+ *  License along with this library; if not, write to the Free Software
+ *  Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ *
+ */
+
+#include <stdint.h>
+#include <limits.h>
+#include "sbc.h"
+#include "sbc_math.h"
+#include "sbc_tables.h"
+
+#include "sbc_primitives.h"
+
+/*
+ * A standard C code of analysis filter.
+ */
+static inline void sbc_analyze_four(const int16_t *in, int32_t *out)
+{
+	FIXED_A t1[4];
+	FIXED_T t2[4];
+	int i = 0, hop = 0;
+
+	/* rounding coefficient */
+	t1[0] = t1[1] = t1[2] = t1[3] =
+		(FIXED_A) 1 << (SBC_PROTO_FIXED4_SCALE - 1);
+
+	/* low pass polyphase filter */
+	for (hop = 0; hop < 40; hop += 8) {
+		t1[0] += (FIXED_A) in[hop] * _sbc_proto_fixed4[hop];
+		t1[1] += (FIXED_A) in[hop + 1] * _sbc_proto_fixed4[hop + 1];
+		t1[2] += (FIXED_A) in[hop + 2] * _sbc_proto_fixed4[hop + 2];
+		t1[1] += (FIXED_A) in[hop + 3] * _sbc_proto_fixed4[hop + 3];
+		t1[0] += (FIXED_A) in[hop + 4] * _sbc_proto_fixed4[hop + 4];
+		t1[3] += (FIXED_A) in[hop + 5] * _sbc_proto_fixed4[hop + 5];
+		t1[3] += (FIXED_A) in[hop + 7] * _sbc_proto_fixed4[hop + 7];
+	}
+
+	/* scaling */
+	t2[0] = t1[0] >> SBC_PROTO_FIXED4_SCALE;
+	t2[1] = t1[1] >> SBC_PROTO_FIXED4_SCALE;
+	t2[2] = t1[2] >> SBC_PROTO_FIXED4_SCALE;
+	t2[3] = t1[3] >> SBC_PROTO_FIXED4_SCALE;
+
+	/* do the cos transform */
+	for (i = 0, hop = 0; i < 4; hop += 8, i++) {
+		out[i] = ((FIXED_A) t2[0] * cos_table_fixed_4[0 + hop] +
+			  (FIXED_A) t2[1] * cos_table_fixed_4[1 + hop] +
+			  (FIXED_A) t2[2] * cos_table_fixed_4[2 + hop] +
+			  (FIXED_A) t2[3] * cos_table_fixed_4[5 + hop]) >>
+			(SBC_COS_TABLE_FIXED4_SCALE - SCALE_OUT_BITS);
+	}
+}
+
+static void sbc_analyze_4b_4s(int16_t *pcm, int16_t *x,
+			      int32_t *out, int out_stride)
+{
+	int i;
+
+	/* Input 4 x 4 Audio Samples */
+	for (i = 0; i < 16; i += 4) {
+		x[64 + i] = x[0 + i] = pcm[15 - i];
+		x[65 + i] = x[1 + i] = pcm[14 - i];
+		x[66 + i] = x[2 + i] = pcm[13 - i];
+		x[67 + i] = x[3 + i] = pcm[12 - i];
+	}
+
+	/* Analyze four blocks */
+	sbc_analyze_four(x + 12, out);
+	out += out_stride;
+	sbc_analyze_four(x + 8, out);
+	out += out_stride;
+	sbc_analyze_four(x + 4, out);
+	out += out_stride;
+	sbc_analyze_four(x, out);
+}
+
+static inline void sbc_analyze_eight(const int16_t *in, int32_t *out)
+{
+	FIXED_A t1[8];
+	FIXED_T t2[8];
+	int i, hop;
+
+	/* rounding coefficient */
+	t1[0] = t1[1] = t1[2] = t1[3] = t1[4] = t1[5] = t1[6] = t1[7] =
+		(FIXED_A) 1 << (SBC_PROTO_FIXED8_SCALE-1);
+
+	/* low pass polyphase filter */
+	for (hop = 0; hop < 80; hop += 16) {
+		t1[0] += (FIXED_A) in[hop] * _sbc_proto_fixed8[hop];
+		t1[1] += (FIXED_A) in[hop + 1] * _sbc_proto_fixed8[hop + 1];
+		t1[2] += (FIXED_A) in[hop + 2] * _sbc_proto_fixed8[hop + 2];
+		t1[3] += (FIXED_A) in[hop + 3] * _sbc_proto_fixed8[hop + 3];
+		t1[4] += (FIXED_A) in[hop + 4] * _sbc_proto_fixed8[hop + 4];
+		t1[3] += (FIXED_A) in[hop + 5] * _sbc_proto_fixed8[hop + 5];
+		t1[2] += (FIXED_A) in[hop + 6] * _sbc_proto_fixed8[hop + 6];
+		t1[1] += (FIXED_A) in[hop + 7] * _sbc_proto_fixed8[hop + 7];
+		t1[0] += (FIXED_A) in[hop + 8] * _sbc_proto_fixed8[hop + 8];
+		t1[5] += (FIXED_A) in[hop + 9] * _sbc_proto_fixed8[hop + 9];
+		t1[6] += (FIXED_A) in[hop + 10] * _sbc_proto_fixed8[hop + 10];
+		t1[7] += (FIXED_A) in[hop + 11] * _sbc_proto_fixed8[hop + 11];
+		t1[7] += (FIXED_A) in[hop + 13] * _sbc_proto_fixed8[hop + 13];
+		t1[6] += (FIXED_A) in[hop + 14] * _sbc_proto_fixed8[hop + 14];
+		t1[5] += (FIXED_A) in[hop + 15] * _sbc_proto_fixed8[hop + 15];
+	}
+
+	/* scaling */
+	t2[0] = t1[0] >> SBC_PROTO_FIXED8_SCALE;
+	t2[1] = t1[1] >> SBC_PROTO_FIXED8_SCALE;
+	t2[2] = t1[2] >> SBC_PROTO_FIXED8_SCALE;
+	t2[3] = t1[3] >> SBC_PROTO_FIXED8_SCALE;
+	t2[4] = t1[4] >> SBC_PROTO_FIXED8_SCALE;
+	t2[5] = t1[5] >> SBC_PROTO_FIXED8_SCALE;
+	t2[6] = t1[6] >> SBC_PROTO_FIXED8_SCALE;
+	t2[7] = t1[7] >> SBC_PROTO_FIXED8_SCALE;
+
+	/* do the cos transform */
+	for (i = 0, hop = 0; i < 8; hop += 16, i++) {
+		out[i] = ((FIXED_A) t2[0] * cos_table_fixed_8[0 + hop] +
+			  (FIXED_A) t2[1] * cos_table_fixed_8[1 + hop] +
+			  (FIXED_A) t2[2] * cos_table_fixed_8[2 + hop] +
+			  (FIXED_A) t2[3] * cos_table_fixed_8[3 + hop] +
+			  (FIXED_A) t2[4] * cos_table_fixed_8[4 + hop] +
+			  (FIXED_A) t2[5] * cos_table_fixed_8[9 + hop] +
+			  (FIXED_A) t2[6] * cos_table_fixed_8[10 + hop] +
+			  (FIXED_A) t2[7] * cos_table_fixed_8[11 + hop]) >>
+			(SBC_COS_TABLE_FIXED8_SCALE - SCALE_OUT_BITS);
+	}
+}
+
+static void sbc_analyze_4b_8s(int16_t *pcm, int16_t *x,
+			      int32_t *out, int out_stride)
+{
+	int i;
+
+	/* Input 4 x 8 Audio Samples */
+	for (i = 0; i < 32; i += 8) {
+		x[128 + i] = x[0 + i] = pcm[31 - i];
+		x[129 + i] = x[1 + i] = pcm[30 - i];
+		x[130 + i] = x[2 + i] = pcm[29 - i];
+		x[131 + i] = x[3 + i] = pcm[28 - i];
+		x[132 + i] = x[4 + i] = pcm[27 - i];
+		x[133 + i] = x[5 + i] = pcm[26 - i];
+		x[134 + i] = x[6 + i] = pcm[25 - i];
+		x[135 + i] = x[7 + i] = pcm[24 - i];
+	}
+
+	/* Analyze four blocks */
+	sbc_analyze_eight(x + 24, out);
+	out += out_stride;
+	sbc_analyze_eight(x + 16, out);
+	out += out_stride;
+	sbc_analyze_eight(x + 8, out);
+	out += out_stride;
+	sbc_analyze_eight(x, out);
+}
+
+/*
+ * A reference C code of analysis filter with SIMD-friendly tables
+ * reordering and code layout. This code can be used to develop platform
+ * specific SIMD optimizations. Also it may be used as some kind of test
+ * for compiler autovectorization capabilities (who knows, if the compiler
+ * is very good at this stuff, hand optimized assembly may be not strictly
+ * needed for some platform).
+ */
+
+static inline void sbc_analyze_four_simd(const int16_t *in, int32_t *out,
+					 const FIXED_T *consts)
+{
+	FIXED_A t1[4];
+	FIXED_T t2[4];
+	int hop = 0;
+
+	/* rounding coefficient */
+	t1[0] = t1[1] = t1[2] = t1[3] =
+		(FIXED_A) 1 << (SBC_PROTO_FIXED4_SCALE - 1);
+
+	/* low pass polyphase filter */
+	for (hop = 0; hop < 40; hop += 8) {
+		t1[0] += (FIXED_A) in[hop] * consts[hop];
+		t1[0] += (FIXED_A) in[hop + 1] * consts[hop + 1];
+		t1[1] += (FIXED_A) in[hop + 2] * consts[hop + 2];
+		t1[1] += (FIXED_A) in[hop + 3] * consts[hop + 3];
+		t1[2] += (FIXED_A) in[hop + 4] * consts[hop + 4];
+		t1[2] += (FIXED_A) in[hop + 5] * consts[hop + 5];
+		t1[3] += (FIXED_A) in[hop + 6] * consts[hop + 6];
+		t1[3] += (FIXED_A) in[hop + 7] * consts[hop + 7];
+	}
+
+	/* scaling */
+	t2[0] = t1[0] >> SBC_PROTO_FIXED4_SCALE;
+	t2[1] = t1[1] >> SBC_PROTO_FIXED4_SCALE;
+	t2[2] = t1[2] >> SBC_PROTO_FIXED4_SCALE;
+	t2[3] = t1[3] >> SBC_PROTO_FIXED4_SCALE;
+
+	/* do the cos transform */
+	t1[0]  = (FIXED_A) t2[0] * consts[40 + 0];
+	t1[0] += (FIXED_A) t2[1] * consts[40 + 1];
+	t1[1]  = (FIXED_A) t2[0] * consts[40 + 2];
+	t1[1] += (FIXED_A) t2[1] * consts[40 + 3];
+	t1[2]  = (FIXED_A) t2[0] * consts[40 + 4];
+	t1[2] += (FIXED_A) t2[1] * consts[40 + 5];
+	t1[3]  = (FIXED_A) t2[0] * consts[40 + 6];
+	t1[3] += (FIXED_A) t2[1] * consts[40 + 7];
+
+	t1[0] += (FIXED_A) t2[2] * consts[40 + 8];
+	t1[0] += (FIXED_A) t2[3] * consts[40 + 9];
+	t1[1] += (FIXED_A) t2[2] * consts[40 + 10];
+	t1[1] += (FIXED_A) t2[3] * consts[40 + 11];
+	t1[2] += (FIXED_A) t2[2] * consts[40 + 12];
+	t1[2] += (FIXED_A) t2[3] * consts[40 + 13];
+	t1[3] += (FIXED_A) t2[2] * consts[40 + 14];
+	t1[3] += (FIXED_A) t2[3] * consts[40 + 15];
+
+	out[0] = t1[0] >>
+		(SBC_COS_TABLE_FIXED4_SCALE - SCALE_OUT_BITS);
+	out[1] = t1[1] >>
+		(SBC_COS_TABLE_FIXED4_SCALE - SCALE_OUT_BITS);
+	out[2] = t1[2] >>
+		(SBC_COS_TABLE_FIXED4_SCALE - SCALE_OUT_BITS);
+	out[3] = t1[3] >>
+		(SBC_COS_TABLE_FIXED4_SCALE - SCALE_OUT_BITS);
+}
+
+static inline void sbc_analyze_eight_simd(const int16_t *in, int32_t *out,
+					  const FIXED_T *consts)
+{
+	FIXED_A t1[8];
+	FIXED_T t2[8];
+	int i, hop;
+
+	/* rounding coefficient */
+	t1[0] = t1[1] = t1[2] = t1[3] = t1[4] = t1[5] = t1[6] = t1[7] =
+		(FIXED_A) 1 << (SBC_PROTO_FIXED8_SCALE-1);
+
+	/* low pass polyphase filter */
+	for (hop = 0; hop < 80; hop += 16) {
+		t1[0] += (FIXED_A) in[hop] * consts[hop];
+		t1[0] += (FIXED_A) in[hop + 1] * consts[hop + 1];
+		t1[1] += (FIXED_A) in[hop + 2] * consts[hop + 2];
+		t1[1] += (FIXED_A) in[hop + 3] * consts[hop + 3];
+		t1[2] += (FIXED_A) in[hop + 4] * consts[hop + 4];
+		t1[2] += (FIXED_A) in[hop + 5] * consts[hop + 5];
+		t1[3] += (FIXED_A) in[hop + 6] * consts[hop + 6];
+		t1[3] += (FIXED_A) in[hop + 7] * consts[hop + 7];
+		t1[4] += (FIXED_A) in[hop + 8] * consts[hop + 8];
+		t1[4] += (FIXED_A) in[hop + 9] * consts[hop + 9];
+		t1[5] += (FIXED_A) in[hop + 10] * consts[hop + 10];
+		t1[5] += (FIXED_A) in[hop + 11] * consts[hop + 11];
+		t1[6] += (FIXED_A) in[hop + 12] * consts[hop + 12];
+		t1[6] += (FIXED_A) in[hop + 13] * consts[hop + 13];
+		t1[7] += (FIXED_A) in[hop + 14] * consts[hop + 14];
+		t1[7] += (FIXED_A) in[hop + 15] * consts[hop + 15];
+	}
+
+	/* scaling */
+	t2[0] = t1[0] >> SBC_PROTO_FIXED8_SCALE;
+	t2[1] = t1[1] >> SBC_PROTO_FIXED8_SCALE;
+	t2[2] = t1[2] >> SBC_PROTO_FIXED8_SCALE;
+	t2[3] = t1[3] >> SBC_PROTO_FIXED8_SCALE;
+	t2[4] = t1[4] >> SBC_PROTO_FIXED8_SCALE;
+	t2[5] = t1[5] >> SBC_PROTO_FIXED8_SCALE;
+	t2[6] = t1[6] >> SBC_PROTO_FIXED8_SCALE;
+	t2[7] = t1[7] >> SBC_PROTO_FIXED8_SCALE;
+
+
+	/* do the cos transform */
+	t1[0] = t1[1] = t1[2] = t1[3] = t1[4] = t1[5] = t1[6] = t1[7] = 0;
+
+	for (i = 0; i < 4; i++) {
+		t1[0] += (FIXED_A) t2[i * 2 + 0] * consts[80 + i * 16 + 0];
+		t1[0] += (FIXED_A) t2[i * 2 + 1] * consts[80 + i * 16 + 1];
+		t1[1] += (FIXED_A) t2[i * 2 + 0] * consts[80 + i * 16 + 2];
+		t1[1] += (FIXED_A) t2[i * 2 + 1] * consts[80 + i * 16 + 3];
+		t1[2] += (FIXED_A) t2[i * 2 + 0] * consts[80 + i * 16 + 4];
+		t1[2] += (FIXED_A) t2[i * 2 + 1] * consts[80 + i * 16 + 5];
+		t1[3] += (FIXED_A) t2[i * 2 + 0] * consts[80 + i * 16 + 6];
+		t1[3] += (FIXED_A) t2[i * 2 + 1] * consts[80 + i * 16 + 7];
+		t1[4] += (FIXED_A) t2[i * 2 + 0] * consts[80 + i * 16 + 8];
+		t1[4] += (FIXED_A) t2[i * 2 + 1] * consts[80 + i * 16 + 9];
+		t1[5] += (FIXED_A) t2[i * 2 + 0] * consts[80 + i * 16 + 10];
+		t1[5] += (FIXED_A) t2[i * 2 + 1] * consts[80 + i * 16 + 11];
+		t1[6] += (FIXED_A) t2[i * 2 + 0] * consts[80 + i * 16 + 12];
+		t1[6] += (FIXED_A) t2[i * 2 + 1] * consts[80 + i * 16 + 13];
+		t1[7] += (FIXED_A) t2[i * 2 + 0] * consts[80 + i * 16 + 14];
+		t1[7] += (FIXED_A) t2[i * 2 + 1] * consts[80 + i * 16 + 15];
+	}
+
+	for (i = 0; i < 8; i++)
+		out[i] = t1[i] >>
+			(SBC_COS_TABLE_FIXED8_SCALE - SCALE_OUT_BITS);
+}
+
+static inline void sbc_analyze_4b_4s_simd(int16_t *pcm, int16_t *x,
+					  int32_t *out, int out_stride)
+{
+	/* Fetch audio samples and do input data reordering for SIMD */
+	x[64] = x[0]  = pcm[8 + 7];
+	x[65] = x[1]  = pcm[8 + 3];
+	x[66] = x[2]  = pcm[8 + 6];
+	x[67] = x[3]  = pcm[8 + 4];
+	x[68] = x[4]  = pcm[8 + 0];
+	x[69] = x[5]  = pcm[8 + 2];
+	x[70] = x[6]  = pcm[8 + 1];
+	x[71] = x[7]  = pcm[8 + 5];
+
+	x[72] = x[8]  = pcm[0 + 7];
+	x[73] = x[9]  = pcm[0 + 3];
+	x[74] = x[10] = pcm[0 + 6];
+	x[75] = x[11] = pcm[0 + 4];
+	x[76] = x[12] = pcm[0 + 0];
+	x[77] = x[13] = pcm[0 + 2];
+	x[78] = x[14] = pcm[0 + 1];
+	x[79] = x[15] = pcm[0 + 5];
+
+	/* Analyze blocks */
+	sbc_analyze_four_simd(x + 12, out, analysis_consts_fixed4_simd_odd);
+	out += out_stride;
+	sbc_analyze_four_simd(x + 8, out, analysis_consts_fixed4_simd_even);
+	out += out_stride;
+	sbc_analyze_four_simd(x + 4, out, analysis_consts_fixed4_simd_odd);
+	out += out_stride;
+	sbc_analyze_four_simd(x + 0, out, analysis_consts_fixed4_simd_even);
+}
+
+static inline void sbc_analyze_4b_8s_simd(int16_t *pcm, int16_t *x,
+					  int32_t *out, int out_stride)
+{
+	/* Fetch audio samples and do input data reordering for SIMD */
+	x[128] = x[0]  = pcm[16 + 15];
+	x[129] = x[1]  = pcm[16 + 7];
+	x[130] = x[2]  = pcm[16 + 14];
+	x[131] = x[3]  = pcm[16 + 8];
+	x[132] = x[4]  = pcm[16 + 13];
+	x[133] = x[5]  = pcm[16 + 9];
+	x[134] = x[6]  = pcm[16 + 12];
+	x[135] = x[7]  = pcm[16 + 10];
+	x[136] = x[8]  = pcm[16 + 11];
+	x[137] = x[9]  = pcm[16 + 3];
+	x[138] = x[10] = pcm[16 + 6];
+	x[139] = x[11] = pcm[16 + 0];
+	x[140] = x[12] = pcm[16 + 5];
+	x[141] = x[13] = pcm[16 + 1];
+	x[142] = x[14] = pcm[16 + 4];
+	x[143] = x[15] = pcm[16 + 2];
+
+	x[144] = x[16] = pcm[0 + 15];
+	x[145] = x[17] = pcm[0 + 7];
+	x[146] = x[18] = pcm[0 + 14];
+	x[147] = x[19] = pcm[0 + 8];
+	x[148] = x[20] = pcm[0 + 13];
+	x[149] = x[21] = pcm[0 + 9];
+	x[150] = x[22] = pcm[0 + 12];
+	x[151] = x[23] = pcm[0 + 10];
+	x[152] = x[24] = pcm[0 + 11];
+	x[153] = x[25] = pcm[0 + 3];
+	x[154] = x[26] = pcm[0 + 6];
+	x[155] = x[27] = pcm[0 + 0];
+	x[156] = x[28] = pcm[0 + 5];
+	x[157] = x[29] = pcm[0 + 1];
+	x[158] = x[30] = pcm[0 + 4];
+	x[159] = x[31] = pcm[0 + 2];
+
+	/* Analyze blocks */
+	sbc_analyze_eight_simd(x + 24, out, analysis_consts_fixed8_simd_odd);
+	out += out_stride;
+	sbc_analyze_eight_simd(x + 16, out, analysis_consts_fixed8_simd_even);
+	out += out_stride;
+	sbc_analyze_eight_simd(x + 8, out, analysis_consts_fixed8_simd_odd);
+	out += out_stride;
+	sbc_analyze_eight_simd(x + 0, out, analysis_consts_fixed8_simd_even);
+}
+
+/*
+ * Detect CPU features and setup function pointers
+ */
+void sbc_init_primitives(struct sbc_encoder_state *state)
+{
+	/* Default implementation for analyze functions */
+	state->sbc_analyze_4b_4s = sbc_analyze_4b_4s;
+	state->sbc_analyze_4b_8s = sbc_analyze_4b_8s;
+}
diff --git a/sbc/sbc_primitives.h b/sbc/sbc_primitives.h
new file mode 100644
index 0000000..ca1ec27
--- /dev/null
+++ b/sbc/sbc_primitives.h
@@ -0,0 +1,52 @@
+/*
+ *
+ *  Bluetooth low-complexity, subband codec (SBC) library
+ *
+ *  Copyright (C) 2004-2009  Marcel Holtmann <marcel@holtmann.org>
+ *  Copyright (C) 2004-2005  Henryk Ploetz <henryk@ploetzli.ch>
+ *  Copyright (C) 2005-2006  Brad Midgley <bmidgley@xmission.com>
+ *
+ *
+ *  This library is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU Lesser General Public
+ *  License as published by the Free Software Foundation; either
+ *  version 2.1 of the License, or (at your option) any later version.
+ *
+ *  This library is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ *  Lesser General Public License for more details.
+ *
+ *  You should have received a copy of the GNU Lesser General Public
+ *  License along with this library; if not, write to the Free Software
+ *  Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ *
+ */
+
+#ifndef __SBC_PRIMITIVES_H
+#define __SBC_PRIMITIVES_H
+
+#define SCALE_OUT_BITS 15
+
+struct sbc_encoder_state {
+	int subbands;
+	int position[2];
+	int16_t X[2][256];
+	/* Polyphase analysis filter for 4 subbands configuration,
+	   it handles 4 blocks at once */
+	void (*sbc_analyze_4b_4s)(int16_t *pcm, int16_t *x,
+				  int32_t *out, int out_stride);
+	/* Polyphase analysis filter for 8 subbands configuration,
+	   it handles 4 blocks at once */
+	void (*sbc_analyze_4b_8s)(int16_t *pcm, int16_t *x,
+				  int32_t *out, int out_stride);
+};
+
+/*
+ * Initialize pointers to the functions which are the basic "building bricks"
+ * of SBC codec. Best implementation is selected based on target CPU
+ * capabilities.
+ */
+void sbc_init_primitives(struct sbc_encoder_state *encoder_state);
+
+#endif
diff --git a/sbc/sbc_tables.h b/sbc/sbc_tables.h
index f1dfe6c..a9a995f 100644
--- a/sbc/sbc_tables.h
+++ b/sbc/sbc_tables.h
@@ -157,8 +157,9 @@ static const int32_t synmatrix8[16][8] = {
  */
 #define SBC_PROTO_FIXED4_SCALE \
 	((sizeof(FIXED_T) * CHAR_BIT - 1) - SBC_FIXED_EXTRA_BITS + 1)
-#define F(x) (FIXED_A) ((x * 2) * \
+#define F_PROTO4(x) (FIXED_A) ((x * 2) * \
 	((FIXED_A) 1 << (sizeof(FIXED_T) * CHAR_BIT - 1)) + 0.5)
+#define F(x) F_PROTO4(x)
 static const FIXED_T _sbc_proto_fixed4[40] = {
 	 F(0.00000000E+00),  F(5.36548976E-04),
 	-F(1.49188357E-03),  F(2.73370904E-03),
@@ -206,8 +207,9 @@ static const FIXED_T _sbc_proto_fixed4[40] = {
  */
 #define SBC_COS_TABLE_FIXED4_SCALE \
 	((sizeof(FIXED_T) * CHAR_BIT - 1) + SBC_FIXED_EXTRA_BITS)
-#define F(x) (FIXED_A) ((x) * \
+#define F_COS4(x) (FIXED_A) ((x) * \
 	((FIXED_A) 1 << (sizeof(FIXED_T) * CHAR_BIT - 1)) + 0.5)
+#define F(x) F_COS4(x)
 static const FIXED_T cos_table_fixed_4[32] = {
 	 F(0.7071067812),  F(0.9238795325), -F(1.0000000000),  F(0.9238795325),
 	 F(0.7071067812),  F(0.3826834324),  F(0.0000000000),  F(0.3826834324),
@@ -233,8 +235,9 @@ static const FIXED_T cos_table_fixed_4[32] = {
  */
 #define SBC_PROTO_FIXED8_SCALE \
 	((sizeof(FIXED_T) * CHAR_BIT - 1) - SBC_FIXED_EXTRA_BITS + 2)
-#define F(x) (FIXED_A) ((x * 4) * \
+#define F_PROTO8(x) (FIXED_A) ((x * 4) * \
 	((FIXED_A) 1 << (sizeof(FIXED_T) * CHAR_BIT - 1)) + 0.5)
+#define F(x) F_PROTO8(x)
 static const FIXED_T _sbc_proto_fixed8[80] = {
 	 F(0.00000000E+00),  F(1.56575398E-04),
 	 F(3.43256425E-04),  F(5.54620202E-04),
@@ -301,8 +304,9 @@ static const FIXED_T _sbc_proto_fixed8[80] = {
  */
 #define SBC_COS_TABLE_FIXED8_SCALE \
 	((sizeof(FIXED_T) * CHAR_BIT - 1) + SBC_FIXED_EXTRA_BITS)
-#define F(x) (FIXED_A) ((x) * \
+#define F_COS8(x) (FIXED_A) ((x) * \
 	((FIXED_A) 1 << (sizeof(FIXED_T) * CHAR_BIT - 1)) + 0.5)
+#define F(x) F_COS8(x)
 static const FIXED_T cos_table_fixed_8[128] = {
 	 F(0.7071067812),  F(0.8314696123),  F(0.9238795325),  F(0.9807852804),
 	-F(1.0000000000),  F(0.9807852804),  F(0.9238795325),  F(0.8314696123),
@@ -345,3 +349,241 @@ static const FIXED_T cos_table_fixed_8[128] = {
 	-F(0.0000000000), -F(0.1950903220),  F(0.3826834324), -F(0.5555702330),
 };
 #undef F
+
+/*
+ * Constant tables for the use in SIMD optimized analysis filters
+ * Each table consists of two parts:
+ * 1. reordered "proto" table
+ * 2. reordered "cos" table
+ *
+ * Due to non-symmetrical reordering, separate tables for "even"
+ * and "odd" cases are needed
+ */
+
+static const FIXED_T analysis_consts_fixed4_simd_even[40 + 16] = {
+#define F(x) F_PROTO4(x)
+	 F(0.00000000E+00),  F(3.83720193E-03),
+	 F(5.36548976E-04),  F(2.73370904E-03),
+	 F(3.06012286E-03),  F(3.89205149E-03),
+	 F(0.00000000E+00), -F(1.49188357E-03),
+	 F(1.09137620E-02),  F(2.58767811E-02),
+	 F(2.04385087E-02),  F(3.21939290E-02),
+	 F(7.76463494E-02),  F(6.13245186E-03),
+	 F(0.00000000E+00), -F(2.88757392E-02),
+	 F(1.35593274E-01),  F(2.94315332E-01),
+	 F(1.94987841E-01),  F(2.81828203E-01),
+	-F(1.94987841E-01),  F(2.81828203E-01),
+	 F(0.00000000E+00), -F(2.46636662E-01),
+	-F(1.35593274E-01),  F(2.58767811E-02),
+	-F(7.76463494E-02),  F(6.13245186E-03),
+	-F(2.04385087E-02),  F(3.21939290E-02),
+	 F(0.00000000E+00),  F(2.88217274E-02),
+	-F(1.09137620E-02),  F(3.83720193E-03),
+	-F(3.06012286E-03),  F(3.89205149E-03),
+	-F(5.36548976E-04),  F(2.73370904E-03),
+	 F(0.00000000E+00), -F(1.86581691E-03),
+#undef F
+#define F(x) F_COS4(x)
+	 F(0.7071067812),  F(0.9238795325),
+	-F(0.7071067812),  F(0.3826834324),
+	-F(0.7071067812), -F(0.3826834324),
+	 F(0.7071067812), -F(0.9238795325),
+	 F(0.3826834324), -F(1.0000000000),
+	-F(0.9238795325), -F(1.0000000000),
+	 F(0.9238795325), -F(1.0000000000),
+	-F(0.3826834324), -F(1.0000000000),
+#undef F
+};
+
+static const FIXED_T analysis_consts_fixed4_simd_odd[40 + 16] = {
+#define F(x) F_PROTO4(x)
+	 F(2.73370904E-03),  F(5.36548976E-04),
+	-F(1.49188357E-03),  F(0.00000000E+00),
+	 F(3.83720193E-03),  F(1.09137620E-02),
+	 F(3.89205149E-03),  F(3.06012286E-03),
+	 F(3.21939290E-02),  F(2.04385087E-02),
+	-F(2.88757392E-02),  F(0.00000000E+00),
+	 F(2.58767811E-02),  F(1.35593274E-01),
+	 F(6.13245186E-03),  F(7.76463494E-02),
+	 F(2.81828203E-01),  F(1.94987841E-01),
+	-F(2.46636662E-01),  F(0.00000000E+00),
+	 F(2.94315332E-01), -F(1.35593274E-01),
+	 F(2.81828203E-01), -F(1.94987841E-01),
+	 F(6.13245186E-03), -F(7.76463494E-02),
+	 F(2.88217274E-02),  F(0.00000000E+00),
+	 F(2.58767811E-02), -F(1.09137620E-02),
+	 F(3.21939290E-02), -F(2.04385087E-02),
+	 F(3.89205149E-03), -F(3.06012286E-03),
+	-F(1.86581691E-03),  F(0.00000000E+00),
+	 F(3.83720193E-03),  F(0.00000000E+00),
+	 F(2.73370904E-03), -F(5.36548976E-04),
+#undef F
+#define F(x) F_COS4(x)
+	 F(0.9238795325), -F(1.0000000000),
+	 F(0.3826834324), -F(1.0000000000),
+	-F(0.3826834324), -F(1.0000000000),
+	-F(0.9238795325), -F(1.0000000000),
+	 F(0.7071067812),  F(0.3826834324),
+	-F(0.7071067812), -F(0.9238795325),
+	-F(0.7071067812),  F(0.9238795325),
+	 F(0.7071067812), -F(0.3826834324),
+#undef F
+};
+
+static const FIXED_T analysis_consts_fixed8_simd_even[80 + 64] = {
+#define F(x) F_PROTO8(x)
+	 F(0.00000000E+00),  F(2.01182542E-03),
+	 F(1.56575398E-04),  F(1.78371725E-03),
+	 F(3.43256425E-04),  F(1.47640169E-03),
+	 F(5.54620202E-04),  F(1.13992507E-03),
+	-F(8.23919506E-04),  F(0.00000000E+00),
+	 F(2.10371989E-03),  F(3.49717454E-03),
+	 F(1.99454554E-03),  F(1.64973098E-03),
+	 F(1.61656283E-03),  F(1.78805361E-04),
+	 F(5.65949473E-03),  F(1.29371806E-02),
+	 F(8.02941163E-03),  F(1.53184106E-02),
+	 F(1.04584443E-02),  F(1.62208471E-02),
+	 F(1.27472335E-02),  F(1.59045603E-02),
+	-F(1.46525263E-02),  F(0.00000000E+00),
+	 F(8.85757540E-03),  F(5.31873032E-02),
+	 F(2.92408442E-03),  F(3.90751381E-02),
+	-F(4.91578024E-03),  F(2.61098752E-02),
+	 F(6.79989431E-02),  F(1.46955068E-01),
+	 F(8.29847578E-02),  F(1.45389847E-01),
+	 F(9.75753918E-02),  F(1.40753505E-01),
+	 F(1.11196689E-01),  F(1.33264415E-01),
+	-F(1.23264548E-01),  F(0.00000000E+00),
+	 F(1.45389847E-01), -F(8.29847578E-02),
+	 F(1.40753505E-01), -F(9.75753918E-02),
+	 F(1.33264415E-01), -F(1.11196689E-01),
+	-F(6.79989431E-02),  F(1.29371806E-02),
+	-F(5.31873032E-02),  F(8.85757540E-03),
+	-F(3.90751381E-02),  F(2.92408442E-03),
+	-F(2.61098752E-02), -F(4.91578024E-03),
+	 F(1.46404076E-02),  F(0.00000000E+00),
+	 F(1.53184106E-02), -F(8.02941163E-03),
+	 F(1.62208471E-02), -F(1.04584443E-02),
+	 F(1.59045603E-02), -F(1.27472335E-02),
+	-F(5.65949473E-03),  F(2.01182542E-03),
+	-F(3.49717454E-03),  F(2.10371989E-03),
+	-F(1.64973098E-03),  F(1.99454554E-03),
+	-F(1.78805361E-04),  F(1.61656283E-03),
+	-F(9.02154502E-04),  F(0.00000000E+00),
+	 F(1.78371725E-03), -F(1.56575398E-04),
+	 F(1.47640169E-03), -F(3.43256425E-04),
+	 F(1.13992507E-03), -F(5.54620202E-04),
+#undef F
+#define F(x) F_COS8(x)
+	 F(0.7071067812),  F(0.8314696123),
+	-F(0.7071067812), -F(0.1950903220),
+	-F(0.7071067812), -F(0.9807852804),
+	 F(0.7071067812), -F(0.5555702330),
+	 F(0.7071067812),  F(0.5555702330),
+	-F(0.7071067812),  F(0.9807852804),
+	-F(0.7071067812),  F(0.1950903220),
+	 F(0.7071067812), -F(0.8314696123),
+	 F(0.9238795325),  F(0.9807852804),
+	 F(0.3826834324),  F(0.8314696123),
+	-F(0.3826834324),  F(0.5555702330),
+	-F(0.9238795325),  F(0.1950903220),
+	-F(0.9238795325), -F(0.1950903220),
+	-F(0.3826834324), -F(0.5555702330),
+	 F(0.3826834324), -F(0.8314696123),
+	 F(0.9238795325), -F(0.9807852804),
+	-F(1.0000000000),  F(0.5555702330),
+	-F(1.0000000000), -F(0.9807852804),
+	-F(1.0000000000),  F(0.1950903220),
+	-F(1.0000000000),  F(0.8314696123),
+	-F(1.0000000000), -F(0.8314696123),
+	-F(1.0000000000), -F(0.1950903220),
+	-F(1.0000000000),  F(0.9807852804),
+	-F(1.0000000000), -F(0.5555702330),
+	 F(0.3826834324),  F(0.1950903220),
+	-F(0.9238795325), -F(0.5555702330),
+	 F(0.9238795325),  F(0.8314696123),
+	-F(0.3826834324), -F(0.9807852804),
+	-F(0.3826834324),  F(0.9807852804),
+	 F(0.9238795325), -F(0.8314696123),
+	-F(0.9238795325),  F(0.5555702330),
+	 F(0.3826834324), -F(0.1950903220),
+#undef F
+};
+
+static const FIXED_T analysis_consts_fixed8_simd_odd[80 + 64] = {
+#define F(x) F_PROTO8(x)
+	 F(0.00000000E+00), -F(8.23919506E-04),
+	 F(1.56575398E-04),  F(1.78371725E-03),
+	 F(3.43256425E-04),  F(1.47640169E-03),
+	 F(5.54620202E-04),  F(1.13992507E-03),
+	 F(2.01182542E-03),  F(5.65949473E-03),
+	 F(2.10371989E-03),  F(3.49717454E-03),
+	 F(1.99454554E-03),  F(1.64973098E-03),
+	 F(1.61656283E-03),  F(1.78805361E-04),
+	 F(0.00000000E+00), -F(1.46525263E-02),
+	 F(8.02941163E-03),  F(1.53184106E-02),
+	 F(1.04584443E-02),  F(1.62208471E-02),
+	 F(1.27472335E-02),  F(1.59045603E-02),
+	 F(1.29371806E-02),  F(6.79989431E-02),
+	 F(8.85757540E-03),  F(5.31873032E-02),
+	 F(2.92408442E-03),  F(3.90751381E-02),
+	-F(4.91578024E-03),  F(2.61098752E-02),
+	 F(0.00000000E+00), -F(1.23264548E-01),
+	 F(8.29847578E-02),  F(1.45389847E-01),
+	 F(9.75753918E-02),  F(1.40753505E-01),
+	 F(1.11196689E-01),  F(1.33264415E-01),
+	 F(1.46955068E-01), -F(6.79989431E-02),
+	 F(1.45389847E-01), -F(8.29847578E-02),
+	 F(1.40753505E-01), -F(9.75753918E-02),
+	 F(1.33264415E-01), -F(1.11196689E-01),
+	 F(0.00000000E+00),  F(1.46404076E-02),
+	-F(5.31873032E-02),  F(8.85757540E-03),
+	-F(3.90751381E-02),  F(2.92408442E-03),
+	-F(2.61098752E-02), -F(4.91578024E-03),
+	 F(1.29371806E-02), -F(5.65949473E-03),
+	 F(1.53184106E-02), -F(8.02941163E-03),
+	 F(1.62208471E-02), -F(1.04584443E-02),
+	 F(1.59045603E-02), -F(1.27472335E-02),
+	 F(0.00000000E+00), -F(9.02154502E-04),
+	-F(3.49717454E-03),  F(2.10371989E-03),
+	-F(1.64973098E-03),  F(1.99454554E-03),
+	-F(1.78805361E-04),  F(1.61656283E-03),
+	 F(2.01182542E-03),  F(0.00000000E+00),
+	 F(1.78371725E-03), -F(1.56575398E-04),
+	 F(1.47640169E-03), -F(3.43256425E-04),
+	 F(1.13992507E-03), -F(5.54620202E-04),
+#undef F
+#define F(x) F_COS8(x)
+	-F(1.0000000000),  F(0.8314696123),
+	-F(1.0000000000), -F(0.1950903220),
+	-F(1.0000000000), -F(0.9807852804),
+	-F(1.0000000000), -F(0.5555702330),
+	-F(1.0000000000),  F(0.5555702330),
+	-F(1.0000000000),  F(0.9807852804),
+	-F(1.0000000000),  F(0.1950903220),
+	-F(1.0000000000), -F(0.8314696123),
+	 F(0.9238795325),  F(0.9807852804),
+	 F(0.3826834324),  F(0.8314696123),
+	-F(0.3826834324),  F(0.5555702330),
+	-F(0.9238795325),  F(0.1950903220),
+	-F(0.9238795325), -F(0.1950903220),
+	-F(0.3826834324), -F(0.5555702330),
+	 F(0.3826834324), -F(0.8314696123),
+	 F(0.9238795325), -F(0.9807852804),
+	 F(0.7071067812),  F(0.5555702330),
+	-F(0.7071067812), -F(0.9807852804),
+	-F(0.7071067812),  F(0.1950903220),
+	 F(0.7071067812),  F(0.8314696123),
+	 F(0.7071067812), -F(0.8314696123),
+	-F(0.7071067812), -F(0.1950903220),
+	-F(0.7071067812),  F(0.9807852804),
+	 F(0.7071067812), -F(0.5555702330),
+	 F(0.3826834324),  F(0.1950903220),
+	-F(0.9238795325), -F(0.5555702330),
+	 F(0.9238795325),  F(0.8314696123),
+	-F(0.3826834324), -F(0.9807852804),
+	-F(0.3826834324),  F(0.9807852804),
+	 F(0.9238795325), -F(0.8314696123),
+	-F(0.9238795325),  F(0.5555702330),
+	 F(0.3826834324), -F(0.1950903220),
+#undef F
+};
-- 
1.5.6.5


[-- Attachment #3: 0002-SBC-arrays-and-constant-tables-aligned-at-16-byte-bo.patch --]
[-- Type: text/x-diff, Size: 5257 bytes --]

>From 7e96a2769e8559fc8b90acfa5671029b75254fa5 Mon Sep 17 00:00:00 2001
From: Siarhei Siamashka <siarhei.siamashka@nokia.com>
Date: Thu, 15 Jan 2009 19:45:36 +0200
Subject: [PATCH] SBC arrays and constant tables aligned at 16 byte boundary for SIMD

Most SIMD instruction sets benefit from data being naturally aligned.
And even if it is not strictly required, performance is usually better
with the aligned data. ARM NEON and SSE2 have different instruction
variants for aligned/unaligned memory accesses.
---
 sbc/sbc.c            |   26 ++++++++++++++++----------
 sbc/sbc.h            |    1 +
 sbc/sbc_primitives.h |    2 +-
 sbc/sbc_tables.h     |   22 ++++++++++++++++++----
 4 files changed, 36 insertions(+), 15 deletions(-)

diff --git a/sbc/sbc.c b/sbc/sbc.c
index 534c935..0699ae0 100644
--- a/sbc/sbc.c
+++ b/sbc/sbc.c
@@ -80,10 +80,13 @@ struct sbc_frame {
 	uint8_t scale_factor[2][8];
 
 	/* raw integer subband samples in the frame */
+	int32_t SBC_ALIGNED sb_sample_f[16][2][8];
 
-	int32_t sb_sample_f[16][2][8];
-	int32_t sb_sample[16][2][8];	/* modified subband samples */
-	int16_t pcm_sample[2][16*8];	/* original pcm audio samples */
+	/* modified subband samples */
+	int32_t SBC_ALIGNED sb_sample[16][2][8];
+
+	/* original pcm audio samples */
+	int16_t SBC_ALIGNED pcm_sample[2][16*8];
 };
 
 struct sbc_decoder_state {
@@ -912,9 +915,9 @@ static void sbc_encoder_init(struct sbc_encoder_state *state,
 
 struct sbc_priv {
 	int init;
-	struct sbc_frame frame;
-	struct sbc_decoder_state dec_state;
-	struct sbc_encoder_state enc_state;
+	struct SBC_ALIGNED sbc_frame frame;
+	struct SBC_ALIGNED sbc_decoder_state dec_state;
+	struct SBC_ALIGNED sbc_encoder_state enc_state;
 };
 
 static void sbc_set_defaults(sbc_t *sbc, unsigned long flags)
@@ -940,10 +943,13 @@ int sbc_init(sbc_t *sbc, unsigned long flags)
 
 	memset(sbc, 0, sizeof(sbc_t));
 
-	sbc->priv = malloc(sizeof(struct sbc_priv));
-	if (!sbc->priv)
+	sbc->priv_alloc_base = malloc(sizeof(struct sbc_priv) + SBC_ALIGN_MASK);
+	if (!sbc->priv_alloc_base)
 		return -ENOMEM;
 
+	sbc->priv = (void *) (((uintptr_t) sbc->priv_alloc_base +
+			SBC_ALIGN_MASK) & ~((uintptr_t) SBC_ALIGN_MASK));
+
 	memset(sbc->priv, 0, sizeof(struct sbc_priv));
 
 	sbc_set_defaults(sbc, flags);
@@ -1091,8 +1097,8 @@ void sbc_finish(sbc_t *sbc)
 	if (!sbc)
 		return;
 
-	if (sbc->priv)
-		free(sbc->priv);
+	if (sbc->priv_alloc_base)
+		free(sbc->priv_alloc_base);
 
 	memset(sbc, 0, sizeof(sbc_t));
 }
diff --git a/sbc/sbc.h b/sbc/sbc.h
index 8ac5930..b0a1488 100644
--- a/sbc/sbc.h
+++ b/sbc/sbc.h
@@ -74,6 +74,7 @@ struct sbc_struct {
 	uint8_t endian;
 
 	void *priv;
+	void *priv_alloc_base;
 };
 
 typedef struct sbc_struct sbc_t;
diff --git a/sbc/sbc_primitives.h b/sbc/sbc_primitives.h
index ca1ec27..a8b3df6 100644
--- a/sbc/sbc_primitives.h
+++ b/sbc/sbc_primitives.h
@@ -31,7 +31,7 @@
 struct sbc_encoder_state {
 	int subbands;
 	int position[2];
-	int16_t X[2][256];
+	int16_t SBC_ALIGNED X[2][256];
 	/* Polyphase analysis filter for 4 subbands configuration,
 	   it handles 4 blocks at once */
 	void (*sbc_analyze_4b_4s)(int16_t *pcm, int16_t *x,
diff --git a/sbc/sbc_tables.h b/sbc/sbc_tables.h
index a9a995f..7c2af07 100644
--- a/sbc/sbc_tables.h
+++ b/sbc/sbc_tables.h
@@ -351,6 +351,20 @@ static const FIXED_T cos_table_fixed_8[128] = {
 #undef F
 
 /*
+ * Enforce 16 byte alignment for the data, which is supposed to be used
+ * with SIMD optimized code.
+ */
+
+#define SBC_ALIGN_BITS 4
+#define SBC_ALIGN_MASK ((1 << (SBC_ALIGN_BITS)) - 1)
+
+#ifdef __GNUC__
+#define SBC_ALIGNED __attribute__((aligned(1 << (SBC_ALIGN_BITS))))
+#else
+#define SBC_ALIGNED
+#endif
+
+/*
  * Constant tables for the use in SIMD optimized analysis filters
  * Each table consists of two parts:
  * 1. reordered "proto" table
@@ -360,7 +374,7 @@ static const FIXED_T cos_table_fixed_8[128] = {
  * and "odd" cases are needed
  */
 
-static const FIXED_T analysis_consts_fixed4_simd_even[40 + 16] = {
+static const FIXED_T SBC_ALIGNED analysis_consts_fixed4_simd_even[40 + 16] = {
 #define F(x) F_PROTO4(x)
 	 F(0.00000000E+00),  F(3.83720193E-03),
 	 F(5.36548976E-04),  F(2.73370904E-03),
@@ -395,7 +409,7 @@ static const FIXED_T analysis_consts_fixed4_simd_even[40 + 16] = {
 #undef F
 };
 
-static const FIXED_T analysis_consts_fixed4_simd_odd[40 + 16] = {
+static const FIXED_T SBC_ALIGNED analysis_consts_fixed4_simd_odd[40 + 16] = {
 #define F(x) F_PROTO4(x)
 	 F(2.73370904E-03),  F(5.36548976E-04),
 	-F(1.49188357E-03),  F(0.00000000E+00),
@@ -430,7 +444,7 @@ static const FIXED_T analysis_consts_fixed4_simd_odd[40 + 16] = {
 #undef F
 };
 
-static const FIXED_T analysis_consts_fixed8_simd_even[80 + 64] = {
+static const FIXED_T SBC_ALIGNED analysis_consts_fixed8_simd_even[80 + 64] = {
 #define F(x) F_PROTO8(x)
 	 F(0.00000000E+00),  F(2.01182542E-03),
 	 F(1.56575398E-04),  F(1.78371725E-03),
@@ -509,7 +523,7 @@ static const FIXED_T analysis_consts_fixed8_simd_even[80 + 64] = {
 #undef F
 };
 
-static const FIXED_T analysis_consts_fixed8_simd_odd[80 + 64] = {
+static const FIXED_T SBC_ALIGNED analysis_consts_fixed8_simd_odd[80 + 64] = {
 #define F(x) F_PROTO8(x)
 	 F(0.00000000E+00), -F(8.23919506E-04),
 	 F(1.56575398E-04),  F(1.78371725E-03),
-- 
1.5.6.5


[-- Attachment #4: 0003-MMX-and-ARM-NEON-optimized-versions-of-analysis-filt.patch --]
[-- Type: text/x-diff, Size: 25593 bytes --]

>From fd46776a2734d800ecc2db6fd226b6cb9cacda36 Mon Sep 17 00:00:00 2001
From: Siarhei Siamashka <siarhei.siamashka@nokia.com>
Date: Thu, 15 Jan 2009 20:25:49 +0200
Subject: [PATCH] MMX and ARM NEON optimized versions of analysis filter for SBC encoder

---
 sbc/Makefile.am           |    2 +-
 sbc/sbc_primitives.c      |   12 ++
 sbc/sbc_primitives_mmx.c  |  373 +++++++++++++++++++++++++++++++++++++++++++++
 sbc/sbc_primitives_mmx.h  |   40 +++++
 sbc/sbc_primitives_neon.c |  299 ++++++++++++++++++++++++++++++++++++
 sbc/sbc_primitives_neon.h |   40 +++++
 6 files changed, 765 insertions(+), 1 deletions(-)
 create mode 100644 sbc/sbc_primitives_mmx.c
 create mode 100644 sbc/sbc_primitives_mmx.h
 create mode 100644 sbc/sbc_primitives_neon.c
 create mode 100644 sbc/sbc_primitives_neon.h

diff --git a/sbc/Makefile.am b/sbc/Makefile.am
index cd068e7..5e47c77 100644
--- a/sbc/Makefile.am
+++ b/sbc/Makefile.am
@@ -9,7 +9,7 @@ if SBC
 noinst_LTLIBRARIES = libsbc.la
 
 libsbc_la_SOURCES = sbc.h sbc.c sbc_math.h sbc_tables.h \
-	sbc_primitives.c
+	sbc_primitives.c sbc_primitives_mmx.c sbc_primitives_neon.c
 
 libsbc_la_CFLAGS = -finline-functions -funswitch-loops -fgcse-after-reload
 
diff --git a/sbc/sbc_primitives.c b/sbc/sbc_primitives.c
index f2e75b4..c77a138 100644
--- a/sbc/sbc_primitives.c
+++ b/sbc/sbc_primitives.c
@@ -30,6 +30,8 @@
 #include "sbc_tables.h"
 
 #include "sbc_primitives.h"
+#include "sbc_primitives_mmx.h"
+#include "sbc_primitives_neon.h"
 
 /*
  * A standard C code of analysis filter.
@@ -398,4 +400,14 @@ void sbc_init_primitives(struct sbc_encoder_state *state)
 	/* Default implementation for analyze functions */
 	state->sbc_analyze_4b_4s = sbc_analyze_4b_4s;
 	state->sbc_analyze_4b_8s = sbc_analyze_4b_8s;
+
+	/* X86/AMD64 optimizations */
+#ifdef SBC_BUILD_WITH_MMX_SUPPORT
+	sbc_init_primitives_mmx(state);
+#endif
+
+	/* ARM optimizations */
+#ifdef SBC_BUILD_WITH_NEON_SUPPORT
+	sbc_init_primitives_neon(state);
+#endif
 }
diff --git a/sbc/sbc_primitives_mmx.c b/sbc/sbc_primitives_mmx.c
new file mode 100644
index 0000000..9f29220
--- /dev/null
+++ b/sbc/sbc_primitives_mmx.c
@@ -0,0 +1,373 @@
+/*
+ *
+ *  Bluetooth low-complexity, subband codec (SBC) library
+ *
+ *  Copyright (C) 2004-2009  Marcel Holtmann <marcel@holtmann.org>
+ *  Copyright (C) 2004-2005  Henryk Ploetz <henryk@ploetzli.ch>
+ *  Copyright (C) 2005-2006  Brad Midgley <bmidgley@xmission.com>
+ *
+ *
+ *  This library is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU Lesser General Public
+ *  License as published by the Free Software Foundation; either
+ *  version 2.1 of the License, or (at your option) any later version.
+ *
+ *  This library is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ *  Lesser General Public License for more details.
+ *
+ *  You should have received a copy of the GNU Lesser General Public
+ *  License along with this library; if not, write to the Free Software
+ *  Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ *
+ */
+
+#include <stdint.h>
+#include <limits.h>
+#include "sbc.h"
+#include "sbc_math.h"
+#include "sbc_tables.h"
+
+#include "sbc_primitives_mmx.h"
+
+/*
+ * MMX optimizations
+ */
+
+#ifdef SBC_BUILD_WITH_MMX_SUPPORT
+
+static inline void sbc_analyze_four_mmx(const int16_t *in, int32_t *out,
+					const FIXED_T *consts)
+{
+	static const SBC_ALIGNED int32_t round_c[2] = {
+		1 << (SBC_PROTO_FIXED4_SCALE - 1),
+		1 << (SBC_PROTO_FIXED4_SCALE - 1),
+	};
+	asm volatile (
+		"movq        (%0), %%mm0\n"
+		"movq       8(%0), %%mm1\n"
+		"pmaddwd     (%1), %%mm0\n"
+		"pmaddwd    8(%1), %%mm1\n"
+		"paddd       (%2), %%mm0\n"
+		"paddd       (%2), %%mm1\n"
+		"\n"
+		"movq      16(%0), %%mm2\n"
+		"movq      24(%0), %%mm3\n"
+		"pmaddwd   16(%1), %%mm2\n"
+		"pmaddwd   24(%1), %%mm3\n"
+		"paddd      %%mm2, %%mm0\n"
+		"paddd      %%mm3, %%mm1\n"
+		"\n"
+		"movq      32(%0), %%mm2\n"
+		"movq      40(%0), %%mm3\n"
+		"pmaddwd   32(%1), %%mm2\n"
+		"pmaddwd   40(%1), %%mm3\n"
+		"paddd      %%mm2, %%mm0\n"
+		"paddd      %%mm3, %%mm1\n"
+		"\n"
+		"movq      48(%0), %%mm2\n"
+		"movq      56(%0), %%mm3\n"
+		"pmaddwd   48(%1), %%mm2\n"
+		"pmaddwd   56(%1), %%mm3\n"
+		"paddd      %%mm2, %%mm0\n"
+		"paddd      %%mm3, %%mm1\n"
+		"\n"
+		"movq      64(%0), %%mm2\n"
+		"movq      72(%0), %%mm3\n"
+		"pmaddwd   64(%1), %%mm2\n"
+		"pmaddwd   72(%1), %%mm3\n"
+		"paddd      %%mm2, %%mm0\n"
+		"paddd      %%mm3, %%mm1\n"
+		"\n"
+		"psrad         %4, %%mm0\n"
+		"psrad         %4, %%mm1\n"
+		"packssdw   %%mm0, %%mm0\n"
+		"packssdw   %%mm1, %%mm1\n"
+		"\n"
+		"movq       %%mm0, %%mm2\n"
+		"pmaddwd   80(%1), %%mm0\n"
+		"pmaddwd   88(%1), %%mm2\n"
+		"\n"
+		"movq       %%mm1, %%mm3\n"
+		"pmaddwd   96(%1), %%mm1\n"
+		"pmaddwd  104(%1), %%mm3\n"
+		"paddd      %%mm1, %%mm0\n"
+		"paddd      %%mm3, %%mm2\n"
+		"\n"
+		"movq       %%mm0, (%3)\n"
+		"movq       %%mm2, 8(%3)\n"
+		:
+		: "r" (in), "r" (consts), "r" (&round_c), "r" (out),
+		  "i" (SBC_PROTO_FIXED4_SCALE)
+		: "memory");
+}
+
+static inline void sbc_analyze_eight_mmx(const int16_t *in, int32_t *out,
+					 const FIXED_T *consts)
+{
+	static const SBC_ALIGNED int32_t round_c[2] = {
+		1 << (SBC_PROTO_FIXED8_SCALE - 1),
+		1 << (SBC_PROTO_FIXED8_SCALE - 1),
+	};
+	asm volatile (
+		"movq        (%0), %%mm0\n"
+		"movq       8(%0), %%mm1\n"
+		"movq      16(%0), %%mm2\n"
+		"movq      24(%0), %%mm3\n"
+		"pmaddwd     (%1), %%mm0\n"
+		"pmaddwd    8(%1), %%mm1\n"
+		"pmaddwd   16(%1), %%mm2\n"
+		"pmaddwd   24(%1), %%mm3\n"
+		"paddd       (%2), %%mm0\n"
+		"paddd       (%2), %%mm1\n"
+		"paddd       (%2), %%mm2\n"
+		"paddd       (%2), %%mm3\n"
+		"\n"
+		"movq      32(%0), %%mm4\n"
+		"movq      40(%0), %%mm5\n"
+		"movq      48(%0), %%mm6\n"
+		"movq      56(%0), %%mm7\n"
+		"pmaddwd   32(%1), %%mm4\n"
+		"pmaddwd   40(%1), %%mm5\n"
+		"pmaddwd   48(%1), %%mm6\n"
+		"pmaddwd   56(%1), %%mm7\n"
+		"paddd      %%mm4, %%mm0\n"
+		"paddd      %%mm5, %%mm1\n"
+		"paddd      %%mm6, %%mm2\n"
+		"paddd      %%mm7, %%mm3\n"
+		"\n"
+		"movq      64(%0), %%mm4\n"
+		"movq      72(%0), %%mm5\n"
+		"movq      80(%0), %%mm6\n"
+		"movq      88(%0), %%mm7\n"
+		"pmaddwd   64(%1), %%mm4\n"
+		"pmaddwd   72(%1), %%mm5\n"
+		"pmaddwd   80(%1), %%mm6\n"
+		"pmaddwd   88(%1), %%mm7\n"
+		"paddd      %%mm4, %%mm0\n"
+		"paddd      %%mm5, %%mm1\n"
+		"paddd      %%mm6, %%mm2\n"
+		"paddd      %%mm7, %%mm3\n"
+		"\n"
+		"movq      96(%0), %%mm4\n"
+		"movq     104(%0), %%mm5\n"
+		"movq     112(%0), %%mm6\n"
+		"movq     120(%0), %%mm7\n"
+		"pmaddwd   96(%1), %%mm4\n"
+		"pmaddwd  104(%1), %%mm5\n"
+		"pmaddwd  112(%1), %%mm6\n"
+		"pmaddwd  120(%1), %%mm7\n"
+		"paddd      %%mm4, %%mm0\n"
+		"paddd      %%mm5, %%mm1\n"
+		"paddd      %%mm6, %%mm2\n"
+		"paddd      %%mm7, %%mm3\n"
+		"\n"
+		"movq     128(%0), %%mm4\n"
+		"movq     136(%0), %%mm5\n"
+		"movq     144(%0), %%mm6\n"
+		"movq     152(%0), %%mm7\n"
+		"pmaddwd  128(%1), %%mm4\n"
+		"pmaddwd  136(%1), %%mm5\n"
+		"pmaddwd  144(%1), %%mm6\n"
+		"pmaddwd  152(%1), %%mm7\n"
+		"paddd      %%mm4, %%mm0\n"
+		"paddd      %%mm5, %%mm1\n"
+		"paddd      %%mm6, %%mm2\n"
+		"paddd      %%mm7, %%mm3\n"
+		"\n"
+		"psrad         %4, %%mm0\n"
+		"psrad         %4, %%mm1\n"
+		"psrad         %4, %%mm2\n"
+		"psrad         %4, %%mm3\n"
+		"\n"
+		"packssdw   %%mm0, %%mm0\n"
+		"packssdw   %%mm1, %%mm1\n"
+		"packssdw   %%mm2, %%mm2\n"
+		"packssdw   %%mm3, %%mm3\n"
+		"\n"
+		"movq       %%mm0, %%mm4\n"
+		"movq       %%mm0, %%mm5\n"
+		"pmaddwd  160(%1), %%mm4\n"
+		"pmaddwd  168(%1), %%mm5\n"
+		"\n"
+		"movq       %%mm1, %%mm6\n"
+		"movq       %%mm1, %%mm7\n"
+		"pmaddwd  192(%1), %%mm6\n"
+		"pmaddwd  200(%1), %%mm7\n"
+		"paddd      %%mm6, %%mm4\n"
+		"paddd      %%mm7, %%mm5\n"
+		"\n"
+		"movq       %%mm2, %%mm6\n"
+		"movq       %%mm2, %%mm7\n"
+		"pmaddwd  224(%1), %%mm6\n"
+		"pmaddwd  232(%1), %%mm7\n"
+		"paddd      %%mm6, %%mm4\n"
+		"paddd      %%mm7, %%mm5\n"
+		"\n"
+		"movq       %%mm3, %%mm6\n"
+		"movq       %%mm3, %%mm7\n"
+		"pmaddwd  256(%1), %%mm6\n"
+		"pmaddwd  264(%1), %%mm7\n"
+		"paddd      %%mm6, %%mm4\n"
+		"paddd      %%mm7, %%mm5\n"
+		"\n"
+		"movq       %%mm4, (%3)\n"
+		"movq       %%mm5, 8(%3)\n"
+		"\n"
+		"movq       %%mm0, %%mm5\n"
+		"pmaddwd  176(%1), %%mm0\n"
+		"pmaddwd  184(%1), %%mm5\n"
+		"\n"
+		"movq       %%mm1, %%mm7\n"
+		"pmaddwd  208(%1), %%mm1\n"
+		"pmaddwd  216(%1), %%mm7\n"
+		"paddd      %%mm1, %%mm0\n"
+		"paddd      %%mm7, %%mm5\n"
+		"\n"
+		"movq       %%mm2, %%mm7\n"
+		"pmaddwd  240(%1), %%mm2\n"
+		"pmaddwd  248(%1), %%mm7\n"
+		"paddd      %%mm2, %%mm0\n"
+		"paddd      %%mm7, %%mm5\n"
+		"\n"
+		"movq       %%mm3, %%mm7\n"
+		"pmaddwd  272(%1), %%mm3\n"
+		"pmaddwd  280(%1), %%mm7\n"
+		"paddd      %%mm3, %%mm0\n"
+		"paddd      %%mm7, %%mm5\n"
+		"\n"
+		"movq       %%mm0, 16(%3)\n"
+		"movq       %%mm5, 24(%3)\n"
+		:
+		: "r" (in), "r" (consts), "r" (&round_c), "r" (out),
+		  "i" (SBC_PROTO_FIXED8_SCALE)
+		: "memory");
+}
+
+static inline void sbc_analyze_4b_4s_mmx(int16_t *pcm, int16_t *x,
+					 int32_t *out, int out_stride)
+{
+	/* Fetch audio samples and do input data reordering for SIMD */
+	x[64] = x[0]  = pcm[8 + 7];
+	x[65] = x[1]  = pcm[8 + 3];
+	x[66] = x[2]  = pcm[8 + 6];
+	x[67] = x[3]  = pcm[8 + 4];
+	x[68] = x[4]  = pcm[8 + 0];
+	x[69] = x[5]  = pcm[8 + 2];
+	x[70] = x[6]  = pcm[8 + 1];
+	x[71] = x[7]  = pcm[8 + 5];
+
+	x[72] = x[8]  = pcm[0 + 7];
+	x[73] = x[9]  = pcm[0 + 3];
+	x[74] = x[10] = pcm[0 + 6];
+	x[75] = x[11] = pcm[0 + 4];
+	x[76] = x[12] = pcm[0 + 0];
+	x[77] = x[13] = pcm[0 + 2];
+	x[78] = x[14] = pcm[0 + 1];
+	x[79] = x[15] = pcm[0 + 5];
+
+	/* Analyze blocks */
+	sbc_analyze_four_mmx(x + 12, out, analysis_consts_fixed4_simd_odd);
+	out += out_stride;
+	sbc_analyze_four_mmx(x + 8, out, analysis_consts_fixed4_simd_even);
+	out += out_stride;
+	sbc_analyze_four_mmx(x + 4, out, analysis_consts_fixed4_simd_odd);
+	out += out_stride;
+	sbc_analyze_four_mmx(x + 0, out, analysis_consts_fixed4_simd_even);
+
+	asm volatile ("emms\n");
+}
+
+static inline void sbc_analyze_4b_8s_mmx(int16_t *pcm, int16_t *x,
+					 int32_t *out, int out_stride)
+{
+	/* Fetch audio samples and do input data reordering for SIMD */
+	x[128] = x[0]  = pcm[16 + 15];
+	x[129] = x[1]  = pcm[16 + 7];
+	x[130] = x[2]  = pcm[16 + 14];
+	x[131] = x[3]  = pcm[16 + 8];
+	x[132] = x[4]  = pcm[16 + 13];
+	x[133] = x[5]  = pcm[16 + 9];
+	x[134] = x[6]  = pcm[16 + 12];
+	x[135] = x[7]  = pcm[16 + 10];
+	x[136] = x[8]  = pcm[16 + 11];
+	x[137] = x[9]  = pcm[16 + 3];
+	x[138] = x[10] = pcm[16 + 6];
+	x[139] = x[11] = pcm[16 + 0];
+	x[140] = x[12] = pcm[16 + 5];
+	x[141] = x[13] = pcm[16 + 1];
+	x[142] = x[14] = pcm[16 + 4];
+	x[143] = x[15] = pcm[16 + 2];
+
+	x[144] = x[16] = pcm[0 + 15];
+	x[145] = x[17] = pcm[0 + 7];
+	x[146] = x[18] = pcm[0 + 14];
+	x[147] = x[19] = pcm[0 + 8];
+	x[148] = x[20] = pcm[0 + 13];
+	x[149] = x[21] = pcm[0 + 9];
+	x[150] = x[22] = pcm[0 + 12];
+	x[151] = x[23] = pcm[0 + 10];
+	x[152] = x[24] = pcm[0 + 11];
+	x[153] = x[25] = pcm[0 + 3];
+	x[154] = x[26] = pcm[0 + 6];
+	x[155] = x[27] = pcm[0 + 0];
+	x[156] = x[28] = pcm[0 + 5];
+	x[157] = x[29] = pcm[0 + 1];
+	x[158] = x[30] = pcm[0 + 4];
+	x[159] = x[31] = pcm[0 + 2];
+
+	/* Analyze blocks */
+	sbc_analyze_eight_mmx(x + 24, out, analysis_consts_fixed8_simd_odd);
+	out += out_stride;
+	sbc_analyze_eight_mmx(x + 16, out, analysis_consts_fixed8_simd_even);
+	out += out_stride;
+	sbc_analyze_eight_mmx(x + 8, out, analysis_consts_fixed8_simd_odd);
+	out += out_stride;
+	sbc_analyze_eight_mmx(x + 0, out, analysis_consts_fixed8_simd_even);
+
+	asm volatile ("emms\n");
+}
+
+static int check_mmx_support()
+{
+#ifdef __amd64__
+	return 1; /* We assume that all 64-bit processors have MMX support */
+#else
+	int cpuid_feature_information;
+	asm volatile (
+		/* According to Intel manual, CPUID instruction is supported
+		   if the value of ID bit (bit 21) in EFLAGS can be modified */
+		"pushf\n"
+		"movl     (%%esp),   %0\n"
+		"xorl     $0x200000, (%%esp)\n" /* try to modify ID bit */
+		"popf\n"
+		"pushf\n"
+		"xorl     (%%esp),   %0\n"      /* check if ID bit changed */
+		"jz       1f\n"
+		"push     %%eax\n"
+		"push     %%ebx\n"
+		"push     %%ecx\n"
+		"mov      $1,        %%eax\n"
+		"cpuid\n"
+		"pop      %%ecx\n"
+		"pop      %%ebx\n"
+		"pop      %%eax\n"
+		"1:\n"
+		"popf\n"
+		: "=d" (cpuid_feature_information)
+		:
+		: "cc");
+    return cpuid_feature_information & (1 << 23);
+#endif
+}
+
+void sbc_init_primitives_mmx(struct sbc_encoder_state *state)
+{
+	if (check_mmx_support()) {
+		state->sbc_analyze_4b_4s = sbc_analyze_4b_4s_mmx;
+		state->sbc_analyze_4b_8s = sbc_analyze_4b_8s_mmx;
+	}
+}
+
+#endif
diff --git a/sbc/sbc_primitives_mmx.h b/sbc/sbc_primitives_mmx.h
new file mode 100644
index 0000000..c1e44a5
--- /dev/null
+++ b/sbc/sbc_primitives_mmx.h
@@ -0,0 +1,40 @@
+/*
+ *
+ *  Bluetooth low-complexity, subband codec (SBC) library
+ *
+ *  Copyright (C) 2004-2009  Marcel Holtmann <marcel@holtmann.org>
+ *  Copyright (C) 2004-2005  Henryk Ploetz <henryk@ploetzli.ch>
+ *  Copyright (C) 2005-2006  Brad Midgley <bmidgley@xmission.com>
+ *
+ *
+ *  This library is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU Lesser General Public
+ *  License as published by the Free Software Foundation; either
+ *  version 2.1 of the License, or (at your option) any later version.
+ *
+ *  This library is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ *  Lesser General Public License for more details.
+ *
+ *  You should have received a copy of the GNU Lesser General Public
+ *  License along with this library; if not, write to the Free Software
+ *  Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ *
+ */
+
+#ifndef __SBC_PRIMITIVES_MMX_H
+#define __SBC_PRIMITIVES_MMX_H
+
+#include "sbc_primitives.h"
+
+#if defined(__GNUC__) && (defined(__i386__) || defined(__amd64__)) && \
+		!defined(SBC_HIGH_PRECISION) && (SCALE_OUT_BITS == 15)
+
+#define SBC_BUILD_WITH_MMX_SUPPORT
+
+void sbc_init_primitives_mmx(struct sbc_encoder_state *encoder_state);
+
+#endif
+
+#endif
diff --git a/sbc/sbc_primitives_neon.c b/sbc/sbc_primitives_neon.c
new file mode 100644
index 0000000..ea8446f
--- /dev/null
+++ b/sbc/sbc_primitives_neon.c
@@ -0,0 +1,299 @@
+/*
+ *
+ *  Bluetooth low-complexity, subband codec (SBC) library
+ *
+ *  Copyright (C) 2004-2009  Marcel Holtmann <marcel@holtmann.org>
+ *  Copyright (C) 2004-2005  Henryk Ploetz <henryk@ploetzli.ch>
+ *  Copyright (C) 2005-2006  Brad Midgley <bmidgley@xmission.com>
+ *
+ *
+ *  This library is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU Lesser General Public
+ *  License as published by the Free Software Foundation; either
+ *  version 2.1 of the License, or (at your option) any later version.
+ *
+ *  This library is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ *  Lesser General Public License for more details.
+ *
+ *  You should have received a copy of the GNU Lesser General Public
+ *  License along with this library; if not, write to the Free Software
+ *  Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ *
+ */
+
+#include <stdint.h>
+#include <limits.h>
+#include "sbc.h"
+#include "sbc_math.h"
+#include "sbc_tables.h"
+
+#include "sbc_primitives_neon.h"
+
+/*
+ * ARM NEON optimizations
+ */
+
+#ifdef SBC_BUILD_WITH_NEON_SUPPORT
+
+static inline void _sbc_analyze_four_neon(const int16_t *in, int32_t *out,
+					 const FIXED_T *consts)
+{
+	/* TODO: merge even and odd cases (or even merge all four calls to this
+		 function) in order to have only aligned reads from 'in' array
+		 and reduce number of load instructions */
+	asm volatile (
+		"vld1.16    {d4, d5}, [%0, :64]!\n"
+		"vld1.16    {d8, d9}, [%1, :128]!\n"
+
+		"vmull.s16  q0, d4, d8\n"
+		"vld1.16    {d6,  d7}, [%0, :64]!\n"
+		"vmull.s16  q1, d5, d9\n"
+		"vld1.16    {d10, d11}, [%1, :128]!\n"
+
+		"vmlal.s16  q0, d6, d10\n"
+		"vld1.16    {d4, d5}, [%0, :64]!\n"
+		"vmlal.s16  q1, d7, d11\n"
+		"vld1.16    {d8, d9}, [%1, :128]!\n"
+
+		"vmlal.s16  q0, d4, d8\n"
+		"vld1.16    {d6,  d7}, [%0, :64]!\n"
+		"vmlal.s16  q1, d5, d9\n"
+		"vld1.16    {d10, d11}, [%1, :128]!\n"
+
+		"vmlal.s16  q0, d6, d10\n"
+		"vld1.16    {d4, d5}, [%0, :64]!\n"
+		"vmlal.s16  q1, d7, d11\n"
+		"vld1.16    {d8, d9}, [%1, :128]!\n"
+
+		"vmlal.s16  q0, d4, d8\n"
+		"vmlal.s16  q1, d5, d9\n"
+
+		"vpadd.s32  d0, d0, d1\n"
+		"vpadd.s32  d1, d2, d3\n"
+
+		"vrshrn.s32 d0, q0, %3\n"
+
+		"vld1.16    {d2, d3, d4, d5}, [%1, :128]!\n"
+
+		"vdup.i32   d1, d0[1]\n"  /* TODO: can be eliminated */
+		"vdup.i32   d0, d0[0]\n"  /* TODO: can be eliminated */
+
+		"vmull.s16  q3, d2, d0\n"
+		"vmull.s16  q4, d3, d0\n"
+		"vmlal.s16  q3, d4, d1\n"
+		"vmlal.s16  q4, d5, d1\n"
+
+		"vpadd.s32  d0, d6, d7\n" /* TODO: can be eliminated */
+		"vpadd.s32  d1, d8, d9\n" /* TODO: can be eliminated */
+
+		"vst1.32    {d0, d1}, [%2, :128]\n"
+		: "+r" (in), "+r" (consts)
+		: "r" (out),
+		  "i" (SBC_PROTO_FIXED4_SCALE)
+		: "memory",
+		  "d0", "d1", "d2", "d3", "d4", "d5",
+		  "d6", "d7", "d8", "d9", "d10", "d11");
+}
+
+static inline void _sbc_analyze_eight_neon(const int16_t *in, int32_t *out,
+					 const FIXED_T *consts)
+{
+	/* TODO: merge even and odd cases (or even merge all four calls to this
+		 function) in order to have only aligned reads from 'in' array
+		 and reduce number of load instructions */
+	asm volatile (
+		"vld1.16    {d4, d5}, [%0, :64]!\n"
+		"vld1.16    {d8, d9}, [%1, :128]!\n"
+
+		"vmull.s16  q6, d4, d8\n"
+		"vld1.16    {d6,  d7}, [%0, :64]!\n"
+		"vmull.s16  q7, d5, d9\n"
+		"vld1.16    {d10, d11}, [%1, :128]!\n"
+		"vmull.s16  q8, d6, d10\n"
+		"vld1.16    {d4, d5}, [%0, :64]!\n"
+		"vmull.s16  q9, d7, d11\n"
+		"vld1.16    {d8, d9}, [%1, :128]!\n"
+
+		"vmlal.s16  q6, d4, d8\n"
+		"vld1.16    {d6,  d7}, [%0, :64]!\n"
+		"vmlal.s16  q7, d5, d9\n"
+		"vld1.16    {d10, d11}, [%1, :128]!\n"
+		"vmlal.s16  q8, d6, d10\n"
+		"vld1.16    {d4, d5}, [%0, :64]!\n"
+		"vmlal.s16  q9, d7, d11\n"
+		"vld1.16    {d8, d9}, [%1, :128]!\n"
+
+		"vmlal.s16  q6, d4, d8\n"
+		"vld1.16    {d6,  d7}, [%0, :64]!\n"
+		"vmlal.s16  q7, d5, d9\n"
+		"vld1.16    {d10, d11}, [%1, :128]!\n"
+		"vmlal.s16  q8, d6, d10\n"
+		"vld1.16    {d4, d5}, [%0, :64]!\n"
+		"vmlal.s16  q9, d7, d11\n"
+		"vld1.16    {d8, d9}, [%1, :128]!\n"
+
+		"vmlal.s16  q6, d4, d8\n"
+		"vld1.16    {d6,  d7}, [%0, :64]!\n"
+		"vmlal.s16  q7, d5, d9\n"
+		"vld1.16    {d10, d11}, [%1, :128]!\n"
+		"vmlal.s16  q8, d6, d10\n"
+		"vld1.16    {d4, d5}, [%0, :64]!\n"
+		"vmlal.s16  q9, d7, d11\n"
+		"vld1.16    {d8, d9}, [%1, :128]!\n"
+
+		"vmlal.s16  q6, d4, d8\n"
+		"vld1.16    {d6,  d7}, [%0, :64]!\n"
+		"vmlal.s16  q7, d5, d9\n"
+		"vld1.16    {d10, d11}, [%1, :128]!\n"
+
+		"vmlal.s16  q8, d6, d10\n"
+		"vmlal.s16  q9, d7, d11\n"
+
+		"vpadd.s32  d0, d12, d13\n"
+		"vpadd.s32  d1, d14, d15\n"
+		"vpadd.s32  d2, d16, d17\n"
+		"vpadd.s32  d3, d18, d19\n"
+
+		"vrshr.s32 q0, q0, %3\n"
+		"vrshr.s32 q1, q1, %3\n"
+		"vmovn.s32 d0, q0\n"
+		"vmovn.s32 d1, q1\n"
+
+		"vdup.i32   d3, d1[1]\n"  /* TODO: can be eliminated */
+		"vdup.i32   d2, d1[0]\n"  /* TODO: can be eliminated */
+		"vdup.i32   d1, d0[1]\n"  /* TODO: can be eliminated */
+		"vdup.i32   d0, d0[0]\n"  /* TODO: can be eliminated */
+
+		"vld1.16    {d4, d5}, [%1, :128]!\n"
+		"vmull.s16  q6, d4, d0\n"
+		"vld1.16    {d6, d7}, [%1, :128]!\n"
+		"vmull.s16  q7, d5, d0\n"
+		"vmull.s16  q8, d6, d0\n"
+		"vmull.s16  q9, d7, d0\n"
+
+		"vld1.16    {d4, d5}, [%1, :128]!\n"
+		"vmlal.s16  q6, d4, d1\n"
+		"vld1.16    {d6, d7}, [%1, :128]!\n"
+		"vmlal.s16  q7, d5, d1\n"
+		"vmlal.s16  q8, d6, d1\n"
+		"vmlal.s16  q9, d7, d1\n"
+
+		"vld1.16    {d4, d5}, [%1, :128]!\n"
+		"vmlal.s16  q6, d4, d2\n"
+		"vld1.16    {d6, d7}, [%1, :128]!\n"
+		"vmlal.s16  q7, d5, d2\n"
+		"vmlal.s16  q8, d6, d2\n"
+		"vmlal.s16  q9, d7, d2\n"
+
+		"vld1.16    {d4, d5}, [%1, :128]!\n"
+		"vmlal.s16  q6, d4, d3\n"
+		"vld1.16    {d6, d7}, [%1, :128]!\n"
+		"vmlal.s16  q7, d5, d3\n"
+		"vmlal.s16  q8, d6, d3\n"
+		"vmlal.s16  q9, d7, d3\n"
+
+		"vpadd.s32  d0, d12, d13\n" /* TODO: can be eliminated */
+		"vpadd.s32  d1, d14, d15\n" /* TODO: can be eliminated */
+		"vpadd.s32  d2, d16, d17\n" /* TODO: can be eliminated */
+		"vpadd.s32  d3, d18, d19\n" /* TODO: can be eliminated */
+
+		"vst1.32    {d0, d1, d2, d3}, [%2, :128]\n"
+		: "+r" (in), "+r" (consts)
+		: "r" (out),
+		  "i" (SBC_PROTO_FIXED8_SCALE)
+		: "memory",
+		  "d0", "d1", "d2", "d3", "d4", "d5",
+		  "d6", "d7", "d8", "d9", "d10", "d11",
+		  "d12", "d13", "d14", "d15", "d16", "d17",
+		  "d18", "d19");
+}
+
+static inline void sbc_analyze_4b_4s_neon(int16_t *pcm, int16_t *x,
+					 int32_t *out, int out_stride)
+{
+	/* Fetch audio samples and do input data reordering for SIMD */
+	x[64] = x[0]  = pcm[8 + 7];
+	x[65] = x[1]  = pcm[8 + 3];
+	x[66] = x[2]  = pcm[8 + 6];
+	x[67] = x[3]  = pcm[8 + 4];
+	x[68] = x[4]  = pcm[8 + 0];
+	x[69] = x[5]  = pcm[8 + 2];
+	x[70] = x[6]  = pcm[8 + 1];
+	x[71] = x[7]  = pcm[8 + 5];
+
+	x[72] = x[8]  = pcm[0 + 7];
+	x[73] = x[9]  = pcm[0 + 3];
+	x[74] = x[10] = pcm[0 + 6];
+	x[75] = x[11] = pcm[0 + 4];
+	x[76] = x[12] = pcm[0 + 0];
+	x[77] = x[13] = pcm[0 + 2];
+	x[78] = x[14] = pcm[0 + 1];
+	x[79] = x[15] = pcm[0 + 5];
+
+	/* Analyze blocks */
+	_sbc_analyze_four_neon(x + 12, out, analysis_consts_fixed4_simd_odd);
+	out += out_stride;
+	_sbc_analyze_four_neon(x + 8, out, analysis_consts_fixed4_simd_even);
+	out += out_stride;
+	_sbc_analyze_four_neon(x + 4, out, analysis_consts_fixed4_simd_odd);
+	out += out_stride;
+	_sbc_analyze_four_neon(x + 0, out, analysis_consts_fixed4_simd_even);
+}
+
+static inline void sbc_analyze_4b_8s_neon(int16_t *pcm, int16_t *x,
+					  int32_t *out, int out_stride)
+{
+	/* Fetch audio samples and do input data reordering for SIMD */
+	x[128] = x[0]  = pcm[16 + 15];
+	x[129] = x[1]  = pcm[16 + 7];
+	x[130] = x[2]  = pcm[16 + 14];
+	x[131] = x[3]  = pcm[16 + 8];
+	x[132] = x[4]  = pcm[16 + 13];
+	x[133] = x[5]  = pcm[16 + 9];
+	x[134] = x[6]  = pcm[16 + 12];
+	x[135] = x[7]  = pcm[16 + 10];
+	x[136] = x[8]  = pcm[16 + 11];
+	x[137] = x[9]  = pcm[16 + 3];
+	x[138] = x[10] = pcm[16 + 6];
+	x[139] = x[11] = pcm[16 + 0];
+	x[140] = x[12] = pcm[16 + 5];
+	x[141] = x[13] = pcm[16 + 1];
+	x[142] = x[14] = pcm[16 + 4];
+	x[143] = x[15] = pcm[16 + 2];
+
+	x[144] = x[16] = pcm[0 + 15];
+	x[145] = x[17] = pcm[0 + 7];
+	x[146] = x[18] = pcm[0 + 14];
+	x[147] = x[19] = pcm[0 + 8];
+	x[148] = x[20] = pcm[0 + 13];
+	x[149] = x[21] = pcm[0 + 9];
+	x[150] = x[22] = pcm[0 + 12];
+	x[151] = x[23] = pcm[0 + 10];
+	x[152] = x[24] = pcm[0 + 11];
+	x[153] = x[25] = pcm[0 + 3];
+	x[154] = x[26] = pcm[0 + 6];
+	x[155] = x[27] = pcm[0 + 0];
+	x[156] = x[28] = pcm[0 + 5];
+	x[157] = x[29] = pcm[0 + 1];
+	x[158] = x[30] = pcm[0 + 4];
+	x[159] = x[31] = pcm[0 + 2];
+
+	/* Analyze blocks */
+	_sbc_analyze_eight_neon(x + 24, out, analysis_consts_fixed8_simd_odd);
+	out += out_stride;
+	_sbc_analyze_eight_neon(x + 16, out, analysis_consts_fixed8_simd_even);
+	out += out_stride;
+	_sbc_analyze_eight_neon(x + 8, out, analysis_consts_fixed8_simd_odd);
+	out += out_stride;
+	_sbc_analyze_eight_neon(x + 0, out, analysis_consts_fixed8_simd_even);
+}
+
+void sbc_init_primitives_neon(struct sbc_encoder_state *state)
+{
+	state->sbc_analyze_4b_4s = sbc_analyze_4b_4s_neon;
+	state->sbc_analyze_4b_8s = sbc_analyze_4b_8s_neon;
+}
+
+#endif
diff --git a/sbc/sbc_primitives_neon.h b/sbc/sbc_primitives_neon.h
new file mode 100644
index 0000000..30766ed
--- /dev/null
+++ b/sbc/sbc_primitives_neon.h
@@ -0,0 +1,40 @@
+/*
+ *
+ *  Bluetooth low-complexity, subband codec (SBC) library
+ *
+ *  Copyright (C) 2004-2009  Marcel Holtmann <marcel@holtmann.org>
+ *  Copyright (C) 2004-2005  Henryk Ploetz <henryk@ploetzli.ch>
+ *  Copyright (C) 2005-2006  Brad Midgley <bmidgley@xmission.com>
+ *
+ *
+ *  This library is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU Lesser General Public
+ *  License as published by the Free Software Foundation; either
+ *  version 2.1 of the License, or (at your option) any later version.
+ *
+ *  This library is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ *  Lesser General Public License for more details.
+ *
+ *  You should have received a copy of the GNU Lesser General Public
+ *  License along with this library; if not, write to the Free Software
+ *  Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ *
+ */
+
+#ifndef __SBC_PRIMITIVES_NEON_H
+#define __SBC_PRIMITIVES_NEON_H
+
+#include "sbc_primitives.h"
+
+#if defined(__GNUC__) && defined(__ARM_NEON__) && \
+		!defined(SBC_HIGH_PRECISION) && (SCALE_OUT_BITS == 15)
+
+#define SBC_BUILD_WITH_NEON_SUPPORT
+
+void sbc_init_primitives_neon(struct sbc_encoder_state *encoder_state);
+
+#endif
+
+#endif
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter
  2009-01-15 19:34         ` Siarhei Siamashka
@ 2009-01-15 23:29           ` Marcel Holtmann
  0 siblings, 0 replies; 20+ messages in thread
From: Marcel Holtmann @ 2009-01-15 23:29 UTC (permalink / raw)
  To: Siarhei Siamashka; +Cc: linux-bluetooth

Hi Siarhei,

> > > > The attached patch contains what I would consider to be a final
> > > > variant. MMX support is now complete. It works for both x86 and amd64,
> > > > has runtime autodetection of MMX availability, supports 4 and 8
> > > > subbands cases. I also ensured that only original MMX instructions are
> > > > used (and no SSE or other later additions), so the code should work
> > > > fine even on the old Pentium1 MMX. New MMX optimized functions produce
> > > > bit identical results when compared with bluez-4.25 release.
> > > >
> > > > With this patch applied, new filtering functions are noticeably faster
> > > > than than the old ones on x86 (so now they are both faster and have
> > > > better quality). Assembly optimizations for the other platforms can be
> > > > easily added too.
> > >
> > > can you re-base your patch against the latest tree and re-send the
> > > patch.
> >
> > Yes, I will submit an updated SIMD optimizations patchset in a few days.
> 
> Updated patches are attached.
> 
> Performance improvement when testing with big buck bunny soundtrack varies
> somewhere between 1.4x (4 subbands, MMX analysis filter, Intel Core2 CPU) and
> 2x factor (8 subbands, NEON analysis filter, ARM Cortex-A8 CPU). But these
> numbers are for default bitpool settings (32) and no joint stereo, this
> configuration is quite sensitive to analysis filter performance.
> 
> SIMD optimized code provides exactly the same output as C version.
> 
> But even with this optimization done, there are still a lot more things
> to improve. I'm going to improve input data permutation/endian
> conversion/channels deinterleaving next. Also scalefactors processing
> can be vectorized. Audio quality can be still improved by tweaking
> constant tables.

patch has been applied. Thanks.

Regards

Marcel



^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2009-01-15 23:29 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-12-31 16:03 [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter Siarhei Siamashka
2008-12-31 20:55 ` Luiz Augusto von Dentz
2009-01-02 16:33   ` Siarhei Siamashka
2009-01-02 19:40     ` Luiz Augusto von Dentz
2009-01-04 17:56       ` Siarhei Siamashka
2009-01-06  2:50   ` Marcel Holtmann
2009-01-01  8:58 ` Marcel Holtmann
2009-01-02 16:07   ` Siarhei Siamashka
2009-01-02 16:27     ` Brad Midgley
2009-01-02 17:11       ` Siarhei Siamashka
2009-01-02 18:03         ` Brad Midgley
2009-01-05 11:08         ` Simon Pickering
2009-01-05  8:57     ` Siarhei Siamashka
2009-01-06  2:49     ` Marcel Holtmann
2009-01-06  5:27       ` Christian Hoene
2009-01-06  5:45         ` Marcel Holtmann
2009-01-07  9:31           ` Siarhei Siamashka
2009-01-09 16:50       ` Siarhei Siamashka
2009-01-15 19:34         ` Siarhei Siamashka
2009-01-15 23:29           ` Marcel Holtmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox