I have attached a short test project that demonstrates what I am doing.

I time this simply with the time function, i.e.

$ time ./mul_SSE 100000000

real    0m1.037s
user    0m1.036s
sys     0m0.001s

$ time ./mul_SSE4_1 100000000

real    0m2.006s
user    0m2.003s
sys     0m0.002s

I assume that I have prepared the A matrix for SSE a little bit by
"dilating" the elements into A = { A11, A11, A11, A11, A12, A12, ...  },
while for SSE4.1 I am calling the multiply with the transpose of B.

As these matrices are really small, they should be completely in L1, so
the movaps operation should have pretty low latency. Since the SSE
version uses 4 times more data for A than the SSE4.1 version, I am
surprised that given the larger number of data movements for the SSE
version it still beats the SSE4.1 version. But maybe I am just not
coding this very intelligently.

Any suggestions would be very welcome,

Thanks already, nick


On 03/12/11 01:20, Frederic Marmond wrote:
> Hello Nicolas,
> 
> Yes, it's the right place :)
> could you please paste your code as well as your benchmark context ?
> 
> Fred
> 
> 2011/3/11 Nicolas Bock <nicolasbock@gmail.com
> <mailto:nicolasbock@gmail.com>>
> 
>     Hello list,
> 
>     I am writing an assembly function that multiplies 2 4x4 single precision
>     matrices. I wrote 2 versions, one using SSE the other using SSE4.1. What
>     surprised me is that the SSE4.1 version fails to beat the SSE version,
>     it is in fact slightly slower.
> 
>     Is this the right place to ask for help? If anyone is interested I can
>     post some code which would maybe clarify the situation a bit.
> 
>     If this is not the right place, please ignore me...
> 
>     nick
> 
>