I have attached a short test project that demonstrates what I am doing. I time this simply with the time function, i.e. $ time ./mul_SSE 100000000 real 0m1.037s user 0m1.036s sys 0m0.001s $ time ./mul_SSE4_1 100000000 real 0m2.006s user 0m2.003s sys 0m0.002s I assume that I have prepared the A matrix for SSE a little bit by "dilating" the elements into A = { A11, A11, A11, A11, A12, A12, ... }, while for SSE4.1 I am calling the multiply with the transpose of B. As these matrices are really small, they should be completely in L1, so the movaps operation should have pretty low latency. Since the SSE version uses 4 times more data for A than the SSE4.1 version, I am surprised that given the larger number of data movements for the SSE version it still beats the SSE4.1 version. But maybe I am just not coding this very intelligently. Any suggestions would be very welcome, Thanks already, nick On 03/12/11 01:20, Frederic Marmond wrote: > Hello Nicolas, > > Yes, it's the right place :) > could you please paste your code as well as your benchmark context ? > > Fred > > 2011/3/11 Nicolas Bock > > > Hello list, > > I am writing an assembly function that multiplies 2 4x4 single precision > matrices. I wrote 2 versions, one using SSE the other using SSE4.1. What > surprised me is that the SSE4.1 version fails to beat the SSE version, > it is in fact slightly slower. > > Is this the right place to ask for help? If anyone is interested I can > post some code which would maybe clarify the situation a bit. > > If this is not the right place, please ignore me... > > nick > >