Hello list, I am writing an assembly function that multiplies 2 4x4 single precision matrices. I wrote 2 versions, one using SSE the other using SSE4.1. What surprised me is that the SSE4.1 version fails to beat the SSE version, it is in fact slightly slower. Is this the right place to ask for help? If anyone is interested I can post some code which would maybe clarify the situation a bit. If this is not the right place, please ignore me... nick