On 09/08/2011 10:35 AM, Borislav Petkov wrote: > On Thu, Sep 01, 2011 at 09:18:32AM -0700, Linus Torvalds wrote: >> On Thu, Sep 1, 2011 at 8:15 AM, Maarten Lankhorst >> wrote: >>> This work intrigued me, in some cases kernel memcpy was a lot faster than sse memcpy, >>> and I finally figured out why. I also extended the test to an optimized avx memcpy, >>> but I think the kernel memcpy will always win in the aligned case. >> "rep movs" is generally optimized in microcode on most modern Intel >> CPU's for some easyish cases, and it will outperform just about >> anything. >> >> Atom is a notable exception, but if you expect performance on any >> general loads from Atom, you need to get your head examined. Atom is a >> disaster for anything but tuned loops. >> >> The "easyish cases" depend on microarchitecture. They are improving, >> so long-term "rep movs" is the best way regardless, but for most >> current ones it's something like "source aligned to 8 bytes *and* >> source and destination are equal "mod 64"". >> >> And that's true in a lot of common situations. It's true for the page >> copy, for example, and it's often true for big user "read()/write()" >> calls (but "often" may not be "often enough" - high-performance >> userland should strive to align read/write buffers to 64 bytes, for >> example). >> >> Many other cases of "memcpy()" are the fairly small, constant-sized >> ones, where the optimal strategy tends to be "move words by hand". > Yeah, > > this probably makes enabling SSE memcpy in the kernel a task > with diminishing returns. There are also the additional costs of > saving/restoring FPU context in the kernel which eat off from any SSE > speedup. > > And then there's the additional I$ pressure because "rep movs" is > much smaller than all those mov[au]ps stanzas. Btw, mov[au]ps are the > smallest (two-byte) instructions I could use - in the AVX case they can > get up to 4 Bytes of length with the VEX prefix and the additional SIB, > size override, etc. fields. > > Oh, and then there's copy_*_user which also does fault handling and > replacing that with a SSE version of memcpy could get quite hairy quite > fast. > > Anyway, I'll try to benchmark an asm version of SSE memcpy in the kernel > when I get the time to see whether it still makes sense, at all. > I have changed your sse memcpy to test various alignments with source/destination offsets instead of random, from that you can see that you don't really get a speedup at all. It seems to be more a case of 'kernel memcpy is significantly slower with some alignments', than 'avx memcpy is just that much faster'. For example 3754 with src misalignment 4 and target misalignment 20 takes 1185 units on avx memcpy, but 1480 units with kernel memcpy The modified testcase is attached, I did some optimizations in avx memcpy, but I fear I may be missing something, when I tried to put it in the kernel, it complained about sata errors I never had before, so I immediately went for the power button to prevent more errors, fortunately it only corrupted some kernel object files, and btrfs threw checksum errors. :) All in all I think testing in userspace is safer, you might want to run it on an idle cpu with schedtool, with a high fifo priority, and set cpufreq governor to performance. ~Maarten