Re: x86 memcpy performance - Maarten Lankhorst

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Maarten Lankhorst <m.b.lankhorst@gmail.com>
To: Borislav Petkov <bp@alien8.de>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Borislav Petkov <bp@amd64.org>,
	"Valdis.Kletnieks@vt.edu" <Valdis.Kletnieks@vt.edu>,
	Ingo Molnar <mingo@elte.hu>, melwyn lobo <linux.melwyn@gmail.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>
Subject: Re: x86 memcpy performance
Date: Thu, 08 Sep 2011 12:58:13 +0200	[thread overview]
Message-ID: <4E689FC5.8010005@gmail.com> (raw)
In-Reply-To: <20110908083551.GA5646@liondog.tnic>

[-- Attachment #1: Type: text/plain, Size: 3330 bytes --]

On 09/08/2011 10:35 AM, Borislav Petkov wrote:
> On Thu, Sep 01, 2011 at 09:18:32AM -0700, Linus Torvalds wrote:
>> On Thu, Sep 1, 2011 at 8:15 AM, Maarten Lankhorst
>> <m.b.lankhorst@gmail.com> wrote:
>>> This work intrigued me, in some cases kernel memcpy was a lot faster than sse memcpy,
>>> and I finally figured out why. I also extended the test to an optimized avx memcpy,
>>> but I think the kernel memcpy will always win in the aligned case.
>> "rep movs" is generally optimized in microcode on most modern Intel
>> CPU's for some easyish cases, and it will outperform just about
>> anything.
>>
>> Atom is a notable exception, but if you expect performance on any
>> general loads from Atom, you need to get your head examined. Atom is a
>> disaster for anything but tuned loops.
>>
>> The "easyish cases" depend on microarchitecture. They are improving,
>> so long-term "rep movs" is the best way regardless, but for most
>> current ones it's something like "source aligned to 8 bytes *and*
>> source and destination are equal "mod 64"".
>>
>> And that's true in a lot of common situations. It's true for the page
>> copy, for example, and it's often true for big user "read()/write()"
>> calls (but "often" may not be "often enough" - high-performance
>> userland should strive to align read/write buffers to 64 bytes, for
>> example).
>>
>> Many other cases of "memcpy()" are the fairly small, constant-sized
>> ones, where the optimal strategy tends to be "move words by hand".
> Yeah,
>
> this probably makes enabling SSE memcpy in the kernel a task
> with diminishing returns. There are also the additional costs of
> saving/restoring FPU context in the kernel which eat off from any SSE
> speedup.
>
> And then there's the additional I$ pressure because "rep movs" is
> much smaller than all those mov[au]ps stanzas. Btw, mov[au]ps are the
> smallest (two-byte) instructions I could use - in the AVX case they can
> get up to 4 Bytes of length with the VEX prefix and the additional SIB,
> size override, etc. fields.
>
> Oh, and then there's copy_*_user which also does fault handling and
> replacing that with a SSE version of memcpy could get quite hairy quite
> fast.
>
> Anyway, I'll try to benchmark an asm version of SSE memcpy in the kernel
> when I get the time to see whether it still makes sense, at all.
>
I have changed your sse memcpy to test various alignments with
source/destination offsets instead of random, from that you can
see that you don't really get a speedup at all. It seems to be more
a case of 'kernel memcpy is significantly slower with some alignments',
than 'avx memcpy is just that much faster'.

For example 3754 with src misalignment 4 and target misalignment 20
takes 1185 units on avx memcpy, but 1480 units with kernel memcpy

The modified testcase is attached, I did some optimizations in avx memcpy,
but I fear I may be missing something, when I tried to put it in the kernel, it
complained about sata errors I never had before, so I immediately went for
the power button to prevent more errors, fortunately it only corrupted some
kernel object files, and btrfs threw checksum errors. :)

All in all I think testing in userspace is safer, you might want to run it on an
idle cpu with schedtool, with a high fifo priority, and set cpufreq governor to
performance.

~Maarten

[-- Attachment #2: memcpy.tar.gz --]
[-- Type: application/x-gzip, Size: 4352 bytes --]

next prev parent reply	other threads:[~2011-09-08 10:58 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-08-12 17:59 x86 memcpy performance melwyn lobo
2011-08-12 18:33 ` Andi Kleen
2011-08-12 19:52 ` Ingo Molnar
2011-08-14  9:59   ` Borislav Petkov
2011-08-14 11:13     ` Denys Vlasenko
2011-08-14 12:40       ` Borislav Petkov
2011-08-15 13:27         ` melwyn lobo
2011-08-15 13:44         ` Denys Vlasenko
2011-08-16  2:34     ` Valdis.Kletnieks
2011-08-16 12:16       ` Borislav Petkov
2011-09-01 15:15         ` Maarten Lankhorst
2011-09-01 16:18           ` Linus Torvalds
2011-09-08  8:35             ` Borislav Petkov
2011-09-08 10:58               ` Maarten Lankhorst [this message]
2011-09-09  8:14                 ` Borislav Petkov
2011-09-09 10:12                   ` Maarten Lankhorst
2011-09-09 11:23                     ` Maarten Lankhorst
2011-09-09 13:42                       ` Borislav Petkov
2011-09-09 14:39                   ` Linus Torvalds
2011-09-09 15:35                     ` Borislav Petkov
2011-12-05 12:20                       ` melwyn lobo
2011-12-05 12:54           ` melwyn lobo
2011-12-05 14:36             ` Alan Cox
  -- strict thread matches above, loose matches on Subject: below --
2011-08-15 14:55 Borislav Petkov
2011-08-15 14:59 ` Andy Lutomirski
2011-08-15 15:29   ` Borislav Petkov
2011-08-15 15:36     ` Andrew Lutomirski
2011-08-15 16:12       ` Borislav Petkov
2011-08-15 17:04         ` Andrew Lutomirski
2011-08-15 18:49           ` Borislav Petkov
2011-08-15 19:11             ` Andrew Lutomirski
2011-08-15 20:05               ` Borislav Petkov
2011-08-15 20:08                 ` Andrew Lutomirski
2011-08-15 16:12       ` H. Peter Anvin
2011-08-15 16:58         ` Andrew Lutomirski
2011-08-15 18:26           ` H. Peter Anvin
2011-08-15 18:35             ` Andrew Lutomirski
2011-08-15 18:52               ` H. Peter Anvin
2011-08-16  7:19 ` melwyn lobo
2011-08-16  7:43   ` Borislav Petkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4E689FC5.8010005@gmail.com \
    --to=m.b.lankhorst@gmail.com \
    --cc=Valdis.Kletnieks@vt.edu \
    --cc=a.p.zijlstra@chello.nl \
    --cc=bp@alien8.de \
    --cc=bp@amd64.org \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux.melwyn@gmail.com \
    --cc=mingo@elte.hu \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.