From: Maarten Lankhorst <m.b.lankhorst@gmail.com>
To: Borislav Petkov <bp@amd64.org>
Cc: "Valdis.Kletnieks@vt.edu" <Valdis.Kletnieks@vt.edu>,
Borislav Petkov <bp@alien8.de>, Ingo Molnar <mingo@elte.hu>,
melwyn lobo <linux.melwyn@gmail.com>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"H. Peter Anvin" <hpa@zytor.com>,
Thomas Gleixner <tglx@linutronix.de>,
Linus Torvalds <torvalds@linux-foundation.org>,
Peter Zijlstra <a.p.zijlstra@chello.nl>
Subject: Re: x86 memcpy performance
Date: Thu, 01 Sep 2011 17:15:22 +0200 [thread overview]
Message-ID: <4E5FA18A.7010205@gmail.com> (raw)
In-Reply-To: <20110816121604.GA29251@aftab>
[-- Attachment #1: Type: text/plain, Size: 3418 bytes --]
Hey,
2011/8/16 Borislav Petkov <bp@amd64.org>:
> On Mon, Aug 15, 2011 at 10:34:35PM -0400, Valdis.Kletnieks@vt.edu wrote:
>> On Sun, 14 Aug 2011 11:59:10 +0200, Borislav Petkov said:
>>
>> > Benchmarking with 10000 iterations, average results:
>> > size XM MM speedup
>> > 119 540.58 449.491 0.8314969419
>>
>> > 12273 2307.86 4042.88 1.751787902
>> > 13924 2431.8 4224.48 1.737184756
>> > 14335 2469.4 4218.82 1.708440514
>> > 15018 2675.67 1904.07 0.711622886
>> > 16374 2989.75 5296.26 1.771470902
>> > 24564 4262.15 7696.86 1.805863077
>> > 27852 4362.53 3347.72 0.7673805572
>> > 28672 5122.8 7113.14 1.388524413
>> > 30033 4874.62 8740.04 1.792967931
>>
>> The numbers for 15018 and 27852 are *way* odd for the MM case. I don't feel
>> really good about this till we understand what happened for those two cases.
>
> Yep.
>
>> Also, anytime I see "10000 iterations", I ask myself if the benchmark
>> rigging took proper note of hot/cold cache issues. That *may* explain
>> the two oddball results we see above - but not knowing more about how
>> it was benched, it's hard to say.
>
> Yeah, the more scrutiny this gets the better. So I've cleaned up my
> setup and have attached it.
>
> xm_mem.c does the benchmarking and in bench_memcpy() there's the
> sse_memcpy call which is the SSE memcpy implementation using inline asm.
> It looks like gcc produces pretty crappy code here because if I replace
> the sse_memcpy call with xm_memcpy() from xm_memcpy.S - this is the
> same function but in pure asm - I get much better numbers, sometimes
> even over 2x. It all depends on the alignment of the buffers though.
> Also, those numbers don't include the context saving/restoring which the
> kernel does for us.
>
> 7491 1509.89 2346.94 1.554378381
> 8170 2166.81 2857.78 1.318890326
> 12277 2659.03 4179.31 1.571744176
> 13907 2571.24 4125.7 1.604558427
> 14319 2638.74 5799.67 2.19789466 <----
> 14993 2752.42 4413.85 1.603625603
> 16371 3479.11 5562.65 1.59887055
This work intrigued me, in some cases kernel memcpy was a lot faster than sse memcpy,
and I finally figured out why. I also extended the test to an optimized avx memcpy,
but I think the kernel memcpy will always win in the aligned case.
Those numbers you posted aren't right it seems. It depends a lot on the alignment,
for example if both are aligned to 64 relative to each other,
kernel memcpy will win from avx memcpy on my machine.
I replaced the malloc calls with memalign(65536, size + 256) so I could toy
around with the alignments a little. This explains why for some sizes, kernel
memcpy was faster than sse memcpy in the test results you had.
When (src & 63 == dst & 63), it seems that kernel memcpy always wins, otherwise
avx memcpy might.
If you want to speed up memcpy, I think your best bet is to find out why it's
so much slower when src and dst aren't 64-byte aligned compared to each other.
Cheers,
Maarten
---
Attached: my modified version of the sse memcpy you posted.
I changed it a bit, and used avx, but some of the other changes might
be better for your sse memcpy too.
[-- Attachment #2: ym_memcpy.txt --]
[-- Type: text/plain, Size: 2668 bytes --]
/*
* ym_memcpy - AVX version of memcpy
*
* Input:
* rdi destination
* rsi source
* rdx count
*
* Output:
* rax original destination
*/
.globl ym_memcpy
.type ym_memcpy, @function
ym_memcpy:
mov %rdi, %rax
/* Target align */
movzbq %dil, %rcx
negb %cl
andb $0x1f, %cl
subq %rcx, %rdx
rep movsb
movq %rdx, %rcx
andq $0x1ff, %rdx
shrq $9, %rcx
jz .trailer
movb %sil, %r8b
andb $0x1f, %r8b
test %r8b, %r8b
jz .repeat_a
.align 32
.repeat_ua:
vmovups 0x0(%rsi), %ymm0
vmovups 0x20(%rsi), %ymm1
vmovups 0x40(%rsi), %ymm2
vmovups 0x60(%rsi), %ymm3
vmovups 0x80(%rsi), %ymm4
vmovups 0xa0(%rsi), %ymm5
vmovups 0xc0(%rsi), %ymm6
vmovups 0xe0(%rsi), %ymm7
vmovups 0x100(%rsi), %ymm8
vmovups 0x120(%rsi), %ymm9
vmovups 0x140(%rsi), %ymm10
vmovups 0x160(%rsi), %ymm11
vmovups 0x180(%rsi), %ymm12
vmovups 0x1a0(%rsi), %ymm13
vmovups 0x1c0(%rsi), %ymm14
vmovups 0x1e0(%rsi), %ymm15
vmovaps %ymm0, 0x0(%rdi)
vmovaps %ymm1, 0x20(%rdi)
vmovaps %ymm2, 0x40(%rdi)
vmovaps %ymm3, 0x60(%rdi)
vmovaps %ymm4, 0x80(%rdi)
vmovaps %ymm5, 0xa0(%rdi)
vmovaps %ymm6, 0xc0(%rdi)
vmovaps %ymm7, 0xe0(%rdi)
vmovaps %ymm8, 0x100(%rdi)
vmovaps %ymm9, 0x120(%rdi)
vmovaps %ymm10, 0x140(%rdi)
vmovaps %ymm11, 0x160(%rdi)
vmovaps %ymm12, 0x180(%rdi)
vmovaps %ymm13, 0x1a0(%rdi)
vmovaps %ymm14, 0x1c0(%rdi)
vmovaps %ymm15, 0x1e0(%rdi)
/* advance pointers */
addq $0x200, %rsi
addq $0x200, %rdi
subq $1, %rcx
jnz .repeat_ua
jz .trailer
.align 32
.repeat_a:
prefetchnta 0x80(%rsi)
prefetchnta 0x100(%rsi)
prefetchnta 0x180(%rsi)
vmovaps 0x0(%rsi), %ymm0
vmovaps 0x20(%rsi), %ymm1
vmovaps 0x40(%rsi), %ymm2
vmovaps 0x60(%rsi), %ymm3
vmovaps 0x80(%rsi), %ymm4
vmovaps 0xa0(%rsi), %ymm5
vmovaps 0xc0(%rsi), %ymm6
vmovaps 0xe0(%rsi), %ymm7
vmovaps 0x100(%rsi), %ymm8
vmovaps 0x120(%rsi), %ymm9
vmovaps 0x140(%rsi), %ymm10
vmovaps 0x160(%rsi), %ymm11
vmovaps 0x180(%rsi), %ymm12
vmovaps 0x1a0(%rsi), %ymm13
vmovaps 0x1c0(%rsi), %ymm14
vmovaps 0x1e0(%rsi), %ymm15
vmovaps %ymm0, 0x0(%rdi)
vmovaps %ymm1, 0x20(%rdi)
vmovaps %ymm2, 0x40(%rdi)
vmovaps %ymm3, 0x60(%rdi)
vmovaps %ymm4, 0x80(%rdi)
vmovaps %ymm5, 0xa0(%rdi)
vmovaps %ymm6, 0xc0(%rdi)
vmovaps %ymm7, 0xe0(%rdi)
vmovaps %ymm8, 0x100(%rdi)
vmovaps %ymm9, 0x120(%rdi)
vmovaps %ymm10, 0x140(%rdi)
vmovaps %ymm11, 0x160(%rdi)
vmovaps %ymm12, 0x180(%rdi)
vmovaps %ymm13, 0x1a0(%rdi)
vmovaps %ymm14, 0x1c0(%rdi)
vmovaps %ymm15, 0x1e0(%rdi)
/* advance pointers */
addq $0x200, %rsi
addq $0x200, %rdi
subq $1, %rcx
jnz .repeat_a
.align 32
.trailer:
movq %rdx, %rcx
shrq $3, %rcx
rep; movsq
movq %rdx, %rcx
andq $0x7, %rcx
rep; movsb
retq
next prev parent reply other threads:[~2011-09-01 15:15 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-08-12 17:59 x86 memcpy performance melwyn lobo
2011-08-12 18:33 ` Andi Kleen
2011-08-12 19:52 ` Ingo Molnar
2011-08-14 9:59 ` Borislav Petkov
2011-08-14 11:13 ` Denys Vlasenko
2011-08-14 12:40 ` Borislav Petkov
2011-08-15 13:27 ` melwyn lobo
2011-08-15 13:44 ` Denys Vlasenko
2011-08-16 2:34 ` Valdis.Kletnieks
2011-08-16 12:16 ` Borislav Petkov
2011-09-01 15:15 ` Maarten Lankhorst [this message]
2011-09-01 16:18 ` Linus Torvalds
2011-09-08 8:35 ` Borislav Petkov
2011-09-08 10:58 ` Maarten Lankhorst
2011-09-09 8:14 ` Borislav Petkov
2011-09-09 10:12 ` Maarten Lankhorst
2011-09-09 11:23 ` Maarten Lankhorst
2011-09-09 13:42 ` Borislav Petkov
2011-09-09 14:39 ` Linus Torvalds
2011-09-09 15:35 ` Borislav Petkov
2011-12-05 12:20 ` melwyn lobo
2011-12-05 12:54 ` melwyn lobo
2011-12-05 14:36 ` Alan Cox
-- strict thread matches above, loose matches on Subject: below --
2011-08-15 14:55 Borislav Petkov
2011-08-15 14:59 ` Andy Lutomirski
2011-08-15 15:29 ` Borislav Petkov
2011-08-15 15:36 ` Andrew Lutomirski
2011-08-15 16:12 ` Borislav Petkov
2011-08-15 17:04 ` Andrew Lutomirski
2011-08-15 18:49 ` Borislav Petkov
2011-08-15 19:11 ` Andrew Lutomirski
2011-08-15 20:05 ` Borislav Petkov
2011-08-15 20:08 ` Andrew Lutomirski
2011-08-15 16:12 ` H. Peter Anvin
2011-08-15 16:58 ` Andrew Lutomirski
2011-08-15 18:26 ` H. Peter Anvin
2011-08-15 18:35 ` Andrew Lutomirski
2011-08-15 18:52 ` H. Peter Anvin
2011-08-16 7:19 ` melwyn lobo
2011-08-16 7:43 ` Borislav Petkov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4E5FA18A.7010205@gmail.com \
--to=m.b.lankhorst@gmail.com \
--cc=Valdis.Kletnieks@vt.edu \
--cc=a.p.zijlstra@chello.nl \
--cc=bp@alien8.de \
--cc=bp@amd64.org \
--cc=hpa@zytor.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux.melwyn@gmail.com \
--cc=mingo@elte.hu \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.