Re: x86 memcpy performance - Maarten Lankhorst

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Maarten Lankhorst <m.b.lankhorst@gmail.com>
To: Borislav Petkov <bp@amd64.org>
Cc: "Valdis.Kletnieks@vt.edu" <Valdis.Kletnieks@vt.edu>,
	Borislav Petkov <bp@alien8.de>, Ingo Molnar <mingo@elte.hu>,
	melwyn lobo <linux.melwyn@gmail.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>
Subject: Re: x86 memcpy performance
Date: Thu, 01 Sep 2011 17:15:22 +0200	[thread overview]
Message-ID: <4E5FA18A.7010205@gmail.com> (raw)
In-Reply-To: <20110816121604.GA29251@aftab>

[-- Attachment #1: Type: text/plain, Size: 3418 bytes --]

Hey,

2011/8/16 Borislav Petkov <bp@amd64.org>:
> On Mon, Aug 15, 2011 at 10:34:35PM -0400, Valdis.Kletnieks@vt.edu wrote:
>> On Sun, 14 Aug 2011 11:59:10 +0200, Borislav Petkov said:
>>
>> > Benchmarking with 10000 iterations, average results:
>> > size    XM              MM              speedup
>> > 119     540.58          449.491         0.8314969419
>>
>> > 12273   2307.86         4042.88         1.751787902
>> > 13924   2431.8          4224.48         1.737184756
>> > 14335   2469.4          4218.82         1.708440514
>> > 15018 2675.67         1904.07         0.711622886
>> > 16374   2989.75         5296.26         1.771470902
>> > 24564   4262.15         7696.86         1.805863077
>> > 27852   4362.53         3347.72         0.7673805572
>> > 28672   5122.8          7113.14         1.388524413
>> > 30033   4874.62         8740.04         1.792967931
>>
>> The numbers for 15018 and 27852 are *way* odd for the MM case. I don't feel
>> really good about this till we understand what happened for those two cases.
>
> Yep.
>
>> Also, anytime I see "10000 iterations", I ask myself if the benchmark
>> rigging took proper note of hot/cold cache issues. That *may* explain
>> the two oddball results we see above - but not knowing more about how
>> it was benched, it's hard to say.
>
> Yeah, the more scrutiny this gets the better. So I've cleaned up my
> setup and have attached it.
>
> xm_mem.c does the benchmarking and in bench_memcpy() there's the
> sse_memcpy call which is the SSE memcpy implementation using inline asm.
> It looks like gcc produces pretty crappy code here because if I replace
> the sse_memcpy call with xm_memcpy() from xm_memcpy.S - this is the
> same function but in pure asm - I get much better numbers, sometimes
> even over 2x. It all depends on the alignment of the buffers though.
> Also, those numbers don't include the context saving/restoring which the
> kernel does for us.
>
> 7491    1509.89         2346.94         1.554378381
> 8170    2166.81         2857.78         1.318890326
> 12277   2659.03         4179.31         1.571744176
> 13907   2571.24         4125.7          1.604558427
> 14319   2638.74         5799.67         2.19789466      <----
> 14993   2752.42         4413.85         1.603625603
> 16371   3479.11         5562.65         1.59887055

This work intrigued me, in some cases kernel memcpy was a lot faster than sse memcpy,
and I finally figured out why. I also extended the test to an optimized avx memcpy,
but I think the kernel memcpy will always win in the aligned case.

Those numbers you posted aren't right it seems. It depends a lot on the alignment,
for example if both are aligned to 64 relative to each other,
kernel memcpy will win from avx memcpy on my machine.

I replaced the malloc calls with memalign(65536, size + 256) so I could toy
around with the alignments a little. This explains why for some sizes, kernel
memcpy was faster than sse memcpy in the test results you had.
When (src & 63 == dst & 63), it seems that kernel memcpy always wins, otherwise
avx memcpy might.

If you want to speed up memcpy, I think your best bet is to find out why it's
so much slower when src and dst aren't 64-byte aligned compared to each other.

Cheers,
Maarten

---
Attached: my modified version of the sse memcpy you posted.

I changed it a bit, and used avx, but some of the other changes might
be better for your sse memcpy too.

[-- Attachment #2: ym_memcpy.txt --]
[-- Type: text/plain, Size: 2668 bytes --]

/*
 * ym_memcpy - AVX version of memcpy
 *
 * Input:
 *  rdi destination
 *  rsi source
 *  rdx count
 *
 * Output:
 * rax original destination
 */
.globl ym_memcpy
.type ym_memcpy, @function

ym_memcpy:
	mov %rdi, %rax

	/* Target align */
	movzbq %dil, %rcx
	negb %cl
	andb $0x1f, %cl
	subq %rcx, %rdx
	rep movsb

	movq %rdx, %rcx
	andq $0x1ff, %rdx
	shrq $9, %rcx
	jz .trailer

	movb %sil, %r8b
	andb $0x1f, %r8b
	test %r8b, %r8b
	jz .repeat_a

	.align 32
.repeat_ua:
	vmovups 0x0(%rsi), %ymm0
	vmovups 0x20(%rsi), %ymm1
	vmovups 0x40(%rsi), %ymm2
	vmovups 0x60(%rsi), %ymm3
	vmovups 0x80(%rsi), %ymm4
	vmovups 0xa0(%rsi), %ymm5
	vmovups 0xc0(%rsi), %ymm6
	vmovups 0xe0(%rsi), %ymm7
	vmovups 0x100(%rsi), %ymm8
	vmovups 0x120(%rsi), %ymm9
	vmovups 0x140(%rsi), %ymm10
	vmovups 0x160(%rsi), %ymm11
	vmovups 0x180(%rsi), %ymm12
	vmovups 0x1a0(%rsi), %ymm13
	vmovups 0x1c0(%rsi), %ymm14
	vmovups 0x1e0(%rsi), %ymm15

	vmovaps %ymm0, 0x0(%rdi)
	vmovaps %ymm1, 0x20(%rdi)
	vmovaps %ymm2, 0x40(%rdi)
	vmovaps %ymm3, 0x60(%rdi)
	vmovaps %ymm4, 0x80(%rdi)
	vmovaps %ymm5, 0xa0(%rdi)
	vmovaps %ymm6, 0xc0(%rdi)
	vmovaps %ymm7, 0xe0(%rdi)
	vmovaps %ymm8, 0x100(%rdi)
	vmovaps %ymm9, 0x120(%rdi)
	vmovaps %ymm10, 0x140(%rdi)
	vmovaps %ymm11, 0x160(%rdi)
	vmovaps %ymm12, 0x180(%rdi)
	vmovaps %ymm13, 0x1a0(%rdi)
	vmovaps %ymm14, 0x1c0(%rdi)
	vmovaps %ymm15, 0x1e0(%rdi)

	/* advance pointers */
	addq $0x200, %rsi
	addq $0x200, %rdi
	subq $1, %rcx
	jnz .repeat_ua
	jz .trailer

	.align 32
.repeat_a:
	prefetchnta 0x80(%rsi)
	prefetchnta 0x100(%rsi)
	prefetchnta 0x180(%rsi)
	vmovaps 0x0(%rsi), %ymm0
	vmovaps 0x20(%rsi), %ymm1
	vmovaps 0x40(%rsi), %ymm2
	vmovaps 0x60(%rsi), %ymm3
	vmovaps 0x80(%rsi), %ymm4
	vmovaps 0xa0(%rsi), %ymm5
	vmovaps 0xc0(%rsi), %ymm6
	vmovaps 0xe0(%rsi), %ymm7
	vmovaps 0x100(%rsi), %ymm8
	vmovaps 0x120(%rsi), %ymm9
	vmovaps 0x140(%rsi), %ymm10
	vmovaps 0x160(%rsi), %ymm11
	vmovaps 0x180(%rsi), %ymm12
	vmovaps 0x1a0(%rsi), %ymm13
	vmovaps 0x1c0(%rsi), %ymm14
	vmovaps 0x1e0(%rsi), %ymm15

	vmovaps %ymm0, 0x0(%rdi)
	vmovaps %ymm1, 0x20(%rdi)
	vmovaps %ymm2, 0x40(%rdi)
	vmovaps %ymm3, 0x60(%rdi)
	vmovaps %ymm4, 0x80(%rdi)
	vmovaps %ymm5, 0xa0(%rdi)
	vmovaps %ymm6, 0xc0(%rdi)
	vmovaps %ymm7, 0xe0(%rdi)
	vmovaps %ymm8, 0x100(%rdi)
	vmovaps %ymm9, 0x120(%rdi)
	vmovaps %ymm10, 0x140(%rdi)
	vmovaps %ymm11, 0x160(%rdi)
	vmovaps %ymm12, 0x180(%rdi)
	vmovaps %ymm13, 0x1a0(%rdi)
	vmovaps %ymm14, 0x1c0(%rdi)
	vmovaps %ymm15, 0x1e0(%rdi)

	/* advance pointers */
	addq $0x200, %rsi
	addq $0x200, %rdi
	subq $1, %rcx
	jnz .repeat_a

	.align 32
.trailer:
	movq %rdx, %rcx
	shrq $3, %rcx
	rep; movsq
	movq %rdx, %rcx
	andq $0x7, %rcx
	rep; movsb
	retq

next prev parent reply	other threads:[~2011-09-01 15:15 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-08-12 17:59 x86 memcpy performance melwyn lobo
2011-08-12 18:33 ` Andi Kleen
2011-08-12 19:52 ` Ingo Molnar
2011-08-14  9:59   ` Borislav Petkov
2011-08-14 11:13     ` Denys Vlasenko
2011-08-14 12:40       ` Borislav Petkov
2011-08-15 13:27         ` melwyn lobo
2011-08-15 13:44         ` Denys Vlasenko
2011-08-16  2:34     ` Valdis.Kletnieks
2011-08-16 12:16       ` Borislav Petkov
2011-09-01 15:15         ` Maarten Lankhorst [this message]
2011-09-01 16:18           ` Linus Torvalds
2011-09-08  8:35             ` Borislav Petkov
2011-09-08 10:58               ` Maarten Lankhorst
2011-09-09  8:14                 ` Borislav Petkov
2011-09-09 10:12                   ` Maarten Lankhorst
2011-09-09 11:23                     ` Maarten Lankhorst
2011-09-09 13:42                       ` Borislav Petkov
2011-09-09 14:39                   ` Linus Torvalds
2011-09-09 15:35                     ` Borislav Petkov
2011-12-05 12:20                       ` melwyn lobo
2011-12-05 12:54           ` melwyn lobo
2011-12-05 14:36             ` Alan Cox
  -- strict thread matches above, loose matches on Subject: below --
2011-08-15 14:55 Borislav Petkov
2011-08-15 14:59 ` Andy Lutomirski
2011-08-15 15:29   ` Borislav Petkov
2011-08-15 15:36     ` Andrew Lutomirski
2011-08-15 16:12       ` Borislav Petkov
2011-08-15 17:04         ` Andrew Lutomirski
2011-08-15 18:49           ` Borislav Petkov
2011-08-15 19:11             ` Andrew Lutomirski
2011-08-15 20:05               ` Borislav Petkov
2011-08-15 20:08                 ` Andrew Lutomirski
2011-08-15 16:12       ` H. Peter Anvin
2011-08-15 16:58         ` Andrew Lutomirski
2011-08-15 18:26           ` H. Peter Anvin
2011-08-15 18:35             ` Andrew Lutomirski
2011-08-15 18:52               ` H. Peter Anvin
2011-08-16  7:19 ` melwyn lobo
2011-08-16  7:43   ` Borislav Petkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4E5FA18A.7010205@gmail.com \
    --to=m.b.lankhorst@gmail.com \
    --cc=Valdis.Kletnieks@vt.edu \
    --cc=a.p.zijlstra@chello.nl \
    --cc=bp@alien8.de \
    --cc=bp@amd64.org \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux.melwyn@gmail.com \
    --cc=mingo@elte.hu \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.