linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Maarten Lankhorst <m.b.lankhorst@gmail.com>
To: Borislav Petkov <bp@amd64.org>
Cc: "Valdis.Kletnieks@vt.edu" <Valdis.Kletnieks@vt.edu>,
	Borislav Petkov <bp@alien8.de>, Ingo Molnar <mingo@elte.hu>,
	melwyn lobo <linux.melwyn@gmail.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>
Subject: Re: x86 memcpy performance
Date: Thu, 01 Sep 2011 17:15:22 +0200	[thread overview]
Message-ID: <4E5FA18A.7010205@gmail.com> (raw)
In-Reply-To: <20110816121604.GA29251@aftab>

[-- Attachment #1: Type: text/plain, Size: 3418 bytes --]

Hey,

2011/8/16 Borislav Petkov <bp@amd64.org>:
> On Mon, Aug 15, 2011 at 10:34:35PM -0400, Valdis.Kletnieks@vt.edu wrote:
>> On Sun, 14 Aug 2011 11:59:10 +0200, Borislav Petkov said:
>>
>> > Benchmarking with 10000 iterations, average results:
>> > size    XM              MM              speedup
>> > 119     540.58          449.491         0.8314969419
>>
>> > 12273   2307.86         4042.88         1.751787902
>> > 13924   2431.8          4224.48         1.737184756
>> > 14335   2469.4          4218.82         1.708440514
>> > 15018 2675.67         1904.07         0.711622886
>> > 16374   2989.75         5296.26         1.771470902
>> > 24564   4262.15         7696.86         1.805863077
>> > 27852   4362.53         3347.72         0.7673805572
>> > 28672   5122.8          7113.14         1.388524413
>> > 30033   4874.62         8740.04         1.792967931
>>
>> The numbers for 15018 and 27852 are *way* odd for the MM case. I don't feel
>> really good about this till we understand what happened for those two cases.
>
> Yep.
>
>> Also, anytime I see "10000 iterations", I ask myself if the benchmark
>> rigging took proper note of hot/cold cache issues. That *may* explain
>> the two oddball results we see above - but not knowing more about how
>> it was benched, it's hard to say.
>
> Yeah, the more scrutiny this gets the better. So I've cleaned up my
> setup and have attached it.
>
> xm_mem.c does the benchmarking and in bench_memcpy() there's the
> sse_memcpy call which is the SSE memcpy implementation using inline asm.
> It looks like gcc produces pretty crappy code here because if I replace
> the sse_memcpy call with xm_memcpy() from xm_memcpy.S - this is the
> same function but in pure asm - I get much better numbers, sometimes
> even over 2x. It all depends on the alignment of the buffers though.
> Also, those numbers don't include the context saving/restoring which the
> kernel does for us.
>
> 7491    1509.89         2346.94         1.554378381
> 8170    2166.81         2857.78         1.318890326
> 12277   2659.03         4179.31         1.571744176
> 13907   2571.24         4125.7          1.604558427
> 14319   2638.74         5799.67         2.19789466      <----
> 14993   2752.42         4413.85         1.603625603
> 16371   3479.11         5562.65         1.59887055

This work intrigued me, in some cases kernel memcpy was a lot faster than sse memcpy,
and I finally figured out why. I also extended the test to an optimized avx memcpy,
but I think the kernel memcpy will always win in the aligned case.

Those numbers you posted aren't right it seems. It depends a lot on the alignment,
for example if both are aligned to 64 relative to each other,
kernel memcpy will win from avx memcpy on my machine.

I replaced the malloc calls with memalign(65536, size + 256) so I could toy
around with the alignments a little. This explains why for some sizes, kernel
memcpy was faster than sse memcpy in the test results you had.
When (src & 63 == dst & 63), it seems that kernel memcpy always wins, otherwise
avx memcpy might.

If you want to speed up memcpy, I think your best bet is to find out why it's
so much slower when src and dst aren't 64-byte aligned compared to each other.

Cheers,
Maarten

---
Attached: my modified version of the sse memcpy you posted.

I changed it a bit, and used avx, but some of the other changes might
be better for your sse memcpy too.

[-- Attachment #2: ym_memcpy.txt --]
[-- Type: text/plain, Size: 2668 bytes --]

/*
 * ym_memcpy - AVX version of memcpy
 *
 * Input:
 *  rdi destination
 *  rsi source
 *  rdx count
 *
 * Output:
 * rax original destination
 */
.globl ym_memcpy
.type ym_memcpy, @function

ym_memcpy:
	mov %rdi, %rax

	/* Target align */
	movzbq %dil, %rcx
	negb %cl
	andb $0x1f, %cl
	subq %rcx, %rdx
	rep movsb

	movq %rdx, %rcx
	andq $0x1ff, %rdx
	shrq $9, %rcx
	jz .trailer

	movb %sil, %r8b
	andb $0x1f, %r8b
	test %r8b, %r8b
	jz .repeat_a

	.align 32
.repeat_ua:
	vmovups 0x0(%rsi), %ymm0
	vmovups 0x20(%rsi), %ymm1
	vmovups 0x40(%rsi), %ymm2
	vmovups 0x60(%rsi), %ymm3
	vmovups 0x80(%rsi), %ymm4
	vmovups 0xa0(%rsi), %ymm5
	vmovups 0xc0(%rsi), %ymm6
	vmovups 0xe0(%rsi), %ymm7
	vmovups 0x100(%rsi), %ymm8
	vmovups 0x120(%rsi), %ymm9
	vmovups 0x140(%rsi), %ymm10
	vmovups 0x160(%rsi), %ymm11
	vmovups 0x180(%rsi), %ymm12
	vmovups 0x1a0(%rsi), %ymm13
	vmovups 0x1c0(%rsi), %ymm14
	vmovups 0x1e0(%rsi), %ymm15

	vmovaps %ymm0, 0x0(%rdi)
	vmovaps %ymm1, 0x20(%rdi)
	vmovaps %ymm2, 0x40(%rdi)
	vmovaps %ymm3, 0x60(%rdi)
	vmovaps %ymm4, 0x80(%rdi)
	vmovaps %ymm5, 0xa0(%rdi)
	vmovaps %ymm6, 0xc0(%rdi)
	vmovaps %ymm7, 0xe0(%rdi)
	vmovaps %ymm8, 0x100(%rdi)
	vmovaps %ymm9, 0x120(%rdi)
	vmovaps %ymm10, 0x140(%rdi)
	vmovaps %ymm11, 0x160(%rdi)
	vmovaps %ymm12, 0x180(%rdi)
	vmovaps %ymm13, 0x1a0(%rdi)
	vmovaps %ymm14, 0x1c0(%rdi)
	vmovaps %ymm15, 0x1e0(%rdi)

	/* advance pointers */
	addq $0x200, %rsi
	addq $0x200, %rdi
	subq $1, %rcx
	jnz .repeat_ua
	jz .trailer

	.align 32
.repeat_a:
	prefetchnta 0x80(%rsi)
	prefetchnta 0x100(%rsi)
	prefetchnta 0x180(%rsi)
	vmovaps 0x0(%rsi), %ymm0
	vmovaps 0x20(%rsi), %ymm1
	vmovaps 0x40(%rsi), %ymm2
	vmovaps 0x60(%rsi), %ymm3
	vmovaps 0x80(%rsi), %ymm4
	vmovaps 0xa0(%rsi), %ymm5
	vmovaps 0xc0(%rsi), %ymm6
	vmovaps 0xe0(%rsi), %ymm7
	vmovaps 0x100(%rsi), %ymm8
	vmovaps 0x120(%rsi), %ymm9
	vmovaps 0x140(%rsi), %ymm10
	vmovaps 0x160(%rsi), %ymm11
	vmovaps 0x180(%rsi), %ymm12
	vmovaps 0x1a0(%rsi), %ymm13
	vmovaps 0x1c0(%rsi), %ymm14
	vmovaps 0x1e0(%rsi), %ymm15

	vmovaps %ymm0, 0x0(%rdi)
	vmovaps %ymm1, 0x20(%rdi)
	vmovaps %ymm2, 0x40(%rdi)
	vmovaps %ymm3, 0x60(%rdi)
	vmovaps %ymm4, 0x80(%rdi)
	vmovaps %ymm5, 0xa0(%rdi)
	vmovaps %ymm6, 0xc0(%rdi)
	vmovaps %ymm7, 0xe0(%rdi)
	vmovaps %ymm8, 0x100(%rdi)
	vmovaps %ymm9, 0x120(%rdi)
	vmovaps %ymm10, 0x140(%rdi)
	vmovaps %ymm11, 0x160(%rdi)
	vmovaps %ymm12, 0x180(%rdi)
	vmovaps %ymm13, 0x1a0(%rdi)
	vmovaps %ymm14, 0x1c0(%rdi)
	vmovaps %ymm15, 0x1e0(%rdi)

	/* advance pointers */
	addq $0x200, %rsi
	addq $0x200, %rdi
	subq $1, %rcx
	jnz .repeat_a

	.align 32
.trailer:
	movq %rdx, %rcx
	shrq $3, %rcx
	rep; movsq
	movq %rdx, %rcx
	andq $0x7, %rcx
	rep; movsb
	retq

  reply	other threads:[~2011-09-01 15:15 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-08-12 17:59 x86 memcpy performance melwyn lobo
2011-08-12 18:33 ` Andi Kleen
2011-08-12 19:52 ` Ingo Molnar
2011-08-14  9:59   ` Borislav Petkov
2011-08-14 11:13     ` Denys Vlasenko
2011-08-14 12:40       ` Borislav Petkov
2011-08-15 13:27         ` melwyn lobo
2011-08-15 13:44         ` Denys Vlasenko
2011-08-16  2:34     ` Valdis.Kletnieks
2011-08-16 12:16       ` Borislav Petkov
2011-09-01 15:15         ` Maarten Lankhorst [this message]
2011-09-01 16:18           ` Linus Torvalds
2011-09-08  8:35             ` Borislav Petkov
2011-09-08 10:58               ` Maarten Lankhorst
2011-09-09  8:14                 ` Borislav Petkov
2011-09-09 10:12                   ` Maarten Lankhorst
2011-09-09 11:23                     ` Maarten Lankhorst
2011-09-09 13:42                       ` Borislav Petkov
2011-09-09 14:39                   ` Linus Torvalds
2011-09-09 15:35                     ` Borislav Petkov
2011-12-05 12:20                       ` melwyn lobo
2011-12-05 12:54           ` melwyn lobo
2011-12-05 14:36             ` Alan Cox
  -- strict thread matches above, loose matches on Subject: below --
2011-08-15 14:55 Borislav Petkov
2011-08-15 14:59 ` Andy Lutomirski
2011-08-15 15:29   ` Borislav Petkov
2011-08-15 15:36     ` Andrew Lutomirski
2011-08-15 16:12       ` Borislav Petkov
2011-08-15 17:04         ` Andrew Lutomirski
2011-08-15 18:49           ` Borislav Petkov
2011-08-15 19:11             ` Andrew Lutomirski
2011-08-15 20:05               ` Borislav Petkov
2011-08-15 20:08                 ` Andrew Lutomirski
2011-08-15 16:12       ` H. Peter Anvin
2011-08-15 16:58         ` Andrew Lutomirski
2011-08-15 18:26           ` H. Peter Anvin
2011-08-15 18:35             ` Andrew Lutomirski
2011-08-15 18:52               ` H. Peter Anvin
2011-08-16  7:19 ` melwyn lobo
2011-08-16  7:43   ` Borislav Petkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4E5FA18A.7010205@gmail.com \
    --to=m.b.lankhorst@gmail.com \
    --cc=Valdis.Kletnieks@vt.edu \
    --cc=a.p.zijlstra@chello.nl \
    --cc=bp@alien8.de \
    --cc=bp@amd64.org \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux.melwyn@gmail.com \
    --cc=mingo@elte.hu \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).