Re: x86 memcpy performance - Maarten Lankhorst

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Maarten Lankhorst <m.b.lankhorst@gmail.com>
To: Borislav Petkov <bp@alien8.de>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Borislav Petkov <bp@amd64.org>,
	"Valdis.Kletnieks@vt.edu" <Valdis.Kletnieks@vt.edu>,
	Ingo Molnar <mingo@elte.hu>, melwyn lobo <linux.melwyn@gmail.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>
Subject: Re: x86 memcpy performance
Date: Fri, 09 Sep 2011 12:12:05 +0200	[thread overview]
Message-ID: <4E69E675.1010809@gmail.com> (raw)
In-Reply-To: <20110909081407.GA29251@liondog.tnic>

Hey,

On 09/09/2011 10:14 AM, Borislav Petkov wrote:
> On Thu, Sep 08, 2011 at 12:58:13PM +0200, Maarten Lankhorst wrote:
>> I have changed your sse memcpy to test various alignments with
>> source/destination offsets instead of random, from that you can
>> see that you don't really get a speedup at all. It seems to be more
>> a case of 'kernel memcpy is significantly slower with some alignments',
>> than 'avx memcpy is just that much faster'.
>>
>> For example 3754 with src misalignment 4 and target misalignment 20
>> takes 1185 units on avx memcpy, but 1480 units with kernel memcpy
> Right, so the idea is to check whether with the bigger buffer sizes
> (and misaligned, although this should not be that often the case in
> the kernel) the SSE version would outperform a "rep movs" with ucode
> optimizations not kicking in.
>
> With your version modified back to SSE memcpy (don't have an AVX box
> right now) I get on an AMD F10h:
>
> ...
> 16384(12/40)    4756.24         7867.74         1.654192552
> 16384(40/12)    5067.81         6068.71         1.197500008
> 16384(12/44)    4341.3          8474.96         1.952172387
> 16384(44/12)    4277.13         7107.64         1.661777347
> 16384(12/48)    4989.16         7964.54         1.596369011
> 16384(48/12)    4644.94         6499.5          1.399264281
> ...
>
> which looks like pretty nice numbers to me. I can't say whether there
> ever is 16K buffer we copy in the kernel but if there were... But <16K
> buffers also show up to 1.5x speedup. So I'd say it's a uarch thing.
> As I said, best it would be to put it in the kernel and run a bunch of
> benchmarks...
I think for bigger memcpy's it might make sense to demand stricter
alignment. What are your numbers for (0/0) ? In my case it seems
that kernel memcpy is always faster for that. In fact, it seems
src&63 == dst&63 is generally faster with kernel memcpy.

Patching my tree to WARN_ON_ONCE for when this condition isn't true, I get the following warnings:

WARNING: at arch/x86/kernel/head64.c:49 x86_64_start_reservations+0x3b/0x18d()
WARNING: at arch/x86/kernel/head64.c:52 x86_64_start_reservations+0xcb/0x18d()
WARNING: at arch/x86/kernel/e820.c:1077 setup_memory_map+0x3b/0x72()
WARNING: at kernel/fork.c:938 copy_process+0x148f/0x1550()
WARNING: at arch/x86/vdso/vdso32-setup.c:306 sysenter_setup+0xd4/0x301()
WARNING: at mm/util.c:72 kmemdup+0x75/0x80()
WARNING: at fs/btrfs/disk-io.c:1742 open_ctree+0x1ab5/0x1bb0()
WARNING: at fs/btrfs/disk-io.c:1744 open_ctree+0x1b35/0x1bb0()
WARNING: at fs/btrfs/extent_io.c:3634 write_extent_buffer+0x209/0x240()
WARNING: at fs/exec.c:1002 flush_old_exec+0x6c3/0x750()
WARNING: at fs/btrfs/extent_io.c:3496 read_extent_buffer+0x1b1/0x1e0()
WARNING: at kernel/module.c:2585 load_module+0x1933/0x1c30()
WARNING: at fs/btrfs/extent_io.c:3748 memcpy_extent_buffer+0x2aa/0x2f0()
WARNING: at fs/btrfs/disk-io.c:2276 write_dev_supers+0x34e/0x360()
WARNING: at lib/swiotlb.c:367 swiotlb_bounce+0xc6/0xe0()
WARNING: at fs/btrfs/transaction.c:1387 btrfs_commit_transaction+0x867/0x8a0()
WARNING: at drivers/tty/serial/serial_core.c:527 uart_write+0x14a/0x160()
WARNING: at mm/memory.c:3830 __access_remote_vm+0x251/0x270()

The most persistent one appears to be the btrfs' *_extent_buffer,
it gets the most warnings on my system. Apart from that on my
system there's not much to gain, since the alignment is already
close to optimal.

My ext4 /home doesn't throw warnings, so I'd gain the most
by figuring out if I could improve btrfs/extent_io.c in some way.
The patch for triggering those warnings is below, change to WARN_ON
if you want to see which one happens the most for you.

I was pleasantly surprised though.

>> The modified testcase is attached, I did some optimizations in avx
>> memcpy, but I fear I may be missing something, when I tried to put it
>> in the kernel, it complained about sata errors I never had before,
>> so I immediately went for the power button to prevent more errors,
>> fortunately it only corrupted some kernel object files, and btrfs
>> threw checksum errors. :)
> Well, your version should do something similar to what _mmx_memcpy does:
> save FPU state and not execute in IRQ context.
>
>> All in all I think testing in userspace is safer, you might want to
>> run it on an idle cpu with schedtool, with a high fifo priority, and
>> set cpufreq governor to performance.
> No, you need a generic system with default settings - otherwise it is
> blatant benchmark lying :-)

diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 19e2c46..77180bb 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -30,6 +30,14 @@ static __always_inline void *__inline_memcpy(void *to, const void *from, size_t
 #ifndef CONFIG_KMEMCHECK
 #if (__GNUC__ == 4 && __GNUC_MINOR__ >= 3) || __GNUC__ > 4
 extern void *memcpy(void *to, const void *from, size_t len);
+#define memcpy(dst, src, len)					\
+({								\
+	size_t __len = (len);					\
+	const void *__src = (src);				\
+	void *__dst = (dst);					\
+	WARN_ON_ONCE(__len > 1024 && (((long)__src & 63) != ((long)__dst & 63))); \
+	memcpy(__dst, __src, __len);				\
+})
 #else
 extern void *__memcpy(void *to, const void *from, size_t len);
 #define memcpy(dst, src, len)					\

next prev parent reply	other threads:[~2011-09-09 10:34 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-08-12 17:59 x86 memcpy performance melwyn lobo
2011-08-12 18:33 ` Andi Kleen
2011-08-12 19:52 ` Ingo Molnar
2011-08-14  9:59   ` Borislav Petkov
2011-08-14 11:13     ` Denys Vlasenko
2011-08-14 12:40       ` Borislav Petkov
2011-08-15 13:27         ` melwyn lobo
2011-08-15 13:44         ` Denys Vlasenko
2011-08-16  2:34     ` Valdis.Kletnieks
2011-08-16 12:16       ` Borislav Petkov
2011-09-01 15:15         ` Maarten Lankhorst
2011-09-01 16:18           ` Linus Torvalds
2011-09-08  8:35             ` Borislav Petkov
2011-09-08 10:58               ` Maarten Lankhorst
2011-09-09  8:14                 ` Borislav Petkov
2011-09-09 10:12                   ` Maarten Lankhorst [this message]
2011-09-09 11:23                     ` Maarten Lankhorst
2011-09-09 13:42                       ` Borislav Petkov
2011-09-09 14:39                   ` Linus Torvalds
2011-09-09 15:35                     ` Borislav Petkov
2011-12-05 12:20                       ` melwyn lobo
2011-12-05 12:54           ` melwyn lobo
2011-12-05 14:36             ` Alan Cox
  -- strict thread matches above, loose matches on Subject: below --
2011-08-15 14:55 Borislav Petkov
2011-08-15 14:59 ` Andy Lutomirski
2011-08-15 15:29   ` Borislav Petkov
2011-08-15 15:36     ` Andrew Lutomirski
2011-08-15 16:12       ` Borislav Petkov
2011-08-15 17:04         ` Andrew Lutomirski
2011-08-15 18:49           ` Borislav Petkov
2011-08-15 19:11             ` Andrew Lutomirski
2011-08-15 20:05               ` Borislav Petkov
2011-08-15 20:08                 ` Andrew Lutomirski
2011-08-15 16:12       ` H. Peter Anvin
2011-08-15 16:58         ` Andrew Lutomirski
2011-08-15 18:26           ` H. Peter Anvin
2011-08-15 18:35             ` Andrew Lutomirski
2011-08-15 18:52               ` H. Peter Anvin
2011-08-16  7:19 ` melwyn lobo
2011-08-16  7:43   ` Borislav Petkov

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:19e2c46 dfblob:77180bb )
 OR (
bs:"Re: x86 memcpy performance" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4E69E675.1010809@gmail.com \
    --to=m.b.lankhorst@gmail.com \
    --cc=Valdis.Kletnieks@vt.edu \
    --cc=a.p.zijlstra@chello.nl \
    --cc=bp@alien8.de \
    --cc=bp@amd64.org \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux.melwyn@gmail.com \
    --cc=mingo@elte.hu \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.