From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753210Ab1HNLOC (ORCPT ); Sun, 14 Aug 2011 07:14:02 -0400 Received: from mail-fx0-f46.google.com ([209.85.161.46]:65022 "EHLO mail-fx0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753109Ab1HNLOA (ORCPT ); Sun, 14 Aug 2011 07:14:00 -0400 From: Denys Vlasenko To: Borislav Petkov Subject: Re: x86 memcpy performance Date: Sun, 14 Aug 2011 13:13:56 +0200 User-Agent: KMail/1.8.2 Cc: Ingo Molnar , melwyn lobo , linux-kernel@vger.kernel.org, "H. Peter Anvin" , Thomas Gleixner , Linus Torvalds , Peter Zijlstra , borislav.petkov@amd.com References: <20110812195220.GA29051@elte.hu> <20110814095910.GA18809@liondog.tnic> In-Reply-To: <20110814095910.GA18809@liondog.tnic> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <201108141313.56926.vda.linux@googlemail.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sunday 14 August 2011 11:59, Borislav Petkov wrote: > Here's the SSE memcpy version I got so far, I haven't wired in the > proper CPU feature detection yet because we want to run more benchmarks > like netperf and stuff to see whether we see any positive results there. > > The SYSTEM_RUNNING check is to take care of early boot situations where > we can't handle FPU exceptions but we use memcpy. There's an aligned and > misaligned variant which should handle any buffers and sizes although > I've set the SSE memcpy threshold at 512 Bytes buffersize the least to > cover context save/restore somewhat. > > Comments are much appreciated! :-) > > --- a/arch/x86/include/asm/string_64.h > +++ b/arch/x86/include/asm/string_64.h > @@ -28,10 +28,20 @@ static __always_inline void *__inline_memcpy(void *to, const void *from, size_t > > #define __HAVE_ARCH_MEMCPY 1 > #ifndef CONFIG_KMEMCHECK > +extern void *__memcpy(void *to, const void *from, size_t len); > +extern void *__sse_memcpy(void *to, const void *from, size_t len); > #if (__GNUC__ == 4 && __GNUC_MINOR__ >= 3) || __GNUC__ > 4 > -extern void *memcpy(void *to, const void *from, size_t len); > +#define memcpy(dst, src, len) \ > +({ \ > + size_t __len = (len); \ > + void *__ret; \ > + if (__len >= 512) \ > + __ret = __sse_memcpy((dst), (src), __len); \ > + else \ > + __ret = __memcpy((dst), (src), __len); \ > + __ret; \ > +}) Please, no. Do not inline every memcpy invocation. This is pure bloat (comsidering how many memcpy calls there are) and it doesn't even win anything in speed, since there will be a fucntion call either way. Put the __len >= 512 check inside your memcpy instead. You may do the check if you know that __len is constant: if (__builtin_constant_p(__len) && __len >= 512) ... because in this case gcc will evaluate it at compile-time. -- vda