From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([208.118.235.92]:37850) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UN2sa-0006Xy-Gd for qemu-devel@nongnu.org; Tue, 02 Apr 2013 11:12:11 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1UN2sX-0006PR-Oq for qemu-devel@nongnu.org; Tue, 02 Apr 2013 11:12:08 -0400 Received: from mail-vc0-f182.google.com ([209.85.220.182]:48539) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UN2sX-0006PL-Jk for qemu-devel@nongnu.org; Tue, 02 Apr 2013 11:12:05 -0400 Received: by mail-vc0-f182.google.com with SMTP id ht11so532614vcb.41 for ; Tue, 02 Apr 2013 08:12:05 -0700 (PDT) Sender: Richard Henderson Message-ID: <515AF541.1040900@twiddle.net> Date: Tue, 02 Apr 2013 08:12:01 -0700 From: Richard Henderson MIME-Version: 1.0 References: <1364876610-3933-1-git-send-email-rth@twiddle.net> <1364876610-3933-18-git-send-email-rth@twiddle.net> <515AE0B2.5090005@twiddle.net> <34B5C19C-785B-43E0-B1A5-F75D16EA6F09@suse.de> In-Reply-To: <34B5C19C-785B-43E0-B1A5-F75D16EA6F09@suse.de> Content-Type: multipart/mixed; boundary="------------010505030903020303030306" Subject: Re: [Qemu-devel] [PATCH v3 17/27] tcg-ppc64: Implement bswap64 List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Alexander Graf Cc: "av1474@comtv.ru" , "qemu-devel@nongnu.org" , "aurelien@aurel32.net" This is a multi-part message in MIME format. --------------010505030903020303030306 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit On 2013-04-02 07:41, Alexander Graf wrote: >> On 2013-04-01 23:34, Alexander Graf wrote: >>> Is this faster than a load/store with std/ldbrx? >> >> Hmm. Almost certainly not. And since we've got stack space >> allocated for function calls, we've got scratch space to do it in. >> >> Probably similar for bswap32 too, eh? > > Depends - memory load/store doesn't come for free and bswap32 is quite short. > >> >> I'll do a tiny bit o benchmarking for power7. > > Cool, thanks a bunch :) Heh. "Almost certainly not" indeed. Unless I've made some silly mistake, going through memory stalls badly. No store buffer forwarding on power7? With the following test case, time reports: f1 2.967s f2 8.930s f3 7.071s f4 7.166s And note that f4 is a normal store/load pair, trying to determine what the store buffer forwarding delay might be. r~ --------------010505030903020303030306 Content-Type: text/x-csrc; name="z.c" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="z.c" static long __attribute__((noinline)) f1(long x, long *mem) { long r, t; asm volatile ( "rlwinm %0,%1,8,0,31\n\ rlwimi %0,%1,24,0,7\n\ rlwimi %0,%1,24,16,23\n\ rldicl %0,%0,32,0\n\ rldicl %2,%1,32,0\n\ rlwimi %0,%2,8,0,31\n\ rlwimi %0,%2,24,0,7\n\ rlwimi %0,%2,24,16,23" : "=&r"(r), "=r"(t) : "r"(x)); return r; } static long __attribute__((noinline)) f2(long x, long *mem) { long r, t; asm volatile ("std %1,0(%2); ldbrx %0,0,%2" : "=r"(r) : "r"(x), "b"(mem)); return r; } static long __attribute__((noinline)) f3(long x, long *mem) { long r, t; asm volatile ("stdbrx %1,0,%2; ld %0,0(%2)" : "=r"(r) : "r"(x), "b"(mem)); return r; } static long __attribute__((noinline)) f4(long x, long *mem) { long r, t; asm volatile ("std %1,0(%2); ld %0,0(%2)" : "=r"(r) : "r"(x), "b"(mem)); return r; } #define D1(x,y) x##y #define DO(x) D1(f,x) int main() { long tmp, i; for (i = 0; i < 1000000000; ++i) DO(N)(i, &tmp); return 0; } --------------010505030903020303030306--