From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([208.118.235.92]:41905) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UN33h-0002c9-Um for qemu-devel@nongnu.org; Tue, 02 Apr 2013 11:23:39 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1UN33f-0002YT-Ks for qemu-devel@nongnu.org; Tue, 02 Apr 2013 11:23:37 -0400 Received: from cantor2.suse.de ([195.135.220.15]:50719 helo=mx2.suse.de) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UN33f-0002YL-FZ for qemu-devel@nongnu.org; Tue, 02 Apr 2013 11:23:35 -0400 Message-ID: <515AF7F5.1020401@suse.de> Date: Tue, 02 Apr 2013 17:23:33 +0200 From: Alexander Graf MIME-Version: 1.0 References: <1364876610-3933-1-git-send-email-rth@twiddle.net> <1364876610-3933-18-git-send-email-rth@twiddle.net> <515AE0B2.5090005@twiddle.net> <34B5C19C-785B-43E0-B1A5-F75D16EA6F09@suse.de> <515AF541.1040900@twiddle.net> In-Reply-To: <515AF541.1040900@twiddle.net> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH v3 17/27] tcg-ppc64: Implement bswap64 List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Richard Henderson Cc: "av1474@comtv.ru" , "qemu-devel@nongnu.org" , "aurelien@aurel32.net" On 04/02/2013 05:12 PM, Richard Henderson wrote: > On 2013-04-02 07:41, Alexander Graf wrote: >>> On 2013-04-01 23:34, Alexander Graf wrote: >>>> Is this faster than a load/store with std/ldbrx? >>> >>> Hmm. Almost certainly not. And since we've got stack space >>> allocated for function calls, we've got scratch space to do it in. >>> >>> Probably similar for bswap32 too, eh? >> >> Depends - memory load/store doesn't come for free and bswap32 is >> quite short. >> >>> >>> I'll do a tiny bit o benchmarking for power7. >> >> Cool, thanks a bunch :) > > Heh. "Almost certainly not" indeed. Unless I've made some silly > mistake, > going through memory stalls badly. No store buffer forwarding on power7? > > With the following test case, time reports: > > f1 2.967s > f2 8.930s > f3 7.071s > f4 7.166s > > And note that f4 is a normal store/load pair, trying to determine what > the > store buffer forwarding delay might be. Yeah, doesn't look like it makes any sense at all to do a load/store cycle then. What a shame :). Keep in mind that this tests icache hot cycles. However, you might get bad icache penalties due to the long bswap64 sequence. So all the memory latency you see here might also affect the instruction stream when it gets executed. But then again we only care about performance of cache hot sequences in the first place.... Alex