From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from gate.crashing.org (gate.crashing.org [63.228.1.57]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by ozlabs.org (Postfix) with ESMTPS id E2784DE1A9 for ; Thu, 22 May 2008 06:12:15 +1000 (EST) In-Reply-To: <20080521160146.GI8897@wotan.suse.de> References: <20080521141056.GC8897@wotan.suse.de> <20080521141231.GD8897@wotan.suse.de> <1211383592.8297.195.camel@pasglop> <20080521153420.GG8897@wotan.suse.de> <1211384580.8297.199.camel@pasglop> <20080521160146.GI8897@wotan.suse.de> Mime-Version: 1.0 (Apple Message framework v623) Content-Type: text/plain; charset=US-ASCII; format=flowed Message-Id: <7dc38d9080603be9c25b8f649e5df0a0@kernel.crashing.org> From: Segher Boessenkool Subject: Re: [patch 2/2] powerpc: optimise smp_wmb Date: Wed, 21 May 2008 22:12:03 +0200 To: Nick Piggin Cc: linuxppc-dev@ozlabs.org, paulus@samba.org List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , >> From memory, I measured lwsync is 5 times faster than eieio on > a dual G5. This was on a simple microbenchmark that made use of > smp_wmb for store ordering, but it did not involve any IO access > (which presumably would disadvantage eieio further). This is very much specific to your particular benchmark. On the 970, there are two differences between lwsync and eieio: 1) lwsync cannot issue before all previous loads are done; eieio does not have this restriction. Then, they both fly through the execution core, it doesn't wait for the barrier insn to complete in the storage system. In both cases, a barrier is inserted into both the L2 queues and the non-cacheable queues. These barriers are both removed at the same time, that is, when both are the oldest in their queue and have done their thing. 2) For eieio, the non-cacheable unit waits for all previous (non-cacheable) stores to complete, and then arbitrates for the bus and sends an EIEIO transaction. Your benchmark doesn't do non-cacheable stores, so it would seem the five-time slowdown is caused by that bus arbitration (and the bus transaction). Maybe your cacheable stores hit the bus as well, that would make this worse. Your benchmark also doesn't see the negative effects from 1). In "real" code, I expect 2) to be pretty much invisible (the store queues will never be completely filled up), but 1) shouldn't be very bad either. So it's a wash. But only a real benchmark will tell. > Given the G5 speedup, I'd be surprised if there is not an improvment > on POWER4 and 5 as well, The 970 storage subsystem and the POWER4 one are very different. Or maybe these queues are just about the last thing that _is_ identical, I dunno, there aren't public POWER4 docs for this ;-) > although no idea about POWER6 or cell... No idea about POWER6; for CBE, the backend works similar to the 970 one. Given that the architecture says to use lwsync for cases like this, it would be very surprising if it performed (much) worse than eieio, eh? ;-) So I think your patch is a win; just wanted to clarify on your five-time slowdown number. HTH, Segher