From mboxrd@z Thu Jan 1 00:00:00 1970 From: Waiman Long Subject: Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks Date: Tue, 04 Mar 2014 10:27:03 -0500 Message-ID: <5315F0C7.8090909@hp.com> References: <1393427668-60228-1-git-send-email-Waiman.Long@hp.com> <1393427668-60228-4-git-send-email-Waiman.Long@hp.com> <20140226162057.GW6835@laptop.programming.kicks-ass.net> <530FA32B.8010202@hp.com> <20140228092945.GG27965@twins.programming.kicks-ass.net> <5310BB81.3090508@hp.com> <20140303174305.GK9987@twins.programming.kicks-ass.net> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20140303174305.GK9987@twins.programming.kicks-ass.net> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: virtualization-bounces@lists.linux-foundation.org Errors-To: virtualization-bounces@lists.linux-foundation.org To: Peter Zijlstra Cc: Jeremy Fitzhardinge , Raghavendra K T , Boris Ostrovsky , virtualization@lists.linux-foundation.org, Andi Kleen , "H. Peter Anvin" , Michel Lespinasse , Alok Kataria , linux-arch@vger.kernel.org, x86@kernel.org, Ingo Molnar , Scott J Norton , xen-devel@lists.xenproject.org, "Paul E. McKenney" , Alexander Fyodorov , Rik van Riel , Arnd Bergmann , Konrad Rzeszutek Wilk , Daniel J Blueman , Oleg Nesterov , Steven Rostedt , Chris Wright , George Spelvin , Thomas Gleixner List-Id: linux-arch.vger.kernel.org On 03/03/2014 12:43 PM, Peter Zijlstra wrote: > Hi, > > Here are some numbers for my version -- also attached is the test code. > I found that booting big machines is tediously slow so I lifted the > whole lot to userspace. > > I measure the cycles spend in arch_spin_lock() + arch_spin_unlock(). > > The machines used are a 4 node (2 socket) AMD Interlagos, and a 2 node > (2 socket) Intel Westmere-EP. > > AMD (ticket) AMD (qspinlock + pending + opt) > > Local: Local: > > 1: 324.425530 1: 324.102142 > 2: 17141.324050 2: 620.185930 > 3: 52212.232343 3: 25242.574661 > 4: 93136.458314 4: 47982.037866 > 6: 167967.455965 6: 95345.011864 > 8: 245402.534869 8: 142412.451438 > > 2 - nodes: 2 - nodes: > > 2: 12763.640956 2: 1879.460823 > 4: 94423.027123 4: 48278.719130 > 6: 167903.698361 6: 96747.767310 > 8: 257243.508294 8: 144672.846317 > > 4 - nodes: 4 - nodes: > > 4: 82408.853603 4: 49820.323075 > 8: 260492.952355 8: 143538.264724 > 16: 630099.031148 16: 337796.553795 > > > > Intel (ticket) Intel (qspinlock + pending + opt) > > Local: Local: > > 1: 19.002249 1: 29.002844 > 2: 5093.275530 2: 1282.209519 > 3: 22300.859761 3: 22127.477388 > 4: 44929.922325 4: 44493.881832 > 6: 86338.755247 6: 86360.083940 > > 2 - nodes: 2 - nodes: > > 2: 1509.193824 2: 1209.090219 > 4: 48154.495998 4: 48547.242379 > 8: 137946.787244 8: 141381.498125 > > --- > > There a few curious facts I found (assuming my test code is sane). > > - Intel seems to be an order of magnitude faster on uncontended LOCKed > ops compared to AMD > > - On Intel the uncontended qspinlock fast path (cmpxchg) seems slower > than the uncontended ticket xadd -- although both are plenty fast > when compared to AMD. > > - In general, replacing cmpxchg loops with unconditional atomic ops > doesn't seem to matter a whole lot when the thing is contended. > > Below is the (rather messy) qspinlock slow path code (the only thing > that really differs between our versions. > > I'll try and slot your version in tomorrow. > > --- > It is curious to see that the qspinlock code offers a big benefit on AMD machines, but no so much on Intel. Anyway, I am working on a revised version of the patch that includes some of your comments. I will also try to see if I can get an AMD machine to run test on. -Longman