From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754065Ab3IERyr (ORCPT ); Thu, 5 Sep 2013 13:54:47 -0400 Received: from g5t0009.atlanta.hp.com ([15.192.0.46]:16592 "EHLO g5t0009.atlanta.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752124Ab3IERyq (ORCPT ); Thu, 5 Sep 2013 13:54:46 -0400 Message-ID: <5228C559.1090509@hp.com> Date: Thu, 05 Sep 2013 13:54:33 -0400 From: Waiman Long User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.12) Gecko/20130109 Thunderbird/10.0.12 MIME-Version: 1.0 To: Linus Torvalds CC: Heiko Carstens , Tony Luck , Linux Kernel Mailing List Subject: Re: [PATCH] lockref: remove cpu_relax() again References: <20130905131814.GA24274@osiris> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 09/05/2013 11:31 AM, Linus Torvalds wrote: > On Thu, Sep 5, 2013 at 6:18 AM, Heiko Carstens > wrote: >> *If* however the cpu_relax() makes sense on other platforms maybe we could >> add something like we have already with "arch_mutex_cpu_relax()": > I actually think it won't. > > The lockref cmpxchg isn't waiting for something to change - it only > loops _if_ something has changed, and rather than cpu_relax(), we most > likely want to try to take advantage of the fact that we have the > changed data in our exclusive cacheline, and try to get our ref update > out as soon as possible. > > IOW, the lockref loop is not an idle loop like a spinlock "wait for > lock to be released", it's very much an active loop of "oops, > something changed". > > And there can't be any livelock, since by definition somebody else > _did_ make progress. In fact, adding the cpu_relax() probably just > makes things much less fair - once somebody else raced on you, the > cpu_relax() now makes it more likely that _another_ cpu does so too. > > That said, let's see Tony's numbers are. On x86, it doesn't seem to > matter, but as Tony noticed, the variability can be quite high (for > me, the numbers tend to be quite stable when running the test program > multiple times in a loop, but then variation between boots or after > having done something else can be quite big - I suspect the cache > access patterns end up varying wildly with different dentry layout and > hash chain depth). > > Linus I have found that having a cpu_relax() at the bottom of the while loop in CMPXCHG_LOOP() macro does help performance in some case on x86-64 processors. I saw no noticeable difference for the AIM7's short workload. On the high_systime workload, however, I saw about 5% better performance with cpu_relax(). Below were the perf profile of the 2 high_systime runs at 1500 users on a 80-core DL980 with HT off. Without cpu_relax(): 4.60% ls [kernel.kallsyms] [k] _raw_spin_lock |--48.50%-- lockref_put_or_lock |--46.97%-- lockref_get_or_lock |--1.04%-- lockref_get 2.95% reaim [kernel.kallsyms] [k] _raw_spin_lock |--29.70%-- lockref_put_or_lock |--28.87%-- lockref_get_or_lock |--0.63%-- lockref_get With cpu_relax(): 1.67% reaim [kernel.kallsyms] [k] _raw_spin_lock |--14.80%-- lockref_put_or_lock |--14.04%-- lockref_get_or_lock 1.36% ls [kernel.kallsyms] [k] _raw_spin_lock |--44.94%-- lockref_put_or_lock |--43.12%-- lockref_get_or_lock So I would suggest having some kind of conditional cpu_relax() in the loop. -Longman