* [PATCH] lockref: remove cpu_relax() again
@ 2013-09-05 13:18 Heiko Carstens
2013-09-05 14:13 ` Heiko Carstens
` (2 more replies)
0 siblings, 3 replies; 13+ messages in thread
From: Heiko Carstens @ 2013-09-05 13:18 UTC (permalink / raw)
To: Linus Torvalds, Tony Luck; +Cc: linux-kernel
d472d9d9 "lockref: Relax in cmpxchg loop" added a cpu_relax() call to the
CMPXCHG_LOOP() macro. However to me it seems to be wrong since it is very
likely that the next round will succeed (or the loop will be left).
Even worse: cpu_relax() is very expensive on s390, since it means yield
"my virtual cpu to the hypervisor". So we are talking of several 1000 cycles.
In fact some measurements show the bad impact of the cpu_relax() call on
s390 using Linus' test case that "stats()" like mad:
Without converting s390 to lockref:
Total loops: 81236173
After converting s390 to lockref:
Total loops: 31896802
After converting s390 to lockref but with removed cpu_relax() call:
Total loops: 86242190
So the cpu_relax() call completely contradicts the intention of
CONFIG_CMPXCHG_LOCKREF at least on s390.
*If* however the cpu_relax() makes sense on other platforms maybe we could
add something like we have already with "arch_mutex_cpu_relax()":
include/linux/mutex.h:
#ifndef CONFIG_HAVE_ARCH_MUTEX_CPU_RELAX
#define arch_mutex_cpu_relax() cpu_relax()
#endif
arch/s390/include/asm/mutex.h:
#define arch_mutex_cpu_relax() barrier()
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
---
lib/lockref.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/lib/lockref.c b/lib/lockref.c
index 9d76f40..7819c2d 100644
--- a/lib/lockref.c
+++ b/lib/lockref.c
@@ -19,7 +19,6 @@
if (likely(old.lock_count == prev.lock_count)) { \
SUCCESS; \
} \
- cpu_relax(); \
} \
} while (0)
--
1.8.3.4
^ permalink raw reply related [flat|nested] 13+ messages in thread* Re: [PATCH] lockref: remove cpu_relax() again 2013-09-05 13:18 [PATCH] lockref: remove cpu_relax() again Heiko Carstens @ 2013-09-05 14:13 ` Heiko Carstens 2013-09-05 14:48 ` Luck, Tony 2013-09-05 15:31 ` Linus Torvalds 2 siblings, 0 replies; 13+ messages in thread From: Heiko Carstens @ 2013-09-05 14:13 UTC (permalink / raw) To: Linus Torvalds, Tony Luck, linux-kernel On Thu, Sep 05, 2013 at 03:18:14PM +0200, Heiko Carstens wrote: > d472d9d9 "lockref: Relax in cmpxchg loop" added a cpu_relax() call to the > CMPXCHG_LOOP() macro. However to me it seems to be wrong since it is very > likely that the next round will succeed (or the loop will be left). > Even worse: cpu_relax() is very expensive on s390, since it means yield > "my virtual cpu to the hypervisor". So we are talking of several 1000 cycles. > > In fact some measurements show the bad impact of the cpu_relax() call on > s390 using Linus' test case that "stats()" like mad: > > Without converting s390 to lockref: > Total loops: 81236173 > > After converting s390 to lockref: > Total loops: 31896802 > > After converting s390 to lockref but with removed cpu_relax() call: > Total loops: 86242190 All of those should have been "converting s390 to ARCH_USE_CMPXCHG_LOCKREF" instead of "to lockref" of course .. ;) ^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: [PATCH] lockref: remove cpu_relax() again 2013-09-05 13:18 [PATCH] lockref: remove cpu_relax() again Heiko Carstens 2013-09-05 14:13 ` Heiko Carstens @ 2013-09-05 14:48 ` Luck, Tony 2013-09-05 15:31 ` Linus Torvalds 2 siblings, 0 replies; 13+ messages in thread From: Luck, Tony @ 2013-09-05 14:48 UTC (permalink / raw) To: Heiko Carstens, Linus Torvalds; +Cc: linux-kernel@vger.kernel.org > *If* however the cpu_relax() makes sense on other platforms maybe we could > add something like we have already with "arch_mutex_cpu_relax()": I'll do some more measurements on ia64. During my first tests cpu_relax() seemed to be a big win - but I only ran "./t" a couple of times. Later (with the cpu_relax() in place) I ran a bunch more iterations, and found that the variation from run to run is much larger with lockref. The mean score is 60% higher, but the standard deviation is an order of magnitude bigger (enough that one run out of 20 with lockref scored lower than the pre-lockref kernel). I think this is expected ... cmpxchg is a free-for-all - and sometimes poor placement across the four socket system might cause short term starvation to a thread while threads on another socket monopolize the cache line. -Tony ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] lockref: remove cpu_relax() again 2013-09-05 13:18 [PATCH] lockref: remove cpu_relax() again Heiko Carstens 2013-09-05 14:13 ` Heiko Carstens 2013-09-05 14:48 ` Luck, Tony @ 2013-09-05 15:31 ` Linus Torvalds 2013-09-05 17:35 ` Luck, Tony 2013-09-05 17:54 ` Waiman Long 2 siblings, 2 replies; 13+ messages in thread From: Linus Torvalds @ 2013-09-05 15:31 UTC (permalink / raw) To: Heiko Carstens; +Cc: Tony Luck, Linux Kernel Mailing List On Thu, Sep 5, 2013 at 6:18 AM, Heiko Carstens <heiko.carstens@de.ibm.com> wrote: > > *If* however the cpu_relax() makes sense on other platforms maybe we could > add something like we have already with "arch_mutex_cpu_relax()": I actually think it won't. The lockref cmpxchg isn't waiting for something to change - it only loops _if_ something has changed, and rather than cpu_relax(), we most likely want to try to take advantage of the fact that we have the changed data in our exclusive cacheline, and try to get our ref update out as soon as possible. IOW, the lockref loop is not an idle loop like a spinlock "wait for lock to be released", it's very much an active loop of "oops, something changed". And there can't be any livelock, since by definition somebody else _did_ make progress. In fact, adding the cpu_relax() probably just makes things much less fair - once somebody else raced on you, the cpu_relax() now makes it more likely that _another_ cpu does so too. That said, let's see Tony's numbers are. On x86, it doesn't seem to matter, but as Tony noticed, the variability can be quite high (for me, the numbers tend to be quite stable when running the test program multiple times in a loop, but then variation between boots or after having done something else can be quite big - I suspect the cache access patterns end up varying wildly with different dentry layout and hash chain depth). Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: [PATCH] lockref: remove cpu_relax() again 2013-09-05 15:31 ` Linus Torvalds @ 2013-09-05 17:35 ` Luck, Tony 2013-09-05 17:53 ` Linus Torvalds 2013-09-05 17:54 ` Waiman Long 1 sibling, 1 reply; 13+ messages in thread From: Luck, Tony @ 2013-09-05 17:35 UTC (permalink / raw) To: Linus Torvalds, Heiko Carstens; +Cc: Linux Kernel Mailing List [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1632 bytes --] > And there can't be any livelock, since by definition somebody else > _did_ make progress. In fact, adding the cpu_relax() probably just > makes things much less fair - once somebody else raced on you, the > cpu_relax() now makes it more likely that _another_ cpu does so too. > > That said, let's see Tony's numbers are. Data from 20 runs of "./t" 3.11 + Linus enabling patches, but ia64 not enabled (commit bc08b449ee14a from Linus tree). mean 3469553.800000 min 3367709.000000 max 3494154.000000 stddev = 43613.722742 Now add ia64 enabling (including the cpu_relax()) mean 5509067.150000 // nice boost min 3191639.000000 // worst case is worse than worst case before we made the change max 6508629.000000 stddev = 793243.943875 // much more variation from run to run Comment out the cpu_relax() mean 2185864.400000 // this sucks min 2141242.000000 max 2286505.000000 stddev = 40847.960152 // but it consistently sucks So Linus is right that the cpu_relax() makes things less fair ... but without it performance sucks so much that I don't want to use the clever cmpxchg at all - I'm much better off without it! This may be caused by Itanium hyper-threading (SOEMT - switch on event multi-threading) where the spinning thread means that its buddy retires no instructions until h/w times it out and forces a switch. But that's just a guess - losing the cacheline to whoever made the change that caused the cmpxchg to fail should also force a thread switch. -Tony ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] lockref: remove cpu_relax() again 2013-09-05 17:35 ` Luck, Tony @ 2013-09-05 17:53 ` Linus Torvalds 2013-09-05 18:57 ` Luck, Tony 0 siblings, 1 reply; 13+ messages in thread From: Linus Torvalds @ 2013-09-05 17:53 UTC (permalink / raw) To: Luck, Tony; +Cc: Heiko Carstens, Linux Kernel Mailing List On Thu, Sep 5, 2013 at 10:35 AM, Luck, Tony <tony.luck@intel.com> wrote: > > So Linus is right that the cpu_relax() makes things less fair ... but without it performance sucks so > much that I don't want to use the clever cmpxchg at all - I'm much better off without it! Hmm. Is this single-socket? The thing really only is supposed to help if there's tons of contention. Also, it strikes me that ia64 has tons of different versions of cmpxchg, and the one you use by default is the one with "acquire" semantics. That may well be the right thing to do, but I have this possibly unfounded gut feeling that you might want to try using a relaxed cmpxchg and then perhaps have a memory barrier *after* it has successfully executed. I'll have to think a bit more about what the exact memory ordering requirements for lockref's are, but my gut feel is that just for incrementing the reference count you don't actually have any real memory ordering requirements. Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: [PATCH] lockref: remove cpu_relax() again 2013-09-05 17:53 ` Linus Torvalds @ 2013-09-05 18:57 ` Luck, Tony 2013-09-05 19:21 ` Linus Torvalds 0 siblings, 1 reply; 13+ messages in thread From: Luck, Tony @ 2013-09-05 18:57 UTC (permalink / raw) To: Linus Torvalds; +Cc: Heiko Carstens, Linux Kernel Mailing List [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset="utf-8", Size: 839 bytes --] > Also, it strikes me that ia64 has tons of different versions of > cmpxchg, and the one you use by default is the one with "acquire" > semantics Not "tons", just two. You can ask for "acquire" or "release" semantics, there is no relaxed option. Worse still - early processor implementations actually just ignored the acquire/release and did a full fence all the time. Unfortunately this meant a lot of badly written code that used .acq when they really wanted .rel became legacy out in the wild - so when we made a cpu that strictly did the .acq or .rel ... all that code started breaking - so we had to back-pedal and keep the "legacy" behavior of a full fence :-( -Tony ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] lockref: remove cpu_relax() again 2013-09-05 18:57 ` Luck, Tony @ 2013-09-05 19:21 ` Linus Torvalds 2013-09-05 19:45 ` Luck, Tony 0 siblings, 1 reply; 13+ messages in thread From: Linus Torvalds @ 2013-09-05 19:21 UTC (permalink / raw) To: Luck, Tony; +Cc: Heiko Carstens, Linux Kernel Mailing List On Thu, Sep 5, 2013 at 11:57 AM, Luck, Tony <tony.luck@intel.com> wrote: > > Not "tons", just two. You can ask for "acquire" or "release" semantics, > there is no relaxed option. Seriously? You can't just do a cache-coherent cmpxchg without extra serialization? Oh well. > Worse still - early processor implementations actually just ignored > the acquire/release and did a full fence all the time. Unfortunately > this meant a lot of badly written code that used .acq when they really > wanted .rel became legacy out in the wild - so when we made a cpu > that strictly did the .acq or .rel ... all that code started breaking - so > we had to back-pedal and keep the "legacy" behavior of a full fence :-( Ugh. Can you try what happens with the weaker release-semantics performance-wise for that code? Do it *just* for the lockref code.. Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: [PATCH] lockref: remove cpu_relax() again 2013-09-05 19:21 ` Linus Torvalds @ 2013-09-05 19:45 ` Luck, Tony 2013-09-05 19:50 ` Linus Torvalds 0 siblings, 1 reply; 13+ messages in thread From: Luck, Tony @ 2013-09-05 19:45 UTC (permalink / raw) To: Linus Torvalds; +Cc: Heiko Carstens, Linux Kernel Mailing List [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset="utf-8", Size: 885 bytes --] >> Worse still - early processor implementations actually just ignored >> the acquire/release and did a full fence all the time. Unfortunately >> this meant a lot of badly written code that used .acq when they really >> wanted .rel became legacy out in the wild - so when we made a cpu >> that strictly did the .acq or .rel ... all that code started breaking - so >> we had to back-pedal and keep the "legacy" behavior of a full fence :-( > > Ugh. Can you try what happens with the weaker release-semantics > performance-wise for that code? Do it *just* for the lockref code.. No. I can change the Linux code to say "cmpxchg.rel" here ... but the h/w will do exactly the same thing it did when I had "cmpxchg.acq". -Tony ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] lockref: remove cpu_relax() again 2013-09-05 19:45 ` Luck, Tony @ 2013-09-05 19:50 ` Linus Torvalds 2013-09-05 19:56 ` Luck, Tony 0 siblings, 1 reply; 13+ messages in thread From: Linus Torvalds @ 2013-09-05 19:50 UTC (permalink / raw) To: Luck, Tony; +Cc: Heiko Carstens, Linux Kernel Mailing List On Thu, Sep 5, 2013 at 12:45 PM, Luck, Tony <tony.luck@intel.com> wrote: > > No. I can change the Linux code to say "cmpxchg.rel" here ... but the > h/w will do exactly the same thing it did when I had "cmpxchg.acq". Oh, so when you said "So we had to back-pedal and keep the "legacy" behavior of a full fence", you meant the hardware design itself, not (as I assumed) the Linux kernel header behavior. Oh well. Hopefully somebody in hardware learnt how stupid it is to expose weak memory ordering to software. But probably not. Ugh. Your four-socket machine certainly should have been able to see the performance improvements of not spinning. That said, another thing that strikes me is that you have 32 CPU threads, and the stupid test-program I sent out had MAX_THREADS set to 16. Did you change that? Becuase if not, then some of the extreme performance profile might be about how the threads get scheduled on your machine (HT threads vs full cores etc). Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: [PATCH] lockref: remove cpu_relax() again 2013-09-05 19:50 ` Linus Torvalds @ 2013-09-05 19:56 ` Luck, Tony 2013-09-06 18:36 ` Tony Luck 0 siblings, 1 reply; 13+ messages in thread From: Luck, Tony @ 2013-09-05 19:56 UTC (permalink / raw) To: Linus Torvalds; +Cc: Heiko Carstens, Linux Kernel Mailing List [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset="utf-8", Size: 712 bytes --] > That said, another thing that strikes me is that you have 32 CPU > threads, and the stupid test-program I sent out had MAX_THREADS set to > 16. Did you change that? Becuase if not, then some of the extreme > performance profile might be about how the threads get scheduled on > your machine (HT threads vs full cores etc). I'll try to get new numbers with 32 threads[*] - but even if they look good, I'd be upset about the 16 thread case being worse with the cmpxchg/no-cpu-relax case than the original code. -Tony [*] probably not till tomorrow ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] lockref: remove cpu_relax() again 2013-09-05 19:56 ` Luck, Tony @ 2013-09-06 18:36 ` Tony Luck 0 siblings, 0 replies; 13+ messages in thread From: Tony Luck @ 2013-09-06 18:36 UTC (permalink / raw) To: Linus Torvalds; +Cc: Heiko Carstens, Linux Kernel Mailing List No new Itanium numbers yet ... but I did wonder how this works on multi-socket x86 ... so I tweaked "t.c. to increase threads to 64 to max out my 4-socket Xeon E5-4650 (8 cores/socket 2 threads/core) and also print out the individual scores from each thread. $ ./t /tmp 64 389827 717666 1540293 130764 681839 33357 606966 33716 33183 33230 69685 76422 352851 34940 257132 34192 34200 34098 34053 34459 234399 33678 241571 545912 620857 65818 32853 739440 33697 683655 741366 36208 385775 446198 45974 33056 403944 717415 254782 166754 702745 43661 1042180 437367 43751 503342 154223 706917 878167 43802 51667 660875 33261 522425 33627 33637 33446 33604 52963 33688 406088 551690 446474 33289 Threads = 64 Total loops: 19109114 Individual thread performance varies from 32853 to 1540293. A factor of 46.9 Sometimes it is good to sacrifice fairness for throughput. But wow! Running for longer [ s/sleep(10)/sleep(300) ] gave things a chance to even out - but I still see a factor of 3.5 between the fastest and the slowest. -Tony ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] lockref: remove cpu_relax() again 2013-09-05 15:31 ` Linus Torvalds 2013-09-05 17:35 ` Luck, Tony @ 2013-09-05 17:54 ` Waiman Long 1 sibling, 0 replies; 13+ messages in thread From: Waiman Long @ 2013-09-05 17:54 UTC (permalink / raw) To: Linus Torvalds; +Cc: Heiko Carstens, Tony Luck, Linux Kernel Mailing List On 09/05/2013 11:31 AM, Linus Torvalds wrote: > On Thu, Sep 5, 2013 at 6:18 AM, Heiko Carstens > <heiko.carstens@de.ibm.com> wrote: >> *If* however the cpu_relax() makes sense on other platforms maybe we could >> add something like we have already with "arch_mutex_cpu_relax()": > I actually think it won't. > > The lockref cmpxchg isn't waiting for something to change - it only > loops _if_ something has changed, and rather than cpu_relax(), we most > likely want to try to take advantage of the fact that we have the > changed data in our exclusive cacheline, and try to get our ref update > out as soon as possible. > > IOW, the lockref loop is not an idle loop like a spinlock "wait for > lock to be released", it's very much an active loop of "oops, > something changed". > > And there can't be any livelock, since by definition somebody else > _did_ make progress. In fact, adding the cpu_relax() probably just > makes things much less fair - once somebody else raced on you, the > cpu_relax() now makes it more likely that _another_ cpu does so too. > > That said, let's see Tony's numbers are. On x86, it doesn't seem to > matter, but as Tony noticed, the variability can be quite high (for > me, the numbers tend to be quite stable when running the test program > multiple times in a loop, but then variation between boots or after > having done something else can be quite big - I suspect the cache > access patterns end up varying wildly with different dentry layout and > hash chain depth). > > Linus I have found that having a cpu_relax() at the bottom of the while loop in CMPXCHG_LOOP() macro does help performance in some case on x86-64 processors. I saw no noticeable difference for the AIM7's short workload. On the high_systime workload, however, I saw about 5% better performance with cpu_relax(). Below were the perf profile of the 2 high_systime runs at 1500 users on a 80-core DL980 with HT off. Without cpu_relax(): 4.60% ls [kernel.kallsyms] [k] _raw_spin_lock |--48.50%-- lockref_put_or_lock |--46.97%-- lockref_get_or_lock |--1.04%-- lockref_get 2.95% reaim [kernel.kallsyms] [k] _raw_spin_lock |--29.70%-- lockref_put_or_lock |--28.87%-- lockref_get_or_lock |--0.63%-- lockref_get With cpu_relax(): 1.67% reaim [kernel.kallsyms] [k] _raw_spin_lock |--14.80%-- lockref_put_or_lock |--14.04%-- lockref_get_or_lock 1.36% ls [kernel.kallsyms] [k] _raw_spin_lock |--44.94%-- lockref_put_or_lock |--43.12%-- lockref_get_or_lock So I would suggest having some kind of conditional cpu_relax() in the loop. -Longman ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2013-09-06 18:36 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-09-05 13:18 [PATCH] lockref: remove cpu_relax() again Heiko Carstens 2013-09-05 14:13 ` Heiko Carstens 2013-09-05 14:48 ` Luck, Tony 2013-09-05 15:31 ` Linus Torvalds 2013-09-05 17:35 ` Luck, Tony 2013-09-05 17:53 ` Linus Torvalds 2013-09-05 18:57 ` Luck, Tony 2013-09-05 19:21 ` Linus Torvalds 2013-09-05 19:45 ` Luck, Tony 2013-09-05 19:50 ` Linus Torvalds 2013-09-05 19:56 ` Luck, Tony 2013-09-06 18:36 ` Tony Luck 2013-09-05 17:54 ` Waiman Long
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox