From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752900Ab0AJBLs (ORCPT ); Sat, 9 Jan 2010 20:11:48 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752291Ab0AJBLr (ORCPT ); Sat, 9 Jan 2010 20:11:47 -0500 Received: from tomts10-srv.bellnexxia.net ([209.226.175.54]:42479 "EHLO tomts10-srv.bellnexxia.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751529Ab0AJBLr (ORCPT ); Sat, 9 Jan 2010 20:11:47 -0500 Date: Sat, 9 Jan 2010 20:11:44 -0500 From: Mathieu Desnoyers To: "Paul E. McKenney" Cc: Steven Rostedt , Oleg Nesterov , Peter Zijlstra , linux-kernel@vger.kernel.org, Ingo Molnar , akpm@linux-foundation.org, josh@joshtriplett.org, tglx@linutronix.de, Valdis.Kletnieks@vt.edu, dhowells@redhat.com, laijs@cn.fujitsu.com, dipankar@in.ibm.com Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier Message-ID: <20100110011144.GD25790@Krystal> References: <20100107205830.GR6764@linux.vnet.ibm.com> <1262900140.28171.3773.camel@gandalf.stny.rr.com> <20100108235338.GA18050@Krystal> <20100109002043.GD6816@linux.vnet.ibm.com> <20100109010231.GA25368@Krystal> <20100109012128.GF6816@linux.vnet.ibm.com> <20100109023842.GA1696@Krystal> <20100109054215.GB9044@linux.vnet.ibm.com> <20100109192006.GA23672@Krystal> <20100109235937.GC9044@linux.vnet.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Content-Disposition: inline In-Reply-To: <20100109235937.GC9044@linux.vnet.ibm.com> X-Editor: vi X-Info: http://krystal.dyndns.org:8080 X-Operating-System: Linux/2.6.27.31-grsec (i686) X-Uptime: 20:05:57 up 24 days, 9:24, 5 users, load average: 0.11, 0.25, 0.17 User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote: > On Sat, Jan 09, 2010 at 02:20:06PM -0500, Mathieu Desnoyers wrote: > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote: > > > On Fri, Jan 08, 2010 at 09:38:42PM -0500, Mathieu Desnoyers wrote: > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote: > > > > > On Fri, Jan 08, 2010 at 08:02:31PM -0500, Mathieu Desnoyers wrote: > > > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote: > > > > > > > On Fri, Jan 08, 2010 at 06:53:38PM -0500, Mathieu Desnoyers wrote: > > > > > > > > * Steven Rostedt (rostedt@goodmis.org) wrote: > > > > > > > > > Well, if we just grab the task_rq(task)->lock here, then we should be > > > > > > > > > OK? We would guarantee that curr is either the task we want or not. > > > > > > > > > > > > > > > > Hrm, I just tested it, and there seems to be a significant performance > > > > > > > > penality involved with taking these locks for each CPU, even with just 8 > > > > > > > > cores. So if we can do without the locks, that would be preferred. > > > > > > > > > > > > > > How significant? Factor of two? Two orders of magnitude? > > > > > > > > > > > > > > > > > > > On a 8-core Intel Xeon (T is the number of threads receiving the IPIs): > > > > > > > > > > > > Without runqueue locks: > > > > > > > > > > > > T=1: 0m13.911s > > > > > > T=2: 0m20.730s > > > > > > T=3: 0m21.474s > > > > > > T=4: 0m27.952s > > > > > > T=5: 0m26.286s > > > > > > T=6: 0m27.855s > > > > > > T=7: 0m29.695s > > > > > > > > > > > > With runqueue locks: > > > > > > > > > > > > T=1: 0m15.802s > > > > > > T=2: 0m22.484s > > > > > > T=3: 0m24.751s > > > > > > T=4: 0m29.134s > > > > > > T=5: 0m30.094s > > > > > > T=6: 0m33.090s > > > > > > T=7: 0m33.897s > > > > > > > > > > > > So on 8 cores, taking spinlocks for each of the 8 runqueues adds about > > > > > > 15% overhead when doing an IPI to 1 thread. Therefore, that won't be > > > > > > pretty on 128+-core machines. > > > > > > > > > > But isn't the bulk of the overhead the IPIs rather than the runqueue > > > > > locks? > > > > > > > > > > W/out RQ W/RQ % degradation > > > > fix: > > > > W/out RQ W/RQ ratio > > > > > T=1: 13.91 15.8 1.14 > > > > > T=2: 20.73 22.48 1.08 > > > > > T=3: 21.47 24.75 1.15 > > > > > T=4: 27.95 29.13 1.04 > > > > > T=5: 26.29 30.09 1.14 > > > > > T=6: 27.86 33.09 1.19 > > > > > T=7: 29.7 33.9 1.14 > > > > > > > > These numbers tell you that the degradation is roughly constant as we > > > > add more threads (let's consider 1 thread per core, 1 IPI per thread, > > > > with active threads). It is all run on a 8-core system will all cpus > > > > active. As we increase the number of IPIs (e.g. T=2 -> T=7) we add 9s, > > > > for 1.8s/IPI (always for 10,000,000 sys_membarrier() calls), for an > > > > added 180 ns/core per call. (note: T=1 is a special-case, as I do not > > > > allocate any cpumask.) > > > > > > > > Using the spinlocks adds about 3s for 10,000,000 sys_membarrier() calls > > > > or a 8-core system, for an added 300 ns/core per call. > > > > > > > > So the overhead of taking the task lock is about twice higher, per core, > > > > than the overhead of the IPIs. This is understandable if the > > > > architecture does an IPI broadcast: the scalability problem then boils > > > > down to exchange cache-lines to inform the ipi sender that the other > > > > cpus have completed. An atomic operation exchanging a cache-line would > > > > be expected to be within the irqoff+spinlock+spinunlock+irqon overhead. > > > > > > Let me rephrase the question... Isn't the vast bulk of the overhead > > > something other than the runqueue spinlocks? > > > > I don't think so. What we have here is: > > > > O(1) > > - a system call > > - cpumask allocation > > - IPI broadcast > > > > O(nr cpus) > > - wait for IPI handlers to complete > > - runqueue spinlocks > > > > The O(1) operations seems to be about 5x slower than the combined > > O(nr cpus) wait and spinlock operations, but this only means that as > > soon as we have 8 cores, then the bulk of the overhead sits in the > > runqueue spinlock (if we have to take them). > > > > If we don't take spinlocks, then we can go up to 16 cores before the > > bulk of the overhead starts to be the "wait for IPI handlers to > > complete" phase. As you pointed out, we could turn this wait phase into > > a tree hierarchy. However, we cannot do this with the spinlocks, as they > > have to be taken for the initial cpumask iteration. > > > > Therefore, if we don't have to take those spinlocks, we can have a very > > significant gain over this system call overhead, especially on large > > systems. Not taking spinlocks here allows us to use a tree hierarchy to > > turn the bulk of the scalability overhead (waiting for IPI handlers to > > complete) into a O(log(nb cpus)) complexity order, which is quite > > interesting. > > All this would sound plausible if it weren't for the ratio of overheads > for runqueue-lock and non-runqueue-lock versions not varying much from > one to seven CPUs. ;-) Hrm, right. for_each_cpu(cpu, mm_cpumask(current->mm)) only iterates (thus takes locks) on active threads. So this overhead being constant is a bit unexpected. Unless for some weird reason the mm_cpumask would always contain all cpus, but I doubt so. > > And please keep in mind that this operations happens on the URCU slowpath. > Further, there may be opportunities for much larger savings by batching > grace-period requests, given that a single grace period can serve an > arbitrarily large number of synchronize_rcu() requests. That's right. > > > > > > So if we had lots of CPUs, we might want to fan the IPIs out through > > > > > intermediate CPUs in a tree fashion, but the runqueue locks are not > > > > > causing excessive pain. > > > > > > > > A tree hierarchy may not be useful for sending the IPIs (as, hopefully, > > > > they can be broadcasted pretty efficiciently), but however could be > > > > useful when waiting for the IPIs to complete efficiently. > > > > > > OK, given that you precompute the CPU mask, you might be able to take > > > advantage of hardware broadcast, on architectures having it. > > > > > > > > How does this compare to use of POSIX signals? Never mind, POSIX > > > > > signals are arbitrarily bad if you have way more threads than are > > > > > actually running at the time... > > > > > > > > POSIX signals to all threads are terrible in that they require to wake > > > > up all those threads. I have not even thought it useful to compare > > > > these two approaches with benchmarks yet (I'll do that when the > > > > sys_membarrier() support is implemented in liburcu). > > > > > > It would be of some interest. I bet that the runqueue spinlock overhead > > > is -way- down in the noise by comparison to POSIX signals, even when all > > > the threads are running. ;-) > > > > For 1,000,000 iterations, sending signals to execute a remote mb and > > waiting for it to complete: > > Adding the previous results for comparison, and please keep in mind the > need to multiply the left-hand column by ten before comparing to the > right-hand column: > > > W/out RQ W/RQ ratio > > T=1: 0m3.107s T=1: 13.91 15.8 1.14 > > T=2: 0m5.772s T=2: 20.73 22.48 1.08 > > T=3: 0m8.662s T=3: 21.47 24.75 1.15 > > T=4: 0m12.239s T=4: 27.95 29.13 1.04 > > T=5: 0m16.213s T=5: 26.29 30.09 1.14 > > T=6: 0m19.482s T=6: 27.86 33.09 1.19 > > T=7: 0m23.227s T=7: 29.7 33.9 1.14 > > So sys_membarrier() is roughly a factor of two better than POSIX signals > for a single thread, rising to not quite a factor of eight for seven > threads. And this data -does- support the notion that POSIX signals > get increasingly worse with increasing numbers of threads, as one > would expect. Yep. > > > So, per iteration: > > > > T=1: 3107 ns > > T=2: 5772 ns > > T=3: 8662 ns > > T=4: 12239 ns > > T=5: 16213 ns > > T=6: 19482 ns > > T=7: 23227 ns > > > > For an added 3000 ns per extra thread. So, yes, the added 300 ns/core > > for spinlocks is almost lost in the noise compared to the signal-based > > solution, but it's not because the old solution was behaving so poorly > > that we can rely on it to say what is noise vs not in the current > > implementation. Looking at what the scalability bottlenecks are, and > > looking at what is noise within the current implementation seems like > > a more appropriate way to design an efficient system call. > > I agree that your measurements show a marked improvement compared to > POSIX signals that increases with increasing numbers of threads, again, > as expected. I knew it! Why did we need a proof again ? (just kidding) ;) > > > So all in all, we can expect around 6.25-fold improvement because we > > diminish the per-core overhead if we use the spinlocks (480 ns/core vs > > 3000 ns/core), but if we don't take the runqueue spinlocks (180 > > ns/core), then we are contemplating a 16.7-fold improvement. And this is > > without considering a tree-hierarchy for waiting for IPIs to complete, > > which would additionally change the order of the scalability overhead > > from O(n) to O(log(n)). > > Eh? You should be able to use the locks to accumulate a cpumask, then > use whatever you want, including a tree hierarchy, to both send and wait > for the IPIs, right? Yes, although I don't see that a tree hierarchy is useful at all when the architecture supports IPI broadcast efficiently. It's only useful when waiting for these IPIs to complete. > > Keep in mind that the runqueue locks just force memory ordering on the > sampling. It should not be necessary to hold them while sending the > IPIs. Or am I missing something? Yes, that's true. We're just talking about a constant cost per thread here (taking/releasing the runqueue spinlock). Mathieu > > Thanx, Paul -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68