From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
To: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>,
Oleg Nesterov <oleg@redhat.com>,
Peter Zijlstra <peterz@infradead.org>,
linux-kernel@vger.kernel.org, Ingo Molnar <mingo@elte.hu>,
akpm@linux-foundation.org, josh@joshtriplett.org,
tglx@linutronix.de, Valdis.Kletnieks@vt.edu, dhowells@redhat.com,
laijs@cn.fujitsu.com, dipankar@in.ibm.com
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
Date: Sat, 9 Jan 2010 20:11:44 -0500 [thread overview]
Message-ID: <20100110011144.GD25790@Krystal> (raw)
In-Reply-To: <20100109235937.GC9044@linux.vnet.ibm.com>
* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Sat, Jan 09, 2010 at 02:20:06PM -0500, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > On Fri, Jan 08, 2010 at 09:38:42PM -0500, Mathieu Desnoyers wrote:
> > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > On Fri, Jan 08, 2010 at 08:02:31PM -0500, Mathieu Desnoyers wrote:
> > > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > > > On Fri, Jan 08, 2010 at 06:53:38PM -0500, Mathieu Desnoyers wrote:
> > > > > > > > * Steven Rostedt (rostedt@goodmis.org) wrote:
> > > > > > > > > Well, if we just grab the task_rq(task)->lock here, then we should be
> > > > > > > > > OK? We would guarantee that curr is either the task we want or not.
> > > > > > > >
> > > > > > > > Hrm, I just tested it, and there seems to be a significant performance
> > > > > > > > penality involved with taking these locks for each CPU, even with just 8
> > > > > > > > cores. So if we can do without the locks, that would be preferred.
> > > > > > >
> > > > > > > How significant? Factor of two? Two orders of magnitude?
> > > > > > >
> > > > > >
> > > > > > On a 8-core Intel Xeon (T is the number of threads receiving the IPIs):
> > > > > >
> > > > > > Without runqueue locks:
> > > > > >
> > > > > > T=1: 0m13.911s
> > > > > > T=2: 0m20.730s
> > > > > > T=3: 0m21.474s
> > > > > > T=4: 0m27.952s
> > > > > > T=5: 0m26.286s
> > > > > > T=6: 0m27.855s
> > > > > > T=7: 0m29.695s
> > > > > >
> > > > > > With runqueue locks:
> > > > > >
> > > > > > T=1: 0m15.802s
> > > > > > T=2: 0m22.484s
> > > > > > T=3: 0m24.751s
> > > > > > T=4: 0m29.134s
> > > > > > T=5: 0m30.094s
> > > > > > T=6: 0m33.090s
> > > > > > T=7: 0m33.897s
> > > > > >
> > > > > > So on 8 cores, taking spinlocks for each of the 8 runqueues adds about
> > > > > > 15% overhead when doing an IPI to 1 thread. Therefore, that won't be
> > > > > > pretty on 128+-core machines.
> > > > >
> > > > > But isn't the bulk of the overhead the IPIs rather than the runqueue
> > > > > locks?
> > > > >
> > > > > W/out RQ W/RQ % degradation
> > > > fix:
> > > > W/out RQ W/RQ ratio
> > > > > T=1: 13.91 15.8 1.14
> > > > > T=2: 20.73 22.48 1.08
> > > > > T=3: 21.47 24.75 1.15
> > > > > T=4: 27.95 29.13 1.04
> > > > > T=5: 26.29 30.09 1.14
> > > > > T=6: 27.86 33.09 1.19
> > > > > T=7: 29.7 33.9 1.14
> > > >
> > > > These numbers tell you that the degradation is roughly constant as we
> > > > add more threads (let's consider 1 thread per core, 1 IPI per thread,
> > > > with active threads). It is all run on a 8-core system will all cpus
> > > > active. As we increase the number of IPIs (e.g. T=2 -> T=7) we add 9s,
> > > > for 1.8s/IPI (always for 10,000,000 sys_membarrier() calls), for an
> > > > added 180 ns/core per call. (note: T=1 is a special-case, as I do not
> > > > allocate any cpumask.)
> > > >
> > > > Using the spinlocks adds about 3s for 10,000,000 sys_membarrier() calls
> > > > or a 8-core system, for an added 300 ns/core per call.
> > > >
> > > > So the overhead of taking the task lock is about twice higher, per core,
> > > > than the overhead of the IPIs. This is understandable if the
> > > > architecture does an IPI broadcast: the scalability problem then boils
> > > > down to exchange cache-lines to inform the ipi sender that the other
> > > > cpus have completed. An atomic operation exchanging a cache-line would
> > > > be expected to be within the irqoff+spinlock+spinunlock+irqon overhead.
> > >
> > > Let me rephrase the question... Isn't the vast bulk of the overhead
> > > something other than the runqueue spinlocks?
> >
> > I don't think so. What we have here is:
> >
> > O(1)
> > - a system call
> > - cpumask allocation
> > - IPI broadcast
> >
> > O(nr cpus)
> > - wait for IPI handlers to complete
> > - runqueue spinlocks
> >
> > The O(1) operations seems to be about 5x slower than the combined
> > O(nr cpus) wait and spinlock operations, but this only means that as
> > soon as we have 8 cores, then the bulk of the overhead sits in the
> > runqueue spinlock (if we have to take them).
> >
> > If we don't take spinlocks, then we can go up to 16 cores before the
> > bulk of the overhead starts to be the "wait for IPI handlers to
> > complete" phase. As you pointed out, we could turn this wait phase into
> > a tree hierarchy. However, we cannot do this with the spinlocks, as they
> > have to be taken for the initial cpumask iteration.
> >
> > Therefore, if we don't have to take those spinlocks, we can have a very
> > significant gain over this system call overhead, especially on large
> > systems. Not taking spinlocks here allows us to use a tree hierarchy to
> > turn the bulk of the scalability overhead (waiting for IPI handlers to
> > complete) into a O(log(nb cpus)) complexity order, which is quite
> > interesting.
>
> All this would sound plausible if it weren't for the ratio of overheads
> for runqueue-lock and non-runqueue-lock versions not varying much from
> one to seven CPUs. ;-)
Hrm, right. for_each_cpu(cpu, mm_cpumask(current->mm)) only iterates
(thus takes locks) on active threads. So this overhead being constant is
a bit unexpected. Unless for some weird reason the mm_cpumask would
always contain all cpus, but I doubt so.
>
> And please keep in mind that this operations happens on the URCU slowpath.
> Further, there may be opportunities for much larger savings by batching
> grace-period requests, given that a single grace period can serve an
> arbitrarily large number of synchronize_rcu() requests.
That's right.
>
> > > > > So if we had lots of CPUs, we might want to fan the IPIs out through
> > > > > intermediate CPUs in a tree fashion, but the runqueue locks are not
> > > > > causing excessive pain.
> > > >
> > > > A tree hierarchy may not be useful for sending the IPIs (as, hopefully,
> > > > they can be broadcasted pretty efficiciently), but however could be
> > > > useful when waiting for the IPIs to complete efficiently.
> > >
> > > OK, given that you precompute the CPU mask, you might be able to take
> > > advantage of hardware broadcast, on architectures having it.
> > >
> > > > > How does this compare to use of POSIX signals? Never mind, POSIX
> > > > > signals are arbitrarily bad if you have way more threads than are
> > > > > actually running at the time...
> > > >
> > > > POSIX signals to all threads are terrible in that they require to wake
> > > > up all those threads. I have not even thought it useful to compare
> > > > these two approaches with benchmarks yet (I'll do that when the
> > > > sys_membarrier() support is implemented in liburcu).
> > >
> > > It would be of some interest. I bet that the runqueue spinlock overhead
> > > is -way- down in the noise by comparison to POSIX signals, even when all
> > > the threads are running. ;-)
> >
> > For 1,000,000 iterations, sending signals to execute a remote mb and
> > waiting for it to complete:
>
> Adding the previous results for comparison, and please keep in mind the
> need to multiply the left-hand column by ten before comparing to the
> right-hand column:
>
> > W/out RQ W/RQ ratio
> > T=1: 0m3.107s T=1: 13.91 15.8 1.14
> > T=2: 0m5.772s T=2: 20.73 22.48 1.08
> > T=3: 0m8.662s T=3: 21.47 24.75 1.15
> > T=4: 0m12.239s T=4: 27.95 29.13 1.04
> > T=5: 0m16.213s T=5: 26.29 30.09 1.14
> > T=6: 0m19.482s T=6: 27.86 33.09 1.19
> > T=7: 0m23.227s T=7: 29.7 33.9 1.14
>
> So sys_membarrier() is roughly a factor of two better than POSIX signals
> for a single thread, rising to not quite a factor of eight for seven
> threads. And this data -does- support the notion that POSIX signals
> get increasingly worse with increasing numbers of threads, as one
> would expect.
Yep.
>
> > So, per iteration:
> >
> > T=1: 3107 ns
> > T=2: 5772 ns
> > T=3: 8662 ns
> > T=4: 12239 ns
> > T=5: 16213 ns
> > T=6: 19482 ns
> > T=7: 23227 ns
> >
> > For an added 3000 ns per extra thread. So, yes, the added 300 ns/core
> > for spinlocks is almost lost in the noise compared to the signal-based
> > solution, but it's not because the old solution was behaving so poorly
> > that we can rely on it to say what is noise vs not in the current
> > implementation. Looking at what the scalability bottlenecks are, and
> > looking at what is noise within the current implementation seems like
> > a more appropriate way to design an efficient system call.
>
> I agree that your measurements show a marked improvement compared to
> POSIX signals that increases with increasing numbers of threads, again,
> as expected.
I knew it! Why did we need a proof again ? (just kidding) ;)
>
> > So all in all, we can expect around 6.25-fold improvement because we
> > diminish the per-core overhead if we use the spinlocks (480 ns/core vs
> > 3000 ns/core), but if we don't take the runqueue spinlocks (180
> > ns/core), then we are contemplating a 16.7-fold improvement. And this is
> > without considering a tree-hierarchy for waiting for IPIs to complete,
> > which would additionally change the order of the scalability overhead
> > from O(n) to O(log(n)).
>
> Eh? You should be able to use the locks to accumulate a cpumask, then
> use whatever you want, including a tree hierarchy, to both send and wait
> for the IPIs, right?
Yes, although I don't see that a tree hierarchy is useful at all when
the architecture supports IPI broadcast efficiently. It's only useful
when waiting for these IPIs to complete.
>
> Keep in mind that the runqueue locks just force memory ordering on the
> sampling. It should not be necessary to hold them while sending the
> IPIs. Or am I missing something?
Yes, that's true. We're just talking about a constant cost per thread
here (taking/releasing the runqueue spinlock).
Mathieu
>
> Thanx, Paul
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
next prev parent reply other threads:[~2010-01-10 1:11 UTC|newest]
Thread overview: 107+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-01-07 4:40 [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier Mathieu Desnoyers
2010-01-07 5:02 ` Paul E. McKenney
2010-01-07 5:39 ` Mathieu Desnoyers
2010-01-07 8:32 ` Peter Zijlstra
2010-01-07 16:39 ` Paul E. McKenney
2010-01-07 5:28 ` Josh Triplett
2010-01-07 6:04 ` Mathieu Desnoyers
2010-01-07 6:32 ` Josh Triplett
2010-01-07 17:45 ` Mathieu Desnoyers
2010-01-07 16:46 ` Paul E. McKenney
2010-01-07 5:40 ` Steven Rostedt
2010-01-07 6:19 ` Mathieu Desnoyers
2010-01-07 6:35 ` Josh Triplett
2010-01-07 8:44 ` Peter Zijlstra
2010-01-07 13:15 ` Steven Rostedt
2010-01-07 15:07 ` Mathieu Desnoyers
2010-01-07 16:52 ` Paul E. McKenney
2010-01-07 17:18 ` Peter Zijlstra
2010-01-07 17:31 ` Paul E. McKenney
2010-01-07 17:44 ` Mathieu Desnoyers
2010-01-07 17:55 ` Paul E. McKenney
2010-01-07 17:44 ` Steven Rostedt
2010-01-07 17:56 ` Paul E. McKenney
2010-01-07 18:04 ` Steven Rostedt
2010-01-07 18:40 ` Paul E. McKenney
2010-01-07 17:36 ` Mathieu Desnoyers
2010-01-07 14:27 ` Steven Rostedt
2010-01-07 15:10 ` Mathieu Desnoyers
2010-01-07 16:49 ` Paul E. McKenney
2010-01-07 17:00 ` Steven Rostedt
2010-01-07 8:27 ` Peter Zijlstra
2010-01-07 18:30 ` Oleg Nesterov
2010-01-07 18:39 ` Paul E. McKenney
2010-01-07 18:59 ` Steven Rostedt
2010-01-07 19:16 ` Paul E. McKenney
2010-01-07 19:40 ` Steven Rostedt
2010-01-07 20:58 ` Paul E. McKenney
2010-01-07 21:35 ` Steven Rostedt
2010-01-07 22:34 ` Paul E. McKenney
2010-01-08 22:28 ` Mathieu Desnoyers
2010-01-08 23:53 ` Mathieu Desnoyers
2010-01-09 0:20 ` Paul E. McKenney
2010-01-09 1:02 ` Mathieu Desnoyers
2010-01-09 1:21 ` Paul E. McKenney
2010-01-09 1:22 ` Paul E. McKenney
2010-01-09 2:38 ` Mathieu Desnoyers
2010-01-09 5:42 ` Paul E. McKenney
2010-01-09 19:20 ` Mathieu Desnoyers
2010-01-09 23:05 ` Steven Rostedt
2010-01-09 23:16 ` Steven Rostedt
2010-01-10 0:03 ` Paul E. McKenney
2010-01-10 0:41 ` Steven Rostedt
2010-01-10 1:14 ` Mathieu Desnoyers
2010-01-10 1:44 ` Mathieu Desnoyers
2010-01-10 2:12 ` Steven Rostedt
2010-01-10 5:25 ` Paul E. McKenney
2010-01-10 11:50 ` Steven Rostedt
2010-01-10 16:03 ` Mathieu Desnoyers
2010-01-10 16:21 ` Steven Rostedt
2010-01-10 17:10 ` Mathieu Desnoyers
2010-01-10 21:02 ` Steven Rostedt
2010-01-10 21:41 ` Mathieu Desnoyers
2010-01-11 1:21 ` Paul E. McKenney
2010-01-10 17:45 ` Paul E. McKenney
2010-01-10 18:24 ` Mathieu Desnoyers
2010-01-11 1:17 ` Paul E. McKenney
2010-01-11 4:25 ` Mathieu Desnoyers
2010-01-11 4:29 ` [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v3a) Mathieu Desnoyers
2010-01-11 17:27 ` Paul E. McKenney
2010-01-11 17:35 ` Mathieu Desnoyers
2010-01-11 17:50 ` Peter Zijlstra
2010-01-11 20:52 ` Mathieu Desnoyers
2010-01-11 21:19 ` Peter Zijlstra
2010-01-11 22:04 ` Mathieu Desnoyers
2010-01-11 22:20 ` Peter Zijlstra
2010-01-11 22:48 ` Paul E. McKenney
2010-01-11 22:48 ` Mathieu Desnoyers
2010-01-11 21:19 ` Peter Zijlstra
2010-01-11 21:31 ` Peter Zijlstra
2010-01-11 4:30 ` [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v3b) Mathieu Desnoyers
2010-01-11 22:43 ` Paul E. McKenney
2010-01-12 15:38 ` Mathieu Desnoyers
2010-01-12 16:27 ` Steven Rostedt
2010-01-12 16:38 ` Mathieu Desnoyers
2010-01-12 16:54 ` Paul E. McKenney
2010-01-12 18:12 ` Paul E. McKenney
2010-01-12 18:56 ` Mathieu Desnoyers
2010-01-13 0:23 ` Paul E. McKenney
2010-01-11 16:25 ` [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier Paul E. McKenney
2010-01-11 20:21 ` Mathieu Desnoyers
2010-01-11 21:48 ` Paul E. McKenney
2010-01-14 2:56 ` Lai Jiangshan
2010-01-14 5:13 ` Paul E. McKenney
2010-01-14 5:39 ` Mathieu Desnoyers
2010-01-10 5:18 ` Paul E. McKenney
2010-01-10 1:12 ` Mathieu Desnoyers
2010-01-10 5:19 ` Paul E. McKenney
2010-01-10 1:04 ` Mathieu Desnoyers
2010-01-10 1:01 ` Mathieu Desnoyers
2010-01-09 23:59 ` Paul E. McKenney
2010-01-10 1:11 ` Mathieu Desnoyers [this message]
2010-01-07 9:50 ` Andi Kleen
2010-01-07 15:12 ` Mathieu Desnoyers
2010-01-07 16:56 ` Paul E. McKenney
2010-01-07 11:04 ` David Howells
2010-01-07 15:15 ` Mathieu Desnoyers
2010-01-07 15:47 ` David Howells
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100110011144.GD25790@Krystal \
--to=mathieu.desnoyers@polymtl.ca \
--cc=Valdis.Kletnieks@vt.edu \
--cc=akpm@linux-foundation.org \
--cc=dhowells@redhat.com \
--cc=dipankar@in.ibm.com \
--cc=josh@joshtriplett.org \
--cc=laijs@cn.fujitsu.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=oleg@redhat.com \
--cc=paulmck@linux.vnet.ibm.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=tglx@linutronix.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox