public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
To: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>,
	Oleg Nesterov <oleg@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	linux-kernel@vger.kernel.org, Ingo Molnar <mingo@elte.hu>,
	akpm@linux-foundation.org, josh@joshtriplett.org,
	tglx@linutronix.de, Valdis.Kletnieks@vt.edu, dhowells@redhat.com,
	laijs@cn.fujitsu.com, dipankar@in.ibm.com
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
Date: Sat, 9 Jan 2010 20:11:44 -0500	[thread overview]
Message-ID: <20100110011144.GD25790@Krystal> (raw)
In-Reply-To: <20100109235937.GC9044@linux.vnet.ibm.com>

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Sat, Jan 09, 2010 at 02:20:06PM -0500, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > On Fri, Jan 08, 2010 at 09:38:42PM -0500, Mathieu Desnoyers wrote:
> > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > On Fri, Jan 08, 2010 at 08:02:31PM -0500, Mathieu Desnoyers wrote:
> > > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > > > On Fri, Jan 08, 2010 at 06:53:38PM -0500, Mathieu Desnoyers wrote:
> > > > > > > > * Steven Rostedt (rostedt@goodmis.org) wrote:
> > > > > > > > > Well, if we just grab the task_rq(task)->lock here, then we should be
> > > > > > > > > OK? We would guarantee that curr is either the task we want or not.
> > > > > > > > 
> > > > > > > > Hrm, I just tested it, and there seems to be a significant performance
> > > > > > > > penality involved with taking these locks for each CPU, even with just 8
> > > > > > > > cores. So if we can do without the locks, that would be preferred.
> > > > > > > 
> > > > > > > How significant?  Factor of two?  Two orders of magnitude?
> > > > > > > 
> > > > > > 
> > > > > > On a 8-core Intel Xeon (T is the number of threads receiving the IPIs):
> > > > > > 
> > > > > > Without runqueue locks:
> > > > > > 
> > > > > > T=1: 0m13.911s
> > > > > > T=2: 0m20.730s
> > > > > > T=3: 0m21.474s
> > > > > > T=4: 0m27.952s
> > > > > > T=5: 0m26.286s
> > > > > > T=6: 0m27.855s
> > > > > > T=7: 0m29.695s
> > > > > > 
> > > > > > With runqueue locks:
> > > > > > 
> > > > > > T=1: 0m15.802s
> > > > > > T=2: 0m22.484s
> > > > > > T=3: 0m24.751s
> > > > > > T=4: 0m29.134s
> > > > > > T=5: 0m30.094s
> > > > > > T=6: 0m33.090s
> > > > > > T=7: 0m33.897s
> > > > > > 
> > > > > > So on 8 cores, taking spinlocks for each of the 8 runqueues adds about
> > > > > > 15% overhead when doing an IPI to 1 thread. Therefore, that won't be
> > > > > > pretty on 128+-core machines.
> > > > > 
> > > > > But isn't the bulk of the overhead the IPIs rather than the runqueue
> > > > > locks?
> > > > > 
> > > > >      W/out RQ       W/RQ   % degradation
> > > > fix:
> > > >        W/out RQ       W/RQ   ratio
> > > > > T=1:    13.91      15.8    1.14
> > > > > T=2:    20.73      22.48   1.08
> > > > > T=3:    21.47      24.75   1.15
> > > > > T=4:    27.95      29.13   1.04
> > > > > T=5:    26.29      30.09   1.14
> > > > > T=6:    27.86      33.09   1.19
> > > > > T=7:    29.7       33.9    1.14
> > > > 
> > > > These numbers tell you that the degradation is roughly constant as we
> > > > add more threads (let's consider 1 thread per core, 1 IPI per thread,
> > > > with active threads). It is all run on a 8-core system will all cpus
> > > > active. As we increase the number of IPIs (e.g. T=2 -> T=7) we add 9s,
> > > > for 1.8s/IPI (always for 10,000,000 sys_membarrier() calls), for an
> > > > added 180 ns/core per call. (note: T=1 is a special-case, as I do not
> > > > allocate any cpumask.)
> > > > 
> > > > Using the spinlocks adds about 3s for 10,000,000 sys_membarrier() calls
> > > > or a 8-core system, for an added 300 ns/core per call.
> > > > 
> > > > So the overhead of taking the task lock is about twice higher, per core,
> > > > than the overhead of the IPIs. This is understandable if the
> > > > architecture does an IPI broadcast: the scalability problem then boils
> > > > down to exchange cache-lines to inform the ipi sender that the other
> > > > cpus have completed. An atomic operation exchanging a cache-line would
> > > > be expected to be within the irqoff+spinlock+spinunlock+irqon overhead.
> > > 
> > > Let me rephrase the question...  Isn't the vast bulk of the overhead
> > > something other than the runqueue spinlocks?
> > 
> > I don't think so. What we have here is:
> > 
> > O(1)
> > - a system call
> > - cpumask allocation
> > - IPI broadcast
> > 
> > O(nr cpus)
> > - wait for IPI handlers to complete
> > - runqueue spinlocks
> > 
> > The O(1) operations seems to be about 5x slower than the combined
> > O(nr cpus) wait and spinlock operations, but this only means that as
> > soon as we have 8 cores, then the bulk of the overhead sits in the
> > runqueue spinlock (if we have to take them).
> > 
> > If we don't take spinlocks, then we can go up to 16 cores before the
> > bulk of the overhead starts to be the "wait for IPI handlers to
> > complete" phase. As you pointed out, we could turn this wait phase into
> > a tree hierarchy. However, we cannot do this with the spinlocks, as they
> > have to be taken for the initial cpumask iteration.
> > 
> > Therefore, if we don't have to take those spinlocks, we can have a very
> > significant gain over this system call overhead, especially on large
> > systems. Not taking spinlocks here allows us to use a tree hierarchy to
> > turn the bulk of the scalability overhead (waiting for IPI handlers to
> > complete) into a O(log(nb cpus)) complexity order, which is quite
> > interesting.
> 
> All this would sound plausible if it weren't for the ratio of overheads
> for runqueue-lock and non-runqueue-lock versions not varying much from
> one to seven CPUs.  ;-)

Hrm, right. for_each_cpu(cpu, mm_cpumask(current->mm)) only iterates
(thus takes locks) on active threads. So this overhead being constant is
a bit unexpected. Unless for some weird reason the mm_cpumask would
always contain all cpus, but I doubt so.

> 
> And please keep in mind that this operations happens on the URCU slowpath.
> Further, there may be opportunities for much larger savings by batching
> grace-period requests, given that a single grace period can serve an
> arbitrarily large number of synchronize_rcu() requests.

That's right.

> 
> > > > > So if we had lots of CPUs, we might want to fan the IPIs out through
> > > > > intermediate CPUs in a tree fashion, but the runqueue locks are not
> > > > > causing excessive pain.
> > > > 
> > > > A tree hierarchy may not be useful for sending the IPIs (as, hopefully,
> > > > they can be broadcasted pretty efficiciently), but however could be
> > > > useful when waiting for the IPIs to complete efficiently.
> > > 
> > > OK, given that you precompute the CPU mask, you might be able to take
> > > advantage of hardware broadcast, on architectures having it.
> > > 
> > > > > How does this compare to use of POSIX signals?  Never mind, POSIX
> > > > > signals are arbitrarily bad if you have way more threads than are
> > > > > actually running at the time...
> > > > 
> > > > POSIX signals to all threads are terrible in that they require to wake
> > > > up all those threads. I have not even thought it useful to compare
> > > > these two approaches with benchmarks yet (I'll do that when the
> > > > sys_membarrier() support is implemented in liburcu).
> > > 
> > > It would be of some interest.  I bet that the runqueue spinlock overhead
> > > is -way- down in the noise by comparison to POSIX signals, even when all
> > > the threads are running.  ;-)
> > 
> > For 1,000,000 iterations, sending signals to execute a remote mb and
> > waiting for it to complete:
> 
> Adding the previous results for comparison, and please keep in mind the
> need to multiply the left-hand column by ten before comparing to the
> right-hand column:
> 
> >                              W/out RQ       W/RQ   ratio
> > T=1: 0m3.107s           T=1:    13.91      15.8    1.14
> > T=2: 0m5.772s           T=2:    20.73      22.48   1.08
> > T=3: 0m8.662s           T=3:    21.47      24.75   1.15
> > T=4: 0m12.239s          T=4:    27.95      29.13   1.04
> > T=5: 0m16.213s          T=5:    26.29      30.09   1.14
> > T=6: 0m19.482s          T=6:    27.86      33.09   1.19
> > T=7: 0m23.227s          T=7:    29.7       33.9    1.14
> 
> So sys_membarrier() is roughly a factor of two better than POSIX signals
> for a single thread, rising to not quite a factor of eight for seven
> threads.  And this data -does- support the notion that POSIX signals
> get increasingly worse with increasing numbers of threads, as one
> would expect.

Yep.

> 
> > So, per iteration:
> > 
> > T=1: 3107 ns
> > T=2: 5772 ns
> > T=3: 8662 ns
> > T=4: 12239 ns
> > T=5: 16213 ns
> > T=6: 19482 ns
> > T=7: 23227 ns
> > 
> > For an added 3000 ns per extra thread. So, yes, the added 300 ns/core
> > for spinlocks is almost lost in the noise compared to the signal-based
> > solution, but it's not because the old solution was behaving so poorly
> > that we can rely on it to say what is noise vs not in the current
> > implementation. Looking at what the scalability bottlenecks are, and
> > looking at what is noise within the current implementation seems like
> > a more appropriate way to design an efficient system call.
> 
> I agree that your measurements show a marked improvement compared to
> POSIX signals that increases with increasing numbers of threads, again,
> as expected.

I knew it! Why did we need a proof again ? (just kidding) ;)

> 
> > So all in all, we can expect around 6.25-fold improvement because we
> > diminish the per-core overhead if we use the spinlocks (480 ns/core vs
> > 3000 ns/core), but if we don't take the runqueue spinlocks (180
> > ns/core), then we are contemplating a 16.7-fold improvement. And this is
> > without considering a tree-hierarchy for waiting for IPIs to complete,
> > which would additionally change the order of the scalability overhead
> > from O(n) to O(log(n)).
> 
> Eh?  You should be able to use the locks to accumulate a cpumask, then
> use whatever you want, including a tree hierarchy, to both send and wait
> for the IPIs, right?

Yes, although I don't see that a tree hierarchy is useful at all when
the architecture supports IPI broadcast efficiently. It's only useful
when waiting for these IPIs to complete.

> 
> Keep in mind that the runqueue locks just force memory ordering on the
> sampling.  It should not be necessary to hold them while sending the
> IPIs.  Or am I missing something?

Yes, that's true. We're just talking about a constant cost per thread
here (taking/releasing the runqueue spinlock).

Mathieu

> 
> 							Thanx, Paul

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

  reply	other threads:[~2010-01-10  1:11 UTC|newest]

Thread overview: 107+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-01-07  4:40 [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier Mathieu Desnoyers
2010-01-07  5:02 ` Paul E. McKenney
2010-01-07  5:39   ` Mathieu Desnoyers
2010-01-07  8:32   ` Peter Zijlstra
2010-01-07 16:39     ` Paul E. McKenney
2010-01-07  5:28 ` Josh Triplett
2010-01-07  6:04   ` Mathieu Desnoyers
2010-01-07  6:32     ` Josh Triplett
2010-01-07 17:45       ` Mathieu Desnoyers
2010-01-07 16:46     ` Paul E. McKenney
2010-01-07  5:40 ` Steven Rostedt
2010-01-07  6:19   ` Mathieu Desnoyers
2010-01-07  6:35     ` Josh Triplett
2010-01-07  8:44       ` Peter Zijlstra
2010-01-07 13:15         ` Steven Rostedt
2010-01-07 15:07         ` Mathieu Desnoyers
2010-01-07 16:52         ` Paul E. McKenney
2010-01-07 17:18           ` Peter Zijlstra
2010-01-07 17:31             ` Paul E. McKenney
2010-01-07 17:44               ` Mathieu Desnoyers
2010-01-07 17:55                 ` Paul E. McKenney
2010-01-07 17:44               ` Steven Rostedt
2010-01-07 17:56                 ` Paul E. McKenney
2010-01-07 18:04                   ` Steven Rostedt
2010-01-07 18:40                     ` Paul E. McKenney
2010-01-07 17:36             ` Mathieu Desnoyers
2010-01-07 14:27     ` Steven Rostedt
2010-01-07 15:10       ` Mathieu Desnoyers
2010-01-07 16:49   ` Paul E. McKenney
2010-01-07 17:00     ` Steven Rostedt
2010-01-07  8:27 ` Peter Zijlstra
2010-01-07 18:30   ` Oleg Nesterov
2010-01-07 18:39     ` Paul E. McKenney
2010-01-07 18:59       ` Steven Rostedt
2010-01-07 19:16         ` Paul E. McKenney
2010-01-07 19:40           ` Steven Rostedt
2010-01-07 20:58             ` Paul E. McKenney
2010-01-07 21:35               ` Steven Rostedt
2010-01-07 22:34                 ` Paul E. McKenney
2010-01-08 22:28                 ` Mathieu Desnoyers
2010-01-08 23:53                 ` Mathieu Desnoyers
2010-01-09  0:20                   ` Paul E. McKenney
2010-01-09  1:02                     ` Mathieu Desnoyers
2010-01-09  1:21                       ` Paul E. McKenney
2010-01-09  1:22                         ` Paul E. McKenney
2010-01-09  2:38                         ` Mathieu Desnoyers
2010-01-09  5:42                           ` Paul E. McKenney
2010-01-09 19:20                             ` Mathieu Desnoyers
2010-01-09 23:05                               ` Steven Rostedt
2010-01-09 23:16                                 ` Steven Rostedt
2010-01-10  0:03                                   ` Paul E. McKenney
2010-01-10  0:41                                     ` Steven Rostedt
2010-01-10  1:14                                       ` Mathieu Desnoyers
2010-01-10  1:44                                       ` Mathieu Desnoyers
2010-01-10  2:12                                         ` Steven Rostedt
2010-01-10  5:25                                           ` Paul E. McKenney
2010-01-10 11:50                                             ` Steven Rostedt
2010-01-10 16:03                                               ` Mathieu Desnoyers
2010-01-10 16:21                                                 ` Steven Rostedt
2010-01-10 17:10                                                   ` Mathieu Desnoyers
2010-01-10 21:02                                                     ` Steven Rostedt
2010-01-10 21:41                                                       ` Mathieu Desnoyers
2010-01-11  1:21                                                       ` Paul E. McKenney
2010-01-10 17:45                                               ` Paul E. McKenney
2010-01-10 18:24                                                 ` Mathieu Desnoyers
2010-01-11  1:17                                                   ` Paul E. McKenney
2010-01-11  4:25                                                     ` Mathieu Desnoyers
2010-01-11  4:29                                                       ` [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v3a) Mathieu Desnoyers
2010-01-11 17:27                                                         ` Paul E. McKenney
2010-01-11 17:35                                                           ` Mathieu Desnoyers
2010-01-11 17:50                                                         ` Peter Zijlstra
2010-01-11 20:52                                                           ` Mathieu Desnoyers
2010-01-11 21:19                                                             ` Peter Zijlstra
2010-01-11 22:04                                                               ` Mathieu Desnoyers
2010-01-11 22:20                                                                 ` Peter Zijlstra
2010-01-11 22:48                                                                   ` Paul E. McKenney
2010-01-11 22:48                                                                   ` Mathieu Desnoyers
2010-01-11 21:19                                                             ` Peter Zijlstra
2010-01-11 21:31                                                             ` Peter Zijlstra
2010-01-11  4:30                                                       ` [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v3b) Mathieu Desnoyers
2010-01-11 22:43                                                         ` Paul E. McKenney
2010-01-12 15:38                                                           ` Mathieu Desnoyers
2010-01-12 16:27                                                             ` Steven Rostedt
2010-01-12 16:38                                                               ` Mathieu Desnoyers
2010-01-12 16:54                                                               ` Paul E. McKenney
2010-01-12 18:12                                                             ` Paul E. McKenney
2010-01-12 18:56                                                               ` Mathieu Desnoyers
2010-01-13  0:23                                                                 ` Paul E. McKenney
2010-01-11 16:25                                                       ` [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier Paul E. McKenney
2010-01-11 20:21                                                         ` Mathieu Desnoyers
2010-01-11 21:48                                                           ` Paul E. McKenney
2010-01-14  2:56                                                             ` Lai Jiangshan
2010-01-14  5:13                                                               ` Paul E. McKenney
2010-01-14  5:39                                                                 ` Mathieu Desnoyers
2010-01-10  5:18                                         ` Paul E. McKenney
2010-01-10  1:12                                     ` Mathieu Desnoyers
2010-01-10  5:19                                       ` Paul E. McKenney
2010-01-10  1:04                                   ` Mathieu Desnoyers
2010-01-10  1:01                                 ` Mathieu Desnoyers
2010-01-09 23:59                               ` Paul E. McKenney
2010-01-10  1:11                                 ` Mathieu Desnoyers [this message]
2010-01-07  9:50 ` Andi Kleen
2010-01-07 15:12   ` Mathieu Desnoyers
2010-01-07 16:56   ` Paul E. McKenney
2010-01-07 11:04 ` David Howells
2010-01-07 15:15   ` Mathieu Desnoyers
2010-01-07 15:47     ` David Howells

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100110011144.GD25790@Krystal \
    --to=mathieu.desnoyers@polymtl.ca \
    --cc=Valdis.Kletnieks@vt.edu \
    --cc=akpm@linux-foundation.org \
    --cc=dhowells@redhat.com \
    --cc=dipankar@in.ibm.com \
    --cc=josh@joshtriplett.org \
    --cc=laijs@cn.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=oleg@redhat.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox