public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Steven Rostedt <rostedt@goodmis.org>,
	Oleg Nesterov <oleg@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	linux-kernel@vger.kernel.org, Ingo Molnar <mingo@elte.hu>,
	akpm@linux-foundation.org, josh@joshtriplett.org,
	tglx@linutronix.de, Valdis.Kletnieks@vt.edu, dhowells@redhat.com,
	laijs@cn.fujitsu.com, dipankar@in.ibm.com
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
Date: Fri, 8 Jan 2010 21:42:15 -0800	[thread overview]
Message-ID: <20100109054215.GB9044@linux.vnet.ibm.com> (raw)
In-Reply-To: <20100109023842.GA1696@Krystal>

On Fri, Jan 08, 2010 at 09:38:42PM -0500, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Fri, Jan 08, 2010 at 08:02:31PM -0500, Mathieu Desnoyers wrote:
> > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > On Fri, Jan 08, 2010 at 06:53:38PM -0500, Mathieu Desnoyers wrote:
> > > > > * Steven Rostedt (rostedt@goodmis.org) wrote:
> > > > > > Well, if we just grab the task_rq(task)->lock here, then we should be
> > > > > > OK? We would guarantee that curr is either the task we want or not.
> > > > > 
> > > > > Hrm, I just tested it, and there seems to be a significant performance
> > > > > penality involved with taking these locks for each CPU, even with just 8
> > > > > cores. So if we can do without the locks, that would be preferred.
> > > > 
> > > > How significant?  Factor of two?  Two orders of magnitude?
> > > > 
> > > 
> > > On a 8-core Intel Xeon (T is the number of threads receiving the IPIs):
> > > 
> > > Without runqueue locks:
> > > 
> > > T=1: 0m13.911s
> > > T=2: 0m20.730s
> > > T=3: 0m21.474s
> > > T=4: 0m27.952s
> > > T=5: 0m26.286s
> > > T=6: 0m27.855s
> > > T=7: 0m29.695s
> > > 
> > > With runqueue locks:
> > > 
> > > T=1: 0m15.802s
> > > T=2: 0m22.484s
> > > T=3: 0m24.751s
> > > T=4: 0m29.134s
> > > T=5: 0m30.094s
> > > T=6: 0m33.090s
> > > T=7: 0m33.897s
> > > 
> > > So on 8 cores, taking spinlocks for each of the 8 runqueues adds about
> > > 15% overhead when doing an IPI to 1 thread. Therefore, that won't be
> > > pretty on 128+-core machines.
> > 
> > But isn't the bulk of the overhead the IPIs rather than the runqueue
> > locks?
> > 
> >      W/out RQ       W/RQ   % degradation
> fix:
>        W/out RQ       W/RQ   ratio
> > T=1:    13.91      15.8    1.14
> > T=2:    20.73      22.48   1.08
> > T=3:    21.47      24.75   1.15
> > T=4:    27.95      29.13   1.04
> > T=5:    26.29      30.09   1.14
> > T=6:    27.86      33.09   1.19
> > T=7:    29.7       33.9    1.14
> 
> These numbers tell you that the degradation is roughly constant as we
> add more threads (let's consider 1 thread per core, 1 IPI per thread,
> with active threads). It is all run on a 8-core system will all cpus
> active. As we increase the number of IPIs (e.g. T=2 -> T=7) we add 9s,
> for 1.8s/IPI (always for 10,000,000 sys_membarrier() calls), for an
> added 180 ns/core per call. (note: T=1 is a special-case, as I do not
> allocate any cpumask.)
> 
> Using the spinlocks adds about 3s for 10,000,000 sys_membarrier() calls
> or a 8-core system, for an added 300 ns/core per call.
> 
> So the overhead of taking the task lock is about twice higher, per core,
> than the overhead of the IPIs. This is understandable if the
> architecture does an IPI broadcast: the scalability problem then boils
> down to exchange cache-lines to inform the ipi sender that the other
> cpus have completed. An atomic operation exchanging a cache-line would
> be expected to be within the irqoff+spinlock+spinunlock+irqon overhead.

Let me rephrase the question...  Isn't the vast bulk of the overhead
something other than the runqueue spinlocks?

> > So if we had lots of CPUs, we might want to fan the IPIs out through
> > intermediate CPUs in a tree fashion, but the runqueue locks are not
> > causing excessive pain.
> 
> A tree hierarchy may not be useful for sending the IPIs (as, hopefully,
> they can be broadcasted pretty efficiciently), but however could be
> useful when waiting for the IPIs to complete efficiently.

OK, given that you precompute the CPU mask, you might be able to take
advantage of hardware broadcast, on architectures having it.

> > How does this compare to use of POSIX signals?  Never mind, POSIX
> > signals are arbitrarily bad if you have way more threads than are
> > actually running at the time...
> 
> POSIX signals to all threads are terrible in that they require to wake
> up all those threads. I have not even thought it useful to compare
> these two approaches with benchmarks yet (I'll do that when the
> sys_membarrier() support is implemented in liburcu).

It would be of some interest.  I bet that the runqueue spinlock overhead
is -way- down in the noise by comparison to POSIX signals, even when all
the threads are running.  ;-)

							Thanx, Paul

  reply	other threads:[~2010-01-09  5:42 UTC|newest]

Thread overview: 107+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-01-07  4:40 [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier Mathieu Desnoyers
2010-01-07  5:02 ` Paul E. McKenney
2010-01-07  5:39   ` Mathieu Desnoyers
2010-01-07  8:32   ` Peter Zijlstra
2010-01-07 16:39     ` Paul E. McKenney
2010-01-07  5:28 ` Josh Triplett
2010-01-07  6:04   ` Mathieu Desnoyers
2010-01-07  6:32     ` Josh Triplett
2010-01-07 17:45       ` Mathieu Desnoyers
2010-01-07 16:46     ` Paul E. McKenney
2010-01-07  5:40 ` Steven Rostedt
2010-01-07  6:19   ` Mathieu Desnoyers
2010-01-07  6:35     ` Josh Triplett
2010-01-07  8:44       ` Peter Zijlstra
2010-01-07 13:15         ` Steven Rostedt
2010-01-07 15:07         ` Mathieu Desnoyers
2010-01-07 16:52         ` Paul E. McKenney
2010-01-07 17:18           ` Peter Zijlstra
2010-01-07 17:31             ` Paul E. McKenney
2010-01-07 17:44               ` Mathieu Desnoyers
2010-01-07 17:55                 ` Paul E. McKenney
2010-01-07 17:44               ` Steven Rostedt
2010-01-07 17:56                 ` Paul E. McKenney
2010-01-07 18:04                   ` Steven Rostedt
2010-01-07 18:40                     ` Paul E. McKenney
2010-01-07 17:36             ` Mathieu Desnoyers
2010-01-07 14:27     ` Steven Rostedt
2010-01-07 15:10       ` Mathieu Desnoyers
2010-01-07 16:49   ` Paul E. McKenney
2010-01-07 17:00     ` Steven Rostedt
2010-01-07  8:27 ` Peter Zijlstra
2010-01-07 18:30   ` Oleg Nesterov
2010-01-07 18:39     ` Paul E. McKenney
2010-01-07 18:59       ` Steven Rostedt
2010-01-07 19:16         ` Paul E. McKenney
2010-01-07 19:40           ` Steven Rostedt
2010-01-07 20:58             ` Paul E. McKenney
2010-01-07 21:35               ` Steven Rostedt
2010-01-07 22:34                 ` Paul E. McKenney
2010-01-08 22:28                 ` Mathieu Desnoyers
2010-01-08 23:53                 ` Mathieu Desnoyers
2010-01-09  0:20                   ` Paul E. McKenney
2010-01-09  1:02                     ` Mathieu Desnoyers
2010-01-09  1:21                       ` Paul E. McKenney
2010-01-09  1:22                         ` Paul E. McKenney
2010-01-09  2:38                         ` Mathieu Desnoyers
2010-01-09  5:42                           ` Paul E. McKenney [this message]
2010-01-09 19:20                             ` Mathieu Desnoyers
2010-01-09 23:05                               ` Steven Rostedt
2010-01-09 23:16                                 ` Steven Rostedt
2010-01-10  0:03                                   ` Paul E. McKenney
2010-01-10  0:41                                     ` Steven Rostedt
2010-01-10  1:14                                       ` Mathieu Desnoyers
2010-01-10  1:44                                       ` Mathieu Desnoyers
2010-01-10  2:12                                         ` Steven Rostedt
2010-01-10  5:25                                           ` Paul E. McKenney
2010-01-10 11:50                                             ` Steven Rostedt
2010-01-10 16:03                                               ` Mathieu Desnoyers
2010-01-10 16:21                                                 ` Steven Rostedt
2010-01-10 17:10                                                   ` Mathieu Desnoyers
2010-01-10 21:02                                                     ` Steven Rostedt
2010-01-10 21:41                                                       ` Mathieu Desnoyers
2010-01-11  1:21                                                       ` Paul E. McKenney
2010-01-10 17:45                                               ` Paul E. McKenney
2010-01-10 18:24                                                 ` Mathieu Desnoyers
2010-01-11  1:17                                                   ` Paul E. McKenney
2010-01-11  4:25                                                     ` Mathieu Desnoyers
2010-01-11  4:29                                                       ` [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v3a) Mathieu Desnoyers
2010-01-11 17:27                                                         ` Paul E. McKenney
2010-01-11 17:35                                                           ` Mathieu Desnoyers
2010-01-11 17:50                                                         ` Peter Zijlstra
2010-01-11 20:52                                                           ` Mathieu Desnoyers
2010-01-11 21:19                                                             ` Peter Zijlstra
2010-01-11 22:04                                                               ` Mathieu Desnoyers
2010-01-11 22:20                                                                 ` Peter Zijlstra
2010-01-11 22:48                                                                   ` Paul E. McKenney
2010-01-11 22:48                                                                   ` Mathieu Desnoyers
2010-01-11 21:19                                                             ` Peter Zijlstra
2010-01-11 21:31                                                             ` Peter Zijlstra
2010-01-11  4:30                                                       ` [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v3b) Mathieu Desnoyers
2010-01-11 22:43                                                         ` Paul E. McKenney
2010-01-12 15:38                                                           ` Mathieu Desnoyers
2010-01-12 16:27                                                             ` Steven Rostedt
2010-01-12 16:38                                                               ` Mathieu Desnoyers
2010-01-12 16:54                                                               ` Paul E. McKenney
2010-01-12 18:12                                                             ` Paul E. McKenney
2010-01-12 18:56                                                               ` Mathieu Desnoyers
2010-01-13  0:23                                                                 ` Paul E. McKenney
2010-01-11 16:25                                                       ` [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier Paul E. McKenney
2010-01-11 20:21                                                         ` Mathieu Desnoyers
2010-01-11 21:48                                                           ` Paul E. McKenney
2010-01-14  2:56                                                             ` Lai Jiangshan
2010-01-14  5:13                                                               ` Paul E. McKenney
2010-01-14  5:39                                                                 ` Mathieu Desnoyers
2010-01-10  5:18                                         ` Paul E. McKenney
2010-01-10  1:12                                     ` Mathieu Desnoyers
2010-01-10  5:19                                       ` Paul E. McKenney
2010-01-10  1:04                                   ` Mathieu Desnoyers
2010-01-10  1:01                                 ` Mathieu Desnoyers
2010-01-09 23:59                               ` Paul E. McKenney
2010-01-10  1:11                                 ` Mathieu Desnoyers
2010-01-07  9:50 ` Andi Kleen
2010-01-07 15:12   ` Mathieu Desnoyers
2010-01-07 16:56   ` Paul E. McKenney
2010-01-07 11:04 ` David Howells
2010-01-07 15:15   ` Mathieu Desnoyers
2010-01-07 15:47     ` David Howells

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100109054215.GB9044@linux.vnet.ibm.com \
    --to=paulmck@linux.vnet.ibm.com \
    --cc=Valdis.Kletnieks@vt.edu \
    --cc=akpm@linux-foundation.org \
    --cc=dhowells@redhat.com \
    --cc=dipankar@in.ibm.com \
    --cc=josh@joshtriplett.org \
    --cc=laijs@cn.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mathieu.desnoyers@polymtl.ca \
    --cc=mingo@elte.hu \
    --cc=oleg@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox