From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
To: Josh Triplett <josh@joshtriplett.org>
Cc: linux-kernel@vger.kernel.org,
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
Ingo Molnar <mingo@elte.hu>,
akpm@linux-foundation.org, tglx@linutronix.de,
peterz@infradead.org, rostedt@goodmis.org,
Valdis.Kletnieks@vt.edu, dhowells@redhat.com,
laijs@cn.fujitsu.com, dipankar@in.ibm.com
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier
Date: Thu, 7 Jan 2010 12:45:05 -0500 [thread overview]
Message-ID: <20100107174505.GA9612@Krystal> (raw)
In-Reply-To: <20100107063207.GA12939@feather>
* Josh Triplett (josh@joshtriplett.org) wrote:
> On Thu, Jan 07, 2010 at 01:04:39AM -0500, Mathieu Desnoyers wrote:
[...]
> > Just tried it with a 10,000,000 iterations loop.
> >
> > The thread doing the system call loop takes 2.0% of user time, 98% of
> > system time. All other cpus are nearly 100.0% idle. Just to give a bit
> > more info about my test setup, I also have a thread sitting on a CPU
> > busy-waiting for the loop to complete. This thread takes 97.7% user
> > time (but it really is just there to make sure we are indeed doing the
> > IPIs, not skipping it through the thread_group_empty(current) test). If
> > I remove this thread, the execution time of the test program shrinks
> > from 32 seconds down to 1.9 seconds. So yes, the IPI is actually
> > executed in the first place, because removing the extra thread
> > accelerates the loop tremendously. I used a 8-core Xeon to test.
>
> Do you know if the kernel properly measures the overhead of IPIs? The
> CPUs might have only looked idle. What about running some kind of
> CPU-bound benchmark on the other CPUs and testing the completion time
> with and without the process running the membarrier loop?
Good point. Just tried with a cache-hot kernel compilation using 6/8 CPUs.
Normally: real 2m41.852s
With the sys_membarrier+1 busy-looping thread running: real 5m41.830s
So... the unrelated processes become 2x slower. That hurts.
So let's try allocating a cpu mask for PeterZ scheme. I prefer to have a
small allocation overhead and benefit from cpumask broadcast if
possible so we scale better. But that all depends on how big the
allocation overhead is.
Impact of allocating a cpumask (time for 10,000,000 sys_membarrier
calls, one thread is doing the sys_membarrier, the others are busy
looping)):
IPI to all: real 0m44.708s
alloc cpumask+local mb()+IPI-many to 1 thread: real 1m2.034s
So, roughly, the cpumask allocation overhead is 17s here, not exactly
cheap. So let's see when it becomes better than single IPIs:
local mb()+single IPI to 1 thread: real 0m29.502s
local mb()+single IPI to 7 threads: real 2m30.971s
So, roughly, the single IPI overhead is 120s here for 6 more threads,
for 20s per thread.
Here is what we can do: Given that it costs almost half as much to
perform the cpumask allocation than to send a single IPI, as we iterate
on the CPUs, for the, say, first N CPUs (ourself and 1 cpu that needs to
have an IPI sent), we send a "single IPI". This will be N-1 IPI and a
local function call. If we need more than that, then we switch to the
cpumask allocation and send a broadcast IPI to the cpumask we construct
for the rest of the CPUs. Let's call it the "adaptative IPI scheme".
For my Intel Xeon E5405:
Just doing local mb()+single IPI to T other threads:
T=1: 0m29.219s
T=2: 0m46.310s
T=3: 1m10.172s
T=4: 1m24.822s
T=5: 1m43.205s
T=6: 2m15.405s
T=7: 2m31.207s
Just doing cpumask alloc+IPI-many to T other threads:
T=1: 0m39.605s
T=2: 0m48.566s
T=3: 0m50.167s
T=4: 0m57.896s
T=5: 0m56.411s
T=6: 1m0.536s
T=7: 1m12.532s
So I think the right threshold should be around 2 threads (assuming
other architecture will behave like mine). So starting 3 threads, we
allocate the cpumask and send IPIs.
How does that sound ?
[...]
>
> > > - Part of me thinks this ought to become slightly more general, and just
> > > deliver a signal that the receiving thread could handle as it likes.
> > > However, that would certainly prove more expensive than this, and I
> > > don't know that the generality would buy anything.
> >
> > A general scheme would have to call every threads, even those which are
> > not running. In the case of this system call, this is a particular case
> > where we can forget about non-running threads, because the memory
> > barrier is implied by the scheduler activity that brought them offline.
> > So I really don't see how we can use this IPI scheme for other things
> > that this kind of synchronization.
>
> No, I don't mean non-running threads. If you wanted that, you could do
> what urcu currently does, and send a signal to all threads. I meant
> something like "signal all *running* threads from my process".
Well, if you find me a real-life use-case, then we can surely look into
that ;)
>
> > > - Could you somehow register reader threads with the kernel, in a way
> > > that makes them easy to detect remotely?
> >
> > There are two ways I figure out we could do this. One would imply adding
> > extra shared data between kernel and userspace (which I'd like to avoid,
> > to keep coupling low). The other alternative would be to add per
> > task_struct information about this, and new system calls. The added per
> > task_struct information would use up cache lines (which are very
> > important, especially in the task_struct) and the added system call at
> > rcu_read_lock/unlock() would simply kill performance.
>
> No, I didn't mean that you would do a syscall in rcu_read_{lock,unlock}.
> I meant that you would do a system call when the reader threads start,
> saying "hey, reader thread here".
Hrm, we need to inform the userspace RCU library that this thread is
present too. So I don't see how going through the kernel helps us there.
Thanks,
Mathieu
>
> - Josh Triplett
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
next prev parent reply other threads:[~2010-01-07 17:45 UTC|newest]
Thread overview: 107+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-01-07 4:40 [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier Mathieu Desnoyers
2010-01-07 5:02 ` Paul E. McKenney
2010-01-07 5:39 ` Mathieu Desnoyers
2010-01-07 8:32 ` Peter Zijlstra
2010-01-07 16:39 ` Paul E. McKenney
2010-01-07 5:28 ` Josh Triplett
2010-01-07 6:04 ` Mathieu Desnoyers
2010-01-07 6:32 ` Josh Triplett
2010-01-07 17:45 ` Mathieu Desnoyers [this message]
2010-01-07 16:46 ` Paul E. McKenney
2010-01-07 5:40 ` Steven Rostedt
2010-01-07 6:19 ` Mathieu Desnoyers
2010-01-07 6:35 ` Josh Triplett
2010-01-07 8:44 ` Peter Zijlstra
2010-01-07 13:15 ` Steven Rostedt
2010-01-07 15:07 ` Mathieu Desnoyers
2010-01-07 16:52 ` Paul E. McKenney
2010-01-07 17:18 ` Peter Zijlstra
2010-01-07 17:31 ` Paul E. McKenney
2010-01-07 17:44 ` Mathieu Desnoyers
2010-01-07 17:55 ` Paul E. McKenney
2010-01-07 17:44 ` Steven Rostedt
2010-01-07 17:56 ` Paul E. McKenney
2010-01-07 18:04 ` Steven Rostedt
2010-01-07 18:40 ` Paul E. McKenney
2010-01-07 17:36 ` Mathieu Desnoyers
2010-01-07 14:27 ` Steven Rostedt
2010-01-07 15:10 ` Mathieu Desnoyers
2010-01-07 16:49 ` Paul E. McKenney
2010-01-07 17:00 ` Steven Rostedt
2010-01-07 8:27 ` Peter Zijlstra
2010-01-07 18:30 ` Oleg Nesterov
2010-01-07 18:39 ` Paul E. McKenney
2010-01-07 18:59 ` Steven Rostedt
2010-01-07 19:16 ` Paul E. McKenney
2010-01-07 19:40 ` Steven Rostedt
2010-01-07 20:58 ` Paul E. McKenney
2010-01-07 21:35 ` Steven Rostedt
2010-01-07 22:34 ` Paul E. McKenney
2010-01-08 22:28 ` Mathieu Desnoyers
2010-01-08 23:53 ` Mathieu Desnoyers
2010-01-09 0:20 ` Paul E. McKenney
2010-01-09 1:02 ` Mathieu Desnoyers
2010-01-09 1:21 ` Paul E. McKenney
2010-01-09 1:22 ` Paul E. McKenney
2010-01-09 2:38 ` Mathieu Desnoyers
2010-01-09 5:42 ` Paul E. McKenney
2010-01-09 19:20 ` Mathieu Desnoyers
2010-01-09 23:05 ` Steven Rostedt
2010-01-09 23:16 ` Steven Rostedt
2010-01-10 0:03 ` Paul E. McKenney
2010-01-10 0:41 ` Steven Rostedt
2010-01-10 1:14 ` Mathieu Desnoyers
2010-01-10 1:44 ` Mathieu Desnoyers
2010-01-10 2:12 ` Steven Rostedt
2010-01-10 5:25 ` Paul E. McKenney
2010-01-10 11:50 ` Steven Rostedt
2010-01-10 16:03 ` Mathieu Desnoyers
2010-01-10 16:21 ` Steven Rostedt
2010-01-10 17:10 ` Mathieu Desnoyers
2010-01-10 21:02 ` Steven Rostedt
2010-01-10 21:41 ` Mathieu Desnoyers
2010-01-11 1:21 ` Paul E. McKenney
2010-01-10 17:45 ` Paul E. McKenney
2010-01-10 18:24 ` Mathieu Desnoyers
2010-01-11 1:17 ` Paul E. McKenney
2010-01-11 4:25 ` Mathieu Desnoyers
2010-01-11 4:29 ` [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v3a) Mathieu Desnoyers
2010-01-11 17:27 ` Paul E. McKenney
2010-01-11 17:35 ` Mathieu Desnoyers
2010-01-11 17:50 ` Peter Zijlstra
2010-01-11 20:52 ` Mathieu Desnoyers
2010-01-11 21:19 ` Peter Zijlstra
2010-01-11 22:04 ` Mathieu Desnoyers
2010-01-11 22:20 ` Peter Zijlstra
2010-01-11 22:48 ` Paul E. McKenney
2010-01-11 22:48 ` Mathieu Desnoyers
2010-01-11 21:19 ` Peter Zijlstra
2010-01-11 21:31 ` Peter Zijlstra
2010-01-11 4:30 ` [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v3b) Mathieu Desnoyers
2010-01-11 22:43 ` Paul E. McKenney
2010-01-12 15:38 ` Mathieu Desnoyers
2010-01-12 16:27 ` Steven Rostedt
2010-01-12 16:38 ` Mathieu Desnoyers
2010-01-12 16:54 ` Paul E. McKenney
2010-01-12 18:12 ` Paul E. McKenney
2010-01-12 18:56 ` Mathieu Desnoyers
2010-01-13 0:23 ` Paul E. McKenney
2010-01-11 16:25 ` [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier Paul E. McKenney
2010-01-11 20:21 ` Mathieu Desnoyers
2010-01-11 21:48 ` Paul E. McKenney
2010-01-14 2:56 ` Lai Jiangshan
2010-01-14 5:13 ` Paul E. McKenney
2010-01-14 5:39 ` Mathieu Desnoyers
2010-01-10 5:18 ` Paul E. McKenney
2010-01-10 1:12 ` Mathieu Desnoyers
2010-01-10 5:19 ` Paul E. McKenney
2010-01-10 1:04 ` Mathieu Desnoyers
2010-01-10 1:01 ` Mathieu Desnoyers
2010-01-09 23:59 ` Paul E. McKenney
2010-01-10 1:11 ` Mathieu Desnoyers
2010-01-07 9:50 ` Andi Kleen
2010-01-07 15:12 ` Mathieu Desnoyers
2010-01-07 16:56 ` Paul E. McKenney
2010-01-07 11:04 ` David Howells
2010-01-07 15:15 ` Mathieu Desnoyers
2010-01-07 15:47 ` David Howells
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100107174505.GA9612@Krystal \
--to=mathieu.desnoyers@polymtl.ca \
--cc=Valdis.Kletnieks@vt.edu \
--cc=akpm@linux-foundation.org \
--cc=dhowells@redhat.com \
--cc=dipankar@in.ibm.com \
--cc=josh@joshtriplett.org \
--cc=laijs@cn.fujitsu.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=paulmck@linux.vnet.ibm.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=tglx@linutronix.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.