From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752386Ab0AGRpM (ORCPT ); Thu, 7 Jan 2010 12:45:12 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752251Ab0AGRpK (ORCPT ); Thu, 7 Jan 2010 12:45:10 -0500 Received: from tomts43-srv.bellnexxia.net ([209.226.175.110]:43509 "EHLO tomts43-srv.bellnexxia.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752202Ab0AGRpI (ORCPT ); Thu, 7 Jan 2010 12:45:08 -0500 Date: Thu, 7 Jan 2010 12:45:05 -0500 From: Mathieu Desnoyers To: Josh Triplett Cc: linux-kernel@vger.kernel.org, "Paul E. McKenney" , Ingo Molnar , akpm@linux-foundation.org, tglx@linutronix.de, peterz@infradead.org, rostedt@goodmis.org, Valdis.Kletnieks@vt.edu, dhowells@redhat.com, laijs@cn.fujitsu.com, dipankar@in.ibm.com Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier Message-ID: <20100107174505.GA9612@Krystal> References: <20100107044007.GA22863@Krystal> <20100107052851.GA12419@feather> <20100107060439.GB25786@Krystal> <20100107063207.GA12939@feather> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Content-Disposition: inline In-Reply-To: <20100107063207.GA12939@feather> X-Editor: vi X-Info: http://krystal.dyndns.org:8080 X-Operating-System: Linux/2.6.27.31-grsec (i686) X-Uptime: 09:09:46 up 21 days, 22:28, 5 users, load average: 0.34, 0.20, 0.18 User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Josh Triplett (josh@joshtriplett.org) wrote: > On Thu, Jan 07, 2010 at 01:04:39AM -0500, Mathieu Desnoyers wrote: [...] > > Just tried it with a 10,000,000 iterations loop. > > > > The thread doing the system call loop takes 2.0% of user time, 98% of > > system time. All other cpus are nearly 100.0% idle. Just to give a bit > > more info about my test setup, I also have a thread sitting on a CPU > > busy-waiting for the loop to complete. This thread takes 97.7% user > > time (but it really is just there to make sure we are indeed doing the > > IPIs, not skipping it through the thread_group_empty(current) test). If > > I remove this thread, the execution time of the test program shrinks > > from 32 seconds down to 1.9 seconds. So yes, the IPI is actually > > executed in the first place, because removing the extra thread > > accelerates the loop tremendously. I used a 8-core Xeon to test. > > Do you know if the kernel properly measures the overhead of IPIs? The > CPUs might have only looked idle. What about running some kind of > CPU-bound benchmark on the other CPUs and testing the completion time > with and without the process running the membarrier loop? Good point. Just tried with a cache-hot kernel compilation using 6/8 CPUs. Normally: real 2m41.852s With the sys_membarrier+1 busy-looping thread running: real 5m41.830s So... the unrelated processes become 2x slower. That hurts. So let's try allocating a cpu mask for PeterZ scheme. I prefer to have a small allocation overhead and benefit from cpumask broadcast if possible so we scale better. But that all depends on how big the allocation overhead is. Impact of allocating a cpumask (time for 10,000,000 sys_membarrier calls, one thread is doing the sys_membarrier, the others are busy looping)): IPI to all: real 0m44.708s alloc cpumask+local mb()+IPI-many to 1 thread: real 1m2.034s So, roughly, the cpumask allocation overhead is 17s here, not exactly cheap. So let's see when it becomes better than single IPIs: local mb()+single IPI to 1 thread: real 0m29.502s local mb()+single IPI to 7 threads: real 2m30.971s So, roughly, the single IPI overhead is 120s here for 6 more threads, for 20s per thread. Here is what we can do: Given that it costs almost half as much to perform the cpumask allocation than to send a single IPI, as we iterate on the CPUs, for the, say, first N CPUs (ourself and 1 cpu that needs to have an IPI sent), we send a "single IPI". This will be N-1 IPI and a local function call. If we need more than that, then we switch to the cpumask allocation and send a broadcast IPI to the cpumask we construct for the rest of the CPUs. Let's call it the "adaptative IPI scheme". For my Intel Xeon E5405: Just doing local mb()+single IPI to T other threads: T=1: 0m29.219s T=2: 0m46.310s T=3: 1m10.172s T=4: 1m24.822s T=5: 1m43.205s T=6: 2m15.405s T=7: 2m31.207s Just doing cpumask alloc+IPI-many to T other threads: T=1: 0m39.605s T=2: 0m48.566s T=3: 0m50.167s T=4: 0m57.896s T=5: 0m56.411s T=6: 1m0.536s T=7: 1m12.532s So I think the right threshold should be around 2 threads (assuming other architecture will behave like mine). So starting 3 threads, we allocate the cpumask and send IPIs. How does that sound ? [...] > > > > - Part of me thinks this ought to become slightly more general, and just > > > deliver a signal that the receiving thread could handle as it likes. > > > However, that would certainly prove more expensive than this, and I > > > don't know that the generality would buy anything. > > > > A general scheme would have to call every threads, even those which are > > not running. In the case of this system call, this is a particular case > > where we can forget about non-running threads, because the memory > > barrier is implied by the scheduler activity that brought them offline. > > So I really don't see how we can use this IPI scheme for other things > > that this kind of synchronization. > > No, I don't mean non-running threads. If you wanted that, you could do > what urcu currently does, and send a signal to all threads. I meant > something like "signal all *running* threads from my process". Well, if you find me a real-life use-case, then we can surely look into that ;) > > > > - Could you somehow register reader threads with the kernel, in a way > > > that makes them easy to detect remotely? > > > > There are two ways I figure out we could do this. One would imply adding > > extra shared data between kernel and userspace (which I'd like to avoid, > > to keep coupling low). The other alternative would be to add per > > task_struct information about this, and new system calls. The added per > > task_struct information would use up cache lines (which are very > > important, especially in the task_struct) and the added system call at > > rcu_read_lock/unlock() would simply kill performance. > > No, I didn't mean that you would do a syscall in rcu_read_{lock,unlock}. > I meant that you would do a system call when the reader threads start, > saying "hey, reader thread here". Hrm, we need to inform the userspace RCU library that this thread is present too. So I don't see how going through the kernel helps us there. Thanks, Mathieu > > - Josh Triplett -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68