From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759942AbZBMReK (ORCPT ); Fri, 13 Feb 2009 12:34:10 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751410AbZBMRd4 (ORCPT ); Fri, 13 Feb 2009 12:33:56 -0500 Received: from tomts43-srv.bellnexxia.net ([209.226.175.110]:56579 "EHLO tomts43-srv.bellnexxia.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751298AbZBMRdz (ORCPT ); Fri, 13 Feb 2009 12:33:55 -0500 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Ah0FABQ+lUlMQWt2/2dsb2JhbACBbtBuhBgG Date: Fri, 13 Feb 2009 12:33:52 -0500 From: Mathieu Desnoyers To: Linus Torvalds Cc: Nick Piggin , Bryan Wu , linux-kernel@vger.kernel.org, ltt-dev@lists.casi.polymtl.ca, uclinux-dist-devel@blackfin.uclinux.org, "Paul E. McKenney" Subject: Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost) Message-ID: <20090213173352.GB4684@Krystal> References: <20090212023308.GA21157@linux.vnet.ibm.com> <20090212215959.GN6759@linux.vnet.ibm.com> <200902140050.44550.nickpiggin@yahoo.com.au> <20090213145653.GA6854@linux.vnet.ibm.com> <20090213151045.GA1574@Krystal> <20090213155507.GA2838@Krystal> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Content-Disposition: inline In-Reply-To: X-Editor: vi X-Info: http://krystal.dyndns.org:8080 X-Operating-System: Linux/2.6.21.3-grsec (i686) X-Uptime: 11:57:05 up 43 days, 16:55, 4 users, load average: 0.20, 0.21, 0.32 User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Linus Torvalds (torvalds@linux-foundation.org) wrote: > > > On Fri, 13 Feb 2009, Mathieu Desnoyers wrote: > > > > I created also > > > > _STORE_SHARED() > > _LOAD_SHARED() > > > > which identify the variables which need to have cache flush done before > > (load) or after (store). So we get both speed and identification when > > needed (if we need to do batch updates linked with a single cache flush). > > e.g. > > The thing is, THAT JUST ABSOLUTELY SUCKS. > > Lookie here - we don't want to flush the cache at every load of a shared > variable. There's no reason to. If you don't care about the orderign, you > might as well get the old values. That's what memory ordering _means_, for > chissake! In the absense of locks, loads may get stale values. It's that > easy. > > A lot of code wants to access multiple variables, and they are potentially > nearby, and in the same cacheline. Making them all use _LOAD_SHARED() adds > absolutely no value - and makes it MUCH MUCH SLOWER. > Hrm, I think there is a misunderstanding here, because _LOAD_SHARED() is not much more than a simple comment. The whole idea behind _LOAD_SHARED() is that it does not translate in any different assembly output than a standard load. So no, it cannot be possibly slower. It has no more side-effect than a simple comment in the code, and that's its purpose : to identify those variables. So if we find a code path doing _STORE_SHARED(x, v); smp_mc(); while (_LOAD_SHARED(z) != val) cpu_relax(); We can verify very easily the code correctness : A write cache flush is required after _STORE_SHARED A read cache flush is required before _LOAD_SHARED Read cache flushes are required to happen eventually between _LOAD_SHARED in the loop. It's basically the same as having something like an eventual : _STORE_ORDERED(x, v); smp_mb(); _LOAD_ORDERED(z); Instead of relying on a comment around smp_mb(); stating which variables it orders. I would understand if you dislike it, but I find it rather useful to have this information in the source code around the variable access rather than formulated as a comment around the barrier. Actually having both the barrier comment *and* this identification seems rather good for code review. > So what's the answer? > > I already outlined it: either you use locks (which will do the magic for > you), or you use memory barriers. In no case do you make the access magic, > unless you have a compiler issue where you are afraid that the compiler > would turn it into _multiple_ accesses and potentially get inconsistent > results. > > So the point about ACCESS_ONCE() is not, and never has been, about > re-ordering. We know that the CPU may re-order the accesses and give us > stale values (or values from the "future" wrt the other accesses around > it). That's not the point. The point of ACCESS_ONCE() is that we get > exactly _one_ value, and not two different ones (or none at all) because > of the compiler either re-loading it several times or not re-loading it at > all. > > Anybody who confuses ACCESS_ONCE() with ordering is simply confused. > > And we don't want to make any "load with cache flush" either. Which side > should the cache flush be on? Before? After? Both? Atomically? There is no > sane semantics for that. > We might want to simply scrap the "safe and slow" version without underscores (LOAD_SHARED, STORE_SHARED) which contain smp_rmc and smp_wmc statements within the macro. But Paul insisted that he likes having the proper memory ordering/cache coherency enforced within the accessor macros. Personnally, I see much more value in the simple "comment-only" versions _LOAD_SHARED/_STORE_SHARED matched with an explicit cache flush statement because in a lot of cases, we will want to do a batch of read/writes between cache flushes. Note that memory barriers are already implicit in a lot of kernel primitives, namely rcu_dereference, cmpxchg, spinlock, ... so this is debatable I guess. > The only remaining sane semantics is to depend on memory barriers, and > then make a magic memory barrier that is extra weak and doesn't order > anythign at all, but just says "syncronize very weakly". I agree completely. What I am proposing here is just to add syntaxic sugar to better identify the variables related to those extra weak barriers. > > And I think we have that in "cpu_relax()". Because if you have somebody > doing shared memory accesses in a loop without any memory barriers or > locks or anything (ie the _ordering_ doesn't matter, only that some value > has been seen), then dang it, I can't see how you can _possibly_ use > anything else than that "cpu_relax()" somewhere in that loop. > It must also be matched with the equivalent write flush barrier at the write side, so hiding this deep within cpu_relax() only at the read-side seems to hide a lot of what must be performed by the cores to exchange the data properly. (ok we don't care about write cache flush for Blackfin particularly, but I don't see why we should not start thinking about what non-coherent caches small embedded devices can bring) It's also worth noting that Paul and I have no agenda to push anything into the mainline kernel to enforce anything like "wmc"-type cache flush barriers. We are barely trying to find the best semantic to express our userspace RCU algorithm, and I happen to have noticed this loophole about non-coherent cache architectures. But ideally it would be good to stay in sync with the Linux kernel's primitives, so this is why your criticism is much appreciated. Thanks, Mathieu > Linus > > _______________________________________________ > ltt-dev mailing list > ltt-dev@lists.casi.polymtl.ca > http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev > -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68