From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:58820) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UpK66-000150-FA for qemu-devel@nongnu.org; Wed, 19 Jun 2013 11:15:01 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1UpK64-0004vC-K6 for qemu-devel@nongnu.org; Wed, 19 Jun 2013 11:14:58 -0400 Received: from mx1.redhat.com ([209.132.183.28]:52277) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UpK64-0004v7-Bv for qemu-devel@nongnu.org; Wed, 19 Jun 2013 11:14:56 -0400 Message-ID: <51C1CAE3.6050908@redhat.com> Date: Wed, 19 Jun 2013 17:14:43 +0200 From: Paolo Bonzini MIME-Version: 1.0 References: <1371381681-14252-1-git-send-email-pingfanl@linux.vnet.ibm.com> <1371381681-14252-2-git-send-email-pingfanl@linux.vnet.ibm.com> <51BF5C0F.6020209@twiddle.net> <51C05F88.2090308@redhat.com> <20130618145033.GN5146@linux.vnet.ibm.com> <51C085EF.1040303@redhat.com> <1371573518.16968.23603.camel@triegel.csb> <51C17A5D.909@redhat.com> <1371647713.16968.25060.camel@triegel.csb> In-Reply-To: <1371647713.16968.25060.camel@triegel.csb> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations) List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Torvald Riegel Cc: Andrew Haley , qemu-devel@nongnu.org, Liu Ping Fan , Anthony Liguori , paulmck@linux.vnet.ibm.com, Richard Henderson Il 19/06/2013 15:15, Torvald Riegel ha scritto: >> One reason is that implementing SC for POWER is quite expensive, > > Sure, but you don't have to use SC fences or atomics if you don't want > them. Note that C11/C++11 as well as the __atomic* builtins allow you > to specify a memory order. It's perfectly fine to use acquire fences or > release fences. There shouldn't be just one kind of barrier/fence. Agreed. For example Linux uses four: consume (read_barrier_depends), acquire (rmb), release (wmb), SC (mb). In addition in Linux loads and stores are always relaxed, some RMW ops are SC but others are relaxed. I want to do something similar in QEMU, with as few changes as possible. In the end I settled for the following: (1) I don't care about relaxed RMW ops (loads/stores occur in hot paths, but RMW shouldn't be that bad. I don't care if reference counting is a little slower than it could be, for example); (2) I'd like to have some kind of non-reordering load/store too, either SC (which I've improperly referred to as C11/C++11 in my previous email) or Java volatile. [An aside: Java guarantees that volatile stores are not reordered with volatile loads. This is not guaranteed by just using release stores and acquire stores, and is why IIUC acq_rel < Java < seq_cst]. As long as you only have a producer and a consumer, C11 is fine, because all you need is load-acquire/store-release. In fact, if it weren't for the experience factor, C11 is easier than manually placing acquire and release barriers. But as soon as two or more threads are reading _and_ writing the shared memory, it gets complicated and I want to provide something simple that people can use. This is the reason for (2) above. There will still be a few cases that need to be optimized, and here are where the difficult requirements come: (R1) the primitives *should* not be alien to people who know Linux. (R2) those optimizations *must* be easy to do and review; at least as easy as these things go. The two are obviously related. Ease of review is why it is important to make things familiar to people who know Linux. In C11, relaxing SC loads and stores is complicated, and more specifically hard to explain! I cannot do that myself, and much less explain that to the community. I cannot make them do that. Unfortunately, relaxing SC loads and stores is important on POWER which has efficient acq/rel but inefficient SC (hwsync in the loads). So, C11 fails both requirements. :( By contrast, Java volatile semantics are easily converted to a sequence of relaxed loads, relaxed stores, and acq/rel/sc fences. It's almost an algorithm; I tried to do that myself and succeeded, I could document it nicely. Even better, there are authoritative sources that confirm my writing and should be accessible to people who did synchronization "stuff" in Linux (no formal models :)). In this respect, Java satisfies both requirements. And the loss is limited, since things such as Dekker's algorithm are rare in practice. (In particular, RCU can be implemented just fine with Java volatile semantics, but load-acquire/store-release is not enough). [Nothing really important after this point, I think]. > Note that there is a reason why C11/C++11 don't just have barriers > combined with ordinary memory accesses: The compiler needs to be aware > which accesses are sequential code (so it can assume that they are > data-race-free) and which are potentially concurrent with other accesses > to the same data. [...] > you can try to make this very likely be correct by careful > placement of asm compiler barriers, but this is likely to be more > difficult than just using atomics, which will do the right thing. Note that asm is just for older compilers (and even then I try to use GCC intrinsics as much as possible). On newer compilers I do use atomics (SC RMW ops, acq/rel/SC/consume thread fences) to properly annotate references. rth also suggested that I use load/store(relaxed) instead of C volatile. > Maybe the issue that you see with C11/C++11 is that it offers more than > you actually need. If you can summarize what kind of synchronization / > concurrent code you are primarily looking at, I can try to help outline > a subset of it (i.e., something like code style but just for > synchronization). Is the above explanation clearer? >> I obviously trust Cambridge for >> C11/C++11, but their material is very concise or just refers to the >> formal model. > > Yes, their publications are really about the model. It's not a > tutorial, but useful for reference. BTW, have you read their C++ paper > http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3132.pdf > or the POPL paper? The former has more detail (no page limit). I know it, but I cannot say I tried hard to understand it. > If you haven't yet, I suggest giving their cppmem tool a try too: > http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/ I saw and tried the similar tool for POWER. The problem with these tools, is that they require you to abstract your program into an input to the tool. It works when _writing_ the code, but not when reviewing it. >> The formal model is not what I want when my question is >> simply "why is lwsync good for acquire and release, but not for >> seqcst?", for example. > > But that's a question that is not about the programming-language-level > memory model, but about POWER's memory model. I suppose that's not > something you'd have to deal with frequently, or are you indeed > primarily interested in emulating a particular architecture's model? > The Cambridge group also has a formal model for POWER's memory model, > perhaps this can help answer this question. I'm more familiar with Cambridge's POWER work than the C++ work, and it didn't. :) I care about the POWER memory model because it's roughly the only one where load(seqcst) => load(relaxed) + fence(acquire) store(seqcst) => fence(release) + store(relaxed) + fence(seqcst) is not a valid (though perhaps suboptimal) mapping. Note that the primitives on the RHS are basically what they use in the Linux kernel. >> In short, the C11/C++11 model is not what most developers are used to >> here > > I guess so. But you also have to consider the legacy that you create. > I do think the C11/C++11 model will used widely, and more and more > people will used to it. I don't think many people will learn how to use the various non-seqcst modes... At least so far I punted. :) Many idioms will certainly be wrapped within the C++ standard library (smart pointers and all that), but C starts at a disadvantage. You have a point, though. >> In case you really need SC _now_, it is easy to do it using >> fetch-and-add (for loads) or xchg (for stores). > > Not on POWER and other architectures with similarly weak memory models :) > If the fetch-and-add abstractions in your model always give you SC, then > that's more an indication that they don't allow you the fine-grained > control that you want, especially on POWER :) Indeed, they don't allow fine-grained control. But I need to strike a balance between control and ease of use, otherwise C11 would have been an obvious choice (except for the detail of older compilers, but let's ignore it). Paolo