From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:58820)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <pbonzini@redhat.com>) id 1UpK66-000150-FA
	for qemu-devel@nongnu.org; Wed, 19 Jun 2013 11:15:01 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <pbonzini@redhat.com>) id 1UpK64-0004vC-K6
	for qemu-devel@nongnu.org; Wed, 19 Jun 2013 11:14:58 -0400
Received: from mx1.redhat.com ([209.132.183.28]:52277)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <pbonzini@redhat.com>) id 1UpK64-0004v7-Bv
	for qemu-devel@nongnu.org; Wed, 19 Jun 2013 11:14:56 -0400
Message-ID: <51C1CAE3.6050908@redhat.com>
Date: Wed, 19 Jun 2013 17:14:43 +0200
From: Paolo Bonzini <pbonzini@redhat.com>
MIME-Version: 1.0
References: <1371381681-14252-1-git-send-email-pingfanl@linux.vnet.ibm.com>
	<1371381681-14252-2-git-send-email-pingfanl@linux.vnet.ibm.com>
	<51BF5C0F.6020209@twiddle.net> <51C05F88.2090308@redhat.com>
	<20130618145033.GN5146@linux.vnet.ibm.com>
	<51C085EF.1040303@redhat.com>
	<1371573518.16968.23603.camel@triegel.csb>
	<51C17A5D.909@redhat.com>
	<1371647713.16968.25060.camel@triegel.csb>
In-Reply-To: <1371647713.16968.25060.camel@triegel.csb>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2
 1/2] add a header file for atomic operations)
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Torvald Riegel <triegel@redhat.com>
Cc: Andrew Haley <aph@redhat.com>, qemu-devel@nongnu.org, Liu Ping Fan <qemulist@gmail.com>, Anthony Liguori <anthony@codemonkey.ws>, paulmck@linux.vnet.ibm.com, Richard Henderson <rth@twiddle.net>

Il 19/06/2013 15:15, Torvald Riegel ha scritto:
>> One reason is that implementing SC for POWER is quite expensive,
> 
> Sure, but you don't have to use SC fences or atomics if you don't want
> them.  Note that C11/C++11 as well as the __atomic* builtins allow you
> to specify a memory order.  It's perfectly fine to use acquire fences or
> release fences.  There shouldn't be just one kind of barrier/fence.

Agreed.  For example Linux uses four: consume (read_barrier_depends),
acquire (rmb), release (wmb), SC (mb).  In addition in Linux loads and
stores are always relaxed, some RMW ops are SC but others are relaxed.

I want to do something similar in QEMU, with as few changes as possible.
 In the end I settled for the following:

(1) I don't care about relaxed RMW ops (loads/stores occur in hot paths,
but RMW shouldn't be that bad.  I don't care if reference counting is a
little slower than it could be, for example);

(2) I'd like to have some kind of non-reordering load/store too, either
SC (which I've improperly referred to as C11/C++11 in my previous email)
or Java volatile.

   [An aside: Java guarantees that volatile stores are not reordered
   with volatile loads.  This is not guaranteed by just using release
   stores and acquire stores, and is why IIUC acq_rel < Java < seq_cst].

As long as you only have a producer and a consumer, C11 is fine, because
all you need is load-acquire/store-release.  In fact, if it weren't for
the experience factor, C11 is easier than manually placing acquire and
release barriers.  But as soon as two or more threads are reading _and_
writing the shared memory, it gets complicated and I want to provide
something simple that people can use.  This is the reason for (2) above.

There will still be a few cases that need to be optimized, and here are
where the difficult requirements come:

(R1) the primitives *should* not be alien to people who know Linux.

(R2) those optimizations *must* be easy to do and review; at least as
easy as these things go.

The two are obviously related.  Ease of review is why it is important to
make things familiar to people who know Linux.

In C11, relaxing SC loads and stores is complicated, and more
specifically hard to explain!  I cannot do that myself, and much less
explain that to the community.  I cannot make them do that.
Unfortunately, relaxing SC loads and stores is important on POWER which
has efficient acq/rel but inefficient SC (hwsync in the loads).  So, C11
fails both requirements. :(

By contrast, Java volatile semantics are easily converted to a sequence
of relaxed loads, relaxed stores, and acq/rel/sc fences.  It's almost an
algorithm; I tried to do that myself and succeeded, I could document it
nicely.  Even better, there are authoritative sources that confirm my
writing and should be accessible to people who did synchronization
"stuff" in Linux (no formal models :)).  In this respect, Java satisfies
both requirements.

And the loss is limited, since things such as Dekker's algorithm are
rare in practice.  (In particular, RCU can be implemented just fine with
Java volatile semantics, but load-acquire/store-release is not enough).

[Nothing really important after this point, I think].

> Note that there is a reason why C11/C++11 don't just have barriers
> combined with ordinary memory accesses: The compiler needs to be aware
> which accesses are sequential code (so it can assume that they are
> data-race-free) and which are potentially concurrent with other accesses
> to the same data.  [...]
> you can try to make this very likely be correct by careful
> placement of asm compiler barriers, but this is likely to be more
> difficult than just using atomics, which will do the right thing.

Note that asm is just for older compilers (and even then I try to use
GCC intrinsics as much as possible).

On newer compilers I do use atomics (SC RMW ops, acq/rel/SC/consume
thread fences) to properly annotate references.  rth also suggested that
I use load/store(relaxed) instead of C volatile.

> Maybe the issue that you see with C11/C++11 is that it offers more than
> you actually need.  If you can summarize what kind of synchronization /
> concurrent code you are primarily looking at, I can try to help outline
> a subset of it (i.e., something like code style but just for
> synchronization).

Is the above explanation clearer?

>> I obviously trust Cambridge for
>> C11/C++11, but their material is very concise or just refers to the
>> formal model.
> 
> Yes, their publications are really about the model.  It's not a
> tutorial, but useful for reference.  BTW, have you read their C++ paper
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3132.pdf
> or the POPL paper?  The former has more detail (no page limit).

I know it, but I cannot say I tried hard to understand it.

> If you haven't yet, I suggest giving their cppmem tool a try too:
> http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/

I saw and tried the similar tool for POWER.  The problem with these
tools, is that they require you to abstract your program into an input
to the tool.  It works when _writing_ the code, but not when reviewing it.

>> The formal model is not what I want when my question is
>> simply "why is lwsync good for acquire and release, but not for
>> seqcst?", for example.
> 
> But that's a question that is not about the programming-language-level
> memory model, but about POWER's memory model.  I suppose that's not
> something you'd have to deal with frequently, or are you indeed
> primarily interested in emulating a particular architecture's model?
> The Cambridge group also has a formal model for POWER's memory model,
> perhaps this can help answer this question.

I'm more familiar with Cambridge's POWER work than the C++ work, and it
didn't. :)  I care about the POWER memory model because it's roughly the
only one where

    load(seqcst) => load(relaxed) + fence(acquire)
    store(seqcst) => fence(release) + store(relaxed) + fence(seqcst)

is not a valid (though perhaps suboptimal) mapping.  Note that the
primitives on the RHS are basically what they use in the Linux kernel.

>> In short, the C11/C++11 model is not what most developers are used to
>> here
> 
> I guess so.  But you also have to consider the legacy that you create.
> I do think the C11/C++11 model will used widely, and more and more
> people will used to it.

I don't think many people will learn how to use the various non-seqcst
modes...  At least so far I punted. :)

Many idioms will certainly be wrapped within the C++ standard library
(smart pointers and all that), but C starts at a disadvantage.  You have
a point, though.

>> In case you really need SC _now_, it is easy to do it using
>> fetch-and-add (for loads) or xchg (for stores).
> 
> Not on POWER and other architectures with similarly weak memory models :)  
> If the fetch-and-add abstractions in your model always give you SC, then
> that's more an indication that they don't allow you the fine-grained
> control that you want, especially on POWER :)

Indeed, they don't allow fine-grained control.  But I need to strike a
balance between control and ease of use, otherwise C11 would have been
an obvious choice (except for the detail of older compilers, but let's
ignore it).

Paolo