From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([208.118.235.92]:38363)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <paolo.bonzini@gmail.com>) id 1SVH4f-0001S7-Tk
	for qemu-devel@nongnu.org; Fri, 18 May 2012 02:54:07 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <paolo.bonzini@gmail.com>) id 1SVH4a-0007bz-Gj
	for qemu-devel@nongnu.org; Fri, 18 May 2012 02:54:05 -0400
Received: from mail-pb0-f45.google.com ([209.85.160.45]:45361)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <paolo.bonzini@gmail.com>) id 1SVH4a-0007bf-7V
	for qemu-devel@nongnu.org; Fri, 18 May 2012 02:54:00 -0400
Received: by pbbro12 with SMTP id ro12so4462457pbb.4
	for <qemu-devel@nongnu.org>; Thu, 17 May 2012 23:53:57 -0700 (PDT)
Sender: Paolo Bonzini <paolo.bonzini@gmail.com>
Message-ID: <4FB5F1FD.9020009@redhat.com>
Date: Fri, 18 May 2012 08:53:49 +0200
From: Paolo Bonzini <pbonzini@redhat.com>
MIME-Version: 1.0
References: <1336625347-10169-1-git-send-email-benh@kernel.crashing.org>
	<1336625347-10169-14-git-send-email-benh@kernel.crashing.org>
	<4FB1A8BF.7060503@codemonkey.ws>
	<20120515014449.GF30229@truffala.fritz.box>
	<1337142938.6727.122.camel@pasglop>
	<4FB4028F.7070003@codemonkey.ws>
	<1337213257.30558.22.camel@pasglop>
	<1337214293.30558.25.camel@pasglop>
In-Reply-To: <1337214293.30558.25.camel@pasglop>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH 13/13] iommu: Add a memory barrier to DMA
	RW function
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Gibson <david@gibson.dropbear.id.au>, qemu-devel@nongnu.org, Anthony Liguori <anthony@codemonkey.ws>, "Michael S. Tsirkin" <mst@redhat.com>

Il 17/05/2012 02:24, Benjamin Herrenschmidt ha scritto:
>> > Also, should I make the barrier conditional to kvm_enabled() ? IE. It's
>> > pointless in full emulation and might actually be a performance hit on
>> > something already quite slow...
> Finally ... something like smp_mb() in qemu will turn into a lock op or
> an mfence on x86, ie not a nop.
> 
> That means overhead from today's implementation, which leads to the
> question ... is today implementation correct ? IE. Is a barrier needed
> on x86 as well or not ?

It depends on what semantics you attach to dma_mb.  In my opinion,
having a separate barrier for DMA is wrong, because you want the same
semantics on all architectures.

The x86 requirements are roughly as follows:

1) it never needs explicit rmb and wmb (as long as you don't use
non-temporal stores etc.);

2) synchronized operations have an implicit mb before and after (unlike
LL/SC on PowerPC).

3) everywhere else, you need an mb.

So, on x86 you have more or less an implicit wmb after each write and an
implicit rmb before a read.  This explains why these kind of bugs are
very hard to see on x86 (or often impossible to see).  Adding these in
cpu_physical_memory_rw has the advantage that x86 performance is not
affected, but it would not cover the case of a device model doing a DMA
read after a DMA write.  Then the device model would need to issue a
smp_mb manually, on all architectures.  I think this is too brittle.

> If not (I'm trying to figure out why exactly does x86 have a barrier in
> the first place and when it's in order), then I might add a new barrier
> type in qemu-barriers.h, something like dma_mb(), and define it as a nop
> on x86, a lwsync or sync (still thinking about it) on ppc, and
> __sync_synchronize() on unknown archs.

I don't think it is correct to think of this in terms of low-level
operations such as sync/lwsync.  Rather, I think what we want is
sequentially consistent accesses; it's heavyweight, but you cannot go
wrong.  http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html says this
is how you do it:

x86 -> Load Seq_Cst:	mov                or mfence; mov
       Store Seq Cst:	mov; mfence        or mov

ARM -> Load Seq Cst: 	ldr; dmb           or dmb; ldr; dmb
       Store Seq Cst: 	dmb; str; dmb      or dmb; str

PPC -> Load Seq Cst: 	sync; ld; cmp; bc; isync
       Store Seq Cst: 	sync; st

       where cmp; bc; isync can be replaced by sync.

and says "As far as the memory model is concerned, the ARM processor is
broadly similar to PowerPC, differing mainly in having a DMB barrier
(analogous to the PowerPC sync in its programmer-observable behaviour
for normal memory) and no analogue of the PowerPC lwsync".

So one of the two ARM mappings, with smp_mb instead of dmb, is what we
want in cpu_physical_memory_rw.  Device models that want to do better
can just use cpu_physical_memory_map, or we can add a
cpu_physical_memory_rw_direct for them.

Paolo