From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([208.118.235.92]:38363) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1SVH4f-0001S7-Tk for qemu-devel@nongnu.org; Fri, 18 May 2012 02:54:07 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1SVH4a-0007bz-Gj for qemu-devel@nongnu.org; Fri, 18 May 2012 02:54:05 -0400 Received: from mail-pb0-f45.google.com ([209.85.160.45]:45361) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1SVH4a-0007bf-7V for qemu-devel@nongnu.org; Fri, 18 May 2012 02:54:00 -0400 Received: by pbbro12 with SMTP id ro12so4462457pbb.4 for ; Thu, 17 May 2012 23:53:57 -0700 (PDT) Sender: Paolo Bonzini Message-ID: <4FB5F1FD.9020009@redhat.com> Date: Fri, 18 May 2012 08:53:49 +0200 From: Paolo Bonzini MIME-Version: 1.0 References: <1336625347-10169-1-git-send-email-benh@kernel.crashing.org> <1336625347-10169-14-git-send-email-benh@kernel.crashing.org> <4FB1A8BF.7060503@codemonkey.ws> <20120515014449.GF30229@truffala.fritz.box> <1337142938.6727.122.camel@pasglop> <4FB4028F.7070003@codemonkey.ws> <1337213257.30558.22.camel@pasglop> <1337214293.30558.25.camel@pasglop> In-Reply-To: <1337214293.30558.25.camel@pasglop> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH 13/13] iommu: Add a memory barrier to DMA RW function List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Benjamin Herrenschmidt Cc: David Gibson , qemu-devel@nongnu.org, Anthony Liguori , "Michael S. Tsirkin" Il 17/05/2012 02:24, Benjamin Herrenschmidt ha scritto: >> > Also, should I make the barrier conditional to kvm_enabled() ? IE. It's >> > pointless in full emulation and might actually be a performance hit on >> > something already quite slow... > Finally ... something like smp_mb() in qemu will turn into a lock op or > an mfence on x86, ie not a nop. > > That means overhead from today's implementation, which leads to the > question ... is today implementation correct ? IE. Is a barrier needed > on x86 as well or not ? It depends on what semantics you attach to dma_mb. In my opinion, having a separate barrier for DMA is wrong, because you want the same semantics on all architectures. The x86 requirements are roughly as follows: 1) it never needs explicit rmb and wmb (as long as you don't use non-temporal stores etc.); 2) synchronized operations have an implicit mb before and after (unlike LL/SC on PowerPC). 3) everywhere else, you need an mb. So, on x86 you have more or less an implicit wmb after each write and an implicit rmb before a read. This explains why these kind of bugs are very hard to see on x86 (or often impossible to see). Adding these in cpu_physical_memory_rw has the advantage that x86 performance is not affected, but it would not cover the case of a device model doing a DMA read after a DMA write. Then the device model would need to issue a smp_mb manually, on all architectures. I think this is too brittle. > If not (I'm trying to figure out why exactly does x86 have a barrier in > the first place and when it's in order), then I might add a new barrier > type in qemu-barriers.h, something like dma_mb(), and define it as a nop > on x86, a lwsync or sync (still thinking about it) on ppc, and > __sync_synchronize() on unknown archs. I don't think it is correct to think of this in terms of low-level operations such as sync/lwsync. Rather, I think what we want is sequentially consistent accesses; it's heavyweight, but you cannot go wrong. http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html says this is how you do it: x86 -> Load Seq_Cst: mov or mfence; mov Store Seq Cst: mov; mfence or mov ARM -> Load Seq Cst: ldr; dmb or dmb; ldr; dmb Store Seq Cst: dmb; str; dmb or dmb; str PPC -> Load Seq Cst: sync; ld; cmp; bc; isync Store Seq Cst: sync; st where cmp; bc; isync can be replaced by sync. and says "As far as the memory model is concerned, the ARM processor is broadly similar to PowerPC, differing mainly in having a DMB barrier (analogous to the PowerPC sync in its programmer-observable behaviour for normal memory) and no analogue of the PowerPC lwsync". So one of the two ARM mappings, with smp_mb instead of dmb, is what we want in cpu_physical_memory_rw. Device models that want to do better can just use cpu_physical_memory_map, or we can add a cpu_physical_memory_rw_direct for them. Paolo