From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([208.118.235.92]:60483)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <anthony@codemonkey.ws>) id 1SUkqm-0000mT-W6
	for qemu-devel@nongnu.org; Wed, 16 May 2012 16:29:38 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <anthony@codemonkey.ws>) id 1SUkqk-0006Qo-VN
	for qemu-devel@nongnu.org; Wed, 16 May 2012 16:29:36 -0400
Received: from mail-pz0-f45.google.com ([209.85.210.45]:51790)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <anthony@codemonkey.ws>) id 1SUk4s-0006yn-35
	for qemu-devel@nongnu.org; Wed, 16 May 2012 15:40:06 -0400
Received: by dadv2 with SMTP id v2so1820270dad.4
	for <qemu-devel@nongnu.org>; Wed, 16 May 2012 12:40:04 -0700 (PDT)
Message-ID: <4FB4028F.7070003@codemonkey.ws>
Date: Wed, 16 May 2012 14:39:59 -0500
From: Anthony Liguori <anthony@codemonkey.ws>
MIME-Version: 1.0
References: <1336625347-10169-1-git-send-email-benh@kernel.crashing.org>
	<1336625347-10169-14-git-send-email-benh@kernel.crashing.org>
	<4FB1A8BF.7060503@codemonkey.ws>
	<20120515014449.GF30229@truffala.fritz.box>
	<1337142938.6727.122.camel@pasglop>
In-Reply-To: <1337142938.6727.122.camel@pasglop>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH 13/13] iommu: Add a memory barrier to DMA
 RW function
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>, qemu-devel@nongnu.org, David Gibson <david@gibson.dropbear.id.au>

On 05/15/2012 11:35 PM, Benjamin Herrenschmidt wrote:
>
>>>>
>>>> +    /* HACK: full memory barrier here */
>>>> +    __sync_synchronize();
>>>
>>> I thought you were going to limit this to the TCE iommu?
>>
>> So, it wasn't my intention to send this one with the rest, but I
>> forgot to explain that to Ben when he resent.  As the comment
>> suggests, this is a hack we have been using internally to see if
>> certain ordering problems were what we thought they were.  If that
>> turned out to be the case (and it now looks like it is), we need to
>> work out where to correctly place this barrier.  As Ben says, this
>> should probably really be in the PCI accessors, and we should use the
>> finer grained primitives from qemu-barrier.h rather than the brute
>> force __sync_synchronize().
>
> Well, I knew you didn't intend to send them but I still think that's the
> right patch for now :-)
>
> So we -could- put it in the PCI accessors ... but that would mean fixing
> all drivers to actually use them. For example, ide/ahci or usb/ohci
> don't and they aren't the only one.
>
> In the end, I don't think there's anything we care about which would not
> benefit from ensuring that the DMAs it does appear in the order they
> were issued to the guest kernel. Most busses provide that guarantee to
> some extent and while some busses do have the ability to explicitly
> request relaxed ordering I don't think this is the case with anything we
> care about emulating at this stage (and we can always make that separate
> accessors or flags to add to the direction for example).

I must confess, I have no idea what PCI et al guarantee with respect to 
ordering.  What's nasty about this patch is that you're not just ordering wrt 
device writes/reads, but also with the other VCPUs.  I don't suspect this would 
be prohibitively expensive but it still worries me.

> So by putting the barrier right in the dma_* accessor we kill all the
> birds with one stone without having to audit all drivers for use of the
> right accessors and all bus types.
>
> Also while the goal of using more targeted barriers might be worthwhile
> in the long run, it's not totally trivial because we do want to order
> store vs. subsequent loads in all cases and load vs. loads, and we don't
> want to have to keep track of what the previous access was, so at this
> stage it's simply easier to just use a full barrier.
>
> So my suggestion is to see if that patch introduces a measurable
> performance regression anywhere we care about (ie on x86) and if not,
> just go for it, it will solve a very real problem and we can ponder ways
> to do it better as a second step if it's worthwhile.
>
> Anthony, how do you usually benchmark these things ? Any chance you can
> run a few tests to see if there's any visible loss ?

My concern would really be limited to virtio ring processing. It all depends on 
where you place the barriers in the end.

I really don't want to just conservatively stick barriers everywhere either. 
I'd like to have a specific ordering guarantee and then implement that and deal 
with the performance consequences.

I also wonder if the "fix" that you see from this is papering around a bigger 
problem.  Can you explain the ohci problem that led you to do this in the first 
place?

Regards,

Anthony Liguori

>
> Cheers,
> Ben.
>
>
>