From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([208.118.235.92]:48460)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <anthony@codemonkey.ws>) id 1SUILD-00013O-9X
	for qemu-devel@nongnu.org; Tue, 15 May 2012 10:03:15 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <anthony@codemonkey.ws>) id 1SUILA-0006N2-P6
	for qemu-devel@nongnu.org; Tue, 15 May 2012 10:03:06 -0400
Received: from mail-yx0-f173.google.com ([209.85.213.173]:42188)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <anthony@codemonkey.ws>) id 1SUILA-0006Mp-JI
	for qemu-devel@nongnu.org; Tue, 15 May 2012 10:03:04 -0400
Received: by yenm4 with SMTP id m4so6877054yen.4
	for <qemu-devel@nongnu.org>; Tue, 15 May 2012 07:03:02 -0700 (PDT)
Message-ID: <4FB26212.5050409@codemonkey.ws>
Date: Tue, 15 May 2012 09:02:58 -0500
From: Anthony Liguori <anthony@codemonkey.ws>
MIME-Version: 1.0
References: <1336625347-10169-1-git-send-email-benh@kernel.crashing.org>
	<1336625347-10169-9-git-send-email-benh@kernel.crashing.org>
	<4FB1A80C.1010103@codemonkey.ws>
	<20120515014204.GE30229@truffala.fritz.box>
	<4FB1B95A.20209@codemonkey.ws> <1337049166.6727.32.camel@pasglop>
	<4FB1C480.1030408@codemonkey.ws> <1337050942.6727.40.camel@pasglop>
In-Reply-To: <1337050942.6727.40.camel@pasglop>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH 08/13] iommu: Introduce IOMMU emulation
	infrastructure
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Alex Williamson <alex.williamson@redhat.com>, Richard Henderson <rth@twiddle.net>, "Michael S. Tsirkin" <mst@redhat.com>, qemu-devel@nongnu.org, Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>

On 05/14/2012 10:02 PM, Benjamin Herrenschmidt wrote:
> On Mon, 2012-05-14 at 21:50 -0500, Anthony Liguori wrote:
>> On 05/14/2012 09:32 PM, Benjamin Herrenschmidt wrote:
>>> On Mon, 2012-05-14 at 21:03 -0500, Anthony Liguori wrote:
>>>> So the CPU thread runs in lock-step with the I/O thread.  Dropping the CPU
>>>> thread lock to let the I/O thread run is a dangerous thing to do in a place like
>>>> this.
>>>>
>>>> Also, I think you'd effectively block the CPU until pending DMA operations
>>>> complete?  This could be many, many, milliseconds, no?  That's going to make
>>>> guests very upset.
>>>
>>> Do you see any other option ?
>>
>> Yes, ignore it.
>>
>> I have a hard time believing software depends on changing DMA translation
>> mid-way through a transaction.
>
> It's a correctness issue. It won't happen in normal circumstances but it
> can, and thus should be handled gracefully.

I think the crux of your argument is that upon a change to the translation 
table, the operation acts as a barrier such that the exact moment it returns, 
you're guaranteed that no DMAs are in flight with the old translation mapping.

That's not my understanding of at least VT-d and I have a hard time believing 
it's true for other IOMMUs as that kind of synchronization seems like it would 
be very expensive to implement in hardware.

Rather, when the IOTLB is flushed, I believe the only guarantee that you have is 
that future IOTLB lookups will return the new mapping.  But that doesn't mean 
that there isn't a request in flight that uses the old mapping.

I will grant you that PCI transactions are typically much smaller than QEMU 
transactions such that we may continue to use the old mappings for much longer 
than real hardware would.  But I think that still puts us well within the realm 
of correctness.

> Cases where that matter are unloading of a (broken) driver, kexec/kdump
> from one guest to another etc... all involve potentially clearing all
> iommu tables while a driver might have left a device DMA'ing. The
> expectation is that the device will get target aborts from the iommu
> until the situation gets "cleaned up" in SW.

Yes, this would be worse in QEMU than on bare metal because we essentially have 
a much larger translation TLB.  But as I said above, I think we're well within 
the specified behavior here.

>> Why does this need to be guaranteed?  How can software depend on this in a
>> meaningful way?
>
> The same as TLB invalidations :-)
>
> In real HW, this is a property of the HW itself, ie, whatever MMIO is
> used to invalidate the HW TLB provides a way to ensure (usually by
> reading back) that any request pending in the iommu pipeline has either
> been completed or canned.

Can you point to a spec that says this?  This doesn't match my understanding.

> When we start having page fault capable iommu's this will be even more
> important as faults will be be part of the non-error case.

We can revisit this discussion after every PCI device is changed to cope with a 
page fault capable IOMMU ;-)

>>> David's approach may not be the best long term, but provided it's not
>>> totally broken (I don't know qemu locking well enough to judge how
>>> dangerous it is) then it might be a "good enough" first step until we
>>> come up with something better ?
>>
>> No, it's definitely not good enough.  Dropping the global mutex in random places
>> is asking for worlds of hurt.
>>
>> If this is really important, then we need some sort of cancellation API to go
>> along with map/unmap although I doubt that's really possible.
>>
>> MMIO/PIO operations cannot block.
>
> Well, there's a truckload of cases in real HW where an MMIO/PIO read is
> used to synchronize some sort of HW operation.... I suppose nothing that
> involves blocking at this stage in qemu but I would be careful with your
> expectations here... writes are usually pipelined but blocking on a read
> response does make a lot of sense.

Blocking on an MMIO/PIO request effectively freezes a CPU.  All sorts of badness 
results from that.  Best case scenario, you trigger soft lockup warnings.

> In any case, for the problem at hand, I can just drop the wait for now
> and maybe just print a warning if I see an existing map.
>
> We still need some kind of either locking or barrier to simply ensure
> that the updates to the TCE table are visible to other processors but
> that can be done in the backend.
>
> But I wouldn't just forget about the issue, it's going to come back and
> bite...

I think working out the exact semantics of what we need to do is absolutely 
important.  But I think you're taking an overly conservative approach to what we 
need to provide here.

Regards,

Anthony Liguori

>
> Cheers,
> Ben.
>
>> Regards,
>>
>> Anthony Liguori
>>
>>>
>>> The normal case will be that no map exist, ie, it will almost always be
>>> a guest programming error to remove an iommu mapping while a device is
>>> actively using it, so having this case be slow is probably a non-issue.
>>>
>>> Cheers,
>>> Ben.
>>>
>>>
>
>