From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:38699)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <aik@ozlabs.ru>) id 1WshOX-0001sk-3f
	for qemu-devel@nongnu.org; Thu, 05 Jun 2014 19:48:35 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <aik@ozlabs.ru>) id 1WshOQ-0003yz-QF
	for qemu-devel@nongnu.org; Thu, 05 Jun 2014 19:48:29 -0400
Received: from mail-pb0-f41.google.com ([209.85.160.41]:52658)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <aik@ozlabs.ru>) id 1WshOQ-0003yr-Ij
	for qemu-devel@nongnu.org; Thu, 05 Jun 2014 19:48:22 -0400
Received: by mail-pb0-f41.google.com with SMTP id uo5so1800666pbc.28
	for <qemu-devel@nongnu.org>; Thu, 05 Jun 2014 16:48:21 -0700 (PDT)
Message-ID: <539101C0.2090004@ozlabs.ru>
Date: Fri, 06 Jun 2014 09:48:16 +1000
From: Alexey Kardashevskiy <aik@ozlabs.ru>
MIME-Version: 1.0
References: <1401947401-21329-1-git-send-email-aik@ozlabs.ru>
	<1401947401-21329-2-git-send-email-aik@ozlabs.ru>
	<5390119D.8040201@ozlabs.ru> <53906B56.3080007@suse.de>
	<53906C50.50308@ozlabs.ru> <53906D54.4030105@suse.de>
	<5390718C.4020005@ozlabs.ru> <53907267.1090000@suse.de>
	<53907FBA.8060604@ozlabs.ru> <5390A01D.7020004@suse.de>
	<5390FA95.2090509@ozlabs.ru> <5390FEF3.4080108@suse.de>
In-Reply-To: <5390FEF3.4080108@suse.de>
Content-Type: text/plain; charset=KOI8-R
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH v7 1/4] spapr_iommu: Make in-kernel TCE
	table optional
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Alexander Graf <agraf@suse.de>, qemu-devel@nongnu.org
Cc: Alex Williamson <alex.williamson@redhat.com>, qemu-ppc@nongnu.org, Gavin Shan <gwshan@linux.vnet.ibm.com>

On 06/06/2014 09:36 AM, Alexander Graf wrote:
> 
> On 06.06.14 01:17, Alexey Kardashevskiy wrote:
>> On 06/06/2014 02:51 AM, Alexander Graf wrote:
>>> On 05.06.14 16:33, Alexey Kardashevskiy wrote:
>>>> On 06/05/2014 11:36 PM, Alexander Graf wrote:
>>>>> On 05.06.14 15:33, Alexey Kardashevskiy wrote:
>>>>>> On 06/05/2014 11:15 PM, Alexander Graf wrote:
>>>>>>> On 05.06.14 15:10, Alexey Kardashevskiy wrote:
>>>>>>>> On 06/05/2014 11:06 PM, Alexander Graf wrote:
>>>>>>>>> On 05.06.14 08:43, Alexey Kardashevskiy wrote:
>>>>>>>>>> On 06/05/2014 03:49 PM, Alexey Kardashevskiy wrote:
>>>>>>>>>>> POWER KVM supports an KVM_CAP_SPAPR_TCE capability which allows
>>>>>>>>>>> allocating
>>>>>>>>>>> TCE tables in the host kernel memory and handle H_PUT_TCE requests
>>>>>>>>>>> targeted to specific LIOBN (logical bus number) right in the host
>>>>>>>>>>> without
>>>>>>>>>>> switching to QEMU. At the moment this is used for emulated devices
>>>>>>>>>>> only
>>>>>>>>>>> and the handler only puts TCE to the table. If the in-kernel
>>>>>>>>>>> H_PUT_TCE
>>>>>>>>>>> handler finds a LIOBN and corresponding table, it will put a TCE to
>>>>>>>>>>> the table and complete hypercall execution. The user space will
>>>>>>>>>>> not be
>>>>>>>>>>> notified.
>>>>>>>>>>>
>>>>>>>>>>> Upcoming VFIO support is going to use the same sPAPRTCETable device
>>>>>>>>>>> class
>>>>>>>>>>> so KVM_CAP_SPAPR_TCE is going to be used as well. That means
>>>>>>>>>>> that TCE
>>>>>>>>>>> tables for VFIO are going to be allocated in the host as well.
>>>>>>>>>>> However VFIO operates with real IOMMU tables and simple copying of
>>>>>>>>>>> a TCE to the real hardware TCE table will not work as guest
>>>>>>>>>>> physical
>>>>>>>>>>> to host physical address translation is requited.
>>>>>>>>>>>
>>>>>>>>>>> So until the host kernel gets VFIO support for H_PUT_TCE, we
>>>>>>>>>>> better not
>>>>>>>>>>> to register VFIO's TCE in the host.
>>>>>>>>>>>
>>>>>>>>>>> This adds a bool @kvm_accel flag to the sPAPRTCETable device
>>>>>>>>>>> telling
>>>>>>>>>>> that sPAPRTCETable should not try allocating TCE table in the host
>>>>>>>>>>> kernel.
>>>>>>>>>>> Instead, the table will be created in QEMU.
>>>>>>>>>>>
>>>>>>>>>>> This adds an kvm_accel parameter to spapr_tce_new_table() to let
>>>>>>>>>>> users
>>>>>>>>>>> choose whether to use acceleration or not. At the moment it is
>>>>>>>>>>> enabled
>>>>>>>>>>> for VIO and emulated PCI. Upcoming VFIO support will set it to
>>>>>>>>>>> false.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>>>>> ---
>>>>>>>>>>>
>>>>>>>>>>> This is a workaround but it lets me have one IOMMU device for VIO,
>>>>>>>>>>> emulated
>>>>>>>>>>> PCI and VFIO which is a good thing.
>>>>>>>>>>>
>>>>>>>>>>> The other way around would be a new KVM_CAP_SPAPR_TCE_VFIO
>>>>>>>>>>> capability but
>>>>>>>>>>> this needs kernel update.
>>>>>>>>>> Never mind, I'll make it a capability. I'll post capability
>>>>>>>>>> reservation
>>>>>>>>>> patch separately.
>>>>>>>>> Just rename the flag from "kvm_accel" to "vfio_accel", set it to
>>>>>>>>> true for
>>>>>>>>> vfio and false for emulated devices. Then the spapr_iommu file can
>>>>>>>>> check on
>>>>>>>>> the capability (and default to false for now, since it doesn't exist
>>>>>>>>> yet).
>>>>>>>> Is that ok if the flag does not have to do anything with VFIO per
>>>>>>>> se? :)
>>>>>>> The flag means "use in-kernel acceleration if the vfio coupling
>>>>>>> capability
>>>>>>> is available", no?
>>>>>> It is a flag of sPAPRTCETable which is not supposed to know about
>>>>>> VFIO at
>>>>>> all, it is just an IOMMU. But if you are ok with it, I have no reason
>>>>>> to be
>>>>>> unhappy either :)
>>>>>>
>>>>>>
>>>>>>
>>>>>>>>> That way you don't have to reserve a CAP today.
>>>>>>>> Why exactly cannot we do that today?
>>>>>>> Because the CAP namespace isn't a garbage bin we can just throw IDs at.
>>>>>>> Maybe we realize during patch review that we need completely different
>>>>>>> CAPs.
>>>>>> That was my first plan - to wait for KVM_CAP_SPAPR_TCE_64 be
>>>>>> available in
>>>>>> the kernel.
>>>>> So all you need are 64bit TCEs with bus_offset?
>>>> No. I need 64bit IOBAs a.k.a. PCI bus addresses. The default DMA window is
>>>> just 1 or 2GB and it is mapped at 0 on PCI bus.
>>>>
>>>> TCEs are 64 bit already.
>>> Ok, so the guest has to tell the PCI device to write to a specific window.
>>> That's a shame :).
>> No. Guest tells the device some address, that's it.  Guest allocates those
>> addresses from some window which host, guest and PHB know about but not the
>> device. What is a shame here?
> 
> It would be nicer if the guest had full control over the virtual address
> range of a PCI device.
>
>>>>> What about the missing
>>>>> in-kernel modification of the shadow TCEs on H_PUT_TCE? I thought that's
>>>>> what this is really about.
>>>> This I do not understand :(
>>> How does real mode H_PUT_TCE emulation know that it needs to notify user
>>> space to establish the map?
>> If it wants to pass control to the user space, it returns H_TOO_HARD. This
>> happens, for example, if LIOBN was not registered in KVM.
> 
> So how does KVM_CAP_SPAPR_TCE_64 help here? With KVM_CAP_SPAPR_TCE_64 we
> can still not map VFIO devices' TCE tables because we're missing all the
> magic to link the virtual TCE table to a physical TCE table.


It does not help here indeeed, I did not say it would ;) I just wanted to
do the preparations first, and this means I need to reserve capability
numbers (which is normally very tough process). Since one capability is
straightforward to implement, I included this into the set.


-- 
Alexey