From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:41377)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <aik@ozlabs.ru>) id 1Z7iIW-0001ET-Dx
	for qemu-devel@nongnu.org; Wed, 24 Jun 2015 06:52:53 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <aik@ozlabs.ru>) id 1Z7iIS-0002Cr-Kd
	for qemu-devel@nongnu.org; Wed, 24 Jun 2015 06:52:52 -0400
Received: from mail-pa0-f43.google.com ([209.85.220.43]:33857)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <aik@ozlabs.ru>) id 1Z7iIS-0002CU-Cg
	for qemu-devel@nongnu.org; Wed, 24 Jun 2015 06:52:48 -0400
Received: by pabvl15 with SMTP id vl15so26962921pab.1
	for <qemu-devel@nongnu.org>; Wed, 24 Jun 2015 03:52:47 -0700 (PDT)
References: <1434627456-13745-1-git-send-email-aik@ozlabs.ru>
	<20150623064442.GC13352@voom.redhat.com>
From: Alexey Kardashevskiy <aik@ozlabs.ru>
Message-ID: <558A8BF8.3080509@ozlabs.ru>
Date: Wed, 24 Jun 2015 20:52:40 +1000
MIME-Version: 1.0
In-Reply-To: <20150623064442.GC13352@voom.redhat.com>
Content-Type: text/plain; charset=koi8-r; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH qemu v8 00/14] spapr: vfio: Enable Dynamic
 DMA windows (DDW)
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: Alex Williamson <alex.williamson@redhat.com>, qemu-ppc@nongnu.org, qemu-devel@nongnu.org, Gavin Shan <gwshan@linux.vnet.ibm.com>, Alexander Graf <agraf@suse.de>

On 06/23/2015 04:44 PM, David Gibson wrote:
> On Thu, Jun 18, 2015 at 09:37:22PM +1000, Alexey Kardashevskiy wrote:
>>
>> (cut-n-paste from kernel patchset)
>>
>> Each Partitionable Endpoint (IOMMU group) has an address range on a PCI bus
>> where devices are allowed to do DMA. These ranges are called DMA windows.
>> By default, there is a single DMA window, 1 or 2GB big, mapped at zero
>> on a PCI bus.
>>
>> PAPR defines a DDW RTAS API which allows pseries guests
>> querying the hypervisor about DDW support and capabilities (page size mask
>> for now). A pseries guest may request an additional (to the default)
>> DMA windows using this RTAS API.
>> The existing pseries Linux guests request an additional window as big as
>> the guest RAM and map the entire guest window which effectively creates
>> direct mapping of the guest memory to a PCI bus.
>>
>> This patchset reworks PPC64 IOMMU code and adds necessary structures
>> to support big windows.
>>
>> Once a Linux guest discovers the presence of DDW, it does:
>> 1. query hypervisor about number of available windows and page size masks;
>> 2. create a window with the biggest possible page size (today 4K/64K/16M);
>> 3. map the entire guest RAM via H_PUT_TCE* hypercalls;
>> 4. switche dma_ops to direct_dma_ops on the selected PE.
>>
>> Once this is done, H_PUT_TCE is not called anymore for 64bit devices and
>> the guest does not waste time on DMA map/unmap operations.
>>
>> Note that 32bit devices won't use DDW and will keep using the default
>> DMA window so KVM optimizations will be required (to be posted later).
>>
>> This patchset adds DDW support for pseries. The host kernel changes are
>> required, posted as:
>>
>> [PATCH kernel v11 00/34] powerpc/iommu/vfio: Enable Dynamic DMA windows
>>
>> This patchset is based on git://github.com/dgibson/qemu.git spapr-next branch.
>
> A couple of general queries - this touchs on the kernel part as well
> as the qemu part:
>
>   * Am I correct in thinking that the point in doing the
>     pre-registration stuff is to allow the kernel to handle PUT_TCE
>     in real mode?  i.e. that the advatage of doing preregistration
>     rather than accounting on the DMA_MAP and DMA_UNMAP itself only
>     appears once you have kernel KVM+VFIO acceleration?


Handling PUT_TCE includes 2 things:
1. get_user_pages_fast() and put_page()
2. update locked_vm

Both are tricky in real mode but 2) is also tricky in virtual mode as I 
have to deal with multiple unrelated 32bit and 64bit windows (VFIO does not 
care if they belong to one or many processes) with IOMMU page size==4K and 
gup/put_page working with 64k pages (our default page size for host kernel).

But yes, without keeping real mode handlers in mind, this thing could have 
been made simpler.


>   * Do you have test numbers to show that it's still worthwhile to have
>     kernel acceleration once you have a guest using DDW?  With DDW in
>     play, even if PUT_TCE is slow, it should be called a lot less
>     often.

With DDW, the whole RAM mapped once at first set_dma_mask(64bit) called by 
the guest, it is just a few PUT_TCE_INDIRECT calls.

If the guest uses DDW, real mode handlers cannot possibly beat it and I 
have reports that real mode handlers are noticibly slower than direct DMA 
mapping (i.e. DDW) for 40Gb devices (10Gb seems to be fine but I have not 
tried a dozen of guests yet).


> The reason I ask is that the preregistration handling is a pretty big
> chunk of code that inserts itself into some pretty core kernel data
> structures, all for one pretty specific use case.  We only want to do
> that if there's a strong justification for it.

Exactly. I keep asking Ben and Paul periodically if we want to keep it and 
the answer is always yes :)


About "vfio: spapr: Move SPAPR-related code to a separate file" - I guess I 
better off removing it for now, right?


-- 
Alexey