From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:47617) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1b16I1-00085k-79 for qemu-devel@nongnu.org; Fri, 13 May 2016 02:09:34 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1b16Hw-0008At-R7 for qemu-devel@nongnu.org; Fri, 13 May 2016 02:09:31 -0400 Received: from mga01.intel.com ([192.55.52.88]:11159) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1b16Hw-00089n-EK for qemu-devel@nongnu.org; Fri, 13 May 2016 02:09:28 -0400 Message-ID: <57356F64.1010406@intel.com> Date: Fri, 13 May 2016 14:08:36 +0800 From: Jike Song MIME-Version: 1.0 References: <1462214441-3732-1-git-send-email-kwankhede@nvidia.com> <1462214441-3732-4-git-send-email-kwankhede@nvidia.com> <20160503164306.6a699fe3@t450s.home> <572AEE72.90008@intel.com> <5731933B.90508@intel.com> <20160510160257.GA4125@nvidia.com> <5732F823.3090409@intel.com> <20160511160628.690876f9@t450s.home> <20160512194924.GA24334@nvidia.com> In-Reply-To: <20160512194924.GA24334@nvidia.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Neo Jia Cc: Jike Song , Alex Williamson , "Tian, Kevin" , Kirti Wankhede , "pbonzini@redhat.com" , "kraxel@redhat.com" , "qemu-devel@nongnu.org" , "kvm@vger.kernel.org" , "Ruan, Shuai" , "Lv, Zhiyuan" On 05/13/2016 03:49 AM, Neo Jia wrote: > On Thu, May 12, 2016 at 12:11:00PM +0800, Jike Song wrote: >> On Thu, May 12, 2016 at 6:06 AM, Alex Williamson >> wrote: >>> On Wed, 11 May 2016 17:15:15 +0800 >>> Jike Song wrote: >>> >>>> On 05/11/2016 12:02 AM, Neo Jia wrote: >>>>> On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote: >>>>>> On 05/05/2016 05:27 PM, Tian, Kevin wrote: >>>>>>>> From: Song, Jike >>>>>>>> >>>>>>>> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU >>>>>>>> hardware. It just, as you said in another mail, "rather than >>>>>>>> programming them into an IOMMU for a device, it simply stores the >>>>>>>> translations for use by later requests". >>>>>>>> >>>>>>>> That imposes a constraint on gfx driver: hardware IOMMU must be disabled. >>>>>>>> Otherwise, if IOMMU is present, the gfx driver eventually programs >>>>>>>> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page; >>>>>>>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA >>>>>>>> translations without any knowledge about hardware IOMMU, how is the >>>>>>>> device model supposed to do to get an IOVA for a given GPA (thereby HPA >>>>>>>> by the IOMMU backend here)? >>>>>>>> >>>>>>>> If things go as guessed above, as vfio_pin_pages() indicates, it >>>>>>>> pin & translate vaddr to PFN, then it will be very difficult for the >>>>>>>> device model to figure out: >>>>>>>> >>>>>>>> 1, for a given GPA, how to avoid calling dma_map_page multiple times? >>>>>>>> 2, for which page to call dma_unmap_page? >>>>>>>> >>>>>>>> -- >>>>>>> >>>>>>> We have to support both w/ iommu and w/o iommu case, since >>>>>>> that fact is out of GPU driver control. A simple way is to use >>>>>>> dma_map_page which internally will cope with w/ and w/o iommu >>>>>>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu. >>>>>>> Then in this file we only need to cache GPA to whatever dmadr_t >>>>>>> returned by dma_map_page. >>>>>>> >>>>>> >>>>>> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here? >>>>> >>>>> Hi Jike, >>>>> >>>>> With mediated passthru, you still can use hardware iommu, but more important >>>>> that part is actually orthogonal to what we are discussing here as we will only >>>>> cache the mapping between , once we >>>>> have pinned pages later with the help of above info, you can map it into the >>>>> proper iommu domain if the system has configured so. >>>>> >>>> >>>> Hi Neo, >>>> >>>> Technically yes you can map a pfn into the proper IOMMU domain elsewhere, >>>> but to find out whether a pfn was previously mapped or not, you have to >>>> track it with another rbtree-alike data structure (the IOMMU driver simply >>>> doesn't bother with tracking), that seems somehow duplicate with the vGPU >>>> IOMMU backend we are discussing here. >>>> >>>> And it is also semantically correct for an IOMMU backend to handle both w/ >>>> and w/o an IOMMU hardware? :) >>> >>> A problem with the iommu doing the dma_map_page() though is for what >>> device does it do this? In the mediated case the vfio infrastructure >>> is dealing with a software representation of a device. For all we >>> know that software model could transparently migrate from one physical >>> GPU to another. There may not even be a physical device backing >>> the mediated device. Those are details left to the vgpu driver itself. >>> >> >> Great point :) Yes, I agree it's a bit intrusive to do the mapping for >> a particular >> pdev in an vGPU IOMMU BE. >> >>> Perhaps one possibility would be to allow the vgpu driver to register >>> map and unmap callbacks. The unmap callback might provide the >>> invalidation interface that we're so far missing. The combination of >>> map and unmap callbacks might simplify the Intel approach of pinning the >>> entire VM memory space, ie. for each map callback do a translation >>> (pin) and dma_map_page, for each unmap do a dma_unmap_page and release >>> the translation. >> >> Yes adding map/unmap ops in pGPU drvier (I assume you are refering to >> gpu_device_ops as >> implemented in Kirti's patch) sounds a good idea, satisfying both: 1) >> keeping vGPU purely >> virtual; 2) dealing with the Linux DMA API to achive hardware IOMMU >> compatibility. >> >> PS, this has very little to do with pinning wholly or partially. Intel KVMGT has >> once been had the whole guest memory pinned, only because we used a spinlock, >> which can't sleep at runtime. We have removed that spinlock in our another >> upstreaming effort, not here but for i915 driver, so probably no biggie. >> > > OK, then you guys don't need to pin everything. Yes :) > The next question will be if you > can send the pinning request from your mediated driver backend to request memory > pinning like we have demonstrated in the v3 patch, function vfio_pin_pages and > vfio_unpin_pages? Kind of yes, not exactly. IMO the mediated driver backend cares not only about pinning, but also the more important translation. The vfio_pin_pages of v3 patch does the pinning and translation simultaneously, whereas I do think the API is better named to 'translate' instead of 'pin', like v2 did. We possibly have the same requirement from the mediate driver backend: a) get a GFN, when guest try to tell hardware; b) consult the vfio iommu with that GFN[1]: will you find me a proper dma_addr? The vfio iommu backend search the tracking table with this GFN[1]: c) if entry found, return the dma_addr; d) if nothing found, call GUP to pin the page, and dma_map_page to get the dma_addr[2], return it; The dma_addr will be told to real GPU hardware. I can't simply say a 'Yes' here, since we may consult dma_addr for a GFN multiple times, but only for the first time we need to pin the page. IOW, pinning is kind of an internal action in the iommu backend. //Sorry for the long, maybe boring explanation.. :) [1] GFN or vaddr, no biggie [2] As pointed out by Alex, dma_map_page can be called elsewhere like a callback. -- Thanks, Jike