From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:55880) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aOXxI-00073g-JP for qemu-devel@nongnu.org; Wed, 27 Jan 2016 16:48:50 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aOXxE-0007Jf-8K for qemu-devel@nongnu.org; Wed, 27 Jan 2016 16:48:48 -0500 Received: from hqemgate14.nvidia.com ([216.228.121.143]:9822) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aOXxD-0007JW-TQ for qemu-devel@nongnu.org; Wed, 27 Jan 2016 16:48:44 -0500 Date: Wed, 27 Jan 2016 13:48:37 -0800 From: Neo Jia Message-ID: <20160127214837.GA6182@nvidia.com> References: <56A6083E.10703@intel.com> <1453757426.32741.614.camel@redhat.com> <20160126102003.GA14400@nvidia.com> <1453838773.15515.1.camel@redhat.com> <20160126222830.GB21927@nvidia.com> <1453851038.18049.9.camel@redhat.com> <20160127091432.GA10936@nvidia.com> <1453911016.6261.3.camel@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Disposition: inline Content-Transfer-Encoding: quoted-printable In-Reply-To: <1453911016.6261.3.camel@redhat.com> Subject: Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...) List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Alex Williamson Cc: "Ruan, Shuai" , "Tian, Kevin" , "kvm@vger.kernel.org" , "igvt-g@lists.01.org" , "Song, Jike" , qemu-devel , Kirti Wankhede , "Lv, Zhiyuan" , Paolo Bonzini , Gerd Hoffmann On Wed, Jan 27, 2016 at 09:10:16AM -0700, Alex Williamson wrote: > On Wed, 2016-01-27 at 01:14 -0800, Neo Jia wrote: > > On Tue, Jan 26, 2016 at 04:30:38PM -0700, Alex Williamson wrote: > > > On Tue, 2016-01-26 at 14:28 -0800, Neo Jia wrote: > > > > On Tue, Jan 26, 2016 at 01:06:13PM -0700, Alex Williamson wrote: > > > > > > 1.1 Under per-physical device sysfs: > > > > > > ---------------------------------------------------------------= ------------------- > > > > > > =A0 > > > > > > vgpu_supported_types - RO, list the current supported virtual G= PU types and its > > > > > > VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads o= f > > > > > > "vgpu_supported_types". > > > > > > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0 > > > > > > vgpu_create - WO, input syntax , create a = virtual > > > > > > gpu device on a target physical GPU. idx: virtual device index = inside a VM > > > > > > =A0 > > > > > > vgpu_destroy - WO, input syntax , destroy a virtua= l gpu device on a > > > > > > target physical GPU > > > > > =A0 > > > > > =A0 > > > > > I've noted in previous discussions that we need to separate user = policy > > > > > from kernel policy here, the kernel policy should not require a "= VM > > > > > UUID".=A0=A0A UUID simply represents a set of one or more devices= and an > > > > > index picks the device within the set.=A0=A0Whether that UUID mat= ches a VM > > > > > or is independently used is up to the user policy when creating t= he > > > > > device. > > > > > =A0 > > > > > Personally I'd also prefer to get rid of the concept of indexes w= ithin a > > > > > UUID set of devices and instead have each device be independent.= =A0=A0This > > > > > seems to be an imposition on the nvidia implementation into the k= ernel > > > > > interface design. > > > > > =A0 > > > > =A0 > > > > Hi Alex, > > > > =A0 > > > > I agree with you that we should not put UUID concept into a kernel = API. At > > > > this point (without any prototyping), I am thinking of using a list= of virtual > > > > devices instead of UUID. > > >=A0 > > > Hi Neo, > > >=A0 > > > A UUID is a perfectly fine name, so long as we let it be just a UUID = and > > > not the UUID matching some specific use case. > > >=A0 > > > > > > =A0 > > > > > > int vgpu_map_virtual_bar > > > > > > ( > > > > > > =A0=A0=A0=A0uint64_t virt_bar_addr, > > > > > > =A0=A0=A0=A0uint64_t phys_bar_addr, > > > > > > =A0=A0=A0=A0uint32_t len, > > > > > > =A0=A0=A0=A0uint32_t flags > > > > > > ) > > > > > > =A0 > > > > > > EXPORT_SYMBOL(vgpu_map_virtual_bar); > > > > > =A0 > > > > > =A0 > > > > > Per the implementation provided, this needs to be implemented in = the > > > > > vfio device driver, not in the iommu interface.=A0=A0Finding the = DMA mapping > > > > > of the device and replacing it is wrong.=A0=A0It should be remapp= ed at the > > > > > vfio device file interface using vm_ops. > > > > > =A0 > > > > =A0 > > > > So you are basically suggesting that we are going to take a mmap fa= ult and > > > > within that fault handler, we will go into vendor driver to look up= the > > > > "pre-registered" mapping and remap there. > > > > =A0 > > > > Is my understanding correct? > > >=A0 > > > Essentially, hopefully the vendor driver will have already registered > > > the backing for the mmap prior to the fault, but either way could wor= k. > > > I think the key though is that you want to remap it onto the vma > > > accessing the vfio device file, not scanning it out of an IOVA mappin= g > > > that might be dynamic and doing a vma lookup based on the point in ti= me > > > mapping of the BAR.=A0=A0The latter doesn't give me much confidence t= hat > > > mappings couldn't change while the former should be a one time fault. > >=A0 > > Hi Alex, > >=A0 > > The fact is that the vendor driver can only prevent such mmap fault by = looking > > up the mapping table that we have saved from IOMMU memory l= isterner >=20 > Why do we need to prevent the fault?=A0=A0We need to handle the fault whe= n > it occurs. >=20 > > when the guest region gets programmed. Also, like you have mentioned be= low, such > > mapping between iova and hva shouldn't be changed as long as the SBIOS = and > > guest OS are done with their job.=A0 >=20 > But you don't know they're done with their job. >=20 > > Yes, you are right it is one time fault, but the gpu work is heavily pi= pelined.=A0 >=20 > Why does that matter?=A0=A0We're talking about the first time the VM > accesses the range of the BAR that will be direct mapped to the physical > GPU.=A0=A0This isn't going to happen in the middle of a benchmark, it's > going to happen during driver initialization in the guest. >=20 > > Probably we should just limit this interface to guest MMIO region and w= e can have > > some crosscheck between the VFIO driver who has monitored the config sp= cae > > access to make sure nothing getting moved around? >=20 > No, the solution for the bar is very clear, map on fault to the vma > accessing the mmap and be done with it for the remainder of this > instance of the VM. >=20 Hi Alex, I totally get your points, my previous comments were just trying to explain= the reasoning behind our current implementation. I think I have found a way to = hide the latency of the mmap fault, which might happen in the middle of running = a benchmark. I will add a new registration interface to allow the driver vendor to provi= de a fault handler callback, and from there pages will be installed.=20 > > > In case it's not clear to folks at Intel, the purpose of this is that= a > > > vGPU may directly map a segment of the physical GPU MMIO space, but w= e > > > may not know what segment that is at setup time, when QEMU does an mm= ap > > > of the vfio device file descriptor.=A0=A0The thought is that we can c= reate > > > an invalid mapping when QEMU calls mmap(), knowing that it won't be > > > accessed until later, then we can fault in the real mmap on demand.= =A0=A0Do > > > you need anything similar? > > >=A0 > > > > > =A0 > > > > > > int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t coun= t) > > > > > > =A0 > > > > > > EXPORT_SYMBOL(vgpu_dma_do_translate); > > > > > > =A0 > > > > > > Still a lot to be added and modified, such as supporting multip= le VMs and=A0 > > > > > > multiple virtual devices, tracking the mapped / pinned region w= ithin VGPU IOMMU=A0 > > > > > > kernel driver, error handling, roll-back and locked memory size= per user, etc.=A0 > > > > > =A0 > > > > > Particularly, handling of mapping changes is completely missing.= =A0=A0This > > > > > cannot be a point in time translation, the user is free to remap > > > > > addresses whenever they wish and device translations need to be u= pdated > > > > > accordingly. > > > > > =A0 > > > > =A0 > > > > When you say "user", do you mean the QEMU? > > >=A0 > > > vfio is a generic userspace driver interface, QEMU is a very, very > > > important user of the interface, but not the only user.=A0=A0So for t= his > > > conversation, we're mostly talking about QEMU as the user, but we sho= uld > > > be careful about assuming QEMU is the only user. > > >=A0 > >=A0 > > Understood. I have to say that our focus at this moment is to support Q= EMU and > > KVM, but I know VFIO interface is much more than that, and that is why = I think > > it is right to leverage this framework so we can together explore futur= e use > > case in the userland. > >=A0 > >=A0 > > > > Here, whenever the DMA that > > > > the guest driver is going to launch will be first pinned within VM,= and then > > > > registered to QEMU, therefore the IOMMU memory listener, eventually= the pages > > > > will be pinned by the GPU or DMA engine. > > > > =A0 > > > > Since we are keeping the upper level code same, thinking about pass= thru case, > > > > where the GPU has already put the real IOVA into his PTEs, I don't = know how QEMU > > > > can change that mapping without causing an IOMMU fault on a active = DMA device. > > >=A0 > > > For the virtual BAR mapping above, it's easy to imagine that mapping = a > > > BAR to a given address is at the guest discretion, it may be mapped a= nd > > > unmapped, it may be mapped to different addresses at different points= in > > > time, the guest BIOS may choose to map it at yet another address, etc= . > > > So if somehow we were trying to setup a mapping for peer-to-peer, the= re > > > are lots of ways that IOVA could change.=A0=A0But even with RAM, we c= an > > > support memory hotplug in a VM.=A0=A0What was once a DMA target may b= e > > > removed or may now be backed by something else.=A0=A0Chipset configur= ation > > > on the emulated platform may change how guest physical memory appears > > > and that might change between VM boots. > > >=A0 > > > Currently with physical device assignment the memory listener watches > > > for both maps and unmaps and updates the iotlb to match.=A0=A0Just li= ke real > > > hardware doing these same sorts of things, we rely on the guest to st= op > > > using memory that's going to be moved as a DMA target prior to moving > > > it. > >=A0 > > Right,=A0=A0you can only do that when the device is quiescent. > >=A0 > > As long as this will be notified to the guest, I think we should be abl= e to > > support it although the real implementation will depend on how the devi= ce gets into=A0 > > quiescent state. > >=A0 > > This is definitely a very interesting feature we should explore, but I = hope we > > probably can first focus on the most basic functionality. >=20 > If we only do a point-in-time translation and assume it never changes, > that's good enough for a proof of concept, but it's not a complete > solution.=A0=A0I think this is=A0=A0practical problem, not just an academ= ic > problem.=A0=A0There needs to be a mechanism mappings to be invalidated ba= sed > on VM memory changes.=A0=A0Thanks, >=20 Sorry, probably my previous comment is not very clear. I highly value your = input and the information related to the memory hotplug scenarios, and I never exclud= e the support of such feature. The only question is when, that is why I would lik= e to defer such VM memory hotplug feature to phase 2 after the initial official launch. Thanks, Neo > Alex >=20