From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:45592) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aOMBr-0000aY-Fx for qemu-devel@nongnu.org; Wed, 27 Jan 2016 04:15:08 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aOMBo-0000d0-7a for qemu-devel@nongnu.org; Wed, 27 Jan 2016 04:15:03 -0500 Received: from hqemgate15.nvidia.com ([216.228.121.64]:7181) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aOMBn-0000cu-SD for qemu-devel@nongnu.org; Wed, 27 Jan 2016 04:15:00 -0500 Date: Wed, 27 Jan 2016 01:14:53 -0800 From: Neo Jia Message-ID: <20160127091432.GA10936@nvidia.com> References: <1453143919.32741.169.camel@redhat.com> <569F4C86.2070501@intel.com> <56A6083E.10703@intel.com> <1453757426.32741.614.camel@redhat.com> <20160126102003.GA14400@nvidia.com> <1453838773.15515.1.camel@redhat.com> <20160126222830.GB21927@nvidia.com> <1453851038.18049.9.camel@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Disposition: inline Content-Transfer-Encoding: quoted-printable In-Reply-To: <1453851038.18049.9.camel@redhat.com> Subject: Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...) List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Alex Williamson Cc: "Ruan, Shuai" , "Tian, Kevin" , "kvm@vger.kernel.org" , "igvt-g@lists.01.org" , "Song, Jike" , qemu-devel , Kirti Wankhede , "Lv, Zhiyuan" , Paolo Bonzini , Gerd Hoffmann On Tue, Jan 26, 2016 at 04:30:38PM -0700, Alex Williamson wrote: > On Tue, 2016-01-26 at 14:28 -0800, Neo Jia wrote: > > On Tue, Jan 26, 2016 at 01:06:13PM -0700, Alex Williamson wrote: > > > > 1.1 Under per-physical device sysfs: > > > > -------------------------------------------------------------------= --------------- > > > > =A0 > > > > vgpu_supported_types - RO, list the current supported virtual GPU t= ypes and its > > > > VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of > > > > "vgpu_supported_types". > > > > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0 > > > > vgpu_create - WO, input syntax , create a virt= ual > > > > gpu device on a target physical GPU. idx: virtual device index insi= de a VM > > > > =A0 > > > > vgpu_destroy - WO, input syntax , destroy a virtual gp= u device on a > > > > target physical GPU > > >=A0 > > >=A0 > > > I've noted in previous discussions that we need to separate user poli= cy > > > from kernel policy here, the kernel policy should not require a "VM > > > UUID".=A0=A0A UUID simply represents a set of one or more devices and= an > > > index picks the device within the set.=A0=A0Whether that UUID matches= a VM > > > or is independently used is up to the user policy when creating the > > > device. > > >=A0 > > > Personally I'd also prefer to get rid of the concept of indexes withi= n a > > > UUID set of devices and instead have each device be independent.=A0= =A0This > > > seems to be an imposition on the nvidia implementation into the kerne= l > > > interface design. > > >=A0 > >=A0 > > Hi Alex, > >=A0 > > I agree with you that we should not put UUID concept into a kernel API.= At > > this point (without any prototyping), I am thinking of using a list of = virtual > > devices instead of UUID. >=20 > Hi Neo, >=20 > A UUID is a perfectly fine name, so long as we let it be just a UUID and > not the UUID matching some specific use case. >=20 > > > > =A0 > > > > int vgpu_map_virtual_bar > > > > ( > > > > =A0=A0=A0=A0uint64_t virt_bar_addr, > > > > =A0=A0=A0=A0uint64_t phys_bar_addr, > > > > =A0=A0=A0=A0uint32_t len, > > > > =A0=A0=A0=A0uint32_t flags > > > > ) > > > > =A0 > > > > EXPORT_SYMBOL(vgpu_map_virtual_bar); > > >=A0 > > >=A0 > > > Per the implementation provided, this needs to be implemented in the > > > vfio device driver, not in the iommu interface.=A0=A0Finding the DMA = mapping > > > of the device and replacing it is wrong.=A0=A0It should be remapped a= t the > > > vfio device file interface using vm_ops. > > >=A0 > >=A0 > > So you are basically suggesting that we are going to take a mmap fault = and > > within that fault handler, we will go into vendor driver to look up the > > "pre-registered" mapping and remap there. > >=A0 > > Is my understanding correct? >=20 > Essentially, hopefully the vendor driver will have already registered > the backing for the mmap prior to the fault, but either way could work. > I think the key though is that you want to remap it onto the vma > accessing the vfio device file, not scanning it out of an IOVA mapping > that might be dynamic and doing a vma lookup based on the point in time > mapping of the BAR.=A0=A0The latter doesn't give me much confidence that > mappings couldn't change while the former should be a one time fault. Hi Alex, The fact is that the vendor driver can only prevent such mmap fault by look= ing up the mapping table that we have saved from IOMMU memory liste= rner when the guest region gets programmed. Also, like you have mentioned below,= such mapping between iova and hva shouldn't be changed as long as the SBIOS and guest OS are done with their job.=20 Yes, you are right it is one time fault, but the gpu work is heavily pipeli= ned.=20 Probably we should just limit this interface to guest MMIO region and we ca= n have some crosscheck between the VFIO driver who has monitored the config spcae access to make sure nothing getting moved around? >=20 > In case it's not clear to folks at Intel, the purpose of this is that a > vGPU may directly map a segment of the physical GPU MMIO space, but we > may not know what segment that is at setup time, when QEMU does an mmap > of the vfio device file descriptor.=A0=A0The thought is that we can creat= e > an invalid mapping when QEMU calls mmap(), knowing that it won't be > accessed until later, then we can fault in the real mmap on demand.=A0=A0= Do > you need anything similar? >=20 > > >=A0 > > > > int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count) > > > > =A0 > > > > EXPORT_SYMBOL(vgpu_dma_do_translate); > > > > =A0 > > > > Still a lot to be added and modified, such as supporting multiple V= Ms and=A0 > > > > multiple virtual devices, tracking the mapped / pinned region withi= n VGPU IOMMU=A0 > > > > kernel driver, error handling, roll-back and locked memory size per= user, etc.=A0 > > >=A0 > > > Particularly, handling of mapping changes is completely missing.=A0= =A0This > > > cannot be a point in time translation, the user is free to remap > > > addresses whenever they wish and device translations need to be updat= ed > > > accordingly. > > >=A0 > >=A0 > > When you say "user", do you mean the QEMU? >=20 > vfio is a generic userspace driver interface, QEMU is a very, very > important user of the interface, but not the only user.=A0=A0So for this > conversation, we're mostly talking about QEMU as the user, but we should > be careful about assuming QEMU is the only user. >=20 Understood. I have to say that our focus at this moment is to support QEMU = and KVM, but I know VFIO interface is much more than that, and that is why I th= ink it is right to leverage this framework so we can together explore future us= e case in the userland. > > Here, whenever the DMA that > > the guest driver is going to launch will be first pinned within VM, and= then > > registered to QEMU, therefore the IOMMU memory listener, eventually the= pages > > will be pinned by the GPU or DMA engine. > >=A0 > > Since we are keeping the upper level code same, thinking about passthru= case, > > where the GPU has already put the real IOVA into his PTEs, I don't know= how QEMU > > can change that mapping without causing an IOMMU fault on a active DMA = device. >=20 > For the virtual BAR mapping above, it's easy to imagine that mapping a > BAR to a given address is at the guest discretion, it may be mapped and > unmapped, it may be mapped to different addresses at different points in > time, the guest BIOS may choose to map it at yet another address, etc. > So if somehow we were trying to setup a mapping for peer-to-peer, there > are lots of ways that IOVA could change.=A0=A0But even with RAM, we can > support memory hotplug in a VM.=A0=A0What was once a DMA target may be > removed or may now be backed by something else.=A0=A0Chipset configuratio= n > on the emulated platform may change how guest physical memory appears > and that might change between VM boots. >=20 > Currently with physical device assignment the memory listener watches > for both maps and unmaps and updates the iotlb to match.=A0=A0Just like r= eal > hardware doing these same sorts of things, we rely on the guest to stop > using memory that's going to be moved as a DMA target prior to moving > it. Right, you can only do that when the device is quiescent. As long as this will be notified to the guest, I think we should be able to support it although the real implementation will depend on how the device g= ets into=20 quiescent state. This is definitely a very interesting feature we should explore, but I hope= we probably can first focus on the most basic functionality. Thanks, Neo >=20 > > > > 4. Modules > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > =A0 > > > > Two new modules are introduced: vfio_iommu_type1_vgpu.ko and vgpu.k= o > > > > =A0 > > > > vfio_iommu_type1_vgpu.ko - IOMMU TYPE1 driver supporting the IOMMU= =A0 > > > > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0TYPE1 v1 and v2 interface.=A0 > > >=A0 > > > Depending on how intrusive it is, this can possibly by done within th= e > > > existing type1 driver.=A0=A0Either that or we can split out common co= de for > > > use by a separate module. > > >=A0 > > > > vgpu.ko=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0- prov= ide registration interface and virtual device > > > > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0VFIO access. > > > > =A0 > > > > 5. QEMU note > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > =A0 > > > > To allow us focus on the VGPU kernel driver prototyping, we have in= troduced a new VFIO=A0 > > > > class - vgpu inside QEMU, so we don't have to change the existing v= fio/pci.c file and=A0 > > > > use it as a reference for our implementation. It is basically just = a quick c & p > > > > from vfio/pci.c to quickly meet our needs. > > > > =A0 > > > > Once this proposal is finalized, we will move to vfio/pci.c instead= of a new > > > > class, and probably the only thing required is to have a new way to= discover the > > > > device. > > > > =A0 > > > > 6. Examples > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > =A0 > > > > On this server, we have two NVIDIA M60 GPUs. > > > > =A0 > > > > [root@cjia-vgx-kvm ~]# lspci -d 10de:13f2 > > > > 86:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (= rev a1) > > > > 87:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (= rev a1) > > > > =A0 > > > > After nvidia.ko gets initialized, we can query the supported vGPU t= ype by > > > > accessing the "vgpu_supported_types" like following: > > > > =A0 > > > > [root@cjia-vgx-kvm ~]# cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu= _supported_types=A0 > > > > 11:GRID M60-0B > > > > 12:GRID M60-0Q > > > > 13:GRID M60-1B > > > > 14:GRID M60-1Q > > > > 15:GRID M60-2B > > > > 16:GRID M60-2Q > > > > 17:GRID M60-4Q > > > > 18:GRID M60-8Q > > > > =A0 > > > > For example the VM_UUID is c0b26072-dd1b-4340-84fe-bf338c510818, an= d we would > > > > like to create "GRID M60-4Q" VM on it. > > > > =A0 > > > > echo "c0b26072-dd1b-4340-84fe-bf338c510818:0:17" > > > > > /sys/bus/pci/devices/0000\:86\:00.0/vgpu_create > > > > =A0 > > > > Note: the number 0 here is for vGPU device index. So far the change= is not tested > > > > for multiple vgpu devices yet, but we will support it. > > > > =A0 > > > > At this moment, if you query the "vgpu_supported_types" it will sti= ll show all > > > > supported virtual GPU types as no virtual GPU resource is committed= yet. > > > > =A0 > > > > Starting VM: > > > > =A0 > > > > echo "c0b26072-dd1b-4340-84fe-bf338c510818" > /sys/class/vgpu/vgpu_= start > > > > =A0 > > > > then, the supported vGPU type query will return: > > > > =A0 > > > > [root@cjia-vgx-kvm /home/cjia]$ > > > > > cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types > > > > 17:GRID M60-4Q > > > > =A0 > > > > So vgpu_supported_config needs to be called whenever a new virtual = device gets > > > > created as the underlying HW might limit the supported types if the= re are > > > > any existing VM runnings. > > > > =A0 > > > > Then, VM gets shutdown, writes to /sys/class/vgpu/vgpu_shutdown wil= l info the > > > > GPU driver vendor to clean up resource. > > > > =A0 > > > > Eventually, those virtual GPUs can be removed by writing to vgpu_de= stroy under > > > > device sysfs. > > >=A0 > > >=A0 > > > I'd like to hear Intel's thoughts on this interface.=A0=A0Are there > > > different vgpu capacities or priority classes that would necessitate > > > different types of vcpus on Intel? > > >=A0 > > > I think there are some gaps in translating from named vgpu types to > > > indexes here, along with my previous mention of the UUID/set oddity. > > >=A0 > > > Does Intel have a need for start and shutdown interfaces? > > >=A0 > > > Neo, wasn't there at some point information about how many of each ty= pe > > > could be supported through these interfaces?=A0=A0How does a user kno= w their > > > capacity limits? > > >=A0 > >=A0 > > Thanks for reminding me that, I think we probably forget to put that *i= mportant* > > information as the output of "vgpu_supported_types". > >=A0 > > Regarding the capacity, we can provide the frame buffer size as part of= the > > "vgpu_supported_types" output as well, I would imagine those will be ev= entually > > show up on the openstack management interface or virt-mgr. > >=A0 > > Basically, yes there would be a separate col to show the number of inst= ance you > > can create for each type of VGPU on a specific physical GPU. >=20 > Ok, Thanks, >=20 > Alex >=20