From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:58852) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aOBmJ-0004Ty-RZ for qemu-devel@nongnu.org; Tue, 26 Jan 2016 17:08:01 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aOBmG-0000xu-Fi for qemu-devel@nongnu.org; Tue, 26 Jan 2016 17:07:59 -0500 Received: from mx1.redhat.com ([209.132.183.28]:48399) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aOBmG-0000xO-5M for qemu-devel@nongnu.org; Tue, 26 Jan 2016 17:07:56 -0500 Message-ID: <1453846073.18049.3.camel@redhat.com> From: Alex Williamson Date: Tue, 26 Jan 2016 15:07:53 -0700 In-Reply-To: References: <569C5071.6080004@intel.com> <1453092476.32741.67.camel@redhat.com> <569CA8AD.6070200@intel.com> <1453143919.32741.169.camel@redhat.com> <569F4C86.2070501@intel.com> <56A6083E.10703@intel.com> <1453757426.32741.614.camel@redhat.com> <56A72313.9030009@intel.com> <56A77D2D.40109@gmail.com> <1453826249.26652.54.camel@redhat.com> <1453844613.18049.1.camel@redhat.com> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...) List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Tian, Kevin" , Yang Zhang , "Song, Jike" Cc: "Ruan, Shuai" , Neo Jia , "kvm@vger.kernel.org" , "igvt-g@lists.01.org" , qemu-devel , Gerd Hoffmann , Paolo Bonzini , "Lv, Zhiyuan" On Tue, 2016-01-26 at 21:50 +0000, Tian, Kevin wrote: > > From: Alex Williamson [mailto:alex.williamson@redhat.com] > > Sent: Wednesday, January 27, 2016 5:44 AM > >=C2=A0 > > On Tue, 2016-01-26 at 21:21 +0000, Tian, Kevin wrote: > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com] > > > > Sent: Wednesday, January 27, 2016 12:37 AM > > > >=C2=A0 > > > > On Tue, 2016-01-26 at 22:05 +0800, Yang Zhang wrote: > > > > > On 2016/1/26 15:41, Jike Song wrote: > > > > > > On 01/26/2016 05:30 AM, Alex Williamson wrote: > > > > > > > [cc +Neo @Nvidia] > > > > > > >=C2=A0 > > > > > > > Hi Jike, > > > > > > >=C2=A0 > > > > > > > On Mon, 2016-01-25 at 19:34 +0800, Jike Song wrote: > > > > > > > > On 01/20/2016 05:05 PM, Tian, Kevin wrote: > > > > > > > > > I would expect we can spell out next level tasks toward= above > > > > > > > > > direction, upon which Alex can easily judge whether the= re are > > > > > > > > > some common VFIO framework changes that he can help :-) > > > > > > > >=C2=A0 > > > > > > > > Hi Alex, > > > > > > > >=C2=A0 > > > > > > > > Here is a draft task list after a short discussion w/ Kev= in, > > > > > > > > would you please have a look? > > > > > > > >=C2=A0 > > > > > > > > =C2=A0 Bus Driver > > > > > > > >=C2=A0 > > > > > > > > =C2=A0 { in i915/vgt/xxx.c } > > > > > > > >=C2=A0 > > > > > > > > =C2=A0 - define a subset of vfio_pci interfaces > > > > > > > > =C2=A0 - selective pass-through (say aperture) > > > > > > > > =C2=A0 - trap MMIO: interface w/ QEMU > > > > > > >=C2=A0 > > > > > > > What's included in the subset?=C2=A0=C2=A0Certainly the bus= reset ioctls really > > > > > > > don't apply, but you'll need to support the full device int= erface, > > > > > > > right?=C2=A0=C2=A0That includes the region info ioctl and a= ccess through the vfio > > > > > > > device file descriptor as well as the interrupt info and se= tup ioctls. > > > > > > >=C2=A0 > > > > > >=C2=A0 > > > > > > [All interfaces I thought are via ioctl:)=C2=A0=C2=A0For othe= r stuff like file > > > > > > descriptor we'll definitely keep it.] > > > > > >=C2=A0 > > > > > > The list of ioctl commands provided by vfio_pci: > > > > > >=C2=A0 > > > > > > =C2=A0 - VFIO_DEVICE_GET_PCI_HOT_RESET_INFO > > > > > > =C2=A0 - VFIO_DEVICE_PCI_HOT_RESET > > > > > >=C2=A0 > > > > > > As you said, above 2 don't apply. But for this: > > > > > >=C2=A0 > > > > > > =C2=A0 - VFIO_DEVICE_RESET > > > > > >=C2=A0 > > > > > > In my opinion it should be kept, no matter what will be provi= ded in > > > > > > the bus driver. > > > > > >=C2=A0 > > > > > > =C2=A0 - VFIO_PCI_ROM_REGION_INDEX > > > > > > =C2=A0 - VFIO_PCI_VGA_REGION_INDEX > > > > > >=C2=A0 > > > > > > I suppose above 2 don't apply neither? For a vgpu we don't pr= ovide a > > > > > > ROM BAR or VGA region. > > > > > >=C2=A0 > > > > > > =C2=A0 - VFIO_DEVICE_GET_INFO > > > > > > =C2=A0 - VFIO_DEVICE_GET_REGION_INFO > > > > > > =C2=A0 - VFIO_DEVICE_GET_IRQ_INFO > > > > > > =C2=A0 - VFIO_DEVICE_SET_IRQS > > > > > >=C2=A0 > > > > > > Above 4 are needed of course. > > > > > >=C2=A0 > > > > > > We will need to extend: > > > > > >=C2=A0 > > > > > > =C2=A0 - VFIO_DEVICE_GET_REGION_INFO > > > > > >=C2=A0 > > > > > >=C2=A0 > > > > > > a) adding a flag: DONT_MAP. For example, the MMIO of vgpu > > > > > > should be trapped instead of being mmap-ed. > > > > >=C2=A0 > > > > > I may not in the context, but i am curious how to handle the DO= NT_MAP in > > > > > vfio driver? Since there are no real MMIO maps into the region = and i > > > > > suppose the access to the region should be handled by vgpu in i= 915 > > > > > driver, but currently most of the mmio accesses are handled by = Qemu. > > > >=C2=A0 > > > > VFIO supports the following region attributes: > > > >=C2=A0 > > > > #define VFIO_REGION_INFO_FLAG_READ=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0(1 << 0) /* Region supports read */ > > > > #define VFIO_REGION_INFO_FLAG_WRITE=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= (1 << 1) /* Region supports write */ > > > > #define VFIO_REGION_INFO_FLAG_MMAP=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0(1 << 2) /* Region supports mmap */ > > > >=C2=A0 > > > > If MMAP is not set, then the QEMU driver will do pread and/or pwr= ite to > > > > the specified offsets of the device file descriptor, depending on= what > > > > accesses are supported. =C2=A0This is all reported through the RE= GION_INFO > > > > ioctl for a given index. =C2=A0If mmap is supported, the VM will = have direct > > > > access to the area, without faulting to KVM other than to populat= e the > > > > mapping. =C2=A0Without mmap support, a VM MMIO access traps into = KVM, which > > > > returns out to QEMU to service the request, which then finds the > > > > MemoryRegion serviced through vfio, which will then perform a > > > > pread/pwrite through to the kernel vfio bus driver to handle the > > > > access. =C2=A0Thanks, > > > >=C2=A0 > > >=C2=A0 > > > Today KVMGT (not using VFIO yet) registers I/O emulation callbacks = to > > > KVM, so VM MMIO access will be forwarded to KVMGT directly for > > > emulation in kernel. If we reuse above R/W flags, the whole emulati= on > > > path would be unnecessarily long with obvious performance impact. W= e > > > either need a new flag here to indicate in-kernel emulation (bias f= rom > > > passthrough support), or just hide the region alternatively (let KV= MGT > > > to handle I/O emulation itself like today). > >=C2=A0 > > That sounds like a future optimization TBH.=C2=A0=C2=A0There's very s= trict > > layering between vfio and kvm.=C2=A0=C2=A0Physical device assignment = could make > > use of it as well, avoiding a round trip through userspace when an > > ioread/write would do.=C2=A0=C2=A0Userspace also needs to orchestrate= those kinds > > of accelerators, there might be cases where userspace wants to see th= ose > > transactions for debugging or manipulating the device.=C2=A0=C2=A0We = can't simply > > take shortcuts to provide such direct access.=C2=A0=C2=A0Thanks, > >=C2=A0 >=C2=A0 > But we have to balance such debugging flexibility and acceptable perfor= mance. > To me the latter one is more important otherwise there'd be no real usa= ge > around this technique, while for debugging there are other alternative = (e.g. > ftrace) Consider some extreme case with 100k traps/second and then see=C2= =A0 > how much impact a 2-3x longer emulation path can bring... Are you jumping to the conclusion that it cannot be done with proper layering in place?=C2=A0=C2=A0Performance is important, but it's not an e= xcuse to abandon designing interfaces between independent components.=C2=A0=C2=A0T= hanks, Alex