From mboxrd@z Thu Jan 1 00:00:00 1970 From: Alex Williamson Subject: Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...) Date: Wed, 27 Jan 2016 09:19:49 -0700 Message-ID: <1453911589.6261.5.camel@redhat.com> References: <569C5071.6080004@intel.com> <1453092476.32741.67.camel@redhat.com> <569CA8AD.6070200@intel.com> <1453143919.32741.169.camel@redhat.com> <569F4C86.2070501@intel.com> <56A6083E.10703@intel.com> <1453757426.32741.614.camel@redhat.com> <56A72313.9030009@intel.com> <56A77D2D.40109@gmail.com> <1453826249.26652.54.camel@redhat.com> <1453844613.18049.1.camel@redhat.com> <1453846073.18049.3.camel@redhat.com> <1453847250.18049.5.camel@redhat.com> <1453848975.18049.7.camel@redhat.com> <56A821AD.5090606@intel.com> <1453864068.3107.3.camel@redhat.com> <56A85913.1020506@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: "Tian, Kevin" , Yang Zhang , Gerd Hoffmann , Paolo Bonzini , "Lv, Zhiyuan" , "Ruan, Shuai" , "kvm@vger.kernel.org" , qemu-devel , "igvt-g@lists.01.org" , Neo Jia To: Jike Song Return-path: Received: from mx1.redhat.com ([209.132.183.28]:44598 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933539AbcA0QTv (ORCPT ); Wed, 27 Jan 2016 11:19:51 -0500 In-Reply-To: <56A85913.1020506@intel.com> Sender: kvm-owner@vger.kernel.org List-ID: On Wed, 2016-01-27 at 13:43 +0800, Jike Song wrote: > On 01/27/2016 11:07 AM, Alex Williamson wrote: > > On Wed, 2016-01-27 at 09:47 +0800, Jike Song wrote: > > > On 01/27/2016 06:56 AM, Alex Williamson wrote: > > > > On Tue, 2016-01-26 at 22:39 +0000, Tian, Kevin wrote: > > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com] > > > > > > Sent: Wednesday, January 27, 2016 6:27 AM > > > > > > =C2=A0 > > > > > > On Tue, 2016-01-26 at 22:15 +0000, Tian, Kevin wrote: > > > > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.co= m] > > > > > > > > Sent: Wednesday, January 27, 2016 6:08 AM > > > > > > > > =C2=A0 > > > > > > > > > > > > =C2=A0 > > > > > > > > > > > =C2=A0 > > > > > > > > > > > Today KVMGT (not using VFIO yet) registers I/O em= ulation callbacks to > > > > > > > > > > > KVM, so VM MMIO access will be forwarded to KVMGT= directly for > > > > > > > > > > > emulation in kernel. If we reuse above R/W flags,= the whole emulation > > > > > > > > > > > path would be unnecessarily long with obvious per= formance impact. We > > > > > > > > > > > either need a new flag here to indicate in-kernel= emulation (bias from > > > > > > > > > > > passthrough support), or just hide the region alt= ernatively (let KVMGT > > > > > > > > > > > to handle I/O emulation itself like today). > > > > > > > > > > =C2=A0 > > > > > > > > > > That sounds like a future optimization TBH.=C2=A0=C2= =A0There's very strict > > > > > > > > > > layering between vfio and kvm.=C2=A0=C2=A0Physical = device assignment could make > > > > > > > > > > use of it as well, avoiding a round trip through us= erspace when an > > > > > > > > > > ioread/write would do.=C2=A0=C2=A0Userspace also ne= eds to orchestrate those kinds > > > > > > > > > > of accelerators, there might be cases where userspa= ce wants to see those > > > > > > > > > > transactions for debugging or manipulating the devi= ce.=C2=A0=C2=A0We can't simply > > > > > > > > > > take shortcuts to provide such direct access.=C2=A0= =C2=A0Thanks, > > > > > > > > > > =C2=A0 > > > > > > > > > =C2=A0 > > > > > > > > > But we have to balance such debugging flexibility and= acceptable performance. > > > > > > > > > To me the latter one is more important otherwise ther= e'd be no real usage > > > > > > > > > around this technique, while for debugging there are = other alternative (e.g. > > > > > > > > > ftrace) Consider some extreme case with 100k traps/se= cond and then see > > > > > > > > > how much impact a 2-3x longer emulation path can brin= g... > > > > > > > > =C2=A0 > > > > > > > > Are you jumping to the conclusion that it cannot be don= e with proper > > > > > > > > layering in place?=C2=A0=C2=A0Performance is important,= but it's not an excuse to > > > > > > > > abandon designing interfaces between independent compon= ents.=C2=A0=C2=A0Thanks, > > > > > > > > =C2=A0 > > > > > > > =C2=A0 > > > > > > > Two are not controversial. My point is to remove unnecess= ary long trip > > > > > > > as possible. After another thought, yes we can reuse exis= ting read/write > > > > > > > flags: > > > > > > > =C2=A0 - KVMGT will expose a private control variable whe= ther in-kernel > > > > > > > delivery is required; > > > > > > =C2=A0 > > > > > > But in-kernel delivery is never *required*.=C2=A0=C2=A0Woul= dn't userspace want to > > > > > > deliver in-kernel any time it possibly could? > > > > > > =C2=A0 > > > > > > > =C2=A0 - when the variable is true, KVMGT will register i= n-kernel MMIO > > > > > > > emulation callbacks then VM MMIO request will be delivere= d to KVMGT > > > > > > > directly; > > > > > > > =C2=A0 - when the variable is false, KVMGT will not regis= ter anything. > > > > > > > VM MMIO request will then be delivered to Qemu and then i= oread/write > > > > > > > will be used to finally reach KVMGT emulation logic; > > > > > > =C2=A0 > > > > > > No, that means the interface is entirely dependent on a bac= kdoor through > > > > > > KVM.=C2=A0=C2=A0Why can't userspace (QEMU) do something lik= e register an MMIO > > > > > > region with KVM handled via a provided file descriptor and = offset, > > > > > > couldn't KVM then call the file ops without a kernel exit?=C2= =A0=C2=A0Thanks, > > > > > > =C2=A0 > > > > > =C2=A0 > > > > > Could you elaborate this thought? If it can achieve the purpo= se w/o > > > > > a kernel exit definitely we can adapt to it. :-) > > > > =C2=A0 > > > > I only thought of it when replying to the last email and have b= een doing > > > > some research, but we already do quite a bit of synchronization= through > > > > file descriptors.=C2=A0=C2=A0The kvm-vfio pseudo device uses a = group file > > > > descriptor to ensure a user has access to a group, allowing som= e degree > > > > of interaction between modules.=C2=A0=C2=A0Eventfds and irqfds = already make use of > > > > f_ops on file descriptors to poke data.=C2=A0=C2=A0So, if KVM h= ad information that > > > > an MMIO region was backed by a file descriptor for which it alr= eady has > > > > a reference via fdget() (and verified access rights and whatnot= ), then > > > > it ought to be a simple matter to get to f_ops->read/write know= ing the > > > > base offset of that MMIO region.=C2=A0=C2=A0Perhaps it could ev= en simply use > > > > __vfs_read/write().=C2=A0=C2=A0Then we've got a proper referenc= e to the file > > > > descriptor for ownership purposes and we've transparently jumpe= d across > > > > modules without any implicit knowledge of the other end.=C2=A0=C2= =A0Could it work? > > > =C2=A0 > > > This is OK for KVMGT, from fops to vgpu device-model would always= be simple. > > > The only question is, how is KVM hypervisor supposed to get the f= d on VM-exitings? > >=C2=A0 > > Hi Jike, > >=C2=A0 > > Sorry, I don't understand "on VM-exiting".=C2=A0=C2=A0KVM would hol= d a reference > > to the fd via fdget(), so the vfio device wouldn't be closed until = the > > VM exits and KVM releases that reference. > >=C2=A0 >=C2=A0 > Sorry for my bad English, I meant VMEXIT, from non-root to kvm hyperv= isor. >=C2=A0 > > > copy-and-paste the current implementation of vcpu_mmio_write(), s= eems > > > nothing but GPA and len are provided: > >=C2=A0 > > I presume that an MMIO region is already registered with a GPA and > > length, the additional information necessary would be a file descri= ptor > > and offset into the file descriptor for the base of the MMIO space. > >=C2=A0 > > > =C2=A0 static int vcpu_mmio_write(struct kvm_vcpu *vcpu, gpa_t ad= dr, int len, > > > =C2=A0 =C2=A0=C2=A0=C2=A0const void *v) > > > =C2=A0 { > > > =C2=A0 int handled =3D 0; > > > =C2=A0 int n; > > > =C2=A0 > > > =C2=A0 do { > > > =C2=A0 n =3D min(len, 8); > > > =C2=A0 if (!(vcpu->arch.apic && > > > =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0!kvm_iodevice_write(= vcpu, &vcpu->arch.apic->dev, addr, n, v)) > > > =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0&& kvm_io_bus_write(vcpu, KVM_MM= IO_BUS, addr, n, v)) > > > =C2=A0 break; > > > =C2=A0 handled +=3D n; > > > =C2=A0 addr +=3D n; > > > =C2=A0 len -=3D n; > > > =C2=A0 v +=3D n; > > > =C2=A0 } while (len); > > > =C2=A0 > > > =C2=A0 return handled; > > > =C2=A0 } > > > =C2=A0 > > > If we back a GPA range with a fd, this will also be a 'backdoor'? > >=C2=A0 > > KVM would simply be able to service the MMIO access using the provi= ded > > fd and offset.=C2=A0=C2=A0It's not a back door because we will have= created an API > > for KVM to have a file descriptor and offset registered (by userspa= ce) > > to handle the access.=C2=A0=C2=A0Also, KVM does not know the file d= escriptor is > > handled by a VFIO device and VFIO doesn't know the read/write acces= ses > > is initiated by KVM.=C2=A0=C2=A0Seems like the question is whether = we can fit > > something like that into the existing KVM MMIO bus/device handlers > > in-kernel.=C2=A0=C2=A0Thanks, > >=C2=A0 >=C2=A0 > Had a look at eventfd, I would say yes, technically we are able to > achieve the goal: introduce a fd, with fop->{read|write} defined in K= VM, > call into vgpu device-model, also an iodev registered for a MMIO GPA > range to invoke the fop->{read|write}.=C2=A0=C2=A0I just didn't under= stand why > userspace can't register an iodev via API directly. Please elaborate on how it would work via iodev. > Besides, this doesn't necessarily require another thread, right? > I guess it can be within the VCPU thread?=C2=A0 I would think so too, the vcpu is blocked on the MMIO access, we should be able to service it in that context.=C2=A0=C2=A0I hope. > And this brought another question: except the vfio bus drvier and > iommu backend (and the page_track ulitiy used for guest memory write-= protection),=C2=A0 > is it KVMGT allowed to call into kvm.ko (or modify)? Though we are > becoming less and less willing to do that with VFIO, it's still bette= r > to know that before going wrong. kvm and vfio are separate modules, for the most part, they know nothing about each other and have no hard dependencies between them.=C2=A0=C2=A0= We do have various accelerations we can use to avoid paths through userspace, but these are all via APIs that are agnostic of the party on the other end. =46or example, vfio signals interrups through eventfds and has no conce= pt of whether that eventfd terminates in userspace or into an irqfd in KVM= =2E vfio supports direct access to device MMIO regions via mmaps, but vfio has no idea if that mmap gets directly mapped into a VM address space. Even with posted interrupts, we've introduced an irq bypass manager allowing interrupt producers and consumers to register independently to form a connection without directly knowing anything about the other module.=C2=A0=C2=A0That sort or proper software layering needs to conti= nue.=C2=A0=C2=A0It would be wrong for a vfio bus driver to assume KVM is the user and directly call into KVM interfaces.=C2=A0=C2=A0Thanks, Alex