From mboxrd@z Thu Jan  1 00:00:00 1970
From: Alex Williamson <alex.williamson@redhat.com>
Subject: Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT
 - a Mediated ...)
Date: Wed, 27 Jan 2016 09:19:49 -0700
Message-ID: <1453911589.6261.5.camel@redhat.com>
References: <569C5071.6080004@intel.com>
	   <1453092476.32741.67.camel@redhat.com> <569CA8AD.6070200@intel.com>
	   <1453143919.32741.169.camel@redhat.com> <569F4C86.2070501@intel.com>
	   <AADFC41AFE54684AB9EE6CBC0274A5D15F786B4B@SHSMSX101.ccr.corp.intel.com>
	   <56A6083E.10703@intel.com> <1453757426.32741.614.camel@redhat.com>
	   <56A72313.9030009@intel.com> <56A77D2D.40109@gmail.com>
	   <1453826249.26652.54.camel@redhat.com>
	   <AADFC41AFE54684AB9EE6CBC0274A5D15F78EAEA@SHSMSX101.ccr.corp.intel.com>
	   <1453844613.18049.1.camel@redhat.com>
	   <AADFC41AFE54684AB9EE6CBC0274A5D15F78EB95@SHSMSX101.ccr.corp.intel.com>
	   <1453846073.18049.3.camel@redhat.com>
	   <AADFC41AFE54684AB9EE6CBC0274A5D15F78ECBB@SHSMSX101.ccr.corp.intel.com>
	   <1453847250.18049.5.camel@redhat.com>
	   <AADFC41AFE54684AB9EE6CBC0274A5D15F78ED63@SHSMSX101.ccr.corp.intel.com>
	  <1453848975.18049.7.camel@redhat.com> <56A821AD.5090606@intel.com>
	 <1453864068.3107.3.camel@redhat.com> <56A85913.1020506@intel.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: "Tian, Kevin" <kevin.tian@intel.com>,
	Yang Zhang <yang.zhang.wz@gmail.com>,
	Gerd Hoffmann <kraxel@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	"Lv, Zhiyuan" <zhiyuan.lv@intel.com>,
	"Ruan, Shuai" <shuai.ruan@intel.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	qemu-devel <qemu-devel@nongnu.org>,
	"igvt-g@lists.01.org" <igvt-g@ml01.01.org>,
	Neo Jia <cjia@nvidia.com>
To: Jike Song <jike.song@intel.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:44598 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S933539AbcA0QTv (ORCPT <rfc822;kvm@vger.kernel.org>);
	Wed, 27 Jan 2016 11:19:51 -0500
In-Reply-To: <56A85913.1020506@intel.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On Wed, 2016-01-27 at 13:43 +0800, Jike Song wrote:
> On 01/27/2016 11:07 AM, Alex Williamson wrote:
> > On Wed, 2016-01-27 at 09:47 +0800, Jike Song wrote:
> > > On 01/27/2016 06:56 AM, Alex Williamson wrote:
> > > > On Tue, 2016-01-26 at 22:39 +0000, Tian, Kevin wrote:
> > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > Sent: Wednesday, January 27, 2016 6:27 AM
> > > > > > =C2=A0
> > > > > > On Tue, 2016-01-26 at 22:15 +0000, Tian, Kevin wrote:
> > > > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.co=
m]
> > > > > > > > Sent: Wednesday, January 27, 2016 6:08 AM
> > > > > > > > =C2=A0
> > > > > > > > > > > > =C2=A0
> > > > > > > > > > > =C2=A0
> > > > > > > > > > > Today KVMGT (not using VFIO yet) registers I/O em=
ulation callbacks to
> > > > > > > > > > > KVM, so VM MMIO access will be forwarded to KVMGT=
 directly for
> > > > > > > > > > > emulation in kernel. If we reuse above R/W flags,=
 the whole emulation
> > > > > > > > > > > path would be unnecessarily long with obvious per=
formance impact. We
> > > > > > > > > > > either need a new flag here to indicate in-kernel=
 emulation (bias from
> > > > > > > > > > > passthrough support), or just hide the region alt=
ernatively (let KVMGT
> > > > > > > > > > > to handle I/O emulation itself like today).
> > > > > > > > > > =C2=A0
> > > > > > > > > > That sounds like a future optimization TBH.=C2=A0=C2=
=A0There's very strict
> > > > > > > > > > layering between vfio and kvm.=C2=A0=C2=A0Physical =
device assignment could make
> > > > > > > > > > use of it as well, avoiding a round trip through us=
erspace when an
> > > > > > > > > > ioread/write would do.=C2=A0=C2=A0Userspace also ne=
eds to orchestrate those kinds
> > > > > > > > > > of accelerators, there might be cases where userspa=
ce wants to see those
> > > > > > > > > > transactions for debugging or manipulating the devi=
ce.=C2=A0=C2=A0We can't simply
> > > > > > > > > > take shortcuts to provide such direct access.=C2=A0=
=C2=A0Thanks,
> > > > > > > > > > =C2=A0
> > > > > > > > > =C2=A0
> > > > > > > > > But we have to balance such debugging flexibility and=
 acceptable performance.
> > > > > > > > > To me the latter one is more important otherwise ther=
e'd be no real usage
> > > > > > > > > around this technique, while for debugging there are =
other alternative (e.g.
> > > > > > > > > ftrace) Consider some extreme case with 100k traps/se=
cond and then see
> > > > > > > > > how much impact a 2-3x longer emulation path can brin=
g...
> > > > > > > > =C2=A0
> > > > > > > > Are you jumping to the conclusion that it cannot be don=
e with proper
> > > > > > > > layering in place?=C2=A0=C2=A0Performance is important,=
 but it's not an excuse to
> > > > > > > > abandon designing interfaces between independent compon=
ents.=C2=A0=C2=A0Thanks,
> > > > > > > > =C2=A0
> > > > > > > =C2=A0
> > > > > > > Two are not controversial. My point is to remove unnecess=
ary long trip
> > > > > > > as possible. After another thought, yes we can reuse exis=
ting read/write
> > > > > > > flags:
> > > > > > > =C2=A0	- KVMGT will expose a private control variable whe=
ther in-kernel
> > > > > > > delivery is required;
> > > > > > =C2=A0
> > > > > > But in-kernel delivery is never *required*.=C2=A0=C2=A0Woul=
dn't userspace want to
> > > > > > deliver in-kernel any time it possibly could?
> > > > > > =C2=A0
> > > > > > > =C2=A0	- when the variable is true, KVMGT will register i=
n-kernel MMIO
> > > > > > > emulation callbacks then VM MMIO request will be delivere=
d to KVMGT
> > > > > > > directly;
> > > > > > > =C2=A0	- when the variable is false, KVMGT will not regis=
ter anything.
> > > > > > > VM MMIO request will then be delivered to Qemu and then i=
oread/write
> > > > > > > will be used to finally reach KVMGT emulation logic;
> > > > > > =C2=A0
> > > > > > No, that means the interface is entirely dependent on a bac=
kdoor through
> > > > > > KVM.=C2=A0=C2=A0Why can't userspace (QEMU) do something lik=
e register an MMIO
> > > > > > region with KVM handled via a provided file descriptor and =
offset,
> > > > > > couldn't KVM then call the file ops without a kernel exit?=C2=
=A0=C2=A0Thanks,
> > > > > > =C2=A0
> > > > > =C2=A0
> > > > > Could you elaborate this thought? If it can achieve the purpo=
se w/o
> > > > > a kernel exit definitely we can adapt to it. :-)
> > > > =C2=A0
> > > > I only thought of it when replying to the last email and have b=
een doing
> > > > some research, but we already do quite a bit of synchronization=
 through
> > > > file descriptors.=C2=A0=C2=A0The kvm-vfio pseudo device uses a =
group file
> > > > descriptor to ensure a user has access to a group, allowing som=
e degree
> > > > of interaction between modules.=C2=A0=C2=A0Eventfds and irqfds =
already make use of
> > > > f_ops on file descriptors to poke data.=C2=A0=C2=A0So, if KVM h=
ad information that
> > > > an MMIO region was backed by a file descriptor for which it alr=
eady has
> > > > a reference via fdget() (and verified access rights and whatnot=
), then
> > > > it ought to be a simple matter to get to f_ops->read/write know=
ing the
> > > > base offset of that MMIO region.=C2=A0=C2=A0Perhaps it could ev=
en simply use
> > > > __vfs_read/write().=C2=A0=C2=A0Then we've got a proper referenc=
e to the file
> > > > descriptor for ownership purposes and we've transparently jumpe=
d across
> > > > modules without any implicit knowledge of the other end.=C2=A0=C2=
=A0Could it work?
> > > =C2=A0
> > > This is OK for KVMGT, from fops to vgpu device-model would always=
 be simple.
> > > The only question is, how is KVM hypervisor supposed to get the f=
d on VM-exitings?
> >=C2=A0
> > Hi Jike,
> >=C2=A0
> > Sorry, I don't understand "on VM-exiting".=C2=A0=C2=A0KVM would hol=
d a reference
> > to the fd via fdget(), so the vfio device wouldn't be closed until =
the
> > VM exits and KVM releases that reference.
> >=C2=A0
>=C2=A0
> Sorry for my bad English, I meant VMEXIT, from non-root to kvm hyperv=
isor.
>=C2=A0
> > > copy-and-paste the current implementation of vcpu_mmio_write(), s=
eems
> > > nothing but GPA and len are provided:
> >=C2=A0
> > I presume that an MMIO region is already registered with a GPA and
> > length, the additional information necessary would be a file descri=
ptor
> > and offset into the file descriptor for the base of the MMIO space.
> >=C2=A0
> > > =C2=A0	static int vcpu_mmio_write(struct kvm_vcpu *vcpu, gpa_t ad=
dr, int len,
> > > =C2=A0				=C2=A0=C2=A0=C2=A0const void *v)
> > > =C2=A0	{
> > > =C2=A0		int handled =3D 0;
> > > =C2=A0		int n;
> > > =C2=A0
> > > =C2=A0		do {
> > > =C2=A0			n =3D min(len, 8);
> > > =C2=A0			if (!(vcpu->arch.apic &&
> > > =C2=A0			=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0!kvm_iodevice_write(=
vcpu, &vcpu->arch.apic->dev, addr, n, v))
> > > =C2=A0			=C2=A0=C2=A0=C2=A0=C2=A0&& kvm_io_bus_write(vcpu, KVM_MM=
IO_BUS, addr, n, v))
> > > =C2=A0				break;
> > > =C2=A0			handled +=3D n;
> > > =C2=A0			addr +=3D n;
> > > =C2=A0			len -=3D n;
> > > =C2=A0			v +=3D n;
> > > =C2=A0		} while (len);
> > > =C2=A0
> > > =C2=A0		return handled;
> > > =C2=A0	}
> > > =C2=A0
> > > If we back a GPA range with a fd, this will also be a 'backdoor'?
> >=C2=A0
> > KVM would simply be able to service the MMIO access using the provi=
ded
> > fd and offset.=C2=A0=C2=A0It's not a back door because we will have=
 created an API
> > for KVM to have a file descriptor and offset registered (by userspa=
ce)
> > to handle the access.=C2=A0=C2=A0Also, KVM does not know the file d=
escriptor is
> > handled by a VFIO device and VFIO doesn't know the read/write acces=
ses
> > is initiated by KVM.=C2=A0=C2=A0Seems like the question is whether =
we can fit
> > something like that into the existing KVM MMIO bus/device handlers
> > in-kernel.=C2=A0=C2=A0Thanks,
> >=C2=A0
>=C2=A0
> Had a look at eventfd, I would say yes, technically we are able to
> achieve the goal: introduce a fd, with fop->{read|write} defined in K=
VM,
> call into vgpu device-model, also an iodev registered for a MMIO GPA
> range to invoke the fop->{read|write}.=C2=A0=C2=A0I just didn't under=
stand why
> userspace can't register an iodev via API directly.

Please elaborate on how it would work via iodev.

> Besides, this doesn't necessarily require another thread, right?
> I guess it can be within the VCPU thread?=C2=A0

I would think so too, the vcpu is blocked on the MMIO access, we should
be able to service it in that context.=C2=A0=C2=A0I hope.

> And this brought another question: except the vfio bus drvier and
> iommu backend (and the page_track ulitiy used for guest memory write-=
protection),=C2=A0
> is it KVMGT allowed to call into kvm.ko (or modify)? Though we are
> becoming less and less willing to do that with VFIO, it's still bette=
r
> to know that before going wrong.

kvm and vfio are separate modules, for the most part, they know nothing
about each other and have no hard dependencies between them.=C2=A0=C2=A0=
We do have
various accelerations we can use to avoid paths through userspace, but
these are all via APIs that are agnostic of the party on the other end.
=46or example, vfio signals interrups through eventfds and has no conce=
pt
of whether that eventfd terminates in userspace or into an irqfd in KVM=
=2E
vfio supports direct access to device MMIO regions via mmaps, but vfio
has no idea if that mmap gets directly mapped into a VM address space.
Even with posted interrupts, we've introduced an irq bypass manager
allowing interrupt producers and consumers to register independently to
form a connection without directly knowing anything about the other
module.=C2=A0=C2=A0That sort or proper software layering needs to conti=
nue.=C2=A0=C2=A0It
would be wrong for a vfio bus driver to assume KVM is the user and
directly call into KVM interfaces.=C2=A0=C2=A0Thanks,

Alex