From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:51115)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <alex.williamson@redhat.com>) id 1aOSp3-0006Ri-3X
	for qemu-devel@nongnu.org; Wed, 27 Jan 2016 11:20:02 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <alex.williamson@redhat.com>) id 1aOSox-00041y-JK
	for qemu-devel@nongnu.org; Wed, 27 Jan 2016 11:19:57 -0500
Received: from mx1.redhat.com ([209.132.183.28]:58397)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <alex.williamson@redhat.com>) id 1aOSox-00041m-9V
	for qemu-devel@nongnu.org; Wed, 27 Jan 2016 11:19:51 -0500
Message-ID: <1453911589.6261.5.camel@redhat.com>
From: Alex Williamson <alex.williamson@redhat.com>
Date: Wed, 27 Jan 2016 09:19:49 -0700
In-Reply-To: <56A85913.1020506@intel.com>
References: <569C5071.6080004@intel.com>
	<1453092476.32741.67.camel@redhat.com> <569CA8AD.6070200@intel.com>
	<1453143919.32741.169.camel@redhat.com> <569F4C86.2070501@intel.com>
	<AADFC41AFE54684AB9EE6CBC0274A5D15F786B4B@SHSMSX101.ccr.corp.intel.com>
	<56A6083E.10703@intel.com> <1453757426.32741.614.camel@redhat.com>
	<56A72313.9030009@intel.com> <56A77D2D.40109@gmail.com>
	<1453826249.26652.54.camel@redhat.com>
	<AADFC41AFE54684AB9EE6CBC0274A5D15F78EAEA@SHSMSX101.ccr.corp.intel.com>
	<1453844613.18049.1.camel@redhat.com>
	<AADFC41AFE54684AB9EE6CBC0274A5D15F78EB95@SHSMSX101.ccr.corp.intel.com>
	<1453846073.18049.3.camel@redhat.com>
	<AADFC41AFE54684AB9EE6CBC0274A5D15F78ECBB@SHSMSX101.ccr.corp.intel.com>
	<1453847250.18049.5.camel@redhat.com>
	<AADFC41AFE54684AB9EE6CBC0274A5D15F78ED63@SHSMSX101.ccr.corp.intel.com>
	<1453848975.18049.7.camel@redhat.com> <56A821AD.5090606@intel.com>
	<1453864068.3107.3.camel@redhat.com> <56A85913.1020506@intel.com>
Content-Type: text/plain; charset="UTF-8"
Mime-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3
 release of XenGT - a Mediated ...)
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Jike Song <jike.song@intel.com>
Cc: Yang Zhang <yang.zhang.wz@gmail.com>, "Ruan, Shuai" <shuai.ruan@intel.com>, "Tian, Kevin" <kevin.tian@intel.com>, Neo Jia <cjia@nvidia.com>, "kvm@vger.kernel.org" <kvm@vger.kernel.org>, "igvt-g@lists.01.org" <igvt-g@ml01.01.org>, qemu-devel <qemu-devel@nongnu.org>, Gerd Hoffmann <kraxel@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, "Lv, Zhiyuan" <zhiyuan.lv@intel.com>

On Wed, 2016-01-27 at 13:43 +0800, Jike Song wrote:
> On 01/27/2016 11:07 AM, Alex Williamson wrote:
> > On Wed, 2016-01-27 at 09:47 +0800, Jike Song wrote:
> > > On 01/27/2016 06:56 AM, Alex Williamson wrote:
> > > > On Tue, 2016-01-26 at 22:39 +0000, Tian, Kevin wrote:
> > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > Sent: Wednesday, January 27, 2016 6:27 AM
> > > > > > =C2=A0
> > > > > > On Tue, 2016-01-26 at 22:15 +0000, Tian, Kevin wrote:
> > > > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > > > Sent: Wednesday, January 27, 2016 6:08 AM
> > > > > > > > =C2=A0
> > > > > > > > > > > > =C2=A0
> > > > > > > > > > > =C2=A0
> > > > > > > > > > > Today KVMGT (not using VFIO yet) registers I/O emul=
ation callbacks to
> > > > > > > > > > > KVM, so VM MMIO access will be forwarded to KVMGT d=
irectly for
> > > > > > > > > > > emulation in kernel. If we reuse above R/W flags, t=
he whole emulation
> > > > > > > > > > > path would be unnecessarily long with obvious perfo=
rmance impact. We
> > > > > > > > > > > either need a new flag here to indicate in-kernel e=
mulation (bias from
> > > > > > > > > > > passthrough support), or just hide the region alter=
natively (let KVMGT
> > > > > > > > > > > to handle I/O emulation itself like today).
> > > > > > > > > > =C2=A0
> > > > > > > > > > That sounds like a future optimization TBH.=C2=A0=C2=A0=
There's very strict
> > > > > > > > > > layering between vfio and kvm.=C2=A0=C2=A0Physical de=
vice assignment could make
> > > > > > > > > > use of it as well, avoiding a round trip through user=
space when an
> > > > > > > > > > ioread/write would do.=C2=A0=C2=A0Userspace also need=
s to orchestrate those kinds
> > > > > > > > > > of accelerators, there might be cases where userspace=
 wants to see those
> > > > > > > > > > transactions for debugging or manipulating the device=
.=C2=A0=C2=A0We can't simply
> > > > > > > > > > take shortcuts to provide such direct access.=C2=A0=C2=
=A0Thanks,
> > > > > > > > > > =C2=A0
> > > > > > > > > =C2=A0
> > > > > > > > > But we have to balance such debugging flexibility and a=
cceptable performance.
> > > > > > > > > To me the latter one is more important otherwise there'=
d be no real usage
> > > > > > > > > around this technique, while for debugging there are ot=
her alternative (e.g.
> > > > > > > > > ftrace) Consider some extreme case with 100k traps/seco=
nd and then see
> > > > > > > > > how much impact a 2-3x longer emulation path can bring.=
..
> > > > > > > > =C2=A0
> > > > > > > > Are you jumping to the conclusion that it cannot be done =
with proper
> > > > > > > > layering in place?=C2=A0=C2=A0Performance is important, b=
ut it's not an excuse to
> > > > > > > > abandon designing interfaces between independent componen=
ts.=C2=A0=C2=A0Thanks,
> > > > > > > > =C2=A0
> > > > > > > =C2=A0
> > > > > > > Two are not controversial. My point is to remove unnecessar=
y long trip
> > > > > > > as possible. After another thought, yes we can reuse existi=
ng read/write
> > > > > > > flags:
> > > > > > > =C2=A0	- KVMGT will expose a private control variable wheth=
er in-kernel
> > > > > > > delivery is required;
> > > > > > =C2=A0
> > > > > > But in-kernel delivery is never *required*.=C2=A0=C2=A0Wouldn=
't userspace want to
> > > > > > deliver in-kernel any time it possibly could?
> > > > > > =C2=A0
> > > > > > > =C2=A0	- when the variable is true, KVMGT will register in-=
kernel MMIO
> > > > > > > emulation callbacks then VM MMIO request will be delivered =
to KVMGT
> > > > > > > directly;
> > > > > > > =C2=A0	- when the variable is false, KVMGT will not registe=
r anything.
> > > > > > > VM MMIO request will then be delivered to Qemu and then ior=
ead/write
> > > > > > > will be used to finally reach KVMGT emulation logic;
> > > > > > =C2=A0
> > > > > > No, that means the interface is entirely dependent on a backd=
oor through
> > > > > > KVM.=C2=A0=C2=A0Why can't userspace (QEMU) do something like =
register an MMIO
> > > > > > region with KVM handled via a provided file descriptor and of=
fset,
> > > > > > couldn't KVM then call the file ops without a kernel exit?=C2=
=A0=C2=A0Thanks,
> > > > > > =C2=A0
> > > > > =C2=A0
> > > > > Could you elaborate this thought? If it can achieve the purpose=
 w/o
> > > > > a kernel exit definitely we can adapt to it. :-)
> > > > =C2=A0
> > > > I only thought of it when replying to the last email and have bee=
n doing
> > > > some research, but we already do quite a bit of synchronization t=
hrough
> > > > file descriptors.=C2=A0=C2=A0The kvm-vfio pseudo device uses a gr=
oup file
> > > > descriptor to ensure a user has access to a group, allowing some =
degree
> > > > of interaction between modules.=C2=A0=C2=A0Eventfds and irqfds al=
ready make use of
> > > > f_ops on file descriptors to poke data.=C2=A0=C2=A0So, if KVM had=
 information that
> > > > an MMIO region was backed by a file descriptor for which it alrea=
dy has
> > > > a reference via fdget() (and verified access rights and whatnot),=
 then
> > > > it ought to be a simple matter to get to f_ops->read/write knowin=
g the
> > > > base offset of that MMIO region.=C2=A0=C2=A0Perhaps it could even=
 simply use
> > > > __vfs_read/write().=C2=A0=C2=A0Then we've got a proper reference =
to the file
> > > > descriptor for ownership purposes and we've transparently jumped =
across
> > > > modules without any implicit knowledge of the other end.=C2=A0=C2=
=A0Could it work?
> > > =C2=A0
> > > This is OK for KVMGT, from fops to vgpu device-model would always b=
e simple.
> > > The only question is, how is KVM hypervisor supposed to get the fd =
on VM-exitings?
> >=C2=A0
> > Hi Jike,
> >=C2=A0
> > Sorry, I don't understand "on VM-exiting".=C2=A0=C2=A0KVM would hold =
a reference
> > to the fd via fdget(), so the vfio device wouldn't be closed until th=
e
> > VM exits and KVM releases that reference.
> >=C2=A0
>=C2=A0
> Sorry for my bad English, I meant VMEXIT, from non-root to kvm hypervis=
or.
>=C2=A0
> > > copy-and-paste the current implementation of vcpu_mmio_write(), see=
ms
> > > nothing but GPA and len are provided:
> >=C2=A0
> > I presume that an MMIO region is already registered with a GPA and
> > length, the additional information necessary would be a file descript=
or
> > and offset into the file descriptor for the base of the MMIO space.
> >=C2=A0
> > > =C2=A0	static int vcpu_mmio_write(struct kvm_vcpu *vcpu, gpa_t addr=
, int len,
> > > =C2=A0				=C2=A0=C2=A0=C2=A0const void *v)
> > > =C2=A0	{
> > > =C2=A0		int handled =3D 0;
> > > =C2=A0		int n;
> > > =C2=A0
> > > =C2=A0		do {
> > > =C2=A0			n =3D min(len, 8);
> > > =C2=A0			if (!(vcpu->arch.apic &&
> > > =C2=A0			=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0!kvm_iodevice_write(vc=
pu, &vcpu->arch.apic->dev, addr, n, v))
> > > =C2=A0			=C2=A0=C2=A0=C2=A0=C2=A0&& kvm_io_bus_write(vcpu, KVM_MMIO=
_BUS, addr, n, v))
> > > =C2=A0				break;
> > > =C2=A0			handled +=3D n;
> > > =C2=A0			addr +=3D n;
> > > =C2=A0			len -=3D n;
> > > =C2=A0			v +=3D n;
> > > =C2=A0		} while (len);
> > > =C2=A0
> > > =C2=A0		return handled;
> > > =C2=A0	}
> > > =C2=A0
> > > If we back a GPA range with a fd, this will also be a 'backdoor'?
> >=C2=A0
> > KVM would simply be able to service the MMIO access using the provide=
d
> > fd and offset.=C2=A0=C2=A0It's not a back door because we will have c=
reated an API
> > for KVM to have a file descriptor and offset registered (by userspace=
)
> > to handle the access.=C2=A0=C2=A0Also, KVM does not know the file des=
criptor is
> > handled by a VFIO device and VFIO doesn't know the read/write accesse=
s
> > is initiated by KVM.=C2=A0=C2=A0Seems like the question is whether we=
 can fit
> > something like that into the existing KVM MMIO bus/device handlers
> > in-kernel.=C2=A0=C2=A0Thanks,
> >=C2=A0
>=C2=A0
> Had a look at eventfd, I would say yes, technically we are able to
> achieve the goal: introduce a fd, with fop->{read|write} defined in KVM=
,
> call into vgpu device-model, also an iodev registered for a MMIO GPA
> range to invoke the fop->{read|write}.=C2=A0=C2=A0I just didn't underst=
and why
> userspace can't register an iodev via API directly.

Please elaborate on how it would work via iodev.

> Besides, this doesn't necessarily require another thread, right?
> I guess it can be within the VCPU thread?=C2=A0

I would think so too, the vcpu is blocked on the MMIO access, we should
be able to service it in that context.=C2=A0=C2=A0I hope.

> And this brought another question: except the vfio bus drvier and
> iommu backend (and the page_track ulitiy used for guest memory write-pr=
otection),=C2=A0
> is it KVMGT allowed to call into kvm.ko (or modify)? Though we are
> becoming less and less willing to do that with VFIO, it's still better
> to know that before going wrong.

kvm and vfio are separate modules, for the most part, they know nothing
about each other and have no hard dependencies between them.=C2=A0=C2=A0W=
e do have
various accelerations we can use to avoid paths through userspace, but
these are all via APIs that are agnostic of the party on the other end.
For example, vfio signals interrups through eventfds and has no concept
of whether that eventfd terminates in userspace or into an irqfd in KVM.
vfio supports direct access to device MMIO regions via mmaps, but vfio
has no idea if that mmap gets directly mapped into a VM address space.
Even with posted interrupts, we've introduced an irq bypass manager
allowing interrupt producers and consumers to register independently to
form a connection without directly knowing anything about the other
module.=C2=A0=C2=A0That sort or proper software layering needs to continu=
e.=C2=A0=C2=A0It
would be wrong for a vfio bus driver to assume KVM is the user and
directly call into KVM interfaces.=C2=A0=C2=A0Thanks,

Alex