From mboxrd@z Thu Jan  1 00:00:00 1970
From: Alex Williamson <alex.williamson@redhat.com>
Subject: Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3
 release of XenGT - a Mediated ...)
Date: Tue, 26 Jan 2016 20:07:48 -0700
Message-ID: <1453864068.3107.3.camel@redhat.com>
References: <569C5071.6080004@intel.com>
	<1453092476.32741.67.camel@redhat.com> <569CA8AD.6070200@intel.com>
	<1453143919.32741.169.camel@redhat.com> <569F4C86.2070501@intel.com>
	<AADFC41AFE54684AB9EE6CBC0274A5D15F786B4B@SHSMSX101.ccr.corp.intel.com>
	<56A6083E.10703@intel.com> <1453757426.32741.614.camel@redhat.com>
	<56A72313.9030009@intel.com> <56A77D2D.40109@gmail.com>
	<1453826249.26652.54.camel@redhat.com>
	<AADFC41AFE54684AB9EE6CBC0274A5D15F78EAEA@SHSMSX101.ccr.corp.intel.com>
	<1453844613.18049.1.camel@redhat.com>
	<AADFC41AFE54684AB9EE6CBC0274A5D15F78EB95@SHSMSX101.ccr.corp.intel.com>
	<1453846073.18049.3.camel@redhat.com>
	<AADFC41AFE54684AB9EE6CBC0274A5D15F78ECBB@SHSMSX101.ccr.corp.intel.com>
	<1453847250.18049.5.camel@redhat.com>
	<AADFC41AFE54684AB9EE6CBC0274A5D15F78ED63@SHSMSX101.ccr.corp.intel.com>
	<1453848975.18049.7.camel@redhat.com> <56A821AD.5090606@intel.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Cc: Yang Zhang <yang.zhang.wz@gmail.com>, "Ruan, Shuai" <shuai.ruan@intel.com>,
	"Tian, Kevin" <kevin.tian@intel.com>, Neo Jia <cjia@nvidia.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"igvt-g@lists.01.org" <igvt-g@ml01.01.org>,
	qemu-devel <qemu-devel@nongnu.org>, Gerd Hoffmann <kraxel@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>, "Lv, Zhiyuan" <zhiyuan.lv@intel.com>
To: Jike Song <jike.song@intel.com>
Return-path: <qemu-devel-bounces+gceq-qemu-devel=gmane.org@nongnu.org>
In-Reply-To: <56A821AD.5090606@intel.com>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Errors-To: qemu-devel-bounces+gceq-qemu-devel=gmane.org@nongnu.org
Sender: qemu-devel-bounces+gceq-qemu-devel=gmane.org@nongnu.org
List-Id: kvm.vger.kernel.org

On Wed, 2016-01-27 at 09:47 +0800, Jike Song wrote:
> On 01/27/2016 06:56 AM, Alex Williamson wrote:
> > On Tue, 2016-01-26 at 22:39 +0000, Tian, Kevin wrote:
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Wednesday, January 27, 2016 6:27 AM
> > > > =C2=A0
> > > > On Tue, 2016-01-26 at 22:15 +0000, Tian, Kevin wrote:
> > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > Sent: Wednesday, January 27, 2016 6:08 AM
> > > > > > =C2=A0
> > > > > > > > > > =C2=A0
> > > > > > > > > =C2=A0
> > > > > > > > > Today KVMGT (not using VFIO yet) registers I/O emulatio=
n callbacks to
> > > > > > > > > KVM, so VM MMIO access will be forwarded to KVMGT direc=
tly for
> > > > > > > > > emulation in kernel. If we reuse above R/W flags, the w=
hole emulation
> > > > > > > > > path would be unnecessarily long with obvious performan=
ce impact. We
> > > > > > > > > either need a new flag here to indicate in-kernel emula=
tion (bias from
> > > > > > > > > passthrough support), or just hide the region alternati=
vely (let KVMGT
> > > > > > > > > to handle I/O emulation itself like today).
> > > > > > > > =C2=A0
> > > > > > > > That sounds like a future optimization TBH.=C2=A0=C2=A0Th=
ere's very strict
> > > > > > > > layering between vfio and kvm.=C2=A0=C2=A0Physical device=
 assignment could make
> > > > > > > > use of it as well, avoiding a round trip through userspac=
e when an
> > > > > > > > ioread/write would do.=C2=A0=C2=A0Userspace also needs to=
 orchestrate those kinds
> > > > > > > > of accelerators, there might be cases where userspace wan=
ts to see those
> > > > > > > > transactions for debugging or manipulating the device.=C2=
=A0=C2=A0We can't simply
> > > > > > > > take shortcuts to provide such direct access.=C2=A0=C2=A0=
Thanks,
> > > > > > > > =C2=A0
> > > > > > > =C2=A0
> > > > > > > But we have to balance such debugging flexibility and accep=
table performance.
> > > > > > > To me the latter one is more important otherwise there'd be=
 no real usage
> > > > > > > around this technique, while for debugging there are other =
alternative (e.g.
> > > > > > > ftrace) Consider some extreme case with 100k traps/second a=
nd then see
> > > > > > > how much impact a 2-3x longer emulation path can bring...
> > > > > > =C2=A0
> > > > > > Are you jumping to the conclusion that it cannot be done with=
 proper
> > > > > > layering in place?=C2=A0=C2=A0Performance is important, but i=
t's not an excuse to
> > > > > > abandon designing interfaces between independent components.=C2=
=A0=C2=A0Thanks,
> > > > > > =C2=A0
> > > > > =C2=A0
> > > > > Two are not controversial. My point is to remove unnecessary lo=
ng trip
> > > > > as possible. After another thought, yes we can reuse existing r=
ead/write
> > > > > flags:
> > > > > =C2=A0	- KVMGT will expose a private control variable whether i=
n-kernel
> > > > > delivery is required;
> > > > =C2=A0
> > > > But in-kernel delivery is never *required*.=C2=A0=C2=A0Wouldn't u=
serspace want to
> > > > deliver in-kernel any time it possibly could?
> > > > =C2=A0
> > > > > =C2=A0	- when the variable is true, KVMGT will register in-kern=
el MMIO
> > > > > emulation callbacks then VM MMIO request will be delivered to K=
VMGT
> > > > > directly;
> > > > > =C2=A0	- when the variable is false, KVMGT will not register an=
ything.
> > > > > VM MMIO request will then be delivered to Qemu and then ioread/=
write
> > > > > will be used to finally reach KVMGT emulation logic;
> > > > =C2=A0
> > > > No, that means the interface is entirely dependent on a backdoor =
through
> > > > KVM.=C2=A0=C2=A0Why can't userspace (QEMU) do something like regi=
ster an MMIO
> > > > region with KVM handled via a provided file descriptor and offset=
,
> > > > couldn't KVM then call the file ops without a kernel exit?=C2=A0=C2=
=A0Thanks,
> > > > =C2=A0
> > > =C2=A0
> > > Could you elaborate this thought? If it can achieve the purpose w/o
> > > a kernel exit definitely we can adapt to it. :-)
> >=C2=A0
> > I only thought of it when replying to the last email and have been do=
ing
> > some research, but we already do quite a bit of synchronization throu=
gh
> > file descriptors.=C2=A0=C2=A0The kvm-vfio pseudo device uses a group =
file
> > descriptor to ensure a user has access to a group, allowing some degr=
ee
> > of interaction between modules.=C2=A0=C2=A0Eventfds and irqfds alread=
y make use of
> > f_ops on file descriptors to poke data.=C2=A0=C2=A0So, if KVM had inf=
ormation that
> > an MMIO region was backed by a file descriptor for which it already h=
as
> > a reference via fdget() (and verified access rights and whatnot), the=
n
> > it ought to be a simple matter to get to f_ops->read/write knowing th=
e
> > base offset of that MMIO region.=C2=A0=C2=A0Perhaps it could even sim=
ply use
> > __vfs_read/write().=C2=A0=C2=A0Then we've got a proper reference to t=
he file
> > descriptor for ownership purposes and we've transparently jumped acro=
ss
> > modules without any implicit knowledge of the other end.=C2=A0=C2=A0C=
ould it work?
>=C2=A0
> This is OK for KVMGT, from fops to vgpu device-model would always be si=
mple.
> The only question is, how is KVM hypervisor supposed to get the fd on V=
M-exitings?

Hi Jike,

Sorry, I don't understand "on VM-exiting".=C2=A0=C2=A0KVM would hold a re=
ference
to the fd via fdget(), so the vfio device wouldn't be closed until the
VM exits and KVM releases that reference.

> copy-and-paste the current implementation of vcpu_mmio_write(), seems
> nothing but GPA and len are provided:

I presume that an MMIO region is already registered with a GPA and
length, the additional information necessary would be a file descriptor
and offset into the file descriptor for the base of the MMIO space.

>=C2=A0	static int vcpu_mmio_write(struct kvm_vcpu *vcpu, gpa_t addr, int=
 len,
>=C2=A0				=C2=A0=C2=A0=C2=A0const void *v)
>=C2=A0	{
>=C2=A0		int handled =3D 0;
>=C2=A0		int n;
>=C2=A0
>=C2=A0		do {
>=C2=A0			n =3D min(len, 8);
>=C2=A0			if (!(vcpu->arch.apic &&
>=C2=A0			=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0!kvm_iodevice_write(vcpu, &=
vcpu->arch.apic->dev, addr, n, v))
>=C2=A0			=C2=A0=C2=A0=C2=A0=C2=A0&& kvm_io_bus_write(vcpu, KVM_MMIO_BUS,=
 addr, n, v))
>=C2=A0				break;
>=C2=A0			handled +=3D n;
>=C2=A0			addr +=3D n;
>=C2=A0			len -=3D n;
>=C2=A0			v +=3D n;
>=C2=A0		} while (len);
>=C2=A0
>=C2=A0		return handled;
>=C2=A0	}
>=C2=A0
> If we back a GPA range with a fd, this will also be a 'backdoor'?

KVM would simply be able to service the MMIO access using the provided
fd and offset.=C2=A0=C2=A0It's not a back door because we will have creat=
ed an API
for KVM to have a file descriptor and offset registered (by userspace)
to handle the access.=C2=A0=C2=A0Also, KVM does not know the file descrip=
tor is
handled by a VFIO device and VFIO doesn't know the read/write accesses
is initiated by KVM.=C2=A0=C2=A0Seems like the question is whether we can=
 fit
something like that into the existing KVM MMIO bus/device handlers
in-kernel.=C2=A0=C2=A0Thanks,

Alex