From mboxrd@z Thu Jan  1 00:00:00 1970
From: Alex Williamson <alex.williamson@redhat.com>
Subject: Re: [iGVT-g] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release
 of XenGT - a Mediated ...)
Date: Fri, 29 Jan 2016 11:50:05 -0700
Message-ID: <1454093405.23148.13.camel@redhat.com>
References: <569C5071.6080004@intel.com> <569F4C86.2070501@intel.com>
	 <AADFC41AFE54684AB9EE6CBC0274A5D15F786B4B@SHSMSX101.ccr.corp.intel.com>
	 <56A6083E.10703@intel.com> <1453757426.32741.614.camel@redhat.com>
	 <56A72313.9030009@intel.com> <56A77D2D.40109@gmail.com>
	 <1453826249.26652.54.camel@redhat.com>
	 <AADFC41AFE54684AB9EE6CBC0274A5D15F78EAEA@SHSMSX101.ccr.corp.intel.com>
	 <1453844613.18049.1.camel@redhat.com>
	 <AADFC41AFE54684AB9EE6CBC0274A5D15F78EB95@SHSMSX101.ccr.corp.intel.com>
	 <1453846073.18049.3.camel@redhat.com>
	 <AADFC41AFE54684AB9EE6CBC0274A5D15F78ECBB@SHSMSX101.ccr.corp.intel.com>
	 <1453847250.18049.5.camel@redhat.com>
	 <AADFC41AFE54684AB9EE6CBC0274A5D15F78ED63@SHSMSX101.ccr.corp.intel.com>
	 <1453848975.18049.7.camel@redhat.com> <56A821AD.5090606@intel.com>
	 <1453864068.3107.3.camel@redhat.com> <56A85913.1020506@intel.com>
	 <1453911589.6261.5.camel@redhat.com> <56A9AE69.3060604@intel.com>
	 <1453994586.29166.1.camel@redhat.com> <56AB12CC.5000402@intel.com>
	 <56AB27AA.80602@intel.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: Yang Zhang <yang.zhang.wz@gmail.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"igvt-g@lists.01.org" <igvt-g@ml01.01.org>,
	qemu-devel <qemu-devel@nongnu.org>,
	Paolo Bonzini <pbonzini@redhat.com>
To: Jike Song <jike.song@intel.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:39059 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S932808AbcA2SuI (ORCPT <rfc822;kvm@vger.kernel.org>);
	Fri, 29 Jan 2016 13:50:08 -0500
In-Reply-To: <56AB27AA.80602@intel.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

Hi Jike,

On Fri, 2016-01-29 at 16:49 +0800, Jike Song wrote:
> On 01/29/2016 03:20 PM, Jike Song wrote:
> > This discussion becomes a little difficult for a newbie like me :(
> >=C2=A0
> > On 01/28/2016 11:23 PM, Alex Williamson wrote:
> > > On Thu, 2016-01-28 at 14:00 +0800, Jike Song wrote:
> > > > On 01/28/2016 12:19 AM, Alex Williamson wrote:
> > > > > On Wed, 2016-01-27 at 13:43 +0800, Jike Song wrote:
> > > > {snip}
> > > > =C2=A0
> > > > > > Had a look at eventfd, I would say yes, technically we are =
able to
> > > > > > achieve the goal: introduce a fd, with fop->{read|write} de=
fined in KVM,
> > > > > > call into vgpu device-model, also an iodev registered for a=
 MMIO GPA
> > > > > > range to invoke the fop->{read|write}.=C2=A0=C2=A0I just di=
dn't understand why
> > > > > > userspace can't register an iodev via API directly.
> > > > > =C2=A0
> > > > > Please elaborate on how it would work via iodev.
> > > > > =C2=A0
> > > > =C2=A0
> > > > QEMU forwards BAR0 write to the bus driver, in the bus driver, =
if
> > > > found that MEM bit is enabled, register an iodev to KVM: with a=
n
> > > > ops:
> > > > =C2=A0
> > > > =C2=A0	const struct kvm_io_device_ops trap_mmio_ops =3D {
> > > > =C2=A0		.read	=3D kvmgt_guest_mmio_read,
> > > > =C2=A0		.write	=3D kvmgt_guest_mmio_write,
> > > > =C2=A0	};
> > > > =C2=A0
> > > > I may not be able to illustrated it clearly with descriptions b=
ut this
> > > > should not be a problem, thanks to your explanation, I can unde=
rstand
> > > > and adopt it for KVMGT.
> > >=C2=A0
> > > You're still crossing modules with direct callbacks, right?=C2=A0=
=C2=A0What's the
> > > advantage versus using the file descriptor + offset approach whic=
h could
> > > offer the same performance and improve KVM overall by creating a =
new
> > > option for generically handling MMIO?
> > >=C2=A0
> >=C2=A0
> > Yes, the method I gave above is the current way: calling kvm_io_dev=
ice_ops
> > from KVM hypervisor, and then going to vgpu device-model directly.
> >=C2=A0
> > From KVMGT's side this is almost the same as what you suggested, I =
don't
> > think now we have a problem here. I will adopt your suggestion.

Great

> > > > > > Besides, this doesn't necessarily require another thread, r=
ight?
> > > > > > I guess it can be within the VCPU thread?=C2=A0
> > > > > =C2=A0
> > > > > I would think so too, the vcpu is blocked on the MMIO access,=
 we should
> > > > > be able to service it in that context.=C2=A0=C2=A0I hope.
> > > > > =C2=A0
> > > > =C2=A0
> > > > Thanks for confirmation.
> > > > =C2=A0
> > > > > > And this brought another question: except the vfio bus drvi=
er and
> > > > > > iommu backend (and the page_track ulitiy used for guest mem=
ory write-protection),=C2=A0
> > > > > > is it KVMGT allowed to call into kvm.ko (or modify)? Though=
 we are
> > > > > > becoming less and less willing to do that with VFIO, it's s=
till better
> > > > > > to know that before going wrong.
> > > > > =C2=A0
> > > > > kvm and vfio are separate modules, for the most part, they kn=
ow nothing
> > > > > about each other and have no hard dependencies between them.=C2=
=A0=C2=A0We do have
> > > > > various accelerations we can use to avoid paths through users=
pace, but
> > > > > these are all via APIs that are agnostic of the party on the =
other end.
> > > > > For example, vfio signals interrups through eventfds and has =
no concept
> > > > > of whether that eventfd terminates in userspace or into an ir=
qfd in KVM.
> > > > > vfio supports direct access to device MMIO regions via mmaps,=
 but vfio
> > > > > has no idea if that mmap gets directly mapped into a VM addre=
ss space.
> > > > > Even with posted interrupts, we've introduced an irq bypass m=
anager
> > > > > allowing interrupt producers and consumers to register indepe=
ndently to
> > > > > form a connection without directly knowing anything about the=
 other
> > > > > module.=C2=A0=C2=A0That sort or proper software layering need=
s to continue.=C2=A0=C2=A0It
> > > > > would be wrong for a vfio bus driver to assume KVM is the use=
r and
> > > > > directly call into KVM interfaces.=C2=A0=C2=A0Thanks,
> > > > > =C2=A0
> > > > =C2=A0
> > > > I understand and agree with your point, it's bad if the bus dri=
ver
> > > > assume KVM is the user and/or call into KVM interfaces.
> > > > =C2=A0
> > > > However, the vgpu device-model, in intel case also a part of i9=
15 driver,
> > > > will always need to call some hypervisor-specific interfaces.
> > >=C2=A0
> > > No, think differently.
> > >=C2=A0
> > > > For example, when a guest gfx driver submit GPU commands, the d=
evice-model
> > > > may want to scan it for security or whatever-else purpose:
> > > > =C2=A0
> > > > =C2=A0	- get a GPA (from GPU page tables)
> > > > =C2=A0	- want to read 16 bytes from that GPA
> > > > =C2=A0	- call hypervisor-specific read_gpa() method
> > > > =C2=A0		- for Xen, the GPA belongs to a foreign domain, it must=
 find
> > > > =C2=A0		=C2=A0=C2=A0a way to map & read it - beyond our scope h=
ere;
> > > > =C2=A0		- for KVM, the GPA can converted to HVA, copy_from_user=
 (if
> > > > =C2=A0		=C2=A0=C2=A0called from vcpu thread) or access_remote_v=
m (if called from
> > > > =C2=A0		=C2=A0=C2=A0other threads);
> > > > =C2=A0
> > > > Please note that this is not from the vfio bus driver, but from=
 the vgpu
> > > > device-model; also this is not DMA addr from GPU talbes, but re=
al GPA.
> > >=C2=A0
> > > This is exactly why we're proposing that the vfio IOMMU interface=
 be
> > > used as a database of guest translations.=C2=A0
> > > The type1 IOMMU model in QEMU
> > > maps all of guest memory through the IOMMU, in the vGPU model typ=
e1 is
> > > simply collecting these and they map GPA to process virtual memor=
y.
> >=C2=A0
> > GPA to HVA mappings are maintained in KVM/QEMU, via memslots.
> > Do you mean making type1 to duplicate the GPA <-> HVA/HPA translati=
ons from
> > KVM? Even technically this could be done, how to synchronize it wit=
h KVM
> > hypervisor? e.g. What is expected if guest hot-add a memslot?

This is exactly what we do today with physical device assignment with
vfio, the vfio code in QEMU registers a MemoryListener and does DMA map
and unmap operations any time the DMA capable memory of the VM changes.
This is where the suggestion that a vgpu version of the type1 interface=
,
that records and makes available these translations to the vgpu driver
for the purpose of pinning and GPA to HPA translation comes in.=C2=A0=C2=
=A0We do
need to devise a notification scheme so that vgpu drivers can invalidat=
e
mappings when things change, but the information is available outside o=
f
KVM.

> > What's more, GPA is totally a virtualization term. When VFIO is use=
d for
> > device assignment, it uses GPA as IOVA, maps it to HPA, that's true=
=2E
> > But for KVMGT, since vGPU doesn't have its own DMA requester ID, VF=
IO
> > won't call IOMMU-API, but DMA-API instead.=C2=A0=C2=A0GPAs from dif=
ferent guests
> > may be identical, while IGD can only have 1 single IOMMU domain ...

The proposal is that we maintain exactly the vfio type1 API to QEMU
where QEMU uses MemoryListeners to relay changes in the VM address spac=
e
to vfio.=C2=A0=C2=A0While we maintain the type1 API to QEMU, the implem=
entation is
different, the vgpu-type1 IOMMU backend does not map memory through the
IOMMU API, nor does it care about the requester ID of the device in use=
=2E
vgpus provide isolation and translation through device specific means,
such as mediation of the device and per process paging structures.=C2=A0=
=C2=A0It's
therefore the responsibility of the GPU driver to call into this
database of VM mappings to get the translations it needs and register
those translations with the DMA API in case a physical IOMMU is present=
=2E
KVM uses the same type of listener to fill memory slots, so by doing
this, we can provide everything we need directly within the vfio
infrastructure without needing to assume we're operating with a
KVM-based VM.

> > > When the GPU driver wants to get a GPA, it does so from this data=
base.
> > > If it wants to read from it, it could get the mm and read from th=
e
> > > virtual memory or pin the page for a GPA to HPA translation and r=
ead
> > > from the HPA.=C2=A0=C2=A0There is no reason to poke directly thro=
ugh to the
> > > hypervisor here.=C2=A0=C2=A0Let's design what you need into the v=
gpu version of
> > > the type1 IOMMU instead.=C2=A0=C2=A0Thanks,
> >=C2=A0
> > For KVM, to access a GPA, having it translated to HVA is enough.
> >=C2=A0
> > IIUC this may be the only remaining problem between us: where shoul=
d
> > a GPA be translated to HVA, KVM or VFIO?

Via the mechanism I describe above, the vgpu-type1 vfio backend will
implement a database of GPA to HVA addresses.=C2=A0=C2=A0The architectu=
re we're
trying to create here should provide interfaces to get that information
and keep it current to the state of the VM.

> Unfortunately it's not the only one. Another example is, device-model
> may want to write-protect a gfn (RAM). In case that this request goes
> to VFIO .. how it is supposed to reach KVM MMU?

Well, let's work through the problem.=C2=A0=C2=A0How is the GFN related=
 to the
device?=C2=A0=C2=A0Is this some sort of page table for device mappings =
with a base
register in the vgpu hardware?=C2=A0=C2=A0If so, then the vgpu driver c=
an find the
HVA via the vgpu-type1 interface above.=C2=A0=C2=A0What's required to
write-protect the page?=C2=A0=C2=A0Can we do this at the page level, wi=
thout
needing KVM?=C2=A0=C2=A0If we wanted to write-protect a page for a user=
 process,
how would we do it?=C2=A0=C2=A0I think there are likely solutions to ea=
ch of these
problems, but we need to start with respecting the software layering an=
d
abstraction between various kernel components rather than calling into
them directly as our first inclination.=C2=A0=C2=A0Thanks,

Alex