From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:38728)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <alex.williamson@redhat.com>) id 1aR4IG-00032j-LX
	for qemu-devel@nongnu.org; Wed, 03 Feb 2016 15:44:54 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <alex.williamson@redhat.com>) id 1aR4ID-0000VS-Cs
	for qemu-devel@nongnu.org; Wed, 03 Feb 2016 15:44:52 -0500
Received: from mx1.redhat.com ([209.132.183.28]:54694)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <alex.williamson@redhat.com>) id 1aR4ID-0000VM-3e
	for qemu-devel@nongnu.org; Wed, 03 Feb 2016 15:44:49 -0500
Message-ID: <1454532287.18969.14.camel@redhat.com>
From: Alex Williamson <alex.williamson@redhat.com>
Date: Wed, 03 Feb 2016 13:44:47 -0700
In-Reply-To: <AADFC41AFE54684AB9EE6CBC0274A5D15F79CD07@SHSMSX103.ccr.corp.intel.com>
References: <AADFC41AFE54684AB9EE6CBC0274A5D15F79CD07@SHSMSX103.ccr.corp.intel.com>
Content-Type: text/plain; charset="UTF-8"
Mime-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] <summary> RE: [iGVT-g] VFIO based vGPU(was Re:
 [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: "Tian, Kevin" <kevin.tian@intel.com>, "Lv, Zhiyuan" <zhiyuan.lv@intel.com>, Gerd Hoffmann <kraxel@redhat.com>
Cc: Yang Zhang <yang.zhang.wz@gmail.com>, "igvt-g@lists.01.org" <igvt-g@ml01.01.org>, qemu-devel <qemu-devel@nongnu.org>, "kvm@vger.kernel.org" <kvm@vger.kernel.org>, Paolo Bonzini <pbonzini@redhat.com>

On Wed, 2016-02-03 at 08:04 +0000, Tian, Kevin wrote:
> > From: Zhiyuan Lv
> > Sent: Tuesday, February 02, 2016 3:35 PM
> >=C2=A0
> > Hi Gerd/Alex,
> >=C2=A0
> > On Mon, Feb 01, 2016 at 02:44:55PM -0700, Alex Williamson wrote:
> > > On Mon, 2016-02-01 at 14:10 +0100, Gerd Hoffmann wrote:
> > > > =C2=A0 Hi,
> > > >=C2=A0
> > > > > > Unfortunately it's not the only one. Another example is, devi=
ce-model
> > > > > > may want to write-protect a gfn (RAM). In case that this requ=
est goes
> > > > > > to VFIO .. how it is supposed to reach KVM MMU?
> > > > >=C2=A0
> > > > > Well, let's work through the problem.=C2=A0=C2=A0How is the GFN=
 related to the
> > > > > device?=C2=A0=C2=A0Is this some sort of page table for device m=
appings with a base
> > > > > register in the vgpu hardware?
> > > >=C2=A0
> > > > IIRC this is needed to make sure the guest can't bypass execbuffe=
r
> > > > verification and works like this:
> > > >=C2=A0
> > > > =C2=A0 (1) guest submits execbuffer.
> > > > =C2=A0 (2) host makes execbuffer readonly for the guest
> > > > =C2=A0 (3) verify the buffer (make sure it only accesses resource=
s owned by
> > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0the vm).
> > > > =C2=A0 (4) pass on execbuffer to the hardware.
> > > > =C2=A0 (5) when the gpu is done with it make the execbuffer writa=
ble again.
> > >=C2=A0
> > > Ok, so are there opportunities to do those page protections outside=
 of
> > > KVM?=C2=A0=C2=A0We should be able to get the vma for the buffer, ca=
n we do
> > > something with that to make it read-only.=C2=A0=C2=A0Alternatively =
can the vgpu
> > > driver copy it to a private buffer and hardware can execute from th=
at?
> > > I'm not a virtual memory expert, but it doesn't seem like an
> > > insurmountable problem.=C2=A0=C2=A0Thanks,
> >=C2=A0
> > Originally iGVT-g used write-protection for privilege execbuffers, as=
 Gerd
> > described. Now the latest implementation has removed wp to do buffer =
copy
> > instead, since the privilege command buffers are usually small. So th=
at part
> > is fine.
> >=C2=A0
> > But we need write-protection for graphics page table shadowing as wel=
l. Once
> > guest driver modifies gpu page table, we need to know that and manipu=
late
> > shadow page table accordingly. buffer copy cannot help here. Thanks!
> >=C2=A0
>=C2=A0
> After walking through the whole thread again, let me do a summary here
> so everyone can be on the same page.=C2=A0
>=C2=A0
> First, Jike told me before his vacation, that we cannot do any change t=
o=C2=A0
> KVM module according to community comments. Now I think it's not true.=C2=
=A0
> We can do necessary changes, as long as it is done in a structural/laye=
red=C2=A0
> approach, w/o hard assumption on KVMGT as the only user. That's the=C2=A0
> guideline we need to obey. :-)

We certainly need to separate the functionality that you're trying to
enable from the more pure concept of vfio.=C2=A0=C2=A0vfio is a userspace=
 driver
interfaces, not a userspace driver interface for KVM-based virtual
machines.=C2=A0=C2=A0Maybe it's more of a gimmick that we can assign PCI =
devices
to QEMU tcg VMs, but that's really just the proof of concept for more
useful capabilities, like supporting DPDK applications.=C2=A0=C2=A0So, I
begrudgingly agree that structured/layered interactions are acceptable,
but consider what use cases may be excluded by doing so.

> Mostly we care about two aspects regarding to a vgpu driver:
> =C2=A0 - services/callbacks which vgpu driver provides to external fram=
ework
> (e.g. vgpu core driver and VFIO);
> =C2=A0 - services/callbacks which vgpu driver relies on for proper emul=
ation
> (e.g. from VFIO and/or hypervisor);
>=C2=A0
> The former is being discussed in another thread. So here let's focus
> on the latter.
>=C2=A0
> In general Intel GVT-g requires below services for emulation:
>=C2=A0
> 1) Selectively pass-through a region to a VM
> --
> This can be supported by today's VFIO framework, by setting
> VFIO_REGION_INFO_FLAG_MMAP for concerned regions. Then Qemu
> will mmap that region which will finally be added to the EPT table of
> the target VM
>=C2=A0
> 2) Trap-and-emulate a region
> --
> Similarly, this can be easily achieved by clearing MMAP flag for concer=
ned
> regions. Then every access from VM will go through Qemu and then VFIO
> and finally reach vgpu driver. The only concern is in the performance
> part. We need some general mechanism to allow delivering I/O emulation
> request directly from KVM in kernel. For example, Alex mentioned some
> flavor based on file descriptor + offset. Likely let's move forward wit=
h
> the default Qemu forwarding, while brainstorming exit-less delivery in =
parallel.
>=C2=A0
> 3) Inject a virtual interrupt
> --
> We can leverage existing VFIO IRQ injection interface, including config=
uration
> and irqfd interface.
>=C2=A0
> 4) Map/unmap guest memory
> --
> It's there for KVM.

Map and unmap for who?=C2=A0=C2=A0For the vGPU or for the VM?=C2=A0=C2=A0=
It seems like we
know how to map guest memory for the vGPU without KVM, but that's
covered in 7), so I'm not entirely sure what this is specifying.
=C2=A0
> 5) Pin/unpin guest memory
> --
> IGD or any PCI passthru should have same requirement. So we should be
> able to leverage existing code in VFIO. The only tricky thing (Jike may
> elaborate after he is back), is that KVMGT requires to pin EPT entry to=
o,
> which requires some further change in KVM side. But I'm not sure whethe=
r
> it still holds true after some design changes made in this thread. So I=
'll
> leave to Jike to further comment.

PCI assignment requires pinning all of guest memory, I would think that
IGD would only need to pin selective memory, so is this simply stating
that both have the need to pin memory, not that they'll do it to the
same extent?

> 6) Write-protect a guest memory page
> --
> The primary purpose is for GPU page table shadowing. We need to track
> modifications on guest GPU page table, so shadow part can be synchroniz=
ed
> accordingly. Just think about CPU page table shadowing. And old example
> as Zhiyuan pointed out, is to write-protect guest cmd buffer. But it be=
comes
> not necessary now.
>=C2=A0
> So we need KVM to provide an interface so some agents can request such
> write-protection action (not just for KVMGT. could be for other trackin=
g=C2=A0
> usages). Guangrong has been working on a general page tracking mechanis=
m,
> upon which write-protection can be easily built on. The review is still=
 in=C2=A0
> progress.

I have a hard time believing we don't have the mechanics to do this
outside of KVM.=C2=A0=C2=A0We should be able to write protect user pages =
from the
kernel, this is how copy-on-write generally works.=C2=A0=C2=A0So it seems=
 like we
should be able to apply those same mechanics to our userspace process,
which just happens to be a KVM VM.=C2=A0=C2=A0I'm hoping that Paolo might=
 have
some ideas how to make this work or maybe Intel has some virtual memory
experts that can point us in the right direction.

> 7) GPA->IOVA/HVA translation
> --
> It's required in various places, e.g.:
> - read a guest structure according to GPA
> - replace GPA with IOVA in various shadow structures
>=C2=A0
> We can maintain both translations in vfio-iommu-type1 driver, since
> necessary information is ready at map interface. And we should use
> MemoryListener to update the database. It's already there for physical
> device passthru (Qemu uses MemoryListener and then rely to vfio).
>=C2=A0
> vfio-vgpu will expose query interface, thru vgpu core driver, so that=C2=
=A0
> vgpu driver can use above database for whatever purpose.
>=C2=A0
>=C2=A0
> ----
> Well, then I realize pretty much opens have been covered with a solutio=
n
> when ending this write-up. Then we should move forward to come up a
> prototype upon which we can then identify anything missing or overlooke=
d
> (definitely there would be), and also discuss several remaining opens a=
top
> =C2=A0(such as exit-less emulation, pin/unpin, etc.). Another thing we =
need
> to think is whether this new design is still compatible to Xen side.
>=C2=A0
> Thanks a lot all for the great discussion (especially Alex with many go=
od
> inputs)! I believe it becomes much clearer now than 2 weeks ago, about=C2=
=A0
> how to integrate KVMGT with VFIO. :-)

Thanks for your summary, Kevin.=C2=A0=C2=A0It does seem like there are on=
ly a few
outstanding issues which should be manageable and hopefully the overall
approach is cleaner for QEMU, management tools, and provides a more
consistent user interface as well.=C2=A0=C2=A0If we can translate the sol=
ution to
Xen, that's even better.=C2=A0=C2=A0Thanks,

Alex