From mboxrd@z Thu Jan  1 00:00:00 1970
From: Neo Jia <cjia@nvidia.com>
Subject: Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT
 - a Mediated ...)
Date: Tue, 26 Jan 2016 14:28:30 -0800
Message-ID: <20160126222830.GB21927@nvidia.com>
References: <1453092476.32741.67.camel@redhat.com>
 <569CA8AD.6070200@intel.com>
 <1453143919.32741.169.camel@redhat.com>
 <569F4C86.2070501@intel.com>
 <AADFC41AFE54684AB9EE6CBC0274A5D15F786B4B@SHSMSX101.ccr.corp.intel.com>
 <56A6083E.10703@intel.com>
 <1453757426.32741.614.camel@redhat.com>
 <AADFC41AFE54684AB9EE6CBC0274A5D15F78D2A3@SHSMSX101.ccr.corp.intel.com>
 <20160126102003.GA14400@nvidia.com>
 <1453838773.15515.1.camel@redhat.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: "Tian, Kevin" <kevin.tian@intel.com>,
	"Song, Jike" <jike.song@intel.com>,
	Gerd Hoffmann <kraxel@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	"Lv, Zhiyuan" <zhiyuan.lv@intel.com>,
	"Ruan, Shuai" <shuai.ruan@intel.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	qemu-devel <qemu-devel@nongnu.org>,
	"igvt-g@lists.01.org" <igvt-g@ml01.01.org>,
	Kirti Wankhede <kwankhede@nvidia.com>
To: Alex Williamson <alex.williamson@redhat.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from hqemgate16.nvidia.com ([216.228.121.65]:6554 "EHLO
	hqemgate16.nvidia.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751128AbcAZW2d convert rfc822-to-8bit (ORCPT
	<rfc822;kvm@vger.kernel.org>); Tue, 26 Jan 2016 17:28:33 -0500
Content-Disposition: inline
In-Reply-To: <1453838773.15515.1.camel@redhat.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On Tue, Jan 26, 2016 at 01:06:13PM -0700, Alex Williamson wrote:
> On Tue, 2016-01-26 at 02:20 -0800, Neo Jia wrote:
> > On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote:
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> >=A0
> > Hi Alex, Kevin and Jike,
> >=A0
> > (Seems I shouldn't use attachment, resend it again to the list, pat=
ches are
> > inline at the end)
> >=A0
> > Thanks for adding me to this technical discussion, a great opportun=
ity
> > for us to design together which can bring both Intel and NVIDIA vGP=
U solution to
> > KVM platform.
> >=A0
> > Instead of directly jumping to the proposal that we have been worki=
ng on
> > recently for NVIDIA vGPU on KVM, I think it is better for me to put=
 out couple
> > quick comments / thoughts regarding the existing discussions on thi=
s thread as
> > fundamentally I think we are solving the same problem, DMA, interru=
pt and MMIO.
> >=A0
> > Then we can look at what we have, hopefully we can reach some conse=
nsus soon.
> >=A0
> > > Yes, and since you're creating and destroying the vgpu here, this=
 is
> > > where I'd expect a struct device to be created and added to an IO=
MMU
> > > group. =A0The lifecycle management should really include links be=
tween
> > > the vGPU and physical GPU, which would be much, much easier to do=
 with
> > > struct devices create here rather than at the point where we star=
t
> > > doing vfio "stuff".
> >=A0
> > Infact to keep vfio-vgpu to be more generic, vgpu device creation a=
nd management
> > can be centralized and done in vfio-vgpu. That also include adding =
to IOMMU
> > group and VFIO group.
>=20
> Is this really a good idea?=A0=A0The concept of a vgpu is not unique =
to
> vfio, we want vfio to be a driver for a vgpu, not an integral part of
> the lifecycle of a vgpu.=A0=A0That certainly doesn't exclude adding
> infrastructure to make lifecycle management of a vgpu more consistent
> between drivers, but it should be done independently of vfio.=A0=A0I'=
ll go
> back to the SR-IOV model, vfio is often used with SR-IOV VFs, but vfi=
o
> does not create the VF, that's done in coordination with the PF makin=
g
> use of some PCI infrastructure for consistency between drivers.
>=20
> It seems like we need to take more advantage of the class and driver
> core support to perhaps setup a vgpu bus and class with vfio-vgpu jus=
t
> being a driver for those devices.
>=20
> > Graphics driver can register with vfio-vgpu to get management and e=
mulation call
> > backs to graphics driver.=A0=A0=A0
> >=A0
> > We already have struct vgpu_device in our proposal that keeps point=
er to
> > physical device.=A0=A0
> >=A0
> > > - vfio_pci will inject an IRQ to guest only when physical IRQ
> > > generated; whereas vfio_vgpu may inject an IRQ for emulation
> > > purpose. Anyway they can share the same injection interface;
> >=A0
> > eventfd to inject the interrupt is known to vfio-vgpu, that fd shou=
ld be
> > available to graphics driver so that graphics driver can inject int=
errupts
> > directly when physical device triggers interrupt.=A0
> >=A0
> > Here is the proposal we have, please review.
> >=A0
> > Please note the patches we have put out here is mainly for POC purp=
ose to
> > verify our understanding also can serve the purpose to reduce confu=
sions and speed up=A0
> > our design, although we are very happy to refine that to something =
eventually
> > can be used for both parties and upstreamed.
> >=A0
> > Linux vGPU kernel design
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >=A0
> > Here we are proposing a generic Linux kernel module based on VFIO f=
ramework
> > which allows different GPU vendors to plugin and provide their GPU =
virtualization
> > solution on KVM, the benefits of having such generic kernel module =
are:
> >=A0
> > 1) Reuse QEMU VFIO driver, supporting VFIO UAPI
> >=A0
> > 2) GPU HW agnostic management API for upper layer software such as =
libvirt
> >=A0
> > 3) No duplicated VFIO kernel logic reimplemented by different GPU d=
river vendor
> >=A0
> > 0. High level overview
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >=A0
> > =A0
> > =A0 user space:
> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0+-----------+=A0=A0VFIO IOMMU IOCTLs
> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0+=
---------| QEMU VFIO |-------------------------+
> > =A0=A0=A0=A0=A0=A0=A0=A0VFIO IOCTLs=A0=A0=A0|=A0=A0=A0=A0=A0=A0=A0=A0=
=A0+-----------+=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0|
> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0|=
=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0|=A0
> > =A0---------------------|------------------------------------------=
-----|---------
> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0|=
=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0|
> > =A0 kernel space:=A0=A0=A0=A0=A0=A0=A0|=A0=A0+--->----------->---+=A0=
=A0(callback)=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0V
> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0|=
=A0=A0|=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0v=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0+------V-----+
> > =A0 +----------+=A0=A0=A0+----V--^--+=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
+--+--+-----+=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0| VGPU=A0=A0=A0=A0=A0=A0=A0=
|
> > =A0 |=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0|=A0=A0=A0|=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0|=A0=A0=A0=A0=A0+----| nvidia.ko +----->-----> TYPE1 IOMMU|
> > =A0 | VFIO Bus <=3D=3D=3D| VGPU.ko=A0=A0|<----|=A0=A0=A0=A0+-------=
----+=A0=A0=A0=A0=A0|=A0=A0=A0=A0=A0+---++-------+=A0
> > =A0 |=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0|=A0=A0=A0|=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0|=A0=A0=A0=A0=A0| (register)=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0^=
=A0=A0=A0=A0=A0=A0=A0=A0=A0||
> > =A0 +----------+=A0=A0=A0+-------+--+=A0=A0=A0=A0=A0|=A0=A0=A0=A0+-=
----------+=A0=A0=A0=A0=A0|=A0=A0=A0=A0=A0=A0=A0=A0=A0||
> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0V=A0=A0=A0=A0=A0=A0=A0=A0+----| i915.ko=A0=A0=A0+-----+=A0=A0=A0=A0=
=A0+---VV-------+=A0
> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0|=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0+-----^-----+=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0| TYPE1=A0=A0=A0=A0=A0=A0|
> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0|=A0=A0(callback)=A0=A0=A0=A0=A0=A0=A0|=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0| IOMMU=A0=A0=A0=A0=A0=A0|
> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0+-->------------>---+=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0+------------+
> > =A0access flow:
> >=A0
> > =A0 Guest MMIO / PCI config access
> > =A0 |
> > =A0 -------------------------------------------------
> > =A0 |
> > =A0 +-----> KVM VM_EXITs=A0=A0(kernel)
> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0|
> > =A0 -------------------------------------------------
> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0|
> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0+-----> QEMU VFIO driver (user)
> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0|=A0
> > =A0 -------------------------------------------------
> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0|
> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0+---->=A0=A0V=
GPU kernel driver (kernel)
> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0|=A0=A0
> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0|=A0
> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0+----> vendor driver callback
> >=A0
> >=A0
> > 1. VGPU management interface
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >=A0
> > This is the interface allows upper layer software (mostly libvirt) =
to query and
> > configure virtual GPU device in a HW agnostic fashion. Also, this m=
anagement
> > interface has provided flexibility to underlying GPU vendor to supp=
ort virtual
> > device hotplug, multiple virtual devices per VM, multiple virtual d=
evices from
> > different physical devices, etc.
> >=A0
> > 1.1 Under per-physical device sysfs:
> > -------------------------------------------------------------------=
---------------
> >=A0
> > vgpu_supported_types - RO, list the current supported virtual GPU t=
ypes and its
> > VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of
> > "vgpu_supported_types".
> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0
> > vgpu_create - WO, input syntax <VM_UUID:idx:VGPU_ID>, create a virt=
ual
> > gpu device on a target physical GPU. idx: virtual device index insi=
de a VM
> >=A0
> > vgpu_destroy - WO, input syntax <VM_UUID:idx>, destroy a virtual gp=
u device on a
> > target physical GPU
>=20
>=20
> I've noted in previous discussions that we need to separate user poli=
cy
> from kernel policy here, the kernel policy should not require a "VM
> UUID".=A0=A0A UUID simply represents a set of one or more devices and=
 an
> index picks the device within the set.=A0=A0Whether that UUID matches=
 a VM
> or is independently used is up to the user policy when creating the
> device.
>=20
> Personally I'd also prefer to get rid of the concept of indexes withi=
n a
> UUID set of devices and instead have each device be independent.=A0=A0=
This
> seems to be an imposition on the nvidia implementation into the kerne=
l
> interface design.
>=20

Hi Alex,

I agree with you that we should not put UUID concept into a kernel API.=
 At
this point (without any prototyping), I am thinking of using a list of =
virtual
devices instead of UUID.

>=20
> > 1.3 Under vgpu class sysfs:
> > -------------------------------------------------------------------=
---------------
> >=A0
> > vgpu_start - WO, input syntax <VM_UUID>, this will trigger the regi=
stration
> > interface to notify the GPU vendor driver to commit virtual GPU res=
ource for
> > this target VM.=A0
> >=A0
> > Also, the vgpu_start function is a synchronized call, the successfu=
l return of
> > this call will indicate all the requested vGPU resource has been fu=
lly
> > committed, the VMM should continue.
> >=A0
> > vgpu_shutdown - WO, input syntax <VM_UUID>, this will trigger the r=
egistration
> > interface to notify the GPU vendor driver to release virtual GPU re=
source of
> > this target VM.
> >=A0
> > 1.4 Virtual device Hotplug
> > -------------------------------------------------------------------=
---------------
> >=A0
> > To support virtual device hotplug, <vgpu_create> and <vgpu_destroy>=
 can be
> > accessed during VM runtime, and the corresponding registration call=
back will be
> > invoked to allow GPU vendor support hotplug.
> >=A0
> > To support hotplug, vendor driver would take necessary action to ha=
ndle the
> > situation when a vgpu_create is done on a VM_UUID after vgpu_start,=
 and that
> > implies both create and start for that vgpu device.
> >=A0
> > Same, vgpu_destroy implies a vgpu_shudown on a running VM only if v=
endor driver
> > supports vgpu hotplug.
> >=A0
> > If hotplug is not supported and VM is still running, vendor driver =
can return
> > error code to indicate not supported.
> >=A0
> > Separate create from start gives flixibility to have:
> >=A0
> > - multiple vgpu instances for single VM and
> > - hotplug feature.
> >=A0
> > 2. GPU driver vendor registration interface
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >=A0
> > 2.1 Registration interface definition (include/linux/vgpu.h)
> > -------------------------------------------------------------------=
---------------
> >=A0
> > extern int vgpu_register_device(struct pci_dev *dev,=A0
> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0const struct gpu_device_ops *ops);
> >=A0
> > extern void vgpu_unregister_device(struct pci_dev *dev);
> >=A0
> > /**
> > =A0* struct gpu_device_ops - Structure to be registered for each ph=
ysical GPU to
> > =A0* register the device to vgpu module.
> > =A0*
> > =A0* @owner:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0The module owner.
> > =A0* @vgpu_supported_config:=A0=A0=A0=A0=A0=A0Called to get informa=
tion about supported vgpu
> > =A0* types.
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0@dev : pci device structure of physical GPU.=
=A0
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0@config: should return string listing suppor=
ted
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0config
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0Returns integer: success (0) or error (< 0)
> > =A0* @vgpu_create:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0C=
alled to allocate basic resouces in graphics
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0driver for a particular vgpu.
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0@dev: physical pci device structure on which
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0vgpu=A0
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0should be created
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0@vm_uuid: VM's uuid for which VM it is inten=
ded
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0to
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0@instance: vgpu instance in that VM
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0@vgpu_id: This represents the type of vgpu t=
o be
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0created
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0Returns integer: success (0) or error (< 0)
> > =A0* @vgpu_destroy:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0Cal=
led to free resources in graphics driver for
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0a vgpu instance of that VM.
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0@dev: physical pci device structure to which
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0this vgpu points to.
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0@vm_uuid: VM's uuid for which the vgpu belon=
gs
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0to.
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0@instance: vgpu instance in that VM
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0Returns integer: success (0) or error (< 0)
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0If VM is running and vgpu_destroy is called =
that=A0
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0means the vGPU is being hotunpluged. Return
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0error
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0if VM is running and graphics driver doesn't
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0support vgpu hotplug.
> > =A0* @vgpu_start:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
Called to do initiate vGPU initialization
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0process in graphics driver when VM boots bef=
ore
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0qemu starts.
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0@vm_uuid: VM's UUID which is booting.
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0Returns integer: success (0) or error (< 0)
> > =A0* @vgpu_shutdown:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0Calle=
d to teardown vGPU related resources for
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0the VM
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0@vm_uuid: VM's UUID which is shutting down .
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0Returns integer: success (0) or error (< 0)
> > =A0* @read:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0Read emulation callback
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0@vdev: vgpu device structure
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0@buf: read buffer
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0@count: number bytes to read=A0
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0@address_space: specifies for which address
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0space
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0the request is: pci_config_space, IO registe=
r
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0space or MMIO space.
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0Retuns number on bytes read on success or er=
ror.
> > =A0* @write:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0Write emulation callback
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0@vdev: vgpu device structure
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0@buf: write buffer
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0@count: number bytes to be written
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0@address_space: specifies for which address
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0space
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0the request is: pci_config_space, IO registe=
r
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0space or MMIO space.
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0Retuns number on bytes written on success or
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0error.
> > =A0* @vgpu_set_irqs:=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0Calle=
d to send about interrupts configuration
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0information that qemu set.=A0
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0@vdev: vgpu device structure
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0@flags, index, start, count and *data : same=
 as
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0that of struct vfio_irq_set of
> > =A0*=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0VFIO_DEVICE_SET_IRQS API.=A0
> > =A0*
> > =A0* Physical GPU that support vGPU should be register with vgpu mo=
dule with=A0
> > =A0* gpu_device_ops structure.
> > =A0*/
> >=A0
> > struct gpu_device_ops {
> > =A0=A0=A0=A0=A0=A0=A0=A0struct module=A0=A0=A0*owner;
> > =A0=A0=A0=A0=A0=A0=A0=A0int=A0=A0=A0=A0=A0(*vgpu_supported_config)(=
struct pci_dev *dev, char *config);
> > =A0=A0=A0=A0=A0=A0=A0=A0int=A0=A0=A0=A0=A0(*vgpu_create)(struct pci=
_dev *dev, uuid_le vm_uuid,
> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0uint32_t instance, uint32_t vgpu_id);
> > =A0=A0=A0=A0=A0=A0=A0=A0int=A0=A0=A0=A0=A0(*vgpu_destroy)(struct pc=
i_dev *dev, uuid_le vm_uuid,
> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0uint32_t instance);
> > =A0=A0=A0=A0=A0=A0=A0=A0int=A0=A0=A0=A0=A0(*vgpu_start)(uuid_le vm_=
uuid);
> > =A0=A0=A0=A0=A0=A0=A0=A0int=A0=A0=A0=A0=A0(*vgpu_shutdown)(uuid_le =
vm_uuid);
> > =A0=A0=A0=A0=A0=A0=A0=A0ssize_t (*read) (struct vgpu_device *vdev, =
char *buf, size_t count,
> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0uint32_t address_space, loff_t pos);
> > =A0=A0=A0=A0=A0=A0=A0=A0ssize_t (*write)(struct vgpu_device *vdev, =
char *buf, size_t count,
> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0uint32_t address_space,loff_t pos);
> > =A0=A0=A0=A0=A0=A0=A0=A0int=A0=A0=A0=A0=A0(*vgpu_set_irqs)(struct v=
gpu_device *vdev, uint32_t flags,
> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0unsigned index, unsigned start, unsigned =
count,
> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0void *data);
> >=A0
> > };
>=20
>=20
> I wonder if it shouldn't be vfio-vgpu sub-drivers (ie, Intel and Nvid=
ia)
> that register these ops with the main vfio-vgpu driver and they shoul=
d
> also include a probe() function which allows us to associate a given
> vgpu device with a set of vendor ops.
>=20
>=20
> >=A0
> > 2.2 Details for callbacks we haven't mentioned above.
> > -------------------------------------------------------------------=
--------------
> >=A0
> > vgpu_supported_config: allows the vendor driver to specify the supp=
orted vGPU
> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
type/configuration
> >=A0
> > vgpu_create=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0: create a virtual GPU dev=
ice, can be used for device hotplug.
> >=A0
> > vgpu_destroy=A0=A0=A0=A0=A0=A0=A0=A0=A0: destroy a virtual GPU devi=
ce, can be used for device hotplug.
> >=A0
> > vgpu_start=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0: callback function to n=
otify vendor driver vgpu device
> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
come to live for a given virtual machine.
> >=A0
> > vgpu_shutdown=A0=A0=A0=A0=A0=A0=A0=A0: callback function to notify =
vendor driver=A0
> >=A0
> > read=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0: callback t=
o vendor driver to handle virtual device config
> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
space or MMIO read access
> >=A0
> > write=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0: callback to =
vendor driver to handle virtual device config
> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
space or MMIO write access
> >=A0
> > vgpu_set_irqs=A0=A0=A0=A0=A0=A0=A0=A0: callback to vendor driver to=
 pass along the interrupt
> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
information for the target virtual device, then vendor
> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
driver can inject interrupt into virtual machine for this
> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
device.
> >=A0
> > 2.3 Potential additional virtual device configuration registration =
interface:
> > -------------------------------------------------------------------=
--------------
> >=A0
> > callback function to describe the MMAP behavior of the virtual GPU=A0
> >=A0
> > callback function to allow GPU vendor driver to provide PCI config =
space backing
> > memory.
> >=A0
> > 3. VGPU TYPE1 IOMMU
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >=A0
> > Here we are providing a TYPE1 IOMMU for vGPU which will basically k=
eep track the=A0
> > <iova, hva, size, flag> and save the QEMU mm for later reference.
> >=A0
> > You can find the quick/ugly implementation in the attached patch fi=
le, which is
> > actually just a simple version Alex's type1 IOMMU without actual re=
al
> > mapping when IOMMU_MAP_DMA / IOMMU_UNMAP_DMA is called.=A0
> >=A0
> > We have thought about providing another vendor driver registration =
interface so
> > such tracking information will be sent to vendor driver and he will=
 use the QEMU
> > mm to do the get_user_pages / remap_pfn_range when it is required. =
After doing a
> > quick implementation within our driver, I noticed following issues:
> >=A0
> > 1) OS/VFIO logic into vendor driver which will be a maintenance iss=
ue.
> >=A0
> > 2) Every driver vendor has to implement their own RB tree, instead =
of reusing
> > the common existing VFIO code (vfio_find/link/unlink_dma)=A0
> >=A0
> > 3) IOMMU_UNMAP_DMA is expecting to get "unmapped bytes" back to the=
 caller/QEMU,
> > better not have anything inside a vendor driver that the VFIO calle=
r immediately
> > depends on.
> >=A0
> > Based on the above consideration, we decide to implement the DMA tr=
acking logic
> > within VGPU TYPE1 IOMMU code (ideally, this should be merged into c=
urrent TYPE1
> > IOMMU code) and expose two symbols to outside for MMIO mapping and =
page
> > translation and pinning.=A0
> >=A0
> > Also, with a mmap MMIO interface between virtual and physical, this=
 allows
> > para-virtualized guest driver can access his virtual MMIO without t=
aking a MMAP
> > fault hit, also we can support different MMIO size between virtual =
and physical
> > device.
> >=A0
> > int vgpu_map_virtual_bar
> > (
> > =A0=A0=A0=A0uint64_t virt_bar_addr,
> > =A0=A0=A0=A0uint64_t phys_bar_addr,
> > =A0=A0=A0=A0uint32_t len,
> > =A0=A0=A0=A0uint32_t flags
> > )
> >=A0
> > EXPORT_SYMBOL(vgpu_map_virtual_bar);
>=20
>=20
> Per the implementation provided, this needs to be implemented in the
> vfio device driver, not in the iommu interface.=A0=A0Finding the DMA =
mapping
> of the device and replacing it is wrong.=A0=A0It should be remapped a=
t the
> vfio device file interface using vm_ops.
>=20

So you are basically suggesting that we are going to take a mmap fault =
and
within that fault handler, we will go into vendor driver to look up the
"pre-registered" mapping and remap there.

Is my understanding correct?

>=20
> > int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)
> >=A0
> > EXPORT_SYMBOL(vgpu_dma_do_translate);
> >=A0
> > Still a lot to be added and modified, such as supporting multiple V=
Ms and=A0
> > multiple virtual devices, tracking the mapped / pinned region withi=
n VGPU IOMMU=A0
> > kernel driver, error handling, roll-back and locked memory size per=
 user, etc.=A0
>=20
> Particularly, handling of mapping changes is completely missing.=A0=A0=
This
> cannot be a point in time translation, the user is free to remap
> addresses whenever they wish and device translations need to be updat=
ed
> accordingly.
>=20

When you say "user", do you mean the QEMU? Here, whenever the DMA that
the guest driver is going to launch will be first pinned within VM, and=
 then
registered to QEMU, therefore the IOMMU memory listener, eventually the=
 pages
will be pinned by the GPU or DMA engine.

Since we are keeping the upper level code same, thinking about passthru=
 case,
where the GPU has already put the real IOVA into his PTEs, I don't know=
 how QEMU
can change that mapping without causing an IOMMU fault on a active DMA =
device.

>=20
> > 4. Modules
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >=A0
> > Two new modules are introduced: vfio_iommu_type1_vgpu.ko and vgpu.k=
o
> >=A0
> > vfio_iommu_type1_vgpu.ko - IOMMU TYPE1 driver supporting the IOMMU=A0
> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0TYPE1 v1 and v2 interface.=A0
>=20
> Depending on how intrusive it is, this can possibly by done within th=
e
> existing type1 driver.=A0=A0Either that or we can split out common co=
de for
> use by a separate module.
>=20
> > vgpu.ko=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0- prov=
ide registration interface and virtual device
> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0VFIO access.
> >=A0
> > 5. QEMU note
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >=A0
> > To allow us focus on the VGPU kernel driver prototyping, we have in=
troduced a new VFIO=A0
> > class - vgpu inside QEMU, so we don't have to change the existing v=
fio/pci.c file and=A0
> > use it as a reference for our implementation. It is basically just =
a quick c & p
> > from vfio/pci.c to quickly meet our needs.
> >=A0
> > Once this proposal is finalized, we will move to vfio/pci.c instead=
 of a new
> > class, and probably the only thing required is to have a new way to=
 discover the
> > device.
> >=A0
> > 6. Examples
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >=A0
> > On this server, we have two NVIDIA M60 GPUs.
> >=A0
> > [root@cjia-vgx-kvm ~]# lspci -d 10de:13f2
> > 86:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (=
rev a1)
> > 87:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (=
rev a1)
> >=A0
> > After nvidia.ko gets initialized, we can query the supported vGPU t=
ype by
> > accessing the "vgpu_supported_types" like following:
> >=A0
> > [root@cjia-vgx-kvm ~]# cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu=
_supported_types=A0
> > 11:GRID M60-0B
> > 12:GRID M60-0Q
> > 13:GRID M60-1B
> > 14:GRID M60-1Q
> > 15:GRID M60-2B
> > 16:GRID M60-2Q
> > 17:GRID M60-4Q
> > 18:GRID M60-8Q
> >=A0
> > For example the VM_UUID is c0b26072-dd1b-4340-84fe-bf338c510818, an=
d we would
> > like to create "GRID M60-4Q" VM on it.
> >=A0
> > echo "c0b26072-dd1b-4340-84fe-bf338c510818:0:17" > /sys/bus/pci/dev=
ices/0000\:86\:00.0/vgpu_create
> >=A0
> > Note: the number 0 here is for vGPU device index. So far the change=
 is not tested
> > for multiple vgpu devices yet, but we will support it.
> >=A0
> > At this moment, if you query the "vgpu_supported_types" it will sti=
ll show all
> > supported virtual GPU types as no virtual GPU resource is committed=
 yet.
> >=A0
> > Starting VM:
> >=A0
> > echo "c0b26072-dd1b-4340-84fe-bf338c510818" > /sys/class/vgpu/vgpu_=
start
> >=A0
> > then, the supported vGPU type query will return:
> >=A0
> > [root@cjia-vgx-kvm /home/cjia]$
> > > cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types
> > 17:GRID M60-4Q
> >=A0
> > So vgpu_supported_config needs to be called whenever a new virtual =
device gets
> > created as the underlying HW might limit the supported types if the=
re are
> > any existing VM runnings.
> >=A0
> > Then, VM gets shutdown, writes to /sys/class/vgpu/vgpu_shutdown wil=
l info the
> > GPU driver vendor to clean up resource.
> >=A0
> > Eventually, those virtual GPUs can be removed by writing to vgpu_de=
stroy under
> > device sysfs.
>=20
>=20
> I'd like to hear Intel's thoughts on this interface.=A0=A0Are there
> different vgpu capacities or priority classes that would necessitate
> different types of vcpus on Intel?
>=20
> I think there are some gaps in translating from named vgpu types to
> indexes here, along with my previous mention of the UUID/set oddity.
>=20
> Does Intel have a need for start and shutdown interfaces?
>=20
> Neo, wasn't there at some point information about how many of each ty=
pe
> could be supported through these interfaces?=A0=A0How does a user kno=
w their
> capacity limits?
>=20

Thanks for reminding me that, I think we probably forget to put that *i=
mportant*
information as the output of "vgpu_supported_types".

Regarding the capacity, we can provide the frame buffer size as part of=
 the
"vgpu_supported_types" output as well, I would imagine those will be ev=
entually
show up on the openstack management interface or virt-mgr.

Basically, yes there would be a separate col to show the number of inst=
ance you
can create for each type of VGPU on a specific physical GPU.

Thanks,
Neo


> Thanks,
> Alex
>=20