From mboxrd@z Thu Jan  1 00:00:00 1970
From: Alex Williamson <alex.williamson@redhat.com>
Subject: Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT
 - a Mediated ...)
Date: Tue, 26 Jan 2016 16:30:38 -0700
Message-ID: <1453851038.18049.9.camel@redhat.com>
References: <1453092476.32741.67.camel@redhat.com>
	 <569CA8AD.6070200@intel.com> <1453143919.32741.169.camel@redhat.com>
	 <569F4C86.2070501@intel.com>
	 <AADFC41AFE54684AB9EE6CBC0274A5D15F786B4B@SHSMSX101.ccr.corp.intel.com>
	 <56A6083E.10703@intel.com> <1453757426.32741.614.camel@redhat.com>
	 <AADFC41AFE54684AB9EE6CBC0274A5D15F78D2A3@SHSMSX101.ccr.corp.intel.com>
	 <20160126102003.GA14400@nvidia.com> <1453838773.15515.1.camel@redhat.com>
	 <20160126222830.GB21927@nvidia.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: "Tian, Kevin" <kevin.tian@intel.com>,
	"Song, Jike" <jike.song@intel.com>,
	Gerd Hoffmann <kraxel@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	"Lv, Zhiyuan" <zhiyuan.lv@intel.com>,
	"Ruan, Shuai" <shuai.ruan@intel.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	qemu-devel <qemu-devel@nongnu.org>,
	"igvt-g@lists.01.org" <igvt-g@ml01.01.org>,
	Kirti Wankhede <kwankhede@nvidia.com>
To: Neo Jia <cjia@nvidia.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:40904 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1750716AbcAZXaj (ORCPT <rfc822;kvm@vger.kernel.org>);
	Tue, 26 Jan 2016 18:30:39 -0500
In-Reply-To: <20160126222830.GB21927@nvidia.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On Tue, 2016-01-26 at 14:28 -0800, Neo Jia wrote:
> On Tue, Jan 26, 2016 at 01:06:13PM -0700, Alex Williamson wrote:
> > > 1.1 Under per-physical device sysfs:
> > > -----------------------------------------------------------------=
-----------------
> > > =C2=A0
> > > vgpu_supported_types - RO, list the current supported virtual GPU=
 types and its
> > > VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of
> > > "vgpu_supported_types".
> > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0
> > > vgpu_create - WO, input syntax <VM_UUID:idx:VGPU_ID>, create a vi=
rtual
> > > gpu device on a target physical GPU. idx: virtual device index in=
side a VM
> > > =C2=A0
> > > vgpu_destroy - WO, input syntax <VM_UUID:idx>, destroy a virtual =
gpu device on a
> > > target physical GPU
> >=C2=A0
> >=C2=A0
> > I've noted in previous discussions that we need to separate user po=
licy
> > from kernel policy here, the kernel policy should not require a "VM
> > UUID".=C2=A0=C2=A0A UUID simply represents a set of one or more dev=
ices and an
> > index picks the device within the set.=C2=A0=C2=A0Whether that UUID=
 matches a VM
> > or is independently used is up to the user policy when creating the
> > device.
> >=C2=A0
> > Personally I'd also prefer to get rid of the concept of indexes wit=
hin a
> > UUID set of devices and instead have each device be independent.=C2=
=A0=C2=A0This
> > seems to be an imposition on the nvidia implementation into the ker=
nel
> > interface design.
> >=C2=A0
>=C2=A0
> Hi Alex,
>=C2=A0
> I agree with you that we should not put UUID concept into a kernel AP=
I. At
> this point (without any prototyping), I am thinking of using a list o=
f virtual
> devices instead of UUID.

Hi Neo,

A UUID is a perfectly fine name, so long as we let it be just a UUID an=
d
not the UUID matching some specific use case.

> > > =C2=A0
> > > int vgpu_map_virtual_bar
> > > (
> > > =C2=A0=C2=A0=C2=A0=C2=A0uint64_t virt_bar_addr,
> > > =C2=A0=C2=A0=C2=A0=C2=A0uint64_t phys_bar_addr,
> > > =C2=A0=C2=A0=C2=A0=C2=A0uint32_t len,
> > > =C2=A0=C2=A0=C2=A0=C2=A0uint32_t flags
> > > )
> > > =C2=A0
> > > EXPORT_SYMBOL(vgpu_map_virtual_bar);
> >=C2=A0
> >=C2=A0
> > Per the implementation provided, this needs to be implemented in th=
e
> > vfio device driver, not in the iommu interface.=C2=A0=C2=A0Finding =
the DMA mapping
> > of the device and replacing it is wrong.=C2=A0=C2=A0It should be re=
mapped at the
> > vfio device file interface using vm_ops.
> >=C2=A0
>=C2=A0
> So you are basically suggesting that we are going to take a mmap faul=
t and
> within that fault handler, we will go into vendor driver to look up t=
he
> "pre-registered" mapping and remap there.
>=C2=A0
> Is my understanding correct?

Essentially, hopefully the vendor driver will have already registered
the backing for the mmap prior to the fault, but either way could work.
I think the key though is that you want to remap it onto the vma
accessing the vfio device file, not scanning it out of an IOVA mapping
that might be dynamic and doing a vma lookup based on the point in time
mapping of the BAR.=C2=A0=C2=A0The latter doesn't give me much confiden=
ce that
mappings couldn't change while the former should be a one time fault.

In case it's not clear to folks at Intel, the purpose of this is that a
vGPU may directly map a segment of the physical GPU MMIO space, but we
may not know what segment that is at setup time, when QEMU does an mmap
of the vfio device file descriptor.=C2=A0=C2=A0The thought is that we c=
an create
an invalid mapping when QEMU calls mmap(), knowing that it won't be
accessed until later, then we can fault in the real mmap on demand.=C2=A0=
=C2=A0Do
you need anything similar?

> >=C2=A0
> > > int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)
> > > =C2=A0
> > > EXPORT_SYMBOL(vgpu_dma_do_translate);
> > > =C2=A0
> > > Still a lot to be added and modified, such as supporting multiple=
 VMs and=C2=A0
> > > multiple virtual devices, tracking the mapped / pinned region wit=
hin VGPU IOMMU=C2=A0
> > > kernel driver, error handling, roll-back and locked memory size p=
er user, etc.=C2=A0
> >=C2=A0
> > Particularly, handling of mapping changes is completely missing.=C2=
=A0=C2=A0This
> > cannot be a point in time translation, the user is free to remap
> > addresses whenever they wish and device translations need to be upd=
ated
> > accordingly.
> >=C2=A0
>=C2=A0
> When you say "user", do you mean the QEMU?

vfio is a generic userspace driver interface, QEMU is a very, very
important user of the interface, but not the only user.=C2=A0=C2=A0So f=
or this
conversation, we're mostly talking about QEMU as the user, but we shoul=
d
be careful about assuming QEMU is the only user.

> Here, whenever the DMA that
> the guest driver is going to launch will be first pinned within VM, a=
nd then
> registered to QEMU, therefore the IOMMU memory listener, eventually t=
he pages
> will be pinned by the GPU or DMA engine.
>=C2=A0
> Since we are keeping the upper level code same, thinking about passth=
ru case,
> where the GPU has already put the real IOVA into his PTEs, I don't kn=
ow how QEMU
> can change that mapping without causing an IOMMU fault on a active DM=
A device.

=46or the virtual BAR mapping above, it's easy to imagine that mapping =
a
BAR to a given address is at the guest discretion, it may be mapped and
unmapped, it may be mapped to different addresses at different points i=
n
time, the guest BIOS may choose to map it at yet another address, etc.
So if somehow we were trying to setup a mapping for peer-to-peer, there
are lots of ways that IOVA could change.=C2=A0=C2=A0But even with RAM, =
we can
support memory hotplug in a VM.=C2=A0=C2=A0What was once a DMA target m=
ay be
removed or may now be backed by something else.=C2=A0=C2=A0Chipset conf=
iguration
on the emulated platform may change how guest physical memory appears
and that might change between VM boots.

Currently with physical device assignment the memory listener watches
for both maps and unmaps and updates the iotlb to match.=C2=A0=C2=A0Jus=
t like real
hardware doing these same sorts of things, we rely on the guest to stop
using memory that's going to be moved as a DMA target prior to moving
it.

> > > 4. Modules
> > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > > =C2=A0
> > > Two new modules are introduced: vfio_iommu_type1_vgpu.ko and vgpu=
=2Eko
> > > =C2=A0
> > > vfio_iommu_type1_vgpu.ko - IOMMU TYPE1 driver supporting the IOMM=
U=C2=A0
> > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0TYPE1 v1 and v2 interface.=C2=A0
> >=C2=A0
> > Depending on how intrusive it is, this can possibly by done within =
the
> > existing type1 driver.=C2=A0=C2=A0Either that or we can split out c=
ommon code for
> > use by a separate module.
> >=C2=A0
> > > vgpu.ko=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0- provide registration =
interface and virtual device
> > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0VFIO access.
> > > =C2=A0
> > > 5. QEMU note
> > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > > =C2=A0
> > > To allow us focus on the VGPU kernel driver prototyping, we have =
introduced a new VFIO=C2=A0
> > > class - vgpu inside QEMU, so we don't have to change the existing=
 vfio/pci.c file and=C2=A0
> > > use it as a reference for our implementation. It is basically jus=
t a quick c & p
> > > from vfio/pci.c to quickly meet our needs.
> > > =C2=A0
> > > Once this proposal is finalized, we will move to vfio/pci.c inste=
ad of a new
> > > class, and probably the only thing required is to have a new way =
to discover the
> > > device.
> > > =C2=A0
> > > 6. Examples
> > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > > =C2=A0
> > > On this server, we have two NVIDIA M60 GPUs.
> > > =C2=A0
> > > [root@cjia-vgx-kvm ~]# lspci -d 10de:13f2
> > > 86:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2=
 (rev a1)
> > > 87:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2=
 (rev a1)
> > > =C2=A0
> > > After nvidia.ko gets initialized, we can query the supported vGPU=
 type by
> > > accessing the "vgpu_supported_types" like following:
> > > =C2=A0
> > > [root@cjia-vgx-kvm ~]# cat /sys/bus/pci/devices/0000\:86\:00.0/vg=
pu_supported_types=C2=A0
> > > 11:GRID M60-0B
> > > 12:GRID M60-0Q
> > > 13:GRID M60-1B
> > > 14:GRID M60-1Q
> > > 15:GRID M60-2B
> > > 16:GRID M60-2Q
> > > 17:GRID M60-4Q
> > > 18:GRID M60-8Q
> > > =C2=A0
> > > For example the VM_UUID is c0b26072-dd1b-4340-84fe-bf338c510818, =
and we would
> > > like to create "GRID M60-4Q" VM on it.
> > > =C2=A0
> > > echo "c0b26072-dd1b-4340-84fe-bf338c510818:0:17" >
> > > /sys/bus/pci/devices/0000\:86\:00.0/vgpu_create
> > > =C2=A0
> > > Note: the number 0 here is for vGPU device index. So far the chan=
ge is not tested
> > > for multiple vgpu devices yet, but we will support it.
> > > =C2=A0
> > > At this moment, if you query the "vgpu_supported_types" it will s=
till show all
> > > supported virtual GPU types as no virtual GPU resource is committ=
ed yet.
> > > =C2=A0
> > > Starting VM:
> > > =C2=A0
> > > echo "c0b26072-dd1b-4340-84fe-bf338c510818" > /sys/class/vgpu/vgp=
u_start
> > > =C2=A0
> > > then, the supported vGPU type query will return:
> > > =C2=A0
> > > [root@cjia-vgx-kvm /home/cjia]$
> > > > cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types
> > > 17:GRID M60-4Q
> > > =C2=A0
> > > So vgpu_supported_config needs to be called whenever a new virtua=
l device gets
> > > created as the underlying HW might limit the supported types if t=
here are
> > > any existing VM runnings.
> > > =C2=A0
> > > Then, VM gets shutdown, writes to /sys/class/vgpu/vgpu_shutdown w=
ill info the
> > > GPU driver vendor to clean up resource.
> > > =C2=A0
> > > Eventually, those virtual GPUs can be removed by writing to vgpu_=
destroy under
> > > device sysfs.
> >=C2=A0
> >=C2=A0
> > I'd like to hear Intel's thoughts on this interface.=C2=A0=C2=A0Are=
 there
> > different vgpu capacities or priority classes that would necessitat=
e
> > different types of vcpus on Intel?
> >=C2=A0
> > I think there are some gaps in translating from named vgpu types to
> > indexes here, along with my previous mention of the UUID/set oddity=
=2E
> >=C2=A0
> > Does Intel have a need for start and shutdown interfaces?
> >=C2=A0
> > Neo, wasn't there at some point information about how many of each =
type
> > could be supported through these interfaces?=C2=A0=C2=A0How does a =
user know their
> > capacity limits?
> >=C2=A0
>=C2=A0
> Thanks for reminding me that, I think we probably forget to put that =
*important*
> information as the output of "vgpu_supported_types".
>=C2=A0
> Regarding the capacity, we can provide the frame buffer size as part =
of the
> "vgpu_supported_types" output as well, I would imagine those will be =
eventually
> show up on the openstack management interface or virt-mgr.
>=C2=A0
> Basically, yes there would be a separate col to show the number of in=
stance you
> can create for each type of VGPU on a specific physical GPU.

Ok, Thanks,

Alex