From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:45592)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <cjia@nvidia.com>) id 1aOMBr-0000aY-Fx
	for qemu-devel@nongnu.org; Wed, 27 Jan 2016 04:15:08 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <cjia@nvidia.com>) id 1aOMBo-0000d0-7a
	for qemu-devel@nongnu.org; Wed, 27 Jan 2016 04:15:03 -0500
Received: from hqemgate15.nvidia.com ([216.228.121.64]:7181)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <cjia@nvidia.com>) id 1aOMBn-0000cu-SD
	for qemu-devel@nongnu.org; Wed, 27 Jan 2016 04:15:00 -0500
Date: Wed, 27 Jan 2016 01:14:53 -0800
From: Neo Jia <cjia@nvidia.com>
Message-ID: <20160127091432.GA10936@nvidia.com>
References: <1453143919.32741.169.camel@redhat.com>
	<569F4C86.2070501@intel.com>
	<AADFC41AFE54684AB9EE6CBC0274A5D15F786B4B@SHSMSX101.ccr.corp.intel.com>
	<56A6083E.10703@intel.com> <1453757426.32741.614.camel@redhat.com>
	<AADFC41AFE54684AB9EE6CBC0274A5D15F78D2A3@SHSMSX101.ccr.corp.intel.com>
	<20160126102003.GA14400@nvidia.com>
	<1453838773.15515.1.camel@redhat.com>
	<20160126222830.GB21927@nvidia.com>
	<1453851038.18049.9.camel@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
In-Reply-To: <1453851038.18049.9.camel@redhat.com>
Subject: Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3
 release of XenGT - a Mediated ...)
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: "Ruan, Shuai" <shuai.ruan@intel.com>, "Tian, Kevin" <kevin.tian@intel.com>, "kvm@vger.kernel.org" <kvm@vger.kernel.org>, "igvt-g@lists.01.org" <igvt-g@ml01.01.org>, "Song,
	Jike" <jike.song@intel.com>, qemu-devel <qemu-devel@nongnu.org>, Kirti Wankhede <kwankhede@nvidia.com>, "Lv,
	Zhiyuan" <zhiyuan.lv@intel.com>, Paolo Bonzini <pbonzini@redhat.com>, Gerd Hoffmann <kraxel@redhat.com>

On Tue, Jan 26, 2016 at 04:30:38PM -0700, Alex Williamson wrote:
> On Tue, 2016-01-26 at 14:28 -0800, Neo Jia wrote:
> > On Tue, Jan 26, 2016 at 01:06:13PM -0700, Alex Williamson wrote:
> > > > 1.1 Under per-physical device sysfs:
> > > > -------------------------------------------------------------------=
---------------
> > > > =A0
> > > > vgpu_supported_types - RO, list the current supported virtual GPU t=
ypes and its
> > > > VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of
> > > > "vgpu_supported_types".
> > > > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0
> > > > vgpu_create - WO, input syntax <VM_UUID:idx:VGPU_ID>, create a virt=
ual
> > > > gpu device on a target physical GPU. idx: virtual device index insi=
de a VM
> > > > =A0
> > > > vgpu_destroy - WO, input syntax <VM_UUID:idx>, destroy a virtual gp=
u device on a
> > > > target physical GPU
> > >=A0
> > >=A0
> > > I've noted in previous discussions that we need to separate user poli=
cy
> > > from kernel policy here, the kernel policy should not require a "VM
> > > UUID".=A0=A0A UUID simply represents a set of one or more devices and=
 an
> > > index picks the device within the set.=A0=A0Whether that UUID matches=
 a VM
> > > or is independently used is up to the user policy when creating the
> > > device.
> > >=A0
> > > Personally I'd also prefer to get rid of the concept of indexes withi=
n a
> > > UUID set of devices and instead have each device be independent.=A0=
=A0This
> > > seems to be an imposition on the nvidia implementation into the kerne=
l
> > > interface design.
> > >=A0
> >=A0
> > Hi Alex,
> >=A0
> > I agree with you that we should not put UUID concept into a kernel API.=
 At
> > this point (without any prototyping), I am thinking of using a list of =
virtual
> > devices instead of UUID.
>=20
> Hi Neo,
>=20
> A UUID is a perfectly fine name, so long as we let it be just a UUID and
> not the UUID matching some specific use case.
>=20
> > > > =A0
> > > > int vgpu_map_virtual_bar
> > > > (
> > > > =A0=A0=A0=A0uint64_t virt_bar_addr,
> > > > =A0=A0=A0=A0uint64_t phys_bar_addr,
> > > > =A0=A0=A0=A0uint32_t len,
> > > > =A0=A0=A0=A0uint32_t flags
> > > > )
> > > > =A0
> > > > EXPORT_SYMBOL(vgpu_map_virtual_bar);
> > >=A0
> > >=A0
> > > Per the implementation provided, this needs to be implemented in the
> > > vfio device driver, not in the iommu interface.=A0=A0Finding the DMA =
mapping
> > > of the device and replacing it is wrong.=A0=A0It should be remapped a=
t the
> > > vfio device file interface using vm_ops.
> > >=A0
> >=A0
> > So you are basically suggesting that we are going to take a mmap fault =
and
> > within that fault handler, we will go into vendor driver to look up the
> > "pre-registered" mapping and remap there.
> >=A0
> > Is my understanding correct?
>=20
> Essentially, hopefully the vendor driver will have already registered
> the backing for the mmap prior to the fault, but either way could work.
> I think the key though is that you want to remap it onto the vma
> accessing the vfio device file, not scanning it out of an IOVA mapping
> that might be dynamic and doing a vma lookup based on the point in time
> mapping of the BAR.=A0=A0The latter doesn't give me much confidence that
> mappings couldn't change while the former should be a one time fault.

Hi Alex,

The fact is that the vendor driver can only prevent such mmap fault by look=
ing
up the <iova, hva> mapping table that we have saved from IOMMU memory liste=
rner
when the guest region gets programmed. Also, like you have mentioned below,=
 such
mapping between iova and hva shouldn't be changed as long as the SBIOS and
guest OS are done with their job.=20

Yes, you are right it is one time fault, but the gpu work is heavily pipeli=
ned.=20

Probably we should just limit this interface to guest MMIO region and we ca=
n have
some crosscheck between the VFIO driver who has monitored the config spcae
access to make sure nothing getting moved around?

>=20
> In case it's not clear to folks at Intel, the purpose of this is that a
> vGPU may directly map a segment of the physical GPU MMIO space, but we
> may not know what segment that is at setup time, when QEMU does an mmap
> of the vfio device file descriptor.=A0=A0The thought is that we can creat=
e
> an invalid mapping when QEMU calls mmap(), knowing that it won't be
> accessed until later, then we can fault in the real mmap on demand.=A0=A0=
Do
> you need anything similar?
>=20
> > >=A0
> > > > int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)
> > > > =A0
> > > > EXPORT_SYMBOL(vgpu_dma_do_translate);
> > > > =A0
> > > > Still a lot to be added and modified, such as supporting multiple V=
Ms and=A0
> > > > multiple virtual devices, tracking the mapped / pinned region withi=
n VGPU IOMMU=A0
> > > > kernel driver, error handling, roll-back and locked memory size per=
 user, etc.=A0
> > >=A0
> > > Particularly, handling of mapping changes is completely missing.=A0=
=A0This
> > > cannot be a point in time translation, the user is free to remap
> > > addresses whenever they wish and device translations need to be updat=
ed
> > > accordingly.
> > >=A0
> >=A0
> > When you say "user", do you mean the QEMU?
>=20
> vfio is a generic userspace driver interface, QEMU is a very, very
> important user of the interface, but not the only user.=A0=A0So for this
> conversation, we're mostly talking about QEMU as the user, but we should
> be careful about assuming QEMU is the only user.
>=20

Understood. I have to say that our focus at this moment is to support QEMU =
and
KVM, but I know VFIO interface is much more than that, and that is why I th=
ink
it is right to leverage this framework so we can together explore future us=
e
case in the userland.


> > Here, whenever the DMA that
> > the guest driver is going to launch will be first pinned within VM, and=
 then
> > registered to QEMU, therefore the IOMMU memory listener, eventually the=
 pages
> > will be pinned by the GPU or DMA engine.
> >=A0
> > Since we are keeping the upper level code same, thinking about passthru=
 case,
> > where the GPU has already put the real IOVA into his PTEs, I don't know=
 how QEMU
> > can change that mapping without causing an IOMMU fault on a active DMA =
device.
>=20
> For the virtual BAR mapping above, it's easy to imagine that mapping a
> BAR to a given address is at the guest discretion, it may be mapped and
> unmapped, it may be mapped to different addresses at different points in
> time, the guest BIOS may choose to map it at yet another address, etc.
> So if somehow we were trying to setup a mapping for peer-to-peer, there
> are lots of ways that IOVA could change.=A0=A0But even with RAM, we can
> support memory hotplug in a VM.=A0=A0What was once a DMA target may be
> removed or may now be backed by something else.=A0=A0Chipset configuratio=
n
> on the emulated platform may change how guest physical memory appears
> and that might change between VM boots.
>=20
> Currently with physical device assignment the memory listener watches
> for both maps and unmaps and updates the iotlb to match.=A0=A0Just like r=
eal
> hardware doing these same sorts of things, we rely on the guest to stop
> using memory that's going to be moved as a DMA target prior to moving
> it.

Right,  you can only do that when the device is quiescent.

As long as this will be notified to the guest, I think we should be able to
support it although the real implementation will depend on how the device g=
ets into=20
quiescent state.

This is definitely a very interesting feature we should explore, but I hope=
 we
probably can first focus on the most basic functionality.

Thanks,
Neo

>=20
> > > > 4. Modules
> > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > > > =A0
> > > > Two new modules are introduced: vfio_iommu_type1_vgpu.ko and vgpu.k=
o
> > > > =A0
> > > > vfio_iommu_type1_vgpu.ko - IOMMU TYPE1 driver supporting the IOMMU=
=A0
> > > > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0TYPE1 v1 and v2 interface.=A0
> > >=A0
> > > Depending on how intrusive it is, this can possibly by done within th=
e
> > > existing type1 driver.=A0=A0Either that or we can split out common co=
de for
> > > use by a separate module.
> > >=A0
> > > > vgpu.ko=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0- prov=
ide registration interface and virtual device
> > > > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0VFIO access.
> > > > =A0
> > > > 5. QEMU note
> > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > > > =A0
> > > > To allow us focus on the VGPU kernel driver prototyping, we have in=
troduced a new VFIO=A0
> > > > class - vgpu inside QEMU, so we don't have to change the existing v=
fio/pci.c file and=A0
> > > > use it as a reference for our implementation. It is basically just =
a quick c & p
> > > > from vfio/pci.c to quickly meet our needs.
> > > > =A0
> > > > Once this proposal is finalized, we will move to vfio/pci.c instead=
 of a new
> > > > class, and probably the only thing required is to have a new way to=
 discover the
> > > > device.
> > > > =A0
> > > > 6. Examples
> > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > > > =A0
> > > > On this server, we have two NVIDIA M60 GPUs.
> > > > =A0
> > > > [root@cjia-vgx-kvm ~]# lspci -d 10de:13f2
> > > > 86:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (=
rev a1)
> > > > 87:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (=
rev a1)
> > > > =A0
> > > > After nvidia.ko gets initialized, we can query the supported vGPU t=
ype by
> > > > accessing the "vgpu_supported_types" like following:
> > > > =A0
> > > > [root@cjia-vgx-kvm ~]# cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu=
_supported_types=A0
> > > > 11:GRID M60-0B
> > > > 12:GRID M60-0Q
> > > > 13:GRID M60-1B
> > > > 14:GRID M60-1Q
> > > > 15:GRID M60-2B
> > > > 16:GRID M60-2Q
> > > > 17:GRID M60-4Q
> > > > 18:GRID M60-8Q
> > > > =A0
> > > > For example the VM_UUID is c0b26072-dd1b-4340-84fe-bf338c510818, an=
d we would
> > > > like to create "GRID M60-4Q" VM on it.
> > > > =A0
> > > > echo "c0b26072-dd1b-4340-84fe-bf338c510818:0:17" >
> > > > /sys/bus/pci/devices/0000\:86\:00.0/vgpu_create
> > > > =A0
> > > > Note: the number 0 here is for vGPU device index. So far the change=
 is not tested
> > > > for multiple vgpu devices yet, but we will support it.
> > > > =A0
> > > > At this moment, if you query the "vgpu_supported_types" it will sti=
ll show all
> > > > supported virtual GPU types as no virtual GPU resource is committed=
 yet.
> > > > =A0
> > > > Starting VM:
> > > > =A0
> > > > echo "c0b26072-dd1b-4340-84fe-bf338c510818" > /sys/class/vgpu/vgpu_=
start
> > > > =A0
> > > > then, the supported vGPU type query will return:
> > > > =A0
> > > > [root@cjia-vgx-kvm /home/cjia]$
> > > > > cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types
> > > > 17:GRID M60-4Q
> > > > =A0
> > > > So vgpu_supported_config needs to be called whenever a new virtual =
device gets
> > > > created as the underlying HW might limit the supported types if the=
re are
> > > > any existing VM runnings.
> > > > =A0
> > > > Then, VM gets shutdown, writes to /sys/class/vgpu/vgpu_shutdown wil=
l info the
> > > > GPU driver vendor to clean up resource.
> > > > =A0
> > > > Eventually, those virtual GPUs can be removed by writing to vgpu_de=
stroy under
> > > > device sysfs.
> > >=A0
> > >=A0
> > > I'd like to hear Intel's thoughts on this interface.=A0=A0Are there
> > > different vgpu capacities or priority classes that would necessitate
> > > different types of vcpus on Intel?
> > >=A0
> > > I think there are some gaps in translating from named vgpu types to
> > > indexes here, along with my previous mention of the UUID/set oddity.
> > >=A0
> > > Does Intel have a need for start and shutdown interfaces?
> > >=A0
> > > Neo, wasn't there at some point information about how many of each ty=
pe
> > > could be supported through these interfaces?=A0=A0How does a user kno=
w their
> > > capacity limits?
> > >=A0
> >=A0
> > Thanks for reminding me that, I think we probably forget to put that *i=
mportant*
> > information as the output of "vgpu_supported_types".
> >=A0
> > Regarding the capacity, we can provide the frame buffer size as part of=
 the
> > "vgpu_supported_types" output as well, I would imagine those will be ev=
entually
> > show up on the openstack management interface or virt-mgr.
> >=A0
> > Basically, yes there would be a separate col to show the number of inst=
ance you
> > can create for each type of VGPU on a specific physical GPU.
>=20
> Ok, Thanks,
>=20
> Alex
>=20