From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:55880)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <cjia@nvidia.com>) id 1aOXxI-00073g-JP
	for qemu-devel@nongnu.org; Wed, 27 Jan 2016 16:48:50 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <cjia@nvidia.com>) id 1aOXxE-0007Jf-8K
	for qemu-devel@nongnu.org; Wed, 27 Jan 2016 16:48:48 -0500
Received: from hqemgate14.nvidia.com ([216.228.121.143]:9822)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <cjia@nvidia.com>) id 1aOXxD-0007JW-TQ
	for qemu-devel@nongnu.org; Wed, 27 Jan 2016 16:48:44 -0500
Date: Wed, 27 Jan 2016 13:48:37 -0800
From: Neo Jia <cjia@nvidia.com>
Message-ID: <20160127214837.GA6182@nvidia.com>
References: <AADFC41AFE54684AB9EE6CBC0274A5D15F786B4B@SHSMSX101.ccr.corp.intel.com>
	<56A6083E.10703@intel.com> <1453757426.32741.614.camel@redhat.com>
	<AADFC41AFE54684AB9EE6CBC0274A5D15F78D2A3@SHSMSX101.ccr.corp.intel.com>
	<20160126102003.GA14400@nvidia.com>
	<1453838773.15515.1.camel@redhat.com>
	<20160126222830.GB21927@nvidia.com>
	<1453851038.18049.9.camel@redhat.com>
	<20160127091432.GA10936@nvidia.com>
	<1453911016.6261.3.camel@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
In-Reply-To: <1453911016.6261.3.camel@redhat.com>
Subject: Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3
 release of XenGT - a Mediated ...)
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: "Ruan, Shuai" <shuai.ruan@intel.com>, "Tian, Kevin" <kevin.tian@intel.com>, "kvm@vger.kernel.org" <kvm@vger.kernel.org>, "igvt-g@lists.01.org" <igvt-g@ml01.01.org>, "Song,
	Jike" <jike.song@intel.com>, qemu-devel <qemu-devel@nongnu.org>, Kirti Wankhede <kwankhede@nvidia.com>, "Lv,
	Zhiyuan" <zhiyuan.lv@intel.com>, Paolo Bonzini <pbonzini@redhat.com>, Gerd Hoffmann <kraxel@redhat.com>

On Wed, Jan 27, 2016 at 09:10:16AM -0700, Alex Williamson wrote:
> On Wed, 2016-01-27 at 01:14 -0800, Neo Jia wrote:
> > On Tue, Jan 26, 2016 at 04:30:38PM -0700, Alex Williamson wrote:
> > > On Tue, 2016-01-26 at 14:28 -0800, Neo Jia wrote:
> > > > On Tue, Jan 26, 2016 at 01:06:13PM -0700, Alex Williamson wrote:
> > > > > > 1.1 Under per-physical device sysfs:
> > > > > > ---------------------------------------------------------------=
-------------------
> > > > > > =A0
> > > > > > vgpu_supported_types - RO, list the current supported virtual G=
PU types and its
> > > > > > VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads o=
f
> > > > > > "vgpu_supported_types".
> > > > > > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0
> > > > > > vgpu_create - WO, input syntax <VM_UUID:idx:VGPU_ID>, create a =
virtual
> > > > > > gpu device on a target physical GPU. idx: virtual device index =
inside a VM
> > > > > > =A0
> > > > > > vgpu_destroy - WO, input syntax <VM_UUID:idx>, destroy a virtua=
l gpu device on a
> > > > > > target physical GPU
> > > > > =A0
> > > > > =A0
> > > > > I've noted in previous discussions that we need to separate user =
policy
> > > > > from kernel policy here, the kernel policy should not require a "=
VM
> > > > > UUID".=A0=A0A UUID simply represents a set of one or more devices=
 and an
> > > > > index picks the device within the set.=A0=A0Whether that UUID mat=
ches a VM
> > > > > or is independently used is up to the user policy when creating t=
he
> > > > > device.
> > > > > =A0
> > > > > Personally I'd also prefer to get rid of the concept of indexes w=
ithin a
> > > > > UUID set of devices and instead have each device be independent.=
=A0=A0This
> > > > > seems to be an imposition on the nvidia implementation into the k=
ernel
> > > > > interface design.
> > > > > =A0
> > > > =A0
> > > > Hi Alex,
> > > > =A0
> > > > I agree with you that we should not put UUID concept into a kernel =
API. At
> > > > this point (without any prototyping), I am thinking of using a list=
 of virtual
> > > > devices instead of UUID.
> > >=A0
> > > Hi Neo,
> > >=A0
> > > A UUID is a perfectly fine name, so long as we let it be just a UUID =
and
> > > not the UUID matching some specific use case.
> > >=A0
> > > > > > =A0
> > > > > > int vgpu_map_virtual_bar
> > > > > > (
> > > > > > =A0=A0=A0=A0uint64_t virt_bar_addr,
> > > > > > =A0=A0=A0=A0uint64_t phys_bar_addr,
> > > > > > =A0=A0=A0=A0uint32_t len,
> > > > > > =A0=A0=A0=A0uint32_t flags
> > > > > > )
> > > > > > =A0
> > > > > > EXPORT_SYMBOL(vgpu_map_virtual_bar);
> > > > > =A0
> > > > > =A0
> > > > > Per the implementation provided, this needs to be implemented in =
the
> > > > > vfio device driver, not in the iommu interface.=A0=A0Finding the =
DMA mapping
> > > > > of the device and replacing it is wrong.=A0=A0It should be remapp=
ed at the
> > > > > vfio device file interface using vm_ops.
> > > > > =A0
> > > > =A0
> > > > So you are basically suggesting that we are going to take a mmap fa=
ult and
> > > > within that fault handler, we will go into vendor driver to look up=
 the
> > > > "pre-registered" mapping and remap there.
> > > > =A0
> > > > Is my understanding correct?
> > >=A0
> > > Essentially, hopefully the vendor driver will have already registered
> > > the backing for the mmap prior to the fault, but either way could wor=
k.
> > > I think the key though is that you want to remap it onto the vma
> > > accessing the vfio device file, not scanning it out of an IOVA mappin=
g
> > > that might be dynamic and doing a vma lookup based on the point in ti=
me
> > > mapping of the BAR.=A0=A0The latter doesn't give me much confidence t=
hat
> > > mappings couldn't change while the former should be a one time fault.
> >=A0
> > Hi Alex,
> >=A0
> > The fact is that the vendor driver can only prevent such mmap fault by =
looking
> > up the <iova, hva> mapping table that we have saved from IOMMU memory l=
isterner
>=20
> Why do we need to prevent the fault?=A0=A0We need to handle the fault whe=
n
> it occurs.
>=20
> > when the guest region gets programmed. Also, like you have mentioned be=
low, such
> > mapping between iova and hva shouldn't be changed as long as the SBIOS =
and
> > guest OS are done with their job.=A0
>=20
> But you don't know they're done with their job.
>=20
> > Yes, you are right it is one time fault, but the gpu work is heavily pi=
pelined.=A0
>=20
> Why does that matter?=A0=A0We're talking about the first time the VM
> accesses the range of the BAR that will be direct mapped to the physical
> GPU.=A0=A0This isn't going to happen in the middle of a benchmark, it's
> going to happen during driver initialization in the guest.
>=20
> > Probably we should just limit this interface to guest MMIO region and w=
e can have
> > some crosscheck between the VFIO driver who has monitored the config sp=
cae
> > access to make sure nothing getting moved around?
>=20
> No, the solution for the bar is very clear, map on fault to the vma
> accessing the mmap and be done with it for the remainder of this
> instance of the VM.
>=20

Hi Alex,

I totally get your points, my previous comments were just trying to explain=
 the
reasoning behind our current implementation. I think I have found a way to =
hide
the latency of the mmap fault, which might happen in the middle of running =
a
benchmark.

I will add a new registration interface to allow the driver vendor to provi=
de a
fault handler callback, and from there pages will be installed.=20

> > > In case it's not clear to folks at Intel, the purpose of this is that=
 a
> > > vGPU may directly map a segment of the physical GPU MMIO space, but w=
e
> > > may not know what segment that is at setup time, when QEMU does an mm=
ap
> > > of the vfio device file descriptor.=A0=A0The thought is that we can c=
reate
> > > an invalid mapping when QEMU calls mmap(), knowing that it won't be
> > > accessed until later, then we can fault in the real mmap on demand.=
=A0=A0Do
> > > you need anything similar?
> > >=A0
> > > > > =A0
> > > > > > int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t coun=
t)
> > > > > > =A0
> > > > > > EXPORT_SYMBOL(vgpu_dma_do_translate);
> > > > > > =A0
> > > > > > Still a lot to be added and modified, such as supporting multip=
le VMs and=A0
> > > > > > multiple virtual devices, tracking the mapped / pinned region w=
ithin VGPU IOMMU=A0
> > > > > > kernel driver, error handling, roll-back and locked memory size=
 per user, etc.=A0
> > > > > =A0
> > > > > Particularly, handling of mapping changes is completely missing.=
=A0=A0This
> > > > > cannot be a point in time translation, the user is free to remap
> > > > > addresses whenever they wish and device translations need to be u=
pdated
> > > > > accordingly.
> > > > > =A0
> > > > =A0
> > > > When you say "user", do you mean the QEMU?
> > >=A0
> > > vfio is a generic userspace driver interface, QEMU is a very, very
> > > important user of the interface, but not the only user.=A0=A0So for t=
his
> > > conversation, we're mostly talking about QEMU as the user, but we sho=
uld
> > > be careful about assuming QEMU is the only user.
> > >=A0
> >=A0
> > Understood. I have to say that our focus at this moment is to support Q=
EMU and
> > KVM, but I know VFIO interface is much more than that, and that is why =
I think
> > it is right to leverage this framework so we can together explore futur=
e use
> > case in the userland.
> >=A0
> >=A0
> > > > Here, whenever the DMA that
> > > > the guest driver is going to launch will be first pinned within VM,=
 and then
> > > > registered to QEMU, therefore the IOMMU memory listener, eventually=
 the pages
> > > > will be pinned by the GPU or DMA engine.
> > > > =A0
> > > > Since we are keeping the upper level code same, thinking about pass=
thru case,
> > > > where the GPU has already put the real IOVA into his PTEs, I don't =
know how QEMU
> > > > can change that mapping without causing an IOMMU fault on a active =
DMA device.
> > >=A0
> > > For the virtual BAR mapping above, it's easy to imagine that mapping =
a
> > > BAR to a given address is at the guest discretion, it may be mapped a=
nd
> > > unmapped, it may be mapped to different addresses at different points=
 in
> > > time, the guest BIOS may choose to map it at yet another address, etc=
.
> > > So if somehow we were trying to setup a mapping for peer-to-peer, the=
re
> > > are lots of ways that IOVA could change.=A0=A0But even with RAM, we c=
an
> > > support memory hotplug in a VM.=A0=A0What was once a DMA target may b=
e
> > > removed or may now be backed by something else.=A0=A0Chipset configur=
ation
> > > on the emulated platform may change how guest physical memory appears
> > > and that might change between VM boots.
> > >=A0
> > > Currently with physical device assignment the memory listener watches
> > > for both maps and unmaps and updates the iotlb to match.=A0=A0Just li=
ke real
> > > hardware doing these same sorts of things, we rely on the guest to st=
op
> > > using memory that's going to be moved as a DMA target prior to moving
> > > it.
> >=A0
> > Right,=A0=A0you can only do that when the device is quiescent.
> >=A0
> > As long as this will be notified to the guest, I think we should be abl=
e to
> > support it although the real implementation will depend on how the devi=
ce gets into=A0
> > quiescent state.
> >=A0
> > This is definitely a very interesting feature we should explore, but I =
hope we
> > probably can first focus on the most basic functionality.
>=20
> If we only do a point-in-time translation and assume it never changes,
> that's good enough for a proof of concept, but it's not a complete
> solution.=A0=A0I think this is=A0=A0practical problem, not just an academ=
ic
> problem.=A0=A0There needs to be a mechanism mappings to be invalidated ba=
sed
> on VM memory changes.=A0=A0Thanks,
>=20

Sorry, probably my previous comment is not very clear. I highly value your =
input and
the information related to the memory hotplug scenarios, and I never exclud=
e the
support of such feature. The only question is when, that is why I would lik=
e to
defer such VM memory hotplug feature to phase 2 after the initial official
launch.

Thanks,
Neo

> Alex
>=20