qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Neo Jia <cjia@nvidia.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: "Ruan, Shuai" <shuai.ruan@intel.com>,
	"Tian, Kevin" <kevin.tian@intel.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"igvt-g@lists.01.org" <igvt-g@ml01.01.org>,
	"Song, Jike" <jike.song@intel.com>,
	qemu-devel <qemu-devel@nongnu.org>,
	Kirti Wankhede <kwankhede@nvidia.com>,
	"Lv, Zhiyuan" <zhiyuan.lv@intel.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Gerd Hoffmann <kraxel@redhat.com>
Subject: Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
Date: Wed, 27 Jan 2016 13:48:37 -0800	[thread overview]
Message-ID: <20160127214837.GA6182@nvidia.com> (raw)
In-Reply-To: <1453911016.6261.3.camel@redhat.com>

On Wed, Jan 27, 2016 at 09:10:16AM -0700, Alex Williamson wrote:
> On Wed, 2016-01-27 at 01:14 -0800, Neo Jia wrote:
> > On Tue, Jan 26, 2016 at 04:30:38PM -0700, Alex Williamson wrote:
> > > On Tue, 2016-01-26 at 14:28 -0800, Neo Jia wrote:
> > > > On Tue, Jan 26, 2016 at 01:06:13PM -0700, Alex Williamson wrote:
> > > > > > 1.1 Under per-physical device sysfs:
> > > > > > ----------------------------------------------------------------------------------
> > > > > >  
> > > > > > vgpu_supported_types - RO, list the current supported virtual GPU types and its
> > > > > > VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of
> > > > > > "vgpu_supported_types".
> > > > > >                             
> > > > > > vgpu_create - WO, input syntax <VM_UUID:idx:VGPU_ID>, create a virtual
> > > > > > gpu device on a target physical GPU. idx: virtual device index inside a VM
> > > > > >  
> > > > > > vgpu_destroy - WO, input syntax <VM_UUID:idx>, destroy a virtual gpu device on a
> > > > > > target physical GPU
> > > > >  
> > > > >  
> > > > > I've noted in previous discussions that we need to separate user policy
> > > > > from kernel policy here, the kernel policy should not require a "VM
> > > > > UUID".  A UUID simply represents a set of one or more devices and an
> > > > > index picks the device within the set.  Whether that UUID matches a VM
> > > > > or is independently used is up to the user policy when creating the
> > > > > device.
> > > > >  
> > > > > Personally I'd also prefer to get rid of the concept of indexes within a
> > > > > UUID set of devices and instead have each device be independent.  This
> > > > > seems to be an imposition on the nvidia implementation into the kernel
> > > > > interface design.
> > > > >  
> > > >  
> > > > Hi Alex,
> > > >  
> > > > I agree with you that we should not put UUID concept into a kernel API. At
> > > > this point (without any prototyping), I am thinking of using a list of virtual
> > > > devices instead of UUID.
> > > 
> > > Hi Neo,
> > > 
> > > A UUID is a perfectly fine name, so long as we let it be just a UUID and
> > > not the UUID matching some specific use case.
> > > 
> > > > > >  
> > > > > > int vgpu_map_virtual_bar
> > > > > > (
> > > > > >     uint64_t virt_bar_addr,
> > > > > >     uint64_t phys_bar_addr,
> > > > > >     uint32_t len,
> > > > > >     uint32_t flags
> > > > > > )
> > > > > >  
> > > > > > EXPORT_SYMBOL(vgpu_map_virtual_bar);
> > > > >  
> > > > >  
> > > > > Per the implementation provided, this needs to be implemented in the
> > > > > vfio device driver, not in the iommu interface.  Finding the DMA mapping
> > > > > of the device and replacing it is wrong.  It should be remapped at the
> > > > > vfio device file interface using vm_ops.
> > > > >  
> > > >  
> > > > So you are basically suggesting that we are going to take a mmap fault and
> > > > within that fault handler, we will go into vendor driver to look up the
> > > > "pre-registered" mapping and remap there.
> > > >  
> > > > Is my understanding correct?
> > > 
> > > Essentially, hopefully the vendor driver will have already registered
> > > the backing for the mmap prior to the fault, but either way could work.
> > > I think the key though is that you want to remap it onto the vma
> > > accessing the vfio device file, not scanning it out of an IOVA mapping
> > > that might be dynamic and doing a vma lookup based on the point in time
> > > mapping of the BAR.  The latter doesn't give me much confidence that
> > > mappings couldn't change while the former should be a one time fault.
> > 
> > Hi Alex,
> > 
> > The fact is that the vendor driver can only prevent such mmap fault by looking
> > up the <iova, hva> mapping table that we have saved from IOMMU memory listerner
> 
> Why do we need to prevent the fault?  We need to handle the fault when
> it occurs.
> 
> > when the guest region gets programmed. Also, like you have mentioned below, such
> > mapping between iova and hva shouldn't be changed as long as the SBIOS and
> > guest OS are done with their job. 
> 
> But you don't know they're done with their job.
> 
> > Yes, you are right it is one time fault, but the gpu work is heavily pipelined. 
> 
> Why does that matter?  We're talking about the first time the VM
> accesses the range of the BAR that will be direct mapped to the physical
> GPU.  This isn't going to happen in the middle of a benchmark, it's
> going to happen during driver initialization in the guest.
> 
> > Probably we should just limit this interface to guest MMIO region and we can have
> > some crosscheck between the VFIO driver who has monitored the config spcae
> > access to make sure nothing getting moved around?
> 
> No, the solution for the bar is very clear, map on fault to the vma
> accessing the mmap and be done with it for the remainder of this
> instance of the VM.
> 

Hi Alex,

I totally get your points, my previous comments were just trying to explain the
reasoning behind our current implementation. I think I have found a way to hide
the latency of the mmap fault, which might happen in the middle of running a
benchmark.

I will add a new registration interface to allow the driver vendor to provide a
fault handler callback, and from there pages will be installed. 

> > > In case it's not clear to folks at Intel, the purpose of this is that a
> > > vGPU may directly map a segment of the physical GPU MMIO space, but we
> > > may not know what segment that is at setup time, when QEMU does an mmap
> > > of the vfio device file descriptor.  The thought is that we can create
> > > an invalid mapping when QEMU calls mmap(), knowing that it won't be
> > > accessed until later, then we can fault in the real mmap on demand.  Do
> > > you need anything similar?
> > > 
> > > > >  
> > > > > > int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)
> > > > > >  
> > > > > > EXPORT_SYMBOL(vgpu_dma_do_translate);
> > > > > >  
> > > > > > Still a lot to be added and modified, such as supporting multiple VMs and 
> > > > > > multiple virtual devices, tracking the mapped / pinned region within VGPU IOMMU 
> > > > > > kernel driver, error handling, roll-back and locked memory size per user, etc. 
> > > > >  
> > > > > Particularly, handling of mapping changes is completely missing.  This
> > > > > cannot be a point in time translation, the user is free to remap
> > > > > addresses whenever they wish and device translations need to be updated
> > > > > accordingly.
> > > > >  
> > > >  
> > > > When you say "user", do you mean the QEMU?
> > > 
> > > vfio is a generic userspace driver interface, QEMU is a very, very
> > > important user of the interface, but not the only user.  So for this
> > > conversation, we're mostly talking about QEMU as the user, but we should
> > > be careful about assuming QEMU is the only user.
> > > 
> > 
> > Understood. I have to say that our focus at this moment is to support QEMU and
> > KVM, but I know VFIO interface is much more than that, and that is why I think
> > it is right to leverage this framework so we can together explore future use
> > case in the userland.
> > 
> > 
> > > > Here, whenever the DMA that
> > > > the guest driver is going to launch will be first pinned within VM, and then
> > > > registered to QEMU, therefore the IOMMU memory listener, eventually the pages
> > > > will be pinned by the GPU or DMA engine.
> > > >  
> > > > Since we are keeping the upper level code same, thinking about passthru case,
> > > > where the GPU has already put the real IOVA into his PTEs, I don't know how QEMU
> > > > can change that mapping without causing an IOMMU fault on a active DMA device.
> > > 
> > > For the virtual BAR mapping above, it's easy to imagine that mapping a
> > > BAR to a given address is at the guest discretion, it may be mapped and
> > > unmapped, it may be mapped to different addresses at different points in
> > > time, the guest BIOS may choose to map it at yet another address, etc.
> > > So if somehow we were trying to setup a mapping for peer-to-peer, there
> > > are lots of ways that IOVA could change.  But even with RAM, we can
> > > support memory hotplug in a VM.  What was once a DMA target may be
> > > removed or may now be backed by something else.  Chipset configuration
> > > on the emulated platform may change how guest physical memory appears
> > > and that might change between VM boots.
> > > 
> > > Currently with physical device assignment the memory listener watches
> > > for both maps and unmaps and updates the iotlb to match.  Just like real
> > > hardware doing these same sorts of things, we rely on the guest to stop
> > > using memory that's going to be moved as a DMA target prior to moving
> > > it.
> > 
> > Right,  you can only do that when the device is quiescent.
> > 
> > As long as this will be notified to the guest, I think we should be able to
> > support it although the real implementation will depend on how the device gets into 
> > quiescent state.
> > 
> > This is definitely a very interesting feature we should explore, but I hope we
> > probably can first focus on the most basic functionality.
> 
> If we only do a point-in-time translation and assume it never changes,
> that's good enough for a proof of concept, but it's not a complete
> solution.  I think this is  practical problem, not just an academic
> problem.  There needs to be a mechanism mappings to be invalidated based
> on VM memory changes.  Thanks,
> 

Sorry, probably my previous comment is not very clear. I highly value your input and
the information related to the memory hotplug scenarios, and I never exclude the
support of such feature. The only question is when, that is why I would like to
defer such VM memory hotplug feature to phase 2 after the initial official
launch.

Thanks,
Neo

> Alex
> 

  reply	other threads:[~2016-01-27 21:48 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-01-18  2:39 [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...) Jike Song
2016-01-18  4:47 ` Alex Williamson
2016-01-18  8:56   ` Jike Song
2016-01-18 19:05     ` Alex Williamson
2016-01-20  8:59       ` Jike Song
2016-01-20  9:05         ` Tian, Kevin
2016-01-25 11:34           ` Jike Song
2016-01-25 21:30             ` Alex Williamson
2016-01-25 21:45               ` Tian, Kevin
2016-01-25 21:48                 ` Tian, Kevin
2016-01-26  9:48                 ` Neo Jia
2016-01-26 10:20                 ` Neo Jia
2016-01-26 19:24                   ` Tian, Kevin
2016-01-26 19:29                     ` Neo Jia
2016-01-26 20:06                   ` Alex Williamson
2016-01-26 21:38                     ` Tian, Kevin
2016-01-26 22:28                     ` Neo Jia
2016-01-26 23:30                       ` Alex Williamson
2016-01-27  9:14                         ` Neo Jia
2016-01-27 16:10                           ` Alex Williamson
2016-01-27 21:48                             ` Neo Jia [this message]
2016-01-27  8:06                     ` Kirti Wankhede
2016-01-27 16:00                       ` Alex Williamson
2016-01-27 20:55                         ` Kirti Wankhede
2016-01-27 21:58                           ` Alex Williamson
2016-01-28  3:01                             ` Kirti Wankhede
2016-01-26  7:41               ` Jike Song
2016-01-26 14:05                 ` Yang Zhang
2016-01-26 16:37                   ` Alex Williamson
2016-01-26 21:21                     ` Tian, Kevin
2016-01-26 21:30                       ` Neo Jia
2016-01-26 21:43                         ` Tian, Kevin
2016-01-26 21:43                       ` Alex Williamson
2016-01-26 21:50                         ` Tian, Kevin
2016-01-26 22:07                           ` Alex Williamson
2016-01-26 22:15                             ` Tian, Kevin
2016-01-26 22:27                               ` Alex Williamson
2016-01-26 22:39                                 ` Tian, Kevin
2016-01-26 22:56                                   ` Alex Williamson
2016-01-27  1:47                                     ` Jike Song
2016-01-27  3:07                                       ` Alex Williamson
2016-01-27  5:43                                         ` Jike Song
2016-01-27 16:19                                           ` Alex Williamson
2016-01-28  6:00                                             ` Jike Song
2016-01-28 15:23                                               ` Alex Williamson
2016-01-29  7:20                                                 ` Jike Song
2016-01-29  8:49                                                   ` [Qemu-devel] [iGVT-g] " Jike Song
2016-01-29 18:50                                                     ` Alex Williamson
2016-02-01 13:10                                                       ` Gerd Hoffmann
2016-02-01 21:44                                                         ` Alex Williamson
2016-02-02  7:28                                                           ` Gerd Hoffmann
2016-02-02  7:35                                                           ` Zhiyuan Lv
2016-01-27  1:52                                     ` [Qemu-devel] " Yang Zhang
2016-01-27  3:37                                       ` Alex Williamson
2016-01-27  0:06                   ` Jike Song
2016-01-27  1:34                     ` Yang Zhang
2016-01-27  1:51                       ` Jike Song
2016-01-26 16:12                 ` Alex Williamson
2016-01-26 21:57                   ` Tian, Kevin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160127214837.GA6182@nvidia.com \
    --to=cjia@nvidia.com \
    --cc=alex.williamson@redhat.com \
    --cc=igvt-g@ml01.01.org \
    --cc=jike.song@intel.com \
    --cc=kevin.tian@intel.com \
    --cc=kraxel@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=kwankhede@nvidia.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=shuai.ruan@intel.com \
    --cc=zhiyuan.lv@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).