Re: [Qemu-devel] [PATCH v8 01/13] vfio: KABI for migration interface

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Alex Williamson <alex.williamson@redhat.com>
To: "Tian, Kevin" <kevin.tian@intel.com>
Cc: "Zhengxiao.zx@Alibaba-inc.com" <Zhengxiao.zx@Alibaba-inc.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"Liu, Yi L" <yi.l.liu@intel.com>,
	"cjia@nvidia.com" <cjia@nvidia.com>,
	"eskultet@redhat.com" <eskultet@redhat.com>,
	"Yang, Ziye" <ziye.yang@intel.com>,
	"cohuck@redhat.com" <cohuck@redhat.com>,
	"shuangtai.tst@alibaba-inc.com" <shuangtai.tst@alibaba-inc.com>,
	"dgilbert@redhat.com" <dgilbert@redhat.com>,
	"Wang, Zhi A" <zhi.a.wang@intel.com>,
	"mlevitsk@redhat.com" <mlevitsk@redhat.com>,
	"pasic@linux.ibm.com" <pasic@linux.ibm.com>,
	"aik@ozlabs.ru" <aik@ozlabs.ru>,
	Kirti Wankhede <kwankhede@nvidia.com>,
	"eauger@redhat.com" <eauger@redhat.com>,
	"felipe@nutanix.com" <felipe@nutanix.com>,
	"jonathan.davies@nutanix.com" <jonathan.davies@nutanix.com>,
	"Zhao, Yan Y" <yan.y.zhao@intel.com>,
	"Liu, Changpeng" <changpeng.liu@intel.com>,
	"Ken.Xue@amd.com" <Ken.Xue@amd.com>
Subject: Re: [Qemu-devel] [PATCH v8 01/13] vfio: KABI for migration interface
Date: Fri, 13 Sep 2019 09:47:50 -0600	[thread overview]
Message-ID: <20190913094750.03759a4d@x1.home> (raw)
In-Reply-To: <AADFC41AFE54684AB9EE6CBC0274A5D19D572135@SHSMSX104.ccr.corp.intel.com>

On Thu, 12 Sep 2019 23:00:03 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Thursday, September 12, 2019 10:41 PM
> > 
> > On Tue, 3 Sep 2019 06:57:27 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Saturday, August 31, 2019 12:33 AM
> > > >
> > > > On Fri, 30 Aug 2019 08:06:32 +0000
> > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > >  
> > > > > > From: Tian, Kevin
> > > > > > Sent: Friday, August 30, 2019 3:26 PM
> > > > > >  
> > > > > [...]  
> > > > > > > How does QEMU handle the fact that IOVAs are potentially  
> > dynamic  
> > > > while  
> > > > > > > performing the live portion of a migration?  For example, each  
> > time a  
> > > > > > > guest driver calls dma_map_page() or dma_unmap_page(), a
> > > > > > > MemoryRegionSection pops in or out of the AddressSpace for the  
> > device  
> > > > > > > (I'm assuming a vIOMMU where the device AddressSpace is not
> > > > > > > system_memory).  I don't see any QEMU code that intercepts that  
> > > > change  
> > > > > > > in the AddressSpace such that the IOVA dirty pfns could be  
> > recorded and  
> > > > > > > translated to GFNs.  The vendor driver can't track these beyond  
> > getting  
> > > > > > > an unmap notification since it only knows the IOVA pfns, which  
> > can be  
> > > > > > > re-used with different GFN backing.  Once the DMA mapping is  
> > torn  
> > > > down,  
> > > > > > > it seems those dirty pfns are lost in the ether.  If this works in  
> > QEMU,  
> > > > > > > please help me find the code that handles it.  
> > > > > >
> > > > > > I'm curious about this part too. Interestingly, I didn't find any  
> > log_sync  
> > > > > > callback registered by emulated devices in Qemu. Looks dirty pages
> > > > > > by emulated DMAs are recorded in some implicit way. But KVM  
> > always  
> > > > > > reports dirty page in GFN instead of IOVA, regardless of the  
> > presence of  
> > > > > > vIOMMU. If Qemu also tracks dirty pages in GFN for emulated DMAs
> > > > > >  (translation can be done when DMA happens), then we don't need
> > > > > > worry about transient mapping from IOVA to GFN. Along this way  
> > we  
> > > > > > also want GFN-based dirty bitmap being reported through VFIO,
> > > > > > similar to what KVM does. For vendor drivers, it needs to translate
> > > > > > from IOVA to HVA to GFN when tracking DMA activities on VFIO
> > > > > > devices. IOVA->HVA is provided by VFIO. for HVA->GFN, it can be
> > > > > > provided by KVM but I'm not sure whether it's exposed now.
> > > > > >  
> > > > >
> > > > > HVA->GFN can be done through hva_to_gfn_memslot in kvm_host.h.  
> > > >
> > > > I thought it was bad enough that we have vendor drivers that depend  
> > on  
> > > > KVM, but designing a vfio interface that only supports a KVM interface
> > > > is more undesirable.  I also note without comment that  
> > gfn_to_memslot()  
> > > > is a GPL symbol.  Thanks,  
> > >
> > > yes it is bad, but sometimes inevitable. If you recall our discussions
> > > back to 3yrs (when discussing the 1st mdev framework), there were  
> > similar  
> > > hypervisor dependencies in GVT-g, e.g. querying gpa->hpa when
> > > creating some shadow structures. gpa->hpa is definitely hypervisor
> > > specific knowledge, which is easy in KVM (gpa->hva->hpa), but needs
> > > hypercall in Xen. but VFIO already makes assumption based on KVM-
> > > only flavor when implementing vfio_{un}pin_page_external.  
> > 
> > Where's the KVM assumption there?  The MAP_DMA ioctl takes an IOVA
> > and
> > HVA.  When an mdev vendor driver calls vfio_pin_pages(), we GUP the HVA
> > to get an HPA and provide an array of HPA pfns back to the caller.  The
> > other vGPU mdev vendor manages to make use of this without KVM... the
> > KVM interface used by GVT-g is GPL-only.  
> 
> To be clear it's the assumption on the host-based hypervisors e.g. KVM.
> GUP is a perfect example, which doesn't work for Xen since DomU's
> memory doesn't belong to Dom0. VFIO in Dom0 has to find the HPA
> through Xen specific hypercalls.

VFIO does not assume a hypervisor at all.  Yes, it happens to work well
with a host-based hypervisor like KVM were we can simply use GUP, but
I'd hardly call using the standard mechanism to pin a user page and get
the pfn within the Linux kernel a KVM assumption.  The fact that Dom0
Xen requires work here while KVM does not does is not an equivalency to
VFIO assuming KVM.  Thanks,

Alex
 
> > > So GVT-g
> > > has to maintain an internal abstraction layer to support both Xen and
> > > KVM. Maybe someday we will re-consider introducing some hypervisor
> > > abstraction layer in VFIO, if this issue starts to hurt other devices and
> > > Xen guys are willing to support VFIO.  
> > 
> > Once upon a time, we had a KVM specific device assignment interface,
> > ie. legacy KVM devie assignment.  We developed VFIO specifically to get
> > KVM out of the business of being a (bad) device driver.  We do have
> > some awareness and interaction between VFIO and KVM in the vfio-kvm
> > pseudo device, but we still try to keep those interfaces generic.  In
> > some cases we're not very successful at that, see vfio_group_set_kvm(),
> > but that's largely just a mechanism to associate a cookie with a group
> > to be consumed by the mdev vendor driver such that it can work with kvm
> > external to vfio.  I don't intend to add further hypervisor awareness
> > to vfio.
> >   
> > > Back to this IOVA issue, I discussed with Yan and we found another
> > > hypervisor-agnostic alternative, by learning from vhost. vhost is very
> > > similar to VFIO - DMA also happens in the kernel, while it already
> > > supports vIOMMU.
> > >
> > > Generally speaking, there are three paths of dirty page collection
> > > in Qemu so far (as previously noted, Qemu always tracks the dirty
> > > bitmap in GFN):  
> > 
> > GFNs or simply PFNs within an AddressSpace?
> >   
> > > 1) Qemu-tracked memory writes (e.g. emulated DMAs). Dirty bitmaps
> > > are updated directly when the guest memory is being updated. For
> > > example, PCI writes are completed through pci_dma_write, which
> > > goes through vIOMMU to translate IOVA into GPA and then update
> > > the bitmap through cpu_physical_memory_set_dirty_range.  
> > 
> > Right, so the IOVA to GPA (GFN) occurs through an explicit translation
> > on the IOMMU AddressSpace.
> >   
> > > 2) Memory writes that are not tracked by Qemu are collected by
> > > registering .log_sync() callback, which is invoked in the dirty logging
> > > process. Now there are two users: kvm and vhost.
> > >
> > >   2.1) KVM tracks CPU-side memory writes, through write-protection
> > > or EPT A/D bits (+PML). This part is always based on GFN and returned
> > > to Qemu when kvm_log_sync is invoked;
> > >
> > >   2.2) vhost tracks kernel-side DMA writes, by interpreting vring
> > > data structure. It maintains an internal iotlb which is synced with
> > > Qemu vIOMMU through a specific interface:
> > > 	- new vhost message type (VHOST_IOTLB_UPDATE/INVALIDATE)
> > > for Qemu to keep vhost iotlb in sync
> > > 	- new VHOST_IOTLB_MISS message to notify Qemu in case of
> > > a miss in vhost iotlb.
> > > 	- Qemu registers a log buffer to kernel vhost driver. The latter
> > > update the buffer (using internal iotlb to get GFN) when serving vring
> > > descriptor.
> > >
> > > VFIO could also implement an internal iotlb, so vendor drivers can
> > > utilize the iotlb to update the GFN-based dirty bitmap. Ideally we
> > > don't need re-invent another iotlb protocol as vhost does. vIOMMU
> > > already sends map/unmap ioctl cmds upon any change of IOVA
> > > mapping. We may introduce a v2 map/unmap interface, allowing
> > > Qemu to pass both {iova, gpa, hva} together to keep internal iotlb
> > > in-sync. But we may also need a iotlb_miss_upcall interface, if VFIO
> > > doesn't want to cache full-size vIOMMU mappings.
> > >
> > > Definitely this alternative needs more work and possibly less
> > > performant (if maintaining a small size iotlb) than straightforward
> > > calling into KVM interface. But the gain is also obvious, since it
> > > is fully constrained with VFIO.
> > >
> > > Thoughts? :-)  
> > 
> > So vhost must then be configuring a listener across system memory
> > rather than only against the device AddressSpace like we do in vfio,
> > such that it get's log_sync() callbacks for the actual GPA space rather
> > than only the IOVA space.  OTOH, QEMU could understand that the device
> > AddressSpace has a translate function and apply the IOVA dirty bits to
> > the system memory AddressSpace.  Wouldn't it make more sense for
> > QEMU
> > to perform a log_sync() prior to removing a MemoryRegionSection within
> > an AddressSpace and update the GPA rather than pushing GPA awareness
> > and potentially large tracking structures into the host kernel?  Thanks,
> >   
> 
> It is an interesting idea.  One drawback is that log_sync might be
> frequently invoked in IOVA case, but I guess the overhead is not much 
> compared to the total overhead of emulating the IOTLB invalidation. 
> Maybe other folks can better comment why this model was not 
> considered before, e.g. when vhost iotlb was introduced.
> 
> Thanks
> Kevin

next prev parent reply	other threads:[~2019-09-13 15:49 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-26 18:55 [Qemu-devel] [PATCH v8 00/13] Add migration support for VFIO device Kirti Wankhede
2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 01/13] vfio: KABI for migration interface Kirti Wankhede
2019-08-28 20:50   ` Alex Williamson
2019-08-30  7:25     ` Tian, Kevin
2019-08-30 16:15       ` Alex Williamson
2019-09-03  6:05         ` Tian, Kevin
2019-09-04  8:28           ` Yan Zhao
     [not found]     ` <AADFC41AFE54684AB9EE6CBC0274A5D19D553133@SHSMSX104.ccr.corp.intel.com>
2019-08-30  8:06       ` Tian, Kevin
2019-08-30 16:32         ` Alex Williamson
2019-09-03  6:57           ` Tian, Kevin
2019-09-12 14:41             ` Alex Williamson
2019-09-12 23:00               ` Tian, Kevin
2019-09-13 15:47                 ` Alex Williamson [this message]
2019-09-16  1:53                   ` Tian, Kevin
     [not found]               ` <AADFC41AFE54684AB9EE6CBC0274A5D19D572142@SHSMSX104.ccr.corp.intel.com>
2019-09-24  2:19                 ` Tian, Kevin
2019-09-24 18:03                   ` Alex Williamson
2019-09-24 23:04                     ` Tian, Kevin
2019-09-25 19:06                       ` Alex Williamson
2019-09-26  3:07                         ` Tian, Kevin
2019-09-26 21:33                           ` Alex Williamson
2019-10-24 11:41                             ` Tian, Kevin
2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 02/13] vfio: Add function to unmap VFIO region Kirti Wankhede
2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 03/13] vfio: Add vfio_get_object callback to VFIODeviceOps Kirti Wankhede
2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 04/13] vfio: Add save and load functions for VFIO PCI devices Kirti Wankhede
2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 05/13] vfio: Add migration region initialization and finalize function Kirti Wankhede
2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 06/13] vfio: Add VM state change handler to know state of VM Kirti Wankhede
2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 07/13] vfio: Add migration state change notifier Kirti Wankhede
2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 08/13] vfio: Register SaveVMHandlers for VFIO device Kirti Wankhede
2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 09/13] vfio: Add save state functions to SaveVMHandlers Kirti Wankhede
2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 10/13] vfio: Add load " Kirti Wankhede
2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 11/13] vfio: Add function to get dirty page list Kirti Wankhede
2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 12/13] vfio: Add vfio_listener_log_sync to mark dirty pages Kirti Wankhede
2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 13/13] vfio: Make vfio-pci device migration capable Kirti Wankhede
2019-08-26 19:43 ` [Qemu-devel] [PATCH v8 00/13] Add migration support for VFIO device no-reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190913094750.03759a4d@x1.home \
    --to=alex.williamson@redhat.com \
    --cc=Ken.Xue@amd.com \
    --cc=Zhengxiao.zx@Alibaba-inc.com \
    --cc=aik@ozlabs.ru \
    --cc=changpeng.liu@intel.com \
    --cc=cjia@nvidia.com \
    --cc=cohuck@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=eauger@redhat.com \
    --cc=eskultet@redhat.com \
    --cc=felipe@nutanix.com \
    --cc=jonathan.davies@nutanix.com \
    --cc=kevin.tian@intel.com \
    --cc=kwankhede@nvidia.com \
    --cc=mlevitsk@redhat.com \
    --cc=pasic@linux.ibm.com \
    --cc=qemu-devel@nongnu.org \
    --cc=shuangtai.tst@alibaba-inc.com \
    --cc=yan.y.zhao@intel.com \
    --cc=yi.l.liu@intel.com \
    --cc=zhi.a.wang@intel.com \
    --cc=ziye.yang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).