Re: [Qemu-devel] [PATCH v8 01/13] vfio: KABI for migration interface

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Alex Williamson <alex.williamson@redhat.com>
To: "Tian, Kevin" <kevin.tian@intel.com>
Cc: "Zhengxiao.zx@Alibaba-inc.com" <Zhengxiao.zx@Alibaba-inc.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"Liu, Yi L" <yi.l.liu@intel.com>,
	"cjia@nvidia.com" <cjia@nvidia.com>,
	"eskultet@redhat.com" <eskultet@redhat.com>,
	"Yang, Ziye" <ziye.yang@intel.com>,
	"cohuck@redhat.com" <cohuck@redhat.com>,
	"shuangtai.tst@alibaba-inc.com" <shuangtai.tst@alibaba-inc.com>,
	"dgilbert@redhat.com" <dgilbert@redhat.com>,
	"Wang, Zhi A" <zhi.a.wang@intel.com>,
	"mlevitsk@redhat.com" <mlevitsk@redhat.com>,
	"pasic@linux.ibm.com" <pasic@linux.ibm.com>,
	"aik@ozlabs.ru" <aik@ozlabs.ru>,
	Kirti Wankhede <kwankhede@nvidia.com>,
	"eauger@redhat.com" <eauger@redhat.com>,
	"felipe@nutanix.com" <felipe@nutanix.com>,
	"jonathan.davies@nutanix.com" <jonathan.davies@nutanix.com>,
	"Zhao, Yan Y" <yan.y.zhao@intel.com>,
	"Liu, Changpeng" <changpeng.liu@intel.com>,
	"Ken.Xue@amd.com" <Ken.Xue@amd.com>
Subject: Re: [Qemu-devel] [PATCH v8 01/13] vfio: KABI for migration interface
Date: Thu, 12 Sep 2019 15:41:06 +0100	[thread overview]
Message-ID: <20190912154106.4e784906@x1.home> (raw)
In-Reply-To: <AADFC41AFE54684AB9EE6CBC0274A5D19D560D74@SHSMSX104.ccr.corp.intel.com>

On Tue, 3 Sep 2019 06:57:27 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Saturday, August 31, 2019 12:33 AM
> > 
> > On Fri, 30 Aug 2019 08:06:32 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Tian, Kevin
> > > > Sent: Friday, August 30, 2019 3:26 PM
> > > >  
> > > [...]  
> > > > > How does QEMU handle the fact that IOVAs are potentially dynamic  
> > while  
> > > > > performing the live portion of a migration?  For example, each time a
> > > > > guest driver calls dma_map_page() or dma_unmap_page(), a
> > > > > MemoryRegionSection pops in or out of the AddressSpace for the device
> > > > > (I'm assuming a vIOMMU where the device AddressSpace is not
> > > > > system_memory).  I don't see any QEMU code that intercepts that  
> > change  
> > > > > in the AddressSpace such that the IOVA dirty pfns could be recorded and
> > > > > translated to GFNs.  The vendor driver can't track these beyond getting
> > > > > an unmap notification since it only knows the IOVA pfns, which can be
> > > > > re-used with different GFN backing.  Once the DMA mapping is torn  
> > down,  
> > > > > it seems those dirty pfns are lost in the ether.  If this works in QEMU,
> > > > > please help me find the code that handles it.  
> > > >
> > > > I'm curious about this part too. Interestingly, I didn't find any log_sync
> > > > callback registered by emulated devices in Qemu. Looks dirty pages
> > > > by emulated DMAs are recorded in some implicit way. But KVM always
> > > > reports dirty page in GFN instead of IOVA, regardless of the presence of
> > > > vIOMMU. If Qemu also tracks dirty pages in GFN for emulated DMAs
> > > >  (translation can be done when DMA happens), then we don't need
> > > > worry about transient mapping from IOVA to GFN. Along this way we
> > > > also want GFN-based dirty bitmap being reported through VFIO,
> > > > similar to what KVM does. For vendor drivers, it needs to translate
> > > > from IOVA to HVA to GFN when tracking DMA activities on VFIO
> > > > devices. IOVA->HVA is provided by VFIO. for HVA->GFN, it can be
> > > > provided by KVM but I'm not sure whether it's exposed now.
> > > >  
> > >
> > > HVA->GFN can be done through hva_to_gfn_memslot in kvm_host.h.  
> > 
> > I thought it was bad enough that we have vendor drivers that depend on
> > KVM, but designing a vfio interface that only supports a KVM interface
> > is more undesirable.  I also note without comment that gfn_to_memslot()
> > is a GPL symbol.  Thanks,  
> 
> yes it is bad, but sometimes inevitable. If you recall our discussions
> back to 3yrs (when discussing the 1st mdev framework), there were similar
> hypervisor dependencies in GVT-g, e.g. querying gpa->hpa when
> creating some shadow structures. gpa->hpa is definitely hypervisor
> specific knowledge, which is easy in KVM (gpa->hva->hpa), but needs
> hypercall in Xen. but VFIO already makes assumption based on KVM-
> only flavor when implementing vfio_{un}pin_page_external.

Where's the KVM assumption there?  The MAP_DMA ioctl takes an IOVA and
HVA.  When an mdev vendor driver calls vfio_pin_pages(), we GUP the HVA
to get an HPA and provide an array of HPA pfns back to the caller.  The
other vGPU mdev vendor manages to make use of this without KVM... the
KVM interface used by GVT-g is GPL-only.

> So GVT-g
> has to maintain an internal abstraction layer to support both Xen and
> KVM. Maybe someday we will re-consider introducing some hypervisor
> abstraction layer in VFIO, if this issue starts to hurt other devices and
> Xen guys are willing to support VFIO.

Once upon a time, we had a KVM specific device assignment interface,
ie. legacy KVM devie assignment.  We developed VFIO specifically to get
KVM out of the business of being a (bad) device driver.  We do have
some awareness and interaction between VFIO and KVM in the vfio-kvm
pseudo device, but we still try to keep those interfaces generic.  In
some cases we're not very successful at that, see vfio_group_set_kvm(),
but that's largely just a mechanism to associate a cookie with a group
to be consumed by the mdev vendor driver such that it can work with kvm
external to vfio.  I don't intend to add further hypervisor awareness
to vfio.

> Back to this IOVA issue, I discussed with Yan and we found another 
> hypervisor-agnostic alternative, by learning from vhost. vhost is very
> similar to VFIO - DMA also happens in the kernel, while it already 
> supports vIOMMU.
> 
> Generally speaking, there are three paths of dirty page collection
> in Qemu so far (as previously noted, Qemu always tracks the dirty
> bitmap in GFN):

GFNs or simply PFNs within an AddressSpace?
 
> 1) Qemu-tracked memory writes (e.g. emulated DMAs). Dirty bitmaps 
> are updated directly when the guest memory is being updated. For 
> example, PCI writes are completed through pci_dma_write, which 
> goes through vIOMMU to translate IOVA into GPA and then update 
> the bitmap through cpu_physical_memory_set_dirty_range.

Right, so the IOVA to GPA (GFN) occurs through an explicit translation
on the IOMMU AddressSpace.
 
> 2) Memory writes that are not tracked by Qemu are collected by
> registering .log_sync() callback, which is invoked in the dirty logging
> process. Now there are two users: kvm and vhost.
> 
>   2.1) KVM tracks CPU-side memory writes, through write-protection
> or EPT A/D bits (+PML). This part is always based on GFN and returned
> to Qemu when kvm_log_sync is invoked;
> 
>   2.2) vhost tracks kernel-side DMA writes, by interpreting vring
> data structure. It maintains an internal iotlb which is synced with
> Qemu vIOMMU through a specific interface:
> 	- new vhost message type (VHOST_IOTLB_UPDATE/INVALIDATE)
> for Qemu to keep vhost iotlb in sync
> 	- new VHOST_IOTLB_MISS message to notify Qemu in case of
> a miss in vhost iotlb.
> 	- Qemu registers a log buffer to kernel vhost driver. The latter
> update the buffer (using internal iotlb to get GFN) when serving vring
> descriptor.
> 
> VFIO could also implement an internal iotlb, so vendor drivers can
> utilize the iotlb to update the GFN-based dirty bitmap. Ideally we
> don't need re-invent another iotlb protocol as vhost does. vIOMMU
> already sends map/unmap ioctl cmds upon any change of IOVA
> mapping. We may introduce a v2 map/unmap interface, allowing
> Qemu to pass both {iova, gpa, hva} together to keep internal iotlb
> in-sync. But we may also need a iotlb_miss_upcall interface, if VFIO
> doesn't want to cache full-size vIOMMU mappings. 
> 
> Definitely this alternative needs more work and possibly less 
> performant (if maintaining a small size iotlb) than straightforward
> calling into KVM interface. But the gain is also obvious, since it
> is fully constrained with VFIO.
> 
> Thoughts? :-)

So vhost must then be configuring a listener across system memory
rather than only against the device AddressSpace like we do in vfio,
such that it get's log_sync() callbacks for the actual GPA space rather
than only the IOVA space.  OTOH, QEMU could understand that the device
AddressSpace has a translate function and apply the IOVA dirty bits to
the system memory AddressSpace.  Wouldn't it make more sense for QEMU
to perform a log_sync() prior to removing a MemoryRegionSection within
an AddressSpace and update the GPA rather than pushing GPA awareness
and potentially large tracking structures into the host kernel?  Thanks,

Alex

next prev parent reply	other threads:[~2019-09-12 18:41 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-26 18:55 [Qemu-devel] [PATCH v8 00/13] Add migration support for VFIO device Kirti Wankhede
2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 01/13] vfio: KABI for migration interface Kirti Wankhede
2019-08-28 20:50   ` Alex Williamson
2019-08-30  7:25     ` Tian, Kevin
2019-08-30 16:15       ` Alex Williamson
2019-09-03  6:05         ` Tian, Kevin
2019-09-04  8:28           ` Yan Zhao
     [not found]     ` <AADFC41AFE54684AB9EE6CBC0274A5D19D553133@SHSMSX104.ccr.corp.intel.com>
2019-08-30  8:06       ` Tian, Kevin
2019-08-30 16:32         ` Alex Williamson
2019-09-03  6:57           ` Tian, Kevin
2019-09-12 14:41             ` Alex Williamson [this message]
2019-09-12 23:00               ` Tian, Kevin
2019-09-13 15:47                 ` Alex Williamson
2019-09-16  1:53                   ` Tian, Kevin
     [not found]               ` <AADFC41AFE54684AB9EE6CBC0274A5D19D572142@SHSMSX104.ccr.corp.intel.com>
2019-09-24  2:19                 ` Tian, Kevin
2019-09-24 18:03                   ` Alex Williamson
2019-09-24 23:04                     ` Tian, Kevin
2019-09-25 19:06                       ` Alex Williamson
2019-09-26  3:07                         ` Tian, Kevin
2019-09-26 21:33                           ` Alex Williamson
2019-10-24 11:41                             ` Tian, Kevin
2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 02/13] vfio: Add function to unmap VFIO region Kirti Wankhede
2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 03/13] vfio: Add vfio_get_object callback to VFIODeviceOps Kirti Wankhede
2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 04/13] vfio: Add save and load functions for VFIO PCI devices Kirti Wankhede
2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 05/13] vfio: Add migration region initialization and finalize function Kirti Wankhede
2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 06/13] vfio: Add VM state change handler to know state of VM Kirti Wankhede
2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 07/13] vfio: Add migration state change notifier Kirti Wankhede
2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 08/13] vfio: Register SaveVMHandlers for VFIO device Kirti Wankhede
2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 09/13] vfio: Add save state functions to SaveVMHandlers Kirti Wankhede
2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 10/13] vfio: Add load " Kirti Wankhede
2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 11/13] vfio: Add function to get dirty page list Kirti Wankhede
2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 12/13] vfio: Add vfio_listener_log_sync to mark dirty pages Kirti Wankhede
2019-08-26 18:55 ` [Qemu-devel] [PATCH v8 13/13] vfio: Make vfio-pci device migration capable Kirti Wankhede
2019-08-26 19:43 ` [Qemu-devel] [PATCH v8 00/13] Add migration support for VFIO device no-reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190912154106.4e784906@x1.home \
    --to=alex.williamson@redhat.com \
    --cc=Ken.Xue@amd.com \
    --cc=Zhengxiao.zx@Alibaba-inc.com \
    --cc=aik@ozlabs.ru \
    --cc=changpeng.liu@intel.com \
    --cc=cjia@nvidia.com \
    --cc=cohuck@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=eauger@redhat.com \
    --cc=eskultet@redhat.com \
    --cc=felipe@nutanix.com \
    --cc=jonathan.davies@nutanix.com \
    --cc=kevin.tian@intel.com \
    --cc=kwankhede@nvidia.com \
    --cc=mlevitsk@redhat.com \
    --cc=pasic@linux.ibm.com \
    --cc=qemu-devel@nongnu.org \
    --cc=shuangtai.tst@alibaba-inc.com \
    --cc=yan.y.zhao@intel.com \
    --cc=yi.l.liu@intel.com \
    --cc=zhi.a.wang@intel.com \
    --cc=ziye.yang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).