Re: [PATCH 0/5] QEMU VFIO live migration

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

From: Zhao Yan <yan.y.zhao@intel.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: "cjia@nvidia.com" <cjia@nvidia.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"aik@ozlabs.ru" <aik@ozlabs.ru>,
	"Zhengxiao.zx@alibaba-inc.com" <Zhengxiao.zx@alibaba-inc.com>,
	"shuangtai.tst@alibaba-inc.com" <shuangtai.tst@alibaba-inc.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"kwankhede@nvidia.com" <kwankhede@nvidia.com>,
	"eauger@redhat.com" <eauger@redhat.com>,
	"Liu, Yi L" <yi.l.liu@intel.com>,
	Erik Skultety <eskultet@redhat.com>,
	"Yang, Ziye" <ziye.yang@intel.com>,
	"mlevitsk@redhat.com" <mlevitsk@redhat.com>,
	"pasic@linux.ibm.com" <pasic@linux.ibm.com>,
	"Gonglei \(Arei\)" <arei.gonglei@huawei.com>,
	"felipe@nutanix.com" <felipe@nutanix.com>,
	"Ken.Xue@amd.com" <Ken.Xue@amd.com>,
	"Tian, Kevin" <kevin.tian@intel.com>,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	"intel-gvt-dev@lists.freedesktop.org"
	<intel-gvt-dev@lists.freedesktop.org>,
	"Liu, Changp
Subject: Re: [PATCH 0/5] QEMU VFIO live migration
Date: Thu, 28 Mar 2019 22:47:04 -0400	[thread overview]
Message-ID: <20190329024704.GF14681@joy-OptiPlex-7040> (raw)
In-Reply-To: <20190328100431.7fad466b@x1.home>

On Fri, Mar 29, 2019 at 12:04:31AM +0800, Alex Williamson wrote:
> On Thu, 28 Mar 2019 10:21:38 +0100
> Erik Skultety <eskultet@redhat.com> wrote:
> 
> > On Thu, Mar 28, 2019 at 04:36:03AM -0400, Zhao Yan wrote:
> > > hi Alex and Dave,
> > > Thanks for your replies.
> > > Please see my comments inline.
> > >
> > > On Thu, Mar 28, 2019 at 06:10:20AM +0800, Alex Williamson wrote:  
> > > > On Wed, 27 Mar 2019 20:18:54 +0000
> > > > "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> > > >  
> > > > > * Zhao Yan (yan.y.zhao@intel.com) wrote:  
> > > > > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:  
> > > > > > > > > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > > > > > > > > version of device?  Or say to a device with older firmware or perhaps
> > > > > > > > > > > a device that has less device memory ?  
> > > > > > > > > > Actually it's still an open for VFIO migration. Need to think about
> > > > > > > > > > whether it's better to check that in libvirt or qemu (like a device magic
> > > > > > > > > > along with verion ?).  
> > > > > > > >
> > > > > > > > We must keep the hardware generation is the same with one POD of public cloud
> > > > > > > > providers. But we still think about the live migration between from the the lower
> > > > > > > > generation of hardware migrated to the higher generation.  
> > > > > > >
> > > > > > > Agreed, lower->higher is the one direction that might make sense to
> > > > > > > support.
> > > > > > >
> > > > > > > But regardless of that, I think we need to make sure that incompatible
> > > > > > > devices/versions fail directly instead of failing in a subtle, hard to
> > > > > > > debug way. Might be useful to do some initial sanity checks in libvirt
> > > > > > > as well.
> > > > > > >
> > > > > > > How easy is it to obtain that information in a form that can be
> > > > > > > consumed by higher layers? Can we find out the device type at least?
> > > > > > > What about some kind of revision?  
> > > > > > hi Alex and Cornelia
> > > > > > for device compatibility, do you think it's a good idea to use "version"
> > > > > > and "device version" fields?
> > > > > >
> > > > > > version field: identify live migration interface's version. it can have a
> > > > > > sort of backward compatibility, like target machine's version >= source
> > > > > > machine's version. something like that.  
> > > >
> > > > Don't we essentially already have this via the device specific region?
> > > > The struct vfio_info_cap_header includes id and version fields, so we
> > > > can declare a migration id and increment the version for any
> > > > incompatible changes to the protocol.  
> > > yes, good idea!
> > > so, what about declaring below new cap?
> > >     #define VFIO_REGION_INFO_CAP_MIGRATION 4
> > >     struct vfio_region_info_cap_migration {
> > >         struct vfio_info_cap_header header;
> > >         __u32 device_version_len;
> > >         __u8  device_version[];
> > >     };
> 
> I'm not sure why we need a new region for everything, it seems this
> could fit within the protocol of a single region.  This could simply be
> a new action to retrieve the version where the protocol would report
> the number of bytes available, just like the migration stream itself.
so, to get version of VFIO live migration device state interface (simply
call it migration interface?),
a new cap looks like this:
#define VFIO_REGION_INFO_CAP_MIGRATION 4
it contains struct vfio_info_cap_header only.
when get region info of the migration region, we query this cap and get
migration interface's version. right?

or just directly use VFIO_REGION_INFO_CAP_TYPE is fine?


> > > > > > device_version field consists two parts:
> > > > > > 1. vendor id : it takes 32 bits. e.g. 0x8086.  
> > > >
> > > > Who allocates IDs?  If we're going to use PCI vendor IDs, then I'd
> > > > suggest we use a bit to flag it as such so we can reserve that portion
> > > > of the 32bit address space.  See for example:
> > > >
> > > > #define VFIO_REGION_TYPE_PCI_VENDOR_TYPE        (1 << 31)
> > > > #define VFIO_REGION_TYPE_PCI_VENDOR_MASK        (0xffff)
> > > >
> > > > For vendor specific regions.  
> > > Yes, use PCI vendor ID.
> > > you are right, we need to use highest bit (VFIO_REGION_TYPE_PCI_VENDOR_TYPE)
> > > to identify it's a PCI ID.
> > > Thanks for pointing it out.
> > > But, I have a question. what is VFIO_REGION_TYPE_PCI_VENDOR_MASK used for?
> > > why it's 0xffff? I searched QEMU and kernel code and did not find anywhere
> > > uses it.
> 
> PCI vendor IDs are 16bits, it's just indicating that when the
> PCI_VENDOR_TYPE bit is set the valid data is the lower 16bits.

thanks:)

> > > > > > 2. vendor proprietary string: it can be any string that a vendor driver
> > > > > > thinks can identify a source device. e.g. pciid + mdev type.
> > > > > > "vendor id" is to avoid overlap of "vendor proprietary string".
> > > > > >
> > > > > >
> > > > > > struct vfio_device_state_ctl {
> > > > > >      __u32 version;            /* ro */
> > > > > >      __u8 device_version[MAX_DEVICE_VERSION_LEN];            /* ro */
> > > > > >      struct {
> > > > > >      	__u32 action; /* GET_BUFFER, SET_BUFFER, IS_COMPATIBLE*/
> > > > > > 	...
> > > > > >      }data;
> > > > > >      ...
> > > > > >  };  
> > > >
> > > > We have a buffer area where we can read and write data from the vendor
> > > > driver, why would we have this IS_COMPATIBLE protocol to write the
> > > > device version string but use a static fixed length version string in
> > > > the control header to read it?  IOW, let's use GET_VERSION,
> > > > CHECK_VERSION actions that make use of the buffer area and allow vendor
> > > > specific version information length.  
> > > you are right, such static fixed length version string is bad :)
> > > To get device version, do you think which approach below is better?
> > > 1. use GET_VERSION action, and read from region buffer
> > > 2. get it when querying cap VFIO_REGION_INFO_CAP_MIGRATION
> > >
> > > seems approach 1 is better, and cap VFIO_REGION_INFO_CAP_MIGRATION is only
> > > for checking migration interface's version?
> 
> I think 1 provides the most flexibility to the vendor driver.

Got it.
For VFIO live migration, compared to reuse device state region (which takes
GET_BUFFER/SET_BUFFER actions),
could we create a new region for GET_VERSION & CHECK_VERSION ?

> > > > > > Then, an action IS_COMPATIBLE is added to check device compatibility.
> > > > > >
> > > > > > The flow to figure out whether a source device is migratable to target device
> > > > > > is like that:
> > > > > > 1. in source side's .save_setup, save source device's device_version string
> > > > > > 2. in target side's .load_state, load source device's device version string
> > > > > > and write it to data region, and call IS_COMPATIBLE action to ask vendor driver
> > > > > > to check whether the source device is compatible to it.
> > > > > >
> > > > > > The advantage of adding an IS_COMPATIBLE action is that, vendor driver can
> > > > > > maintain a compatibility table and decide whether source device is compatible
> > > > > > to target device according to its proprietary table.
> > > > > > In device_version string, vendor driver only has to describe the source
> > > > > > device as elaborately as possible and resorts to vendor driver in target side
> > > > > > to figure out whether they are compatible.  
> > > >
> > > > I agree, it's too complicated and restrictive to try to create an
> > > > interface for the user to determine compatibility, let the driver
> > > > declare it compatible or not.  
> > > :)
> > >  
> > > > > It would also be good if the 'IS_COMPATIBLE' was somehow callable
> > > > > externally - so we could be able to answer a question like 'can we
> > > > > migrate this VM to this host' - from the management layer before it
> > > > > actually starts the migration.  
> > >
> > > so qemu needs to expose two qmp/sysfs interfaces: GET_VERSION and CHECK_VERSION.
> > > GET_VERSION returns a vm's device's version string.
> > > CHECK_VERSION's input is device version string and return
> > > compatible/non-compatible.
> > > Do you think it's good?
> 
> That's the idea, but note that QEMU can only provide the QMP interface,
> the sysfs interface would of course be provided as more of a direct
> path from the vendor driver or mdev kernel layer.
> 
> > > > I think we'd need to mirror this capability in sysfs to support that,
> > > > or create a qmp interface through QEMU that the device owner could make
> > > > the request on behalf of the management layer.  Getting access to the
> > > > vfio device requires an iommu context that's already in use by the
> > > > device owner, we have no intention of supporting a model that allows
> > > > independent tasks access to a device.  Thanks,
> > > > Alex
> > > >  
> > > do you think two sysfs nodes under a device node is ok?
> > > e.g.
> > > /sys/devices/pci0000\:00/0000\:00\:02.0/882cc4da-dede-11e7-9180-078a62063ab1/get_version
> > > /sys/devices/pci0000\:00/0000\:00\:02.0/882cc4da-dede-11e7-9180-078a62063ab1/check_version  
> 
> I'd think it might live more in the mdev_support_types area, wouldn't
> we ideally like to know if a device is compatible even before it's
> created?  For example maybe:
> 
> /sys/class/mdev_bus/0000:00:02.0/mdev_supported_types/i915-GVTg_V5_4/version
> 
> Where reading the sysfs attribute returns the version string and
> writing a string into the attribute return an errno for incompatibility.
yes, knowing if a device is compatible before it's created is good.
but do you think check whether a device is compatible after it's created is
also required? For live migration, user usually only queries this information
when it's really required, i.e. when a device has been created.
maybe we can add this version get/check at both places?


> 
> > Why do you need both sysfs and QMP at the same time? I can see it gives us some
> > flexibility, but is there something more to that?
> >
> > Normally, I'd prefer a QMP interface from libvirt's perspective (with an
> > appropriate capability that libvirt can check for QEMU support) because I imagine large nodes having a
> > bunch of GPUs with different revisions which might not be backwards compatible.
> > Libvirt might query the version string on source and check it on dest via the
> > QMP in a way that QEMU, talking to the driver, would return either a list or a
> > single physical device to which we can migrate, because neither QEMU nor
> > libvirt know that, only the driver does, so that's an important information
> > rather than looping through all the devices and trying to find one that is
> > compatible. However, you might have a hard time making all the necessary
> > changes in QMP introspectable, a new command would be fine, but if you also
> > wanted to extend say vfio-pci options, IIRC that would not appear in the QAPI
> > schema and libvirt would not be able to detect support for it.
> > 
> > On the other hand, the presence of a QMP interface IMO doesn't help mgmt apps
> > much, as it still carries the burden of being able to check this only at the
> > time of migration, which e.g. OpenStack would like to know long before that.
> > 
> > So, having sysfs attributes would work for both libvirt (even though libvirt
> > would benefit from a QMP much more) and OpenStack. OpenStack would IMO then
> > have to figure out how to create the mappings between compatible devices across
> > several nodes which are non-uniform.
> 
> Yep, vfio encompasses more than just QEMU, so a sysfs interface has more
> utility than a QMP interface.  For instance we couldn't predetermine if
> an mdev type on a host is compatible if we need to first create the
> device and launch a QEMU instance on it to get access to QMP.  So maybe
> the question is whether we should bother with any sort of VFIO API to
> do this comparison, perhaps only a sysfs interface is sufficient for a
> complete solution.  The downside of not having a version API in the
> user interface might be that QEMU on its own can only try a migration
> and see if it fails, it wouldn't have the ability to test expected
> compatibility without access to sysfs.  And maybe that's fine.  Thanks,
> 
So QEMU vfio uses sysfs to check device compatiblity in migration's save_setup
phase?

Thanks
Yan

next prev parent reply	other threads:[~2019-03-29  2:47 UTC|newest]

Thread overview: 77+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-19  8:50 [PATCH 0/5] QEMU VFIO live migration Yan Zhao
2019-02-19  8:52 ` [PATCH 1/5] vfio/migration: define kernel interfaces Yan Zhao
2019-02-19 13:09   ` Cornelia Huck
2019-02-20  7:36     ` Zhao Yan
2019-02-20 17:08       ` Cornelia Huck
2019-02-21  1:47         ` Zhao Yan
2019-02-19  8:52 ` [PATCH 2/5] vfio/migration: support device of device config capability Yan Zhao
2019-02-19 11:01   ` Dr. David Alan Gilbert
2019-02-20  5:12     ` Zhao Yan
2019-02-20 10:57       ` Dr. David Alan Gilbert
2019-02-19 14:37   ` Cornelia Huck
2019-02-20 22:54     ` Zhao Yan
2019-02-21 10:56       ` Cornelia Huck
2019-02-19  8:52 ` [PATCH 3/5] vfio/migration: tracking of dirty page in system memory Yan Zhao
2019-02-19  8:52 ` [PATCH 4/5] vfio/migration: turn on migration Yan Zhao
2019-02-19  8:53 ` [PATCH 5/5] vfio/migration: support device memory capability Yan Zhao
2019-02-19 11:25   ` Dr. David Alan Gilbert
2019-02-20  5:17     ` Zhao Yan
2019-02-19 14:42   ` Christophe de Dinechin
2019-02-20  7:58     ` Zhao Yan
2019-02-20 10:14       ` Christophe de Dinechin
2019-02-21  0:07         ` Zhao Yan
2019-02-19 11:32 ` [PATCH 0/5] QEMU VFIO live migration Dr. David Alan Gilbert
2019-02-20  5:28   ` Zhao Yan
2019-02-20 11:01     ` Dr. David Alan Gilbert
2019-02-20 11:28       ` Gonglei (Arei)
2019-02-20 11:42         ` Cornelia Huck
2019-02-20 12:07           ` Gonglei (Arei)
2019-03-27  6:35           ` Zhao Yan
2019-03-27 20:18             ` Dr. David Alan Gilbert
2019-03-27 22:10               ` Alex Williamson
2019-03-28  8:36                 ` Zhao Yan
2019-03-28  9:21                   ` Erik Skultety
2019-03-28 16:04                     ` Alex Williamson
2019-03-29  2:47                       ` Zhao Yan [this message]
2019-03-29 14:26                         ` Alex Williamson
2019-03-29 23:10                           ` Zhao Yan
2019-03-30 14:14                             ` Alex Williamson
2019-04-01  2:17                               ` Zhao Yan
2019-04-01  8:14                 ` Cornelia Huck
2019-04-01  8:40                   ` Yan Zhao
2019-04-01 14:15                     ` Alex Williamson
2019-02-21  0:31       ` Zhao Yan
2019-02-21  9:15         ` Dr. David Alan Gilbert
2019-02-20 11:56 ` Gonglei (Arei)
2019-02-21  0:24   ` Zhao Yan
2019-02-21  1:35     ` Gonglei (Arei)
2019-02-21  1:58       ` Zhao Yan
2019-02-21  3:33         ` Gonglei (Arei)
2019-02-21  4:08           ` Zhao Yan
2019-02-21  5:46             ` Gonglei (Arei)
2019-02-21  2:04       ` Zhao Yan
2019-02-21  3:16         ` Gonglei (Arei)
2019-02-21  4:21           ` Zhao Yan
2019-02-21  5:56             ` Gonglei (Arei)
2019-02-21 20:40 ` Alex Williamson
2019-02-25  2:22   ` Zhao Yan
2019-03-06  0:22     ` Zhao Yan
2019-03-07 17:44     ` Alex Williamson
2019-03-07 23:20       ` Tian, Kevin
2019-03-08 16:11         ` Alex Williamson
2019-03-08 16:21           ` Dr. David Alan Gilbert
2019-03-08 22:02             ` Alex Williamson
2019-03-11  2:33               ` Tian, Kevin
2019-03-11 20:19                 ` Alex Williamson
2019-03-12  2:48                   ` Tian, Kevin
2019-03-13 19:57                     ` Alex Williamson
2019-03-12  2:57       ` Zhao Yan
2019-03-13  1:13         ` Zhao Yan
2019-03-13 19:14           ` Alex Williamson
2019-03-14  1:12             ` Zhao Yan
2019-03-14 22:44               ` Alex Williamson
2019-03-14 23:05                 ` Zhao Yan
2019-03-15  2:24                   ` Alex Williamson
2019-03-18  2:51                     ` Zhao Yan
2019-03-18  3:09                       ` Alex Williamson
2019-03-18  3:27                         ` Zhao Yan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190329024704.GF14681@joy-OptiPlex-7040 \
    --to=yan.y.zhao@intel.com \
    --cc=Ken.Xue@amd.com \
    --cc=Zhengxiao.zx@alibaba-inc.com \
    --cc=aik@ozlabs.ru \
    --cc=alex.williamson@redhat.com \
    --cc=arei.gonglei@huawei.com \
    --cc=cjia@nvidia.com \
    --cc=dgilbert@redhat.com \
    --cc=eauger@redhat.com \
    --cc=eskultet@redhat.com \
    --cc=felipe@nutanix.com \
    --cc=intel-gvt-dev@lists.freedesktop.org \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=kwankhede@nvidia.com \
    --cc=mlevitsk@redhat.com \
    --cc=pasic@linux.ibm.com \
    --cc=qemu-devel@nongnu.org \
    --cc=shuangtai.tst@alibaba-inc.com \
    --cc=yi.l.liu@intel.com \
    --cc=ziye.yang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox