Re: [PATCH 0/5] QEMU VFIO live migration

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

From: Alex Williamson <alex.williamson@redhat.com>
To: Zhao Yan <yan.y.zhao@intel.com>
Cc: "cjia@nvidia.com" <cjia@nvidia.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"aik@ozlabs.ru" <aik@ozlabs.ru>,
	"Zhengxiao.zx@Alibaba-inc.com" <Zhengxiao.zx@Alibaba-inc.com>,
	"shuangtai.tst@alibaba-inc.com" <shuangtai.tst@alibaba-inc.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"kwankhede@nvidia.com" <kwankhede@nvidia.com>,
	"eauger@redhat.com" <eauger@redhat.com>,
	"Liu, Yi L" <yi.l.liu@intel.com>,
	"eskultet@redhat.com" <eskultet@redhat.com>,
	"Yang, Ziye" <ziye.yang@intel.com>,
	"mlevitsk@redhat.com" <mlevitsk@redhat.com>,
	"pasic@linux.ibm.com" <pasic@linux.ibm.com>,
	"arei.gonglei@huawei.com" <arei.gonglei@huawei.com>,
	"felipe@nutanix.com" <felipe@nutanix.com>,
	"Wang, Zhi A" <zhi.a.wang@intel.com>,
	"Tian, Kevin" <kevin.tian@intel.com>,
	"dgilbert@redhat.com" <dgilbert@redhat.com>,
	"intel-gvt-dev@lists.freedesktop.org"
	<intel-gvt-dev@lists.freedesktop.org>,
Subject: Re: [PATCH 0/5] QEMU VFIO live migration
Date: Sun, 17 Mar 2019 21:09:04 -0600	[thread overview]
Message-ID: <20190317210904.050f50c3@x1.home> (raw)
In-Reply-To: <20190318025126.GA14574@joy-OptiPlex-7040>

On Sun, 17 Mar 2019 22:51:27 -0400
Zhao Yan <yan.y.zhao@intel.com> wrote:

> On Fri, Mar 15, 2019 at 10:24:02AM +0800, Alex Williamson wrote:
> > On Thu, 14 Mar 2019 19:05:06 -0400
> > Zhao Yan <yan.y.zhao@intel.com> wrote:
> >   
> > > On Fri, Mar 15, 2019 at 06:44:58AM +0800, Alex Williamson wrote:  
> > > > On Wed, 13 Mar 2019 21:12:22 -0400
> > > > Zhao Yan <yan.y.zhao@intel.com> wrote:
> > > >     
> > > > > On Thu, Mar 14, 2019 at 03:14:54AM +0800, Alex Williamson wrote:    
> > > > > > On Tue, 12 Mar 2019 21:13:01 -0400
> > > > > > Zhao Yan <yan.y.zhao@intel.com> wrote:
> > > > > >       
> > > > > > > hi Alex
> > > > > > > Any comments to the sequence below?
> > > > > > > 
> > > > > > > Actaully we have some concerns and suggestions to userspace-opaque migration
> > > > > > > data.
> > > > > > > 
> > > > > > > 1. if data is opaque to userspace, kernel interface must be tightly bound to
> > > > > > > migration. 
> > > > > > >    e.g. vendor driver has to know state (running + not logging) should not
> > > > > > >    return any data, and state (running + logging) should return whole
> > > > > > >    snapshot first and dirty later. it also has to know qemu migration will
> > > > > > >    not call GET_BUFFER in state (running + not logging), otherwise, it has
> > > > > > >    to adjust its behavior.      
> > > > > > 
> > > > > > This all just sounds like defining the protocol we expect with the
> > > > > > interface.  For instance if we define a session as beginning when
> > > > > > logging is enabled and ending when the device is stopped and the
> > > > > > interface reports no more data is available, then we can state that any
> > > > > > partial accumulation of data is incomplete relative to migration.  If
> > > > > > userspace wants to initiate a new migration stream, they can simply
> > > > > > toggle logging.  How the vendor driver provides the data during the
> > > > > > session is not defined, but beginning the session with a snapshot
> > > > > > followed by repeated iterations of dirtied data is certainly a valid
> > > > > > approach.
> > > > > >       
> > > > > > > 2. vendor driver cannot ensure userspace get all the data it intends to
> > > > > > > save in pre-copy phase.
> > > > > > >   e.g. in stop-and-copy phase, vendor driver has to first check and send
> > > > > > >   data in previous phase.      
> > > > > > 
> > > > > > First, I don't think the device has control of when QEMU switches from
> > > > > > pre-copy to stop-and-copy, the protocol needs to support that
> > > > > > transition at any point.  However, it seems a simply data available
> > > > > > counter provides an indication of when it might be optimal to make such
> > > > > > a transition.  If a vendor driver follows a scheme as above, the
> > > > > > available data counter would indicate a large value, the entire initial
> > > > > > snapshot of the device.  As the migration continues and pages are
> > > > > > dirtied, the device would reach a steady state amount of data
> > > > > > available, depending on the guest activity.  This could indicate to the
> > > > > > user to stop the device.  The migration stream would not be considered
> > > > > > completed until the available data counter reaches zero while the
> > > > > > device is in the stopped|logging state.
> > > > > >       
> > > > > > > 3. if all the sequence is tightly bound to live migration, can we remove the
> > > > > > > logging state? what about adding two states migrate-in and migrate-out?
> > > > > > > so there are four states: running, stopped, migrate-in, migrate-out.
> > > > > > >    migrate-out is for source side when migration starts. together with
> > > > > > >    state running and stopped, it can substitute state logging.
> > > > > > >    migrate-in is for target side.      
> > > > > > 
> > > > > > In fact, Kirti's implementation specifies a data direction, but I think
> > > > > > we still need logging to indicate sessions.  I'd also assume that
> > > > > > logging implies some overhead for the vendor driver.
> > > > > >      
> > > > > ok. If you prefer logging, I'm ok with it. just found migrate-in and
> > > > > migrate-out are more universal againt hardware requirement changes.
> > > > >     
> > > > > > > On Tue, Mar 12, 2019 at 10:57:47AM +0800, Zhao Yan wrote:      
> > > > > > > > hi Alex
> > > > > > > > thanks for your reply.
> > > > > > > > 
> > > > > > > > So, if we choose migration data to be userspace opaque, do you think below
> > > > > > > > sequence is the right behavior for vendor driver to follow:
> > > > > > > > 
> > > > > > > > 1. initially LOGGING state is not set. If userspace calls GET_BUFFER to
> > > > > > > > vendor driver,  vendor driver should reject and return 0.      
> > > > > > 
> > > > > > What would this state mean otherwise?  If we're not logging then it
> > > > > > should not be expected that we can construct dirtied data from a
> > > > > > previous read of the state before logging was enabled (it would be
> > > > > > outside of the "session").  So at best this is an incomplete segment of
> > > > > > the initial snapshot of the device, but that presumes how the vendor
> > > > > > driver constructs the data.  I wouldn't necessarily mandate the vendor
> > > > > > driver reject it, but I think we should consider it undefined and
> > > > > > vendor specific relative to the migration interface.
> > > > > >       
> > > > > > > > 2. then LOGGING state is set, if userspace calls GET_BUFFER to vendor
> > > > > > > > driver,
> > > > > > > >    a. vendor driver shoud first query a whole snapshot of device memory
> > > > > > > >    (let's use this term to represent device's standalone memory for now),
> > > > > > > >    b. vendor driver returns a chunk of data just queried to userspace,
> > > > > > > >    while recording current pos in data.
> > > > > > > >    c. vendor driver finds all data just queried is finished transmitting to
> > > > > > > >    userspace, and queries only dirty data in device memory now.
> > > > > > > >    d. vendor driver returns a chunk of data just quered (this time is dirty
> > > > > > > >    data )to userspace while recording current pos in data
> > > > > > > >    e. if all data is transmited to usespace and still GET_BUFFERs come from
> > > > > > > >    userspace, vendor driver starts another round of dirty data query.      
> > > > > > 
> > > > > > This is a valid vendor driver approach, but it's outside the scope of
> > > > > > the interface definition.  A vendor driver could also decide to not
> > > > > > provide any data until both stopped and logging are set and then
> > > > > > provide a fixed, final snapshot.  The interface supports either
> > > > > > approach by defining the protocol to interact with it.
> > > > > >       
> > > > > > > > 3. if LOGGING state is unset then, and userpace calls GET_BUFFER to vendor
> > > > > > > > driver,
> > > > > > > >    a. if vendor driver finds there's previously untransmitted data, returns
> > > > > > > >    them until all transmitted.
> > > > > > > >    b. vendor driver then queries dirty data again and transmits them.
> > > > > > > >    c. at last, vendor driver queris device config data (which has to be
> > > > > > > >    queried at last and sent once) and transmits them.      
> > > > > > 
> > > > > > This seems broken, the vendor driver is presuming the user intentions.
> > > > > > If logging is unset, we return to bullet 1, reading data is undefined
> > > > > > and vendor specific.  It's outside of the session.
> > > > > >       
> > > > > > > > for the 1 bullet, if LOGGING state is firstly set and migration aborts
> > > > > > > > then,  vendor driver has to be able to detect that condition. so seemingly,
> > > > > > > > vendor driver has to know more qemu's migration state, like migration
> > > > > > > > called and failed. Do you think that's acceptable?      
> > > > > > 
> > > > > > If migration aborts, logging is cleared and the device continues
> > > > > > operation.  If a new migration is started, the session is initiated by
> > > > > > enabling logging.  Sound reasonable?  Thanks,
> > > > > >      
> > > > > 
> > > > > For the flow, I still have a question.
> > > > > There are 2 approaches below, which one do you prefer?
> > > > > 
> > > > > Approach A, in precopy stage, the sequence is
> > > > > 
> > > > > (1)
> > > > > .save_live_pending --> return whole snapshot size
> > > > > .save_live_iterate --> save whole snapshot
> > > > > 
> > > > > (2)
> > > > > .save_live_pending --> get dirty data, return dirty data size
> > > > > .save_live_iterate --> save all dirty data
> > > > > 
> > > > > (3)
> > > > > .save_live_pending --> get dirty data again, return dirty data size
> > > > > .save_live_iterate --> save all dirty data
> > > > > 
> > > > > 
> > > > > Approach B, in precopy stage, the sequence is
> > > > > (1)
> > > > > .save_live_pending --> return whole snapshot size
> > > > > .save_live_iterate --> save part of snapshot
> > > > > 
> > > > > (2)
> > > > > .save_live_pending --> return rest part of whole snapshot size +
> > > > >                               current dirty data size
> > > > > .save_live_iterate --> save part of snapshot 
> > > > > 
> > > > > (3) repeat (2) until whole snapshot saved.
> > > > > 
> > > > > (4) 
> > > > > .save_live_pending --> get diryt data and return current dirty data size
> > > > > .save_live_iterate --> save part of dirty data
> > > > > 
> > > > > (5)
> > > > > .save_live_pending --> return reset part of dirty data size +
> > > > > 			     delta size of dirty data
> > > > > .save_live_iterate --> save part of dirty data
> > > > > 
> > > > > (6)
> > > > > repeat (5) until precopy stops    
> > > > 
> > > > I don't really understand the question here.  If the vendor driver's
> > > > approach is to send a full snapshot followed by iterations of dirty
> > > > data, then when the user enables logging and reads the counter for
> > > > available data it should report the (size of the snapshot).  The next
> > > > time the user reads the counter, it should report the size of the
> > > > (size of the snapshot) - (what the user has already read) + (size of
> > > > the dirty data since the snapshot).  As the user continues to read past
> > > > the snapshot data, the available data counter transitions to reporting
> > > > only the size of the remaining dirty data, which is monotonically
> > > > increasing.  I guess this would be more similar to your approach B,
> > > > which seems to suggest that the interface needs to continue providing
> > > > data regardless of whether the user fully exhausted the available data
> > > > from the previous cycle.  Thanks,
> > > >    
> > > 
> > > Right. But regarding to the VFIO migration code in QEMU, rather than save
> > > one chunk each time, do you think it is better to exhaust all reported data
> > > from .save_live_pending in each .save_live_iterate callback? (eventhough 
> > > vendor driver will handle the case that if userspace cannot exhaust
> > > all data, VFIO QEMU can still try to save as many available data as it can
> > > each time).  
> > 
> > Don't you suspect that some devices might have state that's too large
> > to process in each iteration?  I expect we'll need to use heuristics on
> > data size or time spent on each iteration round such that some devices
> > might be able to fully process their pending data while others will
> > require multiple passes or make up the balance once we've entered stop
> > and copy.  Thanks,
> >  
> hi Alex
> What about looping and draining the pending data in each iteration? :)

How is this question different than your previous question?  Thanks,

Alex

next prev parent reply	other threads:[~2019-03-18  3:09 UTC|newest]

Thread overview: 77+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-19  8:50 [PATCH 0/5] QEMU VFIO live migration Yan Zhao
2019-02-19  8:52 ` [PATCH 1/5] vfio/migration: define kernel interfaces Yan Zhao
2019-02-19 13:09   ` Cornelia Huck
2019-02-20  7:36     ` Zhao Yan
2019-02-20 17:08       ` Cornelia Huck
2019-02-21  1:47         ` Zhao Yan
2019-02-19  8:52 ` [PATCH 2/5] vfio/migration: support device of device config capability Yan Zhao
2019-02-19 11:01   ` Dr. David Alan Gilbert
2019-02-20  5:12     ` Zhao Yan
2019-02-20 10:57       ` Dr. David Alan Gilbert
2019-02-19 14:37   ` Cornelia Huck
2019-02-20 22:54     ` Zhao Yan
2019-02-21 10:56       ` Cornelia Huck
2019-02-19  8:52 ` [PATCH 3/5] vfio/migration: tracking of dirty page in system memory Yan Zhao
2019-02-19  8:52 ` [PATCH 4/5] vfio/migration: turn on migration Yan Zhao
2019-02-19  8:53 ` [PATCH 5/5] vfio/migration: support device memory capability Yan Zhao
2019-02-19 11:25   ` Dr. David Alan Gilbert
2019-02-20  5:17     ` Zhao Yan
2019-02-19 14:42   ` Christophe de Dinechin
2019-02-20  7:58     ` Zhao Yan
2019-02-20 10:14       ` Christophe de Dinechin
2019-02-21  0:07         ` Zhao Yan
2019-02-19 11:32 ` [PATCH 0/5] QEMU VFIO live migration Dr. David Alan Gilbert
2019-02-20  5:28   ` Zhao Yan
2019-02-20 11:01     ` Dr. David Alan Gilbert
2019-02-20 11:28       ` Gonglei (Arei)
2019-02-20 11:42         ` Cornelia Huck
2019-02-20 12:07           ` Gonglei (Arei)
2019-03-27  6:35           ` Zhao Yan
2019-03-27 20:18             ` Dr. David Alan Gilbert
2019-03-27 22:10               ` Alex Williamson
2019-03-28  8:36                 ` Zhao Yan
2019-03-28  9:21                   ` Erik Skultety
2019-03-28 16:04                     ` Alex Williamson
2019-03-29  2:47                       ` Zhao Yan
2019-03-29 14:26                         ` Alex Williamson
2019-03-29 23:10                           ` Zhao Yan
2019-03-30 14:14                             ` Alex Williamson
2019-04-01  2:17                               ` Zhao Yan
2019-04-01  8:14                 ` Cornelia Huck
2019-04-01  8:40                   ` Yan Zhao
2019-04-01 14:15                     ` Alex Williamson
2019-02-21  0:31       ` Zhao Yan
2019-02-21  9:15         ` Dr. David Alan Gilbert
2019-02-20 11:56 ` Gonglei (Arei)
2019-02-21  0:24   ` Zhao Yan
2019-02-21  1:35     ` Gonglei (Arei)
2019-02-21  1:58       ` Zhao Yan
2019-02-21  3:33         ` Gonglei (Arei)
2019-02-21  4:08           ` Zhao Yan
2019-02-21  5:46             ` Gonglei (Arei)
2019-02-21  2:04       ` Zhao Yan
2019-02-21  3:16         ` Gonglei (Arei)
2019-02-21  4:21           ` Zhao Yan
2019-02-21  5:56             ` Gonglei (Arei)
2019-02-21 20:40 ` Alex Williamson
2019-02-25  2:22   ` Zhao Yan
2019-03-06  0:22     ` Zhao Yan
2019-03-07 17:44     ` Alex Williamson
2019-03-07 23:20       ` Tian, Kevin
2019-03-08 16:11         ` Alex Williamson
2019-03-08 16:21           ` Dr. David Alan Gilbert
2019-03-08 22:02             ` Alex Williamson
2019-03-11  2:33               ` Tian, Kevin
2019-03-11 20:19                 ` Alex Williamson
2019-03-12  2:48                   ` Tian, Kevin
2019-03-13 19:57                     ` Alex Williamson
2019-03-12  2:57       ` Zhao Yan
2019-03-13  1:13         ` Zhao Yan
2019-03-13 19:14           ` Alex Williamson
2019-03-14  1:12             ` Zhao Yan
2019-03-14 22:44               ` Alex Williamson
2019-03-14 23:05                 ` Zhao Yan
2019-03-15  2:24                   ` Alex Williamson
2019-03-18  2:51                     ` Zhao Yan
2019-03-18  3:09                       ` Alex Williamson [this message]
2019-03-18  3:27                         ` Zhao Yan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190317210904.050f50c3@x1.home \
    --to=alex.williamson@redhat.com \
    --cc=Zhengxiao.zx@Alibaba-inc.com \
    --cc=aik@ozlabs.ru \
    --cc=arei.gonglei@huawei.com \
    --cc=cjia@nvidia.com \
    --cc=dgilbert@redhat.com \
    --cc=eauger@redhat.com \
    --cc=eskultet@redhat.com \
    --cc=felipe@nutanix.com \
    --cc=intel-gvt-dev@lists.freedesktop.org \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=kwankhede@nvidia.com \
    --cc=mlevitsk@redhat.com \
    --cc=pasic@linux.ibm.com \
    --cc=qemu-devel@nongnu.org \
    --cc=shuangtai.tst@alibaba-inc.com \
    --cc=yan.y.zhao@intel.com \
    --cc=yi.l.liu@intel.com \
    --cc=zhi.a.wang@intel.com \
    --cc=ziye.yang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox