From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:41061)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <berrange@redhat.com>) id 1bpC1w-00057Z-NF
	for qemu-devel@nongnu.org; Wed, 28 Sep 2016 06:24:02 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <berrange@redhat.com>) id 1bpC1t-0000gN-EK
	for qemu-devel@nongnu.org; Wed, 28 Sep 2016 06:24:00 -0400
Received: from mx1.redhat.com ([209.132.183.28]:55790)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <berrange@redhat.com>) id 1bpC1t-0000gJ-4M
	for qemu-devel@nongnu.org; Wed, 28 Sep 2016 06:23:57 -0400
Date: Wed, 28 Sep 2016 11:23:52 +0100
From: "Daniel P. Berrange" <berrange@redhat.com>
Message-ID: <20160928102352.GK21583@redhat.com>
Reply-To: "Daniel P. Berrange" <berrange@redhat.com>
References: <CAJSP0QUV4mBXsoZdhDV7_tZfNLQ4LUk4otoYCp2ZYhxD+OHJWQ@mail.gmail.com>
	<03BF752A-0E6A-4AAD-A310-DFACDF0B8339@nutanix.com>
	<20160927092712.GA563@stefanha-x1.localdomain>
	<20160927095458.GA2200@work-vm> <87twd0bdm4.fsf@emacs.mitica>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <87twd0bdm4.fsf@emacs.mitica>
Subject: Re: [Qemu-devel] Live migration without bdrv_drain_all()
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Juan Quintela <quintela@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>, Mike Cui <cui@nutanix.com>, Kevin Wolf <kwolf@redhat.com>, Stefan Hajnoczi <stefanha@gmail.com>, qemu-devel <qemu-devel@nongnu.org>, Felipe Franciosi <felipe@nutanix.com>, Paolo Bonzini <pbonzini@redhat.com>

On Wed, Sep 28, 2016 at 11:03:15AM +0200, Juan Quintela wrote:
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> > * Stefan Hajnoczi (stefanha@gmail.com) wrote:
> >> On Mon, Aug 29, 2016 at 06:56:42PM +0000, Felipe Franciosi wrote:
> >> > Heya!
> >> > 
> >> > > On 29 Aug 2016, at 08:06, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >> > > 
> >> > > At KVM Forum an interesting idea was proposed to avoid
> >> > > bdrv_drain_all() during live migration.  Mike Cui and Felipe Franciosi
> >> > > mentioned running at queue depth 1.  It needs more thought to make it
> >> > > workable but I want to capture it here for discussion and to archive
> >> > > it.
> >> > > 
> >> > > bdrv_drain_all() is synchronous and can cause VM downtime if I/O
> >> > > requests hang.  We should find a better way of quiescing I/O that is
> >> > > not synchronous.  Up until now I thought we should simply add a
> >> > > timeout to bdrv_drain_all() so it can at least fail (and live
> >> > > migration would fail) if I/O is stuck instead of hanging the VM.  But
> >> > > the following approach is also interesting...
> >> > > 
> >> > > During the iteration phase of live migration we could limit the queue
> >> > > depth so points with no I/O requests in-flight are identified.  At
> >> > > these points the migration algorithm has the opportunity to move to
> >> > > the next phase without requiring bdrv_drain_all() since no requests
> >> > > are pending.
> >> > 
> >> > I actually think that this "io quiesced state" is highly unlikely
> >> > to _just_ happen on a busy guest. The main idea behind running at
> >> > QD1 is to naturally throttle the guest and make it easier to
> >> > "force quiesce" the VQs.
> >> > 
> >> > In other words, if the guest is busy and we run at QD1, I would
> >> > expect the rings to be quite full of pending (ie. unprocessed)
> >> > requests. At the same time, I would expect that a call to
> >> > bdrv_drain_all() (as part of do_vm_stop()) should complete much
> >> > quicker.
> >> > 
> >> > Nevertheless, you mentioned that this is still problematic as that
> >> > single outstanding IO could block, leaving the VM paused for
> >> > longer.
> >> > 
> >> > My suggestion is therefore that we leave the vCPUs running, but
> >> > stop picking up requests from the VQs. Provided nothing blocks,
> >> > you should reach the "io quiesced state" fairly quickly. If you
> >> > don't, then the VM is at least still running (despite seeing no
> >> > progress on its VQs).
> >> > 
> >> > Thoughts on that?
> >> 
> >> If the guest experiences a hung disk it may enter error recovery.  QEMU
> >> should avoid this so the guest doesn't remount file systems read-only.
> >> 
> >> This can be solved by only quiescing the disk for, say, 30 seconds at a
> >> time.  If we don't reach a point where live migration can proceed during
> >> those 30 seconds then the disk will service requests again temporarily
> >> to avoid upsetting the guest.
> >> 
> >> I wonder if Juan or David have any thoughts from the live migration
> >> perspective?
> >
> > Throttling IO to reduce the time in the final drain makes sense
> > to me, however:
> >    a) It doesn't solve the problem if the IO device dies at just the wrong time,
> >       so you can still get that hang in bdrv_drain_all
> >
> >    b) Completely stopping guest IO sounds too drastic to me unless you can
> >       time it to be just at the point before the end of migration; that feels
> >       tricky to get right unless you can somehow tie it to an estimate of
> >       remaining dirty RAM (that never works that well).
> >
> >    c) Something like a 30 second pause still feels too long; if that was
> >       a big hairy database workload it would effectively be 30 seconds
> >       of downtime.
> >
> > Dave
> 
> I think something like the proposed thing could work.
> 
> We can put queue depth = 1 or somesuch when we know we are near
> completion for migration.  What we need them is a way to call the
> equivalent of:
> 
> bdrv_drain_all() to return EAGAIN or EBUSY if it is a bad moment.  In
> that case, we just do another round over the whole memory, or retry in X
> seconds.  Anything is good for us, we just need a way to ask for the
> operation but that it don't block.
> 
> Notice that migration is the equivalent of:
> 
> while (true) {
>      write_some_dirty_pages();
>      if (dirty_pages < threshold) {
>         break;
>      }
> }
> bdrv_drain_all();
> write_rest_of_dirty_pages();
> 
> (Lots and lots of details ommited)
> 
> What we really want is to issue the call of bdrv_drain_all() equivalent
> inside the while, so, if there is any problem, we just do another cycle,
> no problem.

It seems that the main downside of this is that it makes normal
pre-copy live migration even less likely to successfully complete
that it already is. This increases the liklihood of needing to
use post-copy live migration, which has the same bdrv_drain_all
problem. THis is hard to solve because QEMU isn't in charge of
when post-copy starts, so it can't simply wait for a convenient
moment to switch to post-copy if drain_all is busy.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://entangle-photo.org       -o-    http://search.cpan.org/~danberr/ :|