From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:41061) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bpC1w-00057Z-NF for qemu-devel@nongnu.org; Wed, 28 Sep 2016 06:24:02 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bpC1t-0000gN-EK for qemu-devel@nongnu.org; Wed, 28 Sep 2016 06:24:00 -0400 Received: from mx1.redhat.com ([209.132.183.28]:55790) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bpC1t-0000gJ-4M for qemu-devel@nongnu.org; Wed, 28 Sep 2016 06:23:57 -0400 Date: Wed, 28 Sep 2016 11:23:52 +0100 From: "Daniel P. Berrange" Message-ID: <20160928102352.GK21583@redhat.com> Reply-To: "Daniel P. Berrange" References: <03BF752A-0E6A-4AAD-A310-DFACDF0B8339@nutanix.com> <20160927092712.GA563@stefanha-x1.localdomain> <20160927095458.GA2200@work-vm> <87twd0bdm4.fsf@emacs.mitica> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <87twd0bdm4.fsf@emacs.mitica> Subject: Re: [Qemu-devel] Live migration without bdrv_drain_all() List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Juan Quintela Cc: "Dr. David Alan Gilbert" , Mike Cui , Kevin Wolf , Stefan Hajnoczi , qemu-devel , Felipe Franciosi , Paolo Bonzini On Wed, Sep 28, 2016 at 11:03:15AM +0200, Juan Quintela wrote: > "Dr. David Alan Gilbert" wrote: > > * Stefan Hajnoczi (stefanha@gmail.com) wrote: > >> On Mon, Aug 29, 2016 at 06:56:42PM +0000, Felipe Franciosi wrote: > >> > Heya! > >> > > >> > > On 29 Aug 2016, at 08:06, Stefan Hajnoczi wrote: > >> > > > >> > > At KVM Forum an interesting idea was proposed to avoid > >> > > bdrv_drain_all() during live migration. Mike Cui and Felipe Franciosi > >> > > mentioned running at queue depth 1. It needs more thought to make it > >> > > workable but I want to capture it here for discussion and to archive > >> > > it. > >> > > > >> > > bdrv_drain_all() is synchronous and can cause VM downtime if I/O > >> > > requests hang. We should find a better way of quiescing I/O that is > >> > > not synchronous. Up until now I thought we should simply add a > >> > > timeout to bdrv_drain_all() so it can at least fail (and live > >> > > migration would fail) if I/O is stuck instead of hanging the VM. But > >> > > the following approach is also interesting... > >> > > > >> > > During the iteration phase of live migration we could limit the queue > >> > > depth so points with no I/O requests in-flight are identified. At > >> > > these points the migration algorithm has the opportunity to move to > >> > > the next phase without requiring bdrv_drain_all() since no requests > >> > > are pending. > >> > > >> > I actually think that this "io quiesced state" is highly unlikely > >> > to _just_ happen on a busy guest. The main idea behind running at > >> > QD1 is to naturally throttle the guest and make it easier to > >> > "force quiesce" the VQs. > >> > > >> > In other words, if the guest is busy and we run at QD1, I would > >> > expect the rings to be quite full of pending (ie. unprocessed) > >> > requests. At the same time, I would expect that a call to > >> > bdrv_drain_all() (as part of do_vm_stop()) should complete much > >> > quicker. > >> > > >> > Nevertheless, you mentioned that this is still problematic as that > >> > single outstanding IO could block, leaving the VM paused for > >> > longer. > >> > > >> > My suggestion is therefore that we leave the vCPUs running, but > >> > stop picking up requests from the VQs. Provided nothing blocks, > >> > you should reach the "io quiesced state" fairly quickly. If you > >> > don't, then the VM is at least still running (despite seeing no > >> > progress on its VQs). > >> > > >> > Thoughts on that? > >> > >> If the guest experiences a hung disk it may enter error recovery. QEMU > >> should avoid this so the guest doesn't remount file systems read-only. > >> > >> This can be solved by only quiescing the disk for, say, 30 seconds at a > >> time. If we don't reach a point where live migration can proceed during > >> those 30 seconds then the disk will service requests again temporarily > >> to avoid upsetting the guest. > >> > >> I wonder if Juan or David have any thoughts from the live migration > >> perspective? > > > > Throttling IO to reduce the time in the final drain makes sense > > to me, however: > > a) It doesn't solve the problem if the IO device dies at just the wrong time, > > so you can still get that hang in bdrv_drain_all > > > > b) Completely stopping guest IO sounds too drastic to me unless you can > > time it to be just at the point before the end of migration; that feels > > tricky to get right unless you can somehow tie it to an estimate of > > remaining dirty RAM (that never works that well). > > > > c) Something like a 30 second pause still feels too long; if that was > > a big hairy database workload it would effectively be 30 seconds > > of downtime. > > > > Dave > > I think something like the proposed thing could work. > > We can put queue depth = 1 or somesuch when we know we are near > completion for migration. What we need them is a way to call the > equivalent of: > > bdrv_drain_all() to return EAGAIN or EBUSY if it is a bad moment. In > that case, we just do another round over the whole memory, or retry in X > seconds. Anything is good for us, we just need a way to ask for the > operation but that it don't block. > > Notice that migration is the equivalent of: > > while (true) { > write_some_dirty_pages(); > if (dirty_pages < threshold) { > break; > } > } > bdrv_drain_all(); > write_rest_of_dirty_pages(); > > (Lots and lots of details ommited) > > What we really want is to issue the call of bdrv_drain_all() equivalent > inside the while, so, if there is any problem, we just do another cycle, > no problem. It seems that the main downside of this is that it makes normal pre-copy live migration even less likely to successfully complete that it already is. This increases the liklihood of needing to use post-copy live migration, which has the same bdrv_drain_all problem. THis is hard to solve because QEMU isn't in charge of when post-copy starts, so it can't simply wait for a convenient moment to switch to post-copy if drain_all is busy. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|