From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:51135) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bopE4-0002oa-UJ for qemu-devel@nongnu.org; Tue, 27 Sep 2016 06:03:02 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bopE0-0008VH-Oh for qemu-devel@nongnu.org; Tue, 27 Sep 2016 06:02:59 -0400 Received: from mx1.redhat.com ([209.132.183.28]:34838) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bopE0-0008UQ-FI for qemu-devel@nongnu.org; Tue, 27 Sep 2016 06:02:56 -0400 Date: Tue, 27 Sep 2016 10:54:58 +0100 From: "Dr. David Alan Gilbert" Message-ID: <20160927095458.GA2200@work-vm> References: <03BF752A-0E6A-4AAD-A310-DFACDF0B8339@nutanix.com> <20160927092712.GA563@stefanha-x1.localdomain> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160927092712.GA563@stefanha-x1.localdomain> Subject: Re: [Qemu-devel] Live migration without bdrv_drain_all() List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Stefan Hajnoczi Cc: Felipe Franciosi , qemu-devel , Mike Cui , Kevin Wolf , Paolo Bonzini , Juan Quintela * Stefan Hajnoczi (stefanha@gmail.com) wrote: > On Mon, Aug 29, 2016 at 06:56:42PM +0000, Felipe Franciosi wrote: > > Heya! > > > > > On 29 Aug 2016, at 08:06, Stefan Hajnoczi wrote: > > > > > > At KVM Forum an interesting idea was proposed to avoid > > > bdrv_drain_all() during live migration. Mike Cui and Felipe Franciosi > > > mentioned running at queue depth 1. It needs more thought to make it > > > workable but I want to capture it here for discussion and to archive > > > it. > > > > > > bdrv_drain_all() is synchronous and can cause VM downtime if I/O > > > requests hang. We should find a better way of quiescing I/O that is > > > not synchronous. Up until now I thought we should simply add a > > > timeout to bdrv_drain_all() so it can at least fail (and live > > > migration would fail) if I/O is stuck instead of hanging the VM. But > > > the following approach is also interesting... > > > > > > During the iteration phase of live migration we could limit the queue > > > depth so points with no I/O requests in-flight are identified. At > > > these points the migration algorithm has the opportunity to move to > > > the next phase without requiring bdrv_drain_all() since no requests > > > are pending. > > > > I actually think that this "io quiesced state" is highly unlikely to _just_ happen on a busy guest. The main idea behind running at QD1 is to naturally throttle the guest and make it easier to "force quiesce" the VQs. > > > > In other words, if the guest is busy and we run at QD1, I would expect the rings to be quite full of pending (ie. unprocessed) requests. At the same time, I would expect that a call to bdrv_drain_all() (as part of do_vm_stop()) should complete much quicker. > > > > Nevertheless, you mentioned that this is still problematic as that single outstanding IO could block, leaving the VM paused for longer. > > > > My suggestion is therefore that we leave the vCPUs running, but stop picking up requests from the VQs. Provided nothing blocks, you should reach the "io quiesced state" fairly quickly. If you don't, then the VM is at least still running (despite seeing no progress on its VQs). > > > > Thoughts on that? > > If the guest experiences a hung disk it may enter error recovery. QEMU > should avoid this so the guest doesn't remount file systems read-only. > > This can be solved by only quiescing the disk for, say, 30 seconds at a > time. If we don't reach a point where live migration can proceed during > those 30 seconds then the disk will service requests again temporarily > to avoid upsetting the guest. > > I wonder if Juan or David have any thoughts from the live migration > perspective? Throttling IO to reduce the time in the final drain makes sense to me, however: a) It doesn't solve the problem if the IO device dies at just the wrong time, so you can still get that hang in bdrv_drain_all b) Completely stopping guest IO sounds too drastic to me unless you can time it to be just at the point before the end of migration; that feels tricky to get right unless you can somehow tie it to an estimate of remaining dirty RAM (that never works that well). c) Something like a 30 second pause still feels too long; if that was a big hairy database workload it would effectively be 30 seconds of downtime. Dave > > Stefan -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK