From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:51135)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <dgilbert@redhat.com>) id 1bopE4-0002oa-UJ
	for qemu-devel@nongnu.org; Tue, 27 Sep 2016 06:03:02 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <dgilbert@redhat.com>) id 1bopE0-0008VH-Oh
	for qemu-devel@nongnu.org; Tue, 27 Sep 2016 06:02:59 -0400
Received: from mx1.redhat.com ([209.132.183.28]:34838)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <dgilbert@redhat.com>) id 1bopE0-0008UQ-FI
	for qemu-devel@nongnu.org; Tue, 27 Sep 2016 06:02:56 -0400
Date: Tue, 27 Sep 2016 10:54:58 +0100
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Message-ID: <20160927095458.GA2200@work-vm>
References: <CAJSP0QUV4mBXsoZdhDV7_tZfNLQ4LUk4otoYCp2ZYhxD+OHJWQ@mail.gmail.com>
	<03BF752A-0E6A-4AAD-A310-DFACDF0B8339@nutanix.com>
	<20160927092712.GA563@stefanha-x1.localdomain>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20160927092712.GA563@stefanha-x1.localdomain>
Subject: Re: [Qemu-devel] Live migration without bdrv_drain_all()
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Stefan Hajnoczi <stefanha@gmail.com>
Cc: Felipe Franciosi <felipe@nutanix.com>, qemu-devel <qemu-devel@nongnu.org>, Mike Cui <cui@nutanix.com>, Kevin Wolf <kwolf@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, Juan Quintela <quintela@redhat.com>

* Stefan Hajnoczi (stefanha@gmail.com) wrote:
> On Mon, Aug 29, 2016 at 06:56:42PM +0000, Felipe Franciosi wrote:
> > Heya!
> > 
> > > On 29 Aug 2016, at 08:06, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > 
> > > At KVM Forum an interesting idea was proposed to avoid
> > > bdrv_drain_all() during live migration.  Mike Cui and Felipe Franciosi
> > > mentioned running at queue depth 1.  It needs more thought to make it
> > > workable but I want to capture it here for discussion and to archive
> > > it.
> > > 
> > > bdrv_drain_all() is synchronous and can cause VM downtime if I/O
> > > requests hang.  We should find a better way of quiescing I/O that is
> > > not synchronous.  Up until now I thought we should simply add a
> > > timeout to bdrv_drain_all() so it can at least fail (and live
> > > migration would fail) if I/O is stuck instead of hanging the VM.  But
> > > the following approach is also interesting...
> > > 
> > > During the iteration phase of live migration we could limit the queue
> > > depth so points with no I/O requests in-flight are identified.  At
> > > these points the migration algorithm has the opportunity to move to
> > > the next phase without requiring bdrv_drain_all() since no requests
> > > are pending.
> > 
> > I actually think that this "io quiesced state" is highly unlikely to _just_ happen on a busy guest. The main idea behind running at QD1 is to naturally throttle the guest and make it easier to "force quiesce" the VQs.
> > 
> > In other words, if the guest is busy and we run at QD1, I would expect the rings to be quite full of pending (ie. unprocessed) requests. At the same time, I would expect that a call to bdrv_drain_all() (as part of do_vm_stop()) should complete much quicker.
> > 
> > Nevertheless, you mentioned that this is still problematic as that single outstanding IO could block, leaving the VM paused for longer.
> > 
> > My suggestion is therefore that we leave the vCPUs running, but stop picking up requests from the VQs. Provided nothing blocks, you should reach the "io quiesced state" fairly quickly. If you don't, then the VM is at least still running (despite seeing no progress on its VQs).
> > 
> > Thoughts on that?
> 
> If the guest experiences a hung disk it may enter error recovery.  QEMU
> should avoid this so the guest doesn't remount file systems read-only.
> 
> This can be solved by only quiescing the disk for, say, 30 seconds at a
> time.  If we don't reach a point where live migration can proceed during
> those 30 seconds then the disk will service requests again temporarily
> to avoid upsetting the guest.
> 
> I wonder if Juan or David have any thoughts from the live migration
> perspective?

Throttling IO to reduce the time in the final drain makes sense
to me, however:
   a) It doesn't solve the problem if the IO device dies at just the wrong time,
      so you can still get that hang in bdrv_drain_all

   b) Completely stopping guest IO sounds too drastic to me unless you can
      time it to be just at the point before the end of migration; that feels
      tricky to get right unless you can somehow tie it to an estimate of
      remaining dirty RAM (that never works that well).

   c) Something like a 30 second pause still feels too long; if that was
      a big hairy database workload it would effectively be 30 seconds
      of downtime.

Dave

> 
> Stefan


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK