[Qemu-devel] Live migration without bdrv_drain

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] Live migration without bdrv_drain_all()
@ 2016-08-29 15:06 Stefan Hajnoczi
  2016-08-29 18:56 ` Felipe Franciosi
  2016-09-27  9:48 ` Daniel P. Berrange
  0 siblings, 2 replies; 10+ messages in thread
From: Stefan Hajnoczi @ 2016-08-29 15:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: cui, felipe, Kevin Wolf, Paolo Bonzini

At KVM Forum an interesting idea was proposed to avoid
bdrv_drain_all() during live migration.  Mike Cui and Felipe Franciosi
mentioned running at queue depth 1.  It needs more thought to make it
workable but I want to capture it here for discussion and to archive
it.

bdrv_drain_all() is synchronous and can cause VM downtime if I/O
requests hang.  We should find a better way of quiescing I/O that is
not synchronous.  Up until now I thought we should simply add a
timeout to bdrv_drain_all() so it can at least fail (and live
migration would fail) if I/O is stuck instead of hanging the VM.  But
the following approach is also interesting...

During the iteration phase of live migration we could limit the queue
depth so points with no I/O requests in-flight are identified.  At
these points the migration algorithm has the opportunity to move to
the next phase without requiring bdrv_drain_all() since no requests
are pending.

Unprocessed requests are left in the virtio-blk/virtio-scsi virtqueues
so that the destination QEMU can process them after migration
completes.

Unfortunately this approach makes convergence harder because the VM
might also be dirtying memory pages during the iteration phase.  Now
we need to reach a spot where no I/O is in-flight *and* dirty memory
is under the threshold.

Thoughts?

Stefan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] Live migration without bdrv_drain_all()
  2016-08-29 15:06 [Qemu-devel] Live migration without bdrv_drain_all() Stefan Hajnoczi
@ 2016-08-29 18:56 ` Felipe Franciosi
  2016-09-27  9:27   ` Stefan Hajnoczi
  2016-09-27  9:48 ` Daniel P. Berrange
  1 sibling, 1 reply; 10+ messages in thread
From: Felipe Franciosi @ 2016-08-29 18:56 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-devel, Mike Cui, Kevin Wolf, Paolo Bonzini

Heya!

> On 29 Aug 2016, at 08:06, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> 
> At KVM Forum an interesting idea was proposed to avoid
> bdrv_drain_all() during live migration.  Mike Cui and Felipe Franciosi
> mentioned running at queue depth 1.  It needs more thought to make it
> workable but I want to capture it here for discussion and to archive
> it.
> 
> bdrv_drain_all() is synchronous and can cause VM downtime if I/O
> requests hang.  We should find a better way of quiescing I/O that is
> not synchronous.  Up until now I thought we should simply add a
> timeout to bdrv_drain_all() so it can at least fail (and live
> migration would fail) if I/O is stuck instead of hanging the VM.  But
> the following approach is also interesting...
> 
> During the iteration phase of live migration we could limit the queue
> depth so points with no I/O requests in-flight are identified.  At
> these points the migration algorithm has the opportunity to move to
> the next phase without requiring bdrv_drain_all() since no requests
> are pending.

I actually think that this "io quiesced state" is highly unlikely to _just_ happen on a busy guest. The main idea behind running at QD1 is to naturally throttle the guest and make it easier to "force quiesce" the VQs.

In other words, if the guest is busy and we run at QD1, I would expect the rings to be quite full of pending (ie. unprocessed) requests. At the same time, I would expect that a call to bdrv_drain_all() (as part of do_vm_stop()) should complete much quicker.

Nevertheless, you mentioned that this is still problematic as that single outstanding IO could block, leaving the VM paused for longer.

My suggestion is therefore that we leave the vCPUs running, but stop picking up requests from the VQs. Provided nothing blocks, you should reach the "io quiesced state" fairly quickly. If you don't, then the VM is at least still running (despite seeing no progress on its VQs).

Thoughts on that?

Thanks for capturing the discussion and bringing it here,
Felipe

> 
> Unprocessed requests are left in the virtio-blk/virtio-scsi virtqueues
> so that the destination QEMU can process them after migration
> completes.
> 
> Unfortunately this approach makes convergence harder because the VM
> might also be dirtying memory pages during the iteration phase.  Now
> we need to reach a spot where no I/O is in-flight *and* dirty memory
> is under the threshold.
> 
> Thoughts?
> 
> Stefan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] Live migration without bdrv_drain_all()
  2016-08-29 18:56 ` Felipe Franciosi
@ 2016-09-27  9:27   ` Stefan Hajnoczi
  2016-09-27  9:51     ` Daniel P. Berrange
  2016-09-27  9:54     ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 10+ messages in thread
From: Stefan Hajnoczi @ 2016-09-27  9:27 UTC (permalink / raw)
  To: Felipe Franciosi
  Cc: qemu-devel, Mike Cui, Kevin Wolf, Paolo Bonzini, Juan Quintela,
	Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 2621 bytes --]

On Mon, Aug 29, 2016 at 06:56:42PM +0000, Felipe Franciosi wrote:
> Heya!
> 
> > On 29 Aug 2016, at 08:06, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > 
> > At KVM Forum an interesting idea was proposed to avoid
> > bdrv_drain_all() during live migration.  Mike Cui and Felipe Franciosi
> > mentioned running at queue depth 1.  It needs more thought to make it
> > workable but I want to capture it here for discussion and to archive
> > it.
> > 
> > bdrv_drain_all() is synchronous and can cause VM downtime if I/O
> > requests hang.  We should find a better way of quiescing I/O that is
> > not synchronous.  Up until now I thought we should simply add a
> > timeout to bdrv_drain_all() so it can at least fail (and live
> > migration would fail) if I/O is stuck instead of hanging the VM.  But
> > the following approach is also interesting...
> > 
> > During the iteration phase of live migration we could limit the queue
> > depth so points with no I/O requests in-flight are identified.  At
> > these points the migration algorithm has the opportunity to move to
> > the next phase without requiring bdrv_drain_all() since no requests
> > are pending.
> 
> I actually think that this "io quiesced state" is highly unlikely to _just_ happen on a busy guest. The main idea behind running at QD1 is to naturally throttle the guest and make it easier to "force quiesce" the VQs.
> 
> In other words, if the guest is busy and we run at QD1, I would expect the rings to be quite full of pending (ie. unprocessed) requests. At the same time, I would expect that a call to bdrv_drain_all() (as part of do_vm_stop()) should complete much quicker.
> 
> Nevertheless, you mentioned that this is still problematic as that single outstanding IO could block, leaving the VM paused for longer.
> 
> My suggestion is therefore that we leave the vCPUs running, but stop picking up requests from the VQs. Provided nothing blocks, you should reach the "io quiesced state" fairly quickly. If you don't, then the VM is at least still running (despite seeing no progress on its VQs).
> 
> Thoughts on that?

If the guest experiences a hung disk it may enter error recovery.  QEMU
should avoid this so the guest doesn't remount file systems read-only.

This can be solved by only quiescing the disk for, say, 30 seconds at a
time.  If we don't reach a point where live migration can proceed during
those 30 seconds then the disk will service requests again temporarily
to avoid upsetting the guest.

I wonder if Juan or David have any thoughts from the live migration
perspective?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] Live migration without bdrv_drain_all()
  2016-09-27  9:27   ` Stefan Hajnoczi
@ 2016-09-27  9:51     ` Daniel P. Berrange
  2016-09-27  9:54     ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 10+ messages in thread
From: Daniel P. Berrange @ 2016-09-27  9:51 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Felipe Franciosi, Mike Cui, Kevin Wolf, Juan Quintela, qemu-devel,
	Dr. David Alan Gilbert, Paolo Bonzini

On Tue, Sep 27, 2016 at 10:27:12AM +0100, Stefan Hajnoczi wrote:
> On Mon, Aug 29, 2016 at 06:56:42PM +0000, Felipe Franciosi wrote:
> > Heya!
> > 
> > > On 29 Aug 2016, at 08:06, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > 
> > > At KVM Forum an interesting idea was proposed to avoid
> > > bdrv_drain_all() during live migration.  Mike Cui and Felipe Franciosi
> > > mentioned running at queue depth 1.  It needs more thought to make it
> > > workable but I want to capture it here for discussion and to archive
> > > it.
> > > 
> > > bdrv_drain_all() is synchronous and can cause VM downtime if I/O
> > > requests hang.  We should find a better way of quiescing I/O that is
> > > not synchronous.  Up until now I thought we should simply add a
> > > timeout to bdrv_drain_all() so it can at least fail (and live
> > > migration would fail) if I/O is stuck instead of hanging the VM.  But
> > > the following approach is also interesting...
> > > 
> > > During the iteration phase of live migration we could limit the queue
> > > depth so points with no I/O requests in-flight are identified.  At
> > > these points the migration algorithm has the opportunity to move to
> > > the next phase without requiring bdrv_drain_all() since no requests
> > > are pending.
> > 
> > I actually think that this "io quiesced state" is highly unlikely to _just_ happen on a busy guest. The main idea behind running at QD1 is to naturally throttle the guest and make it easier to "force quiesce" the VQs.
> > 
> > In other words, if the guest is busy and we run at QD1, I would expect the rings to be quite full of pending (ie. unprocessed) requests. At the same time, I would expect that a call to bdrv_drain_all() (as part of do_vm_stop()) should complete much quicker.
> > 
> > Nevertheless, you mentioned that this is still problematic as that single outstanding IO could block, leaving the VM paused for longer.
> > 
> > My suggestion is therefore that we leave the vCPUs running, but stop picking up requests from the VQs. Provided nothing blocks, you should reach the "io quiesced state" fairly quickly. If you don't, then the VM is at least still running (despite seeing no progress on its VQs).
> > 
> > Thoughts on that?
> 
> If the guest experiences a hung disk it may enter error recovery.  QEMU
> should avoid this so the guest doesn't remount file systems read-only.
> 
> This can be solved by only quiescing the disk for, say, 30 seconds at a
> time.  If we don't reach a point where live migration can proceed during
> those 30 seconds then the disk will service requests again temporarily
> to avoid upsetting the guest.

What is the actual trigger for guest error recovery ? If you have the
situation where bdrv_drain_all could hang, surely even if you start
processing requests again after 30 seconds, you might not actually be
able to complete those requests for a long time, due to fact that
drain all has still got outstanding work blocking the new requests
you just accepted from the guest ?


Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] Live migration without bdrv_drain_all()
  2016-09-27  9:27   ` Stefan Hajnoczi
  2016-09-27  9:51     ` Daniel P. Berrange
@ 2016-09-27  9:54     ` Dr. David Alan Gilbert
  2016-09-28  9:03       ` Juan Quintela
  1 sibling, 1 reply; 10+ messages in thread
From: Dr. David Alan Gilbert @ 2016-09-27  9:54 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Felipe Franciosi, qemu-devel, Mike Cui, Kevin Wolf, Paolo Bonzini,
	Juan Quintela

* Stefan Hajnoczi (stefanha@gmail.com) wrote:
> On Mon, Aug 29, 2016 at 06:56:42PM +0000, Felipe Franciosi wrote:
> > Heya!
> > 
> > > On 29 Aug 2016, at 08:06, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > 
> > > At KVM Forum an interesting idea was proposed to avoid
> > > bdrv_drain_all() during live migration.  Mike Cui and Felipe Franciosi
> > > mentioned running at queue depth 1.  It needs more thought to make it
> > > workable but I want to capture it here for discussion and to archive
> > > it.
> > > 
> > > bdrv_drain_all() is synchronous and can cause VM downtime if I/O
> > > requests hang.  We should find a better way of quiescing I/O that is
> > > not synchronous.  Up until now I thought we should simply add a
> > > timeout to bdrv_drain_all() so it can at least fail (and live
> > > migration would fail) if I/O is stuck instead of hanging the VM.  But
> > > the following approach is also interesting...
> > > 
> > > During the iteration phase of live migration we could limit the queue
> > > depth so points with no I/O requests in-flight are identified.  At
> > > these points the migration algorithm has the opportunity to move to
> > > the next phase without requiring bdrv_drain_all() since no requests
> > > are pending.
> > 
> > I actually think that this "io quiesced state" is highly unlikely to _just_ happen on a busy guest. The main idea behind running at QD1 is to naturally throttle the guest and make it easier to "force quiesce" the VQs.
> > 
> > In other words, if the guest is busy and we run at QD1, I would expect the rings to be quite full of pending (ie. unprocessed) requests. At the same time, I would expect that a call to bdrv_drain_all() (as part of do_vm_stop()) should complete much quicker.
> > 
> > Nevertheless, you mentioned that this is still problematic as that single outstanding IO could block, leaving the VM paused for longer.
> > 
> > My suggestion is therefore that we leave the vCPUs running, but stop picking up requests from the VQs. Provided nothing blocks, you should reach the "io quiesced state" fairly quickly. If you don't, then the VM is at least still running (despite seeing no progress on its VQs).
> > 
> > Thoughts on that?
> 
> If the guest experiences a hung disk it may enter error recovery.  QEMU
> should avoid this so the guest doesn't remount file systems read-only.
> 
> This can be solved by only quiescing the disk for, say, 30 seconds at a
> time.  If we don't reach a point where live migration can proceed during
> those 30 seconds then the disk will service requests again temporarily
> to avoid upsetting the guest.
> 
> I wonder if Juan or David have any thoughts from the live migration
> perspective?

Throttling IO to reduce the time in the final drain makes sense
to me, however:
   a) It doesn't solve the problem if the IO device dies at just the wrong time,
      so you can still get that hang in bdrv_drain_all

   b) Completely stopping guest IO sounds too drastic to me unless you can
      time it to be just at the point before the end of migration; that feels
      tricky to get right unless you can somehow tie it to an estimate of
      remaining dirty RAM (that never works that well).

   c) Something like a 30 second pause still feels too long; if that was
      a big hairy database workload it would effectively be 30 seconds
      of downtime.

Dave

> 
> Stefan


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] Live migration without bdrv_drain_all()
  2016-09-27  9:54     ` Dr. David Alan Gilbert
@ 2016-09-28  9:03       ` Juan Quintela
  2016-09-28 10:00         ` Felipe Franciosi
  2016-09-28 10:23         ` Daniel P. Berrange
  0 siblings, 2 replies; 10+ messages in thread
From: Juan Quintela @ 2016-09-28  9:03 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Stefan Hajnoczi, Felipe Franciosi, qemu-devel, Mike Cui,
	Kevin Wolf, Paolo Bonzini

"Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> * Stefan Hajnoczi (stefanha@gmail.com) wrote:
>> On Mon, Aug 29, 2016 at 06:56:42PM +0000, Felipe Franciosi wrote:
>> > Heya!
>> > 
>> > > On 29 Aug 2016, at 08:06, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>> > > 
>> > > At KVM Forum an interesting idea was proposed to avoid
>> > > bdrv_drain_all() during live migration.  Mike Cui and Felipe Franciosi
>> > > mentioned running at queue depth 1.  It needs more thought to make it
>> > > workable but I want to capture it here for discussion and to archive
>> > > it.
>> > > 
>> > > bdrv_drain_all() is synchronous and can cause VM downtime if I/O
>> > > requests hang.  We should find a better way of quiescing I/O that is
>> > > not synchronous.  Up until now I thought we should simply add a
>> > > timeout to bdrv_drain_all() so it can at least fail (and live
>> > > migration would fail) if I/O is stuck instead of hanging the VM.  But
>> > > the following approach is also interesting...
>> > > 
>> > > During the iteration phase of live migration we could limit the queue
>> > > depth so points with no I/O requests in-flight are identified.  At
>> > > these points the migration algorithm has the opportunity to move to
>> > > the next phase without requiring bdrv_drain_all() since no requests
>> > > are pending.
>> > 
>> > I actually think that this "io quiesced state" is highly unlikely
>> > to _just_ happen on a busy guest. The main idea behind running at
>> > QD1 is to naturally throttle the guest and make it easier to
>> > "force quiesce" the VQs.
>> > 
>> > In other words, if the guest is busy and we run at QD1, I would
>> > expect the rings to be quite full of pending (ie. unprocessed)
>> > requests. At the same time, I would expect that a call to
>> > bdrv_drain_all() (as part of do_vm_stop()) should complete much
>> > quicker.
>> > 
>> > Nevertheless, you mentioned that this is still problematic as that
>> > single outstanding IO could block, leaving the VM paused for
>> > longer.
>> > 
>> > My suggestion is therefore that we leave the vCPUs running, but
>> > stop picking up requests from the VQs. Provided nothing blocks,
>> > you should reach the "io quiesced state" fairly quickly. If you
>> > don't, then the VM is at least still running (despite seeing no
>> > progress on its VQs).
>> > 
>> > Thoughts on that?
>> 
>> If the guest experiences a hung disk it may enter error recovery.  QEMU
>> should avoid this so the guest doesn't remount file systems read-only.
>> 
>> This can be solved by only quiescing the disk for, say, 30 seconds at a
>> time.  If we don't reach a point where live migration can proceed during
>> those 30 seconds then the disk will service requests again temporarily
>> to avoid upsetting the guest.
>> 
>> I wonder if Juan or David have any thoughts from the live migration
>> perspective?
>
> Throttling IO to reduce the time in the final drain makes sense
> to me, however:
>    a) It doesn't solve the problem if the IO device dies at just the wrong time,
>       so you can still get that hang in bdrv_drain_all
>
>    b) Completely stopping guest IO sounds too drastic to me unless you can
>       time it to be just at the point before the end of migration; that feels
>       tricky to get right unless you can somehow tie it to an estimate of
>       remaining dirty RAM (that never works that well).
>
>    c) Something like a 30 second pause still feels too long; if that was
>       a big hairy database workload it would effectively be 30 seconds
>       of downtime.
>
> Dave

I think something like the proposed thing could work.

We can put queue depth = 1 or somesuch when we know we are near
completion for migration.  What we need them is a way to call the
equivalent of:

bdrv_drain_all() to return EAGAIN or EBUSY if it is a bad moment.  In
that case, we just do another round over the whole memory, or retry in X
seconds.  Anything is good for us, we just need a way to ask for the
operation but that it don't block.

Notice that migration is the equivalent of:

while (true) {
     write_some_dirty_pages();
     if (dirty_pages < threshold) {
        break;
     }
}
bdrv_drain_all();
write_rest_of_dirty_pages();

(Lots and lots of details ommited)

What we really want is to issue the call of bdrv_drain_all() equivalent
inside the while, so, if there is any problem, we just do another cycle,
no problem.

Later, Juan.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] Live migration without bdrv_drain_all()
  2016-09-28  9:03       ` Juan Quintela
@ 2016-09-28 10:00         ` Felipe Franciosi
  2016-09-28 10:23         ` Daniel P. Berrange
  1 sibling, 0 replies; 10+ messages in thread
From: Felipe Franciosi @ 2016-09-28 10:00 UTC (permalink / raw)
  To: quintela@redhat.com, Dr. David Alan Gilbert, Stefan Hajnoczi,
	Daniel P. Berrange
  Cc: qemu-devel, Mike Cui, Kevin Wolf, Paolo Bonzini


> On 28 Sep 2016, at 10:03, Juan Quintela <quintela@redhat.com> wrote:
> 
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
>> * Stefan Hajnoczi (stefanha@gmail.com) wrote:
>>> On Mon, Aug 29, 2016 at 06:56:42PM +0000, Felipe Franciosi wrote:
>>>> Heya!
>>>> 
>>>>> On 29 Aug 2016, at 08:06, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>>>> 
>>>>> At KVM Forum an interesting idea was proposed to avoid
>>>>> bdrv_drain_all() during live migration.  Mike Cui and Felipe Franciosi
>>>>> mentioned running at queue depth 1.  It needs more thought to make it
>>>>> workable but I want to capture it here for discussion and to archive
>>>>> it.
>>>>> 
>>>>> bdrv_drain_all() is synchronous and can cause VM downtime if I/O
>>>>> requests hang.  We should find a better way of quiescing I/O that is
>>>>> not synchronous.  Up until now I thought we should simply add a
>>>>> timeout to bdrv_drain_all() so it can at least fail (and live
>>>>> migration would fail) if I/O is stuck instead of hanging the VM.  But
>>>>> the following approach is also interesting...
>>>>> 
>>>>> During the iteration phase of live migration we could limit the queue
>>>>> depth so points with no I/O requests in-flight are identified.  At
>>>>> these points the migration algorithm has the opportunity to move to
>>>>> the next phase without requiring bdrv_drain_all() since no requests
>>>>> are pending.
>>>> 
>>>> I actually think that this "io quiesced state" is highly unlikely
>>>> to _just_ happen on a busy guest. The main idea behind running at
>>>> QD1 is to naturally throttle the guest and make it easier to
>>>> "force quiesce" the VQs.
>>>> 
>>>> In other words, if the guest is busy and we run at QD1, I would
>>>> expect the rings to be quite full of pending (ie. unprocessed)
>>>> requests. At the same time, I would expect that a call to
>>>> bdrv_drain_all() (as part of do_vm_stop()) should complete much
>>>> quicker.
>>>> 
>>>> Nevertheless, you mentioned that this is still problematic as that
>>>> single outstanding IO could block, leaving the VM paused for
>>>> longer.
>>>> 
>>>> My suggestion is therefore that we leave the vCPUs running, but
>>>> stop picking up requests from the VQs. Provided nothing blocks,
>>>> you should reach the "io quiesced state" fairly quickly. If you
>>>> don't, then the VM is at least still running (despite seeing no
>>>> progress on its VQs).
>>>> 
>>>> Thoughts on that?
>>> 
>>> If the guest experiences a hung disk it may enter error recovery.  QEMU
>>> should avoid this so the guest doesn't remount file systems read-only.
>>> 
>>> This can be solved by only quiescing the disk for, say, 30 seconds at a
>>> time.  If we don't reach a point where live migration can proceed during
>>> those 30 seconds then the disk will service requests again temporarily
>>> to avoid upsetting the guest.
>>> 
>>> I wonder if Juan or David have any thoughts from the live migration
>>> perspective?
>> 
>> Throttling IO to reduce the time in the final drain makes sense
>> to me, however:
>>   a) It doesn't solve the problem if the IO device dies at just the wrong time,
>>      so you can still get that hang in bdrv_drain_all
>> 
>>   b) Completely stopping guest IO sounds too drastic to me unless you can
>>      time it to be just at the point before the end of migration; that feels
>>      tricky to get right unless you can somehow tie it to an estimate of
>>      remaining dirty RAM (that never works that well).
>> 
>>   c) Something like a 30 second pause still feels too long; if that was
>>      a big hairy database workload it would effectively be 30 seconds
>>      of downtime.
>> 
>> Dave
> 
> I think something like the proposed thing could work.
> 
> We can put queue depth = 1 or somesuch when we know we are near
> completion for migration.  What we need them is a way to call the
> equivalent of:
> 
> bdrv_drain_all() to return EAGAIN or EBUSY if it is a bad moment.  In
> that case, we just do another round over the whole memory, or retry in X
> seconds.  Anything is good for us, we just need a way to ask for the
> operation but that it don't block.
> 
> Notice that migration is the equivalent of:
> 
> while (true) {
>     write_some_dirty_pages();
>     if (dirty_pages < threshold) {
>        break;
>     }
> }
> bdrv_drain_all();
> write_rest_of_dirty_pages();
> 
> (Lots and lots of details ommited)
> 
> What we really want is to issue the call of bdrv_drain_all() equivalent
> inside the while, so, if there is any problem, we just do another cycle,
> no problem.
> 
> Later, Juan.

Hi,

Actually, the way I perceive the problem is that Qemu is doing a vm_stop() *after* the "break;" in the pseudocode above (but *before* the drain). That means the VM could be stopped for a long time while you're doing bdrv_drain_all().

I don't see a magic solution for this. All we can do is try and find a way of doing this that improves the VM experience during the migration.

It's easy to argue that it's better to see your storage performance go down for a short period of time instead of seeing your CPUs not running for a long period of time. After all, there's a reason for "cpu downtime" being an actual hypervisor metric.

What I'd propose is a simple improvement like this:

while (true) {
  write_some_dirty_pages();
  if (dirty_pages < threshold_very_low) {
    break;
  } else if (dirty_pages < threshold_low) {
    bdrv_stop_picking_new_reqs();
  } else if (dirty_pages < threshold_med) {
    bdrv_run_at_qd1();
  }
}
vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
bdrv_drain_all();
write_rest_of_dirty_pages();

The idea is simple:
* When we're somewhere near, we pick only one request at a time.
* When we're really close, we stop picking up new requests. That still allows the block drivers to complete whatever is outstanding.
* When we're really really close, we can break. At this point, we're very likely drained already.

Knowing that most OSs use 30s by default as a "this request is not completing anymore" kind of timeout, we can even improve the above to resume the block drivers (or abort the migration) if the time between reaching "threshold_low" and "threshold_very_low" exceeds, say, 15s. That can be combined with actually waiting for everything to complete before stopping the CPUs. A more complete version would look like this:

while (true) {
  write_some_dirty_pages();
  if (dirty_pages < threshold_very_low) {
    if (bdrv_all_is_drained()) {
      break;
    } else if (bdrv_is_stopped() && (now() - ts_bdrv_stopped > 15s)) {
      bdrv_run_at_qd1();
      // or abort the migration and resume normally,
      // perhaps after a few retries
    }
  }
  if (dirty_pages < threshold_low) {
    bdrv_stop_picking_new_reqs();
    ts_bdrv_stopped = now();
  } else if (dirty_pages < threshold_med) {
    bdrv_run_at_qd1();
  }
}
vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
bdrv_drain_all();
write_rest_of_dirty_pages();

Note that this version (somewhat) copes with (dirty_pages<threshold_very_low) being reached before we actually observed a (dirty_pages<threshold_low). There's still a race where requests are fired after bdrv_all_is_drained() and before vm_stop_force_state(). But that can be easily addressed.

Thoughts?

Thanks,
Felipe

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] Live migration without bdrv_drain_all()
  2016-09-28  9:03       ` Juan Quintela
  2016-09-28 10:00         ` Felipe Franciosi
@ 2016-09-28 10:23         ` Daniel P. Berrange
  1 sibling, 0 replies; 10+ messages in thread
From: Daniel P. Berrange @ 2016-09-28 10:23 UTC (permalink / raw)
  To: Juan Quintela
  Cc: Dr. David Alan Gilbert, Mike Cui, Kevin Wolf, Stefan Hajnoczi,
	qemu-devel, Felipe Franciosi, Paolo Bonzini

On Wed, Sep 28, 2016 at 11:03:15AM +0200, Juan Quintela wrote:
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> > * Stefan Hajnoczi (stefanha@gmail.com) wrote:
> >> On Mon, Aug 29, 2016 at 06:56:42PM +0000, Felipe Franciosi wrote:
> >> > Heya!
> >> > 
> >> > > On 29 Aug 2016, at 08:06, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >> > > 
> >> > > At KVM Forum an interesting idea was proposed to avoid
> >> > > bdrv_drain_all() during live migration.  Mike Cui and Felipe Franciosi
> >> > > mentioned running at queue depth 1.  It needs more thought to make it
> >> > > workable but I want to capture it here for discussion and to archive
> >> > > it.
> >> > > 
> >> > > bdrv_drain_all() is synchronous and can cause VM downtime if I/O
> >> > > requests hang.  We should find a better way of quiescing I/O that is
> >> > > not synchronous.  Up until now I thought we should simply add a
> >> > > timeout to bdrv_drain_all() so it can at least fail (and live
> >> > > migration would fail) if I/O is stuck instead of hanging the VM.  But
> >> > > the following approach is also interesting...
> >> > > 
> >> > > During the iteration phase of live migration we could limit the queue
> >> > > depth so points with no I/O requests in-flight are identified.  At
> >> > > these points the migration algorithm has the opportunity to move to
> >> > > the next phase without requiring bdrv_drain_all() since no requests
> >> > > are pending.
> >> > 
> >> > I actually think that this "io quiesced state" is highly unlikely
> >> > to _just_ happen on a busy guest. The main idea behind running at
> >> > QD1 is to naturally throttle the guest and make it easier to
> >> > "force quiesce" the VQs.
> >> > 
> >> > In other words, if the guest is busy and we run at QD1, I would
> >> > expect the rings to be quite full of pending (ie. unprocessed)
> >> > requests. At the same time, I would expect that a call to
> >> > bdrv_drain_all() (as part of do_vm_stop()) should complete much
> >> > quicker.
> >> > 
> >> > Nevertheless, you mentioned that this is still problematic as that
> >> > single outstanding IO could block, leaving the VM paused for
> >> > longer.
> >> > 
> >> > My suggestion is therefore that we leave the vCPUs running, but
> >> > stop picking up requests from the VQs. Provided nothing blocks,
> >> > you should reach the "io quiesced state" fairly quickly. If you
> >> > don't, then the VM is at least still running (despite seeing no
> >> > progress on its VQs).
> >> > 
> >> > Thoughts on that?
> >> 
> >> If the guest experiences a hung disk it may enter error recovery.  QEMU
> >> should avoid this so the guest doesn't remount file systems read-only.
> >> 
> >> This can be solved by only quiescing the disk for, say, 30 seconds at a
> >> time.  If we don't reach a point where live migration can proceed during
> >> those 30 seconds then the disk will service requests again temporarily
> >> to avoid upsetting the guest.
> >> 
> >> I wonder if Juan or David have any thoughts from the live migration
> >> perspective?
> >
> > Throttling IO to reduce the time in the final drain makes sense
> > to me, however:
> >    a) It doesn't solve the problem if the IO device dies at just the wrong time,
> >       so you can still get that hang in bdrv_drain_all
> >
> >    b) Completely stopping guest IO sounds too drastic to me unless you can
> >       time it to be just at the point before the end of migration; that feels
> >       tricky to get right unless you can somehow tie it to an estimate of
> >       remaining dirty RAM (that never works that well).
> >
> >    c) Something like a 30 second pause still feels too long; if that was
> >       a big hairy database workload it would effectively be 30 seconds
> >       of downtime.
> >
> > Dave
> 
> I think something like the proposed thing could work.
> 
> We can put queue depth = 1 or somesuch when we know we are near
> completion for migration.  What we need them is a way to call the
> equivalent of:
> 
> bdrv_drain_all() to return EAGAIN or EBUSY if it is a bad moment.  In
> that case, we just do another round over the whole memory, or retry in X
> seconds.  Anything is good for us, we just need a way to ask for the
> operation but that it don't block.
> 
> Notice that migration is the equivalent of:
> 
> while (true) {
>      write_some_dirty_pages();
>      if (dirty_pages < threshold) {
>         break;
>      }
> }
> bdrv_drain_all();
> write_rest_of_dirty_pages();
> 
> (Lots and lots of details ommited)
> 
> What we really want is to issue the call of bdrv_drain_all() equivalent
> inside the while, so, if there is any problem, we just do another cycle,
> no problem.

It seems that the main downside of this is that it makes normal
pre-copy live migration even less likely to successfully complete
that it already is. This increases the liklihood of needing to
use post-copy live migration, which has the same bdrv_drain_all
problem. THis is hard to solve because QEMU isn't in charge of
when post-copy starts, so it can't simply wait for a convenient
moment to switch to post-copy if drain_all is busy.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://entangle-photo.org       -o-    http://search.cpan.org/~danberr/ :|

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] Live migration without bdrv_drain_all()
  2016-08-29 15:06 [Qemu-devel] Live migration without bdrv_drain_all() Stefan Hajnoczi
  2016-08-29 18:56 ` Felipe Franciosi
@ 2016-09-27  9:48 ` Daniel P. Berrange
  2016-10-12 13:09   ` Stefan Hajnoczi
  1 sibling, 1 reply; 10+ messages in thread
From: Daniel P. Berrange @ 2016-09-27  9:48 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-devel, cui, Kevin Wolf, Paolo Bonzini, felipe

On Mon, Aug 29, 2016 at 11:06:48AM -0400, Stefan Hajnoczi wrote:
> At KVM Forum an interesting idea was proposed to avoid
> bdrv_drain_all() during live migration.  Mike Cui and Felipe Franciosi
> mentioned running at queue depth 1.  It needs more thought to make it
> workable but I want to capture it here for discussion and to archive
> it.
> 
> bdrv_drain_all() is synchronous and can cause VM downtime if I/O
> requests hang.  We should find a better way of quiescing I/O that is
> not synchronous.  Up until now I thought we should simply add a
> timeout to bdrv_drain_all() so it can at least fail (and live
> migration would fail) if I/O is stuck instead of hanging the VM.  But
> the following approach is also interesting...

How would you decide what an acceptable timeout is for the drain
operation ? At what point does a stuck drain op cause the VM
to stall ?  The drain call happens from the migration thread, so
it shouldn't impact vcpu threads or the main event loop thread
if it takes too long.

> 
> During the iteration phase of live migration we could limit the queue
> depth so points with no I/O requests in-flight are identified.  At
> these points the migration algorithm has the opportunity to move to
> the next phase without requiring bdrv_drain_all() since no requests
> are pending.
> 
> Unprocessed requests are left in the virtio-blk/virtio-scsi virtqueues
> so that the destination QEMU can process them after migration
> completes.
> 
> Unfortunately this approach makes convergence harder because the VM
> might also be dirtying memory pages during the iteration phase.  Now
> we need to reach a spot where no I/O is in-flight *and* dirty memory
> is under the threshold.

It doesn't seem like this could easily fit in with post-copy. During
the switchover from pre-copy to post-copy migration calls vm_stop_force_state
which will trigger bdrv_drain_all().

The point at which you switch from pre to post copy mode is not controlled
by QEMU, instead it is an explicit admin action triggered via a QMP command.
Now the actual switch over is not synchronous with completion of the QMP
command, so there is small scope for delaying it to a convenient time, but
not by a very significant amount & certainly not anywhere near 30 seconds.
Perhaps 1 second at the most.


Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] Live migration without bdrv_drain_all()
  2016-09-27  9:48 ` Daniel P. Berrange
@ 2016-10-12 13:09   ` Stefan Hajnoczi
  0 siblings, 0 replies; 10+ messages in thread
From: Stefan Hajnoczi @ 2016-10-12 13:09 UTC (permalink / raw)
  To: Daniel P. Berrange; +Cc: qemu-devel, cui, Kevin Wolf, Paolo Bonzini, felipe

[-- Attachment #1: Type: text/plain, Size: 1211 bytes --]

On Tue, Sep 27, 2016 at 10:48:48AM +0100, Daniel P. Berrange wrote:
> On Mon, Aug 29, 2016 at 11:06:48AM -0400, Stefan Hajnoczi wrote:
> > At KVM Forum an interesting idea was proposed to avoid
> > bdrv_drain_all() during live migration.  Mike Cui and Felipe Franciosi
> > mentioned running at queue depth 1.  It needs more thought to make it
> > workable but I want to capture it here for discussion and to archive
> > it.
> > 
> > bdrv_drain_all() is synchronous and can cause VM downtime if I/O
> > requests hang.  We should find a better way of quiescing I/O that is
> > not synchronous.  Up until now I thought we should simply add a
> > timeout to bdrv_drain_all() so it can at least fail (and live
> > migration would fail) if I/O is stuck instead of hanging the VM.  But
> > the following approach is also interesting...
> 
> How would you decide what an acceptable timeout is for the drain
> operation ?

Same as most timeouts: an arbitrary number :(.

> At what point does a stuck drain op cause the VM
> to stall ?

The drain call has acquired the QEMU global mutex.  Any vmexit that
requires taking the QEMU global mutex will hang that thread (i.e. vcpu
thread).

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2016-10-12 13:10 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-08-29 15:06 [Qemu-devel] Live migration without bdrv_drain_all() Stefan Hajnoczi
2016-08-29 18:56 ` Felipe Franciosi
2016-09-27  9:27   ` Stefan Hajnoczi
2016-09-27  9:51     ` Daniel P. Berrange
2016-09-27  9:54     ` Dr. David Alan Gilbert
2016-09-28  9:03       ` Juan Quintela
2016-09-28 10:00         ` Felipe Franciosi
2016-09-28 10:23         ` Daniel P. Berrange
2016-09-27  9:48 ` Daniel P. Berrange
2016-10-12 13:09   ` Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).