* [Qemu-devel] Live migration without bdrv_drain_all() @ 2016-08-29 15:06 Stefan Hajnoczi 2016-08-29 18:56 ` Felipe Franciosi 2016-09-27 9:48 ` Daniel P. Berrange 0 siblings, 2 replies; 10+ messages in thread From: Stefan Hajnoczi @ 2016-08-29 15:06 UTC (permalink / raw) To: qemu-devel; +Cc: cui, felipe, Kevin Wolf, Paolo Bonzini At KVM Forum an interesting idea was proposed to avoid bdrv_drain_all() during live migration. Mike Cui and Felipe Franciosi mentioned running at queue depth 1. It needs more thought to make it workable but I want to capture it here for discussion and to archive it. bdrv_drain_all() is synchronous and can cause VM downtime if I/O requests hang. We should find a better way of quiescing I/O that is not synchronous. Up until now I thought we should simply add a timeout to bdrv_drain_all() so it can at least fail (and live migration would fail) if I/O is stuck instead of hanging the VM. But the following approach is also interesting... During the iteration phase of live migration we could limit the queue depth so points with no I/O requests in-flight are identified. At these points the migration algorithm has the opportunity to move to the next phase without requiring bdrv_drain_all() since no requests are pending. Unprocessed requests are left in the virtio-blk/virtio-scsi virtqueues so that the destination QEMU can process them after migration completes. Unfortunately this approach makes convergence harder because the VM might also be dirtying memory pages during the iteration phase. Now we need to reach a spot where no I/O is in-flight *and* dirty memory is under the threshold. Thoughts? Stefan ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] Live migration without bdrv_drain_all() 2016-08-29 15:06 [Qemu-devel] Live migration without bdrv_drain_all() Stefan Hajnoczi @ 2016-08-29 18:56 ` Felipe Franciosi 2016-09-27 9:27 ` Stefan Hajnoczi 2016-09-27 9:48 ` Daniel P. Berrange 1 sibling, 1 reply; 10+ messages in thread From: Felipe Franciosi @ 2016-08-29 18:56 UTC (permalink / raw) To: Stefan Hajnoczi; +Cc: qemu-devel, Mike Cui, Kevin Wolf, Paolo Bonzini Heya! > On 29 Aug 2016, at 08:06, Stefan Hajnoczi <stefanha@gmail.com> wrote: > > At KVM Forum an interesting idea was proposed to avoid > bdrv_drain_all() during live migration. Mike Cui and Felipe Franciosi > mentioned running at queue depth 1. It needs more thought to make it > workable but I want to capture it here for discussion and to archive > it. > > bdrv_drain_all() is synchronous and can cause VM downtime if I/O > requests hang. We should find a better way of quiescing I/O that is > not synchronous. Up until now I thought we should simply add a > timeout to bdrv_drain_all() so it can at least fail (and live > migration would fail) if I/O is stuck instead of hanging the VM. But > the following approach is also interesting... > > During the iteration phase of live migration we could limit the queue > depth so points with no I/O requests in-flight are identified. At > these points the migration algorithm has the opportunity to move to > the next phase without requiring bdrv_drain_all() since no requests > are pending. I actually think that this "io quiesced state" is highly unlikely to _just_ happen on a busy guest. The main idea behind running at QD1 is to naturally throttle the guest and make it easier to "force quiesce" the VQs. In other words, if the guest is busy and we run at QD1, I would expect the rings to be quite full of pending (ie. unprocessed) requests. At the same time, I would expect that a call to bdrv_drain_all() (as part of do_vm_stop()) should complete much quicker. Nevertheless, you mentioned that this is still problematic as that single outstanding IO could block, leaving the VM paused for longer. My suggestion is therefore that we leave the vCPUs running, but stop picking up requests from the VQs. Provided nothing blocks, you should reach the "io quiesced state" fairly quickly. If you don't, then the VM is at least still running (despite seeing no progress on its VQs). Thoughts on that? Thanks for capturing the discussion and bringing it here, Felipe > > Unprocessed requests are left in the virtio-blk/virtio-scsi virtqueues > so that the destination QEMU can process them after migration > completes. > > Unfortunately this approach makes convergence harder because the VM > might also be dirtying memory pages during the iteration phase. Now > we need to reach a spot where no I/O is in-flight *and* dirty memory > is under the threshold. > > Thoughts? > > Stefan ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] Live migration without bdrv_drain_all() 2016-08-29 18:56 ` Felipe Franciosi @ 2016-09-27 9:27 ` Stefan Hajnoczi 2016-09-27 9:51 ` Daniel P. Berrange 2016-09-27 9:54 ` Dr. David Alan Gilbert 0 siblings, 2 replies; 10+ messages in thread From: Stefan Hajnoczi @ 2016-09-27 9:27 UTC (permalink / raw) To: Felipe Franciosi Cc: qemu-devel, Mike Cui, Kevin Wolf, Paolo Bonzini, Juan Quintela, Dr. David Alan Gilbert [-- Attachment #1: Type: text/plain, Size: 2621 bytes --] On Mon, Aug 29, 2016 at 06:56:42PM +0000, Felipe Franciosi wrote: > Heya! > > > On 29 Aug 2016, at 08:06, Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > At KVM Forum an interesting idea was proposed to avoid > > bdrv_drain_all() during live migration. Mike Cui and Felipe Franciosi > > mentioned running at queue depth 1. It needs more thought to make it > > workable but I want to capture it here for discussion and to archive > > it. > > > > bdrv_drain_all() is synchronous and can cause VM downtime if I/O > > requests hang. We should find a better way of quiescing I/O that is > > not synchronous. Up until now I thought we should simply add a > > timeout to bdrv_drain_all() so it can at least fail (and live > > migration would fail) if I/O is stuck instead of hanging the VM. But > > the following approach is also interesting... > > > > During the iteration phase of live migration we could limit the queue > > depth so points with no I/O requests in-flight are identified. At > > these points the migration algorithm has the opportunity to move to > > the next phase without requiring bdrv_drain_all() since no requests > > are pending. > > I actually think that this "io quiesced state" is highly unlikely to _just_ happen on a busy guest. The main idea behind running at QD1 is to naturally throttle the guest and make it easier to "force quiesce" the VQs. > > In other words, if the guest is busy and we run at QD1, I would expect the rings to be quite full of pending (ie. unprocessed) requests. At the same time, I would expect that a call to bdrv_drain_all() (as part of do_vm_stop()) should complete much quicker. > > Nevertheless, you mentioned that this is still problematic as that single outstanding IO could block, leaving the VM paused for longer. > > My suggestion is therefore that we leave the vCPUs running, but stop picking up requests from the VQs. Provided nothing blocks, you should reach the "io quiesced state" fairly quickly. If you don't, then the VM is at least still running (despite seeing no progress on its VQs). > > Thoughts on that? If the guest experiences a hung disk it may enter error recovery. QEMU should avoid this so the guest doesn't remount file systems read-only. This can be solved by only quiescing the disk for, say, 30 seconds at a time. If we don't reach a point where live migration can proceed during those 30 seconds then the disk will service requests again temporarily to avoid upsetting the guest. I wonder if Juan or David have any thoughts from the live migration perspective? Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 455 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] Live migration without bdrv_drain_all() 2016-09-27 9:27 ` Stefan Hajnoczi @ 2016-09-27 9:51 ` Daniel P. Berrange 2016-09-27 9:54 ` Dr. David Alan Gilbert 1 sibling, 0 replies; 10+ messages in thread From: Daniel P. Berrange @ 2016-09-27 9:51 UTC (permalink / raw) To: Stefan Hajnoczi Cc: Felipe Franciosi, Mike Cui, Kevin Wolf, Juan Quintela, qemu-devel, Dr. David Alan Gilbert, Paolo Bonzini On Tue, Sep 27, 2016 at 10:27:12AM +0100, Stefan Hajnoczi wrote: > On Mon, Aug 29, 2016 at 06:56:42PM +0000, Felipe Franciosi wrote: > > Heya! > > > > > On 29 Aug 2016, at 08:06, Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > > At KVM Forum an interesting idea was proposed to avoid > > > bdrv_drain_all() during live migration. Mike Cui and Felipe Franciosi > > > mentioned running at queue depth 1. It needs more thought to make it > > > workable but I want to capture it here for discussion and to archive > > > it. > > > > > > bdrv_drain_all() is synchronous and can cause VM downtime if I/O > > > requests hang. We should find a better way of quiescing I/O that is > > > not synchronous. Up until now I thought we should simply add a > > > timeout to bdrv_drain_all() so it can at least fail (and live > > > migration would fail) if I/O is stuck instead of hanging the VM. But > > > the following approach is also interesting... > > > > > > During the iteration phase of live migration we could limit the queue > > > depth so points with no I/O requests in-flight are identified. At > > > these points the migration algorithm has the opportunity to move to > > > the next phase without requiring bdrv_drain_all() since no requests > > > are pending. > > > > I actually think that this "io quiesced state" is highly unlikely to _just_ happen on a busy guest. The main idea behind running at QD1 is to naturally throttle the guest and make it easier to "force quiesce" the VQs. > > > > In other words, if the guest is busy and we run at QD1, I would expect the rings to be quite full of pending (ie. unprocessed) requests. At the same time, I would expect that a call to bdrv_drain_all() (as part of do_vm_stop()) should complete much quicker. > > > > Nevertheless, you mentioned that this is still problematic as that single outstanding IO could block, leaving the VM paused for longer. > > > > My suggestion is therefore that we leave the vCPUs running, but stop picking up requests from the VQs. Provided nothing blocks, you should reach the "io quiesced state" fairly quickly. If you don't, then the VM is at least still running (despite seeing no progress on its VQs). > > > > Thoughts on that? > > If the guest experiences a hung disk it may enter error recovery. QEMU > should avoid this so the guest doesn't remount file systems read-only. > > This can be solved by only quiescing the disk for, say, 30 seconds at a > time. If we don't reach a point where live migration can proceed during > those 30 seconds then the disk will service requests again temporarily > to avoid upsetting the guest. What is the actual trigger for guest error recovery ? If you have the situation where bdrv_drain_all could hang, surely even if you start processing requests again after 30 seconds, you might not actually be able to complete those requests for a long time, due to fact that drain all has still got outstanding work blocking the new requests you just accepted from the guest ? Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] Live migration without bdrv_drain_all() 2016-09-27 9:27 ` Stefan Hajnoczi 2016-09-27 9:51 ` Daniel P. Berrange @ 2016-09-27 9:54 ` Dr. David Alan Gilbert 2016-09-28 9:03 ` Juan Quintela 1 sibling, 1 reply; 10+ messages in thread From: Dr. David Alan Gilbert @ 2016-09-27 9:54 UTC (permalink / raw) To: Stefan Hajnoczi Cc: Felipe Franciosi, qemu-devel, Mike Cui, Kevin Wolf, Paolo Bonzini, Juan Quintela * Stefan Hajnoczi (stefanha@gmail.com) wrote: > On Mon, Aug 29, 2016 at 06:56:42PM +0000, Felipe Franciosi wrote: > > Heya! > > > > > On 29 Aug 2016, at 08:06, Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > > At KVM Forum an interesting idea was proposed to avoid > > > bdrv_drain_all() during live migration. Mike Cui and Felipe Franciosi > > > mentioned running at queue depth 1. It needs more thought to make it > > > workable but I want to capture it here for discussion and to archive > > > it. > > > > > > bdrv_drain_all() is synchronous and can cause VM downtime if I/O > > > requests hang. We should find a better way of quiescing I/O that is > > > not synchronous. Up until now I thought we should simply add a > > > timeout to bdrv_drain_all() so it can at least fail (and live > > > migration would fail) if I/O is stuck instead of hanging the VM. But > > > the following approach is also interesting... > > > > > > During the iteration phase of live migration we could limit the queue > > > depth so points with no I/O requests in-flight are identified. At > > > these points the migration algorithm has the opportunity to move to > > > the next phase without requiring bdrv_drain_all() since no requests > > > are pending. > > > > I actually think that this "io quiesced state" is highly unlikely to _just_ happen on a busy guest. The main idea behind running at QD1 is to naturally throttle the guest and make it easier to "force quiesce" the VQs. > > > > In other words, if the guest is busy and we run at QD1, I would expect the rings to be quite full of pending (ie. unprocessed) requests. At the same time, I would expect that a call to bdrv_drain_all() (as part of do_vm_stop()) should complete much quicker. > > > > Nevertheless, you mentioned that this is still problematic as that single outstanding IO could block, leaving the VM paused for longer. > > > > My suggestion is therefore that we leave the vCPUs running, but stop picking up requests from the VQs. Provided nothing blocks, you should reach the "io quiesced state" fairly quickly. If you don't, then the VM is at least still running (despite seeing no progress on its VQs). > > > > Thoughts on that? > > If the guest experiences a hung disk it may enter error recovery. QEMU > should avoid this so the guest doesn't remount file systems read-only. > > This can be solved by only quiescing the disk for, say, 30 seconds at a > time. If we don't reach a point where live migration can proceed during > those 30 seconds then the disk will service requests again temporarily > to avoid upsetting the guest. > > I wonder if Juan or David have any thoughts from the live migration > perspective? Throttling IO to reduce the time in the final drain makes sense to me, however: a) It doesn't solve the problem if the IO device dies at just the wrong time, so you can still get that hang in bdrv_drain_all b) Completely stopping guest IO sounds too drastic to me unless you can time it to be just at the point before the end of migration; that feels tricky to get right unless you can somehow tie it to an estimate of remaining dirty RAM (that never works that well). c) Something like a 30 second pause still feels too long; if that was a big hairy database workload it would effectively be 30 seconds of downtime. Dave > > Stefan -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] Live migration without bdrv_drain_all() 2016-09-27 9:54 ` Dr. David Alan Gilbert @ 2016-09-28 9:03 ` Juan Quintela 2016-09-28 10:00 ` Felipe Franciosi 2016-09-28 10:23 ` Daniel P. Berrange 0 siblings, 2 replies; 10+ messages in thread From: Juan Quintela @ 2016-09-28 9:03 UTC (permalink / raw) To: Dr. David Alan Gilbert Cc: Stefan Hajnoczi, Felipe Franciosi, qemu-devel, Mike Cui, Kevin Wolf, Paolo Bonzini "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote: > * Stefan Hajnoczi (stefanha@gmail.com) wrote: >> On Mon, Aug 29, 2016 at 06:56:42PM +0000, Felipe Franciosi wrote: >> > Heya! >> > >> > > On 29 Aug 2016, at 08:06, Stefan Hajnoczi <stefanha@gmail.com> wrote: >> > > >> > > At KVM Forum an interesting idea was proposed to avoid >> > > bdrv_drain_all() during live migration. Mike Cui and Felipe Franciosi >> > > mentioned running at queue depth 1. It needs more thought to make it >> > > workable but I want to capture it here for discussion and to archive >> > > it. >> > > >> > > bdrv_drain_all() is synchronous and can cause VM downtime if I/O >> > > requests hang. We should find a better way of quiescing I/O that is >> > > not synchronous. Up until now I thought we should simply add a >> > > timeout to bdrv_drain_all() so it can at least fail (and live >> > > migration would fail) if I/O is stuck instead of hanging the VM. But >> > > the following approach is also interesting... >> > > >> > > During the iteration phase of live migration we could limit the queue >> > > depth so points with no I/O requests in-flight are identified. At >> > > these points the migration algorithm has the opportunity to move to >> > > the next phase without requiring bdrv_drain_all() since no requests >> > > are pending. >> > >> > I actually think that this "io quiesced state" is highly unlikely >> > to _just_ happen on a busy guest. The main idea behind running at >> > QD1 is to naturally throttle the guest and make it easier to >> > "force quiesce" the VQs. >> > >> > In other words, if the guest is busy and we run at QD1, I would >> > expect the rings to be quite full of pending (ie. unprocessed) >> > requests. At the same time, I would expect that a call to >> > bdrv_drain_all() (as part of do_vm_stop()) should complete much >> > quicker. >> > >> > Nevertheless, you mentioned that this is still problematic as that >> > single outstanding IO could block, leaving the VM paused for >> > longer. >> > >> > My suggestion is therefore that we leave the vCPUs running, but >> > stop picking up requests from the VQs. Provided nothing blocks, >> > you should reach the "io quiesced state" fairly quickly. If you >> > don't, then the VM is at least still running (despite seeing no >> > progress on its VQs). >> > >> > Thoughts on that? >> >> If the guest experiences a hung disk it may enter error recovery. QEMU >> should avoid this so the guest doesn't remount file systems read-only. >> >> This can be solved by only quiescing the disk for, say, 30 seconds at a >> time. If we don't reach a point where live migration can proceed during >> those 30 seconds then the disk will service requests again temporarily >> to avoid upsetting the guest. >> >> I wonder if Juan or David have any thoughts from the live migration >> perspective? > > Throttling IO to reduce the time in the final drain makes sense > to me, however: > a) It doesn't solve the problem if the IO device dies at just the wrong time, > so you can still get that hang in bdrv_drain_all > > b) Completely stopping guest IO sounds too drastic to me unless you can > time it to be just at the point before the end of migration; that feels > tricky to get right unless you can somehow tie it to an estimate of > remaining dirty RAM (that never works that well). > > c) Something like a 30 second pause still feels too long; if that was > a big hairy database workload it would effectively be 30 seconds > of downtime. > > Dave I think something like the proposed thing could work. We can put queue depth = 1 or somesuch when we know we are near completion for migration. What we need them is a way to call the equivalent of: bdrv_drain_all() to return EAGAIN or EBUSY if it is a bad moment. In that case, we just do another round over the whole memory, or retry in X seconds. Anything is good for us, we just need a way to ask for the operation but that it don't block. Notice that migration is the equivalent of: while (true) { write_some_dirty_pages(); if (dirty_pages < threshold) { break; } } bdrv_drain_all(); write_rest_of_dirty_pages(); (Lots and lots of details ommited) What we really want is to issue the call of bdrv_drain_all() equivalent inside the while, so, if there is any problem, we just do another cycle, no problem. Later, Juan. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] Live migration without bdrv_drain_all() 2016-09-28 9:03 ` Juan Quintela @ 2016-09-28 10:00 ` Felipe Franciosi 2016-09-28 10:23 ` Daniel P. Berrange 1 sibling, 0 replies; 10+ messages in thread From: Felipe Franciosi @ 2016-09-28 10:00 UTC (permalink / raw) To: quintela@redhat.com, Dr. David Alan Gilbert, Stefan Hajnoczi, Daniel P. Berrange Cc: qemu-devel, Mike Cui, Kevin Wolf, Paolo Bonzini > On 28 Sep 2016, at 10:03, Juan Quintela <quintela@redhat.com> wrote: > > "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote: >> * Stefan Hajnoczi (stefanha@gmail.com) wrote: >>> On Mon, Aug 29, 2016 at 06:56:42PM +0000, Felipe Franciosi wrote: >>>> Heya! >>>> >>>>> On 29 Aug 2016, at 08:06, Stefan Hajnoczi <stefanha@gmail.com> wrote: >>>>> >>>>> At KVM Forum an interesting idea was proposed to avoid >>>>> bdrv_drain_all() during live migration. Mike Cui and Felipe Franciosi >>>>> mentioned running at queue depth 1. It needs more thought to make it >>>>> workable but I want to capture it here for discussion and to archive >>>>> it. >>>>> >>>>> bdrv_drain_all() is synchronous and can cause VM downtime if I/O >>>>> requests hang. We should find a better way of quiescing I/O that is >>>>> not synchronous. Up until now I thought we should simply add a >>>>> timeout to bdrv_drain_all() so it can at least fail (and live >>>>> migration would fail) if I/O is stuck instead of hanging the VM. But >>>>> the following approach is also interesting... >>>>> >>>>> During the iteration phase of live migration we could limit the queue >>>>> depth so points with no I/O requests in-flight are identified. At >>>>> these points the migration algorithm has the opportunity to move to >>>>> the next phase without requiring bdrv_drain_all() since no requests >>>>> are pending. >>>> >>>> I actually think that this "io quiesced state" is highly unlikely >>>> to _just_ happen on a busy guest. The main idea behind running at >>>> QD1 is to naturally throttle the guest and make it easier to >>>> "force quiesce" the VQs. >>>> >>>> In other words, if the guest is busy and we run at QD1, I would >>>> expect the rings to be quite full of pending (ie. unprocessed) >>>> requests. At the same time, I would expect that a call to >>>> bdrv_drain_all() (as part of do_vm_stop()) should complete much >>>> quicker. >>>> >>>> Nevertheless, you mentioned that this is still problematic as that >>>> single outstanding IO could block, leaving the VM paused for >>>> longer. >>>> >>>> My suggestion is therefore that we leave the vCPUs running, but >>>> stop picking up requests from the VQs. Provided nothing blocks, >>>> you should reach the "io quiesced state" fairly quickly. If you >>>> don't, then the VM is at least still running (despite seeing no >>>> progress on its VQs). >>>> >>>> Thoughts on that? >>> >>> If the guest experiences a hung disk it may enter error recovery. QEMU >>> should avoid this so the guest doesn't remount file systems read-only. >>> >>> This can be solved by only quiescing the disk for, say, 30 seconds at a >>> time. If we don't reach a point where live migration can proceed during >>> those 30 seconds then the disk will service requests again temporarily >>> to avoid upsetting the guest. >>> >>> I wonder if Juan or David have any thoughts from the live migration >>> perspective? >> >> Throttling IO to reduce the time in the final drain makes sense >> to me, however: >> a) It doesn't solve the problem if the IO device dies at just the wrong time, >> so you can still get that hang in bdrv_drain_all >> >> b) Completely stopping guest IO sounds too drastic to me unless you can >> time it to be just at the point before the end of migration; that feels >> tricky to get right unless you can somehow tie it to an estimate of >> remaining dirty RAM (that never works that well). >> >> c) Something like a 30 second pause still feels too long; if that was >> a big hairy database workload it would effectively be 30 seconds >> of downtime. >> >> Dave > > I think something like the proposed thing could work. > > We can put queue depth = 1 or somesuch when we know we are near > completion for migration. What we need them is a way to call the > equivalent of: > > bdrv_drain_all() to return EAGAIN or EBUSY if it is a bad moment. In > that case, we just do another round over the whole memory, or retry in X > seconds. Anything is good for us, we just need a way to ask for the > operation but that it don't block. > > Notice that migration is the equivalent of: > > while (true) { > write_some_dirty_pages(); > if (dirty_pages < threshold) { > break; > } > } > bdrv_drain_all(); > write_rest_of_dirty_pages(); > > (Lots and lots of details ommited) > > What we really want is to issue the call of bdrv_drain_all() equivalent > inside the while, so, if there is any problem, we just do another cycle, > no problem. > > Later, Juan. Hi, Actually, the way I perceive the problem is that Qemu is doing a vm_stop() *after* the "break;" in the pseudocode above (but *before* the drain). That means the VM could be stopped for a long time while you're doing bdrv_drain_all(). I don't see a magic solution for this. All we can do is try and find a way of doing this that improves the VM experience during the migration. It's easy to argue that it's better to see your storage performance go down for a short period of time instead of seeing your CPUs not running for a long period of time. After all, there's a reason for "cpu downtime" being an actual hypervisor metric. What I'd propose is a simple improvement like this: while (true) { write_some_dirty_pages(); if (dirty_pages < threshold_very_low) { break; } else if (dirty_pages < threshold_low) { bdrv_stop_picking_new_reqs(); } else if (dirty_pages < threshold_med) { bdrv_run_at_qd1(); } } vm_stop_force_state(RUN_STATE_FINISH_MIGRATE); bdrv_drain_all(); write_rest_of_dirty_pages(); The idea is simple: * When we're somewhere near, we pick only one request at a time. * When we're really close, we stop picking up new requests. That still allows the block drivers to complete whatever is outstanding. * When we're really really close, we can break. At this point, we're very likely drained already. Knowing that most OSs use 30s by default as a "this request is not completing anymore" kind of timeout, we can even improve the above to resume the block drivers (or abort the migration) if the time between reaching "threshold_low" and "threshold_very_low" exceeds, say, 15s. That can be combined with actually waiting for everything to complete before stopping the CPUs. A more complete version would look like this: while (true) { write_some_dirty_pages(); if (dirty_pages < threshold_very_low) { if (bdrv_all_is_drained()) { break; } else if (bdrv_is_stopped() && (now() - ts_bdrv_stopped > 15s)) { bdrv_run_at_qd1(); // or abort the migration and resume normally, // perhaps after a few retries } } if (dirty_pages < threshold_low) { bdrv_stop_picking_new_reqs(); ts_bdrv_stopped = now(); } else if (dirty_pages < threshold_med) { bdrv_run_at_qd1(); } } vm_stop_force_state(RUN_STATE_FINISH_MIGRATE); bdrv_drain_all(); write_rest_of_dirty_pages(); Note that this version (somewhat) copes with (dirty_pages<threshold_very_low) being reached before we actually observed a (dirty_pages<threshold_low). There's still a race where requests are fired after bdrv_all_is_drained() and before vm_stop_force_state(). But that can be easily addressed. Thoughts? Thanks, Felipe ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] Live migration without bdrv_drain_all() 2016-09-28 9:03 ` Juan Quintela 2016-09-28 10:00 ` Felipe Franciosi @ 2016-09-28 10:23 ` Daniel P. Berrange 1 sibling, 0 replies; 10+ messages in thread From: Daniel P. Berrange @ 2016-09-28 10:23 UTC (permalink / raw) To: Juan Quintela Cc: Dr. David Alan Gilbert, Mike Cui, Kevin Wolf, Stefan Hajnoczi, qemu-devel, Felipe Franciosi, Paolo Bonzini On Wed, Sep 28, 2016 at 11:03:15AM +0200, Juan Quintela wrote: > "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote: > > * Stefan Hajnoczi (stefanha@gmail.com) wrote: > >> On Mon, Aug 29, 2016 at 06:56:42PM +0000, Felipe Franciosi wrote: > >> > Heya! > >> > > >> > > On 29 Aug 2016, at 08:06, Stefan Hajnoczi <stefanha@gmail.com> wrote: > >> > > > >> > > At KVM Forum an interesting idea was proposed to avoid > >> > > bdrv_drain_all() during live migration. Mike Cui and Felipe Franciosi > >> > > mentioned running at queue depth 1. It needs more thought to make it > >> > > workable but I want to capture it here for discussion and to archive > >> > > it. > >> > > > >> > > bdrv_drain_all() is synchronous and can cause VM downtime if I/O > >> > > requests hang. We should find a better way of quiescing I/O that is > >> > > not synchronous. Up until now I thought we should simply add a > >> > > timeout to bdrv_drain_all() so it can at least fail (and live > >> > > migration would fail) if I/O is stuck instead of hanging the VM. But > >> > > the following approach is also interesting... > >> > > > >> > > During the iteration phase of live migration we could limit the queue > >> > > depth so points with no I/O requests in-flight are identified. At > >> > > these points the migration algorithm has the opportunity to move to > >> > > the next phase without requiring bdrv_drain_all() since no requests > >> > > are pending. > >> > > >> > I actually think that this "io quiesced state" is highly unlikely > >> > to _just_ happen on a busy guest. The main idea behind running at > >> > QD1 is to naturally throttle the guest and make it easier to > >> > "force quiesce" the VQs. > >> > > >> > In other words, if the guest is busy and we run at QD1, I would > >> > expect the rings to be quite full of pending (ie. unprocessed) > >> > requests. At the same time, I would expect that a call to > >> > bdrv_drain_all() (as part of do_vm_stop()) should complete much > >> > quicker. > >> > > >> > Nevertheless, you mentioned that this is still problematic as that > >> > single outstanding IO could block, leaving the VM paused for > >> > longer. > >> > > >> > My suggestion is therefore that we leave the vCPUs running, but > >> > stop picking up requests from the VQs. Provided nothing blocks, > >> > you should reach the "io quiesced state" fairly quickly. If you > >> > don't, then the VM is at least still running (despite seeing no > >> > progress on its VQs). > >> > > >> > Thoughts on that? > >> > >> If the guest experiences a hung disk it may enter error recovery. QEMU > >> should avoid this so the guest doesn't remount file systems read-only. > >> > >> This can be solved by only quiescing the disk for, say, 30 seconds at a > >> time. If we don't reach a point where live migration can proceed during > >> those 30 seconds then the disk will service requests again temporarily > >> to avoid upsetting the guest. > >> > >> I wonder if Juan or David have any thoughts from the live migration > >> perspective? > > > > Throttling IO to reduce the time in the final drain makes sense > > to me, however: > > a) It doesn't solve the problem if the IO device dies at just the wrong time, > > so you can still get that hang in bdrv_drain_all > > > > b) Completely stopping guest IO sounds too drastic to me unless you can > > time it to be just at the point before the end of migration; that feels > > tricky to get right unless you can somehow tie it to an estimate of > > remaining dirty RAM (that never works that well). > > > > c) Something like a 30 second pause still feels too long; if that was > > a big hairy database workload it would effectively be 30 seconds > > of downtime. > > > > Dave > > I think something like the proposed thing could work. > > We can put queue depth = 1 or somesuch when we know we are near > completion for migration. What we need them is a way to call the > equivalent of: > > bdrv_drain_all() to return EAGAIN or EBUSY if it is a bad moment. In > that case, we just do another round over the whole memory, or retry in X > seconds. Anything is good for us, we just need a way to ask for the > operation but that it don't block. > > Notice that migration is the equivalent of: > > while (true) { > write_some_dirty_pages(); > if (dirty_pages < threshold) { > break; > } > } > bdrv_drain_all(); > write_rest_of_dirty_pages(); > > (Lots and lots of details ommited) > > What we really want is to issue the call of bdrv_drain_all() equivalent > inside the while, so, if there is any problem, we just do another cycle, > no problem. It seems that the main downside of this is that it makes normal pre-copy live migration even less likely to successfully complete that it already is. This increases the liklihood of needing to use post-copy live migration, which has the same bdrv_drain_all problem. THis is hard to solve because QEMU isn't in charge of when post-copy starts, so it can't simply wait for a convenient moment to switch to post-copy if drain_all is busy. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :| ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] Live migration without bdrv_drain_all() 2016-08-29 15:06 [Qemu-devel] Live migration without bdrv_drain_all() Stefan Hajnoczi 2016-08-29 18:56 ` Felipe Franciosi @ 2016-09-27 9:48 ` Daniel P. Berrange 2016-10-12 13:09 ` Stefan Hajnoczi 1 sibling, 1 reply; 10+ messages in thread From: Daniel P. Berrange @ 2016-09-27 9:48 UTC (permalink / raw) To: Stefan Hajnoczi; +Cc: qemu-devel, cui, Kevin Wolf, Paolo Bonzini, felipe On Mon, Aug 29, 2016 at 11:06:48AM -0400, Stefan Hajnoczi wrote: > At KVM Forum an interesting idea was proposed to avoid > bdrv_drain_all() during live migration. Mike Cui and Felipe Franciosi > mentioned running at queue depth 1. It needs more thought to make it > workable but I want to capture it here for discussion and to archive > it. > > bdrv_drain_all() is synchronous and can cause VM downtime if I/O > requests hang. We should find a better way of quiescing I/O that is > not synchronous. Up until now I thought we should simply add a > timeout to bdrv_drain_all() so it can at least fail (and live > migration would fail) if I/O is stuck instead of hanging the VM. But > the following approach is also interesting... How would you decide what an acceptable timeout is for the drain operation ? At what point does a stuck drain op cause the VM to stall ? The drain call happens from the migration thread, so it shouldn't impact vcpu threads or the main event loop thread if it takes too long. > > During the iteration phase of live migration we could limit the queue > depth so points with no I/O requests in-flight are identified. At > these points the migration algorithm has the opportunity to move to > the next phase without requiring bdrv_drain_all() since no requests > are pending. > > Unprocessed requests are left in the virtio-blk/virtio-scsi virtqueues > so that the destination QEMU can process them after migration > completes. > > Unfortunately this approach makes convergence harder because the VM > might also be dirtying memory pages during the iteration phase. Now > we need to reach a spot where no I/O is in-flight *and* dirty memory > is under the threshold. It doesn't seem like this could easily fit in with post-copy. During the switchover from pre-copy to post-copy migration calls vm_stop_force_state which will trigger bdrv_drain_all(). The point at which you switch from pre to post copy mode is not controlled by QEMU, instead it is an explicit admin action triggered via a QMP command. Now the actual switch over is not synchronous with completion of the QMP command, so there is small scope for delaying it to a convenient time, but not by a very significant amount & certainly not anywhere near 30 seconds. Perhaps 1 second at the most. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] Live migration without bdrv_drain_all() 2016-09-27 9:48 ` Daniel P. Berrange @ 2016-10-12 13:09 ` Stefan Hajnoczi 0 siblings, 0 replies; 10+ messages in thread From: Stefan Hajnoczi @ 2016-10-12 13:09 UTC (permalink / raw) To: Daniel P. Berrange; +Cc: qemu-devel, cui, Kevin Wolf, Paolo Bonzini, felipe [-- Attachment #1: Type: text/plain, Size: 1211 bytes --] On Tue, Sep 27, 2016 at 10:48:48AM +0100, Daniel P. Berrange wrote: > On Mon, Aug 29, 2016 at 11:06:48AM -0400, Stefan Hajnoczi wrote: > > At KVM Forum an interesting idea was proposed to avoid > > bdrv_drain_all() during live migration. Mike Cui and Felipe Franciosi > > mentioned running at queue depth 1. It needs more thought to make it > > workable but I want to capture it here for discussion and to archive > > it. > > > > bdrv_drain_all() is synchronous and can cause VM downtime if I/O > > requests hang. We should find a better way of quiescing I/O that is > > not synchronous. Up until now I thought we should simply add a > > timeout to bdrv_drain_all() so it can at least fail (and live > > migration would fail) if I/O is stuck instead of hanging the VM. But > > the following approach is also interesting... > > How would you decide what an acceptable timeout is for the drain > operation ? Same as most timeouts: an arbitrary number :(. > At what point does a stuck drain op cause the VM > to stall ? The drain call has acquired the QEMU global mutex. Any vmexit that requires taking the QEMU global mutex will hang that thread (i.e. vcpu thread). Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 455 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2016-10-12 13:10 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-08-29 15:06 [Qemu-devel] Live migration without bdrv_drain_all() Stefan Hajnoczi 2016-08-29 18:56 ` Felipe Franciosi 2016-09-27 9:27 ` Stefan Hajnoczi 2016-09-27 9:51 ` Daniel P. Berrange 2016-09-27 9:54 ` Dr. David Alan Gilbert 2016-09-28 9:03 ` Juan Quintela 2016-09-28 10:00 ` Felipe Franciosi 2016-09-28 10:23 ` Daniel P. Berrange 2016-09-27 9:48 ` Daniel P. Berrange 2016-10-12 13:09 ` Stefan Hajnoczi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).