* [Qemu-devel] [PATCH] migration: regain control of images when migration fails to complete
@ 2016-05-18 10:18 Greg Kurz
2016-05-18 11:13 ` Dr. David Alan Gilbert
0 siblings, 1 reply; 4+ messages in thread
From: Greg Kurz @ 2016-05-18 10:18 UTC (permalink / raw)
To: Amit Shah, Juan Quintela; +Cc: qemu-stable, Dr. David Alan Gilbert, qemu-devel
We currently have an error path during migration that can cause
the source QEMU to abort:
migration_thread()
migration_completion()
runstate_is_running() ----------------> true if guest is running
bdrv_inactivate_all() ----------------> inactivate images
qemu_savevm_state_complete_precopy()
... qemu_fflush()
socket_writev_buffer() --------> error because destination fails
qemu_fflush() -------------------> set error on migration stream
migration_completion() -----------------> set migrate state to FAILED
migration_thread() -----------------------> break migration loop
vm_start() -----------------------------> restart guest with inactive
images
and you get:
qemu-system-ppc64: socket_writev_buffer: Got err=104 for (32768/18446744073709551615)
qemu-system-ppc64: /home/greg/Work/qemu/qemu-master/block/io.c:1342:bdrv_co_do_pwritev: Assertion `!(bs->open_flags & 0x0800)' failed.
Aborted (core dumped)
If we try postcopy with a similar scenario, we also get the writev error
message but QEMU leaves the guest paused because entered_postcopy is true.
We could possibly do the same with precopy and leave the guest paused.
But since the historical default for migration errors is to restart the
source, this patch adds a call to bdrv_invalidate_cache_all() instead.
Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com>
---
migration/migration.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/migration/migration.c b/migration/migration.c
index 991313a8629a..5726959ddfd9 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1568,8 +1568,17 @@ static void migration_completion(MigrationState *s, int current_active_state,
ret = bdrv_inactivate_all();
}
if (ret >= 0) {
+ Error *local_err = NULL;
+
qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX);
qemu_savevm_state_complete_precopy(s->to_dst_file, false);
+
+ if (qemu_file_get_error(s->to_dst_file)) {
+ bdrv_invalidate_cache_all(&local_err);
+ if (local_err) {
+ error_report_err(local_err);
+ }
+ }
}
}
qemu_mutex_unlock_iothread();
^ permalink raw reply related [flat|nested] 4+ messages in thread* Re: [Qemu-devel] [PATCH] migration: regain control of images when migration fails to complete 2016-05-18 10:18 [Qemu-devel] [PATCH] migration: regain control of images when migration fails to complete Greg Kurz @ 2016-05-18 11:13 ` Dr. David Alan Gilbert 2016-05-18 11:42 ` Kevin Wolf 0 siblings, 1 reply; 4+ messages in thread From: Dr. David Alan Gilbert @ 2016-05-18 11:13 UTC (permalink / raw) To: Greg Kurz, kwolf, mreitz, stefanha Cc: Amit Shah, Juan Quintela, qemu-stable, qemu-devel * Greg Kurz (gkurz@linux.vnet.ibm.com) wrote: > We currently have an error path during migration that can cause > the source QEMU to abort: Hmm, wasn't there something similar recently, sorry I can't remember the details, but cc'ing some block people who might remember. Dave > > migration_thread() > migration_completion() > runstate_is_running() ----------------> true if guest is running > bdrv_inactivate_all() ----------------> inactivate images > qemu_savevm_state_complete_precopy() > ... qemu_fflush() > socket_writev_buffer() --------> error because destination fails > qemu_fflush() -------------------> set error on migration stream > migration_completion() -----------------> set migrate state to FAILED > migration_thread() -----------------------> break migration loop > vm_start() -----------------------------> restart guest with inactive > images > > and you get: > > qemu-system-ppc64: socket_writev_buffer: Got err=104 for (32768/18446744073709551615) > qemu-system-ppc64: /home/greg/Work/qemu/qemu-master/block/io.c:1342:bdrv_co_do_pwritev: Assertion `!(bs->open_flags & 0x0800)' failed. > Aborted (core dumped) > > If we try postcopy with a similar scenario, we also get the writev error > message but QEMU leaves the guest paused because entered_postcopy is true. > > We could possibly do the same with precopy and leave the guest paused. > But since the historical default for migration errors is to restart the > source, this patch adds a call to bdrv_invalidate_cache_all() instead. > > Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com> > --- > migration/migration.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/migration/migration.c b/migration/migration.c > index 991313a8629a..5726959ddfd9 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -1568,8 +1568,17 @@ static void migration_completion(MigrationState *s, int current_active_state, > ret = bdrv_inactivate_all(); > } > if (ret >= 0) { > + Error *local_err = NULL; > + > qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX); > qemu_savevm_state_complete_precopy(s->to_dst_file, false); > + > + if (qemu_file_get_error(s->to_dst_file)) { > + bdrv_invalidate_cache_all(&local_err); > + if (local_err) { > + error_report_err(local_err); > + } > + } > } > } > qemu_mutex_unlock_iothread(); > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Qemu-devel] [PATCH] migration: regain control of images when migration fails to complete 2016-05-18 11:13 ` Dr. David Alan Gilbert @ 2016-05-18 11:42 ` Kevin Wolf 2016-05-18 12:18 ` Greg Kurz 0 siblings, 1 reply; 4+ messages in thread From: Kevin Wolf @ 2016-05-18 11:42 UTC (permalink / raw) To: Dr. David Alan Gilbert Cc: Greg Kurz, mreitz, stefanha, Amit Shah, Juan Quintela, qemu-stable, qemu-devel Am 18.05.2016 um 13:13 hat Dr. David Alan Gilbert geschrieben: > * Greg Kurz (gkurz@linux.vnet.ibm.com) wrote: > > We currently have an error path during migration that can cause > > the source QEMU to abort: > > Hmm, wasn't there something similar recently, sorry I can't remember the details, but > cc'ing some block people who might remember. Not sure, but in any case calling bdrv_invalidate_cache_all() in the error case sounds right to me. I think I would have tried to write the patch a bit differently so that it uses the normal error handling patterns, but that doesn't make it wrong. Specifically, checking the error condition twice in two different places seems strange; normally you check it once and then goto fail, where the cleanup is done (which could include a call to bdrv_invalidate_cache_all()). Kevin > Dave > > > > migration_thread() > > migration_completion() > > runstate_is_running() ----------------> true if guest is running > > bdrv_inactivate_all() ----------------> inactivate images > > qemu_savevm_state_complete_precopy() > > ... qemu_fflush() > > socket_writev_buffer() --------> error because destination fails > > qemu_fflush() -------------------> set error on migration stream > > migration_completion() -----------------> set migrate state to FAILED > > migration_thread() -----------------------> break migration loop > > vm_start() -----------------------------> restart guest with inactive > > images > > > > and you get: > > > > qemu-system-ppc64: socket_writev_buffer: Got err=104 for (32768/18446744073709551615) > > qemu-system-ppc64: /home/greg/Work/qemu/qemu-master/block/io.c:1342:bdrv_co_do_pwritev: Assertion `!(bs->open_flags & 0x0800)' failed. > > Aborted (core dumped) > > > > If we try postcopy with a similar scenario, we also get the writev error > > message but QEMU leaves the guest paused because entered_postcopy is true. > > > > We could possibly do the same with precopy and leave the guest paused. > > But since the historical default for migration errors is to restart the > > source, this patch adds a call to bdrv_invalidate_cache_all() instead. > > > > Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com> > > --- > > migration/migration.c | 9 +++++++++ > > 1 file changed, 9 insertions(+) > > > > diff --git a/migration/migration.c b/migration/migration.c > > index 991313a8629a..5726959ddfd9 100644 > > --- a/migration/migration.c > > +++ b/migration/migration.c > > @@ -1568,8 +1568,17 @@ static void migration_completion(MigrationState *s, int current_active_state, > > ret = bdrv_inactivate_all(); > > } > > if (ret >= 0) { > > + Error *local_err = NULL; > > + > > qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX); > > qemu_savevm_state_complete_precopy(s->to_dst_file, false); > > + > > + if (qemu_file_get_error(s->to_dst_file)) { > > + bdrv_invalidate_cache_all(&local_err); > > + if (local_err) { > > + error_report_err(local_err); > > + } > > + } > > } > > } > > qemu_mutex_unlock_iothread(); > > > -- > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Qemu-devel] [PATCH] migration: regain control of images when migration fails to complete 2016-05-18 11:42 ` Kevin Wolf @ 2016-05-18 12:18 ` Greg Kurz 0 siblings, 0 replies; 4+ messages in thread From: Greg Kurz @ 2016-05-18 12:18 UTC (permalink / raw) To: Kevin Wolf Cc: Dr. David Alan Gilbert, mreitz, stefanha, Amit Shah, Juan Quintela, qemu-stable, qemu-devel On Wed, 18 May 2016 13:42:08 +0200 Kevin Wolf <kwolf@redhat.com> wrote: > Am 18.05.2016 um 13:13 hat Dr. David Alan Gilbert geschrieben: > > * Greg Kurz (gkurz@linux.vnet.ibm.com) wrote: > > > We currently have an error path during migration that can cause > > > the source QEMU to abort: > > > > Hmm, wasn't there something similar recently, sorry I can't remember the details, but > > cc'ing some block people who might remember. > > Not sure, but in any case calling bdrv_invalidate_cache_all() in the > error case sounds right to me. > > I think I would have tried to write the patch a bit differently so that > it uses the normal error handling patterns, but that doesn't make it > wrong. Specifically, checking the error condition twice in two different > places seems strange; normally you check it once and then goto fail, > where the cleanup is done (which could include a call to > bdrv_invalidate_cache_all()). > I have coded it this way because it is obvious we're undoing the call to bdrv_inactivate_all() a few lines above. And also, in the postcopy case, migration_thread() doesn't call vm_start() when migration fails to complete: bdrv_invalidate_cache_all() will be called if the guest gets restarted via QMP. This being said, I agree that the resulting code looks weird. I guess it is possible to add a fail_invalidate: label to do the cleanup in the non postcopy case. I'll send a v2. > Kevin > > > Dave > > > > > > migration_thread() > > > migration_completion() > > > runstate_is_running() ----------------> true if guest is running > > > bdrv_inactivate_all() ----------------> inactivate images > > > qemu_savevm_state_complete_precopy() > > > ... qemu_fflush() > > > socket_writev_buffer() --------> error because destination fails > > > qemu_fflush() -------------------> set error on migration stream > > > migration_completion() -----------------> set migrate state to FAILED > > > migration_thread() -----------------------> break migration loop > > > vm_start() -----------------------------> restart guest with inactive > > > images > > > > > > and you get: > > > > > > qemu-system-ppc64: socket_writev_buffer: Got err=104 for (32768/18446744073709551615) > > > qemu-system-ppc64: /home/greg/Work/qemu/qemu-master/block/io.c:1342:bdrv_co_do_pwritev: Assertion `!(bs->open_flags & 0x0800)' failed. > > > Aborted (core dumped) > > > > > > If we try postcopy with a similar scenario, we also get the writev error > > > message but QEMU leaves the guest paused because entered_postcopy is true. > > > > > > We could possibly do the same with precopy and leave the guest paused. > > > But since the historical default for migration errors is to restart the > > > source, this patch adds a call to bdrv_invalidate_cache_all() instead. > > > > > > Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com> > > > --- > > > migration/migration.c | 9 +++++++++ > > > 1 file changed, 9 insertions(+) > > > > > > diff --git a/migration/migration.c b/migration/migration.c > > > index 991313a8629a..5726959ddfd9 100644 > > > --- a/migration/migration.c > > > +++ b/migration/migration.c > > > @@ -1568,8 +1568,17 @@ static void migration_completion(MigrationState *s, int current_active_state, > > > ret = bdrv_inactivate_all(); > > > } > > > if (ret >= 0) { > > > + Error *local_err = NULL; > > > + > > > qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX); > > > qemu_savevm_state_complete_precopy(s->to_dst_file, false); > > > + > > > + if (qemu_file_get_error(s->to_dst_file)) { > > > + bdrv_invalidate_cache_all(&local_err); > > > + if (local_err) { > > > + error_report_err(local_err); > > > + } > > > + } > > > } > > > } > > > qemu_mutex_unlock_iothread(); > > > > > -- > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK > ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2016-05-18 12:18 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-05-18 10:18 [Qemu-devel] [PATCH] migration: regain control of images when migration fails to complete Greg Kurz 2016-05-18 11:13 ` Dr. David Alan Gilbert 2016-05-18 11:42 ` Kevin Wolf 2016-05-18 12:18 ` Greg Kurz
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).