* [RFC 0/1] migration: Update error description whenever migration fails @ 2023-05-03 20:31 tejus.gk 2023-05-03 20:31 ` [RFC 1/1] " tejus.gk 0 siblings, 1 reply; 4+ messages in thread From: tejus.gk @ 2023-05-03 20:31 UTC (permalink / raw) To: qemu-devel; +Cc: quintela, peterx, leobras, tejus.gk Hi everyone, Currently, in QEMU, whenever a migration fails, its state is set to MIGRATION_STATUS_FAILED via the function migrate_set_state. However, there are places in the code where the migration is marked as a failed migration; however, the error description is never updated in the migration state object. This causes problems when libvirt tries to query for the status of the migration via a query-migrate; it never receives an error description and hence reports the error as "unexpectedly failed". This doesn't give us any information about what actually went wrong with the migration. An approach to solve this problem, which this patch explores, is to update the migration errors through migrate_set_error, whenever the migration state is updated to MIGRATION_STATUS_FAILED. However, sometimes these error descriptions can be due to various reasons or be too vague. An alternative approach to tackle this is to update the error description from the point where the error actually occurred. For instance, an error which occurs while saving the vmstate in the function vmstate_save_state_v in the file migration/vmstate.c, results in a failed migration, hence the error description can be updated here itself, rather than updating it in the function migration_completion, present in migration/migration.c. tejus.gk (1): migration: Update error description whenever migration fails migration/migration.c | 8 ++++++++ 1 file changed, 8 insertions(+) -- 2.22.3 ^ permalink raw reply [flat|nested] 4+ messages in thread
* [RFC 1/1] migration: Update error description whenever migration fails 2023-05-03 20:31 [RFC 0/1] migration: Update error description whenever migration fails tejus.gk @ 2023-05-03 20:31 ` tejus.gk 2023-05-04 8:16 ` Daniel P. Berrangé 0 siblings, 1 reply; 4+ messages in thread From: tejus.gk @ 2023-05-03 20:31 UTC (permalink / raw) To: qemu-devel; +Cc: quintela, peterx, leobras, tejus.gk There are places in the code where the migration is marked failed with MIGRATION_STATUS_FAILED, but the failiure reason is never updated. Hence libvirt doesn't know why the migration failed when it queries for it. Signed-off-by: tejus.gk <tejus.gk@nutanix.com> --- migration/migration.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/migration/migration.c b/migration/migration.c index feb5ab7493..0d7d34bf4d 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -1665,8 +1665,11 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk, } error_setg(errp, QERR_INVALID_PARAMETER_VALUE, "uri", "a valid migration protocol"); + error_setg(&local_err, QERR_INVALID_PARAMETER_VALUE, "uri", + "a valid migration protocol"); migrate_set_state(&s->state, MIGRATION_STATUS_SETUP, MIGRATION_STATUS_FAILED); + migrate_set_error(s, local_err); block_cleanup_parameters(); return; } @@ -2059,6 +2062,7 @@ static int postcopy_start(MigrationState *ms) int64_t bandwidth = migrate_max_postcopy_bandwidth(); bool restart_block = false; int cur_state = MIGRATION_STATUS_ACTIVE; + Error *local_err = NULL; if (migrate_postcopy_preempt()) { migration_wait_main_channel(ms); @@ -2203,8 +2207,10 @@ static int postcopy_start(MigrationState *ms) ret = qemu_file_get_error(ms->to_dst_file); if (ret) { error_report("postcopy_start: Migration stream errored"); + error_setg(&local_err, "postcopy_start: Migration stream errored"); migrate_set_state(&ms->state, MIGRATION_STATUS_POSTCOPY_ACTIVE, MIGRATION_STATUS_FAILED); + migrate_set_error(ms, local_err); } trace_postcopy_preempt_enabled(migrate_postcopy_preempt()); @@ -3233,7 +3239,9 @@ void migrate_fd_connect(MigrationState *s, Error *error_in) if (migrate_postcopy_ram() || migrate_return_path()) { if (open_return_path_on_source(s, !resume)) { error_report("Unable to open return-path for postcopy"); + error_setg(&local_err, "Unable to open return-path"); migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED); + migrate_set_error(s, local_err); migrate_fd_cleanup(s); return; } -- 2.22.3 ^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [RFC 1/1] migration: Update error description whenever migration fails 2023-05-03 20:31 ` [RFC 1/1] " tejus.gk @ 2023-05-04 8:16 ` Daniel P. Berrangé 2023-05-05 14:44 ` Tejus GK 0 siblings, 1 reply; 4+ messages in thread From: Daniel P. Berrangé @ 2023-05-04 8:16 UTC (permalink / raw) To: tejus.gk; +Cc: qemu-devel, quintela, peterx, leobras On Wed, May 03, 2023 at 08:31:16PM +0000, tejus.gk wrote: > There are places in the code where the migration is marked failed with > MIGRATION_STATUS_FAILED, but the failiure reason is never updated. Hence > libvirt doesn't know why the migration failed when it queries for it. > > Signed-off-by: tejus.gk <tejus.gk@nutanix.com> > --- > migration/migration.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/migration/migration.c b/migration/migration.c > index feb5ab7493..0d7d34bf4d 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -1665,8 +1665,11 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk, > } > error_setg(errp, QERR_INVALID_PARAMETER_VALUE, "uri", > "a valid migration protocol"); > + error_setg(&local_err, QERR_INVALID_PARAMETER_VALUE, "uri", > + "a valid migration protocol"); > migrate_set_state(&s->state, MIGRATION_STATUS_SETUP, > MIGRATION_STATUS_FAILED); > + migrate_set_error(s, local_err); > block_cleanup_parameters(); > return; Most of this "} else {" block is duplicating what is done in the following "if (local_error)" block. As such I think this should be deleted and replaced with merely } else { error_setg(&local_err, QERR_INVALID_PARAMETER_VALUE, "uri", "a valid migration protocol"); block_cleanup_parameters(); } ...so we just fallthruogh to the local_error cleanup block. > } > @@ -2059,6 +2062,7 @@ static int postcopy_start(MigrationState *ms) > int64_t bandwidth = migrate_max_postcopy_bandwidth(); > bool restart_block = false; > int cur_state = MIGRATION_STATUS_ACTIVE; > + Error *local_err = NULL; > > if (migrate_postcopy_preempt()) { > migration_wait_main_channel(ms); > @@ -2203,8 +2207,10 @@ static int postcopy_start(MigrationState *ms) > ret = qemu_file_get_error(ms->to_dst_file); > if (ret) { > error_report("postcopy_start: Migration stream errored"); > + error_setg(&local_err, "postcopy_start: Migration stream errored"); There is an earlier place in this method which also calls error_report which you've not changed to call migrate_set_error. Even more crazy is that the caller of postcopy_start() also calls error_report() but with a useless error message. ALso nothing is free'ing the local_err object once set. IMHO, the postcopy_start() method should be changed to accept an "Error **errp" parameter, and then the caller should be responsible for calling error_report_err and migrate_set_error > migrate_set_state(&ms->state, MIGRATION_STATUS_POSTCOPY_ACTIVE, > MIGRATION_STATUS_FAILED); > + migrate_set_error(ms, local_err); > } > > trace_postcopy_preempt_enabled(migrate_postcopy_preempt()); > @@ -3233,7 +3239,9 @@ void migrate_fd_connect(MigrationState *s, Error *error_in) > if (migrate_postcopy_ram() || migrate_return_path()) { > if (open_return_path_on_source(s, !resume)) { > error_report("Unable to open return-path for postcopy"); > + error_setg(&local_err, "Unable to open return-path"); Having two different error messages is bad and again nothing free's the local_err object. Remove the error_report call and have it call error_report_err(&local_err) which does free the object > migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED); > + migrate_set_error(s, local_err); > migrate_fd_cleanup(s); > return; > } > -- > 2.22.3 > > With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [RFC 1/1] migration: Update error description whenever migration fails 2023-05-04 8:16 ` Daniel P. Berrangé @ 2023-05-05 14:44 ` Tejus GK 0 siblings, 0 replies; 4+ messages in thread From: Tejus GK @ 2023-05-05 14:44 UTC (permalink / raw) To: Daniel P. Berrangé Cc: qemu-devel, quintela, peterx, leobras, shivam.kumar1 On 04/05/23 1:46 pm, Daniel P. Berrangé wrote: > On Wed, May 03, 2023 at 08:31:16PM +0000, tejus.gk wrote: >> There are places in the code where the migration is marked failed with >> MIGRATION_STATUS_FAILED, but the failiure reason is never updated. Hence >> libvirt doesn't know why the migration failed when it queries for it. >> >> Signed-off-by: tejus.gk <tejus.gk@nutanix.com> >> --- >> migration/migration.c | 8 ++++++++ >> 1 file changed, 8 insertions(+) >> >> diff --git a/migration/migration.c b/migration/migration.c >> index feb5ab7493..0d7d34bf4d 100644 >> --- a/migration/migration.c >> +++ b/migration/migration.c >> @@ -1665,8 +1665,11 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk, >> } >> error_setg(errp, QERR_INVALID_PARAMETER_VALUE, "uri", >> "a valid migration protocol"); >> + error_setg(&local_err, QERR_INVALID_PARAMETER_VALUE, "uri", >> + "a valid migration protocol"); >> migrate_set_state(&s->state, MIGRATION_STATUS_SETUP, >> MIGRATION_STATUS_FAILED); >> + migrate_set_error(s, local_err); >> block_cleanup_parameters(); >> return; > > Most of this "} else {" block is duplicating what is done in > the following "if (local_error)" block. As such I think this > should be deleted and replaced with merely > > } else { > error_setg(&local_err, QERR_INVALID_PARAMETER_VALUE, "uri", > "a valid migration protocol"); > block_cleanup_parameters(); > } > > ...so we just fallthruogh to the local_error cleanup block. Ack. Will modify this is in the next patch. > >> } >> @@ -2059,6 +2062,7 @@ static int postcopy_start(MigrationState *ms) >> int64_t bandwidth = migrate_max_postcopy_bandwidth(); >> bool restart_block = false; >> int cur_state = MIGRATION_STATUS_ACTIVE; >> + Error *local_err = NULL; >> >> if (migrate_postcopy_preempt()) { >> migration_wait_main_channel(ms); >> @@ -2203,8 +2207,10 @@ static int postcopy_start(MigrationState *ms) >> ret = qemu_file_get_error(ms->to_dst_file); >> if (ret) { >> error_report("postcopy_start: Migration stream errored"); >> + error_setg(&local_err, "postcopy_start: Migration stream errored"); > > There is an earlier place in this method which also calls > error_report which you've not changed to call migrate_set_error. > Ack, will fix this in the next patch. > Even more crazy is that the caller of postcopy_start() also > calls error_report() but with a useless error message. > > ALso nothing is free'ing the local_err object once set. > > IMHO, the postcopy_start() method should be changed to accept > an "Error **errp" parameter, and then the caller should be > responsible for calling error_report_err and migrate_set_error Ack, will modify this in the next patch. > > >> migrate_set_state(&ms->state, MIGRATION_STATUS_POSTCOPY_ACTIVE, >> MIGRATION_STATUS_FAILED); >> + migrate_set_error(ms, local_err); >> } >> >> trace_postcopy_preempt_enabled(migrate_postcopy_preempt()); >> @@ -3233,7 +3239,9 @@ void migrate_fd_connect(MigrationState *s, Error *error_in) >> if (migrate_postcopy_ram() || migrate_return_path()) { >> if (open_return_path_on_source(s, !resume)) { >> error_report("Unable to open return-path for postcopy"); >> + error_setg(&local_err, "Unable to open return-path"); > > Having two different error messages is bad and again nothing free's > the local_err object. Remove the error_report call and have it call > error_report_err(&local_err) which does free the object My bad, missed this. Will fix this in the next patch. > >> migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED); >> + migrate_set_error(s, local_err); >> migrate_fd_cleanup(s); >> return; >> } >> -- >> 2.22.3 >> >> > > With regards, > Daniel Hi, Thanks for the reviews. I'll be sending a revision with the fixes shortly. Meanwhile I wanted to get something clarified. Apart from the places this patch set is covering, there are also places in the code, where the migration is marked as failed, yet an error_report() call is either not happening or is happening in a different file. An example of the latter can be seen in the function migration_completion() in migration.c, where ret = qemu_savevm_state_complete_precopy(s->to_dst_file, false, s->block_inactive); } } qemu_mutex_unlock_iothread(); if (ret < 0) { goto fail; } and if we take a look at fail: fail: migrate_set_state(&s->state, current_active_state, MIGRATION_STATUS_FAILED); In this instance, the error_report() call for a possible failure while saving the vmstate is being done in the file vmstate.c. I wanted to ask if doing a migrate_set_error() in a different file (vmstate.c in this case) is permissible? regards, tejus ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2023-05-05 14:46 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-05-03 20:31 [RFC 0/1] migration: Update error description whenever migration fails tejus.gk 2023-05-03 20:31 ` [RFC 1/1] " tejus.gk 2023-05-04 8:16 ` Daniel P. Berrangé 2023-05-05 14:44 ` Tejus GK
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).