* [PATCH 0/7] migration: Better error handling in return path thread
@ 2023-06-28 21:49 Peter Xu
2023-06-28 21:49 ` [PATCH 1/7] migration: Let migrate_set_error() take ownership Peter Xu
` (6 more replies)
0 siblings, 7 replies; 12+ messages in thread
From: Peter Xu @ 2023-06-28 21:49 UTC (permalink / raw)
To: qemu-devel
Cc: peterx, Leonardo Bras Soares Passos, Fabiano Rosas, Juan Quintela,
Lukas Straub
This is a small series that reworks error handling of postcopy return path
threads.
We used to contain a bunch of error_report(), converting them into
error_setg() properly and deliver any of those errors to migration generic
error reports (via migrate_set_error()). Then these errors can also be
observed in query-migrate after postcopy is paused.
Dropped the return-path specific error reporting: mark_source_rp_bad(),
because it's a duplication if we can always use migrate_set_error().
Please have a look, thanks.
Peter Xu (7):
migration: Let migrate_set_error() take ownership
migration: Introduce migrate_has_error()
migration: Refactor error handling in source return path
migration: Deliver return path file error to migrate state too
migration: Display error in query-migrate irrelevant of status
qemufile: Always return a verbose error
migration: Provide explicit error message for file shutdowns
migration/migration.h | 8 +-
migration/ram.h | 5 +-
migration/channel.c | 1 -
migration/migration.c | 168 +++++++++++++++++++++++----------------
migration/multifd.c | 10 +--
migration/postcopy-ram.c | 1 -
migration/qemu-file.c | 20 ++++-
migration/ram.c | 42 +++++-----
migration/trace-events | 2 +-
9 files changed, 147 insertions(+), 110 deletions(-)
--
2.41.0
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 1/7] migration: Let migrate_set_error() take ownership
2023-06-28 21:49 [PATCH 0/7] migration: Better error handling in return path thread Peter Xu
@ 2023-06-28 21:49 ` Peter Xu
2023-06-28 21:49 ` [PATCH 2/7] migration: Introduce migrate_has_error() Peter Xu
` (5 subsequent siblings)
6 siblings, 0 replies; 12+ messages in thread
From: Peter Xu @ 2023-06-28 21:49 UTC (permalink / raw)
To: qemu-devel
Cc: peterx, Leonardo Bras Soares Passos, Fabiano Rosas, Juan Quintela,
Lukas Straub
migrate_set_error() used one error_copy() so it always copy an error.
However that's not the major use case - the major use case is one would
like to pass the error to migrate_set_error() without further touching the
error.
It can be proved if we see most of the callers are freeing the error
explicitly right afterwards. There're a few outliers (only if when the
caller) where we can use error_copy() explicitly there.
Signed-off-by: Peter Xu <peterx@redhat.com>
---
migration/migration.h | 6 +++---
migration/channel.c | 1 -
migration/migration.c | 20 ++++++++++++++------
migration/multifd.c | 10 ++++------
migration/postcopy-ram.c | 1 -
migration/ram.c | 1 -
6 files changed, 21 insertions(+), 18 deletions(-)
diff --git a/migration/migration.h b/migration/migration.h
index 721b1c9473..32f87e3834 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -468,8 +468,8 @@ bool migration_has_all_channels(void);
uint64_t migrate_max_downtime(void);
-void migrate_set_error(MigrationState *s, const Error *error);
-void migrate_fd_error(MigrationState *s, const Error *error);
+void migrate_set_error(MigrationState *s, Error *error);
+void migrate_fd_error(MigrationState *s, Error *error);
void migrate_fd_connect(MigrationState *s, Error *error_in);
@@ -513,7 +513,7 @@ int foreach_not_ignored_block(RAMBlockIterFunc func, void *opaque);
void migration_make_urgent_request(void);
void migration_consume_urgent_request(void);
bool migration_rate_limit(void);
-void migration_cancel(const Error *error);
+void migration_cancel(Error *error);
void populate_vfio_info(MigrationInfo *info);
void postcopy_temp_page_reset(PostcopyTmpPage *tmp_page);
diff --git a/migration/channel.c b/migration/channel.c
index ca3319a309..48b3f6abd6 100644
--- a/migration/channel.c
+++ b/migration/channel.c
@@ -90,7 +90,6 @@ void migration_channel_connect(MigrationState *s,
}
}
migrate_fd_connect(s, error);
- error_free(error);
}
diff --git a/migration/migration.c b/migration/migration.c
index 203d40d4c9..13dccc4c12 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -160,7 +160,7 @@ void migration_object_init(void)
dirty_bitmap_mig_init();
}
-void migration_cancel(const Error *error)
+void migration_cancel(Error *error)
{
if (error) {
migrate_set_error(current_migration, error);
@@ -1197,11 +1197,20 @@ static void migrate_fd_cleanup_bh(void *opaque)
object_unref(OBJECT(s));
}
-void migrate_set_error(MigrationState *s, const Error *error)
+/*
+ * Set error for current migration state. The `error' ownership will be
+ * moved from the caller to MigrationState, so the caller doesn't need to
+ * free the error.
+ *
+ * If the caller still needs to reference the `error' passed in, one should
+ * use error_copy() explicitly.
+ */
+void migrate_set_error(MigrationState *s, Error *error)
{
QEMU_LOCK_GUARD(&s->error_mutex);
if (!s->error) {
- s->error = error_copy(error);
+ /* Record the first error triggered */
+ s->error = error;
}
}
@@ -1214,7 +1223,7 @@ static void migrate_error_free(MigrationState *s)
}
}
-void migrate_fd_error(MigrationState *s, const Error *error)
+void migrate_fd_error(MigrationState *s, Error *error)
{
trace_migrate_fd_error(error_get_pretty(error));
assert(s->to_dst_file == NULL);
@@ -1678,7 +1687,7 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk,
if (!(has_resume && resume)) {
yank_unregister_instance(MIGRATION_YANK_INSTANCE);
}
- migrate_fd_error(s, local_err);
+ migrate_fd_error(s, error_copy(local_err));
error_propagate(errp, local_err);
return;
}
@@ -2595,7 +2604,6 @@ static MigThrError migration_detect_error(MigrationState *s)
if (local_error) {
migrate_set_error(s, local_error);
- error_free(local_error);
}
if (state == MIGRATION_STATUS_POSTCOPY_ACTIVE && ret) {
diff --git a/migration/multifd.c b/migration/multifd.c
index 3387d8277f..62bc2dbf49 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -551,7 +551,6 @@ void multifd_save_cleanup(void)
multifd_send_state->ops->send_cleanup(p, &local_err);
if (local_err) {
migrate_set_error(migrate_get_current(), local_err);
- error_free(local_err);
}
}
qemu_sem_destroy(&multifd_send_state->channels_ready);
@@ -750,7 +749,6 @@ out:
if (local_err) {
trace_multifd_send_error(p->id);
multifd_send_terminate_threads(local_err);
- error_free(local_err);
}
/*
@@ -883,7 +881,6 @@ static void multifd_new_send_channel_cleanup(MultiFDSendParams *p,
*/
p->quit = true;
object_unref(OBJECT(ioc));
- error_free(err);
}
static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque)
@@ -1148,7 +1145,6 @@ static void *multifd_recv_thread(void *opaque)
if (local_err) {
multifd_recv_terminate_threads(local_err);
- error_free(local_err);
}
qemu_mutex_lock(&p->mutex);
p->running = false;
@@ -1240,7 +1236,8 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error **errp)
id = multifd_recv_initial_packet(ioc, &local_err);
if (id < 0) {
- multifd_recv_terminate_threads(local_err);
+ /* Copy local error because we'll also return it to caller */
+ multifd_recv_terminate_threads(error_copy(local_err));
error_propagate_prepend(errp, local_err,
"failed to receive packet"
" via multifd channel %d: ",
@@ -1253,7 +1250,8 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error **errp)
if (p->c != NULL) {
error_setg(&local_err, "multifd: received id '%d' already setup'",
id);
- multifd_recv_terminate_threads(local_err);
+ /* Copy local error because we'll also return it to caller */
+ multifd_recv_terminate_threads(error_copy(local_err));
error_propagate(errp, local_err);
return;
}
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 5615ec29eb..6f6fb52bf1 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -1594,7 +1594,6 @@ postcopy_preempt_send_channel_done(MigrationState *s,
{
if (local_err) {
migrate_set_error(s, local_err);
- error_free(local_err);
} else {
migration_ioc_register_yank(ioc);
s->postcopy_qemufile_src = qemu_file_new_output(ioc);
diff --git a/migration/ram.c b/migration/ram.c
index 5283a75f02..ba4890563d 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -4267,7 +4267,6 @@ static void ram_mig_ram_block_resized(RAMBlockNotifier *n, void *host,
*/
error_setg(&err, "RAM block '%s' resized during precopy.", rb->idstr);
migration_cancel(err);
- error_free(err);
}
switch (ps) {
--
2.41.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 2/7] migration: Introduce migrate_has_error()
2023-06-28 21:49 [PATCH 0/7] migration: Better error handling in return path thread Peter Xu
2023-06-28 21:49 ` [PATCH 1/7] migration: Let migrate_set_error() take ownership Peter Xu
@ 2023-06-28 21:49 ` Peter Xu
2023-06-28 21:49 ` [PATCH 3/7] migration: Refactor error handling in source return path Peter Xu
` (4 subsequent siblings)
6 siblings, 0 replies; 12+ messages in thread
From: Peter Xu @ 2023-06-28 21:49 UTC (permalink / raw)
To: qemu-devel
Cc: peterx, Leonardo Bras Soares Passos, Fabiano Rosas, Juan Quintela,
Lukas Straub
Introduce a helper to detect whether MigrationState.error is set for
whatever reason. It is intended to not taking the error_mutex here because
neither do we reference the pointer, nor do we modify the pointer. State
why it's safe to do so.
This is preparation work for any thread (e.g. source return path thread) to
setup errors in an unified way to MigrationState, rather than relying on
its own way to set errors (mark_source_rp_bad()).
Signed-off-by: Peter Xu <peterx@redhat.com>
---
migration/migration.h | 1 +
migration/migration.c | 15 +++++++++++++++
2 files changed, 16 insertions(+)
diff --git a/migration/migration.h b/migration/migration.h
index 32f87e3834..e48317ca97 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -469,6 +469,7 @@ bool migration_has_all_channels(void);
uint64_t migrate_max_downtime(void);
void migrate_set_error(MigrationState *s, Error *error);
+bool migrate_has_error(MigrationState *s);
void migrate_fd_error(MigrationState *s, Error *error);
void migrate_fd_connect(MigrationState *s, Error *error_in);
diff --git a/migration/migration.c b/migration/migration.c
index 13dccc4c12..39674a3981 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1214,6 +1214,21 @@ void migrate_set_error(MigrationState *s, Error *error)
}
}
+/*
+ * Whether the migration state has error set?
+ *
+ * Note this function explicitly didn't use error_mutex, because it only
+ * reads the error pointer for a boolean status.
+ *
+ * As long as the Error* is set, it shouldn't be freed before migration
+ * cleanup, so any thread can use this helper to safely detect whether
+ * there's anything wrong happened already.
+ */
+bool migrate_has_error(MigrationState *s)
+{
+ return qatomic_read(&s->error);
+}
+
static void migrate_error_free(MigrationState *s)
{
QEMU_LOCK_GUARD(&s->error_mutex);
--
2.41.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 3/7] migration: Refactor error handling in source return path
2023-06-28 21:49 [PATCH 0/7] migration: Better error handling in return path thread Peter Xu
2023-06-28 21:49 ` [PATCH 1/7] migration: Let migrate_set_error() take ownership Peter Xu
2023-06-28 21:49 ` [PATCH 2/7] migration: Introduce migrate_has_error() Peter Xu
@ 2023-06-28 21:49 ` Peter Xu
2023-06-28 22:51 ` Fabiano Rosas
2023-06-28 21:49 ` [PATCH 4/7] migration: Deliver return path file error to migrate state too Peter Xu
` (3 subsequent siblings)
6 siblings, 1 reply; 12+ messages in thread
From: Peter Xu @ 2023-06-28 21:49 UTC (permalink / raw)
To: qemu-devel
Cc: peterx, Leonardo Bras Soares Passos, Fabiano Rosas, Juan Quintela,
Lukas Straub
rp_state.error was a boolean used to show error happened in return path
thread. That's not only duplicating error reporting (migrate_set_error),
but also not good enough in that we only do error_report() and set it to
true, we never can keep a history of the exact error and show it in
query-migrate.
To make this better, a few things done:
- Use error_setg() rather than error_report() across the whole lifecycle
of return path thread, keeping the error in an Error*.
- Use migrate_set_error() to apply that captured error to the global
migration object when error occured in this thread.
- With above, no need to have mark_source_rp_bad(), remove it, alongside
with rp_state.error itself.
Signed-off-by: Peter Xu <peterx@redhat.com>
---
migration/migration.h | 1 -
migration/ram.h | 5 +-
migration/migration.c | 118 ++++++++++++++++++++---------------------
migration/ram.c | 41 +++++++-------
migration/trace-events | 2 +-
5 files changed, 82 insertions(+), 85 deletions(-)
diff --git a/migration/migration.h b/migration/migration.h
index e48317ca97..d9c4b753b5 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -290,7 +290,6 @@ struct MigrationState {
/* Protected by qemu_file_lock */
QEMUFile *from_dst_file;
QemuThread rp_thread;
- bool error;
/*
* We can also check non-zero of rp_thread, but there's no "official"
* way to do this, so this bool makes it slightly more elegant.
diff --git a/migration/ram.h b/migration/ram.h
index ea1f3c25b5..02a4af2db6 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -52,7 +52,8 @@ uint64_t ram_bytes_total(void);
void mig_throttle_counter_reset(void);
uint64_t ram_pagesize_summary(void);
-int ram_save_queue_pages(const char *rbname, ram_addr_t start, ram_addr_t len);
+int ram_save_queue_pages(const char *rbname, ram_addr_t start, ram_addr_t len,
+ Error **errp);
void ram_postcopy_migrated_memory_release(MigrationState *ms);
/* For outgoing discard bitmap */
void ram_postcopy_send_discard_bitmap(MigrationState *ms);
@@ -72,7 +73,7 @@ void ramblock_recv_bitmap_set(RAMBlock *rb, void *host_addr);
void ramblock_recv_bitmap_set_range(RAMBlock *rb, void *host_addr, size_t nr);
int64_t ramblock_recv_bitmap_send(QEMUFile *file,
const char *block_name);
-int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *rb);
+int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *rb, Error **errp);
bool ramblock_page_is_discarded(RAMBlock *rb, ram_addr_t start);
void postcopy_preempt_shutdown_file(MigrationState *s);
void *postcopy_preempt_thread(void *opaque);
diff --git a/migration/migration.c b/migration/migration.c
index 39674a3981..f8c41c4d98 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1409,7 +1409,6 @@ void migrate_init(MigrationState *s)
s->to_dst_file = NULL;
s->state = MIGRATION_STATUS_NONE;
s->rp_state.from_dst_file = NULL;
- s->rp_state.error = false;
s->mbps = 0.0;
s->pages_per_second = 0.0;
s->downtime = 0;
@@ -1724,16 +1723,6 @@ void qmp_migrate_continue(MigrationStatus state, Error **errp)
qemu_sem_post(&s->pause_sem);
}
-/* migration thread support */
-/*
- * Something bad happened to the RP stream, mark an error
- * The caller shall print or trace something to indicate why
- */
-static void mark_source_rp_bad(MigrationState *s)
-{
- s->rp_state.error = true;
-}
-
static struct rp_cmd_args {
ssize_t len; /* -1 = variable */
const char *name;
@@ -1754,7 +1743,7 @@ static struct rp_cmd_args {
* and we don't need to send pages that have already been sent.
*/
static void migrate_handle_rp_req_pages(MigrationState *ms, const char* rbname,
- ram_addr_t start, size_t len)
+ ram_addr_t start, size_t len, Error **errp)
{
long our_host_ps = qemu_real_host_page_size();
@@ -1766,15 +1755,12 @@ static void migrate_handle_rp_req_pages(MigrationState *ms, const char* rbname,
*/
if (!QEMU_IS_ALIGNED(start, our_host_ps) ||
!QEMU_IS_ALIGNED(len, our_host_ps)) {
- error_report("%s: Misaligned page request, start: " RAM_ADDR_FMT
- " len: %zd", __func__, start, len);
- mark_source_rp_bad(ms);
+ error_setg(errp, "MIG_RP_MSG_REQ_PAGES: Misaligned page request, start:"
+ RAM_ADDR_FMT " len: %zd", start, len);
return;
}
- if (ram_save_queue_pages(rbname, start, len)) {
- mark_source_rp_bad(ms);
- }
+ ram_save_queue_pages(rbname, start, len, errp);
}
/* Return true to retry, false to quit */
@@ -1789,26 +1775,28 @@ static bool postcopy_pause_return_path_thread(MigrationState *s)
return true;
}
-static int migrate_handle_rp_recv_bitmap(MigrationState *s, char *block_name)
+static int migrate_handle_rp_recv_bitmap(MigrationState *s, char *block_name,
+ Error **errp)
{
RAMBlock *block = qemu_ram_block_by_name(block_name);
if (!block) {
- error_report("%s: invalid block name '%s'", __func__, block_name);
+ error_setg(errp, "MIG_RP_MSG_RECV_BITMAP has invalid block name '%s'",
+ block_name);
return -EINVAL;
}
/* Fetch the received bitmap and refresh the dirty bitmap */
- return ram_dirty_bitmap_reload(s, block);
+ return ram_dirty_bitmap_reload(s, block, errp);
}
-static int migrate_handle_rp_resume_ack(MigrationState *s, uint32_t value)
+static int migrate_handle_rp_resume_ack(MigrationState *s,
+ uint32_t value, Error **errp)
{
trace_source_return_path_thread_resume_ack(value);
if (value != MIGRATION_RESUME_ACK_VALUE) {
- error_report("%s: illegal resume_ack value %"PRIu32,
- __func__, value);
+ error_setg(errp, "illegal resume_ack value %"PRIu32, value);
return -1;
}
@@ -1867,49 +1855,47 @@ static void *source_return_path_thread(void *opaque)
uint32_t tmp32, sibling_error;
ram_addr_t start = 0; /* =0 to silence warning */
size_t len = 0, expected_len;
+ Error *err = NULL;
int res;
trace_source_return_path_thread_entry();
rcu_register_thread();
retry:
- while (!ms->rp_state.error && !qemu_file_get_error(rp) &&
+ while (!migrate_has_error(ms) && !qemu_file_get_error(rp) &&
migration_is_setup_or_active(ms->state)) {
trace_source_return_path_thread_loop_top();
+
header_type = qemu_get_be16(rp);
header_len = qemu_get_be16(rp);
if (qemu_file_get_error(rp)) {
- mark_source_rp_bad(ms);
goto out;
}
if (header_type >= MIG_RP_MSG_MAX ||
header_type == MIG_RP_MSG_INVALID) {
- error_report("RP: Received invalid message 0x%04x length 0x%04x",
- header_type, header_len);
- mark_source_rp_bad(ms);
+ error_setg(&err, "Received invalid message 0x%04x length 0x%04x",
+ header_type, header_len);
goto out;
}
if ((rp_cmd_args[header_type].len != -1 &&
header_len != rp_cmd_args[header_type].len) ||
header_len > sizeof(buf)) {
- error_report("RP: Received '%s' message (0x%04x) with"
- "incorrect length %d expecting %zu",
- rp_cmd_args[header_type].name, header_type, header_len,
- (size_t)rp_cmd_args[header_type].len);
- mark_source_rp_bad(ms);
+ error_setg(&err, "Received '%s' message (0x%04x) with"
+ "incorrect length %d expecting %zu",
+ rp_cmd_args[header_type].name, header_type, header_len,
+ (size_t)rp_cmd_args[header_type].len);
goto out;
}
/* We know we've got a valid header by this point */
res = qemu_get_buffer(rp, buf, header_len);
if (res != header_len) {
- error_report("RP: Failed reading data for message 0x%04x"
- " read %d expected %d",
- header_type, res, header_len);
- mark_source_rp_bad(ms);
+ error_setg(&err, "Failed reading data for message 0x%04x"
+ " read %d expected %d",
+ header_type, res, header_len);
goto out;
}
@@ -1919,8 +1905,7 @@ retry:
sibling_error = ldl_be_p(buf);
trace_source_return_path_thread_shut(sibling_error);
if (sibling_error) {
- error_report("RP: Sibling indicated error %d", sibling_error);
- mark_source_rp_bad(ms);
+ error_setg(&err, "Sibling indicated error %d", sibling_error);
}
/*
* We'll let the main thread deal with closing the RP
@@ -1938,7 +1923,10 @@ retry:
case MIG_RP_MSG_REQ_PAGES:
start = ldq_be_p(buf);
len = ldl_be_p(buf + 8);
- migrate_handle_rp_req_pages(ms, NULL, start, len);
+ migrate_handle_rp_req_pages(ms, NULL, start, len, &err);
+ if (err) {
+ goto out;
+ }
break;
case MIG_RP_MSG_REQ_PAGES_ID:
@@ -1953,32 +1941,32 @@ retry:
expected_len += tmp32;
}
if (header_len != expected_len) {
- error_report("RP: Req_Page_id with length %d expecting %zd",
- header_len, expected_len);
- mark_source_rp_bad(ms);
+ error_setg(&err, "Req_Page_id with length %d expecting %zd",
+ header_len, expected_len);
+ goto out;
+ }
+ migrate_handle_rp_req_pages(ms, (char *)&buf[13], start, len,
+ &err);
+ if (err) {
goto out;
}
- migrate_handle_rp_req_pages(ms, (char *)&buf[13], start, len);
break;
case MIG_RP_MSG_RECV_BITMAP:
if (header_len < 1) {
- error_report("%s: missing block name", __func__);
- mark_source_rp_bad(ms);
+ error_setg(&err, "MIG_RP_MSG_RECV_BITMAP missing block name");
goto out;
}
/* Format: len (1B) + idstr (<255B). This ends the idstr. */
buf[buf[0] + 1] = '\0';
- if (migrate_handle_rp_recv_bitmap(ms, (char *)(buf + 1))) {
- mark_source_rp_bad(ms);
+ if (migrate_handle_rp_recv_bitmap(ms, (char *)(buf + 1), &err)) {
goto out;
}
break;
case MIG_RP_MSG_RESUME_ACK:
tmp32 = ldl_be_p(buf);
- if (migrate_handle_rp_resume_ack(ms, tmp32)) {
- mark_source_rp_bad(ms);
+ if (migrate_handle_rp_resume_ack(ms, tmp32, &err)) {
goto out;
}
break;
@@ -1989,6 +1977,19 @@ retry:
}
out:
+ if (err) {
+ /*
+ * Collect any error in return-path thread and report it to the
+ * migration state object.
+ */
+ migrate_set_error(ms, err);
+ /*
+ * We lost ownership to Error*, clear it, prepared to capture the
+ * next error.
+ */
+ err = NULL;
+ }
+
res = qemu_file_get_error(rp);
if (res) {
if (res && migration_in_postcopy()) {
@@ -2004,13 +2005,11 @@ out:
* it's reset only by us above, or when migration completes
*/
rp = ms->rp_state.from_dst_file;
- ms->rp_state.error = false;
goto retry;
}
}
trace_source_return_path_thread_bad_end();
- mark_source_rp_bad(ms);
}
trace_source_return_path_thread_end();
@@ -2043,8 +2042,7 @@ static int open_return_path_on_source(MigrationState *ms,
return 0;
}
-/* Returns 0 if the RP was ok, otherwise there was an error on the RP */
-static int await_return_path_close_on_source(MigrationState *ms)
+static void await_return_path_close_on_source(MigrationState *ms)
{
/*
* If this is a normal exit then the destination will send a SHUT and the
@@ -2057,13 +2055,11 @@ static int await_return_path_close_on_source(MigrationState *ms)
* waiting for the destination.
*/
qemu_file_shutdown(ms->rp_state.from_dst_file);
- mark_source_rp_bad(ms);
}
trace_await_return_path_close_on_source_joining();
qemu_thread_join(&ms->rp_state.rp_thread);
ms->rp_state.rp_thread_created = false;
trace_await_return_path_close_on_source_close();
- return ms->rp_state.error;
}
static inline void
@@ -2366,11 +2362,11 @@ static void migration_completion(MigrationState *s)
* a SHUT command).
*/
if (s->rp_state.rp_thread_created) {
- int rp_error;
trace_migration_return_path_end_before();
- rp_error = await_return_path_close_on_source(s);
- trace_migration_return_path_end_after(rp_error);
- if (rp_error) {
+ await_return_path_close_on_source(s);
+ trace_migration_return_path_end_after();
+ /* If return path has error, should have been set here */
+ if (migrate_has_error(s)) {
goto fail;
}
}
diff --git a/migration/ram.c b/migration/ram.c
index ba4890563d..567b179fb9 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1922,7 +1922,8 @@ static void migration_page_queue_free(RAMState *rs)
* @start: starting address from the start of the RAMBlock
* @len: length (in bytes) to send
*/
-int ram_save_queue_pages(const char *rbname, ram_addr_t start, ram_addr_t len)
+int ram_save_queue_pages(const char *rbname, ram_addr_t start, ram_addr_t len,
+ Error **errp)
{
RAMBlock *ramblock;
RAMState *rs = ram_state;
@@ -1939,7 +1940,7 @@ int ram_save_queue_pages(const char *rbname, ram_addr_t start, ram_addr_t len)
* Shouldn't happen, we can't reuse the last RAMBlock if
* it's the 1st request.
*/
- error_report("ram_save_queue_pages no previous block");
+ error_setg(errp, "MIG_RP_MSG_REQ_PAGES has no previous block");
return -1;
}
} else {
@@ -1947,16 +1948,17 @@ int ram_save_queue_pages(const char *rbname, ram_addr_t start, ram_addr_t len)
if (!ramblock) {
/* We shouldn't be asked for a non-existent RAMBlock */
- error_report("ram_save_queue_pages no block '%s'", rbname);
+ error_setg(errp, "MIG_RP_MSG_REQ_PAGES has no block '%s'", rbname);
return -1;
}
rs->last_req_rb = ramblock;
}
trace_ram_save_queue_pages(ramblock->idstr, start, len);
if (!offset_in_ramblock(ramblock, start + len - 1)) {
- error_report("%s request overrun start=" RAM_ADDR_FMT " len="
- RAM_ADDR_FMT " blocklen=" RAM_ADDR_FMT,
- __func__, start, len, ramblock->used_length);
+ error_setg(errp, "MIG_RP_MSG_REQ_PAGES request overrun, "
+ "start=" RAM_ADDR_FMT " len="
+ RAM_ADDR_FMT " blocklen=" RAM_ADDR_FMT,
+ start, len, ramblock->used_length);
return -1;
}
@@ -1988,9 +1990,9 @@ int ram_save_queue_pages(const char *rbname, ram_addr_t start, ram_addr_t len)
assert(len % page_size == 0);
while (len) {
if (ram_save_host_page_urgent(pss)) {
- error_report("%s: ram_save_host_page_urgent() failed: "
- "ramblock=%s, start_addr=0x"RAM_ADDR_FMT,
- __func__, ramblock->idstr, start);
+ error_setg(errp, "ram_save_host_page_urgent() failed: "
+ "ramblock=%s, start_addr=0x"RAM_ADDR_FMT,
+ ramblock->idstr, start);
ret = -1;
break;
}
@@ -4124,7 +4126,7 @@ static void ram_dirty_bitmap_reload_notify(MigrationState *s)
* This is only used when the postcopy migration is paused but wants
* to resume from a middle point.
*/
-int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block)
+int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block, Error **errp)
{
int ret = -EINVAL;
/* from_dst_file is always valid because we're within rp_thread */
@@ -4136,8 +4138,8 @@ int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block)
trace_ram_dirty_bitmap_reload_begin(block->idstr);
if (s->state != MIGRATION_STATUS_POSTCOPY_RECOVER) {
- error_report("%s: incorrect state %s", __func__,
- MigrationStatus_str(s->state));
+ error_setg(errp, "Reload bitmap in incorrect state %s",
+ MigrationStatus_str(s->state));
return -EINVAL;
}
@@ -4154,9 +4156,8 @@ int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block)
/* The size of the bitmap should match with our ramblock */
if (size != local_size) {
- error_report("%s: ramblock '%s' bitmap size mismatch "
- "(0x%"PRIx64" != 0x%"PRIx64")", __func__,
- block->idstr, size, local_size);
+ error_setg(errp, "ramblock '%s' bitmap size mismatch (0x%"PRIx64
+ " != 0x%"PRIx64")", block->idstr, size, local_size);
ret = -EINVAL;
goto out;
}
@@ -4166,16 +4167,16 @@ int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block)
ret = qemu_file_get_error(file);
if (ret || size != local_size) {
- error_report("%s: read bitmap failed for ramblock '%s': %d"
- " (size 0x%"PRIx64", got: 0x%"PRIx64")",
- __func__, block->idstr, ret, local_size, size);
+ error_setg(errp, "read bitmap failed for ramblock '%s': %d"
+ " (size 0x%"PRIx64", got: 0x%"PRIx64")",
+ block->idstr, ret, local_size, size);
ret = -EIO;
goto out;
}
if (end_mark != RAMBLOCK_RECV_BITMAP_ENDING) {
- error_report("%s: ramblock '%s' end mark incorrect: 0x%"PRIx64,
- __func__, block->idstr, end_mark);
+ error_setg(errp, "ramblock '%s' end mark incorrect: 0x%"PRIx64,
+ block->idstr, end_mark);
ret = -EINVAL;
goto out;
}
diff --git a/migration/trace-events b/migration/trace-events
index cdaef7a1ea..98aa46c780 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -161,7 +161,7 @@ migration_completion_postcopy_end_after_complete(void) ""
migration_rate_limit_pre(int ms) "%d ms"
migration_rate_limit_post(int urgent) "urgent: %d"
migration_return_path_end_before(void) ""
-migration_return_path_end_after(int rp_error) "%d"
+migration_return_path_end_after(void) ""
migration_thread_after_loop(void) ""
migration_thread_file_err(void) ""
migration_thread_setup_complete(void) ""
--
2.41.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 4/7] migration: Deliver return path file error to migrate state too
2023-06-28 21:49 [PATCH 0/7] migration: Better error handling in return path thread Peter Xu
` (2 preceding siblings ...)
2023-06-28 21:49 ` [PATCH 3/7] migration: Refactor error handling in source return path Peter Xu
@ 2023-06-28 21:49 ` Peter Xu
2023-06-28 22:52 ` Fabiano Rosas
2023-06-28 21:50 ` [PATCH 5/7] migration: Display error in query-migrate irrelevant of status Peter Xu
` (2 subsequent siblings)
6 siblings, 1 reply; 12+ messages in thread
From: Peter Xu @ 2023-06-28 21:49 UTC (permalink / raw)
To: qemu-devel
Cc: peterx, Leonardo Bras Soares Passos, Fabiano Rosas, Juan Quintela,
Lukas Straub
We've already did this for most of the return path thread errors, but not
yet for the IO errors happened on the return path qemufile. Do that too.
Remember to reset "err" always, because the ownership is not us anymore,
otherwise we're prone to use-after-free later after recovered.
Signed-off-by: Peter Xu <peterx@redhat.com>
---
migration/migration.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/migration/migration.c b/migration/migration.c
index f8c41c4d98..234dd3601d 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1992,6 +1992,13 @@ out:
res = qemu_file_get_error(rp);
if (res) {
+ /* We have forwarded any error in "err" already, reuse "error" */
+ assert(err == NULL);
+ /* Try to deliver this file error to migration state */
+ qemu_file_get_error_obj(rp, &err);
+ migrate_set_error(ms, err);
+ err = NULL;
+
if (res && migration_in_postcopy()) {
/*
* Maybe there is something we can do: it looks like a
--
2.41.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 5/7] migration: Display error in query-migrate irrelevant of status
2023-06-28 21:49 [PATCH 0/7] migration: Better error handling in return path thread Peter Xu
` (3 preceding siblings ...)
2023-06-28 21:49 ` [PATCH 4/7] migration: Deliver return path file error to migrate state too Peter Xu
@ 2023-06-28 21:50 ` Peter Xu
2023-06-28 23:01 ` Fabiano Rosas
2023-06-28 21:50 ` [PATCH 6/7] qemufile: Always return a verbose error Peter Xu
2023-06-28 21:50 ` [PATCH 7/7] migration: Provide explicit error message for file shutdowns Peter Xu
6 siblings, 1 reply; 12+ messages in thread
From: Peter Xu @ 2023-06-28 21:50 UTC (permalink / raw)
To: qemu-devel
Cc: peterx, Leonardo Bras Soares Passos, Fabiano Rosas, Juan Quintela,
Lukas Straub
Display it as long as being set, irrelevant of FAILED status. E.g., it may
also be applicable to PAUSED stage of postcopy, to provide hint on what has
gone wrong.
The error_mutex seems to be overlooked when referencing the error, add it
to be very safe.
Signed-off-by: Peter Xu <peterx@redhat.com>
---
migration/migration.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/migration/migration.c b/migration/migration.c
index 234dd3601d..7455353918 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1033,9 +1033,6 @@ static void fill_source_migration_info(MigrationInfo *info)
break;
case MIGRATION_STATUS_FAILED:
info->has_status = true;
- if (s->error) {
- info->error_desc = g_strdup(error_get_pretty(s->error));
- }
break;
case MIGRATION_STATUS_CANCELLED:
info->has_status = true;
@@ -1045,6 +1042,11 @@ static void fill_source_migration_info(MigrationInfo *info)
break;
}
info->status = state;
+
+ QEMU_LOCK_GUARD(&s->error_mutex);
+ if (s->error) {
+ info->error_desc = g_strdup(error_get_pretty(s->error));
+ }
}
static void fill_destination_migration_info(MigrationInfo *info)
--
2.41.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 6/7] qemufile: Always return a verbose error
2023-06-28 21:49 [PATCH 0/7] migration: Better error handling in return path thread Peter Xu
` (4 preceding siblings ...)
2023-06-28 21:50 ` [PATCH 5/7] migration: Display error in query-migrate irrelevant of status Peter Xu
@ 2023-06-28 21:50 ` Peter Xu
2023-06-28 21:50 ` [PATCH 7/7] migration: Provide explicit error message for file shutdowns Peter Xu
6 siblings, 0 replies; 12+ messages in thread
From: Peter Xu @ 2023-06-28 21:50 UTC (permalink / raw)
To: qemu-devel
Cc: peterx, Leonardo Bras Soares Passos, Fabiano Rosas, Juan Quintela,
Lukas Straub
There're a lot of cases where we only have an errno set in last_error but
without a detailed error description. When this happens, try to generate
an error contains the errno as a descriptive error.
This will be helpful in cases where one relies on the Error*. E.g.,
migration state only caches Error* in MigrationState.error. With this,
we'll display correct error messages in e.g. query-migrate when the error
was only set by qemu_file_set_error().
Signed-off-by: Peter Xu <peterx@redhat.com>
---
migration/qemu-file.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index acc282654a..419b4092e7 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -156,15 +156,24 @@ void qemu_file_set_hooks(QEMUFile *f, const QEMUFileHooks *hooks)
*
* Return negative error value if there has been an error on previous
* operations, return 0 if no error happened.
- * Optional, it returns Error* in errp, but it may be NULL even if return value
- * is not 0.
*
+ * If errp is specified, a verbose error message will be copied over.
*/
int qemu_file_get_error_obj(QEMUFile *f, Error **errp)
{
+ if (!f->last_error) {
+ return 0;
+ }
+
+ /* There is an error */
if (errp) {
- *errp = f->last_error_obj ? error_copy(f->last_error_obj) : NULL;
+ if (f->last_error_obj) {
+ *errp = error_copy(f->last_error_obj);
+ } else {
+ error_setg_errno(errp, -f->last_error, "Channel error");
+ }
}
+
return f->last_error;
}
--
2.41.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 7/7] migration: Provide explicit error message for file shutdowns
2023-06-28 21:49 [PATCH 0/7] migration: Better error handling in return path thread Peter Xu
` (5 preceding siblings ...)
2023-06-28 21:50 ` [PATCH 6/7] qemufile: Always return a verbose error Peter Xu
@ 2023-06-28 21:50 ` Peter Xu
6 siblings, 0 replies; 12+ messages in thread
From: Peter Xu @ 2023-06-28 21:50 UTC (permalink / raw)
To: qemu-devel
Cc: peterx, Leonardo Bras Soares Passos, Fabiano Rosas, Juan Quintela,
Lukas Straub
Provide an explicit reason for qemu_file_shutdown()s, which can be
displayed in query-migrate when used.
This will make e.g. migrate-pause to display explicit error descriptions,
from:
"error-desc": "Channel error: Input/output error"
To:
"error-desc": "Channel is explicitly shutdown by the user"
in query-migrate.
Signed-off-by: Peter Xu <peterx@redhat.com>
---
migration/qemu-file.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 419b4092e7..ff605027de 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -87,7 +87,10 @@ int qemu_file_shutdown(QEMUFile *f)
* --> guest crash!
*/
if (!f->last_error) {
- qemu_file_set_error(f, -EIO);
+ Error *err = NULL;
+
+ error_setg(&err, "Channel is explicitly shutdown by the user");
+ qemu_file_set_error_obj(f, -EIO, err);
}
if (!qio_channel_has_feature(f->ioc,
--
2.41.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 3/7] migration: Refactor error handling in source return path
2023-06-28 21:49 ` [PATCH 3/7] migration: Refactor error handling in source return path Peter Xu
@ 2023-06-28 22:51 ` Fabiano Rosas
0 siblings, 0 replies; 12+ messages in thread
From: Fabiano Rosas @ 2023-06-28 22:51 UTC (permalink / raw)
To: Peter Xu, qemu-devel
Cc: peterx, Leonardo Bras Soares Passos, Juan Quintela, Lukas Straub
Peter Xu <peterx@redhat.com> writes:
> rp_state.error was a boolean used to show error happened in return path
> thread. That's not only duplicating error reporting (migrate_set_error),
> but also not good enough in that we only do error_report() and set it to
> true, we never can keep a history of the exact error and show it in
> query-migrate.
>
> To make this better, a few things done:
>
> - Use error_setg() rather than error_report() across the whole lifecycle
> of return path thread, keeping the error in an Error*.
>
> - Use migrate_set_error() to apply that captured error to the global
> migration object when error occured in this thread.
>
> - With above, no need to have mark_source_rp_bad(), remove it, alongside
> with rp_state.error itself.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 4/7] migration: Deliver return path file error to migrate state too
2023-06-28 21:49 ` [PATCH 4/7] migration: Deliver return path file error to migrate state too Peter Xu
@ 2023-06-28 22:52 ` Fabiano Rosas
0 siblings, 0 replies; 12+ messages in thread
From: Fabiano Rosas @ 2023-06-28 22:52 UTC (permalink / raw)
To: Peter Xu, qemu-devel
Cc: peterx, Leonardo Bras Soares Passos, Juan Quintela, Lukas Straub
Peter Xu <peterx@redhat.com> writes:
> We've already did this for most of the return path thread errors, but not
> yet for the IO errors happened on the return path qemufile. Do that too.
>
> Remember to reset "err" always, because the ownership is not us anymore,
> otherwise we're prone to use-after-free later after recovered.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 5/7] migration: Display error in query-migrate irrelevant of status
2023-06-28 21:50 ` [PATCH 5/7] migration: Display error in query-migrate irrelevant of status Peter Xu
@ 2023-06-28 23:01 ` Fabiano Rosas
2023-06-29 19:56 ` Peter Xu
0 siblings, 1 reply; 12+ messages in thread
From: Fabiano Rosas @ 2023-06-28 23:01 UTC (permalink / raw)
To: Peter Xu, qemu-devel
Cc: peterx, Leonardo Bras Soares Passos, Juan Quintela, Lukas Straub
Peter Xu <peterx@redhat.com> writes:
> Display it as long as being set, irrelevant of FAILED status. E.g., it may
> also be applicable to PAUSED stage of postcopy, to provide hint on what has
> gone wrong.
This might have made the documentation slightly inaccurate:
# @error-desc: the human readable error description string, when
# @status is 'failed'. Clients should not attempt to parse the
# error strings. (Since 2.7)
But it's not wrong, so:
Reviewed-by: Fabiano Rosas <farosas@suse.de>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 5/7] migration: Display error in query-migrate irrelevant of status
2023-06-28 23:01 ` Fabiano Rosas
@ 2023-06-29 19:56 ` Peter Xu
0 siblings, 0 replies; 12+ messages in thread
From: Peter Xu @ 2023-06-29 19:56 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, Leonardo Bras Soares Passos, Juan Quintela,
Lukas Straub
On Wed, Jun 28, 2023 at 08:01:22PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
>
> > Display it as long as being set, irrelevant of FAILED status. E.g., it may
> > also be applicable to PAUSED stage of postcopy, to provide hint on what has
> > gone wrong.
>
> This might have made the documentation slightly inaccurate:
Hmm yes, maybe I should touch that up so as to include "postcopy-paused",
or just remove the statement that it must be in a "failed" stage.
>
> # @error-desc: the human readable error description string, when
> # @status is 'failed'. Clients should not attempt to parse the
> # error strings. (Since 2.7)
>
> But it's not wrong, so:
>
> Reviewed-by: Fabiano Rosas <farosas@suse.de>
Thanks for taking a look.
--
Peter Xu
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2023-06-29 19:57 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-06-28 21:49 [PATCH 0/7] migration: Better error handling in return path thread Peter Xu
2023-06-28 21:49 ` [PATCH 1/7] migration: Let migrate_set_error() take ownership Peter Xu
2023-06-28 21:49 ` [PATCH 2/7] migration: Introduce migrate_has_error() Peter Xu
2023-06-28 21:49 ` [PATCH 3/7] migration: Refactor error handling in source return path Peter Xu
2023-06-28 22:51 ` Fabiano Rosas
2023-06-28 21:49 ` [PATCH 4/7] migration: Deliver return path file error to migrate state too Peter Xu
2023-06-28 22:52 ` Fabiano Rosas
2023-06-28 21:50 ` [PATCH 5/7] migration: Display error in query-migrate irrelevant of status Peter Xu
2023-06-28 23:01 ` Fabiano Rosas
2023-06-29 19:56 ` Peter Xu
2023-06-28 21:50 ` [PATCH 6/7] qemufile: Always return a verbose error Peter Xu
2023-06-28 21:50 ` [PATCH 7/7] migration: Provide explicit error message for file shutdowns Peter Xu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).