* [PATCH v5 1/8] migration: Fix possible race when setting rp_state.error
2023-08-31 18:39 [PATCH v5 0/8] Fix segfault on migration return path Fabiano Rosas
@ 2023-08-31 18:39 ` Fabiano Rosas
2023-08-31 18:39 ` [PATCH v5 2/8] migration: Fix possible races when shutting down the return path Fabiano Rosas
` (6 subsequent siblings)
7 siblings, 0 replies; 13+ messages in thread
From: Fabiano Rosas @ 2023-08-31 18:39 UTC (permalink / raw)
To: qemu-devel; +Cc: Juan Quintela, Peter Xu, Wei Wang, Leonardo Bras
We don't need to set the rp_state.error right after a shutdown because
qemu_file_shutdown() always sets the QEMUFile error, so the return
path thread would have seen it and set the rp error itself.
Setting the error outside of the thread is also racy because the
thread could clear it after we set it.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
migration/migration.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/migration/migration.c b/migration/migration.c
index 5528acb65e..f88c86079c 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2062,7 +2062,6 @@ static int await_return_path_close_on_source(MigrationState *ms)
* waiting for the destination.
*/
qemu_file_shutdown(ms->rp_state.from_dst_file);
- mark_source_rp_bad(ms);
}
trace_await_return_path_close_on_source_joining();
qemu_thread_join(&ms->rp_state.rp_thread);
--
2.35.3
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v5 2/8] migration: Fix possible races when shutting down the return path
2023-08-31 18:39 [PATCH v5 0/8] Fix segfault on migration return path Fabiano Rosas
2023-08-31 18:39 ` [PATCH v5 1/8] migration: Fix possible race when setting rp_state.error Fabiano Rosas
@ 2023-08-31 18:39 ` Fabiano Rosas
2023-08-31 18:39 ` [PATCH v5 3/8] migration: Fix possible race when shutting down to_dst_file Fabiano Rosas
` (5 subsequent siblings)
7 siblings, 0 replies; 13+ messages in thread
From: Fabiano Rosas @ 2023-08-31 18:39 UTC (permalink / raw)
To: qemu-devel; +Cc: Juan Quintela, Peter Xu, Wei Wang, Leonardo Bras
We cannot call qemu_file_shutdown() on the return path file without
taking the file lock. The return path thread could be running it's
cleanup code and have just cleared the from_dst_file pointer.
Checking ms->to_dst_file for errors could also race with
migrate_fd_cleanup() which clears the to_dst_file pointer.
Protect both accesses by taking the file lock.
This was caught by inspection, it should be rare, but the next patches
will start calling this code from other places, so let's do the
correct thing.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
migration/migration.c | 19 ++++++++++---------
1 file changed, 10 insertions(+), 9 deletions(-)
diff --git a/migration/migration.c b/migration/migration.c
index f88c86079c..85c171f32c 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2052,17 +2052,18 @@ static int open_return_path_on_source(MigrationState *ms,
static int await_return_path_close_on_source(MigrationState *ms)
{
/*
- * If this is a normal exit then the destination will send a SHUT and the
- * rp_thread will exit, however if there's an error we need to cause
- * it to exit.
+ * If this is a normal exit then the destination will send a SHUT
+ * and the rp_thread will exit, however if there's an error we
+ * need to cause it to exit. shutdown(2), if we have it, will
+ * cause it to unblock if it's stuck waiting for the destination.
*/
- if (qemu_file_get_error(ms->to_dst_file) && ms->rp_state.from_dst_file) {
- /*
- * shutdown(2), if we have it, will cause it to unblock if it's stuck
- * waiting for the destination.
- */
- qemu_file_shutdown(ms->rp_state.from_dst_file);
+ WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) {
+ if (ms->to_dst_file && ms->rp_state.from_dst_file &&
+ qemu_file_get_error(ms->to_dst_file)) {
+ qemu_file_shutdown(ms->rp_state.from_dst_file);
+ }
}
+
trace_await_return_path_close_on_source_joining();
qemu_thread_join(&ms->rp_state.rp_thread);
ms->rp_state.rp_thread_created = false;
--
2.35.3
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v5 3/8] migration: Fix possible race when shutting down to_dst_file
2023-08-31 18:39 [PATCH v5 0/8] Fix segfault on migration return path Fabiano Rosas
2023-08-31 18:39 ` [PATCH v5 1/8] migration: Fix possible race when setting rp_state.error Fabiano Rosas
2023-08-31 18:39 ` [PATCH v5 2/8] migration: Fix possible races when shutting down the return path Fabiano Rosas
@ 2023-08-31 18:39 ` Fabiano Rosas
2023-08-31 18:39 ` [PATCH v5 4/8] migration: Remove redundant cleanup of postcopy_qemufile_src Fabiano Rosas
` (4 subsequent siblings)
7 siblings, 0 replies; 13+ messages in thread
From: Fabiano Rosas @ 2023-08-31 18:39 UTC (permalink / raw)
To: qemu-devel; +Cc: Juan Quintela, Peter Xu, Wei Wang, Leonardo Bras
It's not safe to call qemu_file_shutdown() on the to_dst_file without
first checking for the file's presence under the lock. The cleanup of
this file happens at postcopy_pause() and migrate_fd_cleanup() which
are not necessarily running in the same thread as migrate_fd_cancel().
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
migration/migration.c | 18 +++++++++++-------
1 file changed, 11 insertions(+), 7 deletions(-)
diff --git a/migration/migration.c b/migration/migration.c
index 85c171f32c..5e6a766235 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1245,7 +1245,7 @@ static void migrate_fd_error(MigrationState *s, const Error *error)
static void migrate_fd_cancel(MigrationState *s)
{
int old_state ;
- QEMUFile *f = migrate_get_current()->to_dst_file;
+
trace_migrate_fd_cancel();
WITH_QEMU_LOCK_GUARD(&s->qemu_file_lock) {
@@ -1271,11 +1271,13 @@ static void migrate_fd_cancel(MigrationState *s)
* If we're unlucky the migration code might be stuck somewhere in a
* send/write while the network has failed and is waiting to timeout;
* if we've got shutdown(2) available then we can force it to quit.
- * The outgoing qemu file gets closed in migrate_fd_cleanup that is
- * called in a bh, so there is no race against this cancel.
*/
- if (s->state == MIGRATION_STATUS_CANCELLING && f) {
- qemu_file_shutdown(f);
+ if (s->state == MIGRATION_STATUS_CANCELLING) {
+ WITH_QEMU_LOCK_GUARD(&s->qemu_file_lock) {
+ if (s->to_dst_file) {
+ qemu_file_shutdown(s->to_dst_file);
+ }
+ }
}
if (s->state == MIGRATION_STATUS_CANCELLING && s->block_inactive) {
Error *local_err = NULL;
@@ -1519,12 +1521,14 @@ void qmp_migrate_pause(Error **errp)
{
MigrationState *ms = migrate_get_current();
MigrationIncomingState *mis = migration_incoming_get_current();
- int ret;
+ int ret = 0;
if (ms->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) {
/* Source side, during postcopy */
qemu_mutex_lock(&ms->qemu_file_lock);
- ret = qemu_file_shutdown(ms->to_dst_file);
+ if (ms->to_dst_file) {
+ ret = qemu_file_shutdown(ms->to_dst_file);
+ }
qemu_mutex_unlock(&ms->qemu_file_lock);
if (ret) {
error_setg(errp, "Failed to pause source migration");
--
2.35.3
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v5 4/8] migration: Remove redundant cleanup of postcopy_qemufile_src
2023-08-31 18:39 [PATCH v5 0/8] Fix segfault on migration return path Fabiano Rosas
` (2 preceding siblings ...)
2023-08-31 18:39 ` [PATCH v5 3/8] migration: Fix possible race when shutting down to_dst_file Fabiano Rosas
@ 2023-08-31 18:39 ` Fabiano Rosas
2023-08-31 18:39 ` [PATCH v5 5/8] migration: Consolidate return path closing code Fabiano Rosas
` (3 subsequent siblings)
7 siblings, 0 replies; 13+ messages in thread
From: Fabiano Rosas @ 2023-08-31 18:39 UTC (permalink / raw)
To: qemu-devel; +Cc: Juan Quintela, Peter Xu, Wei Wang, Leonardo Bras
This file is owned by the return path thread which is already doing
cleanup.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
migration/migration.c | 6 ------
1 file changed, 6 deletions(-)
diff --git a/migration/migration.c b/migration/migration.c
index 5e6a766235..195726eb4a 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1177,12 +1177,6 @@ static void migrate_fd_cleanup(MigrationState *s)
qemu_fclose(tmp);
}
- if (s->postcopy_qemufile_src) {
- migration_ioc_unregister_yank_from_file(s->postcopy_qemufile_src);
- qemu_fclose(s->postcopy_qemufile_src);
- s->postcopy_qemufile_src = NULL;
- }
-
assert(!migration_is_active(s));
if (s->state == MIGRATION_STATUS_CANCELLING) {
--
2.35.3
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v5 5/8] migration: Consolidate return path closing code
2023-08-31 18:39 [PATCH v5 0/8] Fix segfault on migration return path Fabiano Rosas
` (3 preceding siblings ...)
2023-08-31 18:39 ` [PATCH v5 4/8] migration: Remove redundant cleanup of postcopy_qemufile_src Fabiano Rosas
@ 2023-08-31 18:39 ` Fabiano Rosas
2023-08-31 18:39 ` [PATCH v5 6/8] migration: Replace the return path retry logic Fabiano Rosas
` (2 subsequent siblings)
7 siblings, 0 replies; 13+ messages in thread
From: Fabiano Rosas @ 2023-08-31 18:39 UTC (permalink / raw)
To: qemu-devel; +Cc: Juan Quintela, Peter Xu, Wei Wang, Leonardo Bras
We'll start calling the await_return_path_close_on_source() function
from other parts of the code, so move all of the related checks and
tracepoints into it.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
migration/migration.c | 29 ++++++++++++++---------------
1 file changed, 14 insertions(+), 15 deletions(-)
diff --git a/migration/migration.c b/migration/migration.c
index 195726eb4a..4edbee3a5d 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2049,6 +2049,14 @@ static int open_return_path_on_source(MigrationState *ms,
/* Returns 0 if the RP was ok, otherwise there was an error on the RP */
static int await_return_path_close_on_source(MigrationState *ms)
{
+ int ret;
+
+ if (!ms->rp_state.rp_thread_created) {
+ return 0;
+ }
+
+ trace_migration_return_path_end_before();
+
/*
* If this is a normal exit then the destination will send a SHUT
* and the rp_thread will exit, however if there's an error we
@@ -2066,7 +2074,10 @@ static int await_return_path_close_on_source(MigrationState *ms)
qemu_thread_join(&ms->rp_state.rp_thread);
ms->rp_state.rp_thread_created = false;
trace_await_return_path_close_on_source_close();
- return ms->rp_state.error;
+
+ ret = ms->rp_state.error;
+ trace_migration_return_path_end_after(ret);
+ return ret;
}
static inline void
@@ -2362,20 +2373,8 @@ static void migration_completion(MigrationState *s)
goto fail;
}
- /*
- * If rp was opened we must clean up the thread before
- * cleaning everything else up (since if there are no failures
- * it will wait for the destination to send it's status in
- * a SHUT command).
- */
- if (s->rp_state.rp_thread_created) {
- int rp_error;
- trace_migration_return_path_end_before();
- rp_error = await_return_path_close_on_source(s);
- trace_migration_return_path_end_after(rp_error);
- if (rp_error) {
- goto fail;
- }
+ if (await_return_path_close_on_source(s)) {
+ goto fail;
}
if (qemu_file_get_error(s->to_dst_file)) {
--
2.35.3
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v5 6/8] migration: Replace the return path retry logic
2023-08-31 18:39 [PATCH v5 0/8] Fix segfault on migration return path Fabiano Rosas
` (4 preceding siblings ...)
2023-08-31 18:39 ` [PATCH v5 5/8] migration: Consolidate return path closing code Fabiano Rosas
@ 2023-08-31 18:39 ` Fabiano Rosas
2023-08-31 18:39 ` [PATCH v5 7/8] migration: Move return path cleanup to main migration thread Fabiano Rosas
2023-08-31 18:39 ` [PATCH v5 8/8] migration: Add a wrapper to cleanup migration files Fabiano Rosas
7 siblings, 0 replies; 13+ messages in thread
From: Fabiano Rosas @ 2023-08-31 18:39 UTC (permalink / raw)
To: qemu-devel; +Cc: Juan Quintela, Peter Xu, Wei Wang, Leonardo Bras
Replace the return path retry logic with finishing and restarting the
thread. This fixes a race when resuming the migration that leads to a
segfault.
Currently when doing postcopy we consider that an IO error on the
return path file could be due to a network intermittency. We then keep
the thread alive but have it do cleanup of the 'from_dst_file' and
wait on the 'postcopy_pause_rp' semaphore. When the user issues a
migrate resume, a new return path is opened and the thread is allowed
to continue.
There's a race condition in the above mechanism. It is possible for
the new return path file to be setup *before* the cleanup code in the
return path thread has had a chance to run, leading to the *new* file
being closed and the pointer set to NULL. When the thread is released
after the resume, it tries to dereference 'from_dst_file' and crashes:
Thread 7 "return path" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffd1dbf700 (LWP 9611)]
0x00005555560e4893 in qemu_file_get_error_obj (f=0x0, errp=0x0) at ../migration/qemu-file.c:154
154 return f->last_error;
(gdb) bt
#0 0x00005555560e4893 in qemu_file_get_error_obj (f=0x0, errp=0x0) at ../migration/qemu-file.c:154
#1 0x00005555560e4983 in qemu_file_get_error (f=0x0) at ../migration/qemu-file.c:206
#2 0x0000555555b9a1df in source_return_path_thread (opaque=0x555556e06000) at ../migration/migration.c:1876
#3 0x000055555602e14f in qemu_thread_start (args=0x55555782e780) at ../util/qemu-thread-posix.c:541
#4 0x00007ffff38d76ea in start_thread (arg=0x7fffd1dbf700) at pthread_create.c:477
#5 0x00007ffff35efa6f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Here's the race (important bit is open_return_path happening before
migration_release_dst_files):
migration | qmp | return path
--------------------------+-----------------------------+---------------------------------
qmp_migrate_pause()
shutdown(ms->to_dst_file)
f->last_error = -EIO
migrate_detect_error()
postcopy_pause()
set_state(PAUSED)
wait(postcopy_pause_sem)
qmp_migrate(resume)
migrate_fd_connect()
resume = state == PAUSED
open_return_path <-- TOO SOON!
set_state(RECOVER)
post(postcopy_pause_sem)
(incoming closes to_src_file)
res = qemu_file_get_error(rp)
migration_release_dst_files()
ms->rp_state.from_dst_file = NULL
post(postcopy_pause_rp_sem)
postcopy_pause_return_path_thread()
wait(postcopy_pause_rp_sem)
rp = ms->rp_state.from_dst_file
goto retry
qemu_file_get_error(rp)
SIGSEGV
-------------------------------------------------------------------------------------------
We can keep the retry logic without having the thread alive and
waiting. The only piece of data used by it is the 'from_dst_file' and
it is only allowed to proceed after a migrate resume is issued and the
semaphore released at migrate_fd_connect().
Move the retry logic to outside the thread by waiting for the thread
to finish before pausing the migration.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
migration/migration.c | 60 ++++++++-----------------------------------
migration/migration.h | 1 -
2 files changed, 11 insertions(+), 50 deletions(-)
diff --git a/migration/migration.c b/migration/migration.c
index 4edbee3a5d..7dfcbc3634 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1775,18 +1775,6 @@ static void migrate_handle_rp_req_pages(MigrationState *ms, const char* rbname,
}
}
-/* Return true to retry, false to quit */
-static bool postcopy_pause_return_path_thread(MigrationState *s)
-{
- trace_postcopy_pause_return_path();
-
- qemu_sem_wait(&s->postcopy_pause_rp_sem);
-
- trace_postcopy_pause_return_path_continued();
-
- return true;
-}
-
static int migrate_handle_rp_recv_bitmap(MigrationState *s, char *block_name)
{
RAMBlock *block = qemu_ram_block_by_name(block_name);
@@ -1870,7 +1858,6 @@ static void *source_return_path_thread(void *opaque)
trace_source_return_path_thread_entry();
rcu_register_thread();
-retry:
while (!ms->rp_state.error && !qemu_file_get_error(rp) &&
migration_is_setup_or_active(ms->state)) {
trace_source_return_path_thread_loop_top();
@@ -1992,26 +1979,7 @@ retry:
}
out:
- res = qemu_file_get_error(rp);
- if (res) {
- if (res && migration_in_postcopy()) {
- /*
- * Maybe there is something we can do: it looks like a
- * network down issue, and we pause for a recovery.
- */
- migration_release_dst_files(ms);
- rp = NULL;
- if (postcopy_pause_return_path_thread(ms)) {
- /*
- * Reload rp, reset the rest. Referencing it is safe since
- * it's reset only by us above, or when migration completes
- */
- rp = ms->rp_state.from_dst_file;
- ms->rp_state.error = false;
- goto retry;
- }
- }
-
+ if (qemu_file_get_error(rp)) {
trace_source_return_path_thread_bad_end();
mark_source_rp_bad(ms);
}
@@ -2022,8 +1990,7 @@ out:
return NULL;
}
-static int open_return_path_on_source(MigrationState *ms,
- bool create_thread)
+static int open_return_path_on_source(MigrationState *ms)
{
ms->rp_state.from_dst_file = qemu_file_get_return_path(ms->to_dst_file);
if (!ms->rp_state.from_dst_file) {
@@ -2032,11 +1999,6 @@ static int open_return_path_on_source(MigrationState *ms,
trace_open_return_path_on_source();
- if (!create_thread) {
- /* We're done */
- return 0;
- }
-
qemu_thread_create(&ms->rp_state.rp_thread, "return path",
source_return_path_thread, ms, QEMU_THREAD_JOINABLE);
ms->rp_state.rp_thread_created = true;
@@ -2076,6 +2038,7 @@ static int await_return_path_close_on_source(MigrationState *ms)
trace_await_return_path_close_on_source_close();
ret = ms->rp_state.error;
+ ms->rp_state.error = false;
trace_migration_return_path_end_after(ret);
return ret;
}
@@ -2551,6 +2514,13 @@ static MigThrError postcopy_pause(MigrationState *s)
qemu_file_shutdown(file);
qemu_fclose(file);
+ /*
+ * We're already pausing, so ignore any errors on the return
+ * path and just wait for the thread to finish. It will be
+ * re-created when we resume.
+ */
+ await_return_path_close_on_source(s);
+
migrate_set_state(&s->state, s->state,
MIGRATION_STATUS_POSTCOPY_PAUSED);
@@ -2568,12 +2538,6 @@ static MigThrError postcopy_pause(MigrationState *s)
if (s->state == MIGRATION_STATUS_POSTCOPY_RECOVER) {
/* Woken up by a recover procedure. Give it a shot */
- /*
- * Firstly, let's wake up the return path now, with a new
- * return path channel.
- */
- qemu_sem_post(&s->postcopy_pause_rp_sem);
-
/* Do the resume logic */
if (postcopy_do_resume(s) == 0) {
/* Let's continue! */
@@ -3263,7 +3227,7 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
* QEMU uses the return path.
*/
if (migrate_postcopy_ram() || migrate_return_path()) {
- if (open_return_path_on_source(s, !resume)) {
+ if (open_return_path_on_source(s)) {
error_setg(&local_err, "Unable to open return-path for postcopy");
migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
migrate_set_error(s, local_err);
@@ -3327,7 +3291,6 @@ static void migration_instance_finalize(Object *obj)
qemu_sem_destroy(&ms->rate_limit_sem);
qemu_sem_destroy(&ms->pause_sem);
qemu_sem_destroy(&ms->postcopy_pause_sem);
- qemu_sem_destroy(&ms->postcopy_pause_rp_sem);
qemu_sem_destroy(&ms->rp_state.rp_sem);
qemu_sem_destroy(&ms->rp_state.rp_pong_acks);
qemu_sem_destroy(&ms->postcopy_qemufile_src_sem);
@@ -3347,7 +3310,6 @@ static void migration_instance_init(Object *obj)
migrate_params_init(&ms->parameters);
qemu_sem_init(&ms->postcopy_pause_sem, 0);
- qemu_sem_init(&ms->postcopy_pause_rp_sem, 0);
qemu_sem_init(&ms->rp_state.rp_sem, 0);
qemu_sem_init(&ms->rp_state.rp_pong_acks, 0);
qemu_sem_init(&ms->rate_limit_sem, 0);
diff --git a/migration/migration.h b/migration/migration.h
index 6eea18db36..36eb5ba70b 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -382,7 +382,6 @@ struct MigrationState {
/* Needed by postcopy-pause state */
QemuSemaphore postcopy_pause_sem;
- QemuSemaphore postcopy_pause_rp_sem;
/*
* Whether we abort the migration if decompression errors are
* detected at the destination. It is left at false for qemu
--
2.35.3
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v5 7/8] migration: Move return path cleanup to main migration thread
2023-08-31 18:39 [PATCH v5 0/8] Fix segfault on migration return path Fabiano Rosas
` (5 preceding siblings ...)
2023-08-31 18:39 ` [PATCH v5 6/8] migration: Replace the return path retry logic Fabiano Rosas
@ 2023-08-31 18:39 ` Fabiano Rosas
2023-08-31 18:39 ` [PATCH v5 8/8] migration: Add a wrapper to cleanup migration files Fabiano Rosas
7 siblings, 0 replies; 13+ messages in thread
From: Fabiano Rosas @ 2023-08-31 18:39 UTC (permalink / raw)
To: qemu-devel; +Cc: Juan Quintela, Peter Xu, Wei Wang, Leonardo Bras
Now that the return path thread is allowed to finish during a paused
migration, we can move the cleanup of the QEMUFiles to the main
migration thread.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
migration/migration.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/migration/migration.c b/migration/migration.c
index 7dfcbc3634..7fec57ad7f 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -98,6 +98,7 @@ static int migration_maybe_pause(MigrationState *s,
int *current_active_state,
int new_state);
static void migrate_fd_cancel(MigrationState *s);
+static int await_return_path_close_on_source(MigrationState *s);
static bool migration_needs_multiple_sockets(void)
{
@@ -1177,6 +1178,12 @@ static void migrate_fd_cleanup(MigrationState *s)
qemu_fclose(tmp);
}
+ /*
+ * We already cleaned up to_dst_file, so errors from the return
+ * path might be due to that, ignore them.
+ */
+ await_return_path_close_on_source(s);
+
assert(!migration_is_active(s));
if (s->state == MIGRATION_STATUS_CANCELLING) {
@@ -1985,7 +1992,6 @@ out:
}
trace_source_return_path_thread_end();
- migration_release_dst_files(ms);
rcu_unregister_thread();
return NULL;
}
@@ -2039,6 +2045,9 @@ static int await_return_path_close_on_source(MigrationState *ms)
ret = ms->rp_state.error;
ms->rp_state.error = false;
+
+ migration_release_dst_files(ms);
+
trace_migration_return_path_end_after(ret);
return ret;
}
--
2.35.3
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v5 8/8] migration: Add a wrapper to cleanup migration files
2023-08-31 18:39 [PATCH v5 0/8] Fix segfault on migration return path Fabiano Rosas
` (6 preceding siblings ...)
2023-08-31 18:39 ` [PATCH v5 7/8] migration: Move return path cleanup to main migration thread Fabiano Rosas
@ 2023-08-31 18:39 ` Fabiano Rosas
2023-09-01 16:05 ` Peter Xu
7 siblings, 1 reply; 13+ messages in thread
From: Fabiano Rosas @ 2023-08-31 18:39 UTC (permalink / raw)
To: qemu-devel; +Cc: Juan Quintela, Peter Xu, Wei Wang, Leonardo Bras, Lukas Straub
We currently have a pattern for cleaning up a migration QEMUFile:
qemu_mutex_lock(&s->qemu_file_lock);
file = s->file_name;
s->file_name = NULL;
qemu_mutex_unlock(&s->qemu_file_lock);
migration_ioc_unregister_yank_from_file(file);
qemu_file_shutdown(file);
qemu_fclose(file);
This sequence requires some consideration about locking to avoid
TOC/TOU bugs and avoid passing NULL into the functions that don't
expect it.
There's not need to call a shutdown() right before a close() and a
shutdown() in another thread being issued as a means to unblock a file
should not collide with this close().
Create a wrapper function to make sure the locking is being done
properly. Remove the extra shutdown().
The yank is linked to the QIOChannel, so if more than one QEMUFile
share the same channel, care must be taken to (un)register only one
yank function.
Move the yank unregister before clearing the pointer, so we can avoid
locking and add a comment explaining we're only using the QEMUFile as
a way to access the channel.
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
migration/migration.c | 93 ++++++++++++--------------------------
migration/yank_functions.c | 5 ++
2 files changed, 35 insertions(+), 63 deletions(-)
diff --git a/migration/migration.c b/migration/migration.c
index 7fec57ad7f..99d21c3442 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -217,6 +217,25 @@ MigrationIncomingState *migration_incoming_get_current(void)
return current_incoming;
}
+static void migration_file_release(QEMUFile **file)
+{
+ MigrationState *ms = migrate_get_current();
+ QEMUFile *tmp;
+
+ /*
+ * Reset the pointer before releasing it to avoid holding the lock
+ * for too long.
+ */
+ WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) {
+ tmp = *file;
+ *file = NULL;
+ }
+
+ if (tmp) {
+ qemu_fclose(tmp);
+ }
+}
+
void migration_incoming_transport_cleanup(MigrationIncomingState *mis)
{
if (mis->socket_address_list) {
@@ -1155,8 +1174,6 @@ static void migrate_fd_cleanup(MigrationState *s)
qemu_savevm_state_cleanup();
if (s->to_dst_file) {
- QEMUFile *tmp;
-
trace_migrate_fd_cleanup();
qemu_mutex_unlock_iothread();
if (s->migration_thread_running) {
@@ -1166,16 +1183,9 @@ static void migrate_fd_cleanup(MigrationState *s)
qemu_mutex_lock_iothread();
multifd_save_cleanup();
- qemu_mutex_lock(&s->qemu_file_lock);
- tmp = s->to_dst_file;
- s->to_dst_file = NULL;
- qemu_mutex_unlock(&s->qemu_file_lock);
- /*
- * Close the file handle without the lock to make sure the
- * critical section won't block for long.
- */
- migration_ioc_unregister_yank_from_file(tmp);
- qemu_fclose(tmp);
+
+ migration_ioc_unregister_yank_from_file(s->to_dst_file);
+ migration_file_release(&s->to_dst_file);
}
/*
@@ -1815,38 +1825,6 @@ static int migrate_handle_rp_resume_ack(MigrationState *s, uint32_t value)
return 0;
}
-/*
- * Release ms->rp_state.from_dst_file (and postcopy_qemufile_src if
- * existed) in a safe way.
- */
-static void migration_release_dst_files(MigrationState *ms)
-{
- QEMUFile *file;
-
- WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) {
- /*
- * Reset the from_dst_file pointer first before releasing it, as we
- * can't block within lock section
- */
- file = ms->rp_state.from_dst_file;
- ms->rp_state.from_dst_file = NULL;
- }
-
- /*
- * Do the same to postcopy fast path socket too if there is. No
- * locking needed because this qemufile should only be managed by
- * return path thread.
- */
- if (ms->postcopy_qemufile_src) {
- migration_ioc_unregister_yank_from_file(ms->postcopy_qemufile_src);
- qemu_file_shutdown(ms->postcopy_qemufile_src);
- qemu_fclose(ms->postcopy_qemufile_src);
- ms->postcopy_qemufile_src = NULL;
- }
-
- qemu_fclose(file);
-}
-
/*
* Handles messages sent on the return path towards the source VM
*
@@ -2046,7 +2024,12 @@ static int await_return_path_close_on_source(MigrationState *ms)
ret = ms->rp_state.error;
ms->rp_state.error = false;
- migration_release_dst_files(ms);
+ migration_file_release(&ms->rp_state.from_dst_file);
+
+ if (ms->postcopy_qemufile_src) {
+ migration_ioc_unregister_yank_from_file(ms->postcopy_qemufile_src);
+ }
+ migration_file_release(&ms->postcopy_qemufile_src);
trace_migration_return_path_end_after(ret);
return ret;
@@ -2502,26 +2485,10 @@ static MigThrError postcopy_pause(MigrationState *s)
assert(s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
while (true) {
- QEMUFile *file;
-
- /*
- * Current channel is possibly broken. Release it. Note that this is
- * guaranteed even without lock because to_dst_file should only be
- * modified by the migration thread. That also guarantees that the
- * unregister of yank is safe too without the lock. It should be safe
- * even to be within the qemu_file_lock, but we didn't do that to avoid
- * taking more mutex (yank_lock) within qemu_file_lock. TL;DR: we make
- * the qemu_file_lock critical section as small as possible.
- */
+ /* Current channel is possibly broken. Release it. */
assert(s->to_dst_file);
migration_ioc_unregister_yank_from_file(s->to_dst_file);
- qemu_mutex_lock(&s->qemu_file_lock);
- file = s->to_dst_file;
- s->to_dst_file = NULL;
- qemu_mutex_unlock(&s->qemu_file_lock);
-
- qemu_file_shutdown(file);
- qemu_fclose(file);
+ migration_file_release(&s->to_dst_file);
/*
* We're already pausing, so ignore any errors on the return
diff --git a/migration/yank_functions.c b/migration/yank_functions.c
index d5a710a3f2..31b0d790e2 100644
--- a/migration/yank_functions.c
+++ b/migration/yank_functions.c
@@ -48,6 +48,11 @@ void migration_ioc_unregister_yank(QIOChannel *ioc)
}
}
+/*
+ * There's no direct relationship between the QEMUFile and the
+ * yank. This is just a convenience helper because the QIOChannel and
+ * the QEMUFile lifecycles happen to match.
+ */
void migration_ioc_unregister_yank_from_file(QEMUFile *file)
{
QIOChannel *ioc = qemu_file_get_ioc(file);
--
2.35.3
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v5 8/8] migration: Add a wrapper to cleanup migration files
2023-08-31 18:39 ` [PATCH v5 8/8] migration: Add a wrapper to cleanup migration files Fabiano Rosas
@ 2023-09-01 16:05 ` Peter Xu
2023-09-01 18:29 ` Fabiano Rosas
0 siblings, 1 reply; 13+ messages in thread
From: Peter Xu @ 2023-09-01 16:05 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, Juan Quintela, Wei Wang, Leonardo Bras, Lukas Straub
On Thu, Aug 31, 2023 at 03:39:16PM -0300, Fabiano Rosas wrote:
> @@ -1166,16 +1183,9 @@ static void migrate_fd_cleanup(MigrationState *s)
> qemu_mutex_lock_iothread();
>
> multifd_save_cleanup();
> - qemu_mutex_lock(&s->qemu_file_lock);
> - tmp = s->to_dst_file;
> - s->to_dst_file = NULL;
> - qemu_mutex_unlock(&s->qemu_file_lock);
> - /*
> - * Close the file handle without the lock to make sure the
> - * critical section won't block for long.
> - */
> - migration_ioc_unregister_yank_from_file(tmp);
> - qemu_fclose(tmp);
> +
> + migration_ioc_unregister_yank_from_file(s->to_dst_file);
I think you suggested that we should always take the file lock when
operating on them, so this is slightly going backwards to not hold any lock
when doing it. But doing so in migrate_fd_cleanup() is probably fine (as it
serializes with bql on all the rest qmp commands, neither should migration
thread exist at this point). Your call; it's still much cleaner.
Reviewed-by: Peter Xu <peterx@redhat.com>
--
Peter Xu
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v5 8/8] migration: Add a wrapper to cleanup migration files
2023-09-01 16:05 ` Peter Xu
@ 2023-09-01 18:29 ` Fabiano Rosas
2023-09-05 15:34 ` Peter Xu
0 siblings, 1 reply; 13+ messages in thread
From: Fabiano Rosas @ 2023-09-01 18:29 UTC (permalink / raw)
To: Peter Xu; +Cc: qemu-devel, Juan Quintela, Wei Wang, Leonardo Bras, Lukas Straub
Peter Xu <peterx@redhat.com> writes:
> On Thu, Aug 31, 2023 at 03:39:16PM -0300, Fabiano Rosas wrote:
>> @@ -1166,16 +1183,9 @@ static void migrate_fd_cleanup(MigrationState *s)
>> qemu_mutex_lock_iothread();
>>
>> multifd_save_cleanup();
>> - qemu_mutex_lock(&s->qemu_file_lock);
>> - tmp = s->to_dst_file;
>> - s->to_dst_file = NULL;
>> - qemu_mutex_unlock(&s->qemu_file_lock);
>> - /*
>> - * Close the file handle without the lock to make sure the
>> - * critical section won't block for long.
>> - */
>> - migration_ioc_unregister_yank_from_file(tmp);
>> - qemu_fclose(tmp);
>> +
>> + migration_ioc_unregister_yank_from_file(s->to_dst_file);
>
> I think you suggested that we should always take the file lock when
> operating on them, so this is slightly going backwards to not hold any lock
> when doing it. But doing so in migrate_fd_cleanup() is probably fine (as it
> serializes with bql on all the rest qmp commands, neither should migration
> thread exist at this point). Your call; it's still much cleaner.
I think I was mistaken. We need the lock on the thread that clears the
pointer so that we can safely dereference it on another thread under the
lock.
Here we're accessing it from the same thread that later does the
clearing. So that's a slightly different problem.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v5 8/8] migration: Add a wrapper to cleanup migration files
2023-09-01 18:29 ` Fabiano Rosas
@ 2023-09-05 15:34 ` Peter Xu
2023-09-05 17:25 ` Fabiano Rosas
0 siblings, 1 reply; 13+ messages in thread
From: Peter Xu @ 2023-09-05 15:34 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, Juan Quintela, Wei Wang, Leonardo Bras, Lukas Straub
On Fri, Sep 01, 2023 at 03:29:51PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
>
> > On Thu, Aug 31, 2023 at 03:39:16PM -0300, Fabiano Rosas wrote:
> >> @@ -1166,16 +1183,9 @@ static void migrate_fd_cleanup(MigrationState *s)
> >> qemu_mutex_lock_iothread();
> >>
> >> multifd_save_cleanup();
> >> - qemu_mutex_lock(&s->qemu_file_lock);
> >> - tmp = s->to_dst_file;
> >> - s->to_dst_file = NULL;
> >> - qemu_mutex_unlock(&s->qemu_file_lock);
> >> - /*
> >> - * Close the file handle without the lock to make sure the
> >> - * critical section won't block for long.
> >> - */
> >> - migration_ioc_unregister_yank_from_file(tmp);
> >> - qemu_fclose(tmp);
> >> +
> >> + migration_ioc_unregister_yank_from_file(s->to_dst_file);
> >
> > I think you suggested that we should always take the file lock when
> > operating on them, so this is slightly going backwards to not hold any lock
> > when doing it. But doing so in migrate_fd_cleanup() is probably fine (as it
> > serializes with bql on all the rest qmp commands, neither should migration
> > thread exist at this point). Your call; it's still much cleaner.
>
> I think I was mistaken. We need the lock on the thread that clears the
> pointer so that we can safely dereference it on another thread under the
> lock.
>
> Here we're accessing it from the same thread that later does the
> clearing. So that's a slightly different problem.
But this is not the only place to clear it, so you still need to justify
why the other call sites (e.g., postcopy_pause() won't happen in parallel
with this call site.
The good thing about your proposal (of always taking that lock) is we avoid
those justifications, as you said before. :)
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v5 8/8] migration: Add a wrapper to cleanup migration files
2023-09-05 15:34 ` Peter Xu
@ 2023-09-05 17:25 ` Fabiano Rosas
0 siblings, 0 replies; 13+ messages in thread
From: Fabiano Rosas @ 2023-09-05 17:25 UTC (permalink / raw)
To: Peter Xu; +Cc: qemu-devel, Juan Quintela, Wei Wang, Leonardo Bras, Lukas Straub
Peter Xu <peterx@redhat.com> writes:
> On Fri, Sep 01, 2023 at 03:29:51PM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>>
>> > On Thu, Aug 31, 2023 at 03:39:16PM -0300, Fabiano Rosas wrote:
>> >> @@ -1166,16 +1183,9 @@ static void migrate_fd_cleanup(MigrationState *s)
>> >> qemu_mutex_lock_iothread();
>> >>
>> >> multifd_save_cleanup();
>> >> - qemu_mutex_lock(&s->qemu_file_lock);
>> >> - tmp = s->to_dst_file;
>> >> - s->to_dst_file = NULL;
>> >> - qemu_mutex_unlock(&s->qemu_file_lock);
>> >> - /*
>> >> - * Close the file handle without the lock to make sure the
>> >> - * critical section won't block for long.
>> >> - */
>> >> - migration_ioc_unregister_yank_from_file(tmp);
>> >> - qemu_fclose(tmp);
>> >> +
>> >> + migration_ioc_unregister_yank_from_file(s->to_dst_file);
>> >
>> > I think you suggested that we should always take the file lock when
>> > operating on them, so this is slightly going backwards to not hold any lock
>> > when doing it. But doing so in migrate_fd_cleanup() is probably fine (as it
>> > serializes with bql on all the rest qmp commands, neither should migration
>> > thread exist at this point). Your call; it's still much cleaner.
>>
>> I think I was mistaken. We need the lock on the thread that clears the
>> pointer so that we can safely dereference it on another thread under the
>> lock.
>>
>> Here we're accessing it from the same thread that later does the
>> clearing. So that's a slightly different problem.
>
> But this is not the only place to clear it, so you still need to justify
> why the other call sites (e.g., postcopy_pause() won't happen in parallel
> with this call site.
>
> The good thing about your proposal (of always taking that lock) is we avoid
> those justifications, as you said before. :)
>
Yes, I should probably try harder to keep it under the lock.
The issue is that without using the QIOChannel reference count or
keeping a flag there's no way to pair the register/unregister of the
yank. Because 1) we'll never be sure whether the yank was previously
registered when calling the unregister and 2) we don't store the ioc, so
we need to access it from the QEMUFile, but then several QEMUFiles can
have the same ioc.
The easiest way to keep it under the lock would be to add a flag:
migration_file_release(QEMUFile **file, bool unregister_yank);
... and only set it when we're sure the yank has been registered. It is
still a bit hand-wavy though.
^ permalink raw reply [flat|nested] 13+ messages in thread