* [PATCH v2 1/6] migration/multifd: Join the TLS thread
2024-02-05 19:49 [PATCH v2 0/6] migration/multifd: Fix channel creation vs. cleanup races Fabiano Rosas
@ 2024-02-05 19:49 ` Fabiano Rosas
2024-02-06 8:53 ` Daniel P. Berrangé
2024-02-05 19:49 ` [PATCH v2 2/6] migration/multifd: Remove p->running Fabiano Rosas
` (5 subsequent siblings)
6 siblings, 1 reply; 16+ messages in thread
From: Fabiano Rosas @ 2024-02-05 19:49 UTC (permalink / raw)
To: qemu-devel; +Cc: Peter Xu, Avihai Horon, Daniel P . Berrangé, qemu-stable
We're currently leaking the resources of the TLS thread by not joining
it and also overwriting the p->thread pointer altogether.
Fixes: a1af605bd5 ("migration/multifd: fix hangup with TLS-Multifd due to blocking handshake")
Cc: qemu-stable <qemu-stable@nongnu.org>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
migration/multifd.c | 8 +++++++-
migration/multifd.h | 2 ++
2 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/migration/multifd.c b/migration/multifd.c
index ef13e2e781..8195c1daf3 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -630,6 +630,10 @@ static void multifd_send_terminate_threads(void)
for (i = 0; i < migrate_multifd_channels(); i++) {
MultiFDSendParams *p = &multifd_send_state->params[i];
+ if (p->tls_thread_created) {
+ qemu_thread_join(&p->tls_thread);
+ }
+
if (p->running) {
qemu_thread_join(&p->thread);
}
@@ -921,7 +925,9 @@ static bool multifd_tls_channel_connect(MultiFDSendParams *p,
trace_multifd_tls_outgoing_handshake_start(ioc, tioc, hostname);
qio_channel_set_name(QIO_CHANNEL(tioc), "multifd-tls-outgoing");
p->c = QIO_CHANNEL(tioc);
- qemu_thread_create(&p->thread, "multifd-tls-handshake-worker",
+
+ p->tls_thread_created = true;
+ qemu_thread_create(&p->tls_thread, "multifd-tls-handshake-worker",
multifd_tls_handshake_thread, p,
QEMU_THREAD_JOINABLE);
return true;
diff --git a/migration/multifd.h b/migration/multifd.h
index 78a2317263..720c9d50db 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -73,6 +73,8 @@ typedef struct {
char *name;
/* channel thread id */
QemuThread thread;
+ QemuThread tls_thread;
+ bool tls_thread_created;
/* communication channel */
QIOChannel *c;
/* is the yank function registered */
--
2.35.3
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/6] migration/multifd: Join the TLS thread
2024-02-05 19:49 ` [PATCH v2 1/6] migration/multifd: Join the TLS thread Fabiano Rosas
@ 2024-02-06 8:53 ` Daniel P. Berrangé
2024-02-06 9:15 ` Peter Xu
0 siblings, 1 reply; 16+ messages in thread
From: Daniel P. Berrangé @ 2024-02-06 8:53 UTC (permalink / raw)
To: Fabiano Rosas; +Cc: qemu-devel, Peter Xu, Avihai Horon, qemu-stable
On Mon, Feb 05, 2024 at 04:49:24PM -0300, Fabiano Rosas wrote:
> We're currently leaking the resources of the TLS thread by not joining
> it and also overwriting the p->thread pointer altogether.
AFAICS, it is not ovewriting 'p->thread' because at the time when the
TLS thread is created, the main 'send thread' has not yet been
created. The TLS thread and send thread execution times are mutually
exclusive.
The 'p->running' flag is already set to true when the TLS thread is
created, so the existing cleanup should be working too, so I'm not
seeing a bug that needs fixing here.
>
> Fixes: a1af605bd5 ("migration/multifd: fix hangup with TLS-Multifd due to blocking handshake")
> Cc: qemu-stable <qemu-stable@nongnu.org>
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
> migration/multifd.c | 8 +++++++-
> migration/multifd.h | 2 ++
> 2 files changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/migration/multifd.c b/migration/multifd.c
> index ef13e2e781..8195c1daf3 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -630,6 +630,10 @@ static void multifd_send_terminate_threads(void)
> for (i = 0; i < migrate_multifd_channels(); i++) {
> MultiFDSendParams *p = &multifd_send_state->params[i];
>
> + if (p->tls_thread_created) {
> + qemu_thread_join(&p->tls_thread);
> + }
> +
> if (p->running) {
> qemu_thread_join(&p->thread);
> }
> @@ -921,7 +925,9 @@ static bool multifd_tls_channel_connect(MultiFDSendParams *p,
> trace_multifd_tls_outgoing_handshake_start(ioc, tioc, hostname);
> qio_channel_set_name(QIO_CHANNEL(tioc), "multifd-tls-outgoing");
> p->c = QIO_CHANNEL(tioc);
> - qemu_thread_create(&p->thread, "multifd-tls-handshake-worker",
> +
> + p->tls_thread_created = true;
> + qemu_thread_create(&p->tls_thread, "multifd-tls-handshake-worker",
> multifd_tls_handshake_thread, p,
> QEMU_THREAD_JOINABLE);
> return true;
> diff --git a/migration/multifd.h b/migration/multifd.h
> index 78a2317263..720c9d50db 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -73,6 +73,8 @@ typedef struct {
> char *name;
> /* channel thread id */
> QemuThread thread;
> + QemuThread tls_thread;
> + bool tls_thread_created;
> /* communication channel */
> QIOChannel *c;
> /* is the yank function registered */
> --
> 2.35.3
>
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/6] migration/multifd: Join the TLS thread
2024-02-06 8:53 ` Daniel P. Berrangé
@ 2024-02-06 9:15 ` Peter Xu
2024-02-06 10:06 ` Daniel P. Berrangé
0 siblings, 1 reply; 16+ messages in thread
From: Peter Xu @ 2024-02-06 9:15 UTC (permalink / raw)
To: Daniel P. Berrangé
Cc: Fabiano Rosas, qemu-devel, Avihai Horon, qemu-stable
On Tue, Feb 06, 2024 at 08:53:45AM +0000, Daniel P. Berrangé wrote:
> AFAICS, it is not ovewriting 'p->thread' because at the time when the
> TLS thread is created, the main 'send thread' has not yet been
> created. The TLS thread and send thread execution times are mutually
> exclusive.
IIUC it'll be overwritten after the tls handshake, where in the tls thread
uses multifd_channel_connect() to create the ultimate multifd thread with
the same p->thread variable:
qemu_thread_create(&p->thread, p->name, multifd_send_thread, p,
QEMU_THREAD_JOINABLE);
There it'll overwrite the old value setup by p->thread, hence the tls
thread resource should be leaked until QEMU quits when created with
JOINABLE in both contexts.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/6] migration/multifd: Join the TLS thread
2024-02-06 9:15 ` Peter Xu
@ 2024-02-06 10:06 ` Daniel P. Berrangé
0 siblings, 0 replies; 16+ messages in thread
From: Daniel P. Berrangé @ 2024-02-06 10:06 UTC (permalink / raw)
To: Peter Xu; +Cc: Fabiano Rosas, qemu-devel, Avihai Horon, qemu-stable
On Tue, Feb 06, 2024 at 05:15:07PM +0800, Peter Xu wrote:
> On Tue, Feb 06, 2024 at 08:53:45AM +0000, Daniel P. Berrangé wrote:
> > AFAICS, it is not ovewriting 'p->thread' because at the time when the
> > TLS thread is created, the main 'send thread' has not yet been
> > created. The TLS thread and send thread execution times are mutually
> > exclusive.
>
> IIUC it'll be overwritten after the tls handshake, where in the tls thread
> uses multifd_channel_connect() to create the ultimate multifd thread with
> the same p->thread variable:
>
> qemu_thread_create(&p->thread, p->name, multifd_send_thread, p,
> QEMU_THREAD_JOINABLE);
>
> There it'll overwrite the old value setup by p->thread, hence the tls
> thread resource should be leaked until QEMU quits when created with
> JOINABLE in both contexts.
Ah yes, missed that, you're right.
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v2 2/6] migration/multifd: Remove p->running
2024-02-05 19:49 [PATCH v2 0/6] migration/multifd: Fix channel creation vs. cleanup races Fabiano Rosas
2024-02-05 19:49 ` [PATCH v2 1/6] migration/multifd: Join the TLS thread Fabiano Rosas
@ 2024-02-05 19:49 ` Fabiano Rosas
2024-02-05 19:49 ` [PATCH v2 3/6] migration/multifd: Move multifd_send_setup error handling in to the function Fabiano Rosas
` (4 subsequent siblings)
6 siblings, 0 replies; 16+ messages in thread
From: Fabiano Rosas @ 2024-02-05 19:49 UTC (permalink / raw)
To: qemu-devel
Cc: Peter Xu, Avihai Horon, Daniel P . Berrangé, qemu-stable,
chenyuhui5
We currently only need p->running to avoid calling qemu_thread_join()
on a non existent thread if the thread has never been created.
However, there are at least two bugs in this logic:
1) On the sending side, p->running is set too early and
qemu_thread_create() can be skipped due to an error during TLS
handshake, leaving the flag set and leading to a crash when
multifd_send_cleanup() calls qemu_thread_join().
2) During exit, the multifd thread clears the flag while holding the
channel lock. The counterpart at multifd_send_cleanup() reads the flag
outside of the lock and might free the mutex while the multifd thread
still has it locked.
Fix the first issue by setting the flag right before creating the
thread. Rename it from p->running to p->thread_created to clarify its
usage.
Fix the second issue by not clearing the flag at the multifd thread
exit. We don't have any use for that.
Note that these bugs are straight-forward logic issues and not race
conditions. There is still a gap for races to affect this code due to
multifd_send_cleanup() being allowed to run concurrently with the
thread creation loop. This issue is solved in the next patches.
Cc: qemu-stable <qemu-stable@nongnu.org>
Fixes: 29647140157a ("migration/tls: add support for multifd tls-handshake")
Reported-by: Avihai Horon <avihaih@nvidia.com>
Reported-by: <chenyuhui5@huawei.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
migration/multifd.c | 27 ++++++++++++---------------
migration/multifd.h | 7 ++-----
2 files changed, 14 insertions(+), 20 deletions(-)
diff --git a/migration/multifd.c b/migration/multifd.c
index 8195c1daf3..515d88e04b 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -634,7 +634,7 @@ static void multifd_send_terminate_threads(void)
qemu_thread_join(&p->tls_thread);
}
- if (p->running) {
+ if (p->thread_created) {
qemu_thread_join(&p->thread);
}
}
@@ -862,7 +862,6 @@ out:
error_free(local_err);
}
- p->running = false;
rcu_unregister_thread();
migration_threads_remove(thread);
trace_multifd_send_thread_end(p->id, p->packets_sent, p->total_normal_pages);
@@ -953,6 +952,8 @@ static bool multifd_channel_connect(MultiFDSendParams *p,
migration_ioc_register_yank(ioc);
p->registered_yank = true;
p->c = ioc;
+
+ p->thread_created = true;
qemu_thread_create(&p->thread, p->name, multifd_send_thread, p,
QEMU_THREAD_JOINABLE);
return true;
@@ -967,7 +968,6 @@ static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque)
trace_multifd_new_send_channel_async(p->id);
if (!qio_task_propagate_error(task, &local_err)) {
qio_channel_set_delay(ioc, false);
- p->running = true;
if (multifd_channel_connect(p, ioc, &local_err)) {
return;
}
@@ -1128,15 +1128,15 @@ void multifd_recv_cleanup(void)
for (i = 0; i < migrate_multifd_channels(); i++) {
MultiFDRecvParams *p = &multifd_recv_state->params[i];
- if (p->running) {
- /*
- * multifd_recv_thread may hung at MULTIFD_FLAG_SYNC handle code,
- * however try to wakeup it without harm in cleanup phase.
- */
- qemu_sem_post(&p->sem_sync);
- }
+ /*
+ * multifd_recv_thread may hung at MULTIFD_FLAG_SYNC handle code,
+ * however try to wakeup it without harm in cleanup phase.
+ */
+ qemu_sem_post(&p->sem_sync);
- qemu_thread_join(&p->thread);
+ if (p->thread_created) {
+ qemu_thread_join(&p->thread);
+ }
}
for (i = 0; i < migrate_multifd_channels(); i++) {
multifd_recv_cleanup_channel(&multifd_recv_state->params[i]);
@@ -1222,9 +1222,6 @@ static void *multifd_recv_thread(void *opaque)
multifd_recv_terminate_threads(local_err);
error_free(local_err);
}
- qemu_mutex_lock(&p->mutex);
- p->running = false;
- qemu_mutex_unlock(&p->mutex);
rcu_unregister_thread();
trace_multifd_recv_thread_end(p->id, p->packets_recved, p->total_normal_pages);
@@ -1330,7 +1327,7 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error **errp)
p->c = ioc;
object_ref(OBJECT(ioc));
- p->running = true;
+ p->thread_created = true;
qemu_thread_create(&p->thread, p->name, multifd_recv_thread, p,
QEMU_THREAD_JOINABLE);
qatomic_inc(&multifd_recv_state->count);
diff --git a/migration/multifd.h b/migration/multifd.h
index 720c9d50db..7881980ee6 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -73,6 +73,7 @@ typedef struct {
char *name;
/* channel thread id */
QemuThread thread;
+ bool thread_created;
QemuThread tls_thread;
bool tls_thread_created;
/* communication channel */
@@ -93,8 +94,6 @@ typedef struct {
/* syncs main thread and channels */
QemuSemaphore sem_sync;
- /* is this channel thread running */
- bool running;
/* multifd flags for each packet */
uint32_t flags;
/*
@@ -143,6 +142,7 @@ typedef struct {
char *name;
/* channel thread id */
QemuThread thread;
+ bool thread_created;
/* communication channel */
QIOChannel *c;
/* packet allocated len */
@@ -157,8 +157,6 @@ typedef struct {
/* this mutex protects the following parameters */
QemuMutex mutex;
- /* is this channel thread running */
- bool running;
/* should this thread finish */
bool quit;
/* multifd flags for each packet */
@@ -217,4 +215,3 @@ static inline void multifd_send_prepare_header(MultiFDSendParams *p)
#endif
-
--
2.35.3
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v2 3/6] migration/multifd: Move multifd_send_setup error handling in to the function
2024-02-05 19:49 [PATCH v2 0/6] migration/multifd: Fix channel creation vs. cleanup races Fabiano Rosas
2024-02-05 19:49 ` [PATCH v2 1/6] migration/multifd: Join the TLS thread Fabiano Rosas
2024-02-05 19:49 ` [PATCH v2 2/6] migration/multifd: Remove p->running Fabiano Rosas
@ 2024-02-05 19:49 ` Fabiano Rosas
2024-02-05 19:49 ` [PATCH v2 4/6] migration/multifd: Move multifd_send_setup into migration thread Fabiano Rosas
` (3 subsequent siblings)
6 siblings, 0 replies; 16+ messages in thread
From: Fabiano Rosas @ 2024-02-05 19:49 UTC (permalink / raw)
To: qemu-devel; +Cc: Peter Xu, Avihai Horon, Daniel P . Berrangé
Hide the error handling inside multifd_send_setup to make it cleaner
for the next patch to move the function around.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
migration/migration.c | 6 +-----
migration/multifd.c | 24 +++++++++++++++++-------
migration/multifd.h | 2 +-
3 files changed, 19 insertions(+), 13 deletions(-)
diff --git a/migration/migration.c b/migration/migration.c
index ba99772e76..2942f8cf42 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3623,11 +3623,7 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
return;
}
- if (multifd_send_setup(&local_err) != 0) {
- migrate_set_error(s, local_err);
- error_report_err(local_err);
- migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
- MIGRATION_STATUS_FAILED);
+ if (!multifd_send_setup()) {
migrate_fd_cleanup(s);
return;
}
diff --git a/migration/multifd.c b/migration/multifd.c
index 515d88e04b..cc10be2c3f 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -985,14 +985,16 @@ static void multifd_new_send_channel_create(gpointer opaque)
socket_send_channel_create(multifd_new_send_channel_async, opaque);
}
-int multifd_send_setup(Error **errp)
+bool multifd_send_setup(void)
{
- int thread_count;
+ MigrationState *s = migrate_get_current();
+ Error *local_err = NULL;
+ int thread_count, ret = 0;
uint32_t page_count = MULTIFD_PACKET_SIZE / qemu_target_page_size();
uint8_t i;
if (!migrate_multifd()) {
- return 0;
+ return true;
}
thread_count = migrate_multifd_channels();
@@ -1026,14 +1028,22 @@ int multifd_send_setup(Error **errp)
for (i = 0; i < thread_count; i++) {
MultiFDSendParams *p = &multifd_send_state->params[i];
- int ret;
- ret = multifd_send_state->ops->send_setup(p, errp);
+ ret = multifd_send_state->ops->send_setup(p, &local_err);
if (ret) {
- return ret;
+ break;
}
}
- return 0;
+
+ if (ret) {
+ migrate_set_error(s, local_err);
+ error_report_err(local_err);
+ migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
+ MIGRATION_STATUS_FAILED);
+ return false;
+ }
+
+ return true;
}
struct {
diff --git a/migration/multifd.h b/migration/multifd.h
index 7881980ee6..8a1cad0996 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -13,7 +13,7 @@
#ifndef QEMU_MIGRATION_MULTIFD_H
#define QEMU_MIGRATION_MULTIFD_H
-int multifd_send_setup(Error **errp);
+bool multifd_send_setup(void);
void multifd_send_shutdown(void);
int multifd_recv_setup(Error **errp);
void multifd_recv_cleanup(void);
--
2.35.3
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v2 4/6] migration/multifd: Move multifd_send_setup into migration thread
2024-02-05 19:49 [PATCH v2 0/6] migration/multifd: Fix channel creation vs. cleanup races Fabiano Rosas
` (2 preceding siblings ...)
2024-02-05 19:49 ` [PATCH v2 3/6] migration/multifd: Move multifd_send_setup error handling in to the function Fabiano Rosas
@ 2024-02-05 19:49 ` Fabiano Rosas
2024-02-05 19:49 ` [PATCH v2 5/6] migration/multifd: Unify multifd and TLS connection paths Fabiano Rosas
` (2 subsequent siblings)
6 siblings, 0 replies; 16+ messages in thread
From: Fabiano Rosas @ 2024-02-05 19:49 UTC (permalink / raw)
To: qemu-devel; +Cc: Peter Xu, Avihai Horon, Daniel P . Berrangé
We currently have an unfavorable situation around multifd channels
creation and the migration thread execution.
We create the multifd channels with qio_channel_socket_connect_async
-> qio_task_run_in_thread, but only connect them at the
multifd_new_send_channel_async callback, called from
qio_task_complete, which is registered as a glib event.
So at multifd_send_setup() we create the channels, but they will only
be actually usable after the whole multifd_send_setup() calling stack
returns back to the main loop. Which means that the migration thread
is already up and running without any possibility for the multifd
channels to be ready on time.
We currently rely on the channels-ready semaphore blocking
multifd_send_sync_main() until channels start to come up and release
it. However there have been bugs recently found when a channel's
creation fails and multifd_send_cleanup() is allowed to run while
other channels are still being created.
Let's start to organize this situation by moving the
multifd_send_setup() call into the migration thread. That way we
unblock the main-loop to dispatch the completion callbacks and
actually have a chance of getting the multifd channels ready for when
the migration thread needs them.
The next patches will deal with the synchronization aspects.
Note that this takes multifd_send_setup() out of the BQL.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
migration/migration.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/migration/migration.c b/migration/migration.c
index 2942f8cf42..0675e12c64 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3315,6 +3315,10 @@ static void *migration_thread(void *opaque)
object_ref(OBJECT(s));
update_iteration_initial_status(s);
+ if (!multifd_send_setup()) {
+ goto out;
+ }
+
bql_lock();
qemu_savevm_state_header(s->to_dst_file);
bql_unlock();
@@ -3386,6 +3390,7 @@ static void *migration_thread(void *opaque)
urgent = migration_rate_limit();
}
+out:
trace_migration_thread_after_loop();
migration_iteration_finish(s);
object_unref(OBJECT(s));
@@ -3623,11 +3628,6 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
return;
}
- if (!multifd_send_setup()) {
- migrate_fd_cleanup(s);
- return;
- }
-
if (migrate_background_snapshot()) {
qemu_thread_create(&s->thread, "bg_snapshot",
bg_migration_thread, s, QEMU_THREAD_JOINABLE);
--
2.35.3
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v2 5/6] migration/multifd: Unify multifd and TLS connection paths
2024-02-05 19:49 [PATCH v2 0/6] migration/multifd: Fix channel creation vs. cleanup races Fabiano Rosas
` (3 preceding siblings ...)
2024-02-05 19:49 ` [PATCH v2 4/6] migration/multifd: Move multifd_send_setup into migration thread Fabiano Rosas
@ 2024-02-05 19:49 ` Fabiano Rosas
2024-02-06 3:33 ` Peter Xu
2024-02-06 12:44 ` Avihai Horon
2024-02-05 19:49 ` [PATCH v2 6/6] migration/multifd: Add a synchronization point for channel creation Fabiano Rosas
2024-02-06 3:42 ` [PATCH v2 0/6] migration/multifd: Fix channel creation vs. cleanup races Peter Xu
6 siblings, 2 replies; 16+ messages in thread
From: Fabiano Rosas @ 2024-02-05 19:49 UTC (permalink / raw)
To: qemu-devel; +Cc: Peter Xu, Avihai Horon, Daniel P . Berrangé
During multifd channel creation (multifd_send_new_channel_async) when
TLS is enabled, the multifd_channel_connect function is called twice,
once to create the TLS handshake thread and another time after the
asynchrounous TLS handshake has finished.
This creates a slightly confusing call stack where
multifd_channel_connect() is called more times than the number of
channels. It also splits error handling between the two callers of
multifd_channel_connect() causing some code duplication. Lastly, it
gets in the way of having a single point to determine whether all
channel creation tasks have been initiated.
Refactor the code to move the reentrancy one level up at the
multifd_new_send_channel_async() level, de-duplicating the error
handling and allowing for the next patch to introduce a
synchronization point common to all the multifd channel creation,
regardless of TLS.
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
migration/multifd.c | 73 +++++++++++++++++++--------------------------
1 file changed, 30 insertions(+), 43 deletions(-)
diff --git a/migration/multifd.c b/migration/multifd.c
index cc10be2c3f..89d39fa67c 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -869,30 +869,7 @@ out:
return NULL;
}
-static bool multifd_channel_connect(MultiFDSendParams *p,
- QIOChannel *ioc,
- Error **errp);
-
-static void multifd_tls_outgoing_handshake(QIOTask *task,
- gpointer opaque)
-{
- MultiFDSendParams *p = opaque;
- QIOChannel *ioc = QIO_CHANNEL(qio_task_get_source(task));
- Error *err = NULL;
-
- if (!qio_task_propagate_error(task, &err)) {
- trace_multifd_tls_outgoing_handshake_complete(ioc);
- if (multifd_channel_connect(p, ioc, &err)) {
- return;
- }
- }
-
- trace_multifd_tls_outgoing_handshake_error(ioc, error_get_pretty(err));
-
- multifd_send_set_error(err);
- multifd_send_kick_main(p);
- error_free(err);
-}
+static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque);
static void *multifd_tls_handshake_thread(void *opaque)
{
@@ -900,7 +877,7 @@ static void *multifd_tls_handshake_thread(void *opaque)
QIOChannelTLS *tioc = QIO_CHANNEL_TLS(p->c);
qio_channel_tls_handshake(tioc,
- multifd_tls_outgoing_handshake,
+ multifd_new_send_channel_async,
p,
NULL,
NULL);
@@ -936,19 +913,6 @@ static bool multifd_channel_connect(MultiFDSendParams *p,
QIOChannel *ioc,
Error **errp)
{
- trace_multifd_set_outgoing_channel(
- ioc, object_get_typename(OBJECT(ioc)),
- migrate_get_current()->hostname);
-
- if (migrate_channel_requires_tls_upgrade(ioc)) {
- /*
- * tls_channel_connect will call back to this
- * function after the TLS handshake,
- * so we mustn't call multifd_send_thread until then
- */
- return multifd_tls_channel_connect(p, ioc, errp);
- }
-
migration_ioc_register_yank(ioc);
p->registered_yank = true;
p->c = ioc;
@@ -959,20 +923,43 @@ static bool multifd_channel_connect(MultiFDSendParams *p,
return true;
}
+/*
+ * When TLS is enabled this function is called once to establish the
+ * TLS connection and a second time after the TLS handshake to create
+ * the multifd channel. Without TLS it goes straight into the channel
+ * creation.
+ */
static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque)
{
MultiFDSendParams *p = opaque;
QIOChannel *ioc = QIO_CHANNEL(qio_task_get_source(task));
Error *local_err = NULL;
+ bool ret;
+
trace_multifd_new_send_channel_async(p->id);
- if (!qio_task_propagate_error(task, &local_err)) {
- qio_channel_set_delay(ioc, false);
- if (multifd_channel_connect(p, ioc, &local_err)) {
- return;
- }
+
+ if (qio_task_propagate_error(task, &local_err)) {
+ ret = false;
+ goto out;
+ }
+
+ qio_channel_set_delay(ioc, false);
+
+ trace_multifd_set_outgoing_channel(ioc, object_get_typename(OBJECT(ioc)),
+ migrate_get_current()->hostname);
+
+ if (migrate_channel_requires_tls_upgrade(ioc)) {
+ ret = multifd_tls_channel_connect(p, ioc, &local_err);
+ } else {
+ ret = multifd_channel_connect(p, ioc, &local_err);
+ }
+
+ if (ret) {
+ return;
}
+out:
trace_multifd_new_send_channel_async_error(p->id, local_err);
multifd_send_set_error(local_err);
multifd_send_kick_main(p);
--
2.35.3
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v2 5/6] migration/multifd: Unify multifd and TLS connection paths
2024-02-05 19:49 ` [PATCH v2 5/6] migration/multifd: Unify multifd and TLS connection paths Fabiano Rosas
@ 2024-02-06 3:33 ` Peter Xu
2024-02-06 12:44 ` Avihai Horon
1 sibling, 0 replies; 16+ messages in thread
From: Peter Xu @ 2024-02-06 3:33 UTC (permalink / raw)
To: Fabiano Rosas; +Cc: qemu-devel, Avihai Horon, Daniel P . Berrangé
On Mon, Feb 05, 2024 at 04:49:28PM -0300, Fabiano Rosas wrote:
> During multifd channel creation (multifd_send_new_channel_async) when
> TLS is enabled, the multifd_channel_connect function is called twice,
> once to create the TLS handshake thread and another time after the
> asynchrounous TLS handshake has finished.
>
> This creates a slightly confusing call stack where
> multifd_channel_connect() is called more times than the number of
> channels. It also splits error handling between the two callers of
> multifd_channel_connect() causing some code duplication. Lastly, it
> gets in the way of having a single point to determine whether all
> channel creation tasks have been initiated.
>
> Refactor the code to move the reentrancy one level up at the
> multifd_new_send_channel_async() level, de-duplicating the error
> handling and allowing for the next patch to introduce a
> synchronization point common to all the multifd channel creation,
> regardless of TLS.
>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
--
Peter Xu
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 5/6] migration/multifd: Unify multifd and TLS connection paths
2024-02-05 19:49 ` [PATCH v2 5/6] migration/multifd: Unify multifd and TLS connection paths Fabiano Rosas
2024-02-06 3:33 ` Peter Xu
@ 2024-02-06 12:44 ` Avihai Horon
2024-02-06 14:30 ` Fabiano Rosas
1 sibling, 1 reply; 16+ messages in thread
From: Avihai Horon @ 2024-02-06 12:44 UTC (permalink / raw)
To: Fabiano Rosas, qemu-devel; +Cc: Peter Xu, Daniel P . Berrangé
On 05/02/2024 21:49, Fabiano Rosas wrote:
> External email: Use caution opening links or attachments
>
>
> During multifd channel creation (multifd_send_new_channel_async) when
> TLS is enabled, the multifd_channel_connect function is called twice,
> once to create the TLS handshake thread and another time after the
> asynchrounous TLS handshake has finished.
>
> This creates a slightly confusing call stack where
> multifd_channel_connect() is called more times than the number of
> channels. It also splits error handling between the two callers of
> multifd_channel_connect() causing some code duplication. Lastly, it
> gets in the way of having a single point to determine whether all
> channel creation tasks have been initiated.
>
> Refactor the code to move the reentrancy one level up at the
> multifd_new_send_channel_async() level, de-duplicating the error
> handling and allowing for the next patch to introduce a
> synchronization point common to all the multifd channel creation,
> regardless of TLS.
>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
> migration/multifd.c | 73 +++++++++++++++++++--------------------------
> 1 file changed, 30 insertions(+), 43 deletions(-)
>
> diff --git a/migration/multifd.c b/migration/multifd.c
> index cc10be2c3f..89d39fa67c 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -869,30 +869,7 @@ out:
> return NULL;
> }
>
> -static bool multifd_channel_connect(MultiFDSendParams *p,
> - QIOChannel *ioc,
> - Error **errp);
> -
> -static void multifd_tls_outgoing_handshake(QIOTask *task,
> - gpointer opaque)
> -{
> - MultiFDSendParams *p = opaque;
> - QIOChannel *ioc = QIO_CHANNEL(qio_task_get_source(task));
> - Error *err = NULL;
> -
> - if (!qio_task_propagate_error(task, &err)) {
> - trace_multifd_tls_outgoing_handshake_complete(ioc);
> - if (multifd_channel_connect(p, ioc, &err)) {
> - return;
> - }
> - }
> -
> - trace_multifd_tls_outgoing_handshake_error(ioc, error_get_pretty(err));
> -
> - multifd_send_set_error(err);
> - multifd_send_kick_main(p);
> - error_free(err);
> -}
> +static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque);
>
> static void *multifd_tls_handshake_thread(void *opaque)
> {
> @@ -900,7 +877,7 @@ static void *multifd_tls_handshake_thread(void *opaque)
> QIOChannelTLS *tioc = QIO_CHANNEL_TLS(p->c);
>
> qio_channel_tls_handshake(tioc,
> - multifd_tls_outgoing_handshake,
> + multifd_new_send_channel_async,
> p,
> NULL,
> NULL);
> @@ -936,19 +913,6 @@ static bool multifd_channel_connect(MultiFDSendParams *p,
> QIOChannel *ioc,
> Error **errp)
> {
> - trace_multifd_set_outgoing_channel(
> - ioc, object_get_typename(OBJECT(ioc)),
> - migrate_get_current()->hostname);
> -
> - if (migrate_channel_requires_tls_upgrade(ioc)) {
> - /*
> - * tls_channel_connect will call back to this
> - * function after the TLS handshake,
> - * so we mustn't call multifd_send_thread until then
> - */
> - return multifd_tls_channel_connect(p, ioc, errp);
> - }
> -
> migration_ioc_register_yank(ioc);
> p->registered_yank = true;
> p->c = ioc;
> @@ -959,20 +923,43 @@ static bool multifd_channel_connect(MultiFDSendParams *p,
> return true;
> }
>
> +/*
> + * When TLS is enabled this function is called once to establish the
> + * TLS connection and a second time after the TLS handshake to create
> + * the multifd channel. Without TLS it goes straight into the channel
> + * creation.
> + */
> static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque)
> {
> MultiFDSendParams *p = opaque;
> QIOChannel *ioc = QIO_CHANNEL(qio_task_get_source(task));
> Error *local_err = NULL;
>
> + bool ret;
> +
> trace_multifd_new_send_channel_async(p->id);
> - if (!qio_task_propagate_error(task, &local_err)) {
> - qio_channel_set_delay(ioc, false);
> - if (multifd_channel_connect(p, ioc, &local_err)) {
> - return;
> - }
> +
> + if (qio_task_propagate_error(task, &local_err)) {
> + ret = false;
> + goto out;
> + }
I think this common error handling for both TLS/non-TLS is a bit
problematic if there is an error in TLS handshake:
multifd_tls_channel_connect() sets p->c = QIO_CHANNEL(tioc).
TLS handshake fails.
multifd_new_send_channel_async() errors and calls
object_unref(OBJECT(ioc)) which will result in freeing the IOC.
Then, multifd_send_terminate_threads() will try to access p->ioc because
it's not NULL, causing a segfault.
> +
> + qio_channel_set_delay(ioc, false);
Maybe qio_channel_set_delay() should be moved inside
multifd_channel_connect()? It's called two times when TLS is used.
> +
> + trace_multifd_set_outgoing_channel(ioc, object_get_typename(OBJECT(ioc)),
> + migrate_get_current()->hostname);
> +
> + if (migrate_channel_requires_tls_upgrade(ioc)) {
> + ret = multifd_tls_channel_connect(p, ioc, &local_err);
> + } else {
> + ret = multifd_channel_connect(p, ioc, &local_err);
> + }
> +
> + if (ret) {
> + return;
> }
>
> +out:
> trace_multifd_new_send_channel_async_error(p->id, local_err);
> multifd_send_set_error(local_err);
> multifd_send_kick_main(p);
> --
> 2.35.3
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 5/6] migration/multifd: Unify multifd and TLS connection paths
2024-02-06 12:44 ` Avihai Horon
@ 2024-02-06 14:30 ` Fabiano Rosas
2024-02-06 14:44 ` Avihai Horon
0 siblings, 1 reply; 16+ messages in thread
From: Fabiano Rosas @ 2024-02-06 14:30 UTC (permalink / raw)
To: Avihai Horon, qemu-devel; +Cc: Peter Xu, Daniel P . Berrangé
Avihai Horon <avihaih@nvidia.com> writes:
> On 05/02/2024 21:49, Fabiano Rosas wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> During multifd channel creation (multifd_send_new_channel_async) when
>> TLS is enabled, the multifd_channel_connect function is called twice,
>> once to create the TLS handshake thread and another time after the
>> asynchrounous TLS handshake has finished.
>>
>> This creates a slightly confusing call stack where
>> multifd_channel_connect() is called more times than the number of
>> channels. It also splits error handling between the two callers of
>> multifd_channel_connect() causing some code duplication. Lastly, it
>> gets in the way of having a single point to determine whether all
>> channel creation tasks have been initiated.
>>
>> Refactor the code to move the reentrancy one level up at the
>> multifd_new_send_channel_async() level, de-duplicating the error
>> handling and allowing for the next patch to introduce a
>> synchronization point common to all the multifd channel creation,
>> regardless of TLS.
>>
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> ---
>> migration/multifd.c | 73 +++++++++++++++++++--------------------------
>> 1 file changed, 30 insertions(+), 43 deletions(-)
>>
>> diff --git a/migration/multifd.c b/migration/multifd.c
>> index cc10be2c3f..89d39fa67c 100644
>> --- a/migration/multifd.c
>> +++ b/migration/multifd.c
>> @@ -869,30 +869,7 @@ out:
>> return NULL;
>> }
>>
>> -static bool multifd_channel_connect(MultiFDSendParams *p,
>> - QIOChannel *ioc,
>> - Error **errp);
>> -
>> -static void multifd_tls_outgoing_handshake(QIOTask *task,
>> - gpointer opaque)
>> -{
>> - MultiFDSendParams *p = opaque;
>> - QIOChannel *ioc = QIO_CHANNEL(qio_task_get_source(task));
>> - Error *err = NULL;
>> -
>> - if (!qio_task_propagate_error(task, &err)) {
>> - trace_multifd_tls_outgoing_handshake_complete(ioc);
>> - if (multifd_channel_connect(p, ioc, &err)) {
>> - return;
>> - }
>> - }
>> -
>> - trace_multifd_tls_outgoing_handshake_error(ioc, error_get_pretty(err));
>> -
>> - multifd_send_set_error(err);
>> - multifd_send_kick_main(p);
>> - error_free(err);
>> -}
>> +static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque);
>>
>> static void *multifd_tls_handshake_thread(void *opaque)
>> {
>> @@ -900,7 +877,7 @@ static void *multifd_tls_handshake_thread(void *opaque)
>> QIOChannelTLS *tioc = QIO_CHANNEL_TLS(p->c);
>>
>> qio_channel_tls_handshake(tioc,
>> - multifd_tls_outgoing_handshake,
>> + multifd_new_send_channel_async,
>> p,
>> NULL,
>> NULL);
>> @@ -936,19 +913,6 @@ static bool multifd_channel_connect(MultiFDSendParams *p,
>> QIOChannel *ioc,
>> Error **errp)
>> {
>> - trace_multifd_set_outgoing_channel(
>> - ioc, object_get_typename(OBJECT(ioc)),
>> - migrate_get_current()->hostname);
>> -
>> - if (migrate_channel_requires_tls_upgrade(ioc)) {
>> - /*
>> - * tls_channel_connect will call back to this
>> - * function after the TLS handshake,
>> - * so we mustn't call multifd_send_thread until then
>> - */
>> - return multifd_tls_channel_connect(p, ioc, errp);
>> - }
>> -
>> migration_ioc_register_yank(ioc);
>> p->registered_yank = true;
>> p->c = ioc;
>> @@ -959,20 +923,43 @@ static bool multifd_channel_connect(MultiFDSendParams *p,
>> return true;
>> }
>>
>> +/*
>> + * When TLS is enabled this function is called once to establish the
>> + * TLS connection and a second time after the TLS handshake to create
>> + * the multifd channel. Without TLS it goes straight into the channel
>> + * creation.
>> + */
>> static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque)
>> {
>> MultiFDSendParams *p = opaque;
>> QIOChannel *ioc = QIO_CHANNEL(qio_task_get_source(task));
>> Error *local_err = NULL;
>>
>> + bool ret;
>> +
>> trace_multifd_new_send_channel_async(p->id);
>> - if (!qio_task_propagate_error(task, &local_err)) {
>> - qio_channel_set_delay(ioc, false);
>> - if (multifd_channel_connect(p, ioc, &local_err)) {
>> - return;
>> - }
>> +
>> + if (qio_task_propagate_error(task, &local_err)) {
>> + ret = false;
>> + goto out;
>> + }
>
> I think this common error handling for both TLS/non-TLS is a bit
> problematic if there is an error in TLS handshake:
> multifd_tls_channel_connect() sets p->c = QIO_CHANNEL(tioc).
> TLS handshake fails.
> multifd_new_send_channel_async() errors and calls
> object_unref(OBJECT(ioc)) which will result in freeing the IOC.
> Then, multifd_send_terminate_threads() will try to access p->ioc because
> it's not NULL, causing a segfault.
Good catch.
I'm not sure the current reference counting is even correct. AFAICS, the
refcount is 2 at new_send_channel_async due to the qio_task taking a
reference and that will be decremented after we return from the
completion callback, which is multifd_new_send_channel_async itself. The
last reference should be dropped when we cleanup the channel.
So I don't really understand the need for that unref there. But there's
no asserts being reached due to an extra decrement, so there might be
some extra increment hiding somewhere.
Anyway, I'll figure this out and update this patch. Thanks
>> +
>> + qio_channel_set_delay(ioc, false);
>
> Maybe qio_channel_set_delay() should be moved inside
> multifd_channel_connect()? It's called two times when TLS is used.
>
It looks like it could, I'll do that.
>> +
>> + trace_multifd_set_outgoing_channel(ioc, object_get_typename(OBJECT(ioc)),
>> + migrate_get_current()->hostname);
>> +
>> + if (migrate_channel_requires_tls_upgrade(ioc)) {
>> + ret = multifd_tls_channel_connect(p, ioc, &local_err);
>> + } else {
>> + ret = multifd_channel_connect(p, ioc, &local_err);
>> + }
>> +
>> + if (ret) {
>> + return;
>> }
>>
>> +out:
>> trace_multifd_new_send_channel_async_error(p->id, local_err);
>> multifd_send_set_error(local_err);
>> multifd_send_kick_main(p);
>> --
>> 2.35.3
>>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 5/6] migration/multifd: Unify multifd and TLS connection paths
2024-02-06 14:30 ` Fabiano Rosas
@ 2024-02-06 14:44 ` Avihai Horon
0 siblings, 0 replies; 16+ messages in thread
From: Avihai Horon @ 2024-02-06 14:44 UTC (permalink / raw)
To: Fabiano Rosas, qemu-devel; +Cc: Peter Xu, Daniel P . Berrangé
On 06/02/2024 16:30, Fabiano Rosas wrote:
> External email: Use caution opening links or attachments
>
>
> Avihai Horon <avihaih@nvidia.com> writes:
>
>> On 05/02/2024 21:49, Fabiano Rosas wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> During multifd channel creation (multifd_send_new_channel_async) when
>>> TLS is enabled, the multifd_channel_connect function is called twice,
>>> once to create the TLS handshake thread and another time after the
>>> asynchrounous TLS handshake has finished.
>>>
>>> This creates a slightly confusing call stack where
>>> multifd_channel_connect() is called more times than the number of
>>> channels. It also splits error handling between the two callers of
>>> multifd_channel_connect() causing some code duplication. Lastly, it
>>> gets in the way of having a single point to determine whether all
>>> channel creation tasks have been initiated.
>>>
>>> Refactor the code to move the reentrancy one level up at the
>>> multifd_new_send_channel_async() level, de-duplicating the error
>>> handling and allowing for the next patch to introduce a
>>> synchronization point common to all the multifd channel creation,
>>> regardless of TLS.
>>>
>>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>>> ---
>>> migration/multifd.c | 73 +++++++++++++++++++--------------------------
>>> 1 file changed, 30 insertions(+), 43 deletions(-)
>>>
>>> diff --git a/migration/multifd.c b/migration/multifd.c
>>> index cc10be2c3f..89d39fa67c 100644
>>> --- a/migration/multifd.c
>>> +++ b/migration/multifd.c
>>> @@ -869,30 +869,7 @@ out:
>>> return NULL;
>>> }
>>>
>>> -static bool multifd_channel_connect(MultiFDSendParams *p,
>>> - QIOChannel *ioc,
>>> - Error **errp);
>>> -
>>> -static void multifd_tls_outgoing_handshake(QIOTask *task,
>>> - gpointer opaque)
>>> -{
>>> - MultiFDSendParams *p = opaque;
>>> - QIOChannel *ioc = QIO_CHANNEL(qio_task_get_source(task));
>>> - Error *err = NULL;
>>> -
>>> - if (!qio_task_propagate_error(task, &err)) {
>>> - trace_multifd_tls_outgoing_handshake_complete(ioc);
>>> - if (multifd_channel_connect(p, ioc, &err)) {
>>> - return;
>>> - }
>>> - }
>>> -
>>> - trace_multifd_tls_outgoing_handshake_error(ioc, error_get_pretty(err));
>>> -
>>> - multifd_send_set_error(err);
>>> - multifd_send_kick_main(p);
>>> - error_free(err);
>>> -}
>>> +static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque);
>>>
>>> static void *multifd_tls_handshake_thread(void *opaque)
>>> {
>>> @@ -900,7 +877,7 @@ static void *multifd_tls_handshake_thread(void *opaque)
>>> QIOChannelTLS *tioc = QIO_CHANNEL_TLS(p->c);
>>>
>>> qio_channel_tls_handshake(tioc,
>>> - multifd_tls_outgoing_handshake,
>>> + multifd_new_send_channel_async,
>>> p,
>>> NULL,
>>> NULL);
>>> @@ -936,19 +913,6 @@ static bool multifd_channel_connect(MultiFDSendParams *p,
>>> QIOChannel *ioc,
>>> Error **errp)
>>> {
>>> - trace_multifd_set_outgoing_channel(
>>> - ioc, object_get_typename(OBJECT(ioc)),
>>> - migrate_get_current()->hostname);
>>> -
>>> - if (migrate_channel_requires_tls_upgrade(ioc)) {
>>> - /*
>>> - * tls_channel_connect will call back to this
>>> - * function after the TLS handshake,
>>> - * so we mustn't call multifd_send_thread until then
>>> - */
>>> - return multifd_tls_channel_connect(p, ioc, errp);
>>> - }
>>> -
>>> migration_ioc_register_yank(ioc);
>>> p->registered_yank = true;
>>> p->c = ioc;
>>> @@ -959,20 +923,43 @@ static bool multifd_channel_connect(MultiFDSendParams *p,
>>> return true;
>>> }
>>>
>>> +/*
>>> + * When TLS is enabled this function is called once to establish the
>>> + * TLS connection and a second time after the TLS handshake to create
>>> + * the multifd channel. Without TLS it goes straight into the channel
>>> + * creation.
>>> + */
>>> static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque)
>>> {
>>> MultiFDSendParams *p = opaque;
>>> QIOChannel *ioc = QIO_CHANNEL(qio_task_get_source(task));
>>> Error *local_err = NULL;
>>>
>>> + bool ret;
>>> +
>>> trace_multifd_new_send_channel_async(p->id);
>>> - if (!qio_task_propagate_error(task, &local_err)) {
>>> - qio_channel_set_delay(ioc, false);
>>> - if (multifd_channel_connect(p, ioc, &local_err)) {
>>> - return;
>>> - }
>>> +
>>> + if (qio_task_propagate_error(task, &local_err)) {
>>> + ret = false;
>>> + goto out;
>>> + }
>> I think this common error handling for both TLS/non-TLS is a bit
>> problematic if there is an error in TLS handshake:
>> multifd_tls_channel_connect() sets p->c = QIO_CHANNEL(tioc).
>> TLS handshake fails.
>> multifd_new_send_channel_async() errors and calls
>> object_unref(OBJECT(ioc)) which will result in freeing the IOC.
>> Then, multifd_send_terminate_threads() will try to access p->ioc because
>> it's not NULL, causing a segfault.
> Good catch.
>
> I'm not sure the current reference counting is even correct. AFAICS, the
> refcount is 2 at new_send_channel_async due to the qio_task taking a
> reference and that will be decremented after we return from the
> completion callback, which is multifd_new_send_channel_async itself. The
> last reference should be dropped when we cleanup the channel.
>
> So I don't really understand the need for that unref there. But there's
> no asserts being reached due to an extra decrement, so there might be
> some extra increment hiding somewhere.
I think the ref counting is correct, in the non-TLS case we never set
p->c = ioc, so the cleanup will just skip destroying this p->c.
>
> Anyway, I'll figure this out and update this patch. Thanks
>
>>> +
>>> + qio_channel_set_delay(ioc, false);
>> Maybe qio_channel_set_delay() should be moved inside
>> multifd_channel_connect()? It's called two times when TLS is used.
>>
> It looks like it could, I'll do that.
>
>>> +
>>> + trace_multifd_set_outgoing_channel(ioc, object_get_typename(OBJECT(ioc)),
>>> + migrate_get_current()->hostname);
>>> +
>>> + if (migrate_channel_requires_tls_upgrade(ioc)) {
>>> + ret = multifd_tls_channel_connect(p, ioc, &local_err);
>>> + } else {
>>> + ret = multifd_channel_connect(p, ioc, &local_err);
>>> + }
>>> +
>>> + if (ret) {
>>> + return;
>>> }
>>>
>>> +out:
>>> trace_multifd_new_send_channel_async_error(p->id, local_err);
>>> multifd_send_set_error(local_err);
>>> multifd_send_kick_main(p);
>>> --
>>> 2.35.3
>>>
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v2 6/6] migration/multifd: Add a synchronization point for channel creation
2024-02-05 19:49 [PATCH v2 0/6] migration/multifd: Fix channel creation vs. cleanup races Fabiano Rosas
` (4 preceding siblings ...)
2024-02-05 19:49 ` [PATCH v2 5/6] migration/multifd: Unify multifd and TLS connection paths Fabiano Rosas
@ 2024-02-05 19:49 ` Fabiano Rosas
2024-02-06 3:37 ` Peter Xu
2024-02-06 3:42 ` [PATCH v2 0/6] migration/multifd: Fix channel creation vs. cleanup races Peter Xu
6 siblings, 1 reply; 16+ messages in thread
From: Fabiano Rosas @ 2024-02-05 19:49 UTC (permalink / raw)
To: qemu-devel; +Cc: Peter Xu, Avihai Horon, Daniel P . Berrangé
It is possible that one of the multifd channels fails to be created at
multifd_new_send_channel_async() while the rest of the channel
creation tasks are still in flight.
This could lead to multifd_save_cleanup() executing the
qemu_thread_join() loop too early and not waiting for the threads
which haven't been created yet, leading to the freeing of resources
that the newly created threads will try to access and crash.
Add a synchronization point after which there will be no attempts at
thread creation and therefore calling multifd_save_cleanup() past that
point will ensure it properly waits for the threads.
A note about performance: Prior to this patch, if a channel took too
long to be established, other channels could finish connecting first
and already start taking load. Now we're bounded by the
slowest-connecting channel.
Reported-by: Avihai Horon <avihaih@nvidia.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
migration/multifd.c | 33 ++++++++++++++++++++++++++-------
1 file changed, 26 insertions(+), 7 deletions(-)
diff --git a/migration/multifd.c b/migration/multifd.c
index 89d39fa67c..a2b73c9946 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -62,6 +62,11 @@ struct {
* Make it easy for now.
*/
uintptr_t packet_num;
+ /*
+ * Synchronization point past which no more channels will be
+ * created.
+ */
+ QemuSemaphore channels_created;
/* send channels ready */
QemuSemaphore channels_ready;
/*
@@ -622,10 +627,6 @@ static void multifd_send_terminate_threads(void)
/*
* Finally recycle all the threads.
- *
- * TODO: p->running is still buggy, e.g. we can reach here without the
- * corresponding multifd_new_send_channel_async() get invoked yet,
- * then a new thread can even be created after this function returns.
*/
for (i = 0; i < migrate_multifd_channels(); i++) {
MultiFDSendParams *p = &multifd_send_state->params[i];
@@ -670,6 +671,7 @@ static bool multifd_send_cleanup_channel(MultiFDSendParams *p, Error **errp)
static void multifd_send_cleanup_state(void)
{
+ qemu_sem_destroy(&multifd_send_state->channels_created);
qemu_sem_destroy(&multifd_send_state->channels_ready);
g_free(multifd_send_state->params);
multifd_send_state->params = NULL;
@@ -934,7 +936,6 @@ static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque)
MultiFDSendParams *p = opaque;
QIOChannel *ioc = QIO_CHANNEL(qio_task_get_source(task));
Error *local_err = NULL;
-
bool ret;
trace_multifd_new_send_channel_async(p->id);
@@ -951,18 +952,26 @@ static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque)
if (migrate_channel_requires_tls_upgrade(ioc)) {
ret = multifd_tls_channel_connect(p, ioc, &local_err);
+ if (ret) {
+ return;
+ }
} else {
ret = multifd_channel_connect(p, ioc, &local_err);
}
+out:
+ /*
+ * Here we're not interested whether creation succeeded, only that
+ * it happened at all.
+ */
+ qemu_sem_post(&multifd_send_state->channels_created);
+
if (ret) {
return;
}
-out:
trace_multifd_new_send_channel_async_error(p->id, local_err);
multifd_send_set_error(local_err);
- multifd_send_kick_main(p);
object_unref(OBJECT(ioc));
error_free(local_err);
}
@@ -988,6 +997,7 @@ bool multifd_send_setup(void)
multifd_send_state = g_malloc0(sizeof(*multifd_send_state));
multifd_send_state->params = g_new0(MultiFDSendParams, thread_count);
multifd_send_state->pages = multifd_pages_init(page_count);
+ qemu_sem_init(&multifd_send_state->channels_created, 0);
qemu_sem_init(&multifd_send_state->channels_ready, 0);
qatomic_set(&multifd_send_state->exiting, 0);
multifd_send_state->ops = multifd_ops[migrate_multifd_compression()];
@@ -1013,6 +1023,15 @@ bool multifd_send_setup(void)
multifd_new_send_channel_create(p);
}
+ /*
+ * Wait until channel creation has started for all channels. The
+ * creation can still fail, but no more channels will be created
+ * past this point.
+ */
+ for (i = 0; i < thread_count; i++) {
+ qemu_sem_wait(&multifd_send_state->channels_created);
+ }
+
for (i = 0; i < thread_count; i++) {
MultiFDSendParams *p = &multifd_send_state->params[i];
--
2.35.3
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v2 6/6] migration/multifd: Add a synchronization point for channel creation
2024-02-05 19:49 ` [PATCH v2 6/6] migration/multifd: Add a synchronization point for channel creation Fabiano Rosas
@ 2024-02-06 3:37 ` Peter Xu
0 siblings, 0 replies; 16+ messages in thread
From: Peter Xu @ 2024-02-06 3:37 UTC (permalink / raw)
To: Fabiano Rosas; +Cc: qemu-devel, Avihai Horon, Daniel P . Berrangé
On Mon, Feb 05, 2024 at 04:49:29PM -0300, Fabiano Rosas wrote:
> It is possible that one of the multifd channels fails to be created at
> multifd_new_send_channel_async() while the rest of the channel
> creation tasks are still in flight.
>
> This could lead to multifd_save_cleanup() executing the
> qemu_thread_join() loop too early and not waiting for the threads
> which haven't been created yet, leading to the freeing of resources
> that the newly created threads will try to access and crash.
>
> Add a synchronization point after which there will be no attempts at
> thread creation and therefore calling multifd_save_cleanup() past that
> point will ensure it properly waits for the threads.
>
> A note about performance: Prior to this patch, if a channel took too
> long to be established, other channels could finish connecting first
> and already start taking load. Now we're bounded by the
> slowest-connecting channel.
>
> Reported-by: Avihai Horon <avihaih@nvidia.com>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
[...]
> @@ -934,7 +936,6 @@ static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque)
> MultiFDSendParams *p = opaque;
> QIOChannel *ioc = QIO_CHANNEL(qio_task_get_source(task));
> Error *local_err = NULL;
> -
This line removal should belong to the previous patch. I can touch that
up.
> bool ret;
>
> trace_multifd_new_send_channel_async(p->id);
Reviewed-by: Peter Xu <peterx@redhat.com>
--
Peter Xu
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 0/6] migration/multifd: Fix channel creation vs. cleanup races
2024-02-05 19:49 [PATCH v2 0/6] migration/multifd: Fix channel creation vs. cleanup races Fabiano Rosas
` (5 preceding siblings ...)
2024-02-05 19:49 ` [PATCH v2 6/6] migration/multifd: Add a synchronization point for channel creation Fabiano Rosas
@ 2024-02-06 3:42 ` Peter Xu
6 siblings, 0 replies; 16+ messages in thread
From: Peter Xu @ 2024-02-06 3:42 UTC (permalink / raw)
To: Fabiano Rosas; +Cc: qemu-devel, Avihai Horon, Daniel P . Berrangé
On Mon, Feb 05, 2024 at 04:49:23PM -0300, Fabiano Rosas wrote:
> Based-on: 20240202102857.110210-1-peterx@redhat.com
> [PATCH v2 00/23] migration/multifd: Refactor ->send_prepare() and cleanups
> https://lore.kernel.org/r/20240202102857.110210-1-peterx@redhat.com
>
> Hi,
>
> In this v2 I made sure NO channel is created after the semaphores are
> posted. Feel free to call me out if that's not the case.
Queued into -staging. Plan to send a pull only before I'll be out (Feb
9-19), so comments are still welcomed. Thanks.
--
Peter Xu
^ permalink raw reply [flat|nested] 16+ messages in thread