* [PATCH 0/3] migration/tls: Graceful shutdowns for main and postcopy channels
@ 2025-09-10 16:01 Peter Xu
2025-09-10 16:01 ` [PATCH 1/3] migration/tls: Gracefully shutdown main and preempt channels Peter Xu
` (3 more replies)
0 siblings, 4 replies; 14+ messages in thread
From: Peter Xu @ 2025-09-10 16:01 UTC (permalink / raw)
To: qemu-devel; +Cc: peterx, Juraj Marcin, Daniel P . Berrangé, Fabiano Rosas
Fabiano fixed graceful shutdowns for multifd channels previously:
https://lore.kernel.org/qemu-devel/20250206175824.22664-1-farosas@suse.de/
However we likely forgot the rest channels. Do it the same for the main
and postcopy channels. This fixes a warning message when running unit test
/ARCH/migration/postcopy/preempt/tls/psk.
Thanks,
Peter Xu (3):
migration/tls: Gracefully shutdown main and preempt channels
migration: Make migration_has_failed() work even for CANCELLING
migration/multifd: Use the new graceful termination helper
migration/channel.h | 3 +++
migration/migration.h | 2 ++
migration/tls.h | 1 -
migration/channel.c | 20 ++++++++++++++++++++
migration/migration.c | 27 +++++++++++++++++++++++++--
migration/multifd.c | 40 +++++++---------------------------------
migration/tls.c | 5 -----
7 files changed, 57 insertions(+), 41 deletions(-)
--
2.50.1
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 1/3] migration/tls: Gracefully shutdown main and preempt channels
2025-09-10 16:01 [PATCH 0/3] migration/tls: Graceful shutdowns for main and postcopy channels Peter Xu
@ 2025-09-10 16:01 ` Peter Xu
2025-09-17 20:22 ` Fabiano Rosas
2025-09-10 16:01 ` [PATCH 2/3] migration: Make migration_has_failed() work even for CANCELLING Peter Xu
` (2 subsequent siblings)
3 siblings, 1 reply; 14+ messages in thread
From: Peter Xu @ 2025-09-10 16:01 UTC (permalink / raw)
To: qemu-devel
Cc: peterx, Juraj Marcin, Daniel P . Berrangé, Fabiano Rosas,
Xiaohui Li
QEMU supported graceful shutdowns for multifd channels starting from commit
48796f6b44 ("migration/multifd: Terminate the TLS connection"). Then error
check was enabled for premature TLS terminations.
Now if we run the preempt TLS unit test, the test would pass, but there
will be a warning reported:
qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
ok 1 /x86_64/migration/postcopy/preempt/tls/psk
To fix it, make the rest channels to be gracefully terminated too when it's
a TLS channel.
One note is that the qemufile helper needs to be in migration.c not
qemu-file.c, because qemu-file.c will be linked in unit tests, which will
not link channel.c unfortunately.
Reported-by: Xiaohui Li <xiaohli@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
migration/channel.h | 3 +++
migration/migration.h | 2 ++
migration/channel.c | 13 +++++++++++++
migration/migration.c | 24 +++++++++++++++++++++++-
4 files changed, 41 insertions(+), 1 deletion(-)
diff --git a/migration/channel.h b/migration/channel.h
index 5bdb8208a7..0b25dd7c5b 100644
--- a/migration/channel.h
+++ b/migration/channel.h
@@ -29,4 +29,7 @@ int migration_channel_read_peek(QIOChannel *ioc,
const char *buf,
const size_t buflen,
Error **errp);
+
+bool migration_channel_shutdown_gracefully(QIOChannel *c, Error **errp);
+
#endif
diff --git a/migration/migration.h b/migration/migration.h
index 01329bf824..b5763af057 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -594,4 +594,6 @@ void migration_bitmap_sync_precopy(bool last_stage);
void dirty_bitmap_mig_init(void);
bool should_send_vmdesc(void);
+bool qemu_file_shutdown_gracefully(QEMUFile *f, Error **errp);
+
#endif
diff --git a/migration/channel.c b/migration/channel.c
index a547b1fbfe..1ae839e5fe 100644
--- a/migration/channel.c
+++ b/migration/channel.c
@@ -145,3 +145,16 @@ int migration_channel_read_peek(QIOChannel *ioc,
return 0;
}
+
+/*
+ * This is only needed for a successful migration, no-op for non-TLS
+ * channels. For unexpected interruptions, use qio_channel_shutdown().
+ */
+bool migration_channel_shutdown_gracefully(QIOChannel *c, Error **errp)
+{
+ if (object_dynamic_cast((Object *)c, TYPE_QIO_CHANNEL_TLS)) {
+ qio_channel_tls_bye(QIO_CHANNEL_TLS(c), errp);
+ }
+
+ return *errp == NULL;
+}
diff --git a/migration/migration.c b/migration/migration.c
index 10c216d25d..7015c2b5e0 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -113,6 +113,27 @@ static bool close_return_path_on_source(MigrationState *s);
static void migration_completion_end(MigrationState *s);
static void migrate_hup_delete(MigrationState *s);
+/*
+ * See migration_channel_shutdown_gracefully(). The "graceful" versions
+ * are only needed if migration succeeded.
+ */
+bool qemu_file_shutdown_gracefully(QEMUFile *f, Error **errp)
+{
+ int ret;
+
+ if (!migration_channel_shutdown_gracefully(qemu_file_get_ioc(f), errp)) {
+ return false;
+ }
+
+ ret = qemu_file_shutdown(f);
+ if (ret) {
+ error_setg_errno(errp, -ret, "qemu_file_shutdown() failed");
+ return false;
+ }
+
+ return true;
+}
+
static void migration_downtime_start(MigrationState *s)
{
trace_vmstate_downtime_checkpoint("src-downtime-start");
@@ -2473,11 +2494,12 @@ static void migration_release_dst_files(MigrationState *ms)
*/
if (ms->postcopy_qemufile_src) {
migration_ioc_unregister_yank_from_file(ms->postcopy_qemufile_src);
- qemu_file_shutdown(ms->postcopy_qemufile_src);
+ qemu_file_shutdown_gracefully(ms->postcopy_qemufile_src, &error_warn);
qemu_fclose(ms->postcopy_qemufile_src);
ms->postcopy_qemufile_src = NULL;
}
+ qemu_file_shutdown_gracefully(file, &error_warn);
qemu_fclose(file);
}
--
2.50.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 2/3] migration: Make migration_has_failed() work even for CANCELLING
2025-09-10 16:01 [PATCH 0/3] migration/tls: Graceful shutdowns for main and postcopy channels Peter Xu
2025-09-10 16:01 ` [PATCH 1/3] migration/tls: Gracefully shutdown main and preempt channels Peter Xu
@ 2025-09-10 16:01 ` Peter Xu
2025-09-17 20:52 ` Fabiano Rosas
2025-09-10 16:01 ` [PATCH 3/3] migration/multifd: Use the new graceful termination helper Peter Xu
2025-09-11 13:13 ` [PATCH 0/3] migration/tls: Graceful shutdowns for main and postcopy channels Peter Xu
3 siblings, 1 reply; 14+ messages in thread
From: Peter Xu @ 2025-09-10 16:01 UTC (permalink / raw)
To: qemu-devel; +Cc: peterx, Juraj Marcin, Daniel P . Berrangé, Fabiano Rosas
We set CANCELLED very late, it means migration_has_failed() may not work
correctly if it's invoked before updating CANCELLING to CANCELLED.
Allow that state will make migration_has_failed() working as expected even
if it's invoked slightly earlier.
One current user is the multifd code for the TLS graceful termination,
where it's before updating to CANCELLED.
Signed-off-by: Peter Xu <peterx@redhat.com>
---
migration/migration.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/migration/migration.c b/migration/migration.c
index 7015c2b5e0..397917b1b3 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1723,7 +1723,8 @@ int migration_call_notifiers(MigrationState *s, MigrationEventType type,
bool migration_has_failed(MigrationState *s)
{
- return (s->state == MIGRATION_STATUS_CANCELLED ||
+ return (s->state == MIGRATION_STATUS_CANCELLING ||
+ s->state == MIGRATION_STATUS_CANCELLED ||
s->state == MIGRATION_STATUS_FAILED);
}
--
2.50.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 3/3] migration/multifd: Use the new graceful termination helper
2025-09-10 16:01 [PATCH 0/3] migration/tls: Graceful shutdowns for main and postcopy channels Peter Xu
2025-09-10 16:01 ` [PATCH 1/3] migration/tls: Gracefully shutdown main and preempt channels Peter Xu
2025-09-10 16:01 ` [PATCH 2/3] migration: Make migration_has_failed() work even for CANCELLING Peter Xu
@ 2025-09-10 16:01 ` Peter Xu
2025-09-17 21:07 ` Fabiano Rosas
2025-09-11 13:13 ` [PATCH 0/3] migration/tls: Graceful shutdowns for main and postcopy channels Peter Xu
3 siblings, 1 reply; 14+ messages in thread
From: Peter Xu @ 2025-09-10 16:01 UTC (permalink / raw)
To: qemu-devel; +Cc: peterx, Juraj Marcin, Daniel P . Berrangé, Fabiano Rosas
Multifd has a separate loop to do TLS terminations gracefully. Meanwhile,
it depends on two variables which records thread creations.
It works perfectly before, however relying on "whether some threads are
created" flag might be not as straightforward to decide a graceful
shutdown.
Since we'll need to dynamically identify TLS channels anyway with the new
helper (which is needed for main and postcopy channels), use the same
simple API for multifd channels too. Also, we only need graceful shutdown
on success of migrations.
With that, we can remove the loop and drop migration_tls_channel_end().
The comment there is still a good explanation, move it over to the new
helper instead.
Signed-off-by: Peter Xu <peterx@redhat.com>
---
migration/tls.h | 1 -
migration/channel.c | 7 +++++++
migration/multifd.c | 40 +++++++---------------------------------
migration/tls.c | 5 -----
4 files changed, 14 insertions(+), 39 deletions(-)
diff --git a/migration/tls.h b/migration/tls.h
index 58b25e1228..75c918e156 100644
--- a/migration/tls.h
+++ b/migration/tls.h
@@ -36,7 +36,6 @@ void migration_tls_channel_connect(MigrationState *s,
QIOChannel *ioc,
const char *hostname,
Error **errp);
-void migration_tls_channel_end(QIOChannel *ioc, Error **errp);
/* Whether the QIO channel requires further TLS handshake? */
bool migrate_channel_requires_tls_upgrade(QIOChannel *ioc);
diff --git a/migration/channel.c b/migration/channel.c
index 1ae839e5fe..a481b45eae 100644
--- a/migration/channel.c
+++ b/migration/channel.c
@@ -153,6 +153,13 @@ int migration_channel_read_peek(QIOChannel *ioc,
bool migration_channel_shutdown_gracefully(QIOChannel *c, Error **errp)
{
if (object_dynamic_cast((Object *)c, TYPE_QIO_CHANNEL_TLS)) {
+ /*
+ * The destination expects the TLS session to always be properly
+ * terminated. This helps to detect a premature termination in the
+ * middle of the stream. Note that older QEMUs always break the
+ * connection on the source and the destination always sees
+ * GNUTLS_E_PREMATURE_TERMINATION.
+ */
qio_channel_tls_bye(QIO_CHANNEL_TLS(c), errp);
}
diff --git a/migration/multifd.c b/migration/multifd.c
index b255778855..cb0262076b 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -439,7 +439,7 @@ static void multifd_send_set_error(Error *err)
}
}
-static void multifd_send_terminate_threads(void)
+static void multifd_send_terminate_threads(bool succeeded)
{
int i;
@@ -460,6 +460,9 @@ static void multifd_send_terminate_threads(void)
qemu_sem_post(&p->sem);
if (p->c) {
+ if (succeeded) {
+ migration_channel_shutdown_gracefully(p->c, &error_warn);
+ }
qio_channel_shutdown(p->c, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
}
}
@@ -541,50 +544,21 @@ static void multifd_send_cleanup_state(void)
void multifd_send_shutdown(void)
{
+ MigrationState *s = migrate_get_current();
int i;
if (!migrate_multifd()) {
return;
}
- for (i = 0; i < migrate_multifd_channels(); i++) {
- MultiFDSendParams *p = &multifd_send_state->params[i];
-
- /* thread_created implies the TLS handshake has succeeded */
- if (p->tls_thread_created && p->thread_created) {
- Error *local_err = NULL;
- /*
- * The destination expects the TLS session to always be
- * properly terminated. This helps to detect a premature
- * termination in the middle of the stream. Note that
- * older QEMUs always break the connection on the source
- * and the destination always sees
- * GNUTLS_E_PREMATURE_TERMINATION.
- */
- migration_tls_channel_end(p->c, &local_err);
-
- /*
- * The above can return an error in case the migration has
- * already failed. If the migration succeeded, errors are
- * not expected but there's no need to kill the source.
- */
- if (local_err && !migration_has_failed(migrate_get_current())) {
- warn_report(
- "multifd_send_%d: Failed to terminate TLS connection: %s",
- p->id, error_get_pretty(local_err));
- break;
- }
- }
- }
-
- multifd_send_terminate_threads();
+ multifd_send_terminate_threads(!migration_has_failed(s));
for (i = 0; i < migrate_multifd_channels(); i++) {
MultiFDSendParams *p = &multifd_send_state->params[i];
Error *local_err = NULL;
if (!multifd_send_cleanup_channel(p, &local_err)) {
- migrate_set_error(migrate_get_current(), local_err);
+ migrate_set_error(s, local_err);
error_free(local_err);
}
}
diff --git a/migration/tls.c b/migration/tls.c
index 284a6194b2..ca1595e05d 100644
--- a/migration/tls.c
+++ b/migration/tls.c
@@ -165,11 +165,6 @@ void migration_tls_channel_connect(MigrationState *s,
NULL);
}
-void migration_tls_channel_end(QIOChannel *ioc, Error **errp)
-{
- qio_channel_tls_bye(QIO_CHANNEL_TLS(ioc), errp);
-}
-
bool migrate_channel_requires_tls_upgrade(QIOChannel *ioc)
{
if (!migrate_tls()) {
--
2.50.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 0/3] migration/tls: Graceful shutdowns for main and postcopy channels
2025-09-10 16:01 [PATCH 0/3] migration/tls: Graceful shutdowns for main and postcopy channels Peter Xu
` (2 preceding siblings ...)
2025-09-10 16:01 ` [PATCH 3/3] migration/multifd: Use the new graceful termination helper Peter Xu
@ 2025-09-11 13:13 ` Peter Xu
2025-09-17 20:56 ` Fabiano Rosas
3 siblings, 1 reply; 14+ messages in thread
From: Peter Xu @ 2025-09-11 13:13 UTC (permalink / raw)
To: qemu-devel; +Cc: Juraj Marcin, Daniel P . Berrangé, Fabiano Rosas
On Wed, Sep 10, 2025 at 12:01:41PM -0400, Peter Xu wrote:
> Fabiano fixed graceful shutdowns for multifd channels previously:
>
> https://lore.kernel.org/qemu-devel/20250206175824.22664-1-farosas@suse.de/
>
> However we likely forgot the rest channels. Do it the same for the main
> and postcopy channels. This fixes a warning message when running unit test
> /ARCH/migration/postcopy/preempt/tls/psk.
>
> Thanks,
>
> Peter Xu (3):
> migration/tls: Gracefully shutdown main and preempt channels
> migration: Make migration_has_failed() work even for CANCELLING
> migration/multifd: Use the new graceful termination helper
Please hold off the review on this one. Juraj reported the issue wasn't
resolved by the changes, and I can also reproduce. I'll have a look and
repost..
--
Peter Xu
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/3] migration/tls: Gracefully shutdown main and preempt channels
2025-09-10 16:01 ` [PATCH 1/3] migration/tls: Gracefully shutdown main and preempt channels Peter Xu
@ 2025-09-17 20:22 ` Fabiano Rosas
0 siblings, 0 replies; 14+ messages in thread
From: Fabiano Rosas @ 2025-09-17 20:22 UTC (permalink / raw)
To: Peter Xu, qemu-devel
Cc: peterx, Juraj Marcin, Daniel P . Berrangé, Xiaohui Li
Peter Xu <peterx@redhat.com> writes:
> QEMU supported graceful shutdowns for multifd channels starting from commit
> 48796f6b44 ("migration/multifd: Terminate the TLS connection"). Then error
> check was enabled for premature TLS terminations.
>
> Now if we run the preempt TLS unit test, the test would pass, but there
> will be a warning reported:
>
> qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
> ok 1 /x86_64/migration/postcopy/preempt/tls/psk
>
> To fix it, make the rest channels to be gracefully terminated too when it's
> a TLS channel.
>
> One note is that the qemufile helper needs to be in migration.c not
> qemu-file.c, because qemu-file.c will be linked in unit tests, which will
> not link channel.c unfortunately.
>
> Reported-by: Xiaohui Li <xiaohli@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> migration/channel.h | 3 +++
> migration/migration.h | 2 ++
> migration/channel.c | 13 +++++++++++++
> migration/migration.c | 24 +++++++++++++++++++++++-
> 4 files changed, 41 insertions(+), 1 deletion(-)
>
> diff --git a/migration/channel.h b/migration/channel.h
> index 5bdb8208a7..0b25dd7c5b 100644
> --- a/migration/channel.h
> +++ b/migration/channel.h
> @@ -29,4 +29,7 @@ int migration_channel_read_peek(QIOChannel *ioc,
> const char *buf,
> const size_t buflen,
> Error **errp);
> +
> +bool migration_channel_shutdown_gracefully(QIOChannel *c, Error **errp);
> +
> #endif
> diff --git a/migration/migration.h b/migration/migration.h
> index 01329bf824..b5763af057 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -594,4 +594,6 @@ void migration_bitmap_sync_precopy(bool last_stage);
> void dirty_bitmap_mig_init(void);
> bool should_send_vmdesc(void);
>
> +bool qemu_file_shutdown_gracefully(QEMUFile *f, Error **errp);
> +
> #endif
> diff --git a/migration/channel.c b/migration/channel.c
> index a547b1fbfe..1ae839e5fe 100644
> --- a/migration/channel.c
> +++ b/migration/channel.c
> @@ -145,3 +145,16 @@ int migration_channel_read_peek(QIOChannel *ioc,
>
> return 0;
> }
> +
> +/*
> + * This is only needed for a successful migration, no-op for non-TLS
> + * channels. For unexpected interruptions, use qio_channel_shutdown().
> + */
> +bool migration_channel_shutdown_gracefully(QIOChannel *c, Error **errp)
> +{
ERRP_GUARD();
due to dereferencing errp below
> + if (object_dynamic_cast((Object *)c, TYPE_QIO_CHANNEL_TLS)) {
> + qio_channel_tls_bye(QIO_CHANNEL_TLS(c), errp);
> + }
> +
> + return *errp == NULL;
> +}
> diff --git a/migration/migration.c b/migration/migration.c
> index 10c216d25d..7015c2b5e0 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -113,6 +113,27 @@ static bool close_return_path_on_source(MigrationState *s);
> static void migration_completion_end(MigrationState *s);
> static void migrate_hup_delete(MigrationState *s);
>
> +/*
> + * See migration_channel_shutdown_gracefully(). The "graceful" versions
> + * are only needed if migration succeeded.
> + */
> +bool qemu_file_shutdown_gracefully(QEMUFile *f, Error **errp)
> +{
> + int ret;
> +
> + if (!migration_channel_shutdown_gracefully(qemu_file_get_ioc(f), errp)) {
> + return false;
> + }
> +
> + ret = qemu_file_shutdown(f);
> + if (ret) {
> + error_setg_errno(errp, -ret, "qemu_file_shutdown() failed");
> + return false;
> + }
> +
> + return true;
> +}
> +
> static void migration_downtime_start(MigrationState *s)
> {
> trace_vmstate_downtime_checkpoint("src-downtime-start");
> @@ -2473,11 +2494,12 @@ static void migration_release_dst_files(MigrationState *ms)
> */
> if (ms->postcopy_qemufile_src) {
> migration_ioc_unregister_yank_from_file(ms->postcopy_qemufile_src);
> - qemu_file_shutdown(ms->postcopy_qemufile_src);
> + qemu_file_shutdown_gracefully(ms->postcopy_qemufile_src, &error_warn);
> qemu_fclose(ms->postcopy_qemufile_src);
> ms->postcopy_qemufile_src = NULL;
> }
>
> + qemu_file_shutdown_gracefully(file, &error_warn);
> qemu_fclose(file);
> }
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/3] migration: Make migration_has_failed() work even for CANCELLING
2025-09-10 16:01 ` [PATCH 2/3] migration: Make migration_has_failed() work even for CANCELLING Peter Xu
@ 2025-09-17 20:52 ` Fabiano Rosas
2025-09-17 22:00 ` Peter Xu
0 siblings, 1 reply; 14+ messages in thread
From: Fabiano Rosas @ 2025-09-17 20:52 UTC (permalink / raw)
To: Peter Xu, qemu-devel; +Cc: peterx, Juraj Marcin, Daniel P . Berrangé
Peter Xu <peterx@redhat.com> writes:
> We set CANCELLED very late, it means migration_has_failed() may not work
> correctly if it's invoked before updating CANCELLING to CANCELLED.
>
The prophecy is fulfilled.
https://wiki.qemu.org/ToDo/LiveMigration#Migration_cancel_concurrency
I'm not sure I'm convinced, for instance, CANCELLING is part of
migration_is_running(), while FAILED is not. This doesn't seem
right. Another point is that CANCELLING is not a final state, so we're
prone to later need a migration_has_finished_failing_now() helper. =)
My mental model is that CANCELLING is a transitional, ongoing state
where we shouldn't really be making assumptions. Once FAILED is reached,
then we're sure in which general state everything is.
How did you catch this? It was one of the cancel tests that failed? I
just noticed that multifd_send_shutdown() is called from
migration_cleanup() before it changes the state to CANCELLED. So current
code also has whatever issue you detected here.
> Allow that state will make migration_has_failed() working as expected even
> if it's invoked slightly earlier.
>
> One current user is the multifd code for the TLS graceful termination,
> where it's before updating to CANCELLED.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> migration/migration.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/migration/migration.c b/migration/migration.c
> index 7015c2b5e0..397917b1b3 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1723,7 +1723,8 @@ int migration_call_notifiers(MigrationState *s, MigrationEventType type,
>
> bool migration_has_failed(MigrationState *s)
> {
> - return (s->state == MIGRATION_STATUS_CANCELLED ||
> + return (s->state == MIGRATION_STATUS_CANCELLING ||
> + s->state == MIGRATION_STATUS_CANCELLED ||
> s->state == MIGRATION_STATUS_FAILED);
> }
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/3] migration/tls: Graceful shutdowns for main and postcopy channels
2025-09-11 13:13 ` [PATCH 0/3] migration/tls: Graceful shutdowns for main and postcopy channels Peter Xu
@ 2025-09-17 20:56 ` Fabiano Rosas
2025-09-17 21:50 ` Peter Xu
0 siblings, 1 reply; 14+ messages in thread
From: Fabiano Rosas @ 2025-09-17 20:56 UTC (permalink / raw)
To: Peter Xu, qemu-devel; +Cc: Juraj Marcin, Daniel P . Berrangé
Peter Xu <peterx@redhat.com> writes:
> On Wed, Sep 10, 2025 at 12:01:41PM -0400, Peter Xu wrote:
>> Fabiano fixed graceful shutdowns for multifd channels previously:
>>
>> https://lore.kernel.org/qemu-devel/20250206175824.22664-1-farosas@suse.de/
>>
>> However we likely forgot the rest channels. Do it the same for the main
>> and postcopy channels. This fixes a warning message when running unit test
>> /ARCH/migration/postcopy/preempt/tls/psk.
>>
>> Thanks,
>>
>> Peter Xu (3):
>> migration/tls: Gracefully shutdown main and preempt channels
>> migration: Make migration_has_failed() work even for CANCELLING
>> migration/multifd: Use the new graceful termination helper
>
> Please hold off the review on this one. Juraj reported the issue wasn't
> resolved by the changes, and I can also reproduce. I'll have a look and
> repost..
I'm wondering if the assumption that only succeeded migrations should
gracefully exit is correct. My understanding is that we need to always
exit gracefully, but after failure, the channel might not be there, so
we ignore failures. But that does not seem to mean a failed migration
can simply not exit gracefully.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 3/3] migration/multifd: Use the new graceful termination helper
2025-09-10 16:01 ` [PATCH 3/3] migration/multifd: Use the new graceful termination helper Peter Xu
@ 2025-09-17 21:07 ` Fabiano Rosas
0 siblings, 0 replies; 14+ messages in thread
From: Fabiano Rosas @ 2025-09-17 21:07 UTC (permalink / raw)
To: Peter Xu, qemu-devel; +Cc: peterx, Juraj Marcin, Daniel P . Berrangé
Peter Xu <peterx@redhat.com> writes:
> Multifd has a separate loop to do TLS terminations gracefully. Meanwhile,
> it depends on two variables which records thread creations.
>
> It works perfectly before, however relying on "whether some threads are
> created" flag might be not as straightforward to decide a graceful
> shutdown.
>
> Since we'll need to dynamically identify TLS channels anyway with the new
> helper (which is needed for main and postcopy channels), use the same
> simple API for multifd channels too. Also, we only need graceful shutdown
> on success of migrations.
>
> With that, we can remove the loop and drop migration_tls_channel_end().
>
> The comment there is still a good explanation, move it over to the new
> helper instead.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> migration/tls.h | 1 -
> migration/channel.c | 7 +++++++
> migration/multifd.c | 40 +++++++---------------------------------
> migration/tls.c | 5 -----
> 4 files changed, 14 insertions(+), 39 deletions(-)
>
> diff --git a/migration/tls.h b/migration/tls.h
> index 58b25e1228..75c918e156 100644
> --- a/migration/tls.h
> +++ b/migration/tls.h
> @@ -36,7 +36,6 @@ void migration_tls_channel_connect(MigrationState *s,
> QIOChannel *ioc,
> const char *hostname,
> Error **errp);
> -void migration_tls_channel_end(QIOChannel *ioc, Error **errp);
> /* Whether the QIO channel requires further TLS handshake? */
> bool migrate_channel_requires_tls_upgrade(QIOChannel *ioc);
>
> diff --git a/migration/channel.c b/migration/channel.c
> index 1ae839e5fe..a481b45eae 100644
> --- a/migration/channel.c
> +++ b/migration/channel.c
> @@ -153,6 +153,13 @@ int migration_channel_read_peek(QIOChannel *ioc,
> bool migration_channel_shutdown_gracefully(QIOChannel *c, Error **errp)
> {
> if (object_dynamic_cast((Object *)c, TYPE_QIO_CHANNEL_TLS)) {
> + /*
> + * The destination expects the TLS session to always be properly
> + * terminated. This helps to detect a premature termination in the
> + * middle of the stream. Note that older QEMUs always break the
> + * connection on the source and the destination always sees
> + * GNUTLS_E_PREMATURE_TERMINATION.
> + */
> qio_channel_tls_bye(QIO_CHANNEL_TLS(c), errp);
> }
>
> diff --git a/migration/multifd.c b/migration/multifd.c
> index b255778855..cb0262076b 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -439,7 +439,7 @@ static void multifd_send_set_error(Error *err)
> }
> }
>
> -static void multifd_send_terminate_threads(void)
> +static void multifd_send_terminate_threads(bool succeeded)
> {
> int i;
>
> @@ -460,6 +460,9 @@ static void multifd_send_terminate_threads(void)
>
> qemu_sem_post(&p->sem);
> if (p->c) {
> + if (succeeded) {
> + migration_channel_shutdown_gracefully(p->c, &error_warn);
Maybe keep the warning on failure to gracefully terminate?
> + }
> qio_channel_shutdown(p->c, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
This triggers my multifd-code-smell detector. Reusing the loop. Could
the destination perhaps see this shutdown and start cleanup of it's
channels, causing the subsequent bye to fail?
> }
> }
> @@ -541,50 +544,21 @@ static void multifd_send_cleanup_state(void)
>
> void multifd_send_shutdown(void)
> {
> + MigrationState *s = migrate_get_current();
> int i;
>
> if (!migrate_multifd()) {
> return;
> }
>
> - for (i = 0; i < migrate_multifd_channels(); i++) {
> - MultiFDSendParams *p = &multifd_send_state->params[i];
> -
> - /* thread_created implies the TLS handshake has succeeded */
> - if (p->tls_thread_created && p->thread_created) {
> - Error *local_err = NULL;
> - /*
> - * The destination expects the TLS session to always be
> - * properly terminated. This helps to detect a premature
> - * termination in the middle of the stream. Note that
> - * older QEMUs always break the connection on the source
> - * and the destination always sees
> - * GNUTLS_E_PREMATURE_TERMINATION.
> - */
> - migration_tls_channel_end(p->c, &local_err);
> -
> - /*
> - * The above can return an error in case the migration has
> - * already failed. If the migration succeeded, errors are
> - * not expected but there's no need to kill the source.
> - */
> - if (local_err && !migration_has_failed(migrate_get_current())) {
> - warn_report(
> - "multifd_send_%d: Failed to terminate TLS connection: %s",
> - p->id, error_get_pretty(local_err));
> - break;
> - }
> - }
> - }
> -
> - multifd_send_terminate_threads();
> + multifd_send_terminate_threads(!migration_has_failed(s));
Looks awkward to pass this value in instead of caching inside the
function. But no big deal.
>
> for (i = 0; i < migrate_multifd_channels(); i++) {
> MultiFDSendParams *p = &multifd_send_state->params[i];
> Error *local_err = NULL;
>
> if (!multifd_send_cleanup_channel(p, &local_err)) {
> - migrate_set_error(migrate_get_current(), local_err);
> + migrate_set_error(s, local_err);
> error_free(local_err);
> }
> }
> diff --git a/migration/tls.c b/migration/tls.c
> index 284a6194b2..ca1595e05d 100644
> --- a/migration/tls.c
> +++ b/migration/tls.c
> @@ -165,11 +165,6 @@ void migration_tls_channel_connect(MigrationState *s,
> NULL);
> }
>
> -void migration_tls_channel_end(QIOChannel *ioc, Error **errp)
> -{
> - qio_channel_tls_bye(QIO_CHANNEL_TLS(ioc), errp);
> -}
> -
> bool migrate_channel_requires_tls_upgrade(QIOChannel *ioc)
> {
> if (!migrate_tls()) {
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/3] migration/tls: Graceful shutdowns for main and postcopy channels
2025-09-17 20:56 ` Fabiano Rosas
@ 2025-09-17 21:50 ` Peter Xu
2025-09-18 13:47 ` Fabiano Rosas
0 siblings, 1 reply; 14+ messages in thread
From: Peter Xu @ 2025-09-17 21:50 UTC (permalink / raw)
To: Fabiano Rosas; +Cc: qemu-devel, Juraj Marcin, Daniel P . Berrangé
On Wed, Sep 17, 2025 at 05:56:40PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
>
> > On Wed, Sep 10, 2025 at 12:01:41PM -0400, Peter Xu wrote:
> >> Fabiano fixed graceful shutdowns for multifd channels previously:
> >>
> >> https://lore.kernel.org/qemu-devel/20250206175824.22664-1-farosas@suse.de/
> >>
> >> However we likely forgot the rest channels. Do it the same for the main
> >> and postcopy channels. This fixes a warning message when running unit test
> >> /ARCH/migration/postcopy/preempt/tls/psk.
> >>
> >> Thanks,
> >>
> >> Peter Xu (3):
> >> migration/tls: Gracefully shutdown main and preempt channels
> >> migration: Make migration_has_failed() work even for CANCELLING
> >> migration/multifd: Use the new graceful termination helper
> >
> > Please hold off the review on this one. Juraj reported the issue wasn't
> > resolved by the changes, and I can also reproduce. I'll have a look and
> > repost..
>
> I'm wondering if the assumption that only succeeded migrations should
> gracefully exit is correct. My understanding is that we need to always
> exit gracefully, but after failure, the channel might not be there, so
> we ignore failures. But that does not seem to mean a failed migration
> can simply not exit gracefully.
Currently tls channels will ignore premature terminations whenever there's
a shutdown() on READ. So when failed / cancelled and whenever there's a
shutdown(), iochannel already does the bypass no matter what migration does.
Or do you mean we should remove that, and still try to do graceful
shutdowns even if the channel was shutdown()?
Feel free to have a look at v2 of this series, especially patch 1. v3 will
come soon, but just to say, v2 is hugely different from v1, and should be
fairly close to upcoming v3, at least on patch 1 which is the real fix.
I should have mentioned that earlier.. :(
--
Peter Xu
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/3] migration: Make migration_has_failed() work even for CANCELLING
2025-09-17 20:52 ` Fabiano Rosas
@ 2025-09-17 22:00 ` Peter Xu
2025-09-18 13:43 ` Fabiano Rosas
0 siblings, 1 reply; 14+ messages in thread
From: Peter Xu @ 2025-09-17 22:00 UTC (permalink / raw)
To: Fabiano Rosas; +Cc: qemu-devel, Juraj Marcin, Daniel P . Berrangé
On Wed, Sep 17, 2025 at 05:52:54PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
>
> > We set CANCELLED very late, it means migration_has_failed() may not work
> > correctly if it's invoked before updating CANCELLING to CANCELLED.
> >
>
> The prophecy is fulfilled.
>
> https://wiki.qemu.org/ToDo/LiveMigration#Migration_cancel_concurrency
>
> I'm not sure I'm convinced, for instance, CANCELLING is part of
> migration_is_running(), while FAILED is not. This doesn't seem
> right. Another point is that CANCELLING is not a final state, so we're
> prone to later need a migration_has_finished_failing_now() helper. =)
Considering we only have two users so far, and the other user doesn't care
about CANCELLING (while the multifd shutdown cares?), then I assume it's ok
to treat CANCELLING to be "has failed"? :) I didn't try to interpret "has
failed" in English, but only for the sake of an universal helper that works
for both places.
Or maybe it can be is_failing() too? I don't have a strong feeling.
>
> My mental model is that CANCELLING is a transitional, ongoing state
> where we shouldn't really be making assumptions. Once FAILED is reached,
> then we're sure in which general state everything is.
>
> How did you catch this? It was one of the cancel tests that failed? I
> just noticed that multifd_send_shutdown() is called from
> migration_cleanup() before it changes the state to CANCELLED. So current
> code also has whatever issue you detected here.
No test failed, it was only by code observation, mentioned below [1],
exactly as you said.
I just think when cancelling the tls sessions, we shouldn't dump the error
messages anymore even if the bye failed. Or maybe we simply do not need to
invoke migration_tls_channel_end() when CANCELLING / FAILED? That's
relevant to your ask on the cover letter, we can discuss there.
This is very trivial. Let me know how you thinks. I can also drop this
patch when repost v3 but fix the postcopy warning first, which reliably
reproduce now with qtest.
>
> > Allow that state will make migration_has_failed() working as expected even
> > if it's invoked slightly earlier.
> >
> > One current user is the multifd code for the TLS graceful termination,
> > where it's before updating to CANCELLED.
[1]
> >
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> > migration/migration.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 7015c2b5e0..397917b1b3 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -1723,7 +1723,8 @@ int migration_call_notifiers(MigrationState *s, MigrationEventType type,
> >
> > bool migration_has_failed(MigrationState *s)
> > {
> > - return (s->state == MIGRATION_STATUS_CANCELLED ||
> > + return (s->state == MIGRATION_STATUS_CANCELLING ||
> > + s->state == MIGRATION_STATUS_CANCELLED ||
> > s->state == MIGRATION_STATUS_FAILED);
> > }
>
--
Peter Xu
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/3] migration: Make migration_has_failed() work even for CANCELLING
2025-09-17 22:00 ` Peter Xu
@ 2025-09-18 13:43 ` Fabiano Rosas
0 siblings, 0 replies; 14+ messages in thread
From: Fabiano Rosas @ 2025-09-18 13:43 UTC (permalink / raw)
To: Peter Xu; +Cc: qemu-devel, Juraj Marcin, Daniel P . Berrangé
Peter Xu <peterx@redhat.com> writes:
> On Wed, Sep 17, 2025 at 05:52:54PM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>>
>> > We set CANCELLED very late, it means migration_has_failed() may not work
>> > correctly if it's invoked before updating CANCELLING to CANCELLED.
>> >
>>
>> The prophecy is fulfilled.
>>
>> https://wiki.qemu.org/ToDo/LiveMigration#Migration_cancel_concurrency
>>
>> I'm not sure I'm convinced, for instance, CANCELLING is part of
>> migration_is_running(), while FAILED is not. This doesn't seem
>> right. Another point is that CANCELLING is not a final state, so we're
>> prone to later need a migration_has_finished_failing_now() helper. =)
>
> Considering we only have two users so far, and the other user doesn't care
> about CANCELLING (while the multifd shutdown cares?), then I assume it's ok
> to treat CANCELLING to be "has failed"? :) I didn't try to interpret "has
> failed" in English, but only for the sake of an universal helper that works
> for both places.
>
> Or maybe it can be is_failing() too? I don't have a strong feeling.
>
I'm not nitipicking on language. I'm pointing out that CANCELLING is a
transitory state, i.e. from migrate_cancel() until migrate_cleanup(),
while FAILED is a terminal state, nothing happens after it.
But fine, I guess it's really only *my* assumptions being broken and not
the ones in the code.
>>
>> My mental model is that CANCELLING is a transitional, ongoing state
>> where we shouldn't really be making assumptions. Once FAILED is reached,
>> then we're sure in which general state everything is.
>>
>> How did you catch this? It was one of the cancel tests that failed? I
>> just noticed that multifd_send_shutdown() is called from
>> migration_cleanup() before it changes the state to CANCELLED. So current
>> code also has whatever issue you detected here.
>
> No test failed, it was only by code observation, mentioned below [1],
> exactly as you said.
>
> I just think when cancelling the tls sessions, we shouldn't dump the error
> messages anymore even if the bye failed.
Ok
> Or maybe we simply do not need to
> invoke migration_tls_channel_end() when CANCELLING / FAILED? That's
> relevant to your ask on the cover letter, we can discuss there.
>
> This is very trivial.
Nah, let me review the patch properly, please.
> Let me know how you thinks. I can also drop this
> patch when repost v3 but fix the postcopy warning first, which reliably
> reproduce now with qtest.
>
>>
>> > Allow that state will make migration_has_failed() working as expected even
>> > if it's invoked slightly earlier.
>> >
>> > One current user is the multifd code for the TLS graceful termination,
>> > where it's before updating to CANCELLED.
>
> [1]
>
>> >
>> > Signed-off-by: Peter Xu <peterx@redhat.com>
>> > ---
>> > migration/migration.c | 3 ++-
>> > 1 file changed, 2 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/migration/migration.c b/migration/migration.c
>> > index 7015c2b5e0..397917b1b3 100644
>> > --- a/migration/migration.c
>> > +++ b/migration/migration.c
>> > @@ -1723,7 +1723,8 @@ int migration_call_notifiers(MigrationState *s, MigrationEventType type,
>> >
>> > bool migration_has_failed(MigrationState *s)
>> > {
>> > - return (s->state == MIGRATION_STATUS_CANCELLED ||
>> > + return (s->state == MIGRATION_STATUS_CANCELLING ||
>> > + s->state == MIGRATION_STATUS_CANCELLED ||
>> > s->state == MIGRATION_STATUS_FAILED);
>> > }
>>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/3] migration/tls: Graceful shutdowns for main and postcopy channels
2025-09-17 21:50 ` Peter Xu
@ 2025-09-18 13:47 ` Fabiano Rosas
2025-09-18 16:15 ` Peter Xu
0 siblings, 1 reply; 14+ messages in thread
From: Fabiano Rosas @ 2025-09-18 13:47 UTC (permalink / raw)
To: Peter Xu; +Cc: qemu-devel, Juraj Marcin, Daniel P . Berrangé
Peter Xu <peterx@redhat.com> writes:
> On Wed, Sep 17, 2025 at 05:56:40PM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>>
>> > On Wed, Sep 10, 2025 at 12:01:41PM -0400, Peter Xu wrote:
>> >> Fabiano fixed graceful shutdowns for multifd channels previously:
>> >>
>> >> https://lore.kernel.org/qemu-devel/20250206175824.22664-1-farosas@suse.de/
>> >>
>> >> However we likely forgot the rest channels. Do it the same for the main
>> >> and postcopy channels. This fixes a warning message when running unit test
>> >> /ARCH/migration/postcopy/preempt/tls/psk.
>> >>
>> >> Thanks,
>> >>
>> >> Peter Xu (3):
>> >> migration/tls: Gracefully shutdown main and preempt channels
>> >> migration: Make migration_has_failed() work even for CANCELLING
>> >> migration/multifd: Use the new graceful termination helper
>> >
>> > Please hold off the review on this one. Juraj reported the issue wasn't
>> > resolved by the changes, and I can also reproduce. I'll have a look and
>> > repost..
>>
>> I'm wondering if the assumption that only succeeded migrations should
>> gracefully exit is correct. My understanding is that we need to always
>> exit gracefully, but after failure, the channel might not be there, so
>> we ignore failures. But that does not seem to mean a failed migration
>> can simply not exit gracefully.
>
> Currently tls channels will ignore premature terminations whenever there's
> a shutdown() on READ. So when failed / cancelled and whenever there's a
> shutdown(), iochannel already does the bypass no matter what migration does.
>
I'm thinking if it's possible for a premature termination to be detected
by TLS before we did the shutdown(). So my suggestion was to always
bye() before shutdown(), not matter the state migration is in. But maybe
your way is ok, I'm not sure now. Let me read the other versions of the
series...
> Or do you mean we should remove that, and still try to do graceful
> shutdowns even if the channel was shutdown()?
>
> Feel free to have a look at v2 of this series, especially patch 1. v3 will
> come soon, but just to say, v2 is hugely different from v1, and should be
> fairly close to upcoming v3, at least on patch 1 which is the real fix.
>
Ah, sorry, I didn't see the v2 on my list. But it's there.
> I should have mentioned that earlier.. :(
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/3] migration/tls: Graceful shutdowns for main and postcopy channels
2025-09-18 13:47 ` Fabiano Rosas
@ 2025-09-18 16:15 ` Peter Xu
0 siblings, 0 replies; 14+ messages in thread
From: Peter Xu @ 2025-09-18 16:15 UTC (permalink / raw)
To: Fabiano Rosas; +Cc: qemu-devel, Juraj Marcin, Daniel P . Berrangé
On Thu, Sep 18, 2025 at 10:47:48AM -0300, Fabiano Rosas wrote:
> I'm thinking if it's possible for a premature termination to be detected
> by TLS before we did the shutdown(). So my suggestion was to always
> bye() before shutdown(), not matter the state migration is in. But maybe
> your way is ok, I'm not sure now. Let me read the other versions of the
> series...
For failing / cancelling migrations, premature termination is likely fine.
IMHO we also shouldn't care too much on error reports on premature
terminations because it's failing anyway.
So maybe you're talking about shutdown()s when the migration is
successfully completed.
We can try to do that, but maybe it's not easily doable. E.g., we have the
preempt thread currently only be able to be kicked out by a shutdown() from
the dst main thread. While we can start to inject a bye() there before the
shutdown(), it'll be:
(1) a bye() sent concurrently while the preempt thread is still logically
owning and operating on the preempt channel (luckily, so far read-only),
and,
(2) we need to double check if this works if we send bye(WR) from dest to
src and whether it'll also gracefully shutdown the src side.
(3) currently, a shutdown() is synchronous. bye() is yet not. We may
then need similiar treatment (e.g. changing IO to block tempoararily?)
when doing explicit shutdown()s.
I very vaguely remember when working on this series I tried (2) and it
didn't really work, but I'm not very sure. Anyway, all these will add some
complexity, and we'd better justify it's worthwhile..
[...]
> Ah, sorry, I didn't see the v2 on my list. But it's there.
Nah, that's not your fault, maybe mine.
--
Peter Xu
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2025-09-18 16:17 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-10 16:01 [PATCH 0/3] migration/tls: Graceful shutdowns for main and postcopy channels Peter Xu
2025-09-10 16:01 ` [PATCH 1/3] migration/tls: Gracefully shutdown main and preempt channels Peter Xu
2025-09-17 20:22 ` Fabiano Rosas
2025-09-10 16:01 ` [PATCH 2/3] migration: Make migration_has_failed() work even for CANCELLING Peter Xu
2025-09-17 20:52 ` Fabiano Rosas
2025-09-17 22:00 ` Peter Xu
2025-09-18 13:43 ` Fabiano Rosas
2025-09-10 16:01 ` [PATCH 3/3] migration/multifd: Use the new graceful termination helper Peter Xu
2025-09-17 21:07 ` Fabiano Rosas
2025-09-11 13:13 ` [PATCH 0/3] migration/tls: Graceful shutdowns for main and postcopy channels Peter Xu
2025-09-17 20:56 ` Fabiano Rosas
2025-09-17 21:50 ` Peter Xu
2025-09-18 13:47 ` Fabiano Rosas
2025-09-18 16:15 ` Peter Xu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).