* [PATCH] migration: Fix blocking in POSTCOPY_DEVICE during package load
@ 2026-04-21 5:22 Pranav Tyagi
2026-04-21 21:25 ` Peter Xu
0 siblings, 1 reply; 3+ messages in thread
From: Pranav Tyagi @ 2026-04-21 5:22 UTC (permalink / raw)
To: qemu-devel
Cc: Peter Xu, Fabiano Rosas, Juraj Marcin, Prasad Pandit,
Pranav Tyagi
The package_loaded event is not set in case MIG_RP_MSG_PONG does not
arrive on the source from the destination in the return path thread. The
migration thread would then be blocked waiting for package_loaded event
indefinitely in POSTCOPY_DEVICE state. Where as, in such a condition the
source VM can safely resume as the destination has not yet started. The
pong message can get lost in case of a network failure or destination
crash before sending the pong.
This patch uses the error detected in case of network failure or
destination crash to set the package_loaded event in the out path of the
return path thread. This will kick the migration thread out from
a condition of indefinitely waiting for the package_loaded event. The
migration thread then fails early and breaks from the migration loop to
resume the VM on the source side.
Fixes: 7b842fe354c6 ("migration: Introduce POSTCOPY_DEVICE state")
Signed-off-by: Pranav Tyagi <prtyagi@redhat.com>
---
migration/migration.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)
diff --git a/migration/migration.c b/migration/migration.c
index 5c9aaa6e58..1656c1203c 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2386,6 +2386,15 @@ out:
if (err) {
migrate_error_propagate(ms, err);
trace_source_return_path_thread_bad_end();
+ if (ms->state == MIGRATION_STATUS_POSTCOPY_DEVICE) {
+ /*
+ * Kick the migration thread if it gets stuck in
+ * POSTCOPY_DEVICE state waiting for
+ * postcopy_package_loaded_event. The event will never be
+ * set as MIG_RP_MSG_PONG from the destination is lost.
+ */
+ qemu_event_set(&ms->postcopy_package_loaded_event);
+ }
}
if (ms->state == MIGRATION_STATUS_POSTCOPY_RECOVER) {
@@ -3232,6 +3241,17 @@ static MigIterateState migration_iteration_run(MigrationState *s)
* package before actually completing.
*/
qemu_event_wait(&s->postcopy_package_loaded_event);
+ /*
+ * Check for errors in case the migration thread was stuck in
+ * POSTCOPY_DEVICE state waiting for the
+ * postcopy_package_loaded_event which was never set.
+ * If so, fail now and break out of the iteration.
+ */
+ if (migrate_has_error(s)) {
+ migrate_set_state(&s->state, MIGRATION_STATUS_POSTCOPY_DEVICE,
+ MIGRATION_STATUS_FAILING);
+ return MIG_ITERATE_BREAK;
+ }
migrate_set_state(&s->state, MIGRATION_STATUS_POSTCOPY_DEVICE,
MIGRATION_STATUS_POSTCOPY_ACTIVE);
}
--
2.53.0
^ permalink raw reply related [flat|nested] 3+ messages in thread* Re: [PATCH] migration: Fix blocking in POSTCOPY_DEVICE during package load 2026-04-21 5:22 [PATCH] migration: Fix blocking in POSTCOPY_DEVICE during package load Pranav Tyagi @ 2026-04-21 21:25 ` Peter Xu 2026-04-22 8:49 ` Pranav Tyagi 0 siblings, 1 reply; 3+ messages in thread From: Peter Xu @ 2026-04-21 21:25 UTC (permalink / raw) To: Pranav Tyagi; +Cc: qemu-devel, Fabiano Rosas, Juraj Marcin, Prasad Pandit On Tue, Apr 21, 2026 at 10:52:27AM +0530, Pranav Tyagi wrote: > The package_loaded event is not set in case MIG_RP_MSG_PONG does not > arrive on the source from the destination in the return path thread. The > migration thread would then be blocked waiting for package_loaded event > indefinitely in POSTCOPY_DEVICE state. Where as, in such a condition the > source VM can safely resume as the destination has not yet started. The > pong message can get lost in case of a network failure or destination > crash before sending the pong. > > This patch uses the error detected in case of network failure or > destination crash to set the package_loaded event in the out path of the > return path thread. This will kick the migration thread out from > a condition of indefinitely waiting for the package_loaded event. The > migration thread then fails early and breaks from the migration loop to > resume the VM on the source side. > > Fixes: 7b842fe354c6 ("migration: Introduce POSTCOPY_DEVICE state") > Signed-off-by: Pranav Tyagi <prtyagi@redhat.com> Ah I see.. thanks for figuring this out. > --- > migration/migration.c | 20 ++++++++++++++++++++ > 1 file changed, 20 insertions(+) > > diff --git a/migration/migration.c b/migration/migration.c > index 5c9aaa6e58..1656c1203c 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -2386,6 +2386,15 @@ out: > if (err) { > migrate_error_propagate(ms, err); > trace_source_return_path_thread_bad_end(); > + if (ms->state == MIGRATION_STATUS_POSTCOPY_DEVICE) { > + /* > + * Kick the migration thread if it gets stuck in > + * POSTCOPY_DEVICE state waiting for > + * postcopy_package_loaded_event. The event will never be > + * set as MIG_RP_MSG_PONG from the destination is lost. > + */ > + qemu_event_set(&ms->postcopy_package_loaded_event); > + } This makes sense. Said that, we have another similar case right below for the postcopy recover path: if (ms->state == MIGRATION_STATUS_POSTCOPY_RECOVER) { /* * this will be extremely unlikely: that we got yet another network * issue during recovering of the 1st network failure.. during this * period the main migration thread can be waiting on rp_sem for * this thread to sync with the other side. * * When this happens, explicitly kick the migration thread out of * RECOVER stage and back to PAUSED, so the admin can try * everything again. */ migration_rp_kick(ms); } It achieves some similar goal, where the migration thread is waiting for some event from return path thread, and when the channel is broken we want to kick it out. I'm thinking if we can reuse that, instead of kicking multiple events. Say, we can do migration_rp_kick() for both POSTCOPY_RECOVER and POSTCOPY_DEVICE states in the rp thread. Meanwhile at below.. > } > > if (ms->state == MIGRATION_STATUS_POSTCOPY_RECOVER) { > @@ -3232,6 +3241,17 @@ static MigIterateState migration_iteration_run(MigrationState *s) > * package before actually completing. > */ > qemu_event_wait(&s->postcopy_package_loaded_event); > + /* > + * Check for errors in case the migration thread was stuck in > + * POSTCOPY_DEVICE state waiting for the > + * postcopy_package_loaded_event which was never set. > + * If so, fail now and break out of the iteration. > + */ > + if (migrate_has_error(s)) { > + migrate_set_state(&s->state, MIGRATION_STATUS_POSTCOPY_DEVICE, > + MIGRATION_STATUS_FAILING); > + return MIG_ITERATE_BREAK; > + } ... instead of waiting on postcopy_package_loaded_event, we can do: while (!s->postcopy_package_loaded) { if (migration_rp_wait(s)) { /* Error happened */ migrate_set_state(&s->state, MIGRATION_STATUS_POSTCOPY_DEVICE, MIGRATION_STATUS_FAILING); return MIG_ITERATE_BREAK; } } /* Acknowledgement received from dest */ migrate_set_state(&s->state, MIGRATION_STATUS_POSTCOPY_DEVICE, MIGRATION_STATUS_POSTCOPY_ACTIVE); PS: I think event also works in this case, so likely either one shared sem or event should work; it's just that it was a sem before, and rp_sem has a name that is generic enough to apply directly to the postcopy_package_loaded_event use case. Thanks, > migrate_set_state(&s->state, MIGRATION_STATUS_POSTCOPY_DEVICE, > MIGRATION_STATUS_POSTCOPY_ACTIVE); > } > -- > 2.53.0 > -- Peter Xu ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH] migration: Fix blocking in POSTCOPY_DEVICE during package load 2026-04-21 21:25 ` Peter Xu @ 2026-04-22 8:49 ` Pranav Tyagi 0 siblings, 0 replies; 3+ messages in thread From: Pranav Tyagi @ 2026-04-22 8:49 UTC (permalink / raw) To: Peter Xu; +Cc: qemu-devel, Fabiano Rosas, Juraj Marcin, Prasad Pandit [-- Attachment #1: Type: text/plain, Size: 5363 bytes --] On Wed, Apr 22, 2026 at 2:55 AM Peter Xu <peterx@redhat.com> wrote: > On Tue, Apr 21, 2026 at 10:52:27AM +0530, Pranav Tyagi wrote: > > The package_loaded event is not set in case MIG_RP_MSG_PONG does not > > arrive on the source from the destination in the return path thread. The > > migration thread would then be blocked waiting for package_loaded event > > indefinitely in POSTCOPY_DEVICE state. Where as, in such a condition the > > source VM can safely resume as the destination has not yet started. The > > pong message can get lost in case of a network failure or destination > > crash before sending the pong. > > > > This patch uses the error detected in case of network failure or > > destination crash to set the package_loaded event in the out path of the > > return path thread. This will kick the migration thread out from > > a condition of indefinitely waiting for the package_loaded event. The > > migration thread then fails early and breaks from the migration loop to > > resume the VM on the source side. > > > > Fixes: 7b842fe354c6 ("migration: Introduce POSTCOPY_DEVICE state") > > Signed-off-by: Pranav Tyagi <prtyagi@redhat.com> > > Ah I see.. thanks for figuring this out. > > > --- > > migration/migration.c | 20 ++++++++++++++++++++ > > 1 file changed, 20 insertions(+) > > > > diff --git a/migration/migration.c b/migration/migration.c > > index 5c9aaa6e58..1656c1203c 100644 > > --- a/migration/migration.c > > +++ b/migration/migration.c > > @@ -2386,6 +2386,15 @@ out: > > if (err) { > > migrate_error_propagate(ms, err); > > trace_source_return_path_thread_bad_end(); > > + if (ms->state == MIGRATION_STATUS_POSTCOPY_DEVICE) { > > + /* > > + * Kick the migration thread if it gets stuck in > > + * POSTCOPY_DEVICE state waiting for > > + * postcopy_package_loaded_event. The event will never be > > + * set as MIG_RP_MSG_PONG from the destination is lost. > > + */ > > + qemu_event_set(&ms->postcopy_package_loaded_event); > > + } > > This makes sense. Said that, we have another similar case right below for > the postcopy recover path: > > if (ms->state == MIGRATION_STATUS_POSTCOPY_RECOVER) { > /* > * this will be extremely unlikely: that we got yet another network > * issue during recovering of the 1st network failure.. during this > * period the main migration thread can be waiting on rp_sem for > * this thread to sync with the other side. > * > * When this happens, explicitly kick the migration thread out of > * RECOVER stage and back to PAUSED, so the admin can try > * everything again. > */ > migration_rp_kick(ms); > } > > It achieves some similar goal, where the migration thread is waiting for > some event from return path thread, and when the channel is broken we want > to kick it out. > > I'm thinking if we can reuse that, instead of kicking multiple events. > > Say, we can do migration_rp_kick() for both POSTCOPY_RECOVER and > POSTCOPY_DEVICE states in the rp thread. Meanwhile at below.. > > > } > > > > if (ms->state == MIGRATION_STATUS_POSTCOPY_RECOVER) { > > @@ -3232,6 +3241,17 @@ static MigIterateState > migration_iteration_run(MigrationState *s) > > * package before actually completing. > > */ > > qemu_event_wait(&s->postcopy_package_loaded_event); > > + /* > > + * Check for errors in case the migration thread was stuck > in > > + * POSTCOPY_DEVICE state waiting for the > > + * postcopy_package_loaded_event which was never set. > > + * If so, fail now and break out of the iteration. > > + */ > > + if (migrate_has_error(s)) { > > + migrate_set_state(&s->state, > MIGRATION_STATUS_POSTCOPY_DEVICE, > > + MIGRATION_STATUS_FAILING); > > + return MIG_ITERATE_BREAK; > > + } > > ... instead of waiting on postcopy_package_loaded_event, we can do: > > while (!s->postcopy_package_loaded) { > if (migration_rp_wait(s)) { > /* Error happened */ > migrate_set_state(&s->state, MIGRATION_STATUS_POSTCOPY_DEVICE, > MIGRATION_STATUS_FAILING); > return MIG_ITERATE_BREAK; > } > } > /* Acknowledgement received from dest */ > migrate_set_state(&s->state, MIGRATION_STATUS_POSTCOPY_DEVICE, > MIGRATION_STATUS_POSTCOPY_ACTIVE); > > PS: I think event also works in this case, so likely either one shared sem > or event should work; it's just that it was a sem before, and rp_sem has a > name that is generic enough to apply directly to the > postcopy_package_loaded_event use case. > > Thanks, > > > migrate_set_state(&s->state, > MIGRATION_STATUS_POSTCOPY_DEVICE, > > MIGRATION_STATUS_POSTCOPY_ACTIVE); > > } > > -- > > 2.53.0 > > > > -- > Peter Xu > > Thanks for the feedback Peter. I will update the patch accordingly and send it again. Regards Pranav Tyagi [-- Attachment #2: Type: text/html, Size: 6647 bytes --] ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-04-22 8:50 UTC | newest] Thread overview: 3+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-04-21 5:22 [PATCH] migration: Fix blocking in POSTCOPY_DEVICE during package load Pranav Tyagi 2026-04-21 21:25 ` Peter Xu 2026-04-22 8:49 ` Pranav Tyagi
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.