All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] migration: Fix blocking in POSTCOPY_DEVICE during package load
@ 2026-04-23  9:44 Pranav Tyagi
  2026-04-23 19:12 ` Peter Xu
  2026-04-24 10:33 ` Juraj Marcin
  0 siblings, 2 replies; 6+ messages in thread
From: Pranav Tyagi @ 2026-04-23  9:44 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, Juraj Marcin, Prasad Pandit,
	Pranav Tyagi

The package_loaded event is not set in case MIG_RP_MSG_PONG does not
arrive on the source from the destination in the return path thread. The
migration thread would then be blocked waiting for package_loaded event
indefinitely in POSTCOPY_DEVICE state. Where as, in such a condition the
source VM can safely resume as the destination has not yet started. The
pong message can get lost in case of a network failure or destination
crash before sending the pong.

This patch removes the package_loaded event and uses rp_sem, instead of
kicking multiple events. The error is detected in case of network
failure or destination crash and rp_sem is set in the out path of the
return path thread. This will kick the migration thread out from a
condition of indefinitely waiting for rp_sem. The migration thread then
fails early and breaks from the migration loop to resume the vm on the
source side.

Fixes: 7b842fe354c6 ("migration: Introduce POSTCOPY_DEVICE state")
Signed-off-by: Pranav Tyagi <prtyagi@redhat.com>
---
V1: https://lore.kernel.org/all/20260421052227.8278-1-prtyagi@redhat.com/

changed in v2:
- removed postcopy_package_loaded_event and using rp_sem to kick the
  migration thread
- using migration_rp_wait() in place of qemu_event_wait() in the
  migration thread

 migration/migration.c | 48 ++++++++++++++++++++++++++++---------------
 migration/migration.h |  1 -
 2 files changed, 31 insertions(+), 18 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 5c9aaa6e58..6e4988a590 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1661,7 +1661,6 @@ int migrate_init(MigrationState *s, Error **errp)
     migration_reset_vfio_bytes_transferred();
 
     s->postcopy_package_loaded = false;
-    qemu_event_reset(&s->postcopy_package_loaded_event);
 
     return 0;
 }
@@ -2317,7 +2316,7 @@ static void *source_return_path_thread(void *opaque)
             if (tmp32 == QEMU_VM_PING_PACKAGED_LOADED) {
                 trace_source_return_path_thread_postcopy_package_loaded();
                 ms->postcopy_package_loaded = true;
-                qemu_event_set(&ms->postcopy_package_loaded_event);
+                migration_rp_kick(ms);
             }
             break;
 
@@ -2388,16 +2387,21 @@ out:
         trace_source_return_path_thread_bad_end();
     }
 
-    if (ms->state == MIGRATION_STATUS_POSTCOPY_RECOVER) {
+    if (ms->state == MIGRATION_STATUS_POSTCOPY_RECOVER ||
+        ms->state == MIGRATION_STATUS_POSTCOPY_DEVICE) {
         /*
-         * this will be extremely unlikely: that we got yet another network
-         * issue during recovering of the 1st network failure.. during this
-         * period the main migration thread can be waiting on rp_sem for
-         * this thread to sync with the other side.
+         * The migration thread can get stuck waiting for rp_sem if the
+         * return path fails to sync with the destination. This handles
+         * two specific cases:
          *
-         * When this happens, explicitly kick the migration thread out of
-         * RECOVER stage and back to PAUSED, so the admin can try
-         * everything again.
+         * POSTCOPY_RECOVER: A failure occurs during a recovery attempt.
+         * We kick the migration thread back to PAUSED so the admin can
+         * retry.
+         *
+         * POSTCOPY_DEVICE: The MIG_RP_MSG_PONG is lost due to a
+         * network failure or destination crash. We kick the migration
+         * thread out of its wait so it can fail the migration and safely
+         * resume the VM on the source.
          */
         migration_rp_kick(ms);
     }
@@ -3226,12 +3230,24 @@ static MigIterateState migration_iteration_run(MigrationState *s)
         if (s->state == MIGRATION_STATUS_POSTCOPY_DEVICE &&
             (s->postcopy_package_loaded || complete_ready)) {
             /*
-             * If package has been loaded, the event is set and we will
-             * immediatelly transition to POSTCOPY_ACTIVE. If we are ready for
-             * completion, we need to wait for destination to load the postcopy
-             * package before actually completing.
+             * We will immediately transition to POSTCOPY_ACTIVE.
+             * If we are ready for completion, we need to wait for
+             * destination to load the postcopy package before actually
+             * completing.
              */
-            qemu_event_wait(&s->postcopy_package_loaded_event);
+            while (!s->postcopy_package_loaded) {
+                if (migration_rp_wait(s)) {
+                    /*
+                     * Error happened. Migration thread was stuck waiting in
+                     * POSTCOPY_DEVICE for rp_sem which was never set.
+                     */
+                    migrate_set_state(&s->state,
+                                    MIGRATION_STATUS_POSTCOPY_DEVICE,
+                                    MIGRATION_STATUS_FAILING);
+                    return MIG_ITERATE_BREAK;
+                }
+            }
+            /* Acknowledgement received from the destination */
             migrate_set_state(&s->state, MIGRATION_STATUS_POSTCOPY_DEVICE,
                               MIGRATION_STATUS_POSTCOPY_ACTIVE);
         }
@@ -3863,7 +3879,6 @@ static void migration_instance_finalize(Object *obj)
     qemu_sem_destroy(&ms->rp_state.rp_pong_acks);
     qemu_sem_destroy(&ms->postcopy_qemufile_src_sem);
     error_free(ms->error);
-    qemu_event_destroy(&ms->postcopy_package_loaded_event);
 }
 
 static void migration_instance_init(Object *obj)
@@ -3885,7 +3900,6 @@ static void migration_instance_init(Object *obj)
     qemu_sem_init(&ms->wait_unplug_sem, 0);
     qemu_sem_init(&ms->postcopy_qemufile_src_sem, 0);
     qemu_mutex_init(&ms->qemu_file_lock);
-    qemu_event_init(&ms->postcopy_package_loaded_event, 0);
 }
 
 /*
diff --git a/migration/migration.h b/migration/migration.h
index b6888daced..9081e6a612 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -512,7 +512,6 @@ struct MigrationState {
     bool rdma_migration;
 
     bool postcopy_package_loaded;
-    QemuEvent postcopy_package_loaded_event;
 
     GSource *hup_source;
 
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] migration: Fix blocking in POSTCOPY_DEVICE during package load
  2026-04-23  9:44 [PATCH v2] migration: Fix blocking in POSTCOPY_DEVICE during package load Pranav Tyagi
@ 2026-04-23 19:12 ` Peter Xu
  2026-04-24  4:48   ` Pranav Tyagi
  2026-04-24 10:32   ` Juraj Marcin
  2026-04-24 10:33 ` Juraj Marcin
  1 sibling, 2 replies; 6+ messages in thread
From: Peter Xu @ 2026-04-23 19:12 UTC (permalink / raw)
  To: Pranav Tyagi; +Cc: qemu-devel, Fabiano Rosas, Juraj Marcin, Prasad Pandit

On Thu, Apr 23, 2026 at 03:14:38PM +0530, Pranav Tyagi wrote:
> The package_loaded event is not set in case MIG_RP_MSG_PONG does not
> arrive on the source from the destination in the return path thread. The
> migration thread would then be blocked waiting for package_loaded event
> indefinitely in POSTCOPY_DEVICE state. Where as, in such a condition the
> source VM can safely resume as the destination has not yet started. The
> pong message can get lost in case of a network failure or destination
> crash before sending the pong.
> 
> This patch removes the package_loaded event and uses rp_sem, instead of
> kicking multiple events. The error is detected in case of network
> failure or destination crash and rp_sem is set in the out path of the
> return path thread. This will kick the migration thread out from a
> condition of indefinitely waiting for rp_sem. The migration thread then
> fails early and breaks from the migration loop to resume the vm on the
> source side.
> 
> Fixes: 7b842fe354c6 ("migration: Introduce POSTCOPY_DEVICE state")
> Signed-off-by: Pranav Tyagi <prtyagi@redhat.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

I assume Juraj has looked at this already internally, in that case you can
always attach his R-b directly when post / repost.

If not, then it becomes a sincere request.. :-D

Thanks!

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] migration: Fix blocking in POSTCOPY_DEVICE during package load
  2026-04-23 19:12 ` Peter Xu
@ 2026-04-24  4:48   ` Pranav Tyagi
  2026-04-24 10:32   ` Juraj Marcin
  1 sibling, 0 replies; 6+ messages in thread
From: Pranav Tyagi @ 2026-04-24  4:48 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, Fabiano Rosas, Juraj Marcin, Prasad Pandit

[-- Attachment #1: Type: text/plain, Size: 1691 bytes --]

On Fri, Apr 24, 2026 at 12:43 AM Peter Xu <peterx@redhat.com> wrote:

> On Thu, Apr 23, 2026 at 03:14:38PM +0530, Pranav Tyagi wrote:
> > The package_loaded event is not set in case MIG_RP_MSG_PONG does not
> > arrive on the source from the destination in the return path thread. The
> > migration thread would then be blocked waiting for package_loaded event
> > indefinitely in POSTCOPY_DEVICE state. Where as, in such a condition the
> > source VM can safely resume as the destination has not yet started. The
> > pong message can get lost in case of a network failure or destination
> > crash before sending the pong.
> >
> > This patch removes the package_loaded event and uses rp_sem, instead of
> > kicking multiple events. The error is detected in case of network
> > failure or destination crash and rp_sem is set in the out path of the
> > return path thread. This will kick the migration thread out from a
> > condition of indefinitely waiting for rp_sem. The migration thread then
> > fails early and breaks from the migration loop to resume the vm on the
> > source side.
> >
> > Fixes: 7b842fe354c6 ("migration: Introduce POSTCOPY_DEVICE state")
> > Signed-off-by: Pranav Tyagi <prtyagi@redhat.com>
>
> Reviewed-by: Peter Xu <peterx@redhat.com>
>
> I assume Juraj has looked at this already internally, in that case you can
> always attach his R-b directly when post / repost.
>
> If not, then it becomes a sincere request.. :-D
>
> Thanks!
>
> --
> Peter Xu
>
> Hello Peter, thanks for the review. Juraj had already reviewed the patch.
In such a case, I'll remember to attach the R-b tag directly from next time.

Regards
Pranav Tyagi

[-- Attachment #2: Type: text/html, Size: 2325 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] migration: Fix blocking in POSTCOPY_DEVICE during package load
  2026-04-23 19:12 ` Peter Xu
  2026-04-24  4:48   ` Pranav Tyagi
@ 2026-04-24 10:32   ` Juraj Marcin
  1 sibling, 0 replies; 6+ messages in thread
From: Juraj Marcin @ 2026-04-24 10:32 UTC (permalink / raw)
  To: Peter Xu; +Cc: Pranav Tyagi, qemu-devel, Fabiano Rosas, Prasad Pandit

Hi Peter,

On 2026-04-23 15:12, Peter Xu wrote:
> On Thu, Apr 23, 2026 at 03:14:38PM +0530, Pranav Tyagi wrote:
> > The package_loaded event is not set in case MIG_RP_MSG_PONG does not
> > arrive on the source from the destination in the return path thread. The
> > migration thread would then be blocked waiting for package_loaded event
> > indefinitely in POSTCOPY_DEVICE state. Where as, in such a condition the
> > source VM can safely resume as the destination has not yet started. The
> > pong message can get lost in case of a network failure or destination
> > crash before sending the pong.
> > 
> > This patch removes the package_loaded event and uses rp_sem, instead of
> > kicking multiple events. The error is detected in case of network
> > failure or destination crash and rp_sem is set in the out path of the
> > return path thread. This will kick the migration thread out from a
> > condition of indefinitely waiting for rp_sem. The migration thread then
> > fails early and breaks from the migration loop to resume the vm on the
> > source side.
> > 
> > Fixes: 7b842fe354c6 ("migration: Introduce POSTCOPY_DEVICE state")
> > Signed-off-by: Pranav Tyagi <prtyagi@redhat.com>
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>
> 
> I assume Juraj has looked at this already internally, in that case you can
> always attach his R-b directly when post / repost.

I did indeed check it, you can include by R-b!

> 
> If not, then it becomes a sincere request.. :-D
> 
> Thanks!
> 
> -- 
> Peter Xu
> 



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] migration: Fix blocking in POSTCOPY_DEVICE during package load
  2026-04-23  9:44 [PATCH v2] migration: Fix blocking in POSTCOPY_DEVICE during package load Pranav Tyagi
  2026-04-23 19:12 ` Peter Xu
@ 2026-04-24 10:33 ` Juraj Marcin
  2026-04-24 14:00   ` Peter Xu
  1 sibling, 1 reply; 6+ messages in thread
From: Juraj Marcin @ 2026-04-24 10:33 UTC (permalink / raw)
  To: Pranav Tyagi; +Cc: qemu-devel, Peter Xu, Fabiano Rosas, Prasad Pandit

On 2026-04-23 15:14, Pranav Tyagi wrote:
> The package_loaded event is not set in case MIG_RP_MSG_PONG does not
> arrive on the source from the destination in the return path thread. The
> migration thread would then be blocked waiting for package_loaded event
> indefinitely in POSTCOPY_DEVICE state. Where as, in such a condition the
> source VM can safely resume as the destination has not yet started. The
> pong message can get lost in case of a network failure or destination
> crash before sending the pong.
> 
> This patch removes the package_loaded event and uses rp_sem, instead of
> kicking multiple events. The error is detected in case of network
> failure or destination crash and rp_sem is set in the out path of the
> return path thread. This will kick the migration thread out from a
> condition of indefinitely waiting for rp_sem. The migration thread then
> fails early and breaks from the migration loop to resume the vm on the
> source side.
> 
> Fixes: 7b842fe354c6 ("migration: Introduce POSTCOPY_DEVICE state")
> Signed-off-by: Pranav Tyagi <prtyagi@redhat.com>
> ---
> V1: https://lore.kernel.org/all/20260421052227.8278-1-prtyagi@redhat.com/
> 
> changed in v2:
> - removed postcopy_package_loaded_event and using rp_sem to kick the
>   migration thread
> - using migration_rp_wait() in place of qemu_event_wait() in the
>   migration thread
> 
>  migration/migration.c | 48 ++++++++++++++++++++++++++++---------------
>  migration/migration.h |  1 -
>  2 files changed, 31 insertions(+), 18 deletions(-)

Reviewed-by: Juraj Marcin <jmarcin@redhat.com>



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] migration: Fix blocking in POSTCOPY_DEVICE during package load
  2026-04-24 10:33 ` Juraj Marcin
@ 2026-04-24 14:00   ` Peter Xu
  0 siblings, 0 replies; 6+ messages in thread
From: Peter Xu @ 2026-04-24 14:00 UTC (permalink / raw)
  To: Juraj Marcin; +Cc: Pranav Tyagi, qemu-devel, Fabiano Rosas, Prasad Pandit

On Fri, Apr 24, 2026 at 12:33:02PM +0200, Juraj Marcin wrote:
> Reviewed-by: Juraj Marcin <jmarcin@redhat.com>

Thank you!  I queued it for 11.1.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-04-24 14:02 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-23  9:44 [PATCH v2] migration: Fix blocking in POSTCOPY_DEVICE during package load Pranav Tyagi
2026-04-23 19:12 ` Peter Xu
2026-04-24  4:48   ` Pranav Tyagi
2026-04-24 10:32   ` Juraj Marcin
2026-04-24 10:33 ` Juraj Marcin
2026-04-24 14:00   ` Peter Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.