qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] migration: Introduce POSTCOPY_DEVICE state
@ 2025-09-15 11:59 Juraj Marcin
  2025-09-15 11:59 ` [PATCH 1/4] migration: Do not try to start VM if disk activation fails Juraj Marcin
                   ` (3 more replies)
  0 siblings, 4 replies; 33+ messages in thread
From: Juraj Marcin @ 2025-09-15 11:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juraj Marcin, Jiri Denemark, Peter Xu, Dr. David Alan Gilbert,
	Fabiano Rosas

This series is a continuation of the following RFC series and its
discussion [1].

[1]: https://lore.kernel.org/all/20250807114922.1013286-1-jmarcin@redhat.com/

This series takes a different approach to source side recoverability
than the original RFC series, it uses existing PING/PONG message types.
Although, such approach has some theoretical race conditions, when
discussed we came to a conclusion that in practice there is a very, very
slim chance if any for it to happen. On the other hand, this approach
doesn't require any changes in the migration protocol nor the
destination side QEMU instance to be functional.

In preparation for the state introduction, this series contains few
changes.

First, it includes a patch suggested by Peter, which adds a check to
block device activation when the source side tries to resume after a
failed migration.

Next, it refactors cleanup and error handling on the destination side.
This change is not strictly necessary for the feature to work. Without
this patch, if device state load failed, the destination QEMU would
either exit with an error exit code from the listen thread, or it might
crash if the main thread does some cleanup before the listen thread
exits the process. However, the source side can recover regardless of
how the destination side fails.

Finally, the last patch contains the main feature, the POSTCOPY_DEVICE
state. Compared to the approach discussed in the RFC, it uses a new PING
message with custom PING number. The reason behind that is, that the
PING 3 message is now sent only when postcopy-ram is active, but there
might be postcopy scenarios when this isn't true. The destination side
can respond to this new PING message without any changes required.

As this change introduces a new migration state, I have also tested it
with libvirt. Apart from a warning about an unknown migration state
received in an event, migration finishes without any issues.

Juraj Marcin (3):
  migration: Accept MigrationStatus in migration_has_failed()
  migration: Refactor incoming cleanup into migration_incoming_finish()
  migration: Introduce POSTCOPY_DEVICE state

Peter Xu (1):
  migration: Do not try to start VM if disk activation fails

 migration/migration.c                 | 124 +++++++++++++++++---------
 migration/migration.h                 |   3 +-
 migration/multifd.c                   |   2 +-
 migration/savevm.c                    |  48 ++++------
 migration/savevm.h                    |   2 +
 migration/trace-events                |   1 +
 qapi/migration.json                   |   8 +-
 tests/qtest/migration/precopy-tests.c |   3 +-
 8 files changed, 112 insertions(+), 79 deletions(-)

-- 
2.51.0



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 1/4] migration: Do not try to start VM if disk activation fails
  2025-09-15 11:59 [PATCH 0/4] migration: Introduce POSTCOPY_DEVICE state Juraj Marcin
@ 2025-09-15 11:59 ` Juraj Marcin
  2025-09-19 16:12   ` Fabiano Rosas
  2025-09-15 11:59 ` [PATCH 2/4] migration: Accept MigrationStatus in migration_has_failed() Juraj Marcin
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 33+ messages in thread
From: Juraj Marcin @ 2025-09-15 11:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juraj Marcin, Jiri Denemark, Peter Xu, Dr. David Alan Gilbert,
	Fabiano Rosas

From: Peter Xu <peterx@redhat.com>

If a rare split brain happens (e.g. dest QEMU started running somehow,
taking shared drive locks), src QEMU may not be able to activate the
drives anymore.  In this case, src QEMU shouldn't start the VM or it might
crash the block layer later with something like:

bdrv_co_write_req_prepare: Assertion `!(bs->open_flags & BDRV_O_INACTIVE)' failed.

Meanwhile, src QEMU cannot try to continue either even if dest QEMU can
release the drive locks (e.g. by QMP "stop").  Because as long as dest QEMU
started running, it means dest QEMU's RAM is the only version that is
consistent with current status of the shared storage.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c | 29 ++++++++++++++++++++++++-----
 1 file changed, 24 insertions(+), 5 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 10c216d25d..54dac3db88 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3502,6 +3502,8 @@ static MigIterateState migration_iteration_run(MigrationState *s)
 
 static void migration_iteration_finish(MigrationState *s)
 {
+    Error *local_err = NULL;
+
     bql_lock();
 
     /*
@@ -3525,11 +3527,28 @@ static void migration_iteration_finish(MigrationState *s)
     case MIGRATION_STATUS_FAILED:
     case MIGRATION_STATUS_CANCELLED:
     case MIGRATION_STATUS_CANCELLING:
-        /*
-         * Re-activate the block drives if they're inactivated.  Note, COLO
-         * shouldn't use block_active at all, so it should be no-op there.
-         */
-        migration_block_activate(NULL);
+        if (!migration_block_activate(&local_err)) {
+            /*
+            * Re-activate the block drives if they're inactivated.
+            *
+            * If it fails (e.g. in case of a split brain, where dest QEMU
+            * might have taken some of the drive locks and running!), do
+            * not start VM, instead wait for mgmt to decide the next step.
+            *
+            * If dest already started, it means dest QEMU should contain
+            * all the data it needs and it properly owns all the drive
+            * locks.  Then even if src QEMU got a FAILED in migration, it
+            * normally should mean we should treat the migration as
+            * COMPLETED.
+            *
+            * NOTE: it's not safe anymore to start VM on src now even if
+            * dest would release the drive locks.  It's because as long as
+            * dest started running then only dest QEMU's RAM is consistent
+            * with the shared storage.
+            */
+            error_free(local_err);
+            break;
+        }
         if (runstate_is_live(s->vm_old_state)) {
             if (!runstate_check(RUN_STATE_SHUTDOWN)) {
                 vm_start();
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 2/4] migration: Accept MigrationStatus in migration_has_failed()
  2025-09-15 11:59 [PATCH 0/4] migration: Introduce POSTCOPY_DEVICE state Juraj Marcin
  2025-09-15 11:59 ` [PATCH 1/4] migration: Do not try to start VM if disk activation fails Juraj Marcin
@ 2025-09-15 11:59 ` Juraj Marcin
  2025-09-19 14:57   ` Peter Xu
  2025-09-15 11:59 ` [PATCH 3/4] migration: Refactor incoming cleanup into migration_incoming_finish() Juraj Marcin
  2025-09-15 11:59 ` [PATCH 4/4] migration: Introduce POSTCOPY_DEVICE state Juraj Marcin
  3 siblings, 1 reply; 33+ messages in thread
From: Juraj Marcin @ 2025-09-15 11:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juraj Marcin, Jiri Denemark, Peter Xu, Dr. David Alan Gilbert,
	Fabiano Rosas

From: Juraj Marcin <jmarcin@redhat.com>

This allows to reuse the helper also with MigrationIncomingState.

Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
---
 migration/migration.c | 8 ++++----
 migration/migration.h | 2 +-
 migration/multifd.c   | 2 +-
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 54dac3db88..2c0b3a7229 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1542,7 +1542,7 @@ static void migration_cleanup(MigrationState *s)
         /* It is used on info migrate.  We can't free it */
         error_report_err(error_copy(s->error));
     }
-    type = migration_has_failed(s) ? MIG_EVENT_PRECOPY_FAILED :
+    type = migration_has_failed(s->state) ? MIG_EVENT_PRECOPY_FAILED :
                                      MIG_EVENT_PRECOPY_DONE;
     migration_call_notifiers(s, type, NULL);
     yank_unregister_instance(MIGRATION_YANK_INSTANCE);
@@ -1700,10 +1700,10 @@ int migration_call_notifiers(MigrationState *s, MigrationEventType type,
     return ret;
 }
 
-bool migration_has_failed(MigrationState *s)
+bool migration_has_failed(MigrationStatus state)
 {
-    return (s->state == MIGRATION_STATUS_CANCELLED ||
-            s->state == MIGRATION_STATUS_FAILED);
+    return (state == MIGRATION_STATUS_CANCELLED ||
+            state == MIGRATION_STATUS_FAILED);
 }
 
 bool migration_in_postcopy(void)
diff --git a/migration/migration.h b/migration/migration.h
index 01329bf824..2c2331f40d 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -535,7 +535,7 @@ bool migration_is_blocked(Error **errp);
 bool migration_in_postcopy(void);
 bool migration_postcopy_is_alive(MigrationStatus state);
 MigrationState *migrate_get_current(void);
-bool migration_has_failed(MigrationState *);
+bool migration_has_failed(MigrationStatus state);
 bool migrate_mode_is_cpr(MigrationState *);
 
 uint64_t ram_get_total_transferred_pages(void);
diff --git a/migration/multifd.c b/migration/multifd.c
index b255778855..c569f91f2c 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -568,7 +568,7 @@ void multifd_send_shutdown(void)
              * already failed. If the migration succeeded, errors are
              * not expected but there's no need to kill the source.
              */
-            if (local_err && !migration_has_failed(migrate_get_current())) {
+            if (local_err && !migration_has_failed(migrate_get_current()->state)) {
                 warn_report(
                     "multifd_send_%d: Failed to terminate TLS connection: %s",
                     p->id, error_get_pretty(local_err));
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 3/4] migration: Refactor incoming cleanup into migration_incoming_finish()
  2025-09-15 11:59 [PATCH 0/4] migration: Introduce POSTCOPY_DEVICE state Juraj Marcin
  2025-09-15 11:59 ` [PATCH 1/4] migration: Do not try to start VM if disk activation fails Juraj Marcin
  2025-09-15 11:59 ` [PATCH 2/4] migration: Accept MigrationStatus in migration_has_failed() Juraj Marcin
@ 2025-09-15 11:59 ` Juraj Marcin
  2025-09-19 15:53   ` Peter Xu
  2025-09-19 16:46   ` Fabiano Rosas
  2025-09-15 11:59 ` [PATCH 4/4] migration: Introduce POSTCOPY_DEVICE state Juraj Marcin
  3 siblings, 2 replies; 33+ messages in thread
From: Juraj Marcin @ 2025-09-15 11:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juraj Marcin, Jiri Denemark, Peter Xu, Dr. David Alan Gilbert,
	Fabiano Rosas

From: Juraj Marcin <jmarcin@redhat.com>

Currently, there are two functions that are responsible for cleanup of
the incoming migration state. With successful precopy, it's the main
thread and with successful postcopy it's the listen thread. However, if
postcopy fails during in the device load, both functions will try to do
the cleanup. Moreover, when exit-on-error parameter was added, it was
applied only to precopy.

This patch refactors common cleanup and exiting on error into a helper
function that can be started either from precopy or postcopy, reducing
the duplication. If the listen thread has been started (the postcopy
state is at least LISTENING), the listen thread is responsible for all
cleanup and exiting, otherwise it's the main thread's responsibility.

Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
---
 migration/migration.c | 64 ++++++++++++++++++++++++-------------------
 migration/migration.h |  1 +
 migration/savevm.c    | 48 +++++++++++---------------------
 3 files changed, 53 insertions(+), 60 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 2c0b3a7229..7222e3de13 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -442,9 +442,19 @@ void migration_incoming_transport_cleanup(MigrationIncomingState *mis)
 void migration_incoming_state_destroy(void)
 {
     struct MigrationIncomingState *mis = migration_incoming_get_current();
+    PostcopyState ps = postcopy_state_get();
 
     multifd_recv_cleanup();
 
+    if (mis->have_listen_thread) {
+        qemu_thread_join(&mis->listen_thread);
+        mis->have_listen_thread = false;
+    }
+
+    if (ps != POSTCOPY_INCOMING_NONE) {
+        postcopy_ram_incoming_cleanup(mis);
+    }
+
     /*
      * RAM state cleanup needs to happen after multifd cleanup, because
      * multifd threads can use some of its states (receivedmap).
@@ -809,6 +819,23 @@ static void qemu_start_incoming_migration(const char *uri, bool has_channels,
     cpr_state_close();
 }
 
+void migration_incoming_finish(void)
+{
+    MigrationState *s = migrate_get_current();
+    MigrationIncomingState *mis = migration_incoming_get_current();
+
+    migration_incoming_state_destroy();
+
+    if (migration_has_failed(mis->state) && mis->exit_on_error) {
+        WITH_QEMU_LOCK_GUARD(&s->error_mutex) {
+            error_report_err(s->error);
+            s->error = NULL;
+        }
+
+        exit(EXIT_FAILURE);
+    }
+}
+
 static void process_incoming_migration_bh(void *opaque)
 {
     MigrationIncomingState *mis = opaque;
@@ -861,7 +888,7 @@ static void process_incoming_migration_bh(void *opaque)
      */
     migrate_set_state(&mis->state, MIGRATION_STATUS_ACTIVE,
                       MIGRATION_STATUS_COMPLETED);
-    migration_incoming_state_destroy();
+    migration_incoming_finish();
 }
 
 static void coroutine_fn
@@ -888,23 +915,13 @@ process_incoming_migration_co(void *opaque)
 
     ps = postcopy_state_get();
     trace_process_incoming_migration_co_end(ret, ps);
-    if (ps != POSTCOPY_INCOMING_NONE) {
-        if (ps == POSTCOPY_INCOMING_ADVISE) {
-            /*
-             * Where a migration had postcopy enabled (and thus went to advise)
-             * but managed to complete within the precopy period, we can use
-             * the normal exit.
-             */
-            postcopy_ram_incoming_cleanup(mis);
-        } else if (ret >= 0) {
-            /*
-             * Postcopy was started, cleanup should happen at the end of the
-             * postcopy thread.
-             */
-            trace_process_incoming_migration_co_postcopy_end_main();
-            goto out;
-        }
-        /* Else if something went wrong then just fall out of the normal exit */
+    if (ps >= POSTCOPY_INCOMING_LISTENING) {
+        /*
+         * Postcopy was started, cleanup should happen at the end of the
+         * postcopy thread.
+         */
+        trace_process_incoming_migration_co_postcopy_end_main();
+        goto out;
     }
 
     if (ret < 0) {
@@ -926,16 +943,7 @@ fail:
     migrate_set_error(s, local_err);
     error_free(local_err);
 
-    migration_incoming_state_destroy();
-
-    if (mis->exit_on_error) {
-        WITH_QEMU_LOCK_GUARD(&s->error_mutex) {
-            error_report_err(s->error);
-            s->error = NULL;
-        }
-
-        exit(EXIT_FAILURE);
-    }
+    migration_incoming_finish();
 out:
     /* Pairs with the refcount taken in qmp_migrate_incoming() */
     migrate_incoming_unref_outgoing_state();
diff --git a/migration/migration.h b/migration/migration.h
index 2c2331f40d..67e3318467 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -518,6 +518,7 @@ void migrate_set_state(MigrationStatus *state, MigrationStatus old_state,
 void migration_fd_process_incoming(QEMUFile *f);
 void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp);
 void migration_incoming_process(void);
+void migration_incoming_finish(void);
 
 bool  migration_has_all_channels(void);
 
diff --git a/migration/savevm.c b/migration/savevm.c
index fabbeb296a..d7eb416d48 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2069,6 +2069,11 @@ static int loadvm_postcopy_ram_handle_discard(MigrationIncomingState *mis,
     return 0;
 }
 
+static void postcopy_ram_listen_thread_bh(void *opaque)
+{
+    migration_incoming_finish();
+}
+
 /*
  * Triggered by a postcopy_listen command; this thread takes over reading
  * the input stream, leaving the main thread free to carry on loading the rest
@@ -2122,52 +2127,31 @@ static void *postcopy_ram_listen_thread(void *opaque)
                          "bitmaps may be lost, and present migrated dirty "
                          "bitmaps are correctly migrated and valid.",
                          __func__, load_res);
-            load_res = 0; /* prevent further exit() */
         } else {
             error_report("%s: loadvm failed: %d", __func__, load_res);
             migrate_set_state(&mis->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
                                            MIGRATION_STATUS_FAILED);
+            goto out;
         }
     }
-    if (load_res >= 0) {
-        /*
-         * This looks good, but it's possible that the device loading in the
-         * main thread hasn't finished yet, and so we might not be in 'RUN'
-         * state yet; wait for the end of the main thread.
-         */
-        qemu_event_wait(&mis->main_thread_load_event);
-    }
-    postcopy_ram_incoming_cleanup(mis);
-
-    if (load_res < 0) {
-        /*
-         * If something went wrong then we have a bad state so exit;
-         * depending how far we got it might be possible at this point
-         * to leave the guest running and fire MCEs for pages that never
-         * arrived as a desperate recovery step.
-         */
-        rcu_unregister_thread();
-        exit(EXIT_FAILURE);
-    }
+    /*
+     * This looks good, but it's possible that the device loading in the
+     * main thread hasn't finished yet, and so we might not be in 'RUN'
+     * state yet; wait for the end of the main thread.
+     */
+    qemu_event_wait(&mis->main_thread_load_event);
 
     migrate_set_state(&mis->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
                                    MIGRATION_STATUS_COMPLETED);
-    /*
-     * If everything has worked fine, then the main thread has waited
-     * for us to start, and we're the last use of the mis.
-     * (If something broke then qemu will have to exit anyway since it's
-     * got a bad migration state).
-     */
-    bql_lock();
-    migration_incoming_state_destroy();
-    bql_unlock();
 
+out:
     rcu_unregister_thread();
-    mis->have_listen_thread = false;
     postcopy_state_set(POSTCOPY_INCOMING_END);
 
     object_unref(OBJECT(migr));
 
+    migration_bh_schedule(postcopy_ram_listen_thread_bh, NULL);
+
     return NULL;
 }
 
@@ -2217,7 +2201,7 @@ static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
     mis->have_listen_thread = true;
     postcopy_thread_create(mis, &mis->listen_thread,
                            MIGRATION_THREAD_DST_LISTEN,
-                           postcopy_ram_listen_thread, QEMU_THREAD_DETACHED);
+                           postcopy_ram_listen_thread, QEMU_THREAD_JOINABLE);
     trace_loadvm_postcopy_handle_listen("return");
 
     return 0;
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 4/4] migration: Introduce POSTCOPY_DEVICE state
  2025-09-15 11:59 [PATCH 0/4] migration: Introduce POSTCOPY_DEVICE state Juraj Marcin
                   ` (2 preceding siblings ...)
  2025-09-15 11:59 ` [PATCH 3/4] migration: Refactor incoming cleanup into migration_incoming_finish() Juraj Marcin
@ 2025-09-15 11:59 ` Juraj Marcin
  2025-09-19 16:58   ` Peter Xu
  2025-09-25 11:54   ` Jiří Denemark
  3 siblings, 2 replies; 33+ messages in thread
From: Juraj Marcin @ 2025-09-15 11:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juraj Marcin, Jiri Denemark, Peter Xu, Dr. David Alan Gilbert,
	Fabiano Rosas

From: Juraj Marcin <jmarcin@redhat.com>

Currently, when postcopy starts, the source VM starts switchover and
sends a package containing the state of all non-postcopiable devices.
When the destination loads this package, the switchover is complete and
the destination VM starts. However, if the device state load fails or
the destination side crashes, the source side is already in
POSTCOPY_ACTIVE state and cannot be recovered, even when it has the most
up-to-date machine state as the destination has not yet started.

This patch introduces a new POSTCOPY_DEVICE state which is active
while the destination machine is loading the device state, is not yet
running, and the source side can be resumed in case of a migration
failure.

To transition from POSTCOPY_DEVICE to POSTCOPY_ACTIVE, the source
side uses a PONG message that is a response to a PING message processed
just before the POSTCOPY_RUN command that starts the destination VM.
Thus, this change does not require any changes on the destination side
and is effective even with older destination versions.

Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
---
 migration/migration.c                 | 23 ++++++++++++++++++-----
 migration/savevm.h                    |  2 ++
 migration/trace-events                |  1 +
 qapi/migration.json                   |  8 ++++++--
 tests/qtest/migration/precopy-tests.c |  3 ++-
 5 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 7222e3de13..e63a7487be 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1223,6 +1223,7 @@ bool migration_is_running(void)
 
     switch (s->state) {
     case MIGRATION_STATUS_ACTIVE:
+    case MIGRATION_STATUS_POSTCOPY_DEVICE:
     case MIGRATION_STATUS_POSTCOPY_ACTIVE:
     case MIGRATION_STATUS_POSTCOPY_PAUSED:
     case MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP:
@@ -1244,6 +1245,7 @@ static bool migration_is_active(void)
     MigrationState *s = current_migration;
 
     return (s->state == MIGRATION_STATUS_ACTIVE ||
+            s->state == MIGRATION_STATUS_POSTCOPY_DEVICE ||
             s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
 }
 
@@ -1366,6 +1368,7 @@ static void fill_source_migration_info(MigrationInfo *info)
         break;
     case MIGRATION_STATUS_ACTIVE:
     case MIGRATION_STATUS_CANCELLING:
+    case MIGRATION_STATUS_POSTCOPY_DEVICE:
     case MIGRATION_STATUS_POSTCOPY_ACTIVE:
     case MIGRATION_STATUS_PRE_SWITCHOVER:
     case MIGRATION_STATUS_DEVICE:
@@ -1419,6 +1422,7 @@ static void fill_destination_migration_info(MigrationInfo *info)
     case MIGRATION_STATUS_CANCELLING:
     case MIGRATION_STATUS_CANCELLED:
     case MIGRATION_STATUS_ACTIVE:
+    case MIGRATION_STATUS_POSTCOPY_DEVICE:
     case MIGRATION_STATUS_POSTCOPY_ACTIVE:
     case MIGRATION_STATUS_POSTCOPY_PAUSED:
     case MIGRATION_STATUS_POSTCOPY_RECOVER:
@@ -1719,6 +1723,7 @@ bool migration_in_postcopy(void)
     MigrationState *s = migrate_get_current();
 
     switch (s->state) {
+    case MIGRATION_STATUS_POSTCOPY_DEVICE:
     case MIGRATION_STATUS_POSTCOPY_ACTIVE:
     case MIGRATION_STATUS_POSTCOPY_PAUSED:
     case MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP:
@@ -2564,6 +2569,11 @@ static void *source_return_path_thread(void *opaque)
             tmp32 = ldl_be_p(buf);
             trace_source_return_path_thread_pong(tmp32);
             qemu_sem_post(&ms->rp_state.rp_pong_acks);
+            if (tmp32 == QEMU_VM_PING_PACKAGED_LOADED) {
+                trace_source_return_path_thread_dst_started();
+                migrate_set_state(&ms->state, MIGRATION_STATUS_POSTCOPY_DEVICE,
+                                  MIGRATION_STATUS_POSTCOPY_ACTIVE);
+            }
             break;
 
         case MIG_RP_MSG_REQ_PAGES:
@@ -2814,6 +2824,7 @@ static int postcopy_start(MigrationState *ms, Error **errp)
     if (migrate_postcopy_ram()) {
         qemu_savevm_send_ping(fb, 3);
     }
+    qemu_savevm_send_ping(fb, QEMU_VM_PING_PACKAGED_LOADED);
 
     qemu_savevm_send_postcopy_run(fb);
 
@@ -2871,7 +2882,7 @@ static int postcopy_start(MigrationState *ms, Error **errp)
 
     /* Now, switchover looks all fine, switching to postcopy-active */
     migrate_set_state(&ms->state, MIGRATION_STATUS_DEVICE,
-                      MIGRATION_STATUS_POSTCOPY_ACTIVE);
+                      MIGRATION_STATUS_POSTCOPY_DEVICE);
 
     bql_unlock();
 
@@ -3035,7 +3046,8 @@ static void migration_completion(MigrationState *s)
 
     if (s->state == MIGRATION_STATUS_ACTIVE) {
         ret = migration_completion_precopy(s);
-    } else if (s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) {
+    } else if (s->state == MIGRATION_STATUS_POSTCOPY_DEVICE ||
+               s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) {
         migration_completion_postcopy(s);
     } else {
         ret = -1;
@@ -3311,8 +3323,8 @@ static MigThrError migration_detect_error(MigrationState *s)
         return postcopy_pause(s);
     } else {
         /*
-         * For precopy (or postcopy with error outside IO), we fail
-         * with no time.
+         * For precopy (or postcopy with error outside IO, or before dest
+         * starts), we fail with no time.
          */
         migrate_set_state(&s->state, state, MIGRATION_STATUS_FAILED);
         trace_migration_thread_file_err();
@@ -3447,7 +3459,8 @@ static MigIterateState migration_iteration_run(MigrationState *s)
 {
     uint64_t must_precopy, can_postcopy, pending_size;
     Error *local_err = NULL;
-    bool in_postcopy = s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE;
+    bool in_postcopy = (s->state == MIGRATION_STATUS_POSTCOPY_DEVICE ||
+                        s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
     bool can_switchover = migration_can_switchover(s);
     bool complete_ready;
 
diff --git a/migration/savevm.h b/migration/savevm.h
index 2d5e9c7166..c4de0325eb 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -29,6 +29,8 @@
 #define QEMU_VM_COMMAND              0x08
 #define QEMU_VM_SECTION_FOOTER       0x7e
 
+#define QEMU_VM_PING_PACKAGED_LOADED 0x42
+
 bool qemu_savevm_state_blocked(Error **errp);
 void qemu_savevm_non_migratable_list(strList **reasons);
 int qemu_savevm_state_prepare(Error **errp);
diff --git a/migration/trace-events b/migration/trace-events
index 706db97def..007b5c407e 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -191,6 +191,7 @@ source_return_path_thread_pong(uint32_t val) "0x%x"
 source_return_path_thread_shut(uint32_t val) "0x%x"
 source_return_path_thread_resume_ack(uint32_t v) "%"PRIu32
 source_return_path_thread_switchover_acked(void) ""
+source_return_path_thread_dst_started(void) ""
 migration_thread_low_pending(uint64_t pending) "%" PRIu64
 migrate_transferred(uint64_t transferred, uint64_t time_spent, uint64_t bandwidth, uint64_t avail_bw, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %" PRIu64 " switchover_bw %" PRIu64 " max_size %" PRId64
 process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
diff --git a/qapi/migration.json b/qapi/migration.json
index 2387c21e9c..89a20d858d 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -142,6 +142,10 @@
 # @postcopy-active: like active, but now in postcopy mode.
 #     (since 2.5)
 #
+# @postcopy-device: like postcopy-active, but the destination is still
+#     loading device state and is not running yet.  If migration fails
+#     during this state, the source side will resume.  (since 10.2)
+#
 # @postcopy-paused: during postcopy but paused.  (since 3.0)
 #
 # @postcopy-recover-setup: setup phase for a postcopy recovery
@@ -173,8 +177,8 @@
 ##
 { 'enum': 'MigrationStatus',
   'data': [ 'none', 'setup', 'cancelling', 'cancelled',
-            'active', 'postcopy-active', 'postcopy-paused',
-            'postcopy-recover-setup',
+            'active', 'postcopy-device', 'postcopy-active',
+            'postcopy-paused', 'postcopy-recover-setup',
             'postcopy-recover', 'completed', 'failed', 'colo',
             'pre-switchover', 'device', 'wait-unplug' ] }
 ##
diff --git a/tests/qtest/migration/precopy-tests.c b/tests/qtest/migration/precopy-tests.c
index bb38292550..57ca623de5 100644
--- a/tests/qtest/migration/precopy-tests.c
+++ b/tests/qtest/migration/precopy-tests.c
@@ -1316,13 +1316,14 @@ void migration_test_add_precopy(MigrationTestEnv *env)
     }
 
     /* ensure new status don't go unnoticed */
-    assert(MIGRATION_STATUS__MAX == 15);
+    assert(MIGRATION_STATUS__MAX == 16);
 
     for (int i = MIGRATION_STATUS_NONE; i < MIGRATION_STATUS__MAX; i++) {
         switch (i) {
         case MIGRATION_STATUS_DEVICE: /* happens too fast */
         case MIGRATION_STATUS_WAIT_UNPLUG: /* no support in tests */
         case MIGRATION_STATUS_COLO: /* no support in tests */
+        case MIGRATION_STATUS_POSTCOPY_DEVICE: /* postcopy can't be cancelled */
         case MIGRATION_STATUS_POSTCOPY_ACTIVE: /* postcopy can't be cancelled */
         case MIGRATION_STATUS_POSTCOPY_PAUSED:
         case MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP:
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/4] migration: Accept MigrationStatus in migration_has_failed()
  2025-09-15 11:59 ` [PATCH 2/4] migration: Accept MigrationStatus in migration_has_failed() Juraj Marcin
@ 2025-09-19 14:57   ` Peter Xu
  2025-09-22 11:26     ` Juraj Marcin
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Xu @ 2025-09-19 14:57 UTC (permalink / raw)
  To: Juraj Marcin
  Cc: qemu-devel, Jiri Denemark, Dr. David Alan Gilbert, Fabiano Rosas

On Mon, Sep 15, 2025 at 01:59:13PM +0200, Juraj Marcin wrote:
> From: Juraj Marcin <jmarcin@redhat.com>
> 
> This allows to reuse the helper also with MigrationIncomingState.

I get the point, but just to mention that this helper doesn't really change
much on incoming side on simplifying the code or function-wise, because we
don't have CANCELLING/CANCELLED state on deste QEMU.. which is definitely
not obvious.. :(

So:

  migration_has_failed(incoming->state)

Is exactly the same as:

  incoming->state == MIGRATION_STATUS_FAILED

Except it will make src start to pass in s->state.. which is slightly more
awkward.

Maybe we keep the MIGRATION_STATUS_FAILED check in your next patch, and
drop this one for now until it grows more than FAILED on dest?

> 
> Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
> ---
>  migration/migration.c | 8 ++++----
>  migration/migration.h | 2 +-
>  migration/multifd.c   | 2 +-
>  3 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 54dac3db88..2c0b3a7229 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1542,7 +1542,7 @@ static void migration_cleanup(MigrationState *s)
>          /* It is used on info migrate.  We can't free it */
>          error_report_err(error_copy(s->error));
>      }
> -    type = migration_has_failed(s) ? MIG_EVENT_PRECOPY_FAILED :
> +    type = migration_has_failed(s->state) ? MIG_EVENT_PRECOPY_FAILED :
>                                       MIG_EVENT_PRECOPY_DONE;
>      migration_call_notifiers(s, type, NULL);
>      yank_unregister_instance(MIGRATION_YANK_INSTANCE);
> @@ -1700,10 +1700,10 @@ int migration_call_notifiers(MigrationState *s, MigrationEventType type,
>      return ret;
>  }
>  
> -bool migration_has_failed(MigrationState *s)
> +bool migration_has_failed(MigrationStatus state)
>  {
> -    return (s->state == MIGRATION_STATUS_CANCELLED ||
> -            s->state == MIGRATION_STATUS_FAILED);
> +    return (state == MIGRATION_STATUS_CANCELLED ||
> +            state == MIGRATION_STATUS_FAILED);
>  }
>  
>  bool migration_in_postcopy(void)
> diff --git a/migration/migration.h b/migration/migration.h
> index 01329bf824..2c2331f40d 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -535,7 +535,7 @@ bool migration_is_blocked(Error **errp);
>  bool migration_in_postcopy(void);
>  bool migration_postcopy_is_alive(MigrationStatus state);
>  MigrationState *migrate_get_current(void);
> -bool migration_has_failed(MigrationState *);
> +bool migration_has_failed(MigrationStatus state);
>  bool migrate_mode_is_cpr(MigrationState *);
>  
>  uint64_t ram_get_total_transferred_pages(void);
> diff --git a/migration/multifd.c b/migration/multifd.c
> index b255778855..c569f91f2c 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -568,7 +568,7 @@ void multifd_send_shutdown(void)
>               * already failed. If the migration succeeded, errors are
>               * not expected but there's no need to kill the source.
>               */
> -            if (local_err && !migration_has_failed(migrate_get_current())) {
> +            if (local_err && !migration_has_failed(migrate_get_current()->state)) {
>                  warn_report(
>                      "multifd_send_%d: Failed to terminate TLS connection: %s",
>                      p->id, error_get_pretty(local_err));
> -- 
> 2.51.0
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 3/4] migration: Refactor incoming cleanup into migration_incoming_finish()
  2025-09-15 11:59 ` [PATCH 3/4] migration: Refactor incoming cleanup into migration_incoming_finish() Juraj Marcin
@ 2025-09-19 15:53   ` Peter Xu
  2025-09-19 16:46   ` Fabiano Rosas
  1 sibling, 0 replies; 33+ messages in thread
From: Peter Xu @ 2025-09-19 15:53 UTC (permalink / raw)
  To: Juraj Marcin
  Cc: qemu-devel, Jiri Denemark, Dr. David Alan Gilbert, Fabiano Rosas

On Mon, Sep 15, 2025 at 01:59:14PM +0200, Juraj Marcin wrote:
> From: Juraj Marcin <jmarcin@redhat.com>
> 
> Currently, there are two functions that are responsible for cleanup of
> the incoming migration state. With successful precopy, it's the main
> thread and with successful postcopy it's the listen thread. However, if
> postcopy fails during in the device load, both functions will try to do
> the cleanup. Moreover, when exit-on-error parameter was added, it was
> applied only to precopy.
> 
> This patch refactors common cleanup and exiting on error into a helper
> function that can be started either from precopy or postcopy, reducing
> the duplication. If the listen thread has been started (the postcopy
> state is at least LISTENING), the listen thread is responsible for all
> cleanup and exiting, otherwise it's the main thread's responsibility.

Looks almost good, thanks!  Only nitpicks below.

> 
> Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
> ---
>  migration/migration.c | 64 ++++++++++++++++++++++++-------------------
>  migration/migration.h |  1 +
>  migration/savevm.c    | 48 +++++++++++---------------------
>  3 files changed, 53 insertions(+), 60 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 2c0b3a7229..7222e3de13 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -442,9 +442,19 @@ void migration_incoming_transport_cleanup(MigrationIncomingState *mis)
>  void migration_incoming_state_destroy(void)
>  {
>      struct MigrationIncomingState *mis = migration_incoming_get_current();
> +    PostcopyState ps = postcopy_state_get();
>  
>      multifd_recv_cleanup();
>  
> +    if (mis->have_listen_thread) {
> +        qemu_thread_join(&mis->listen_thread);
> +        mis->have_listen_thread = false;
> +    }

Maybe this fits more to be in postcopy_ram_incoming_cleanup() below?

> +
> +    if (ps != POSTCOPY_INCOMING_NONE) {
> +        postcopy_ram_incoming_cleanup(mis);
> +    }
> +
>      /*
>       * RAM state cleanup needs to happen after multifd cleanup, because
>       * multifd threads can use some of its states (receivedmap).
> @@ -809,6 +819,23 @@ static void qemu_start_incoming_migration(const char *uri, bool has_channels,
>      cpr_state_close();
>  }
>  
> +void migration_incoming_finish(void)
> +{
> +    MigrationState *s = migrate_get_current();
> +    MigrationIncomingState *mis = migration_incoming_get_current();
> +
> +    migration_incoming_state_destroy();
> +
> +    if (migration_has_failed(mis->state) && mis->exit_on_error) {

If you agree on my comment in patch 2, we can keep checking against FAILED.

> +        WITH_QEMU_LOCK_GUARD(&s->error_mutex) {
> +            error_report_err(s->error);
> +            s->error = NULL;
> +        }
> +
> +        exit(EXIT_FAILURE);
> +    }
> +}
> +
>  static void process_incoming_migration_bh(void *opaque)
>  {
>      MigrationIncomingState *mis = opaque;
> @@ -861,7 +888,7 @@ static void process_incoming_migration_bh(void *opaque)
>       */
>      migrate_set_state(&mis->state, MIGRATION_STATUS_ACTIVE,
>                        MIGRATION_STATUS_COMPLETED);
> -    migration_incoming_state_destroy();
> +    migration_incoming_finish();

This is exactly the same as before when we know it's succeeding, but I
think I get your point, always using migration_incoming_finish() should be
fine.

>  }
>  
>  static void coroutine_fn
> @@ -888,23 +915,13 @@ process_incoming_migration_co(void *opaque)
>  
>      ps = postcopy_state_get();
>      trace_process_incoming_migration_co_end(ret, ps);
> -    if (ps != POSTCOPY_INCOMING_NONE) {
> -        if (ps == POSTCOPY_INCOMING_ADVISE) {
> -            /*
> -             * Where a migration had postcopy enabled (and thus went to advise)
> -             * but managed to complete within the precopy period, we can use
> -             * the normal exit.
> -             */
> -            postcopy_ram_incoming_cleanup(mis);
> -        } else if (ret >= 0) {
> -            /*
> -             * Postcopy was started, cleanup should happen at the end of the
> -             * postcopy thread.
> -             */
> -            trace_process_incoming_migration_co_postcopy_end_main();
> -            goto out;
> -        }
> -        /* Else if something went wrong then just fall out of the normal exit */
> +    if (ps >= POSTCOPY_INCOMING_LISTENING) {
> +        /*
> +         * Postcopy was started, cleanup should happen at the end of the
> +         * postcopy thread.

Postcopy has >1 threads, better mention "at the end of postcopy ram listen
thread", that helps explain why checking >= POSTCOPY_INCOMING_LISTENING,
because that event creates the ram listen thread.

> +         */
> +        trace_process_incoming_migration_co_postcopy_end_main();
> +        goto out;
>      }
>  
>      if (ret < 0) {
> @@ -926,16 +943,7 @@ fail:
>      migrate_set_error(s, local_err);
>      error_free(local_err);
>  
> -    migration_incoming_state_destroy();
> -
> -    if (mis->exit_on_error) {
> -        WITH_QEMU_LOCK_GUARD(&s->error_mutex) {
> -            error_report_err(s->error);
> -            s->error = NULL;
> -        }
> -
> -        exit(EXIT_FAILURE);
> -    }
> +    migration_incoming_finish();
>  out:
>      /* Pairs with the refcount taken in qmp_migrate_incoming() */
>      migrate_incoming_unref_outgoing_state();
> diff --git a/migration/migration.h b/migration/migration.h
> index 2c2331f40d..67e3318467 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -518,6 +518,7 @@ void migrate_set_state(MigrationStatus *state, MigrationStatus old_state,
>  void migration_fd_process_incoming(QEMUFile *f);
>  void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp);
>  void migration_incoming_process(void);
> +void migration_incoming_finish(void);
>  
>  bool  migration_has_all_channels(void);
>  
> diff --git a/migration/savevm.c b/migration/savevm.c
> index fabbeb296a..d7eb416d48 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -2069,6 +2069,11 @@ static int loadvm_postcopy_ram_handle_discard(MigrationIncomingState *mis,
>      return 0;
>  }
>  
> +static void postcopy_ram_listen_thread_bh(void *opaque)
> +{
> +    migration_incoming_finish();
> +}
> +
>  /*
>   * Triggered by a postcopy_listen command; this thread takes over reading
>   * the input stream, leaving the main thread free to carry on loading the rest
> @@ -2122,52 +2127,31 @@ static void *postcopy_ram_listen_thread(void *opaque)
>                           "bitmaps may be lost, and present migrated dirty "
>                           "bitmaps are correctly migrated and valid.",
>                           __func__, load_res);
> -            load_res = 0; /* prevent further exit() */
>          } else {
>              error_report("%s: loadvm failed: %d", __func__, load_res);
>              migrate_set_state(&mis->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
>                                             MIGRATION_STATUS_FAILED);
> +            goto out;
>          }
>      }
> -    if (load_res >= 0) {
> -        /*
> -         * This looks good, but it's possible that the device loading in the
> -         * main thread hasn't finished yet, and so we might not be in 'RUN'
> -         * state yet; wait for the end of the main thread.
> -         */
> -        qemu_event_wait(&mis->main_thread_load_event);
> -    }
> -    postcopy_ram_incoming_cleanup(mis);
> -
> -    if (load_res < 0) {
> -        /*
> -         * If something went wrong then we have a bad state so exit;
> -         * depending how far we got it might be possible at this point
> -         * to leave the guest running and fire MCEs for pages that never
> -         * arrived as a desperate recovery step.
> -         */
> -        rcu_unregister_thread();
> -        exit(EXIT_FAILURE);
> -    }
> +    /*
> +     * This looks good, but it's possible that the device loading in the
> +     * main thread hasn't finished yet, and so we might not be in 'RUN'
> +     * state yet; wait for the end of the main thread.
> +     */
> +    qemu_event_wait(&mis->main_thread_load_event);
>  
>      migrate_set_state(&mis->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
>                                     MIGRATION_STATUS_COMPLETED);
> -    /*
> -     * If everything has worked fine, then the main thread has waited
> -     * for us to start, and we're the last use of the mis.
> -     * (If something broke then qemu will have to exit anyway since it's
> -     * got a bad migration state).
> -     */
> -    bql_lock();
> -    migration_incoming_state_destroy();
> -    bql_unlock();
>  
> +out:
>      rcu_unregister_thread();
> -    mis->have_listen_thread = false;
>      postcopy_state_set(POSTCOPY_INCOMING_END);
>  
>      object_unref(OBJECT(migr));
>  
> +    migration_bh_schedule(postcopy_ram_listen_thread_bh, NULL);
> +
>      return NULL;
>  }
>  
> @@ -2217,7 +2201,7 @@ static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
>      mis->have_listen_thread = true;
>      postcopy_thread_create(mis, &mis->listen_thread,
>                             MIGRATION_THREAD_DST_LISTEN,
> -                           postcopy_ram_listen_thread, QEMU_THREAD_DETACHED);
> +                           postcopy_ram_listen_thread, QEMU_THREAD_JOINABLE);

Nit once more: better mention this change in commit message.

Thanks!

>      trace_loadvm_postcopy_handle_listen("return");
>  
>      return 0;
> -- 
> 2.51.0
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/4] migration: Do not try to start VM if disk activation fails
  2025-09-15 11:59 ` [PATCH 1/4] migration: Do not try to start VM if disk activation fails Juraj Marcin
@ 2025-09-19 16:12   ` Fabiano Rosas
  0 siblings, 0 replies; 33+ messages in thread
From: Fabiano Rosas @ 2025-09-19 16:12 UTC (permalink / raw)
  To: Juraj Marcin, qemu-devel
  Cc: Juraj Marcin, Jiri Denemark, Peter Xu, Dr. David Alan Gilbert

Juraj Marcin <jmarcin@redhat.com> writes:

> From: Peter Xu <peterx@redhat.com>
>
> If a rare split brain happens (e.g. dest QEMU started running somehow,
> taking shared drive locks), src QEMU may not be able to activate the
> drives anymore.  In this case, src QEMU shouldn't start the VM or it might
> crash the block layer later with something like:
>
> bdrv_co_write_req_prepare: Assertion `!(bs->open_flags &
> BDRV_O_INACTIVE)' failed.

This patch doesn't fix the assert, so I think it's best not to mention
it at all. We've had a few instances of this assert (or similar) and
there's a fair chance someone still looks at git log searching for a
backport.

>
> Meanwhile, src QEMU cannot try to continue either even if dest QEMU can
> release the drive locks (e.g. by QMP "stop").  Because as long as dest QEMU
> started running, it means dest QEMU's RAM is the only version that is
> consistent with current status of the shared storage.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/migration.c | 29 ++++++++++++++++++++++++-----
>  1 file changed, 24 insertions(+), 5 deletions(-)
>
> diff --git a/migration/migration.c b/migration/migration.c
> index 10c216d25d..54dac3db88 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3502,6 +3502,8 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>  
>  static void migration_iteration_finish(MigrationState *s)
>  {
> +    Error *local_err = NULL;
> +
>      bql_lock();
>  
>      /*
> @@ -3525,11 +3527,28 @@ static void migration_iteration_finish(MigrationState *s)
>      case MIGRATION_STATUS_FAILED:
>      case MIGRATION_STATUS_CANCELLED:
>      case MIGRATION_STATUS_CANCELLING:
> -        /*
> -         * Re-activate the block drives if they're inactivated.  Note, COLO
> -         * shouldn't use block_active at all, so it should be no-op there.
> -         */
> -        migration_block_activate(NULL);
> +        if (!migration_block_activate(&local_err)) {
> +            /*
> +            * Re-activate the block drives if they're inactivated.

Comment formatting is wrong.

> +            *
> +            * If it fails (e.g. in case of a split brain, where dest QEMU
> +            * might have taken some of the drive locks and running!), do
> +            * not start VM, instead wait for mgmt to decide the next step.
> +            *
> +            * If dest already started, it means dest QEMU should contain
> +            * all the data it needs and it properly owns all the drive
> +            * locks.  Then even if src QEMU got a FAILED in migration, it
> +            * normally should mean we should treat the migration as
> +            * COMPLETED.
> +            *
> +            * NOTE: it's not safe anymore to start VM on src now even if
> +            * dest would release the drive locks.  It's because as long as
> +            * dest started running then only dest QEMU's RAM is consistent
> +            * with the shared storage.
> +            */
> +            error_free(local_err);
> +            break;
> +        }
>          if (runstate_is_live(s->vm_old_state)) {
>              if (!runstate_check(RUN_STATE_SHUTDOWN)) {
>                  vm_start();

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 3/4] migration: Refactor incoming cleanup into migration_incoming_finish()
  2025-09-15 11:59 ` [PATCH 3/4] migration: Refactor incoming cleanup into migration_incoming_finish() Juraj Marcin
  2025-09-19 15:53   ` Peter Xu
@ 2025-09-19 16:46   ` Fabiano Rosas
  2025-09-22 12:58     ` Juraj Marcin
  1 sibling, 1 reply; 33+ messages in thread
From: Fabiano Rosas @ 2025-09-19 16:46 UTC (permalink / raw)
  To: Juraj Marcin, qemu-devel
  Cc: Juraj Marcin, Jiri Denemark, Peter Xu, Dr. David Alan Gilbert

Juraj Marcin <jmarcin@redhat.com> writes:

Hi Juraj,

Good patch, nice use of migrate_has_failed()

> From: Juraj Marcin <jmarcin@redhat.com>
>
> Currently, there are two functions that are responsible for cleanup of
> the incoming migration state. With successful precopy, it's the main
> thread and with successful postcopy it's the listen thread. However, if
> postcopy fails during in the device load, both functions will try to do
> the cleanup. Moreover, when exit-on-error parameter was added, it was
> applied only to precopy.
>

Someone could be relying in postcopy always exiting on error while
explicitly setting exit-on-error=false for precopy and this patch would
change the behavior incompatibly. Is this an issue? I'm willing to
ignore it, but you guys know more about postcopy.

> This patch refactors common cleanup and exiting on error into a helper
> function that can be started either from precopy or postcopy, reducing
> the duplication. If the listen thread has been started (the postcopy
> state is at least LISTENING), the listen thread is responsible for all
> cleanup and exiting, otherwise it's the main thread's responsibility.

Don't the BHs also run in the main thread? I'm not sure this sentence is
accurate.

>
> Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
> ---
>  migration/migration.c | 64 ++++++++++++++++++++++++-------------------
>  migration/migration.h |  1 +
>  migration/savevm.c    | 48 +++++++++++---------------------

Could someone act on the TODOs and move postcopy code into postcopy-ram?
It's never too late to make things consistent.

>  3 files changed, 53 insertions(+), 60 deletions(-)
>
> diff --git a/migration/migration.c b/migration/migration.c
> index 2c0b3a7229..7222e3de13 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -442,9 +442,19 @@ void migration_incoming_transport_cleanup(MigrationIncomingState *mis)
>  void migration_incoming_state_destroy(void)
>  {
>      struct MigrationIncomingState *mis = migration_incoming_get_current();
> +    PostcopyState ps = postcopy_state_get();
>  
>      multifd_recv_cleanup();
>  
> +    if (mis->have_listen_thread) {
> +        qemu_thread_join(&mis->listen_thread);
> +        mis->have_listen_thread = false;
> +    }
> +
> +    if (ps != POSTCOPY_INCOMING_NONE) {
> +        postcopy_ram_incoming_cleanup(mis);
> +    }
> +
>      /*
>       * RAM state cleanup needs to happen after multifd cleanup, because
>       * multifd threads can use some of its states (receivedmap).
> @@ -809,6 +819,23 @@ static void qemu_start_incoming_migration(const char *uri, bool has_channels,
>      cpr_state_close();
>  }
>  
> +void migration_incoming_finish(void)
> +{
> +    MigrationState *s = migrate_get_current();
> +    MigrationIncomingState *mis = migration_incoming_get_current();
> +
> +    migration_incoming_state_destroy();
> +
> +    if (migration_has_failed(mis->state) && mis->exit_on_error) {
> +        WITH_QEMU_LOCK_GUARD(&s->error_mutex) {
> +            error_report_err(s->error);
> +            s->error = NULL;
> +        }
> +
> +        exit(EXIT_FAILURE);
> +    }
> +}
> +
>  static void process_incoming_migration_bh(void *opaque)
>  {
>      MigrationIncomingState *mis = opaque;
> @@ -861,7 +888,7 @@ static void process_incoming_migration_bh(void *opaque)
>       */
>      migrate_set_state(&mis->state, MIGRATION_STATUS_ACTIVE,
>                        MIGRATION_STATUS_COMPLETED);
> -    migration_incoming_state_destroy();
> +    migration_incoming_finish();
>  }
>  
>  static void coroutine_fn
> @@ -888,23 +915,13 @@ process_incoming_migration_co(void *opaque)
>  
>      ps = postcopy_state_get();
>      trace_process_incoming_migration_co_end(ret, ps);
> -    if (ps != POSTCOPY_INCOMING_NONE) {
> -        if (ps == POSTCOPY_INCOMING_ADVISE) {
> -            /*
> -             * Where a migration had postcopy enabled (and thus went to advise)
> -             * but managed to complete within the precopy period, we can use
> -             * the normal exit.
> -             */
> -            postcopy_ram_incoming_cleanup(mis);
> -        } else if (ret >= 0) {
> -            /*
> -             * Postcopy was started, cleanup should happen at the end of the
> -             * postcopy thread.
> -             */
> -            trace_process_incoming_migration_co_postcopy_end_main();
> -            goto out;
> -        }
> -        /* Else if something went wrong then just fall out of the normal exit */
> +    if (ps >= POSTCOPY_INCOMING_LISTENING) {
> +        /*
> +         * Postcopy was started, cleanup should happen at the end of the
> +         * postcopy thread.
> +         */
> +        trace_process_incoming_migration_co_postcopy_end_main();
> +        goto out;
>      }
>  
>      if (ret < 0) {
> @@ -926,16 +943,7 @@ fail:
>      migrate_set_error(s, local_err);
>      error_free(local_err);
>  
> -    migration_incoming_state_destroy();
> -
> -    if (mis->exit_on_error) {
> -        WITH_QEMU_LOCK_GUARD(&s->error_mutex) {
> -            error_report_err(s->error);
> -            s->error = NULL;
> -        }
> -
> -        exit(EXIT_FAILURE);
> -    }
> +    migration_incoming_finish();
>  out:
>      /* Pairs with the refcount taken in qmp_migrate_incoming() */
>      migrate_incoming_unref_outgoing_state();
> diff --git a/migration/migration.h b/migration/migration.h
> index 2c2331f40d..67e3318467 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -518,6 +518,7 @@ void migrate_set_state(MigrationStatus *state, MigrationStatus old_state,
>  void migration_fd_process_incoming(QEMUFile *f);
>  void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp);
>  void migration_incoming_process(void);
> +void migration_incoming_finish(void);

What about merging migration_incoming_state_destroy and
migration_incoming_finish and pair all of this with migration_cleanup?

static void migration_cleanup_bh(void *opaque)
{
    migration_cleanup(opaque);
}

static void migration_incoming_cleanup_bh(void *opaque)
{
    migration_incoming_cleanup(opaque);
}

>  
>  bool  migration_has_all_channels(void);
>  
> diff --git a/migration/savevm.c b/migration/savevm.c
> index fabbeb296a..d7eb416d48 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -2069,6 +2069,11 @@ static int loadvm_postcopy_ram_handle_discard(MigrationIncomingState *mis,
>      return 0;
>  }
>  
> +static void postcopy_ram_listen_thread_bh(void *opaque)
> +{
> +    migration_incoming_finish();
> +}
> +
>  /*
>   * Triggered by a postcopy_listen command; this thread takes over reading
>   * the input stream, leaving the main thread free to carry on loading the rest
> @@ -2122,52 +2127,31 @@ static void *postcopy_ram_listen_thread(void *opaque)
>                           "bitmaps may be lost, and present migrated dirty "
>                           "bitmaps are correctly migrated and valid.",
>                           __func__, load_res);
> -            load_res = 0; /* prevent further exit() */
>          } else {
>              error_report("%s: loadvm failed: %d", __func__, load_res);
>              migrate_set_state(&mis->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
>                                             MIGRATION_STATUS_FAILED);
> +            goto out;
>          }
>      }
> -    if (load_res >= 0) {
> -        /*
> -         * This looks good, but it's possible that the device loading in the
> -         * main thread hasn't finished yet, and so we might not be in 'RUN'
> -         * state yet; wait for the end of the main thread.
> -         */
> -        qemu_event_wait(&mis->main_thread_load_event);
> -    }
> -    postcopy_ram_incoming_cleanup(mis);
> -
> -    if (load_res < 0) {
> -        /*
> -         * If something went wrong then we have a bad state so exit;
> -         * depending how far we got it might be possible at this point
> -         * to leave the guest running and fire MCEs for pages that never
> -         * arrived as a desperate recovery step.
> -         */
> -        rcu_unregister_thread();
> -        exit(EXIT_FAILURE);
> -    }
> +    /*
> +     * This looks good, but it's possible that the device loading in the
> +     * main thread hasn't finished yet, and so we might not be in 'RUN'
> +     * state yet; wait for the end of the main thread.
> +     */
> +    qemu_event_wait(&mis->main_thread_load_event);
>  
>      migrate_set_state(&mis->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
>                                     MIGRATION_STATUS_COMPLETED);
> -    /*
> -     * If everything has worked fine, then the main thread has waited
> -     * for us to start, and we're the last use of the mis.
> -     * (If something broke then qemu will have to exit anyway since it's
> -     * got a bad migration state).
> -     */
> -    bql_lock();
> -    migration_incoming_state_destroy();
> -    bql_unlock();
>  
> +out:
>      rcu_unregister_thread();
> -    mis->have_listen_thread = false;
>      postcopy_state_set(POSTCOPY_INCOMING_END);
>  
>      object_unref(OBJECT(migr));
>  
> +    migration_bh_schedule(postcopy_ram_listen_thread_bh, NULL);

Better to schedule before the object_unref to ensure there's always
someone holding a reference?

> +
>      return NULL;
>  }
>  
> @@ -2217,7 +2201,7 @@ static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
>      mis->have_listen_thread = true;
>      postcopy_thread_create(mis, &mis->listen_thread,
>                             MIGRATION_THREAD_DST_LISTEN,
> -                           postcopy_ram_listen_thread, QEMU_THREAD_DETACHED);
> +                           postcopy_ram_listen_thread, QEMU_THREAD_JOINABLE);
>      trace_loadvm_postcopy_handle_listen("return");
>  
>      return 0;


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/4] migration: Introduce POSTCOPY_DEVICE state
  2025-09-15 11:59 ` [PATCH 4/4] migration: Introduce POSTCOPY_DEVICE state Juraj Marcin
@ 2025-09-19 16:58   ` Peter Xu
  2025-09-19 17:50     ` Peter Xu
  2025-09-25 11:54   ` Jiří Denemark
  1 sibling, 1 reply; 33+ messages in thread
From: Peter Xu @ 2025-09-19 16:58 UTC (permalink / raw)
  To: Juraj Marcin
  Cc: qemu-devel, Jiri Denemark, Dr. David Alan Gilbert, Fabiano Rosas

On Mon, Sep 15, 2025 at 01:59:15PM +0200, Juraj Marcin wrote:
> From: Juraj Marcin <jmarcin@redhat.com>
> 
> Currently, when postcopy starts, the source VM starts switchover and
> sends a package containing the state of all non-postcopiable devices.
> When the destination loads this package, the switchover is complete and
> the destination VM starts. However, if the device state load fails or
> the destination side crashes, the source side is already in
> POSTCOPY_ACTIVE state and cannot be recovered, even when it has the most
> up-to-date machine state as the destination has not yet started.
> 
> This patch introduces a new POSTCOPY_DEVICE state which is active
> while the destination machine is loading the device state, is not yet
> running, and the source side can be resumed in case of a migration
> failure.
> 
> To transition from POSTCOPY_DEVICE to POSTCOPY_ACTIVE, the source
> side uses a PONG message that is a response to a PING message processed
> just before the POSTCOPY_RUN command that starts the destination VM.
> Thus, this change does not require any changes on the destination side
> and is effective even with older destination versions.
> 
> Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
> ---
>  migration/migration.c                 | 23 ++++++++++++++++++-----
>  migration/savevm.h                    |  2 ++
>  migration/trace-events                |  1 +
>  qapi/migration.json                   |  8 ++++++--
>  tests/qtest/migration/precopy-tests.c |  3 ++-
>  5 files changed, 29 insertions(+), 8 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 7222e3de13..e63a7487be 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1223,6 +1223,7 @@ bool migration_is_running(void)
>  
>      switch (s->state) {
>      case MIGRATION_STATUS_ACTIVE:
> +    case MIGRATION_STATUS_POSTCOPY_DEVICE:
>      case MIGRATION_STATUS_POSTCOPY_ACTIVE:
>      case MIGRATION_STATUS_POSTCOPY_PAUSED:
>      case MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP:
> @@ -1244,6 +1245,7 @@ static bool migration_is_active(void)
>      MigrationState *s = current_migration;
>  
>      return (s->state == MIGRATION_STATUS_ACTIVE ||
> +            s->state == MIGRATION_STATUS_POSTCOPY_DEVICE ||
>              s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
>  }
>  
> @@ -1366,6 +1368,7 @@ static void fill_source_migration_info(MigrationInfo *info)
>          break;
>      case MIGRATION_STATUS_ACTIVE:
>      case MIGRATION_STATUS_CANCELLING:
> +    case MIGRATION_STATUS_POSTCOPY_DEVICE:
>      case MIGRATION_STATUS_POSTCOPY_ACTIVE:
>      case MIGRATION_STATUS_PRE_SWITCHOVER:
>      case MIGRATION_STATUS_DEVICE:
> @@ -1419,6 +1422,7 @@ static void fill_destination_migration_info(MigrationInfo *info)
>      case MIGRATION_STATUS_CANCELLING:
>      case MIGRATION_STATUS_CANCELLED:
>      case MIGRATION_STATUS_ACTIVE:
> +    case MIGRATION_STATUS_POSTCOPY_DEVICE:
>      case MIGRATION_STATUS_POSTCOPY_ACTIVE:
>      case MIGRATION_STATUS_POSTCOPY_PAUSED:
>      case MIGRATION_STATUS_POSTCOPY_RECOVER:
> @@ -1719,6 +1723,7 @@ bool migration_in_postcopy(void)
>      MigrationState *s = migrate_get_current();
>  
>      switch (s->state) {
> +    case MIGRATION_STATUS_POSTCOPY_DEVICE:
>      case MIGRATION_STATUS_POSTCOPY_ACTIVE:
>      case MIGRATION_STATUS_POSTCOPY_PAUSED:
>      case MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP:
> @@ -2564,6 +2569,11 @@ static void *source_return_path_thread(void *opaque)
>              tmp32 = ldl_be_p(buf);
>              trace_source_return_path_thread_pong(tmp32);
>              qemu_sem_post(&ms->rp_state.rp_pong_acks);
> +            if (tmp32 == QEMU_VM_PING_PACKAGED_LOADED) {
> +                trace_source_return_path_thread_dst_started();
> +                migrate_set_state(&ms->state, MIGRATION_STATUS_POSTCOPY_DEVICE,
> +                                  MIGRATION_STATUS_POSTCOPY_ACTIVE);

Could this race with the migration thread modifying the state concurrently?

To avoid it, we could have a bool, set it here once, and in the iterations
do something like:

diff --git a/migration/migration.c b/migration/migration.c
index 10c216d25d..55230e10ee 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3449,6 +3449,16 @@ static MigIterateState migration_iteration_run(MigrationState *s)
     trace_migrate_pending_estimate(pending_size, must_precopy, can_postcopy);
 
     if (in_postcopy) {
+        if (s->postcopy_package_loaded) {
+            assert(s->state == MIGRATION_STATUS_POSTCOPY_DEVICE);
+            migrate_set_state(s->state, MIGRATION_STATUS_POSTCOPY_DEVICE,
+                              MIGRATION_STATUS_POSTCOPY_ACTIVE);
+            /*
+             * Since postcopy cannot be re-initiated, this flag will only
+             * be set at most once for QEMU's whole lifecyce.
+             */
+            s->postcopy_package_loaded = false;
+        }
         /*
          * Iterate in postcopy until all pending data flushed.  Note that
          * postcopy completion doesn't rely on can_switchover, because when

> +            }
>              break;
>  
>          case MIG_RP_MSG_REQ_PAGES:
> @@ -2814,6 +2824,7 @@ static int postcopy_start(MigrationState *ms, Error **errp)
>      if (migrate_postcopy_ram()) {
>          qemu_savevm_send_ping(fb, 3);
>      }
> +    qemu_savevm_send_ping(fb, QEMU_VM_PING_PACKAGED_LOADED);

Nitpick: some comment would be nice here describing this ping, especially
when it becomes functional.  The name does provide some info, but not
extremely clear what PACKAGED_LOADED mean if in a PONG yet.

>  
>      qemu_savevm_send_postcopy_run(fb);
>  
> @@ -2871,7 +2882,7 @@ static int postcopy_start(MigrationState *ms, Error **errp)
>  
>      /* Now, switchover looks all fine, switching to postcopy-active */
>      migrate_set_state(&ms->state, MIGRATION_STATUS_DEVICE,
> -                      MIGRATION_STATUS_POSTCOPY_ACTIVE);
> +                      MIGRATION_STATUS_POSTCOPY_DEVICE);
>  
>      bql_unlock();
>  
> @@ -3035,7 +3046,8 @@ static void migration_completion(MigrationState *s)
>  
>      if (s->state == MIGRATION_STATUS_ACTIVE) {
>          ret = migration_completion_precopy(s);
> -    } else if (s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) {
> +    } else if (s->state == MIGRATION_STATUS_POSTCOPY_DEVICE ||
> +               s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) {

Exactly.  We need to be prepared that src sending too fast so when device
loading on dest we finished.

>          migration_completion_postcopy(s);
>      } else {
>          ret = -1;
> @@ -3311,8 +3323,8 @@ static MigThrError migration_detect_error(MigrationState *s)
>          return postcopy_pause(s);
>      } else {
>          /*
> -         * For precopy (or postcopy with error outside IO), we fail
> -         * with no time.
> +         * For precopy (or postcopy with error outside IO, or before dest
> +         * starts), we fail with no time.
>           */
>          migrate_set_state(&s->state, state, MIGRATION_STATUS_FAILED);
>          trace_migration_thread_file_err();
> @@ -3447,7 +3459,8 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>  {
>      uint64_t must_precopy, can_postcopy, pending_size;
>      Error *local_err = NULL;
> -    bool in_postcopy = s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE;
> +    bool in_postcopy = (s->state == MIGRATION_STATUS_POSTCOPY_DEVICE ||
> +                        s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
>      bool can_switchover = migration_can_switchover(s);
>      bool complete_ready;
>  
> diff --git a/migration/savevm.h b/migration/savevm.h
> index 2d5e9c7166..c4de0325eb 100644
> --- a/migration/savevm.h
> +++ b/migration/savevm.h
> @@ -29,6 +29,8 @@
>  #define QEMU_VM_COMMAND              0x08
>  #define QEMU_VM_SECTION_FOOTER       0x7e
>  
> +#define QEMU_VM_PING_PACKAGED_LOADED 0x42

Only curious how you picked it. :)

It's fine, it can also be 5, as we know exactly what we used to use (1-4).

> +
>  bool qemu_savevm_state_blocked(Error **errp);
>  void qemu_savevm_non_migratable_list(strList **reasons);
>  int qemu_savevm_state_prepare(Error **errp);
> diff --git a/migration/trace-events b/migration/trace-events
> index 706db97def..007b5c407e 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -191,6 +191,7 @@ source_return_path_thread_pong(uint32_t val) "0x%x"
>  source_return_path_thread_shut(uint32_t val) "0x%x"
>  source_return_path_thread_resume_ack(uint32_t v) "%"PRIu32
>  source_return_path_thread_switchover_acked(void) ""
> +source_return_path_thread_dst_started(void) ""
>  migration_thread_low_pending(uint64_t pending) "%" PRIu64
>  migrate_transferred(uint64_t transferred, uint64_t time_spent, uint64_t bandwidth, uint64_t avail_bw, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %" PRIu64 " switchover_bw %" PRIu64 " max_size %" PRId64
>  process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 2387c21e9c..89a20d858d 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -142,6 +142,10 @@
>  # @postcopy-active: like active, but now in postcopy mode.
>  #     (since 2.5)
>  #
> +# @postcopy-device: like postcopy-active, but the destination is still
> +#     loading device state and is not running yet.  If migration fails
> +#     during this state, the source side will resume.  (since 10.2)

Is it worthwhile to mention we could jump from postcopy-device to
completed?  I also wonder if it happens if libvirt would get confused.
Worth checking with Jiri.

Thanks,

> +#
>  # @postcopy-paused: during postcopy but paused.  (since 3.0)
>  #
>  # @postcopy-recover-setup: setup phase for a postcopy recovery
> @@ -173,8 +177,8 @@
>  ##
>  { 'enum': 'MigrationStatus',
>    'data': [ 'none', 'setup', 'cancelling', 'cancelled',
> -            'active', 'postcopy-active', 'postcopy-paused',
> -            'postcopy-recover-setup',
> +            'active', 'postcopy-device', 'postcopy-active',
> +            'postcopy-paused', 'postcopy-recover-setup',
>              'postcopy-recover', 'completed', 'failed', 'colo',
>              'pre-switchover', 'device', 'wait-unplug' ] }
>  ##
> diff --git a/tests/qtest/migration/precopy-tests.c b/tests/qtest/migration/precopy-tests.c
> index bb38292550..57ca623de5 100644
> --- a/tests/qtest/migration/precopy-tests.c
> +++ b/tests/qtest/migration/precopy-tests.c
> @@ -1316,13 +1316,14 @@ void migration_test_add_precopy(MigrationTestEnv *env)
>      }
>  
>      /* ensure new status don't go unnoticed */
> -    assert(MIGRATION_STATUS__MAX == 15);
> +    assert(MIGRATION_STATUS__MAX == 16);
>  
>      for (int i = MIGRATION_STATUS_NONE; i < MIGRATION_STATUS__MAX; i++) {
>          switch (i) {
>          case MIGRATION_STATUS_DEVICE: /* happens too fast */
>          case MIGRATION_STATUS_WAIT_UNPLUG: /* no support in tests */
>          case MIGRATION_STATUS_COLO: /* no support in tests */
> +        case MIGRATION_STATUS_POSTCOPY_DEVICE: /* postcopy can't be cancelled */
>          case MIGRATION_STATUS_POSTCOPY_ACTIVE: /* postcopy can't be cancelled */
>          case MIGRATION_STATUS_POSTCOPY_PAUSED:
>          case MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP:
> -- 
> 2.51.0
> 

-- 
Peter Xu



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/4] migration: Introduce POSTCOPY_DEVICE state
  2025-09-19 16:58   ` Peter Xu
@ 2025-09-19 17:50     ` Peter Xu
  2025-09-22 13:34       ` Juraj Marcin
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Xu @ 2025-09-19 17:50 UTC (permalink / raw)
  To: Juraj Marcin
  Cc: qemu-devel, Jiri Denemark, Dr. David Alan Gilbert, Fabiano Rosas

On Fri, Sep 19, 2025 at 12:58:08PM -0400, Peter Xu wrote:
> > @@ -2564,6 +2569,11 @@ static void *source_return_path_thread(void *opaque)
> >              tmp32 = ldl_be_p(buf);
> >              trace_source_return_path_thread_pong(tmp32);
> >              qemu_sem_post(&ms->rp_state.rp_pong_acks);
> > +            if (tmp32 == QEMU_VM_PING_PACKAGED_LOADED) {
> > +                trace_source_return_path_thread_dst_started();
> > +                migrate_set_state(&ms->state, MIGRATION_STATUS_POSTCOPY_DEVICE,
> > +                                  MIGRATION_STATUS_POSTCOPY_ACTIVE);
> 
> Could this race with the migration thread modifying the state concurrently?
> 
> To avoid it, we could have a bool, set it here once, and in the iterations
> do something like:
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 10c216d25d..55230e10ee 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3449,6 +3449,16 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>      trace_migrate_pending_estimate(pending_size, must_precopy, can_postcopy);
>  
>      if (in_postcopy) {
> +        if (s->postcopy_package_loaded) {
> +            assert(s->state == MIGRATION_STATUS_POSTCOPY_DEVICE);
> +            migrate_set_state(s->state, MIGRATION_STATUS_POSTCOPY_DEVICE,
> +                              MIGRATION_STATUS_POSTCOPY_ACTIVE);
> +            /*
> +             * Since postcopy cannot be re-initiated, this flag will only
> +             * be set at most once for QEMU's whole lifecyce.
> +             */
> +            s->postcopy_package_loaded = false;
> +        }
>          /*
>           * Iterate in postcopy until all pending data flushed.  Note that
>           * postcopy completion doesn't rely on can_switchover, because when

[...]

> > @@ -2871,7 +2882,7 @@ static int postcopy_start(MigrationState *ms, Error **errp)
> >  
> >      /* Now, switchover looks all fine, switching to postcopy-active */
> >      migrate_set_state(&ms->state, MIGRATION_STATUS_DEVICE,
> > -                      MIGRATION_STATUS_POSTCOPY_ACTIVE);
> > +                      MIGRATION_STATUS_POSTCOPY_DEVICE);
> >  
> >      bql_unlock();
> >  
> > @@ -3035,7 +3046,8 @@ static void migration_completion(MigrationState *s)
> >  
> >      if (s->state == MIGRATION_STATUS_ACTIVE) {
> >          ret = migration_completion_precopy(s);
> > -    } else if (s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) {
> > +    } else if (s->state == MIGRATION_STATUS_POSTCOPY_DEVICE ||
> > +               s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) {
> 
> Exactly.  We need to be prepared that src sending too fast so when device
> loading on dest we finished.

One thing more to mention here.. which may void some previous comments I
left.  Let's discuss.

I think it may also make sense to only allow a COMPLETE after
POSTCOPY_ACTIVE.

That is, if src sends too fast to have finished sending everything,
reaching COMPLETE during POSTCOPY_DEVICE, that is, while before it receives
the new PONG you defined, then.. I _think_ it is better to wait for that.

If it finally arrives, then it's perfect, we switch to POSTCOPY_ACTIVE,
then continue the completion.

If the channel is broken before its arrival, logically we should handle
this case as a FAILURE and restart the VM on src.

It's only relevant in a corner case, but does that sound better?

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/4] migration: Accept MigrationStatus in migration_has_failed()
  2025-09-19 14:57   ` Peter Xu
@ 2025-09-22 11:26     ` Juraj Marcin
  0 siblings, 0 replies; 33+ messages in thread
From: Juraj Marcin @ 2025-09-22 11:26 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, Jiri Denemark, Dr. David Alan Gilbert, Fabiano Rosas

Hi Peter,

On 2025-09-19 10:57, Peter Xu wrote:
> On Mon, Sep 15, 2025 at 01:59:13PM +0200, Juraj Marcin wrote:
> > From: Juraj Marcin <jmarcin@redhat.com>
> > 
> > This allows to reuse the helper also with MigrationIncomingState.
> 
> I get the point, but just to mention that this helper doesn't really change
> much on incoming side on simplifying the code or function-wise, because we
> don't have CANCELLING/CANCELLED state on deste QEMU.. which is definitely
> not obvious.. :(
> 
> So:
> 
>   migration_has_failed(incoming->state)
> 
> Is exactly the same as:
> 
>   incoming->state == MIGRATION_STATUS_FAILED
> 
> Except it will make src start to pass in s->state.. which is slightly more
> awkward.
> 
> Maybe we keep the MIGRATION_STATUS_FAILED check in your next patch, and
> drop this one for now until it grows more than FAILED on dest?

Sure, that sounds like a good solution.

> 
> > 
> > Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
> > ---
> >  migration/migration.c | 8 ++++----
> >  migration/migration.h | 2 +-
> >  migration/multifd.c   | 2 +-
> >  3 files changed, 6 insertions(+), 6 deletions(-)
> > 
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 54dac3db88..2c0b3a7229 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -1542,7 +1542,7 @@ static void migration_cleanup(MigrationState *s)
> >          /* It is used on info migrate.  We can't free it */
> >          error_report_err(error_copy(s->error));
> >      }
> > -    type = migration_has_failed(s) ? MIG_EVENT_PRECOPY_FAILED :
> > +    type = migration_has_failed(s->state) ? MIG_EVENT_PRECOPY_FAILED :
> >                                       MIG_EVENT_PRECOPY_DONE;
> >      migration_call_notifiers(s, type, NULL);
> >      yank_unregister_instance(MIGRATION_YANK_INSTANCE);
> > @@ -1700,10 +1700,10 @@ int migration_call_notifiers(MigrationState *s, MigrationEventType type,
> >      return ret;
> >  }
> >  
> > -bool migration_has_failed(MigrationState *s)
> > +bool migration_has_failed(MigrationStatus state)
> >  {
> > -    return (s->state == MIGRATION_STATUS_CANCELLED ||
> > -            s->state == MIGRATION_STATUS_FAILED);
> > +    return (state == MIGRATION_STATUS_CANCELLED ||
> > +            state == MIGRATION_STATUS_FAILED);
> >  }
> >  
> >  bool migration_in_postcopy(void)
> > diff --git a/migration/migration.h b/migration/migration.h
> > index 01329bf824..2c2331f40d 100644
> > --- a/migration/migration.h
> > +++ b/migration/migration.h
> > @@ -535,7 +535,7 @@ bool migration_is_blocked(Error **errp);
> >  bool migration_in_postcopy(void);
> >  bool migration_postcopy_is_alive(MigrationStatus state);
> >  MigrationState *migrate_get_current(void);
> > -bool migration_has_failed(MigrationState *);
> > +bool migration_has_failed(MigrationStatus state);
> >  bool migrate_mode_is_cpr(MigrationState *);
> >  
> >  uint64_t ram_get_total_transferred_pages(void);
> > diff --git a/migration/multifd.c b/migration/multifd.c
> > index b255778855..c569f91f2c 100644
> > --- a/migration/multifd.c
> > +++ b/migration/multifd.c
> > @@ -568,7 +568,7 @@ void multifd_send_shutdown(void)
> >               * already failed. If the migration succeeded, errors are
> >               * not expected but there's no need to kill the source.
> >               */
> > -            if (local_err && !migration_has_failed(migrate_get_current())) {
> > +            if (local_err && !migration_has_failed(migrate_get_current()->state)) {
> >                  warn_report(
> >                      "multifd_send_%d: Failed to terminate TLS connection: %s",
> >                      p->id, error_get_pretty(local_err));
> > -- 
> > 2.51.0
> > 
> 
> -- 
> Peter Xu
> 



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 3/4] migration: Refactor incoming cleanup into migration_incoming_finish()
  2025-09-19 16:46   ` Fabiano Rosas
@ 2025-09-22 12:58     ` Juraj Marcin
  2025-09-22 15:51       ` Peter Xu
  0 siblings, 1 reply; 33+ messages in thread
From: Juraj Marcin @ 2025-09-22 12:58 UTC (permalink / raw)
  To: Fabiano Rosas; +Cc: qemu-devel, Jiri Denemark, Peter Xu, Dr. David Alan Gilbert

Hi Fabiano,

On 2025-09-19 13:46, Fabiano Rosas wrote:
> Juraj Marcin <jmarcin@redhat.com> writes:
> 
> Hi Juraj,
> 
> Good patch, nice use of migrate_has_failed()

Thanks!

> 
> > From: Juraj Marcin <jmarcin@redhat.com>
> >
> > Currently, there are two functions that are responsible for cleanup of
> > the incoming migration state. With successful precopy, it's the main
> > thread and with successful postcopy it's the listen thread. However, if
> > postcopy fails during in the device load, both functions will try to do
> > the cleanup. Moreover, when exit-on-error parameter was added, it was
> > applied only to precopy.
> >
> 
> Someone could be relying in postcopy always exiting on error while
> explicitly setting exit-on-error=false for precopy and this patch would
> change the behavior incompatibly. Is this an issue? I'm willing to
> ignore it, but you guys know more about postcopy.

Good question. When going through older patches where postcopy listen
thread and then where exit-on-error were implemented, it seemed more
like an overlook than intentional omission. However, it might be better
to not break any potential users of this, we could add another option,
"exit-on-postcopy-error" that would allow such handling if postscopy
failed unrecoverably. I've already talked about such option with
@jdenemar and he agreed with it.

> 
> > This patch refactors common cleanup and exiting on error into a helper
> > function that can be started either from precopy or postcopy, reducing
> > the duplication. If the listen thread has been started (the postcopy
> > state is at least LISTENING), the listen thread is responsible for all
> > cleanup and exiting, otherwise it's the main thread's responsibility.
> 
> Don't the BHs also run in the main thread? I'm not sure this sentence is
> accurate.

Yeah, it is a bit inaccurate now that you mention it, BHs are indeed run
in the main thread. More accurate would to replace the main thread with
process incoming migration coroutine, that would precisely describe who
has the responsibility to call the cleanup.

> 
> >
> > Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
> > ---
> >  migration/migration.c | 64 ++++++++++++++++++++++++-------------------
> >  migration/migration.h |  1 +
> >  migration/savevm.c    | 48 +++++++++++---------------------
> 
> Could someone act on the TODOs and move postcopy code into postcopy-ram?
> It's never too late to make things consistent.

I can take a look.

> 
> >  3 files changed, 53 insertions(+), 60 deletions(-)
> >
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 2c0b3a7229..7222e3de13 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -442,9 +442,19 @@ void migration_incoming_transport_cleanup(MigrationIncomingState *mis)
> >  void migration_incoming_state_destroy(void)
> >  {
> >      struct MigrationIncomingState *mis = migration_incoming_get_current();
> > +    PostcopyState ps = postcopy_state_get();
> >  
> >      multifd_recv_cleanup();
> >  
> > +    if (mis->have_listen_thread) {
> > +        qemu_thread_join(&mis->listen_thread);
> > +        mis->have_listen_thread = false;
> > +    }
> > +
> > +    if (ps != POSTCOPY_INCOMING_NONE) {
> > +        postcopy_ram_incoming_cleanup(mis);
> > +    }
> > +
> >      /*
> >       * RAM state cleanup needs to happen after multifd cleanup, because
> >       * multifd threads can use some of its states (receivedmap).
> > @@ -809,6 +819,23 @@ static void qemu_start_incoming_migration(const char *uri, bool has_channels,
> >      cpr_state_close();
> >  }
> >  
> > +void migration_incoming_finish(void)
> > +{
> > +    MigrationState *s = migrate_get_current();
> > +    MigrationIncomingState *mis = migration_incoming_get_current();
> > +
> > +    migration_incoming_state_destroy();
> > +
> > +    if (migration_has_failed(mis->state) && mis->exit_on_error) {
> > +        WITH_QEMU_LOCK_GUARD(&s->error_mutex) {
> > +            error_report_err(s->error);
> > +            s->error = NULL;
> > +        }
> > +
> > +        exit(EXIT_FAILURE);
> > +    }
> > +}
> > +
> >  static void process_incoming_migration_bh(void *opaque)
> >  {
> >      MigrationIncomingState *mis = opaque;
> > @@ -861,7 +888,7 @@ static void process_incoming_migration_bh(void *opaque)
> >       */
> >      migrate_set_state(&mis->state, MIGRATION_STATUS_ACTIVE,
> >                        MIGRATION_STATUS_COMPLETED);
> > -    migration_incoming_state_destroy();
> > +    migration_incoming_finish();
> >  }
> >  
> >  static void coroutine_fn
> > @@ -888,23 +915,13 @@ process_incoming_migration_co(void *opaque)
> >  
> >      ps = postcopy_state_get();
> >      trace_process_incoming_migration_co_end(ret, ps);
> > -    if (ps != POSTCOPY_INCOMING_NONE) {
> > -        if (ps == POSTCOPY_INCOMING_ADVISE) {
> > -            /*
> > -             * Where a migration had postcopy enabled (and thus went to advise)
> > -             * but managed to complete within the precopy period, we can use
> > -             * the normal exit.
> > -             */
> > -            postcopy_ram_incoming_cleanup(mis);
> > -        } else if (ret >= 0) {
> > -            /*
> > -             * Postcopy was started, cleanup should happen at the end of the
> > -             * postcopy thread.
> > -             */
> > -            trace_process_incoming_migration_co_postcopy_end_main();
> > -            goto out;
> > -        }
> > -        /* Else if something went wrong then just fall out of the normal exit */
> > +    if (ps >= POSTCOPY_INCOMING_LISTENING) {
> > +        /*
> > +         * Postcopy was started, cleanup should happen at the end of the
> > +         * postcopy thread.
> > +         */
> > +        trace_process_incoming_migration_co_postcopy_end_main();
> > +        goto out;
> >      }
> >  
> >      if (ret < 0) {
> > @@ -926,16 +943,7 @@ fail:
> >      migrate_set_error(s, local_err);
> >      error_free(local_err);
> >  
> > -    migration_incoming_state_destroy();
> > -
> > -    if (mis->exit_on_error) {
> > -        WITH_QEMU_LOCK_GUARD(&s->error_mutex) {
> > -            error_report_err(s->error);
> > -            s->error = NULL;
> > -        }
> > -
> > -        exit(EXIT_FAILURE);
> > -    }
> > +    migration_incoming_finish();
> >  out:
> >      /* Pairs with the refcount taken in qmp_migrate_incoming() */
> >      migrate_incoming_unref_outgoing_state();
> > diff --git a/migration/migration.h b/migration/migration.h
> > index 2c2331f40d..67e3318467 100644
> > --- a/migration/migration.h
> > +++ b/migration/migration.h
> > @@ -518,6 +518,7 @@ void migrate_set_state(MigrationStatus *state, MigrationStatus old_state,
> >  void migration_fd_process_incoming(QEMUFile *f);
> >  void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp);
> >  void migration_incoming_process(void);
> > +void migration_incoming_finish(void);
> 
> What about merging migration_incoming_state_destroy and
> migration_incoming_finish and pair all of this with migration_cleanup?
> 
> static void migration_cleanup_bh(void *opaque)
> {
>     migration_cleanup(opaque);
> }
> 
> static void migration_incoming_cleanup_bh(void *opaque)
> {
>     migration_incoming_cleanup(opaque);
> }

Yes, it would pair well. I guess it would also solve Peter's nitpick
about the change in process_incoming_migration_bh() above.

> >  
> >  bool  migration_has_all_channels(void);
> >  
> > diff --git a/migration/savevm.c b/migration/savevm.c
> > index fabbeb296a..d7eb416d48 100644
> > --- a/migration/savevm.c
> > +++ b/migration/savevm.c
> > @@ -2069,6 +2069,11 @@ static int loadvm_postcopy_ram_handle_discard(MigrationIncomingState *mis,
> >      return 0;
> >  }
> >  
> > +static void postcopy_ram_listen_thread_bh(void *opaque)
> > +{
> > +    migration_incoming_finish();
> > +}
> > +
> >  /*
> >   * Triggered by a postcopy_listen command; this thread takes over reading
> >   * the input stream, leaving the main thread free to carry on loading the rest
> > @@ -2122,52 +2127,31 @@ static void *postcopy_ram_listen_thread(void *opaque)
> >                           "bitmaps may be lost, and present migrated dirty "
> >                           "bitmaps are correctly migrated and valid.",
> >                           __func__, load_res);
> > -            load_res = 0; /* prevent further exit() */
> >          } else {
> >              error_report("%s: loadvm failed: %d", __func__, load_res);
> >              migrate_set_state(&mis->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
> >                                             MIGRATION_STATUS_FAILED);
> > +            goto out;
> >          }
> >      }
> > -    if (load_res >= 0) {
> > -        /*
> > -         * This looks good, but it's possible that the device loading in the
> > -         * main thread hasn't finished yet, and so we might not be in 'RUN'
> > -         * state yet; wait for the end of the main thread.
> > -         */
> > -        qemu_event_wait(&mis->main_thread_load_event);
> > -    }
> > -    postcopy_ram_incoming_cleanup(mis);
> > -
> > -    if (load_res < 0) {
> > -        /*
> > -         * If something went wrong then we have a bad state so exit;
> > -         * depending how far we got it might be possible at this point
> > -         * to leave the guest running and fire MCEs for pages that never
> > -         * arrived as a desperate recovery step.
> > -         */
> > -        rcu_unregister_thread();
> > -        exit(EXIT_FAILURE);
> > -    }
> > +    /*
> > +     * This looks good, but it's possible that the device loading in the
> > +     * main thread hasn't finished yet, and so we might not be in 'RUN'
> > +     * state yet; wait for the end of the main thread.
> > +     */
> > +    qemu_event_wait(&mis->main_thread_load_event);
> >  
> >      migrate_set_state(&mis->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
> >                                     MIGRATION_STATUS_COMPLETED);
> > -    /*
> > -     * If everything has worked fine, then the main thread has waited
> > -     * for us to start, and we're the last use of the mis.
> > -     * (If something broke then qemu will have to exit anyway since it's
> > -     * got a bad migration state).
> > -     */
> > -    bql_lock();
> > -    migration_incoming_state_destroy();
> > -    bql_unlock();
> >  
> > +out:
> >      rcu_unregister_thread();
> > -    mis->have_listen_thread = false;
> >      postcopy_state_set(POSTCOPY_INCOMING_END);
> >  
> >      object_unref(OBJECT(migr));
> >  
> > +    migration_bh_schedule(postcopy_ram_listen_thread_bh, NULL);
> 
> Better to schedule before the object_unref to ensure there's always
> someone holding a reference?

True, I'll move it.

> 
> > +
> >      return NULL;
> >  }
> >  
> > @@ -2217,7 +2201,7 @@ static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
> >      mis->have_listen_thread = true;
> >      postcopy_thread_create(mis, &mis->listen_thread,
> >                             MIGRATION_THREAD_DST_LISTEN,
> > -                           postcopy_ram_listen_thread, QEMU_THREAD_DETACHED);
> > +                           postcopy_ram_listen_thread, QEMU_THREAD_JOINABLE);
> >      trace_loadvm_postcopy_handle_listen("return");
> >  
> >      return 0;
> 



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/4] migration: Introduce POSTCOPY_DEVICE state
  2025-09-19 17:50     ` Peter Xu
@ 2025-09-22 13:34       ` Juraj Marcin
  2025-09-22 16:16         ` Peter Xu
  0 siblings, 1 reply; 33+ messages in thread
From: Juraj Marcin @ 2025-09-22 13:34 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, Jiri Denemark, Dr. David Alan Gilbert, Fabiano Rosas

Hi Peter,

On 2025-09-19 13:50, Peter Xu wrote:
> On Fri, Sep 19, 2025 at 12:58:08PM -0400, Peter Xu wrote:
> > > @@ -2564,6 +2569,11 @@ static void *source_return_path_thread(void *opaque)
> > >              tmp32 = ldl_be_p(buf);
> > >              trace_source_return_path_thread_pong(tmp32);
> > >              qemu_sem_post(&ms->rp_state.rp_pong_acks);
> > > +            if (tmp32 == QEMU_VM_PING_PACKAGED_LOADED) {
> > > +                trace_source_return_path_thread_dst_started();
> > > +                migrate_set_state(&ms->state, MIGRATION_STATUS_POSTCOPY_DEVICE,
> > > +                                  MIGRATION_STATUS_POSTCOPY_ACTIVE);
> > 
> > Could this race with the migration thread modifying the state concurrently?
> > 
> > To avoid it, we could have a bool, set it here once, and in the iterations
> > do something like:
> > 
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 10c216d25d..55230e10ee 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -3449,6 +3449,16 @@ static MigIterateState migration_iteration_run(MigrationState *s)
> >      trace_migrate_pending_estimate(pending_size, must_precopy, can_postcopy);
> >  
> >      if (in_postcopy) {
> > +        if (s->postcopy_package_loaded) {
> > +            assert(s->state == MIGRATION_STATUS_POSTCOPY_DEVICE);
> > +            migrate_set_state(s->state, MIGRATION_STATUS_POSTCOPY_DEVICE,
> > +                              MIGRATION_STATUS_POSTCOPY_ACTIVE);
> > +            /*
> > +             * Since postcopy cannot be re-initiated, this flag will only
> > +             * be set at most once for QEMU's whole lifecyce.
> > +             */
> > +            s->postcopy_package_loaded = false;
> > +        }
> >          /*
> >           * Iterate in postcopy until all pending data flushed.  Note that
> >           * postcopy completion doesn't rely on can_switchover, because when

It was there in the RFC version, when there was mutual handshake before
dst starting, but I thought cmp&exchange would be enough. I can add it
again, however, there is no need to set it to false afterwards, we can
simply check if this condition is true:
s->postcopy_package_loaded && s->state == MIGRATION_STATUS_POSTCOPY_DEVICE.

> 
> [...]
> 
> > > @@ -2871,7 +2882,7 @@ static int postcopy_start(MigrationState *ms, Error **errp)
> > >  
> > >      /* Now, switchover looks all fine, switching to postcopy-active */
> > >      migrate_set_state(&ms->state, MIGRATION_STATUS_DEVICE,
> > > -                      MIGRATION_STATUS_POSTCOPY_ACTIVE);
> > > +                      MIGRATION_STATUS_POSTCOPY_DEVICE);
> > >  
> > >      bql_unlock();
> > >  
> > > @@ -3035,7 +3046,8 @@ static void migration_completion(MigrationState *s)
> > >  
> > >      if (s->state == MIGRATION_STATUS_ACTIVE) {
> > >          ret = migration_completion_precopy(s);
> > > -    } else if (s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) {
> > > +    } else if (s->state == MIGRATION_STATUS_POSTCOPY_DEVICE ||
> > > +               s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) {
> > 
> > Exactly.  We need to be prepared that src sending too fast so when device
> > loading on dest we finished.
> 
> One thing more to mention here.. which may void some previous comments I
> left.  Let's discuss.
> 
> I think it may also make sense to only allow a COMPLETE after
> POSTCOPY_ACTIVE.
> 
> That is, if src sends too fast to have finished sending everything,
> reaching COMPLETE during POSTCOPY_DEVICE, that is, while before it receives
> the new PONG you defined, then.. I _think_ it is better to wait for that.
> 
> If it finally arrives, then it's perfect, we switch to POSTCOPY_ACTIVE,
> then continue the completion.
> 
> If the channel is broken before its arrival, logically we should handle
> this case as a FAILURE and restart the VM on src.
> 
> It's only relevant in a corner case, but does that sound better?

Yes, it does make sense to wait for POSTCOPY_ACTIVE as src could finish
before dst finished package load and could still fail. We could use a
qemu_event_wait() to wait and set the event in src return path thread
when the PONG is received.

> 
> -- 
> Peter Xu
> 



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 3/4] migration: Refactor incoming cleanup into migration_incoming_finish()
  2025-09-22 12:58     ` Juraj Marcin
@ 2025-09-22 15:51       ` Peter Xu
  2025-09-22 17:40         ` Fabiano Rosas
  2025-09-23 14:58         ` Juraj Marcin
  0 siblings, 2 replies; 33+ messages in thread
From: Peter Xu @ 2025-09-22 15:51 UTC (permalink / raw)
  To: Juraj Marcin
  Cc: Fabiano Rosas, qemu-devel, Jiri Denemark, Dr. David Alan Gilbert

On Mon, Sep 22, 2025 at 02:58:38PM +0200, Juraj Marcin wrote:
> Hi Fabiano,
> 
> On 2025-09-19 13:46, Fabiano Rosas wrote:
> > Juraj Marcin <jmarcin@redhat.com> writes:
> > 
> > Hi Juraj,
> > 
> > Good patch, nice use of migrate_has_failed()
> 
> Thanks!
> 
> > 
> > > From: Juraj Marcin <jmarcin@redhat.com>
> > >
> > > Currently, there are two functions that are responsible for cleanup of
> > > the incoming migration state. With successful precopy, it's the main
> > > thread and with successful postcopy it's the listen thread. However, if
> > > postcopy fails during in the device load, both functions will try to do
> > > the cleanup. Moreover, when exit-on-error parameter was added, it was
> > > applied only to precopy.
> > >
> > 
> > Someone could be relying in postcopy always exiting on error while
> > explicitly setting exit-on-error=false for precopy and this patch would
> > change the behavior incompatibly. Is this an issue? I'm willing to
> > ignore it, but you guys know more about postcopy.
> 
> Good question. When going through older patches where postcopy listen
> thread and then where exit-on-error were implemented, it seemed more
> like an overlook than intentional omission. However, it might be better
> to not break any potential users of this, we could add another option,
> "exit-on-postcopy-error" that would allow such handling if postscopy
> failed unrecoverably. I've already talked about such option with
> @jdenemar and he agreed with it.

The idea for postcopy ram is, it should never fail.. as failing should
never be better than a pause.  Block dirty bitmap might be different,
though, when enabled separately.

For postcopy-ram, qemu_loadvm_state_main() will in reality only receive RAM
updates. It'll almost always trigger the postcopy_pause_incoming() path
when anything fails.

For pure block-dirty-bitmap-only styled postcopy: for this exit-on-error, I
also don't think we should really "exit on errors", even if the flag is
set.  IIUC, it's not fatal to the VM if that failed, as described in:

commit ee64722514fabcad2430982ade86180208f5be4f
Author: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Date:   Mon Jul 27 22:42:32 2020 +0300

    migration/savevm: don't worry if bitmap migration postcopy failed

    ...

    And anyway, bitmaps postcopy is not prepared to be somehow recovered.
    The original idea instead is that if bitmaps postcopy failed, we just
    lose some bitmaps, which is not critical. So, on failure we just need
    to remove unfinished bitmaps and guest should continue execution on
    destination.

Hence, exit here might be an overkill.. need block developers to double
check, though..

> 
> > 
> > > This patch refactors common cleanup and exiting on error into a helper
> > > function that can be started either from precopy or postcopy, reducing
> > > the duplication. If the listen thread has been started (the postcopy
> > > state is at least LISTENING), the listen thread is responsible for all
> > > cleanup and exiting, otherwise it's the main thread's responsibility.
> > 
> > Don't the BHs also run in the main thread? I'm not sure this sentence is
> > accurate.
> 
> Yeah, it is a bit inaccurate now that you mention it, BHs are indeed run
> in the main thread. More accurate would to replace the main thread with
> process incoming migration coroutine, that would precisely describe who
> has the responsibility to call the cleanup.
> 
> > 
> > >
> > > Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
> > > ---
> > >  migration/migration.c | 64 ++++++++++++++++++++++++-------------------
> > >  migration/migration.h |  1 +
> > >  migration/savevm.c    | 48 +++++++++++---------------------
> > 
> > Could someone act on the TODOs and move postcopy code into postcopy-ram?
> > It's never too late to make things consistent.
> 
> I can take a look.
> 
> > 
> > >  3 files changed, 53 insertions(+), 60 deletions(-)
> > >
> > > diff --git a/migration/migration.c b/migration/migration.c
> > > index 2c0b3a7229..7222e3de13 100644
> > > --- a/migration/migration.c
> > > +++ b/migration/migration.c
> > > @@ -442,9 +442,19 @@ void migration_incoming_transport_cleanup(MigrationIncomingState *mis)
> > >  void migration_incoming_state_destroy(void)
> > >  {
> > >      struct MigrationIncomingState *mis = migration_incoming_get_current();
> > > +    PostcopyState ps = postcopy_state_get();
> > >  
> > >      multifd_recv_cleanup();
> > >  
> > > +    if (mis->have_listen_thread) {
> > > +        qemu_thread_join(&mis->listen_thread);
> > > +        mis->have_listen_thread = false;
> > > +    }
> > > +
> > > +    if (ps != POSTCOPY_INCOMING_NONE) {
> > > +        postcopy_ram_incoming_cleanup(mis);
> > > +    }
> > > +
> > >      /*
> > >       * RAM state cleanup needs to happen after multifd cleanup, because
> > >       * multifd threads can use some of its states (receivedmap).
> > > @@ -809,6 +819,23 @@ static void qemu_start_incoming_migration(const char *uri, bool has_channels,
> > >      cpr_state_close();
> > >  }
> > >  
> > > +void migration_incoming_finish(void)
> > > +{
> > > +    MigrationState *s = migrate_get_current();
> > > +    MigrationIncomingState *mis = migration_incoming_get_current();
> > > +
> > > +    migration_incoming_state_destroy();
> > > +
> > > +    if (migration_has_failed(mis->state) && mis->exit_on_error) {
> > > +        WITH_QEMU_LOCK_GUARD(&s->error_mutex) {
> > > +            error_report_err(s->error);
> > > +            s->error = NULL;
> > > +        }
> > > +
> > > +        exit(EXIT_FAILURE);
> > > +    }
> > > +}
> > > +
> > >  static void process_incoming_migration_bh(void *opaque)
> > >  {
> > >      MigrationIncomingState *mis = opaque;
> > > @@ -861,7 +888,7 @@ static void process_incoming_migration_bh(void *opaque)
> > >       */
> > >      migrate_set_state(&mis->state, MIGRATION_STATUS_ACTIVE,
> > >                        MIGRATION_STATUS_COMPLETED);
> > > -    migration_incoming_state_destroy();
> > > +    migration_incoming_finish();
> > >  }
> > >  
> > >  static void coroutine_fn
> > > @@ -888,23 +915,13 @@ process_incoming_migration_co(void *opaque)
> > >  
> > >      ps = postcopy_state_get();
> > >      trace_process_incoming_migration_co_end(ret, ps);
> > > -    if (ps != POSTCOPY_INCOMING_NONE) {
> > > -        if (ps == POSTCOPY_INCOMING_ADVISE) {
> > > -            /*
> > > -             * Where a migration had postcopy enabled (and thus went to advise)
> > > -             * but managed to complete within the precopy period, we can use
> > > -             * the normal exit.
> > > -             */
> > > -            postcopy_ram_incoming_cleanup(mis);
> > > -        } else if (ret >= 0) {
> > > -            /*
> > > -             * Postcopy was started, cleanup should happen at the end of the
> > > -             * postcopy thread.
> > > -             */
> > > -            trace_process_incoming_migration_co_postcopy_end_main();
> > > -            goto out;
> > > -        }
> > > -        /* Else if something went wrong then just fall out of the normal exit */
> > > +    if (ps >= POSTCOPY_INCOMING_LISTENING) {
> > > +        /*
> > > +         * Postcopy was started, cleanup should happen at the end of the
> > > +         * postcopy thread.
> > > +         */
> > > +        trace_process_incoming_migration_co_postcopy_end_main();
> > > +        goto out;
> > >      }
> > >  
> > >      if (ret < 0) {
> > > @@ -926,16 +943,7 @@ fail:
> > >      migrate_set_error(s, local_err);
> > >      error_free(local_err);
> > >  
> > > -    migration_incoming_state_destroy();
> > > -
> > > -    if (mis->exit_on_error) {
> > > -        WITH_QEMU_LOCK_GUARD(&s->error_mutex) {
> > > -            error_report_err(s->error);
> > > -            s->error = NULL;
> > > -        }
> > > -
> > > -        exit(EXIT_FAILURE);
> > > -    }
> > > +    migration_incoming_finish();
> > >  out:
> > >      /* Pairs with the refcount taken in qmp_migrate_incoming() */
> > >      migrate_incoming_unref_outgoing_state();
> > > diff --git a/migration/migration.h b/migration/migration.h
> > > index 2c2331f40d..67e3318467 100644
> > > --- a/migration/migration.h
> > > +++ b/migration/migration.h
> > > @@ -518,6 +518,7 @@ void migrate_set_state(MigrationStatus *state, MigrationStatus old_state,
> > >  void migration_fd_process_incoming(QEMUFile *f);
> > >  void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp);
> > >  void migration_incoming_process(void);
> > > +void migration_incoming_finish(void);
> > 
> > What about merging migration_incoming_state_destroy and
> > migration_incoming_finish and pair all of this with migration_cleanup?
> > 
> > static void migration_cleanup_bh(void *opaque)
> > {
> >     migration_cleanup(opaque);
> > }
> > 
> > static void migration_incoming_cleanup_bh(void *opaque)
> > {
> >     migration_incoming_cleanup(opaque);
> > }
> 
> Yes, it would pair well. I guess it would also solve Peter's nitpick
> about the change in process_incoming_migration_bh() above.
> 
> > >  
> > >  bool  migration_has_all_channels(void);
> > >  
> > > diff --git a/migration/savevm.c b/migration/savevm.c
> > > index fabbeb296a..d7eb416d48 100644
> > > --- a/migration/savevm.c
> > > +++ b/migration/savevm.c
> > > @@ -2069,6 +2069,11 @@ static int loadvm_postcopy_ram_handle_discard(MigrationIncomingState *mis,
> > >      return 0;
> > >  }
> > >  
> > > +static void postcopy_ram_listen_thread_bh(void *opaque)
> > > +{
> > > +    migration_incoming_finish();
> > > +}
> > > +
> > >  /*
> > >   * Triggered by a postcopy_listen command; this thread takes over reading
> > >   * the input stream, leaving the main thread free to carry on loading the rest
> > > @@ -2122,52 +2127,31 @@ static void *postcopy_ram_listen_thread(void *opaque)
> > >                           "bitmaps may be lost, and present migrated dirty "
> > >                           "bitmaps are correctly migrated and valid.",
> > >                           __func__, load_res);
> > > -            load_res = 0; /* prevent further exit() */
> > >          } else {
> > >              error_report("%s: loadvm failed: %d", __func__, load_res);
> > >              migrate_set_state(&mis->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
> > >                                             MIGRATION_STATUS_FAILED);
> > > +            goto out;
> > >          }
> > >      }
> > > -    if (load_res >= 0) {
> > > -        /*
> > > -         * This looks good, but it's possible that the device loading in the
> > > -         * main thread hasn't finished yet, and so we might not be in 'RUN'
> > > -         * state yet; wait for the end of the main thread.
> > > -         */
> > > -        qemu_event_wait(&mis->main_thread_load_event);
> > > -    }
> > > -    postcopy_ram_incoming_cleanup(mis);
> > > -
> > > -    if (load_res < 0) {
> > > -        /*
> > > -         * If something went wrong then we have a bad state so exit;
> > > -         * depending how far we got it might be possible at this point
> > > -         * to leave the guest running and fire MCEs for pages that never
> > > -         * arrived as a desperate recovery step.
> > > -         */
> > > -        rcu_unregister_thread();
> > > -        exit(EXIT_FAILURE);
> > > -    }
> > > +    /*
> > > +     * This looks good, but it's possible that the device loading in the
> > > +     * main thread hasn't finished yet, and so we might not be in 'RUN'
> > > +     * state yet; wait for the end of the main thread.
> > > +     */
> > > +    qemu_event_wait(&mis->main_thread_load_event);

PS: I didn't notice this change, looks like this may be better to be a
separate patch when moving out of the if.  Meanwhile, I don't think we set
it right either, in qemu_loadvm_state():

    qemu_event_set(&mis->main_thread_load_event);

The problem is e.g. load_snapshot / qmp_xen_load_devices_state also set
that event, even if there'll be no one to consume it.. not a huge deal, but
maybe while moving it out of the if, we can also cleanup the set() side
too, by moving the set() upper into process_incoming_migration_co().

> > >  
> > >      migrate_set_state(&mis->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
> > >                                     MIGRATION_STATUS_COMPLETED);
> > > -    /*
> > > -     * If everything has worked fine, then the main thread has waited
> > > -     * for us to start, and we're the last use of the mis.
> > > -     * (If something broke then qemu will have to exit anyway since it's
> > > -     * got a bad migration state).
> > > -     */
> > > -    bql_lock();
> > > -    migration_incoming_state_destroy();
> > > -    bql_unlock();
> > >  
> > > +out:
> > >      rcu_unregister_thread();
> > > -    mis->have_listen_thread = false;
> > >      postcopy_state_set(POSTCOPY_INCOMING_END);
> > >  
> > >      object_unref(OBJECT(migr));
> > >  
> > > +    migration_bh_schedule(postcopy_ram_listen_thread_bh, NULL);
> > 
> > Better to schedule before the object_unref to ensure there's always
> > someone holding a reference?
> 
> True, I'll move it.

Good point.  Though I'm not sure moving it upper would help, because it'll
be the BH that references the MigrationState*..  So maybe we could unref at
the end of postcopy_ram_listen_thread_bh().  If so, we should add a comment
on ref() / unref() saying how they're paired.

> 
> > 
> > > +
> > >      return NULL;
> > >  }
> > >  
> > > @@ -2217,7 +2201,7 @@ static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
> > >      mis->have_listen_thread = true;
> > >      postcopy_thread_create(mis, &mis->listen_thread,
> > >                             MIGRATION_THREAD_DST_LISTEN,
> > > -                           postcopy_ram_listen_thread, QEMU_THREAD_DETACHED);
> > > +                           postcopy_ram_listen_thread, QEMU_THREAD_JOINABLE);
> > >      trace_loadvm_postcopy_handle_listen("return");
> > >  
> > >      return 0;
> > 
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/4] migration: Introduce POSTCOPY_DEVICE state
  2025-09-22 13:34       ` Juraj Marcin
@ 2025-09-22 16:16         ` Peter Xu
  2025-09-23 14:23           ` Juraj Marcin
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Xu @ 2025-09-22 16:16 UTC (permalink / raw)
  To: Juraj Marcin
  Cc: qemu-devel, Jiri Denemark, Dr. David Alan Gilbert, Fabiano Rosas

On Mon, Sep 22, 2025 at 03:34:19PM +0200, Juraj Marcin wrote:
> Hi Peter,
> 
> On 2025-09-19 13:50, Peter Xu wrote:
> > On Fri, Sep 19, 2025 at 12:58:08PM -0400, Peter Xu wrote:
> > > > @@ -2564,6 +2569,11 @@ static void *source_return_path_thread(void *opaque)
> > > >              tmp32 = ldl_be_p(buf);
> > > >              trace_source_return_path_thread_pong(tmp32);
> > > >              qemu_sem_post(&ms->rp_state.rp_pong_acks);
> > > > +            if (tmp32 == QEMU_VM_PING_PACKAGED_LOADED) {
> > > > +                trace_source_return_path_thread_dst_started();
> > > > +                migrate_set_state(&ms->state, MIGRATION_STATUS_POSTCOPY_DEVICE,
> > > > +                                  MIGRATION_STATUS_POSTCOPY_ACTIVE);
> > > 
> > > Could this race with the migration thread modifying the state concurrently?
> > > 
> > > To avoid it, we could have a bool, set it here once, and in the iterations
> > > do something like:
> > > 
> > > diff --git a/migration/migration.c b/migration/migration.c
> > > index 10c216d25d..55230e10ee 100644
> > > --- a/migration/migration.c
> > > +++ b/migration/migration.c
> > > @@ -3449,6 +3449,16 @@ static MigIterateState migration_iteration_run(MigrationState *s)
> > >      trace_migrate_pending_estimate(pending_size, must_precopy, can_postcopy);
> > >  
> > >      if (in_postcopy) {
> > > +        if (s->postcopy_package_loaded) {
> > > +            assert(s->state == MIGRATION_STATUS_POSTCOPY_DEVICE);
> > > +            migrate_set_state(s->state, MIGRATION_STATUS_POSTCOPY_DEVICE,
> > > +                              MIGRATION_STATUS_POSTCOPY_ACTIVE);
> > > +            /*
> > > +             * Since postcopy cannot be re-initiated, this flag will only
> > > +             * be set at most once for QEMU's whole lifecyce.
> > > +             */
> > > +            s->postcopy_package_loaded = false;
> > > +        }
> > >          /*
> > >           * Iterate in postcopy until all pending data flushed.  Note that
> > >           * postcopy completion doesn't rely on can_switchover, because when
> 
> It was there in the RFC version, when there was mutual handshake before
> dst starting, but I thought cmp&exchange would be enough. I can add it
> again, however, there is no need to set it to false afterwards, we can
> simply check if this condition is true:
> s->postcopy_package_loaded && s->state == MIGRATION_STATUS_POSTCOPY_DEVICE.

Setting it to false was a safety measure.  Indeed not needed, but when so
we need to be extremely careful to always check with above two conditions
to avoid it frequently triggers.  So I thought clearing it would be much
easier to read.  I can wait and read the new version whatever you prefer.

> 
> > 
> > [...]
> > 
> > > > @@ -2871,7 +2882,7 @@ static int postcopy_start(MigrationState *ms, Error **errp)
> > > >  
> > > >      /* Now, switchover looks all fine, switching to postcopy-active */
> > > >      migrate_set_state(&ms->state, MIGRATION_STATUS_DEVICE,
> > > > -                      MIGRATION_STATUS_POSTCOPY_ACTIVE);
> > > > +                      MIGRATION_STATUS_POSTCOPY_DEVICE);
> > > >  
> > > >      bql_unlock();
> > > >  
> > > > @@ -3035,7 +3046,8 @@ static void migration_completion(MigrationState *s)
> > > >  
> > > >      if (s->state == MIGRATION_STATUS_ACTIVE) {
> > > >          ret = migration_completion_precopy(s);
> > > > -    } else if (s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) {
> > > > +    } else if (s->state == MIGRATION_STATUS_POSTCOPY_DEVICE ||
> > > > +               s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) {
> > > 
> > > Exactly.  We need to be prepared that src sending too fast so when device
> > > loading on dest we finished.
> > 
> > One thing more to mention here.. which may void some previous comments I
> > left.  Let's discuss.
> > 
> > I think it may also make sense to only allow a COMPLETE after
> > POSTCOPY_ACTIVE.
> > 
> > That is, if src sends too fast to have finished sending everything,
> > reaching COMPLETE during POSTCOPY_DEVICE, that is, while before it receives
> > the new PONG you defined, then.. I _think_ it is better to wait for that.
> > 
> > If it finally arrives, then it's perfect, we switch to POSTCOPY_ACTIVE,
> > then continue the completion.
> > 
> > If the channel is broken before its arrival, logically we should handle
> > this case as a FAILURE and restart the VM on src.
> > 
> > It's only relevant in a corner case, but does that sound better?
> 
> Yes, it does make sense to wait for POSTCOPY_ACTIVE as src could finish
> before dst finished package load and could still fail. We could use a
> qemu_event_wait() to wait and set the event in src return path thread
> when the PONG is received.

Right.

Though since we want to move to POSTCOPY_ACTIVE asap when receiving the
corresponding PONG, we may want to have something like qemu_event_read()
just return "ev->value == EV_SET", as we can't event_wait() in the
migration thread while pushing RAMs.

Or, to avoid touching library code, we can also introduce yet another bool,
then when receiving the PONG we set both (1) the bool, and (2) the event.
We can check the bool in the iterations (when set, wait() to consume the
event; the event should have happened or just to happen), and wait() on the
event when completing (when return, bool must have been set, hence can
reset bool).

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 3/4] migration: Refactor incoming cleanup into migration_incoming_finish()
  2025-09-22 15:51       ` Peter Xu
@ 2025-09-22 17:40         ` Fabiano Rosas
  2025-09-22 17:48           ` Peter Xu
  2025-09-23 14:58         ` Juraj Marcin
  1 sibling, 1 reply; 33+ messages in thread
From: Fabiano Rosas @ 2025-09-22 17:40 UTC (permalink / raw)
  To: Peter Xu, Juraj Marcin; +Cc: qemu-devel, Jiri Denemark, Dr. David Alan Gilbert

Peter Xu <peterx@redhat.com> writes:

> On Mon, Sep 22, 2025 at 02:58:38PM +0200, Juraj Marcin wrote:
>> Hi Fabiano,
>> 
>> On 2025-09-19 13:46, Fabiano Rosas wrote:
>> > Juraj Marcin <jmarcin@redhat.com> writes:
>> > 
>> > Hi Juraj,
>> > 
>> > Good patch, nice use of migrate_has_failed()
>> 
>> Thanks!
>> 
>> > 
>> > > From: Juraj Marcin <jmarcin@redhat.com>
>> > >
>> > > Currently, there are two functions that are responsible for cleanup of
>> > > the incoming migration state. With successful precopy, it's the main
>> > > thread and with successful postcopy it's the listen thread. However, if
>> > > postcopy fails during in the device load, both functions will try to do
>> > > the cleanup. Moreover, when exit-on-error parameter was added, it was
>> > > applied only to precopy.
>> > >
>> > 
>> > Someone could be relying in postcopy always exiting on error while
>> > explicitly setting exit-on-error=false for precopy and this patch would
>> > change the behavior incompatibly. Is this an issue? I'm willing to
>> > ignore it, but you guys know more about postcopy.
>> 
>> Good question. When going through older patches where postcopy listen
>> thread and then where exit-on-error were implemented, it seemed more
>> like an overlook than intentional omission. However, it might be better
>> to not break any potential users of this, we could add another option,
>> "exit-on-postcopy-error" that would allow such handling if postscopy
>> failed unrecoverably. I've already talked about such option with
>> @jdenemar and he agreed with it.
>
> The idea for postcopy ram is, it should never fail.. as failing should
> never be better than a pause.  Block dirty bitmap might be different,
> though, when enabled separately.
>
> For postcopy-ram, qemu_loadvm_state_main() will in reality only receive RAM
> updates. It'll almost always trigger the postcopy_pause_incoming() path
> when anything fails.
>
> For pure block-dirty-bitmap-only styled postcopy: for this exit-on-error, I
> also don't think we should really "exit on errors", even if the flag is
> set.  IIUC, it's not fatal to the VM if that failed, as described in:
>
> commit ee64722514fabcad2430982ade86180208f5be4f
> Author: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> Date:   Mon Jul 27 22:42:32 2020 +0300
>
>     migration/savevm: don't worry if bitmap migration postcopy failed
>
>     ...
>
>     And anyway, bitmaps postcopy is not prepared to be somehow recovered.
>     The original idea instead is that if bitmaps postcopy failed, we just
>     lose some bitmaps, which is not critical. So, on failure we just need
>     to remove unfinished bitmaps and guest should continue execution on
>     destination.
>
> Hence, exit here might be an overkill.. need block developers to double
> check, though..
>
>> 
>> > 
>> > > This patch refactors common cleanup and exiting on error into a helper
>> > > function that can be started either from precopy or postcopy, reducing
>> > > the duplication. If the listen thread has been started (the postcopy
>> > > state is at least LISTENING), the listen thread is responsible for all
>> > > cleanup and exiting, otherwise it's the main thread's responsibility.
>> > 
>> > Don't the BHs also run in the main thread? I'm not sure this sentence is
>> > accurate.
>> 
>> Yeah, it is a bit inaccurate now that you mention it, BHs are indeed run
>> in the main thread. More accurate would to replace the main thread with
>> process incoming migration coroutine, that would precisely describe who
>> has the responsibility to call the cleanup.
>> 
>> > 
>> > >
>> > > Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
>> > > ---
>> > >  migration/migration.c | 64 ++++++++++++++++++++++++-------------------
>> > >  migration/migration.h |  1 +
>> > >  migration/savevm.c    | 48 +++++++++++---------------------
>> > 
>> > Could someone act on the TODOs and move postcopy code into postcopy-ram?
>> > It's never too late to make things consistent.
>> 
>> I can take a look.
>> 
>> > 
>> > >  3 files changed, 53 insertions(+), 60 deletions(-)
>> > >
>> > > diff --git a/migration/migration.c b/migration/migration.c
>> > > index 2c0b3a7229..7222e3de13 100644
>> > > --- a/migration/migration.c
>> > > +++ b/migration/migration.c
>> > > @@ -442,9 +442,19 @@ void migration_incoming_transport_cleanup(MigrationIncomingState *mis)
>> > >  void migration_incoming_state_destroy(void)
>> > >  {
>> > >      struct MigrationIncomingState *mis = migration_incoming_get_current();
>> > > +    PostcopyState ps = postcopy_state_get();
>> > >  
>> > >      multifd_recv_cleanup();
>> > >  
>> > > +    if (mis->have_listen_thread) {
>> > > +        qemu_thread_join(&mis->listen_thread);
>> > > +        mis->have_listen_thread = false;
>> > > +    }
>> > > +
>> > > +    if (ps != POSTCOPY_INCOMING_NONE) {
>> > > +        postcopy_ram_incoming_cleanup(mis);
>> > > +    }
>> > > +
>> > >      /*
>> > >       * RAM state cleanup needs to happen after multifd cleanup, because
>> > >       * multifd threads can use some of its states (receivedmap).
>> > > @@ -809,6 +819,23 @@ static void qemu_start_incoming_migration(const char *uri, bool has_channels,
>> > >      cpr_state_close();
>> > >  }
>> > >  
>> > > +void migration_incoming_finish(void)
>> > > +{
>> > > +    MigrationState *s = migrate_get_current();
>> > > +    MigrationIncomingState *mis = migration_incoming_get_current();
>> > > +
>> > > +    migration_incoming_state_destroy();
>> > > +
>> > > +    if (migration_has_failed(mis->state) && mis->exit_on_error) {
>> > > +        WITH_QEMU_LOCK_GUARD(&s->error_mutex) {
>> > > +            error_report_err(s->error);
>> > > +            s->error = NULL;
>> > > +        }
>> > > +
>> > > +        exit(EXIT_FAILURE);
>> > > +    }
>> > > +}
>> > > +
>> > >  static void process_incoming_migration_bh(void *opaque)
>> > >  {
>> > >      MigrationIncomingState *mis = opaque;
>> > > @@ -861,7 +888,7 @@ static void process_incoming_migration_bh(void *opaque)
>> > >       */
>> > >      migrate_set_state(&mis->state, MIGRATION_STATUS_ACTIVE,
>> > >                        MIGRATION_STATUS_COMPLETED);
>> > > -    migration_incoming_state_destroy();
>> > > +    migration_incoming_finish();
>> > >  }
>> > >  
>> > >  static void coroutine_fn
>> > > @@ -888,23 +915,13 @@ process_incoming_migration_co(void *opaque)
>> > >  
>> > >      ps = postcopy_state_get();
>> > >      trace_process_incoming_migration_co_end(ret, ps);
>> > > -    if (ps != POSTCOPY_INCOMING_NONE) {
>> > > -        if (ps == POSTCOPY_INCOMING_ADVISE) {
>> > > -            /*
>> > > -             * Where a migration had postcopy enabled (and thus went to advise)
>> > > -             * but managed to complete within the precopy period, we can use
>> > > -             * the normal exit.
>> > > -             */
>> > > -            postcopy_ram_incoming_cleanup(mis);
>> > > -        } else if (ret >= 0) {
>> > > -            /*
>> > > -             * Postcopy was started, cleanup should happen at the end of the
>> > > -             * postcopy thread.
>> > > -             */
>> > > -            trace_process_incoming_migration_co_postcopy_end_main();
>> > > -            goto out;
>> > > -        }
>> > > -        /* Else if something went wrong then just fall out of the normal exit */
>> > > +    if (ps >= POSTCOPY_INCOMING_LISTENING) {
>> > > +        /*
>> > > +         * Postcopy was started, cleanup should happen at the end of the
>> > > +         * postcopy thread.
>> > > +         */
>> > > +        trace_process_incoming_migration_co_postcopy_end_main();
>> > > +        goto out;
>> > >      }
>> > >  
>> > >      if (ret < 0) {
>> > > @@ -926,16 +943,7 @@ fail:
>> > >      migrate_set_error(s, local_err);
>> > >      error_free(local_err);
>> > >  
>> > > -    migration_incoming_state_destroy();
>> > > -
>> > > -    if (mis->exit_on_error) {
>> > > -        WITH_QEMU_LOCK_GUARD(&s->error_mutex) {
>> > > -            error_report_err(s->error);
>> > > -            s->error = NULL;
>> > > -        }
>> > > -
>> > > -        exit(EXIT_FAILURE);
>> > > -    }
>> > > +    migration_incoming_finish();
>> > >  out:
>> > >      /* Pairs with the refcount taken in qmp_migrate_incoming() */
>> > >      migrate_incoming_unref_outgoing_state();
>> > > diff --git a/migration/migration.h b/migration/migration.h
>> > > index 2c2331f40d..67e3318467 100644
>> > > --- a/migration/migration.h
>> > > +++ b/migration/migration.h
>> > > @@ -518,6 +518,7 @@ void migrate_set_state(MigrationStatus *state, MigrationStatus old_state,
>> > >  void migration_fd_process_incoming(QEMUFile *f);
>> > >  void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp);
>> > >  void migration_incoming_process(void);
>> > > +void migration_incoming_finish(void);
>> > 
>> > What about merging migration_incoming_state_destroy and
>> > migration_incoming_finish and pair all of this with migration_cleanup?
>> > 
>> > static void migration_cleanup_bh(void *opaque)
>> > {
>> >     migration_cleanup(opaque);
>> > }
>> > 
>> > static void migration_incoming_cleanup_bh(void *opaque)
>> > {
>> >     migration_incoming_cleanup(opaque);
>> > }
>> 
>> Yes, it would pair well. I guess it would also solve Peter's nitpick
>> about the change in process_incoming_migration_bh() above.
>> 
>> > >  
>> > >  bool  migration_has_all_channels(void);
>> > >  
>> > > diff --git a/migration/savevm.c b/migration/savevm.c
>> > > index fabbeb296a..d7eb416d48 100644
>> > > --- a/migration/savevm.c
>> > > +++ b/migration/savevm.c
>> > > @@ -2069,6 +2069,11 @@ static int loadvm_postcopy_ram_handle_discard(MigrationIncomingState *mis,
>> > >      return 0;
>> > >  }
>> > >  
>> > > +static void postcopy_ram_listen_thread_bh(void *opaque)
>> > > +{
>> > > +    migration_incoming_finish();
>> > > +}
>> > > +
>> > >  /*
>> > >   * Triggered by a postcopy_listen command; this thread takes over reading
>> > >   * the input stream, leaving the main thread free to carry on loading the rest
>> > > @@ -2122,52 +2127,31 @@ static void *postcopy_ram_listen_thread(void *opaque)
>> > >                           "bitmaps may be lost, and present migrated dirty "
>> > >                           "bitmaps are correctly migrated and valid.",
>> > >                           __func__, load_res);
>> > > -            load_res = 0; /* prevent further exit() */
>> > >          } else {
>> > >              error_report("%s: loadvm failed: %d", __func__, load_res);
>> > >              migrate_set_state(&mis->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
>> > >                                             MIGRATION_STATUS_FAILED);
>> > > +            goto out;
>> > >          }
>> > >      }
>> > > -    if (load_res >= 0) {
>> > > -        /*
>> > > -         * This looks good, but it's possible that the device loading in the
>> > > -         * main thread hasn't finished yet, and so we might not be in 'RUN'
>> > > -         * state yet; wait for the end of the main thread.
>> > > -         */
>> > > -        qemu_event_wait(&mis->main_thread_load_event);
>> > > -    }
>> > > -    postcopy_ram_incoming_cleanup(mis);
>> > > -
>> > > -    if (load_res < 0) {
>> > > -        /*
>> > > -         * If something went wrong then we have a bad state so exit;
>> > > -         * depending how far we got it might be possible at this point
>> > > -         * to leave the guest running and fire MCEs for pages that never
>> > > -         * arrived as a desperate recovery step.
>> > > -         */
>> > > -        rcu_unregister_thread();
>> > > -        exit(EXIT_FAILURE);
>> > > -    }
>> > > +    /*
>> > > +     * This looks good, but it's possible that the device loading in the
>> > > +     * main thread hasn't finished yet, and so we might not be in 'RUN'
>> > > +     * state yet; wait for the end of the main thread.
>> > > +     */
>> > > +    qemu_event_wait(&mis->main_thread_load_event);
>
> PS: I didn't notice this change, looks like this may be better to be a
> separate patch when moving out of the if.  Meanwhile, I don't think we set
> it right either, in qemu_loadvm_state():
>
>     qemu_event_set(&mis->main_thread_load_event);
>
> The problem is e.g. load_snapshot / qmp_xen_load_devices_state also set
> that event, even if there'll be no one to consume it.. not a huge deal, but
> maybe while moving it out of the if, we can also cleanup the set() side
> too, by moving the set() upper into process_incoming_migration_co().
>
>> > >  
>> > >      migrate_set_state(&mis->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
>> > >                                     MIGRATION_STATUS_COMPLETED);
>> > > -    /*
>> > > -     * If everything has worked fine, then the main thread has waited
>> > > -     * for us to start, and we're the last use of the mis.
>> > > -     * (If something broke then qemu will have to exit anyway since it's
>> > > -     * got a bad migration state).
>> > > -     */
>> > > -    bql_lock();
>> > > -    migration_incoming_state_destroy();
>> > > -    bql_unlock();
>> > >  
>> > > +out:
>> > >      rcu_unregister_thread();
>> > > -    mis->have_listen_thread = false;
>> > >      postcopy_state_set(POSTCOPY_INCOMING_END);
>> > >  
>> > >      object_unref(OBJECT(migr));
>> > >  
>> > > +    migration_bh_schedule(postcopy_ram_listen_thread_bh, NULL);
>> > 
>> > Better to schedule before the object_unref to ensure there's always
>> > someone holding a reference?
>> 
>> True, I'll move it.
>
> Good point.  Though I'm not sure moving it upper would help, because it'll
> be the BH that references the MigrationState*..

A while ago we wrapped QEMU BHs with migration_bh_dispatch_bh. The BHs
don't need to hold or drop MigrationState reference anymore because the
dispatch callback does it:

void migration_bh_schedule(QEMUBHFunc *cb, void *opaque)
{
    ...
    object_ref(OBJECT(s));
    qemu_bh_schedule(bh);
}

static void migration_bh_dispatch_bh(void *opaque)
{
    ...        
    migbh->cb(migbh->opaque);
    object_unref(OBJECT(s));
    ...
}

>  So maybe we could unref at
> the end of postcopy_ram_listen_thread_bh().  If so, we should add a comment
> on ref() / unref() saying how they're paired.
>

>> 
>> > 
>> > > +
>> > >      return NULL;
>> > >  }
>> > >  
>> > > @@ -2217,7 +2201,7 @@ static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
>> > >      mis->have_listen_thread = true;
>> > >      postcopy_thread_create(mis, &mis->listen_thread,
>> > >                             MIGRATION_THREAD_DST_LISTEN,
>> > > -                           postcopy_ram_listen_thread, QEMU_THREAD_DETACHED);
>> > > +                           postcopy_ram_listen_thread, QEMU_THREAD_JOINABLE);
>> > >      trace_loadvm_postcopy_handle_listen("return");
>> > >  
>> > >      return 0;
>> > 
>> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 3/4] migration: Refactor incoming cleanup into migration_incoming_finish()
  2025-09-22 17:40         ` Fabiano Rosas
@ 2025-09-22 17:48           ` Peter Xu
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Xu @ 2025-09-22 17:48 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Juraj Marcin, qemu-devel, Jiri Denemark, Dr. David Alan Gilbert

On Mon, Sep 22, 2025 at 02:40:37PM -0300, Fabiano Rosas wrote:
> A while ago we wrapped QEMU BHs with migration_bh_dispatch_bh. The BHs
> don't need to hold or drop MigrationState reference anymore because the
> dispatch callback does it:
> 
> void migration_bh_schedule(QEMUBHFunc *cb, void *opaque)
> {
>     ...
>     object_ref(OBJECT(s));
>     qemu_bh_schedule(bh);
> }
> 
> static void migration_bh_dispatch_bh(void *opaque)
> {
>     ...        
>     migbh->cb(migbh->opaque);
>     object_unref(OBJECT(s));
>     ...
> }

Indeed!  Ignore my comment.. :)

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/4] migration: Introduce POSTCOPY_DEVICE state
  2025-09-22 16:16         ` Peter Xu
@ 2025-09-23 14:23           ` Juraj Marcin
  0 siblings, 0 replies; 33+ messages in thread
From: Juraj Marcin @ 2025-09-23 14:23 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, Jiri Denemark, Dr. David Alan Gilbert, Fabiano Rosas

On 2025-09-22 12:16, Peter Xu wrote:
> On Mon, Sep 22, 2025 at 03:34:19PM +0200, Juraj Marcin wrote:
> > Hi Peter,
> > 
> > On 2025-09-19 13:50, Peter Xu wrote:
> > > On Fri, Sep 19, 2025 at 12:58:08PM -0400, Peter Xu wrote:
> > > > > @@ -2564,6 +2569,11 @@ static void *source_return_path_thread(void *opaque)
> > > > >              tmp32 = ldl_be_p(buf);
> > > > >              trace_source_return_path_thread_pong(tmp32);
> > > > >              qemu_sem_post(&ms->rp_state.rp_pong_acks);
> > > > > +            if (tmp32 == QEMU_VM_PING_PACKAGED_LOADED) {
> > > > > +                trace_source_return_path_thread_dst_started();
> > > > > +                migrate_set_state(&ms->state, MIGRATION_STATUS_POSTCOPY_DEVICE,
> > > > > +                                  MIGRATION_STATUS_POSTCOPY_ACTIVE);
> > > > 
> > > > Could this race with the migration thread modifying the state concurrently?
> > > > 
> > > > To avoid it, we could have a bool, set it here once, and in the iterations
> > > > do something like:
> > > > 
> > > > diff --git a/migration/migration.c b/migration/migration.c
> > > > index 10c216d25d..55230e10ee 100644
> > > > --- a/migration/migration.c
> > > > +++ b/migration/migration.c
> > > > @@ -3449,6 +3449,16 @@ static MigIterateState migration_iteration_run(MigrationState *s)
> > > >      trace_migrate_pending_estimate(pending_size, must_precopy, can_postcopy);
> > > >  
> > > >      if (in_postcopy) {
> > > > +        if (s->postcopy_package_loaded) {
> > > > +            assert(s->state == MIGRATION_STATUS_POSTCOPY_DEVICE);
> > > > +            migrate_set_state(s->state, MIGRATION_STATUS_POSTCOPY_DEVICE,
> > > > +                              MIGRATION_STATUS_POSTCOPY_ACTIVE);
> > > > +            /*
> > > > +             * Since postcopy cannot be re-initiated, this flag will only
> > > > +             * be set at most once for QEMU's whole lifecyce.
> > > > +             */
> > > > +            s->postcopy_package_loaded = false;
> > > > +        }
> > > >          /*
> > > >           * Iterate in postcopy until all pending data flushed.  Note that
> > > >           * postcopy completion doesn't rely on can_switchover, because when
> > 
> > It was there in the RFC version, when there was mutual handshake before
> > dst starting, but I thought cmp&exchange would be enough. I can add it
> > again, however, there is no need to set it to false afterwards, we can
> > simply check if this condition is true:
> > s->postcopy_package_loaded && s->state == MIGRATION_STATUS_POSTCOPY_DEVICE.
> 
> Setting it to false was a safety measure.  Indeed not needed, but when so
> we need to be extremely careful to always check with above two conditions
> to avoid it frequently triggers.  So I thought clearing it would be much
> easier to read.  I can wait and read the new version whatever you prefer.
> 
> > 
> > > 
> > > [...]
> > > 
> > > > > @@ -2871,7 +2882,7 @@ static int postcopy_start(MigrationState *ms, Error **errp)
> > > > >  
> > > > >      /* Now, switchover looks all fine, switching to postcopy-active */
> > > > >      migrate_set_state(&ms->state, MIGRATION_STATUS_DEVICE,
> > > > > -                      MIGRATION_STATUS_POSTCOPY_ACTIVE);
> > > > > +                      MIGRATION_STATUS_POSTCOPY_DEVICE);
> > > > >  
> > > > >      bql_unlock();
> > > > >  
> > > > > @@ -3035,7 +3046,8 @@ static void migration_completion(MigrationState *s)
> > > > >  
> > > > >      if (s->state == MIGRATION_STATUS_ACTIVE) {
> > > > >          ret = migration_completion_precopy(s);
> > > > > -    } else if (s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) {
> > > > > +    } else if (s->state == MIGRATION_STATUS_POSTCOPY_DEVICE ||
> > > > > +               s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) {
> > > > 
> > > > Exactly.  We need to be prepared that src sending too fast so when device
> > > > loading on dest we finished.
> > > 
> > > One thing more to mention here.. which may void some previous comments I
> > > left.  Let's discuss.
> > > 
> > > I think it may also make sense to only allow a COMPLETE after
> > > POSTCOPY_ACTIVE.
> > > 
> > > That is, if src sends too fast to have finished sending everything,
> > > reaching COMPLETE during POSTCOPY_DEVICE, that is, while before it receives
> > > the new PONG you defined, then.. I _think_ it is better to wait for that.
> > > 
> > > If it finally arrives, then it's perfect, we switch to POSTCOPY_ACTIVE,
> > > then continue the completion.
> > > 
> > > If the channel is broken before its arrival, logically we should handle
> > > this case as a FAILURE and restart the VM on src.
> > > 
> > > It's only relevant in a corner case, but does that sound better?
> > 
> > Yes, it does make sense to wait for POSTCOPY_ACTIVE as src could finish
> > before dst finished package load and could still fail. We could use a
> > qemu_event_wait() to wait and set the event in src return path thread
> > when the PONG is received.
> 
> Right.
> 
> Though since we want to move to POSTCOPY_ACTIVE asap when receiving the
> corresponding PONG, we may want to have something like qemu_event_read()
> just return "ev->value == EV_SET", as we can't event_wait() in the
> migration thread while pushing RAMs.
> 
> Or, to avoid touching library code, we can also introduce yet another bool,
> then when receiving the PONG we set both (1) the bool, and (2) the event.
> We can check the bool in the iterations (when set, wait() to consume the
> event; the event should have happened or just to happen), and wait() on the
> event when completing (when return, bool must have been set, hence can
> reset bool).

Yes, my initial idea was the latter with two new attributes a bool and
an event, definitely not waiting while there are still pages to send.
But I can also evaluate the first one before sending another version.

Thanks!

> 
> -- 
> Peter Xu
> 



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 3/4] migration: Refactor incoming cleanup into migration_incoming_finish()
  2025-09-22 15:51       ` Peter Xu
  2025-09-22 17:40         ` Fabiano Rosas
@ 2025-09-23 14:58         ` Juraj Marcin
  2025-09-23 16:17           ` Peter Xu
  1 sibling, 1 reply; 33+ messages in thread
From: Juraj Marcin @ 2025-09-23 14:58 UTC (permalink / raw)
  To: Peter Xu; +Cc: Fabiano Rosas, qemu-devel, Jiri Denemark, Dr. David Alan Gilbert

On 2025-09-22 11:51, Peter Xu wrote:
> On Mon, Sep 22, 2025 at 02:58:38PM +0200, Juraj Marcin wrote:
> > Hi Fabiano,
> > 
> > On 2025-09-19 13:46, Fabiano Rosas wrote:
> > > Juraj Marcin <jmarcin@redhat.com> writes:
> > > 
> > > Hi Juraj,
> > > 
> > > Good patch, nice use of migrate_has_failed()
> > 
> > Thanks!
> > 
> > > 
> > > > From: Juraj Marcin <jmarcin@redhat.com>
> > > >
> > > > Currently, there are two functions that are responsible for cleanup of
> > > > the incoming migration state. With successful precopy, it's the main
> > > > thread and with successful postcopy it's the listen thread. However, if
> > > > postcopy fails during in the device load, both functions will try to do
> > > > the cleanup. Moreover, when exit-on-error parameter was added, it was
> > > > applied only to precopy.
> > > >
> > > 
> > > Someone could be relying in postcopy always exiting on error while
> > > explicitly setting exit-on-error=false for precopy and this patch would
> > > change the behavior incompatibly. Is this an issue? I'm willing to
> > > ignore it, but you guys know more about postcopy.
> > 
> > Good question. When going through older patches where postcopy listen
> > thread and then where exit-on-error were implemented, it seemed more
> > like an overlook than intentional omission. However, it might be better
> > to not break any potential users of this, we could add another option,
> > "exit-on-postcopy-error" that would allow such handling if postscopy
> > failed unrecoverably. I've already talked about such option with
> > @jdenemar and he agreed with it.
> 
> The idea for postcopy ram is, it should never fail.. as failing should
> never be better than a pause.  Block dirty bitmap might be different,
> though, when enabled separately.
> 
> For postcopy-ram, qemu_loadvm_state_main() will in reality only receive RAM
> updates. It'll almost always trigger the postcopy_pause_incoming() path
> when anything fails.
> 
> For pure block-dirty-bitmap-only styled postcopy: for this exit-on-error, I
> also don't think we should really "exit on errors", even if the flag is
> set.  IIUC, it's not fatal to the VM if that failed, as described in:

I agree, however, this patch doesn't add any new cases in which the
destination QEMU would exit. If there is an error in block dirty bitmaps
it is only reported to the console, and then it continues to waiting for
main_thread_load_event, marks the migration as COMPLETED and does the
cleanup, same as before. [1] I can add a comment similar to "prevent
further exit" as there was before.

However, if there is other error, in which the postcopy cannot pause
(for example there was a failure in the main thread loading the device
state before the machine started), the migration status changes to
FAILED and jumps right to cleanup which then checks exit-on-error and
optionally exits the QEMU, before it would always exit in such case [2]:

[1]: https://gitlab.com/qemu-project/qemu/-/blob/ab8008b231e758e03c87c1c483c03afdd9c02e19/migration/savevm.c#L2120
[2]: https://gitlab.com/qemu-project/qemu/-/blob/ab8008b231e758e03c87c1c483c03afdd9c02e19/migration/savevm.c#L2150

> 
> commit ee64722514fabcad2430982ade86180208f5be4f
> Author: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> Date:   Mon Jul 27 22:42:32 2020 +0300
> 
>     migration/savevm: don't worry if bitmap migration postcopy failed
> 
>     ...
> 
>     And anyway, bitmaps postcopy is not prepared to be somehow recovered.
>     The original idea instead is that if bitmaps postcopy failed, we just
>     lose some bitmaps, which is not critical. So, on failure we just need
>     to remove unfinished bitmaps and guest should continue execution on
>     destination.
> 
> Hence, exit here might be an overkill.. need block developers to double
> check, though..
> 

/* snip */

> > > > diff --git a/migration/savevm.c b/migration/savevm.c
> > > > index fabbeb296a..d7eb416d48 100644
> > > > --- a/migration/savevm.c
> > > > +++ b/migration/savevm.c
> > > > @@ -2069,6 +2069,11 @@ static int loadvm_postcopy_ram_handle_discard(MigrationIncomingState *mis,
> > > >      return 0;
> > > >  }
> > > >  
> > > > +static void postcopy_ram_listen_thread_bh(void *opaque)
> > > > +{
> > > > +    migration_incoming_finish();
> > > > +}
> > > > +
> > > >  /*
> > > >   * Triggered by a postcopy_listen command; this thread takes over reading
> > > >   * the input stream, leaving the main thread free to carry on loading the rest
> > > > @@ -2122,52 +2127,31 @@ static void *postcopy_ram_listen_thread(void *opaque)
> > > >                           "bitmaps may be lost, and present migrated dirty "
> > > >                           "bitmaps are correctly migrated and valid.",
> > > >                           __func__, load_res);
> > > > -            load_res = 0; /* prevent further exit() */
> > > >          } else {
> > > >              error_report("%s: loadvm failed: %d", __func__, load_res);
> > > >              migrate_set_state(&mis->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
> > > >                                             MIGRATION_STATUS_FAILED);
> > > > +            goto out;
> > > >          }
> > > >      }
> > > > -    if (load_res >= 0) {
> > > > -        /*
> > > > -         * This looks good, but it's possible that the device loading in the
> > > > -         * main thread hasn't finished yet, and so we might not be in 'RUN'
> > > > -         * state yet; wait for the end of the main thread.
> > > > -         */
> > > > -        qemu_event_wait(&mis->main_thread_load_event);
> > > > -    }
> > > > -    postcopy_ram_incoming_cleanup(mis);
> > > > -
> > > > -    if (load_res < 0) {
> > > > -        /*
> > > > -         * If something went wrong then we have a bad state so exit;
> > > > -         * depending how far we got it might be possible at this point
> > > > -         * to leave the guest running and fire MCEs for pages that never
> > > > -         * arrived as a desperate recovery step.
> > > > -         */
> > > > -        rcu_unregister_thread();
> > > > -        exit(EXIT_FAILURE);
> > > > -    }
> > > > +    /*
> > > > +     * This looks good, but it's possible that the device loading in the
> > > > +     * main thread hasn't finished yet, and so we might not be in 'RUN'
> > > > +     * state yet; wait for the end of the main thread.
> > > > +     */
> > > > +    qemu_event_wait(&mis->main_thread_load_event);
> 
> PS: I didn't notice this change, looks like this may be better to be a
> separate patch when moving out of the if.  Meanwhile, I don't think we set
> it right either, in qemu_loadvm_state():
> 
>     qemu_event_set(&mis->main_thread_load_event);
> 
> The problem is e.g. load_snapshot / qmp_xen_load_devices_state also set
> that event, even if there'll be no one to consume it.. not a huge deal, but
> maybe while moving it out of the if, we can also cleanup the set() side
> too, by moving the set() upper into process_incoming_migration_co().

While I have moved it out of the condition, it is still only waited in
the success path, if there is an error that would previously cause the
condition to be false, the execution now jumps directly to the cleanup
section (out label), skipping this wait and setting migration state to
COMPLETED (it's set to FAILED before the jump). But I can still look
into moving the set() up.

> 
> > > >  
> > > >      migrate_set_state(&mis->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
> > > >                                     MIGRATION_STATUS_COMPLETED);
> > > > -    /*
> > > > -     * If everything has worked fine, then the main thread has waited
> > > > -     * for us to start, and we're the last use of the mis.
> > > > -     * (If something broke then qemu will have to exit anyway since it's
> > > > -     * got a bad migration state).
> > > > -     */
> > > > -    bql_lock();
> > > > -    migration_incoming_state_destroy();
> > > > -    bql_unlock();
> > > >  
> > > > +out:
> > > >      rcu_unregister_thread();
> > > > -    mis->have_listen_thread = false;
> > > >      postcopy_state_set(POSTCOPY_INCOMING_END);
> > > >  
> > > >      object_unref(OBJECT(migr));
> > > >  
> > > > +    migration_bh_schedule(postcopy_ram_listen_thread_bh, NULL);
> > > 
> > > Better to schedule before the object_unref to ensure there's always
> > > someone holding a reference?
> > 
> > True, I'll move it.
> 
> Good point.  Though I'm not sure moving it upper would help, because it'll
> be the BH that references the MigrationState*..  So maybe we could unref at
> the end of postcopy_ram_listen_thread_bh().  If so, we should add a comment
> on ref() / unref() saying how they're paired.
> 
> > 
> > > 
> > > > +
> > > >      return NULL;
> > > >  }
> > > >  
> > > > @@ -2217,7 +2201,7 @@ static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
> > > >      mis->have_listen_thread = true;
> > > >      postcopy_thread_create(mis, &mis->listen_thread,
> > > >                             MIGRATION_THREAD_DST_LISTEN,
> > > > -                           postcopy_ram_listen_thread, QEMU_THREAD_DETACHED);
> > > > +                           postcopy_ram_listen_thread, QEMU_THREAD_JOINABLE);
> > > >      trace_loadvm_postcopy_handle_listen("return");
> > > >  
> > > >      return 0;
> > > 
> > 
> 
> -- 
> Peter Xu
> 



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 3/4] migration: Refactor incoming cleanup into migration_incoming_finish()
  2025-09-23 14:58         ` Juraj Marcin
@ 2025-09-23 16:17           ` Peter Xu
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Xu @ 2025-09-23 16:17 UTC (permalink / raw)
  To: Juraj Marcin
  Cc: Fabiano Rosas, qemu-devel, Jiri Denemark, Dr. David Alan Gilbert

On Tue, Sep 23, 2025 at 04:58:15PM +0200, Juraj Marcin wrote:
> On 2025-09-22 11:51, Peter Xu wrote:
> > On Mon, Sep 22, 2025 at 02:58:38PM +0200, Juraj Marcin wrote:
> > > Hi Fabiano,
> > > 
> > > On 2025-09-19 13:46, Fabiano Rosas wrote:
> > > > Juraj Marcin <jmarcin@redhat.com> writes:
> > > > 
> > > > Hi Juraj,
> > > > 
> > > > Good patch, nice use of migrate_has_failed()
> > > 
> > > Thanks!
> > > 
> > > > 
> > > > > From: Juraj Marcin <jmarcin@redhat.com>
> > > > >
> > > > > Currently, there are two functions that are responsible for cleanup of
> > > > > the incoming migration state. With successful precopy, it's the main
> > > > > thread and with successful postcopy it's the listen thread. However, if
> > > > > postcopy fails during in the device load, both functions will try to do
> > > > > the cleanup. Moreover, when exit-on-error parameter was added, it was
> > > > > applied only to precopy.
> > > > >
> > > > 
> > > > Someone could be relying in postcopy always exiting on error while
> > > > explicitly setting exit-on-error=false for precopy and this patch would
> > > > change the behavior incompatibly. Is this an issue? I'm willing to
> > > > ignore it, but you guys know more about postcopy.
> > > 
> > > Good question. When going through older patches where postcopy listen
> > > thread and then where exit-on-error were implemented, it seemed more
> > > like an overlook than intentional omission. However, it might be better
> > > to not break any potential users of this, we could add another option,
> > > "exit-on-postcopy-error" that would allow such handling if postscopy
> > > failed unrecoverably. I've already talked about such option with
> > > @jdenemar and he agreed with it.
> > 
> > The idea for postcopy ram is, it should never fail.. as failing should
> > never be better than a pause.  Block dirty bitmap might be different,
> > though, when enabled separately.
> > 
> > For postcopy-ram, qemu_loadvm_state_main() will in reality only receive RAM
> > updates. It'll almost always trigger the postcopy_pause_incoming() path
> > when anything fails.
> > 
> > For pure block-dirty-bitmap-only styled postcopy: for this exit-on-error, I
> > also don't think we should really "exit on errors", even if the flag is
> > set.  IIUC, it's not fatal to the VM if that failed, as described in:
> 
> I agree, however, this patch doesn't add any new cases in which the
> destination QEMU would exit. If there is an error in block dirty bitmaps
> it is only reported to the console, and then it continues to waiting for
> main_thread_load_event, marks the migration as COMPLETED and does the
> cleanup, same as before. [1] I can add a comment similar to "prevent
> further exit" as there was before.
> 
> However, if there is other error, in which the postcopy cannot pause
> (for example there was a failure in the main thread loading the device
> state before the machine started), the migration status changes to
> FAILED and jumps right to cleanup which then checks exit-on-error and
> optionally exits the QEMU, before it would always exit in such case [2]:
> 
> [1]: https://gitlab.com/qemu-project/qemu/-/blob/ab8008b231e758e03c87c1c483c03afdd9c02e19/migration/savevm.c#L2120
> [2]: https://gitlab.com/qemu-project/qemu/-/blob/ab8008b231e758e03c87c1c483c03afdd9c02e19/migration/savevm.c#L2150

Ah, so you're saying an failure within qemu_loadvm_state_main() but during
POSTCOPY_INCOMING_LISTENING..  I think you're right.  It's possible.

In that case, exit-on-postcopy-error still sounds an overkill to me.  I
vote that we go ahead reusing exit-on-error for postcopy too. IIUC that's
what this current patch does as-is.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/4] migration: Introduce POSTCOPY_DEVICE state
  2025-09-15 11:59 ` [PATCH 4/4] migration: Introduce POSTCOPY_DEVICE state Juraj Marcin
  2025-09-19 16:58   ` Peter Xu
@ 2025-09-25 11:54   ` Jiří Denemark
  2025-09-25 18:22     ` Peter Xu
  1 sibling, 1 reply; 33+ messages in thread
From: Jiří Denemark @ 2025-09-25 11:54 UTC (permalink / raw)
  To: Juraj Marcin; +Cc: qemu-devel, Peter Xu, Dr. David Alan Gilbert, Fabiano Rosas

On Mon, Sep 15, 2025 at 13:59:15 +0200, Juraj Marcin wrote:
> From: Juraj Marcin <jmarcin@redhat.com>
> 
> Currently, when postcopy starts, the source VM starts switchover and
> sends a package containing the state of all non-postcopiable devices.
> When the destination loads this package, the switchover is complete and
> the destination VM starts. However, if the device state load fails or
> the destination side crashes, the source side is already in
> POSTCOPY_ACTIVE state and cannot be recovered, even when it has the most
> up-to-date machine state as the destination has not yet started.
> 
> This patch introduces a new POSTCOPY_DEVICE state which is active
> while the destination machine is loading the device state, is not yet
> running, and the source side can be resumed in case of a migration
> failure.
> 
> To transition from POSTCOPY_DEVICE to POSTCOPY_ACTIVE, the source
> side uses a PONG message that is a response to a PING message processed
> just before the POSTCOPY_RUN command that starts the destination VM.
> Thus, this change does not require any changes on the destination side
> and is effective even with older destination versions.

Thanks, this will help libvirt as we think that the migration can be
safely aborted unless we successfully called "cont" and thus we just
kill QEMU on the destination. But since QEMU on the source already
entered postcopy-active, we can't cancel the migration and the result is
a paused VM with no way of recovering it.

This series will make the situation better as the source will stay in
postcopy-device until the destination successfully loads device data.
There's still room for some enhancement though. Depending on how fast
this loading is libvirt may issue cont before device data is loaded (the
destination is already in postcopy-active at this point), which always
succeeds as it only marks the domain to be autostarted, but the actual
start may fail later. When discussing this with Juraj we agreed on
introducing the new postcopy-device state on the destination as well to
make sure libvirt will only call cont once device data was successfully
loaded so that we always get a proper result when running cont. But it
may still fail when locking disks fails (not sure if this is the only
way cont may fail). In this case we cannot cancel the migration on the
source as it is already in postcopy-active and we can't recover
migration either as the CPUs are not running on the destination. Ideally
we'd have a way of canceling the migration in postocpy-active if we are
sure CPUs were not started yet. Alternatively a possibility to recover
migration would work as well.

Jirka



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/4] migration: Introduce POSTCOPY_DEVICE state
  2025-09-25 11:54   ` Jiří Denemark
@ 2025-09-25 18:22     ` Peter Xu
  2025-09-30  7:53       ` Jiří Denemark
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Xu @ 2025-09-25 18:22 UTC (permalink / raw)
  To: Jiří Denemark
  Cc: Juraj Marcin, qemu-devel, Dr. David Alan Gilbert, Fabiano Rosas

On Thu, Sep 25, 2025 at 01:54:40PM +0200, Jiří Denemark wrote:
> On Mon, Sep 15, 2025 at 13:59:15 +0200, Juraj Marcin wrote:
> > From: Juraj Marcin <jmarcin@redhat.com>
> > 
> > Currently, when postcopy starts, the source VM starts switchover and
> > sends a package containing the state of all non-postcopiable devices.
> > When the destination loads this package, the switchover is complete and
> > the destination VM starts. However, if the device state load fails or
> > the destination side crashes, the source side is already in
> > POSTCOPY_ACTIVE state and cannot be recovered, even when it has the most
> > up-to-date machine state as the destination has not yet started.
> > 
> > This patch introduces a new POSTCOPY_DEVICE state which is active
> > while the destination machine is loading the device state, is not yet
> > running, and the source side can be resumed in case of a migration
> > failure.
> > 
> > To transition from POSTCOPY_DEVICE to POSTCOPY_ACTIVE, the source
> > side uses a PONG message that is a response to a PING message processed
> > just before the POSTCOPY_RUN command that starts the destination VM.
> > Thus, this change does not require any changes on the destination side
> > and is effective even with older destination versions.
> 
> Thanks, this will help libvirt as we think that the migration can be
> safely aborted unless we successfully called "cont" and thus we just
> kill QEMU on the destination. But since QEMU on the source already
> entered postcopy-active, we can't cancel the migration and the result is
> a paused VM with no way of recovering it.
> 
> This series will make the situation better as the source will stay in
> postcopy-device until the destination successfully loads device data.
> There's still room for some enhancement though. Depending on how fast
> this loading is libvirt may issue cont before device data is loaded (the
> destination is already in postcopy-active at this point), which always
> succeeds as it only marks the domain to be autostarted, but the actual
> start may fail later. When discussing this with Juraj we agreed on
> introducing the new postcopy-device state on the destination as well to

I used to think and define postcopy-active be the state we should never be
able to cancel it anymore, implying that the real postcopy process is in
progress, and also implying the state where we need to start assume the
latest VM pages are spread on both sides, not one anymore.  Cancellation or
killing either side means crashing VM then.

It could be a good thing indeed to have postcopy-device on dest too from
that regard, because having postcopy-device on dest can mark out the small
initial window when dest qemu hasn't yet start to generate new data but
only applying old data (device data, which src also owns a copy).  From
that POV, that indeed does not belong to the point if we define
postcopy-active as above.

IOW, also with such definition, setting postcopy-active on dest QEMU right
at the entry of ram load thread (what we do right now..) is too early.

> make sure libvirt will only call cont once device data was successfully
> loaded so that we always get a proper result when running cont. But it

Do we know an estimate of how much extra downtime this would introduce?

We should have discussed this in a previous thread, the issue is if we cont
only after device loaded, then dest QEMU may need to wait a while until it
receives the cont from libvirt, that will contribute to the downtime.  It
would best be avoided, or if it's negligible then it's fine too but I'm not
sure whether it's guaranteed to be negligible..

If the goal is to make sure libvirt knows what is happening, can it still
relies on the event emitted, in this case, RESUME?  We can also reorg how
postcopy-device and postcopy-active states will be reported on dest, then
they'll help if RESUME is too coarse grained.

So far, dest QEMU will try to resume the VM after getting RUN command, that
is what loadvm_postcopy_handle_run_bh() does, and it will (when autostart=1
set): (1) firstly try to activate all block devices, iff it succeeded, (2)
do vm_start(), at the end of which RESUME event will be generated.  So
RESUME currently implies both disk activation success, and vm start worked.

> may still fail when locking disks fails (not sure if this is the only
> way cont may fail). In this case we cannot cancel the migration on the

Is there any known issue with locking disks that dest would fail?  This
really sound like we should have the admin taking a look.

I recall Juraj mentioned off list that drive lock sometimes will stop
working.  Is that relevant here?

> source as it is already in postcopy-active and we can't recover
> migration either as the CPUs are not running on the destination. Ideally
> we'd have a way of canceling the migration in postocpy-active if we are
> sure CPUs were not started yet. Alternatively a possibility to recover
> migration would work as well.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/4] migration: Introduce POSTCOPY_DEVICE state
  2025-09-25 18:22     ` Peter Xu
@ 2025-09-30  7:53       ` Jiří Denemark
  2025-09-30 20:04         ` Peter Xu
  0 siblings, 1 reply; 33+ messages in thread
From: Jiří Denemark @ 2025-09-30  7:53 UTC (permalink / raw)
  To: Peter Xu; +Cc: Juraj Marcin, qemu-devel, Dr. David Alan Gilbert, Fabiano Rosas

On Thu, Sep 25, 2025 at 14:22:06 -0400, Peter Xu wrote:
> On Thu, Sep 25, 2025 at 01:54:40PM +0200, Jiří Denemark wrote:
> > On Mon, Sep 15, 2025 at 13:59:15 +0200, Juraj Marcin wrote:
> > > From: Juraj Marcin <jmarcin@redhat.com>
> > > 
> > > Currently, when postcopy starts, the source VM starts switchover and
> > > sends a package containing the state of all non-postcopiable devices.
> > > When the destination loads this package, the switchover is complete and
> > > the destination VM starts. However, if the device state load fails or
> > > the destination side crashes, the source side is already in
> > > POSTCOPY_ACTIVE state and cannot be recovered, even when it has the most
> > > up-to-date machine state as the destination has not yet started.
> > > 
> > > This patch introduces a new POSTCOPY_DEVICE state which is active
> > > while the destination machine is loading the device state, is not yet
> > > running, and the source side can be resumed in case of a migration
> > > failure.
> > > 
> > > To transition from POSTCOPY_DEVICE to POSTCOPY_ACTIVE, the source
> > > side uses a PONG message that is a response to a PING message processed
> > > just before the POSTCOPY_RUN command that starts the destination VM.
> > > Thus, this change does not require any changes on the destination side
> > > and is effective even with older destination versions.
> > 
> > Thanks, this will help libvirt as we think that the migration can be
> > safely aborted unless we successfully called "cont" and thus we just
> > kill QEMU on the destination. But since QEMU on the source already
> > entered postcopy-active, we can't cancel the migration and the result is
> > a paused VM with no way of recovering it.
> > 
> > This series will make the situation better as the source will stay in
> > postcopy-device until the destination successfully loads device data.
> > There's still room for some enhancement though. Depending on how fast
> > this loading is libvirt may issue cont before device data is loaded (the
> > destination is already in postcopy-active at this point), which always
> > succeeds as it only marks the domain to be autostarted, but the actual
> > start may fail later. When discussing this with Juraj we agreed on
> > introducing the new postcopy-device state on the destination as well to
> 
> I used to think and define postcopy-active be the state we should never be
> able to cancel it anymore, implying that the real postcopy process is in
> progress, and also implying the state where we need to start assume the
> latest VM pages are spread on both sides, not one anymore.  Cancellation or
> killing either side means crashing VM then.

Right, although it's unfortunately not the case now as the source is in
postcopy-active even though the complete state is still on the source.

> It could be a good thing indeed to have postcopy-device on dest too from
> that regard, because having postcopy-device on dest can mark out the small
> initial window when dest qemu hasn't yet start to generate new data but
> only applying old data (device data, which src also owns a copy).  From
> that POV, that indeed does not belong to the point if we define
> postcopy-active as above.
> 
> IOW, also with such definition, setting postcopy-active on dest QEMU right
> at the entry of ram load thread (what we do right now..) is too early.
> 
> > make sure libvirt will only call cont once device data was successfully
> > loaded so that we always get a proper result when running cont. But it
> 
> Do we know an estimate of how much extra downtime this would introduce?
> 
> We should have discussed this in a previous thread, the issue is if we cont
> only after device loaded, then dest QEMU may need to wait a while until it
> receives the cont from libvirt, that will contribute to the downtime.  It
> would best be avoided, or if it's negligible then it's fine too but I'm not
> sure whether it's guaranteed to be negligible..

We start QEMU with -S so it always needs to wait for cont from libvirt.
We wait for postcopy-active on the destination before sending cont. So
currently it can arrive while QEMU is still loading device state or when
this is already done. I was just suggesting to always wait for the
device state to be loaded before sending cont. So in some cases it would
arrive a bit later while in other cases nothing would change. It's just
a matter of waking up a thread waiting for postcopy-active and sending
the command back to QEMU. There's no communication with the other host
at this point so I'd expect the difference to be negligible. And as I
said depending on how fast device state loading vs transferring
migration control from libvirt on the source to the destination we may
already be sending cont when QEMU is done.

But anyway, this would only be helpful if there's a way to actually
cancel migration on the source when cont fails.

> If the goal is to make sure libvirt knows what is happening, can it still
> relies on the event emitted, in this case, RESUME?  We can also reorg how
> postcopy-device and postcopy-active states will be reported on dest, then
> they'll help if RESUME is too coarse grained.

The main goal is to make sure we don't end up with vCPUs paused on both
sides during a postcopy migration that can't be recovered nor canceled
thus effectively crashing the VM.

> So far, dest QEMU will try to resume the VM after getting RUN command, that
> is what loadvm_postcopy_handle_run_bh() does, and it will (when autostart=1
> set): (1) firstly try to activate all block devices, iff it succeeded, (2)
> do vm_start(), at the end of which RESUME event will be generated.  So
> RESUME currently implies both disk activation success, and vm start worked.
> 
> > may still fail when locking disks fails (not sure if this is the only
> > way cont may fail). In this case we cannot cancel the migration on the
> 
> Is there any known issue with locking disks that dest would fail?  This
> really sound like we should have the admin taking a look.

Oh definitely, it would be some kind of an storage access issue on the
destination. But we'd like to give the admin an option to actually do
anything else than just killing the VM :-) Either by automatically
canceling the migration or allowing recovery once storage issues are
solved.

> I recall Juraj mentioned off list that drive lock sometimes will stop
> working.  Is that relevant here?

I think this was more a case of drive lock succeeding on both sides at
the end of precopy migration. If we're thinking about the same thing :-)
This is a separate issue that needs to be handled on libvirt side.

Jirka



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/4] migration: Introduce POSTCOPY_DEVICE state
  2025-09-30  7:53       ` Jiří Denemark
@ 2025-09-30 20:04         ` Peter Xu
  2025-10-01  8:43           ` Jiří Denemark
  2025-10-01 10:09           ` Juraj Marcin
  0 siblings, 2 replies; 33+ messages in thread
From: Peter Xu @ 2025-09-30 20:04 UTC (permalink / raw)
  To: Jiří Denemark
  Cc: Juraj Marcin, qemu-devel, Dr. David Alan Gilbert, Fabiano Rosas

On Tue, Sep 30, 2025 at 09:53:31AM +0200, Jiří Denemark wrote:
> On Thu, Sep 25, 2025 at 14:22:06 -0400, Peter Xu wrote:
> > On Thu, Sep 25, 2025 at 01:54:40PM +0200, Jiří Denemark wrote:
> > > On Mon, Sep 15, 2025 at 13:59:15 +0200, Juraj Marcin wrote:
> > > > From: Juraj Marcin <jmarcin@redhat.com>
> > > > 
> > > > Currently, when postcopy starts, the source VM starts switchover and
> > > > sends a package containing the state of all non-postcopiable devices.
> > > > When the destination loads this package, the switchover is complete and
> > > > the destination VM starts. However, if the device state load fails or
> > > > the destination side crashes, the source side is already in
> > > > POSTCOPY_ACTIVE state and cannot be recovered, even when it has the most
> > > > up-to-date machine state as the destination has not yet started.
> > > > 
> > > > This patch introduces a new POSTCOPY_DEVICE state which is active
> > > > while the destination machine is loading the device state, is not yet
> > > > running, and the source side can be resumed in case of a migration
> > > > failure.
> > > > 
> > > > To transition from POSTCOPY_DEVICE to POSTCOPY_ACTIVE, the source
> > > > side uses a PONG message that is a response to a PING message processed
> > > > just before the POSTCOPY_RUN command that starts the destination VM.
> > > > Thus, this change does not require any changes on the destination side
> > > > and is effective even with older destination versions.
> > > 
> > > Thanks, this will help libvirt as we think that the migration can be
> > > safely aborted unless we successfully called "cont" and thus we just
> > > kill QEMU on the destination. But since QEMU on the source already
> > > entered postcopy-active, we can't cancel the migration and the result is
> > > a paused VM with no way of recovering it.
> > > 
> > > This series will make the situation better as the source will stay in
> > > postcopy-device until the destination successfully loads device data.
> > > There's still room for some enhancement though. Depending on how fast
> > > this loading is libvirt may issue cont before device data is loaded (the
> > > destination is already in postcopy-active at this point), which always
> > > succeeds as it only marks the domain to be autostarted, but the actual
> > > start may fail later. When discussing this with Juraj we agreed on
> > > introducing the new postcopy-device state on the destination as well to
> > 
> > I used to think and define postcopy-active be the state we should never be
> > able to cancel it anymore, implying that the real postcopy process is in
> > progress, and also implying the state where we need to start assume the
> > latest VM pages are spread on both sides, not one anymore.  Cancellation or
> > killing either side means crashing VM then.
> 
> Right, although it's unfortunately not the case now as the source is in
> postcopy-active even though the complete state is still on the source.
> 
> > It could be a good thing indeed to have postcopy-device on dest too from
> > that regard, because having postcopy-device on dest can mark out the small
> > initial window when dest qemu hasn't yet start to generate new data but
> > only applying old data (device data, which src also owns a copy).  From
> > that POV, that indeed does not belong to the point if we define
> > postcopy-active as above.
> > 
> > IOW, also with such definition, setting postcopy-active on dest QEMU right
> > at the entry of ram load thread (what we do right now..) is too early.
> > 
> > > make sure libvirt will only call cont once device data was successfully
> > > loaded so that we always get a proper result when running cont. But it
> > 
> > Do we know an estimate of how much extra downtime this would introduce?
> > 
> > We should have discussed this in a previous thread, the issue is if we cont
> > only after device loaded, then dest QEMU may need to wait a while until it
> > receives the cont from libvirt, that will contribute to the downtime.  It
> > would best be avoided, or if it's negligible then it's fine too but I'm not
> > sure whether it's guaranteed to be negligible..
> 
> We start QEMU with -S so it always needs to wait for cont from libvirt.
> We wait for postcopy-active on the destination before sending cont. So
> currently it can arrive while QEMU is still loading device state or when
> this is already done. I was just suggesting to always wait for the
> device state to be loaded before sending cont. So in some cases it would
> arrive a bit later while in other cases nothing would change. It's just
> a matter of waking up a thread waiting for postcopy-active and sending
> the command back to QEMU. There's no communication with the other host
> at this point so I'd expect the difference to be negligible. And as I
> said depending on how fast device state loading vs transferring
> migration control from libvirt on the source to the destination we may
> already be sending cont when QEMU is done.

Ah OK, I think this is not a major concern, until it is justified to.

> 
> But anyway, this would only be helpful if there's a way to actually
> cancel migration on the source when cont fails.
> 
> > If the goal is to make sure libvirt knows what is happening, can it still
> > relies on the event emitted, in this case, RESUME?  We can also reorg how
> > postcopy-device and postcopy-active states will be reported on dest, then
> > they'll help if RESUME is too coarse grained.
> 
> The main goal is to make sure we don't end up with vCPUs paused on both
> sides during a postcopy migration that can't be recovered nor canceled
> thus effectively crashing the VM.

Right, I assume that's what Juraj's series is trying to fix.  After this
series lands, I don't see why it would happen.  But indeed I'm still
expecting the block drive (including their locks) to behave all fine.

> 
> > So far, dest QEMU will try to resume the VM after getting RUN command, that
> > is what loadvm_postcopy_handle_run_bh() does, and it will (when autostart=1
> > set): (1) firstly try to activate all block devices, iff it succeeded, (2)
> > do vm_start(), at the end of which RESUME event will be generated.  So
> > RESUME currently implies both disk activation success, and vm start worked.
> > 
> > > may still fail when locking disks fails (not sure if this is the only
> > > way cont may fail). In this case we cannot cancel the migration on the
> > 
> > Is there any known issue with locking disks that dest would fail?  This
> > really sound like we should have the admin taking a look.
> 
> Oh definitely, it would be some kind of an storage access issue on the
> destination. But we'd like to give the admin an option to actually do
> anything else than just killing the VM :-) Either by automatically
> canceling the migration or allowing recovery once storage issues are
> solved.

The problem is, if the storage locking stopped working properly, then how
to guarantee the shared storage itself is working properly?

When I was replying previously, I was expecting the admin taking a look to
fix the storage, I didn't expect the VM can still be recovered anymore if
there's no confidence that the block devices will work all fine.  The
locking errors to me may imply a block corruption already, or should I not
see it like that?

Fundamentally, "crashing the VM" doesn't lose anything from the block POV
because it's always persistent when synced.  It's almost only about RAM
that is getting lost, alongside it's about task status, service
availability, and the part of storage that was not flushed to backends.

Do we really want to add anything more complex when shared storage has
locking issues?  Maybe there's known issues on locking that we're 100% sure
the storage is fine, but only the locking went wrong?

IIUC, the hope is after this series lands we should close the gap for
almost all the rest paths that may cause both sides to HALT for a postcopy,
except for a storage issue with lockings.  But I'm not sure if I missed
something.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/4] migration: Introduce POSTCOPY_DEVICE state
  2025-09-30 20:04         ` Peter Xu
@ 2025-10-01  8:43           ` Jiří Denemark
  2025-10-01 11:05             ` Dr. David Alan Gilbert
  2025-10-01 10:09           ` Juraj Marcin
  1 sibling, 1 reply; 33+ messages in thread
From: Jiří Denemark @ 2025-10-01  8:43 UTC (permalink / raw)
  To: Peter Xu; +Cc: Juraj Marcin, qemu-devel, Dr. David Alan Gilbert, Fabiano Rosas

On Tue, Sep 30, 2025 at 16:04:54 -0400, Peter Xu wrote:
> On Tue, Sep 30, 2025 at 09:53:31AM +0200, Jiří Denemark wrote:
> > On Thu, Sep 25, 2025 at 14:22:06 -0400, Peter Xu wrote:
> > > On Thu, Sep 25, 2025 at 01:54:40PM +0200, Jiří Denemark wrote:
> > > > On Mon, Sep 15, 2025 at 13:59:15 +0200, Juraj Marcin wrote:
> > > So far, dest QEMU will try to resume the VM after getting RUN command, that
> > > is what loadvm_postcopy_handle_run_bh() does, and it will (when autostart=1
> > > set): (1) firstly try to activate all block devices, iff it succeeded, (2)
> > > do vm_start(), at the end of which RESUME event will be generated.  So
> > > RESUME currently implies both disk activation success, and vm start worked.
> > > 
> > > > may still fail when locking disks fails (not sure if this is the only
> > > > way cont may fail). In this case we cannot cancel the migration on the
> > > 
> > > Is there any known issue with locking disks that dest would fail?  This
> > > really sound like we should have the admin taking a look.
> > 
> > Oh definitely, it would be some kind of an storage access issue on the
> > destination. But we'd like to give the admin an option to actually do
> > anything else than just killing the VM :-) Either by automatically
> > canceling the migration or allowing recovery once storage issues are
> > solved.
> 
> The problem is, if the storage locking stopped working properly, then how
> to guarantee the shared storage itself is working properly?
> 
> When I was replying previously, I was expecting the admin taking a look to
> fix the storage, I didn't expect the VM can still be recovered anymore if
> there's no confidence that the block devices will work all fine.  The
> locking errors to me may imply a block corruption already, or should I not
> see it like that?

If the storage itself is broken, there's clearly nothing we can do. But
the thing is we're accessing it from two distinct hosts. So while it may
work on the source, it can be broken on the destination. For example,
connection between the destination host and the storage may be broken.
Not sure how often this can happen in real life, but we have a bug
report that (artificially) breaking storage access on the destination
results in paused VM on the source which can only be killed.

So I believe we should do better if reasonably possible. People don't
like losing their VMs just because they tried to migrate and something
failed.

Jirka



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/4] migration: Introduce POSTCOPY_DEVICE state
  2025-09-30 20:04         ` Peter Xu
  2025-10-01  8:43           ` Jiří Denemark
@ 2025-10-01 10:09           ` Juraj Marcin
  1 sibling, 0 replies; 33+ messages in thread
From: Juraj Marcin @ 2025-10-01 10:09 UTC (permalink / raw)
  To: Peter Xu
  Cc: Jiří Denemark, qemu-devel, Dr. David Alan Gilbert,
	Fabiano Rosas

On 2025-09-30 16:04, Peter Xu wrote:
> On Tue, Sep 30, 2025 at 09:53:31AM +0200, Jiří Denemark wrote:
> > On Thu, Sep 25, 2025 at 14:22:06 -0400, Peter Xu wrote:
> > > On Thu, Sep 25, 2025 at 01:54:40PM +0200, Jiří Denemark wrote:
> > > > On Mon, Sep 15, 2025 at 13:59:15 +0200, Juraj Marcin wrote:
> > > > > From: Juraj Marcin <jmarcin@redhat.com>
> > > > > 
> > > > > Currently, when postcopy starts, the source VM starts switchover and
> > > > > sends a package containing the state of all non-postcopiable devices.
> > > > > When the destination loads this package, the switchover is complete and
> > > > > the destination VM starts. However, if the device state load fails or
> > > > > the destination side crashes, the source side is already in
> > > > > POSTCOPY_ACTIVE state and cannot be recovered, even when it has the most
> > > > > up-to-date machine state as the destination has not yet started.
> > > > > 
> > > > > This patch introduces a new POSTCOPY_DEVICE state which is active
> > > > > while the destination machine is loading the device state, is not yet
> > > > > running, and the source side can be resumed in case of a migration
> > > > > failure.
> > > > > 
> > > > > To transition from POSTCOPY_DEVICE to POSTCOPY_ACTIVE, the source
> > > > > side uses a PONG message that is a response to a PING message processed
> > > > > just before the POSTCOPY_RUN command that starts the destination VM.
> > > > > Thus, this change does not require any changes on the destination side
> > > > > and is effective even with older destination versions.
> > > > 
> > > > Thanks, this will help libvirt as we think that the migration can be
> > > > safely aborted unless we successfully called "cont" and thus we just
> > > > kill QEMU on the destination. But since QEMU on the source already
> > > > entered postcopy-active, we can't cancel the migration and the result is
> > > > a paused VM with no way of recovering it.
> > > > 
> > > > This series will make the situation better as the source will stay in
> > > > postcopy-device until the destination successfully loads device data.
> > > > There's still room for some enhancement though. Depending on how fast
> > > > this loading is libvirt may issue cont before device data is loaded (the
> > > > destination is already in postcopy-active at this point), which always
> > > > succeeds as it only marks the domain to be autostarted, but the actual
> > > > start may fail later. When discussing this with Juraj we agreed on
> > > > introducing the new postcopy-device state on the destination as well to
> > > 
> > > I used to think and define postcopy-active be the state we should never be
> > > able to cancel it anymore, implying that the real postcopy process is in
> > > progress, and also implying the state where we need to start assume the
> > > latest VM pages are spread on both sides, not one anymore.  Cancellation or
> > > killing either side means crashing VM then.
> > 
> > Right, although it's unfortunately not the case now as the source is in
> > postcopy-active even though the complete state is still on the source.
> > 
> > > It could be a good thing indeed to have postcopy-device on dest too from
> > > that regard, because having postcopy-device on dest can mark out the small
> > > initial window when dest qemu hasn't yet start to generate new data but
> > > only applying old data (device data, which src also owns a copy).  From
> > > that POV, that indeed does not belong to the point if we define
> > > postcopy-active as above.
> > > 
> > > IOW, also with such definition, setting postcopy-active on dest QEMU right
> > > at the entry of ram load thread (what we do right now..) is too early.
> > > 
> > > > make sure libvirt will only call cont once device data was successfully
> > > > loaded so that we always get a proper result when running cont. But it
> > > 
> > > Do we know an estimate of how much extra downtime this would introduce?
> > > 
> > > We should have discussed this in a previous thread, the issue is if we cont
> > > only after device loaded, then dest QEMU may need to wait a while until it
> > > receives the cont from libvirt, that will contribute to the downtime.  It
> > > would best be avoided, or if it's negligible then it's fine too but I'm not
> > > sure whether it's guaranteed to be negligible..
> > 
> > We start QEMU with -S so it always needs to wait for cont from libvirt.
> > We wait for postcopy-active on the destination before sending cont. So
> > currently it can arrive while QEMU is still loading device state or when
> > this is already done. I was just suggesting to always wait for the
> > device state to be loaded before sending cont. So in some cases it would
> > arrive a bit later while in other cases nothing would change. It's just
> > a matter of waking up a thread waiting for postcopy-active and sending
> > the command back to QEMU. There's no communication with the other host
> > at this point so I'd expect the difference to be negligible. And as I
> > said depending on how fast device state loading vs transferring
> > migration control from libvirt on the source to the destination we may
> > already be sending cont when QEMU is done.
> 
> Ah OK, I think this is not a major concern, until it is justified to.
> 
> > 
> > But anyway, this would only be helpful if there's a way to actually
> > cancel migration on the source when cont fails.
> > 
> > > If the goal is to make sure libvirt knows what is happening, can it still
> > > relies on the event emitted, in this case, RESUME?  We can also reorg how
> > > postcopy-device and postcopy-active states will be reported on dest, then
> > > they'll help if RESUME is too coarse grained.
> > 
> > The main goal is to make sure we don't end up with vCPUs paused on both
> > sides during a postcopy migration that can't be recovered nor canceled
> > thus effectively crashing the VM.
> 
> Right, I assume that's what Juraj's series is trying to fix.  After this
> series lands, I don't see why it would happen.  But indeed I'm still
> expecting the block drive (including their locks) to behave all fine.

My POSTCOPY_DEVICE series resolves failures during the device state
load, that is up to MIG_CMD_PING that is just before the
MIG_CMD_POSTCOPY_RUN command. However, if the destination fails during
resume (POSTCOPY_RUN), the source machine cannot be recovered even with
this series, it has already received PONG and switched to
postcopy-active.

It is also not possible to move the PING after POSTCOPY_RUN without
changing how the destination interprets the migration stream and the
source would also need to know if destination supports the feature.

> 
> > 
> > > So far, dest QEMU will try to resume the VM after getting RUN command, that
> > > is what loadvm_postcopy_handle_run_bh() does, and it will (when autostart=1
> > > set): (1) firstly try to activate all block devices, iff it succeeded, (2)
> > > do vm_start(), at the end of which RESUME event will be generated.  So
> > > RESUME currently implies both disk activation success, and vm start worked.
> > > 
> > > > may still fail when locking disks fails (not sure if this is the only
> > > > way cont may fail). In this case we cannot cancel the migration on the
> > > 
> > > Is there any known issue with locking disks that dest would fail?  This
> > > really sound like we should have the admin taking a look.
> > 
> > Oh definitely, it would be some kind of an storage access issue on the
> > destination. But we'd like to give the admin an option to actually do
> > anything else than just killing the VM :-) Either by automatically
> > canceling the migration or allowing recovery once storage issues are
> > solved.
> 
> The problem is, if the storage locking stopped working properly, then how
> to guarantee the shared storage itself is working properly?
> 
> When I was replying previously, I was expecting the admin taking a look to
> fix the storage, I didn't expect the VM can still be recovered anymore if
> there's no confidence that the block devices will work all fine.  The
> locking errors to me may imply a block corruption already, or should I not
> see it like that?
> 
> Fundamentally, "crashing the VM" doesn't lose anything from the block POV
> because it's always persistent when synced.  It's almost only about RAM
> that is getting lost, alongside it's about task status, service
> availability, and the part of storage that was not flushed to backends.
> 
> Do we really want to add anything more complex when shared storage has
> locking issues?  Maybe there's known issues on locking that we're 100% sure
> the storage is fine, but only the locking went wrong?
> 
> IIUC, the hope is after this series lands we should close the gap for
> almost all the rest paths that may cause both sides to HALT for a postcopy,
> except for a storage issue with lockings.  But I'm not sure if I missed
> something.
> 
> Thanks,
> 
> -- 
> Peter Xu
> 



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/4] migration: Introduce POSTCOPY_DEVICE state
  2025-10-01  8:43           ` Jiří Denemark
@ 2025-10-01 11:05             ` Dr. David Alan Gilbert
  2025-10-01 14:26               ` Jiří Denemark
  2025-10-01 15:10               ` Daniel P. Berrangé
  0 siblings, 2 replies; 33+ messages in thread
From: Dr. David Alan Gilbert @ 2025-10-01 11:05 UTC (permalink / raw)
  To: Jiří Denemark; +Cc: Peter Xu, Juraj Marcin, qemu-devel, Fabiano Rosas

* Jiří Denemark (jdenemar@redhat.com) wrote:
> On Tue, Sep 30, 2025 at 16:04:54 -0400, Peter Xu wrote:
> > On Tue, Sep 30, 2025 at 09:53:31AM +0200, Jiří Denemark wrote:
> > > On Thu, Sep 25, 2025 at 14:22:06 -0400, Peter Xu wrote:
> > > > On Thu, Sep 25, 2025 at 01:54:40PM +0200, Jiří Denemark wrote:
> > > > > On Mon, Sep 15, 2025 at 13:59:15 +0200, Juraj Marcin wrote:
> > > > So far, dest QEMU will try to resume the VM after getting RUN command, that
> > > > is what loadvm_postcopy_handle_run_bh() does, and it will (when autostart=1
> > > > set): (1) firstly try to activate all block devices, iff it succeeded, (2)
> > > > do vm_start(), at the end of which RESUME event will be generated.  So
> > > > RESUME currently implies both disk activation success, and vm start worked.
> > > > 
> > > > > may still fail when locking disks fails (not sure if this is the only
> > > > > way cont may fail). In this case we cannot cancel the migration on the
> > > > 
> > > > Is there any known issue with locking disks that dest would fail?  This
> > > > really sound like we should have the admin taking a look.
> > > 
> > > Oh definitely, it would be some kind of an storage access issue on the
> > > destination. But we'd like to give the admin an option to actually do
> > > anything else than just killing the VM :-) Either by automatically
> > > canceling the migration or allowing recovery once storage issues are
> > > solved.
> > 
> > The problem is, if the storage locking stopped working properly, then how
> > to guarantee the shared storage itself is working properly?
> > 
> > When I was replying previously, I was expecting the admin taking a look to
> > fix the storage, I didn't expect the VM can still be recovered anymore if
> > there's no confidence that the block devices will work all fine.  The
> > locking errors to me may imply a block corruption already, or should I not
> > see it like that?
> 
> If the storage itself is broken, there's clearly nothing we can do. But
> the thing is we're accessing it from two distinct hosts. So while it may
> work on the source, it can be broken on the destination. For example,
> connection between the destination host and the storage may be broken.
> Not sure how often this can happen in real life, but we have a bug
> report that (artificially) breaking storage access on the destination
> results in paused VM on the source which can only be killed.

I've got a vague memory that a tricky case is when some of your storage
devices are broken on the destination, but not all.
So you tell the block layer you want to take them on the destination
some take their lock, one fails;  now what state are you in?
I'm not sure if the block layer had a way of telling you what state
you were in when I was last involved in that.

> So I believe we should do better if reasonably possible. People don't
> like losing their VMs just because they tried to migrate and something
> failed.

Nod.

Dave

> Jirka
> 
-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
\        dave @ treblig.org |                               | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/4] migration: Introduce POSTCOPY_DEVICE state
  2025-10-01 11:05             ` Dr. David Alan Gilbert
@ 2025-10-01 14:26               ` Jiří Denemark
  2025-10-01 15:53                 ` Dr. David Alan Gilbert
  2025-10-01 15:10               ` Daniel P. Berrangé
  1 sibling, 1 reply; 33+ messages in thread
From: Jiří Denemark @ 2025-10-01 14:26 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Peter Xu, Juraj Marcin, qemu-devel, Fabiano Rosas

On Wed, Oct 01, 2025 at 11:05:59 +0000, Dr. David Alan Gilbert wrote:
> * Jiří Denemark (jdenemar@redhat.com) wrote:
> > On Tue, Sep 30, 2025 at 16:04:54 -0400, Peter Xu wrote:
> > > On Tue, Sep 30, 2025 at 09:53:31AM +0200, Jiří Denemark wrote:
> > > > On Thu, Sep 25, 2025 at 14:22:06 -0400, Peter Xu wrote:
> > > > > On Thu, Sep 25, 2025 at 01:54:40PM +0200, Jiří Denemark wrote:
> > > > > > On Mon, Sep 15, 2025 at 13:59:15 +0200, Juraj Marcin wrote:
> > > > > So far, dest QEMU will try to resume the VM after getting RUN command, that
> > > > > is what loadvm_postcopy_handle_run_bh() does, and it will (when autostart=1
> > > > > set): (1) firstly try to activate all block devices, iff it succeeded, (2)
> > > > > do vm_start(), at the end of which RESUME event will be generated.  So
> > > > > RESUME currently implies both disk activation success, and vm start worked.
> > > > > 
> > > > > > may still fail when locking disks fails (not sure if this is the only
> > > > > > way cont may fail). In this case we cannot cancel the migration on the
> > > > > 
> > > > > Is there any known issue with locking disks that dest would fail?  This
> > > > > really sound like we should have the admin taking a look.
> > > > 
> > > > Oh definitely, it would be some kind of an storage access issue on the
> > > > destination. But we'd like to give the admin an option to actually do
> > > > anything else than just killing the VM :-) Either by automatically
> > > > canceling the migration or allowing recovery once storage issues are
> > > > solved.
> > > 
> > > The problem is, if the storage locking stopped working properly, then how
> > > to guarantee the shared storage itself is working properly?
> > > 
> > > When I was replying previously, I was expecting the admin taking a look to
> > > fix the storage, I didn't expect the VM can still be recovered anymore if
> > > there's no confidence that the block devices will work all fine.  The
> > > locking errors to me may imply a block corruption already, or should I not
> > > see it like that?
> > 
> > If the storage itself is broken, there's clearly nothing we can do. But
> > the thing is we're accessing it from two distinct hosts. So while it may
> > work on the source, it can be broken on the destination. For example,
> > connection between the destination host and the storage may be broken.
> > Not sure how often this can happen in real life, but we have a bug
> > report that (artificially) breaking storage access on the destination
> > results in paused VM on the source which can only be killed.
> 
> I've got a vague memory that a tricky case is when some of your storage
> devices are broken on the destination, but not all.
> So you tell the block layer you want to take them on the destination
> some take their lock, one fails;  now what state are you in?
> I'm not sure if the block layer had a way of telling you what state
> you were in when I was last involved in that.

Wouldn't those locks be automatically released when we kill QEMU on the
destination as a reaction to a failure to start vCPUs?

Jirka



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/4] migration: Introduce POSTCOPY_DEVICE state
  2025-10-01 11:05             ` Dr. David Alan Gilbert
  2025-10-01 14:26               ` Jiří Denemark
@ 2025-10-01 15:10               ` Daniel P. Berrangé
  2025-10-02 12:17                 ` Jiří Denemark
  1 sibling, 1 reply; 33+ messages in thread
From: Daniel P. Berrangé @ 2025-10-01 15:10 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Jiří Denemark, Peter Xu, Juraj Marcin, qemu-devel,
	Fabiano Rosas

On Wed, Oct 01, 2025 at 11:05:59AM +0000, Dr. David Alan Gilbert wrote:
> * Jiří Denemark (jdenemar@redhat.com) wrote:
> > On Tue, Sep 30, 2025 at 16:04:54 -0400, Peter Xu wrote:
> > > On Tue, Sep 30, 2025 at 09:53:31AM +0200, Jiří Denemark wrote:
> > > > On Thu, Sep 25, 2025 at 14:22:06 -0400, Peter Xu wrote:
> > > > > On Thu, Sep 25, 2025 at 01:54:40PM +0200, Jiří Denemark wrote:
> > > > > > On Mon, Sep 15, 2025 at 13:59:15 +0200, Juraj Marcin wrote:
> > > > > So far, dest QEMU will try to resume the VM after getting RUN command, that
> > > > > is what loadvm_postcopy_handle_run_bh() does, and it will (when autostart=1
> > > > > set): (1) firstly try to activate all block devices, iff it succeeded, (2)
> > > > > do vm_start(), at the end of which RESUME event will be generated.  So
> > > > > RESUME currently implies both disk activation success, and vm start worked.
> > > > > 
> > > > > > may still fail when locking disks fails (not sure if this is the only
> > > > > > way cont may fail). In this case we cannot cancel the migration on the
> > > > > 
> > > > > Is there any known issue with locking disks that dest would fail?  This
> > > > > really sound like we should have the admin taking a look.
> > > > 
> > > > Oh definitely, it would be some kind of an storage access issue on the
> > > > destination. But we'd like to give the admin an option to actually do
> > > > anything else than just killing the VM :-) Either by automatically
> > > > canceling the migration or allowing recovery once storage issues are
> > > > solved.
> > > 
> > > The problem is, if the storage locking stopped working properly, then how
> > > to guarantee the shared storage itself is working properly?
> > > 
> > > When I was replying previously, I was expecting the admin taking a look to
> > > fix the storage, I didn't expect the VM can still be recovered anymore if
> > > there's no confidence that the block devices will work all fine.  The
> > > locking errors to me may imply a block corruption already, or should I not
> > > see it like that?
> > 
> > If the storage itself is broken, there's clearly nothing we can do. But
> > the thing is we're accessing it from two distinct hosts. So while it may
> > work on the source, it can be broken on the destination. For example,
> > connection between the destination host and the storage may be broken.
> > Not sure how often this can happen in real life, but we have a bug
> > report that (artificially) breaking storage access on the destination
> > results in paused VM on the source which can only be killed.
> 
> I've got a vague memory that a tricky case is when some of your storage
> devices are broken on the destination, but not all.
> So you tell the block layer you want to take them on the destination
> some take their lock, one fails;  now what state are you in?
> I'm not sure if the block layer had a way of telling you what state
> you were in when I was last involved in that.

As long as the target QEMU CPUs have NOT started running, then
no I/O writes should have been sent to the storage, so the storage
should still be in a consistent state, and thus we can still try
to fail back to the source QEMU.

The "fun" problem here is that just because we /try/ to fail back
to the source QEMU, does not imply the source QEMU will now succeed
in re-acquiring the locks it just released a short time ago.

Consider the classic dead NFS server problem. The target may have
acquired 1 lock and failed on another lock because of a service
interruption to the NFS server. Well the target can't neccessarily
release the lock that it did successfully acquire now. So if we
fail back to the source, it'll be unable to reacquire the lock as
the target still holds it.

This doesn't mean we shouldn't try to fail back, but there will
always be some failures scenarios we'll struggle to recover from.

The "migration paused" state is our last chance, as that leaves
both QEMU's present while the admin tries to fix the underlying
problems.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/4] migration: Introduce POSTCOPY_DEVICE state
  2025-10-01 14:26               ` Jiří Denemark
@ 2025-10-01 15:53                 ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 33+ messages in thread
From: Dr. David Alan Gilbert @ 2025-10-01 15:53 UTC (permalink / raw)
  To: Jiří Denemark; +Cc: Peter Xu, Juraj Marcin, qemu-devel, Fabiano Rosas

* Jiří Denemark (jdenemar@redhat.com) wrote:
> On Wed, Oct 01, 2025 at 11:05:59 +0000, Dr. David Alan Gilbert wrote:
> > * Jiří Denemark (jdenemar@redhat.com) wrote:
> > > On Tue, Sep 30, 2025 at 16:04:54 -0400, Peter Xu wrote:
> > > > On Tue, Sep 30, 2025 at 09:53:31AM +0200, Jiří Denemark wrote:
> > > > > On Thu, Sep 25, 2025 at 14:22:06 -0400, Peter Xu wrote:
> > > > > > On Thu, Sep 25, 2025 at 01:54:40PM +0200, Jiří Denemark wrote:
> > > > > > > On Mon, Sep 15, 2025 at 13:59:15 +0200, Juraj Marcin wrote:
> > > > > > So far, dest QEMU will try to resume the VM after getting RUN command, that
> > > > > > is what loadvm_postcopy_handle_run_bh() does, and it will (when autostart=1
> > > > > > set): (1) firstly try to activate all block devices, iff it succeeded, (2)
> > > > > > do vm_start(), at the end of which RESUME event will be generated.  So
> > > > > > RESUME currently implies both disk activation success, and vm start worked.
> > > > > > 
> > > > > > > may still fail when locking disks fails (not sure if this is the only
> > > > > > > way cont may fail). In this case we cannot cancel the migration on the
> > > > > > 
> > > > > > Is there any known issue with locking disks that dest would fail?  This
> > > > > > really sound like we should have the admin taking a look.
> > > > > 
> > > > > Oh definitely, it would be some kind of an storage access issue on the
> > > > > destination. But we'd like to give the admin an option to actually do
> > > > > anything else than just killing the VM :-) Either by automatically
> > > > > canceling the migration or allowing recovery once storage issues are
> > > > > solved.
> > > > 
> > > > The problem is, if the storage locking stopped working properly, then how
> > > > to guarantee the shared storage itself is working properly?
> > > > 
> > > > When I was replying previously, I was expecting the admin taking a look to
> > > > fix the storage, I didn't expect the VM can still be recovered anymore if
> > > > there's no confidence that the block devices will work all fine.  The
> > > > locking errors to me may imply a block corruption already, or should I not
> > > > see it like that?
> > > 
> > > If the storage itself is broken, there's clearly nothing we can do. But
> > > the thing is we're accessing it from two distinct hosts. So while it may
> > > work on the source, it can be broken on the destination. For example,
> > > connection between the destination host and the storage may be broken.
> > > Not sure how often this can happen in real life, but we have a bug
> > > report that (artificially) breaking storage access on the destination
> > > results in paused VM on the source which can only be killed.
> > 
> > I've got a vague memory that a tricky case is when some of your storage
> > devices are broken on the destination, but not all.
> > So you tell the block layer you want to take them on the destination
> > some take their lock, one fails;  now what state are you in?
> > I'm not sure if the block layer had a way of telling you what state
> > you were in when I was last involved in that.
> 
> Wouldn't those locks be automatically released when we kill QEMU on the
> destination as a reaction to a failure to start vCPUs?

Oh hmm, yeh that might work OK.

Dave

> Jirka
> 
-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
\        dave @ treblig.org |                               | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/4] migration: Introduce POSTCOPY_DEVICE state
  2025-10-01 15:10               ` Daniel P. Berrangé
@ 2025-10-02 12:17                 ` Jiří Denemark
  2025-10-02 13:12                   ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 33+ messages in thread
From: Jiří Denemark @ 2025-10-02 12:17 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Dr. David Alan Gilbert, Peter Xu, Juraj Marcin, qemu-devel,
	Fabiano Rosas

On Wed, Oct 01, 2025 at 16:10:44 +0100, Daniel P. Berrangé wrote:
> On Wed, Oct 01, 2025 at 11:05:59AM +0000, Dr. David Alan Gilbert wrote:
> > * Jiří Denemark (jdenemar@redhat.com) wrote:
> > > On Tue, Sep 30, 2025 at 16:04:54 -0400, Peter Xu wrote:
> > > > On Tue, Sep 30, 2025 at 09:53:31AM +0200, Jiří Denemark wrote:
> > > > > On Thu, Sep 25, 2025 at 14:22:06 -0400, Peter Xu wrote:
> > > > > > On Thu, Sep 25, 2025 at 01:54:40PM +0200, Jiří Denemark wrote:
> > > > > > > On Mon, Sep 15, 2025 at 13:59:15 +0200, Juraj Marcin wrote:
> > > > > > So far, dest QEMU will try to resume the VM after getting RUN command, that
> > > > > > is what loadvm_postcopy_handle_run_bh() does, and it will (when autostart=1
> > > > > > set): (1) firstly try to activate all block devices, iff it succeeded, (2)
> > > > > > do vm_start(), at the end of which RESUME event will be generated.  So
> > > > > > RESUME currently implies both disk activation success, and vm start worked.
> > > > > > 
> > > > > > > may still fail when locking disks fails (not sure if this is the only
> > > > > > > way cont may fail). In this case we cannot cancel the migration on the
> > > > > > 
> > > > > > Is there any known issue with locking disks that dest would fail?  This
> > > > > > really sound like we should have the admin taking a look.
> > > > > 
> > > > > Oh definitely, it would be some kind of an storage access issue on the
> > > > > destination. But we'd like to give the admin an option to actually do
> > > > > anything else than just killing the VM :-) Either by automatically
> > > > > canceling the migration or allowing recovery once storage issues are
> > > > > solved.
> > > > 
> > > > The problem is, if the storage locking stopped working properly, then how
> > > > to guarantee the shared storage itself is working properly?
> > > > 
> > > > When I was replying previously, I was expecting the admin taking a look to
> > > > fix the storage, I didn't expect the VM can still be recovered anymore if
> > > > there's no confidence that the block devices will work all fine.  The
> > > > locking errors to me may imply a block corruption already, or should I not
> > > > see it like that?
> > > 
> > > If the storage itself is broken, there's clearly nothing we can do. But
> > > the thing is we're accessing it from two distinct hosts. So while it may
> > > work on the source, it can be broken on the destination. For example,
> > > connection between the destination host and the storage may be broken.
> > > Not sure how often this can happen in real life, but we have a bug
> > > report that (artificially) breaking storage access on the destination
> > > results in paused VM on the source which can only be killed.
> > 
> > I've got a vague memory that a tricky case is when some of your storage
> > devices are broken on the destination, but not all.
> > So you tell the block layer you want to take them on the destination
> > some take their lock, one fails;  now what state are you in?
> > I'm not sure if the block layer had a way of telling you what state
> > you were in when I was last involved in that.
> 
> As long as the target QEMU CPUs have NOT started running, then
> no I/O writes should have been sent to the storage, so the storage
> should still be in a consistent state, and thus we can still try
> to fail back to the source QEMU.
> 
> The "fun" problem here is that just because we /try/ to fail back
> to the source QEMU, does not imply the source QEMU will now succeed
> in re-acquiring the locks it just released a short time ago.

Right, but if we manage to get the source QEMU out of postcopy migration
is such scenario, the VM may be manually resumed once storage issues are
solved. So yes, we can't always rollback, but we should at least allow
some kind of manual recovery.

> Consider the classic dead NFS server problem. The target may have
> acquired 1 lock and failed on another lock because of a service
> interruption to the NFS server. Well the target can't neccessarily
> release the lock that it did successfully acquire now. So if we
> fail back to the source, it'll be unable to reacquire the lock as
> the target still holds it.
> 
> This doesn't mean we shouldn't try to fail back, but there will
> always be some failures scenarios we'll struggle to recover from.
> 
> The "migration paused" state is our last chance, as that leaves
> both QEMU's present while the admin tries to fix the underlying
> problems.

IIUC from my conversation with Juraj switching to postcopy-paused can
only happen when CPUs were already started.

Jirka



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/4] migration: Introduce POSTCOPY_DEVICE state
  2025-10-02 12:17                 ` Jiří Denemark
@ 2025-10-02 13:12                   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 33+ messages in thread
From: Dr. David Alan Gilbert @ 2025-10-02 13:12 UTC (permalink / raw)
  To: Jiří Denemark
  Cc: Daniel P. Berrangé, Peter Xu, Juraj Marcin, qemu-devel,
	Fabiano Rosas

* Jiří Denemark (jdenemar@redhat.com) wrote:
> On Wed, Oct 01, 2025 at 16:10:44 +0100, Daniel P. Berrangé wrote:
> > On Wed, Oct 01, 2025 at 11:05:59AM +0000, Dr. David Alan Gilbert wrote:
> > > * Jiří Denemark (jdenemar@redhat.com) wrote:
> > > > On Tue, Sep 30, 2025 at 16:04:54 -0400, Peter Xu wrote:
> > > > > On Tue, Sep 30, 2025 at 09:53:31AM +0200, Jiří Denemark wrote:
> > > > > > On Thu, Sep 25, 2025 at 14:22:06 -0400, Peter Xu wrote:
> > > > > > > On Thu, Sep 25, 2025 at 01:54:40PM +0200, Jiří Denemark wrote:
> > > > > > > > On Mon, Sep 15, 2025 at 13:59:15 +0200, Juraj Marcin wrote:
> > > > > > > So far, dest QEMU will try to resume the VM after getting RUN command, that
> > > > > > > is what loadvm_postcopy_handle_run_bh() does, and it will (when autostart=1
> > > > > > > set): (1) firstly try to activate all block devices, iff it succeeded, (2)
> > > > > > > do vm_start(), at the end of which RESUME event will be generated.  So
> > > > > > > RESUME currently implies both disk activation success, and vm start worked.
> > > > > > > 
> > > > > > > > may still fail when locking disks fails (not sure if this is the only
> > > > > > > > way cont may fail). In this case we cannot cancel the migration on the
> > > > > > > 
> > > > > > > Is there any known issue with locking disks that dest would fail?  This
> > > > > > > really sound like we should have the admin taking a look.
> > > > > > 
> > > > > > Oh definitely, it would be some kind of an storage access issue on the
> > > > > > destination. But we'd like to give the admin an option to actually do
> > > > > > anything else than just killing the VM :-) Either by automatically
> > > > > > canceling the migration or allowing recovery once storage issues are
> > > > > > solved.
> > > > > 
> > > > > The problem is, if the storage locking stopped working properly, then how
> > > > > to guarantee the shared storage itself is working properly?
> > > > > 
> > > > > When I was replying previously, I was expecting the admin taking a look to
> > > > > fix the storage, I didn't expect the VM can still be recovered anymore if
> > > > > there's no confidence that the block devices will work all fine.  The
> > > > > locking errors to me may imply a block corruption already, or should I not
> > > > > see it like that?
> > > > 
> > > > If the storage itself is broken, there's clearly nothing we can do. But
> > > > the thing is we're accessing it from two distinct hosts. So while it may
> > > > work on the source, it can be broken on the destination. For example,
> > > > connection between the destination host and the storage may be broken.
> > > > Not sure how often this can happen in real life, but we have a bug
> > > > report that (artificially) breaking storage access on the destination
> > > > results in paused VM on the source which can only be killed.
> > > 
> > > I've got a vague memory that a tricky case is when some of your storage
> > > devices are broken on the destination, but not all.
> > > So you tell the block layer you want to take them on the destination
> > > some take their lock, one fails;  now what state are you in?
> > > I'm not sure if the block layer had a way of telling you what state
> > > you were in when I was last involved in that.
> > 
> > As long as the target QEMU CPUs have NOT started running, then
> > no I/O writes should have been sent to the storage, so the storage
> > should still be in a consistent state, and thus we can still try
> > to fail back to the source QEMU.
> > 
> > The "fun" problem here is that just because we /try/ to fail back
> > to the source QEMU, does not imply the source QEMU will now succeed
> > in re-acquiring the locks it just released a short time ago.
> 
> Right, but if we manage to get the source QEMU out of postcopy migration
> is such scenario, the VM may be manually resumed once storage issues are
> solved. So yes, we can't always rollback, but we should at least allow
> some kind of manual recovery.
> 
> > Consider the classic dead NFS server problem. The target may have
> > acquired 1 lock and failed on another lock because of a service
> > interruption to the NFS server. Well the target can't neccessarily
> > release the lock that it did successfully acquire now. So if we
> > fail back to the source, it'll be unable to reacquire the lock as
> > the target still holds it.
> > 
> > This doesn't mean we shouldn't try to fail back, but there will
> > always be some failures scenarios we'll struggle to recover from.
> > 
> > The "migration paused" state is our last chance, as that leaves
> > both QEMU's present while the admin tries to fix the underlying
> > problems.
> 
> IIUC from my conversation with Juraj switching to postcopy-paused can
> only happen when CPUs were already started.

But if you:
  a) Kill the destination
  b) Try to restart the source
  c) The source IO fails to restart

you could take the source to a normal 'paused' like you can do if storage
fails on the source. (as in -drive werror=stop  ?)

Dave

> Jirka
> 
-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
\        dave @ treblig.org |                               | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/


^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2025-10-02 13:15 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-15 11:59 [PATCH 0/4] migration: Introduce POSTCOPY_DEVICE state Juraj Marcin
2025-09-15 11:59 ` [PATCH 1/4] migration: Do not try to start VM if disk activation fails Juraj Marcin
2025-09-19 16:12   ` Fabiano Rosas
2025-09-15 11:59 ` [PATCH 2/4] migration: Accept MigrationStatus in migration_has_failed() Juraj Marcin
2025-09-19 14:57   ` Peter Xu
2025-09-22 11:26     ` Juraj Marcin
2025-09-15 11:59 ` [PATCH 3/4] migration: Refactor incoming cleanup into migration_incoming_finish() Juraj Marcin
2025-09-19 15:53   ` Peter Xu
2025-09-19 16:46   ` Fabiano Rosas
2025-09-22 12:58     ` Juraj Marcin
2025-09-22 15:51       ` Peter Xu
2025-09-22 17:40         ` Fabiano Rosas
2025-09-22 17:48           ` Peter Xu
2025-09-23 14:58         ` Juraj Marcin
2025-09-23 16:17           ` Peter Xu
2025-09-15 11:59 ` [PATCH 4/4] migration: Introduce POSTCOPY_DEVICE state Juraj Marcin
2025-09-19 16:58   ` Peter Xu
2025-09-19 17:50     ` Peter Xu
2025-09-22 13:34       ` Juraj Marcin
2025-09-22 16:16         ` Peter Xu
2025-09-23 14:23           ` Juraj Marcin
2025-09-25 11:54   ` Jiří Denemark
2025-09-25 18:22     ` Peter Xu
2025-09-30  7:53       ` Jiří Denemark
2025-09-30 20:04         ` Peter Xu
2025-10-01  8:43           ` Jiří Denemark
2025-10-01 11:05             ` Dr. David Alan Gilbert
2025-10-01 14:26               ` Jiří Denemark
2025-10-01 15:53                 ` Dr. David Alan Gilbert
2025-10-01 15:10               ` Daniel P. Berrangé
2025-10-02 12:17                 ` Jiří Denemark
2025-10-02 13:12                   ` Dr. David Alan Gilbert
2025-10-01 10:09           ` Juraj Marcin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).