[RFC PATCH 0/4] migration: Introduce postcopy-setup capability and state

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/4] migration: Introduce postcopy-setup capability and state
@ 2025-08-07 11:49 Juraj Marcin
  2025-08-07 11:49 ` [RFC PATCH 1/4] qemu-thread: Introduce qemu_thread_detach() Juraj Marcin
                   ` (4 more replies)
  0 siblings, 5 replies; 26+ messages in thread
From: Juraj Marcin @ 2025-08-07 11:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juraj Marcin, Jiri Denemark, Stefan Weil, Paolo Bonzini, Peter Xu,
	Fabiano Rosas

When postcopy migration starts, the source side sends all
non-postcopiable device data in one package command and immediately
transitions to a "postcopy-active" state. However, if the destination
side fails to load the device data or crashes during it, the source side
stays paused indefinitely with no way of recovery.

This series introduces a new "postcopy-setup" state during which the
destination side is guaranteed to not been started yet and, the source
side can recover and resume and the destination side gracefully exit.

Key element of this feature is isolating the postcopy-run command from
non-postcopiable data and sending it only after the destination side
acknowledges, that it has loaded all devices and is ready to be started.
This is necessary, as once the postcopy-run command is sent, the source
side cannot be sure if the destination is running or not and if it can
safely resume in case of a failure.

Reusing existing ping/pong messages was also considered, PING 3 is right
before the postcopy-run command, but there are two reasons why the PING
3 message might not be delivered to the source side:

1. destination machine failed, it is not running, and the source side
   can resume,
2. there is a network failure, so PING 3 delivery fails, but until until
   TCP or other transport times out, the destination could process the
   postcopy-run command and start, in which case the source side cannot
   resume.

Furthermore, this series contains two more patches required for the
implementation of this feature, that make the listen thread joinable for
graceful cleanup and detach it explicitly otherwise, and one patch
fixing state transitions inside postcopy_start().

Such (or similar) feature could be potentially useful also for normal
(only precopy) migration with return-path, to prevent issues when
network failure happens just as the destination side shuts the
return-path. When I tested such scenario (by filtering out the SHUT
command), the destination started and reported successful migration,
while the source side reported failed migration and tried to resume, but
exited as it failed to gain disk image file lock.

Another suggestion from Peter, that I would like to discuss, is that
instead of introducing a new state, we could move the boundary between
"device" and "postcopy-active" states to when the postcopy-run command
is actually sent (in this series boundary of "postcopy-setup" and
"postcopy-active"), however, I am not sure if such change would not have
any unwanted implications.

Juraj Marcin (4):
  qemu-thread: Introduce qemu_thread_detach()
  migration: Fix state transition in postcopy_start() error handling
  migration: Make listen thread joinable
  migration: Introduce postcopy-setup capability and state

 include/qemu/thread.h                  |  1 +
 migration/migration.c                  | 77 +++++++++++++++++++++++---
 migration/migration.h                  |  7 +++
 migration/options.c                    | 16 ++++++
 migration/options.h                    |  1 +
 migration/postcopy-ram.c               |  7 +++
 migration/savevm.c                     | 53 ++++++++++++++++--
 qapi/migration.json                    | 19 ++++++-
 tests/qtest/migration/postcopy-tests.c | 55 ++++++++++++++++++
 tests/qtest/migration/precopy-tests.c  |  3 +-
 util/qemu-thread-posix.c               |  8 +++
 util/qemu-thread-win32.c               | 10 ++++
 12 files changed, 241 insertions(+), 16 deletions(-)

-- 
2.50.1

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC PATCH 1/4] qemu-thread: Introduce qemu_thread_detach()
  2025-08-07 11:49 [RFC PATCH 0/4] migration: Introduce postcopy-setup capability and state Juraj Marcin
@ 2025-08-07 11:49 ` Juraj Marcin
  2025-08-19 10:37   ` Daniel P. Berrangé
  2025-08-07 11:49 ` [RFC PATCH 2/4] migration: Fix state transition in postcopy_start() error handling Juraj Marcin
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 26+ messages in thread
From: Juraj Marcin @ 2025-08-07 11:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juraj Marcin, Jiri Denemark, Stefan Weil, Paolo Bonzini, Peter Xu,
	Fabiano Rosas

From: Juraj Marcin <jmarcin@redhat.com>

Currently, QEMU threads abstraction supports both joinable and detached
threads, but once a thread is marked as joinable it must be joined using
qemu_thread_join() and cannot be detached later.

For POSIX implementation, pthread_detach() is used. For Windows, marking
the thread as detached and releasing critical section is enough as
thread handle is released by qemu_thread_create().

Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
---
 include/qemu/thread.h    |  1 +
 util/qemu-thread-posix.c |  8 ++++++++
 util/qemu-thread-win32.c | 10 ++++++++++
 3 files changed, 19 insertions(+)

diff --git a/include/qemu/thread.h b/include/qemu/thread.h
index f0302ed01f..8a6d1ba98e 100644
--- a/include/qemu/thread.h
+++ b/include/qemu/thread.h
@@ -212,6 +212,7 @@ int qemu_thread_set_affinity(QemuThread *thread, unsigned long *host_cpus,
 int qemu_thread_get_affinity(QemuThread *thread, unsigned long **host_cpus,
                              unsigned long *nbits);
 void *qemu_thread_join(QemuThread *thread);
+void qemu_thread_detach(QemuThread *thread);
 void qemu_thread_get_self(QemuThread *thread);
 bool qemu_thread_is_self(QemuThread *thread);
 G_NORETURN void qemu_thread_exit(void *retval);
diff --git a/util/qemu-thread-posix.c b/util/qemu-thread-posix.c
index ba725444ba..20442456b5 100644
--- a/util/qemu-thread-posix.c
+++ b/util/qemu-thread-posix.c
@@ -536,3 +536,11 @@ void *qemu_thread_join(QemuThread *thread)
     }
     return ret;
 }
+
+void qemu_thread_detach(QemuThread *thread)
+{
+    int err = pthread_detach(thread->thread);
+    if (err) {
+        error_exit(err, __func__);
+    }
+}
diff --git a/util/qemu-thread-win32.c b/util/qemu-thread-win32.c
index ca2e0b512e..bdfb7b4aee 100644
--- a/util/qemu-thread-win32.c
+++ b/util/qemu-thread-win32.c
@@ -328,6 +328,16 @@ void *qemu_thread_join(QemuThread *thread)
     return ret;
 }
 
+void qemu_thread_detach(QemuThread *thread)
+{
+    QemuThreadData *data;
+
+    if (data->mode == QEMU_THREAD_JOINABLE) {
+        data->mode = QEMU_THREAD_DETACHED;
+        DeleteCriticalSection(&data->cs);
+    }
+}
+
 static bool set_thread_description(HANDLE h, const char *name)
 {
     HRESULT hr;
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH 2/4] migration: Fix state transition in postcopy_start() error handling
  2025-08-07 11:49 [RFC PATCH 0/4] migration: Introduce postcopy-setup capability and state Juraj Marcin
  2025-08-07 11:49 ` [RFC PATCH 1/4] qemu-thread: Introduce qemu_thread_detach() Juraj Marcin
@ 2025-08-07 11:49 ` Juraj Marcin
  2025-08-07 20:54   ` Peter Xu
  2025-08-07 11:49 ` [RFC PATCH 3/4] migration: Make listen thread joinable Juraj Marcin
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 26+ messages in thread
From: Juraj Marcin @ 2025-08-07 11:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juraj Marcin, Jiri Denemark, Stefan Weil, Paolo Bonzini, Peter Xu,
	Fabiano Rosas

From: Juraj Marcin <jmarcin@redhat.com>

Depending on where an error during postcopy_start() happens, the state
can be either "active", "device" or "cancelling", but never
"postcopy-active". Migration state is transitioned to "postcopy-active"
only just before a successful return from the function.

Accept any state except "cancelling" when transitioning to "failed"
state.

Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
---
 migration/migration.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 10c216d25d..e5ce2940d5 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2872,8 +2872,9 @@ static int postcopy_start(MigrationState *ms, Error **errp)
 fail_closefb:
     qemu_fclose(fb);
 fail:
-    migrate_set_state(&ms->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
-                          MIGRATION_STATUS_FAILED);
+    if ( ms->state != MIGRATION_STATUS_CANCELLING) {
+        migrate_set_state(&ms->state, ms->state, MIGRATION_STATUS_FAILED);
+    }
     migration_block_activate(NULL);
     migration_call_notifiers(ms, MIG_EVENT_PRECOPY_FAILED, NULL);
     bql_unlock();
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH 3/4] migration: Make listen thread joinable
  2025-08-07 11:49 [RFC PATCH 0/4] migration: Introduce postcopy-setup capability and state Juraj Marcin
  2025-08-07 11:49 ` [RFC PATCH 1/4] qemu-thread: Introduce qemu_thread_detach() Juraj Marcin
  2025-08-07 11:49 ` [RFC PATCH 2/4] migration: Fix state transition in postcopy_start() error handling Juraj Marcin
@ 2025-08-07 11:49 ` Juraj Marcin
  2025-08-07 20:57   ` Peter Xu
  2025-08-07 11:49 ` [RFC PATCH 4/4] migration: Introduce postcopy-setup capability and state Juraj Marcin
  2025-08-11 14:54 ` [RFC PATCH 0/4] " Peter Xu
  4 siblings, 1 reply; 26+ messages in thread
From: Juraj Marcin @ 2025-08-07 11:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juraj Marcin, Jiri Denemark, Stefan Weil, Paolo Bonzini, Peter Xu,
	Fabiano Rosas

From: Juraj Marcin <jmarcin@redhat.com>

This patch allows joining the migration listen thread. This is done in
preparation for the introduction of "postcopy-setup" state at the
beginning of a postcopy migration, when destination can fail gracefully
and source side then resume to a running state.

In case of such failure, to gracefully perform all cleanup in the main
migration thread, we need to wait for the listen thread to exit, which
can be done by joining it.

Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
---
 migration/migration.c | 1 +
 migration/savevm.c    | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/migration/migration.c b/migration/migration.c
index e5ce2940d5..8174e811eb 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -901,6 +901,7 @@ process_incoming_migration_co(void *opaque)
              * Postcopy was started, cleanup should happen at the end of the
              * postcopy thread.
              */
+            qemu_thread_detach(&mis->listen_thread);
             trace_process_incoming_migration_co_postcopy_end_main();
             goto out;
         }
diff --git a/migration/savevm.c b/migration/savevm.c
index fabbeb296a..d2360be53c 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2217,7 +2217,7 @@ static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
     mis->have_listen_thread = true;
     postcopy_thread_create(mis, &mis->listen_thread,
                            MIGRATION_THREAD_DST_LISTEN,
-                           postcopy_ram_listen_thread, QEMU_THREAD_DETACHED);
+                           postcopy_ram_listen_thread, QEMU_THREAD_JOINABLE);
     trace_loadvm_postcopy_handle_listen("return");
 
     return 0;
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH 4/4] migration: Introduce postcopy-setup capability and state
  2025-08-07 11:49 [RFC PATCH 0/4] migration: Introduce postcopy-setup capability and state Juraj Marcin
                   ` (2 preceding siblings ...)
  2025-08-07 11:49 ` [RFC PATCH 3/4] migration: Make listen thread joinable Juraj Marcin
@ 2025-08-07 11:49 ` Juraj Marcin
  2025-08-11 14:54 ` [RFC PATCH 0/4] " Peter Xu
  4 siblings, 0 replies; 26+ messages in thread
From: Juraj Marcin @ 2025-08-07 11:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juraj Marcin, Jiri Denemark, Stefan Weil, Paolo Bonzini, Peter Xu,
	Fabiano Rosas

From: Juraj Marcin <jmarcin@redhat.com>

During switchover at the start of postcopy the source side sends a
package containing all non-postcopiable device state and a postcopy-run
command and transitions to a "postcopy-active" state. However, if the
destination side fails to load the device states or crashes during this
process, there is currently no way of recovering the source side.

To resolve this problem, this patch adds a new feature that includes a
new "postcopy-setup" state between "device" and "postcopy-active" states
and removes the run postcopy-run command from the package containing all
non-postcopiable device data. During this state, it is guaranteed that
the destination side has not been yet started, and in case of an error,
the source side can be resumed without losing any data.

The destination transitions to "postcopy-active" state only after it
successfully loads all non-postcopiable data included in the package
command and sends a POSTCOPY_RUN_ACK message signalling it can be
started. When the source side receives this ACK message, it finally
transitions to "postcopy-active" state and sends the run command, that
is processed by the listen thread on the destination side and the
destination is started. Postcopy migration then continues as usual.

This feature needs to be enabled with the "postcopy-setup" capability on
both sides before the migration starts, this ensures backwards
compatibility in both directions (migrating from QEMU without this
feature or migrating to QEMU without this feature).

Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
---
 migration/migration.c                  | 71 +++++++++++++++++++++++---
 migration/migration.h                  |  7 +++
 migration/options.c                    | 16 ++++++
 migration/options.h                    |  1 +
 migration/postcopy-ram.c               |  7 +++
 migration/savevm.c                     | 51 ++++++++++++++++--
 qapi/migration.json                    | 19 ++++++-
 tests/qtest/migration/postcopy-tests.c | 55 ++++++++++++++++++++
 tests/qtest/migration/precopy-tests.c  |  3 +-
 9 files changed, 217 insertions(+), 13 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 8174e811eb..5b3cf57712 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -91,6 +91,7 @@ enum mig_rp_message_type {
     MIG_RP_MSG_RECV_BITMAP,  /* send recved_bitmap back to source */
     MIG_RP_MSG_RESUME_ACK,   /* tell source that we are ready to resume */
     MIG_RP_MSG_SWITCHOVER_ACK, /* Tell source it's OK to do switchover */
+    MIG_RP_MSG_POSTCOPY_RUN_ACK, /* tell source it's OK to send postcopy run */
 
     MIG_RP_MSG_MAX
 };
@@ -896,6 +897,15 @@ process_incoming_migration_co(void *opaque)
              * the normal exit.
              */
             postcopy_ram_incoming_cleanup(mis);
+        } else if (ret < 0 && ps == POSTCOPY_INCOMING_LISTENING &&
+                   mis->state == MIGRATION_STATUS_ACTIVE) {
+            /*
+             * An error happened during postcopy start while we have not yet
+             * acknowledged to be started, wait for listen thread to exit and
+             * then do a normal cleanup.
+             */
+            qemu_thread_join(&mis->listen_thread);
+            postcopy_ram_incoming_cleanup(mis);
         } else if (ret >= 0) {
             /*
              * Postcopy was started, cleanup should happen at the end of the
@@ -1206,6 +1216,11 @@ void migrate_send_rp_resume_ack(MigrationIncomingState *mis, uint32_t value)
     migrate_send_rp_message(mis, MIG_RP_MSG_RESUME_ACK, sizeof(buf), &buf);
 }
 
+int migrate_send_rp_postcopy_run_ack(MigrationIncomingState *mis)
+{
+    return migrate_send_rp_message(mis, MIG_RP_MSG_POSTCOPY_RUN_ACK, 0, NULL);
+}
+
 bool migration_is_running(void)
 {
     MigrationState *s = current_migration;
@@ -1216,6 +1231,7 @@ bool migration_is_running(void)
 
     switch (s->state) {
     case MIGRATION_STATUS_ACTIVE:
+    case MIGRATION_STATUS_POSTCOPY_SETUP:
     case MIGRATION_STATUS_POSTCOPY_ACTIVE:
     case MIGRATION_STATUS_POSTCOPY_PAUSED:
     case MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP:
@@ -1237,6 +1253,7 @@ static bool migration_is_active(void)
     MigrationState *s = current_migration;
 
     return (s->state == MIGRATION_STATUS_ACTIVE ||
+            s->state == MIGRATION_STATUS_POSTCOPY_SETUP ||
             s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
 }
 
@@ -1359,6 +1376,7 @@ static void fill_source_migration_info(MigrationInfo *info)
         break;
     case MIGRATION_STATUS_ACTIVE:
     case MIGRATION_STATUS_CANCELLING:
+    case MIGRATION_STATUS_POSTCOPY_SETUP:
     case MIGRATION_STATUS_POSTCOPY_ACTIVE:
     case MIGRATION_STATUS_PRE_SWITCHOVER:
     case MIGRATION_STATUS_DEVICE:
@@ -1412,6 +1430,7 @@ static void fill_destination_migration_info(MigrationInfo *info)
     case MIGRATION_STATUS_CANCELLING:
     case MIGRATION_STATUS_CANCELLED:
     case MIGRATION_STATUS_ACTIVE:
+    case MIGRATION_STATUS_POSTCOPY_SETUP:
     case MIGRATION_STATUS_POSTCOPY_ACTIVE:
     case MIGRATION_STATUS_POSTCOPY_PAUSED:
     case MIGRATION_STATUS_POSTCOPY_RECOVER:
@@ -1712,6 +1731,7 @@ bool migration_in_postcopy(void)
     MigrationState *s = migrate_get_current();
 
     switch (s->state) {
+    case MIGRATION_STATUS_POSTCOPY_SETUP:
     case MIGRATION_STATUS_POSTCOPY_ACTIVE:
     case MIGRATION_STATUS_POSTCOPY_PAUSED:
     case MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP:
@@ -1725,6 +1745,7 @@ bool migration_in_postcopy(void)
 bool migration_postcopy_is_alive(MigrationStatus state)
 {
     switch (state) {
+    case MIGRATION_STATUS_POSTCOPY_SETUP:
     case MIGRATION_STATUS_POSTCOPY_ACTIVE:
     case MIGRATION_STATUS_POSTCOPY_RECOVER:
         return true;
@@ -1806,6 +1827,7 @@ int migrate_init(MigrationState *s, Error **errp)
     s->threshold_size = 0;
     s->switchover_acked = false;
     s->rdma_migration = false;
+    s->postcopy_run_acked = false;
     /*
      * set mig_stats memory to zero for a new migration
      */
@@ -2615,6 +2637,10 @@ static void *source_return_path_thread(void *opaque)
             trace_source_return_path_thread_switchover_acked();
             break;
 
+        case MIG_RP_MSG_POSTCOPY_RUN_ACK:
+            ms->postcopy_run_acked = true;
+            break;
+
         default:
             break;
         }
@@ -2808,7 +2834,16 @@ static int postcopy_start(MigrationState *ms, Error **errp)
         qemu_savevm_send_ping(fb, 3);
     }
 
-    qemu_savevm_send_postcopy_run(fb);
+    if (migrate_postcopy_setup()) {
+        /*
+         * EOF mark is necessary for the receiving side to stop reading contents
+         * of the CMD_PACKAGED buffer.
+         */
+        qemu_put_byte(fb, QEMU_VM_EOF);
+        qemu_fflush(fb);
+    } else {
+        qemu_savevm_send_postcopy_run(fb);
+    }
 
     /* <><> end of stuff going into the package */
 
@@ -2835,7 +2870,13 @@ static int postcopy_start(MigrationState *ms, Error **errp)
      */
     migration_call_notifiers(ms, MIG_EVENT_PRECOPY_DONE, NULL);
 
-    migration_downtime_end(ms);
+    if (!migrate_postcopy_setup()) {
+        /*
+         * With postcopy-setup enabled, POSTCOPY_RUN command is not present in
+         * the package. Downtime will end when the command is actually sent.
+         */
+        migration_downtime_end(ms);
+    }
 
     if (migrate_postcopy_ram()) {
         /*
@@ -2862,8 +2903,13 @@ static int postcopy_start(MigrationState *ms, Error **errp)
      */
     migration_rate_set(migrate_max_postcopy_bandwidth());
 
-    /* Now, switchover looks all fine, switching to postcopy-active */
+    /*
+    * Now, switchover looks all fine, switching to either postcopy-setup or
+    * directly postcopy-active depending on capabilities.
+    */
     migrate_set_state(&ms->state, MIGRATION_STATUS_DEVICE,
+                      migrate_postcopy_setup() ?
+                      MIGRATION_STATUS_POSTCOPY_SETUP :
                       MIGRATION_STATUS_POSTCOPY_ACTIVE);
 
     bql_unlock();
@@ -3305,8 +3351,8 @@ static MigThrError migration_detect_error(MigrationState *s)
         return postcopy_pause(s);
     } else {
         /*
-         * For precopy (or postcopy with error outside IO), we fail
-         * with no time.
+         * For precopy (or postcopy with error outside IO, or postcopy before
+         * the destination has been started), we fail with no time.
          */
         migrate_set_state(&s->state, state, MIGRATION_STATUS_FAILED);
         trace_migration_thread_file_err();
@@ -3441,7 +3487,8 @@ static MigIterateState migration_iteration_run(MigrationState *s)
 {
     uint64_t must_precopy, can_postcopy, pending_size;
     Error *local_err = NULL;
-    bool in_postcopy = s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE;
+    bool in_postcopy = (s->state == MIGRATION_STATUS_POSTCOPY_SETUP ||
+                        s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
     bool can_switchover = migration_can_switchover(s);
     bool complete_ready;
 
@@ -3491,6 +3538,18 @@ static MigIterateState migration_iteration_run(MigrationState *s)
         complete_ready = can_switchover && (pending_size <= s->threshold_size);
     }
 
+    /*
+     * Destination ACKed POSTCOPY_RUN to be sent, or the migration is going to
+     * end in this iteration, so the destination must start.
+     */
+    if (s->state == MIGRATION_STATUS_POSTCOPY_SETUP &&
+        (s->postcopy_run_acked || complete_ready)) {
+        migrate_set_state(&s->state, MIGRATION_STATUS_POSTCOPY_SETUP,
+                          MIGRATION_STATUS_POSTCOPY_ACTIVE);
+        qemu_savevm_send_postcopy_run(s->to_dst_file);
+        migration_downtime_end(s);
+    }
+
     if (complete_ready) {
         trace_migration_thread_low_pending(pending_size);
         migration_completion(s);
diff --git a/migration/migration.h b/migration/migration.h
index 01329bf824..fff79f0647 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -509,6 +509,12 @@ struct MigrationState {
     /* Is this a rdma migration */
     bool rdma_migration;
 
+    /*
+     * Indicates whether the destination ACKed loading the device state and is
+     * ready to receive POSTCOPY_RUN command.
+     */
+    bool postcopy_run_acked;
+
     GSource *hup_source;
 };
 
@@ -553,6 +559,7 @@ void migrate_send_rp_recv_bitmap(MigrationIncomingState *mis,
                                  char *block_name);
 void migrate_send_rp_resume_ack(MigrationIncomingState *mis, uint32_t value);
 int migrate_send_rp_switchover_ack(MigrationIncomingState *mis);
+int migrate_send_rp_postcopy_run_ack(MigrationIncomingState *mis);
 
 void dirty_bitmap_mig_before_vm_start(void);
 void dirty_bitmap_mig_cancel_outgoing(void);
diff --git a/migration/options.c b/migration/options.c
index 4e923a2e07..e4541ced6f 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -203,6 +203,7 @@ const Property migration_properties[] = {
                         MIGRATION_CAPABILITY_SWITCHOVER_ACK),
     DEFINE_PROP_MIG_CAP("x-dirty-limit", MIGRATION_CAPABILITY_DIRTY_LIMIT),
     DEFINE_PROP_MIG_CAP("mapped-ram", MIGRATION_CAPABILITY_MAPPED_RAM),
+    DEFINE_PROP_MIG_CAP("postcopy-setup", MIGRATION_CAPABILITY_POSTCOPY_SETUP),
 };
 const size_t migration_properties_count = ARRAY_SIZE(migration_properties);
 
@@ -360,6 +361,13 @@ bool migrate_zero_copy_send(void)
     return s->capabilities[MIGRATION_CAPABILITY_ZERO_COPY_SEND];
 }
 
+bool migrate_postcopy_setup(void)
+{
+    MigrationState *s = migrate_get_current();
+
+    return s->capabilities[MIGRATION_CAPABILITY_POSTCOPY_SETUP];
+}
+
 /* pseudo capabilities */
 
 bool migrate_multifd_flush_after_each_section(void)
@@ -626,6 +634,14 @@ bool migrate_caps_check(bool *old_caps, bool *new_caps, Error **errp)
         }
     }
 
+    if (new_caps[MIGRATION_CAPABILITY_POSTCOPY_SETUP]) {
+        if (!migrate_postcopy_setup() && migration_is_running()) {
+            error_setg(errp,
+                       "Postcopy-setup cannot be enabled during migration");
+            return false;
+        }
+    }
+
     /*
      * On destination side, check the cases that capability is being set
      * after incoming thread has started.
diff --git a/migration/options.h b/migration/options.h
index 82d839709e..2c5d232d43 100644
--- a/migration/options.h
+++ b/migration/options.h
@@ -42,6 +42,7 @@ bool migrate_return_path(void);
 bool migrate_validate_uuid(void);
 bool migrate_xbzrle(void);
 bool migrate_zero_copy_send(void);
+bool migrate_postcopy_setup(void);
 
 /*
  * pseudo capabilities
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 45af9a361e..6bc16fd2dc 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -1393,6 +1393,13 @@ retry:
                                         msg.arg.pagefault.address,
                                         msg.arg.pagefault.feat.ptid);
             if (ret) {
+                if (mis->state == MIGRATION_STATUS_ACTIVE) {
+                    /*
+                     * Not postcopy-active yet, there is no recovery from that,
+                     * exit the thread.
+                     */
+                    break;
+                }
                 /* May be network failure, try to wait for recovery */
                 postcopy_pause_fault_thread(mis);
                 goto retry;
diff --git a/migration/savevm.c b/migration/savevm.c
index d2360be53c..3efe499ff1 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -279,6 +279,7 @@ static bool should_validate_capability(int capability)
     switch (capability) {
     case MIGRATION_CAPABILITY_X_IGNORE_SHARED:
     case MIGRATION_CAPABILITY_MAPPED_RAM:
+    case MIGRATION_CAPABILITY_POSTCOPY_SETUP:
         return true;
     default:
         return false;
@@ -2085,8 +2086,15 @@ static void *postcopy_ram_listen_thread(void *opaque)
 
     object_ref(OBJECT(migr));
 
-    migrate_set_state(&mis->state, MIGRATION_STATUS_ACTIVE,
-                                   MIGRATION_STATUS_POSTCOPY_ACTIVE);
+    if (!migrate_postcopy_setup()) {
+        /*
+         * If postcopy-setup is enabled, we transition to postcopy-active after
+         * the machine finishes loading CMD_PACKAGED and ACKs to be started.
+         */
+        migrate_set_state(&mis->state,
+                          MIGRATION_STATUS_ACTIVE,
+                          MIGRATION_STATUS_POSTCOPY_ACTIVE);
+    }
     qemu_event_set(&mis->thread_sync_event);
     trace_postcopy_ram_listen_thread_start();
 
@@ -2099,6 +2107,14 @@ static void *postcopy_ram_listen_thread(void *opaque)
 
     /* TODO: sanity check that only postcopiable data will be loaded here */
     load_res = qemu_loadvm_state_main(f, mis);
+    if (load_res < 0 && mis->state != MIGRATION_STATUS_POSTCOPY_ACTIVE) {
+        /*
+        * Something happened during device load in the main thread, as we are
+        * not running yet. Don't force exit, the main thread will handle
+        * incoming_state and postcopy_ram_incoming cleanups.
+        */
+        goto out;
+    }
 
     /*
      * This is tricky, but, mis->from_src_file can change after it
@@ -2150,6 +2166,10 @@ static void *postcopy_ram_listen_thread(void *opaque)
         exit(EXIT_FAILURE);
     }
 
+    if (mis->state != MIGRATION_STATUS_POSTCOPY_ACTIVE) {
+        error_report("%s: at this point the destination should have been started already",
+                     __func__);
+    }
     migrate_set_state(&mis->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
                                    MIGRATION_STATUS_COMPLETED);
     /*
@@ -2162,6 +2182,7 @@ static void *postcopy_ram_listen_thread(void *opaque)
     migration_incoming_state_destroy();
     bql_unlock();
 
+out:
     rcu_unregister_thread();
     mis->have_listen_thread = false;
     postcopy_state_set(POSTCOPY_INCOMING_END);
@@ -2276,6 +2297,13 @@ static int loadvm_postcopy_handle_run(MigrationIncomingState *mis)
     postcopy_state_set(POSTCOPY_INCOMING_RUNNING);
     migration_bh_schedule(loadvm_postcopy_handle_run_bh, mis);
 
+    if (migrate_postcopy_setup()) {
+        /*
+         * With postcopy-setup enabled POSTCOPY_RUN command is processed by the
+         * listen thread instead of the main thread.
+         */
+        return 0;
+    }
     /* We need to finish reading the stream from the package
      * and also stop reading anything more from the stream that loaded the
      * package (since it's now being read by the listener thread).
@@ -2453,8 +2481,23 @@ static int loadvm_handle_cmd_packaged(MigrationIncomingState *mis)
     trace_loadvm_handle_cmd_packaged_main(ret);
     qemu_fclose(packf);
     object_unref(OBJECT(bioc));
-
-    return ret;
+    if (ret < 0) {
+        return ret;
+    }
+    if (migrate_postcopy_setup()) {
+        ret = migrate_send_rp_postcopy_run_ack(mis);
+        if (ret < 0) {
+            return ret;
+        }
+        migrate_set_state(&mis->state, MIGRATION_STATUS_ACTIVE,
+                          MIGRATION_STATUS_POSTCOPY_ACTIVE);
+    }
+    /* We need to finish reading the stream from the package
+     * and also stop reading anything more from the stream that loaded the
+     * package (since it's now being read by the listener thread).
+     * LOADVM_QUIT will quit all the layers of nested loadvm loops.
+     */
+    return LOADVM_QUIT;
 }
 
 /*
diff --git a/qapi/migration.json b/qapi/migration.json
index 2387c21e9c..f1f434613a 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -139,6 +139,13 @@
 #
 # @active: in the process of doing migration.
 #
+# @postcopy-setup: Postcopy migration has been initiated, but the
+#     destination is not running yet, as it is processing the contents
+#     of CMD_PACKAGED.  After the destination ACKs to be started, the
+#     source transitions to postcopy-active state and sends
+#     POSTCOPY_RUN command.  Only present if postcopy-setup capability
+#     is enabled.  (Since 10.2)
+#
 # @postcopy-active: like active, but now in postcopy mode.
 #     (since 2.5)
 #
@@ -173,7 +180,7 @@
 ##
 { 'enum': 'MigrationStatus',
   'data': [ 'none', 'setup', 'cancelling', 'cancelled',
-            'active', 'postcopy-active', 'postcopy-paused',
+            'active', 'postcopy-setup', 'postcopy-active', 'postcopy-paused',
             'postcopy-recover-setup',
             'postcopy-recover', 'completed', 'failed', 'colo',
             'pre-switchover', 'device', 'wait-unplug' ] }
@@ -517,6 +524,14 @@
 #     each RAM page.  Requires a migration URI that supports seeking,
 #     such as a file.  (since 9.0)
 #
+# @postcopy-setup: If enabled, POSTCOPY_RUN command is not sent at the
+#     end of device states during postcopy switchover, but rather
+#     after the destination side ACKs that it has successfully loaded
+#     the device state and can be started.  Between these two events,
+#     machines are in postcopy-setup state which allows the source
+#     machine to fully recover and resume operation in case of errors.
+#     (Since 10.2)
+#
 # Features:
 #
 # @unstable: Members @x-colo and @x-ignore-shared are experimental.
@@ -536,7 +551,7 @@
            { 'name': 'x-ignore-shared', 'features': [ 'unstable' ] },
            'validate-uuid', 'background-snapshot',
            'zero-copy-send', 'postcopy-preempt', 'switchover-ack',
-           'dirty-limit', 'mapped-ram'] }
+           'dirty-limit', 'mapped-ram', 'postcopy-setup'] }
 
 ##
 # @MigrationCapabilityStatus:
diff --git a/tests/qtest/migration/postcopy-tests.c b/tests/qtest/migration/postcopy-tests.c
index 3773525843..80ce6320d9 100644
--- a/tests/qtest/migration/postcopy-tests.c
+++ b/tests/qtest/migration/postcopy-tests.c
@@ -27,6 +27,17 @@ static void test_postcopy(void)
     test_postcopy_common(&args);
 }
 
+static void test_postcopy_setup(void)
+{
+    MigrateCommon args = {
+        .start = {
+            .caps[MIGRATION_CAPABILITY_POSTCOPY_SETUP] = true,
+        }
+    };
+
+    test_postcopy_common(&args);
+}
+
 static void test_postcopy_suspend(void)
 {
     MigrateCommon args = {
@@ -47,6 +58,18 @@ static void test_postcopy_preempt(void)
     test_postcopy_common(&args);
 }
 
+static void test_postcopy_preempt_setup(void)
+{
+    MigrateCommon args = {
+        .start = {
+            .caps[MIGRATION_CAPABILITY_POSTCOPY_PREEMPT] = true,
+            .caps[MIGRATION_CAPABILITY_POSTCOPY_SETUP] = true,
+        },
+    };
+
+    test_postcopy_common(&args);
+}
+
 static void test_postcopy_recovery(void)
 {
     MigrateCommon args = { };
@@ -87,10 +110,13 @@ static void migration_test_add_postcopy_smoke(MigrationTestEnv *env)
 {
     if (env->has_uffd) {
         migration_test_add("/migration/postcopy/plain", test_postcopy);
+        migration_test_add("/migration/postcopy/setup", test_postcopy_setup);
         migration_test_add("/migration/postcopy/recovery/plain",
                            test_postcopy_recovery);
         migration_test_add("/migration/postcopy/preempt/plain",
                            test_postcopy_preempt);
+        migration_test_add("/migration/postcopy/preempt/setup",
+                           test_postcopy_preempt_setup);
     }
 }
 
@@ -105,6 +131,18 @@ static void test_multifd_postcopy(void)
     test_postcopy_common(&args);
 }
 
+static void test_multifd_postcopy_setup(void)
+{
+    MigrateCommon args = {
+        .start = {
+            .caps[MIGRATION_CAPABILITY_MULTIFD] = true,
+            .caps[MIGRATION_CAPABILITY_POSTCOPY_SETUP] = true,
+        },
+    };
+
+    test_postcopy_common(&args);
+}
+
 static void test_multifd_postcopy_preempt(void)
 {
     MigrateCommon args = {
@@ -117,6 +155,19 @@ static void test_multifd_postcopy_preempt(void)
     test_postcopy_common(&args);
 }
 
+static void test_multifd_postcopy_preempt_setup(void)
+{
+    MigrateCommon args = {
+        .start = {
+            .caps[MIGRATION_CAPABILITY_MULTIFD] = true,
+            .caps[MIGRATION_CAPABILITY_POSTCOPY_PREEMPT] = true,
+            .caps[MIGRATION_CAPABILITY_POSTCOPY_SETUP] = true,
+        },
+    };
+
+    test_postcopy_common(&args);
+}
+
 void migration_test_add_postcopy(MigrationTestEnv *env)
 {
     migration_test_add_postcopy_smoke(env);
@@ -139,8 +190,12 @@ void migration_test_add_postcopy(MigrationTestEnv *env)
 
         migration_test_add("/migration/multifd+postcopy/plain",
                            test_multifd_postcopy);
+        migration_test_add("/migration/multifd+postcopy/setup",
+                           test_multifd_postcopy_setup);
         migration_test_add("/migration/multifd+postcopy/preempt/plain",
                            test_multifd_postcopy_preempt);
+        migration_test_add("/migration/multifd+postcopy/preempt/setup",
+                           test_multifd_postcopy_preempt_setup);
         if (env->is_x86) {
             migration_test_add("/migration/postcopy/suspend",
                                test_postcopy_suspend);
diff --git a/tests/qtest/migration/precopy-tests.c b/tests/qtest/migration/precopy-tests.c
index bb38292550..0e5e949cc3 100644
--- a/tests/qtest/migration/precopy-tests.c
+++ b/tests/qtest/migration/precopy-tests.c
@@ -1316,13 +1316,14 @@ void migration_test_add_precopy(MigrationTestEnv *env)
     }
 
     /* ensure new status don't go unnoticed */
-    assert(MIGRATION_STATUS__MAX == 15);
+    assert(MIGRATION_STATUS__MAX == 16);
 
     for (int i = MIGRATION_STATUS_NONE; i < MIGRATION_STATUS__MAX; i++) {
         switch (i) {
         case MIGRATION_STATUS_DEVICE: /* happens too fast */
         case MIGRATION_STATUS_WAIT_UNPLUG: /* no support in tests */
         case MIGRATION_STATUS_COLO: /* no support in tests */
+        case MIGRATION_STATUS_POSTCOPY_SETUP: /* happens too fast */
         case MIGRATION_STATUS_POSTCOPY_ACTIVE: /* postcopy can't be cancelled */
         case MIGRATION_STATUS_POSTCOPY_PAUSED:
         case MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP:
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 2/4] migration: Fix state transition in postcopy_start() error handling
  2025-08-07 11:49 ` [RFC PATCH 2/4] migration: Fix state transition in postcopy_start() error handling Juraj Marcin
@ 2025-08-07 20:54   ` Peter Xu
  2025-08-08  9:44     ` Juraj Marcin
  2025-08-08 19:08     ` Fabiano Rosas
  0 siblings, 2 replies; 26+ messages in thread
From: Peter Xu @ 2025-08-07 20:54 UTC (permalink / raw)
  To: Juraj Marcin
  Cc: qemu-devel, Jiri Denemark, Stefan Weil, Paolo Bonzini,
	Fabiano Rosas

On Thu, Aug 07, 2025 at 01:49:10PM +0200, Juraj Marcin wrote:
> From: Juraj Marcin <jmarcin@redhat.com>
> 
> Depending on where an error during postcopy_start() happens, the state
> can be either "active", "device" or "cancelling", but never
> "postcopy-active". Migration state is transitioned to "postcopy-active"
> only just before a successful return from the function.
> 
> Accept any state except "cancelling" when transitioning to "failed"
> state.
> 
> Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
> ---
>  migration/migration.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 10c216d25d..e5ce2940d5 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -2872,8 +2872,9 @@ static int postcopy_start(MigrationState *ms, Error **errp)
>  fail_closefb:
>      qemu_fclose(fb);
>  fail:
> -    migrate_set_state(&ms->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
> -                          MIGRATION_STATUS_FAILED);
> +    if ( ms->state != MIGRATION_STATUS_CANCELLING) {
> +        migrate_set_state(&ms->state, ms->state, MIGRATION_STATUS_FAILED);
> +    }

Hmm, this might have been overlooked from my commit 48814111366b.  Maybe
worth a Fixes and copy stable?

For example, I would expect the old code (prior of 48814111366b) still be
able to fail postcopy and resume src QEMU if qemu_savevm_send_packaged()
failed.  Now, looks like it'll be stuck at "device" state..

The other thing is it also looks like a common pattern to set FAILED
meanwhile not messing with a CANCELLING stage.  It's not easy to always
remember this, so maybe we should consider having a helper function?

  migrate_set_failure(MigrationState *, Error *err);

Which could set err with migrate_set_error() (likely we could also
error_report() the error), and update FAILED iff it's not CANCELLING.

I saw three of such occurances that such helper may apply, but worth double
check:

postcopy_start[2725]           if (ms->state != MIGRATION_STATUS_CANCELLING) {
migration_completion[3069]     if (s->state != MIGRATION_STATUS_CANCELLING) {
igration_connect[4064]        if (s->state != MIGRATION_STATUS_CANCELLING) {

If the cleanup looks worthwhile, and if the Fixes apply, we could have the
cleanup patch on top of the fixes patch so patch 1 is easier to backport.

Thanks,

>      migration_block_activate(NULL);
>      migration_call_notifiers(ms, MIG_EVENT_PRECOPY_FAILED, NULL);
>      bql_unlock();
> -- 
> 2.50.1
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 3/4] migration: Make listen thread joinable
  2025-08-07 11:49 ` [RFC PATCH 3/4] migration: Make listen thread joinable Juraj Marcin
@ 2025-08-07 20:57   ` Peter Xu
  2025-08-08 11:08     ` Juraj Marcin
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Xu @ 2025-08-07 20:57 UTC (permalink / raw)
  To: Juraj Marcin
  Cc: qemu-devel, Jiri Denemark, Stefan Weil, Paolo Bonzini,
	Fabiano Rosas

On Thu, Aug 07, 2025 at 01:49:11PM +0200, Juraj Marcin wrote:
> From: Juraj Marcin <jmarcin@redhat.com>
> 
> This patch allows joining the migration listen thread. This is done in
> preparation for the introduction of "postcopy-setup" state at the
> beginning of a postcopy migration, when destination can fail gracefully
> and source side then resume to a running state.
> 
> In case of such failure, to gracefully perform all cleanup in the main
> migration thread, we need to wait for the listen thread to exit, which
> can be done by joining it.
> 
> Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
> ---
>  migration/migration.c | 1 +
>  migration/savevm.c    | 2 +-
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index e5ce2940d5..8174e811eb 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -901,6 +901,7 @@ process_incoming_migration_co(void *opaque)
>               * Postcopy was started, cleanup should happen at the end of the
>               * postcopy thread.
>               */
> +            qemu_thread_detach(&mis->listen_thread);
>              trace_process_incoming_migration_co_postcopy_end_main();
>              goto out;
>          }
> diff --git a/migration/savevm.c b/migration/savevm.c
> index fabbeb296a..d2360be53c 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -2217,7 +2217,7 @@ static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
>      mis->have_listen_thread = true;
>      postcopy_thread_create(mis, &mis->listen_thread,
>                             MIGRATION_THREAD_DST_LISTEN,
> -                           postcopy_ram_listen_thread, QEMU_THREAD_DETACHED);
> +                           postcopy_ram_listen_thread, QEMU_THREAD_JOINABLE);

This is good; I actually forgot it used to be detached..

Instead of proactively detach it above, could we always properly join it
(and hopefully every migration thread)?  Then we could drop patch 1 too.

>      trace_loadvm_postcopy_handle_listen("return");
>  
>      return 0;
> -- 
> 2.50.1
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 2/4] migration: Fix state transition in postcopy_start() error handling
  2025-08-07 20:54   ` Peter Xu
@ 2025-08-08  9:44     ` Juraj Marcin
  2025-08-08 16:00       ` Peter Xu
  2025-08-08 19:08     ` Fabiano Rosas
  1 sibling, 1 reply; 26+ messages in thread
From: Juraj Marcin @ 2025-08-08  9:44 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Jiri Denemark, Stefan Weil, Paolo Bonzini,
	Fabiano Rosas

Hi Peter,

On 2025-08-07 16:54, Peter Xu wrote:
> On Thu, Aug 07, 2025 at 01:49:10PM +0200, Juraj Marcin wrote:
> > From: Juraj Marcin <jmarcin@redhat.com>
> > 
> > Depending on where an error during postcopy_start() happens, the state
> > can be either "active", "device" or "cancelling", but never
> > "postcopy-active". Migration state is transitioned to "postcopy-active"
> > only just before a successful return from the function.
> > 
> > Accept any state except "cancelling" when transitioning to "failed"
> > state.
> > 
> > Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
> > ---
> >  migration/migration.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> > 
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 10c216d25d..e5ce2940d5 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -2872,8 +2872,9 @@ static int postcopy_start(MigrationState *ms, Error **errp)
> >  fail_closefb:
> >      qemu_fclose(fb);
> >  fail:
> > -    migrate_set_state(&ms->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
> > -                          MIGRATION_STATUS_FAILED);
> > +    if ( ms->state != MIGRATION_STATUS_CANCELLING) {
> > +        migrate_set_state(&ms->state, ms->state, MIGRATION_STATUS_FAILED);
> > +    }
> 
> Hmm, this might have been overlooked from my commit 48814111366b.  Maybe
> worth a Fixes and copy stable?

yeah, it looks like it. POSTCOPY_ACTIVE state used to be set way sooner
before. I'll add Fixes tag to the patch.

> 
> For example, I would expect the old code (prior of 48814111366b) still be
> able to fail postcopy and resume src QEMU if qemu_savevm_send_packaged()
> failed.  Now, looks like it'll be stuck at "device" state..
> 
> The other thing is it also looks like a common pattern to set FAILED
> meanwhile not messing with a CANCELLING stage.  It's not easy to always
> remember this, so maybe we should consider having a helper function?
> 
>   migrate_set_failure(MigrationState *, Error *err);
> 
> Which could set err with migrate_set_error() (likely we could also
> error_report() the error), and update FAILED iff it's not CANCELLING.
> 
> I saw three of such occurances that such helper may apply, but worth double
> check:
> 
> postcopy_start[2725]           if (ms->state != MIGRATION_STATUS_CANCELLING) {
> migration_completion[3069]     if (s->state != MIGRATION_STATUS_CANCELLING) {
> igration_connect[4064]        if (s->state != MIGRATION_STATUS_CANCELLING) {
> 
> If the cleanup looks worthwhile, and if the Fixes apply, we could have the
> cleanup patch on top of the fixes patch so patch 1 is easier to backport.

Such function could be useful. I could also send it with the above fix
together as a separate patchset, and send it also to stable.


Best regards

Juraj Marcin

> 
> Thanks,
> 
> >      migration_block_activate(NULL);
> >      migration_call_notifiers(ms, MIG_EVENT_PRECOPY_FAILED, NULL);
> >      bql_unlock();
> > -- 
> > 2.50.1
> > 
> 
> -- 
> Peter Xu
> 



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 3/4] migration: Make listen thread joinable
  2025-08-07 20:57   ` Peter Xu
@ 2025-08-08 11:08     ` Juraj Marcin
  2025-08-08 17:05       ` Peter Xu
  0 siblings, 1 reply; 26+ messages in thread
From: Juraj Marcin @ 2025-08-08 11:08 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Jiri Denemark, Stefan Weil, Paolo Bonzini,
	Fabiano Rosas

Hi Peter,

On 2025-08-07 16:57, Peter Xu wrote:
> On Thu, Aug 07, 2025 at 01:49:11PM +0200, Juraj Marcin wrote:
> > From: Juraj Marcin <jmarcin@redhat.com>
> > 
> > This patch allows joining the migration listen thread. This is done in
> > preparation for the introduction of "postcopy-setup" state at the
> > beginning of a postcopy migration, when destination can fail gracefully
> > and source side then resume to a running state.
> > 
> > In case of such failure, to gracefully perform all cleanup in the main
> > migration thread, we need to wait for the listen thread to exit, which
> > can be done by joining it.
> > 
> > Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
> > ---
> >  migration/migration.c | 1 +
> >  migration/savevm.c    | 2 +-
> >  2 files changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/migration/migration.c b/migration/migration.c
> > index e5ce2940d5..8174e811eb 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -901,6 +901,7 @@ process_incoming_migration_co(void *opaque)
> >               * Postcopy was started, cleanup should happen at the end of the
> >               * postcopy thread.
> >               */
> > +            qemu_thread_detach(&mis->listen_thread);
> >              trace_process_incoming_migration_co_postcopy_end_main();
> >              goto out;
> >          }
> > diff --git a/migration/savevm.c b/migration/savevm.c
> > index fabbeb296a..d2360be53c 100644
> > --- a/migration/savevm.c
> > +++ b/migration/savevm.c
> > @@ -2217,7 +2217,7 @@ static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
> >      mis->have_listen_thread = true;
> >      postcopy_thread_create(mis, &mis->listen_thread,
> >                             MIGRATION_THREAD_DST_LISTEN,
> > -                           postcopy_ram_listen_thread, QEMU_THREAD_DETACHED);
> > +                           postcopy_ram_listen_thread, QEMU_THREAD_JOINABLE);
> 
> This is good; I actually forgot it used to be detached..
> 
> Instead of proactively detach it above, could we always properly join it

However, after the main thread finishes loading device state from the
package, process_incoming_migration_co() exits, and IIUC main thread is
then no longer occupied with migration. So, if we should instead join
the listen thread, we probably should yield the coroutine until the
listen thread can be joined, so we are not blocking the main thread?

> (and hopefully every migration thread)?  Then we could drop patch 1 too.

If I haven't missed any, there are no detached migration threads except
listen and get dirty rate threads.


Thanks,

Juraj Marcin

> 
> >      trace_loadvm_postcopy_handle_listen("return");
> >  
> >      return 0;
> > -- 
> > 2.50.1
> > 
> 
> -- 
> Peter Xu
> 



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 2/4] migration: Fix state transition in postcopy_start() error handling
  2025-08-08  9:44     ` Juraj Marcin
@ 2025-08-08 16:00       ` Peter Xu
  0 siblings, 0 replies; 26+ messages in thread
From: Peter Xu @ 2025-08-08 16:00 UTC (permalink / raw)
  To: Juraj Marcin
  Cc: qemu-devel, Jiri Denemark, Stefan Weil, Paolo Bonzini,
	Fabiano Rosas

On Fri, Aug 08, 2025 at 11:44:36AM +0200, Juraj Marcin wrote:
> Such function could be useful. I could also send it with the above fix
> together as a separate patchset, and send it also to stable.

Yep that works.  The 2nd patch as a cleanup doesn't need to copy stable.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 3/4] migration: Make listen thread joinable
  2025-08-08 11:08     ` Juraj Marcin
@ 2025-08-08 17:05       ` Peter Xu
  2025-08-11 13:02         ` Juraj Marcin
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Xu @ 2025-08-08 17:05 UTC (permalink / raw)
  To: Juraj Marcin
  Cc: qemu-devel, Jiri Denemark, Stefan Weil, Paolo Bonzini,
	Fabiano Rosas

On Fri, Aug 08, 2025 at 01:08:39PM +0200, Juraj Marcin wrote:
> Hi Peter,
> 
> On 2025-08-07 16:57, Peter Xu wrote:
> > On Thu, Aug 07, 2025 at 01:49:11PM +0200, Juraj Marcin wrote:
> > > From: Juraj Marcin <jmarcin@redhat.com>
> > > 
> > > This patch allows joining the migration listen thread. This is done in
> > > preparation for the introduction of "postcopy-setup" state at the
> > > beginning of a postcopy migration, when destination can fail gracefully
> > > and source side then resume to a running state.
> > > 
> > > In case of such failure, to gracefully perform all cleanup in the main
> > > migration thread, we need to wait for the listen thread to exit, which
> > > can be done by joining it.
> > > 
> > > Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
> > > ---
> > >  migration/migration.c | 1 +
> > >  migration/savevm.c    | 2 +-
> > >  2 files changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/migration/migration.c b/migration/migration.c
> > > index e5ce2940d5..8174e811eb 100644
> > > --- a/migration/migration.c
> > > +++ b/migration/migration.c
> > > @@ -901,6 +901,7 @@ process_incoming_migration_co(void *opaque)
> > >               * Postcopy was started, cleanup should happen at the end of the
> > >               * postcopy thread.
> > >               */
> > > +            qemu_thread_detach(&mis->listen_thread);
> > >              trace_process_incoming_migration_co_postcopy_end_main();
> > >              goto out;
> > >          }
> > > diff --git a/migration/savevm.c b/migration/savevm.c
> > > index fabbeb296a..d2360be53c 100644
> > > --- a/migration/savevm.c
> > > +++ b/migration/savevm.c
> > > @@ -2217,7 +2217,7 @@ static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
> > >      mis->have_listen_thread = true;
> > >      postcopy_thread_create(mis, &mis->listen_thread,
> > >                             MIGRATION_THREAD_DST_LISTEN,
> > > -                           postcopy_ram_listen_thread, QEMU_THREAD_DETACHED);
> > > +                           postcopy_ram_listen_thread, QEMU_THREAD_JOINABLE);
> > 
> > This is good; I actually forgot it used to be detached..
> > 
> > Instead of proactively detach it above, could we always properly join it
> 
> However, after the main thread finishes loading device state from the
> package, process_incoming_migration_co() exits, and IIUC main thread is
> then no longer occupied with migration. So, if we should instead join
> the listen thread, we probably should yield the coroutine until the
> listen thread can be joined, so we are not blocking the main thread?

Or could we schedule a bottom half at the end of
postcopy_ram_listen_thread() to join itself?  We could move something over
into the BH:

    ... join() ...
    mis->have_listen_thread = false;
    migration_incoming_state_destroy();
    object_unref(OBJECT(migr));

> 
> > (and hopefully every migration thread)?  Then we could drop patch 1 too.
> 
> If I haven't missed any, there are no detached migration threads except
> listen and get dirty rate threads.

Yep.

From mgmt pov, IMHO it's always good we create joinable threads.  But we
can leave the calc_dirty_rate thread until necessary.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 2/4] migration: Fix state transition in postcopy_start() error handling
  2025-08-07 20:54   ` Peter Xu
  2025-08-08  9:44     ` Juraj Marcin
@ 2025-08-08 19:08     ` Fabiano Rosas
  2025-08-11 13:00       ` Juraj Marcin
  1 sibling, 1 reply; 26+ messages in thread
From: Fabiano Rosas @ 2025-08-08 19:08 UTC (permalink / raw)
  To: Peter Xu, Juraj Marcin
  Cc: qemu-devel, Jiri Denemark, Stefan Weil, Paolo Bonzini

Peter Xu <peterx@redhat.com> writes:

> On Thu, Aug 07, 2025 at 01:49:10PM +0200, Juraj Marcin wrote:
>> From: Juraj Marcin <jmarcin@redhat.com>
>> 
>> Depending on where an error during postcopy_start() happens, the state
>> can be either "active", "device" or "cancelling", but never
>> "postcopy-active". Migration state is transitioned to "postcopy-active"
>> only just before a successful return from the function.
>> 
>> Accept any state except "cancelling" when transitioning to "failed"
>> state.
>> 
>> Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
>> ---
>>  migration/migration.c | 5 +++--
>>  1 file changed, 3 insertions(+), 2 deletions(-)
>> 
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 10c216d25d..e5ce2940d5 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -2872,8 +2872,9 @@ static int postcopy_start(MigrationState *ms, Error **errp)
>>  fail_closefb:
>>      qemu_fclose(fb);
>>  fail:
>> -    migrate_set_state(&ms->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
>> -                          MIGRATION_STATUS_FAILED);
>> +    if ( ms->state != MIGRATION_STATUS_CANCELLING) {
>> +        migrate_set_state(&ms->state, ms->state, MIGRATION_STATUS_FAILED);
>> +    }
>
> Hmm, this might have been overlooked from my commit 48814111366b.  Maybe
> worth a Fixes and copy stable?
>
> For example, I would expect the old code (prior of 48814111366b) still be
> able to fail postcopy and resume src QEMU if qemu_savevm_send_packaged()
> failed.  Now, looks like it'll be stuck at "device" state..
>
> The other thing is it also looks like a common pattern to set FAILED
> meanwhile not messing with a CANCELLING stage.  It's not easy to always
> remember this, so maybe we should consider having a helper function?
>
>   migrate_set_failure(MigrationState *, Error *err);
>

We didn't do it back then because at there would be some logical
conflict with this series:

https://lore.kernel.org/r/20250110100707.4805-1-shivam.kumar1@nutanix.com

But I don't remember the details. If it works this time I'm all for it.

> Which could set err with migrate_set_error() (likely we could also
> error_report() the error), and update FAILED iff it's not CANCELLING.
>
> I saw three of such occurances that such helper may apply, but worth double
> check:
>
> postcopy_start[2725]           if (ms->state != MIGRATION_STATUS_CANCELLING) {
> migration_completion[3069]     if (s->state != MIGRATION_STATUS_CANCELLING) {
> igration_connect[4064]        if (s->state != MIGRATION_STATUS_CANCELLING) {
>
> If the cleanup looks worthwhile, and if the Fixes apply, we could have the
> cleanup patch on top of the fixes patch so patch 1 is easier to backport.
>
> Thanks,
>
>>      migration_block_activate(NULL);
>>      migration_call_notifiers(ms, MIG_EVENT_PRECOPY_FAILED, NULL);
>>      bql_unlock();
>> -- 
>> 2.50.1
>> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 2/4] migration: Fix state transition in postcopy_start() error handling
  2025-08-08 19:08     ` Fabiano Rosas
@ 2025-08-11 13:00       ` Juraj Marcin
  0 siblings, 0 replies; 26+ messages in thread
From: Juraj Marcin @ 2025-08-11 13:00 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Peter Xu, qemu-devel, Jiri Denemark, Stefan Weil, Paolo Bonzini

Hi Fabiano

On 2025-08-08 16:08, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Thu, Aug 07, 2025 at 01:49:10PM +0200, Juraj Marcin wrote:
> >> From: Juraj Marcin <jmarcin@redhat.com>
> >> 
> >> Depending on where an error during postcopy_start() happens, the state
> >> can be either "active", "device" or "cancelling", but never
> >> "postcopy-active". Migration state is transitioned to "postcopy-active"
> >> only just before a successful return from the function.
> >> 
> >> Accept any state except "cancelling" when transitioning to "failed"
> >> state.
> >> 
> >> Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
> >> ---
> >>  migration/migration.c | 5 +++--
> >>  1 file changed, 3 insertions(+), 2 deletions(-)
> >> 
> >> diff --git a/migration/migration.c b/migration/migration.c
> >> index 10c216d25d..e5ce2940d5 100644
> >> --- a/migration/migration.c
> >> +++ b/migration/migration.c
> >> @@ -2872,8 +2872,9 @@ static int postcopy_start(MigrationState *ms, Error **errp)
> >>  fail_closefb:
> >>      qemu_fclose(fb);
> >>  fail:
> >> -    migrate_set_state(&ms->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
> >> -                          MIGRATION_STATUS_FAILED);
> >> +    if ( ms->state != MIGRATION_STATUS_CANCELLING) {
> >> +        migrate_set_state(&ms->state, ms->state, MIGRATION_STATUS_FAILED);
> >> +    }
> >
> > Hmm, this might have been overlooked from my commit 48814111366b.  Maybe
> > worth a Fixes and copy stable?
> >
> > For example, I would expect the old code (prior of 48814111366b) still be
> > able to fail postcopy and resume src QEMU if qemu_savevm_send_packaged()
> > failed.  Now, looks like it'll be stuck at "device" state..
> >
> > The other thing is it also looks like a common pattern to set FAILED
> > meanwhile not messing with a CANCELLING stage.  It's not easy to always
> > remember this, so maybe we should consider having a helper function?
> >
> >   migrate_set_failure(MigrationState *, Error *err);
> >
> 
> We didn't do it back then because at there would be some logical
> conflict with this series:
> 
> https://lore.kernel.org/r/20250110100707.4805-1-shivam.kumar1@nutanix.com
> 
> But I don't remember the details. If it works this time I'm all for it.

Thanks! I will look into that.

Best regards,

Juraj Marcin

> 
> > Which could set err with migrate_set_error() (likely we could also
> > error_report() the error), and update FAILED iff it's not CANCELLING.
> >
> > I saw three of such occurances that such helper may apply, but worth double
> > check:
> >
> > postcopy_start[2725]           if (ms->state != MIGRATION_STATUS_CANCELLING) {
> > migration_completion[3069]     if (s->state != MIGRATION_STATUS_CANCELLING) {
> > igration_connect[4064]        if (s->state != MIGRATION_STATUS_CANCELLING) {
> >
> > If the cleanup looks worthwhile, and if the Fixes apply, we could have the
> > cleanup patch on top of the fixes patch so patch 1 is easier to backport.
> >
> > Thanks,
> >
> >>      migration_block_activate(NULL);
> >>      migration_call_notifiers(ms, MIG_EVENT_PRECOPY_FAILED, NULL);
> >>      bql_unlock();
> >> -- 
> >> 2.50.1
> >> 
> 



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 3/4] migration: Make listen thread joinable
  2025-08-08 17:05       ` Peter Xu
@ 2025-08-11 13:02         ` Juraj Marcin
  0 siblings, 0 replies; 26+ messages in thread
From: Juraj Marcin @ 2025-08-11 13:02 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Jiri Denemark, Stefan Weil, Paolo Bonzini,
	Fabiano Rosas

Hi Peter

On 2025-08-08 13:05, Peter Xu wrote:
> On Fri, Aug 08, 2025 at 01:08:39PM +0200, Juraj Marcin wrote:
> > Hi Peter,
> > 
> > On 2025-08-07 16:57, Peter Xu wrote:
> > > On Thu, Aug 07, 2025 at 01:49:11PM +0200, Juraj Marcin wrote:
> > > > From: Juraj Marcin <jmarcin@redhat.com>
> > > > 
> > > > This patch allows joining the migration listen thread. This is done in
> > > > preparation for the introduction of "postcopy-setup" state at the
> > > > beginning of a postcopy migration, when destination can fail gracefully
> > > > and source side then resume to a running state.
> > > > 
> > > > In case of such failure, to gracefully perform all cleanup in the main
> > > > migration thread, we need to wait for the listen thread to exit, which
> > > > can be done by joining it.
> > > > 
> > > > Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
> > > > ---
> > > >  migration/migration.c | 1 +
> > > >  migration/savevm.c    | 2 +-
> > > >  2 files changed, 2 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/migration/migration.c b/migration/migration.c
> > > > index e5ce2940d5..8174e811eb 100644
> > > > --- a/migration/migration.c
> > > > +++ b/migration/migration.c
> > > > @@ -901,6 +901,7 @@ process_incoming_migration_co(void *opaque)
> > > >               * Postcopy was started, cleanup should happen at the end of the
> > > >               * postcopy thread.
> > > >               */
> > > > +            qemu_thread_detach(&mis->listen_thread);
> > > >              trace_process_incoming_migration_co_postcopy_end_main();
> > > >              goto out;
> > > >          }
> > > > diff --git a/migration/savevm.c b/migration/savevm.c
> > > > index fabbeb296a..d2360be53c 100644
> > > > --- a/migration/savevm.c
> > > > +++ b/migration/savevm.c
> > > > @@ -2217,7 +2217,7 @@ static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
> > > >      mis->have_listen_thread = true;
> > > >      postcopy_thread_create(mis, &mis->listen_thread,
> > > >                             MIGRATION_THREAD_DST_LISTEN,
> > > > -                           postcopy_ram_listen_thread, QEMU_THREAD_DETACHED);
> > > > +                           postcopy_ram_listen_thread, QEMU_THREAD_JOINABLE);
> > > 
> > > This is good; I actually forgot it used to be detached..
> > > 
> > > Instead of proactively detach it above, could we always properly join it
> > 
> > However, after the main thread finishes loading device state from the
> > package, process_incoming_migration_co() exits, and IIUC main thread is
> > then no longer occupied with migration. So, if we should instead join
> > the listen thread, we probably should yield the coroutine until the
> > listen thread can be joined, so we are not blocking the main thread?
> 
> Or could we schedule a bottom half at the end of
> postcopy_ram_listen_thread() to join itself?  We could move something over
> into the BH:
> 
>     ... join() ...
>     mis->have_listen_thread = false;
>     migration_incoming_state_destroy();
>     object_unref(OBJECT(migr));

That sounds like a good idea, I will certainly try that, thank you.

Best regards,

Juraj Marcin

> 
> > 
> > > (and hopefully every migration thread)?  Then we could drop patch 1 too.
> > 
> > If I haven't missed any, there are no detached migration threads except
> > listen and get dirty rate threads.
> 
> Yep.
> 
> From mgmt pov, IMHO it's always good we create joinable threads.  But we
> can leave the calc_dirty_rate thread until necessary.
> 
> -- 
> Peter Xu
> 



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/4] migration: Introduce postcopy-setup capability and state
  2025-08-07 11:49 [RFC PATCH 0/4] migration: Introduce postcopy-setup capability and state Juraj Marcin
                   ` (3 preceding siblings ...)
  2025-08-07 11:49 ` [RFC PATCH 4/4] migration: Introduce postcopy-setup capability and state Juraj Marcin
@ 2025-08-11 14:54 ` Peter Xu
  2025-08-12 13:34   ` Juraj Marcin
  4 siblings, 1 reply; 26+ messages in thread
From: Peter Xu @ 2025-08-11 14:54 UTC (permalink / raw)
  To: Juraj Marcin
  Cc: qemu-devel, Jiri Denemark, Stefan Weil, Paolo Bonzini,
	Fabiano Rosas

[Sorry to respond late on the real meat of this series..]

On Thu, Aug 07, 2025 at 01:49:08PM +0200, Juraj Marcin wrote:
> When postcopy migration starts, the source side sends all
> non-postcopiable device data in one package command and immediately
> transitions to a "postcopy-active" state. However, if the destination
> side fails to load the device data or crashes during it, the source side
> stays paused indefinitely with no way of recovery.
> 
> This series introduces a new "postcopy-setup" state during which the
> destination side is guaranteed to not been started yet and, the source
> side can recover and resume and the destination side gracefully exit.
> 
> Key element of this feature is isolating the postcopy-run command from
> non-postcopiable data and sending it only after the destination side
> acknowledges, that it has loaded all devices and is ready to be started.
> This is necessary, as once the postcopy-run command is sent, the source
> side cannot be sure if the destination is running or not and if it can
> safely resume in case of a failure.
> 
> Reusing existing ping/pong messages was also considered, PING 3 is right
> before the postcopy-run command, but there are two reasons why the PING
> 3 message might not be delivered to the source side:
> 
> 1. destination machine failed, it is not running, and the source side
>    can resume,
> 2. there is a network failure, so PING 3 delivery fails, but until until
>    TCP or other transport times out, the destination could process the
>    postcopy-run command and start, in which case the source side cannot
>    resume.
> 
> Furthermore, this series contains two more patches required for the
> implementation of this feature, that make the listen thread joinable for
> graceful cleanup and detach it explicitly otherwise, and one patch
> fixing state transitions inside postcopy_start().
> 
> Such (or similar) feature could be potentially useful also for normal
> (only precopy) migration with return-path, to prevent issues when
> network failure happens just as the destination side shuts the
> return-path. When I tested such scenario (by filtering out the SHUT
> command), the destination started and reported successful migration,
> while the source side reported failed migration and tried to resume, but
> exited as it failed to gain disk image file lock.
> 
> Another suggestion from Peter, that I would like to discuss, is that
> instead of introducing a new state, we could move the boundary between
> "device" and "postcopy-active" states to when the postcopy-run command
> is actually sent (in this series boundary of "postcopy-setup" and
> "postcopy-active"), however, I am not sure if such change would not have
> any unwanted implications.

Yeah, after reading patch 4, I still want to check with you on whether this
is possible, on a simpler version of such soluion.

As we discussed offlist, looks like there's no perfect solution for
synchronizing between src <-> dst on this matter.  No matter what is the
last message to be sent, either precopy has RP_SHUT, or relying on 3rd/4th
PONG, or RUN_ACK, it might get lost in a network failure.

IIUC it means we need to face the situation of split brain. The link can
simply be broken at any time.  The ultimate result is still better when two
VMs will be halted when split brain, but then IMHO we'll need to justify
whether that complexity would be worthwhile for changing "both sides
active" -> "both sides halted" when it happened.

If we go back to the original request of why we decided to work on this: it
was more or less a feature parity request on postcopy against precopy, so
that when device states loading failed during switchover, postcopy can also
properly get cancelled rather than hanging.  Precopy can do that, we wished
postcopy can do at least the same.

Could we still explore the simpler idea and understand better on the gap
between the two?  E.g. relying on the 3rd/4th PONG returned from the dest
QEMU to be the ACK message.

Something like:

  - Start postcopy...

  - Send the postcopy wholesale package (which includes e.g. whole device
    states dump, PING-3, RUN), as before.

  - Instead of going directly POSTCOPY_ACTIVE, we stay in DEVICE, but we
    start to allow iterations to resolve page faults while keep moving
    pages.

  - If...

    - we received the 3rd PONG, we _assume_ the device states are loaded
      successfully and the RUN must be processed, src QEMU moves to
      POSTCOPY_ACTIVE.

    - we noticed network failure before 3rd PONG, we _assume_ destination
      failed to load or crashed, src QEMU fails the migration (DEVICE ->
      FAILED) and try to restart VM on src.

This might be a much smaller change, and it might not need any change from
dest qemu or stream protocol.

It means, if it works (even if imperfect) it'll start to work for old VMs
too as long as they got migrated to the new QEMU, and we get this postcopy
parity feature asap instead of requesting user to cold-restart the VM with
a newer machine type.

Would this be a better possible trade-off?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/4] migration: Introduce postcopy-setup capability and state
  2025-08-11 14:54 ` [RFC PATCH 0/4] " Peter Xu
@ 2025-08-12 13:34   ` Juraj Marcin
  2025-08-13 17:42     ` Peter Xu
  0 siblings, 1 reply; 26+ messages in thread
From: Juraj Marcin @ 2025-08-12 13:34 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Jiri Denemark, Stefan Weil, Paolo Bonzini,
	Fabiano Rosas

Hi Peter

On 2025-08-11 10:54, Peter Xu wrote:
> [Sorry to respond late on the real meat of this series..]
> 
> On Thu, Aug 07, 2025 at 01:49:08PM +0200, Juraj Marcin wrote:
> > When postcopy migration starts, the source side sends all
> > non-postcopiable device data in one package command and immediately
> > transitions to a "postcopy-active" state. However, if the destination
> > side fails to load the device data or crashes during it, the source side
> > stays paused indefinitely with no way of recovery.
> > 
> > This series introduces a new "postcopy-setup" state during which the
> > destination side is guaranteed to not been started yet and, the source
> > side can recover and resume and the destination side gracefully exit.
> > 
> > Key element of this feature is isolating the postcopy-run command from
> > non-postcopiable data and sending it only after the destination side
> > acknowledges, that it has loaded all devices and is ready to be started.
> > This is necessary, as once the postcopy-run command is sent, the source
> > side cannot be sure if the destination is running or not and if it can
> > safely resume in case of a failure.
> > 
> > Reusing existing ping/pong messages was also considered, PING 3 is right
> > before the postcopy-run command, but there are two reasons why the PING
> > 3 message might not be delivered to the source side:
> > 
> > 1. destination machine failed, it is not running, and the source side
> >    can resume,
> > 2. there is a network failure, so PING 3 delivery fails, but until until
> >    TCP or other transport times out, the destination could process the
> >    postcopy-run command and start, in which case the source side cannot
> >    resume.
> > 
> > Furthermore, this series contains two more patches required for the
> > implementation of this feature, that make the listen thread joinable for
> > graceful cleanup and detach it explicitly otherwise, and one patch
> > fixing state transitions inside postcopy_start().
> > 
> > Such (or similar) feature could be potentially useful also for normal
> > (only precopy) migration with return-path, to prevent issues when
> > network failure happens just as the destination side shuts the
> > return-path. When I tested such scenario (by filtering out the SHUT
> > command), the destination started and reported successful migration,
> > while the source side reported failed migration and tried to resume, but
> > exited as it failed to gain disk image file lock.
> > 
> > Another suggestion from Peter, that I would like to discuss, is that
> > instead of introducing a new state, we could move the boundary between
> > "device" and "postcopy-active" states to when the postcopy-run command
> > is actually sent (in this series boundary of "postcopy-setup" and
> > "postcopy-active"), however, I am not sure if such change would not have
> > any unwanted implications.
> 
> Yeah, after reading patch 4, I still want to check with you on whether this
> is possible, on a simpler version of such soluion.
> 
> As we discussed offlist, looks like there's no perfect solution for
> synchronizing between src <-> dst on this matter.  No matter what is the
> last message to be sent, either precopy has RP_SHUT, or relying on 3rd/4th
> PONG, or RUN_ACK, it might get lost in a network failure.
> 
> IIUC it means we need to face the situation of split brain. The link can
> simply be broken at any time.  The ultimate result is still better when two
> VMs will be halted when split brain, but then IMHO we'll need to justify
> whether that complexity would be worthwhile for changing "both sides
> active" -> "both sides halted" when it happened.

Yes, a network failure can indeed happen at any time, but that's the
decision we need to make if we can allow the possibility of two machines
running at the same time. And depending on that, one solution is more
complex than the other.

Right now if the network fails during the device load and the
destination reaches 3rd ping and postcopy-run, the machine would start,
but the source wouldn't. So to me, it looks a bit like a regression, if
we introduce a possibility of two machines trying to start.

> 
> If we go back to the original request of why we decided to work on this: it
> was more or less a feature parity request on postcopy against precopy, so
> that when device states loading failed during switchover, postcopy can also
> properly get cancelled rather than hanging.  Precopy can do that, we wished
> postcopy can do at least the same.
> 
> Could we still explore the simpler idea and understand better on the gap
> between the two?  E.g. relying on the 3rd/4th PONG returned from the dest
> QEMU to be the ACK message.
> 
> Something like:
> 
>   - Start postcopy...
> 
>   - Send the postcopy wholesale package (which includes e.g. whole device
>     states dump, PING-3, RUN), as before.
> 
>   - Instead of going directly POSTCOPY_ACTIVE, we stay in DEVICE, but we
>     start to allow iterations to resolve page faults while keep moving
>     pages.
> 
>   - If...
> 
>     - we received the 3rd PONG, we _assume_ the device states are loaded
>       successfully and the RUN must be processed, src QEMU moves to
>       POSTCOPY_ACTIVE.
> 
>     - we noticed network failure before 3rd PONG, we _assume_ destination
>       failed to load or crashed, src QEMU fails the migration (DEVICE ->
>       FAILED) and try to restart VM on src.
> 
> This might be a much smaller change, and it might not need any change from
> dest qemu or stream protocol.

I can test this idea, but I think it should be working and there should
be no problems if there are no network issues. However, then there's
also a question if we want the destination side to exit gracefully if
there is some issue during device load that doesn't cause immediate
crash. IUUC it would switch to POSTCOPY_PAUSED and then the management
application would need to kill it and restart the migration.

> 
> It means, if it works (even if imperfect) it'll start to work for old VMs
> too as long as they got migrated to the new QEMU, and we get this postcopy
> parity feature asap instead of requesting user to cold-restart the VM with
> a newer machine type.

But are migration capabilities limited to machine types?

My understanding is that once VM is migrated to the new QEMU it can
start using the capability even if it uses older machine type. Then we
would be in the same situation, that the feature is usable once we are
migrating from a newer QEMU instance.

> 
> Would this be a better possible trade-off?

So, while yes, such solution would require fewer changes, but to me, it
feels like introducing a known regression if the network would fail
before the destination reaches 3rd ping message while processing the
packaged command. But in case the probability of such failure is so
slim, that it's not worth to have the more complex solution, I can move
on with the simpler one.

Thanks,

Juraj Marcin

> 
> Thanks,
> 
> -- 
> Peter Xu
> 



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/4] migration: Introduce postcopy-setup capability and state
  2025-08-12 13:34   ` Juraj Marcin
@ 2025-08-13 17:42     ` Peter Xu
  2025-08-14 15:42       ` Juraj Marcin
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Xu @ 2025-08-13 17:42 UTC (permalink / raw)
  To: Juraj Marcin
  Cc: qemu-devel, Jiri Denemark, Stefan Weil, Paolo Bonzini,
	Fabiano Rosas

On Tue, Aug 12, 2025 at 03:34:26PM +0200, Juraj Marcin wrote:
> Hi Peter
> 
> On 2025-08-11 10:54, Peter Xu wrote:
> > [Sorry to respond late on the real meat of this series..]
> > 
> > On Thu, Aug 07, 2025 at 01:49:08PM +0200, Juraj Marcin wrote:
> > > When postcopy migration starts, the source side sends all
> > > non-postcopiable device data in one package command and immediately
> > > transitions to a "postcopy-active" state. However, if the destination
> > > side fails to load the device data or crashes during it, the source side
> > > stays paused indefinitely with no way of recovery.
> > > 
> > > This series introduces a new "postcopy-setup" state during which the
> > > destination side is guaranteed to not been started yet and, the source
> > > side can recover and resume and the destination side gracefully exit.
> > > 
> > > Key element of this feature is isolating the postcopy-run command from
> > > non-postcopiable data and sending it only after the destination side
> > > acknowledges, that it has loaded all devices and is ready to be started.
> > > This is necessary, as once the postcopy-run command is sent, the source
> > > side cannot be sure if the destination is running or not and if it can
> > > safely resume in case of a failure.
> > > 
> > > Reusing existing ping/pong messages was also considered, PING 3 is right
> > > before the postcopy-run command, but there are two reasons why the PING
> > > 3 message might not be delivered to the source side:
> > > 
> > > 1. destination machine failed, it is not running, and the source side
> > >    can resume,
> > > 2. there is a network failure, so PING 3 delivery fails, but until until
> > >    TCP or other transport times out, the destination could process the
> > >    postcopy-run command and start, in which case the source side cannot
> > >    resume.
> > > 
> > > Furthermore, this series contains two more patches required for the
> > > implementation of this feature, that make the listen thread joinable for
> > > graceful cleanup and detach it explicitly otherwise, and one patch
> > > fixing state transitions inside postcopy_start().
> > > 
> > > Such (or similar) feature could be potentially useful also for normal
> > > (only precopy) migration with return-path, to prevent issues when
> > > network failure happens just as the destination side shuts the
> > > return-path. When I tested such scenario (by filtering out the SHUT
> > > command), the destination started and reported successful migration,
> > > while the source side reported failed migration and tried to resume, but
> > > exited as it failed to gain disk image file lock.
> > > 
> > > Another suggestion from Peter, that I would like to discuss, is that
> > > instead of introducing a new state, we could move the boundary between
> > > "device" and "postcopy-active" states to when the postcopy-run command
> > > is actually sent (in this series boundary of "postcopy-setup" and
> > > "postcopy-active"), however, I am not sure if such change would not have
> > > any unwanted implications.
> > 
> > Yeah, after reading patch 4, I still want to check with you on whether this
> > is possible, on a simpler version of such soluion.
> > 
> > As we discussed offlist, looks like there's no perfect solution for
> > synchronizing between src <-> dst on this matter.  No matter what is the
> > last message to be sent, either precopy has RP_SHUT, or relying on 3rd/4th
> > PONG, or RUN_ACK, it might get lost in a network failure.
> > 
> > IIUC it means we need to face the situation of split brain. The link can
> > simply be broken at any time.  The ultimate result is still better when two
> > VMs will be halted when split brain, but then IMHO we'll need to justify
> > whether that complexity would be worthwhile for changing "both sides
> > active" -> "both sides halted" when it happened.
> 
> Yes, a network failure can indeed happen at any time, but that's the
> decision we need to make if we can allow the possibility of two machines
> running at the same time. And depending on that, one solution is more
> complex than the other.
> 
> Right now if the network fails during the device load and the
> destination reaches 3rd ping and postcopy-run, the machine would start,
> but the source wouldn't. So to me, it looks a bit like a regression, if
> we introduce a possibility of two machines trying to start.

That's a fair point.

Said that, I think trying to start two VMs on both sides are fine when we
have proper drive locks.  That may help us to serialize things.

So.. we're talking about an extremely rare case of last-phase split brain
of migration, that we lost one last ACK message, no matter what is that
message.

IMHO we have two scenarios:

  (a) VM has no shared storage, all storages need to be migrated using
      block-mirror

  In this case, starting both VMs in such extremely rare case should, IIUC,
  succeed on both sides, because file locks for the drives are separate,
  hence the drive locks are separate too.

  On src, it would see migration FAILED because of the network failure on
  last-phase ACK message.  On dest, it sees SUCCEEDED (as if it sees
  FAILED it'll be a common "migration failure" case).

  Then, the mgmt (while seeing src migration FAILED), should IMHO fallback
  to src.  After all, migration is almost always driven on the source.

  Even if the dest QEMU is running, src contains all guest data so IMHO
  it's fine killing the dest QEMU, and the real ownership should have never
  switched over due to src QEMU migration status was either ACTIVE or
  FAILED, never COMPLETED (which should be the final mark for mgmt to
  switchover the real ownership).

  Even though I'm not sure whether libvirt would do so already, but it
  sounds the right thing to do.

  (b) With shared storage, either some drives or all drives are shared

  In this case, src QEMU still sees migration FAILED, rolling back and
  trying to restart the VM.  Both sides will contend on the drive lock of
  whatever drives are shared.  Here, whoever takes the lock will be able to
  run the VM.  Here I assumed both QEMUs take all locks in order so no ABBA
  deadlock might never happen.  Hence, no chance that both fails, one must
  win on taking all the locks.

  If it's src that wins, it's perfect, because dest QEMU will fail the
  migration_block_activate() in last BH (process_incoming_migration_bh).
  mgmt should see dest QEMU halted, src QEMU started running with migration
  FAILED.  Killing dest would work.

  If it's dest that wins, this can start to become more complicated..  Two
  sub-conditions:

  (b.1) precopy: it should be ok too as long as src QEMU will notice lock
        contention failed.  Then instead of trying to start the VM, it
        should wait for mgmt to involve [*]. In this case, we need to
        assume migration completed because only dest QEMU's RAM now matches
        with the shared storages.

  (b.2) postcopy: it will similarly fail the drive activation on src.
        Since due to the same reason above we must switch ownership to dest
        QEMU now (dest QEMU's RAM is the solo source of truth that matches
        with the shared storage), we may need some way to enforce setting
        src QEMU to postcopy-active once more so src QEMU need to be able
        to resolve page requests from dest.  It's not easy because the
        stream is broken now, so it needs to recover the postcopy first.

[*] NOTE: I think this part is still trivially broken.. as we don't detect
drive activation failure so far on src QEMU when migration fails..  We may
need some patch like this to allow "trying to start two VMs on both sides"
safe:

From 8b69239e6a7980c34c20880b55f9dd5fd96779fd Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Wed, 13 Aug 2025 12:52:38 -0400
Subject: [PATCH] migration: Do not try to start VM if disk activation fails

If a rare split brain happens (e.g. dest QEMU started running somehow,
taking shared drive locks), src QEMU may not be able to activate the
drives anymore.  In this case, src QEMU shouldn't start the VM or it might
crash the block layer later with something like:

bdrv_co_write_req_prepare: Assertion `!(bs->open_flags & BDRV_O_INACTIVE)' failed.

Meanwhile, src QEMU cannot try to continue either even if dest QEMU can
release the drive locks (e.g. by QMP "stop").  Because as long as dest QEMU
started running, it means dest QEMU's RAM is the only version that is
consistent with current status of the shared storage.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c | 29 ++++++++++++++++++++++++-----
 1 file changed, 24 insertions(+), 5 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 10c216d25d..3c01e78182 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3502,6 +3502,8 @@ static MigIterateState migration_iteration_run(MigrationState *s)

 static void migration_iteration_finish(MigrationState *s)
 {
+    Error *err = NULL;
+
     bql_lock();

     /*
@@ -3525,11 +3527,28 @@ static void migration_iteration_finish(MigrationState *s)
     case MIGRATION_STATUS_FAILED:
     case MIGRATION_STATUS_CANCELLED:
     case MIGRATION_STATUS_CANCELLING:
-        /*
-         * Re-activate the block drives if they're inactivated.  Note, COLO
-         * shouldn't use block_active at all, so it should be no-op there.
-         */
-        migration_block_activate(NULL);
+        if (!migration_block_activate(&err)) {
+            /*
+            * Re-activate the block drives if they're inactivated.
+            *
+            * If it fails (e.g. in case of a split brain, where dest QEMU
+            * might have taken some of the drive locks and running!), do
+            * not start VM, instead wait for mgmt to decide the next step.
+            *
+            * If dest already started, it means dest QEMU should contain
+            * all the data it needs and it properly owns all the drive
+            * locks.  Then even if src QEMU got a FAILED in migration, it
+            * normally should mean we should treat the migration as
+            * COMPLETED.
+            *
+            * NOTE: it's not safe anymore to start VM on src now even if
+            * dest would release the drive locks.  It's because as long as
+            * dest started running then only dest QEMU's RAM is consistent
+            * with the shared storage.
+            */
+            error_free(err);
+            break;
+        }
         if (runstate_is_live(s->vm_old_state)) {
             if (!runstate_check(RUN_STATE_SHUTDOWN)) {
                 vm_start();
-- 
2.50.1

> 
> > 
> > If we go back to the original request of why we decided to work on this: it
> > was more or less a feature parity request on postcopy against precopy, so
> > that when device states loading failed during switchover, postcopy can also
> > properly get cancelled rather than hanging.  Precopy can do that, we wished
> > postcopy can do at least the same.
> > 
> > Could we still explore the simpler idea and understand better on the gap
> > between the two?  E.g. relying on the 3rd/4th PONG returned from the dest
> > QEMU to be the ACK message.
> > 
> > Something like:
> > 
> >   - Start postcopy...
> > 
> >   - Send the postcopy wholesale package (which includes e.g. whole device
> >     states dump, PING-3, RUN), as before.
> > 
> >   - Instead of going directly POSTCOPY_ACTIVE, we stay in DEVICE, but we
> >     start to allow iterations to resolve page faults while keep moving
> >     pages.
> > 
> >   - If...
> > 
> >     - we received the 3rd PONG, we _assume_ the device states are loaded
> >       successfully and the RUN must be processed, src QEMU moves to
> >       POSTCOPY_ACTIVE.
> > 
> >     - we noticed network failure before 3rd PONG, we _assume_ destination
> >       failed to load or crashed, src QEMU fails the migration (DEVICE ->
> >       FAILED) and try to restart VM on src.
> > 
> > This might be a much smaller change, and it might not need any change from
> > dest qemu or stream protocol.
> 
> I can test this idea, but I think it should be working and there should
> be no problems if there are no network issues. However, then there's
> also a question if we want the destination side to exit gracefully if
> there is some issue during device load that doesn't cause immediate
> crash.

Maybe there's no perfect solution on cleanly shutdown the dest?

For example, even with RUN_ACK that this series provided, if RUN_ACK is
lost, dest QEMU is also running on dest showing migration SUCCEED on dest,
and src QEMU migration FAILED (as it didn't see the RUN_ACK).  In this case
we'll still need mgmt involvement, right?  Logically, we'll also need
special care to treat migration SUCCESS even if on src migration shows
FAILED.

> IUUC it would switch to POSTCOPY_PAUSED and then the management
> application would need to kill it and restart the migration.

Not hugely matters to our discussion here.. but just to mention, currently
POSTCOPY_PAUSED is very special.  When in such condition, we should not
kill either side of QEMU because it means there're unique data on both
sides.  We should suggest mgmt not killing any qemu instance that is in
POSTCOPY_PAUSED state.

> 
> > 
> > It means, if it works (even if imperfect) it'll start to work for old VMs
> > too as long as they got migrated to the new QEMU, and we get this postcopy
> > parity feature asap instead of requesting user to cold-restart the VM with
> > a newer machine type.
> 
> But are migration capabilities limited to machine types?
> 
> My understanding is that once VM is migrated to the new QEMU it can
> start using the capability even if it uses older machine type. Then we
> would be in the same situation, that the feature is usable once we are
> migrating from a newer QEMU instance.

You're right.

Though it also means when it's a cap, we need to have libvirt changed too
to enable it.  It's not the best way to do for this feature.

For this one, we definitely want it to be enabled whenever it can.  Likely
it can be a migraion parameter and setting OFF on old qemus, ON on new
QEMUs.  Then it'll start to matter.

The best is if we can enable it silently on src QEMU by default.. if
possible.  We can finish the discussion first on the rest.

> 
> > 
> > Would this be a better possible trade-off?
> 
> So, while yes, such solution would require fewer changes, but to me, it
> feels like introducing a known regression if the network would fail
> before the destination reaches 3rd ping message while processing the
> packaged command. But in case the probability of such failure is so
> slim, that it's not worth to have the more complex solution, I can move
> on with the simpler one.

Right, I think that's the whole point, at least for the past years we never
hit such problem yet so far on split brain at the last message.  I believe
we have much more severe issues than this one..

So yes, we have two issues, one rare issue ([Issue 1]), one extremely rare
issue ([Issue 2]):

  [Issue 1] start postcopy fails when device load fails on dest

  [Issue 2] split brain on the last ACK message, for either precopy or
            postcopy, no matter what's the last ACK message

Let's say, possibility of Issue 1 is 1%.  NOTE: this may not be as rare as
we thought: sometimes dest kernel can start or miss some kernel features,
causing either KVM feature or vhost or ... feature diff, causing VM load on
dest fail during device load.  Whatever we hit such migration failure cases
in precopy, it could happen if we start to always enable postcopy.  It's
just that we rarely enable postcopy at least until now in production, but I
suspect it's not true anymore. Hence this request to fix it.

Possibility of Issue 2 is... I would say, 0.001%? Or less..

I'm not sure whether we would fix Issue 2 at all.  The fix might need to
also involve libvirt.  But likely Issue 1 always has higher priority.

Thanks,

-- 
Peter Xu

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/4] migration: Introduce postcopy-setup capability and state
  2025-08-13 17:42     ` Peter Xu
@ 2025-08-14 15:42       ` Juraj Marcin
  2025-08-14 19:24         ` Peter Xu
  0 siblings, 1 reply; 26+ messages in thread
From: Juraj Marcin @ 2025-08-14 15:42 UTC (permalink / raw)
  To: Peter Xu, Jiri Denemark
  Cc: qemu-devel, Stefan Weil, Paolo Bonzini, Fabiano Rosas

Hi Peter,

thank you very much for your answer!

On 2025-08-13 13:42, Peter Xu wrote:
> On Tue, Aug 12, 2025 at 03:34:26PM +0200, Juraj Marcin wrote:
> > Hi Peter
> > 
> > On 2025-08-11 10:54, Peter Xu wrote:
> > > [Sorry to respond late on the real meat of this series..]
> > > 
> > > On Thu, Aug 07, 2025 at 01:49:08PM +0200, Juraj Marcin wrote:
> > > > When postcopy migration starts, the source side sends all
> > > > non-postcopiable device data in one package command and immediately
> > > > transitions to a "postcopy-active" state. However, if the destination
> > > > side fails to load the device data or crashes during it, the source side
> > > > stays paused indefinitely with no way of recovery.
> > > > 
> > > > This series introduces a new "postcopy-setup" state during which the
> > > > destination side is guaranteed to not been started yet and, the source
> > > > side can recover and resume and the destination side gracefully exit.
> > > > 
> > > > Key element of this feature is isolating the postcopy-run command from
> > > > non-postcopiable data and sending it only after the destination side
> > > > acknowledges, that it has loaded all devices and is ready to be started.
> > > > This is necessary, as once the postcopy-run command is sent, the source
> > > > side cannot be sure if the destination is running or not and if it can
> > > > safely resume in case of a failure.
> > > > 
> > > > Reusing existing ping/pong messages was also considered, PING 3 is right
> > > > before the postcopy-run command, but there are two reasons why the PING
> > > > 3 message might not be delivered to the source side:
> > > > 
> > > > 1. destination machine failed, it is not running, and the source side
> > > >    can resume,
> > > > 2. there is a network failure, so PING 3 delivery fails, but until until
> > > >    TCP or other transport times out, the destination could process the
> > > >    postcopy-run command and start, in which case the source side cannot
> > > >    resume.
> > > > 
> > > > Furthermore, this series contains two more patches required for the
> > > > implementation of this feature, that make the listen thread joinable for
> > > > graceful cleanup and detach it explicitly otherwise, and one patch
> > > > fixing state transitions inside postcopy_start().
> > > > 
> > > > Such (or similar) feature could be potentially useful also for normal
> > > > (only precopy) migration with return-path, to prevent issues when
> > > > network failure happens just as the destination side shuts the
> > > > return-path. When I tested such scenario (by filtering out the SHUT
> > > > command), the destination started and reported successful migration,
> > > > while the source side reported failed migration and tried to resume, but
> > > > exited as it failed to gain disk image file lock.
> > > > 
> > > > Another suggestion from Peter, that I would like to discuss, is that
> > > > instead of introducing a new state, we could move the boundary between
> > > > "device" and "postcopy-active" states to when the postcopy-run command
> > > > is actually sent (in this series boundary of "postcopy-setup" and
> > > > "postcopy-active"), however, I am not sure if such change would not have
> > > > any unwanted implications.
> > > 
> > > Yeah, after reading patch 4, I still want to check with you on whether this
> > > is possible, on a simpler version of such soluion.
> > > 
> > > As we discussed offlist, looks like there's no perfect solution for
> > > synchronizing between src <-> dst on this matter.  No matter what is the
> > > last message to be sent, either precopy has RP_SHUT, or relying on 3rd/4th
> > > PONG, or RUN_ACK, it might get lost in a network failure.
> > > 
> > > IIUC it means we need to face the situation of split brain. The link can
> > > simply be broken at any time.  The ultimate result is still better when two
> > > VMs will be halted when split brain, but then IMHO we'll need to justify
> > > whether that complexity would be worthwhile for changing "both sides
> > > active" -> "both sides halted" when it happened.
> > 
> > Yes, a network failure can indeed happen at any time, but that's the
> > decision we need to make if we can allow the possibility of two machines
> > running at the same time. And depending on that, one solution is more
> > complex than the other.
> > 
> > Right now if the network fails during the device load and the
> > destination reaches 3rd ping and postcopy-run, the machine would start,
> > but the source wouldn't. So to me, it looks a bit like a regression, if
> > we introduce a possibility of two machines trying to start.
> 
> That's a fair point.
> 
> Said that, I think trying to start two VMs on both sides are fine when we
> have proper drive locks.  That may help us to serialize things.
> 
> So.. we're talking about an extremely rare case of last-phase split brain
> of migration, that we lost one last ACK message, no matter what is that
> message.
> 
> IMHO we have two scenarios:
> 
>   (a) VM has no shared storage, all storages need to be migrated using
>       block-mirror
> 
>   In this case, starting both VMs in such extremely rare case should, IIUC,
>   succeed on both sides, because file locks for the drives are separate,
>   hence the drive locks are separate too.
> 
>   On src, it would see migration FAILED because of the network failure on
>   last-phase ACK message.  On dest, it sees SUCCEEDED (as if it sees
>   FAILED it'll be a common "migration failure" case).
>   
>   Then, the mgmt (while seeing src migration FAILED), should IMHO fallback
>   to src.  After all, migration is almost always driven on the source.
> 
>   Even if the dest QEMU is running, src contains all guest data so IMHO
>   it's fine killing the dest QEMU, and the real ownership should have never
>   switched over due to src QEMU migration status was either ACTIVE or
>   FAILED, never COMPLETED (which should be the final mark for mgmt to
>   switchover the real ownership).
> 
>   Even though I'm not sure whether libvirt would do so already, but it
>   sounds the right thing to do.

Yeah, that sounds right. Maybe @jdenemar could enlighten us, if and how
libvirt handles such situation now, or what changes would it require.

> 
>   (b) With shared storage, either some drives or all drives are shared
> 
>   In this case, src QEMU still sees migration FAILED, rolling back and
>   trying to restart the VM.  Both sides will contend on the drive lock of
>   whatever drives are shared.  Here, whoever takes the lock will be able to
>   run the VM.  Here I assumed both QEMUs take all locks in order so no ABBA
>   deadlock might never happen.  Hence, no chance that both fails, one must
>   win on taking all the locks.
> 
>   If it's src that wins, it's perfect, because dest QEMU will fail the
>   migration_block_activate() in last BH (process_incoming_migration_bh).
>   mgmt should see dest QEMU halted, src QEMU started running with migration
>   FAILED.  Killing dest would work.
> 
>   If it's dest that wins, this can start to become more complicated..  Two
>   sub-conditions:
> 
>   (b.1) precopy: it should be ok too as long as src QEMU will notice lock
>         contention failed.  Then instead of trying to start the VM, it
>         should wait for mgmt to involve [*]. In this case, we need to
>         assume migration completed because only dest QEMU's RAM now matches
>         with the shared storages.
> 
>   (b.2) postcopy: it will similarly fail the drive activation on src.
>         Since due to the same reason above we must switch ownership to dest
>         QEMU now (dest QEMU's RAM is the solo source of truth that matches
>         with the shared storage), we may need some way to enforce setting
>         src QEMU to postcopy-active once more so src QEMU need to be able
>         to resolve page requests from dest.  It's not easy because the
>         stream is broken now, so it needs to recover the postcopy first.
> 
> [*] NOTE: I think this part is still trivially broken.. as we don't detect
> drive activation failure so far on src QEMU when migration fails..  We may
> need some patch like this to allow "trying to start two VMs on both sides"
> safe:

Maybe, in case disk activation fails on source and the postcopy was
already started, the source could fall back to postcopy-paused, so the
postcopy can be recovered, instead of going into failure. This could
solve the problem with recovering migration in case (b.2).

> 
> From 8b69239e6a7980c34c20880b55f9dd5fd96779fd Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@redhat.com>
> Date: Wed, 13 Aug 2025 12:52:38 -0400
> Subject: [PATCH] migration: Do not try to start VM if disk activation fails
> 
> If a rare split brain happens (e.g. dest QEMU started running somehow,
> taking shared drive locks), src QEMU may not be able to activate the
> drives anymore.  In this case, src QEMU shouldn't start the VM or it might
> crash the block layer later with something like:
> 
> bdrv_co_write_req_prepare: Assertion `!(bs->open_flags & BDRV_O_INACTIVE)' failed.
> 
> Meanwhile, src QEMU cannot try to continue either even if dest QEMU can
> release the drive locks (e.g. by QMP "stop").  Because as long as dest QEMU
> started running, it means dest QEMU's RAM is the only version that is
> consistent with current status of the shared storage.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/migration.c | 29 ++++++++++++++++++++++++-----
>  1 file changed, 24 insertions(+), 5 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 10c216d25d..3c01e78182 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3502,6 +3502,8 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>  
>  static void migration_iteration_finish(MigrationState *s)
>  {
> +    Error *err = NULL;
> +
>      bql_lock();
>  
>      /*
> @@ -3525,11 +3527,28 @@ static void migration_iteration_finish(MigrationState *s)
>      case MIGRATION_STATUS_FAILED:
>      case MIGRATION_STATUS_CANCELLED:
>      case MIGRATION_STATUS_CANCELLING:
> -        /*
> -         * Re-activate the block drives if they're inactivated.  Note, COLO
> -         * shouldn't use block_active at all, so it should be no-op there.
> -         */
> -        migration_block_activate(NULL);
> +        if (!migration_block_activate(&err)) {
> +            /*
> +            * Re-activate the block drives if they're inactivated.
> +            *
> +            * If it fails (e.g. in case of a split brain, where dest QEMU
> +            * might have taken some of the drive locks and running!), do
> +            * not start VM, instead wait for mgmt to decide the next step.
> +            *
> +            * If dest already started, it means dest QEMU should contain
> +            * all the data it needs and it properly owns all the drive
> +            * locks.  Then even if src QEMU got a FAILED in migration, it
> +            * normally should mean we should treat the migration as
> +            * COMPLETED.
> +            *
> +            * NOTE: it's not safe anymore to start VM on src now even if
> +            * dest would release the drive locks.  It's because as long as
> +            * dest started running then only dest QEMU's RAM is consistent
> +            * with the shared storage.
> +            */
> +            error_free(err);
> +            break;
> +        }
>          if (runstate_is_live(s->vm_old_state)) {
>              if (!runstate_check(RUN_STATE_SHUTDOWN)) {
>                  vm_start();
> -- 
> 2.50.1
> 
> 
> > 
> > > 
> > > If we go back to the original request of why we decided to work on this: it
> > > was more or less a feature parity request on postcopy against precopy, so
> > > that when device states loading failed during switchover, postcopy can also
> > > properly get cancelled rather than hanging.  Precopy can do that, we wished
> > > postcopy can do at least the same.
> > > 
> > > Could we still explore the simpler idea and understand better on the gap
> > > between the two?  E.g. relying on the 3rd/4th PONG returned from the dest
> > > QEMU to be the ACK message.
> > > 
> > > Something like:
> > > 
> > >   - Start postcopy...
> > > 
> > >   - Send the postcopy wholesale package (which includes e.g. whole device
> > >     states dump, PING-3, RUN), as before.
> > > 
> > >   - Instead of going directly POSTCOPY_ACTIVE, we stay in DEVICE, but we
> > >     start to allow iterations to resolve page faults while keep moving
> > >     pages.
> > > 
> > >   - If...
> > > 
> > >     - we received the 3rd PONG, we _assume_ the device states are loaded
> > >       successfully and the RUN must be processed, src QEMU moves to
> > >       POSTCOPY_ACTIVE.
> > > 
> > >     - we noticed network failure before 3rd PONG, we _assume_ destination
> > >       failed to load or crashed, src QEMU fails the migration (DEVICE ->
> > >       FAILED) and try to restart VM on src.
> > > 
> > > This might be a much smaller change, and it might not need any change from
> > > dest qemu or stream protocol.
> > 
> > I can test this idea, but I think it should be working and there should
> > be no problems if there are no network issues. However, then there's
> > also a question if we want the destination side to exit gracefully if
> > there is some issue during device load that doesn't cause immediate
> > crash.
> 
> Maybe there's no perfect solution on cleanly shutdown the dest?
> 
> For example, even with RUN_ACK that this series provided, if RUN_ACK is
> lost, dest QEMU is also running on dest showing migration SUCCEED on dest,
> and src QEMU migration FAILED (as it didn't see the RUN_ACK).  In this case
> we'll still need mgmt involvement, right?  Logically, we'll also need
> special care to treat migration SUCCESS even if on src migration shows
> FAILED.
> 
> > IUUC it would switch to POSTCOPY_PAUSED and then the management
> > application would need to kill it and restart the migration.
> 
> Not hugely matters to our discussion here.. but just to mention, currently
> POSTCOPY_PAUSED is very special.  When in such condition, we should not
> kill either side of QEMU because it means there're unique data on both
> sides.  We should suggest mgmt not killing any qemu instance that is in
> POSTCOPY_PAUSED state.

I tested it and if some device state load function returns and error
code, the destination goes to FAILED instead. It will not pause unless
the postcopy state is POSTCOPY_INCOMING_RUNNING to which it transitions
when the destination starts.

> 
> > 
> > > 
> > > It means, if it works (even if imperfect) it'll start to work for old VMs
> > > too as long as they got migrated to the new QEMU, and we get this postcopy
> > > parity feature asap instead of requesting user to cold-restart the VM with
> > > a newer machine type.
> > 
> > But are migration capabilities limited to machine types?
> > 
> > My understanding is that once VM is migrated to the new QEMU it can
> > start using the capability even if it uses older machine type. Then we
> > would be in the same situation, that the feature is usable once we are
> > migrating from a newer QEMU instance.
> 
> You're right.
> 
> Though it also means when it's a cap, we need to have libvirt changed too
> to enable it.  It's not the best way to do for this feature.
> 
> For this one, we definitely want it to be enabled whenever it can.  Likely
> it can be a migraion parameter and setting OFF on old qemus, ON on new
> QEMUs.  Then it'll start to matter.
> 
> The best is if we can enable it silently on src QEMU by default.. if
> possible.  We can finish the discussion first on the rest.
> 
> > 
> > > 
> > > Would this be a better possible trade-off?
> > 
> > So, while yes, such solution would require fewer changes, but to me, it
> > feels like introducing a known regression if the network would fail
> > before the destination reaches 3rd ping message while processing the
> > packaged command. But in case the probability of such failure is so
> > slim, that it's not worth to have the more complex solution, I can move
> > on with the simpler one.
> 
> Right, I think that's the whole point, at least for the past years we never
> hit such problem yet so far on split brain at the last message.  I believe
> we have much more severe issues than this one..
> 
> So yes, we have two issues, one rare issue ([Issue 1]), one extremely rare
> issue ([Issue 2]):
> 
>   [Issue 1] start postcopy fails when device load fails on dest
> 
>   [Issue 2] split brain on the last ACK message, for either precopy or
>             postcopy, no matter what's the last ACK message
> 
> Let's say, possibility of Issue 1 is 1%.  NOTE: this may not be as rare as
> we thought: sometimes dest kernel can start or miss some kernel features,
> causing either KVM feature or vhost or ... feature diff, causing VM load on
> dest fail during device load.  Whatever we hit such migration failure cases
> in precopy, it could happen if we start to always enable postcopy.  It's
> just that we rarely enable postcopy at least until now in production, but I
> suspect it's not true anymore. Hence this request to fix it.
> 
> Possibility of Issue 2 is... I would say, 0.001%? Or less..
> 
> I'm not sure whether we would fix Issue 2 at all.  The fix might need to
> also involve libvirt.  But likely Issue 1 always has higher priority.

Fair point, I'll then continue with the PING/PONG solution, the first
implementation I have seems to be working to resolve Issue 1.

For rarer split brain, we'll rely on block device locks/mgmt to resolve
and change the failure handling, so it registers errors from disk
activation.

As tested, there should be no problems with the destination
transitioning to POSTCOPY_PAUSED, since the VM was not started yet.

However, to prevent the source side from transitioning to
POSTCOPY_PAUSED, I think adding a new state is still the best option.

I tried keeping the migration states as they are now and just rely on an
attribute of MigrationState if 3rd PONG was received, however, this
collides with (at least) migrate_pause tests, that are waiting for
POSTCOPY_ACTIVE, and then pause the migration triggering the source to
resume. We could maybe work around it by waiting for the 3rd pong
instead, but I am not sure if it is possible from tests, or by not
resuming if migrate_pause command is executed?

I also tried extending the span of the DEVICE state, but some functions
behave differently depending on if they are in postcopy or not, using
the migration_in_postcopy() function, but adding the DEVICE there isn't
working either. And treating the DEVICE state sometimes as postcopy and
sometimes as not seems just too messy, if it would even be possible.


Thank you!

Juraj Marcin

> 
> Thanks,
> 
> -- 
> Peter Xu
> 



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/4] migration: Introduce postcopy-setup capability and state
  2025-08-14 15:42       ` Juraj Marcin
@ 2025-08-14 19:24         ` Peter Xu
  2025-08-15  6:35           ` Juraj Marcin
  2025-09-01 17:57           ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 26+ messages in thread
From: Peter Xu @ 2025-08-14 19:24 UTC (permalink / raw)
  To: Juraj Marcin
  Cc: Jiri Denemark, qemu-devel, Stefan Weil, Paolo Bonzini,
	Fabiano Rosas

On Thu, Aug 14, 2025 at 05:42:23PM +0200, Juraj Marcin wrote:
> Fair point, I'll then continue with the PING/PONG solution, the first
> implementation I have seems to be working to resolve Issue 1.
> 
> For rarer split brain, we'll rely on block device locks/mgmt to resolve
> and change the failure handling, so it registers errors from disk
> activation.
> 
> As tested, there should be no problems with the destination
> transitioning to POSTCOPY_PAUSED, since the VM was not started yet.
> 
> However, to prevent the source side from transitioning to
> POSTCOPY_PAUSED, I think adding a new state is still the best option.
> 
> I tried keeping the migration states as they are now and just rely on an
> attribute of MigrationState if 3rd PONG was received, however, this
> collides with (at least) migrate_pause tests, that are waiting for
> POSTCOPY_ACTIVE, and then pause the migration triggering the source to
> resume. We could maybe work around it by waiting for the 3rd pong
> instead, but I am not sure if it is possible from tests, or by not
> resuming if migrate_pause command is executed?
> 
> I also tried extending the span of the DEVICE state, but some functions
> behave differently depending on if they are in postcopy or not, using
> the migration_in_postcopy() function, but adding the DEVICE there isn't
> working either. And treating the DEVICE state sometimes as postcopy and
> sometimes as not seems just too messy, if it would even be possible.

Yeah, it might indeed be a bit messy.

Is it possible to find a middle ground?  E.g. add postcopy-setup status,
but without any new knob to enable it?  Just to describe the period of time
where dest QEMU haven't started running but started loading device states.

The hope is libvirt (which, AFAIU, always enables the "events" capability)
can ignore the new postcopy-setup status transition, then maybe we can also
introduce the postcopy-setup and make it always appear.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/4] migration: Introduce postcopy-setup capability and state
  2025-08-14 19:24         ` Peter Xu
@ 2025-08-15  6:35           ` Juraj Marcin
  2025-09-01 17:57           ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 26+ messages in thread
From: Juraj Marcin @ 2025-08-15  6:35 UTC (permalink / raw)
  To: Peter Xu
  Cc: Jiri Denemark, qemu-devel, Stefan Weil, Paolo Bonzini,
	Fabiano Rosas

On 2025-08-14 15:24, Peter Xu wrote:
> On Thu, Aug 14, 2025 at 05:42:23PM +0200, Juraj Marcin wrote:
> > Fair point, I'll then continue with the PING/PONG solution, the first
> > implementation I have seems to be working to resolve Issue 1.
> > 
> > For rarer split brain, we'll rely on block device locks/mgmt to resolve
> > and change the failure handling, so it registers errors from disk
> > activation.
> > 
> > As tested, there should be no problems with the destination
> > transitioning to POSTCOPY_PAUSED, since the VM was not started yet.
> > 
> > However, to prevent the source side from transitioning to
> > POSTCOPY_PAUSED, I think adding a new state is still the best option.
> > 
> > I tried keeping the migration states as they are now and just rely on an
> > attribute of MigrationState if 3rd PONG was received, however, this
> > collides with (at least) migrate_pause tests, that are waiting for
> > POSTCOPY_ACTIVE, and then pause the migration triggering the source to
> > resume. We could maybe work around it by waiting for the 3rd pong
> > instead, but I am not sure if it is possible from tests, or by not
> > resuming if migrate_pause command is executed?
> > 
> > I also tried extending the span of the DEVICE state, but some functions
> > behave differently depending on if they are in postcopy or not, using
> > the migration_in_postcopy() function, but adding the DEVICE there isn't
> > working either. And treating the DEVICE state sometimes as postcopy and
> > sometimes as not seems just too messy, if it would even be possible.
> 
> Yeah, it might indeed be a bit messy.
> 
> Is it possible to find a middle ground?  E.g. add postcopy-setup status,
> but without any new knob to enable it?  Just to describe the period of time
> where dest QEMU haven't started running but started loading device states.

Yes, as the ping/pong solution doesn't require any changes in the
protocol, there's no need for a new capability and the new state can be
always used.

> 
> The hope is libvirt (which, AFAIU, always enables the "events" capability)
> can ignore the new postcopy-setup status transition, then maybe we can also
> introduce the postcopy-setup and make it always appear.
> 
> Thanks,
> 
> -- 
> Peter Xu
> 



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 1/4] qemu-thread: Introduce qemu_thread_detach()
  2025-08-07 11:49 ` [RFC PATCH 1/4] qemu-thread: Introduce qemu_thread_detach() Juraj Marcin
@ 2025-08-19 10:37   ` Daniel P. Berrangé
  0 siblings, 0 replies; 26+ messages in thread
From: Daniel P. Berrangé @ 2025-08-19 10:37 UTC (permalink / raw)
  To: Juraj Marcin
  Cc: qemu-devel, Jiri Denemark, Stefan Weil, Paolo Bonzini, Peter Xu,
	Fabiano Rosas

On Thu, Aug 07, 2025 at 01:49:09PM +0200, Juraj Marcin wrote:
> From: Juraj Marcin <jmarcin@redhat.com>
> 
> Currently, QEMU threads abstraction supports both joinable and detached
> threads, but once a thread is marked as joinable it must be joined using
> qemu_thread_join() and cannot be detached later.

IMHO it is a good thing to avoid the concept of turning a joinable
thread into a detached thread at runtime. Such a change makes it
harder to reason about the correctness of the code, as you need to
fully understand the global picture of what code runs at each
phase of the thread's life, to decide if you need to join or not.
So I'd really encourage looking at whether the migration code can
be made to *always* join, rather than mixing joinable/detached for
the same thread.

> 
> For POSIX implementation, pthread_detach() is used. For Windows, marking
> the thread as detached and releasing critical section is enough as
> thread handle is released by qemu_thread_create().
> 
> Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
> ---
>  include/qemu/thread.h    |  1 +
>  util/qemu-thread-posix.c |  8 ++++++++
>  util/qemu-thread-win32.c | 10 ++++++++++
>  3 files changed, 19 insertions(+)
> 
> diff --git a/include/qemu/thread.h b/include/qemu/thread.h
> index f0302ed01f..8a6d1ba98e 100644
> --- a/include/qemu/thread.h
> +++ b/include/qemu/thread.h
> @@ -212,6 +212,7 @@ int qemu_thread_set_affinity(QemuThread *thread, unsigned long *host_cpus,
>  int qemu_thread_get_affinity(QemuThread *thread, unsigned long **host_cpus,
>                               unsigned long *nbits);
>  void *qemu_thread_join(QemuThread *thread);
> +void qemu_thread_detach(QemuThread *thread);
>  void qemu_thread_get_self(QemuThread *thread);
>  bool qemu_thread_is_self(QemuThread *thread);
>  G_NORETURN void qemu_thread_exit(void *retval);
> diff --git a/util/qemu-thread-posix.c b/util/qemu-thread-posix.c
> index ba725444ba..20442456b5 100644
> --- a/util/qemu-thread-posix.c
> +++ b/util/qemu-thread-posix.c
> @@ -536,3 +536,11 @@ void *qemu_thread_join(QemuThread *thread)
>      }
>      return ret;
>  }
> +
> +void qemu_thread_detach(QemuThread *thread)
> +{
> +    int err = pthread_detach(thread->thread);
> +    if (err) {
> +        error_exit(err, __func__);
> +    }
> +}
> diff --git a/util/qemu-thread-win32.c b/util/qemu-thread-win32.c
> index ca2e0b512e..bdfb7b4aee 100644
> --- a/util/qemu-thread-win32.c
> +++ b/util/qemu-thread-win32.c
> @@ -328,6 +328,16 @@ void *qemu_thread_join(QemuThread *thread)
>      return ret;
>  }
>  
> +void qemu_thread_detach(QemuThread *thread)
> +{
> +    QemuThreadData *data;
> +
> +    if (data->mode == QEMU_THREAD_JOINABLE) {
> +        data->mode = QEMU_THREAD_DETACHED;
> +        DeleteCriticalSection(&data->cs);
> +    }
> +}
> +
>  static bool set_thread_description(HANDLE h, const char *name)
>  {
>      HRESULT hr;
> -- 
> 2.50.1
> 
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/4] migration: Introduce postcopy-setup capability and state
  2025-08-14 19:24         ` Peter Xu
  2025-08-15  6:35           ` Juraj Marcin
@ 2025-09-01 17:57           ` Dr. David Alan Gilbert
  2025-09-02  8:30             ` Juraj Marcin
  1 sibling, 1 reply; 26+ messages in thread
From: Dr. David Alan Gilbert @ 2025-09-01 17:57 UTC (permalink / raw)
  To: Peter Xu
  Cc: Juraj Marcin, Jiri Denemark, qemu-devel, Stefan Weil,
	Paolo Bonzini, Fabiano Rosas

* Peter Xu (peterx@redhat.com) wrote:
> On Thu, Aug 14, 2025 at 05:42:23PM +0200, Juraj Marcin wrote:
> > Fair point, I'll then continue with the PING/PONG solution, the first
> > implementation I have seems to be working to resolve Issue 1.
> > 
> > For rarer split brain, we'll rely on block device locks/mgmt to resolve
> > and change the failure handling, so it registers errors from disk
> > activation.
> > 
> > As tested, there should be no problems with the destination
> > transitioning to POSTCOPY_PAUSED, since the VM was not started yet.
> > 
> > However, to prevent the source side from transitioning to
> > POSTCOPY_PAUSED, I think adding a new state is still the best option.
> > 
> > I tried keeping the migration states as they are now and just rely on an
> > attribute of MigrationState if 3rd PONG was received, however, this
> > collides with (at least) migrate_pause tests, that are waiting for
> > POSTCOPY_ACTIVE, and then pause the migration triggering the source to
> > resume. We could maybe work around it by waiting for the 3rd pong
> > instead, but I am not sure if it is possible from tests, or by not
> > resuming if migrate_pause command is executed?
> > 
> > I also tried extending the span of the DEVICE state, but some functions
> > behave differently depending on if they are in postcopy or not, using
> > the migration_in_postcopy() function, but adding the DEVICE there isn't
> > working either. And treating the DEVICE state sometimes as postcopy and
> > sometimes as not seems just too messy, if it would even be possible.
> 
> Yeah, it might indeed be a bit messy.
> 
> Is it possible to find a middle ground?  E.g. add postcopy-setup status,
> but without any new knob to enable it?  Just to describe the period of time
> where dest QEMU haven't started running but started loading device states.
> 
> The hope is libvirt (which, AFAIU, always enables the "events" capability)
> can ignore the new postcopy-setup status transition, then maybe we can also
> introduce the postcopy-setup and make it always appear.

When the destination is started with '-S' (autostart=false), which is what
I think libvirt does, doesn't management only start the destination
after a certain useful event?
In other words, is there an event we already emit to say that the destination
has finished loading the postcopy devices, or could we just add that
event, so that management could just wait for that before issuing
the continue?

Dave

> Thanks,
> 
> -- 
> Peter Xu
> 
> 
-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
\        dave @ treblig.org |                               | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/4] migration: Introduce postcopy-setup capability and state
  2025-09-01 17:57           ` Dr. David Alan Gilbert
@ 2025-09-02  8:30             ` Juraj Marcin
  2025-09-03 12:00               ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 26+ messages in thread
From: Juraj Marcin @ 2025-09-02  8:30 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Peter Xu, Jiri Denemark, qemu-devel, Stefan Weil, Paolo Bonzini,
	Fabiano Rosas

Hi Dave,

On 2025-09-01 17:57, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > On Thu, Aug 14, 2025 at 05:42:23PM +0200, Juraj Marcin wrote:
> > > Fair point, I'll then continue with the PING/PONG solution, the first
> > > implementation I have seems to be working to resolve Issue 1.
> > > 
> > > For rarer split brain, we'll rely on block device locks/mgmt to resolve
> > > and change the failure handling, so it registers errors from disk
> > > activation.
> > > 
> > > As tested, there should be no problems with the destination
> > > transitioning to POSTCOPY_PAUSED, since the VM was not started yet.
> > > 
> > > However, to prevent the source side from transitioning to
> > > POSTCOPY_PAUSED, I think adding a new state is still the best option.
> > > 
> > > I tried keeping the migration states as they are now and just rely on an
> > > attribute of MigrationState if 3rd PONG was received, however, this
> > > collides with (at least) migrate_pause tests, that are waiting for
> > > POSTCOPY_ACTIVE, and then pause the migration triggering the source to
> > > resume. We could maybe work around it by waiting for the 3rd pong
> > > instead, but I am not sure if it is possible from tests, or by not
> > > resuming if migrate_pause command is executed?
> > > 
> > > I also tried extending the span of the DEVICE state, but some functions
> > > behave differently depending on if they are in postcopy or not, using
> > > the migration_in_postcopy() function, but adding the DEVICE there isn't
> > > working either. And treating the DEVICE state sometimes as postcopy and
> > > sometimes as not seems just too messy, if it would even be possible.
> > 
> > Yeah, it might indeed be a bit messy.
> > 
> > Is it possible to find a middle ground?  E.g. add postcopy-setup status,
> > but without any new knob to enable it?  Just to describe the period of time
> > where dest QEMU haven't started running but started loading device states.
> > 
> > The hope is libvirt (which, AFAIU, always enables the "events" capability)
> > can ignore the new postcopy-setup status transition, then maybe we can also
> > introduce the postcopy-setup and make it always appear.
> 
> When the destination is started with '-S' (autostart=false), which is what
> I think libvirt does, doesn't management only start the destination
> after a certain useful event?
> In other words, is there an event we already emit to say that the destination
> has finished loading the postcopy devices, or could we just add that
> event, so that management could just wait for that before issuing
> the continue?

I am not aware of any such event on the destination side. When postcopy
(and its switchower) starts, the destination transitions from ACTIVE
directly to POSTCOPY_ACTIVE in the listen thread while devices are
loaded concurrently by the main thread.

There is DEVICE state on the source side, but that is used only on the
source side when device state is being collected. When device state is
being loaded on the destination, the source side is also already in
POSTCOPY_ACTIVE state.

Best regards,

Juraj Marcin

> 
> Dave
> 
> > Thanks,
> > 
> > -- 
> > Peter Xu
> > 
> > 
> -- 
>  -----Open up your eyes, open up your mind, open up your code -------   
> / Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
> \        dave @ treblig.org |                               | In Hex /
>  \ _________________________|_____ http://www.treblig.org   |_______/
> 



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/4] migration: Introduce postcopy-setup capability and state
  2025-09-02  8:30             ` Juraj Marcin
@ 2025-09-03 12:00               ` Dr. David Alan Gilbert
  2025-09-03 13:07                 ` Peter Xu
  2025-09-04 16:11                 ` Juraj Marcin
  0 siblings, 2 replies; 26+ messages in thread
From: Dr. David Alan Gilbert @ 2025-09-03 12:00 UTC (permalink / raw)
  To: Juraj Marcin
  Cc: Peter Xu, Jiri Denemark, qemu-devel, Stefan Weil, Paolo Bonzini,
	Fabiano Rosas

* Juraj Marcin (jmarcin@redhat.com) wrote:
> Hi Dave,
> 
> On 2025-09-01 17:57, Dr. David Alan Gilbert wrote:
> > * Peter Xu (peterx@redhat.com) wrote:
> > > On Thu, Aug 14, 2025 at 05:42:23PM +0200, Juraj Marcin wrote:
> > > > Fair point, I'll then continue with the PING/PONG solution, the first
> > > > implementation I have seems to be working to resolve Issue 1.
> > > > 
> > > > For rarer split brain, we'll rely on block device locks/mgmt to resolve
> > > > and change the failure handling, so it registers errors from disk
> > > > activation.
> > > > 
> > > > As tested, there should be no problems with the destination
> > > > transitioning to POSTCOPY_PAUSED, since the VM was not started yet.
> > > > 
> > > > However, to prevent the source side from transitioning to
> > > > POSTCOPY_PAUSED, I think adding a new state is still the best option.
> > > > 
> > > > I tried keeping the migration states as they are now and just rely on an
> > > > attribute of MigrationState if 3rd PONG was received, however, this
> > > > collides with (at least) migrate_pause tests, that are waiting for
> > > > POSTCOPY_ACTIVE, and then pause the migration triggering the source to
> > > > resume. We could maybe work around it by waiting for the 3rd pong
> > > > instead, but I am not sure if it is possible from tests, or by not
> > > > resuming if migrate_pause command is executed?
> > > > 
> > > > I also tried extending the span of the DEVICE state, but some functions
> > > > behave differently depending on if they are in postcopy or not, using
> > > > the migration_in_postcopy() function, but adding the DEVICE there isn't
> > > > working either. And treating the DEVICE state sometimes as postcopy and
> > > > sometimes as not seems just too messy, if it would even be possible.
> > > 
> > > Yeah, it might indeed be a bit messy.
> > > 
> > > Is it possible to find a middle ground?  E.g. add postcopy-setup status,
> > > but without any new knob to enable it?  Just to describe the period of time
> > > where dest QEMU haven't started running but started loading device states.
> > > 
> > > The hope is libvirt (which, AFAIU, always enables the "events" capability)
> > > can ignore the new postcopy-setup status transition, then maybe we can also
> > > introduce the postcopy-setup and make it always appear.
> > 
> > When the destination is started with '-S' (autostart=false), which is what
> > I think libvirt does, doesn't management only start the destination
> > after a certain useful event?
> > In other words, is there an event we already emit to say that the destination
> > has finished loading the postcopy devices, or could we just add that
> > event, so that management could just wait for that before issuing
> > the continue?
> 
> I am not aware of any such event on the destination side. When postcopy
> (and its switchower) starts, the destination transitions from ACTIVE
> directly to POSTCOPY_ACTIVE in the listen thread while devices are
> loaded concurrently by the main thread.
> 
> There is DEVICE state on the source side, but that is used only on the
> source side when device state is being collected. When device state is
> being loaded on the destination, the source side is also already in
> POSTCOPY_ACTIVE state.

So I wonder what libvirt uses to trigger it starting the destination in
the postcopy case?  It's got to be after the device state has loaded.

Dave

> Best regards,
> 
> Juraj Marcin
> 
> > 
> > Dave
> > 
> > > Thanks,
> > > 
> > > -- 
> > > Peter Xu
> > > 
> > > 
> > -- 
> >  -----Open up your eyes, open up your mind, open up your code -------   
> > / Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
> > \        dave @ treblig.org |                               | In Hex /
> >  \ _________________________|_____ http://www.treblig.org   |_______/
> > 
> 
> 
-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
\        dave @ treblig.org |                               | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/4] migration: Introduce postcopy-setup capability and state
  2025-09-03 12:00               ` Dr. David Alan Gilbert
@ 2025-09-03 13:07                 ` Peter Xu
  2025-09-04 16:11                 ` Juraj Marcin
  1 sibling, 0 replies; 26+ messages in thread
From: Peter Xu @ 2025-09-03 13:07 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Juraj Marcin, Jiri Denemark, qemu-devel, Stefan Weil,
	Paolo Bonzini, Fabiano Rosas

On Wed, Sep 03, 2025 at 12:00:11PM +0000, Dr. David Alan Gilbert wrote:
> So I wonder what libvirt uses to trigger it starting the destination in
> the postcopy case?  It's got to be after the device state has loaded.

qmp_cont() supports the "autostart" variable:

    if (runstate_check(RUN_STATE_INMIGRATE)) {
        autostart = 1;
    } else {

That's since commit 1e9981465f ("qmp: handle stop/cont in INMIGRATE
state").  The commit message also mentioned libvirt used to use a loop
somehow..  and I'm surprised to know it wasn't trying to fix the libvirt
problem but something else..

That makes sense, as any delay on cont would be accounted as downtime (even
if trivially), if only executed after loading complete event.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/4] migration: Introduce postcopy-setup capability and state
  2025-09-03 12:00               ` Dr. David Alan Gilbert
  2025-09-03 13:07                 ` Peter Xu
@ 2025-09-04 16:11                 ` Juraj Marcin
  1 sibling, 0 replies; 26+ messages in thread
From: Juraj Marcin @ 2025-09-04 16:11 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Peter Xu, Jiri Denemark, qemu-devel, Stefan Weil, Paolo Bonzini,
	Fabiano Rosas

Hello Dave,

On 2025-09-03 12:00, Dr. David Alan Gilbert wrote:
> * Juraj Marcin (jmarcin@redhat.com) wrote:
> > Hi Dave,
> > 
> > On 2025-09-01 17:57, Dr. David Alan Gilbert wrote:
> > > * Peter Xu (peterx@redhat.com) wrote:
> > > > On Thu, Aug 14, 2025 at 05:42:23PM +0200, Juraj Marcin wrote:
> > > > > Fair point, I'll then continue with the PING/PONG solution, the first
> > > > > implementation I have seems to be working to resolve Issue 1.
> > > > > 
> > > > > For rarer split brain, we'll rely on block device locks/mgmt to resolve
> > > > > and change the failure handling, so it registers errors from disk
> > > > > activation.
> > > > > 
> > > > > As tested, there should be no problems with the destination
> > > > > transitioning to POSTCOPY_PAUSED, since the VM was not started yet.
> > > > > 
> > > > > However, to prevent the source side from transitioning to
> > > > > POSTCOPY_PAUSED, I think adding a new state is still the best option.
> > > > > 
> > > > > I tried keeping the migration states as they are now and just rely on an
> > > > > attribute of MigrationState if 3rd PONG was received, however, this
> > > > > collides with (at least) migrate_pause tests, that are waiting for
> > > > > POSTCOPY_ACTIVE, and then pause the migration triggering the source to
> > > > > resume. We could maybe work around it by waiting for the 3rd pong
> > > > > instead, but I am not sure if it is possible from tests, or by not
> > > > > resuming if migrate_pause command is executed?
> > > > > 
> > > > > I also tried extending the span of the DEVICE state, but some functions
> > > > > behave differently depending on if they are in postcopy or not, using
> > > > > the migration_in_postcopy() function, but adding the DEVICE there isn't
> > > > > working either. And treating the DEVICE state sometimes as postcopy and
> > > > > sometimes as not seems just too messy, if it would even be possible.
> > > > 
> > > > Yeah, it might indeed be a bit messy.
> > > > 
> > > > Is it possible to find a middle ground?  E.g. add postcopy-setup status,
> > > > but without any new knob to enable it?  Just to describe the period of time
> > > > where dest QEMU haven't started running but started loading device states.
> > > > 
> > > > The hope is libvirt (which, AFAIU, always enables the "events" capability)
> > > > can ignore the new postcopy-setup status transition, then maybe we can also
> > > > introduce the postcopy-setup and make it always appear.
> > > 
> > > When the destination is started with '-S' (autostart=false), which is what
> > > I think libvirt does, doesn't management only start the destination
> > > after a certain useful event?
> > > In other words, is there an event we already emit to say that the destination
> > > has finished loading the postcopy devices, or could we just add that
> > > event, so that management could just wait for that before issuing
> > > the continue?
> > 
> > I am not aware of any such event on the destination side. When postcopy
> > (and its switchower) starts, the destination transitions from ACTIVE
> > directly to POSTCOPY_ACTIVE in the listen thread while devices are
> > loaded concurrently by the main thread.
> > 
> > There is DEVICE state on the source side, but that is used only on the
> > source side when device state is being collected. When device state is
> > being loaded on the destination, the source side is also already in
> > POSTCOPY_ACTIVE state.
> 
> So I wonder what libvirt uses to trigger it starting the destination in
> the postcopy case?  It's got to be after the device state has loaded.

I checked the libvirt code and IIUC it waits for POSTCOPY_ACTIVE and
then it issues the 'cont' command, so it's concurrently with the device
state load. But as Peter mentioned, it doesn't actually start the VM if
the device load is not finished yet, it only sets the autostart variable
and the VM is started when the destination processes CMD_POSTCOPY_RUN
after the device state load.

Best regards,

Juraj Marcin

> 
> Dave
> 
> > Best regards,
> > 
> > Juraj Marcin
> > 
> > > 
> > > Dave
> > > 
> > > > Thanks,
> > > > 
> > > > -- 
> > > > Peter Xu
> > > > 
> > > > 
> > > -- 
> > >  -----Open up your eyes, open up your mind, open up your code -------   
> > > / Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
> > > \        dave @ treblig.org |                               | In Hex /
> > >  \ _________________________|_____ http://www.treblig.org   |_______/
> > > 
> > 
> > 
> -- 
>  -----Open up your eyes, open up your mind, open up your code -------   
> / Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
> \        dave @ treblig.org |                               | In Hex /
>  \ _________________________|_____ http://www.treblig.org   |_______/
> 



^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2025-09-04 16:12 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-07 11:49 [RFC PATCH 0/4] migration: Introduce postcopy-setup capability and state Juraj Marcin
2025-08-07 11:49 ` [RFC PATCH 1/4] qemu-thread: Introduce qemu_thread_detach() Juraj Marcin
2025-08-19 10:37   ` Daniel P. Berrangé
2025-08-07 11:49 ` [RFC PATCH 2/4] migration: Fix state transition in postcopy_start() error handling Juraj Marcin
2025-08-07 20:54   ` Peter Xu
2025-08-08  9:44     ` Juraj Marcin
2025-08-08 16:00       ` Peter Xu
2025-08-08 19:08     ` Fabiano Rosas
2025-08-11 13:00       ` Juraj Marcin
2025-08-07 11:49 ` [RFC PATCH 3/4] migration: Make listen thread joinable Juraj Marcin
2025-08-07 20:57   ` Peter Xu
2025-08-08 11:08     ` Juraj Marcin
2025-08-08 17:05       ` Peter Xu
2025-08-11 13:02         ` Juraj Marcin
2025-08-07 11:49 ` [RFC PATCH 4/4] migration: Introduce postcopy-setup capability and state Juraj Marcin
2025-08-11 14:54 ` [RFC PATCH 0/4] " Peter Xu
2025-08-12 13:34   ` Juraj Marcin
2025-08-13 17:42     ` Peter Xu
2025-08-14 15:42       ` Juraj Marcin
2025-08-14 19:24         ` Peter Xu
2025-08-15  6:35           ` Juraj Marcin
2025-09-01 17:57           ` Dr. David Alan Gilbert
2025-09-02  8:30             ` Juraj Marcin
2025-09-03 12:00               ` Dr. David Alan Gilbert
2025-09-03 13:07                 ` Peter Xu
2025-09-04 16:11                 ` Juraj Marcin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).