[PATCH 0/6] migration: Fix issues during qmp_migrate

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/6] migration: Fix issues during qmp_migrate_cancel
@ 2024-12-02 22:01 Fabiano Rosas
  2024-12-02 22:01 ` [PATCH 1/6] tests/qtest/migration: Introduce migration_test_add_suffix Fabiano Rosas
                   ` (5 more replies)
  0 siblings, 6 replies; 24+ messages in thread
From: Fabiano Rosas @ 2024-12-02 22:01 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Xu, Daniel P . Berrangé

Hi,

While working on downstream issues with postcopy, I ended up writing a
set of tests for issuing qmp_migrate_cancel() at various points during
the migration. That exposed some bugs, which this series attempts to
fix.

There is also a fix for the issue Daniel found:

https://gitlab.com/qemu-project/qemu/-/issues/2633

I'm also sending the test code. It creates one test per
MIGRATION_STATUS_ state. Each test starts a migration, waits for that
specific state to be reached, issues qmp_migrate_cancel() and checks
that the migration state changes to cancelled (for now only cancelling
migration from the source side).

I was initially worried that this would be too racy, but so far each
test has survived 1000 iterations. I'm thinking it's worth merging,
specially because even after working on this I haven't been able to
clear the questions we have in our todo list [1], so we'll probably
need more work around this area in the future.

1- https://wiki.qemu.org/ToDo/LiveMigration#Migration_cancel_concurrency

CI run: https://gitlab.com/farosas/qemu/-/pipelines/1569870481

Thanks

Fabiano Rosas (6):
  tests/qtest/migration: Introduce migration_test_add_suffix
  migration: Kick postcopy threads on cancel
  migration: Fix postcopy listen thread exit
  migration: Make sure postcopy recovery doesn't hang when cancelling
  migration: Fix hang after error in destination setup phase
  tests/qtest/migration: Add a cancel test

 migration/channel.c             |  11 +-
 migration/migration.c           |  58 +++++---
 migration/migration.h           |   2 +-
 migration/postcopy-ram.c        |  14 +-
 migration/savevm.c              |  60 ++++----
 tests/qtest/migration-helpers.c |  24 ++++
 tests/qtest/migration-helpers.h |   2 +
 tests/qtest/migration-test.c    | 243 ++++++++++++++++++++++++++++++++
 8 files changed, 365 insertions(+), 49 deletions(-)

-- 
2.35.3

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/6] tests/qtest/migration: Introduce migration_test_add_suffix
  2024-12-02 22:01 [PATCH 0/6] migration: Fix issues during qmp_migrate_cancel Fabiano Rosas
@ 2024-12-02 22:01 ` Fabiano Rosas
  2024-12-04 18:44   ` Peter Xu
  2024-12-02 22:01 ` [PATCH 2/6] migration: Kick postcopy threads on cancel Fabiano Rosas
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 24+ messages in thread
From: Fabiano Rosas @ 2024-12-02 22:01 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Xu, Daniel P . Berrangé

Introduce a new migration_test_add_suffix to allow programmatic
creation of tests based on a suffix. Pass the test name into the test
so it can know which variant to run.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 tests/qtest/migration-helpers.c | 24 ++++++++++++++++++++++++
 tests/qtest/migration-helpers.h |  2 ++
 2 files changed, 26 insertions(+)

diff --git a/tests/qtest/migration-helpers.c b/tests/qtest/migration-helpers.c
index 3f8ba7fa8e..905f4c583d 100644
--- a/tests/qtest/migration-helpers.c
+++ b/tests/qtest/migration-helpers.c
@@ -437,6 +437,7 @@ char *resolve_machine_version(const char *alias, const char *var1,
 typedef struct {
     char *name;
     void (*func)(void);
+    void (*func_full)(void *);
 } MigrationTest;
 
 static void migration_test_destroy(gpointer data)
@@ -466,6 +467,29 @@ void migration_test_add(const char *path, void (*fn)(void))
                              migration_test_destroy);
 }
 
+static void migration_test_wrapper_full(const void *data)
+{
+    MigrationTest *test = (MigrationTest *)data;
+
+    g_test_message("Running /%s%s", qtest_get_arch(), test->name);
+    test->func_full(test->name);
+}
+
+void migration_test_add_suffix(const char *path, const char *suffix,
+                               void (*fn)(void *))
+{
+    MigrationTest *test = g_new0(MigrationTest, 1);
+
+    g_assert(g_str_has_suffix(path, "/"));
+    g_assert(!g_str_has_prefix(suffix, "/"));
+
+    test->func_full = fn;
+    test->name = g_strconcat(path, suffix, NULL);
+
+    qtest_add_data_func_full(test->name, test, migration_test_wrapper_full,
+                             migration_test_destroy);
+}
+
 #ifdef O_DIRECT
 /*
  * Probe for O_DIRECT support on the filesystem. Since this is used
diff --git a/tests/qtest/migration-helpers.h b/tests/qtest/migration-helpers.h
index 72dba369fb..391038f59b 100644
--- a/tests/qtest/migration-helpers.h
+++ b/tests/qtest/migration-helpers.h
@@ -63,6 +63,8 @@ static inline bool probe_o_direct_support(const char *tmpfs)
 }
 #endif
 void migration_test_add(const char *path, void (*fn)(void));
+void migration_test_add_suffix(const char *path, const char *suffix,
+                               void (*fn)(void *));
 void migration_event_wait(QTestState *s, const char *target);
 
 #endif /* MIGRATION_HELPERS_H */
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/6] migration: Kick postcopy threads on cancel
  2024-12-02 22:01 [PATCH 0/6] migration: Fix issues during qmp_migrate_cancel Fabiano Rosas
  2024-12-02 22:01 ` [PATCH 1/6] tests/qtest/migration: Introduce migration_test_add_suffix Fabiano Rosas
@ 2024-12-02 22:01 ` Fabiano Rosas
  2024-12-04 18:39   ` Peter Xu
  2024-12-02 22:01 ` [PATCH 3/6] migration: Fix postcopy listen thread exit Fabiano Rosas
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 24+ messages in thread
From: Fabiano Rosas @ 2024-12-02 22:01 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Xu, Daniel P . Berrangé

Make sure postcopy threads are released when migrate_cancel is
issued. Kick the postcopy_pause semaphore and have the fault thread
read 'fault_thread_quit' when joining.

While here fix the comment mentioning userfault_event_fd.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/migration.c    | 14 +++++++++++---
 migration/postcopy-ram.c | 14 ++++++++++++--
 2 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 8c5bd0a75c..07fbb5c9f4 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -105,7 +105,7 @@ static bool migration_object_check(MigrationState *ms, Error **errp);
 static int migration_maybe_pause(MigrationState *s,
                                  int *current_active_state,
                                  int new_state);
-static void migrate_fd_cancel(MigrationState *s);
+static void migrate_fd_cancel(MigrationState *s, MigrationIncomingState *mis);
 static bool close_return_path_on_source(MigrationState *s);
 static void migration_completion_end(MigrationState *s);
 
@@ -317,7 +317,7 @@ void migration_cancel(const Error *error)
     if (migrate_dirty_limit()) {
         qmp_cancel_vcpu_dirty_limit(false, -1, NULL);
     }
-    migrate_fd_cancel(current_migration);
+    migrate_fd_cancel(current_migration, current_incoming);
 }
 
 void migration_shutdown(void)
@@ -1502,7 +1502,7 @@ static void migrate_fd_error(MigrationState *s, const Error *error)
     migrate_set_error(s, error);
 }
 
-static void migrate_fd_cancel(MigrationState *s)
+static void migrate_fd_cancel(MigrationState *s, MigrationIncomingState *mis)
 {
     int old_state ;
 
@@ -1515,6 +1515,12 @@ static void migrate_fd_cancel(MigrationState *s)
         }
     }
 
+    if (mis->state == MIGRATION_STATUS_POSTCOPY_PAUSED) {
+        migrate_set_state(&mis->state, MIGRATION_STATUS_POSTCOPY_PAUSED,
+                          MIGRATION_STATUS_CANCELLING);
+        qemu_sem_post(&mis->postcopy_pause_sem_dst);
+    }
+
     do {
         old_state = s->state;
         if (!migration_is_running()) {
@@ -1523,6 +1529,8 @@ static void migrate_fd_cancel(MigrationState *s)
         /* If the migration is paused, kick it out of the pause */
         if (old_state == MIGRATION_STATUS_PRE_SWITCHOVER) {
             qemu_sem_post(&s->pause_sem);
+        } else if (old_state == MIGRATION_STATUS_POSTCOPY_PAUSED) {
+            qemu_sem_post(&s->postcopy_pause_sem);
         }
         migrate_set_state(&s->state, old_state, MIGRATION_STATUS_CANCELLING);
     } while (s->state != MIGRATION_STATUS_CANCELLING);
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index a535fd2e30..6882ef977d 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -634,6 +634,7 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
         qatomic_set(&mis->fault_thread_quit, 1);
         postcopy_fault_thread_notify(mis);
         trace_postcopy_ram_incoming_cleanup_join();
+        qemu_sem_post(&mis->postcopy_pause_sem_fault);
         qemu_thread_join(&mis->fault_thread);
 
         if (postcopy_notify(POSTCOPY_NOTIFY_INBOUND_END, &local_err)) {
@@ -991,8 +992,7 @@ static void *postcopy_ram_fault_thread(void *opaque)
 
         /*
          * We're mainly waiting for the kernel to give us a faulting HVA,
-         * however we can be told to quit via userfault_quit_fd which is
-         * an eventfd
+         * however we can be told to quit via userfault_event_fd.
          */
 
         poll_result = poll(pfd, pfd_len, -1 /* Wait forever */);
@@ -1008,6 +1008,11 @@ static void *postcopy_ram_fault_thread(void *opaque)
              * the channel is rebuilt.
              */
             postcopy_pause_fault_thread(mis);
+
+            if (qatomic_read(&mis->fault_thread_quit)) {
+                trace_postcopy_ram_fault_thread_quit();
+                break;
+            }
         }
 
         if (pfd[1].revents) {
@@ -1082,6 +1087,11 @@ retry:
             if (ret) {
                 /* May be network failure, try to wait for recovery */
                 postcopy_pause_fault_thread(mis);
+
+                if (qatomic_read(&mis->fault_thread_quit)) {
+                    trace_postcopy_ram_fault_thread_quit();
+                    break;
+                }
                 goto retry;
             }
         }
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/6] migration: Fix postcopy listen thread exit
  2024-12-02 22:01 [PATCH 0/6] migration: Fix issues during qmp_migrate_cancel Fabiano Rosas
  2024-12-02 22:01 ` [PATCH 1/6] tests/qtest/migration: Introduce migration_test_add_suffix Fabiano Rosas
  2024-12-02 22:01 ` [PATCH 2/6] migration: Kick postcopy threads on cancel Fabiano Rosas
@ 2024-12-02 22:01 ` Fabiano Rosas
  2024-12-02 22:01 ` [PATCH 4/6] migration: Make sure postcopy recovery doesn't hang when cancelling Fabiano Rosas
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: Fabiano Rosas @ 2024-12-02 22:01 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Xu, Daniel P . Berrangé

There are a couple of problems with exiting the postcopy listen
thread. It does not honor the exit-on-error flag and always exits QEMU
upon error. It also does not behave well if a qmp_migrate_cancel() is
issued while postcopy is paused, it either hangs during retry or
crashes during access of a non-recovered QEMUFile (i.e. NULL).

Fix it by adding support for exit-on-error and avoiding accessing the
NULL file pointer.

While here, move the end tracepoint to a later part of the function.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/savevm.c | 60 +++++++++++++++++++++++++++-------------------
 1 file changed, 35 insertions(+), 25 deletions(-)

diff --git a/migration/savevm.c b/migration/savevm.c
index 98821c8120..44b7f883f7 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2004,11 +2004,14 @@ static void *postcopy_ram_listen_thread(void *opaque)
      * want a wrapper for the QEMUFile handle.
      */
     f = mis->from_src_file;
+    if (!f) {
+        /* postcopy pause never got recovered */
+        goto out;
+    }
 
     /* And non-blocking again so we don't block in any cleanup */
     qemu_file_set_blocking(f, false);
 
-    trace_postcopy_ram_listen_thread_exit();
     if (load_res < 0) {
         qemu_file_set_error(f, load_res);
         dirty_bitmap_mig_cancel_incoming();
@@ -2021,10 +2024,6 @@ static void *postcopy_ram_listen_thread(void *opaque)
                          "bitmaps are correctly migrated and valid.",
                          __func__, load_res);
             load_res = 0; /* prevent further exit() */
-        } else {
-            error_report("%s: loadvm failed: %d", __func__, load_res);
-            migrate_set_state(&mis->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
-                                           MIGRATION_STATUS_FAILED);
         }
     }
     if (load_res >= 0) {
@@ -2034,31 +2033,40 @@ static void *postcopy_ram_listen_thread(void *opaque)
          * state yet; wait for the end of the main thread.
          */
         qemu_event_wait(&mis->main_thread_load_event);
-    }
-    postcopy_ram_incoming_cleanup(mis);
 
-    if (load_res < 0) {
+        postcopy_ram_incoming_cleanup(mis);
+
+        migrate_set_state(&mis->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
+                          MIGRATION_STATUS_COMPLETED);
+
         /*
-         * If something went wrong then we have a bad state so exit;
-         * depending how far we got it might be possible at this point
-         * to leave the guest running and fire MCEs for pages that never
-         * arrived as a desperate recovery step.
+         * If everything has worked fine, then the main thread has waited
+         * for us to start, and we're the last use of the mis.
          */
-        rcu_unregister_thread();
-        exit(EXIT_FAILURE);
+        migration_incoming_state_destroy();
     }
 
-    migrate_set_state(&mis->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
-                                   MIGRATION_STATUS_COMPLETED);
-    /*
-     * If everything has worked fine, then the main thread has waited
-     * for us to start, and we're the last use of the mis.
-     * (If something broke then qemu will have to exit anyway since it's
-     * got a bad migration state).
-     */
-    migration_incoming_state_destroy();
-
+out:
+    trace_postcopy_ram_listen_thread_exit();
     rcu_unregister_thread();
+
+    if (load_res < 0) {
+        postcopy_ram_incoming_cleanup(mis);
+
+        error_report("%s: loadvm failed: %d", __func__, load_res);
+        migrate_set_state(&mis->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
+                          MIGRATION_STATUS_FAILED);
+        if (mis->exit_on_error) {
+            /*
+             * If something went wrong then we have a bad state so exit;
+             * depending how far we got it might be possible at this point
+             * to leave the guest running and fire MCEs for pages that never
+             * arrived as a desperate recovery step.
+             */
+            exit(EXIT_FAILURE);
+        }
+    }
+
     mis->have_listen_thread = false;
     postcopy_state_set(POSTCOPY_INCOMING_END);
 
@@ -2921,7 +2929,9 @@ out:
             migrate_postcopy_ram() && postcopy_pause_incoming(mis)) {
             /* Reset f to point to the newly created channel */
             f = mis->from_src_file;
-            goto retry;
+            if (f) {
+                goto retry;
+            }
         }
     }
     return ret;
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 4/6] migration: Make sure postcopy recovery doesn't hang when cancelling
  2024-12-02 22:01 [PATCH 0/6] migration: Fix issues during qmp_migrate_cancel Fabiano Rosas
                   ` (2 preceding siblings ...)
  2024-12-02 22:01 ` [PATCH 3/6] migration: Fix postcopy listen thread exit Fabiano Rosas
@ 2024-12-02 22:01 ` Fabiano Rosas
  2024-12-02 22:01 ` [PATCH 5/6] migration: Fix hang after error in destination setup phase Fabiano Rosas
  2024-12-02 22:01 ` [PATCH 6/6] tests/qtest/migration: Add a cancel test Fabiano Rosas
  5 siblings, 0 replies; 24+ messages in thread
From: Fabiano Rosas @ 2024-12-02 22:01 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Xu, Daniel P . Berrangé

Make sure postcopy recovery doesn't hang when calling
qmp_migrate_cancel() and also doesn't pause the migration once more.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/migration.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/migration/migration.c b/migration/migration.c
index 07fbb5c9f4..8a61cc26d7 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1531,6 +1531,8 @@ static void migrate_fd_cancel(MigrationState *s, MigrationIncomingState *mis)
             qemu_sem_post(&s->pause_sem);
         } else if (old_state == MIGRATION_STATUS_POSTCOPY_PAUSED) {
             qemu_sem_post(&s->postcopy_pause_sem);
+        } else if (old_state == MIGRATION_STATUS_POSTCOPY_RECOVER) {
+            qemu_sem_post(&s->rp_state.rp_sem);
         }
         migrate_set_state(&s->state, old_state, MIGRATION_STATUS_CANCELLING);
     } while (s->state != MIGRATION_STATUS_CANCELLING);
@@ -2148,6 +2150,10 @@ void qmp_migrate_continue(MigrationStatus state, Error **errp)
 
 int migration_rp_wait(MigrationState *s)
 {
+    if (qatomic_read(&s->state) == MIGRATION_STATUS_CANCELLING) {
+        return -1;
+    }
+
     /* If migration has failure already, ignore the wait */
     if (migrate_has_error(s)) {
         return -1;
@@ -2160,6 +2166,10 @@ int migration_rp_wait(MigrationState *s)
         return -1;
     }
 
+    if (qatomic_read(&s->state) == MIGRATION_STATUS_CANCELLING) {
+        return -1;
+    }
+
     return 0;
 }
 
@@ -3023,6 +3033,9 @@ static MigThrError postcopy_pause(MigrationState *s)
                 trace_postcopy_pause_continued();
                 return MIG_THR_ERR_RECOVERED;
             } else {
+                if (s->state == MIGRATION_STATUS_CANCELLING) {
+                    return MIG_THR_ERR_FATAL;
+                }
                 /*
                  * Something wrong happened during the recovery, let's
                  * pause again. Pause is always better than throwing
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 5/6] migration: Fix hang after error in destination setup phase
  2024-12-02 22:01 [PATCH 0/6] migration: Fix issues during qmp_migrate_cancel Fabiano Rosas
                   ` (3 preceding siblings ...)
  2024-12-02 22:01 ` [PATCH 4/6] migration: Make sure postcopy recovery doesn't hang when cancelling Fabiano Rosas
@ 2024-12-02 22:01 ` Fabiano Rosas
  2024-12-04 18:44   ` Peter Xu
  2024-12-02 22:01 ` [PATCH 6/6] tests/qtest/migration: Add a cancel test Fabiano Rosas
  5 siblings, 1 reply; 24+ messages in thread
From: Fabiano Rosas @ 2024-12-02 22:01 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Xu, Daniel P . Berrangé, Thomas Huth

If the destination side fails at migration_ioc_process_incoming()
before starting the coroutine, it will report the error but QEMU will
not exit.

Set the migration state to FAILED and exit the process if
exit-on-error allows.

CC: Thomas Huth <thuth@redhat.com>
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2633
Reported-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/channel.c   | 11 ++++++-----
 migration/migration.c | 31 ++++++++++++++++++-------------
 migration/migration.h |  2 +-
 3 files changed, 25 insertions(+), 19 deletions(-)

diff --git a/migration/channel.c b/migration/channel.c
index f9de064f3b..6d7f9172d8 100644
--- a/migration/channel.c
+++ b/migration/channel.c
@@ -40,13 +40,14 @@ void migration_channel_process_incoming(QIOChannel *ioc)
 
     if (migrate_channel_requires_tls_upgrade(ioc)) {
         migration_tls_channel_process_incoming(s, ioc, &local_err);
+
+        if (local_err) {
+            error_report_err(local_err);
+        }
+
     } else {
         migration_ioc_register_yank(ioc);
-        migration_ioc_process_incoming(ioc, &local_err);
-    }
-
-    if (local_err) {
-        error_report_err(local_err);
+        migration_ioc_process_incoming(ioc);
     }
 }
 
diff --git a/migration/migration.c b/migration/migration.c
index 8a61cc26d7..cd88ebc875 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -943,7 +943,7 @@ static bool migration_should_start_incoming(bool main_channel)
     return true;
 }
 
-void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
+void migration_ioc_process_incoming(QIOChannel *ioc)
 {
     MigrationIncomingState *mis = migration_incoming_get_current();
     Error *local_err = NULL;
@@ -966,10 +966,9 @@ void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
          * issue is not possible.
          */
         ret = migration_channel_read_peek(ioc, (void *)&channel_magic,
-                                          sizeof(channel_magic), errp);
-
+                                          sizeof(channel_magic), &local_err);
         if (ret != 0) {
-            return;
+            goto err;
         }
 
         default_channel = (channel_magic == cpu_to_be32(QEMU_VM_FILE_MAGIC));
@@ -977,8 +976,8 @@ void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
         default_channel = !mis->from_src_file;
     }
 
-    if (multifd_recv_setup(errp) != 0) {
-        return;
+    if (multifd_recv_setup(&local_err) != 0) {
+        goto err;
     }
 
     if (default_channel) {
@@ -995,18 +994,24 @@ void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
             postcopy_preempt_new_channel(mis, f);
         }
         if (local_err) {
-            error_propagate(errp, local_err);
-            return;
+            goto err;
         }
     }
 
-    if (migration_should_start_incoming(default_channel)) {
-        /* If it's a recovery, we're done */
-        if (postcopy_try_recover()) {
-            return;
-        }
+    if (migration_should_start_incoming(default_channel) &&
+        !postcopy_try_recover()) {
         migration_incoming_process();
     }
+
+    return;
+
+err:
+    error_report_err(local_err);
+    migrate_set_state(&mis->state, MIGRATION_STATUS_SETUP,
+                      MIGRATION_STATUS_FAILED);
+    if (mis->exit_on_error) {
+        exit(EXIT_FAILURE);
+    }
 }
 
 /**
diff --git a/migration/migration.h b/migration/migration.h
index 0956e9274b..c367e5ea40 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -477,7 +477,7 @@ void migrate_set_state(MigrationStatus *state, MigrationStatus old_state,
                        MigrationStatus new_state);
 
 void migration_fd_process_incoming(QEMUFile *f);
-void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp);
+void migration_ioc_process_incoming(QIOChannel *ioc);
 void migration_incoming_process(void);
 
 bool  migration_has_all_channels(void);
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 6/6] tests/qtest/migration: Add a cancel test
  2024-12-02 22:01 [PATCH 0/6] migration: Fix issues during qmp_migrate_cancel Fabiano Rosas
                   ` (4 preceding siblings ...)
  2024-12-02 22:01 ` [PATCH 5/6] migration: Fix hang after error in destination setup phase Fabiano Rosas
@ 2024-12-02 22:01 ` Fabiano Rosas
  5 siblings, 0 replies; 24+ messages in thread
From: Fabiano Rosas @ 2024-12-02 22:01 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Xu, Daniel P . Berrangé

The qmp_migrate_cancel() command is poorly tested and code inspection
reveals that there might be concurrency issues with its usage. Add a
test that runs a migration and calls qmp_migrate_cancel() at specific
moments.

In order to make the test more deterministic, instead of calling
qmp_migrate_cancel() at random moments during migration, do it after
the migration status change events are seen.

The expected result is that qmp_migrate_cancel() on the source ends
migration on the source with the "cancelled" state and ends migration
on the destination with the "failed" state.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 tests/qtest/migration-test.c | 243 +++++++++++++++++++++++++++++++++++
 1 file changed, 243 insertions(+)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 74d3000198..48942289bf 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -23,6 +23,8 @@
 #include "qapi/qmp/qlist.h"
 #include "ppc-util.h"
 
+#include "qapi-types-migration.h"
+
 #include "migration-helpers.h"
 #include "tests/migration/migration-test.h"
 #ifdef CONFIG_GNUTLS
@@ -3774,6 +3776,234 @@ static bool kvm_dirty_ring_supported(void)
 #endif
 }
 
+static void test_cancel_src_after_failed(QTestState *from, QTestState *to,
+                                         const char *uri, const char *phase)
+{
+    /*
+     * No migrate_incoming_qmp() at the start to force source into
+     * failed state during migrate_qmp().
+     */
+
+    wait_for_serial("src_serial");
+    migrate_ensure_converge(from);
+
+    migrate_qmp(from, to, uri, NULL, "{}");
+
+    migration_event_wait(from, phase);
+    migrate_cancel(from);
+
+    /* cancelling will not move the migration out of 'failed' */
+
+    wait_for_migration_status(from, "failed",
+                              (const char * []) { "completed", NULL });
+
+    /*
+     * Not waiting for the destination because it never started
+     * migration.
+     */
+}
+
+static void test_cancel_src_after_cancelled(QTestState *from, QTestState *to,
+                                            const char *uri, const char *phase)
+{
+    migrate_incoming_qmp(to, uri, "{ 'exit-on-error': false }");
+
+    wait_for_serial("src_serial");
+    migrate_ensure_converge(from);
+
+    migrate_qmp(from, to, uri, NULL, "{}");
+
+    /* To move to cancelled/cancelling */
+    migrate_cancel(from);
+    migration_event_wait(from, phase);
+
+    /* The migrate_cancel under test */
+    migrate_cancel(from);
+
+    wait_for_migration_status(from, "cancelled",
+                              (const char * []) { "completed", NULL });
+
+    wait_for_migration_status(to, "failed",
+                              (const char * []) { "completed", NULL });
+}
+
+static void test_cancel_src_after_complete(QTestState *from, QTestState *to,
+                                           const char *uri, const char *phase)
+{
+    migrate_incoming_qmp(to, uri, "{ 'exit-on-error': false }");
+
+    wait_for_serial("src_serial");
+    migrate_ensure_converge(from);
+
+    migrate_qmp(from, to, uri, NULL, "{}");
+
+    migration_event_wait(from, phase);
+    migrate_cancel(from);
+
+    /*
+     * qmp_migrate_cancel() exits early if migration is not running
+     * anymore, the status will not change to cancelled.
+     */
+    wait_for_migration_complete(from);
+    wait_for_migration_complete(to);
+}
+
+static void test_cancel_src_after_none(QTestState *from, QTestState *to,
+                                       const char *uri, const char *phase)
+{
+    /*
+     * Test that cancelling without a migration happening does not
+     * affect subsequent migrations
+     */
+    migrate_cancel(to);
+
+    wait_for_serial("src_serial");
+    migrate_cancel(from);
+
+    migrate_incoming_qmp(to, uri, "{ 'exit-on-error': false }");
+
+    migrate_ensure_converge(from);
+    migrate_qmp(from, to, uri, NULL, "{}");
+
+    wait_for_migration_complete(from);
+    wait_for_migration_complete(to);
+}
+
+static void test_cancel_src_after_postcopy(QTestState *from, QTestState *to,
+                                           const char *uri, const char *phase)
+{
+    bool postcopy_active = g_str_equal(phase, "postcopy-active");
+    bool postcopy_paused = g_str_equal(phase, "postcopy-paused");
+    bool postcopy_recover = g_str_equal(phase, "postcopy-recover-setup") ||
+        g_str_equal(phase, "postcopy-recover");
+
+    migrate_set_capability(from, "postcopy-ram", true);
+    migrate_set_capability(to, "postcopy-ram", true);
+
+    migrate_incoming_qmp(to, uri, "{ 'exit-on-error': false }");
+
+    wait_for_serial("src_serial");
+    migrate_ensure_non_converge(from);
+
+    migrate_qmp(from, to, uri, NULL, "{}");
+
+    migration_event_wait(from, "active");
+
+    /* Turn postcopy speed down, 4K/s is slow enough on any machines */
+    migrate_set_parameter_int(from, "max-postcopy-bandwidth", 4096);
+    qtest_qmp_assert_success(from, "{ 'execute': 'migrate-start-postcopy' }");
+
+    if (!postcopy_active) {
+        migration_event_wait(from, "postcopy-active");
+        migration_event_wait(to, "postcopy-active");
+
+        wait_for_stop(from, &src_state);
+        qtest_qmp_eventwait(to, "RESUME");
+
+        migrate_pause(from);
+    }
+
+    if (postcopy_recover) {
+        migration_event_wait(to, "postcopy-paused");
+        migration_event_wait(from, "postcopy-paused");
+
+        migrate_recover(to, uri);
+        migrate_qmp(from, to, uri, NULL, "{'resume': true}");
+    }
+
+    migration_event_wait(from, phase);
+    migrate_cancel(from);
+    migration_event_wait(from, "cancelling");
+
+    wait_for_migration_status(from, "cancelled",
+                              (const char * []) { "completed", NULL });
+
+    if (postcopy_paused || postcopy_recover) {
+        /*
+         * Cancelling on source is not detectable by the destination,
+         * so it remains paused.
+         */
+        migration_event_wait(to, "postcopy-paused");
+    } else {
+        wait_for_migration_status(to, "failed",
+                                  (const char * []) { "completed", NULL });
+    }
+}
+
+static void test_cancel_src_pre_switchover(QTestState *from, QTestState *to,
+                                           const char *uri, const char *phase)
+{
+    migrate_set_capability(from, "pause-before-switchover", true);
+    migrate_set_capability(to, "pause-before-switchover", true);
+
+    migrate_set_parameter_int(from, "multifd-channels", 2);
+    migrate_set_parameter_int(to, "multifd-channels", 2);
+    migrate_set_capability(from, "multifd", true);
+    migrate_set_capability(to, "multifd", true);
+
+    migrate_incoming_qmp(to, uri, "{ 'exit-on-error': false }");
+
+    wait_for_serial("src_serial");
+    migrate_ensure_converge(from);
+
+    migrate_qmp(from, to, uri, NULL, "{}");
+
+    migration_event_wait(from, phase);
+    migrate_cancel(from);
+    migration_event_wait(from, "cancelling");
+
+    wait_for_migration_status(from, "cancelled",
+                              (const char * []) { "completed", NULL });
+
+    wait_for_migration_status(to, "failed",
+                              (const char * []) { "completed", NULL });
+}
+
+static void test_cancel_src_after_status(void *opaque)
+{
+    const char *test_path = opaque;
+    g_autofree char *phase = g_path_get_basename(test_path);
+    g_autofree char *uri = g_strdup_printf("unix:%s/migsocket", tmpfs);
+    QTestState *from, *to;
+    MigrateStart args = {
+        .hide_stderr = true,
+    };
+
+    if (test_migrate_start(&from, &to, "defer", &args)) {
+        return;
+    }
+
+    if (g_str_has_prefix(phase, "postcopy")) {
+        if (!ufd_version_check()) {
+            g_test_skip("No postcopy support. "
+                        "Cannot test migrate_cancel in postcopy states");
+            goto out;
+        }
+
+        test_cancel_src_after_postcopy(from, to, uri, phase);
+
+    } else if (g_str_equal(phase, "cancelling") ||
+               g_str_equal(phase, "cancelled")) {
+        test_cancel_src_after_cancelled(from, to, uri, phase);
+
+    } else if (g_str_equal(phase, "completed")) {
+        test_cancel_src_after_complete(from, to, uri, phase);
+
+    } else if (g_str_equal(phase, "failed")) {
+        test_cancel_src_after_failed(from, to, uri, phase);
+
+    } else if (g_str_equal(phase, "none")) {
+        test_cancel_src_after_none(from, to, uri, phase);
+
+    } else {
+        /* any state that comes before pre-switchover */
+        test_cancel_src_pre_switchover(from, to, uri, phase);
+    }
+
+out:
+    test_migrate_end(from, to, false);
+}
+
 int main(int argc, char **argv)
 {
     bool has_kvm, has_tcg;
@@ -4034,6 +4264,19 @@ int main(int argc, char **argv)
         }
     }
 
+    for (int i = MIGRATION_STATUS_NONE; i < MIGRATION_STATUS__MAX; i++) {
+        switch (i) {
+        case MIGRATION_STATUS_DEVICE: /* happens too fast */
+        case MIGRATION_STATUS_WAIT_UNPLUG: /* no support in tests */
+        case MIGRATION_STATUS_COLO: /* no support in tests */
+            continue;
+        default:
+            migration_test_add_suffix("/migration/cancel/src/after/",
+                                      MigrationStatus_str(i),
+                                      test_cancel_src_after_status);
+        }
+    }
+
     ret = g_test_run();
 
     g_assert_cmpint(ret, ==, 0);
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/6] migration: Kick postcopy threads on cancel
  2024-12-02 22:01 ` [PATCH 2/6] migration: Kick postcopy threads on cancel Fabiano Rosas
@ 2024-12-04 18:39   ` Peter Xu
  2024-12-04 19:02     ` Fabiano Rosas
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Xu @ 2024-12-04 18:39 UTC (permalink / raw)
  To: Fabiano Rosas; +Cc: qemu-devel, Daniel P . Berrangé

On Mon, Dec 02, 2024 at 07:01:33PM -0300, Fabiano Rosas wrote:
> Make sure postcopy threads are released when migrate_cancel is
> issued. Kick the postcopy_pause semaphore and have the fault thread
> read 'fault_thread_quit' when joining.
> 
> While here fix the comment mentioning userfault_event_fd.
> 
> Signed-off-by: Fabiano Rosas <farosas@suse.de>

I remember when working on postcopy, I thought about failing migrate-cancel
for postcopy in general, rejecting such request.  And when working on the
recover feature, there's no concern on having it being cancelled, because
the user really shouldn't do that..

The problem is migrate-cancel means crashing the VM on both sides when QEMU
already goes into postcopy stage.

If the user wants to crash the VM anyway, an easier way to do is killing on
both sides.

If the user wished to cancel, we should tell them "postcopy cannot be
cancelled, until complete".  That's probably the major reason why people
think postcopy is dangerous to use..

Or do we have any use case this could be a valid scenario?

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/6] migration: Fix hang after error in destination setup phase
  2024-12-02 22:01 ` [PATCH 5/6] migration: Fix hang after error in destination setup phase Fabiano Rosas
@ 2024-12-04 18:44   ` Peter Xu
  2024-12-04 19:05     ` Fabiano Rosas
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Xu @ 2024-12-04 18:44 UTC (permalink / raw)
  To: Fabiano Rosas; +Cc: qemu-devel, Daniel P . Berrangé, Thomas Huth

On Mon, Dec 02, 2024 at 07:01:36PM -0300, Fabiano Rosas wrote:
> If the destination side fails at migration_ioc_process_incoming()
> before starting the coroutine, it will report the error but QEMU will
> not exit.
> 
> Set the migration state to FAILED and exit the process if
> exit-on-error allows.
> 
> CC: Thomas Huth <thuth@redhat.com>
> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2633
> Reported-by: Daniel P. Berrangé <berrange@redhat.com>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>

(I skipped the postcopy patches as of now, until we finish the discussion
 in patch 2)

> ---
>  migration/channel.c   | 11 ++++++-----
>  migration/migration.c | 31 ++++++++++++++++++-------------
>  migration/migration.h |  2 +-
>  3 files changed, 25 insertions(+), 19 deletions(-)
> 
> diff --git a/migration/channel.c b/migration/channel.c
> index f9de064f3b..6d7f9172d8 100644
> --- a/migration/channel.c
> +++ b/migration/channel.c
> @@ -40,13 +40,14 @@ void migration_channel_process_incoming(QIOChannel *ioc)
>  
>      if (migrate_channel_requires_tls_upgrade(ioc)) {
>          migration_tls_channel_process_incoming(s, ioc, &local_err);
> +
> +        if (local_err) {
> +            error_report_err(local_err);
> +        }

What if tls processing failed here, do we have similar issue that qemu will
stall?  Do we want to cover that too?

> +
>      } else {
>          migration_ioc_register_yank(ioc);
> -        migration_ioc_process_incoming(ioc, &local_err);
> -    }
> -
> -    if (local_err) {
> -        error_report_err(local_err);
> +        migration_ioc_process_incoming(ioc);
>      }
>  }
>  
> diff --git a/migration/migration.c b/migration/migration.c
> index 8a61cc26d7..cd88ebc875 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -943,7 +943,7 @@ static bool migration_should_start_incoming(bool main_channel)
>      return true;
>  }
>  
> -void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
> +void migration_ioc_process_incoming(QIOChannel *ioc)
>  {
>      MigrationIncomingState *mis = migration_incoming_get_current();
>      Error *local_err = NULL;
> @@ -966,10 +966,9 @@ void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
>           * issue is not possible.
>           */
>          ret = migration_channel_read_peek(ioc, (void *)&channel_magic,
> -                                          sizeof(channel_magic), errp);
> -
> +                                          sizeof(channel_magic), &local_err);
>          if (ret != 0) {
> -            return;
> +            goto err;
>          }
>  
>          default_channel = (channel_magic == cpu_to_be32(QEMU_VM_FILE_MAGIC));
> @@ -977,8 +976,8 @@ void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
>          default_channel = !mis->from_src_file;
>      }
>  
> -    if (multifd_recv_setup(errp) != 0) {
> -        return;
> +    if (multifd_recv_setup(&local_err) != 0) {
> +        goto err;
>      }
>  
>      if (default_channel) {
> @@ -995,18 +994,24 @@ void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
>              postcopy_preempt_new_channel(mis, f);
>          }
>          if (local_err) {
> -            error_propagate(errp, local_err);
> -            return;
> +            goto err;
>          }
>      }
>  
> -    if (migration_should_start_incoming(default_channel)) {
> -        /* If it's a recovery, we're done */
> -        if (postcopy_try_recover()) {
> -            return;
> -        }
> +    if (migration_should_start_incoming(default_channel) &&
> +        !postcopy_try_recover()) {
>          migration_incoming_process();
>      }
> +
> +    return;
> +
> +err:
> +    error_report_err(local_err);
> +    migrate_set_state(&mis->state, MIGRATION_STATUS_SETUP,
> +                      MIGRATION_STATUS_FAILED);
> +    if (mis->exit_on_error) {
> +        exit(EXIT_FAILURE);
> +    }
>  }
>  
>  /**
> diff --git a/migration/migration.h b/migration/migration.h
> index 0956e9274b..c367e5ea40 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -477,7 +477,7 @@ void migrate_set_state(MigrationStatus *state, MigrationStatus old_state,
>                         MigrationStatus new_state);
>  
>  void migration_fd_process_incoming(QEMUFile *f);
> -void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp);
> +void migration_ioc_process_incoming(QIOChannel *ioc);
>  void migration_incoming_process(void);
>  
>  bool  migration_has_all_channels(void);
> -- 
> 2.35.3
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/6] tests/qtest/migration: Introduce migration_test_add_suffix
  2024-12-02 22:01 ` [PATCH 1/6] tests/qtest/migration: Introduce migration_test_add_suffix Fabiano Rosas
@ 2024-12-04 18:44   ` Peter Xu
  0 siblings, 0 replies; 24+ messages in thread
From: Peter Xu @ 2024-12-04 18:44 UTC (permalink / raw)
  To: Fabiano Rosas; +Cc: qemu-devel, Daniel P . Berrangé

On Mon, Dec 02, 2024 at 07:01:32PM -0300, Fabiano Rosas wrote:
> Introduce a new migration_test_add_suffix to allow programmatic
> creation of tests based on a suffix. Pass the test name into the test
> so it can know which variant to run.
> 
> Signed-off-by: Fabiano Rosas <farosas@suse.de>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/6] migration: Kick postcopy threads on cancel
  2024-12-04 18:39   ` Peter Xu
@ 2024-12-04 19:02     ` Fabiano Rosas
  2024-12-04 19:39       ` Peter Xu
  0 siblings, 1 reply; 24+ messages in thread
From: Fabiano Rosas @ 2024-12-04 19:02 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, Daniel P . Berrangé

Peter Xu <peterx@redhat.com> writes:

> On Mon, Dec 02, 2024 at 07:01:33PM -0300, Fabiano Rosas wrote:
>> Make sure postcopy threads are released when migrate_cancel is
>> issued. Kick the postcopy_pause semaphore and have the fault thread
>> read 'fault_thread_quit' when joining.
>> 
>> While here fix the comment mentioning userfault_event_fd.
>> 
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>
> I remember when working on postcopy, I thought about failing migrate-cancel
> for postcopy in general, rejecting such request.  And when working on the
> recover feature, there's no concern on having it being cancelled, because
> the user really shouldn't do that..
>
> The problem is migrate-cancel means crashing the VM on both sides when QEMU
> already goes into postcopy stage.

Well, that's the sillyness of having a cancel command, you can never
know what "cancel" means. The "documentation" says: "Cancel the current
executing migration process", it doesn't mention anything about the
consequences of such action.

>
> If the user wants to crash the VM anyway, an easier way to do is killing on
> both sides.

I don't think this is fair. We expose a "cancel" command, we better do
some cancelling or instead reject the command appropriately, not expect
the user to "know better".

>
> If the user wished to cancel, we should tell them "postcopy cannot be
> cancelled, until complete".  That's probably the major reason why people
> think postcopy is dangerous to use..

We could certainly add that restriction, I don't see a problem with
it. That said, what is the actual use case for migrate_cancel? And how
does that compare with yank? I feel like we've been kind of relying on
nobody using those commands really.

One truth that we have (because it's tested) is that the multifd
migration allows migrate_cancel on the source and another migration to
start from it.

(btw, that reminds me that multifd+postcopy will probably break that
test).

>
> Or do we have any use case this could be a valid scenario?

Not that I know of. But you're the postcopy expert =)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/6] migration: Fix hang after error in destination setup phase
  2024-12-04 18:44   ` Peter Xu
@ 2024-12-04 19:05     ` Fabiano Rosas
  0 siblings, 0 replies; 24+ messages in thread
From: Fabiano Rosas @ 2024-12-04 19:05 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, Daniel P . Berrangé, Thomas Huth

Peter Xu <peterx@redhat.com> writes:

> On Mon, Dec 02, 2024 at 07:01:36PM -0300, Fabiano Rosas wrote:
>> If the destination side fails at migration_ioc_process_incoming()
>> before starting the coroutine, it will report the error but QEMU will
>> not exit.
>> 
>> Set the migration state to FAILED and exit the process if
>> exit-on-error allows.
>> 
>> CC: Thomas Huth <thuth@redhat.com>
>> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2633
>> Reported-by: Daniel P. Berrangé <berrange@redhat.com>
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>
> (I skipped the postcopy patches as of now, until we finish the discussion
>  in patch 2)
>
>> ---
>>  migration/channel.c   | 11 ++++++-----
>>  migration/migration.c | 31 ++++++++++++++++++-------------
>>  migration/migration.h |  2 +-
>>  3 files changed, 25 insertions(+), 19 deletions(-)
>> 
>> diff --git a/migration/channel.c b/migration/channel.c
>> index f9de064f3b..6d7f9172d8 100644
>> --- a/migration/channel.c
>> +++ b/migration/channel.c
>> @@ -40,13 +40,14 @@ void migration_channel_process_incoming(QIOChannel *ioc)
>>  
>>      if (migrate_channel_requires_tls_upgrade(ioc)) {
>>          migration_tls_channel_process_incoming(s, ioc, &local_err);
>> +
>> +        if (local_err) {
>> +            error_report_err(local_err);
>> +        }
>
> What if tls processing failed here, do we have similar issue that qemu will
> stall?  Do we want to cover that too?
>

Hmm, I think I confused this with the multifd_channel_connect() stuff
where the code is reentrant and we take the "regular" path when called a
second time. I need to look closer into this part.

>> +
>>      } else {
>>          migration_ioc_register_yank(ioc);
>> -        migration_ioc_process_incoming(ioc, &local_err);
>> -    }
>> -
>> -    if (local_err) {
>> -        error_report_err(local_err);
>> +        migration_ioc_process_incoming(ioc);
>>      }
>>  }
>>  
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 8a61cc26d7..cd88ebc875 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -943,7 +943,7 @@ static bool migration_should_start_incoming(bool main_channel)
>>      return true;
>>  }
>>  
>> -void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
>> +void migration_ioc_process_incoming(QIOChannel *ioc)
>>  {
>>      MigrationIncomingState *mis = migration_incoming_get_current();
>>      Error *local_err = NULL;
>> @@ -966,10 +966,9 @@ void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
>>           * issue is not possible.
>>           */
>>          ret = migration_channel_read_peek(ioc, (void *)&channel_magic,
>> -                                          sizeof(channel_magic), errp);
>> -
>> +                                          sizeof(channel_magic), &local_err);
>>          if (ret != 0) {
>> -            return;
>> +            goto err;
>>          }
>>  
>>          default_channel = (channel_magic == cpu_to_be32(QEMU_VM_FILE_MAGIC));
>> @@ -977,8 +976,8 @@ void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
>>          default_channel = !mis->from_src_file;
>>      }
>>  
>> -    if (multifd_recv_setup(errp) != 0) {
>> -        return;
>> +    if (multifd_recv_setup(&local_err) != 0) {
>> +        goto err;
>>      }
>>  
>>      if (default_channel) {
>> @@ -995,18 +994,24 @@ void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
>>              postcopy_preempt_new_channel(mis, f);
>>          }
>>          if (local_err) {
>> -            error_propagate(errp, local_err);
>> -            return;
>> +            goto err;
>>          }
>>      }
>>  
>> -    if (migration_should_start_incoming(default_channel)) {
>> -        /* If it's a recovery, we're done */
>> -        if (postcopy_try_recover()) {
>> -            return;
>> -        }
>> +    if (migration_should_start_incoming(default_channel) &&
>> +        !postcopy_try_recover()) {
>>          migration_incoming_process();
>>      }
>> +
>> +    return;
>> +
>> +err:
>> +    error_report_err(local_err);
>> +    migrate_set_state(&mis->state, MIGRATION_STATUS_SETUP,
>> +                      MIGRATION_STATUS_FAILED);
>> +    if (mis->exit_on_error) {
>> +        exit(EXIT_FAILURE);
>> +    }
>>  }
>>  
>>  /**
>> diff --git a/migration/migration.h b/migration/migration.h
>> index 0956e9274b..c367e5ea40 100644
>> --- a/migration/migration.h
>> +++ b/migration/migration.h
>> @@ -477,7 +477,7 @@ void migrate_set_state(MigrationStatus *state, MigrationStatus old_state,
>>                         MigrationStatus new_state);
>>  
>>  void migration_fd_process_incoming(QEMUFile *f);
>> -void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp);
>> +void migration_ioc_process_incoming(QIOChannel *ioc);
>>  void migration_incoming_process(void);
>>  
>>  bool  migration_has_all_channels(void);
>> -- 
>> 2.35.3
>> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/6] migration: Kick postcopy threads on cancel
  2024-12-04 19:02     ` Fabiano Rosas
@ 2024-12-04 19:39       ` Peter Xu
  2024-12-04 20:02         ` Daniel P. Berrangé
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Xu @ 2024-12-04 19:39 UTC (permalink / raw)
  To: Fabiano Rosas; +Cc: qemu-devel, Daniel P . Berrangé

On Wed, Dec 04, 2024 at 04:02:36PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Mon, Dec 02, 2024 at 07:01:33PM -0300, Fabiano Rosas wrote:
> >> Make sure postcopy threads are released when migrate_cancel is
> >> issued. Kick the postcopy_pause semaphore and have the fault thread
> >> read 'fault_thread_quit' when joining.
> >> 
> >> While here fix the comment mentioning userfault_event_fd.
> >> 
> >> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> >
> > I remember when working on postcopy, I thought about failing migrate-cancel
> > for postcopy in general, rejecting such request.  And when working on the
> > recover feature, there's no concern on having it being cancelled, because
> > the user really shouldn't do that..
> >
> > The problem is migrate-cancel means crashing the VM on both sides when QEMU
> > already goes into postcopy stage.
> 
> Well, that's the sillyness of having a cancel command, you can never
> know what "cancel" means. The "documentation" says: "Cancel the current
> executing migration process", it doesn't mention anything about the
> consequences of such action.

We definitely need cancel.  It was always used in precopy, and people use
it a lot!

> 
> >
> > If the user wants to crash the VM anyway, an easier way to do is killing on
> > both sides.
> 
> I don't think this is fair. We expose a "cancel" command, we better do
> some cancelling or instead reject the command appropriately, not expect
> the user to "know better".

That's exactly why we should fail it with a proper error message, IMHO.
Because the user may not really understand the impact of postcopy.

I remember I also tried to reuse migrate-cancel to work as what
migrate-pause does currently (so as to not introduce yet another new QMP
command).  I don't remember why we came up with the new cmd, but with
migrate-pause being there, I think the only sane way to respond to cancel
request during postcopy is to fail it properly..

> 
> >
> > If the user wished to cancel, we should tell them "postcopy cannot be
> > cancelled, until complete".  That's probably the major reason why people
> > think postcopy is dangerous to use..
> 
> We could certainly add that restriction, I don't see a problem with
> it. That said, what is the actual use case for migrate_cancel? And how
> does that compare with yank? I feel like we've been kind of relying on
> nobody using those commands really.

We had "cancel" first, then "yank".  I didn't remember who merged yank,
especially for migration, and why it was ever needed.

What I still remember is yank crash qemu migration quite frequently, though
I can't blame it as it does help us to find quite a few mismatch on
iochannel setup/cleanup or stuff like that.  So it's sometimes "helpful" as
kind of a program checker..

Migration users should have stick with "cancel" rather than "yank" - qmp
"yank" would "FAIL" the migration instead of showing CANCELLED, definitely
should avoid.  I am not aware of anybody that uses "yank" for migration at
all.

So yeah, both commands are slightly duplicated, and if we want to throw
one, it needs to be yank, not cancel.  I'm fine keeping both..

> 
> One truth that we have (because it's tested) is that the multifd
> migration allows migrate_cancel on the source and another migration to
> start from it.
> 
> (btw, that reminds me that multifd+postcopy will probably break that
> test).

When postcopy==true (even if multifd==true as well!  Or not, it doesn't
matter), we should reject the cancel request, IMHO.

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/6] migration: Kick postcopy threads on cancel
  2024-12-04 19:39       ` Peter Xu
@ 2024-12-04 20:02         ` Daniel P. Berrangé
  2024-12-04 20:40           ` Fabiano Rosas
  2024-12-04 20:51           ` Peter Xu
  0 siblings, 2 replies; 24+ messages in thread
From: Daniel P. Berrangé @ 2024-12-04 20:02 UTC (permalink / raw)
  To: Peter Xu; +Cc: Fabiano Rosas, qemu-devel

On Wed, Dec 04, 2024 at 02:39:12PM -0500, Peter Xu wrote:
> On Wed, Dec 04, 2024 at 04:02:36PM -0300, Fabiano Rosas wrote:
> > Peter Xu <peterx@redhat.com> writes:
> > 
> > > On Mon, Dec 02, 2024 at 07:01:33PM -0300, Fabiano Rosas wrote:
> > >> Make sure postcopy threads are released when migrate_cancel is
> > >> issued. Kick the postcopy_pause semaphore and have the fault thread
> > >> read 'fault_thread_quit' when joining.
> > >> 
> > >> While here fix the comment mentioning userfault_event_fd.
> > >> 
> > >> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> > >
> > > I remember when working on postcopy, I thought about failing migrate-cancel
> > > for postcopy in general, rejecting such request.  And when working on the
> > > recover feature, there's no concern on having it being cancelled, because
> > > the user really shouldn't do that..
> > >
> > > The problem is migrate-cancel means crashing the VM on both sides when QEMU
> > > already goes into postcopy stage.
> > 
> > Well, that's the sillyness of having a cancel command, you can never
> > know what "cancel" means. The "documentation" says: "Cancel the current
> > executing migration process", it doesn't mention anything about the
> > consequences of such action.
> 
> We definitely need cancel.  It was always used in precopy, and people use
> it a lot!

Not a fair benchmark though.

People use cancel alot because 'precopy' cannot complete in a
predictable amount of time, any time guesstime can suddenly
get much worse if the guest dirties memory differently. So
people give up and cancel it after waiting ridiculously long
and never being sure if it is nearing the finish.

Once a migrate has been switched to 'postcopy' phase, however,
we have what should be a highly predictable completion time,
directly related to the amount of untransferred pages. That
time should not get worse. The amount of time spent in the
'postcopy' phase will depend on how long you let the migrate
run in 'precopy' before flipping to 'postcopy'

IOW, I think there's a reasonable case to be made that NOT
having the ability to cancel while in 'postcopy' phase would
be mostly acceptable. You give up the ability to cancel for
a while, in exchange for a clearly determined completion
time. 

> > > If the user wants to crash the VM anyway, an easier way to do is killing on
> > > both sides.
> > 
> > I don't think this is fair. We expose a "cancel" command, we better do
> > some cancelling or instead reject the command appropriately, not expect
> > the user to "know better".
> 
> That's exactly why we should fail it with a proper error message, IMHO.
> Because the user may not really understand the impact of postcopy.

Yep, I think users/apps expect "cancel" to be safe. So if it can't be
safe at certain times, we should reject it in those time windows.

> > > If the user wished to cancel, we should tell them "postcopy cannot be
> > > cancelled, until complete".  That's probably the major reason why people
> > > think postcopy is dangerous to use..
> > 
> > We could certainly add that restriction, I don't see a problem with
> > it. That said, what is the actual use case for migrate_cancel? And how
> > does that compare with yank? I feel like we've been kind of relying on
> > nobody using those commands really.
> 
> We had "cancel" first, then "yank".  I didn't remember who merged yank,
> especially for migration, and why it was ever needed.

yank is for the case where the network connections are completely stuck,
causing QEMU to potentially get stalled in I/O operations until a TCP
timeout is reached. yank force unwedges any stuck I/O by aggresively
closing the connections. It is most useful in the non-migration use
cases though.

> Migration users should have stick with "cancel" rather than "yank" - qmp
> "yank" would "FAIL" the migration instead of showing CANCELLED, definitely
> should avoid.  I am not aware of anybody that uses "yank" for migration at
> all.
> 
> So yeah, both commands are slightly duplicated, and if we want to throw
> one, it needs to be yank, not cancel.  I'm fine keeping both..

I would say the difference is like a graceful shutdown vs pulling the
power plug in a bare metal machine

'cancel' is intended to be graceful. It should leave you with a functional
QEMU (or refuse to run if unsafe).

'yank' is intended to be forceful, letting you get out of bad situations
that would otherwise require you to kill the entire QEMU process, but
still with possible associated risk data loss to the QEMU backends.

They have overlap, but are none the less different.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/6] migration: Kick postcopy threads on cancel
  2024-12-04 20:02         ` Daniel P. Berrangé
@ 2024-12-04 20:40           ` Fabiano Rosas
  2024-12-04 20:59             ` Peter Xu
  2024-12-04 20:51           ` Peter Xu
  1 sibling, 1 reply; 24+ messages in thread
From: Fabiano Rosas @ 2024-12-04 20:40 UTC (permalink / raw)
  To: Daniel P. Berrangé, Peter Xu; +Cc: qemu-devel

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Wed, Dec 04, 2024 at 02:39:12PM -0500, Peter Xu wrote:
>> On Wed, Dec 04, 2024 at 04:02:36PM -0300, Fabiano Rosas wrote:
>> > Peter Xu <peterx@redhat.com> writes:
>> > 
>> > > On Mon, Dec 02, 2024 at 07:01:33PM -0300, Fabiano Rosas wrote:
>> > >> Make sure postcopy threads are released when migrate_cancel is
>> > >> issued. Kick the postcopy_pause semaphore and have the fault thread
>> > >> read 'fault_thread_quit' when joining.
>> > >> 
>> > >> While here fix the comment mentioning userfault_event_fd.
>> > >> 
>> > >> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> > >
>> > > I remember when working on postcopy, I thought about failing migrate-cancel
>> > > for postcopy in general, rejecting such request.  And when working on the
>> > > recover feature, there's no concern on having it being cancelled, because
>> > > the user really shouldn't do that..
>> > >
>> > > The problem is migrate-cancel means crashing the VM on both sides when QEMU
>> > > already goes into postcopy stage.
>> > 
>> > Well, that's the sillyness of having a cancel command, you can never
>> > know what "cancel" means. The "documentation" says: "Cancel the current
>> > executing migration process", it doesn't mention anything about the
>> > consequences of such action.
>> 
>> We definitely need cancel.  It was always used in precopy, and people use
>> it a lot!

To be clear, I'm not arguing against cancel. I'm just pointing out that
it's silly because it's just like pressing C-c in the shell in the
middle of something. What's the expected end state? Completely
unspecified. I don't find it at all "elegant" that we treat cancel like
error and just let the code carry on stumbling and exit
eventually. Because then we have this C-c arriving at random moments in
the middle of stuff. The way we do "exiting" in multifd is way more
maintainable. If that flag is set, then let's exit, otherwise everything
should work.

Note that in some parts of this series I'm checking the CANCELLING
state, which aside from being prone to be racy, is a step in a different
direction than the way we've implemented cancel so far (shutdown and let
the code exit vs. explicitly check against the CANCELLING state).

Hope I'm being coherent, I'm not sure.

>
> Not a fair benchmark though.
>
> People use cancel alot because 'precopy' cannot complete in a
> predictable amount of time, any time guesstime can suddenly
> get much worse if the guest dirties memory differently. So
> people give up and cancel it after waiting ridiculously long
> and never being sure if it is nearing the finish.
>
> Once a migrate has been switched to 'postcopy' phase, however,
> we have what should be a highly predictable completion time,
> directly related to the amount of untransferred pages. That
> time should not get worse. The amount of time spent in the
> 'postcopy' phase will depend on how long you let the migrate
> run in 'precopy' before flipping to 'postcopy'
>
> IOW, I think there's a reasonable case to be made that NOT
> having the ability to cancel while in 'postcopy' phase would
> be mostly acceptable. You give up the ability to cancel for
> a while, in exchange for a clearly determined completion
> time. 

Some of these words could be in the documentation for cancel to make it
clear what the expectations are.

>
>> > > If the user wants to crash the VM anyway, an easier way to do is killing on
>> > > both sides.
>> > 
>> > I don't think this is fair. We expose a "cancel" command, we better do
>> > some cancelling or instead reject the command appropriately, not expect
>> > the user to "know better".
>> 
>> That's exactly why we should fail it with a proper error message, IMHO.
>> Because the user may not really understand the impact of postcopy.
>
> Yep, I think users/apps expect "cancel" to be safe. So if it can't be
> safe at certain times, we should reject it in those time windows.

Which kind of implies we should test this by migrate->cancel->migrate,
like the multifd test does, and not like this series does. I've been
focusing more on catching crashes/hangs.

>
>> > > If the user wished to cancel, we should tell them "postcopy cannot be
>> > > cancelled, until complete".  That's probably the major reason why people
>> > > think postcopy is dangerous to use..
>> > 
>> > We could certainly add that restriction, I don't see a problem with
>> > it. That said, what is the actual use case for migrate_cancel? And how
>> > does that compare with yank? I feel like we've been kind of relying on
>> > nobody using those commands really.
>> 
>> We had "cancel" first, then "yank".  I didn't remember who merged yank,
>> especially for migration, and why it was ever needed.
>
> yank is for the case where the network connections are completely stuck,
> causing QEMU to potentially get stalled in I/O operations until a TCP
> timeout is reached. yank force unwedges any stuck I/O by aggresively
> closing the connections. It is most useful in the non-migration use
> cases though.
>

I asked because for socket at least yank and cancel do the same:
shutdown(). It might be more trouble than it's worth it, but I wonder if
we could have everything that can be stuck implement a yank routine and
just have cancel call those. So for instance, when this series does
sem_post on a bunch of semaphores explicitly, the cancel command would
instead call whatever yank routine was registered by the code that can
wait on the semaphore. With this there should be no surprises of a
cancel arriving at a weird moment, because we'd require "code that
sleeps" to implement a yank.

>> Migration users should have stick with "cancel" rather than "yank" - qmp
>> "yank" would "FAIL" the migration instead of showing CANCELLED, definitely
>> should avoid.  I am not aware of anybody that uses "yank" for migration at
>> all.
>> 
>> So yeah, both commands are slightly duplicated, and if we want to throw
>> one, it needs to be yank, not cancel.  I'm fine keeping both..
>
> I would say the difference is like a graceful shutdown vs pulling the
> power plug in a bare metal machine
>
> 'cancel' is intended to be graceful. It should leave you with a functional
> QEMU (or refuse to run if unsafe).
>
> 'yank' is intended to be forceful, letting you get out of bad situations
> that would otherwise require you to kill the entire QEMU process, but
> still with possible associated risk data loss to the QEMU backends.

For migration specifically I don't see much difference. Yank must leave
QEMU in a usable state enough for a second migration to succeed,
otherwise it's useless.

>
> They have overlap, but are none the less different.
>
> With regards,
> Daniel


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/6] migration: Kick postcopy threads on cancel
  2024-12-04 20:02         ` Daniel P. Berrangé
  2024-12-04 20:40           ` Fabiano Rosas
@ 2024-12-04 20:51           ` Peter Xu
  2024-12-04 21:01             ` Fabiano Rosas
  2024-12-05 10:52             ` Daniel P. Berrangé
  1 sibling, 2 replies; 24+ messages in thread
From: Peter Xu @ 2024-12-04 20:51 UTC (permalink / raw)
  To: Daniel P. Berrangé; +Cc: Fabiano Rosas, qemu-devel

On Wed, Dec 04, 2024 at 08:02:31PM +0000, Daniel P. Berrangé wrote:
> On Wed, Dec 04, 2024 at 02:39:12PM -0500, Peter Xu wrote:
> > On Wed, Dec 04, 2024 at 04:02:36PM -0300, Fabiano Rosas wrote:
> > > Peter Xu <peterx@redhat.com> writes:
> > > 
> > > > On Mon, Dec 02, 2024 at 07:01:33PM -0300, Fabiano Rosas wrote:
> > > >> Make sure postcopy threads are released when migrate_cancel is
> > > >> issued. Kick the postcopy_pause semaphore and have the fault thread
> > > >> read 'fault_thread_quit' when joining.
> > > >> 
> > > >> While here fix the comment mentioning userfault_event_fd.
> > > >> 
> > > >> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> > > >
> > > > I remember when working on postcopy, I thought about failing migrate-cancel
> > > > for postcopy in general, rejecting such request.  And when working on the
> > > > recover feature, there's no concern on having it being cancelled, because
> > > > the user really shouldn't do that..
> > > >
> > > > The problem is migrate-cancel means crashing the VM on both sides when QEMU
> > > > already goes into postcopy stage.
> > > 
> > > Well, that's the sillyness of having a cancel command, you can never
> > > know what "cancel" means. The "documentation" says: "Cancel the current
> > > executing migration process", it doesn't mention anything about the
> > > consequences of such action.
> > 
> > We definitely need cancel.  It was always used in precopy, and people use
> > it a lot!
> 
> Not a fair benchmark though.
> 
> People use cancel alot because 'precopy' cannot complete in a
> predictable amount of time, any time guesstime can suddenly
> get much worse if the guest dirties memory differently. So
> people give up and cancel it after waiting ridiculously long
> and never being sure if it is nearing the finish.
> 
> Once a migrate has been switched to 'postcopy' phase, however,
> we have what should be a highly predictable completion time,
> directly related to the amount of untransferred pages. That
> time should not get worse. The amount of time spent in the
> 'postcopy' phase will depend on how long you let the migrate
> run in 'precopy' before flipping to 'postcopy'

Right.  And if to be precise - the time doesn't matter, but "how many dirty
page left".  For extremely busy guests, longer precopy doesn't help, for
example.. so time isn't very relevant, IMHO.  It's just that postcopy
always have an upper time limit, which is guest mem size / bw.

> 
> IOW, I think there's a reasonable case to be made that NOT
> having the ability to cancel while in 'postcopy' phase would
> be mostly acceptable. You give up the ability to cancel for
> a while, in exchange for a clearly determined completion
> time. 
> 
> > > > If the user wants to crash the VM anyway, an easier way to do is killing on
> > > > both sides.
> > > 
> > > I don't think this is fair. We expose a "cancel" command, we better do
> > > some cancelling or instead reject the command appropriately, not expect
> > > the user to "know better".
> > 
> > That's exactly why we should fail it with a proper error message, IMHO.
> > Because the user may not really understand the impact of postcopy.
> 
> Yep, I think users/apps expect "cancel" to be safe. So if it can't be
> safe at certain times, we should reject it in those time windows.
> 
> > > > If the user wished to cancel, we should tell them "postcopy cannot be
> > > > cancelled, until complete".  That's probably the major reason why people
> > > > think postcopy is dangerous to use..
> > > 
> > > We could certainly add that restriction, I don't see a problem with
> > > it. That said, what is the actual use case for migrate_cancel? And how
> > > does that compare with yank? I feel like we've been kind of relying on
> > > nobody using those commands really.
> > 
> > We had "cancel" first, then "yank".  I didn't remember who merged yank,
> > especially for migration, and why it was ever needed.
> 
> yank is for the case where the network connections are completely stuck,
> causing QEMU to potentially get stalled in I/O operations until a TCP
> timeout is reached. yank force unwedges any stuck I/O by aggresively
> closing the connections. It is most useful in the non-migration use
> cases though.
> 
> > Migration users should have stick with "cancel" rather than "yank" - qmp
> > "yank" would "FAIL" the migration instead of showing CANCELLED, definitely
> > should avoid.  I am not aware of anybody that uses "yank" for migration at
> > all.
> > 
> > So yeah, both commands are slightly duplicated, and if we want to throw
> > one, it needs to be yank, not cancel.  I'm fine keeping both..
> 
> I would say the difference is like a graceful shutdown vs pulling the
> power plug in a bare metal machine
> 
> 'cancel' is intended to be graceful. It should leave you with a functional
> QEMU (or refuse to run if unsafe).
> 
> 'yank' is intended to be forceful, letting you get out of bad situations
> that would otherwise require you to kill the entire QEMU process, but
> still with possible associated risk data loss to the QEMU backends.
> 
> They have overlap, but are none the less different.

The question is more about whether yank should be used at all for
migration only, not about the rest instances.

My answer is yank should never be used for migration, because
"migrate_cancel" also unplugs the power plug.. It's not anything more
enforced.  It's only doing less always.

E.g. migration_yank_iochannel() is exactly what we do with
qmp_migrate_cancel() in the first place, only that migrate_cancel only does
it on the main channel (on both qemufiles even if ioc is one), however it
should be suffice, and behave the same way, as strong as "yank".

Meanwhile, "yank" definitely lacks the graceful side of thing, not only on
"cancelled", but also anything extra in migration_cancel() that wasn't
about shutdown().  Examples, qmp_migrate_cancel has special care on:

  - migrate_dirty_limit(), where it'll shutdown the limit threads too
    alongside if migration cancels,

  - process the possible hang case with PRE_SWITCHOVER.

  - if the cancel happened right during switchover, and if at that time the
    disks are inactivated, it'll re-activate the disks

In general, there's no reason to use yank on migration, IMHO, because it's
not stronger either on shutting the NIC.

Considering it's confusing to mostly everyone, and tons of people asked me
about this.. maybe I should send a patch to remove yank from migration?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/6] migration: Kick postcopy threads on cancel
  2024-12-04 20:40           ` Fabiano Rosas
@ 2024-12-04 20:59             ` Peter Xu
  0 siblings, 0 replies; 24+ messages in thread
From: Peter Xu @ 2024-12-04 20:59 UTC (permalink / raw)
  To: Fabiano Rosas; +Cc: Daniel P. Berrangé, qemu-devel

On Wed, Dec 04, 2024 at 05:40:17PM -0300, Fabiano Rosas wrote:
> To be clear, I'm not arguing against cancel. I'm just pointing out that
> it's silly because it's just like pressing C-c in the shell in the
> middle of something. What's the expected end state? Completely
> unspecified. I don't find it at all "elegant" that we treat cancel like
> error and just let the code carry on stumbling and exit
> eventually. Because then we have this C-c arriving at random moments in
> the middle of stuff. The way we do "exiting" in multifd is way more
> maintainable. If that flag is set, then let's exit, otherwise everything
> should work.

If taking the example of C-c, then "migration during postcopy" is exactly
"TASK_UNINTERRUPTIBLE".. :) IOW, "hanging death" the C-c is the correct and
expected behavior for UNINTERRUPTIBLE tasks.

[...]

> > 'yank' is intended to be forceful, letting you get out of bad situations
> > that would otherwise require you to kill the entire QEMU process, but
> > still with possible associated risk data loss to the QEMU backends.
> 
> For migration specifically I don't see much difference. Yank must leave
> QEMU in a usable state enough for a second migration to succeed,
> otherwise it's useless.

Side note: when I said (in the other reply) that we should remove yank
support on migration, I meant, we should probably deprecate that (and then
remove it).

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/6] migration: Kick postcopy threads on cancel
  2024-12-04 20:51           ` Peter Xu
@ 2024-12-04 21:01             ` Fabiano Rosas
  2024-12-04 21:11               ` Peter Xu
  2024-12-05 10:52             ` Daniel P. Berrangé
  1 sibling, 1 reply; 24+ messages in thread
From: Fabiano Rosas @ 2024-12-04 21:01 UTC (permalink / raw)
  To: Peter Xu, Daniel P. Berrangé; +Cc: qemu-devel

Peter Xu <peterx@redhat.com> writes:

> On Wed, Dec 04, 2024 at 08:02:31PM +0000, Daniel P. Berrangé wrote:
>> On Wed, Dec 04, 2024 at 02:39:12PM -0500, Peter Xu wrote:
>> > On Wed, Dec 04, 2024 at 04:02:36PM -0300, Fabiano Rosas wrote:
>> > > Peter Xu <peterx@redhat.com> writes:
>> > > 
>> > > > On Mon, Dec 02, 2024 at 07:01:33PM -0300, Fabiano Rosas wrote:
>> > > >> Make sure postcopy threads are released when migrate_cancel is
>> > > >> issued. Kick the postcopy_pause semaphore and have the fault thread
>> > > >> read 'fault_thread_quit' when joining.
>> > > >> 
>> > > >> While here fix the comment mentioning userfault_event_fd.
>> > > >> 
>> > > >> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> > > >
>> > > > I remember when working on postcopy, I thought about failing migrate-cancel
>> > > > for postcopy in general, rejecting such request.  And when working on the
>> > > > recover feature, there's no concern on having it being cancelled, because
>> > > > the user really shouldn't do that..
>> > > >
>> > > > The problem is migrate-cancel means crashing the VM on both sides when QEMU
>> > > > already goes into postcopy stage.
>> > > 
>> > > Well, that's the sillyness of having a cancel command, you can never
>> > > know what "cancel" means. The "documentation" says: "Cancel the current
>> > > executing migration process", it doesn't mention anything about the
>> > > consequences of such action.
>> > 
>> > We definitely need cancel.  It was always used in precopy, and people use
>> > it a lot!
>> 
>> Not a fair benchmark though.
>> 
>> People use cancel alot because 'precopy' cannot complete in a
>> predictable amount of time, any time guesstime can suddenly
>> get much worse if the guest dirties memory differently. So
>> people give up and cancel it after waiting ridiculously long
>> and never being sure if it is nearing the finish.
>> 
>> Once a migrate has been switched to 'postcopy' phase, however,
>> we have what should be a highly predictable completion time,
>> directly related to the amount of untransferred pages. That
>> time should not get worse. The amount of time spent in the
>> 'postcopy' phase will depend on how long you let the migrate
>> run in 'precopy' before flipping to 'postcopy'
>
> Right.  And if to be precise - the time doesn't matter, but "how many dirty
> page left".  For extremely busy guests, longer precopy doesn't help, for
> example.. so time isn't very relevant, IMHO.  It's just that postcopy
> always have an upper time limit, which is guest mem size / bw.
>
>> 
>> IOW, I think there's a reasonable case to be made that NOT
>> having the ability to cancel while in 'postcopy' phase would
>> be mostly acceptable. You give up the ability to cancel for
>> a while, in exchange for a clearly determined completion
>> time. 
>> 
>> > > > If the user wants to crash the VM anyway, an easier way to do is killing on
>> > > > both sides.
>> > > 
>> > > I don't think this is fair. We expose a "cancel" command, we better do
>> > > some cancelling or instead reject the command appropriately, not expect
>> > > the user to "know better".
>> > 
>> > That's exactly why we should fail it with a proper error message, IMHO.
>> > Because the user may not really understand the impact of postcopy.
>> 
>> Yep, I think users/apps expect "cancel" to be safe. So if it can't be
>> safe at certain times, we should reject it in those time windows.
>> 
>> > > > If the user wished to cancel, we should tell them "postcopy cannot be
>> > > > cancelled, until complete".  That's probably the major reason why people
>> > > > think postcopy is dangerous to use..
>> > > 
>> > > We could certainly add that restriction, I don't see a problem with
>> > > it. That said, what is the actual use case for migrate_cancel? And how
>> > > does that compare with yank? I feel like we've been kind of relying on
>> > > nobody using those commands really.
>> > 
>> > We had "cancel" first, then "yank".  I didn't remember who merged yank,
>> > especially for migration, and why it was ever needed.
>> 
>> yank is for the case where the network connections are completely stuck,
>> causing QEMU to potentially get stalled in I/O operations until a TCP
>> timeout is reached. yank force unwedges any stuck I/O by aggresively
>> closing the connections. It is most useful in the non-migration use
>> cases though.
>> 
>> > Migration users should have stick with "cancel" rather than "yank" - qmp
>> > "yank" would "FAIL" the migration instead of showing CANCELLED, definitely
>> > should avoid.  I am not aware of anybody that uses "yank" for migration at
>> > all.
>> > 
>> > So yeah, both commands are slightly duplicated, and if we want to throw
>> > one, it needs to be yank, not cancel.  I'm fine keeping both..
>> 
>> I would say the difference is like a graceful shutdown vs pulling the
>> power plug in a bare metal machine
>> 
>> 'cancel' is intended to be graceful. It should leave you with a functional
>> QEMU (or refuse to run if unsafe).
>> 
>> 'yank' is intended to be forceful, letting you get out of bad situations
>> that would otherwise require you to kill the entire QEMU process, but
>> still with possible associated risk data loss to the QEMU backends.
>> 
>> They have overlap, but are none the less different.
>
> The question is more about whether yank should be used at all for
> migration only, not about the rest instances.
>
> My answer is yank should never be used for migration, because
> "migrate_cancel" also unplugs the power plug.. It's not anything more
> enforced.  It's only doing less always.
>
> E.g. migration_yank_iochannel() is exactly what we do with
> qmp_migrate_cancel() in the first place, only that migrate_cancel only does
> it on the main channel (on both qemufiles even if ioc is one), however it
> should be suffice, and behave the same way, as strong as "yank".
>
> Meanwhile, "yank" definitely lacks the graceful side of thing, not only on
> "cancelled", but also anything extra in migration_cancel() that wasn't
> about shutdown().  Examples, qmp_migrate_cancel has special care on:
>
>   - migrate_dirty_limit(), where it'll shutdown the limit threads too
>     alongside if migration cancels,
>
>   - process the possible hang case with PRE_SWITCHOVER.
>
>   - if the cancel happened right during switchover, and if at that time the
>     disks are inactivated, it'll re-activate the disks
>
> In general, there's no reason to use yank on migration, IMHO, because it's
> not stronger either on shutting the NIC.
>
> Considering it's confusing to mostly everyone, and tons of people asked me
> about this.. maybe I should send a patch to remove yank from migration?

Take a look at my suggestion in the other thread, it might make yank
make sense for migration after all. But I'm not against the removal,
it's better than the current state IMO.

>
> Thanks,


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/6] migration: Kick postcopy threads on cancel
  2024-12-04 21:01             ` Fabiano Rosas
@ 2024-12-04 21:11               ` Peter Xu
  0 siblings, 0 replies; 24+ messages in thread
From: Peter Xu @ 2024-12-04 21:11 UTC (permalink / raw)
  To: Fabiano Rosas; +Cc: Daniel P. Berrangé, qemu-devel

On Wed, Dec 04, 2024 at 06:01:39PM -0300, Fabiano Rosas wrote:
> > Considering it's confusing to mostly everyone, and tons of people asked me
> > about this.. maybe I should send a patch to remove yank from migration?
> 
> Take a look at my suggestion in the other thread, it might make yank
> make sense for migration after all. But I'm not against the removal,
> it's better than the current state IMO.

I missed that.  Copying that part over:

> I asked because for socket at least yank and cancel do the same:
> shutdown(). It might be more trouble than it's worth it, but I wonder if
> we could have everything that can be stuck implement a yank routine and
> just have cancel call those. So for instance, when this series does
> sem_post on a bunch of semaphores explicitly, the cancel command would
> instead call whatever yank routine was registered by the code that can
> wait on the semaphore. With this there should be no surprises of a
> cancel arriving at a weird moment, because we'd require "code that
> sleeps" to implement a yank.

If so, it means migration_cancel will still do the same as what yank would
do (but it could cover more now, like processing sleeping semaphores).

Actually, since it'll invoke a yank API it'll be the same from that regard.

However as long as there's still something extra migrate_cancel would do
after that yank API, then we should still suggest nobody using QMP "yank"
on migration instance..

Unless we want to precisely define migrate_cancel exactly the same as the
yank API would do.  But then it means the two APIs are identical, then we
should probably keep the one that people use the most, which is still, the
migrate_cancel API..

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/6] migration: Kick postcopy threads on cancel
  2024-12-04 20:51           ` Peter Xu
  2024-12-04 21:01             ` Fabiano Rosas
@ 2024-12-05 10:52             ` Daniel P. Berrangé
  2024-12-05 13:18               ` Fabiano Rosas
  1 sibling, 1 reply; 24+ messages in thread
From: Daniel P. Berrangé @ 2024-12-05 10:52 UTC (permalink / raw)
  To: Peter Xu; +Cc: Fabiano Rosas, qemu-devel

On Wed, Dec 04, 2024 at 03:51:27PM -0500, Peter Xu wrote:
> On Wed, Dec 04, 2024 at 08:02:31PM +0000, Daniel P. Berrangé wrote:
> > I would say the difference is like a graceful shutdown vs pulling the
> > power plug in a bare metal machine
> > 
> > 'cancel' is intended to be graceful. It should leave you with a functional
> > QEMU (or refuse to run if unsafe).
> > 
> > 'yank' is intended to be forceful, letting you get out of bad situations
> > that would otherwise require you to kill the entire QEMU process, but
> > still with possible associated risk data loss to the QEMU backends.
> > 
> > They have overlap, but are none the less different.
> 
> The question is more about whether yank should be used at all for
> migration only, not about the rest instances.
> 
> My answer is yank should never be used for migration, because
> "migrate_cancel" also unplugs the power plug.. It's not anything more
> enforced.  It's only doing less always.
> 
> E.g. migration_yank_iochannel() is exactly what we do with
> qmp_migrate_cancel() in the first place, only that migrate_cancel only does
> it on the main channel (on both qemufiles even if ioc is one), however it
> should be suffice, and behave the same way, as strong as "yank".

I recall at the time the yank stuff was introduced, one of the scenarios
they were concerned about was related to locks held by QEMU code. eg that
there are scenarios where migrate_cancel may not be processed promptly
enough due to being stalled on mutexes held by other concurrently running
threads. Now I would expect any such long duration stalls on migration
mutexes to be bugs, but the intent of yank is to give a recovery mechanism
that can workaround such bugs.  The yank QMP command only interacts with
its own local mutexes.

So even though both migrate-cancel and yank end up calling the same
qio_channel_shutdown() API, there can be practical differences in how
they reach that qio_channel_shutdown() call.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/6] migration: Kick postcopy threads on cancel
  2024-12-05 10:52             ` Daniel P. Berrangé
@ 2024-12-05 13:18               ` Fabiano Rosas
  2024-12-05 15:03                 ` Peter Xu
  2024-12-05 15:40                 ` Daniel P. Berrangé
  0 siblings, 2 replies; 24+ messages in thread
From: Fabiano Rosas @ 2024-12-05 13:18 UTC (permalink / raw)
  To: Daniel P. Berrangé, Peter Xu; +Cc: qemu-devel

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Wed, Dec 04, 2024 at 03:51:27PM -0500, Peter Xu wrote:
>> On Wed, Dec 04, 2024 at 08:02:31PM +0000, Daniel P. Berrangé wrote:
>> > I would say the difference is like a graceful shutdown vs pulling the
>> > power plug in a bare metal machine
>> > 
>> > 'cancel' is intended to be graceful. It should leave you with a functional
>> > QEMU (or refuse to run if unsafe).
>> > 
>> > 'yank' is intended to be forceful, letting you get out of bad situations
>> > that would otherwise require you to kill the entire QEMU process, but
>> > still with possible associated risk data loss to the QEMU backends.
>> > 
>> > They have overlap, but are none the less different.
>> 
>> The question is more about whether yank should be used at all for
>> migration only, not about the rest instances.
>> 
>> My answer is yank should never be used for migration, because
>> "migrate_cancel" also unplugs the power plug.. It's not anything more
>> enforced.  It's only doing less always.
>> 
>> E.g. migration_yank_iochannel() is exactly what we do with
>> qmp_migrate_cancel() in the first place, only that migrate_cancel only does
>> it on the main channel (on both qemufiles even if ioc is one), however it
>> should be suffice, and behave the same way, as strong as "yank".
>
> I recall at the time the yank stuff was introduced, one of the scenarios
> they were concerned about was related to locks held by QEMU code. eg that
> there are scenarios where migrate_cancel may not be processed promptly
> enough due to being stalled on mutexes held by other concurrently running
> threads. Now I would expect any such long duration stalls on migration
> mutexes to be bugs, but the intent of yank is to give a recovery mechanism
> that can workaround such bugs.  The yank QMP command only interacts with
> its own local mutexes.

Ok, so that could only mean a thread stuck in recv() while holding the
BQL. I don't think we have any other locks which would stop
migrate_cancel from making progress or other stall situations that could
be helped by a shutdown(). Note that most of locks around qemu_file were
a late addition. I don't think that scenario is possible today. I'll
have to do some tests.

On that note, how is yank supposed to be accessed? I don't see support
in libvirt. Is there a way to hook into QMP after the fact somehow?


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/6] migration: Kick postcopy threads on cancel
  2024-12-05 13:18               ` Fabiano Rosas
@ 2024-12-05 15:03                 ` Peter Xu
  2024-12-05 15:44                   ` Daniel P. Berrangé
  2024-12-05 15:40                 ` Daniel P. Berrangé
  1 sibling, 1 reply; 24+ messages in thread
From: Peter Xu @ 2024-12-05 15:03 UTC (permalink / raw)
  To: Fabiano Rosas; +Cc: Daniel P. Berrangé, qemu-devel

On Thu, Dec 05, 2024 at 10:18:53AM -0300, Fabiano Rosas wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
> 
> > On Wed, Dec 04, 2024 at 03:51:27PM -0500, Peter Xu wrote:
> >> On Wed, Dec 04, 2024 at 08:02:31PM +0000, Daniel P. Berrangé wrote:
> >> > I would say the difference is like a graceful shutdown vs pulling the
> >> > power plug in a bare metal machine
> >> > 
> >> > 'cancel' is intended to be graceful. It should leave you with a functional
> >> > QEMU (or refuse to run if unsafe).
> >> > 
> >> > 'yank' is intended to be forceful, letting you get out of bad situations
> >> > that would otherwise require you to kill the entire QEMU process, but
> >> > still with possible associated risk data loss to the QEMU backends.
> >> > 
> >> > They have overlap, but are none the less different.
> >> 
> >> The question is more about whether yank should be used at all for
> >> migration only, not about the rest instances.
> >> 
> >> My answer is yank should never be used for migration, because
> >> "migrate_cancel" also unplugs the power plug.. It's not anything more
> >> enforced.  It's only doing less always.
> >> 
> >> E.g. migration_yank_iochannel() is exactly what we do with
> >> qmp_migrate_cancel() in the first place, only that migrate_cancel only does
> >> it on the main channel (on both qemufiles even if ioc is one), however it
> >> should be suffice, and behave the same way, as strong as "yank".
> >
> > I recall at the time the yank stuff was introduced, one of the scenarios
> > they were concerned about was related to locks held by QEMU code. eg that
> > there are scenarios where migrate_cancel may not be processed promptly
> > enough due to being stalled on mutexes held by other concurrently running
> > threads. Now I would expect any such long duration stalls on migration
> > mutexes to be bugs, but the intent of yank is to give a recovery mechanism
> > that can workaround such bugs.  The yank QMP command only interacts with
> > its own local mutexes.
> 
> Ok, so that could only mean a thread stuck in recv() while holding the
> BQL. I don't think we have any other locks which would stop
> migrate_cancel from making progress or other stall situations that could
> be helped by a shutdown(). Note that most of locks around qemu_file were
> a late addition. I don't think that scenario is possible today. I'll
> have to do some tests.

And if that is a real difference, I'd think whether we should simply make
migrate_cancel be oob-capable too.. IOW, I still think it'll be good to
stick with always one API to cancel a migration, no matter which it is.  If
we want to move over to yank then I think we should move all migrate_cancel
operations into yank and deprecate "migrate_cancel', but that sounds
overkill.

There's only one thing that might not be oob-compatible there so far, which
is bdrv_activate_all().  But I plan to remove it very soon (so that disks
will be activated in the migration thread instead, just like failure cases).

> 
> On that note, how is yank supposed to be accessed? I don't see support
> in libvirt. Is there a way to hook into QMP after the fact somehow?
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/6] migration: Kick postcopy threads on cancel
  2024-12-05 13:18               ` Fabiano Rosas
  2024-12-05 15:03                 ` Peter Xu
@ 2024-12-05 15:40                 ` Daniel P. Berrangé
  1 sibling, 0 replies; 24+ messages in thread
From: Daniel P. Berrangé @ 2024-12-05 15:40 UTC (permalink / raw)
  To: Fabiano Rosas; +Cc: Peter Xu, qemu-devel

On Thu, Dec 05, 2024 at 10:18:53AM -0300, Fabiano Rosas wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
> 
> > On Wed, Dec 04, 2024 at 03:51:27PM -0500, Peter Xu wrote:
> >> On Wed, Dec 04, 2024 at 08:02:31PM +0000, Daniel P. Berrangé wrote:
> >> > I would say the difference is like a graceful shutdown vs pulling the
> >> > power plug in a bare metal machine
> >> > 
> >> > 'cancel' is intended to be graceful. It should leave you with a functional
> >> > QEMU (or refuse to run if unsafe).
> >> > 
> >> > 'yank' is intended to be forceful, letting you get out of bad situations
> >> > that would otherwise require you to kill the entire QEMU process, but
> >> > still with possible associated risk data loss to the QEMU backends.
> >> > 
> >> > They have overlap, but are none the less different.
> >> 
> >> The question is more about whether yank should be used at all for
> >> migration only, not about the rest instances.
> >> 
> >> My answer is yank should never be used for migration, because
> >> "migrate_cancel" also unplugs the power plug.. It's not anything more
> >> enforced.  It's only doing less always.
> >> 
> >> E.g. migration_yank_iochannel() is exactly what we do with
> >> qmp_migrate_cancel() in the first place, only that migrate_cancel only does
> >> it on the main channel (on both qemufiles even if ioc is one), however it
> >> should be suffice, and behave the same way, as strong as "yank".
> >
> > I recall at the time the yank stuff was introduced, one of the scenarios
> > they were concerned about was related to locks held by QEMU code. eg that
> > there are scenarios where migrate_cancel may not be processed promptly
> > enough due to being stalled on mutexes held by other concurrently running
> > threads. Now I would expect any such long duration stalls on migration
> > mutexes to be bugs, but the intent of yank is to give a recovery mechanism
> > that can workaround such bugs.  The yank QMP command only interacts with
> > its own local mutexes.
> 
> Ok, so that could only mean a thread stuck in recv() while holding the
> BQL. I don't think we have any other locks which would stop
> migrate_cancel from making progress or other stall situations that could
> be helped by a shutdown(). Note that most of locks around qemu_file were
> a late addition. I don't think that scenario is possible today. I'll
> have to do some tests.

Yes, in general there should never be a for "yank", *if* QMEU is implemented
correctly. yank is there in case something unexpected happens.

IOW, even if we think migration is perfect today, yank is still worth
having there as a safety net.

> On that note, how is yank supposed to be accessed? I don't see support
> in libvirt. Is there a way to hook into QMP after the fact somehow?

We've not wired up any API for this libvirt. I can be issued via libvirt's
QMP passthrough API if desired though.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/6] migration: Kick postcopy threads on cancel
  2024-12-05 15:03                 ` Peter Xu
@ 2024-12-05 15:44                   ` Daniel P. Berrangé
  0 siblings, 0 replies; 24+ messages in thread
From: Daniel P. Berrangé @ 2024-12-05 15:44 UTC (permalink / raw)
  To: Peter Xu; +Cc: Fabiano Rosas, qemu-devel

On Thu, Dec 05, 2024 at 10:03:58AM -0500, Peter Xu wrote:
> On Thu, Dec 05, 2024 at 10:18:53AM -0300, Fabiano Rosas wrote:
> > Daniel P. Berrangé <berrange@redhat.com> writes:
> > 
> > > On Wed, Dec 04, 2024 at 03:51:27PM -0500, Peter Xu wrote:
> > >> On Wed, Dec 04, 2024 at 08:02:31PM +0000, Daniel P. Berrangé wrote:
> > >> > I would say the difference is like a graceful shutdown vs pulling the
> > >> > power plug in a bare metal machine
> > >> > 
> > >> > 'cancel' is intended to be graceful. It should leave you with a functional
> > >> > QEMU (or refuse to run if unsafe).
> > >> > 
> > >> > 'yank' is intended to be forceful, letting you get out of bad situations
> > >> > that would otherwise require you to kill the entire QEMU process, but
> > >> > still with possible associated risk data loss to the QEMU backends.
> > >> > 
> > >> > They have overlap, but are none the less different.
> > >> 
> > >> The question is more about whether yank should be used at all for
> > >> migration only, not about the rest instances.
> > >> 
> > >> My answer is yank should never be used for migration, because
> > >> "migrate_cancel" also unplugs the power plug.. It's not anything more
> > >> enforced.  It's only doing less always.
> > >> 
> > >> E.g. migration_yank_iochannel() is exactly what we do with
> > >> qmp_migrate_cancel() in the first place, only that migrate_cancel only does
> > >> it on the main channel (on both qemufiles even if ioc is one), however it
> > >> should be suffice, and behave the same way, as strong as "yank".
> > >
> > > I recall at the time the yank stuff was introduced, one of the scenarios
> > > they were concerned about was related to locks held by QEMU code. eg that
> > > there are scenarios where migrate_cancel may not be processed promptly
> > > enough due to being stalled on mutexes held by other concurrently running
> > > threads. Now I would expect any such long duration stalls on migration
> > > mutexes to be bugs, but the intent of yank is to give a recovery mechanism
> > > that can workaround such bugs.  The yank QMP command only interacts with
> > > its own local mutexes.
> > 
> > Ok, so that could only mean a thread stuck in recv() while holding the
> > BQL. I don't think we have any other locks which would stop
> > migrate_cancel from making progress or other stall situations that could
> > be helped by a shutdown(). Note that most of locks around qemu_file were
> > a late addition. I don't think that scenario is possible today. I'll
> > have to do some tests.
> 
> And if that is a real difference, I'd think whether we should simply make
> migrate_cancel be oob-capable too.. IOW, I still think it'll be good to
> stick with always one API to cancel a migration, no matter which it is.  If
> we want to move over to yank then I think we should move all migrate_cancel
> operations into yank and deprecate "migrate_cancel', but that sounds
> overkill.

Well migrate_cancel ought to be safer than yank. eg migrate_cancel (sh|c)ould
refuse to run if issued during post-copy phase. Or even in precopy, if
in the final vmstate copy & switchover phase we shouldn't need to cancel.
yank meanwhile will always run, no matter what, because by design, it has
no interaction with the migration code beyond knowing that a socket exists.

I don't  think we should combine them. They have alot of common, but there
are subtle differences that are relevant to the scenarios in which thye
are intended to be used.

> There's only one thing that might not be oob-compatible there so far, which
> is bdrv_activate_all().  But I plan to remove it very soon (so that disks
> will be activated in the migration thread instead, just like failure cases).
> 
> > 
> > On that note, how is yank supposed to be accessed? I don't see support
> > in libvirt. Is there a way to hook into QMP after the fact somehow?
> > 
> 
> -- 
> Peter Xu
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2024-12-05 15:44 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-02 22:01 [PATCH 0/6] migration: Fix issues during qmp_migrate_cancel Fabiano Rosas
2024-12-02 22:01 ` [PATCH 1/6] tests/qtest/migration: Introduce migration_test_add_suffix Fabiano Rosas
2024-12-04 18:44   ` Peter Xu
2024-12-02 22:01 ` [PATCH 2/6] migration: Kick postcopy threads on cancel Fabiano Rosas
2024-12-04 18:39   ` Peter Xu
2024-12-04 19:02     ` Fabiano Rosas
2024-12-04 19:39       ` Peter Xu
2024-12-04 20:02         ` Daniel P. Berrangé
2024-12-04 20:40           ` Fabiano Rosas
2024-12-04 20:59             ` Peter Xu
2024-12-04 20:51           ` Peter Xu
2024-12-04 21:01             ` Fabiano Rosas
2024-12-04 21:11               ` Peter Xu
2024-12-05 10:52             ` Daniel P. Berrangé
2024-12-05 13:18               ` Fabiano Rosas
2024-12-05 15:03                 ` Peter Xu
2024-12-05 15:44                   ` Daniel P. Berrangé
2024-12-05 15:40                 ` Daniel P. Berrangé
2024-12-02 22:01 ` [PATCH 3/6] migration: Fix postcopy listen thread exit Fabiano Rosas
2024-12-02 22:01 ` [PATCH 4/6] migration: Make sure postcopy recovery doesn't hang when cancelling Fabiano Rosas
2024-12-02 22:01 ` [PATCH 5/6] migration: Fix hang after error in destination setup phase Fabiano Rosas
2024-12-04 18:44   ` Peter Xu
2024-12-04 19:05     ` Fabiano Rosas
2024-12-02 22:01 ` [PATCH 6/6] tests/qtest/migration: Add a cancel test Fabiano Rosas

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.