All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: qemu-devel@nongnu.org
Cc: "Fabiano Rosas" <farosas@suse.de>, "Peter Xu" <peterx@redhat.com>,
	"Marc-André Lureau" <marcandre.lureau@redhat.com>
Subject: [PULL 01/29] migration: Fix crash on second migration when cancel early
Date: Wed, 20 May 2026 17:33:29 -0400	[thread overview]
Message-ID: <20260520213357.40646-2-peterx@redhat.com> (raw)
In-Reply-To: <20260520213357.40646-1-peterx@redhat.com>

Marc-André reported an issue on QEMU crash when retrying a cancelled
migration during early setup phase, see "Link:" for more information, and
also easy way to reproduce.

This patch is a replacement of the prior fix proposed by not only switching
to migration_cleanup(), but also fixing it from CPR side, so that we track
hup_source properly to know if src QEMU is waiting or the HUP signal.

To put it simple: this chunk of special casing in migration_cancel() should
not affect normal migration, but only cpr-transfer migration to cover the
small window when the src QEMU is waiting for a HUP signal on cpr
channel (so that src QEMU can continue the migration on the main channel).

To achieve that, we'll also need to remember to detach the hup_source
whenenver invoked: after that point, we should always be able to cleanup
the migration.

It's not a generic operation to explicitly detach a gsource from its
context while in its dispatch() function.  But it should be safe, because
gsource disptch() will only happen with a boosted refcount for the
dispatcher so that the gsource will not be freed until the callback
completes. It's also safe to return G_SOURCE_REMOVE after the gsource is
detached, as glib will simply ignore the G_SOURCE_REMOVE.

One can refer to latest 2.86.5 glib code in g_main_dispatch() for that:

https://github.com/GNOME/glib/blob/2.86.5/glib/gmain.c#L3592

When at this, add a bunch of assertions to make sure nothing surprises us.

After this patch applied, the 2nd migration will not crash QEMU, instead
it'll be in CANCELLING until the socket connection times out (it will take
~2min on my Fedora default kernel).  During this process no 2nd migration
will be allowed, and after it timed out migration can be restarted.

It's because so far we don't have control over socket_connect_outgoing(),
or anything yet managed by a task executed in qio_task_run_in_thread().
Speeding up the cancellation to be left for future.

I also tested cpr-transfer by only providing cpr channel not the main
channel (with -incoming defer), kickoff migration on source, then cancel it
on source directly without providing the main channel.  It keeps working.

I wanted to add an unit test for that but it'll need to refactor current
cpr-transfer tests first; let's leave it for later.

Link: https://lore.kernel.org/r/20260417184742.293061-1-marcandre.lureau@redhat.com
Reported-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Tested-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20260421175820.302795-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/migration/cpr.h  |  1 +
 migration/migration.h    |  5 +++++
 migration/cpr-transfer.c | 10 ++++++++++
 migration/migration.c    | 31 +++++++++++++++++++++++--------
 4 files changed, 39 insertions(+), 8 deletions(-)

diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index 96ce26e711..56fb67e6b4 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -57,6 +57,7 @@ QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
 void cpr_transfer_add_hup_watch(MigrationState *s, QIOChannelFunc func,
                                 void *opaque);
 void cpr_transfer_source_destroy(MigrationState *s);
+bool cpr_transfer_source_active(MigrationState *s);
 
 void cpr_exec_init(void);
 QEMUFile *cpr_exec_output(Error **errp);
diff --git a/migration/migration.h b/migration/migration.h
index a5e064a1ac..841f49b215 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -512,6 +512,11 @@ struct MigrationState {
 
     bool postcopy_package_loaded;
 
+    /*
+     * When set, it means cpr-transfer is waiting for the HUP signal from
+     * destination to continue the 2nd step of migration via the main
+     * channel.
+     */
     GSource *hup_source;
 
     /*
diff --git a/migration/cpr-transfer.c b/migration/cpr-transfer.c
index 61d5c9dce2..9defe7bad7 100644
--- a/migration/cpr-transfer.c
+++ b/migration/cpr-transfer.c
@@ -6,6 +6,7 @@
  */
 
 #include "qemu/osdep.h"
+#include "qemu/main-loop.h"
 #include "qapi/clone-visitor.h"
 #include "qapi/error.h"
 #include "qapi/qapi-visit-migration.h"
@@ -79,6 +80,7 @@ QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp)
 void cpr_transfer_add_hup_watch(MigrationState *s, QIOChannelFunc func,
                                 void *opaque)
 {
+    assert(bql_locked());
     s->hup_source = qio_channel_create_watch(cpr_state_ioc(), G_IO_HUP);
     g_source_set_callback(s->hup_source,
                           (GSourceFunc)func,
@@ -89,9 +91,17 @@ void cpr_transfer_add_hup_watch(MigrationState *s, QIOChannelFunc func,
 
 void cpr_transfer_source_destroy(MigrationState *s)
 {
+    assert(bql_locked());
     if (s->hup_source) {
         g_source_destroy(s->hup_source);
         g_source_unref(s->hup_source);
         s->hup_source = NULL;
     }
 }
+
+bool cpr_transfer_source_active(MigrationState *s)
+{
+    /* Whenever the HUP gsource is available, it's active. */
+    assert(bql_locked());
+    return s->hup_source;
+}
diff --git a/migration/migration.c b/migration/migration.c
index ecc69dc4d2..b6f78eb3ac 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1502,14 +1502,19 @@ void migration_cancel(void)
     }
 
     /*
-     * If migration_connect_outgoing has not been called, then there
-     * is no path that will complete the cancellation. Do it now.
-     */
-    if (setup && !s->to_dst_file) {
-        migrate_set_state(&s->state, MIGRATION_STATUS_CANCELLING,
-                          MIGRATION_STATUS_CANCELLED);
-        cpr_state_close();
-        cpr_transfer_source_destroy(s);
+     * This is cpr-transfer specific processing.
+     *
+     * If this is true, it means cpr-transfer migration is waiting for the
+     * destination to send HUP event on CPR channel to continue the next
+     * phase.  If so, do the cleanup proactively to avoid get stuck in
+     * CANCELLING state.
+     */
+    if (cpr_transfer_source_active(s)) {
+        assert(migrate_mode() == MIG_MODE_CPR_TRANSFER);
+        assert(setup && !s->to_dst_file);
+        migration_cleanup(s);
+        /* Now all things should have been released */
+        assert(!cpr_transfer_source_active(s));
     }
 }
 
@@ -2045,12 +2050,22 @@ static gboolean migration_connect_outgoing_cb(QIOChannel *channel,
     MigrationState *s = migrate_get_current();
     Error *local_err = NULL;
 
+    /*
+     * Detach and release the GSource right after use.  We rely on this to
+     * detect this small cpr-transfer window of "waiting for HUP event".
+     */
+    cpr_transfer_source_destroy(s);
+
     migration_connect_outgoing(s, opaque, &local_err);
 
     if (local_err) {
         migration_connect_error_propagate(s, local_err);
     }
 
+    /*
+     * This is redundant as we do cpr_transfer_source_destroy() at the
+     * entry, but it's benign; glib will just skip the detach.
+     */
     return G_SOURCE_REMOVE;
 }
 
-- 
2.53.0



  reply	other threads:[~2026-05-20 21:35 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-20 21:33 [PULL 00/29] Next patches Peter Xu
2026-05-20 21:33 ` Peter Xu [this message]
2026-05-20 21:33 ` [PULL 02/29] migration: Remove VMS_MULTIPLY_ELEMENTS and VMSTATE_VARRAY_MULTIPLY() Peter Xu
2026-05-20 21:33 ` [PULL 03/29] migration: Fix possible division by zero on calc expected downtime Peter Xu
2026-05-20 21:33 ` [PULL 04/29] tests/qtest/migration: Fix auto-converge test Peter Xu
2026-05-20 21:33 ` [PULL 05/29] migration: Replace current_migration with migrate_get_current() Peter Xu
2026-05-20 21:33 ` [PULL 06/29] MAINTAINERS: Make Maciej CPR maintainer Peter Xu
2026-05-20 21:33 ` [PULL 07/29] tests/qtest/migration: Move cpr transfer logic into cpr-tests.c Peter Xu
2026-05-20 21:33 ` [PULL 08/29] tests/qtest/migration: Make file-tests defer by default Peter Xu
2026-05-20 21:33 ` [PULL 09/29] tests/qtest/migration: Set file URI " Peter Xu
2026-05-20 21:33 ` [PULL 10/29] tests/qtest/migration: Group unix migration tests Peter Xu
2026-05-20 21:33 ` [PULL 11/29] tests/qtest/migration: Use precopy_unix_common for ignore-shared test Peter Xu
2026-05-20 21:33 ` [PULL 12/29] tests/qtest/migration: Use a default TCP URI for precopy Peter Xu
2026-05-20 21:33 ` [PULL 13/29] tests/qtest/migration: Defer by default in precopy_common Peter Xu
2026-05-20 21:33 ` [PULL 14/29] tests/qtest/migration: Set compression method in compression-tests Peter Xu
2026-05-20 21:33 ` [PULL 15/29] tests/qtest/migration: Remove multifd compression hook Peter Xu
2026-05-20 21:33 ` [PULL 16/29] tests/qtest/migration: Use defer for all tests Peter Xu
2026-05-20 21:33 ` [PULL 17/29] tests/qtest/migration: Use defer for cpr-tests Peter Xu
2026-05-20 21:33 ` [PULL 18/29] tests/qtest/migration: Use defer for auto-converge Peter Xu
2026-05-20 21:33 ` [PULL 19/29] tests/qtest/migration: Use defer in dirty_limit test Peter Xu
2026-05-20 21:33 ` [PULL 20/29] tests/qtest/migration: Stop passing URI into migrate_start Peter Xu
2026-05-20 21:33 ` [PULL 21/29] tests/qtest/migration: Unify URIs Peter Xu
2026-05-20 21:33 ` [PULL 22/29] migration/global_state: replace strcpy("") with explicit NUL termination Peter Xu
2026-05-20 21:33 ` [PULL 23/29] migration/vmstate: avoid per-element heap churn in vmsd ptr marker field Peter Xu
2026-05-20 21:33 ` [PULL 24/29] migration/savevm: use stack-allocated bitmap in configuration_validate_capabilities Peter Xu
2026-05-20 21:33 ` [PULL 25/29] migration/multifd: fix off-by-one in recv channel ID validation Peter Xu
2026-05-20 21:33 ` [PULL 26/29] migration/multifd: cache migrate_multifd_channels() in send/recv hot paths Peter Xu
2026-05-20 21:33 ` [PULL 27/29] migration/multifd: cache channel count in multifd_send_sync_main Peter Xu
2026-05-20 21:33 ` [PULL 28/29] migration/cpr: use hashtable for cpr fds Peter Xu
2026-05-20 21:33 ` [PULL 29/29] MAINTAINERS: Update email of Yong Huang Peter Xu
2026-05-21 19:20 ` [PULL 00/29] Next patches Stefan Hajnoczi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260520213357.40646-2-peterx@redhat.com \
    --to=peterx@redhat.com \
    --cc=farosas@suse.de \
    --cc=marcandre.lureau@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.