From: Peter Xu <peterx@redhat.com>
To: marcandre.lureau@redhat.com
Cc: qemu-devel@nongnu.org, steven.sistare@oracle.com,
Fabiano Rosas <farosas@suse.de>
Subject: Re: [PATCH] migration: fix abort when re-migrating after cancel during SETUP
Date: Mon, 20 Apr 2026 17:46:23 -0400 [thread overview]
Message-ID: <aeaer6F0Q9XLt8Zt@x1.local> (raw)
In-Reply-To: <aeaBKzauFijGRhWr@x1.local>
On Mon, Apr 20, 2026 at 03:40:27PM -0400, Peter Xu wrote:
> On Fri, Apr 17, 2026 at 10:47:42PM +0400, marcandre.lureau@redhat.com wrote:
> > From: Marc-André Lureau <marcandre.lureau@redhat.com>
> >
> > When a migration is cancelled during the early SETUP phase (before
> > migration_connect_outgoing() has set s->to_dst_file), migration_cancel()
> > takes a fast path that transitions the state from CANCELLING to
> > Cancelled without calling migration_cleanup(). This leaves the migration
> > yank instance registered.
> >
> > A subsequent qmp_migrate() call passes the migration_is_running() guard
> > (since CANCELLED is a terminal state) and then calls
> > yank_register_instance(MIGRATION_YANK_INSTANCE, &error_abort), which
> > finds the stale entry and aborts with "duplicate yank instance".
> >
> > #6 0x000056028b5e17fc in error_setg_internal
> > (errp=errp@entry=0x56028cc8cb18 <error_abort>, src=src@entry=0x56028ba87fa5 "../util/yank.c", line=line@entry=87, func=func@entry=0x56028bb77140 <__func__.5> "yank_register_instance", fmt=fmt@entry=0x56028ba87f8d "duplicate yank instance") at ../util/error.c:100
> > #7 0x000056028b601b2a in yank_register_instance (instance=instance@entry=0x7ffea0cf36b0, errp=0x56028cc8cb18 <error_abort>) at ../util/yank.c:87
> > #8 0x000056028b25221e in migrate_prepare (s=0x5602b0a7db90, resume=<optimized out>, errp=0x7ffea0cf3718) at ../migration/migration.c:2001
> > #9 qmp_migrate (uri=<optimized out>, has_channels=<optimized out>, channels=<optimized out>, has_resume=<optimized out>, resume=<optimized out>, errp=errp@entry=0x7ffea0cf3718) at ../migration/migration.c:2039
> > #10 0x000056028b5891be in qmp_marshal_migrate (args=<optimized out>, ret=<optimized out>, errp=0x7f63392e0ee0) at qapi/qapi-commands-migration.c:459
> >
> > Add missing yank_unregister_instance. Alternatively, it seems
> > migration_cleanup() should be safe in this context too.
> >
> > Fixes: 624e6e654e11 ("migration: cpr-transfer mode")
> > Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
> > ---
> > migration/migration.c | 1 +
> > 1 file changed, 1 insertion(+)
> >
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 5c9aaa6e58f..bbdd91ee7ee 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -1477,6 +1477,7 @@ void migration_cancel(void)
> > MIGRATION_STATUS_CANCELLED);
> > cpr_state_close();
> > cpr_transfer_source_destroy(s);
> > + yank_unregister_instance(MIGRATION_YANK_INSTANCE);
> > }
> > }
>
> Thanks for the report.
>
> I had a feeling that this is not enough.. and I had a feeling that this
> special casing was something fast merged for cpr-transfer work to be able
> to cancel the HUP source when waiting, however did it wrong. IOW, at least
> this chunk:
>
> if (setup && !s->to_dst_file) {
> migrate_set_state(&s->state, MIGRATION_STATUS_CANCELLING,
> MIGRATION_STATUS_CANCELLED);
> cpr_state_close();
> cpr_transfer_source_destroy(s);
> }
>
> Should only apply to cpr-transfer, not normal migrations..
>
> E.g. consider one qmp_migrate() on top of tcp socket:
>
> - qmp_migrate
> - migration_connect_outgoing
> - socket_connect_outgoing
> - ... waiting for worker to invoke socket_outgoing_migration()
> - ... some time later
> - qmp_migrate_cancel
> - above check hits, update CANCELLING->CANCELLED, unregister yank etc.
> - ... some time later
> - socket_outgoing_migration() invoked with an error
> - migration_connect_error_propagate() should see CANCELLED state, which
> is illegal.
So the 1st thing I did here is I tried to reproduce this issue with
firewall dropping packets over some port:
$ sudo iptables -A OUTPUT -p tcp --dport 44444 -j DROP
Then I can reproduce this by migrating to port 44444, and this patch does
fix it, but then I verified I can hit the above illegal CANCELLED state
error:
2026-04-20T21:30:21.669131Z migrate_set_state new state setup
2026-04-20T21:30:27.281824Z migrate_set_state new state cancelling
2026-04-20T21:30:27.281862Z migrate_set_state new state cancelled
...
2026-04-20T21:32:37.638382Z qemu-system-x86_64: migration_connect_error_propagate: Illegal migration status (cancelled) detected
After that if I retry with a correct connection migration will actually
succeed. I believe it's because migration_connect_error_propagate() was
careful on skipping the extra migration_cleanup() when hitting something
weird:
/*
* This really shouldn't happen. Just be careful to not crash a VM
* just for this. Instead, dump something.
*/
error_report("%s: Illegal migration status (%s) detected",
__func__, MigrationStatus_str(current));
return;
I'll try out the other approach and share the patch if it will pass all
tests.
>
> So IMHO for non-cpr cases we should always get ourselves covered with
> migration_connect_error_propagate(), except CPR where it may have the HUP
> source registered via migration_connect_error_propagate().
>
> IIUC, maybe we could change this check to be an explicit one, checking
> whether cpr_transfer_add_hup_watch() has the source attached but not yet
> executed (if executed, migration_connect_outgoing_cb() will also properly
> invoke migration_connect_error_propagate()).
>
> We could check against both "mode==CPR_TRANSFER && s->hup_source", but then
> we need early release of hup_resource too (currently it is deferred until
> migration_cleanup(), which is slightly hackish; after all the source
> returns always with G_SOURCE_REMOVE for migration_connect_outgoing_cb, so
> it was removed long time ago, rather than until the end of migration).
>
> Thanks,
>
> --
> Peter Xu
--
Peter Xu
prev parent reply other threads:[~2026-04-20 21:46 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-17 18:47 [PATCH] migration: fix abort when re-migrating after cancel during SETUP marcandre.lureau
2026-04-20 19:40 ` Peter Xu
2026-04-20 19:50 ` Marc-André Lureau
2026-04-20 20:42 ` Peter Xu
2026-04-20 21:46 ` Peter Xu [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aeaer6F0Q9XLt8Zt@x1.local \
--to=peterx@redhat.com \
--cc=farosas@suse.de \
--cc=marcandre.lureau@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=steven.sistare@oracle.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.