All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: marcandre.lureau@redhat.com
Cc: qemu-devel@nongnu.org, steven.sistare@oracle.com,
	Fabiano Rosas <farosas@suse.de>
Subject: Re: [PATCH] migration: fix abort when re-migrating after cancel during SETUP
Date: Mon, 20 Apr 2026 15:40:27 -0400	[thread overview]
Message-ID: <aeaBKzauFijGRhWr@x1.local> (raw)
In-Reply-To: <20260417184742.293061-1-marcandre.lureau@redhat.com>

On Fri, Apr 17, 2026 at 10:47:42PM +0400, marcandre.lureau@redhat.com wrote:
> From: Marc-André Lureau <marcandre.lureau@redhat.com>
> 
> When a migration is cancelled during the early SETUP phase (before
> migration_connect_outgoing() has set s->to_dst_file), migration_cancel()
> takes a fast path that transitions the state from CANCELLING to
> Cancelled without calling migration_cleanup(). This leaves the migration
> yank instance registered.
> 
> A subsequent qmp_migrate() call passes the migration_is_running() guard
> (since CANCELLED is a terminal state) and then calls
> yank_register_instance(MIGRATION_YANK_INSTANCE, &error_abort), which
> finds the stale entry and aborts with "duplicate yank instance".
> 
>   #6  0x000056028b5e17fc in error_setg_internal
>       (errp=errp@entry=0x56028cc8cb18 <error_abort>, src=src@entry=0x56028ba87fa5 "../util/yank.c", line=line@entry=87, func=func@entry=0x56028bb77140 <__func__.5> "yank_register_instance", fmt=fmt@entry=0x56028ba87f8d "duplicate yank instance") at ../util/error.c:100
>   #7  0x000056028b601b2a in yank_register_instance (instance=instance@entry=0x7ffea0cf36b0, errp=0x56028cc8cb18 <error_abort>) at ../util/yank.c:87
>   #8  0x000056028b25221e in migrate_prepare (s=0x5602b0a7db90, resume=<optimized out>, errp=0x7ffea0cf3718) at ../migration/migration.c:2001
>   #9  qmp_migrate (uri=<optimized out>, has_channels=<optimized out>, channels=<optimized out>, has_resume=<optimized out>, resume=<optimized out>, errp=errp@entry=0x7ffea0cf3718) at ../migration/migration.c:2039
>   #10 0x000056028b5891be in qmp_marshal_migrate (args=<optimized out>, ret=<optimized out>, errp=0x7f63392e0ee0) at qapi/qapi-commands-migration.c:459
> 
> Add missing yank_unregister_instance. Alternatively, it seems
> migration_cleanup() should be safe in this context too.
> 
> Fixes: 624e6e654e11 ("migration: cpr-transfer mode")
> Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
> ---
>  migration/migration.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 5c9aaa6e58f..bbdd91ee7ee 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1477,6 +1477,7 @@ void migration_cancel(void)
>                            MIGRATION_STATUS_CANCELLED);
>          cpr_state_close();
>          cpr_transfer_source_destroy(s);
> +        yank_unregister_instance(MIGRATION_YANK_INSTANCE);
>      }
>  }

Thanks for the report.

I had a feeling that this is not enough.. and I had a feeling that this
special casing was something fast merged for cpr-transfer work to be able
to cancel the HUP source when waiting, however did it wrong.  IOW, at least
this chunk:

    if (setup && !s->to_dst_file) {
        migrate_set_state(&s->state, MIGRATION_STATUS_CANCELLING,
                          MIGRATION_STATUS_CANCELLED);
        cpr_state_close();
        cpr_transfer_source_destroy(s);
    }

Should only apply to cpr-transfer, not normal migrations..

E.g. consider one qmp_migrate() on top of tcp socket:

- qmp_migrate
  - migration_connect_outgoing
    - socket_connect_outgoing
      - ... waiting for worker to invoke socket_outgoing_migration()
- ... some time later
- qmp_migrate_cancel
  - above check hits, update CANCELLING->CANCELLED, unregister yank etc.
- ... some time later
- socket_outgoing_migration() invoked with an error
  - migration_connect_error_propagate() should see CANCELLED state, which
    is illegal.

So IMHO for non-cpr cases we should always get ourselves covered with
migration_connect_error_propagate(), except CPR where it may have the HUP
source registered via migration_connect_error_propagate().

IIUC, maybe we could change this check to be an explicit one, checking
whether cpr_transfer_add_hup_watch() has the source attached but not yet
executed (if executed, migration_connect_outgoing_cb() will also properly
invoke migration_connect_error_propagate()).

We could check against both "mode==CPR_TRANSFER && s->hup_source", but then
we need early release of hup_resource too (currently it is deferred until
migration_cleanup(), which is slightly hackish; after all the source
returns always with G_SOURCE_REMOVE for migration_connect_outgoing_cb, so
it was removed long time ago, rather than until the end of migration).

Thanks,

-- 
Peter Xu



  reply	other threads:[~2026-04-20 19:40 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-17 18:47 [PATCH] migration: fix abort when re-migrating after cancel during SETUP marcandre.lureau
2026-04-20 19:40 ` Peter Xu [this message]
2026-04-20 19:50   ` Marc-André Lureau
2026-04-20 20:42     ` Peter Xu
2026-04-20 21:46   ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aeaBKzauFijGRhWr@x1.local \
    --to=peterx@redhat.com \
    --cc=farosas@suse.de \
    --cc=marcandre.lureau@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=steven.sistare@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.