From: Peter Xu <peterx@redhat.com>
To: Fabiano Rosas <farosas@suse.de>
Cc: Shivam Kumar <shivam.kumar1@nutanix.com>, qemu-devel@nongnu.org
Subject: Re: [RFC PATCH] Fix race in live migration failure path
Date: Mon, 13 Jan 2025 11:29:46 -0500 [thread overview]
Message-ID: <Z4U_emPVDfTb1VmF@x1n> (raw)
In-Reply-To: <87frlqerxp.fsf@suse.de>
On Fri, Jan 10, 2025 at 10:09:38AM -0300, Fabiano Rosas wrote:
> Shivam Kumar <shivam.kumar1@nutanix.com> writes:
>
> > Even if a live migration fails due to some reason, migration status
> > should not be set to MIGRATION_STATUS_FAILED until migrate fd cleanup
> > is done, else the client can trigger another instance of migration
> > before the cleanup is complete (as it would assume no migration is
> > active) or reset migration capabilities affecting old migration's
> > cleanup. Hence, set the status to 'failing' when a migration failure
> > happens and once the cleanup is complete, set the migration status to
> > MIGRATION_STATUS_FAILED.
> >
> > Signed-off-by: Shivam Kumar <shivam.kumar1@nutanix.com>
> > ---
> > migration/migration.c | 49 +++++++++++++++++++++----------------------
> > migration/migration.h | 9 ++++++++
> > migration/multifd.c | 6 ++----
> > migration/savevm.c | 7 +++----
> > 4 files changed, 38 insertions(+), 33 deletions(-)
> >
> > diff --git a/migration/migration.c b/migration/migration.c
> > index df61ca4e93..f084f54f6b 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -1143,8 +1143,9 @@ static bool migration_is_active(void)
>
> migration_is_running() is the one that gates qmp_migrate() and
> qmp_migrate_set_capabilities().
>
> > {
> > MigrationState *s = current_migration;
> >
> > - return (s->state == MIGRATION_STATUS_ACTIVE ||
> > - s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
> > + return ((s->state == MIGRATION_STATUS_ACTIVE ||
> > + s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) &&
> > + !qatomic_read(&s->failing));
> > }
> >
> > static bool migrate_show_downtime(MigrationState *s)
> > @@ -1439,6 +1440,11 @@ static void migrate_fd_cleanup(MigrationState *s)
> > MIGRATION_STATUS_CANCELLED);
> > }
> >
> > + if (qatomic_xchg(&s->failing, 0)) {
> > + migrate_set_state(&s->state, s->state,
> > + MIGRATION_STATUS_FAILED);
> > + }
>
> I hope you've verified that sure every place that used to set FAILED
> will also reach migrate_fd_cleanup() eventually.
>
> Also, we probably still need the FAILING state. Otherwise, this will
> trip code that expects a state change on failure. Anything that does:
>
> if (state != MIGRATION_STATUS_FOO) {
> ...
> }
>
> So I think the change overall should be
>
> -migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
> +migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILING);
>
> void migrate_set_state(MigrationStatus *state, MigrationStatus old_state,
> MigrationStatus new_state)
> {
> assert(new_state < MIGRATION_STATUS__MAX);
> if (qatomic_cmpxchg(state, old_state, new_state) == old_state) {
> trace_migrate_set_state(MigrationStatus_str(new_state));
>
> + if (new_state == MIGRATION_STATUS_FAILING) {
> + qatomic_set(&s->failing, 1);
> + }
> migrate_generate_event(new_state);
> }
> }
>
> And we should proably do the same for CANCELLING actually, but there the
> (preexisting) issue is actual concurrency, while here it's just
> inconsistency in the state.
Yes something like FAILING sounds reasonable. Though since we have
s->error, I wonder whether that's a better place to represent a migration
as "failing" in one place, because otherwise we need to set two places
(both FAILING state, and the s->error) - whenever something fails, we'd
better always update s->error so as to remember what failed, then reported
via query-migrate.
From that POV, s->failing is probably never gonna be needed (due to
s->error being present anyway)? So far, such Error* looks like the best
single point to say that the migration is failing - it also enforces the
Error to be provided whoever wants to set it to failing state.
--
Peter Xu
next prev parent reply other threads:[~2025-01-13 16:30 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-10 10:07 [RFC PATCH] Fix race in live migration failure path Shivam Kumar
2025-01-10 13:09 ` Fabiano Rosas
2025-01-13 16:29 ` Peter Xu [this message]
2025-01-22 10:54 ` Shivam Kumar
2025-01-22 16:40 ` Peter Xu
2025-01-23 9:53 ` Shivam Kumar
2025-01-23 16:27 ` Peter Xu
2025-01-23 18:05 ` Shivam Kumar
2025-01-23 20:40 ` Peter Xu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z4U_emPVDfTb1VmF@x1n \
--to=peterx@redhat.com \
--cc=farosas@suse.de \
--cc=qemu-devel@nongnu.org \
--cc=shivam.kumar1@nutanix.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.