qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: "Daniel P. Berrangé" <berrange@redhat.com>
Cc: qemu-devel@nongnu.org, Avihai Horon <avihaih@nvidia.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Leonardo Bras Soares Passos <lsoaresp@redhat.com>,
	Juan Quintela <quintela@redhat.com>,
	Laurent Vivier <lvivier@redhat.com>,
	Thomas Huth <thuth@redhat.com>
Subject: Re: [PATCH RFC 1/3] migration: switchover-hold parameter
Date: Fri, 2 Jun 2023 09:20:38 -0400	[thread overview]
Message-ID: <ZHnspj8pVNt3YGGR@x1n> (raw)
In-Reply-To: <ZHnPDHnUUY/SRszl@redhat.com>

On Fri, Jun 02, 2023 at 12:14:20PM +0100, Daniel P. Berrangé wrote:
> On Thu, Jun 01, 2023 at 09:16:24PM -0400, Peter Xu wrote:
> > Add a new migration parameter switchover-hold which can block src qemu
> > migration from switching over to dest from running.
> > 
> > One can set this flag to true so src qemu will keep iterating the VM data,
> > not switching over to dest even if it can.
> > 
> > It means now live migration works somehow like COLO; we keep syncing data
> > from src to dst without stopping.
> > 
> > When the user is ready for the switchover, one can set the parameter from
> > true->false.  That'll contain a implicit kick to migration thread to be
> > alive and re-evaluate the switchover decision.
> > 
> > This can be used in two cases so far in my mind:
> > 
> >   (1) One can use this parameter to start pre-heating migration (but not
> >       really migrating, so a migrate-cancel will cancel the preheat).  When
> >       the user wants to really migrate, just clear the flag.  It'll in most
> >       cases migrate immediately because most pages are already synced.
> > 
> >   (2) Can also be used as a clean way to do qtest, in many of the precopy
> >       tests we have requirement to run after 1 iteration without completing
> >       the precopy migration.  Before that we have either set bandwidth to
> >       ridiculous low value, or tricks on detecting guest memory change over
> >       some adhoc guest memory position.  Now we can simply set this flag
> >       then we know precopy won't complete and will just keep going.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  qapi/migration.json            | 25 +++++++++++++--
> >  migration/migration.h          |  7 +++++
> >  migration/migration-hmp-cmds.c |  3 ++
> >  migration/migration.c          | 56 ++++++++++++++++++++++++++++++++--
> >  migration/options.c            | 17 +++++++++++
> >  5 files changed, 102 insertions(+), 6 deletions(-)
> > 
> > diff --git a/qapi/migration.json b/qapi/migration.json
> > index 179af0c4d8..1d0059d125 100644
> > --- a/qapi/migration.json
> > +++ b/qapi/migration.json
> > @@ -779,6 +779,15 @@
> >  #     Nodes are mapped to their block device name if there is one, and
> >  #     to their node name otherwise.  (Since 5.2)
> >  #
> > +# @switchover-hold: Whether we should hold-off precopy switchover from src
> > +#     to dest QEMU, even if we can finish migration in the downtime
> > +#     specified.  By default off, so precopy migration will complete as
> > +#     soon as possible.  One can set it to explicitly keep iterating during
> > +#     precopy migration until set the flag to false again to kick off the
> > +#     final switchover.  Note, this does not affect postcopy switchover,
> > +#     because the user can control that using "migrate-start-postcopy"
> > +#     command explicitly. (Since 8.1)
> > +#
> >  # Features:
> >  #
> >  # @unstable: Member @x-checkpoint-delay is experimental.
> > @@ -800,7 +809,7 @@
> >             'xbzrle-cache-size', 'max-postcopy-bandwidth',
> >             'max-cpu-throttle', 'multifd-compression',
> >             'multifd-zlib-level' ,'multifd-zstd-level',
> > -           'block-bitmap-mapping' ] }
> > +           'block-bitmap-mapping', 'switchover-hold' ] }
> >  
> >  ##
> >  # @MigrateSetParameters:
> > @@ -935,6 +944,10 @@
> >  #     Nodes are mapped to their block device name if there is one, and
> >  #     to their node name otherwise.  (Since 5.2)
> >  #
> > +# @switchover-hold: Whether we should hold-off precopy switchover from src
> > +#     to dest QEMU.  For more details, please refer to MigrationParameter
> > +#     entry of the same field. (Since 8.1)
> > +#
> >  # Features:
> >  #
> >  # @unstable: Member @x-checkpoint-delay is experimental.
> > @@ -972,7 +985,8 @@
> >              '*multifd-compression': 'MultiFDCompression',
> >              '*multifd-zlib-level': 'uint8',
> >              '*multifd-zstd-level': 'uint8',
> > -            '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ] } }
> > +            '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ],
> > +            '*switchover-hold': 'bool' } }
> >  
> >  ##
> >  # @migrate-set-parameters:
> > @@ -1127,6 +1141,10 @@
> >  #     Nodes are mapped to their block device name if there is one, and
> >  #     to their node name otherwise.  (Since 5.2)
> >  #
> > +# @switchover-hold: Whether we should hold-off precopy switchover from src
> > +#     to dest QEMU.  For more details, please refer to MigrationParameter
> > +#     entry of the same field. (Since 8.1)
> > +#
> >  # Features:
> >  #
> >  # @unstable: Member @x-checkpoint-delay is experimental.
> > @@ -1161,7 +1179,8 @@
> >              '*multifd-compression': 'MultiFDCompression',
> >              '*multifd-zlib-level': 'uint8',
> >              '*multifd-zstd-level': 'uint8',
> > -            '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ] } }
> > +            '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ],
> > +            '*switchover-hold': 'bool' } }
> >  
> >  ##
> >  # @query-migrate-parameters:
> > diff --git a/migration/migration.h b/migration/migration.h
> > index 48a46123a0..086ceec754 100644
> > --- a/migration/migration.h
> > +++ b/migration/migration.h
> > @@ -437,6 +437,13 @@ struct MigrationState {
> >  
> >      /* QEMU_VM_VMDESCRIPTION content filled for all non-iterable devices. */
> >      JSONWriter *vmdesc;
> > +    /*
> > +     * Only migration thread will wait on it when switchover_hold==true.
> > +     *
> > +     * Only qmp set param will kick it when switching switchover_hold from
> > +     * true->false.
> > +     */
> > +    QemuSemaphore switchover_hold_sem;
> >  };
> >  
> >  void migrate_set_state(int *state, int old_state, int new_state);
> > diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
> > index 9885d7c9f7..63a2c8a4a3 100644
> > --- a/migration/migration-hmp-cmds.c
> > +++ b/migration/migration-hmp-cmds.c
> > @@ -338,6 +338,9 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
> >          monitor_printf(mon, "%s: '%s'\n",
> >              MigrationParameter_str(MIGRATION_PARAMETER_TLS_AUTHZ),
> >              params->tls_authz);
> > +        monitor_printf(mon, "%s: %s\n",
> > +            MigrationParameter_str(MIGRATION_PARAMETER_SWITCHOVER_HOLD),
> > +            params->switchover_hold ? "on" : "off");
> >  
> >          if (params->has_block_bitmap_mapping) {
> >              const BitmapMigrationNodeAliasList *bmnal;
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 5de7f734b9..30f165527b 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -2700,6 +2700,55 @@ static void migration_update_counters(MigrationState *s,
> >                                bandwidth, s->threshold_size);
> >  }
> >  
> > +static bool
> > +migration_should_complete(MigrationState *s, uint64_t pending_size)
> > +{
> > +    /* We still have large pending data to send? */
> > +    if (pending_size && (pending_size >= s->threshold_size)) {
> > +        return false;
> > +    }
> > +
> > +    /* The user doesn't want us to switchover yet */
> > +    if (s->parameters.switchover_hold) {
> > +        /*
> > +         * Note: when reaching here it probably means we've migrated almost
> > +         * everything and ready to switchover.  If user asked not to switch
> > +         * wait for a short period and respond to kicks immediately.
> > +         *
> > +         * If we wait too long, there can be a lot of dirty data generated,
> > +         * while we could have done something to sync data between src/dst.
> > +         *
> > +         * If we wait too short, migration thread can eat most/all cpu
> > +         * resource looping over switchover_hold.
> > +         *
> > +         * Make it 10ms which seems to be a good intermediate value.
> > +         */
> > +        qemu_sem_timedwait(&s->switchover_hold_sem, 10);
> > +
> > +        /*
> > +         * Return false here always even if user changed it, because we'd
> > +         * like to re-evaluate everything (e.g. pending_size).
> > +         */
> > +        return false;
> > +    }
> > +
> > +    return true;
> > +}
> > +
> > +static bool
> > +migration_should_start_postcopy(MigrationState *s, uint64_t must_precopy)
> > +{
> > +    /* If we're already in postcopy phase, don't bother */
> > +    if (migration_in_postcopy()) {
> > +        return false;
> > +    }
> > +    /* We still have lots of thing that must be migrated in precopy */
> > +    if (must_precopy > s->threshold_size) {
> > +        return false;
> > +    }
> > +    return qatomic_read(&s->start_postcopy);
> > +}
> > +
> >  /* Migration thread iteration status */
> >  typedef enum {
> >      MIG_ITERATE_RESUME,         /* Resume current iteration */
> > @@ -2727,15 +2776,14 @@ static MigIterateState migration_iteration_run(MigrationState *s)
> >          trace_migrate_pending_exact(pending_size, must_precopy, can_postcopy);
> >      }
> >  
> > -    if (!pending_size || pending_size < s->threshold_size) {
> > +    if (migration_should_complete(s, pending_size)) {
> >          trace_migration_thread_low_pending(pending_size);
> >          migration_completion(s);
> >          return MIG_ITERATE_BREAK;
> >      }
> >  
> >      /* Still a significant amount to transfer */
> > -    if (!in_postcopy && must_precopy <= s->threshold_size &&
> > -        qatomic_read(&s->start_postcopy)) {
> > +    if (migration_should_start_postcopy(s, must_precopy)) {
> >          if (postcopy_start(s)) {
> >              error_report("%s: postcopy failed to start", __func__);
> >          }
> > @@ -3287,6 +3335,7 @@ static void migration_instance_finalize(Object *obj)
> >      qemu_sem_destroy(&ms->rp_state.rp_sem);
> >      qemu_sem_destroy(&ms->rp_state.rp_pong_acks);
> >      qemu_sem_destroy(&ms->postcopy_qemufile_src_sem);
> > +    qemu_sem_destroy(&ms->switchover_hold_sem);
> >      error_free(ms->error);
> >  }
> >  
> > @@ -3309,6 +3358,7 @@ static void migration_instance_init(Object *obj)
> >      qemu_sem_init(&ms->rate_limit_sem, 0);
> >      qemu_sem_init(&ms->wait_unplug_sem, 0);
> >      qemu_sem_init(&ms->postcopy_qemufile_src_sem, 0);
> > +    qemu_sem_init(&ms->switchover_hold_sem, 0);
> >      qemu_mutex_init(&ms->qemu_file_lock);
> >  }
> >  
> > diff --git a/migration/options.c b/migration/options.c
> > index b62ab30cd5..2d6b138352 100644
> > --- a/migration/options.c
> > +++ b/migration/options.c
> > @@ -163,6 +163,8 @@ Property migration_properties[] = {
> >      DEFINE_PROP_STRING("tls-creds", MigrationState, parameters.tls_creds),
> >      DEFINE_PROP_STRING("tls-hostname", MigrationState, parameters.tls_hostname),
> >      DEFINE_PROP_STRING("tls-authz", MigrationState, parameters.tls_authz),
> > +    DEFINE_PROP_BOOL("switchover-hold", MigrationState,
> > +                     parameters.switchover_hold, false),
> >  
> >      /* Migration capabilities */
> >      DEFINE_PROP_MIG_CAP("x-xbzrle", MIGRATION_CAPABILITY_XBZRLE),
> > @@ -883,6 +885,8 @@ MigrationParameters *qmp_query_migrate_parameters(Error **errp)
> >      params->announce_rounds = s->parameters.announce_rounds;
> >      params->has_announce_step = true;
> >      params->announce_step = s->parameters.announce_step;
> > +    params->has_switchover_hold = true;
> > +    params->switchover_hold = s->parameters.switchover_hold;
> >  
> >      if (s->parameters.has_block_bitmap_mapping) {
> >          params->has_block_bitmap_mapping = true;
> > @@ -923,6 +927,7 @@ void migrate_params_init(MigrationParameters *params)
> >      params->has_announce_max = true;
> >      params->has_announce_rounds = true;
> >      params->has_announce_step = true;
> > +    params->has_switchover_hold = true;
> >  }
> >  
> >  /*
> > @@ -1177,6 +1182,9 @@ static void migrate_params_test_apply(MigrateSetParameters *params,
> >      if (params->has_announce_step) {
> >          dest->announce_step = params->announce_step;
> >      }
> > +    if (params->has_switchover_hold) {
> > +        dest->switchover_hold = params->switchover_hold;
> > +    }
> >  
> >      if (params->has_block_bitmap_mapping) {
> >          dest->has_block_bitmap_mapping = true;
> > @@ -1290,6 +1298,15 @@ static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
> >      if (params->has_announce_step) {
> >          s->parameters.announce_step = params->announce_step;
> >      }
> > +    if (params->has_switchover_hold) {
> > +        bool old = s->parameters.switchover_hold;
> > +        bool new = params->switchover_hold;
> > +
> > +        s->parameters.switchover_hold = params->switchover_hold;
> > +        if (old && !new) {
> > +            qemu_sem_post(&s->switchover_hold_sem);
> > +        }
> > +    }
> 
> IIUC, MigrationState is global to QEMU and gets reused on every
> migration.
> 
> If we qemu_sem_post() here, and the migration fails, and the
> user starts a new migration on the same QEMU, the switchover_hold_sem
> needs resetting somewhere, as there's a (tiny) risk that the
> qemu_sem_timedwait() won't have run before the migration fail.

Yes I can do that, but may not be necessary - I think it's fine to have
this sem accumulate outdated events.

The sem may go not "accurate" easily, e.g. one can set this flag
true->false->true->false... so it keeps accumulate.  I considered checking
that too, but decided to didn't bother because each outdated post will only
cause migration_thread (only when reaching end of migration and hitting the
sem_timedwait()) to spin another round - the migration thread will "eat"
that outdated sem count and spin one more time, reaching timedwait() again,
then do the real sleep.

How about I add some comment on explaining outdated counts are safe,
instead?  Then we don't need to worry on anything happen to this sem, since
the only goal it services is to kick the migration thread faster, meanwhile
don't let migration thread eat 100% cpu looping there.

Thanks,

-- 
Peter Xu



  reply	other threads:[~2023-06-02 13:21 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-02  1:16 [PATCH RFC 0/3] migration: switchover-hold flag Peter Xu
2023-06-02  1:16 ` [PATCH RFC 1/3] migration: switchover-hold parameter Peter Xu
2023-06-02 11:14   ` Daniel P. Berrangé
2023-06-02 13:20     ` Peter Xu [this message]
2023-06-02  1:16 ` [PATCH RFC 2/3] Revert "tests/qtest: massively speed up migration-test" Peter Xu
2023-06-02  1:16 ` [PATCH RFC 3/3] qtest/migration: Use switchover-hold to speedup Peter Xu
2023-06-02 11:03   ` Daniel P. Berrangé
2023-06-02 13:23     ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZHnspj8pVNt3YGGR@x1n \
    --to=peterx@redhat.com \
    --cc=avihaih@nvidia.com \
    --cc=berrange@redhat.com \
    --cc=lsoaresp@redhat.com \
    --cc=lvivier@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    --cc=thuth@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).