From: Avihai Horon <avihaih@nvidia.com>
To: Peter Xu <peterx@redhat.com>, qemu-devel@nongnu.org
Cc: "Daniel P . Berrangé" <berrange@redhat.com>,
"Laurent Vivier" <lvivier@redhat.com>,
"Thomas Huth" <thuth@redhat.com>,
"Paolo Bonzini" <pbonzini@redhat.com>,
"Juan Quintela" <quintela@redhat.com>,
"Leonardo Bras Soares Passos" <lsoaresp@redhat.com>
Subject: Re: [PATCH 1/2] migration: switchover-hold parameter
Date: Mon, 5 Jun 2023 15:27:53 +0300 [thread overview]
Message-ID: <0c459f0c-1a04-f460-2b51-e8c3f997370d@nvidia.com> (raw)
In-Reply-To: <20230602144715.249002-2-peterx@redhat.com>
Hi Peter,
On 02/06/2023 17:47, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> Add a new migration parameter switchover-hold which can block src qemu
> migration from switching over to dest from running.
>
> One can set this flag to true so src qemu will keep iterating the VM data,
> not switching over to dest even if it can.
>
> It means now live migration works somehow like COLO; we keep syncing data
> from src to dst without stopping.
>
> When the user is ready for the switchover, one can set the parameter from
> true->false. That'll contain a implicit kick to migration thread to be
> alive and re-evaluate the switchover decision.
>
> This can be used in two cases so far in my mind:
>
> (1) One can use this parameter to start pre-heating migration (but not
> really migrating, so a migrate-cancel will cancel the preheat). When
> the user wants to really migrate, just clear the flag. It'll in most
> cases migrate immediately because most pages are already synced.
>
> (2) Can also be used as a clean way to do qtest, in many of the precopy
> tests we have requirement to run after 1 iteration without completing
> the precopy migration. Before that we have either set bandwidth to
> ridiculous low value, or tricks on detecting guest memory change over
> some adhoc guest memory position. Now we can simply set this flag
> then we know precopy won't complete and will just keep going.
>
> Here we leveraged a sem to make sure migration thread won't busy spin on a
> physical cpu, meanwhile provide a timedwait() of 10ms so it can still try
> its best to sync with dest QEMU from time to time. Note that the sem is
> prone to outdated counts but it's benign, please refer to the comment above
> the semaphore definition for more information.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> qapi/migration.json | 25 +++++++++++++--
> migration/migration.h | 16 ++++++++++
> migration/migration-hmp-cmds.c | 3 ++
> migration/migration.c | 56 ++++++++++++++++++++++++++++++++--
> migration/options.c | 17 +++++++++++
> 5 files changed, 111 insertions(+), 6 deletions(-)
>
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 179af0c4d8..1d0059d125 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -779,6 +779,15 @@
> # Nodes are mapped to their block device name if there is one, and
> # to their node name otherwise. (Since 5.2)
> #
> +# @switchover-hold: Whether we should hold-off precopy switchover from src
> +# to dest QEMU, even if we can finish migration in the downtime
> +# specified. By default off, so precopy migration will complete as
> +# soon as possible. One can set it to explicitly keep iterating during
> +# precopy migration until set the flag to false again to kick off the
> +# final switchover. Note, this does not affect postcopy switchover,
> +# because the user can control that using "migrate-start-postcopy"
> +# command explicitly. (Since 8.1)
> +#
I think this should wrap at col 70, and need double space before (Since
8.1).
Also in other places below.
> # Features:
> #
> # @unstable: Member @x-checkpoint-delay is experimental.
> @@ -800,7 +809,7 @@
> 'xbzrle-cache-size', 'max-postcopy-bandwidth',
> 'max-cpu-throttle', 'multifd-compression',
> 'multifd-zlib-level' ,'multifd-zstd-level',
> - 'block-bitmap-mapping' ] }
> + 'block-bitmap-mapping', 'switchover-hold' ] }
>
> ##
> # @MigrateSetParameters:
> @@ -935,6 +944,10 @@
> # Nodes are mapped to their block device name if there is one, and
> # to their node name otherwise. (Since 5.2)
> #
> +# @switchover-hold: Whether we should hold-off precopy switchover from src
> +# to dest QEMU. For more details, please refer to MigrationParameter
> +# entry of the same field. (Since 8.1)
> +#
> # Features:
> #
> # @unstable: Member @x-checkpoint-delay is experimental.
> @@ -972,7 +985,8 @@
> '*multifd-compression': 'MultiFDCompression',
> '*multifd-zlib-level': 'uint8',
> '*multifd-zstd-level': 'uint8',
> - '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ] } }
> + '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ],
> + '*switchover-hold': 'bool' } }
>
> ##
> # @migrate-set-parameters:
> @@ -1127,6 +1141,10 @@
> # Nodes are mapped to their block device name if there is one, and
> # to their node name otherwise. (Since 5.2)
> #
> +# @switchover-hold: Whether we should hold-off precopy switchover from src
> +# to dest QEMU. For more details, please refer to MigrationParameter
> +# entry of the same field. (Since 8.1)
> +#
> # Features:
> #
> # @unstable: Member @x-checkpoint-delay is experimental.
> @@ -1161,7 +1179,8 @@
> '*multifd-compression': 'MultiFDCompression',
> '*multifd-zlib-level': 'uint8',
> '*multifd-zstd-level': 'uint8',
> - '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ] } }
> + '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ],
> + '*switchover-hold': 'bool' } }
>
> ##
> # @query-migrate-parameters:
> diff --git a/migration/migration.h b/migration/migration.h
> index 30c3e97635..721b1c9473 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -440,6 +440,22 @@ struct MigrationState {
>
> /* QEMU_VM_VMDESCRIPTION content filled for all non-iterable devices. */
> JSONWriter *vmdesc;
> + /*
> + * Only migration thread will wait on it when switchover_hold==true.
> + *
> + * Only qmp set param will kick it when switching switchover_hold from
> + * true->false.
> + *
> + * NOTE: outdated sem count here is benign. E.g., when this is posted,
> + * the 1st migration got cancelled, then start the 2nd migration, or
> + * when someone sets the flag from true->false->true->false.. because
> + * any outdated sem count will only let the migration thread to run one
> + * more loop (timedwait() will eat the outdated count) when reaching
> + * the completion phase, then in the next loop it'll sleep again. The
> + * important thing here OTOH is when the migration thread is sleeping
> + * we can always kick it out of the sleep, which we will always do.
> + */
> + QemuSemaphore switchover_hold_sem;
> };
>
> void migrate_set_state(int *state, int old_state, int new_state);
> diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
> index 9885d7c9f7..63a2c8a4a3 100644
> --- a/migration/migration-hmp-cmds.c
> +++ b/migration/migration-hmp-cmds.c
> @@ -338,6 +338,9 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
> monitor_printf(mon, "%s: '%s'\n",
> MigrationParameter_str(MIGRATION_PARAMETER_TLS_AUTHZ),
> params->tls_authz);
> + monitor_printf(mon, "%s: %s\n",
> + MigrationParameter_str(MIGRATION_PARAMETER_SWITCHOVER_HOLD),
> + params->switchover_hold ? "on" : "off");
>
> if (params->has_block_bitmap_mapping) {
> const BitmapMigrationNodeAliasList *bmnal;
> diff --git a/migration/migration.c b/migration/migration.c
> index dc05c6f6ea..076c9f1372 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -2693,6 +2693,55 @@ static void migration_update_counters(MigrationState *s,
> bandwidth, s->threshold_size);
> }
>
> +static bool
> +migration_should_complete(MigrationState *s, uint64_t pending_size)
> +{
> + /* We still have large pending data to send? */
> + if (pending_size && (pending_size >= s->threshold_size)) {
The parenthesis here are redundant.
> + return false;
> + }
> +
> + /* The user doesn't want us to switchover yet */
> + if (s->parameters.switchover_hold) {
Should we check this only if we are in precopy? I.e., skip this check if
postcopy is active?
I thought that this parameter is all about precopy switchover, and the
fact that it prevents stopcopy from reaching migration_completion() is
just an undesired side effect.
As I see it, if this parameter is set then precopy switchover is hold
off, and if on top of that a user starts postcopy then this parameter is
not relevant anymore - we should start postcopy and ignore it for the
rest of migration.
Does it make sense?
Thanks.
> + /*
> + * Note: when reaching here it probably means we've migrated almost
> + * everything and ready to switchover. If user asked not to switch
> + * wait for a short period and respond to kicks immediately.
> + *
> + * If we wait too long, there can be a lot of dirty data generated,
> + * while we could have done something to sync data between src/dst.
> + *
> + * If we wait too short, migration thread can eat most/all cpu
> + * resource looping over switchover_hold.
> + *
> + * Make it 10ms which seems to be a good intermediate value.
> + */
> + qemu_sem_timedwait(&s->switchover_hold_sem, 10);
> +
> + /*
> + * Return false here always even if user changed it, because we'd
> + * like to re-evaluate everything (e.g. pending_size).
> + */
> + return false;
> + }
> +
> + return true;
> +}
> +
> +static bool
> +migration_should_start_postcopy(MigrationState *s, uint64_t must_precopy)
> +{
> + /* If we're already in postcopy phase, don't bother */
> + if (migration_in_postcopy()) {
> + return false;
> + }
> + /* We still have lots of thing that must be migrated in precopy */
> + if (must_precopy > s->threshold_size) {
> + return false;
> + }
> + return qatomic_read(&s->start_postcopy);
> +}
> +
> /* Migration thread iteration status */
> typedef enum {
> MIG_ITERATE_RESUME, /* Resume current iteration */
> @@ -2720,15 +2769,14 @@ static MigIterateState migration_iteration_run(MigrationState *s)
> trace_migrate_pending_exact(pending_size, must_precopy, can_postcopy);
> }
>
> - if (!pending_size || pending_size < s->threshold_size) {
> + if (migration_should_complete(s, pending_size)) {
> trace_migration_thread_low_pending(pending_size);
> migration_completion(s);
> return MIG_ITERATE_BREAK;
> }
>
> /* Still a significant amount to transfer */
> - if (!in_postcopy && must_precopy <= s->threshold_size &&
> - qatomic_read(&s->start_postcopy)) {
> + if (migration_should_start_postcopy(s, must_precopy)) {
> if (postcopy_start(s)) {
> error_report("%s: postcopy failed to start", __func__);
> }
> @@ -3285,6 +3333,7 @@ static void migration_instance_finalize(Object *obj)
> qemu_sem_destroy(&ms->rp_state.rp_sem);
> qemu_sem_destroy(&ms->rp_state.rp_pong_acks);
> qemu_sem_destroy(&ms->postcopy_qemufile_src_sem);
> + qemu_sem_destroy(&ms->switchover_hold_sem);
> error_free(ms->error);
> }
>
> @@ -3307,6 +3356,7 @@ static void migration_instance_init(Object *obj)
> qemu_sem_init(&ms->rate_limit_sem, 0);
> qemu_sem_init(&ms->wait_unplug_sem, 0);
> qemu_sem_init(&ms->postcopy_qemufile_src_sem, 0);
> + qemu_sem_init(&ms->switchover_hold_sem, 0);
> qemu_mutex_init(&ms->qemu_file_lock);
> }
>
> diff --git a/migration/options.c b/migration/options.c
> index b62ab30cd5..2d6b138352 100644
> --- a/migration/options.c
> +++ b/migration/options.c
> @@ -163,6 +163,8 @@ Property migration_properties[] = {
> DEFINE_PROP_STRING("tls-creds", MigrationState, parameters.tls_creds),
> DEFINE_PROP_STRING("tls-hostname", MigrationState, parameters.tls_hostname),
> DEFINE_PROP_STRING("tls-authz", MigrationState, parameters.tls_authz),
> + DEFINE_PROP_BOOL("switchover-hold", MigrationState,
> + parameters.switchover_hold, false),
>
> /* Migration capabilities */
> DEFINE_PROP_MIG_CAP("x-xbzrle", MIGRATION_CAPABILITY_XBZRLE),
> @@ -883,6 +885,8 @@ MigrationParameters *qmp_query_migrate_parameters(Error **errp)
> params->announce_rounds = s->parameters.announce_rounds;
> params->has_announce_step = true;
> params->announce_step = s->parameters.announce_step;
> + params->has_switchover_hold = true;
> + params->switchover_hold = s->parameters.switchover_hold;
>
> if (s->parameters.has_block_bitmap_mapping) {
> params->has_block_bitmap_mapping = true;
> @@ -923,6 +927,7 @@ void migrate_params_init(MigrationParameters *params)
> params->has_announce_max = true;
> params->has_announce_rounds = true;
> params->has_announce_step = true;
> + params->has_switchover_hold = true;
> }
>
> /*
> @@ -1177,6 +1182,9 @@ static void migrate_params_test_apply(MigrateSetParameters *params,
> if (params->has_announce_step) {
> dest->announce_step = params->announce_step;
> }
> + if (params->has_switchover_hold) {
> + dest->switchover_hold = params->switchover_hold;
> + }
>
> if (params->has_block_bitmap_mapping) {
> dest->has_block_bitmap_mapping = true;
> @@ -1290,6 +1298,15 @@ static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
> if (params->has_announce_step) {
> s->parameters.announce_step = params->announce_step;
> }
> + if (params->has_switchover_hold) {
> + bool old = s->parameters.switchover_hold;
> + bool new = params->switchover_hold;
> +
> + s->parameters.switchover_hold = params->switchover_hold;
> + if (old && !new) {
> + qemu_sem_post(&s->switchover_hold_sem);
> + }
> + }
>
> if (params->has_block_bitmap_mapping) {
> qapi_free_BitmapMigrationNodeAliasList(
> --
> 2.40.1
>
next prev parent reply other threads:[~2023-06-05 12:33 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-06-02 14:47 [PATCH 0/2] migration: switchover-hold flag Peter Xu
2023-06-02 14:47 ` [PATCH 1/2] migration: switchover-hold parameter Peter Xu
2023-06-05 12:27 ` Avihai Horon [this message]
2023-06-05 16:06 ` Peter Xu
2023-06-06 12:34 ` Avihai Horon
2023-06-06 15:58 ` Peter Xu
2023-06-02 14:47 ` [PATCH 2/2] qtest/migration: Use switchover-hold to speedup Peter Xu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0c459f0c-1a04-f460-2b51-e8c3f997370d@nvidia.com \
--to=avihaih@nvidia.com \
--cc=berrange@redhat.com \
--cc=lsoaresp@redhat.com \
--cc=lvivier@redhat.com \
--cc=pbonzini@redhat.com \
--cc=peterx@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=quintela@redhat.com \
--cc=thuth@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).