* [PATCH 00/16] migration: Switchover phase refactoring
@ 2025-01-14 23:07 Peter Xu
2025-01-14 23:07 ` [PATCH 01/16] migration: Remove postcopy implications in should_send_vmdesc() Peter Xu
` (18 more replies)
0 siblings, 19 replies; 21+ messages in thread
From: Peter Xu @ 2025-01-14 23:07 UTC (permalink / raw)
To: qemu-devel
Cc: peterx, Juraj Marcin, Julia Suvorova, Prasad Pandit,
Fabiano Rosas
CI: https://gitlab.com/peterx/qemu/-/pipelines/1625266692
(note: warning is present on rust stuff, but shouldn't be relevant)
This series refactors the migration switchover path quite a bit. I started
this work initially to measure the JSON writer overhead, but then I decided
to cleanup the switchover path in general when I am at it altogether, as I
wanted to do this for a long time.
A few major things I tried to do:
- About the JSON writer
Currently, precopy migration always dumps a chunk of data called VM
description (QEMU_VM_VMDESCRIPTION) for debugging purpose. That is a
JSON blob explaining all the vmstates dumped in the migration stream.
QEMU has a machine property suppress-vmdesc deciding whether migration
will have that JSON chunk included.
Postcopy does not have such JSON dump because postcopy is live session
and it can't normally be debugged from stream level (e.g. as a streamed
file).
A tiny problem is we don't yet have a clue on how much cpu cycles we
need to construct and dump these JSONs even if they're only for
debugging, and even if suppress-vmdesc=on QEMU will still try to
construct these JSONs (e.g. also for postcopy).
This series has a few patches just to make sure the JSON blob won't be
constructed if not needed (either postcopy, or suppress-vmdesc=on). I
tried to measure the downtime diff with/without these changes, the time
QEMU takes to construct / dump the JSON blob is still not measurable.
So I suppose unconditionally having this is ok. Said that, let's still
have these changes around so we avoid JSON operations if not needed.
- DEVICE migration state
QEMU has a very special DEVICE migration state, that only happens with
precopy, and only when pause-before-switchover capability is enabled.
Due to that specialty we can't merge precopy and postcopy code on
switchover starts, because the state machine will be different.
However after I checked the history and also with libvirt developers,
this seems unnecessary. So I had one patch making DEVICE state to be
the "switchover" phase for precopy/postcopy unconditionally. That will
make the state machine much easier for both modes, meanwhile nothing is
expected to break with it (but please still shoot if anyone knows /
suspect something will, or could, break..).
- General cleanups and fixes
Most of the rest changes are random cleanups and fixes in the
switchover path.
E.g., postcopy_start() has some code that isn't easy to read due to
some special flags here and there, mostly around the two calls of
qemu_savevm_state_complete_precopy(). This series will remove most of
those special treatments here and there.
We could have done something twice in the past in postcopy switchover
(e.g. I believe we sync CPU twice.. but only happens with postcopy),
now they should all be sorted out.
And quite some other things hopefully can be separately discussed and
justified in each patch. After these cleanups, we will be able to have
an unified entrance for precopy/postcopy on switchover.
Initially I thought this could optimize the downtime slightly, but after
some tests, it turns out there's no measureable difference, at least in my
current setup... So let's take this as a cleanup series at least for now,
and I hope they would still make some sense. Comments welcomed.
Thanks,
Peter Xu (16):
migration: Remove postcopy implications in should_send_vmdesc()
migration: Do not construct JSON description if suppressed
migration: Optimize postcopy on downtime by avoiding JSON writer
migration: Avoid two src-downtime-end tracepoints for postcopy
migration: Drop inactivate_disk param in qemu_savevm_state_complete*
migration: Synchronize all CPU states only for non-iterable dump
migration: Adjust postcopy bandwidth during switchover
migration: Adjust locking in migration_maybe_pause()
migration: Drop cached migration state in migration_maybe_pause()
migration: Take BQL slightly longer in postcopy_start()
migration: Notify COMPLETE once for postcopy
migration: Unwrap qemu_savevm_state_complete_precopy() in postcopy
migration: Cleanup qemu_savevm_state_complete_precopy()
migration: Always set DEVICE state
migration: Merge precopy/postcopy on switchover start
migration: Trivial cleanup on JSON writer of vmstate_save()
qapi/migration.json | 7 +-
migration/migration.h | 1 +
migration/savevm.h | 6 +-
migration/migration.c | 209 +++++++++++++++++++++++-------------
migration/savevm.c | 116 ++++++++------------
migration/vmstate.c | 6 +-
tests/qtest/libqos/libqos.c | 3 +-
migration/trace-events | 2 +-
tests/qemu-iotests/194.out | 1 +
tests/qemu-iotests/203.out | 1 +
tests/qemu-iotests/234.out | 2 +
tests/qemu-iotests/262.out | 1 +
tests/qemu-iotests/280.out | 1 +
13 files changed, 200 insertions(+), 156 deletions(-)
--
2.47.0
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 01/16] migration: Remove postcopy implications in should_send_vmdesc()
2025-01-14 23:07 [PATCH 00/16] migration: Switchover phase refactoring Peter Xu
@ 2025-01-14 23:07 ` Peter Xu
2025-01-14 23:07 ` [PATCH 02/16] migration: Do not construct JSON description if suppressed Peter Xu
` (17 subsequent siblings)
18 siblings, 0 replies; 21+ messages in thread
From: Peter Xu @ 2025-01-14 23:07 UTC (permalink / raw)
To: qemu-devel
Cc: peterx, Juraj Marcin, Julia Suvorova, Prasad Pandit,
Fabiano Rosas
should_send_vmdesc() has a hack inside (which was not reflected in the
function name) in that it tries to detect global postcopy state and that
will affect the value to be returned.
It's easier to keep the helper simple by only check the suppress-vmdesc
property. Then:
- On the sender side of its usage, there's already in_postcopy variable
that we can use: postcopy doesn't send vmdesc at all, so directly skip
everything for postcopy.
- On the recv side, when reaching vmdesc processing it must be precopy
code already, hence that hack check never used to work anyway.
No functional change intended, except a trivial side effect that QEMU
source will start to avoid running some JSON helper in postcopy path, but
that would only reduce the postcopy blackout window a bit, rather than any
other bad side effect.
Signed-off-by: Peter Xu <peterx@redhat.com>
---
migration/savevm.c | 21 +++++++++++----------
1 file changed, 11 insertions(+), 10 deletions(-)
diff --git a/migration/savevm.c b/migration/savevm.c
index c929da1ca5..0c4df27177 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1427,8 +1427,8 @@ int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy)
static bool should_send_vmdesc(void)
{
MachineState *machine = MACHINE(qdev_get_machine());
- bool in_postcopy = migration_in_postcopy();
- return !machine->suppress_vmdesc && !in_postcopy;
+
+ return !machine->suppress_vmdesc;
}
/*
@@ -1563,16 +1563,16 @@ int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
if (!in_postcopy) {
/* Postcopy stream will still be going */
qemu_put_byte(f, QEMU_VM_EOF);
- }
- json_writer_end_array(vmdesc);
- json_writer_end_object(vmdesc);
- vmdesc_len = strlen(json_writer_get(vmdesc));
+ json_writer_end_array(vmdesc);
+ json_writer_end_object(vmdesc);
+ vmdesc_len = strlen(json_writer_get(vmdesc));
- if (should_send_vmdesc()) {
- qemu_put_byte(f, QEMU_VM_VMDESCRIPTION);
- qemu_put_be32(f, vmdesc_len);
- qemu_put_buffer(f, (uint8_t *)json_writer_get(vmdesc), vmdesc_len);
+ if (should_send_vmdesc()) {
+ qemu_put_byte(f, QEMU_VM_VMDESCRIPTION);
+ qemu_put_be32(f, vmdesc_len);
+ qemu_put_buffer(f, (uint8_t *)json_writer_get(vmdesc), vmdesc_len);
+ }
}
/* Free it now to detect any inconsistencies. */
@@ -2965,6 +2965,7 @@ int qemu_loadvm_state(QEMUFile *f)
return ret;
}
+ /* When reaching here, it must be precopy */
if (ret == 0) {
ret = qemu_file_get_error(f);
}
--
2.47.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 02/16] migration: Do not construct JSON description if suppressed
2025-01-14 23:07 [PATCH 00/16] migration: Switchover phase refactoring Peter Xu
2025-01-14 23:07 ` [PATCH 01/16] migration: Remove postcopy implications in should_send_vmdesc() Peter Xu
@ 2025-01-14 23:07 ` Peter Xu
2025-01-14 23:07 ` [PATCH 03/16] migration: Optimize postcopy on downtime by avoiding JSON writer Peter Xu
` (16 subsequent siblings)
18 siblings, 0 replies; 21+ messages in thread
From: Peter Xu @ 2025-01-14 23:07 UTC (permalink / raw)
To: qemu-devel
Cc: peterx, Juraj Marcin, Julia Suvorova, Prasad Pandit,
Fabiano Rosas
QEMU machine has a property "suppress-vmdesc". When it is enabled, QEMU
will stop attaching JSON VM description at the end of the precopy migration
stream (postcopy is never affected because postcopy never attach that).
However even if it's suppressed by the user, the source QEMU will still
construct the JSON descriptions, which is a complete waste of CPU and
memory resources.
To avoid it, only create the JSON writer object if suppress-vmdesc is not
specified.
Luckily, vmstate_save() already supports vmdesc==NULL, so only a few spots
that are left to be prepared that vmdesc can be NULL now.
When at it, move the init / destroy of the JSON writer object to start /
end of the migration - the JSON writer object is a sub-struct of migration
state, and that looks like the only object that was dynamically allocated /
destroyed within migration process. Make it the same as the rest objects
that migration uses.
Signed-off-by: Peter Xu <peterx@redhat.com>
---
migration/migration.h | 1 +
migration/migration.c | 9 +++++---
migration/savevm.c | 49 +++++++++++++++++++++++--------------------
3 files changed, 33 insertions(+), 26 deletions(-)
diff --git a/migration/migration.h b/migration/migration.h
index 0df2a187af..3f1927f79c 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -552,6 +552,7 @@ void migration_bitmap_sync_precopy(bool last_stage);
/* migration/block-dirty-bitmap.c */
void dirty_bitmap_mig_init(void);
+bool should_send_vmdesc(void);
/* migration/block-active.c */
void migration_block_active_setup(bool active);
diff --git a/migration/migration.c b/migration/migration.c
index 2d1da917c7..bad7b39293 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1406,8 +1406,8 @@ static void migrate_fd_cleanup(MigrationState *s)
g_free(s->hostname);
s->hostname = NULL;
- json_writer_free(s->vmdesc);
- s->vmdesc = NULL;
+
+ g_clear_pointer(&s->vmdesc, json_writer_free);
qemu_savevm_state_cleanup();
@@ -1681,7 +1681,10 @@ int migrate_init(MigrationState *s, Error **errp)
s->migration_thread_running = false;
error_free(s->error);
s->error = NULL;
- s->vmdesc = NULL;
+
+ if (should_send_vmdesc()) {
+ s->vmdesc = json_writer_new(false);
+ }
migrate_set_state(&s->state, MIGRATION_STATUS_NONE, MIGRATION_STATUS_SETUP);
diff --git a/migration/savevm.c b/migration/savevm.c
index 0c4df27177..fa03a0a264 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1231,8 +1231,7 @@ void qemu_savevm_non_migratable_list(strList **reasons)
void qemu_savevm_state_header(QEMUFile *f)
{
MigrationState *s = migrate_get_current();
-
- s->vmdesc = json_writer_new(false);
+ JSONWriter *vmdesc = s->vmdesc;
trace_savevm_state_header();
qemu_put_be32(f, QEMU_VM_FILE_MAGIC);
@@ -1241,16 +1240,21 @@ void qemu_savevm_state_header(QEMUFile *f)
if (s->send_configuration) {
qemu_put_byte(f, QEMU_VM_CONFIGURATION);
- /*
- * This starts the main json object and is paired with the
- * json_writer_end_object in
- * qemu_savevm_state_complete_precopy_non_iterable
- */
- json_writer_start_object(s->vmdesc, NULL);
+ if (vmdesc) {
+ /*
+ * This starts the main json object and is paired with the
+ * json_writer_end_object in
+ * qemu_savevm_state_complete_precopy_non_iterable
+ */
+ json_writer_start_object(vmdesc, NULL);
+ json_writer_start_object(vmdesc, "configuration");
+ }
+
+ vmstate_save_state(f, &vmstate_configuration, &savevm_state, vmdesc);
- json_writer_start_object(s->vmdesc, "configuration");
- vmstate_save_state(f, &vmstate_configuration, &savevm_state, s->vmdesc);
- json_writer_end_object(s->vmdesc);
+ if (vmdesc) {
+ json_writer_end_object(vmdesc);
+ }
}
}
@@ -1296,16 +1300,19 @@ int qemu_savevm_state_setup(QEMUFile *f, Error **errp)
{
ERRP_GUARD();
MigrationState *ms = migrate_get_current();
+ JSONWriter *vmdesc = ms->vmdesc;
SaveStateEntry *se;
int ret = 0;
- json_writer_int64(ms->vmdesc, "page_size", qemu_target_page_size());
- json_writer_start_array(ms->vmdesc, "devices");
+ if (vmdesc) {
+ json_writer_int64(vmdesc, "page_size", qemu_target_page_size());
+ json_writer_start_array(vmdesc, "devices");
+ }
trace_savevm_state_setup();
QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
if (se->vmsd && se->vmsd->early_setup) {
- ret = vmstate_save(f, se, ms->vmdesc, errp);
+ ret = vmstate_save(f, se, vmdesc, errp);
if (ret) {
migrate_set_error(ms, *errp);
qemu_file_set_error(f, ret);
@@ -1424,7 +1431,7 @@ int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy)
return all_finished;
}
-static bool should_send_vmdesc(void)
+bool should_send_vmdesc(void)
{
MachineState *machine = MACHINE(qdev_get_machine());
@@ -1564,21 +1571,17 @@ int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
/* Postcopy stream will still be going */
qemu_put_byte(f, QEMU_VM_EOF);
- json_writer_end_array(vmdesc);
- json_writer_end_object(vmdesc);
- vmdesc_len = strlen(json_writer_get(vmdesc));
+ if (vmdesc) {
+ json_writer_end_array(vmdesc);
+ json_writer_end_object(vmdesc);
+ vmdesc_len = strlen(json_writer_get(vmdesc));
- if (should_send_vmdesc()) {
qemu_put_byte(f, QEMU_VM_VMDESCRIPTION);
qemu_put_be32(f, vmdesc_len);
qemu_put_buffer(f, (uint8_t *)json_writer_get(vmdesc), vmdesc_len);
}
}
- /* Free it now to detect any inconsistencies. */
- json_writer_free(vmdesc);
- ms->vmdesc = NULL;
-
trace_vmstate_downtime_checkpoint("src-non-iterable-saved");
return 0;
--
2.47.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 03/16] migration: Optimize postcopy on downtime by avoiding JSON writer
2025-01-14 23:07 [PATCH 00/16] migration: Switchover phase refactoring Peter Xu
2025-01-14 23:07 ` [PATCH 01/16] migration: Remove postcopy implications in should_send_vmdesc() Peter Xu
2025-01-14 23:07 ` [PATCH 02/16] migration: Do not construct JSON description if suppressed Peter Xu
@ 2025-01-14 23:07 ` Peter Xu
2025-01-14 23:07 ` [PATCH 04/16] migration: Avoid two src-downtime-end tracepoints for postcopy Peter Xu
` (15 subsequent siblings)
18 siblings, 0 replies; 21+ messages in thread
From: Peter Xu @ 2025-01-14 23:07 UTC (permalink / raw)
To: qemu-devel
Cc: peterx, Juraj Marcin, Julia Suvorova, Prasad Pandit,
Fabiano Rosas
postcopy_start() is the entry function that postcopy is destined to start.
It also means QEMU source will not dump VM description, aka, the JSON
writer is garbage now.
We can leave that to be cleaned up when migration completes, however when
with the JSON writer object being present, vmstate_save() will still try to
construct the JSON objects for the VM descriptions, even though it'll never
be used later if it's postcopy.
To save those cycles, release the JSON writer earlier for postcopy. Then
vmstate_save() later will be smart enough to skip the JSON object
constructions completely. It can logically reduce downtime because all
such JSON constructions happen during postcopy blackout.
Signed-off-by: Peter Xu <peterx@redhat.com>
---
migration/migration.c | 17 +++++++++++++++--
1 file changed, 15 insertions(+), 2 deletions(-)
diff --git a/migration/migration.c b/migration/migration.c
index bad7b39293..0eb28e850d 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1397,6 +1397,11 @@ void migrate_set_state(MigrationStatus *state, MigrationStatus old_state,
}
}
+static void migration_cleanup_json_writer(MigrationState *s)
+{
+ g_clear_pointer(&s->vmdesc, json_writer_free);
+}
+
static void migrate_fd_cleanup(MigrationState *s)
{
MigrationEventType type;
@@ -1404,11 +1409,11 @@ static void migrate_fd_cleanup(MigrationState *s)
trace_migrate_fd_cleanup();
+ migration_cleanup_json_writer(s);
+
g_free(s->hostname);
s->hostname = NULL;
- g_clear_pointer(&s->vmdesc, json_writer_free);
-
qemu_savevm_state_cleanup();
close_return_path_on_source(s);
@@ -2501,6 +2506,14 @@ static int postcopy_start(MigrationState *ms, Error **errp)
uint64_t bandwidth = migrate_max_postcopy_bandwidth();
int cur_state = MIGRATION_STATUS_ACTIVE;
+ /*
+ * Now we're 100% sure to switch to postcopy, so JSON writer won't be
+ * useful anymore. Free the resources early if it is there. Clearing
+ * the vmdesc also means any follow up vmstate_save()s will start to
+ * skip all JSON operations, which can shrink postcopy downtime.
+ */
+ migration_cleanup_json_writer(ms);
+
if (migrate_postcopy_preempt()) {
migration_wait_main_channel(ms);
if (postcopy_preempt_establish_channel(ms)) {
--
2.47.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 04/16] migration: Avoid two src-downtime-end tracepoints for postcopy
2025-01-14 23:07 [PATCH 00/16] migration: Switchover phase refactoring Peter Xu
` (2 preceding siblings ...)
2025-01-14 23:07 ` [PATCH 03/16] migration: Optimize postcopy on downtime by avoiding JSON writer Peter Xu
@ 2025-01-14 23:07 ` Peter Xu
2025-01-14 23:07 ` [PATCH 05/16] migration: Drop inactivate_disk param in qemu_savevm_state_complete* Peter Xu
` (14 subsequent siblings)
18 siblings, 0 replies; 21+ messages in thread
From: Peter Xu @ 2025-01-14 23:07 UTC (permalink / raw)
To: qemu-devel
Cc: peterx, Juraj Marcin, Julia Suvorova, Prasad Pandit,
Fabiano Rosas
Postcopy can trigger this tracepoint twice, while only the 1st one is
valid. Avoid triggering the 2nd tracepoint just like what we do with
recording the total downtime.
Signed-off-by: Peter Xu <peterx@redhat.com>
---
migration/migration.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/migration/migration.c b/migration/migration.c
index 0eb28e850d..e1fc1a7fdc 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -125,9 +125,8 @@ static void migration_downtime_end(MigrationState *s)
*/
if (!s->downtime) {
s->downtime = now - s->downtime_start;
+ trace_vmstate_downtime_checkpoint("src-downtime-end");
}
-
- trace_vmstate_downtime_checkpoint("src-downtime-end");
}
static bool migration_needs_multiple_sockets(void)
--
2.47.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 05/16] migration: Drop inactivate_disk param in qemu_savevm_state_complete*
2025-01-14 23:07 [PATCH 00/16] migration: Switchover phase refactoring Peter Xu
` (3 preceding siblings ...)
2025-01-14 23:07 ` [PATCH 04/16] migration: Avoid two src-downtime-end tracepoints for postcopy Peter Xu
@ 2025-01-14 23:07 ` Peter Xu
2025-01-14 23:07 ` [PATCH 06/16] migration: Synchronize all CPU states only for non-iterable dump Peter Xu
` (13 subsequent siblings)
18 siblings, 0 replies; 21+ messages in thread
From: Peter Xu @ 2025-01-14 23:07 UTC (permalink / raw)
To: qemu-devel
Cc: peterx, Juraj Marcin, Julia Suvorova, Prasad Pandit,
Fabiano Rosas
This parameter is only used by one caller, which is the genuine precopy
complete path (migration_completion_precopy).
The parameter was introduced in a1fbe750fd ("migration: Fix race of image
locking between src and dst") to make sure the inactivate will happen
before EOF to make sure dest will always be able to activate the disk
properly. However there's no limitation on how early we inactivate the
disk. For precopy completion path, we can always do that as long as VM is
stopped.
Move the disk inactivate there, then we can remove this inactivate_disk
parameter in the whole call stack, because all the rest users pass in false
always.
Signed-off-by: Peter Xu <peterx@redhat.com>
---
migration/savevm.h | 5 ++---
migration/migration.c | 22 ++++++++++++++++------
migration/savevm.c | 27 +++++----------------------
3 files changed, 23 insertions(+), 31 deletions(-)
diff --git a/migration/savevm.h b/migration/savevm.h
index 9ec96a995c..c48a53e95e 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -39,8 +39,7 @@ void qemu_savevm_state_header(QEMUFile *f);
int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy);
void qemu_savevm_state_cleanup(void);
void qemu_savevm_state_complete_postcopy(QEMUFile *f);
-int qemu_savevm_state_complete_precopy(QEMUFile *f, bool iterable_only,
- bool inactivate_disks);
+int qemu_savevm_state_complete_precopy(QEMUFile *f, bool iterable_only);
void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
uint64_t *can_postcopy);
void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
@@ -68,6 +67,6 @@ int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
int qemu_load_device_state(QEMUFile *f);
int qemu_loadvm_approve_switchover(void);
int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
- bool in_postcopy, bool inactivate_disks);
+ bool in_postcopy);
#endif
diff --git a/migration/migration.c b/migration/migration.c
index e1fc1a7fdc..b33baab950 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2555,7 +2555,7 @@ static int postcopy_start(MigrationState *ms, Error **errp)
* Cause any non-postcopiable, but iterative devices to
* send out their final data.
*/
- qemu_savevm_state_complete_precopy(ms->to_dst_file, true, false);
+ qemu_savevm_state_complete_precopy(ms->to_dst_file, true);
/*
* in Finish migrate and with the io-lock held everything should
@@ -2600,7 +2600,7 @@ static int postcopy_start(MigrationState *ms, Error **errp)
*/
qemu_savevm_send_postcopy_listen(fb);
- qemu_savevm_state_complete_precopy(fb, false, false);
+ qemu_savevm_state_complete_precopy(fb, false);
if (migrate_postcopy_ram()) {
qemu_savevm_send_ping(fb, 3);
}
@@ -2732,11 +2732,21 @@ static int migration_completion_precopy(MigrationState *s,
goto out_unlock;
}
+ /* Inactivate disks except in COLO */
+ if (!migrate_colo()) {
+ /*
+ * Inactivate before sending QEMU_VM_EOF so that the
+ * bdrv_activate_all() on the other end won't fail.
+ */
+ if (!migration_block_inactivate()) {
+ ret = -EFAULT;
+ goto out_unlock;
+ }
+ }
+
migration_rate_set(RATE_LIMIT_DISABLED);
- /* Inactivate disks except in COLO */
- ret = qemu_savevm_state_complete_precopy(s->to_dst_file, false,
- !migrate_colo());
+ ret = qemu_savevm_state_complete_precopy(s->to_dst_file, false);
out_unlock:
bql_unlock();
return ret;
@@ -3617,7 +3627,7 @@ static void *bg_migration_thread(void *opaque)
* save their state to channel-buffer along with devices.
*/
cpu_synchronize_all_states();
- if (qemu_savevm_state_complete_precopy_non_iterable(fb, false, false)) {
+ if (qemu_savevm_state_complete_precopy_non_iterable(fb, false)) {
goto fail;
}
/*
diff --git a/migration/savevm.c b/migration/savevm.c
index fa03a0a264..5e56a5d9fc 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1521,8 +1521,7 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
}
int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
- bool in_postcopy,
- bool inactivate_disks)
+ bool in_postcopy)
{
MigrationState *ms = migrate_get_current();
int64_t start_ts_each, end_ts_each;
@@ -1553,20 +1552,6 @@ int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
end_ts_each - start_ts_each);
}
- if (inactivate_disks) {
- /*
- * Inactivate before sending QEMU_VM_EOF so that the
- * bdrv_activate_all() on the other end won't fail.
- */
- if (!migration_block_inactivate()) {
- error_setg(&local_err, "%s: bdrv_inactivate_all() failed",
- __func__);
- migrate_set_error(ms, local_err);
- error_report_err(local_err);
- qemu_file_set_error(f, -EFAULT);
- return ret;
- }
- }
if (!in_postcopy) {
/* Postcopy stream will still be going */
qemu_put_byte(f, QEMU_VM_EOF);
@@ -1587,8 +1572,7 @@ int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
return 0;
}
-int qemu_savevm_state_complete_precopy(QEMUFile *f, bool iterable_only,
- bool inactivate_disks)
+int qemu_savevm_state_complete_precopy(QEMUFile *f, bool iterable_only)
{
int ret;
Error *local_err = NULL;
@@ -1613,8 +1597,7 @@ int qemu_savevm_state_complete_precopy(QEMUFile *f, bool iterable_only,
goto flush;
}
- ret = qemu_savevm_state_complete_precopy_non_iterable(f, in_postcopy,
- inactivate_disks);
+ ret = qemu_savevm_state_complete_precopy_non_iterable(f, in_postcopy);
if (ret) {
return ret;
}
@@ -1717,7 +1700,7 @@ static int qemu_savevm_state(QEMUFile *f, Error **errp)
ret = qemu_file_get_error(f);
if (ret == 0) {
- qemu_savevm_state_complete_precopy(f, false, false);
+ qemu_savevm_state_complete_precopy(f, false);
ret = qemu_file_get_error(f);
}
if (ret != 0) {
@@ -1743,7 +1726,7 @@ cleanup:
void qemu_savevm_live_state(QEMUFile *f)
{
/* save QEMU_VM_SECTION_END section */
- qemu_savevm_state_complete_precopy(f, true, false);
+ qemu_savevm_state_complete_precopy(f, true);
qemu_put_byte(f, QEMU_VM_EOF);
}
--
2.47.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 06/16] migration: Synchronize all CPU states only for non-iterable dump
2025-01-14 23:07 [PATCH 00/16] migration: Switchover phase refactoring Peter Xu
` (4 preceding siblings ...)
2025-01-14 23:07 ` [PATCH 05/16] migration: Drop inactivate_disk param in qemu_savevm_state_complete* Peter Xu
@ 2025-01-14 23:07 ` Peter Xu
2025-01-14 23:07 ` [PATCH 07/16] migration: Adjust postcopy bandwidth during switchover Peter Xu
` (12 subsequent siblings)
18 siblings, 0 replies; 21+ messages in thread
From: Peter Xu @ 2025-01-14 23:07 UTC (permalink / raw)
To: qemu-devel
Cc: peterx, Juraj Marcin, Julia Suvorova, Prasad Pandit,
Fabiano Rosas
Do one shot cpu sync at qemu_savevm_state_complete_precopy_non_iterable(),
instead of coding it separately in two places.
Note that in the context of qemu_savevm_state_complete_precopy(), this
patch is also an optimization for postcopy path, in that we can avoid sync
cpu twice during switchover: before this patch, postcopy_start() invokes
twice on qemu_savevm_state_complete_precopy(), each of them will try to
sync CPU info. In reality, only one of them would be enough.
For background snapshot, there's no intended functional change.
Signed-off-by: Peter Xu <peterx@redhat.com>
---
migration/migration.c | 6 +-----
migration/savevm.c | 5 +++--
2 files changed, 4 insertions(+), 7 deletions(-)
diff --git a/migration/migration.c b/migration/migration.c
index b33baab950..33c4824b68 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3622,11 +3622,7 @@ static void *bg_migration_thread(void *opaque)
if (migration_stop_vm(s, RUN_STATE_PAUSED)) {
goto fail;
}
- /*
- * Put vCPUs in sync with shadow context structures, then
- * save their state to channel-buffer along with devices.
- */
- cpu_synchronize_all_states();
+
if (qemu_savevm_state_complete_precopy_non_iterable(fb, false)) {
goto fail;
}
diff --git a/migration/savevm.c b/migration/savevm.c
index 5e56a5d9fc..92e77ca92b 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1531,6 +1531,9 @@ int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
Error *local_err = NULL;
int ret;
+ /* Making sure cpu states are synchronized before saving non-iterable */
+ cpu_synchronize_all_states();
+
QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
if (se->vmsd && se->vmsd->early_setup) {
/* Already saved during qemu_savevm_state_setup(). */
@@ -1584,8 +1587,6 @@ int qemu_savevm_state_complete_precopy(QEMUFile *f, bool iterable_only)
trace_savevm_state_complete_precopy();
- cpu_synchronize_all_states();
-
if (!in_postcopy || iterable_only) {
ret = qemu_savevm_state_complete_precopy_iterable(f, in_postcopy);
if (ret) {
--
2.47.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 07/16] migration: Adjust postcopy bandwidth during switchover
2025-01-14 23:07 [PATCH 00/16] migration: Switchover phase refactoring Peter Xu
` (5 preceding siblings ...)
2025-01-14 23:07 ` [PATCH 06/16] migration: Synchronize all CPU states only for non-iterable dump Peter Xu
@ 2025-01-14 23:07 ` Peter Xu
2025-01-14 23:07 ` [PATCH 08/16] migration: Adjust locking in migration_maybe_pause() Peter Xu
` (11 subsequent siblings)
18 siblings, 0 replies; 21+ messages in thread
From: Peter Xu @ 2025-01-14 23:07 UTC (permalink / raw)
To: qemu-devel
Cc: peterx, Juraj Marcin, Julia Suvorova, Prasad Pandit,
Fabiano Rosas
Precopy uses unlimited bandwidth always during switchover, it makes sense
because this is so critical and no one would like to throttle bandwidth
during the VM blackout.
OTOH, postcopy surprisingly didn't do that. There's one line that in the
middle of the postcopy switchover it tries to switch to postcopy's
specified max-postcopy-bandwidth, but even so it's somewhere in the middle
which is strange.
This patch brings the two modes to always use unlimited bandwidth for
switchover, meanwhile only apply the postcopy max bandwidth after the
switchover is completed.
Signed-off-by: Peter Xu <peterx@redhat.com>
---
migration/migration.c | 16 +++++++++-------
1 file changed, 9 insertions(+), 7 deletions(-)
diff --git a/migration/migration.c b/migration/migration.c
index 33c4824b68..13b7df0d5b 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2502,7 +2502,6 @@ static int postcopy_start(MigrationState *ms, Error **errp)
int ret;
QIOChannelBuffer *bioc;
QEMUFile *fb;
- uint64_t bandwidth = migrate_max_postcopy_bandwidth();
int cur_state = MIGRATION_STATUS_ACTIVE;
/*
@@ -2551,6 +2550,9 @@ static int postcopy_start(MigrationState *ms, Error **errp)
goto fail;
}
+ /* Switchover phase, switch to unlimited */
+ migration_rate_set(RATE_LIMIT_DISABLED);
+
/*
* Cause any non-postcopiable, but iterative devices to
* send out their final data.
@@ -2567,12 +2569,6 @@ static int postcopy_start(MigrationState *ms, Error **errp)
ram_postcopy_send_discard_bitmap(ms);
}
- /*
- * send rest of state - note things that are doing postcopy
- * will notice we're in POSTCOPY_ACTIVE and not actually
- * wrap their state up here
- */
- migration_rate_set(bandwidth);
if (migrate_postcopy_ram()) {
/* Ping just for debugging, helps line traces up */
qemu_savevm_send_ping(ms->to_dst_file, 2);
@@ -2656,6 +2652,12 @@ static int postcopy_start(MigrationState *ms, Error **errp)
}
trace_postcopy_preempt_enabled(migrate_postcopy_preempt());
+ /*
+ * Now postcopy officially started, switch to postcopy bandwidth that
+ * user specified.
+ */
+ migration_rate_set(migrate_max_postcopy_bandwidth());
+
return ret;
fail_closefb:
--
2.47.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 08/16] migration: Adjust locking in migration_maybe_pause()
2025-01-14 23:07 [PATCH 00/16] migration: Switchover phase refactoring Peter Xu
` (6 preceding siblings ...)
2025-01-14 23:07 ` [PATCH 07/16] migration: Adjust postcopy bandwidth during switchover Peter Xu
@ 2025-01-14 23:07 ` Peter Xu
2025-01-14 23:07 ` [PATCH 09/16] migration: Drop cached migration state " Peter Xu
` (10 subsequent siblings)
18 siblings, 0 replies; 21+ messages in thread
From: Peter Xu @ 2025-01-14 23:07 UTC (permalink / raw)
To: qemu-devel
Cc: peterx, Juraj Marcin, Julia Suvorova, Prasad Pandit,
Fabiano Rosas
In migration_maybe_pause() QEMU may yield BQL before waiting for a
semaphore. However it yields the BQL too early, which logically gives it
chance for the main thread to quickly take the BQL and modify the state to
CANCELLING.
To avoid such race condition from happening at all, always update the
migration states within the BQL. It'll make sure no concurrent
cancellation can ever happen.
With that, IIUC there's chance we can remove the extra parameter in
migration_maybe_pause() to update active state, but that'll be done
separately later.
Signed-off-by: Peter Xu <peterx@redhat.com>
---
migration/migration.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/migration/migration.c b/migration/migration.c
index 13b7df0d5b..5c688059de 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2701,14 +2701,14 @@ static int migration_maybe_pause(MigrationState *s,
* wait for the 'pause_sem' semaphore.
*/
if (s->state != MIGRATION_STATUS_CANCELLING) {
- bql_unlock();
migrate_set_state(&s->state, *current_active_state,
MIGRATION_STATUS_PRE_SWITCHOVER);
+ bql_unlock();
qemu_sem_wait(&s->pause_sem);
+ bql_lock();
migrate_set_state(&s->state, MIGRATION_STATUS_PRE_SWITCHOVER,
new_state);
*current_active_state = new_state;
- bql_lock();
}
return s->state == new_state ? 0 : -EINVAL;
--
2.47.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 09/16] migration: Drop cached migration state in migration_maybe_pause()
2025-01-14 23:07 [PATCH 00/16] migration: Switchover phase refactoring Peter Xu
` (7 preceding siblings ...)
2025-01-14 23:07 ` [PATCH 08/16] migration: Adjust locking in migration_maybe_pause() Peter Xu
@ 2025-01-14 23:07 ` Peter Xu
2025-01-14 23:07 ` [PATCH 10/16] migration: Take BQL slightly longer in postcopy_start() Peter Xu
` (9 subsequent siblings)
18 siblings, 0 replies; 21+ messages in thread
From: Peter Xu @ 2025-01-14 23:07 UTC (permalink / raw)
To: qemu-devel
Cc: peterx, Juraj Marcin, Julia Suvorova, Prasad Pandit,
Fabiano Rosas
I can't see why we must cache the state now after we avoided possible
CANCEL race: that's the only thing I can think of that can modify the
migration state concurrently with the migration thread itself. Make all
the state updates to happen always, then we don't need to cache the state
anymore.
Signed-off-by: Peter Xu <peterx@redhat.com>
---
migration/migration.c | 27 ++++++++-------------------
1 file changed, 8 insertions(+), 19 deletions(-)
diff --git a/migration/migration.c b/migration/migration.c
index 5c688059de..5850c90c9f 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -102,9 +102,7 @@ static MigrationIncomingState *current_incoming;
static GSList *migration_blockers[MIG_MODE__MAX];
static bool migration_object_check(MigrationState *ms, Error **errp);
-static int migration_maybe_pause(MigrationState *s,
- int *current_active_state,
- int new_state);
+static int migration_maybe_pause(MigrationState *s, int new_state);
static void migrate_fd_cancel(MigrationState *s);
static bool close_return_path_on_source(MigrationState *s);
static void migration_completion_end(MigrationState *s);
@@ -2502,7 +2500,6 @@ static int postcopy_start(MigrationState *ms, Error **errp)
int ret;
QIOChannelBuffer *bioc;
QEMUFile *fb;
- int cur_state = MIGRATION_STATUS_ACTIVE;
/*
* Now we're 100% sure to switch to postcopy, so JSON writer won't be
@@ -2537,8 +2534,7 @@ static int postcopy_start(MigrationState *ms, Error **errp)
goto fail;
}
- ret = migration_maybe_pause(ms, &cur_state,
- MIGRATION_STATUS_POSTCOPY_ACTIVE);
+ ret = migration_maybe_pause(ms, MIGRATION_STATUS_POSTCOPY_ACTIVE);
if (ret < 0) {
error_setg_errno(errp, -ret, "%s: Failed in migration_maybe_pause()",
__func__);
@@ -2676,9 +2672,7 @@ fail:
* migrate_pause_before_switchover called with the BQL locked
* Returns: 0 on success
*/
-static int migration_maybe_pause(MigrationState *s,
- int *current_active_state,
- int new_state)
+static int migration_maybe_pause(MigrationState *s, int new_state)
{
if (!migrate_pause_before_switchover()) {
return 0;
@@ -2701,21 +2695,19 @@ static int migration_maybe_pause(MigrationState *s,
* wait for the 'pause_sem' semaphore.
*/
if (s->state != MIGRATION_STATUS_CANCELLING) {
- migrate_set_state(&s->state, *current_active_state,
+ migrate_set_state(&s->state, s->state,
MIGRATION_STATUS_PRE_SWITCHOVER);
bql_unlock();
qemu_sem_wait(&s->pause_sem);
bql_lock();
migrate_set_state(&s->state, MIGRATION_STATUS_PRE_SWITCHOVER,
new_state);
- *current_active_state = new_state;
}
return s->state == new_state ? 0 : -EINVAL;
}
-static int migration_completion_precopy(MigrationState *s,
- int *current_active_state)
+static int migration_completion_precopy(MigrationState *s)
{
int ret;
@@ -2728,8 +2720,7 @@ static int migration_completion_precopy(MigrationState *s,
}
}
- ret = migration_maybe_pause(s, current_active_state,
- MIGRATION_STATUS_DEVICE);
+ ret = migration_maybe_pause(s, MIGRATION_STATUS_DEVICE);
if (ret < 0) {
goto out_unlock;
}
@@ -2782,11 +2773,10 @@ static void migration_completion_postcopy(MigrationState *s)
static void migration_completion(MigrationState *s)
{
int ret = 0;
- int current_active_state = s->state;
Error *local_err = NULL;
if (s->state == MIGRATION_STATUS_ACTIVE) {
- ret = migration_completion_precopy(s, ¤t_active_state);
+ ret = migration_completion_precopy(s);
} else if (s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) {
migration_completion_postcopy(s);
} else {
@@ -2826,8 +2816,7 @@ fail:
error_free(local_err);
}
- migrate_set_state(&s->state, current_active_state,
- MIGRATION_STATUS_FAILED);
+ migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
}
/**
--
2.47.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 10/16] migration: Take BQL slightly longer in postcopy_start()
2025-01-14 23:07 [PATCH 00/16] migration: Switchover phase refactoring Peter Xu
` (8 preceding siblings ...)
2025-01-14 23:07 ` [PATCH 09/16] migration: Drop cached migration state " Peter Xu
@ 2025-01-14 23:07 ` Peter Xu
2025-01-14 23:07 ` [PATCH 11/16] migration: Notify COMPLETE once for postcopy Peter Xu
` (8 subsequent siblings)
18 siblings, 0 replies; 21+ messages in thread
From: Peter Xu @ 2025-01-14 23:07 UTC (permalink / raw)
To: qemu-devel
Cc: peterx, Juraj Marcin, Julia Suvorova, Prasad Pandit,
Fabiano Rosas
This paves way for some follow up patch to modify migration states at the
end of postcopy_start(), which should better be with the BQL so that
there's no way of concurrent cancellation.
So we'll do something slightly more with BQL but they're really trivial,
hopefully nothing will really chance with this.
A side benefit is we can drop another explicit lock() in failure path.
Signed-off-by: Peter Xu <peterx@redhat.com>
---
migration/migration.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/migration/migration.c b/migration/migration.c
index 5850c90c9f..f644a6306b 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2626,8 +2626,6 @@ static int postcopy_start(MigrationState *ms, Error **errp)
migration_downtime_end(ms);
- bql_unlock();
-
if (migrate_postcopy_ram()) {
/*
* Although this ping is just for debug, it could potentially be
@@ -2643,7 +2641,6 @@ static int postcopy_start(MigrationState *ms, Error **errp)
ret = qemu_file_get_error(ms->to_dst_file);
if (ret) {
error_setg_errno(errp, -ret, "postcopy_start: Migration stream error");
- bql_lock();
goto fail;
}
trace_postcopy_preempt_enabled(migrate_postcopy_preempt());
@@ -2654,6 +2651,8 @@ static int postcopy_start(MigrationState *ms, Error **errp)
*/
migration_rate_set(migrate_max_postcopy_bandwidth());
+ bql_unlock();
+
return ret;
fail_closefb:
--
2.47.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 11/16] migration: Notify COMPLETE once for postcopy
2025-01-14 23:07 [PATCH 00/16] migration: Switchover phase refactoring Peter Xu
` (9 preceding siblings ...)
2025-01-14 23:07 ` [PATCH 10/16] migration: Take BQL slightly longer in postcopy_start() Peter Xu
@ 2025-01-14 23:07 ` Peter Xu
2025-01-14 23:07 ` [PATCH 12/16] migration: Unwrap qemu_savevm_state_complete_precopy() in postcopy Peter Xu
` (7 subsequent siblings)
18 siblings, 0 replies; 21+ messages in thread
From: Peter Xu @ 2025-01-14 23:07 UTC (permalink / raw)
To: qemu-devel
Cc: peterx, Juraj Marcin, Julia Suvorova, Prasad Pandit,
Fabiano Rosas
Postcopy invokes qemu_savevm_state_complete_precopy() twice, that means
it'll invoke COMPLETE notify twice.. also twice the tracepoints that
marking precopy complete.
Move that notification (along with the tracepoint) out to the caller, so
that postcopy will only notify once right at the start of switchover phase
from precopy. When at it, rename it to suite the file now it locates.
For precopy, there should have no functional change except the tracepoint
has a name change.
For the other two users of qemu_savevm_state_complete_precopy(), namely:
qemu_savevm_state() and qemu_savevm_live_state(): the notifier shouldn't
matter because they're not precopy at all. Now in these two contexts (aka,
"savevm", and "colo") sometimes the precopy notifiers will still be
invoked, but that's outside the scope of this patch.
Signed-off-by: Peter Xu <peterx@redhat.com>
---
migration/migration.c | 15 +++++++++++++++
migration/savevm.c | 7 -------
migration/trace-events | 2 +-
3 files changed, 16 insertions(+), 8 deletions(-)
diff --git a/migration/migration.c b/migration/migration.c
index f644a6306b..2c5674c2ae 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -127,6 +127,17 @@ static void migration_downtime_end(MigrationState *s)
}
}
+static void precopy_notify_complete(void)
+{
+ Error *local_err = NULL;
+
+ if (precopy_notify(PRECOPY_NOTIFY_COMPLETE, &local_err)) {
+ error_report_err(local_err);
+ }
+
+ trace_migration_precopy_complete();
+}
+
static bool migration_needs_multiple_sockets(void)
{
return migrate_multifd() || migrate_postcopy_preempt();
@@ -2549,6 +2560,8 @@ static int postcopy_start(MigrationState *ms, Error **errp)
/* Switchover phase, switch to unlimited */
migration_rate_set(RATE_LIMIT_DISABLED);
+ precopy_notify_complete();
+
/*
* Cause any non-postcopiable, but iterative devices to
* send out their final data.
@@ -2738,6 +2751,8 @@ static int migration_completion_precopy(MigrationState *s)
migration_rate_set(RATE_LIMIT_DISABLED);
+ precopy_notify_complete();
+
ret = qemu_savevm_state_complete_precopy(s->to_dst_file, false);
out_unlock:
bql_unlock();
diff --git a/migration/savevm.c b/migration/savevm.c
index 92e77ca92b..9aef2fa3c9 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1578,15 +1578,8 @@ int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
int qemu_savevm_state_complete_precopy(QEMUFile *f, bool iterable_only)
{
int ret;
- Error *local_err = NULL;
bool in_postcopy = migration_in_postcopy();
- if (precopy_notify(PRECOPY_NOTIFY_COMPLETE, &local_err)) {
- error_report_err(local_err);
- }
-
- trace_savevm_state_complete_precopy();
-
if (!in_postcopy || iterable_only) {
ret = qemu_savevm_state_complete_precopy_iterable(f, in_postcopy);
if (ret) {
diff --git a/migration/trace-events b/migration/trace-events
index b82a1c5e40..ec3c9d0ffe 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -44,7 +44,6 @@ savevm_state_resume_prepare(void) ""
savevm_state_header(void) ""
savevm_state_iterate(void) ""
savevm_state_cleanup(void) ""
-savevm_state_complete_precopy(void) ""
vmstate_save(const char *idstr, const char *vmsd_name) "%s, %s"
vmstate_load(const char *idstr, const char *vmsd_name) "%s, %s"
vmstate_downtime_save(const char *type, const char *idstr, uint32_t instance_id, int64_t downtime) "type=%s idstr=%s instance_id=%d downtime=%"PRIi64
@@ -193,6 +192,7 @@ migrate_transferred(uint64_t transferred, uint64_t time_spent, uint64_t bandwidt
process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
process_incoming_migration_co_postcopy_end_main(void) ""
postcopy_preempt_enabled(bool value) "%d"
+migration_precopy_complete(void) ""
# migration-stats
migration_transferred_bytes(uint64_t qemu_file, uint64_t multifd, uint64_t rdma) "qemu_file %" PRIu64 " multifd %" PRIu64 " RDMA %" PRIu64
--
2.47.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 12/16] migration: Unwrap qemu_savevm_state_complete_precopy() in postcopy
2025-01-14 23:07 [PATCH 00/16] migration: Switchover phase refactoring Peter Xu
` (10 preceding siblings ...)
2025-01-14 23:07 ` [PATCH 11/16] migration: Notify COMPLETE once for postcopy Peter Xu
@ 2025-01-14 23:07 ` Peter Xu
2025-01-14 23:07 ` [PATCH 13/16] migration: Cleanup qemu_savevm_state_complete_precopy() Peter Xu
` (6 subsequent siblings)
18 siblings, 0 replies; 21+ messages in thread
From: Peter Xu @ 2025-01-14 23:07 UTC (permalink / raw)
To: qemu-devel
Cc: peterx, Juraj Marcin, Julia Suvorova, Prasad Pandit,
Fabiano Rosas
Postcopy invokes qemu_savevm_state_complete_precopy() twice for a long
time, and that caused way too much confusions. Let's clean this up and
make postcopy easier to read.
It's actually fairly straightforward: postcopy starts with saving
non-postcopiable iterables, then later it saves again with non-iterable
only. Move these two calls out makes everything much easier to follow.
Otherwise it's very unclear what qemu_savevm_state_complete_precopy() did
in either of the calls.
No functional change intended.
Signed-off-by: Peter Xu <peterx@redhat.com>
---
migration/savevm.h | 1 +
migration/migration.c | 13 +++++++++++--
migration/savevm.c | 1 -
3 files changed, 12 insertions(+), 3 deletions(-)
diff --git a/migration/savevm.h b/migration/savevm.h
index c48a53e95e..7957460062 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -44,6 +44,7 @@ void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
uint64_t *can_postcopy);
void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
uint64_t *can_postcopy);
+int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy);
void qemu_savevm_send_ping(QEMUFile *f, uint32_t value);
void qemu_savevm_send_open_return_path(QEMUFile *f);
int qemu_savevm_send_packaged(QEMUFile *f, const uint8_t *buf, size_t len);
diff --git a/migration/migration.c b/migration/migration.c
index 2c5674c2ae..a8fe423a2b 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2566,7 +2566,11 @@ static int postcopy_start(MigrationState *ms, Error **errp)
* Cause any non-postcopiable, but iterative devices to
* send out their final data.
*/
- qemu_savevm_state_complete_precopy(ms->to_dst_file, true);
+ ret = qemu_savevm_state_complete_precopy_iterable(ms->to_dst_file, true);
+ if (ret) {
+ error_setg(errp, "Postcopy save non-postcopiable iterables failed");
+ goto fail;
+ }
/*
* in Finish migrate and with the io-lock held everything should
@@ -2605,7 +2609,12 @@ static int postcopy_start(MigrationState *ms, Error **errp)
*/
qemu_savevm_send_postcopy_listen(fb);
- qemu_savevm_state_complete_precopy(fb, false);
+ ret = qemu_savevm_state_complete_precopy_non_iterable(fb, true);
+ if (ret) {
+ error_setg(errp, "Postcopy save non-iterable device states failed");
+ goto fail_closefb;
+ }
+
if (migrate_postcopy_ram()) {
qemu_savevm_send_ping(fb, 3);
}
diff --git a/migration/savevm.c b/migration/savevm.c
index 9aef2fa3c9..0ddc4c8eb5 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1477,7 +1477,6 @@ void qemu_savevm_state_complete_postcopy(QEMUFile *f)
qemu_fflush(f);
}
-static
int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
{
int64_t start_ts_each, end_ts_each;
--
2.47.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 13/16] migration: Cleanup qemu_savevm_state_complete_precopy()
2025-01-14 23:07 [PATCH 00/16] migration: Switchover phase refactoring Peter Xu
` (11 preceding siblings ...)
2025-01-14 23:07 ` [PATCH 12/16] migration: Unwrap qemu_savevm_state_complete_precopy() in postcopy Peter Xu
@ 2025-01-14 23:07 ` Peter Xu
2025-01-14 23:07 ` [PATCH 14/16] migration: Always set DEVICE state Peter Xu
` (5 subsequent siblings)
18 siblings, 0 replies; 21+ messages in thread
From: Peter Xu @ 2025-01-14 23:07 UTC (permalink / raw)
To: qemu-devel
Cc: peterx, Juraj Marcin, Julia Suvorova, Prasad Pandit,
Fabiano Rosas
Now qemu_savevm_state_complete_precopy() is never used in postcopy, clean
it up as in_postcopy==false now unconditionally.
Signed-off-by: Peter Xu <peterx@redhat.com>
---
migration/savevm.c | 20 +++++++-------------
1 file changed, 7 insertions(+), 13 deletions(-)
diff --git a/migration/savevm.c b/migration/savevm.c
index 0ddc4c8eb5..bc375db282 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1577,25 +1577,19 @@ int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
int qemu_savevm_state_complete_precopy(QEMUFile *f, bool iterable_only)
{
int ret;
- bool in_postcopy = migration_in_postcopy();
- if (!in_postcopy || iterable_only) {
- ret = qemu_savevm_state_complete_precopy_iterable(f, in_postcopy);
+ ret = qemu_savevm_state_complete_precopy_iterable(f, false);
+ if (ret) {
+ return ret;
+ }
+
+ if (!iterable_only) {
+ ret = qemu_savevm_state_complete_precopy_non_iterable(f, false);
if (ret) {
return ret;
}
}
- if (iterable_only) {
- goto flush;
- }
-
- ret = qemu_savevm_state_complete_precopy_non_iterable(f, in_postcopy);
- if (ret) {
- return ret;
- }
-
-flush:
return qemu_fflush(f);
}
--
2.47.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 14/16] migration: Always set DEVICE state
2025-01-14 23:07 [PATCH 00/16] migration: Switchover phase refactoring Peter Xu
` (12 preceding siblings ...)
2025-01-14 23:07 ` [PATCH 13/16] migration: Cleanup qemu_savevm_state_complete_precopy() Peter Xu
@ 2025-01-14 23:07 ` Peter Xu
2025-01-14 23:07 ` [PATCH 15/16] migration: Merge precopy/postcopy on switchover start Peter Xu
` (4 subsequent siblings)
18 siblings, 0 replies; 21+ messages in thread
From: Peter Xu @ 2025-01-14 23:07 UTC (permalink / raw)
To: qemu-devel
Cc: peterx, Juraj Marcin, Julia Suvorova, Prasad Pandit,
Fabiano Rosas, Jiri Denemark, Daniel P. Berrangé,
Dr. David Alan Gilbert
DEVICE state was introduced back in 2017:
https://lore.kernel.org/qemu-devel/20171020090556.18631-1-dgilbert@redhat.com/
Quote from Dave's cover letter, when the pre-switchover phase was enabled,
the state transition looks like this:
The precopy flow is:
active->pre-switchover->device->completed
The postcopy flow is:
active->pre-switchover->postcopy-active->completed
To supplement above, when the cap is not enabled:
The precopy flow is:
active->completed
The postcopy flow is:
active->postcopy-active->completed
It works for us, though we have some code just to special case these state
transitions, so the DEVICE state currently is special only to precopy, and
only conditionally.
I had a quick discussion with Libvirt developers, it turns out that this
may not be necessary. IOW, it seems okay we can have DEVICE state to be
generic, so that we don't have over-complicated state machines. It not
only helps align all the migration state machine, help cleanup the code
path especially on pre-switchover handling (see the patch itself), another
side benefit is we can unconditionally have a specific state to mark the
switchover phase, which might be helpful for debugging too.
This patch makes the DEVICE state to be present always, marking that source
QEMU is switching over. Then the state machine will be always as simple
as:
active-> [pre-switchover->] -> device -> [postcopy-active->] -> complete
After the change, no matter whether pre-switchover or postcopy is enabled
or not, we always have DEVICE state showing the switchover phase. When
pre-switchover enabled, we'll have an extra stage before that. When
postcopy is enabled, we'll have an extra stage after that.
A few qtests need touch up in QEMU tree for this change:
- A few iotest outputs (194, 203, 234, 262, 280)
- Teach libqos's migrate() on "device" state
Cc: Jiri Denemark <jdenemar@redhat.com>
Cc: Daniel P. Berrangé <berrange@redhat.com>
Cc: Dr. David Alan Gilbert <dave@treblig.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
qapi/migration.json | 7 +++-
migration/migration.c | 82 +++++++++++++++++++++++--------------
tests/qtest/libqos/libqos.c | 3 +-
tests/qemu-iotests/194.out | 1 +
tests/qemu-iotests/203.out | 1 +
tests/qemu-iotests/234.out | 2 +
tests/qemu-iotests/262.out | 1 +
tests/qemu-iotests/280.out | 1 +
8 files changed, 64 insertions(+), 34 deletions(-)
diff --git a/qapi/migration.json b/qapi/migration.json
index a605dc26db..86d2342e1f 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -158,8 +158,11 @@
#
# @pre-switchover: Paused before device serialisation. (since 2.11)
#
-# @device: During device serialisation when pause-before-switchover is
-# enabled (since 2.11)
+# @device: During device serialisation (also known as switchover phase).
+# Before 9.2, this is only used when (1) in precopy, and (2) when
+# pre-switchover capability is enabled. After 10.0, this state will
+# always be present for every migration procedure as the switchover
+# phase. (since 2.11)
#
# @wait-unplug: wait for device unplug request by guest OS to be
# completed. (since 4.2)
diff --git a/migration/migration.c b/migration/migration.c
index a8fe423a2b..2a95081b6c 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -102,7 +102,7 @@ static MigrationIncomingState *current_incoming;
static GSList *migration_blockers[MIG_MODE__MAX];
static bool migration_object_check(MigrationState *ms, Error **errp);
-static int migration_maybe_pause(MigrationState *s, int new_state);
+static bool migration_switchover_start(MigrationState *s);
static void migrate_fd_cancel(MigrationState *s);
static bool close_return_path_on_source(MigrationState *s);
static void migration_completion_end(MigrationState *s);
@@ -2530,11 +2530,6 @@ static int postcopy_start(MigrationState *ms, Error **errp)
}
}
- if (!migrate_pause_before_switchover()) {
- migrate_set_state(&ms->state, MIGRATION_STATUS_ACTIVE,
- MIGRATION_STATUS_POSTCOPY_ACTIVE);
- }
-
trace_postcopy_start();
bql_lock();
trace_postcopy_start_set_run();
@@ -2545,10 +2540,8 @@ static int postcopy_start(MigrationState *ms, Error **errp)
goto fail;
}
- ret = migration_maybe_pause(ms, MIGRATION_STATUS_POSTCOPY_ACTIVE);
- if (ret < 0) {
- error_setg_errno(errp, -ret, "%s: Failed in migration_maybe_pause()",
- __func__);
+ if (!migration_switchover_start(ms)) {
+ error_setg(errp, "migration_switchover_start() failed");
goto fail;
}
@@ -2673,6 +2666,10 @@ static int postcopy_start(MigrationState *ms, Error **errp)
*/
migration_rate_set(migrate_max_postcopy_bandwidth());
+ /* Now, switchover looks all fine, switching to postcopy-active */
+ migrate_set_state(&ms->state, MIGRATION_STATUS_DEVICE,
+ MIGRATION_STATUS_POSTCOPY_ACTIVE);
+
bql_unlock();
return ret;
@@ -2689,14 +2686,39 @@ fail:
}
/**
- * migration_maybe_pause: Pause if required to by
- * migrate_pause_before_switchover called with the BQL locked
- * Returns: 0 on success
+ * @migration_switchover_start: Start VM switchover procedure
+ *
+ * @s: The migration state object pointer
+ *
+ * Prepares for the switchover, depending on "pause-before-switchover"
+ * capability.
+ *
+ * If cap set, state machine goes like:
+ * [postcopy-]active -> pre-switchover -> device
+ *
+ * If cap not set:
+ * [postcopy-]active -> device
+ *
+ * Returns: true on success, false on interruptions.
*/
-static int migration_maybe_pause(MigrationState *s, int new_state)
+static bool migration_switchover_start(MigrationState *s)
{
+ /* Concurrent cancellation? Quit */
+ if (s->state == MIGRATION_STATUS_CANCELLING) {
+ return false;
+ }
+
+ /*
+ * No matter precopy or postcopy, since we still hold BQL it must not
+ * change concurrently to CANCELLING, so it must be either ACTIVE or
+ * POSTCOPY_ACTIVE.
+ */
+ assert(migration_is_active());
+
+ /* If the pre stage not requested, directly switch to DEVICE */
if (!migrate_pause_before_switchover()) {
- return 0;
+ migrate_set_state(&s->state, s->state, MIGRATION_STATUS_DEVICE);
+ return true;
}
/* Since leaving this state is not atomic with posting the semaphore
@@ -2709,23 +2731,22 @@ static int migration_maybe_pause(MigrationState *s, int new_state)
/* This block intentionally left blank */
}
+ /* Update [POSTCOPY_]ACTIVE to PRE_SWITCHOVER */
+ migrate_set_state(&s->state, s->state, MIGRATION_STATUS_PRE_SWITCHOVER);
+ bql_unlock();
+
+ qemu_sem_wait(&s->pause_sem);
+
+ bql_lock();
/*
- * If the migration is cancelled when it is in the completion phase,
- * the migration state is set to MIGRATION_STATUS_CANCELLING.
- * So we don't need to wait a semaphore, otherwise we would always
- * wait for the 'pause_sem' semaphore.
+ * After BQL released and retaken, the state can be CANCELLING if it
+ * happend during sem_wait().. Only change the state if it's still
+ * pre-switchover.
*/
- if (s->state != MIGRATION_STATUS_CANCELLING) {
- migrate_set_state(&s->state, s->state,
- MIGRATION_STATUS_PRE_SWITCHOVER);
- bql_unlock();
- qemu_sem_wait(&s->pause_sem);
- bql_lock();
- migrate_set_state(&s->state, MIGRATION_STATUS_PRE_SWITCHOVER,
- new_state);
- }
+ migrate_set_state(&s->state, MIGRATION_STATUS_PRE_SWITCHOVER,
+ MIGRATION_STATUS_DEVICE);
- return s->state == new_state ? 0 : -EINVAL;
+ return s->state == MIGRATION_STATUS_DEVICE;
}
static int migration_completion_precopy(MigrationState *s)
@@ -2741,8 +2762,7 @@ static int migration_completion_precopy(MigrationState *s)
}
}
- ret = migration_maybe_pause(s, MIGRATION_STATUS_DEVICE);
- if (ret < 0) {
+ if (!migration_switchover_start(s)) {
goto out_unlock;
}
diff --git a/tests/qtest/libqos/libqos.c b/tests/qtest/libqos/libqos.c
index 5c0fa1f7c5..28a0901a0a 100644
--- a/tests/qtest/libqos/libqos.c
+++ b/tests/qtest/libqos/libqos.c
@@ -117,13 +117,14 @@ void migrate(QOSState *from, QOSState *to, const char *uri)
g_assert(qdict_haskey(sub, "status"));
st = qdict_get_str(sub, "status");
- /* "setup", "active", "completed", "failed", "cancelled" */
+ /* "setup", "active", "device", "completed", "failed", "cancelled" */
if (strcmp(st, "completed") == 0) {
qobject_unref(rsp);
break;
}
if ((strcmp(st, "setup") == 0) || (strcmp(st, "active") == 0)
+ || (strcmp(st, "device") == 0)
|| (strcmp(st, "wait-unplug") == 0)) {
qobject_unref(rsp);
g_usleep(5000);
diff --git a/tests/qemu-iotests/194.out b/tests/qemu-iotests/194.out
index 376ed1d2e6..6940e809cd 100644
--- a/tests/qemu-iotests/194.out
+++ b/tests/qemu-iotests/194.out
@@ -14,6 +14,7 @@ Starting migration...
{"return": {}}
{"data": {"status": "setup"}, "event": "MIGRATION", "timestamp": {"microseconds": "USECS", "seconds": "SECS"}}
{"data": {"status": "active"}, "event": "MIGRATION", "timestamp": {"microseconds": "USECS", "seconds": "SECS"}}
+{"data": {"status": "device"}, "event": "MIGRATION", "timestamp": {"microseconds": "USECS", "seconds": "SECS"}}
{"data": {"status": "completed"}, "event": "MIGRATION", "timestamp": {"microseconds": "USECS", "seconds": "SECS"}}
Gracefully ending the `drive-mirror` job on source...
{"return": {}}
diff --git a/tests/qemu-iotests/203.out b/tests/qemu-iotests/203.out
index 9d4abba8c5..8e58705e51 100644
--- a/tests/qemu-iotests/203.out
+++ b/tests/qemu-iotests/203.out
@@ -8,4 +8,5 @@ Starting migration...
{"return": {}}
{"data": {"status": "setup"}, "event": "MIGRATION", "timestamp": {"microseconds": "USECS", "seconds": "SECS"}}
{"data": {"status": "active"}, "event": "MIGRATION", "timestamp": {"microseconds": "USECS", "seconds": "SECS"}}
+{"data": {"status": "device"}, "event": "MIGRATION", "timestamp": {"microseconds": "USECS", "seconds": "SECS"}}
{"data": {"status": "completed"}, "event": "MIGRATION", "timestamp": {"microseconds": "USECS", "seconds": "SECS"}}
diff --git a/tests/qemu-iotests/234.out b/tests/qemu-iotests/234.out
index ac8b64350c..be3e138b58 100644
--- a/tests/qemu-iotests/234.out
+++ b/tests/qemu-iotests/234.out
@@ -10,6 +10,7 @@ Starting migration to B...
{"return": {}}
{"data": {"status": "setup"}, "event": "MIGRATION", "timestamp": {"microseconds": "USECS", "seconds": "SECS"}}
{"data": {"status": "active"}, "event": "MIGRATION", "timestamp": {"microseconds": "USECS", "seconds": "SECS"}}
+{"data": {"status": "device"}, "event": "MIGRATION", "timestamp": {"microseconds": "USECS", "seconds": "SECS"}}
{"data": {"status": "completed"}, "event": "MIGRATION", "timestamp": {"microseconds": "USECS", "seconds": "SECS"}}
{"data": {"status": "active"}, "event": "MIGRATION", "timestamp": {"microseconds": "USECS", "seconds": "SECS"}}
{"data": {"status": "completed"}, "event": "MIGRATION", "timestamp": {"microseconds": "USECS", "seconds": "SECS"}}
@@ -27,6 +28,7 @@ Starting migration back to A...
{"return": {}}
{"data": {"status": "setup"}, "event": "MIGRATION", "timestamp": {"microseconds": "USECS", "seconds": "SECS"}}
{"data": {"status": "active"}, "event": "MIGRATION", "timestamp": {"microseconds": "USECS", "seconds": "SECS"}}
+{"data": {"status": "device"}, "event": "MIGRATION", "timestamp": {"microseconds": "USECS", "seconds": "SECS"}}
{"data": {"status": "completed"}, "event": "MIGRATION", "timestamp": {"microseconds": "USECS", "seconds": "SECS"}}
{"data": {"status": "active"}, "event": "MIGRATION", "timestamp": {"microseconds": "USECS", "seconds": "SECS"}}
{"data": {"status": "completed"}, "event": "MIGRATION", "timestamp": {"microseconds": "USECS", "seconds": "SECS"}}
diff --git a/tests/qemu-iotests/262.out b/tests/qemu-iotests/262.out
index b8a2d3598d..bd7706b84b 100644
--- a/tests/qemu-iotests/262.out
+++ b/tests/qemu-iotests/262.out
@@ -8,6 +8,7 @@ Starting migration to B...
{"return": {}}
{"data": {"status": "setup"}, "event": "MIGRATION", "timestamp": {"microseconds": "USECS", "seconds": "SECS"}}
{"data": {"status": "active"}, "event": "MIGRATION", "timestamp": {"microseconds": "USECS", "seconds": "SECS"}}
+{"data": {"status": "device"}, "event": "MIGRATION", "timestamp": {"microseconds": "USECS", "seconds": "SECS"}}
{"data": {"status": "completed"}, "event": "MIGRATION", "timestamp": {"microseconds": "USECS", "seconds": "SECS"}}
{"data": {"status": "active"}, "event": "MIGRATION", "timestamp": {"microseconds": "USECS", "seconds": "SECS"}}
{"data": {"status": "completed"}, "event": "MIGRATION", "timestamp": {"microseconds": "USECS", "seconds": "SECS"}}
diff --git a/tests/qemu-iotests/280.out b/tests/qemu-iotests/280.out
index 546dbb4a68..37411144ca 100644
--- a/tests/qemu-iotests/280.out
+++ b/tests/qemu-iotests/280.out
@@ -7,6 +7,7 @@ Enabling migration QMP events on VM...
{"return": {}}
{"data": {"status": "setup"}, "event": "MIGRATION", "timestamp": {"microseconds": "USECS", "seconds": "SECS"}}
{"data": {"status": "active"}, "event": "MIGRATION", "timestamp": {"microseconds": "USECS", "seconds": "SECS"}}
+{"data": {"status": "device"}, "event": "MIGRATION", "timestamp": {"microseconds": "USECS", "seconds": "SECS"}}
{"data": {"status": "completed"}, "event": "MIGRATION", "timestamp": {"microseconds": "USECS", "seconds": "SECS"}}
VM is now stopped:
--
2.47.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 15/16] migration: Merge precopy/postcopy on switchover start
2025-01-14 23:07 [PATCH 00/16] migration: Switchover phase refactoring Peter Xu
` (13 preceding siblings ...)
2025-01-14 23:07 ` [PATCH 14/16] migration: Always set DEVICE state Peter Xu
@ 2025-01-14 23:07 ` Peter Xu
2025-01-14 23:07 ` [PATCH 16/16] migration: Trivial cleanup on JSON writer of vmstate_save() Peter Xu
` (3 subsequent siblings)
18 siblings, 0 replies; 21+ messages in thread
From: Peter Xu @ 2025-01-14 23:07 UTC (permalink / raw)
To: qemu-devel
Cc: peterx, Juraj Marcin, Julia Suvorova, Prasad Pandit,
Fabiano Rosas
Now after all the cleanups, finally we can merge the switchover startup
phase into one single function for precopy/postcopy.
Signed-off-by: Peter Xu <peterx@redhat.com>
---
migration/migration.c | 62 ++++++++++++++++++++++---------------------
1 file changed, 32 insertions(+), 30 deletions(-)
diff --git a/migration/migration.c b/migration/migration.c
index 2a95081b6c..f6694e7e94 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -102,7 +102,7 @@ static MigrationIncomingState *current_incoming;
static GSList *migration_blockers[MIG_MODE__MAX];
static bool migration_object_check(MigrationState *ms, Error **errp);
-static bool migration_switchover_start(MigrationState *s);
+static bool migration_switchover_start(MigrationState *s, Error **errp);
static void migrate_fd_cancel(MigrationState *s);
static bool close_return_path_on_source(MigrationState *s);
static void migration_completion_end(MigrationState *s);
@@ -2540,21 +2540,10 @@ static int postcopy_start(MigrationState *ms, Error **errp)
goto fail;
}
- if (!migration_switchover_start(ms)) {
- error_setg(errp, "migration_switchover_start() failed");
+ if (!migration_switchover_start(ms, errp)) {
goto fail;
}
- if (!migration_block_inactivate()) {
- error_setg(errp, "%s: Failed in bdrv_inactivate_all()", __func__);
- goto fail;
- }
-
- /* Switchover phase, switch to unlimited */
- migration_rate_set(RATE_LIMIT_DISABLED);
-
- precopy_notify_complete();
-
/*
* Cause any non-postcopiable, but iterative devices to
* send out their final data.
@@ -2686,7 +2675,7 @@ fail:
}
/**
- * @migration_switchover_start: Start VM switchover procedure
+ * @migration_switchover_prepare: Start VM switchover procedure
*
* @s: The migration state object pointer
*
@@ -2701,7 +2690,7 @@ fail:
*
* Returns: true on success, false on interruptions.
*/
-static bool migration_switchover_start(MigrationState *s)
+static bool migration_switchover_prepare(MigrationState *s)
{
/* Concurrent cancellation? Quit */
if (s->state == MIGRATION_STATUS_CANCELLING) {
@@ -2749,21 +2738,13 @@ static bool migration_switchover_start(MigrationState *s)
return s->state == MIGRATION_STATUS_DEVICE;
}
-static int migration_completion_precopy(MigrationState *s)
+static bool migration_switchover_start(MigrationState *s, Error **errp)
{
- int ret;
-
- bql_lock();
-
- if (!migrate_mode_is_cpr(s)) {
- ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE);
- if (ret < 0) {
- goto out_unlock;
- }
- }
+ ERRP_GUARD();
- if (!migration_switchover_start(s)) {
- goto out_unlock;
+ if (!migration_switchover_prepare(s)) {
+ error_setg(errp, "Switchover is interrupted");
+ return false;
}
/* Inactivate disks except in COLO */
@@ -2773,8 +2754,8 @@ static int migration_completion_precopy(MigrationState *s)
* bdrv_activate_all() on the other end won't fail.
*/
if (!migration_block_inactivate()) {
- ret = -EFAULT;
- goto out_unlock;
+ error_setg(errp, "Block inactivate failed during switchover");
+ return false;
}
}
@@ -2782,6 +2763,27 @@ static int migration_completion_precopy(MigrationState *s)
precopy_notify_complete();
+ return true;
+}
+
+static int migration_completion_precopy(MigrationState *s)
+{
+ int ret;
+
+ bql_lock();
+
+ if (!migrate_mode_is_cpr(s)) {
+ ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE);
+ if (ret < 0) {
+ goto out_unlock;
+ }
+ }
+
+ if (!migration_switchover_start(s, NULL)) {
+ ret = -EFAULT;
+ goto out_unlock;
+ }
+
ret = qemu_savevm_state_complete_precopy(s->to_dst_file, false);
out_unlock:
bql_unlock();
--
2.47.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 16/16] migration: Trivial cleanup on JSON writer of vmstate_save()
2025-01-14 23:07 [PATCH 00/16] migration: Switchover phase refactoring Peter Xu
` (14 preceding siblings ...)
2025-01-14 23:07 ` [PATCH 15/16] migration: Merge precopy/postcopy on switchover start Peter Xu
@ 2025-01-14 23:07 ` Peter Xu
2025-01-15 9:12 ` [PATCH 00/16] migration: Switchover phase refactoring Jiri Denemark
` (2 subsequent siblings)
18 siblings, 0 replies; 21+ messages in thread
From: Peter Xu @ 2025-01-14 23:07 UTC (permalink / raw)
To: qemu-devel
Cc: peterx, Juraj Marcin, Julia Suvorova, Prasad Pandit,
Fabiano Rosas
Two small cleanups in the same section of vmstate_save():
- Check vmdesc before the "mixed null/non-null data in array" logic, to
be crystal clear that it's only about the JSON writer, not the vmstate on
its own in the migration stream.
- Since we have is_null variable now, use that to replace a check.
Signed-off-by: Peter Xu <peterx@redhat.com>
---
migration/vmstate.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/migration/vmstate.c b/migration/vmstate.c
index 82bd005a83..047a52af89 100644
--- a/migration/vmstate.c
+++ b/migration/vmstate.c
@@ -459,6 +459,8 @@ int vmstate_save_state_v(QEMUFile *f, const VMStateDescription *vmsd,
}
/*
+ * This logic only matters when dumping VM Desc.
+ *
* Due to the fake nullptr handling above, if there's mixed
* null/non-null data, it doesn't make sense to emit a
* compressed array representation spanning the entire array
@@ -466,7 +468,7 @@ int vmstate_save_state_v(QEMUFile *f, const VMStateDescription *vmsd,
* vs. nullptr). Search ahead for the next null/non-null element
* and start a new compressed array if found.
*/
- if (field->flags & VMS_ARRAY_OF_POINTER &&
+ if (vmdesc && (field->flags & VMS_ARRAY_OF_POINTER) &&
is_null != is_prev_null) {
is_prev_null = is_null;
@@ -504,7 +506,7 @@ int vmstate_save_state_v(QEMUFile *f, const VMStateDescription *vmsd,
written_bytes);
/* If we used a fake temp field.. free it now */
- if (inner_field != field) {
+ if (is_null) {
g_clear_pointer((gpointer *)&inner_field, g_free);
}
--
2.47.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 00/16] migration: Switchover phase refactoring
2025-01-14 23:07 [PATCH 00/16] migration: Switchover phase refactoring Peter Xu
` (15 preceding siblings ...)
2025-01-14 23:07 ` [PATCH 16/16] migration: Trivial cleanup on JSON writer of vmstate_save() Peter Xu
@ 2025-01-15 9:12 ` Jiri Denemark
2025-01-15 12:55 ` Peter Xu
2025-01-15 16:13 ` Juraj Marcin
2025-01-15 16:49 ` Fabiano Rosas
18 siblings, 1 reply; 21+ messages in thread
From: Jiri Denemark @ 2025-01-15 9:12 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, Juraj Marcin, Julia Suvorova, Prasad Pandit,
Fabiano Rosas
On Tue, Jan 14, 2025 at 18:07:30 -0500, Peter Xu wrote:
> CI: https://gitlab.com/peterx/qemu/-/pipelines/1625266692
> (note: warning is present on rust stuff, but shouldn't be relevant)
>
> This series refactors the migration switchover path quite a bit. I started
> this work initially to measure the JSON writer overhead, but then I decided
> to cleanup the switchover path in general when I am at it altogether, as I
> wanted to do this for a long time.
>
> A few major things I tried to do:
>
...
> - DEVICE migration state
>
> QEMU has a very special DEVICE migration state, that only happens with
> precopy, and only when pause-before-switchover capability is enabled.
> Due to that specialty we can't merge precopy and postcopy code on
> switchover starts, because the state machine will be different.
>
> However after I checked the history and also with libvirt developers,
> this seems unnecessary. So I had one patch making DEVICE state to be
> the "switchover" phase for precopy/postcopy unconditionally. That will
> make the state machine much easier for both modes, meanwhile nothing is
> expected to break with it (but please still shoot if anyone knows /
> suspect something will, or could, break..).
No problem from libvirt side...
Tested-by: Jiri Denemark <jdenemar@redhat.com>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 00/16] migration: Switchover phase refactoring
2025-01-15 9:12 ` [PATCH 00/16] migration: Switchover phase refactoring Jiri Denemark
@ 2025-01-15 12:55 ` Peter Xu
0 siblings, 0 replies; 21+ messages in thread
From: Peter Xu @ 2025-01-15 12:55 UTC (permalink / raw)
To: Jiri Denemark
Cc: qemu-devel, Juraj Marcin, Julia Suvorova, Prasad Pandit,
Fabiano Rosas
On Wed, Jan 15, 2025 at 10:12:45AM +0100, Jiri Denemark wrote:
> On Tue, Jan 14, 2025 at 18:07:30 -0500, Peter Xu wrote:
> > CI: https://gitlab.com/peterx/qemu/-/pipelines/1625266692
> > (note: warning is present on rust stuff, but shouldn't be relevant)
> >
> > This series refactors the migration switchover path quite a bit. I started
> > this work initially to measure the JSON writer overhead, but then I decided
> > to cleanup the switchover path in general when I am at it altogether, as I
> > wanted to do this for a long time.
> >
> > A few major things I tried to do:
> >
> ...
> > - DEVICE migration state
> >
> > QEMU has a very special DEVICE migration state, that only happens with
> > precopy, and only when pause-before-switchover capability is enabled.
> > Due to that specialty we can't merge precopy and postcopy code on
> > switchover starts, because the state machine will be different.
> >
> > However after I checked the history and also with libvirt developers,
> > this seems unnecessary. So I had one patch making DEVICE state to be
> > the "switchover" phase for precopy/postcopy unconditionally. That will
> > make the state machine much easier for both modes, meanwhile nothing is
> > expected to break with it (but please still shoot if anyone knows /
> > suspect something will, or could, break..).
>
> No problem from libvirt side...
>
> Tested-by: Jiri Denemark <jdenemar@redhat.com>
This is definitely reassuring.. thanks a lot, Jiri!
--
Peter Xu
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 00/16] migration: Switchover phase refactoring
2025-01-14 23:07 [PATCH 00/16] migration: Switchover phase refactoring Peter Xu
` (16 preceding siblings ...)
2025-01-15 9:12 ` [PATCH 00/16] migration: Switchover phase refactoring Jiri Denemark
@ 2025-01-15 16:13 ` Juraj Marcin
2025-01-15 16:49 ` Fabiano Rosas
18 siblings, 0 replies; 21+ messages in thread
From: Juraj Marcin @ 2025-01-15 16:13 UTC (permalink / raw)
To: Peter Xu; +Cc: qemu-devel, Julia Suvorova, Prasad Pandit, Fabiano Rosas
On 2025-01-14 18:07, Peter Xu wrote:
> CI: https://gitlab.com/peterx/qemu/-/pipelines/1625266692
> (note: warning is present on rust stuff, but shouldn't be relevant)
>
> This series refactors the migration switchover path quite a bit. I started
> this work initially to measure the JSON writer overhead, but then I decided
> to cleanup the switchover path in general when I am at it altogether, as I
> wanted to do this for a long time.
>
> A few major things I tried to do:
>
> - About the JSON writer
>
> Currently, precopy migration always dumps a chunk of data called VM
> description (QEMU_VM_VMDESCRIPTION) for debugging purpose. That is a
> JSON blob explaining all the vmstates dumped in the migration stream.
> QEMU has a machine property suppress-vmdesc deciding whether migration
> will have that JSON chunk included.
>
> Postcopy does not have such JSON dump because postcopy is live session
> and it can't normally be debugged from stream level (e.g. as a streamed
> file).
>
> A tiny problem is we don't yet have a clue on how much cpu cycles we
> need to construct and dump these JSONs even if they're only for
> debugging, and even if suppress-vmdesc=on QEMU will still try to
> construct these JSONs (e.g. also for postcopy).
>
> This series has a few patches just to make sure the JSON blob won't be
> constructed if not needed (either postcopy, or suppress-vmdesc=on). I
> tried to measure the downtime diff with/without these changes, the time
> QEMU takes to construct / dump the JSON blob is still not measurable.
> So I suppose unconditionally having this is ok. Said that, let's still
> have these changes around so we avoid JSON operations if not needed.
>
> - DEVICE migration state
>
> QEMU has a very special DEVICE migration state, that only happens with
> precopy, and only when pause-before-switchover capability is enabled.
> Due to that specialty we can't merge precopy and postcopy code on
> switchover starts, because the state machine will be different.
>
> However after I checked the history and also with libvirt developers,
> this seems unnecessary. So I had one patch making DEVICE state to be
> the "switchover" phase for precopy/postcopy unconditionally. That will
> make the state machine much easier for both modes, meanwhile nothing is
> expected to break with it (but please still shoot if anyone knows /
> suspect something will, or could, break..).
>
> - General cleanups and fixes
>
> Most of the rest changes are random cleanups and fixes in the
> switchover path.
>
> E.g., postcopy_start() has some code that isn't easy to read due to
> some special flags here and there, mostly around the two calls of
> qemu_savevm_state_complete_precopy(). This series will remove most of
> those special treatments here and there.
>
> We could have done something twice in the past in postcopy switchover
> (e.g. I believe we sync CPU twice.. but only happens with postcopy),
> now they should all be sorted out.
>
> And quite some other things hopefully can be separately discussed and
> justified in each patch. After these cleanups, we will be able to have
> an unified entrance for precopy/postcopy on switchover.
>
> Initially I thought this could optimize the downtime slightly, but after
> some tests, it turns out there's no measureable difference, at least in my
> current setup... So let's take this as a cleanup series at least for now,
> and I hope they would still make some sense. Comments welcomed.
>
> Thanks,
>
> Peter Xu (16):
> migration: Remove postcopy implications in should_send_vmdesc()
> migration: Do not construct JSON description if suppressed
> migration: Optimize postcopy on downtime by avoiding JSON writer
> migration: Avoid two src-downtime-end tracepoints for postcopy
> migration: Drop inactivate_disk param in qemu_savevm_state_complete*
> migration: Synchronize all CPU states only for non-iterable dump
> migration: Adjust postcopy bandwidth during switchover
> migration: Adjust locking in migration_maybe_pause()
> migration: Drop cached migration state in migration_maybe_pause()
> migration: Take BQL slightly longer in postcopy_start()
> migration: Notify COMPLETE once for postcopy
> migration: Unwrap qemu_savevm_state_complete_precopy() in postcopy
> migration: Cleanup qemu_savevm_state_complete_precopy()
> migration: Always set DEVICE state
> migration: Merge precopy/postcopy on switchover start
> migration: Trivial cleanup on JSON writer of vmstate_save()
>
> qapi/migration.json | 7 +-
> migration/migration.h | 1 +
> migration/savevm.h | 6 +-
> migration/migration.c | 209 +++++++++++++++++++++++-------------
> migration/savevm.c | 116 ++++++++------------
> migration/vmstate.c | 6 +-
> tests/qtest/libqos/libqos.c | 3 +-
> migration/trace-events | 2 +-
> tests/qemu-iotests/194.out | 1 +
> tests/qemu-iotests/203.out | 1 +
> tests/qemu-iotests/234.out | 2 +
> tests/qemu-iotests/262.out | 1 +
> tests/qemu-iotests/280.out | 1 +
> 13 files changed, 200 insertions(+), 156 deletions(-)
>
> --
> 2.47.0
>
All patches look good to me, nice refactor!
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
--
Juraj Marcin
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 00/16] migration: Switchover phase refactoring
2025-01-14 23:07 [PATCH 00/16] migration: Switchover phase refactoring Peter Xu
` (17 preceding siblings ...)
2025-01-15 16:13 ` Juraj Marcin
@ 2025-01-15 16:49 ` Fabiano Rosas
18 siblings, 0 replies; 21+ messages in thread
From: Fabiano Rosas @ 2025-01-15 16:49 UTC (permalink / raw)
To: Peter Xu, qemu-devel; +Cc: peterx, Juraj Marcin, Julia Suvorova, Prasad Pandit
Peter Xu <peterx@redhat.com> writes:
> CI: https://gitlab.com/peterx/qemu/-/pipelines/1625266692
> (note: warning is present on rust stuff, but shouldn't be relevant)
>
> This series refactors the migration switchover path quite a bit. I started
> this work initially to measure the JSON writer overhead, but then I decided
> to cleanup the switchover path in general when I am at it altogether, as I
> wanted to do this for a long time.
>
> A few major things I tried to do:
>
> - About the JSON writer
>
> Currently, precopy migration always dumps a chunk of data called VM
> description (QEMU_VM_VMDESCRIPTION) for debugging purpose. That is a
> JSON blob explaining all the vmstates dumped in the migration stream.
> QEMU has a machine property suppress-vmdesc deciding whether migration
> will have that JSON chunk included.
>
> Postcopy does not have such JSON dump because postcopy is live session
> and it can't normally be debugged from stream level (e.g. as a streamed
> file).
>
> A tiny problem is we don't yet have a clue on how much cpu cycles we
> need to construct and dump these JSONs even if they're only for
> debugging, and even if suppress-vmdesc=on QEMU will still try to
> construct these JSONs (e.g. also for postcopy).
>
> This series has a few patches just to make sure the JSON blob won't be
> constructed if not needed (either postcopy, or suppress-vmdesc=on). I
> tried to measure the downtime diff with/without these changes, the time
> QEMU takes to construct / dump the JSON blob is still not measurable.
> So I suppose unconditionally having this is ok. Said that, let's still
> have these changes around so we avoid JSON operations if not needed.
>
> - DEVICE migration state
>
> QEMU has a very special DEVICE migration state, that only happens with
> precopy, and only when pause-before-switchover capability is enabled.
> Due to that specialty we can't merge precopy and postcopy code on
> switchover starts, because the state machine will be different.
>
> However after I checked the history and also with libvirt developers,
> this seems unnecessary. So I had one patch making DEVICE state to be
> the "switchover" phase for precopy/postcopy unconditionally. That will
> make the state machine much easier for both modes, meanwhile nothing is
> expected to break with it (but please still shoot if anyone knows /
> suspect something will, or could, break..).
>
> - General cleanups and fixes
>
> Most of the rest changes are random cleanups and fixes in the
> switchover path.
>
> E.g., postcopy_start() has some code that isn't easy to read due to
> some special flags here and there, mostly around the two calls of
> qemu_savevm_state_complete_precopy(). This series will remove most of
> those special treatments here and there.
>
> We could have done something twice in the past in postcopy switchover
> (e.g. I believe we sync CPU twice.. but only happens with postcopy),
> now they should all be sorted out.
>
> And quite some other things hopefully can be separately discussed and
> justified in each patch. After these cleanups, we will be able to have
> an unified entrance for precopy/postcopy on switchover.
>
> Initially I thought this could optimize the downtime slightly, but after
> some tests, it turns out there's no measureable difference, at least in my
> current setup... So let's take this as a cleanup series at least for now,
> and I hope they would still make some sense. Comments welcomed.
>
> Thanks,
>
> Peter Xu (16):
> migration: Remove postcopy implications in should_send_vmdesc()
> migration: Do not construct JSON description if suppressed
> migration: Optimize postcopy on downtime by avoiding JSON writer
> migration: Avoid two src-downtime-end tracepoints for postcopy
> migration: Drop inactivate_disk param in qemu_savevm_state_complete*
> migration: Synchronize all CPU states only for non-iterable dump
> migration: Adjust postcopy bandwidth during switchover
> migration: Adjust locking in migration_maybe_pause()
> migration: Drop cached migration state in migration_maybe_pause()
> migration: Take BQL slightly longer in postcopy_start()
> migration: Notify COMPLETE once for postcopy
> migration: Unwrap qemu_savevm_state_complete_precopy() in postcopy
> migration: Cleanup qemu_savevm_state_complete_precopy()
> migration: Always set DEVICE state
> migration: Merge precopy/postcopy on switchover start
> migration: Trivial cleanup on JSON writer of vmstate_save()
>
> qapi/migration.json | 7 +-
> migration/migration.h | 1 +
> migration/savevm.h | 6 +-
> migration/migration.c | 209 +++++++++++++++++++++++-------------
> migration/savevm.c | 116 ++++++++------------
> migration/vmstate.c | 6 +-
> tests/qtest/libqos/libqos.c | 3 +-
> migration/trace-events | 2 +-
> tests/qemu-iotests/194.out | 1 +
> tests/qemu-iotests/203.out | 1 +
> tests/qemu-iotests/234.out | 2 +
> tests/qemu-iotests/262.out | 1 +
> tests/qemu-iotests/280.out | 1 +
> 13 files changed, 200 insertions(+), 156 deletions(-)
Queued, thanks!
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2025-01-15 16:50 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-14 23:07 [PATCH 00/16] migration: Switchover phase refactoring Peter Xu
2025-01-14 23:07 ` [PATCH 01/16] migration: Remove postcopy implications in should_send_vmdesc() Peter Xu
2025-01-14 23:07 ` [PATCH 02/16] migration: Do not construct JSON description if suppressed Peter Xu
2025-01-14 23:07 ` [PATCH 03/16] migration: Optimize postcopy on downtime by avoiding JSON writer Peter Xu
2025-01-14 23:07 ` [PATCH 04/16] migration: Avoid two src-downtime-end tracepoints for postcopy Peter Xu
2025-01-14 23:07 ` [PATCH 05/16] migration: Drop inactivate_disk param in qemu_savevm_state_complete* Peter Xu
2025-01-14 23:07 ` [PATCH 06/16] migration: Synchronize all CPU states only for non-iterable dump Peter Xu
2025-01-14 23:07 ` [PATCH 07/16] migration: Adjust postcopy bandwidth during switchover Peter Xu
2025-01-14 23:07 ` [PATCH 08/16] migration: Adjust locking in migration_maybe_pause() Peter Xu
2025-01-14 23:07 ` [PATCH 09/16] migration: Drop cached migration state " Peter Xu
2025-01-14 23:07 ` [PATCH 10/16] migration: Take BQL slightly longer in postcopy_start() Peter Xu
2025-01-14 23:07 ` [PATCH 11/16] migration: Notify COMPLETE once for postcopy Peter Xu
2025-01-14 23:07 ` [PATCH 12/16] migration: Unwrap qemu_savevm_state_complete_precopy() in postcopy Peter Xu
2025-01-14 23:07 ` [PATCH 13/16] migration: Cleanup qemu_savevm_state_complete_precopy() Peter Xu
2025-01-14 23:07 ` [PATCH 14/16] migration: Always set DEVICE state Peter Xu
2025-01-14 23:07 ` [PATCH 15/16] migration: Merge precopy/postcopy on switchover start Peter Xu
2025-01-14 23:07 ` [PATCH 16/16] migration: Trivial cleanup on JSON writer of vmstate_save() Peter Xu
2025-01-15 9:12 ` [PATCH 00/16] migration: Switchover phase refactoring Jiri Denemark
2025-01-15 12:55 ` Peter Xu
2025-01-15 16:13 ` Juraj Marcin
2025-01-15 16:49 ` Fabiano Rosas
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).