From: Peter Xu <peterx@redhat.com>
To: qemu-devel@nongnu.org
Cc: peterx@redhat.com, Juraj Marcin <jmarcin@redhat.com>,
Mario Casquero <mcasquer@redhat.com>,
Fabiano Rosas <farosas@suse.de>,
"Dr . David Alan Gilbert" <dave@treblig.org>
Subject: [PATCH v3 10/11] migration: Rewrite the migration complete detect logic
Date: Fri, 13 Jun 2025 10:08:00 -0400 [thread overview]
Message-ID: <20250613140801.474264-11-peterx@redhat.com> (raw)
In-Reply-To: <20250613140801.474264-1-peterx@redhat.com>
There're a few things off here in that logic, rewrite it. When at it, add
rich comment to explain each of the decisions.
Since this is very sensitive path for migration, below are the list of
things changed with their reasonings.
(1) Exact pending size is only needed for precopy not postcopy
Fundamentally it's because "exact" version only does one more deep
sync to fetch the pending results, while in postcopy's case it's
never going to sync anything more than estimate as the VM on source
is stopped.
(2) Do _not_ rely on threshold_size anymore to decide whether postcopy
should complete
threshold_size was calculated from the expected downtime and
bandwidth only during precopy as an efficient way to decide when to
switchover. It's not sensible to rely on threshold_size in postcopy.
For precopy, if switchover is decided, the migration will complete
soon. It's not true for postcopy. Logically speaking, postcopy
should only complete the migration if all pending data is flushed.
Here it used to work because save_complete() used to implicitly
contain save_live_iterate() when there's pending size.
Even if that looks benign, having RAMs to be migrated in postcopy's
save_complete() has other bad side effects:
(a) Since save_complete() needs to be run once at a time, it means
when moving RAM there's no way moving other things (rather than
round-robin iterating the vmstate handlers like what we do with
ITERABLE phase). Not an immediate concern, but it may stop working
in the future when there're more than one iterables (e.g. vfio
postcopy).
(b) postcopy recovery, unfortunately, only works during ITERABLE
phase. IOW, if src QEMU moves RAM during postcopy's save_complete()
and network failed, then it'll crash both QEMUs... OTOH if it failed
during iteration it'll still be recoverable. IOW, this change should
further reduce the window QEMU split brain and crash in extreme cases.
If we enable the ram_save_complete() tracepoints, we'll see this
before this patch:
1267959@1748381938.294066:ram_save_complete dirty=9627, done=0
1267959@1748381938.308884:ram_save_complete dirty=0, done=1
It means in this migration there're 9627 pages migrated at complete()
of postcopy phase.
After this change, all the postcopy RAM should be migrated in iterable
phase, rather than save_complete():
1267959@1748381938.294066:ram_save_complete dirty=0, done=0
1267959@1748381938.308884:ram_save_complete dirty=0, done=1
(3) Adjust when to decide to switch to postcopy
This shouldn't be super important, the movement makes sure there's
only one in_postcopy check, then we are clear on what we do with the
two completely differnt use cases (precopy v.s. postcopy).
(4) Trivial touch up on threshold_size comparision
Which changes:
"(!pending_size || pending_size < s->threshold_size)"
into:
"(pending_size <= s->threshold_size)"
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
migration/migration.c | 57 +++++++++++++++++++++++++++++++------------
1 file changed, 42 insertions(+), 15 deletions(-)
diff --git a/migration/migration.c b/migration/migration.c
index e33e39ac74..923400f801 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3436,33 +3436,60 @@ static MigIterateState migration_iteration_run(MigrationState *s)
Error *local_err = NULL;
bool in_postcopy = s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE;
bool can_switchover = migration_can_switchover(s);
+ bool complete_ready;
+ /* Fast path - get the estimated amount of pending data */
qemu_savevm_state_pending_estimate(&must_precopy, &can_postcopy);
pending_size = must_precopy + can_postcopy;
trace_migrate_pending_estimate(pending_size, must_precopy, can_postcopy);
- if (pending_size < s->threshold_size) {
- qemu_savevm_state_pending_exact(&must_precopy, &can_postcopy);
- pending_size = must_precopy + can_postcopy;
- trace_migrate_pending_exact(pending_size, must_precopy, can_postcopy);
+ if (in_postcopy) {
+ /*
+ * Iterate in postcopy until all pending data flushed. Note that
+ * postcopy completion doesn't rely on can_switchover, because when
+ * POSTCOPY_ACTIVE it means switchover already happened.
+ */
+ complete_ready = !pending_size;
+ } else {
+ /*
+ * Exact pending reporting is only needed for precopy. Taking RAM
+ * as example, there'll be no extra dirty information after
+ * postcopy started, so ESTIMATE should always match with EXACT
+ * during postcopy phase.
+ */
+ if (pending_size < s->threshold_size) {
+ qemu_savevm_state_pending_exact(&must_precopy, &can_postcopy);
+ pending_size = must_precopy + can_postcopy;
+ trace_migrate_pending_exact(pending_size, must_precopy,
+ can_postcopy);
+ }
+
+ /* Should we switch to postcopy now? */
+ if (must_precopy <= s->threshold_size &&
+ can_switchover && qatomic_read(&s->start_postcopy)) {
+ if (postcopy_start(s, &local_err)) {
+ migrate_set_error(s, local_err);
+ error_report_err(local_err);
+ }
+ return MIG_ITERATE_SKIP;
+ }
+
+ /*
+ * For precopy, migration can complete only if:
+ *
+ * (1) Switchover is acknowledged by destination
+ * (2) Pending size is no more than the threshold specified
+ * (which was calculated from expected downtime)
+ */
+ complete_ready = can_switchover && (pending_size <= s->threshold_size);
}
- if ((!pending_size || pending_size < s->threshold_size) && can_switchover) {
+ if (complete_ready) {
trace_migration_thread_low_pending(pending_size);
migration_completion(s);
return MIG_ITERATE_BREAK;
}
- /* Still a significant amount to transfer */
- if (!in_postcopy && must_precopy <= s->threshold_size && can_switchover &&
- qatomic_read(&s->start_postcopy)) {
- if (postcopy_start(s, &local_err)) {
- migrate_set_error(s, local_err);
- error_report_err(local_err);
- }
- return MIG_ITERATE_SKIP;
- }
-
/* Just another iteration step */
qemu_savevm_state_iterate(s->to_dst_file, in_postcopy);
return MIG_ITERATE_RESUME;
--
2.49.0
next prev parent reply other threads:[~2025-06-13 14:09 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-13 14:07 [PATCH v3 00/11] migration: Some enhancements and cleanups for 10.1 Peter Xu
2025-06-13 14:07 ` [PATCH v3 01/11] migration/hmp: Reorg "info migrate" once more Peter Xu
2025-06-13 14:07 ` [PATCH v3 02/11] migration/hmp: Fix postcopy-blocktime per-vCPU results Peter Xu
2025-06-24 14:28 ` Juraj Marcin
2025-06-13 14:07 ` [PATCH v3 03/11] migration/docs: Move docs for postcopy blocktime feature Peter Xu
2025-06-13 14:07 ` [PATCH v3 04/11] migration/bg-snapshot: Do not check for SKIP in iterator Peter Xu
2025-06-13 14:07 ` [PATCH v3 05/11] migration: Drop save_live_complete_postcopy hook Peter Xu
2025-06-24 14:29 ` Juraj Marcin
2025-06-13 14:07 ` [PATCH v3 06/11] migration: Rename save_live_complete_precopy to save_complete Peter Xu
2025-06-24 14:36 ` Juraj Marcin
2025-06-24 15:41 ` Peter Xu
2025-06-25 11:13 ` Juraj Marcin
2025-06-25 13:38 ` Peter Xu
2025-06-13 14:07 ` [PATCH v3 07/11] migration: qemu_savevm_complete*() helpers Peter Xu
2025-06-24 14:38 ` Juraj Marcin
2025-06-13 14:07 ` [PATCH v3 08/11] migration/ram: One less indent for ram_find_and_save_block() Peter Xu
2025-06-13 14:07 ` [PATCH v3 09/11] migration/ram: Add tracepoints for ram_save_complete() Peter Xu
2025-06-13 14:08 ` Peter Xu [this message]
2025-06-13 14:08 ` [PATCH v3 11/11] migration/postcopy: Avoid clearing dirty bitmap for postcopy too Peter Xu
2025-06-25 13:38 ` [PATCH v3 00/11] migration: Some enhancements and cleanups for 10.1 Peter Xu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250613140801.474264-11-peterx@redhat.com \
--to=peterx@redhat.com \
--cc=dave@treblig.org \
--cc=farosas@suse.de \
--cc=jmarcin@redhat.com \
--cc=mcasquer@redhat.com \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).