[PATCH v2 00/13] Make switchover-ack re-usable and add VFIO precopy REINIT feature

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 00/13] Make switchover-ack re-usable and add VFIO precopy REINIT feature
@ 2026-06-02  9:26 Avihai Horon
  2026-06-02  9:26 ` [PATCH v2 01/13] migration: Propagate errors in migration_completion_precopy() Avihai Horon
                   ` (12 more replies)
  0 siblings, 13 replies; 38+ messages in thread
From: Avihai Horon @ 2026-06-02  9:26 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Peter Xu, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Halil Pasic, Christian Borntraeger, Jason Herne,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Eric Farman, Matthew Rosato, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, Markus Armbruster,
	Maor Gottlieb, Avihai Horon

Hello,

Changes from v1 [2]:
* Rebased on top of master.
* Added a new patch that runs a final save_query_pending at switchover
  (replace RAM specific code). (Peter)
* Dropped patches #1 and #2 as they were already taken by Cedric (in
  some way or another).
* Patch #3: Used Error variable of migration_completion() instead of
  adding a new one in migration_completion_precopy(). (Cedric, Peter)
* Patch #6: Dropped renaming of qemu_loadvm_approve_switchover to legacy
  since in next patch it's renamed back again. (Peter)
* Patch #7: 
  - Rephrased switchover-ack QAPI documentation to be more general and
    not mention number of ACKs sent. (Peter)
  - Dropped the new request_switchover_ack SaveVMHandler and instead
    requested switchover ACKs via save_query_pending SaveVMHandler.
    (Peter)
  - Added documentation comment for legacy/new switchover-ack. (Peter)
* Patch #8: Adjusted according to changes in patch #7.
* Dropped patch #9 since the scenario that it comes to solve can't
  really happen (kept only the part that extracts sending of
  INIT_DATA_SENT flag to a helper). (Peter, Cedric)
* Patch #10: Used error_prepend() to add error prefix in
  vfio_migration_init(). (Cedric)
* Patch #11-13: Adjusted according to changes in patch #7.
* Patch #12: Added explanation about VFIO_PRECOPY_INFO_REINIT and an
  example to vfio migration documentation. (Peter)
* Collected R-bs.

===

This series adds support for the kernel VFIO_PRECOPY_INFO_REINIT feature
[1], which can reduce migration downtime in some VFIO precopy scenarios
(see the Tests section below). Supporting it requires refactoring
switchover-ack and making it re-usable; that work comes first.

=== Background ===

Switchover-ack is a mechanism to synchronize between source and
destination QEMU during migration to prevent the source from switching
over prematurely.

VFIO uses switchover-ack to ensure switchover happens only after
destination side has loaded the precopy initial bytes. This is important
for VFIO, as otherwise downtime could be impacted and be higher.

In its current state, switchover-ack is a one-time mechanism, meaning
that switchover is acked only once and past that another ACK cannot be
requested again. This was sufficient until now, as VFIO precopy initial
bytes was defined to be monotonically decreasing. Thus, when precopy
initial bytes reached zero for all VFIO devices, a single ACK would be
sent and its validity would hold.

=== Problem ===

Now the new VFIO_PRECOPY_INFO_REINIT feature allows precopy initial
bytes to be re-initialized during precopy. Specifically, it means that
initial bytes can grow after reaching zero, which would invalidate a
previously sent switchover ACK.

=== Solution ===

To support VFIO_PRECOPY_INFO_REINIT, this series makes switchover-ack
re-usable and allows devices to request another switchover ACK when
needed.

In the new mechansim, switchover ACK can be requested for a specific
device and in different times, so switchover ACK is changed to be
per-device (instead of a single ACK for all devices) and source side is
the one who does the pending ACKs accounting (instead of destination
side).

The old (legacy) switchover-ack mechanism is kept for backward
compatibility and is turned on by a compatibility property for older
machines.

With that infrastructure in place, VFIO uses the new switchover-ack path
and implements VFIO_PRECOPY_INFO_REINIT.

=== Tests ===

Functional and performance tests were run.

Performance tests were done by migrating a single VM with:
* 8 GB RAM
* 4 mlx5 VFIO devices:
  - One device with 1GB of device data (stopcopy data) that runs
    workload during precopy so VFIO_PRECOPY_INFO_REINIT is exercised
    (generate new initial_bytes chunks during precopy).
  - The other 3 devices are idle.

In this setup, VFIO_PRECOPY_INFO_REINIT reduced total migration downtime by
about 43%, and downtime attributed to the busy VFIO device by about 67%:

With VFIO_PRECOPY_INFO_REINIT:
  1335ms total (~520ms from the VFIO device running the workload).

Without VFIO_PRECOPY_INFO_REINIT:
  2352ms total (~1600ms from the VFIO device running the workload).

Functional tests covered the main code paths and combinations, including
legacy and new switchover-ack across versions, for example:
* Migration between QEMU 11.0 (old binary) and 11.1 (new binary).
* Migration between two 11.1 binaries with different machine versions.
* Migration when the VFIO device has no VFIO_PRECOPY_INFO_REINIT.
* Migration when the VFIO device has no VFIO precopy support.

=== Patch Breakdown ===

* Patches 1-7,13: Migration cleanups and the new switchover-ack mechanism
* Patches 8-12: VFIO cleanups and VFIO_PRECOPY_INFO_REINIT

Thanks.

[1] https://lore.kernel.org/all/20260317161753.18964-1-yishaih@nvidia.com/
[2] https://lore.kernel.org/qemu-devel/20260505081423.28326-1-avihaih@nvidia.com/

Avihai Horon (13):
  migration: Propagate errors in migration_completion_precopy()
  migration: Run final save_query_pending at switchover
  migration: Log the approver in qemu_loadvm_approve_switchover()
  migration: Replace switchover_ack_needed SaveVMHandler
  migration: Rename switchover-ack code to legacy
  migration: Make switchover-ack re-usable
  migration: Fail migration if switchover-ack is requested after
    switchover decision
  vfio/migration: Extract VFIO_MIG_FLAG_DEV_INIT_DATA_SENT sending to
    helper
  vfio/migration: Add Error ** parameter to vfio_migration_init()
  vfio/migration: Add new switchover-ack mechanism
  vfio/migration: Implement VFIO_PRECOPY_INFO_REINIT feature
  vfio/migration: Check VFIO_PRECOPY_INFO_REINIT during switchover
  migration: Enable new switchover-ack

 docs/devel/migration/vfio.rst      |  17 ++-
 qapi/migration.json                |  16 ++-
 hw/vfio/vfio-migration-internal.h  |   2 +
 include/migration/client-options.h |   1 +
 include/migration/misc.h           |   2 +
 include/migration/register.h       |  56 ++++-----
 migration/migration.h              |  34 +++++-
 migration/savevm.h                 |   7 +-
 hw/core/machine.c                  |   1 +
 hw/s390x/s390-stattrib.c           |   9 +-
 hw/vfio/migration.c                | 189 +++++++++++++++++++++++------
 migration/block-dirty-bitmap.c     |   6 +-
 migration/migration.c              |  92 ++++++++++++--
 migration/options.c                |   9 ++
 migration/ram.c                    |  37 +++---
 migration/savevm.c                 | 134 +++++++++++++-------
 hw/vfio/trace-events               |   7 +-
 migration/trace-events             |   9 +-
 18 files changed, 461 insertions(+), 167 deletions(-)

-- 
2.40.1

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v2 01/13] migration: Propagate errors in migration_completion_precopy()
  2026-06-02  9:26 [PATCH v2 00/13] Make switchover-ack re-usable and add VFIO precopy REINIT feature Avihai Horon
@ 2026-06-02  9:26 ` Avihai Horon
  2026-06-03 19:47   ` Peter Xu
  2026-06-02  9:26 ` [PATCH v2 02/13] migration: Run final save_query_pending at switchover Avihai Horon
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 38+ messages in thread
From: Avihai Horon @ 2026-06-02  9:26 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Peter Xu, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Halil Pasic, Christian Borntraeger, Jason Herne,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Eric Farman, Matthew Rosato, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, Markus Armbruster,
	Maor Gottlieb, Avihai Horon

migration_completion_precopy() doesn't propagate errors to migration
core which leads to error information loss. Fix that.

This prepares for a follow-up where migration_switchover_start() can
fail on switchover-ack and still report a useful error. Errors from
qemu_savevm_state_complete_precopy() are not propagated yet as it
requires more plumbing.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 migration/migration.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 074d3f2c69..7488a94206 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2814,7 +2814,7 @@ static bool migration_switchover_start(MigrationState *s, Error **errp)
     return true;
 }
 
-static int migration_completion_precopy(MigrationState *s)
+static int migration_completion_precopy(MigrationState *s, Error **errp)
 {
     int ret;
 
@@ -2823,11 +2823,12 @@ static int migration_completion_precopy(MigrationState *s)
     if (!migrate_mode_is_cpr()) {
         ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE);
         if (ret < 0) {
+            error_setg_errno(errp, -ret, "Failed to stop the VM");
             goto out_unlock;
         }
     }
 
-    if (!migration_switchover_start(s, NULL)) {
+    if (!migration_switchover_start(s, errp)) {
         ret = -EFAULT;
         goto out_unlock;
     }
@@ -2869,7 +2870,7 @@ static void migration_completion(MigrationState *s)
     Error *local_err = NULL;
 
     if (s->state == MIGRATION_STATUS_ACTIVE) {
-        ret = migration_completion_precopy(s);
+        ret = migration_completion_precopy(s, &local_err);
     } else if (s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) {
         migration_completion_postcopy(s);
     } else {
@@ -2900,7 +2901,9 @@ static void migration_completion(MigrationState *s)
     return;
 
 fail:
-    if (qemu_file_get_error_obj(s->to_dst_file, &local_err)) {
+    if (local_err) {
+        migrate_error_propagate(s, local_err);
+    } else if (qemu_file_get_error_obj(s->to_dst_file, &local_err)) {
         migrate_error_propagate(s, local_err);
     } else if (ret) {
         error_setg_errno(&local_err, -ret, "Error in migration completion");
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 02/13] migration: Run final save_query_pending at switchover
  2026-06-02  9:26 [PATCH v2 00/13] Make switchover-ack re-usable and add VFIO precopy REINIT feature Avihai Horon
  2026-06-02  9:26 ` [PATCH v2 01/13] migration: Propagate errors in migration_completion_precopy() Avihai Horon
@ 2026-06-02  9:26 ` Avihai Horon
  2026-06-03 20:05   ` Peter Xu
  2026-06-02  9:26 ` [PATCH v2 03/13] migration: Log the approver in qemu_loadvm_approve_switchover() Avihai Horon
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 38+ messages in thread
From: Avihai Horon @ 2026-06-02  9:26 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Peter Xu, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Halil Pasic, Christian Borntraeger, Jason Herne,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Eric Farman, Matthew Rosato, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, Markus Armbruster,
	Maor Gottlieb, Avihai Horon

Before switchover, the source needs one last exact pending query so
modules can flush dirty state. This is currently done ad hoc in modules
handlers. For example, RAM syncs its dirty bitmap in its save_complete
handler.

This should be a general concept relevant for any module, so extract it
to migration core instead by running a final save_query_pending before
switchover.

The final query requires special handling by modules (e.g., it's called
with BQL locked, during VM stop), so extend save_query_pending
SaveVMHandlers callback and qemu_savevm_query_pending() with a "final"
flag so migration modules can tell the last pending query during
switchover from periodic iteration queries.

Not all modules need the final query; skip it for those who don't need.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 include/migration/register.h   | 41 ++++++++++++++++++----------------
 migration/savevm.h             |  3 ++-
 hw/s390x/s390-stattrib.c       |  9 ++++++--
 hw/vfio/migration.c            |  6 ++++-
 migration/block-dirty-bitmap.c |  6 ++++-
 migration/migration.c          |  7 ++++--
 migration/ram.c                | 37 ++++++++++++++++++------------
 migration/savevm.c             | 27 +++++++++++++++++-----
 migration/trace-events         |  2 +-
 9 files changed, 91 insertions(+), 47 deletions(-)

diff --git a/include/migration/register.h b/include/migration/register.h
index 5e5e0ee432..6f632123f1 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -171,6 +171,28 @@ typedef struct SaveVMHandlers {
      */
     bool (*is_active_iterate)(void *opaque);
 
+    /**
+     * @save_query_pending
+     *
+     * This estimates the remaining data to transfer on the source side.
+     *
+     * When @exact is true, a module must report accurate results.  When
+     * @exact is false, a module may report estimates.
+     *
+     * It's highly recommended that modules implement a faster version of
+     * the query path (for example, by proper caching on the counters) if
+     * an accurate query will be time-consuming.
+     *
+     * @opaque: data pointer passed to register_savevm_live()
+     * @pending: pointer to a MigPendingData struct
+     * @exact: set to true for an accurate (slow) query
+     * @final: set to true for the final query during switchover. When final is
+     * true, the query is called with BQL locked. Otherwise, it's called with
+     * BQL unlocked.
+     */
+    void (*save_query_pending)(void *opaque, MigPendingData *pending,
+                               bool exact, bool final);
+
     /* This runs outside the BQL in the migration case, and
      * within the lock in the savevm case.  The callback had better only
      * use data that is local to the migration thread or protected
@@ -210,25 +232,6 @@ typedef struct SaveVMHandlers {
      */
     bool (*save_postcopy_prepare)(QEMUFile *f, void *opaque, Error **errp);
 
-    /**
-     * @save_query_pending
-     *
-     * This estimates the remaining data to transfer on the source side.
-     *
-     * When @exact is true, a module must report accurate results.  When
-     * @exact is false, a module may report estimates.
-     *
-     * It's highly recommended that modules implement a faster version of
-     * the query path (for example, by proper caching on the counters) if
-     * an accurate query will be time-consuming.
-     *
-     * @opaque: data pointer passed to register_savevm_live()
-     * @pending: pointer to a MigPendingData struct
-     * @exact: set to true for an accurate (slow) query
-     */
-    void (*save_query_pending)(void *opaque, MigPendingData *pending,
-                               bool exact);
-
     /**
      * @load_state
      *
diff --git a/migration/savevm.h b/migration/savevm.h
index 96fdf96d4e..10b3d78a5f 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -45,7 +45,8 @@ int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy);
 void qemu_savevm_state_cleanup(void);
 void qemu_savevm_state_complete_postcopy(QEMUFile *f);
 int qemu_savevm_state_complete_precopy(MigrationState *s);
-void qemu_savevm_query_pending(MigPendingData *pending, bool exact);
+void qemu_savevm_query_pending_iter(MigPendingData *pending, bool exact);
+void qemu_savevm_query_pending_final(MigPendingData *pending);
 int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy);
 bool qemu_savevm_state_postcopy_prepare(QEMUFile *f, Error **errp);
 void qemu_savevm_state_end(QEMUFile *f);
diff --git a/hw/s390x/s390-stattrib.c b/hw/s390x/s390-stattrib.c
index c334714b31..aed0c6844d 100644
--- a/hw/s390x/s390-stattrib.c
+++ b/hw/s390x/s390-stattrib.c
@@ -190,12 +190,17 @@ static int cmma_save_setup(QEMUFile *f, void *opaque, Error **errp)
 }
 
 static void cmma_state_pending(void *opaque, MigPendingData *pending,
-                               bool exact)
+                               bool exact, bool final)
 {
     S390StAttribState *sas = S390_STATTRIB(opaque);
     S390StAttribClass *sac = S390_STATTRIB_GET_CLASS(sas);
-    long long res = sac->get_dirtycount(sas);
+    long long res;
 
+    if (final) {
+        return;
+    }
+
+    res = sac->get_dirtycount(sas);
     if (res >= 0) {
         pending->precopy_bytes += res;
     }
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index fb12b9717f..95072d6664 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -622,12 +622,16 @@ static void vfio_state_pending_sync(VFIODevice *vbasedev)
 }
 
 static void vfio_state_pending(void *opaque, MigPendingData *pending,
-                               bool exact)
+                               bool exact, bool final)
 {
     VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
     uint64_t precopy_size, stopcopy_size;
 
+    if (final) {
+        return;
+    }
+
     if (exact) {
         vfio_state_pending_sync(vbasedev);
     }
diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
index 7ef3759e53..66ed91de03 100644
--- a/migration/block-dirty-bitmap.c
+++ b/migration/block-dirty-bitmap.c
@@ -767,12 +767,16 @@ static int dirty_bitmap_save_complete(QEMUFile *f, void *opaque)
 }
 
 static void dirty_bitmap_state_pending(void *opaque, MigPendingData *data,
-                                       bool exact)
+                                       bool exact, bool final)
 {
     DBMSaveState *s = &((DBMState *)opaque)->save;
     SaveBitmapState *dbms;
     uint64_t pending = 0;
 
+    if (final) {
+        return;
+    }
+
     bql_lock();
 
     QSIMPLEQ_FOREACH(dbms, &s->dbms_list, entry) {
diff --git a/migration/migration.c b/migration/migration.c
index 7488a94206..df8ed3a9fd 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2787,12 +2787,15 @@ static bool migration_switchover_prepare(MigrationState *s)
 static bool migration_switchover_start(MigrationState *s, Error **errp)
 {
     ERRP_GUARD();
+    MigPendingData pending = {};
 
     if (!migration_switchover_prepare(s)) {
         error_setg(errp, "Switchover is interrupted");
         return false;
     }
 
+    qemu_savevm_query_pending_final(&pending);
+
     /* Inactivate disks except in COLO */
     if (!migrate_colo()) {
         /*
@@ -3285,7 +3288,7 @@ static void migration_iteration_go_next(MigPendingData *pending)
     /*
      * Do a slow sync first before boosting the iteration count.
      */
-    qemu_savevm_query_pending(pending, true);
+    qemu_savevm_query_pending_iter(pending, true);
 
     /*
      * Update the dirty information for the whole system for this
@@ -3336,7 +3339,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
     bool complete_ready;
 
     /* Fast path - get the estimated amount of pending data */
-    qemu_savevm_query_pending(&pending, false);
+    qemu_savevm_query_pending_iter(&pending, false);
 
     if (in_postcopy) {
         /*
diff --git a/migration/ram.c b/migration/ram.c
index fc38ffbf8a..4e3f442593 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3376,10 +3376,6 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
     rs->last_stage = !migration_in_colo_state();
 
     WITH_RCU_READ_LOCK_GUARD() {
-        if (!migration_in_postcopy()) {
-            migration_bitmap_sync_precopy(true);
-        }
-
         ret = rdma_registration_start(f, RAM_CONTROL_FINISH);
         if (ret < 0) {
             qemu_file_set_error(f, ret);
@@ -3442,25 +3438,38 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
     return qemu_fflush(f);
 }
 
-static void ram_state_pending(void *opaque, MigPendingData *pending,
-                              bool exact)
+static void ram_state_pending_sync(bool exact, bool final)
 {
-    RAMState **temp = opaque;
-    RAMState *rs = *temp;
-    uint64_t remaining_size;
-
     /*
      * Sync is not needed either with: (1) a fast query, or (2) after
      * postcopy has started (no new dirty will generate anymore).
      */
-    if (exact && !migration_in_postcopy()) {
+    if (!exact || migration_in_postcopy()) {
+        return;
+    }
+
+    /* Final pending query is called with BQL locked */
+    if (!final) {
         bql_lock();
-        WITH_RCU_READ_LOCK_GUARD() {
-            migration_bitmap_sync_precopy(false);
-        }
+    }
+
+    WITH_RCU_READ_LOCK_GUARD() {
+        migration_bitmap_sync_precopy(final);
+    }
+
+    if (!final) {
         bql_unlock();
     }
+}
+
+static void ram_state_pending(void *opaque, MigPendingData *pending,
+                              bool exact, bool final)
+{
+    RAMState **temp = opaque;
+    RAMState *rs = *temp;
+    uint64_t remaining_size;
 
+    ram_state_pending_sync(exact, final);
     remaining_size = rs->migration_dirty_pages * TARGET_PAGE_SIZE;
 
     if (migrate_postcopy_ram()) {
diff --git a/migration/savevm.c b/migration/savevm.c
index 23adaf9dd9..22cb6a15c6 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1795,7 +1795,8 @@ int qemu_savevm_state_complete_precopy(MigrationState *s)
     return qemu_fflush(f);
 }
 
-void qemu_savevm_query_pending(MigPendingData *pending, bool exact)
+static void qemu_savevm_query_pending(MigPendingData *pending, bool exact,
+                                      bool final)
 {
     SaveStateEntry *se;
 
@@ -1808,7 +1809,7 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool exact)
         if (!qemu_savevm_state_active(se)) {
             continue;
         }
-        se->ops->save_query_pending(se->opaque, pending, exact);
+        se->ops->save_query_pending(se->opaque, pending, exact, final);
     }
 
     pending->total_bytes = pending->precopy_bytes +
@@ -1817,16 +1818,30 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool exact)
     /*
      * Update system remaining dirty bytes whenever QEMU queries.  It will
      * make the value to be not as accurate, but should still be pretty
-     * close to reality when this got invoked frequently while iterating.
+     * close to reality when this got invoked frequently while iterating.  Don't
+     * update in final query as some modules may skip it if not needed.
      */
-    mig_stats.dirty_bytes_total = pending->total_bytes;
-
-    trace_qemu_savevm_query_pending(exact, pending->precopy_bytes,
+    if (!final) {
+        mig_stats.dirty_bytes_total = pending->total_bytes;
+    }
+    trace_qemu_savevm_query_pending(exact, final, pending->precopy_bytes,
                                     pending->stopcopy_bytes,
                                     pending->postcopy_bytes,
                                     pending->total_bytes);
 }
 
+void qemu_savevm_query_pending_iter(MigPendingData *pending, bool exact)
+{
+    qemu_savevm_query_pending(pending, exact, false);
+}
+
+void qemu_savevm_query_pending_final(MigPendingData *pending)
+{
+    g_assert(bql_locked());
+
+    qemu_savevm_query_pending(pending, true, true);
+}
+
 void qemu_savevm_state_cleanup(void)
 {
     SaveStateEntry *se;
diff --git a/migration/trace-events b/migration/trace-events
index de99d976ab..1c9212d3e2 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -7,7 +7,7 @@ qemu_loadvm_state_section_partend(uint32_t section_id) "%u"
 qemu_loadvm_state_post_main(int ret) "%d"
 qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
 qemu_savevm_send_packaged(void) ""
-qemu_savevm_query_pending(bool exact, uint64_t precopy, uint64_t stopcopy, uint64_t postcopy, uint64_t total) "exact=%d, precopy=%"PRIu64", stopcopy=%"PRIu64", postcopy=%"PRIu64", total=%"PRIu64
+qemu_savevm_query_pending(bool exact, bool final, uint64_t precopy, uint64_t stopcopy, uint64_t postcopy, uint64_t total) "exact=%d, final=%d, precopy=%"PRIu64", stopcopy=%"PRIu64", postcopy=%"PRIu64", total=%"PRIu64
 loadvm_state_switchover_ack_needed(unsigned int switchover_ack_pending_num) "Switchover ack pending num=%u"
 loadvm_state_setup(void) ""
 loadvm_state_cleanup(void) ""
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 03/13] migration: Log the approver in qemu_loadvm_approve_switchover()
  2026-06-02  9:26 [PATCH v2 00/13] Make switchover-ack re-usable and add VFIO precopy REINIT feature Avihai Horon
  2026-06-02  9:26 ` [PATCH v2 01/13] migration: Propagate errors in migration_completion_precopy() Avihai Horon
  2026-06-02  9:26 ` [PATCH v2 02/13] migration: Run final save_query_pending at switchover Avihai Horon
@ 2026-06-02  9:26 ` Avihai Horon
  2026-06-02  9:26 ` [PATCH v2 04/13] migration: Replace switchover_ack_needed SaveVMHandler Avihai Horon
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 38+ messages in thread
From: Avihai Horon @ 2026-06-02  9:26 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Peter Xu, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Halil Pasic, Christian Borntraeger, Jason Herne,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Eric Farman, Matthew Rosato, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, Markus Armbruster,
	Maor Gottlieb, Avihai Horon

Pass the device name that approved switchover to
qemu_loadvm_approve_switchover() and log it in the trace for debugging
purposes.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 migration/savevm.h     | 2 +-
 hw/vfio/migration.c    | 2 +-
 migration/savevm.c     | 4 ++--
 migration/trace-events | 2 +-
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/migration/savevm.h b/migration/savevm.h
index 10b3d78a5f..0d732eb0f7 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -71,7 +71,7 @@ void qemu_loadvm_state_cleanup(MigrationIncomingState *mis);
 int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis,
                            Error **errp);
 int qemu_load_device_state(QEMUFile *f, Error **errp);
-int qemu_loadvm_approve_switchover(void);
+int qemu_loadvm_approve_switchover(const char *approver);
 int qemu_savevm_state_non_iterable(QEMUFile *f, Error **errp);
 int qemu_savevm_state_non_iterable_early(QEMUFile *f,
                                          JSONWriter *vmdesc,
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 95072d6664..49266cbd76 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -846,7 +846,7 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
                 return -EINVAL;
             }
 
-            ret = qemu_loadvm_approve_switchover();
+            ret = qemu_loadvm_approve_switchover(vbasedev->name);
             if (ret) {
                 error_report(
                     "%s: qemu_loadvm_approve_switchover failed, err=%d (%s)",
diff --git a/migration/savevm.c b/migration/savevm.c
index 22cb6a15c6..fd870345b4 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -3172,7 +3172,7 @@ int qemu_load_device_state(QEMUFile *f, Error **errp)
     return 0;
 }
 
-int qemu_loadvm_approve_switchover(void)
+int qemu_loadvm_approve_switchover(const char *approver)
 {
     MigrationIncomingState *mis = migration_incoming_get_current();
 
@@ -3181,7 +3181,7 @@ int qemu_loadvm_approve_switchover(void)
     }
 
     mis->switchover_ack_pending_num--;
-    trace_loadvm_approve_switchover(mis->switchover_ack_pending_num);
+    trace_loadvm_approve_switchover(approver, mis->switchover_ack_pending_num);
 
     if (mis->switchover_ack_pending_num) {
         return 0;
diff --git a/migration/trace-events b/migration/trace-events
index 1c9212d3e2..c0c433744c 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -24,7 +24,7 @@ loadvm_postcopy_ram_handle_discard_end(void) ""
 loadvm_postcopy_ram_handle_discard_header(const char *ramid, uint16_t len) "%s: %ud"
 loadvm_process_command(const char *s, uint16_t len) "com=%s len=%d"
 loadvm_process_command_ping(uint32_t val) "0x%x"
-loadvm_approve_switchover(unsigned int switchover_ack_pending_num) "Switchover ack pending num=%u"
+loadvm_approve_switchover(const char *approver, unsigned int switchover_ack_pending_num) "Approver %s, switchover_ack_pending_num %u"
 postcopy_ram_listen_thread_exit(void) ""
 postcopy_ram_listen_thread_start(void) ""
 qemu_savevm_send_postcopy_advise(void) ""
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 04/13] migration: Replace switchover_ack_needed SaveVMHandler
  2026-06-02  9:26 [PATCH v2 00/13] Make switchover-ack re-usable and add VFIO precopy REINIT feature Avihai Horon
                   ` (2 preceding siblings ...)
  2026-06-02  9:26 ` [PATCH v2 03/13] migration: Log the approver in qemu_loadvm_approve_switchover() Avihai Horon
@ 2026-06-02  9:26 ` Avihai Horon
  2026-06-02  9:26 ` [PATCH v2 05/13] migration: Rename switchover-ack code to legacy Avihai Horon
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 38+ messages in thread
From: Avihai Horon @ 2026-06-02  9:26 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Peter Xu, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Halil Pasic, Christian Borntraeger, Jason Herne,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Eric Farman, Matthew Rosato, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, Markus Armbruster,
	Maor Gottlieb, Avihai Horon

A new switchover-ack mechanism that will replace the existing one will
be added in the following patches. The new mechanism will not use
switchover_ack_needed SaveVMHandler, however, the old mechanism must
still be kept for backward compatibility.

To keep things clear and decrease API surface of old code, replace
switchover_ack_needed SaveVMHandler with a regular function
migration_request_switchover_ack().

No functional changes intended.

Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 docs/devel/migration/vfio.rst |  3 ---
 include/migration/misc.h      |  2 ++
 include/migration/register.h  | 13 -------------
 hw/vfio/migration.c           | 18 ++++++++++--------
 migration/migration.c         | 15 +++++++++++++++
 migration/savevm.c            | 21 ---------------------
 migration/trace-events        |  2 +-
 7 files changed, 28 insertions(+), 46 deletions(-)

diff --git a/docs/devel/migration/vfio.rst b/docs/devel/migration/vfio.rst
index 691061d182..854277b11c 100644
--- a/docs/devel/migration/vfio.rst
+++ b/docs/devel/migration/vfio.rst
@@ -59,9 +59,6 @@ VFIO implements the device hooks for the iterative approach as follows:
 * A ``save_live_iterate`` function that reads the VFIO device's data from the
   vendor driver during iterative pre-copy phase.
 
-* A ``switchover_ack_needed`` function that checks if the VFIO device uses
-  "switchover-ack" migration capability when this capability is enabled.
-
 * A ``switchover_start`` function that in the multifd mode starts a thread that
   reassembles the multifd received data and loads it in-order into the device.
   In the non-multifd mode this function is a NOP.
diff --git a/include/migration/misc.h b/include/migration/misc.h
index 3159a5e53c..a2219c981b 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -156,4 +156,6 @@ bool multifd_device_state_save_thread_should_exit(void);
 void multifd_abort_device_state_save_threads(void);
 bool multifd_join_device_state_save_threads(void);
 
+void migration_request_switchover_ack(const char *requester);
+
 #endif
diff --git a/include/migration/register.h b/include/migration/register.h
index 6f632123f1..a61c4236d2 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -302,19 +302,6 @@ typedef struct SaveVMHandlers {
      */
     int (*resume_prepare)(MigrationState *s, void *opaque);
 
-    /**
-     * @switchover_ack_needed
-     *
-     * Checks if switchover ack should be used. Called only on
-     * destination.
-     *
-     * @opaque: data pointer passed to register_savevm_live()
-     *
-     * Returns true if switchover ack should be used and false
-     * otherwise
-     */
-    bool (*switchover_ack_needed)(void *opaque);
-
     /**
      * @switchover_start
      *
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 49266cbd76..6303f006a2 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -487,6 +487,14 @@ static bool vfio_precopy_supported(VFIODevice *vbasedev)
     return migration->mig_flags & VFIO_MIGRATION_PRE_COPY;
 }
 
+static void vfio_request_switchover_ack(VFIODevice *vbasedev)
+{
+    if (vfio_precopy_supported(vbasedev)) {
+        /* Precopy support implies switchover-ack is needed */
+        migration_request_switchover_ack(vbasedev->name);
+    }
+}
+
 /* ---------------------------------------------------------------------- */
 
 static int vfio_save_prepare(void *opaque, Error **errp)
@@ -775,6 +783,8 @@ static int vfio_load_setup(QEMUFile *f, void *opaque, Error **errp)
         return ret;
     }
 
+    vfio_request_switchover_ack(vbasedev);
+
     return 0;
 }
 
@@ -873,13 +883,6 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
     return ret;
 }
 
-static bool vfio_switchover_ack_needed(void *opaque)
-{
-    VFIODevice *vbasedev = opaque;
-
-    return vfio_precopy_supported(vbasedev);
-}
-
 static int vfio_switchover_start(void *opaque)
 {
     VFIODevice *vbasedev = opaque;
@@ -903,7 +906,6 @@ static const SaveVMHandlers savevm_vfio_handlers = {
     .load_setup = vfio_load_setup,
     .load_cleanup = vfio_load_cleanup,
     .load_state = vfio_load_state,
-    .switchover_ack_needed = vfio_switchover_ack_needed,
     /*
      * Multifd support
      */
diff --git a/migration/migration.c b/migration/migration.c
index df8ed3a9fd..957b794e91 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2196,6 +2196,21 @@ void migration_rp_kick(MigrationState *s)
     qemu_sem_post(&s->rp_state.rp_sem);
 }
 
+/* This is called only on destination side */
+void migration_request_switchover_ack(const char *requester)
+{
+    MigrationIncomingState *mis = migration_incoming_get_current();
+
+    if (!migrate_switchover_ack()) {
+        return;
+    }
+
+    mis->switchover_ack_pending_num++;
+
+    trace_migration_request_switchover_ack(requester,
+                                           mis->switchover_ack_pending_num);
+}
+
 static struct rp_cmd_args {
     ssize_t     len; /* -1 = variable */
     const char *name;
diff --git a/migration/savevm.c b/migration/savevm.c
index fd870345b4..1a60a178af 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2798,23 +2798,6 @@ static int qemu_loadvm_state_header(QEMUFile *f, Error **errp)
     return 0;
 }
 
-static void qemu_loadvm_state_switchover_ack_needed(MigrationIncomingState *mis)
-{
-    SaveStateEntry *se;
-
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
-        if (!se->ops || !se->ops->switchover_ack_needed) {
-            continue;
-        }
-
-        if (se->ops->switchover_ack_needed(se->opaque)) {
-            mis->switchover_ack_pending_num++;
-        }
-    }
-
-    trace_loadvm_state_switchover_ack_needed(mis->switchover_ack_pending_num);
-}
-
 static int qemu_loadvm_state_setup(QEMUFile *f, Error **errp)
 {
     ERRP_GUARD();
@@ -3076,10 +3059,6 @@ int qemu_loadvm_state(QEMUFile *f, Error **errp)
         return -EINVAL;
     }
 
-    if (migrate_switchover_ack()) {
-        qemu_loadvm_state_switchover_ack_needed(mis);
-    }
-
     cpu_synchronize_all_pre_loadvm();
 
     ret = qemu_loadvm_state_main(f, mis, errp);
diff --git a/migration/trace-events b/migration/trace-events
index c0c433744c..5955befcc6 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -8,7 +8,6 @@ qemu_loadvm_state_post_main(int ret) "%d"
 qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
 qemu_savevm_send_packaged(void) ""
 qemu_savevm_query_pending(bool exact, bool final, uint64_t precopy, uint64_t stopcopy, uint64_t postcopy, uint64_t total) "exact=%d, final=%d, precopy=%"PRIu64", stopcopy=%"PRIu64", postcopy=%"PRIu64", total=%"PRIu64
-loadvm_state_switchover_ack_needed(unsigned int switchover_ack_pending_num) "Switchover ack pending num=%u"
 loadvm_state_setup(void) ""
 loadvm_state_cleanup(void) ""
 loadvm_handle_cmd_packaged(unsigned int length) "%u"
@@ -199,6 +198,7 @@ process_incoming_migration_co_postcopy_end_main(void) ""
 postcopy_preempt_enabled(bool value) "%d"
 migration_precopy_complete(void) ""
 migration_call_notifiers(int type) "type=%d"
+migration_request_switchover_ack(const char *requester, unsigned int switchover_ack_pending_num) "Requester %s, switchover_ack_pending_num %u"
 
 # migration-stats
 migration_transferred_bytes(uint64_t qemu_file, uint64_t multifd, uint64_t rdma) "qemu_file %" PRIu64 " multifd %" PRIu64 " RDMA %" PRIu64
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 05/13] migration: Rename switchover-ack code to legacy
  2026-06-02  9:26 [PATCH v2 00/13] Make switchover-ack re-usable and add VFIO precopy REINIT feature Avihai Horon
                   ` (3 preceding siblings ...)
  2026-06-02  9:26 ` [PATCH v2 04/13] migration: Replace switchover_ack_needed SaveVMHandler Avihai Horon
@ 2026-06-02  9:26 ` Avihai Horon
  2026-06-03 20:21   ` Peter Xu
  2026-06-02  9:26 ` [PATCH v2 06/13] migration: Make switchover-ack re-usable Avihai Horon
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 38+ messages in thread
From: Avihai Horon @ 2026-06-02  9:26 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Peter Xu, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Halil Pasic, Christian Borntraeger, Jason Herne,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Eric Farman, Matthew Rosato, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, Markus Armbruster,
	Maor Gottlieb, Avihai Horon

A new switchover-ack mechanism will be added in the following patches.
However, the old mechanism must still be kept for backward
compatibility.

Rename existing code that will be used only for old switchover-ack
mechanism as legacy. This will help to distinguish legacy code from new
code and make it more readable and easier for removal later when no
longer needed.

No functional change intended.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 include/migration/misc.h |  2 +-
 migration/migration.h    |  2 +-
 hw/vfio/migration.c      |  6 ++---
 migration/migration.c    |  8 +++----
 migration/savevm.c       | 49 +++++++++++++++++++++++++++-------------
 migration/trace-events   |  4 ++--
 6 files changed, 44 insertions(+), 27 deletions(-)

diff --git a/include/migration/misc.h b/include/migration/misc.h
index a2219c981b..4b43413aee 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -156,6 +156,6 @@ bool multifd_device_state_save_thread_should_exit(void);
 void multifd_abort_device_state_save_threads(void);
 bool multifd_join_device_state_save_threads(void);
 
-void migration_request_switchover_ack(const char *requester);
+void migration_request_switchover_ack_legacy(const char *requester);
 
 #endif
diff --git a/migration/migration.h b/migration/migration.h
index 841f49b215..da45444f7b 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -246,7 +246,7 @@ struct MigrationIncomingState {
      * zero an ACK that it's OK to do switchover is sent to the source. No lock
      * is needed as this field is updated serially.
      */
-    unsigned int switchover_ack_pending_num;
+    unsigned int switchover_ack_pending_num_legacy;
 
     /* Do exit on incoming migration failure */
     bool exit_on_error;
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 6303f006a2..d5be135584 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -487,11 +487,11 @@ static bool vfio_precopy_supported(VFIODevice *vbasedev)
     return migration->mig_flags & VFIO_MIGRATION_PRE_COPY;
 }
 
-static void vfio_request_switchover_ack(VFIODevice *vbasedev)
+static void vfio_request_switchover_ack_legacy(VFIODevice *vbasedev)
 {
     if (vfio_precopy_supported(vbasedev)) {
         /* Precopy support implies switchover-ack is needed */
-        migration_request_switchover_ack(vbasedev->name);
+        migration_request_switchover_ack_legacy(vbasedev->name);
     }
 }
 
@@ -783,7 +783,7 @@ static int vfio_load_setup(QEMUFile *f, void *opaque, Error **errp)
         return ret;
     }
 
-    vfio_request_switchover_ack(vbasedev);
+    vfio_request_switchover_ack_legacy(vbasedev);
 
     return 0;
 }
diff --git a/migration/migration.c b/migration/migration.c
index 957b794e91..59aff50d68 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2197,7 +2197,7 @@ void migration_rp_kick(MigrationState *s)
 }
 
 /* This is called only on destination side */
-void migration_request_switchover_ack(const char *requester)
+void migration_request_switchover_ack_legacy(const char *requester)
 {
     MigrationIncomingState *mis = migration_incoming_get_current();
 
@@ -2205,10 +2205,10 @@ void migration_request_switchover_ack(const char *requester)
         return;
     }
 
-    mis->switchover_ack_pending_num++;
+    mis->switchover_ack_pending_num_legacy++;
 
-    trace_migration_request_switchover_ack(requester,
-                                           mis->switchover_ack_pending_num);
+    trace_migration_request_switchover_ack_legacy(
+        requester, mis->switchover_ack_pending_num_legacy);
 }
 
 static struct rp_cmd_args {
diff --git a/migration/savevm.c b/migration/savevm.c
index 1a60a178af..fa188dd34f 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2476,6 +2476,31 @@ static int loadvm_postcopy_handle_switchover_start(Error **errp)
     return 0;
 }
 
+/*
+ * If legacy switchover-ack is enabled but no device uses it, need to send an
+ * ACK to source that it's OK to switchover.
+ */
+static int loadvm_switchover_ack_no_users_legacy(MigrationIncomingState *mis,
+                                                 Error **errp)
+{
+    int ret;
+
+    if (!migrate_switchover_ack()) {
+        return 0;
+    }
+
+    if (!mis->switchover_ack_pending_num_legacy) {
+        ret = migrate_send_rp_switchover_ack(mis);
+        if (ret) {
+            error_setg_errno(errp, -ret,
+                             "Could not send switchover ack RP MSG");
+            return ret;
+        }
+    }
+
+    return 0;
+}
+
 /*
  * Process an incoming 'QEMU_VM_COMMAND'
  * 0           just a normal return
@@ -2525,18 +2550,9 @@ static int loadvm_process_command(QEMUFile *f, Error **errp)
         }
         mis->to_src_file = qemu_file_get_return_path(f);
 
-        /*
-         * Switchover ack is enabled but no device uses it, so send an ACK to
-         * source that it's OK to switchover. Do it here, after return path has
-         * been created.
-         */
-        if (migrate_switchover_ack() && !mis->switchover_ack_pending_num) {
-            ret = migrate_send_rp_switchover_ack(mis);
-            if (ret) {
-                error_setg_errno(errp, -ret,
-                                 "Could not send switchover ack RP MSG");
-                return ret;
-            }
+        ret = loadvm_switchover_ack_no_users_legacy(mis, errp);
+        if (ret) {
+            return ret;
         }
         return 0;
 
@@ -3155,14 +3171,15 @@ int qemu_loadvm_approve_switchover(const char *approver)
 {
     MigrationIncomingState *mis = migration_incoming_get_current();
 
-    if (!mis->switchover_ack_pending_num) {
+    if (!mis->switchover_ack_pending_num_legacy) {
         return -EINVAL;
     }
 
-    mis->switchover_ack_pending_num--;
-    trace_loadvm_approve_switchover(approver, mis->switchover_ack_pending_num);
+    mis->switchover_ack_pending_num_legacy--;
+    trace_loadvm_approve_switchover_legacy(
+        approver, mis->switchover_ack_pending_num_legacy);
 
-    if (mis->switchover_ack_pending_num) {
+    if (mis->switchover_ack_pending_num_legacy) {
         return 0;
     }
 
diff --git a/migration/trace-events b/migration/trace-events
index 5955befcc6..a6b8c31ee1 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -23,7 +23,7 @@ loadvm_postcopy_ram_handle_discard_end(void) ""
 loadvm_postcopy_ram_handle_discard_header(const char *ramid, uint16_t len) "%s: %ud"
 loadvm_process_command(const char *s, uint16_t len) "com=%s len=%d"
 loadvm_process_command_ping(uint32_t val) "0x%x"
-loadvm_approve_switchover(const char *approver, unsigned int switchover_ack_pending_num) "Approver %s, switchover_ack_pending_num %u"
+loadvm_approve_switchover_legacy(const char *approver, unsigned int switchover_ack_pending_num_legacy) "Approver %s, switchover_ack_pending_num_legacy %u"
 postcopy_ram_listen_thread_exit(void) ""
 postcopy_ram_listen_thread_start(void) ""
 qemu_savevm_send_postcopy_advise(void) ""
@@ -198,7 +198,7 @@ process_incoming_migration_co_postcopy_end_main(void) ""
 postcopy_preempt_enabled(bool value) "%d"
 migration_precopy_complete(void) ""
 migration_call_notifiers(int type) "type=%d"
-migration_request_switchover_ack(const char *requester, unsigned int switchover_ack_pending_num) "Requester %s, switchover_ack_pending_num %u"
+migration_request_switchover_ack_legacy(const char *requester, unsigned int switchover_ack_pending_num_legacy) "Requester %s, switchover_ack_pending_num_legacy %u"
 
 # migration-stats
 migration_transferred_bytes(uint64_t qemu_file, uint64_t multifd, uint64_t rdma) "qemu_file %" PRIu64 " multifd %" PRIu64 " RDMA %" PRIu64
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 06/13] migration: Make switchover-ack re-usable
  2026-06-02  9:26 [PATCH v2 00/13] Make switchover-ack re-usable and add VFIO precopy REINIT feature Avihai Horon
                   ` (4 preceding siblings ...)
  2026-06-02  9:26 ` [PATCH v2 05/13] migration: Rename switchover-ack code to legacy Avihai Horon
@ 2026-06-02  9:26 ` Avihai Horon
  2026-06-03  7:17   ` Markus Armbruster
  2026-06-02  9:26 ` [PATCH v2 07/13] migration: Fail migration if switchover-ack is requested after switchover decision Avihai Horon
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 38+ messages in thread
From: Avihai Horon @ 2026-06-02  9:26 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Peter Xu, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Halil Pasic, Christian Borntraeger, Jason Herne,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Eric Farman, Matthew Rosato, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, Markus Armbruster,
	Maor Gottlieb, Avihai Horon

Switchover-ack is a mechanism to synchronize between source and
destination QEMU during migration to prevent the source from switching
over prematurely.

VFIO uses switchover-ack to ensure switchover happens only after
destination side has loaded the precopy initial bytes. This is important
for VFIO, as otherwise downtime could be impacted and be higher.

In its current state, switchover-ack is a one-time mechanism, meaning
that switchover is acked only once and past that another ACK cannot be
requested again. This was sufficient until now, as VFIO precopy initial
bytes was defined to be monotonically decreasing. Thus, when precopy
initial bytes reached zero for all VFIO devices, a single ACK would be
sent and its validity would hold.

However, now the new VFIO_PRECOPY_INFO_REINIT feature allows precopy
initial bytes to be re-initialized during precopy. Specifically, it
means that initial bytes can grow after reaching zero, which would
invalidate a previously sent switchover ACK.

To solve this, make switchover-ack reusable and allow devices to request
switchover ACKs when needed via the save_query_pending SaveVMHandler.

Since now switchover ACK can be requested for a specific device and in
different times, make switchover ACK per-device (instead of a single ACK
for all devices) and let source side do the pending ACKs accounting.

Keep the legacy switchover-ack mechanism for backward compatibility and
turn it on by a compatibility property for older machines. Enable the
property until VFIO implements the new switchover-ack.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 qapi/migration.json                | 16 ++++-----
 include/migration/client-options.h |  1 +
 include/migration/register.h       |  2 ++
 migration/migration.h              | 32 ++++++++++++++++--
 migration/savevm.h                 |  6 ++--
 hw/core/machine.c                  |  1 +
 migration/migration.c              | 37 ++++++++++++++-------
 migration/options.c                | 10 ++++++
 migration/savevm.c                 | 53 +++++++++++++++++++++++-------
 migration/trace-events             |  5 +--
 10 files changed, 125 insertions(+), 38 deletions(-)

diff --git a/qapi/migration.json b/qapi/migration.json
index 27a7970556..550cb77762 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -507,15 +507,13 @@
 #     and should not affect the correctness of postcopy migration.
 #     (since 7.1)
 #
-# @switchover-ack: If enabled, migration will not stop the source VM
-#     and complete the migration until an ACK is received from the
-#     destination that it's OK to do so.  Exactly when this ACK is
-#     sent depends on the migrated devices that use this feature.  For
-#     example, a device can use it to make sure some of its data is
-#     sent and loaded in the destination before doing switchover.
-#     This can reduce downtime if devices that support this capability
-#     are present.  'return-path' capability must be enabled to use
-#     it.  (since 8.1)
+# @switchover-ack: If enabled, migration will rely on destination side
+#     to acknowledge the source's switchover decision.  The
+#     acknowledgement may depend, for example, on some device's data
+#     being loaded in the destination before doing switchover.  This
+#     can reduce downtime if devices that support this capability are
+#     present.  'return-path' capability must be enabled to use it.
+#     (since 8.1)
 #
 # @dirty-limit: If enabled, migration will throttle vCPUs as needed to
 #     keep their dirty page rate within @vcpu-dirty-limit.  This can
diff --git a/include/migration/client-options.h b/include/migration/client-options.h
index 289c9d7762..78b1daa1a6 100644
--- a/include/migration/client-options.h
+++ b/include/migration/client-options.h
@@ -13,6 +13,7 @@
 
 /* properties */
 bool migrate_send_switchover_start(void);
+bool migrate_switchover_ack_legacy(void);
 
 /* capabilities */
 
diff --git a/include/migration/register.h b/include/migration/register.h
index a61c4236d2..5825eb30cb 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -23,6 +23,8 @@ typedef struct MigPendingData {
     uint64_t postcopy_bytes;
     /* Amount of pending bytes can be transferred only in stopcopy */
     uint64_t stopcopy_bytes;
+    /* Number of new pending switchover ACKs */
+    uint32_t switchover_ack_pending;
     /*
      * Total pending data, modules do not need to update this field, it
      * will be automatically calculated by migration core API.
diff --git a/migration/migration.h b/migration/migration.h
index da45444f7b..086eb9a15d 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -494,6 +494,29 @@ struct MigrationState {
      */
     uint8_t clear_bitmap_shift;
 
+    /*
+     * This decides whether to use legacy switchover-ack or new switchover-ack.
+     * The main difference between them is that the former allows acknowledging
+     * switchover only once while the latter multiple times.
+     *
+     * In legacy, the destination keeps track of a pending ACKs counter. As
+     * migration progresses, the devices on the destination acknowledge
+     * switchover, decreasing the counter. When the counter reaches zero, a
+     * single ACK message is sent to the source via the return path, indicating
+     * that it's OK to switchover.
+     *
+     * In new switchover-ack, the source is the one that keeps track of a
+     * pending ACKs counter. As migration progresses, the destination sends ACK
+     * message per-device via the return path, which decrements the source
+     * counter. When the counter reaches zero, it's OK to switchover. During
+     * precopy, source-side devices may request additional ACKs, which increment
+     * the counter again.
+     *
+     * In both legacy and new schemes, we rely on per-device protocol to request
+     * switchover ACK from the destination-side counterpart.
+     */
+    bool switchover_ack_legacy;
+
     /*
      * This save hostname when out-going migration starts
      */
@@ -503,10 +526,13 @@ struct MigrationState {
     JSONWriter *vmdesc;
 
     /*
-     * Indicates whether an ACK from the destination that it's OK to do
-     * switchover has been received.
+     * Indicates the number of pending ACKs from the destination. The value may
+     * increase or decrease during precopy as new ACKs are requested or
+     * received. When zero is reached, it's OK to switchover. In legacy
+     * switchover-ack, it's initialized to 1 and decreased to zero upon ACK.
      */
-    bool switchover_acked;
+    uint32_t switchover_ack_pending_num;
+
     /* Is this a rdma migration */
     bool rdma_migration;
 
diff --git a/migration/savevm.h b/migration/savevm.h
index 0d732eb0f7..a049d695c9 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -45,8 +45,10 @@ int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy);
 void qemu_savevm_state_cleanup(void);
 void qemu_savevm_state_complete_postcopy(QEMUFile *f);
 int qemu_savevm_state_complete_precopy(MigrationState *s);
-void qemu_savevm_query_pending_iter(MigPendingData *pending, bool exact);
-void qemu_savevm_query_pending_final(MigPendingData *pending);
+void qemu_savevm_query_pending_iter(MigrationState *s, MigPendingData *pending,
+                                    bool exact);
+void qemu_savevm_query_pending_final(MigrationState *s,
+                                     MigPendingData *pending);
 int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy);
 bool qemu_savevm_state_postcopy_prepare(QEMUFile *f, Error **errp);
 void qemu_savevm_state_end(QEMUFile *f);
diff --git a/hw/core/machine.c b/hw/core/machine.c
index 63baff859f..8a39943d94 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -41,6 +41,7 @@
 
 GlobalProperty hw_compat_11_0[] = {
     { "chardev-vc", "encoding", "cp437" },
+    { "migration", "switchover-ack-legacy", "on" },
 };
 const size_t hw_compat_11_0_len = G_N_ELEMENTS(hw_compat_11_0);
 
diff --git a/migration/migration.c b/migration/migration.c
index 59aff50d68..4bb649a467 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1707,7 +1707,9 @@ int migrate_init(MigrationState *s, Error **errp)
     s->vm_old_state = -1;
     s->iteration_initial_bytes = 0;
     s->threshold_size = 0;
-    s->switchover_acked = false;
+    /* Legacy switchover-ack sends a single ACK for all devices */
+    qatomic_set(&s->switchover_ack_pending_num,
+                migrate_switchover_ack_legacy() ? 1 : 0);
     s->rdma_migration = false;
 
     /*
@@ -2201,7 +2203,7 @@ void migration_request_switchover_ack_legacy(const char *requester)
 {
     MigrationIncomingState *mis = migration_incoming_get_current();
 
-    if (!migrate_switchover_ack()) {
+    if (!migrate_switchover_ack() || !migrate_switchover_ack_legacy()) {
         return;
     }
 
@@ -2457,9 +2459,18 @@ static void *source_return_path_thread(void *opaque)
             break;
 
         case MIG_RP_MSG_SWITCHOVER_ACK:
-            ms->switchover_acked = true;
-            trace_source_return_path_thread_switchover_acked();
+        {
+            uint32_t pending_num;
+
+            pending_num = qatomic_dec_fetch(&ms->switchover_ack_pending_num);
+            trace_source_return_path_thread_switchover_acked(pending_num);
+            if (pending_num == UINT32_MAX) {
+                error_setg(&err, "Switchover ack pending num underflowed");
+                goto out;
+            }
+
             break;
+        }
 
         default:
             break;
@@ -2809,7 +2820,7 @@ static bool migration_switchover_start(MigrationState *s, Error **errp)
         return false;
     }
 
-    qemu_savevm_query_pending_final(&pending);
+    qemu_savevm_query_pending_final(s, &pending);
 
     /* Inactivate disks except in COLO */
     if (!migrate_colo()) {
@@ -3259,7 +3270,7 @@ static bool migration_can_switchover(MigrationState *s)
         return true;
     }
 
-    return s->switchover_acked;
+    return qatomic_read(&s->switchover_ack_pending_num) == 0;
 }
 
 /* Migration thread iteration status */
@@ -3298,12 +3309,13 @@ static bool migration_iteration_next_ready(MigrationState *s,
     return false;
 }
 
-static void migration_iteration_go_next(MigPendingData *pending)
+static void migration_iteration_go_next(MigrationState *s,
+                                        MigPendingData *pending)
 {
     /*
      * Do a slow sync first before boosting the iteration count.
      */
-    qemu_savevm_query_pending_iter(pending, true);
+    qemu_savevm_query_pending_iter(s, pending, true);
 
     /*
      * Update the dirty information for the whole system for this
@@ -3349,12 +3361,12 @@ static MigIterateState migration_iteration_run(MigrationState *s)
     Error *local_err = NULL;
     bool in_postcopy = (s->state == MIGRATION_STATUS_POSTCOPY_DEVICE ||
                         s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
-    bool can_switchover = migration_can_switchover(s);
+    bool can_switchover;
     MigPendingData pending = { };
     bool complete_ready;
 
     /* Fast path - get the estimated amount of pending data */
-    qemu_savevm_query_pending_iter(&pending, false);
+    qemu_savevm_query_pending_iter(s, &pending, false);
 
     if (in_postcopy) {
         /*
@@ -3395,9 +3407,12 @@ static MigIterateState migration_iteration_run(MigrationState *s)
          * during postcopy phase.
          */
         if (migration_iteration_next_ready(s, &pending)) {
-            migration_iteration_go_next(&pending);
+            migration_iteration_go_next(s, &pending);
         }
 
+        /* Check can switchover after qemu_savevm_query_pending() */
+        can_switchover = migration_can_switchover(s);
+
         /* Should we switch to postcopy now? */
         if (can_switchover && postcopy_should_start(s, &pending)) {
             if (postcopy_start(s, &local_err)) {
diff --git a/migration/options.c b/migration/options.c
index 5cbfd29099..4c9b25372e 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -110,6 +110,9 @@ const Property migration_properties[] = {
                      preempt_pre_7_2, false),
     DEFINE_PROP_BOOL("multifd-clean-tls-termination", MigrationState,
                      multifd_clean_tls_termination, true),
+    /* Use legacy until VFIO implements new switchover-ack */
+    DEFINE_PROP_BOOL("switchover-ack-legacy", MigrationState,
+                     switchover_ack_legacy, true),
 
     /* Migration parameters */
     DEFINE_PROP_UINT8("x-throttle-trigger-threshold", MigrationState,
@@ -467,6 +470,13 @@ bool migrate_rdma(void)
     return s->rdma_migration;
 }
 
+bool migrate_switchover_ack_legacy(void)
+{
+    MigrationState *s = migrate_get_current();
+
+    return s->switchover_ack_legacy;
+}
+
 typedef enum WriteTrackingSupport {
     WT_SUPPORT_UNKNOWN = 0,
     WT_SUPPORT_ABSENT,
diff --git a/migration/savevm.c b/migration/savevm.c
index fa188dd34f..0e487ea8ab 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1795,7 +1795,8 @@ int qemu_savevm_state_complete_precopy(MigrationState *s)
     return qemu_fflush(f);
 }
 
-static void qemu_savevm_query_pending(MigPendingData *pending, bool exact,
+static void qemu_savevm_query_pending(MigrationState *s,
+                                      MigPendingData *pending, bool exact,
                                       bool final)
 {
     SaveStateEntry *se;
@@ -1824,22 +1825,35 @@ static void qemu_savevm_query_pending(MigPendingData *pending, bool exact,
     if (!final) {
         mig_stats.dirty_bytes_total = pending->total_bytes;
     }
-    trace_qemu_savevm_query_pending(exact, final, pending->precopy_bytes,
-                                    pending->stopcopy_bytes,
-                                    pending->postcopy_bytes,
-                                    pending->total_bytes);
+
+    if (migrate_switchover_ack() && !migrate_switchover_ack_legacy() &&
+        pending->switchover_ack_pending) {
+        /*
+         * NOTE: Currently we rely on per-device protocol to request switchover
+         * ACK from the device on the destination side.
+         */
+        qatomic_add(&s->switchover_ack_pending_num,
+                    pending->switchover_ack_pending);
+    }
+
+    trace_qemu_savevm_query_pending(
+        exact, final, pending->precopy_bytes, pending->stopcopy_bytes,
+        pending->postcopy_bytes, pending->total_bytes,
+        pending->switchover_ack_pending,
+        qatomic_read(&s->switchover_ack_pending_num));
 }
 
-void qemu_savevm_query_pending_iter(MigPendingData *pending, bool exact)
+void qemu_savevm_query_pending_iter(MigrationState *s, MigPendingData *pending,
+                                    bool exact)
 {
-    qemu_savevm_query_pending(pending, exact, false);
+    qemu_savevm_query_pending(s, pending, exact, false);
 }
 
-void qemu_savevm_query_pending_final(MigPendingData *pending)
+void qemu_savevm_query_pending_final(MigrationState *s, MigPendingData *pending)
 {
     g_assert(bql_locked());
 
-    qemu_savevm_query_pending(pending, true, true);
+    qemu_savevm_query_pending(s, pending, true, true);
 }
 
 void qemu_savevm_state_cleanup(void)
@@ -2485,7 +2499,7 @@ static int loadvm_switchover_ack_no_users_legacy(MigrationIncomingState *mis,
 {
     int ret;
 
-    if (!migrate_switchover_ack()) {
+    if (!migrate_switchover_ack() || !migrate_switchover_ack_legacy()) {
         return 0;
     }
 
@@ -3167,7 +3181,7 @@ int qemu_load_device_state(QEMUFile *f, Error **errp)
     return 0;
 }
 
-int qemu_loadvm_approve_switchover(const char *approver)
+static int qemu_loadvm_approve_switchover_legacy(const char *approver)
 {
     MigrationIncomingState *mis = migration_incoming_get_current();
 
@@ -3186,6 +3200,23 @@ int qemu_loadvm_approve_switchover(const char *approver)
     return migrate_send_rp_switchover_ack(mis);
 }
 
+int qemu_loadvm_approve_switchover(const char *approver)
+{
+    MigrationIncomingState *mis = migration_incoming_get_current();
+
+    if (!migrate_switchover_ack()) {
+        return 0;
+    }
+
+    if (migrate_switchover_ack_legacy()) {
+        return qemu_loadvm_approve_switchover_legacy(approver);
+    }
+
+    trace_loadvm_approve_switchover(approver);
+
+    return migrate_send_rp_switchover_ack(mis);
+}
+
 bool qemu_loadvm_load_state_buffer(const char *idstr, uint32_t instance_id,
                                    char *buf, size_t len, Error **errp)
 {
diff --git a/migration/trace-events b/migration/trace-events
index a6b8c31ee1..f5339f4193 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -7,7 +7,7 @@ qemu_loadvm_state_section_partend(uint32_t section_id) "%u"
 qemu_loadvm_state_post_main(int ret) "%d"
 qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
 qemu_savevm_send_packaged(void) ""
-qemu_savevm_query_pending(bool exact, bool final, uint64_t precopy, uint64_t stopcopy, uint64_t postcopy, uint64_t total) "exact=%d, final=%d, precopy=%"PRIu64", stopcopy=%"PRIu64", postcopy=%"PRIu64", total=%"PRIu64
+qemu_savevm_query_pending(bool exact, bool final, uint64_t precopy, uint64_t stopcopy, uint64_t postcopy, uint64_t total, uint32_t switchover_ack_pending, uint32_t total_switchover_ack_pending) "exact=%d, final=%d, precopy=%"PRIu64", stopcopy=%"PRIu64", postcopy=%"PRIu64", total=%"PRIu64", collected switchover ack pending=%"PRIu32", total switchover ack pending=%"PRIu32
 loadvm_state_setup(void) ""
 loadvm_state_cleanup(void) ""
 loadvm_handle_cmd_packaged(unsigned int length) "%u"
@@ -24,6 +24,7 @@ loadvm_postcopy_ram_handle_discard_header(const char *ramid, uint16_t len) "%s:
 loadvm_process_command(const char *s, uint16_t len) "com=%s len=%d"
 loadvm_process_command_ping(uint32_t val) "0x%x"
 loadvm_approve_switchover_legacy(const char *approver, unsigned int switchover_ack_pending_num_legacy) "Approver %s, switchover_ack_pending_num_legacy %u"
+loadvm_approve_switchover(const char *approver) "Approver %s"
 postcopy_ram_listen_thread_exit(void) ""
 postcopy_ram_listen_thread_start(void) ""
 qemu_savevm_send_postcopy_advise(void) ""
@@ -189,7 +190,7 @@ source_return_path_thread_loop_top(void) ""
 source_return_path_thread_pong(uint32_t val) "0x%x"
 source_return_path_thread_shut(uint32_t val) "0x%x"
 source_return_path_thread_resume_ack(uint32_t v) "%"PRIu32
-source_return_path_thread_switchover_acked(void) ""
+source_return_path_thread_switchover_acked(uint32_t pending_num) "switchover_ack_pending_num %" PRIu32
 source_return_path_thread_postcopy_package_loaded(void) ""
 migration_thread_low_pending(uint64_t pending) "%" PRIu64
 migrate_transferred(uint64_t transferred, uint64_t time_spent, uint64_t bandwidth, uint64_t avail_bw, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %" PRIu64 " switchover_bw %" PRIu64 " max_size %" PRId64
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 07/13] migration: Fail migration if switchover-ack is requested after switchover decision
  2026-06-02  9:26 [PATCH v2 00/13] Make switchover-ack re-usable and add VFIO precopy REINIT feature Avihai Horon
                   ` (5 preceding siblings ...)
  2026-06-02  9:26 ` [PATCH v2 06/13] migration: Make switchover-ack re-usable Avihai Horon
@ 2026-06-02  9:26 ` Avihai Horon
  2026-06-03 20:42   ` Peter Xu
  2026-06-02  9:26 ` [PATCH v2 08/13] vfio/migration: Extract VFIO_MIG_FLAG_DEV_INIT_DATA_SENT sending to helper Avihai Horon
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 38+ messages in thread
From: Avihai Horon @ 2026-06-02  9:26 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Peter Xu, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Halil Pasic, Christian Borntraeger, Jason Herne,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Eric Farman, Matthew Rosato, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, Markus Armbruster,
	Maor Gottlieb, Avihai Horon

Switchover ACK is checked only during precopy while the guest is still
running. The last migration_can_switchover() decision and guest stop are
not atomic, so a device may want to request another switchover ACK in
the gap after switchover decision has been made but before the guest is
stopped. Migration would then miss that request, which can increase
downtime.

Cover this case by failing the migration if a switchover-ack was
requested during that time.

Ideally, precopy iterations should be resumed in this case, however,
VFIO doesn't support going back to precopy after being stopped, so
implementing such logic would require non-trivial changes to the guest
start/stop flow. Given the above and that this case should be rare,
failing the migration seems reasonable.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 migration/migration.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/migration/migration.c b/migration/migration.c
index 4bb649a467..6ee1c795ff 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2810,6 +2810,27 @@ static bool migration_switchover_prepare(MigrationState *s)
     return s->state == MIGRATION_STATUS_DEVICE;
 }
 
+static bool migration_switchover_check_switchover_ack_pending(MigrationState *s,
+                                                              Error **errp)
+{
+    uint32_t pending_num;
+
+    if (!migrate_switchover_ack() || migrate_switchover_ack_legacy()) {
+        return true;
+    }
+
+    pending_num = qatomic_read(&s->switchover_ack_pending_num);
+    if (pending_num > 0) {
+        error_setg(errp,
+                   "Switchover ACK was requested by %" PRIu32
+                   " devices during switchover",
+                   pending_num);
+        return false;
+    }
+
+    return true;
+}
+
 static bool migration_switchover_start(MigrationState *s, Error **errp)
 {
     ERRP_GUARD();
@@ -2822,6 +2843,15 @@ static bool migration_switchover_start(MigrationState *s, Error **errp)
 
     qemu_savevm_query_pending_final(s, &pending);
 
+    /*
+     * Switchover-ack requests done after switchover decision, are not allowed.
+     * Fail the migration in this case since we currently don't support going
+     * back to precopy.
+     */
+    if (!migration_switchover_check_switchover_ack_pending(s, errp)) {
+        return false;
+    }
+
     /* Inactivate disks except in COLO */
     if (!migrate_colo()) {
         /*
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 08/13] vfio/migration: Extract VFIO_MIG_FLAG_DEV_INIT_DATA_SENT sending to helper
  2026-06-02  9:26 [PATCH v2 00/13] Make switchover-ack re-usable and add VFIO precopy REINIT feature Avihai Horon
                   ` (6 preceding siblings ...)
  2026-06-02  9:26 ` [PATCH v2 07/13] migration: Fail migration if switchover-ack is requested after switchover decision Avihai Horon
@ 2026-06-02  9:26 ` Avihai Horon
  2026-06-03 20:43   ` Peter Xu
  2026-06-02  9:26 ` [PATCH v2 09/13] vfio/migration: Add Error ** parameter to vfio_migration_init() Avihai Horon
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 38+ messages in thread
From: Avihai Horon @ 2026-06-02  9:26 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Peter Xu, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Halil Pasic, Christian Borntraeger, Jason Herne,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Eric Farman, Matthew Rosato, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, Markus Armbruster,
	Maor Gottlieb, Avihai Horon

Extract the VFIO_MIG_FLAG_DEV_INIT_DATA_SENT flag sending logic from
vfio_save_iterate() into vfio_send_init_data_flag() for clarity. Also
add a trace while at it.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/migration.c  | 26 +++++++++++++++++++++-----
 hw/vfio/trace-events |  1 +
 2 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index d5be135584..a15e877fd3 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -480,6 +480,26 @@ static void vfio_update_estimated_pending_data(VFIOMigration *migration,
                                          data_size);
 }
 
+/* Returns true if the init data flag was sent, false otherwise */
+static bool vfio_send_init_data_flag(QEMUFile *f, VFIOMigration *migration)
+{
+    VFIODevice *vbasedev = migration->vbasedev;
+
+    if (!migrate_switchover_ack()) {
+        return false;
+    }
+
+    if (migration->precopy_init_size || migration->initial_data_sent) {
+        return false;
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_INIT_DATA_SENT);
+    migration->initial_data_sent = true;
+    trace_vfio_send_init_data_flag(vbasedev->name);
+
+    return true;
+}
+
 static bool vfio_precopy_supported(VFIODevice *vbasedev)
 {
     VFIOMigration *migration = vbasedev->migration;
@@ -692,11 +712,7 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
 
     vfio_update_estimated_pending_data(migration, data_size);
 
-    if (migrate_switchover_ack() && !migration->precopy_init_size &&
-        !migration->initial_data_sent) {
-        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_INIT_DATA_SENT);
-        migration->initial_data_sent = true;
-    } else {
+    if (!vfio_send_init_data_flag(f, migration)) {
         qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
     }
 
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 2049159015..e99ee2ee8a 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -175,6 +175,7 @@ vfio_save_device_config_state(const char *name) " (%s)"
 vfio_save_iterate(const char *name, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy initial size %"PRIu64" precopy dirty size %"PRIu64
 vfio_save_iterate_start(const char *name) " (%s)"
 vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size %"PRIu64
+vfio_send_init_data_flag(const char *name) " (%s)"
 vfio_state_pending(const char *name, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size, bool exact) " (%s) stopcopy size %"PRIu64" precopy initial size %"PRIu64" precopy dirty size %"PRIu64 " exact %d"
 vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
 vfio_vmstate_change_prepare(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 09/13] vfio/migration: Add Error ** parameter to vfio_migration_init()
  2026-06-02  9:26 [PATCH v2 00/13] Make switchover-ack re-usable and add VFIO precopy REINIT feature Avihai Horon
                   ` (7 preceding siblings ...)
  2026-06-02  9:26 ` [PATCH v2 08/13] vfio/migration: Extract VFIO_MIG_FLAG_DEV_INIT_DATA_SENT sending to helper Avihai Horon
@ 2026-06-02  9:26 ` Avihai Horon
  2026-06-03 11:38   ` Philippe Mathieu-Daudé
  2026-06-02  9:26 ` [PATCH v2 10/13] vfio/migration: Add new switchover-ack mechanism Avihai Horon
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 38+ messages in thread
From: Avihai Horon @ 2026-06-02  9:26 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Peter Xu, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Halil Pasic, Christian Borntraeger, Jason Herne,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Eric Farman, Matthew Rosato, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, Markus Armbruster,
	Maor Gottlieb, Avihai Horon

vfio_migration_init() already has many failure points and a new one will
be added in next patch.

Add Error ** parameter to vfio_migration_init() to report a detailed
error message through it.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/migration.c | 25 +++++++++++++------------
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index a15e877fd3..02ef216712 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -1055,7 +1055,7 @@ static bool vfio_dma_logging_supported(VFIODevice *vbasedev)
     return !ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature);
 }
 
-static int vfio_migration_init(VFIODevice *vbasedev)
+static int vfio_migration_init(VFIODevice *vbasedev, Error **errp)
 {
     int ret;
     Object *obj;
@@ -1066,21 +1066,31 @@ static int vfio_migration_init(VFIODevice *vbasedev)
     VMChangeStateHandler *prepare_cb;
 
     if (!vbasedev->ops->vfio_get_object) {
+        error_setg(errp, "no vfio_get_object handler");
         return -EINVAL;
     }
 
     obj = vbasedev->ops->vfio_get_object(vbasedev);
     if (!obj) {
+        error_setg(errp, "failed to get object");
         return -EINVAL;
     }
 
     ret = vfio_migration_query_flags(vbasedev, &mig_flags);
     if (ret) {
+        if (ret == -ENOTTY) {
+            error_setg_errno(errp, -ret,
+                             "migration is not supported in kernel");
+        } else {
+            error_setg_errno(errp, -ret, "failed to query migration flags");
+        }
+
         return ret;
     }
 
     /* Basic migration functionality must be supported */
     if (!(mig_flags & VFIO_MIGRATION_STOP_COPY)) {
+        error_setg(errp, "VFIO_MIGRATION_STOP_COPY is not supported");
         return -EOPNOTSUPP;
     }
 
@@ -1278,18 +1288,9 @@ bool vfio_migration_realize(VFIODevice *vbasedev, Error **errp)
         return !vfio_block_migration(vbasedev, err, errp);
     }
 
-    ret = vfio_migration_init(vbasedev);
+    ret = vfio_migration_init(vbasedev, &err);
     if (ret) {
-        if (ret == -ENOTTY) {
-            error_setg(&err, "%s: VFIO migration is not supported in kernel",
-                       vbasedev->name);
-        } else {
-            error_setg(&err,
-                       "%s: Migration couldn't be initialized for VFIO device, "
-                       "err: %d (%s)",
-                       vbasedev->name, ret, strerror(-ret));
-        }
-
+        error_prepend(&err, "%s: VFIO migration init failed: ", vbasedev->name);
         return !vfio_block_migration(vbasedev, err, errp);
     }
 
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 10/13] vfio/migration: Add new switchover-ack mechanism
  2026-06-02  9:26 [PATCH v2 00/13] Make switchover-ack re-usable and add VFIO precopy REINIT feature Avihai Horon
                   ` (8 preceding siblings ...)
  2026-06-02  9:26 ` [PATCH v2 09/13] vfio/migration: Add Error ** parameter to vfio_migration_init() Avihai Horon
@ 2026-06-02  9:26 ` Avihai Horon
  2026-06-03 20:47   ` Peter Xu
  2026-06-02  9:26 ` [PATCH v2 11/13] vfio/migration: Implement VFIO_PRECOPY_INFO_REINIT feature Avihai Horon
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 38+ messages in thread
From: Avihai Horon @ 2026-06-02  9:26 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Peter Xu, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Halil Pasic, Christian Borntraeger, Jason Herne,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Eric Farman, Matthew Rosato, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, Markus Armbruster,
	Maor Gottlieb, Avihai Horon

Add support for the new switchover-ack mechanism. This includes
requesting a switchover ACK on the first save_query_pending call (with
exact=false) if VFIO precopy is supported.

This achieves the same functionality of legacy switchover-ack but with
the new switchover-ack mechanism.

Keep legacy switchover-ack functionality for backward compatibility.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/vfio-migration-internal.h |  1 +
 hw/vfio/migration.c               | 13 ++++++++++++-
 hw/vfio/trace-events              |  2 +-
 3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/vfio-migration-internal.h b/hw/vfio/vfio-migration-internal.h
index a15fc74703..dc741e5142 100644
--- a/hw/vfio/vfio-migration-internal.h
+++ b/hw/vfio/vfio-migration-internal.h
@@ -58,6 +58,7 @@ typedef struct VFIOMigration {
     bool multifd_transfer;
     VFIOMultifd *multifd;
     bool initial_data_sent;
+    bool request_switchover_ack;
 
     bool event_save_iterate_started;
     bool event_precopy_empty_hit;
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 02ef216712..2296d0d44b 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -582,6 +582,9 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error **errp)
             }
 
             vfio_query_precopy_size(migration);
+            if (migrate_switchover_ack() && !migrate_switchover_ack_legacy()) {
+                migration->request_switchover_ack = true;
+            }
 
             break;
         case VFIO_DEVICE_STATE_STOP:
@@ -634,6 +637,7 @@ static void vfio_save_cleanup(void *opaque)
     migration->precopy_init_size = 0;
     migration->precopy_dirty_size = 0;
     migration->initial_data_sent = false;
+    migration->request_switchover_ack = false;
     vfio_migration_cleanup(vbasedev);
     trace_vfio_save_cleanup(vbasedev->name);
 }
@@ -655,6 +659,7 @@ static void vfio_state_pending(void *opaque, MigPendingData *pending,
     VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
     uint64_t precopy_size, stopcopy_size;
+    bool request_switchover_ack = false;
 
     if (final) {
         return;
@@ -675,10 +680,16 @@ static void vfio_state_pending(void *opaque, MigPendingData *pending,
 
     pending->precopy_bytes += precopy_size;
     pending->stopcopy_bytes += stopcopy_size;
+    if (migration->request_switchover_ack) {
+        pending->switchover_ack_pending++;
+        request_switchover_ack = true;
+        migration->request_switchover_ack = false;
+    }
 
     trace_vfio_state_pending(vbasedev->name, migration->stopcopy_size,
                              migration->precopy_init_size,
-                             migration->precopy_dirty_size, exact);
+                             migration->precopy_dirty_size,
+                             request_switchover_ack, exact);
 }
 
 static bool vfio_is_active_iterate(void *opaque)
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index e99ee2ee8a..50722eb717 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -176,7 +176,7 @@ vfio_save_iterate(const char *name, uint64_t precopy_init_size, uint64_t precopy
 vfio_save_iterate_start(const char *name) " (%s)"
 vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size %"PRIu64
 vfio_send_init_data_flag(const char *name) " (%s)"
-vfio_state_pending(const char *name, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size, bool exact) " (%s) stopcopy size %"PRIu64" precopy initial size %"PRIu64" precopy dirty size %"PRIu64 " exact %d"
+vfio_state_pending(const char *name, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size, bool request_switchover_ack, bool exact) " (%s) stopcopy size %"PRIu64", precopy initial size %"PRIu64", precopy dirty size %"PRIu64 ", request switchover ack %d, exact %d"
 vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
 vfio_vmstate_change_prepare(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
 
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 11/13] vfio/migration: Implement VFIO_PRECOPY_INFO_REINIT feature
  2026-06-02  9:26 [PATCH v2 00/13] Make switchover-ack re-usable and add VFIO precopy REINIT feature Avihai Horon
                   ` (9 preceding siblings ...)
  2026-06-02  9:26 ` [PATCH v2 10/13] vfio/migration: Add new switchover-ack mechanism Avihai Horon
@ 2026-06-02  9:26 ` Avihai Horon
  2026-06-03 20:49   ` Peter Xu
  2026-06-02  9:26 ` [PATCH v2 12/13] vfio/migration: Check VFIO_PRECOPY_INFO_REINIT during switchover Avihai Horon
  2026-06-02  9:26 ` [PATCH v2 13/13] migration: Enable new switchover-ack Avihai Horon
  12 siblings, 1 reply; 38+ messages in thread
From: Avihai Horon @ 2026-06-02  9:26 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Peter Xu, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Halil Pasic, Christian Borntraeger, Jason Herne,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Eric Farman, Matthew Rosato, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, Markus Armbruster,
	Maor Gottlieb, Avihai Horon

According to VFIO uAPI, precopy initial_bytes is considered as critical
data that should be transferred and loaded prior to moving to STOP_COPY
state to ensure precopy phase would be effective.

As currently defined, initial_bytes can only decrease as it's being read
from the data fd. However, there are cases where a new chunk of
initial_bytes should be transferred during precopy.

The new VFIO_PRECOPY_INFO_REINIT feature addresses this and allows
reporting a new value for initial_bytes regardless of any previously
reported values.

Implement VFIO_PRECOPY_INFO_REINIT feature:
1. Opt-in for VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2 to make
   VFIO_PRECOPY_INFO_REINIT available.
2. Request a new switchover ACK if initial_bytes increases post of a
   previous switchover ACK. This ensures the device is not moved to
   STOP_COPY before initial_bytes has reached zero again.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 docs/devel/migration/vfio.rst     | 14 +++++++
 hw/vfio/vfio-migration-internal.h |  1 +
 hw/vfio/migration.c               | 68 ++++++++++++++++++++++++++++---
 hw/vfio/trace-events              |  4 +-
 4 files changed, 80 insertions(+), 7 deletions(-)

diff --git a/docs/devel/migration/vfio.rst b/docs/devel/migration/vfio.rst
index 854277b11c..f235c2d4f9 100644
--- a/docs/devel/migration/vfio.rst
+++ b/docs/devel/migration/vfio.rst
@@ -23,6 +23,20 @@ and recommends that the initial bytes are sent and loaded in the destination
 before stopping the source VM. Enabling this migration capability will
 guarantee that and thus, can potentially reduce downtime even further.
 
+For example, in mlx5 devices, the initial bytes hold metadata used for time
+consuming pre-allocations of resources on the destination. Although init bytes
+may be small in size and sending them may take little time, loading them in the
+destination can take a significant amount of time. Switchover-ack guarantees
+that this pre-allocation doesn't happen during downtime.
+
+Initial bytes was originally defined to be monotonically decreasing, however
+there are cases where a new chunk of initial bytes should be transferred during
+precopy, e.g., due to a device reconfiguration, etc. The
+VFIO_PRECOPY_INFO_REINIT feature addresses this and when supported, allows to
+report a new initial bytes value regardless of any previously reported values.
+In this case, a new switchover ACK will be requested to make sure the new
+initial bytes are loaded in the destination before switching over.
+
 To support migration of multiple devices that might do P2P transactions between
 themselves, VFIO migration uAPI defines an intermediate P2P quiescent state.
 While in the P2P quiescent state, P2P DMA transactions cannot be initiated by
diff --git a/hw/vfio/vfio-migration-internal.h b/hw/vfio/vfio-migration-internal.h
index dc741e5142..a1c58b1126 100644
--- a/hw/vfio/vfio-migration-internal.h
+++ b/hw/vfio/vfio-migration-internal.h
@@ -45,6 +45,7 @@ typedef struct VFIOMigration {
     void *data_buffer;
     size_t data_buffer_size;
     uint64_t mig_flags;
+    bool precopy_info_v2_used;
     /*
      * NOTE: all three sizes cached are reported from VFIO's uAPI, which
      * are defined as estimate only.  QEMU should not trust these values
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 2296d0d44b..caf4d5e19f 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -373,9 +373,11 @@ static int vfio_query_stop_copy_size(VFIODevice *vbasedev)
 
 static int vfio_query_precopy_size(VFIOMigration *migration)
 {
+    VFIODevice *vbasedev = migration->vbasedev;
     struct vfio_precopy_info precopy = {
         .argsz = sizeof(precopy),
     };
+    bool reinit = false;
     int ret = 0;
 
     if (ioctl(migration->data_fd, VFIO_MIG_GET_PRECOPY_INFO, &precopy)) {
@@ -383,25 +385,43 @@ static int vfio_query_precopy_size(VFIOMigration *migration)
         migration->precopy_dirty_size = 0;
         ret = -errno;
         warn_report_once("VFIO device %s ioctl(VFIO_MIG_GET_PRECOPY_INFO) "
-                         "failed (%d)", migration->vbasedev->name, ret);
+                         "failed (%d)", vbasedev->name, ret);
     } else {
         bool overflow;
 
         migration->precopy_init_size = precopy.initial_bytes;
         migration->precopy_dirty_size = precopy.dirty_bytes;
+        /*
+         * struct vfio_precopy_info.flags is valid only if
+         * VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2 is used.
+         */
+         if (migration->precopy_info_v2_used) {
+            reinit = precopy.flags & VFIO_PRECOPY_INFO_REINIT;
+        }
 
-        overflow  = vfio_migration_check_overflow(migration->vbasedev,
+        overflow  = vfio_migration_check_overflow(vbasedev,
                          migration->precopy_init_size,  "precopy init size");
-        overflow |= vfio_migration_check_overflow(migration->vbasedev,
+        overflow |= vfio_migration_check_overflow(vbasedev,
                          migration->precopy_dirty_size, "precopy dirty size");
         if (overflow) {
             ret = -ERANGE;
         }
     }
 
-    trace_vfio_query_precopy_size(migration->vbasedev->name,
-                                  migration->precopy_init_size,
-                                  migration->precopy_dirty_size, ret);
+    trace_vfio_query_precopy_size(vbasedev->name, migration->precopy_init_size,
+                                  migration->precopy_dirty_size, reinit, ret);
+
+    /*
+     * If we got new initial_bytes after previous initial_bytes were
+     * transferred, request a new switchover ACK. Don't request if legacy
+     * switchover-ack is used.
+     */
+    if (reinit && migration->initial_data_sent &&
+        !migrate_switchover_ack_legacy()) {
+        migration->initial_data_sent = false;
+        migration->request_switchover_ack = true;
+        trace_vfio_query_precopy_size_request_switchover_ack(vbasedev->name);
+    }
 
     return ret;
 }
@@ -1053,6 +1073,27 @@ static int vfio_migration_query_flags(VFIODevice *vbasedev, uint64_t *mig_flags)
     return 0;
 }
 
+/* Returns 1 on success, 0 if not supported and negative errno on failure */
+static int vfio_migration_set_precopy_info_v2(VFIODevice *vbasedev)
+{
+    uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature),
+                              sizeof(uint64_t))] = {};
+    struct vfio_device_feature *feature = (struct vfio_device_feature *)buf;
+
+    feature->argsz = sizeof(buf);
+    feature->flags =
+        VFIO_DEVICE_FEATURE_SET | VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2;
+    if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
+        if (errno == ENOTTY) {
+            return 0;
+        }
+
+        return -errno;
+    }
+
+    return 1;
+}
+
 static bool vfio_dma_logging_supported(VFIODevice *vbasedev)
 {
     uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature),
@@ -1074,6 +1115,7 @@ static int vfio_migration_init(VFIODevice *vbasedev, Error **errp)
     char id[256] = "";
     g_autofree char *path = NULL, *oid = NULL;
     uint64_t mig_flags = 0;
+    bool precopy_info_v2_used = false;
     VMChangeStateHandler *prepare_cb;
 
     if (!vbasedev->ops->vfio_get_object) {
@@ -1105,12 +1147,22 @@ static int vfio_migration_init(VFIODevice *vbasedev, Error **errp)
         return -EOPNOTSUPP;
     }
 
+    if (mig_flags & VFIO_MIGRATION_PRE_COPY) {
+        ret = vfio_migration_set_precopy_info_v2(vbasedev);
+        if (ret < 0) {
+            error_setg_errno(errp, -ret, "failed to set precopy info v2");
+            return ret;
+        }
+        precopy_info_v2_used = ret;
+    }
+
     vbasedev->migration = g_new0(VFIOMigration, 1);
     migration = vbasedev->migration;
     migration->vbasedev = vbasedev;
     migration->device_state = VFIO_DEVICE_STATE_RUNNING;
     migration->data_fd = -1;
     migration->mig_flags = mig_flags;
+    migration->precopy_info_v2_used = precopy_info_v2_used;
 
     vbasedev->dirty_pages_supported = vfio_dma_logging_supported(vbasedev);
 
@@ -1133,6 +1185,10 @@ static int vfio_migration_init(VFIODevice *vbasedev, Error **errp)
     migration_add_notifier(&migration->migration_state,
                            vfio_migration_state_notifier);
 
+    trace_vfio_migration_init(vbasedev->name, migration->mig_flags,
+                              migration->precopy_info_v2_used,
+                              vbasedev->dirty_pages_supported);
+
     return 0;
 }
 
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 50722eb717..464c28c860 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -158,11 +158,13 @@ vfio_load_state_device_buffer_starved(const char *name, uint32_t idx) " (%s) idx
 vfio_load_state_device_buffer_load_start(const char *name, uint32_t idx) " (%s) idx %"PRIu32
 vfio_load_state_device_buffer_load_end(const char *name, uint32_t idx) " (%s) idx %"PRIu32
 vfio_load_state_device_buffer_end(const char *name) " (%s)"
+vfio_migration_init(const char *name, uint64_t mig_flags, bool precopy_info_v2_used, bool dirty_pages_supported) " (%s) mig_flags 0x%"PRIx64", precopy_info_v2_used %d, dirty_pages_supported %d"
 vfio_migration_realize(const char *name) " (%s)"
 vfio_migration_set_device_state(const char *name, const char *state) " (%s) state %s"
 vfio_migration_set_state(const char *name, const char *new_state, const char *recover_state) " (%s) new state %s, recover state %s"
 vfio_migration_state_notifier(const char *name, int state) " (%s) state %d"
-vfio_query_precopy_size(const char *name, uint64_t init_size, uint64_t dirty_size, int ret) " (%s) init %"PRIu64" dirty %"PRIu64" ret %d"
+vfio_query_precopy_size(const char *name, uint64_t init_size, uint64_t dirty_size, bool reinit, int ret) " (%s) init %"PRIu64", dirty %"PRIu64", reinit %d, ret %d"
+vfio_query_precopy_size_request_switchover_ack(const char *name) " (%s)"
 vfio_query_stop_copy_size(const char *name, uint64_t size, int ret) " (%s) stopcopy size %"PRIu64" ret %d"
 vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
 vfio_save_block_precopy_empty_hit(const char *name) " (%s)"
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 12/13] vfio/migration: Check VFIO_PRECOPY_INFO_REINIT during switchover
  2026-06-02  9:26 [PATCH v2 00/13] Make switchover-ack re-usable and add VFIO precopy REINIT feature Avihai Horon
                   ` (10 preceding siblings ...)
  2026-06-02  9:26 ` [PATCH v2 11/13] vfio/migration: Implement VFIO_PRECOPY_INFO_REINIT feature Avihai Horon
@ 2026-06-02  9:26 ` Avihai Horon
  2026-06-02  9:26 ` [PATCH v2 13/13] migration: Enable new switchover-ack Avihai Horon
  12 siblings, 0 replies; 38+ messages in thread
From: Avihai Horon @ 2026-06-02  9:26 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Peter Xu, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Halil Pasic, Christian Borntraeger, Jason Herne,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Eric Farman, Matthew Rosato, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, Markus Armbruster,
	Maor Gottlieb, Avihai Horon

VFIO_REPCOPY_INFO_REINIT is checked only during precopy, before the
switchover decision. However, the switchover decision and guest stop are
not atomic, so a VFIO device may want to set VFIO_PRECOPY_INFO_REINIT
and request another switchover ACK in the gap after switchover decision
has been made but before the guest is stopped. This would be missed and
may increase downtime.

Solve this by checking if VFIO_PRECOPY_INFO_REINIT was set during that
gap, and request a new switchover-ack in the final save_state_pending
call. Query precopy info after vCPUs are stopped but before
transitioning from PRE_COPY state, when its valid to call the ioctl.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/migration.c  | 41 +++++++++++++++++++++++++++++++++++------
 hw/vfio/trace-events |  2 +-
 2 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index caf4d5e19f..f1480fc4cc 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -681,11 +681,11 @@ static void vfio_state_pending(void *opaque, MigPendingData *pending,
     uint64_t precopy_size, stopcopy_size;
     bool request_switchover_ack = false;
 
-    if (final) {
-        return;
-    }
-
-    if (exact) {
+    /*
+     * Skip sync in final query as sync for precopy (which is needed for
+     * switchover-ack) was already done during guest stop.
+     */
+    if (exact && !final) {
         vfio_state_pending_sync(vbasedev);
     }
 
@@ -709,7 +709,7 @@ static void vfio_state_pending(void *opaque, MigPendingData *pending,
     trace_vfio_state_pending(vbasedev->name, migration->stopcopy_size,
                              migration->precopy_init_size,
                              migration->precopy_dirty_size,
-                             request_switchover_ack, exact);
+                             request_switchover_ack, exact, final);
 }
 
 static bool vfio_is_active_iterate(void *opaque)
@@ -963,6 +963,26 @@ static const SaveVMHandlers savevm_vfio_handlers = {
 
 /* ---------------------------------------------------------------------- */
 
+static void vfio_final_precopy_reinit_check(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+
+    if (!migration->precopy_info_v2_used || !migrate_switchover_ack() ||
+        migrate_switchover_ack_legacy()) {
+        return;
+    }
+
+    ret = vfio_query_precopy_size(migration);
+    if (ret) {
+        error_report("%s: Final precopy reinit check failed (err: %d)",
+                     vbasedev->name, ret);
+        /* If query failed, assume reinit and request switchover-ack */
+        migration->request_switchover_ack = true;
+        migration->initial_data_sent = false;
+    }
+}
+
 static void vfio_vmstate_change_prepare(void *opaque, bool running,
                                         RunState state)
 {
@@ -976,6 +996,15 @@ static void vfio_vmstate_change_prepare(void *opaque, bool running,
                     VFIO_DEVICE_STATE_PRE_COPY_P2P :
                     VFIO_DEVICE_STATE_RUNNING_P2P;
 
+    if (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
+        /*
+         * Now that vCPUs are stopped, check if new init_bytes are available
+         * since switchover decision, to be reported in the final
+         * save_query_pending.
+         */
+        vfio_final_precopy_reinit_check(vbasedev);
+    }
+
     ret = vfio_migration_set_state_or_reset(vbasedev, new_state, &local_err);
     if (ret) {
         /*
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 464c28c860..3fbc32cb3a 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -178,7 +178,7 @@ vfio_save_iterate(const char *name, uint64_t precopy_init_size, uint64_t precopy
 vfio_save_iterate_start(const char *name) " (%s)"
 vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size %"PRIu64
 vfio_send_init_data_flag(const char *name) " (%s)"
-vfio_state_pending(const char *name, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size, bool request_switchover_ack, bool exact) " (%s) stopcopy size %"PRIu64", precopy initial size %"PRIu64", precopy dirty size %"PRIu64 ", request switchover ack %d, exact %d"
+vfio_state_pending(const char *name, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size, bool request_switchover_ack, bool exact, bool final) " (%s) stopcopy size %"PRIu64", precopy initial size %"PRIu64", precopy dirty size %"PRIu64 ", request switchover ack %d, exact %d, final %d"
 vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
 vfio_vmstate_change_prepare(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
 
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 13/13] migration: Enable new switchover-ack
  2026-06-02  9:26 [PATCH v2 00/13] Make switchover-ack re-usable and add VFIO precopy REINIT feature Avihai Horon
                   ` (11 preceding siblings ...)
  2026-06-02  9:26 ` [PATCH v2 12/13] vfio/migration: Check VFIO_PRECOPY_INFO_REINIT during switchover Avihai Horon
@ 2026-06-02  9:26 ` Avihai Horon
  12 siblings, 0 replies; 38+ messages in thread
From: Avihai Horon @ 2026-06-02  9:26 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Peter Xu, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Halil Pasic, Christian Borntraeger, Jason Herne,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Eric Farman, Matthew Rosato, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, Markus Armbruster,
	Maor Gottlieb, Avihai Horon

Now that VFIO has implemented new switchover-ack, enable it.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 migration/options.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/migration/options.c b/migration/options.c
index 4c9b25372e..dfce19405d 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -110,9 +110,8 @@ const Property migration_properties[] = {
                      preempt_pre_7_2, false),
     DEFINE_PROP_BOOL("multifd-clean-tls-termination", MigrationState,
                      multifd_clean_tls_termination, true),
-    /* Use legacy until VFIO implements new switchover-ack */
     DEFINE_PROP_BOOL("switchover-ack-legacy", MigrationState,
-                     switchover_ack_legacy, true),
+                     switchover_ack_legacy, false),
 
     /* Migration parameters */
     DEFINE_PROP_UINT8("x-throttle-trigger-threshold", MigrationState,
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 06/13] migration: Make switchover-ack re-usable
  2026-06-02  9:26 ` [PATCH v2 06/13] migration: Make switchover-ack re-usable Avihai Horon
@ 2026-06-03  7:17   ` Markus Armbruster
  2026-06-04 15:05     ` Avihai Horon
  0 siblings, 1 reply; 38+ messages in thread
From: Markus Armbruster @ 2026-06-03  7:17 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater, Peter Xu,
	Fabiano Rosas, Pierrick Bouvier, Philippe Mathieu-Daudé,
	Zhao Liu, Halil Pasic, Christian Borntraeger, Jason Herne,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Eric Farman, Matthew Rosato, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, Maor Gottlieb

Avihai Horon <avihaih@nvidia.com> writes:

> Switchover-ack is a mechanism to synchronize between source and
> destination QEMU during migration to prevent the source from switching
> over prematurely.
>
> VFIO uses switchover-ack to ensure switchover happens only after
> destination side has loaded the precopy initial bytes. This is important
> for VFIO, as otherwise downtime could be impacted and be higher.
>
> In its current state, switchover-ack is a one-time mechanism, meaning
> that switchover is acked only once and past that another ACK cannot be
> requested again. This was sufficient until now, as VFIO precopy initial
> bytes was defined to be monotonically decreasing. Thus, when precopy
> initial bytes reached zero for all VFIO devices, a single ACK would be
> sent and its validity would hold.
>
> However, now the new VFIO_PRECOPY_INFO_REINIT feature allows precopy
> initial bytes to be re-initialized during precopy. Specifically, it
> means that initial bytes can grow after reaching zero, which would
> invalidate a previously sent switchover ACK.
>
> To solve this, make switchover-ack reusable and allow devices to request
> switchover ACKs when needed via the save_query_pending SaveVMHandler.
>
> Since now switchover ACK can be requested for a specific device and in
> different times, make switchover ACK per-device (instead of a single ACK
> for all devices) and let source side do the pending ACKs accounting.
>
> Keep the legacy switchover-ack mechanism for backward compatibility and
> turn it on by a compatibility property for older machines. Enable the
> property until VFIO implements the new switchover-ack.

I figure this is about the transmission of ACKs via the return path.  If
both source and destination understand the new per-device ACK, they use
it.  Else, they use old one.  Correct?

This is not visible to the management application.  It may use migration
capability @switchover-ack to enable just as before.  Correct?

Can you think of a reason why management applications might need to know
whether a certain qemu-system-FOO supports per-device ACKs?

> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>  qapi/migration.json                | 16 ++++-----
>  include/migration/client-options.h |  1 +
>  include/migration/register.h       |  2 ++
>  migration/migration.h              | 32 ++++++++++++++++--
>  migration/savevm.h                 |  6 ++--
>  hw/core/machine.c                  |  1 +
>  migration/migration.c              | 37 ++++++++++++++-------
>  migration/options.c                | 10 ++++++
>  migration/savevm.c                 | 53 +++++++++++++++++++++++-------
>  migration/trace-events             |  5 +--
>  10 files changed, 125 insertions(+), 38 deletions(-)
>
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 27a7970556..550cb77762 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -507,15 +507,13 @@
>  #     and should not affect the correctness of postcopy migration.
>  #     (since 7.1)
>  #
> -# @switchover-ack: If enabled, migration will not stop the source VM
> -#     and complete the migration until an ACK is received from the
> -#     destination that it's OK to do so.  Exactly when this ACK is
> -#     sent depends on the migrated devices that use this feature.  For
> -#     example, a device can use it to make sure some of its data is
> -#     sent and loaded in the destination before doing switchover.
> -#     This can reduce downtime if devices that support this capability
> -#     are present.  'return-path' capability must be enabled to use
> -#     it.  (since 8.1)
> +# @switchover-ack: If enabled, migration will rely on destination side
> +#     to acknowledge the source's switchover decision.  The
> +#     acknowledgement may depend, for example, on some device's data
> +#     being loaded in the destination before doing switchover.  This
> +#     can reduce downtime if devices that support this capability are
> +#     present.  'return-path' capability must be enabled to use it.

Please use the opportunity to fix markup: @return-path.
docs/devel/qapi-code-gen.rst:

    Use @foo to reference a member description within the current
    definition.  This is an rST extension.  It is currently rendered the
    same way as ````foo````, but carries additional meaning.

Suggest "Capability @return-path must be ..."

The old text is concrete: "will not stop the source VM ... until".  The
new text is vague "will rely on destination side to acknowledge".  What
does that mean exactly?  How does it impact behavior the user /
management application cares about?

> +#     (since 8.1)
>  #
>  # @dirty-limit: If enabled, migration will throttle vCPUs as needed to
>  #     keep their dirty page rate within @vcpu-dirty-limit.  This can

[...]



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 09/13] vfio/migration: Add Error ** parameter to vfio_migration_init()
  2026-06-02  9:26 ` [PATCH v2 09/13] vfio/migration: Add Error ** parameter to vfio_migration_init() Avihai Horon
@ 2026-06-03 11:38   ` Philippe Mathieu-Daudé
  2026-06-04 15:06     ` Avihai Horon
  0 siblings, 1 reply; 38+ messages in thread
From: Philippe Mathieu-Daudé @ 2026-06-03 11:38 UTC (permalink / raw)
  To: Avihai Horon, qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Peter Xu, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Halil Pasic, Christian Borntraeger, Jason Herne,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Eric Farman, Matthew Rosato, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, Markus Armbruster,
	Maor Gottlieb

Hi Avihai,

On 2/6/26 11:26, Avihai Horon wrote:
> vfio_migration_init() already has many failure points and a new one will
> be added in next patch.
> 
> Add Error ** parameter to vfio_migration_init() to report a detailed
> error message through it.
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>   hw/vfio/migration.c | 25 +++++++++++++------------
>   1 file changed, 13 insertions(+), 12 deletions(-)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index a15e877fd3..02ef216712 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -1055,7 +1055,7 @@ static bool vfio_dma_logging_supported(VFIODevice *vbasedev)
>       return !ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature);
>   }
>   
> -static int vfio_migration_init(VFIODevice *vbasedev)
> +static int vfio_migration_init(VFIODevice *vbasedev, Error **errp)
>   {
>       int ret;
>       Object *obj;
> @@ -1066,21 +1066,31 @@ static int vfio_migration_init(VFIODevice *vbasedev)
>       VMChangeStateHandler *prepare_cb;
>   
>       if (!vbasedev->ops->vfio_get_object) {
> +        error_setg(errp, "no vfio_get_object handler");
>           return -EINVAL;
>       }
>   
>       obj = vbasedev->ops->vfio_get_object(vbasedev);
>       if (!obj) {
> +        error_setg(errp, "failed to get object");
>           return -EINVAL;
>       }
>   
>       ret = vfio_migration_query_flags(vbasedev, &mig_flags);
>       if (ret) {
> +        if (ret == -ENOTTY) {
> +            error_setg_errno(errp, -ret,
> +                             "migration is not supported in kernel");
> +        } else {
> +            error_setg_errno(errp, -ret, "failed to query migration flags");
> +        }
> +
>           return ret;
>       }
>   
>       /* Basic migration functionality must be supported */
>       if (!(mig_flags & VFIO_MIGRATION_STOP_COPY)) {
> +        error_setg(errp, "VFIO_MIGRATION_STOP_COPY is not supported");
>           return -EOPNOTSUPP;
>       }
>   
> @@ -1278,18 +1288,9 @@ bool vfio_migration_realize(VFIODevice *vbasedev, Error **errp)
>           return !vfio_block_migration(vbasedev, err, errp);
>       }
>   
> -    ret = vfio_migration_init(vbasedev);
> +    ret = vfio_migration_init(vbasedev, &err);
>       if (ret) {

Only 'ret' boolean part is used now, have the method return a boolean
instead, like other methods with Error **errp last param?

> -        if (ret == -ENOTTY) {
> -            error_setg(&err, "%s: VFIO migration is not supported in kernel",
> -                       vbasedev->name);
> -        } else {
> -            error_setg(&err,
> -                       "%s: Migration couldn't be initialized for VFIO device, "
> -                       "err: %d (%s)",
> -                       vbasedev->name, ret, strerror(-ret));
> -        }
> -
> +        error_prepend(&err, "%s: VFIO migration init failed: ", vbasedev->name);
>           return !vfio_block_migration(vbasedev, err, errp);
>       }
>   




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 01/13] migration: Propagate errors in migration_completion_precopy()
  2026-06-02  9:26 ` [PATCH v2 01/13] migration: Propagate errors in migration_completion_precopy() Avihai Horon
@ 2026-06-03 19:47   ` Peter Xu
  2026-06-04 15:12     ` Avihai Horon
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Xu @ 2026-06-03 19:47 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Halil Pasic, Christian Borntraeger, Jason Herne,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Eric Farman, Matthew Rosato, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, Markus Armbruster,
	Maor Gottlieb

On Tue, Jun 02, 2026 at 12:26:09PM +0300, Avihai Horon wrote:
> migration_completion_precopy() doesn't propagate errors to migration
> core which leads to error information loss. Fix that.
> 
> This prepares for a follow-up where migration_switchover_start() can
> fail on switchover-ack and still report a useful error. Errors from
> qemu_savevm_state_complete_precopy() are not propagated yet as it
> requires more plumbing.
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>  migration/migration.c | 11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 074d3f2c69..7488a94206 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -2814,7 +2814,7 @@ static bool migration_switchover_start(MigrationState *s, Error **errp)
>      return true;
>  }
>  
> -static int migration_completion_precopy(MigrationState *s)
> +static int migration_completion_precopy(MigrationState *s, Error **errp)
>  {
>      int ret;
>  
> @@ -2823,11 +2823,12 @@ static int migration_completion_precopy(MigrationState *s)
>      if (!migrate_mode_is_cpr()) {
>          ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE);
>          if (ret < 0) {
> +            error_setg_errno(errp, -ret, "Failed to stop the VM");
>              goto out_unlock;
>          }
>      }
>  
> -    if (!migration_switchover_start(s, NULL)) {
> +    if (!migration_switchover_start(s, errp)) {
>          ret = -EFAULT;
>          goto out_unlock;
>      }

IIUC this patch overlooked the follow up call:

    ret = qemu_savevm_state_complete_precopy(s);

We should make sure ret!=0 will always set Error*.  Better pass Error**
into qemu_savevm_state_complete_precopy() too.

Thanks,

> @@ -2869,7 +2870,7 @@ static void migration_completion(MigrationState *s)
>      Error *local_err = NULL;
>  
>      if (s->state == MIGRATION_STATUS_ACTIVE) {
> -        ret = migration_completion_precopy(s);
> +        ret = migration_completion_precopy(s, &local_err);
>      } else if (s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) {
>          migration_completion_postcopy(s);
>      } else {
> @@ -2900,7 +2901,9 @@ static void migration_completion(MigrationState *s)
>      return;
>  
>  fail:
> -    if (qemu_file_get_error_obj(s->to_dst_file, &local_err)) {
> +    if (local_err) {
> +        migrate_error_propagate(s, local_err);
> +    } else if (qemu_file_get_error_obj(s->to_dst_file, &local_err)) {
>          migrate_error_propagate(s, local_err);
>      } else if (ret) {
>          error_setg_errno(&local_err, -ret, "Error in migration completion");
> -- 
> 2.40.1
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 02/13] migration: Run final save_query_pending at switchover
  2026-06-02  9:26 ` [PATCH v2 02/13] migration: Run final save_query_pending at switchover Avihai Horon
@ 2026-06-03 20:05   ` Peter Xu
  2026-06-03 21:04     ` Peter Xu
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Xu @ 2026-06-03 20:05 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Halil Pasic, Christian Borntraeger, Jason Herne,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Eric Farman, Matthew Rosato, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, Markus Armbruster,
	Maor Gottlieb

On Tue, Jun 02, 2026 at 12:26:10PM +0300, Avihai Horon wrote:
> Before switchover, the source needs one last exact pending query so
> modules can flush dirty state. This is currently done ad hoc in modules
> handlers. For example, RAM syncs its dirty bitmap in its save_complete
> handler.
> 
> This should be a general concept relevant for any module, so extract it
> to migration core instead by running a final save_query_pending before
> switchover.
> 
> The final query requires special handling by modules (e.g., it's called
> with BQL locked, during VM stop), so extend save_query_pending
> SaveVMHandlers callback and qemu_savevm_query_pending() with a "final"
> flag so migration modules can tell the last pending query during
> switchover from periodic iteration queries.
> 
> Not all modules need the final query; skip it for those who don't need.
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>  include/migration/register.h   | 41 ++++++++++++++++++----------------
>  migration/savevm.h             |  3 ++-
>  hw/s390x/s390-stattrib.c       |  9 ++++++--
>  hw/vfio/migration.c            |  6 ++++-
>  migration/block-dirty-bitmap.c |  6 ++++-
>  migration/migration.c          |  7 ++++--
>  migration/ram.c                | 37 ++++++++++++++++++------------
>  migration/savevm.c             | 27 +++++++++++++++++-----
>  migration/trace-events         |  2 +-
>  9 files changed, 91 insertions(+), 47 deletions(-)
> 
> diff --git a/include/migration/register.h b/include/migration/register.h
> index 5e5e0ee432..6f632123f1 100644
> --- a/include/migration/register.h
> +++ b/include/migration/register.h
> @@ -171,6 +171,28 @@ typedef struct SaveVMHandlers {
>       */
>      bool (*is_active_iterate)(void *opaque);
>  
> +    /**
> +     * @save_query_pending
> +     *
> +     * This estimates the remaining data to transfer on the source side.
> +     *
> +     * When @exact is true, a module must report accurate results.  When
> +     * @exact is false, a module may report estimates.
> +     *
> +     * It's highly recommended that modules implement a faster version of
> +     * the query path (for example, by proper caching on the counters) if
> +     * an accurate query will be time-consuming.
> +     *
> +     * @opaque: data pointer passed to register_savevm_live()
> +     * @pending: pointer to a MigPendingData struct
> +     * @exact: set to true for an accurate (slow) query
> +     * @final: set to true for the final query during switchover. When final is
> +     * true, the query is called with BQL locked. Otherwise, it's called with
> +     * BQL unlocked.
> +     */
> +    void (*save_query_pending)(void *opaque, MigPendingData *pending,
> +                               bool exact, bool final);
> +
>      /* This runs outside the BQL in the migration case, and
>       * within the lock in the savevm case.  The callback had better only
>       * use data that is local to the migration thread or protected
> @@ -210,25 +232,6 @@ typedef struct SaveVMHandlers {
>       */
>      bool (*save_postcopy_prepare)(QEMUFile *f, void *opaque, Error **errp);
>  
> -    /**
> -     * @save_query_pending
> -     *
> -     * This estimates the remaining data to transfer on the source side.
> -     *
> -     * When @exact is true, a module must report accurate results.  When
> -     * @exact is false, a module may report estimates.
> -     *
> -     * It's highly recommended that modules implement a faster version of
> -     * the query path (for example, by proper caching on the counters) if
> -     * an accurate query will be time-consuming.
> -     *
> -     * @opaque: data pointer passed to register_savevm_live()
> -     * @pending: pointer to a MigPendingData struct
> -     * @exact: set to true for an accurate (slow) query
> -     */
> -    void (*save_query_pending)(void *opaque, MigPendingData *pending,
> -                               bool exact);
> -
>      /**
>       * @load_state
>       *
> diff --git a/migration/savevm.h b/migration/savevm.h
> index 96fdf96d4e..10b3d78a5f 100644
> --- a/migration/savevm.h
> +++ b/migration/savevm.h
> @@ -45,7 +45,8 @@ int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy);
>  void qemu_savevm_state_cleanup(void);
>  void qemu_savevm_state_complete_postcopy(QEMUFile *f);
>  int qemu_savevm_state_complete_precopy(MigrationState *s);
> -void qemu_savevm_query_pending(MigPendingData *pending, bool exact);
> +void qemu_savevm_query_pending_iter(MigPendingData *pending, bool exact);
> +void qemu_savevm_query_pending_final(MigPendingData *pending);
>  int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy);
>  bool qemu_savevm_state_postcopy_prepare(QEMUFile *f, Error **errp);
>  void qemu_savevm_state_end(QEMUFile *f);
> diff --git a/hw/s390x/s390-stattrib.c b/hw/s390x/s390-stattrib.c
> index c334714b31..aed0c6844d 100644
> --- a/hw/s390x/s390-stattrib.c
> +++ b/hw/s390x/s390-stattrib.c
> @@ -190,12 +190,17 @@ static int cmma_save_setup(QEMUFile *f, void *opaque, Error **errp)
>  }
>  
>  static void cmma_state_pending(void *opaque, MigPendingData *pending,
> -                               bool exact)
> +                               bool exact, bool final)
>  {
>      S390StAttribState *sas = S390_STATTRIB(opaque);
>      S390StAttribClass *sac = S390_STATTRIB_GET_CLASS(sas);
> -    long long res = sac->get_dirtycount(sas);
> +    long long res;
>  
> +    if (final) {
> +        return;
> +    }
> +
> +    res = sac->get_dirtycount(sas);
>      if (res >= 0) {
>          pending->precopy_bytes += res;
>      }
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index fb12b9717f..95072d6664 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -622,12 +622,16 @@ static void vfio_state_pending_sync(VFIODevice *vbasedev)
>  }
>  
>  static void vfio_state_pending(void *opaque, MigPendingData *pending,
> -                               bool exact)
> +                               bool exact, bool final)
>  {
>      VFIODevice *vbasedev = opaque;
>      VFIOMigration *migration = vbasedev->migration;
>      uint64_t precopy_size, stopcopy_size;
>  
> +    if (final) {
> +        return;
> +    }
> +
>      if (exact) {
>          vfio_state_pending_sync(vbasedev);
>      }
> diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
> index 7ef3759e53..66ed91de03 100644
> --- a/migration/block-dirty-bitmap.c
> +++ b/migration/block-dirty-bitmap.c
> @@ -767,12 +767,16 @@ static int dirty_bitmap_save_complete(QEMUFile *f, void *opaque)
>  }
>  
>  static void dirty_bitmap_state_pending(void *opaque, MigPendingData *data,
> -                                       bool exact)
> +                                       bool exact, bool final)
>  {
>      DBMSaveState *s = &((DBMState *)opaque)->save;
>      SaveBitmapState *dbms;
>      uint64_t pending = 0;
>  
> +    if (final) {
> +        return;
> +    }
> +
>      bql_lock();
>  
>      QSIMPLEQ_FOREACH(dbms, &s->dbms_list, entry) {
> diff --git a/migration/migration.c b/migration/migration.c
> index 7488a94206..df8ed3a9fd 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -2787,12 +2787,15 @@ static bool migration_switchover_prepare(MigrationState *s)
>  static bool migration_switchover_start(MigrationState *s, Error **errp)
>  {
>      ERRP_GUARD();
> +    MigPendingData pending = {};
>  
>      if (!migration_switchover_prepare(s)) {
>          error_setg(errp, "Switchover is interrupted");
>          return false;
>      }
>  
> +    qemu_savevm_query_pending_final(&pending);

Not needed for postcopy, we can move it to migration_completion_precopy().

I'd also appreciate a short comment explaining why the final query is
needed.

> +
>      /* Inactivate disks except in COLO */
>      if (!migrate_colo()) {
>          /*
> @@ -3285,7 +3288,7 @@ static void migration_iteration_go_next(MigPendingData *pending)
>      /*
>       * Do a slow sync first before boosting the iteration count.
>       */
> -    qemu_savevm_query_pending(pending, true);
> +    qemu_savevm_query_pending_iter(pending, true);
>  
>      /*
>       * Update the dirty information for the whole system for this
> @@ -3336,7 +3339,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>      bool complete_ready;
>  
>      /* Fast path - get the estimated amount of pending data */
> -    qemu_savevm_query_pending(&pending, false);
> +    qemu_savevm_query_pending_iter(&pending, false);
>  
>      if (in_postcopy) {
>          /*
> diff --git a/migration/ram.c b/migration/ram.c
> index fc38ffbf8a..4e3f442593 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -3376,10 +3376,6 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
>      rs->last_stage = !migration_in_colo_state();
>  
>      WITH_RCU_READ_LOCK_GUARD() {
> -        if (!migration_in_postcopy()) {
> -            migration_bitmap_sync_precopy(true);
> -        }
> -
>          ret = rdma_registration_start(f, RAM_CONTROL_FINISH);
>          if (ret < 0) {
>              qemu_file_set_error(f, ret);
> @@ -3442,25 +3438,38 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
>      return qemu_fflush(f);
>  }
>  
> -static void ram_state_pending(void *opaque, MigPendingData *pending,
> -                              bool exact)
> +static void ram_state_pending_sync(bool exact, bool final)
>  {
> -    RAMState **temp = opaque;
> -    RAMState *rs = *temp;
> -    uint64_t remaining_size;
> -
>      /*
>       * Sync is not needed either with: (1) a fast query, or (2) after
>       * postcopy has started (no new dirty will generate anymore).
>       */
> -    if (exact && !migration_in_postcopy()) {
> +    if (!exact || migration_in_postcopy()) {
> +        return;
> +    }
> +
> +    /* Final pending query is called with BQL locked */
> +    if (!final) {
>          bql_lock();
> -        WITH_RCU_READ_LOCK_GUARD() {
> -            migration_bitmap_sync_precopy(false);
> -        }
> +    }
> +
> +    WITH_RCU_READ_LOCK_GUARD() {
> +        migration_bitmap_sync_precopy(final);
> +    }
> +
> +    if (!final) {
>          bql_unlock();
>      }
> +}
> +
> +static void ram_state_pending(void *opaque, MigPendingData *pending,
> +                              bool exact, bool final)
> +{
> +    RAMState **temp = opaque;
> +    RAMState *rs = *temp;
> +    uint64_t remaining_size;
>  
> +    ram_state_pending_sync(exact, final);
>      remaining_size = rs->migration_dirty_pages * TARGET_PAGE_SIZE;
>  
>      if (migrate_postcopy_ram()) {
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 23adaf9dd9..22cb6a15c6 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -1795,7 +1795,8 @@ int qemu_savevm_state_complete_precopy(MigrationState *s)
>      return qemu_fflush(f);
>  }
>  
> -void qemu_savevm_query_pending(MigPendingData *pending, bool exact)
> +static void qemu_savevm_query_pending(MigPendingData *pending, bool exact,
> +                                      bool final)
>  {
>      SaveStateEntry *se;
>  
> @@ -1808,7 +1809,7 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool exact)
>          if (!qemu_savevm_state_active(se)) {
>              continue;
>          }
> -        se->ops->save_query_pending(se->opaque, pending, exact);
> +        se->ops->save_query_pending(se->opaque, pending, exact, final);
>      }
>  
>      pending->total_bytes = pending->precopy_bytes +
> @@ -1817,16 +1818,30 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool exact)
>      /*
>       * Update system remaining dirty bytes whenever QEMU queries.  It will
>       * make the value to be not as accurate, but should still be pretty
> -     * close to reality when this got invoked frequently while iterating.
> +     * close to reality when this got invoked frequently while iterating.  Don't
> +     * update in final query as some modules may skip it if not needed.

IMHO this specialty isn't needed.  We can just make all modules report and
making sure we don't deadlock on bql.  AFAIU, CMMA never takes BQL, while
dirty-bitmap should conditionally take it now.

>       */
> -    mig_stats.dirty_bytes_total = pending->total_bytes;
> -
> -    trace_qemu_savevm_query_pending(exact, pending->precopy_bytes,
> +    if (!final) {
> +        mig_stats.dirty_bytes_total = pending->total_bytes;
> +    }

Then we remove this too which is counter-intuitive..

The rest looks all reasonable.

Thanks,

> +    trace_qemu_savevm_query_pending(exact, final, pending->precopy_bytes,
>                                      pending->stopcopy_bytes,
>                                      pending->postcopy_bytes,
>                                      pending->total_bytes);
>  }
>  
> +void qemu_savevm_query_pending_iter(MigPendingData *pending, bool exact)
> +{
> +    qemu_savevm_query_pending(pending, exact, false);
> +}
> +
> +void qemu_savevm_query_pending_final(MigPendingData *pending)
> +{
> +    g_assert(bql_locked());
> +
> +    qemu_savevm_query_pending(pending, true, true);
> +}
> +
>  void qemu_savevm_state_cleanup(void)
>  {
>      SaveStateEntry *se;
> diff --git a/migration/trace-events b/migration/trace-events
> index de99d976ab..1c9212d3e2 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -7,7 +7,7 @@ qemu_loadvm_state_section_partend(uint32_t section_id) "%u"
>  qemu_loadvm_state_post_main(int ret) "%d"
>  qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
>  qemu_savevm_send_packaged(void) ""
> -qemu_savevm_query_pending(bool exact, uint64_t precopy, uint64_t stopcopy, uint64_t postcopy, uint64_t total) "exact=%d, precopy=%"PRIu64", stopcopy=%"PRIu64", postcopy=%"PRIu64", total=%"PRIu64
> +qemu_savevm_query_pending(bool exact, bool final, uint64_t precopy, uint64_t stopcopy, uint64_t postcopy, uint64_t total) "exact=%d, final=%d, precopy=%"PRIu64", stopcopy=%"PRIu64", postcopy=%"PRIu64", total=%"PRIu64
>  loadvm_state_switchover_ack_needed(unsigned int switchover_ack_pending_num) "Switchover ack pending num=%u"
>  loadvm_state_setup(void) ""
>  loadvm_state_cleanup(void) ""
> -- 
> 2.40.1
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 05/13] migration: Rename switchover-ack code to legacy
  2026-06-02  9:26 ` [PATCH v2 05/13] migration: Rename switchover-ack code to legacy Avihai Horon
@ 2026-06-03 20:21   ` Peter Xu
  0 siblings, 0 replies; 38+ messages in thread
From: Peter Xu @ 2026-06-03 20:21 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Halil Pasic, Christian Borntraeger, Jason Herne,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Eric Farman, Matthew Rosato, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, Markus Armbruster,
	Maor Gottlieb

On Tue, Jun 02, 2026 at 12:26:13PM +0300, Avihai Horon wrote:
> A new switchover-ack mechanism will be added in the following patches.
> However, the old mechanism must still be kept for backward
> compatibility.
> 
> Rename existing code that will be used only for old switchover-ack
> mechanism as legacy. This will help to distinguish legacy code from new
> code and make it more readable and easier for removal later when no
> longer needed.
> 
> No functional change intended.
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 07/13] migration: Fail migration if switchover-ack is requested after switchover decision
  2026-06-02  9:26 ` [PATCH v2 07/13] migration: Fail migration if switchover-ack is requested after switchover decision Avihai Horon
@ 2026-06-03 20:42   ` Peter Xu
  2026-06-04 15:36     ` Avihai Horon
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Xu @ 2026-06-03 20:42 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Halil Pasic, Christian Borntraeger, Jason Herne,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Eric Farman, Matthew Rosato, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, Markus Armbruster,
	Maor Gottlieb

On Tue, Jun 02, 2026 at 12:26:15PM +0300, Avihai Horon wrote:
> Switchover ACK is checked only during precopy while the guest is still
> running. The last migration_can_switchover() decision and guest stop are
> not atomic, so a device may want to request another switchover ACK in
> the gap after switchover decision has been made but before the guest is
> stopped. Migration would then miss that request, which can increase
> downtime.
> 
> Cover this case by failing the migration if a switchover-ack was
> requested during that time.
> 
> Ideally, precopy iterations should be resumed in this case, however,
> VFIO doesn't support going back to precopy after being stopped, so
> implementing such logic would require non-trivial changes to the guest
> start/stop flow. Given the above and that this case should be rare,
> failing the migration seems reasonable.
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

One nit:

> ---
>  migration/migration.c | 30 ++++++++++++++++++++++++++++++
>  1 file changed, 30 insertions(+)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 4bb649a467..6ee1c795ff 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -2810,6 +2810,27 @@ static bool migration_switchover_prepare(MigrationState *s)
>      return s->state == MIGRATION_STATUS_DEVICE;
>  }
>  
> +static bool migration_switchover_check_switchover_ack_pending(MigrationState *s,

The name is slightly confusing.  It says "check if there is pending ack"
but then it returns true if there's no pending ACK..

Maybe migration_switchover_is_acknowledged()?

> +                                                              Error **errp)
> +{
> +    uint32_t pending_num;
> +
> +    if (!migrate_switchover_ack() || migrate_switchover_ack_legacy()) {
> +        return true;
> +    }
> +
> +    pending_num = qatomic_read(&s->switchover_ack_pending_num);
> +    if (pending_num > 0) {
> +        error_setg(errp,
> +                   "Switchover ACK was requested by %" PRIu32
> +                   " devices during switchover",
> +                   pending_num);
> +        return false;
> +    }
> +
> +    return true;
> +}
> +
>  static bool migration_switchover_start(MigrationState *s, Error **errp)
>  {
>      ERRP_GUARD();
> @@ -2822,6 +2843,15 @@ static bool migration_switchover_start(MigrationState *s, Error **errp)
>  
>      qemu_savevm_query_pending_final(s, &pending);
>  
> +    /*
> +     * Switchover-ack requests done after switchover decision, are not allowed.
> +     * Fail the migration in this case since we currently don't support going
> +     * back to precopy.
> +     */
> +    if (!migration_switchover_check_switchover_ack_pending(s, errp)) {
> +        return false;
> +    }
> +
>      /* Inactivate disks except in COLO */
>      if (!migrate_colo()) {
>          /*
> -- 
> 2.40.1
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 08/13] vfio/migration: Extract VFIO_MIG_FLAG_DEV_INIT_DATA_SENT sending to helper
  2026-06-02  9:26 ` [PATCH v2 08/13] vfio/migration: Extract VFIO_MIG_FLAG_DEV_INIT_DATA_SENT sending to helper Avihai Horon
@ 2026-06-03 20:43   ` Peter Xu
  0 siblings, 0 replies; 38+ messages in thread
From: Peter Xu @ 2026-06-03 20:43 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Halil Pasic, Christian Borntraeger, Jason Herne,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Eric Farman, Matthew Rosato, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, Markus Armbruster,
	Maor Gottlieb

On Tue, Jun 02, 2026 at 12:26:16PM +0300, Avihai Horon wrote:
> Extract the VFIO_MIG_FLAG_DEV_INIT_DATA_SENT flag sending logic from
> vfio_save_iterate() into vfio_send_init_data_flag() for clarity. Also
> add a trace while at it.
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 10/13] vfio/migration: Add new switchover-ack mechanism
  2026-06-02  9:26 ` [PATCH v2 10/13] vfio/migration: Add new switchover-ack mechanism Avihai Horon
@ 2026-06-03 20:47   ` Peter Xu
  0 siblings, 0 replies; 38+ messages in thread
From: Peter Xu @ 2026-06-03 20:47 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Halil Pasic, Christian Borntraeger, Jason Herne,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Eric Farman, Matthew Rosato, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, Markus Armbruster,
	Maor Gottlieb

On Tue, Jun 02, 2026 at 12:26:18PM +0300, Avihai Horon wrote:
> Add support for the new switchover-ack mechanism. This includes
> requesting a switchover ACK on the first save_query_pending call (with
> exact=false) if VFIO precopy is supported.
> 
> This achieves the same functionality of legacy switchover-ack but with
> the new switchover-ack mechanism.
> 
> Keep legacy switchover-ack functionality for backward compatibility.
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 11/13] vfio/migration: Implement VFIO_PRECOPY_INFO_REINIT feature
  2026-06-02  9:26 ` [PATCH v2 11/13] vfio/migration: Implement VFIO_PRECOPY_INFO_REINIT feature Avihai Horon
@ 2026-06-03 20:49   ` Peter Xu
  0 siblings, 0 replies; 38+ messages in thread
From: Peter Xu @ 2026-06-03 20:49 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Halil Pasic, Christian Borntraeger, Jason Herne,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Eric Farman, Matthew Rosato, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, Markus Armbruster,
	Maor Gottlieb

On Tue, Jun 02, 2026 at 12:26:19PM +0300, Avihai Horon wrote:
> According to VFIO uAPI, precopy initial_bytes is considered as critical
> data that should be transferred and loaded prior to moving to STOP_COPY
> state to ensure precopy phase would be effective.
> 
> As currently defined, initial_bytes can only decrease as it's being read
> from the data fd. However, there are cases where a new chunk of
> initial_bytes should be transferred during precopy.
> 
> The new VFIO_PRECOPY_INFO_REINIT feature addresses this and allows
> reporting a new value for initial_bytes regardless of any previously
> reported values.
> 
> Implement VFIO_PRECOPY_INFO_REINIT feature:
> 1. Opt-in for VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2 to make
>    VFIO_PRECOPY_INFO_REINIT available.
> 2. Request a new switchover ACK if initial_bytes increases post of a
>    previous switchover ACK. This ensures the device is not moved to
>    STOP_COPY before initial_bytes has reached zero again.
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>

Acked-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 02/13] migration: Run final save_query_pending at switchover
  2026-06-03 20:05   ` Peter Xu
@ 2026-06-03 21:04     ` Peter Xu
  2026-06-04 15:29       ` Avihai Horon
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Xu @ 2026-06-03 21:04 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Halil Pasic, Christian Borntraeger, Jason Herne,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Eric Farman, Matthew Rosato, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, Markus Armbruster,
	Maor Gottlieb

On Wed, Jun 03, 2026 at 04:05:49PM -0400, Peter Xu wrote:
> On Tue, Jun 02, 2026 at 12:26:10PM +0300, Avihai Horon wrote:
> > Before switchover, the source needs one last exact pending query so
> > modules can flush dirty state. This is currently done ad hoc in modules
> > handlers. For example, RAM syncs its dirty bitmap in its save_complete
> > handler.
> > 
> > This should be a general concept relevant for any module, so extract it
> > to migration core instead by running a final save_query_pending before
> > switchover.
> > 
> > The final query requires special handling by modules (e.g., it's called
> > with BQL locked, during VM stop), so extend save_query_pending
> > SaveVMHandlers callback and qemu_savevm_query_pending() with a "final"
> > flag so migration modules can tell the last pending query during
> > switchover from periodic iteration queries.
> > 
> > Not all modules need the final query; skip it for those who don't need.
> > 
> > Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> > ---
> >  include/migration/register.h   | 41 ++++++++++++++++++----------------
> >  migration/savevm.h             |  3 ++-
> >  hw/s390x/s390-stattrib.c       |  9 ++++++--
> >  hw/vfio/migration.c            |  6 ++++-
> >  migration/block-dirty-bitmap.c |  6 ++++-
> >  migration/migration.c          |  7 ++++--
> >  migration/ram.c                | 37 ++++++++++++++++++------------
> >  migration/savevm.c             | 27 +++++++++++++++++-----
> >  migration/trace-events         |  2 +-
> >  9 files changed, 91 insertions(+), 47 deletions(-)
> > 
> > diff --git a/include/migration/register.h b/include/migration/register.h
> > index 5e5e0ee432..6f632123f1 100644
> > --- a/include/migration/register.h
> > +++ b/include/migration/register.h
> > @@ -171,6 +171,28 @@ typedef struct SaveVMHandlers {
> >       */
> >      bool (*is_active_iterate)(void *opaque);
> >  
> > +    /**
> > +     * @save_query_pending
> > +     *
> > +     * This estimates the remaining data to transfer on the source side.
> > +     *
> > +     * When @exact is true, a module must report accurate results.  When
> > +     * @exact is false, a module may report estimates.
> > +     *
> > +     * It's highly recommended that modules implement a faster version of
> > +     * the query path (for example, by proper caching on the counters) if
> > +     * an accurate query will be time-consuming.
> > +     *
> > +     * @opaque: data pointer passed to register_savevm_live()
> > +     * @pending: pointer to a MigPendingData struct
> > +     * @exact: set to true for an accurate (slow) query
> > +     * @final: set to true for the final query during switchover. When final is
> > +     * true, the query is called with BQL locked. Otherwise, it's called with
> > +     * BQL unlocked.
> > +     */
> > +    void (*save_query_pending)(void *opaque, MigPendingData *pending,
> > +                               bool exact, bool final);
> > +
> >      /* This runs outside the BQL in the migration case, and
> >       * within the lock in the savevm case.  The callback had better only
> >       * use data that is local to the migration thread or protected
> > @@ -210,25 +232,6 @@ typedef struct SaveVMHandlers {
> >       */
> >      bool (*save_postcopy_prepare)(QEMUFile *f, void *opaque, Error **errp);
> >  
> > -    /**
> > -     * @save_query_pending
> > -     *
> > -     * This estimates the remaining data to transfer on the source side.
> > -     *
> > -     * When @exact is true, a module must report accurate results.  When
> > -     * @exact is false, a module may report estimates.
> > -     *
> > -     * It's highly recommended that modules implement a faster version of
> > -     * the query path (for example, by proper caching on the counters) if
> > -     * an accurate query will be time-consuming.
> > -     *
> > -     * @opaque: data pointer passed to register_savevm_live()
> > -     * @pending: pointer to a MigPendingData struct
> > -     * @exact: set to true for an accurate (slow) query
> > -     */
> > -    void (*save_query_pending)(void *opaque, MigPendingData *pending,
> > -                               bool exact);
> > -
> >      /**
> >       * @load_state
> >       *
> > diff --git a/migration/savevm.h b/migration/savevm.h
> > index 96fdf96d4e..10b3d78a5f 100644
> > --- a/migration/savevm.h
> > +++ b/migration/savevm.h
> > @@ -45,7 +45,8 @@ int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy);
> >  void qemu_savevm_state_cleanup(void);
> >  void qemu_savevm_state_complete_postcopy(QEMUFile *f);
> >  int qemu_savevm_state_complete_precopy(MigrationState *s);
> > -void qemu_savevm_query_pending(MigPendingData *pending, bool exact);
> > +void qemu_savevm_query_pending_iter(MigPendingData *pending, bool exact);
> > +void qemu_savevm_query_pending_final(MigPendingData *pending);
> >  int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy);
> >  bool qemu_savevm_state_postcopy_prepare(QEMUFile *f, Error **errp);
> >  void qemu_savevm_state_end(QEMUFile *f);
> > diff --git a/hw/s390x/s390-stattrib.c b/hw/s390x/s390-stattrib.c
> > index c334714b31..aed0c6844d 100644
> > --- a/hw/s390x/s390-stattrib.c
> > +++ b/hw/s390x/s390-stattrib.c
> > @@ -190,12 +190,17 @@ static int cmma_save_setup(QEMUFile *f, void *opaque, Error **errp)
> >  }
> >  
> >  static void cmma_state_pending(void *opaque, MigPendingData *pending,
> > -                               bool exact)
> > +                               bool exact, bool final)
> >  {
> >      S390StAttribState *sas = S390_STATTRIB(opaque);
> >      S390StAttribClass *sac = S390_STATTRIB_GET_CLASS(sas);
> > -    long long res = sac->get_dirtycount(sas);
> > +    long long res;
> >  
> > +    if (final) {
> > +        return;
> > +    }
> > +
> > +    res = sac->get_dirtycount(sas);
> >      if (res >= 0) {
> >          pending->precopy_bytes += res;
> >      }
> > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > index fb12b9717f..95072d6664 100644
> > --- a/hw/vfio/migration.c
> > +++ b/hw/vfio/migration.c
> > @@ -622,12 +622,16 @@ static void vfio_state_pending_sync(VFIODevice *vbasedev)
> >  }
> >  
> >  static void vfio_state_pending(void *opaque, MigPendingData *pending,
> > -                               bool exact)
> > +                               bool exact, bool final)
> >  {
> >      VFIODevice *vbasedev = opaque;
> >      VFIOMigration *migration = vbasedev->migration;
> >      uint64_t precopy_size, stopcopy_size;
> >  
> > +    if (final) {
> > +        return;
> > +    }
> > +
> >      if (exact) {
> >          vfio_state_pending_sync(vbasedev);
> >      }
> > diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
> > index 7ef3759e53..66ed91de03 100644
> > --- a/migration/block-dirty-bitmap.c
> > +++ b/migration/block-dirty-bitmap.c
> > @@ -767,12 +767,16 @@ static int dirty_bitmap_save_complete(QEMUFile *f, void *opaque)
> >  }
> >  
> >  static void dirty_bitmap_state_pending(void *opaque, MigPendingData *data,
> > -                                       bool exact)
> > +                                       bool exact, bool final)
> >  {
> >      DBMSaveState *s = &((DBMState *)opaque)->save;
> >      SaveBitmapState *dbms;
> >      uint64_t pending = 0;
> >  
> > +    if (final) {
> > +        return;
> > +    }
> > +
> >      bql_lock();
> >  
> >      QSIMPLEQ_FOREACH(dbms, &s->dbms_list, entry) {
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 7488a94206..df8ed3a9fd 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -2787,12 +2787,15 @@ static bool migration_switchover_prepare(MigrationState *s)
> >  static bool migration_switchover_start(MigrationState *s, Error **errp)
> >  {
> >      ERRP_GUARD();
> > +    MigPendingData pending = {};
> >  
> >      if (!migration_switchover_prepare(s)) {
> >          error_setg(errp, "Switchover is interrupted");
> >          return false;
> >      }
> >  
> > +    qemu_savevm_query_pending_final(&pending);
> 
> Not needed for postcopy, we can move it to migration_completion_precopy().

A better idea is.. we can drop migration_bitmap_sync() in
ram_postcopy_send_discard_bitmap() too, then we can keep this one.

But then, let's properly document this, both for precopy and postcopy
cases.  One example:

    /*
     * The final query to the whole system on dirty data to make sure we
     * collect the latest status of the VM.  For precopy, source QEMU will
     * dump all the dirty data during switchover.  For postcopy, this will
     * properly update all the dirty bitmaps to finally generate the
     * correct discard bitmaps; see ram_postcopy_send_discard_bitmap().
     */

> 
> I'd also appreciate a short comment explaining why the final query is
> needed.
> 
> > +
> >      /* Inactivate disks except in COLO */
> >      if (!migrate_colo()) {
> >          /*
> > @@ -3285,7 +3288,7 @@ static void migration_iteration_go_next(MigPendingData *pending)
> >      /*
> >       * Do a slow sync first before boosting the iteration count.
> >       */
> > -    qemu_savevm_query_pending(pending, true);
> > +    qemu_savevm_query_pending_iter(pending, true);
> >  
> >      /*
> >       * Update the dirty information for the whole system for this
> > @@ -3336,7 +3339,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
> >      bool complete_ready;
> >  
> >      /* Fast path - get the estimated amount of pending data */
> > -    qemu_savevm_query_pending(&pending, false);
> > +    qemu_savevm_query_pending_iter(&pending, false);
> >  
> >      if (in_postcopy) {
> >          /*
> > diff --git a/migration/ram.c b/migration/ram.c
> > index fc38ffbf8a..4e3f442593 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -3376,10 +3376,6 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
> >      rs->last_stage = !migration_in_colo_state();
> >  
> >      WITH_RCU_READ_LOCK_GUARD() {
> > -        if (!migration_in_postcopy()) {
> > -            migration_bitmap_sync_precopy(true);
> > -        }
> > -
> >          ret = rdma_registration_start(f, RAM_CONTROL_FINISH);
> >          if (ret < 0) {
> >              qemu_file_set_error(f, ret);
> > @@ -3442,25 +3438,38 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
> >      return qemu_fflush(f);
> >  }
> >  
> > -static void ram_state_pending(void *opaque, MigPendingData *pending,
> > -                              bool exact)
> > +static void ram_state_pending_sync(bool exact, bool final)
> >  {
> > -    RAMState **temp = opaque;
> > -    RAMState *rs = *temp;
> > -    uint64_t remaining_size;
> > -
> >      /*
> >       * Sync is not needed either with: (1) a fast query, or (2) after
> >       * postcopy has started (no new dirty will generate anymore).
> >       */
> > -    if (exact && !migration_in_postcopy()) {
> > +    if (!exact || migration_in_postcopy()) {
> > +        return;
> > +    }
> > +
> > +    /* Final pending query is called with BQL locked */
> > +    if (!final) {
> >          bql_lock();
> > -        WITH_RCU_READ_LOCK_GUARD() {
> > -            migration_bitmap_sync_precopy(false);
> > -        }
> > +    }
> > +
> > +    WITH_RCU_READ_LOCK_GUARD() {
> > +        migration_bitmap_sync_precopy(final);
> > +    }
> > +
> > +    if (!final) {
> >          bql_unlock();
> >      }
> > +}
> > +
> > +static void ram_state_pending(void *opaque, MigPendingData *pending,
> > +                              bool exact, bool final)
> > +{
> > +    RAMState **temp = opaque;
> > +    RAMState *rs = *temp;
> > +    uint64_t remaining_size;
> >  
> > +    ram_state_pending_sync(exact, final);
> >      remaining_size = rs->migration_dirty_pages * TARGET_PAGE_SIZE;
> >  
> >      if (migrate_postcopy_ram()) {
> > diff --git a/migration/savevm.c b/migration/savevm.c
> > index 23adaf9dd9..22cb6a15c6 100644
> > --- a/migration/savevm.c
> > +++ b/migration/savevm.c
> > @@ -1795,7 +1795,8 @@ int qemu_savevm_state_complete_precopy(MigrationState *s)
> >      return qemu_fflush(f);
> >  }
> >  
> > -void qemu_savevm_query_pending(MigPendingData *pending, bool exact)
> > +static void qemu_savevm_query_pending(MigPendingData *pending, bool exact,
> > +                                      bool final)
> >  {
> >      SaveStateEntry *se;
> >  
> > @@ -1808,7 +1809,7 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool exact)
> >          if (!qemu_savevm_state_active(se)) {
> >              continue;
> >          }
> > -        se->ops->save_query_pending(se->opaque, pending, exact);
> > +        se->ops->save_query_pending(se->opaque, pending, exact, final);
> >      }
> >  
> >      pending->total_bytes = pending->precopy_bytes +
> > @@ -1817,16 +1818,30 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool exact)
> >      /*
> >       * Update system remaining dirty bytes whenever QEMU queries.  It will
> >       * make the value to be not as accurate, but should still be pretty
> > -     * close to reality when this got invoked frequently while iterating.
> > +     * close to reality when this got invoked frequently while iterating.  Don't
> > +     * update in final query as some modules may skip it if not needed.
> 
> IMHO this specialty isn't needed.  We can just make all modules report and
> making sure we don't deadlock on bql.  AFAIU, CMMA never takes BQL, while
> dirty-bitmap should conditionally take it now.
> 
> >       */
> > -    mig_stats.dirty_bytes_total = pending->total_bytes;
> > -
> > -    trace_qemu_savevm_query_pending(exact, pending->precopy_bytes,
> > +    if (!final) {
> > +        mig_stats.dirty_bytes_total = pending->total_bytes;
> > +    }
> 
> Then we remove this too which is counter-intuitive..
> 
> The rest looks all reasonable.
> 
> Thanks,
> 
> > +    trace_qemu_savevm_query_pending(exact, final, pending->precopy_bytes,
> >                                      pending->stopcopy_bytes,
> >                                      pending->postcopy_bytes,
> >                                      pending->total_bytes);
> >  }
> >  
> > +void qemu_savevm_query_pending_iter(MigPendingData *pending, bool exact)
> > +{
> > +    qemu_savevm_query_pending(pending, exact, false);
> > +}
> > +
> > +void qemu_savevm_query_pending_final(MigPendingData *pending)
> > +{
> > +    g_assert(bql_locked());
> > +
> > +    qemu_savevm_query_pending(pending, true, true);
> > +}
> > +
> >  void qemu_savevm_state_cleanup(void)
> >  {
> >      SaveStateEntry *se;
> > diff --git a/migration/trace-events b/migration/trace-events
> > index de99d976ab..1c9212d3e2 100644
> > --- a/migration/trace-events
> > +++ b/migration/trace-events
> > @@ -7,7 +7,7 @@ qemu_loadvm_state_section_partend(uint32_t section_id) "%u"
> >  qemu_loadvm_state_post_main(int ret) "%d"
> >  qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
> >  qemu_savevm_send_packaged(void) ""
> > -qemu_savevm_query_pending(bool exact, uint64_t precopy, uint64_t stopcopy, uint64_t postcopy, uint64_t total) "exact=%d, precopy=%"PRIu64", stopcopy=%"PRIu64", postcopy=%"PRIu64", total=%"PRIu64
> > +qemu_savevm_query_pending(bool exact, bool final, uint64_t precopy, uint64_t stopcopy, uint64_t postcopy, uint64_t total) "exact=%d, final=%d, precopy=%"PRIu64", stopcopy=%"PRIu64", postcopy=%"PRIu64", total=%"PRIu64
> >  loadvm_state_switchover_ack_needed(unsigned int switchover_ack_pending_num) "Switchover ack pending num=%u"
> >  loadvm_state_setup(void) ""
> >  loadvm_state_cleanup(void) ""
> > -- 
> > 2.40.1
> > 
> 
> -- 
> Peter Xu

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 06/13] migration: Make switchover-ack re-usable
  2026-06-03  7:17   ` Markus Armbruster
@ 2026-06-04 15:05     ` Avihai Horon
  2026-06-08 10:23       ` Markus Armbruster
  0 siblings, 1 reply; 38+ messages in thread
From: Avihai Horon @ 2026-06-04 15:05 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater, Peter Xu,
	Fabiano Rosas, Pierrick Bouvier, Philippe Mathieu-Daudé,
	Zhao Liu, Halil Pasic, Christian Borntraeger, Jason Herne,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Eric Farman, Matthew Rosato, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, Maor Gottlieb


On 6/3/2026 10:17 AM, Markus Armbruster wrote:
> External email: Use caution opening links or attachments
>
>
> Avihai Horon <avihaih@nvidia.com> writes:
>
>> Switchover-ack is a mechanism to synchronize between source and
>> destination QEMU during migration to prevent the source from switching
>> over prematurely.
>>
>> VFIO uses switchover-ack to ensure switchover happens only after
>> destination side has loaded the precopy initial bytes. This is important
>> for VFIO, as otherwise downtime could be impacted and be higher.
>>
>> In its current state, switchover-ack is a one-time mechanism, meaning
>> that switchover is acked only once and past that another ACK cannot be
>> requested again. This was sufficient until now, as VFIO precopy initial
>> bytes was defined to be monotonically decreasing. Thus, when precopy
>> initial bytes reached zero for all VFIO devices, a single ACK would be
>> sent and its validity would hold.
>>
>> However, now the new VFIO_PRECOPY_INFO_REINIT feature allows precopy
>> initial bytes to be re-initialized during precopy. Specifically, it
>> means that initial bytes can grow after reaching zero, which would
>> invalidate a previously sent switchover ACK.
>>
>> To solve this, make switchover-ack reusable and allow devices to request
>> switchover ACKs when needed via the save_query_pending SaveVMHandler.
>>
>> Since now switchover ACK can be requested for a specific device and in
>> different times, make switchover ACK per-device (instead of a single ACK
>> for all devices) and let source side do the pending ACKs accounting.
>>
>> Keep the legacy switchover-ack mechanism for backward compatibility and
>> turn it on by a compatibility property for older machines. Enable the
>> property until VFIO implements the new switchover-ack.
> I figure this is about the transmission of ACKs via the return path.  If
> both source and destination understand the new per-device ACK, they use
> it.  Else, they use old one.  Correct?

Correct.

>
> This is not visible to the management application.  It may use migration
> capability @switchover-ack to enable just as before.  Correct?

Correct.

>
> Can you think of a reason why management applications might need to know
> whether a certain qemu-system-FOO supports per-device ACKs?

Not that I can think of. It should be an implementation detail internal 
to QEMU.

>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> ---
>>   qapi/migration.json                | 16 ++++-----
>>   include/migration/client-options.h |  1 +
>>   include/migration/register.h       |  2 ++
>>   migration/migration.h              | 32 ++++++++++++++++--
>>   migration/savevm.h                 |  6 ++--
>>   hw/core/machine.c                  |  1 +
>>   migration/migration.c              | 37 ++++++++++++++-------
>>   migration/options.c                | 10 ++++++
>>   migration/savevm.c                 | 53 +++++++++++++++++++++++-------
>>   migration/trace-events             |  5 +--
>>   10 files changed, 125 insertions(+), 38 deletions(-)
>>
>> diff --git a/qapi/migration.json b/qapi/migration.json
>> index 27a7970556..550cb77762 100644
>> --- a/qapi/migration.json
>> +++ b/qapi/migration.json
>> @@ -507,15 +507,13 @@
>>   #     and should not affect the correctness of postcopy migration.
>>   #     (since 7.1)
>>   #
>> -# @switchover-ack: If enabled, migration will not stop the source VM
>> -#     and complete the migration until an ACK is received from the
>> -#     destination that it's OK to do so.  Exactly when this ACK is
>> -#     sent depends on the migrated devices that use this feature.  For
>> -#     example, a device can use it to make sure some of its data is
>> -#     sent and loaded in the destination before doing switchover.
>> -#     This can reduce downtime if devices that support this capability
>> -#     are present.  'return-path' capability must be enabled to use
>> -#     it.  (since 8.1)
>> +# @switchover-ack: If enabled, migration will rely on destination side
>> +#     to acknowledge the source's switchover decision.  The
>> +#     acknowledgement may depend, for example, on some device's data
>> +#     being loaded in the destination before doing switchover.  This
>> +#     can reduce downtime if devices that support this capability are
>> +#     present.  'return-path' capability must be enabled to use it.
> Please use the opportunity to fix markup: @return-path.
> docs/devel/qapi-code-gen.rst:
>
>      Use @foo to reference a member description within the current
>      definition.  This is an rST extension.  It is currently rendered the
>      same way as ````foo````, but carries additional meaning.
>
> Suggest "Capability @return-path must be ..."

Sure, will change it.

>
> The old text is concrete: "will not stop the source VM ... until".  The
> new text is vague "will rely on destination side to acknowledge".  What
> does that mean exactly?  How does it impact behavior the user /
> management application cares about?

How about the suggestion below? It's both concrete and doesn't mention 
the number of ACKs (as requested by Peter).

# @switchover-ack: If enabled, migration will not stop the source VM
#     and complete the migration until the destination has acknowledged
#     that switchover is safe.  The acknowledgement may depend...

Thanks.

>
>> +#     (since 8.1)
>>   #
>>   # @dirty-limit: If enabled, migration will throttle vCPUs as needed to
>>   #     keep their dirty page rate within @vcpu-dirty-limit.  This can
> [...]
>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 09/13] vfio/migration: Add Error ** parameter to vfio_migration_init()
  2026-06-03 11:38   ` Philippe Mathieu-Daudé
@ 2026-06-04 15:06     ` Avihai Horon
  0 siblings, 0 replies; 38+ messages in thread
From: Avihai Horon @ 2026-06-04 15:06 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé, qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Peter Xu, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Halil Pasic, Christian Borntraeger, Jason Herne,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Eric Farman, Matthew Rosato, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, Markus Armbruster,
	Maor Gottlieb


On 6/3/2026 2:38 PM, Philippe Mathieu-Daudé wrote:
> External email: Use caution opening links or attachments
>
>
> Hi Avihai,
>
> On 2/6/26 11:26, Avihai Horon wrote:
>> vfio_migration_init() already has many failure points and a new one will
>> be added in next patch.
>>
>> Add Error ** parameter to vfio_migration_init() to report a detailed
>> error message through it.
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> ---
>>   hw/vfio/migration.c | 25 +++++++++++++------------
>>   1 file changed, 13 insertions(+), 12 deletions(-)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index a15e877fd3..02ef216712 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -1055,7 +1055,7 @@ static bool 
>> vfio_dma_logging_supported(VFIODevice *vbasedev)
>>       return !ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature);
>>   }
>>
>> -static int vfio_migration_init(VFIODevice *vbasedev)
>> +static int vfio_migration_init(VFIODevice *vbasedev, Error **errp)
>>   {
>>       int ret;
>>       Object *obj;
>> @@ -1066,21 +1066,31 @@ static int vfio_migration_init(VFIODevice 
>> *vbasedev)
>>       VMChangeStateHandler *prepare_cb;
>>
>>       if (!vbasedev->ops->vfio_get_object) {
>> +        error_setg(errp, "no vfio_get_object handler");
>>           return -EINVAL;
>>       }
>>
>>       obj = vbasedev->ops->vfio_get_object(vbasedev);
>>       if (!obj) {
>> +        error_setg(errp, "failed to get object");
>>           return -EINVAL;
>>       }
>>
>>       ret = vfio_migration_query_flags(vbasedev, &mig_flags);
>>       if (ret) {
>> +        if (ret == -ENOTTY) {
>> +            error_setg_errno(errp, -ret,
>> +                             "migration is not supported in kernel");
>> +        } else {
>> +            error_setg_errno(errp, -ret, "failed to query migration 
>> flags");
>> +        }
>> +
>>           return ret;
>>       }
>>
>>       /* Basic migration functionality must be supported */
>>       if (!(mig_flags & VFIO_MIGRATION_STOP_COPY)) {
>> +        error_setg(errp, "VFIO_MIGRATION_STOP_COPY is not supported");
>>           return -EOPNOTSUPP;
>>       }
>>
>> @@ -1278,18 +1288,9 @@ bool vfio_migration_realize(VFIODevice 
>> *vbasedev, Error **errp)
>>           return !vfio_block_migration(vbasedev, err, errp);
>>       }
>>
>> -    ret = vfio_migration_init(vbasedev);
>> +    ret = vfio_migration_init(vbasedev, &err);
>>       if (ret) {
>
> Only 'ret' boolean part is used now, have the method return a boolean
> instead, like other methods with Error **errp last param?

Sure, I will change it.

Thanks.

>
>> -        if (ret == -ENOTTY) {
>> -            error_setg(&err, "%s: VFIO migration is not supported in 
>> kernel",
>> -                       vbasedev->name);
>> -        } else {
>> -            error_setg(&err,
>> -                       "%s: Migration couldn't be initialized for 
>> VFIO device, "
>> -                       "err: %d (%s)",
>> -                       vbasedev->name, ret, strerror(-ret));
>> -        }
>> -
>> +        error_prepend(&err, "%s: VFIO migration init failed: ", 
>> vbasedev->name);
>>           return !vfio_block_migration(vbasedev, err, errp);
>>       }
>>
>
>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 01/13] migration: Propagate errors in migration_completion_precopy()
  2026-06-03 19:47   ` Peter Xu
@ 2026-06-04 15:12     ` Avihai Horon
  2026-06-04 15:21       ` Peter Xu
  0 siblings, 1 reply; 38+ messages in thread
From: Avihai Horon @ 2026-06-04 15:12 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Halil Pasic, Christian Borntraeger, Jason Herne,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Eric Farman, Matthew Rosato, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, Markus Armbruster,
	Maor Gottlieb


On 6/3/2026 10:47 PM, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Tue, Jun 02, 2026 at 12:26:09PM +0300, Avihai Horon wrote:
>> migration_completion_precopy() doesn't propagate errors to migration
>> core which leads to error information loss. Fix that.
>>
>> This prepares for a follow-up where migration_switchover_start() can
>> fail on switchover-ack and still report a useful error. Errors from
>> qemu_savevm_state_complete_precopy() are not propagated yet as it
>> requires more plumbing.
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> ---
>>   migration/migration.c | 11 +++++++----
>>   1 file changed, 7 insertions(+), 4 deletions(-)
>>
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 074d3f2c69..7488a94206 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -2814,7 +2814,7 @@ static bool migration_switchover_start(MigrationState *s, Error **errp)
>>       return true;
>>   }
>>
>> -static int migration_completion_precopy(MigrationState *s)
>> +static int migration_completion_precopy(MigrationState *s, Error **errp)
>>   {
>>       int ret;
>>
>> @@ -2823,11 +2823,12 @@ static int migration_completion_precopy(MigrationState *s)
>>       if (!migrate_mode_is_cpr()) {
>>           ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE);
>>           if (ret < 0) {
>> +            error_setg_errno(errp, -ret, "Failed to stop the VM");
>>               goto out_unlock;
>>           }
>>       }
>>
>> -    if (!migration_switchover_start(s, NULL)) {
>> +    if (!migration_switchover_start(s, errp)) {
>>           ret = -EFAULT;
>>           goto out_unlock;
>>       }
> IIUC this patch overlooked the follow up call:
>
>      ret = qemu_savevm_state_complete_precopy(s);
>
> We should make sure ret!=0 will always set Error*.  Better pass Error**
> into qemu_savevm_state_complete_precopy() too.

Yes, that was intentional, because this requires more plumbing in 
qemu_savevm_state_complete_precopy() (as mentioned in the commit message).
Although not clean, I thought it was OK to do this one-time exception 
since we explicitly check for local_err in migration_completion().

I can give it a shot and try adding Error **errp to 
qemu_savevm_state_complete_precopy(), but I think it'll be opening a can 
of worms.
Do you have any thoughts?

Thanks.

>
> Thanks,
>
>> @@ -2869,7 +2870,7 @@ static void migration_completion(MigrationState *s)
>>       Error *local_err = NULL;
>>
>>       if (s->state == MIGRATION_STATUS_ACTIVE) {
>> -        ret = migration_completion_precopy(s);
>> +        ret = migration_completion_precopy(s, &local_err);
>>       } else if (s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) {
>>           migration_completion_postcopy(s);
>>       } else {
>> @@ -2900,7 +2901,9 @@ static void migration_completion(MigrationState *s)
>>       return;
>>
>>   fail:
>> -    if (qemu_file_get_error_obj(s->to_dst_file, &local_err)) {
>> +    if (local_err) {
>> +        migrate_error_propagate(s, local_err);
>> +    } else if (qemu_file_get_error_obj(s->to_dst_file, &local_err)) {
>>           migrate_error_propagate(s, local_err);
>>       } else if (ret) {
>>           error_setg_errno(&local_err, -ret, "Error in migration completion");
>> --
>> 2.40.1
>>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 01/13] migration: Propagate errors in migration_completion_precopy()
  2026-06-04 15:12     ` Avihai Horon
@ 2026-06-04 15:21       ` Peter Xu
  2026-06-04 15:38         ` Avihai Horon
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Xu @ 2026-06-04 15:21 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Halil Pasic, Christian Borntraeger, Jason Herne,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Eric Farman, Matthew Rosato, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, Markus Armbruster,
	Maor Gottlieb

On Thu, Jun 04, 2026 at 06:12:11PM +0300, Avihai Horon wrote:
> 
> On 6/3/2026 10:47 PM, Peter Xu wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Tue, Jun 02, 2026 at 12:26:09PM +0300, Avihai Horon wrote:
> > > migration_completion_precopy() doesn't propagate errors to migration
> > > core which leads to error information loss. Fix that.
> > > 
> > > This prepares for a follow-up where migration_switchover_start() can
> > > fail on switchover-ack and still report a useful error. Errors from
> > > qemu_savevm_state_complete_precopy() are not propagated yet as it
> > > requires more plumbing.
> > > 
> > > Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> > > ---
> > >   migration/migration.c | 11 +++++++----
> > >   1 file changed, 7 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/migration/migration.c b/migration/migration.c
> > > index 074d3f2c69..7488a94206 100644
> > > --- a/migration/migration.c
> > > +++ b/migration/migration.c
> > > @@ -2814,7 +2814,7 @@ static bool migration_switchover_start(MigrationState *s, Error **errp)
> > >       return true;
> > >   }
> > > 
> > > -static int migration_completion_precopy(MigrationState *s)
> > > +static int migration_completion_precopy(MigrationState *s, Error **errp)
> > >   {
> > >       int ret;
> > > 
> > > @@ -2823,11 +2823,12 @@ static int migration_completion_precopy(MigrationState *s)
> > >       if (!migrate_mode_is_cpr()) {
> > >           ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE);
> > >           if (ret < 0) {
> > > +            error_setg_errno(errp, -ret, "Failed to stop the VM");
> > >               goto out_unlock;
> > >           }
> > >       }
> > > 
> > > -    if (!migration_switchover_start(s, NULL)) {
> > > +    if (!migration_switchover_start(s, errp)) {
> > >           ret = -EFAULT;
> > >           goto out_unlock;
> > >       }
> > IIUC this patch overlooked the follow up call:
> > 
> >      ret = qemu_savevm_state_complete_precopy(s);
> > 
> > We should make sure ret!=0 will always set Error*.  Better pass Error**
> > into qemu_savevm_state_complete_precopy() too.
> 
> Yes, that was intentional, because this requires more plumbing in
> qemu_savevm_state_complete_precopy() (as mentioned in the commit message).
> Although not clean, I thought it was OK to do this one-time exception since
> we explicitly check for local_err in migration_completion().
> 
> I can give it a shot and try adding Error **errp to
> qemu_savevm_state_complete_precopy(), but I think it'll be opening a can of
> worms.
> Do you have any thoughts?

If we want to keep the min change, we can consider setting errp at least
when qemu_savevm_state_complete_precopy() returned non-zero in the path of
migration_completion_precopy(), to make sure errp is always set when error
happened.  Otherwise it's error prone.

Thanks,

> 
> Thanks.
> 
> > 
> > Thanks,
> > 
> > > @@ -2869,7 +2870,7 @@ static void migration_completion(MigrationState *s)
> > >       Error *local_err = NULL;
> > > 
> > >       if (s->state == MIGRATION_STATUS_ACTIVE) {
> > > -        ret = migration_completion_precopy(s);
> > > +        ret = migration_completion_precopy(s, &local_err);
> > >       } else if (s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) {
> > >           migration_completion_postcopy(s);
> > >       } else {
> > > @@ -2900,7 +2901,9 @@ static void migration_completion(MigrationState *s)
> > >       return;
> > > 
> > >   fail:
> > > -    if (qemu_file_get_error_obj(s->to_dst_file, &local_err)) {
> > > +    if (local_err) {
> > > +        migrate_error_propagate(s, local_err);
> > > +    } else if (qemu_file_get_error_obj(s->to_dst_file, &local_err)) {
> > >           migrate_error_propagate(s, local_err);
> > >       } else if (ret) {
> > >           error_setg_errno(&local_err, -ret, "Error in migration completion");
> > > --
> > > 2.40.1
> > > 
> > --
> > Peter Xu
> > 
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 02/13] migration: Run final save_query_pending at switchover
  2026-06-03 21:04     ` Peter Xu
@ 2026-06-04 15:29       ` Avihai Horon
  2026-06-04 17:18         ` Peter Xu
  0 siblings, 1 reply; 38+ messages in thread
From: Avihai Horon @ 2026-06-04 15:29 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Halil Pasic, Christian Borntraeger, Jason Herne,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Eric Farman, Matthew Rosato, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, Markus Armbruster,
	Maor Gottlieb


On 6/4/2026 12:04 AM, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Wed, Jun 03, 2026 at 04:05:49PM -0400, Peter Xu wrote:
>> On Tue, Jun 02, 2026 at 12:26:10PM +0300, Avihai Horon wrote:
>>> Before switchover, the source needs one last exact pending query so
>>> modules can flush dirty state. This is currently done ad hoc in modules
>>> handlers. For example, RAM syncs its dirty bitmap in its save_complete
>>> handler.
>>>
>>> This should be a general concept relevant for any module, so extract it
>>> to migration core instead by running a final save_query_pending before
>>> switchover.
>>>
>>> The final query requires special handling by modules (e.g., it's called
>>> with BQL locked, during VM stop), so extend save_query_pending
>>> SaveVMHandlers callback and qemu_savevm_query_pending() with a "final"
>>> flag so migration modules can tell the last pending query during
>>> switchover from periodic iteration queries.
>>>
>>> Not all modules need the final query; skip it for those who don't need.
>>>
>>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>>> ---
>>>   include/migration/register.h   | 41 ++++++++++++++++++----------------
>>>   migration/savevm.h             |  3 ++-
>>>   hw/s390x/s390-stattrib.c       |  9 ++++++--
>>>   hw/vfio/migration.c            |  6 ++++-
>>>   migration/block-dirty-bitmap.c |  6 ++++-
>>>   migration/migration.c          |  7 ++++--
>>>   migration/ram.c                | 37 ++++++++++++++++++------------
>>>   migration/savevm.c             | 27 +++++++++++++++++-----
>>>   migration/trace-events         |  2 +-
>>>   9 files changed, 91 insertions(+), 47 deletions(-)
>>>
>>> diff --git a/include/migration/register.h b/include/migration/register.h
>>> index 5e5e0ee432..6f632123f1 100644
>>> --- a/include/migration/register.h
>>> +++ b/include/migration/register.h
>>> @@ -171,6 +171,28 @@ typedef struct SaveVMHandlers {
>>>        */
>>>       bool (*is_active_iterate)(void *opaque);
>>>
>>> +    /**
>>> +     * @save_query_pending
>>> +     *
>>> +     * This estimates the remaining data to transfer on the source side.
>>> +     *
>>> +     * When @exact is true, a module must report accurate results.  When
>>> +     * @exact is false, a module may report estimates.
>>> +     *
>>> +     * It's highly recommended that modules implement a faster version of
>>> +     * the query path (for example, by proper caching on the counters) if
>>> +     * an accurate query will be time-consuming.
>>> +     *
>>> +     * @opaque: data pointer passed to register_savevm_live()
>>> +     * @pending: pointer to a MigPendingData struct
>>> +     * @exact: set to true for an accurate (slow) query
>>> +     * @final: set to true for the final query during switchover. When final is
>>> +     * true, the query is called with BQL locked. Otherwise, it's called with
>>> +     * BQL unlocked.
>>> +     */
>>> +    void (*save_query_pending)(void *opaque, MigPendingData *pending,
>>> +                               bool exact, bool final);
>>> +
>>>       /* This runs outside the BQL in the migration case, and
>>>        * within the lock in the savevm case.  The callback had better only
>>>        * use data that is local to the migration thread or protected
>>> @@ -210,25 +232,6 @@ typedef struct SaveVMHandlers {
>>>        */
>>>       bool (*save_postcopy_prepare)(QEMUFile *f, void *opaque, Error **errp);
>>>
>>> -    /**
>>> -     * @save_query_pending
>>> -     *
>>> -     * This estimates the remaining data to transfer on the source side.
>>> -     *
>>> -     * When @exact is true, a module must report accurate results.  When
>>> -     * @exact is false, a module may report estimates.
>>> -     *
>>> -     * It's highly recommended that modules implement a faster version of
>>> -     * the query path (for example, by proper caching on the counters) if
>>> -     * an accurate query will be time-consuming.
>>> -     *
>>> -     * @opaque: data pointer passed to register_savevm_live()
>>> -     * @pending: pointer to a MigPendingData struct
>>> -     * @exact: set to true for an accurate (slow) query
>>> -     */
>>> -    void (*save_query_pending)(void *opaque, MigPendingData *pending,
>>> -                               bool exact);
>>> -
>>>       /**
>>>        * @load_state
>>>        *
>>> diff --git a/migration/savevm.h b/migration/savevm.h
>>> index 96fdf96d4e..10b3d78a5f 100644
>>> --- a/migration/savevm.h
>>> +++ b/migration/savevm.h
>>> @@ -45,7 +45,8 @@ int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy);
>>>   void qemu_savevm_state_cleanup(void);
>>>   void qemu_savevm_state_complete_postcopy(QEMUFile *f);
>>>   int qemu_savevm_state_complete_precopy(MigrationState *s);
>>> -void qemu_savevm_query_pending(MigPendingData *pending, bool exact);
>>> +void qemu_savevm_query_pending_iter(MigPendingData *pending, bool exact);
>>> +void qemu_savevm_query_pending_final(MigPendingData *pending);
>>>   int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy);
>>>   bool qemu_savevm_state_postcopy_prepare(QEMUFile *f, Error **errp);
>>>   void qemu_savevm_state_end(QEMUFile *f);
>>> diff --git a/hw/s390x/s390-stattrib.c b/hw/s390x/s390-stattrib.c
>>> index c334714b31..aed0c6844d 100644
>>> --- a/hw/s390x/s390-stattrib.c
>>> +++ b/hw/s390x/s390-stattrib.c
>>> @@ -190,12 +190,17 @@ static int cmma_save_setup(QEMUFile *f, void *opaque, Error **errp)
>>>   }
>>>
>>>   static void cmma_state_pending(void *opaque, MigPendingData *pending,
>>> -                               bool exact)
>>> +                               bool exact, bool final)
>>>   {
>>>       S390StAttribState *sas = S390_STATTRIB(opaque);
>>>       S390StAttribClass *sac = S390_STATTRIB_GET_CLASS(sas);
>>> -    long long res = sac->get_dirtycount(sas);
>>> +    long long res;
>>>
>>> +    if (final) {
>>> +        return;
>>> +    }
>>> +
>>> +    res = sac->get_dirtycount(sas);
>>>       if (res >= 0) {
>>>           pending->precopy_bytes += res;
>>>       }
>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>> index fb12b9717f..95072d6664 100644
>>> --- a/hw/vfio/migration.c
>>> +++ b/hw/vfio/migration.c
>>> @@ -622,12 +622,16 @@ static void vfio_state_pending_sync(VFIODevice *vbasedev)
>>>   }
>>>
>>>   static void vfio_state_pending(void *opaque, MigPendingData *pending,
>>> -                               bool exact)
>>> +                               bool exact, bool final)
>>>   {
>>>       VFIODevice *vbasedev = opaque;
>>>       VFIOMigration *migration = vbasedev->migration;
>>>       uint64_t precopy_size, stopcopy_size;
>>>
>>> +    if (final) {
>>> +        return;
>>> +    }
>>> +
>>>       if (exact) {
>>>           vfio_state_pending_sync(vbasedev);
>>>       }
>>> diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
>>> index 7ef3759e53..66ed91de03 100644
>>> --- a/migration/block-dirty-bitmap.c
>>> +++ b/migration/block-dirty-bitmap.c
>>> @@ -767,12 +767,16 @@ static int dirty_bitmap_save_complete(QEMUFile *f, void *opaque)
>>>   }
>>>
>>>   static void dirty_bitmap_state_pending(void *opaque, MigPendingData *data,
>>> -                                       bool exact)
>>> +                                       bool exact, bool final)
>>>   {
>>>       DBMSaveState *s = &((DBMState *)opaque)->save;
>>>       SaveBitmapState *dbms;
>>>       uint64_t pending = 0;
>>>
>>> +    if (final) {
>>> +        return;
>>> +    }
>>> +
>>>       bql_lock();
>>>
>>>       QSIMPLEQ_FOREACH(dbms, &s->dbms_list, entry) {
>>> diff --git a/migration/migration.c b/migration/migration.c
>>> index 7488a94206..df8ed3a9fd 100644
>>> --- a/migration/migration.c
>>> +++ b/migration/migration.c
>>> @@ -2787,12 +2787,15 @@ static bool migration_switchover_prepare(MigrationState *s)
>>>   static bool migration_switchover_start(MigrationState *s, Error **errp)
>>>   {
>>>       ERRP_GUARD();
>>> +    MigPendingData pending = {};
>>>
>>>       if (!migration_switchover_prepare(s)) {
>>>           error_setg(errp, "Switchover is interrupted");
>>>           return false;
>>>       }
>>>
>>> +    qemu_savevm_query_pending_final(&pending);
>> Not needed for postcopy, we can move it to migration_completion_precopy().

While currently not needed (because VFIO isn't compatible with 
postcopy), AFAIU, generally it is needed because we have here a 
switchover similar to precopy case -- the switchover decision and VM 
stop at stopcopy_start() are not atomic, so theoretically a device may 
want to request an ACK in this gap.

But that's all theoretical and not necessary currently.

> A better idea is.. we can drop migration_bitmap_sync() in
> ram_postcopy_send_discard_bitmap() too, then we can keep this one.
>
> But then, let's properly document this, both for precopy and postcopy
> cases.  One example:
>
>      /*
>       * The final query to the whole system on dirty data to make sure we
>       * collect the latest status of the VM.  For precopy, source QEMU will
>       * dump all the dirty data during switchover.  For postcopy, this will
>       * properly update all the dirty bitmaps to finally generate the
>       * correct discard bitmaps; see ram_postcopy_send_discard_bitmap().
>       */

The last dirty sync is done a bit differently for postcopy case:
   postcopy_start()
     ram_postcopy_send_discard_bitmap()
       migration_bitmap_sync(rs, false);

1. The last_stage parameter of migration_bitmap_sync() is false 
(shouldn't it be true here as well?).
2. It calls migration_bitmap_sync and not migration_bitmap_sync_precopy, 
i.e., without precopy_notify calls.

If we need to keep the above differences for postcopy then I guess we'd 
need to introduce different final query flow for it. It may be cumbersome.
Do you think we should do this extra effort to support it, or, as you 
suggested, we can drop the final switchover ack check for postcopy for now?

>> I'd also appreciate a short comment explaining why the final query is
>> needed.
>>
>>> +
>>>       /* Inactivate disks except in COLO */
>>>       if (!migrate_colo()) {
>>>           /*
>>> @@ -3285,7 +3288,7 @@ static void migration_iteration_go_next(MigPendingData *pending)
>>>       /*
>>>        * Do a slow sync first before boosting the iteration count.
>>>        */
>>> -    qemu_savevm_query_pending(pending, true);
>>> +    qemu_savevm_query_pending_iter(pending, true);
>>>
>>>       /*
>>>        * Update the dirty information for the whole system for this
>>> @@ -3336,7 +3339,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>>>       bool complete_ready;
>>>
>>>       /* Fast path - get the estimated amount of pending data */
>>> -    qemu_savevm_query_pending(&pending, false);
>>> +    qemu_savevm_query_pending_iter(&pending, false);
>>>
>>>       if (in_postcopy) {
>>>           /*
>>> diff --git a/migration/ram.c b/migration/ram.c
>>> index fc38ffbf8a..4e3f442593 100644
>>> --- a/migration/ram.c
>>> +++ b/migration/ram.c
>>> @@ -3376,10 +3376,6 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
>>>       rs->last_stage = !migration_in_colo_state();
>>>
>>>       WITH_RCU_READ_LOCK_GUARD() {
>>> -        if (!migration_in_postcopy()) {
>>> -            migration_bitmap_sync_precopy(true);
>>> -        }
>>> -
>>>           ret = rdma_registration_start(f, RAM_CONTROL_FINISH);
>>>           if (ret < 0) {
>>>               qemu_file_set_error(f, ret);
>>> @@ -3442,25 +3438,38 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
>>>       return qemu_fflush(f);
>>>   }
>>>
>>> -static void ram_state_pending(void *opaque, MigPendingData *pending,
>>> -                              bool exact)
>>> +static void ram_state_pending_sync(bool exact, bool final)
>>>   {
>>> -    RAMState **temp = opaque;
>>> -    RAMState *rs = *temp;
>>> -    uint64_t remaining_size;
>>> -
>>>       /*
>>>        * Sync is not needed either with: (1) a fast query, or (2) after
>>>        * postcopy has started (no new dirty will generate anymore).
>>>        */
>>> -    if (exact && !migration_in_postcopy()) {
>>> +    if (!exact || migration_in_postcopy()) {
>>> +        return;
>>> +    }
>>> +
>>> +    /* Final pending query is called with BQL locked */
>>> +    if (!final) {
>>>           bql_lock();
>>> -        WITH_RCU_READ_LOCK_GUARD() {
>>> -            migration_bitmap_sync_precopy(false);
>>> -        }
>>> +    }
>>> +
>>> +    WITH_RCU_READ_LOCK_GUARD() {
>>> +        migration_bitmap_sync_precopy(final);
>>> +    }
>>> +
>>> +    if (!final) {
>>>           bql_unlock();
>>>       }
>>> +}
>>> +
>>> +static void ram_state_pending(void *opaque, MigPendingData *pending,
>>> +                              bool exact, bool final)
>>> +{
>>> +    RAMState **temp = opaque;
>>> +    RAMState *rs = *temp;
>>> +    uint64_t remaining_size;
>>>
>>> +    ram_state_pending_sync(exact, final);
>>>       remaining_size = rs->migration_dirty_pages * TARGET_PAGE_SIZE;
>>>
>>>       if (migrate_postcopy_ram()) {
>>> diff --git a/migration/savevm.c b/migration/savevm.c
>>> index 23adaf9dd9..22cb6a15c6 100644
>>> --- a/migration/savevm.c
>>> +++ b/migration/savevm.c
>>> @@ -1795,7 +1795,8 @@ int qemu_savevm_state_complete_precopy(MigrationState *s)
>>>       return qemu_fflush(f);
>>>   }
>>>
>>> -void qemu_savevm_query_pending(MigPendingData *pending, bool exact)
>>> +static void qemu_savevm_query_pending(MigPendingData *pending, bool exact,
>>> +                                      bool final)
>>>   {
>>>       SaveStateEntry *se;
>>>
>>> @@ -1808,7 +1809,7 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool exact)
>>>           if (!qemu_savevm_state_active(se)) {
>>>               continue;
>>>           }
>>> -        se->ops->save_query_pending(se->opaque, pending, exact);
>>> +        se->ops->save_query_pending(se->opaque, pending, exact, final);
>>>       }
>>>
>>>       pending->total_bytes = pending->precopy_bytes +
>>> @@ -1817,16 +1818,30 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool exact)
>>>       /*
>>>        * Update system remaining dirty bytes whenever QEMU queries.  It will
>>>        * make the value to be not as accurate, but should still be pretty
>>> -     * close to reality when this got invoked frequently while iterating.
>>> +     * close to reality when this got invoked frequently while iterating.  Don't
>>> +     * update in final query as some modules may skip it if not needed.
>> IMHO this specialty isn't needed.  We can just make all modules report and
>> making sure we don't deadlock on bql.  AFAIU, CMMA never takes BQL, while
>> dirty-bitmap should conditionally take it now.

I didn't want to add unnecessary work for the critical downtime path, 
i.e., unnecessary exact queries.
I am not familiar with the other devices, but if you think it's 
negligible for them then we can drop this specialty (in VFIO I'd still 
like to skip the stopcopy size query; using the cached value is good 
enough IMHO).

Thanks.

>>
>>>        */
>>> -    mig_stats.dirty_bytes_total = pending->total_bytes;
>>> -
>>> -    trace_qemu_savevm_query_pending(exact, pending->precopy_bytes,
>>> +    if (!final) {
>>> +        mig_stats.dirty_bytes_total = pending->total_bytes;
>>> +    }
>> Then we remove this too which is counter-intuitive..
>>
>> The rest looks all reasonable.
>>
>> Thanks,
>>
>>> +    trace_qemu_savevm_query_pending(exact, final, pending->precopy_bytes,
>>>                                       pending->stopcopy_bytes,
>>>                                       pending->postcopy_bytes,
>>>                                       pending->total_bytes);
>>>   }
>>>
>>> +void qemu_savevm_query_pending_iter(MigPendingData *pending, bool exact)
>>> +{
>>> +    qemu_savevm_query_pending(pending, exact, false);
>>> +}
>>> +
>>> +void qemu_savevm_query_pending_final(MigPendingData *pending)
>>> +{
>>> +    g_assert(bql_locked());
>>> +
>>> +    qemu_savevm_query_pending(pending, true, true);
>>> +}
>>> +
>>>   void qemu_savevm_state_cleanup(void)
>>>   {
>>>       SaveStateEntry *se;
>>> diff --git a/migration/trace-events b/migration/trace-events
>>> index de99d976ab..1c9212d3e2 100644
>>> --- a/migration/trace-events
>>> +++ b/migration/trace-events
>>> @@ -7,7 +7,7 @@ qemu_loadvm_state_section_partend(uint32_t section_id) "%u"
>>>   qemu_loadvm_state_post_main(int ret) "%d"
>>>   qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
>>>   qemu_savevm_send_packaged(void) ""
>>> -qemu_savevm_query_pending(bool exact, uint64_t precopy, uint64_t stopcopy, uint64_t postcopy, uint64_t total) "exact=%d, precopy=%"PRIu64", stopcopy=%"PRIu64", postcopy=%"PRIu64", total=%"PRIu64
>>> +qemu_savevm_query_pending(bool exact, bool final, uint64_t precopy, uint64_t stopcopy, uint64_t postcopy, uint64_t total) "exact=%d, final=%d, precopy=%"PRIu64", stopcopy=%"PRIu64", postcopy=%"PRIu64", total=%"PRIu64
>>>   loadvm_state_switchover_ack_needed(unsigned int switchover_ack_pending_num) "Switchover ack pending num=%u"
>>>   loadvm_state_setup(void) ""
>>>   loadvm_state_cleanup(void) ""
>>> --
>>> 2.40.1
>>>
>> --
>> Peter Xu
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 07/13] migration: Fail migration if switchover-ack is requested after switchover decision
  2026-06-03 20:42   ` Peter Xu
@ 2026-06-04 15:36     ` Avihai Horon
  2026-06-04 17:48       ` Peter Xu
  0 siblings, 1 reply; 38+ messages in thread
From: Avihai Horon @ 2026-06-04 15:36 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Halil Pasic, Christian Borntraeger, Jason Herne,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Eric Farman, Matthew Rosato, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, Markus Armbruster,
	Maor Gottlieb


On 6/3/2026 11:42 PM, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Tue, Jun 02, 2026 at 12:26:15PM +0300, Avihai Horon wrote:
>> Switchover ACK is checked only during precopy while the guest is still
>> running. The last migration_can_switchover() decision and guest stop are
>> not atomic, so a device may want to request another switchover ACK in
>> the gap after switchover decision has been made but before the guest is
>> stopped. Migration would then miss that request, which can increase
>> downtime.
>>
>> Cover this case by failing the migration if a switchover-ack was
>> requested during that time.
>>
>> Ideally, precopy iterations should be resumed in this case, however,
>> VFIO doesn't support going back to precopy after being stopped, so
>> implementing such logic would require non-trivial changes to the guest
>> start/stop flow. Given the above and that this case should be rare,
>> failing the migration seems reasonable.
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> Reviewed-by: Peter Xu <peterx@redhat.com>
>
> One nit:
>
>> ---
>>   migration/migration.c | 30 ++++++++++++++++++++++++++++++
>>   1 file changed, 30 insertions(+)
>>
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 4bb649a467..6ee1c795ff 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -2810,6 +2810,27 @@ static bool migration_switchover_prepare(MigrationState *s)
>>       return s->state == MIGRATION_STATUS_DEVICE;
>>   }
>>
>> +static bool migration_switchover_check_switchover_ack_pending(MigrationState *s,
> The name is slightly confusing.  It says "check if there is pending ack"
> but then it returns true if there's no pending ACK..
>
> Maybe migration_switchover_is_acknowledged()?

Sure, or maybe migration_have_new_switchover_ack_pending()?

I just noticed that I overlooked one corner case:
If guest is stopped during precopy and source has pending acks > 0 
(e.g., virsh migrate with --timeout param), then we will fail in this 
check and unnecessarily abort migration.

I thought about doing the change below. It's not as clean as I'd want it 
to be, but I have no other ideas.

===========8<===========
diff --git a/migration/migration.c b/migration/migration.c
index 6ee1c795ff..6054453b18 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2810,37 +2810,47 @@ static bool 
migration_switchover_prepare(MigrationState *s)
      return s->state == MIGRATION_STATUS_DEVICE;
  }

-static bool 
migration_switchover_check_switchover_ack_pending(MigrationState *s,
- Error **errp)
+static bool migration_have_new_switchover_ack_pending(
+    MigrationState *s, uint32_t old_switchover_ack_pending_num, Error 
**errp)
  {
      uint32_t pending_num;

      if (!migrate_switchover_ack() || migrate_switchover_ack_legacy()) {
-        return true;
+        return false;
      }

      pending_num = qatomic_read(&s->switchover_ack_pending_num);
-    if (pending_num > 0) {
+    if (pending_num > old_switchover_ack_pending_num) {
          error_setg(errp,
                     "Switchover ACK was requested by %" PRIu32
                     " devices during switchover",
-                   pending_num);
-        return false;
+                   pending_num - old_switchover_ack_pending_num);
+        return true;
      }

-    return true;
+    return false;
  }

  static bool migration_switchover_start(MigrationState *s, Error **errp)
  {
      ERRP_GUARD();
      MigPendingData pending = {};
+    uint32_t old_switchover_ack_pending_num;

      if (!migration_switchover_prepare(s)) {
          error_setg(errp, "Switchover is interrupted");
          return false;
      }

+    /*
+     * If the guest was stopped during precopy (e.g., by management), 
we may get
+     * here with switchover_ack_pending_num > 0. Store the current
+     * switchover_ack_pending_num value before the final pending query to
+     * distinguish between existing pending ACKs and new ones.
+     */
+    old_switchover_ack_pending_num =
+        qatomic_read(&s->switchover_ack_pending_num);
+
      qemu_savevm_query_pending_final(s, &pending);

      /*
@@ -2848,7 +2858,8 @@ static bool 
migration_switchover_start(MigrationState *s, Error **errp)
       * Fail the migration in this case since we currently don't 
support going
       * back to precopy.
       */
-    if (!migration_switchover_check_switchover_ack_pending(s, errp)) {
+    if (migration_have_new_switchover_ack_pending(
+            s, old_switchover_ack_pending_num, errp)) {
          return false;
      }

===========8<===========

Thanks.

>
>> +                                                              Error **errp)
>> +{
>> +    uint32_t pending_num;
>> +
>> +    if (!migrate_switchover_ack() || migrate_switchover_ack_legacy()) {
>> +        return true;
>> +    }
>> +
>> +    pending_num = qatomic_read(&s->switchover_ack_pending_num);
>> +    if (pending_num > 0) {
>> +        error_setg(errp,
>> +                   "Switchover ACK was requested by %" PRIu32
>> +                   " devices during switchover",
>> +                   pending_num);
>> +        return false;
>> +    }
>> +
>> +    return true;
>> +}
>> +
>>   static bool migration_switchover_start(MigrationState *s, Error **errp)
>>   {
>>       ERRP_GUARD();
>> @@ -2822,6 +2843,15 @@ static bool migration_switchover_start(MigrationState *s, Error **errp)
>>
>>       qemu_savevm_query_pending_final(s, &pending);
>>
>> +    /*
>> +     * Switchover-ack requests done after switchover decision, are not allowed.
>> +     * Fail the migration in this case since we currently don't support going
>> +     * back to precopy.
>> +     */
>> +    if (!migration_switchover_check_switchover_ack_pending(s, errp)) {
>> +        return false;
>> +    }
>> +
>>       /* Inactivate disks except in COLO */
>>       if (!migrate_colo()) {
>>           /*
>> --
>> 2.40.1
>>
> --
> Peter Xu
>


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 01/13] migration: Propagate errors in migration_completion_precopy()
  2026-06-04 15:21       ` Peter Xu
@ 2026-06-04 15:38         ` Avihai Horon
  0 siblings, 0 replies; 38+ messages in thread
From: Avihai Horon @ 2026-06-04 15:38 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Halil Pasic, Christian Borntraeger, Jason Herne,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Eric Farman, Matthew Rosato, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, Markus Armbruster,
	Maor Gottlieb


On 6/4/2026 6:21 PM, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, Jun 04, 2026 at 06:12:11PM +0300, Avihai Horon wrote:
>> On 6/3/2026 10:47 PM, Peter Xu wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On Tue, Jun 02, 2026 at 12:26:09PM +0300, Avihai Horon wrote:
>>>> migration_completion_precopy() doesn't propagate errors to migration
>>>> core which leads to error information loss. Fix that.
>>>>
>>>> This prepares for a follow-up where migration_switchover_start() can
>>>> fail on switchover-ack and still report a useful error. Errors from
>>>> qemu_savevm_state_complete_precopy() are not propagated yet as it
>>>> requires more plumbing.
>>>>
>>>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>>>> ---
>>>>    migration/migration.c | 11 +++++++----
>>>>    1 file changed, 7 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/migration/migration.c b/migration/migration.c
>>>> index 074d3f2c69..7488a94206 100644
>>>> --- a/migration/migration.c
>>>> +++ b/migration/migration.c
>>>> @@ -2814,7 +2814,7 @@ static bool migration_switchover_start(MigrationState *s, Error **errp)
>>>>        return true;
>>>>    }
>>>>
>>>> -static int migration_completion_precopy(MigrationState *s)
>>>> +static int migration_completion_precopy(MigrationState *s, Error **errp)
>>>>    {
>>>>        int ret;
>>>>
>>>> @@ -2823,11 +2823,12 @@ static int migration_completion_precopy(MigrationState *s)
>>>>        if (!migrate_mode_is_cpr()) {
>>>>            ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE);
>>>>            if (ret < 0) {
>>>> +            error_setg_errno(errp, -ret, "Failed to stop the VM");
>>>>                goto out_unlock;
>>>>            }
>>>>        }
>>>>
>>>> -    if (!migration_switchover_start(s, NULL)) {
>>>> +    if (!migration_switchover_start(s, errp)) {
>>>>            ret = -EFAULT;
>>>>            goto out_unlock;
>>>>        }
>>> IIUC this patch overlooked the follow up call:
>>>
>>>       ret = qemu_savevm_state_complete_precopy(s);
>>>
>>> We should make sure ret!=0 will always set Error*.  Better pass Error**
>>> into qemu_savevm_state_complete_precopy() too.
>> Yes, that was intentional, because this requires more plumbing in
>> qemu_savevm_state_complete_precopy() (as mentioned in the commit message).
>> Although not clean, I thought it was OK to do this one-time exception since
>> we explicitly check for local_err in migration_completion().
>>
>> I can give it a shot and try adding Error **errp to
>> qemu_savevm_state_complete_precopy(), but I think it'll be opening a can of
>> worms.
>> Do you have any thoughts?
> If we want to keep the min change, we can consider setting errp at least
> when qemu_savevm_state_complete_precopy() returned non-zero in the path of
> migration_completion_precopy(), to make sure errp is always set when error
> happened.  Otherwise it's error prone.

OK, I'll see how that works.

Thanks.

>
> Thanks,
>
>> Thanks.
>>
>>> Thanks,
>>>
>>>> @@ -2869,7 +2870,7 @@ static void migration_completion(MigrationState *s)
>>>>        Error *local_err = NULL;
>>>>
>>>>        if (s->state == MIGRATION_STATUS_ACTIVE) {
>>>> -        ret = migration_completion_precopy(s);
>>>> +        ret = migration_completion_precopy(s, &local_err);
>>>>        } else if (s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) {
>>>>            migration_completion_postcopy(s);
>>>>        } else {
>>>> @@ -2900,7 +2901,9 @@ static void migration_completion(MigrationState *s)
>>>>        return;
>>>>
>>>>    fail:
>>>> -    if (qemu_file_get_error_obj(s->to_dst_file, &local_err)) {
>>>> +    if (local_err) {
>>>> +        migrate_error_propagate(s, local_err);
>>>> +    } else if (qemu_file_get_error_obj(s->to_dst_file, &local_err)) {
>>>>            migrate_error_propagate(s, local_err);
>>>>        } else if (ret) {
>>>>            error_setg_errno(&local_err, -ret, "Error in migration completion");
>>>> --
>>>> 2.40.1
>>>>
>>> --
>>> Peter Xu
>>>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 02/13] migration: Run final save_query_pending at switchover
  2026-06-04 15:29       ` Avihai Horon
@ 2026-06-04 17:18         ` Peter Xu
  2026-06-08 12:07           ` Avihai Horon
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Xu @ 2026-06-04 17:18 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Halil Pasic, Christian Borntraeger, Jason Herne,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Eric Farman, Matthew Rosato, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, Markus Armbruster,
	Maor Gottlieb, Gavin Shan, Wei Wang

On Thu, Jun 04, 2026 at 06:29:01PM +0300, Avihai Horon wrote:
> 
> On 6/4/2026 12:04 AM, Peter Xu wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Wed, Jun 03, 2026 at 04:05:49PM -0400, Peter Xu wrote:
> > > On Tue, Jun 02, 2026 at 12:26:10PM +0300, Avihai Horon wrote:
> > > > Before switchover, the source needs one last exact pending query so
> > > > modules can flush dirty state. This is currently done ad hoc in modules
> > > > handlers. For example, RAM syncs its dirty bitmap in its save_complete
> > > > handler.
> > > > 
> > > > This should be a general concept relevant for any module, so extract it
> > > > to migration core instead by running a final save_query_pending before
> > > > switchover.
> > > > 
> > > > The final query requires special handling by modules (e.g., it's called
> > > > with BQL locked, during VM stop), so extend save_query_pending
> > > > SaveVMHandlers callback and qemu_savevm_query_pending() with a "final"
> > > > flag so migration modules can tell the last pending query during
> > > > switchover from periodic iteration queries.
> > > > 
> > > > Not all modules need the final query; skip it for those who don't need.
> > > > 
> > > > Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> > > > ---
> > > >   include/migration/register.h   | 41 ++++++++++++++++++----------------
> > > >   migration/savevm.h             |  3 ++-
> > > >   hw/s390x/s390-stattrib.c       |  9 ++++++--
> > > >   hw/vfio/migration.c            |  6 ++++-
> > > >   migration/block-dirty-bitmap.c |  6 ++++-
> > > >   migration/migration.c          |  7 ++++--
> > > >   migration/ram.c                | 37 ++++++++++++++++++------------
> > > >   migration/savevm.c             | 27 +++++++++++++++++-----
> > > >   migration/trace-events         |  2 +-
> > > >   9 files changed, 91 insertions(+), 47 deletions(-)
> > > > 
> > > > diff --git a/include/migration/register.h b/include/migration/register.h
> > > > index 5e5e0ee432..6f632123f1 100644
> > > > --- a/include/migration/register.h
> > > > +++ b/include/migration/register.h
> > > > @@ -171,6 +171,28 @@ typedef struct SaveVMHandlers {
> > > >        */
> > > >       bool (*is_active_iterate)(void *opaque);
> > > > 
> > > > +    /**
> > > > +     * @save_query_pending
> > > > +     *
> > > > +     * This estimates the remaining data to transfer on the source side.
> > > > +     *
> > > > +     * When @exact is true, a module must report accurate results.  When
> > > > +     * @exact is false, a module may report estimates.
> > > > +     *
> > > > +     * It's highly recommended that modules implement a faster version of
> > > > +     * the query path (for example, by proper caching on the counters) if
> > > > +     * an accurate query will be time-consuming.
> > > > +     *
> > > > +     * @opaque: data pointer passed to register_savevm_live()
> > > > +     * @pending: pointer to a MigPendingData struct
> > > > +     * @exact: set to true for an accurate (slow) query
> > > > +     * @final: set to true for the final query during switchover. When final is
> > > > +     * true, the query is called with BQL locked. Otherwise, it's called with
> > > > +     * BQL unlocked.
> > > > +     */
> > > > +    void (*save_query_pending)(void *opaque, MigPendingData *pending,
> > > > +                               bool exact, bool final);
> > > > +
> > > >       /* This runs outside the BQL in the migration case, and
> > > >        * within the lock in the savevm case.  The callback had better only
> > > >        * use data that is local to the migration thread or protected
> > > > @@ -210,25 +232,6 @@ typedef struct SaveVMHandlers {
> > > >        */
> > > >       bool (*save_postcopy_prepare)(QEMUFile *f, void *opaque, Error **errp);
> > > > 
> > > > -    /**
> > > > -     * @save_query_pending
> > > > -     *
> > > > -     * This estimates the remaining data to transfer on the source side.
> > > > -     *
> > > > -     * When @exact is true, a module must report accurate results.  When
> > > > -     * @exact is false, a module may report estimates.
> > > > -     *
> > > > -     * It's highly recommended that modules implement a faster version of
> > > > -     * the query path (for example, by proper caching on the counters) if
> > > > -     * an accurate query will be time-consuming.
> > > > -     *
> > > > -     * @opaque: data pointer passed to register_savevm_live()
> > > > -     * @pending: pointer to a MigPendingData struct
> > > > -     * @exact: set to true for an accurate (slow) query
> > > > -     */
> > > > -    void (*save_query_pending)(void *opaque, MigPendingData *pending,
> > > > -                               bool exact);
> > > > -
> > > >       /**
> > > >        * @load_state
> > > >        *
> > > > diff --git a/migration/savevm.h b/migration/savevm.h
> > > > index 96fdf96d4e..10b3d78a5f 100644
> > > > --- a/migration/savevm.h
> > > > +++ b/migration/savevm.h
> > > > @@ -45,7 +45,8 @@ int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy);
> > > >   void qemu_savevm_state_cleanup(void);
> > > >   void qemu_savevm_state_complete_postcopy(QEMUFile *f);
> > > >   int qemu_savevm_state_complete_precopy(MigrationState *s);
> > > > -void qemu_savevm_query_pending(MigPendingData *pending, bool exact);
> > > > +void qemu_savevm_query_pending_iter(MigPendingData *pending, bool exact);
> > > > +void qemu_savevm_query_pending_final(MigPendingData *pending);
> > > >   int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy);
> > > >   bool qemu_savevm_state_postcopy_prepare(QEMUFile *f, Error **errp);
> > > >   void qemu_savevm_state_end(QEMUFile *f);
> > > > diff --git a/hw/s390x/s390-stattrib.c b/hw/s390x/s390-stattrib.c
> > > > index c334714b31..aed0c6844d 100644
> > > > --- a/hw/s390x/s390-stattrib.c
> > > > +++ b/hw/s390x/s390-stattrib.c
> > > > @@ -190,12 +190,17 @@ static int cmma_save_setup(QEMUFile *f, void *opaque, Error **errp)
> > > >   }
> > > > 
> > > >   static void cmma_state_pending(void *opaque, MigPendingData *pending,
> > > > -                               bool exact)
> > > > +                               bool exact, bool final)
> > > >   {
> > > >       S390StAttribState *sas = S390_STATTRIB(opaque);
> > > >       S390StAttribClass *sac = S390_STATTRIB_GET_CLASS(sas);
> > > > -    long long res = sac->get_dirtycount(sas);
> > > > +    long long res;
> > > > 
> > > > +    if (final) {
> > > > +        return;
> > > > +    }
> > > > +
> > > > +    res = sac->get_dirtycount(sas);
> > > >       if (res >= 0) {
> > > >           pending->precopy_bytes += res;
> > > >       }
> > > > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > > > index fb12b9717f..95072d6664 100644
> > > > --- a/hw/vfio/migration.c
> > > > +++ b/hw/vfio/migration.c
> > > > @@ -622,12 +622,16 @@ static void vfio_state_pending_sync(VFIODevice *vbasedev)
> > > >   }
> > > > 
> > > >   static void vfio_state_pending(void *opaque, MigPendingData *pending,
> > > > -                               bool exact)
> > > > +                               bool exact, bool final)
> > > >   {
> > > >       VFIODevice *vbasedev = opaque;
> > > >       VFIOMigration *migration = vbasedev->migration;
> > > >       uint64_t precopy_size, stopcopy_size;
> > > > 
> > > > +    if (final) {
> > > > +        return;
> > > > +    }
> > > > +
> > > >       if (exact) {
> > > >           vfio_state_pending_sync(vbasedev);
> > > >       }
> > > > diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
> > > > index 7ef3759e53..66ed91de03 100644
> > > > --- a/migration/block-dirty-bitmap.c
> > > > +++ b/migration/block-dirty-bitmap.c
> > > > @@ -767,12 +767,16 @@ static int dirty_bitmap_save_complete(QEMUFile *f, void *opaque)
> > > >   }
> > > > 
> > > >   static void dirty_bitmap_state_pending(void *opaque, MigPendingData *data,
> > > > -                                       bool exact)
> > > > +                                       bool exact, bool final)
> > > >   {
> > > >       DBMSaveState *s = &((DBMState *)opaque)->save;
> > > >       SaveBitmapState *dbms;
> > > >       uint64_t pending = 0;
> > > > 
> > > > +    if (final) {
> > > > +        return;
> > > > +    }
> > > > +
> > > >       bql_lock();
> > > > 
> > > >       QSIMPLEQ_FOREACH(dbms, &s->dbms_list, entry) {
> > > > diff --git a/migration/migration.c b/migration/migration.c
> > > > index 7488a94206..df8ed3a9fd 100644
> > > > --- a/migration/migration.c
> > > > +++ b/migration/migration.c
> > > > @@ -2787,12 +2787,15 @@ static bool migration_switchover_prepare(MigrationState *s)
> > > >   static bool migration_switchover_start(MigrationState *s, Error **errp)
> > > >   {
> > > >       ERRP_GUARD();
> > > > +    MigPendingData pending = {};
> > > > 
> > > >       if (!migration_switchover_prepare(s)) {
> > > >           error_setg(errp, "Switchover is interrupted");
> > > >           return false;
> > > >       }
> > > > 
> > > > +    qemu_savevm_query_pending_final(&pending);
> > > Not needed for postcopy, we can move it to migration_completion_precopy().
> 
> While currently not needed (because VFIO isn't compatible with postcopy),
> AFAIU, generally it is needed because we have here a switchover similar to
> precopy case -- the switchover decision and VM stop at stopcopy_start() are
> not atomic, so theoretically a device may want to request an ACK in this
> gap.
> 
> But that's all theoretical and not necessary currently.

True, then it makes more sense to keep it here.  You can enhance the
comment below with VFIO implications for the future DMA faults when it's
ready.

> 
> > A better idea is.. we can drop migration_bitmap_sync() in
> > ram_postcopy_send_discard_bitmap() too, then we can keep this one.
> > 
> > But then, let's properly document this, both for precopy and postcopy
> > cases.  One example:
> > 
> >      /*
> >       * The final query to the whole system on dirty data to make sure we
> >       * collect the latest status of the VM.  For precopy, source QEMU will
> >       * dump all the dirty data during switchover.  For postcopy, this will
> >       * properly update all the dirty bitmaps to finally generate the
> >       * correct discard bitmaps; see ram_postcopy_send_discard_bitmap().
> >       */
> 
> The last dirty sync is done a bit differently for postcopy case:
>   postcopy_start()
>     ram_postcopy_send_discard_bitmap()
>       migration_bitmap_sync(rs, false);
> 
> 1. The last_stage parameter of migration_bitmap_sync() is false (shouldn't
> it be true here as well?).

Yes, I think it should.

The only thing uses that last_stage so far should be ARM64 where some dirty
data can be generated to bitmap not ring (only matters when dirty ring
enabled).  In that case AFAIU we need that too for a postcopy switchover.

> 2. It calls migration_bitmap_sync and not migration_bitmap_sync_precopy,
> i.e., without precopy_notify calls.

This is another thing I feel like got overlooked in the free page hint
feature.  The only notifiers registered is that balloon device, logically I
think we need these notifiers in postcopy too to make sure we properly stop
the free page hints seeing BEFORE_BITMAP_SYNC, then don't start it
(vm_running=false) in AFTER_BITMAP_SYNC:

virtio_balloon_free_page_hint_notify():

    case PRECOPY_NOTIFY_BEFORE_BITMAP_SYNC:
        virtio_balloon_free_page_stop(dev);
        break;
    case PRECOPY_NOTIFY_AFTER_BITMAP_SYNC:
        if (vdev->vm_running) {
            virtio_balloon_free_page_start(dev);
            break;
        }
> 
> If we need to keep the above differences for postcopy then I guess we'd need
> to introduce different final query flow for it. It may be cumbersome.
> Do you think we should do this extra effort to support it, or, as you
> suggested, we can drop the final switchover ack check for postcopy for now?

IMHO you can add a small patch for replacing the postcopy sync first with
migration_bitmap_sync_precopy(last_stage=true), copy Gavin and Wei in that
patch only:

Gavin Shan <gshan@redhat.com>
Wei Wang <wei.w.wang@intel.com>

Then this patch can be on top.

I'll loop both of them in while replying, to see if we can collect early
comments.

> 
> > > I'd also appreciate a short comment explaining why the final query is
> > > needed.
> > > 
> > > > +
> > > >       /* Inactivate disks except in COLO */
> > > >       if (!migrate_colo()) {
> > > >           /*
> > > > @@ -3285,7 +3288,7 @@ static void migration_iteration_go_next(MigPendingData *pending)
> > > >       /*
> > > >        * Do a slow sync first before boosting the iteration count.
> > > >        */
> > > > -    qemu_savevm_query_pending(pending, true);
> > > > +    qemu_savevm_query_pending_iter(pending, true);
> > > > 
> > > >       /*
> > > >        * Update the dirty information for the whole system for this
> > > > @@ -3336,7 +3339,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
> > > >       bool complete_ready;
> > > > 
> > > >       /* Fast path - get the estimated amount of pending data */
> > > > -    qemu_savevm_query_pending(&pending, false);
> > > > +    qemu_savevm_query_pending_iter(&pending, false);
> > > > 
> > > >       if (in_postcopy) {
> > > >           /*
> > > > diff --git a/migration/ram.c b/migration/ram.c
> > > > index fc38ffbf8a..4e3f442593 100644
> > > > --- a/migration/ram.c
> > > > +++ b/migration/ram.c
> > > > @@ -3376,10 +3376,6 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
> > > >       rs->last_stage = !migration_in_colo_state();
> > > > 
> > > >       WITH_RCU_READ_LOCK_GUARD() {
> > > > -        if (!migration_in_postcopy()) {
> > > > -            migration_bitmap_sync_precopy(true);
> > > > -        }
> > > > -
> > > >           ret = rdma_registration_start(f, RAM_CONTROL_FINISH);
> > > >           if (ret < 0) {
> > > >               qemu_file_set_error(f, ret);
> > > > @@ -3442,25 +3438,38 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
> > > >       return qemu_fflush(f);
> > > >   }
> > > > 
> > > > -static void ram_state_pending(void *opaque, MigPendingData *pending,
> > > > -                              bool exact)
> > > > +static void ram_state_pending_sync(bool exact, bool final)
> > > >   {
> > > > -    RAMState **temp = opaque;
> > > > -    RAMState *rs = *temp;
> > > > -    uint64_t remaining_size;
> > > > -
> > > >       /*
> > > >        * Sync is not needed either with: (1) a fast query, or (2) after
> > > >        * postcopy has started (no new dirty will generate anymore).
> > > >        */
> > > > -    if (exact && !migration_in_postcopy()) {
> > > > +    if (!exact || migration_in_postcopy()) {
> > > > +        return;
> > > > +    }
> > > > +
> > > > +    /* Final pending query is called with BQL locked */
> > > > +    if (!final) {
> > > >           bql_lock();
> > > > -        WITH_RCU_READ_LOCK_GUARD() {
> > > > -            migration_bitmap_sync_precopy(false);
> > > > -        }
> > > > +    }
> > > > +
> > > > +    WITH_RCU_READ_LOCK_GUARD() {
> > > > +        migration_bitmap_sync_precopy(final);
> > > > +    }
> > > > +
> > > > +    if (!final) {
> > > >           bql_unlock();
> > > >       }
> > > > +}
> > > > +
> > > > +static void ram_state_pending(void *opaque, MigPendingData *pending,
> > > > +                              bool exact, bool final)
> > > > +{
> > > > +    RAMState **temp = opaque;
> > > > +    RAMState *rs = *temp;
> > > > +    uint64_t remaining_size;
> > > > 
> > > > +    ram_state_pending_sync(exact, final);
> > > >       remaining_size = rs->migration_dirty_pages * TARGET_PAGE_SIZE;
> > > > 
> > > >       if (migrate_postcopy_ram()) {
> > > > diff --git a/migration/savevm.c b/migration/savevm.c
> > > > index 23adaf9dd9..22cb6a15c6 100644
> > > > --- a/migration/savevm.c
> > > > +++ b/migration/savevm.c
> > > > @@ -1795,7 +1795,8 @@ int qemu_savevm_state_complete_precopy(MigrationState *s)
> > > >       return qemu_fflush(f);
> > > >   }
> > > > 
> > > > -void qemu_savevm_query_pending(MigPendingData *pending, bool exact)
> > > > +static void qemu_savevm_query_pending(MigPendingData *pending, bool exact,
> > > > +                                      bool final)
> > > >   {
> > > >       SaveStateEntry *se;
> > > > 
> > > > @@ -1808,7 +1809,7 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool exact)
> > > >           if (!qemu_savevm_state_active(se)) {
> > > >               continue;
> > > >           }
> > > > -        se->ops->save_query_pending(se->opaque, pending, exact);
> > > > +        se->ops->save_query_pending(se->opaque, pending, exact, final);
> > > >       }
> > > > 
> > > >       pending->total_bytes = pending->precopy_bytes +
> > > > @@ -1817,16 +1818,30 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool exact)
> > > >       /*
> > > >        * Update system remaining dirty bytes whenever QEMU queries.  It will
> > > >        * make the value to be not as accurate, but should still be pretty
> > > > -     * close to reality when this got invoked frequently while iterating.
> > > > +     * close to reality when this got invoked frequently while iterating.  Don't
> > > > +     * update in final query as some modules may skip it if not needed.
> > > IMHO this specialty isn't needed.  We can just make all modules report and
> > > making sure we don't deadlock on bql.  AFAIU, CMMA never takes BQL, while
> > > dirty-bitmap should conditionally take it now.
> 
> I didn't want to add unnecessary work for the critical downtime path, i.e.,
> unnecessary exact queries.

If there's any real slow sync, then likely it needs to happen.. because of
the same reason why RAM do the sync, to make sure the data is latest in
QEMU migration core (say, bitmaps).  If it's a fast sync then it doesn't
matter.

query_pending() so far should achieve both functions: (1) synchronize dirty
info with whatever remote modules like KVM, enforced when exact=true, (2)
report.

VFIO is just special because VFIO dirty info accuracy never rely on
query_pending or whatever it collects, so it's purely a way to estimate for
size of dirty, hence (2) only.

> I am not familiar with the other devices, but if you think it's negligible
> for them then we can drop this specialty (in VFIO I'd still like to skip the
> stopcopy size query; using the cached value is good enough IMHO).

Yes VFIO is fine but it'll be good to comment.  Let's try to drop them, I
want to avoid below "if (!final)" if possible.

Thanks,

> 
> Thanks.
> 
> > > 
> > > >        */
> > > > -    mig_stats.dirty_bytes_total = pending->total_bytes;
> > > > -
> > > > -    trace_qemu_savevm_query_pending(exact, pending->precopy_bytes,
> > > > +    if (!final) {
> > > > +        mig_stats.dirty_bytes_total = pending->total_bytes;
> > > > +    }

[a]

> > > Then we remove this too which is counter-intuitive..
> > > 
> > > The rest looks all reasonable.
> > > 
> > > Thanks,
> > > 
> > > > +    trace_qemu_savevm_query_pending(exact, final, pending->precopy_bytes,
> > > >                                       pending->stopcopy_bytes,
> > > >                                       pending->postcopy_bytes,
> > > >                                       pending->total_bytes);
> > > >   }
> > > > 
> > > > +void qemu_savevm_query_pending_iter(MigPendingData *pending, bool exact)
> > > > +{
> > > > +    qemu_savevm_query_pending(pending, exact, false);
> > > > +}
> > > > +
> > > > +void qemu_savevm_query_pending_final(MigPendingData *pending)
> > > > +{
> > > > +    g_assert(bql_locked());
> > > > +
> > > > +    qemu_savevm_query_pending(pending, true, true);
> > > > +}
> > > > +
> > > >   void qemu_savevm_state_cleanup(void)
> > > >   {
> > > >       SaveStateEntry *se;
> > > > diff --git a/migration/trace-events b/migration/trace-events
> > > > index de99d976ab..1c9212d3e2 100644
> > > > --- a/migration/trace-events
> > > > +++ b/migration/trace-events
> > > > @@ -7,7 +7,7 @@ qemu_loadvm_state_section_partend(uint32_t section_id) "%u"
> > > >   qemu_loadvm_state_post_main(int ret) "%d"
> > > >   qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
> > > >   qemu_savevm_send_packaged(void) ""
> > > > -qemu_savevm_query_pending(bool exact, uint64_t precopy, uint64_t stopcopy, uint64_t postcopy, uint64_t total) "exact=%d, precopy=%"PRIu64", stopcopy=%"PRIu64", postcopy=%"PRIu64", total=%"PRIu64
> > > > +qemu_savevm_query_pending(bool exact, bool final, uint64_t precopy, uint64_t stopcopy, uint64_t postcopy, uint64_t total) "exact=%d, final=%d, precopy=%"PRIu64", stopcopy=%"PRIu64", postcopy=%"PRIu64", total=%"PRIu64
> > > >   loadvm_state_switchover_ack_needed(unsigned int switchover_ack_pending_num) "Switchover ack pending num=%u"
> > > >   loadvm_state_setup(void) ""
> > > >   loadvm_state_cleanup(void) ""
> > > > --
> > > > 2.40.1
> > > > 
> > > --
> > > Peter Xu
> > --
> > Peter Xu
> > 
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 07/13] migration: Fail migration if switchover-ack is requested after switchover decision
  2026-06-04 15:36     ` Avihai Horon
@ 2026-06-04 17:48       ` Peter Xu
  2026-06-08 12:49         ` Avihai Horon
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Xu @ 2026-06-04 17:48 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Halil Pasic, Christian Borntraeger, Jason Herne,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Eric Farman, Matthew Rosato, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, Markus Armbruster,
	Maor Gottlieb

On Thu, Jun 04, 2026 at 06:36:14PM +0300, Avihai Horon wrote:
> 
> On 6/3/2026 11:42 PM, Peter Xu wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Tue, Jun 02, 2026 at 12:26:15PM +0300, Avihai Horon wrote:
> > > Switchover ACK is checked only during precopy while the guest is still
> > > running. The last migration_can_switchover() decision and guest stop are
> > > not atomic, so a device may want to request another switchover ACK in
> > > the gap after switchover decision has been made but before the guest is
> > > stopped. Migration would then miss that request, which can increase
> > > downtime.
> > > 
> > > Cover this case by failing the migration if a switchover-ack was
> > > requested during that time.
> > > 
> > > Ideally, precopy iterations should be resumed in this case, however,
> > > VFIO doesn't support going back to precopy after being stopped, so
> > > implementing such logic would require non-trivial changes to the guest
> > > start/stop flow. Given the above and that this case should be rare,
> > > failing the migration seems reasonable.
> > > 
> > > Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> > Reviewed-by: Peter Xu <peterx@redhat.com>
> > 
> > One nit:
> > 
> > > ---
> > >   migration/migration.c | 30 ++++++++++++++++++++++++++++++
> > >   1 file changed, 30 insertions(+)
> > > 
> > > diff --git a/migration/migration.c b/migration/migration.c
> > > index 4bb649a467..6ee1c795ff 100644
> > > --- a/migration/migration.c
> > > +++ b/migration/migration.c
> > > @@ -2810,6 +2810,27 @@ static bool migration_switchover_prepare(MigrationState *s)
> > >       return s->state == MIGRATION_STATUS_DEVICE;
> > >   }
> > > 
> > > +static bool migration_switchover_check_switchover_ack_pending(MigrationState *s,
> > The name is slightly confusing.  It says "check if there is pending ack"
> > but then it returns true if there's no pending ACK..
> > 
> > Maybe migration_switchover_is_acknowledged()?
> 
> Sure, or maybe migration_have_new_switchover_ack_pending()?

Sure.

> 
> I just noticed that I overlooked one corner case:
> If guest is stopped during precopy and source has pending acks > 0 (e.g.,
> virsh migrate with --timeout param), then we will fail in this check and
> unnecessarily abort migration.
> 
> I thought about doing the change below. It's not as clean as I'd want it to
> be, but I have no other ideas.

Yes, it's not good to bleed switchover_ack_pending_num all over the
places.. Maybe you can make qemu_savevm_query_pending_final() return an
error when seeing any newly popped ack request while querying?  Then the
new function is not needed.

Thanks,

> 
> ===========8<===========
> diff --git a/migration/migration.c b/migration/migration.c
> index 6ee1c795ff..6054453b18 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -2810,37 +2810,47 @@ static bool
> migration_switchover_prepare(MigrationState *s)
>      return s->state == MIGRATION_STATUS_DEVICE;
>  }
> 
> -static bool
> migration_switchover_check_switchover_ack_pending(MigrationState *s,
> - Error **errp)
> +static bool migration_have_new_switchover_ack_pending(
> +    MigrationState *s, uint32_t old_switchover_ack_pending_num, Error
> **errp)
>  {
>      uint32_t pending_num;
> 
>      if (!migrate_switchover_ack() || migrate_switchover_ack_legacy()) {
> -        return true;
> +        return false;
>      }
> 
>      pending_num = qatomic_read(&s->switchover_ack_pending_num);
> -    if (pending_num > 0) {
> +    if (pending_num > old_switchover_ack_pending_num) {
>          error_setg(errp,
>                     "Switchover ACK was requested by %" PRIu32
>                     " devices during switchover",
> -                   pending_num);
> -        return false;
> +                   pending_num - old_switchover_ack_pending_num);
> +        return true;
>      }
> 
> -    return true;
> +    return false;
>  }
> 
>  static bool migration_switchover_start(MigrationState *s, Error **errp)
>  {
>      ERRP_GUARD();
>      MigPendingData pending = {};
> +    uint32_t old_switchover_ack_pending_num;
> 
>      if (!migration_switchover_prepare(s)) {
>          error_setg(errp, "Switchover is interrupted");
>          return false;
>      }
> 
> +    /*
> +     * If the guest was stopped during precopy (e.g., by management), we
> may get
> +     * here with switchover_ack_pending_num > 0. Store the current
> +     * switchover_ack_pending_num value before the final pending query to
> +     * distinguish between existing pending ACKs and new ones.
> +     */
> +    old_switchover_ack_pending_num =
> +        qatomic_read(&s->switchover_ack_pending_num);
> +
>      qemu_savevm_query_pending_final(s, &pending);
> 
>      /*
> @@ -2848,7 +2858,8 @@ static bool migration_switchover_start(MigrationState
> *s, Error **errp)
>       * Fail the migration in this case since we currently don't support
> going
>       * back to precopy.
>       */
> -    if (!migration_switchover_check_switchover_ack_pending(s, errp)) {
> +    if (migration_have_new_switchover_ack_pending(
> +            s, old_switchover_ack_pending_num, errp)) {
>          return false;
>      }
> 
> ===========8<===========
> 
> Thanks.
> 
> > 
> > > +                                                              Error **errp)
> > > +{
> > > +    uint32_t pending_num;
> > > +
> > > +    if (!migrate_switchover_ack() || migrate_switchover_ack_legacy()) {
> > > +        return true;
> > > +    }
> > > +
> > > +    pending_num = qatomic_read(&s->switchover_ack_pending_num);
> > > +    if (pending_num > 0) {
> > > +        error_setg(errp,
> > > +                   "Switchover ACK was requested by %" PRIu32
> > > +                   " devices during switchover",
> > > +                   pending_num);
> > > +        return false;
> > > +    }
> > > +
> > > +    return true;
> > > +}
> > > +
> > >   static bool migration_switchover_start(MigrationState *s, Error **errp)
> > >   {
> > >       ERRP_GUARD();
> > > @@ -2822,6 +2843,15 @@ static bool migration_switchover_start(MigrationState *s, Error **errp)
> > > 
> > >       qemu_savevm_query_pending_final(s, &pending);
> > > 
> > > +    /*
> > > +     * Switchover-ack requests done after switchover decision, are not allowed.
> > > +     * Fail the migration in this case since we currently don't support going
> > > +     * back to precopy.
> > > +     */
> > > +    if (!migration_switchover_check_switchover_ack_pending(s, errp)) {
> > > +        return false;
> > > +    }
> > > +
> > >       /* Inactivate disks except in COLO */
> > >       if (!migrate_colo()) {
> > >           /*
> > > --
> > > 2.40.1
> > > 
> > --
> > Peter Xu
> > 
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 06/13] migration: Make switchover-ack re-usable
  2026-06-04 15:05     ` Avihai Horon
@ 2026-06-08 10:23       ` Markus Armbruster
  0 siblings, 0 replies; 38+ messages in thread
From: Markus Armbruster @ 2026-06-08 10:23 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater, Peter Xu,
	Fabiano Rosas, Pierrick Bouvier, Philippe Mathieu-Daudé,
	Zhao Liu, Halil Pasic, Christian Borntraeger, Jason Herne,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Eric Farman, Matthew Rosato, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, Maor Gottlieb

Avihai Horon <avihaih@nvidia.com> writes:

> On 6/3/2026 10:17 AM, Markus Armbruster wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> Avihai Horon <avihaih@nvidia.com> writes:
>>
>>> Switchover-ack is a mechanism to synchronize between source and
>>> destination QEMU during migration to prevent the source from switching
>>> over prematurely.
>>>
>>> VFIO uses switchover-ack to ensure switchover happens only after
>>> destination side has loaded the precopy initial bytes. This is important
>>> for VFIO, as otherwise downtime could be impacted and be higher.
>>>
>>> In its current state, switchover-ack is a one-time mechanism, meaning
>>> that switchover is acked only once and past that another ACK cannot be
>>> requested again. This was sufficient until now, as VFIO precopy initial
>>> bytes was defined to be monotonically decreasing. Thus, when precopy
>>> initial bytes reached zero for all VFIO devices, a single ACK would be
>>> sent and its validity would hold.
>>>
>>> However, now the new VFIO_PRECOPY_INFO_REINIT feature allows precopy
>>> initial bytes to be re-initialized during precopy. Specifically, it
>>> means that initial bytes can grow after reaching zero, which would
>>> invalidate a previously sent switchover ACK.
>>>
>>> To solve this, make switchover-ack reusable and allow devices to request
>>> switchover ACKs when needed via the save_query_pending SaveVMHandler.
>>>
>>> Since now switchover ACK can be requested for a specific device and in
>>> different times, make switchover ACK per-device (instead of a single ACK
>>> for all devices) and let source side do the pending ACKs accounting.
>>>
>>> Keep the legacy switchover-ack mechanism for backward compatibility and
>>> turn it on by a compatibility property for older machines. Enable the
>>> property until VFIO implements the new switchover-ack.
>>
>> I figure this is about the transmission of ACKs via the return path.  If
>> both source and destination understand the new per-device ACK, they use
>> it.  Else, they use old one.  Correct?
>
> Correct.
>
>>
>> This is not visible to the management application.  It may use migration
>> capability @switchover-ack to enable just as before.  Correct?
>
> Correct.
>
>>
>> Can you think of a reason why management applications might need to know
>> whether a certain qemu-system-FOO supports per-device ACKs?
>
> Not that I can think of. It should be an implementation detail internal to QEMU.

Thank you.

>>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>>> ---
>>>  qapi/migration.json                | 16 ++++-----
>>>  include/migration/client-options.h |  1 +
>>>  include/migration/register.h       |  2 ++
>>>  migration/migration.h              | 32 ++++++++++++++++--
>>>  migration/savevm.h                 |  6 ++--
>>>  hw/core/machine.c                  |  1 +
>>>  migration/migration.c              | 37 ++++++++++++++-------
>>>  migration/options.c                | 10 ++++++
>>>  migration/savevm.c                 | 53 +++++++++++++++++++++++-------
>>>  migration/trace-events             |  5 +--
>>>  10 files changed, 125 insertions(+), 38 deletions(-)
>>>
>>> diff --git a/qapi/migration.json b/qapi/migration.json
>>> index 27a7970556..550cb77762 100644
>>> --- a/qapi/migration.json
>>> +++ b/qapi/migration.json
>>> @@ -507,15 +507,13 @@
>>>  #     and should not affect the correctness of postcopy migration.
>>>  #     (since 7.1)
>>>  #
>>> -# @switchover-ack: If enabled, migration will not stop the source VM
>>> -#     and complete the migration until an ACK is received from the
>>> -#     destination that it's OK to do so.  Exactly when this ACK is
>>> -#     sent depends on the migrated devices that use this feature.  For
>>> -#     example, a device can use it to make sure some of its data is
>>> -#     sent and loaded in the destination before doing switchover.
>>> -#     This can reduce downtime if devices that support this capability
>>> -#     are present.  'return-path' capability must be enabled to use
>>> -#     it.  (since 8.1)
>>> +# @switchover-ack: If enabled, migration will rely on destination side
>>> +#     to acknowledge the source's switchover decision.  The
>>> +#     acknowledgement may depend, for example, on some device's data
>>> +#     being loaded in the destination before doing switchover.  This
>>> +#     can reduce downtime if devices that support this capability are
>>> +#     present.  'return-path' capability must be enabled to use it.
>>
>> Please use the opportunity to fix markup: @return-path.
>> docs/devel/qapi-code-gen.rst:
>>
>>      Use @foo to reference a member description within the current
>>      definition.  This is an rST extension.  It is currently rendered the
>>      same way as ````foo````, but carries additional meaning.
>>
>> Suggest "Capability @return-path must be ..."
>
> Sure, will change it.
>
>>
>> The old text is concrete: "will not stop the source VM ... until".  The
>> new text is vague "will rely on destination side to acknowledge".  What
>> does that mean exactly?  How does it impact behavior the user /
>> management application cares about?
>
> How about the suggestion below? It's both concrete and doesn't mention the number of ACKs (as requested by Peter).
>
> # @switchover-ack: If enabled, migration will not stop the source VM
> #     and complete the migration until the destination has acknowledged
> #     that switchover is safe.  The acknowledgement may depend...
>
> Thanks.

Better!

With these improvements, QAPI schema
Acked-by: Markus Armbruster <armbru@redhat.com>


>>
>>> +#     (since 8.1)
>>>  #
>>>  # @dirty-limit: If enabled, migration will throttle vCPUs as needed to
>>>  #     keep their dirty page rate within @vcpu-dirty-limit.  This can
>> [...]
>>



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 02/13] migration: Run final save_query_pending at switchover
  2026-06-04 17:18         ` Peter Xu
@ 2026-06-08 12:07           ` Avihai Horon
  2026-06-08 14:11             ` Peter Xu
  0 siblings, 1 reply; 38+ messages in thread
From: Avihai Horon @ 2026-06-08 12:07 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Halil Pasic, Christian Borntraeger, Jason Herne,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Eric Farman, Matthew Rosato, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, Markus Armbruster,
	Maor Gottlieb, Gavin Shan, Wei Wang


On 6/4/2026 8:18 PM, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, Jun 04, 2026 at 06:29:01PM +0300, Avihai Horon wrote:
>> On 6/4/2026 12:04 AM, Peter Xu wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On Wed, Jun 03, 2026 at 04:05:49PM -0400, Peter Xu wrote:
>>>> On Tue, Jun 02, 2026 at 12:26:10PM +0300, Avihai Horon wrote:
>>>>> Before switchover, the source needs one last exact pending query so
>>>>> modules can flush dirty state. This is currently done ad hoc in modules
>>>>> handlers. For example, RAM syncs its dirty bitmap in its save_complete
>>>>> handler.
>>>>>
>>>>> This should be a general concept relevant for any module, so extract it
>>>>> to migration core instead by running a final save_query_pending before
>>>>> switchover.
>>>>>
>>>>> The final query requires special handling by modules (e.g., it's called
>>>>> with BQL locked, during VM stop), so extend save_query_pending
>>>>> SaveVMHandlers callback and qemu_savevm_query_pending() with a "final"
>>>>> flag so migration modules can tell the last pending query during
>>>>> switchover from periodic iteration queries.
>>>>>
>>>>> Not all modules need the final query; skip it for those who don't need.
>>>>>
>>>>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>>>>> ---
>>>>>    include/migration/register.h   | 41 ++++++++++++++++++----------------
>>>>>    migration/savevm.h             |  3 ++-
>>>>>    hw/s390x/s390-stattrib.c       |  9 ++++++--
>>>>>    hw/vfio/migration.c            |  6 ++++-
>>>>>    migration/block-dirty-bitmap.c |  6 ++++-
>>>>>    migration/migration.c          |  7 ++++--
>>>>>    migration/ram.c                | 37 ++++++++++++++++++------------
>>>>>    migration/savevm.c             | 27 +++++++++++++++++-----
>>>>>    migration/trace-events         |  2 +-
>>>>>    9 files changed, 91 insertions(+), 47 deletions(-)
>>>>>
>>>>> diff --git a/include/migration/register.h b/include/migration/register.h
>>>>> index 5e5e0ee432..6f632123f1 100644
>>>>> --- a/include/migration/register.h
>>>>> +++ b/include/migration/register.h
>>>>> @@ -171,6 +171,28 @@ typedef struct SaveVMHandlers {
>>>>>         */
>>>>>        bool (*is_active_iterate)(void *opaque);
>>>>>
>>>>> +    /**
>>>>> +     * @save_query_pending
>>>>> +     *
>>>>> +     * This estimates the remaining data to transfer on the source side.
>>>>> +     *
>>>>> +     * When @exact is true, a module must report accurate results.  When
>>>>> +     * @exact is false, a module may report estimates.
>>>>> +     *
>>>>> +     * It's highly recommended that modules implement a faster version of
>>>>> +     * the query path (for example, by proper caching on the counters) if
>>>>> +     * an accurate query will be time-consuming.
>>>>> +     *
>>>>> +     * @opaque: data pointer passed to register_savevm_live()
>>>>> +     * @pending: pointer to a MigPendingData struct
>>>>> +     * @exact: set to true for an accurate (slow) query
>>>>> +     * @final: set to true for the final query during switchover. When final is
>>>>> +     * true, the query is called with BQL locked. Otherwise, it's called with
>>>>> +     * BQL unlocked.
>>>>> +     */
>>>>> +    void (*save_query_pending)(void *opaque, MigPendingData *pending,
>>>>> +                               bool exact, bool final);
>>>>> +
>>>>>        /* This runs outside the BQL in the migration case, and
>>>>>         * within the lock in the savevm case.  The callback had better only
>>>>>         * use data that is local to the migration thread or protected
>>>>> @@ -210,25 +232,6 @@ typedef struct SaveVMHandlers {
>>>>>         */
>>>>>        bool (*save_postcopy_prepare)(QEMUFile *f, void *opaque, Error **errp);
>>>>>
>>>>> -    /**
>>>>> -     * @save_query_pending
>>>>> -     *
>>>>> -     * This estimates the remaining data to transfer on the source side.
>>>>> -     *
>>>>> -     * When @exact is true, a module must report accurate results.  When
>>>>> -     * @exact is false, a module may report estimates.
>>>>> -     *
>>>>> -     * It's highly recommended that modules implement a faster version of
>>>>> -     * the query path (for example, by proper caching on the counters) if
>>>>> -     * an accurate query will be time-consuming.
>>>>> -     *
>>>>> -     * @opaque: data pointer passed to register_savevm_live()
>>>>> -     * @pending: pointer to a MigPendingData struct
>>>>> -     * @exact: set to true for an accurate (slow) query
>>>>> -     */
>>>>> -    void (*save_query_pending)(void *opaque, MigPendingData *pending,
>>>>> -                               bool exact);
>>>>> -
>>>>>        /**
>>>>>         * @load_state
>>>>>         *
>>>>> diff --git a/migration/savevm.h b/migration/savevm.h
>>>>> index 96fdf96d4e..10b3d78a5f 100644
>>>>> --- a/migration/savevm.h
>>>>> +++ b/migration/savevm.h
>>>>> @@ -45,7 +45,8 @@ int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy);
>>>>>    void qemu_savevm_state_cleanup(void);
>>>>>    void qemu_savevm_state_complete_postcopy(QEMUFile *f);
>>>>>    int qemu_savevm_state_complete_precopy(MigrationState *s);
>>>>> -void qemu_savevm_query_pending(MigPendingData *pending, bool exact);
>>>>> +void qemu_savevm_query_pending_iter(MigPendingData *pending, bool exact);
>>>>> +void qemu_savevm_query_pending_final(MigPendingData *pending);
>>>>>    int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy);
>>>>>    bool qemu_savevm_state_postcopy_prepare(QEMUFile *f, Error **errp);
>>>>>    void qemu_savevm_state_end(QEMUFile *f);
>>>>> diff --git a/hw/s390x/s390-stattrib.c b/hw/s390x/s390-stattrib.c
>>>>> index c334714b31..aed0c6844d 100644
>>>>> --- a/hw/s390x/s390-stattrib.c
>>>>> +++ b/hw/s390x/s390-stattrib.c
>>>>> @@ -190,12 +190,17 @@ static int cmma_save_setup(QEMUFile *f, void *opaque, Error **errp)
>>>>>    }
>>>>>
>>>>>    static void cmma_state_pending(void *opaque, MigPendingData *pending,
>>>>> -                               bool exact)
>>>>> +                               bool exact, bool final)
>>>>>    {
>>>>>        S390StAttribState *sas = S390_STATTRIB(opaque);
>>>>>        S390StAttribClass *sac = S390_STATTRIB_GET_CLASS(sas);
>>>>> -    long long res = sac->get_dirtycount(sas);
>>>>> +    long long res;
>>>>>
>>>>> +    if (final) {
>>>>> +        return;
>>>>> +    }
>>>>> +
>>>>> +    res = sac->get_dirtycount(sas);
>>>>>        if (res >= 0) {
>>>>>            pending->precopy_bytes += res;
>>>>>        }
>>>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>>>> index fb12b9717f..95072d6664 100644
>>>>> --- a/hw/vfio/migration.c
>>>>> +++ b/hw/vfio/migration.c
>>>>> @@ -622,12 +622,16 @@ static void vfio_state_pending_sync(VFIODevice *vbasedev)
>>>>>    }
>>>>>
>>>>>    static void vfio_state_pending(void *opaque, MigPendingData *pending,
>>>>> -                               bool exact)
>>>>> +                               bool exact, bool final)
>>>>>    {
>>>>>        VFIODevice *vbasedev = opaque;
>>>>>        VFIOMigration *migration = vbasedev->migration;
>>>>>        uint64_t precopy_size, stopcopy_size;
>>>>>
>>>>> +    if (final) {
>>>>> +        return;
>>>>> +    }
>>>>> +
>>>>>        if (exact) {
>>>>>            vfio_state_pending_sync(vbasedev);
>>>>>        }
>>>>> diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
>>>>> index 7ef3759e53..66ed91de03 100644
>>>>> --- a/migration/block-dirty-bitmap.c
>>>>> +++ b/migration/block-dirty-bitmap.c
>>>>> @@ -767,12 +767,16 @@ static int dirty_bitmap_save_complete(QEMUFile *f, void *opaque)
>>>>>    }
>>>>>
>>>>>    static void dirty_bitmap_state_pending(void *opaque, MigPendingData *data,
>>>>> -                                       bool exact)
>>>>> +                                       bool exact, bool final)
>>>>>    {
>>>>>        DBMSaveState *s = &((DBMState *)opaque)->save;
>>>>>        SaveBitmapState *dbms;
>>>>>        uint64_t pending = 0;
>>>>>
>>>>> +    if (final) {
>>>>> +        return;
>>>>> +    }
>>>>> +
>>>>>        bql_lock();
>>>>>
>>>>>        QSIMPLEQ_FOREACH(dbms, &s->dbms_list, entry) {
>>>>> diff --git a/migration/migration.c b/migration/migration.c
>>>>> index 7488a94206..df8ed3a9fd 100644
>>>>> --- a/migration/migration.c
>>>>> +++ b/migration/migration.c
>>>>> @@ -2787,12 +2787,15 @@ static bool migration_switchover_prepare(MigrationState *s)
>>>>>    static bool migration_switchover_start(MigrationState *s, Error **errp)
>>>>>    {
>>>>>        ERRP_GUARD();
>>>>> +    MigPendingData pending = {};
>>>>>
>>>>>        if (!migration_switchover_prepare(s)) {
>>>>>            error_setg(errp, "Switchover is interrupted");
>>>>>            return false;
>>>>>        }
>>>>>
>>>>> +    qemu_savevm_query_pending_final(&pending);
>>>> Not needed for postcopy, we can move it to migration_completion_precopy().
>> While currently not needed (because VFIO isn't compatible with postcopy),
>> AFAIU, generally it is needed because we have here a switchover similar to
>> precopy case -- the switchover decision and VM stop at stopcopy_start() are
>> not atomic, so theoretically a device may want to request an ACK in this
>> gap.
>>
>> But that's all theoretical and not necessary currently.
> True, then it makes more sense to keep it here.  You can enhance the
> comment below with VFIO implications for the future DMA faults when it's
> ready.
>
>>> A better idea is.. we can drop migration_bitmap_sync() in
>>> ram_postcopy_send_discard_bitmap() too, then we can keep this one.
>>>
>>> But then, let's properly document this, both for precopy and postcopy
>>> cases.  One example:
>>>
>>>       /*
>>>        * The final query to the whole system on dirty data to make sure we
>>>        * collect the latest status of the VM.  For precopy, source QEMU will
>>>        * dump all the dirty data during switchover.  For postcopy, this will
>>>        * properly update all the dirty bitmaps to finally generate the
>>>        * correct discard bitmaps; see ram_postcopy_send_discard_bitmap().
>>>        */
>> The last dirty sync is done a bit differently for postcopy case:
>>    postcopy_start()
>>      ram_postcopy_send_discard_bitmap()
>>        migration_bitmap_sync(rs, false);
>>
>> 1. The last_stage parameter of migration_bitmap_sync() is false (shouldn't
>> it be true here as well?).
> Yes, I think it should.
>
> The only thing uses that last_stage so far should be ARM64 where some dirty
> data can be generated to bitmap not ring (only matters when dirty ring
> enabled).  In that case AFAIU we need that too for a postcopy switchover.

Ack.

>
>> 2. It calls migration_bitmap_sync and not migration_bitmap_sync_precopy,
>> i.e., without precopy_notify calls.
> This is another thing I feel like got overlooked in the free page hint
> feature.  The only notifiers registered is that balloon device, logically I
> think we need these notifiers in postcopy too to make sure we properly stop
> the free page hints seeing BEFORE_BITMAP_SYNC, then don't start it
> (vm_running=false) in AFTER_BITMAP_SYNC:
>
> virtio_balloon_free_page_hint_notify():
>
>      case PRECOPY_NOTIFY_BEFORE_BITMAP_SYNC:
>          virtio_balloon_free_page_stop(dev);
>          break;
>      case PRECOPY_NOTIFY_AFTER_BITMAP_SYNC:
>          if (vdev->vm_running) {
>              virtio_balloon_free_page_start(dev);
>              break;
>          }

Actually, looking more closely in the code, I see that it's not used if 
postcopy is enabled [1].
So calling the precopy notifiers in postcopy switchover is basically a 
no-op.

Then it seems fine to call migration_bitmap_sync_precopy from postcopy 
switchover flow.

[1] See commit fd51e54fa102 ("virtio-balloon: don't start free page 
hinting if postcopy is possible")

>> If we need to keep the above differences for postcopy then I guess we'd need
>> to introduce different final query flow for it. It may be cumbersome.
>> Do you think we should do this extra effort to support it, or, as you
>> suggested, we can drop the final switchover ack check for postcopy for now?
> IMHO you can add a small patch for replacing the postcopy sync first with
> migration_bitmap_sync_precopy(last_stage=true), copy Gavin and Wei in that
> patch only:
>
> Gavin Shan <gshan@redhat.com>
> Wei Wang <wei.w.wang@intel.com>
>
> Then this patch can be on top.

Sure, will do.

>
> I'll loop both of them in while replying, to see if we can collect early
> comments.
>
>>>> I'd also appreciate a short comment explaining why the final query is
>>>> needed.
>>>>
>>>>> +
>>>>>        /* Inactivate disks except in COLO */
>>>>>        if (!migrate_colo()) {
>>>>>            /*
>>>>> @@ -3285,7 +3288,7 @@ static void migration_iteration_go_next(MigPendingData *pending)
>>>>>        /*
>>>>>         * Do a slow sync first before boosting the iteration count.
>>>>>         */
>>>>> -    qemu_savevm_query_pending(pending, true);
>>>>> +    qemu_savevm_query_pending_iter(pending, true);
>>>>>
>>>>>        /*
>>>>>         * Update the dirty information for the whole system for this
>>>>> @@ -3336,7 +3339,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>>>>>        bool complete_ready;
>>>>>
>>>>>        /* Fast path - get the estimated amount of pending data */
>>>>> -    qemu_savevm_query_pending(&pending, false);
>>>>> +    qemu_savevm_query_pending_iter(&pending, false);
>>>>>
>>>>>        if (in_postcopy) {
>>>>>            /*
>>>>> diff --git a/migration/ram.c b/migration/ram.c
>>>>> index fc38ffbf8a..4e3f442593 100644
>>>>> --- a/migration/ram.c
>>>>> +++ b/migration/ram.c
>>>>> @@ -3376,10 +3376,6 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
>>>>>        rs->last_stage = !migration_in_colo_state();
>>>>>
>>>>>        WITH_RCU_READ_LOCK_GUARD() {
>>>>> -        if (!migration_in_postcopy()) {
>>>>> -            migration_bitmap_sync_precopy(true);
>>>>> -        }
>>>>> -
>>>>>            ret = rdma_registration_start(f, RAM_CONTROL_FINISH);
>>>>>            if (ret < 0) {
>>>>>                qemu_file_set_error(f, ret);
>>>>> @@ -3442,25 +3438,38 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
>>>>>        return qemu_fflush(f);
>>>>>    }
>>>>>
>>>>> -static void ram_state_pending(void *opaque, MigPendingData *pending,
>>>>> -                              bool exact)
>>>>> +static void ram_state_pending_sync(bool exact, bool final)
>>>>>    {
>>>>> -    RAMState **temp = opaque;
>>>>> -    RAMState *rs = *temp;
>>>>> -    uint64_t remaining_size;
>>>>> -
>>>>>        /*
>>>>>         * Sync is not needed either with: (1) a fast query, or (2) after
>>>>>         * postcopy has started (no new dirty will generate anymore).
>>>>>         */
>>>>> -    if (exact && !migration_in_postcopy()) {
>>>>> +    if (!exact || migration_in_postcopy()) {
>>>>> +        return;
>>>>> +    }
>>>>> +
>>>>> +    /* Final pending query is called with BQL locked */
>>>>> +    if (!final) {
>>>>>            bql_lock();
>>>>> -        WITH_RCU_READ_LOCK_GUARD() {
>>>>> -            migration_bitmap_sync_precopy(false);
>>>>> -        }
>>>>> +    }
>>>>> +
>>>>> +    WITH_RCU_READ_LOCK_GUARD() {
>>>>> +        migration_bitmap_sync_precopy(final);
>>>>> +    }
>>>>> +
>>>>> +    if (!final) {
>>>>>            bql_unlock();
>>>>>        }
>>>>> +}
>>>>> +
>>>>> +static void ram_state_pending(void *opaque, MigPendingData *pending,
>>>>> +                              bool exact, bool final)
>>>>> +{
>>>>> +    RAMState **temp = opaque;
>>>>> +    RAMState *rs = *temp;
>>>>> +    uint64_t remaining_size;
>>>>>
>>>>> +    ram_state_pending_sync(exact, final);
>>>>>        remaining_size = rs->migration_dirty_pages * TARGET_PAGE_SIZE;
>>>>>
>>>>>        if (migrate_postcopy_ram()) {
>>>>> diff --git a/migration/savevm.c b/migration/savevm.c
>>>>> index 23adaf9dd9..22cb6a15c6 100644
>>>>> --- a/migration/savevm.c
>>>>> +++ b/migration/savevm.c
>>>>> @@ -1795,7 +1795,8 @@ int qemu_savevm_state_complete_precopy(MigrationState *s)
>>>>>        return qemu_fflush(f);
>>>>>    }
>>>>>
>>>>> -void qemu_savevm_query_pending(MigPendingData *pending, bool exact)
>>>>> +static void qemu_savevm_query_pending(MigPendingData *pending, bool exact,
>>>>> +                                      bool final)
>>>>>    {
>>>>>        SaveStateEntry *se;
>>>>>
>>>>> @@ -1808,7 +1809,7 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool exact)
>>>>>            if (!qemu_savevm_state_active(se)) {
>>>>>                continue;
>>>>>            }
>>>>> -        se->ops->save_query_pending(se->opaque, pending, exact);
>>>>> +        se->ops->save_query_pending(se->opaque, pending, exact, final);
>>>>>        }
>>>>>
>>>>>        pending->total_bytes = pending->precopy_bytes +
>>>>> @@ -1817,16 +1818,30 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool exact)
>>>>>        /*
>>>>>         * Update system remaining dirty bytes whenever QEMU queries.  It will
>>>>>         * make the value to be not as accurate, but should still be pretty
>>>>> -     * close to reality when this got invoked frequently while iterating.
>>>>> +     * close to reality when this got invoked frequently while iterating.  Don't
>>>>> +     * update in final query as some modules may skip it if not needed.
>>>> IMHO this specialty isn't needed.  We can just make all modules report and
>>>> making sure we don't deadlock on bql.  AFAIU, CMMA never takes BQL, while
>>>> dirty-bitmap should conditionally take it now.
>> I didn't want to add unnecessary work for the critical downtime path, i.e.,
>> unnecessary exact queries.
> If there's any real slow sync, then likely it needs to happen.. because of
> the same reason why RAM do the sync, to make sure the data is latest in
> QEMU migration core (say, bitmaps).  If it's a fast sync then it doesn't
> matter.

Yes, but what I meant is that currently cmma and block dirty bitmaps 
don't issue a final slow sync during switchover, only RAM does.

But it does seem harmless to add this final sync for them.

>
> query_pending() so far should achieve both functions: (1) synchronize dirty
> info with whatever remote modules like KVM, enforced when exact=true, (2)
> report.
>
> VFIO is just special because VFIO dirty info accuracy never rely on
> query_pending or whatever it collects, so it's purely a way to estimate for
> size of dirty, hence (2) only.
>
>> I am not familiar with the other devices, but if you think it's negligible
>> for them then we can drop this specialty (in VFIO I'd still like to skip the
>> stopcopy size query; using the cached value is good enough IMHO).
> Yes VFIO is fine but it'll be good to comment.  Let's try to drop them, I
> want to avoid below "if (!final)" if possible.

Alright, will change that.

Thanks.

>
> Thanks,
>
>> Thanks.
>>
>>>>>         */
>>>>> -    mig_stats.dirty_bytes_total = pending->total_bytes;
>>>>> -
>>>>> -    trace_qemu_savevm_query_pending(exact, pending->precopy_bytes,
>>>>> +    if (!final) {
>>>>> +        mig_stats.dirty_bytes_total = pending->total_bytes;
>>>>> +    }
> [a]
>
>>>> Then we remove this too which is counter-intuitive..
>>>>
>>>> The rest looks all reasonable.
>>>>
>>>> Thanks,
>>>>
>>>>> +    trace_qemu_savevm_query_pending(exact, final, pending->precopy_bytes,
>>>>>                                        pending->stopcopy_bytes,
>>>>>                                        pending->postcopy_bytes,
>>>>>                                        pending->total_bytes);
>>>>>    }
>>>>>
>>>>> +void qemu_savevm_query_pending_iter(MigPendingData *pending, bool exact)
>>>>> +{
>>>>> +    qemu_savevm_query_pending(pending, exact, false);
>>>>> +}
>>>>> +
>>>>> +void qemu_savevm_query_pending_final(MigPendingData *pending)
>>>>> +{
>>>>> +    g_assert(bql_locked());
>>>>> +
>>>>> +    qemu_savevm_query_pending(pending, true, true);
>>>>> +}
>>>>> +
>>>>>    void qemu_savevm_state_cleanup(void)
>>>>>    {
>>>>>        SaveStateEntry *se;
>>>>> diff --git a/migration/trace-events b/migration/trace-events
>>>>> index de99d976ab..1c9212d3e2 100644
>>>>> --- a/migration/trace-events
>>>>> +++ b/migration/trace-events
>>>>> @@ -7,7 +7,7 @@ qemu_loadvm_state_section_partend(uint32_t section_id) "%u"
>>>>>    qemu_loadvm_state_post_main(int ret) "%d"
>>>>>    qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
>>>>>    qemu_savevm_send_packaged(void) ""
>>>>> -qemu_savevm_query_pending(bool exact, uint64_t precopy, uint64_t stopcopy, uint64_t postcopy, uint64_t total) "exact=%d, precopy=%"PRIu64", stopcopy=%"PRIu64", postcopy=%"PRIu64", total=%"PRIu64
>>>>> +qemu_savevm_query_pending(bool exact, bool final, uint64_t precopy, uint64_t stopcopy, uint64_t postcopy, uint64_t total) "exact=%d, final=%d, precopy=%"PRIu64", stopcopy=%"PRIu64", postcopy=%"PRIu64", total=%"PRIu64
>>>>>    loadvm_state_switchover_ack_needed(unsigned int switchover_ack_pending_num) "Switchover ack pending num=%u"
>>>>>    loadvm_state_setup(void) ""
>>>>>    loadvm_state_cleanup(void) ""
>>>>> --
>>>>> 2.40.1
>>>>>
>>>> --
>>>> Peter Xu
>>> --
>>> Peter Xu
>>>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 07/13] migration: Fail migration if switchover-ack is requested after switchover decision
  2026-06-04 17:48       ` Peter Xu
@ 2026-06-08 12:49         ` Avihai Horon
  0 siblings, 0 replies; 38+ messages in thread
From: Avihai Horon @ 2026-06-08 12:49 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Halil Pasic, Christian Borntraeger, Jason Herne,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Eric Farman, Matthew Rosato, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, Markus Armbruster,
	Maor Gottlieb


On 6/4/2026 8:48 PM, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, Jun 04, 2026 at 06:36:14PM +0300, Avihai Horon wrote:
>> On 6/3/2026 11:42 PM, Peter Xu wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On Tue, Jun 02, 2026 at 12:26:15PM +0300, Avihai Horon wrote:
>>>> Switchover ACK is checked only during precopy while the guest is still
>>>> running. The last migration_can_switchover() decision and guest stop are
>>>> not atomic, so a device may want to request another switchover ACK in
>>>> the gap after switchover decision has been made but before the guest is
>>>> stopped. Migration would then miss that request, which can increase
>>>> downtime.
>>>>
>>>> Cover this case by failing the migration if a switchover-ack was
>>>> requested during that time.
>>>>
>>>> Ideally, precopy iterations should be resumed in this case, however,
>>>> VFIO doesn't support going back to precopy after being stopped, so
>>>> implementing such logic would require non-trivial changes to the guest
>>>> start/stop flow. Given the above and that this case should be rare,
>>>> failing the migration seems reasonable.
>>>>
>>>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>>> Reviewed-by: Peter Xu <peterx@redhat.com>
>>>
>>> One nit:
>>>
>>>> ---
>>>>    migration/migration.c | 30 ++++++++++++++++++++++++++++++
>>>>    1 file changed, 30 insertions(+)
>>>>
>>>> diff --git a/migration/migration.c b/migration/migration.c
>>>> index 4bb649a467..6ee1c795ff 100644
>>>> --- a/migration/migration.c
>>>> +++ b/migration/migration.c
>>>> @@ -2810,6 +2810,27 @@ static bool migration_switchover_prepare(MigrationState *s)
>>>>        return s->state == MIGRATION_STATUS_DEVICE;
>>>>    }
>>>>
>>>> +static bool migration_switchover_check_switchover_ack_pending(MigrationState *s,
>>> The name is slightly confusing.  It says "check if there is pending ack"
>>> but then it returns true if there's no pending ACK..
>>>
>>> Maybe migration_switchover_is_acknowledged()?
>> Sure, or maybe migration_have_new_switchover_ack_pending()?
> Sure.
>
>> I just noticed that I overlooked one corner case:
>> If guest is stopped during precopy and source has pending acks > 0 (e.g.,
>> virsh migrate with --timeout param), then we will fail in this check and
>> unnecessarily abort migration.
>>
>> I thought about doing the change below. It's not as clean as I'd want it to
>> be, but I have no other ideas.
> Yes, it's not good to bleed switchover_ack_pending_num all over the
> places.. Maybe you can make qemu_savevm_query_pending_final() return an
> error when seeing any newly popped ack request while querying?  Then the
> new function is not needed.

Ah, I totally forgot that we have 
MigPendingData->switchover_ack_pending. So no need to check the 
global switchover_ack_pending_num counter at all.

I will move this check inside qemu_savevm_query_pending_final() and 
return error.

Thanks.

>
> Thanks,
>
>> ===========8<===========
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 6ee1c795ff..6054453b18 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -2810,37 +2810,47 @@ static bool
>> migration_switchover_prepare(MigrationState *s)
>>       return s->state == MIGRATION_STATUS_DEVICE;
>>   }
>>
>> -static bool
>> migration_switchover_check_switchover_ack_pending(MigrationState *s,
>> - Error **errp)
>> +static bool migration_have_new_switchover_ack_pending(
>> +    MigrationState *s, uint32_t old_switchover_ack_pending_num, Error
>> **errp)
>>   {
>>       uint32_t pending_num;
>>
>>       if (!migrate_switchover_ack() || migrate_switchover_ack_legacy()) {
>> -        return true;
>> +        return false;
>>       }
>>
>>       pending_num = qatomic_read(&s->switchover_ack_pending_num);
>> -    if (pending_num > 0) {
>> +    if (pending_num > old_switchover_ack_pending_num) {
>>           error_setg(errp,
>>                      "Switchover ACK was requested by %" PRIu32
>>                      " devices during switchover",
>> -                   pending_num);
>> -        return false;
>> +                   pending_num - old_switchover_ack_pending_num);
>> +        return true;
>>       }
>>
>> -    return true;
>> +    return false;
>>   }
>>
>>   static bool migration_switchover_start(MigrationState *s, Error **errp)
>>   {
>>       ERRP_GUARD();
>>       MigPendingData pending = {};
>> +    uint32_t old_switchover_ack_pending_num;
>>
>>       if (!migration_switchover_prepare(s)) {
>>           error_setg(errp, "Switchover is interrupted");
>>           return false;
>>       }
>>
>> +    /*
>> +     * If the guest was stopped during precopy (e.g., by management), we
>> may get
>> +     * here with switchover_ack_pending_num > 0. Store the current
>> +     * switchover_ack_pending_num value before the final pending query to
>> +     * distinguish between existing pending ACKs and new ones.
>> +     */
>> +    old_switchover_ack_pending_num =
>> +        qatomic_read(&s->switchover_ack_pending_num);
>> +
>>       qemu_savevm_query_pending_final(s, &pending);
>>
>>       /*
>> @@ -2848,7 +2858,8 @@ static bool migration_switchover_start(MigrationState
>> *s, Error **errp)
>>        * Fail the migration in this case since we currently don't support
>> going
>>        * back to precopy.
>>        */
>> -    if (!migration_switchover_check_switchover_ack_pending(s, errp)) {
>> +    if (migration_have_new_switchover_ack_pending(
>> +            s, old_switchover_ack_pending_num, errp)) {
>>           return false;
>>       }
>>
>> ===========8<===========
>>
>> Thanks.
>>
>>>> +                                                              Error **errp)
>>>> +{
>>>> +    uint32_t pending_num;
>>>> +
>>>> +    if (!migrate_switchover_ack() || migrate_switchover_ack_legacy()) {
>>>> +        return true;
>>>> +    }
>>>> +
>>>> +    pending_num = qatomic_read(&s->switchover_ack_pending_num);
>>>> +    if (pending_num > 0) {
>>>> +        error_setg(errp,
>>>> +                   "Switchover ACK was requested by %" PRIu32
>>>> +                   " devices during switchover",
>>>> +                   pending_num);
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    return true;
>>>> +}
>>>> +
>>>>    static bool migration_switchover_start(MigrationState *s, Error **errp)
>>>>    {
>>>>        ERRP_GUARD();
>>>> @@ -2822,6 +2843,15 @@ static bool migration_switchover_start(MigrationState *s, Error **errp)
>>>>
>>>>        qemu_savevm_query_pending_final(s, &pending);
>>>>
>>>> +    /*
>>>> +     * Switchover-ack requests done after switchover decision, are not allowed.
>>>> +     * Fail the migration in this case since we currently don't support going
>>>> +     * back to precopy.
>>>> +     */
>>>> +    if (!migration_switchover_check_switchover_ack_pending(s, errp)) {
>>>> +        return false;
>>>> +    }
>>>> +
>>>>        /* Inactivate disks except in COLO */
>>>>        if (!migrate_colo()) {
>>>>            /*
>>>> --
>>>> 2.40.1
>>>>
>>> --
>>> Peter Xu
>>>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 02/13] migration: Run final save_query_pending at switchover
  2026-06-08 12:07           ` Avihai Horon
@ 2026-06-08 14:11             ` Peter Xu
  2026-06-08 14:32               ` Avihai Horon
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Xu @ 2026-06-08 14:11 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Halil Pasic, Christian Borntraeger, Jason Herne,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Eric Farman, Matthew Rosato, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, Markus Armbruster,
	Maor Gottlieb, Gavin Shan, Wei Wang

On Mon, Jun 08, 2026 at 03:07:32PM +0300, Avihai Horon wrote:
> > > 2. It calls migration_bitmap_sync and not migration_bitmap_sync_precopy,
> > > i.e., without precopy_notify calls.
> > This is another thing I feel like got overlooked in the free page hint
> > feature.  The only notifiers registered is that balloon device, logically I
> > think we need these notifiers in postcopy too to make sure we properly stop
> > the free page hints seeing BEFORE_BITMAP_SYNC, then don't start it
> > (vm_running=false) in AFTER_BITMAP_SYNC:
> > 
> > virtio_balloon_free_page_hint_notify():
> > 
> >      case PRECOPY_NOTIFY_BEFORE_BITMAP_SYNC:
> >          virtio_balloon_free_page_stop(dev);
> >          break;
> >      case PRECOPY_NOTIFY_AFTER_BITMAP_SYNC:
> >          if (vdev->vm_running) {
> >              virtio_balloon_free_page_start(dev);
> >              break;
> >          }
> 
> Actually, looking more closely in the code, I see that it's not used if
> postcopy is enabled [1].
> So calling the precopy notifiers in postcopy switchover is basically a
> no-op.
> 
> Then it seems fine to call migration_bitmap_sync_precopy from postcopy
> switchover flow.
> 
> [1] See commit fd51e54fa102 ("virtio-balloon: don't start free page hinting
> if postcopy is possible")

Yes, thanks for the pointer.  So it relies on the DONE event rather than
stop() to really stop the reporting.. good to know it was already bypassed
for postcopy, I definitely forgot that change.  For now, we can add another
trivial comment if we want.  So what we really need is final=true for
postcopy.

From a notifier semantics POV, IIUC it's also correct to notify here in
postcopy_start(), because at this stage it is still precopy.  Source QEMU
will move to postcopy stage (starting from POSTCOPY_DEVICE state) only if
postcopy_start() succeeded.  So a precopy notifier, no matter the analysis
of balloon use case, should theoretically apply here as well.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 02/13] migration: Run final save_query_pending at switchover
  2026-06-08 14:11             ` Peter Xu
@ 2026-06-08 14:32               ` Avihai Horon
  0 siblings, 0 replies; 38+ messages in thread
From: Avihai Horon @ 2026-06-08 14:32 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater, Fabiano Rosas,
	Pierrick Bouvier, Philippe Mathieu-Daudé, Zhao Liu,
	Halil Pasic, Christian Borntraeger, Jason Herne,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Eric Farman, Matthew Rosato, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow, Markus Armbruster,
	Maor Gottlieb, Gavin Shan, Wei Wang


On 6/8/2026 5:11 PM, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Mon, Jun 08, 2026 at 03:07:32PM +0300, Avihai Horon wrote:
>>>> 2. It calls migration_bitmap_sync and not migration_bitmap_sync_precopy,
>>>> i.e., without precopy_notify calls.
>>> This is another thing I feel like got overlooked in the free page hint
>>> feature.  The only notifiers registered is that balloon device, logically I
>>> think we need these notifiers in postcopy too to make sure we properly stop
>>> the free page hints seeing BEFORE_BITMAP_SYNC, then don't start it
>>> (vm_running=false) in AFTER_BITMAP_SYNC:
>>>
>>> virtio_balloon_free_page_hint_notify():
>>>
>>>       case PRECOPY_NOTIFY_BEFORE_BITMAP_SYNC:
>>>           virtio_balloon_free_page_stop(dev);
>>>           break;
>>>       case PRECOPY_NOTIFY_AFTER_BITMAP_SYNC:
>>>           if (vdev->vm_running) {
>>>               virtio_balloon_free_page_start(dev);
>>>               break;
>>>           }
>> Actually, looking more closely in the code, I see that it's not used if
>> postcopy is enabled [1].
>> So calling the precopy notifiers in postcopy switchover is basically a
>> no-op.
>>
>> Then it seems fine to call migration_bitmap_sync_precopy from postcopy
>> switchover flow.
>>
>> [1] See commit fd51e54fa102 ("virtio-balloon: don't start free page hinting
>> if postcopy is possible")
> Yes, thanks for the pointer.  So it relies on the DONE event rather than
> stop() to really stop the reporting.. good to know it was already bypassed
> for postcopy, I definitely forgot that change.  For now, we can add another
> trivial comment if we want.  So what we really need is final=true for
> postcopy.
>
>  From a notifier semantics POV, IIUC it's also correct to notify here in
> postcopy_start(), because at this stage it is still precopy.  Source QEMU
> will move to postcopy stage (starting from POSTCOPY_DEVICE state) only if
> postcopy_start() succeeded.  So a precopy notifier, no matter the analysis
> of balloon use case, should theoretically apply here as well.

Agreed.

Thanks.



^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2026-06-08 14:33 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-02  9:26 [PATCH v2 00/13] Make switchover-ack re-usable and add VFIO precopy REINIT feature Avihai Horon
2026-06-02  9:26 ` [PATCH v2 01/13] migration: Propagate errors in migration_completion_precopy() Avihai Horon
2026-06-03 19:47   ` Peter Xu
2026-06-04 15:12     ` Avihai Horon
2026-06-04 15:21       ` Peter Xu
2026-06-04 15:38         ` Avihai Horon
2026-06-02  9:26 ` [PATCH v2 02/13] migration: Run final save_query_pending at switchover Avihai Horon
2026-06-03 20:05   ` Peter Xu
2026-06-03 21:04     ` Peter Xu
2026-06-04 15:29       ` Avihai Horon
2026-06-04 17:18         ` Peter Xu
2026-06-08 12:07           ` Avihai Horon
2026-06-08 14:11             ` Peter Xu
2026-06-08 14:32               ` Avihai Horon
2026-06-02  9:26 ` [PATCH v2 03/13] migration: Log the approver in qemu_loadvm_approve_switchover() Avihai Horon
2026-06-02  9:26 ` [PATCH v2 04/13] migration: Replace switchover_ack_needed SaveVMHandler Avihai Horon
2026-06-02  9:26 ` [PATCH v2 05/13] migration: Rename switchover-ack code to legacy Avihai Horon
2026-06-03 20:21   ` Peter Xu
2026-06-02  9:26 ` [PATCH v2 06/13] migration: Make switchover-ack re-usable Avihai Horon
2026-06-03  7:17   ` Markus Armbruster
2026-06-04 15:05     ` Avihai Horon
2026-06-08 10:23       ` Markus Armbruster
2026-06-02  9:26 ` [PATCH v2 07/13] migration: Fail migration if switchover-ack is requested after switchover decision Avihai Horon
2026-06-03 20:42   ` Peter Xu
2026-06-04 15:36     ` Avihai Horon
2026-06-04 17:48       ` Peter Xu
2026-06-08 12:49         ` Avihai Horon
2026-06-02  9:26 ` [PATCH v2 08/13] vfio/migration: Extract VFIO_MIG_FLAG_DEV_INIT_DATA_SENT sending to helper Avihai Horon
2026-06-03 20:43   ` Peter Xu
2026-06-02  9:26 ` [PATCH v2 09/13] vfio/migration: Add Error ** parameter to vfio_migration_init() Avihai Horon
2026-06-03 11:38   ` Philippe Mathieu-Daudé
2026-06-04 15:06     ` Avihai Horon
2026-06-02  9:26 ` [PATCH v2 10/13] vfio/migration: Add new switchover-ack mechanism Avihai Horon
2026-06-03 20:47   ` Peter Xu
2026-06-02  9:26 ` [PATCH v2 11/13] vfio/migration: Implement VFIO_PRECOPY_INFO_REINIT feature Avihai Horon
2026-06-03 20:49   ` Peter Xu
2026-06-02  9:26 ` [PATCH v2 12/13] vfio/migration: Check VFIO_PRECOPY_INFO_REINIT during switchover Avihai Horon
2026-06-02  9:26 ` [PATCH v2 13/13] migration: Enable new switchover-ack Avihai Horon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.