public inbox for qemu-devel@nongnu.org
 help / color / mirror / Atom feed
* [PATCH RFC 00/12] migration/vfio: Fix a few issues on API misuse or statistic reports
@ 2026-03-19 23:12 Peter Xu
  2026-03-19 23:12 ` [PATCH RFC 01/12] migration: Fix low possibility downtime violation Peter Xu
                   ` (11 more replies)
  0 siblings, 12 replies; 28+ messages in thread
From: Peter Xu @ 2026-03-19 23:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, peterx, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater

CI: https://gitlab.com/peterx/qemu/-/pipelines/2396755777

VFIO migration was merged quite a while, but we do still see things off
here and there.  This series tries to address some of them, but only based
on my limited understandings.

This is RFC series as I don't have VFIO devices to test.  So one can also
see this as a raw proposal on raising the issues first with a solution that
hasn't been well tested.  However I tested non-vfio side and so far no
issue I'm aware.

Two major issues I wanted to resolve here:

(1) VFIO reports state_pending_{exact|estimate}() differently

It reports stop-only sizes in exact() only (which includes both precopy and
stopcopy data), while in estimate() it only reports precopy data.  This is
violating the API.  The guess was it was done like it to trigger proper
sync on the VFIO ioctls only.  This series should fix it by introducing
stopcopy size reporting facility for vmstate handlers.

(2) expected_downtime doesn't take VFIO devices into account

When query migration, QEMU reports one field called "expected-downtime".
The document was phrasing this almost from RAM perspective, but ideally it
should be about an estimated blackout window (in milliseconds) if we
switchover anytime, based on known information.

This didn't yet took VFIO into account, especially in the case of VFIO
devices that may contain a large amount of device states (like GPUs).

For problem (2), the use case should be that an mgmt app when migrating a
VFIO GPU device needs to always adjust downtime for migration to converge,
because when it's involved normal downtime like 300ms will normally not
suffice.

Now the issue with that is the mgmt doesn't have a good way to know exactly
how well the precopy goes with the whole system and the GPU device.

The hope is fixing expected_downtime may at least provide one way for the
mgmt app so that it can monitor this field in query-migrate at the start of
each iteration (by enabling events) to guess the progress, and it also may
provide a relatively reasonable hint for downtime at least in case of GPUs.
For this part, I tested nothing, so it's only a guess for now, but that's
the wish it'll be something easier to use than before.

When without GPU, mgmt can also monitor this field (which so far is the
only global field that one can query) so as to know how iterations are
making progresses, so the mgmt should expect to see expected_downtime
shrinking over iterations, and when it stops shrinking for a few iterations
maybe it's wise to do something about it.

Tests or reviews will be very much welcomed.

Thanks,

Peter Xu (12):
  migration: Fix low possibility downtime violation
  migration/qapi: Rename MigrationStats to MigrationRAMStats
  vfio/migration: Throttle vfio_save_block() on data size to read
  vfio/migration: Cache stop size in VFIOMigration
  migration/treewide: Merge @state_pending_{exact|estimate} APIs
  migration: Use the new save_query_pending() API directly
  migration: Introduce stopcopy_bytes in save_query_pending()
  vfio/migration: Fix incorrect reporting for VFIO pending data
  migration: Make iteration counter out of RAM
  migration: Introduce a helper to return switchover bw estimate
  migration: Calculate expected downtime on demand
  migration: Fix calculation of expected_downtime to take VFIO info

 docs/about/removed-features.rst   |   2 +-
 docs/devel/migration/main.rst     |   5 +-
 docs/devel/migration/vfio.rst     |   9 +-
 qapi/migration.json               |   8 +-
 hw/vfio/vfio-migration-internal.h |   1 +
 include/migration/register.h      |  64 ++++++------
 migration/migration-stats.h       |  15 +--
 migration/migration.h             |   2 +-
 migration/savevm.h                |   7 +-
 hw/s390x/s390-stattrib.c          |   8 +-
 hw/vfio/migration.c               | 114 ++++++++++++---------
 migration/block-dirty-bitmap.c    |   9 +-
 migration/migration.c             | 159 +++++++++++++++++++++---------
 migration/ram.c                   |  39 ++------
 migration/savevm.c                |  37 ++-----
 hw/vfio/trace-events              |   3 +-
 migration/trace-events            |   1 +
 17 files changed, 254 insertions(+), 229 deletions(-)

-- 
2.50.1



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH RFC 01/12] migration: Fix low possibility downtime violation
  2026-03-19 23:12 [PATCH RFC 00/12] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
@ 2026-03-19 23:12 ` Peter Xu
  2026-03-20 12:26   ` Prasad Pandit
  2026-03-19 23:12 ` [PATCH RFC 02/12] migration/qapi: Rename MigrationStats to MigrationRAMStats Peter Xu
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 28+ messages in thread
From: Peter Xu @ 2026-03-19 23:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, peterx, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater,
	qemu-stable

When QEMU queried the estimated version of pending data and thinks it's
ready to converge, it'll send another accurate query to make sure of it.
It is needed to make sure we collect the latest reports and that equation
still holds true.

However we missed one tiny little difference here on "<" v.s. "<=" when
comparing pending_size (A) to threshold_size (B)..

QEMU src only re-query if A<B, but will kickoff switchover if A<=B.

I think it means it is possible to happen if A (as an estimate only so far)
accidentally equals to B, then re-query won't happen and switchover will
proceed without considering new dirtied data.

It turns out it was an accident in my commit 7aaa1fc072 when refactoring
the code around.  Fix this by using the same equation in both places.

Fixes: 7aaa1fc072 ("migration: Rewrite the migration complete detect logic")
Cc: qemu-stable@nongnu.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/migration/migration.c b/migration/migration.c
index 5c9aaa6e58..dfc60372cf 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3242,7 +3242,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
          * postcopy started, so ESTIMATE should always match with EXACT
          * during postcopy phase.
          */
-        if (pending_size < s->threshold_size) {
+        if (pending_size <= s->threshold_size) {
             qemu_savevm_state_pending_exact(&must_precopy, &can_postcopy);
             pending_size = must_precopy + can_postcopy;
             trace_migrate_pending_exact(pending_size, must_precopy,
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH RFC 02/12] migration/qapi: Rename MigrationStats to MigrationRAMStats
  2026-03-19 23:12 [PATCH RFC 00/12] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
  2026-03-19 23:12 ` [PATCH RFC 01/12] migration: Fix low possibility downtime violation Peter Xu
@ 2026-03-19 23:12 ` Peter Xu
  2026-03-19 23:26   ` Peter Xu
  2026-03-20  6:54   ` Markus Armbruster
  2026-03-19 23:12 ` [PATCH RFC 03/12] vfio/migration: Throttle vfio_save_block() on data size to read Peter Xu
                   ` (9 subsequent siblings)
  11 siblings, 2 replies; 28+ messages in thread
From: Peter Xu @ 2026-03-19 23:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, peterx, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater,
	Libvirt Mailing List

This stats is only about RAM, so make it accurate.  This paves way for
statistics for all devices.

Note that even if this is part of qapi/, this should not be part of ABI of
at least query-migrate, because the structure is not changed, and this
stats is always reported only under the "ram" section.

Cc: Daniel P. Berrangé <berrange@redhat.com>
Cc: Markus Armbruster <armbru@redhat.com>
Cc: Libvirt Mailing List <libvir-list@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 docs/about/removed-features.rst | 2 +-
 qapi/migration.json             | 8 ++++----
 migration/migration-stats.h     | 2 +-
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/docs/about/removed-features.rst b/docs/about/removed-features.rst
index 6f4447993c..7c4f4325f7 100644
--- a/docs/about/removed-features.rst
+++ b/docs/about/removed-features.rst
@@ -699,7 +699,7 @@ was superseded by ``sections``.
 ``query-migrate`` return value member ``skipped`` (removed in 9.1)
 ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
 
-Member ``skipped`` of the ``MigrationStats`` struct hasn't been used
+Member ``skipped`` of the ``MigrationRAMStats`` struct hasn't been used
 for more than 10 years. Removed with no replacement.
 
 ``migrate`` command option ``inc`` (removed in 9.1)
diff --git a/qapi/migration.json b/qapi/migration.json
index 7134d4ce47..cfc6ccee26 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -12,7 +12,7 @@
 { 'include': 'sockets.json' }
 
 ##
-# @MigrationStats:
+# @MigrationRAMStats:
 #
 # Detailed migration status.
 #
@@ -64,7 +64,7 @@
 #
 # Since: 0.14
 ##
-{ 'struct': 'MigrationStats',
+{ 'struct': 'MigrationRAMStats',
   'data': {'transferred': 'int', 'remaining': 'int', 'total': 'int' ,
            'duplicate': 'int',
            'normal': 'int',
@@ -209,7 +209,7 @@
 #     If this field is not returned, no migration process has been
 #     initiated
 #
-# @ram: `MigrationStats` containing detailed migration status, only
+# @ram: `MigrationRAMStats` containing detailed migration status, only
 #     returned if status is 'active' or 'completed'(since 1.2)
 #
 # @xbzrle-cache: `XBZRLECacheStats` containing detailed XBZRLE
@@ -309,7 +309,7 @@
 # Since: 0.14
 ##
 { 'struct': 'MigrationInfo',
-  'data': {'*status': 'MigrationStatus', '*ram': 'MigrationStats',
+  'data': {'*status': 'MigrationStatus', '*ram': 'MigrationRAMStats',
            '*vfio': 'VfioStats',
            '*xbzrle-cache': 'XBZRLECacheStats',
            '*total-time': 'int',
diff --git a/migration/migration-stats.h b/migration/migration-stats.h
index c0f50144c9..1153520f7a 100644
--- a/migration/migration-stats.h
+++ b/migration/migration-stats.h
@@ -27,7 +27,7 @@
 
 /*
  * These are the ram migration statistic counters.  It is loosely
- * based on MigrationStats.
+ * based on MigrationRAMStats.
  */
 typedef struct {
     /*
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH RFC 03/12] vfio/migration: Throttle vfio_save_block() on data size to read
  2026-03-19 23:12 [PATCH RFC 00/12] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
  2026-03-19 23:12 ` [PATCH RFC 01/12] migration: Fix low possibility downtime violation Peter Xu
  2026-03-19 23:12 ` [PATCH RFC 02/12] migration/qapi: Rename MigrationStats to MigrationRAMStats Peter Xu
@ 2026-03-19 23:12 ` Peter Xu
  2026-03-25 14:10   ` Avihai Horon
  2026-03-19 23:12 ` [PATCH RFC 04/12] vfio/migration: Cache stop size in VFIOMigration Peter Xu
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 28+ messages in thread
From: Peter Xu @ 2026-03-19 23:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, peterx, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater

During precopy phase, VFIO maintains two counters for init/dirty data
tracking for query estimations.

VFIO fetches data during precopy by reading from the VFIO fd, after
fetching it'll deduct the read size.

Here since the fd's size can dynamically change, I think it means VFIO may
read more than what it "thought" were there for fetching.

I highly suspect it's also relevant to a weird case in the function of
vfio_update_estimated_pending_data(), where when VFIO reads 0 from the FD
it will _reset_ the two counters, instead of asserting both of them being
zeros, which looks pretty hackish.

Just guarantee it from userspace level that VFIO won't read more than what
it expects for now.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/vfio/migration.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 83327b6573..851ea783f3 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -357,12 +357,18 @@ static int vfio_query_precopy_size(VFIOMigration *migration)
 }
 
 /* Returns the size of saved data on success and -errno on error */
-static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
+static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration,
+                               bool precopy)
 {
-    ssize_t data_size;
+    ssize_t data_size = migration->data_buffer_size;
+
+    if (precopy) {
+        /* Limit the buffer size to make sure cached stats don't overflow */
+        data_size = MIN(data_size, migration->precopy_init_size +
+                        migration->precopy_dirty_size);
+    }
 
-    data_size = read(migration->data_fd, migration->data_buffer,
-                     migration->data_buffer_size);
+    data_size = read(migration->data_fd, migration->data_buffer, data_size);
     if (data_size < 0) {
         /*
          * Pre-copy emptied all the device state for now. For more information,
@@ -623,7 +629,7 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
         migration->event_save_iterate_started = true;
     }
 
-    data_size = vfio_save_block(f, migration);
+    data_size = vfio_save_block(f, migration, true);
     if (data_size < 0) {
         return data_size;
     }
@@ -667,7 +673,7 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
     }
 
     do {
-        data_size = vfio_save_block(f, vbasedev->migration);
+        data_size = vfio_save_block(f, vbasedev->migration, false);
         if (data_size < 0) {
             return data_size;
         }
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH RFC 04/12] vfio/migration: Cache stop size in VFIOMigration
  2026-03-19 23:12 [PATCH RFC 00/12] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
                   ` (2 preceding siblings ...)
  2026-03-19 23:12 ` [PATCH RFC 03/12] vfio/migration: Throttle vfio_save_block() on data size to read Peter Xu
@ 2026-03-19 23:12 ` Peter Xu
  2026-03-25 14:15   ` Avihai Horon
  2026-03-19 23:12 ` [PATCH RFC 05/12] migration/treewide: Merge @state_pending_{exact|estimate} APIs Peter Xu
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 28+ messages in thread
From: Peter Xu @ 2026-03-19 23:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, peterx, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater

Add a field to cache stop size.  Note that there's an initial value change
in vfio_save_setup for the stop size default, but it shouldn't matter if it
is followed with a math of MIN() against VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/vfio/vfio-migration-internal.h |  1 +
 hw/vfio/migration.c               | 43 +++++++++++++++++--------------
 2 files changed, 24 insertions(+), 20 deletions(-)

diff --git a/hw/vfio/vfio-migration-internal.h b/hw/vfio/vfio-migration-internal.h
index 814fbd9eba..08df32c055 100644
--- a/hw/vfio/vfio-migration-internal.h
+++ b/hw/vfio/vfio-migration-internal.h
@@ -47,6 +47,7 @@ typedef struct VFIOMigration {
     uint64_t mig_flags;
     uint64_t precopy_init_size;
     uint64_t precopy_dirty_size;
+    uint64_t stopcopy_size;
     bool multifd_transfer;
     VFIOMultifd *multifd;
     bool initial_data_sent;
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 851ea783f3..827d3ded63 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -41,6 +41,12 @@
  */
 #define VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE (1 * MiB)
 
+/*
+ * Migration size of VFIO devices can be as little as a few KBs or as big as
+ * many GBs. This value should be big enough to cover the worst case.
+ */
+#define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
+
 static unsigned long bytes_transferred;
 
 static const char *mig_state_to_str(enum vfio_device_mig_state state)
@@ -314,8 +320,7 @@ static void vfio_migration_cleanup(VFIODevice *vbasedev)
     migration->data_fd = -1;
 }
 
-static int vfio_query_stop_copy_size(VFIODevice *vbasedev,
-                                     uint64_t *stop_copy_size)
+static int vfio_query_stop_copy_size(VFIODevice *vbasedev)
 {
     uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
                               sizeof(struct vfio_device_feature_mig_data_size),
@@ -323,16 +328,22 @@ static int vfio_query_stop_copy_size(VFIODevice *vbasedev,
     struct vfio_device_feature *feature = (struct vfio_device_feature *)buf;
     struct vfio_device_feature_mig_data_size *mig_data_size =
         (struct vfio_device_feature_mig_data_size *)feature->data;
+    VFIOMigration *migration = vbasedev->migration;
 
     feature->argsz = sizeof(buf);
     feature->flags =
         VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_MIG_DATA_SIZE;
 
     if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
+        /*
+         * If getting pending migration size fails, VFIO_MIG_STOP_COPY_SIZE
+         * is reported so downtime limit won't be violated.
+         */
+        migration->stopcopy_size = VFIO_MIG_STOP_COPY_SIZE;
         return -errno;
     }
 
-    *stop_copy_size = mig_data_size->stop_copy_length;
+    migration->stopcopy_size = mig_data_size->stop_copy_length;
 
     return 0;
 }
@@ -415,6 +426,9 @@ static void vfio_update_estimated_pending_data(VFIOMigration *migration,
         return;
     }
 
+    /* The total size remaining requires separate accounting */
+    migration->stopcopy_size -= data_size;
+
     if (migration->precopy_init_size) {
         uint64_t init_size = MIN(migration->precopy_init_size, data_size);
 
@@ -469,7 +483,6 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error **errp)
 {
     VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
-    uint64_t stop_copy_size = VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE;
     int ret;
 
     if (!vfio_multifd_setup(vbasedev, false, errp)) {
@@ -478,9 +491,9 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error **errp)
 
     qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
 
-    vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
+    vfio_query_stop_copy_size(vbasedev);
     migration->data_buffer_size = MIN(VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE,
-                                      stop_copy_size);
+                                      migration->stopcopy_size);
     migration->data_buffer = g_try_malloc0(migration->data_buffer_size);
     if (!migration->data_buffer) {
         error_setg(errp, "%s: Failed to allocate migration data buffer",
@@ -576,32 +589,22 @@ static void vfio_state_pending_estimate(void *opaque, uint64_t *must_precopy,
                                       migration->precopy_dirty_size);
 }
 
-/*
- * Migration size of VFIO devices can be as little as a few KBs or as big as
- * many GBs. This value should be big enough to cover the worst case.
- */
-#define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
-
 static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
                                      uint64_t *can_postcopy)
 {
     VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
-    uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
 
-    /*
-     * If getting pending migration size fails, VFIO_MIG_STOP_COPY_SIZE is
-     * reported so downtime limit won't be violated.
-     */
-    vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
-    *must_precopy += stop_copy_size;
+    vfio_query_stop_copy_size(vbasedev);
+    *must_precopy += migration->stopcopy_size;
 
     if (vfio_device_state_is_precopy(vbasedev)) {
         vfio_query_precopy_size(migration);
     }
 
     trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy,
-                                   stop_copy_size, migration->precopy_init_size,
+                                   migration->stopcopy_size,
+                                   migration->precopy_init_size,
                                    migration->precopy_dirty_size);
 }
 
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH RFC 05/12] migration/treewide: Merge @state_pending_{exact|estimate} APIs
  2026-03-19 23:12 [PATCH RFC 00/12] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
                   ` (3 preceding siblings ...)
  2026-03-19 23:12 ` [PATCH RFC 04/12] vfio/migration: Cache stop size in VFIOMigration Peter Xu
@ 2026-03-19 23:12 ` Peter Xu
  2026-03-24 10:35   ` Prasad Pandit
  2026-03-25 15:20   ` Avihai Horon
  2026-03-19 23:12 ` [PATCH RFC 06/12] migration: Use the new save_query_pending() API directly Peter Xu
                   ` (6 subsequent siblings)
  11 siblings, 2 replies; 28+ messages in thread
From: Peter Xu @ 2026-03-19 23:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, peterx, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater,
	Halil Pasic, Christian Borntraeger, Jason Herne, Eric Farman,
	Matthew Rosato, Richard Henderson, Ilya Leoshkevich,
	David Hildenbrand, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow

These two APIs are a slight duplication.  For example, there're a few users
that directly pass in the same function.

It might also be slightly error prone to provide two hooks, so that it's
easier to happen that one module report different things via the two
hooks. In reality they should always report the same thing, only about
whether we should use a fast-path when the slow path might be too slow, and
even if we need to pay with some less accuracy.

Let's just merge it into one API, but instead provide a bool showing if the
query is a fast query or not.

No functional change intended.

Export qemu_savevm_query_pending().  We should likely directly use the new
API here provided when there're new users to do the query.  This will
happen very soon.

Cc: Halil Pasic <pasic@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Jason Herne <jjherne@linux.ibm.com>
Cc: Eric Farman <farman@linux.ibm.com>
Cc: Matthew Rosato <mjrosato@linux.ibm.com>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: Eric Blake <eblake@redhat.com>
Cc: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Cc: John Snow <jsnow@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 docs/devel/migration/main.rst  |  5 ++-
 docs/devel/migration/vfio.rst  |  9 ++----
 include/migration/register.h   | 52 ++++++++++--------------------
 migration/savevm.h             |  3 ++
 hw/s390x/s390-stattrib.c       |  8 ++---
 hw/vfio/migration.c            | 58 +++++++++++++++++++---------------
 migration/block-dirty-bitmap.c |  9 ++----
 migration/ram.c                | 32 ++++++-------------
 migration/savevm.c             | 43 ++++++++++++-------------
 hw/vfio/trace-events           |  3 +-
 10 files changed, 93 insertions(+), 129 deletions(-)

diff --git a/docs/devel/migration/main.rst b/docs/devel/migration/main.rst
index 234d280249..22c5910d5c 100644
--- a/docs/devel/migration/main.rst
+++ b/docs/devel/migration/main.rst
@@ -519,9 +519,8 @@ An iterative device must provide:
     data we must save.  The core migration code will use this to
     determine when to pause the CPUs and complete the migration.
 
-  - A ``state_pending_estimate`` function that indicates how much more
-    data we must save.  When the estimated amount is smaller than the
-    threshold, we call ``state_pending_exact``.
+  - A ``save_query_pending`` function that indicates how much more
+    data we must save.
 
   - A ``save_live_iterate`` function should send a chunk of data until
     the point that stream bandwidth limits tell it to stop.  Each call
diff --git a/docs/devel/migration/vfio.rst b/docs/devel/migration/vfio.rst
index 0790e5031d..33768c877c 100644
--- a/docs/devel/migration/vfio.rst
+++ b/docs/devel/migration/vfio.rst
@@ -50,13 +50,8 @@ VFIO implements the device hooks for the iterative approach as follows:
 * A ``load_setup`` function that sets the VFIO device on the destination in
   _RESUMING state.
 
-* A ``state_pending_estimate`` function that reports an estimate of the
-  remaining pre-copy data that the vendor driver has yet to save for the VFIO
-  device.
-
-* A ``state_pending_exact`` function that reads pending_bytes from the vendor
-  driver, which indicates the amount of data that the vendor driver has yet to
-  save for the VFIO device.
+* A ``save_query_pending`` function that reports the remaining pre-copy
+  data that the vendor driver has yet to save for the VFIO device.
 
 * An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
   active only when the VFIO device is in pre-copy states.
diff --git a/include/migration/register.h b/include/migration/register.h
index d0f37f5f43..2320c3a981 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -16,6 +16,15 @@
 
 #include "hw/core/vmstate-if.h"
 
+typedef struct MigPendingData {
+    /* How many bytes are pending for precopy / stopcopy? */
+    uint64_t precopy_bytes;
+    /* How many bytes are pending that can be transferred in postcopy? */
+    uint64_t postcopy_bytes;
+    /* Is this a fastpath query (which can be inaccurate)? */
+    bool fastpath;
+} MigPendingData ;
+
 /**
  * struct SaveVMHandlers: handler structure to finely control
  * migration of complex subsystems and devices, such as RAM, block and
@@ -197,46 +206,17 @@ typedef struct SaveVMHandlers {
     bool (*save_postcopy_prepare)(QEMUFile *f, void *opaque, Error **errp);
 
     /**
-     * @state_pending_estimate
-     *
-     * This estimates the remaining data to transfer
-     *
-     * Sum of @can_postcopy and @must_postcopy is the whole amount of
-     * pending data.
-     *
-     * @opaque: data pointer passed to register_savevm_live()
-     * @must_precopy: amount of data that must be migrated in precopy
-     *                or in stopped state, i.e. that must be migrated
-     *                before target start.
-     * @can_postcopy: amount of data that can be migrated in postcopy
-     *                or in stopped state, i.e. after target start.
-     *                Some can also be migrated during precopy (RAM).
-     *                Some must be migrated after source stops
-     *                (block-dirty-bitmap)
-     */
-    void (*state_pending_estimate)(void *opaque, uint64_t *must_precopy,
-                                   uint64_t *can_postcopy);
-
-    /**
-     * @state_pending_exact
-     *
-     * This calculates the exact remaining data to transfer
+     * @save_query_pending
      *
-     * Sum of @can_postcopy and @must_postcopy is the whole amount of
-     * pending data.
+     * This estimates the remaining data to transfer on the source side.
+     * It's highly suggested that the module should implement both fastpath
+     * and slowpath version of it when it can be slow (for more information
+     * please check pending->fastpath field).
      *
      * @opaque: data pointer passed to register_savevm_live()
-     * @must_precopy: amount of data that must be migrated in precopy
-     *                or in stopped state, i.e. that must be migrated
-     *                before target start.
-     * @can_postcopy: amount of data that can be migrated in postcopy
-     *                or in stopped state, i.e. after target start.
-     *                Some can also be migrated during precopy (RAM).
-     *                Some must be migrated after source stops
-     *                (block-dirty-bitmap)
+     * @pending: pointer to a MigPendingData struct
      */
-    void (*state_pending_exact)(void *opaque, uint64_t *must_precopy,
-                                uint64_t *can_postcopy);
+    void (*save_query_pending)(void *opaque, MigPendingData *pending);
 
     /**
      * @load_state
diff --git a/migration/savevm.h b/migration/savevm.h
index b3d1e8a13c..b116933bce 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -14,6 +14,8 @@
 #ifndef MIGRATION_SAVEVM_H
 #define MIGRATION_SAVEVM_H
 
+#include "migration/register.h"
+
 #define QEMU_VM_FILE_MAGIC           0x5145564d
 #define QEMU_VM_FILE_VERSION_COMPAT  0x00000002
 #define QEMU_VM_FILE_VERSION         0x00000003
@@ -43,6 +45,7 @@ int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy);
 void qemu_savevm_state_cleanup(void);
 void qemu_savevm_state_complete_postcopy(QEMUFile *f);
 int qemu_savevm_state_complete_precopy(MigrationState *s);
+void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath);
 void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
                                      uint64_t *can_postcopy);
 void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
diff --git a/hw/s390x/s390-stattrib.c b/hw/s390x/s390-stattrib.c
index d808ece3b9..b1ec51c77a 100644
--- a/hw/s390x/s390-stattrib.c
+++ b/hw/s390x/s390-stattrib.c
@@ -187,15 +187,14 @@ static int cmma_save_setup(QEMUFile *f, void *opaque, Error **errp)
     return 0;
 }
 
-static void cmma_state_pending(void *opaque, uint64_t *must_precopy,
-                               uint64_t *can_postcopy)
+static void cmma_state_pending(void *opaque, MigPendingData *pending)
 {
     S390StAttribState *sas = S390_STATTRIB(opaque);
     S390StAttribClass *sac = S390_STATTRIB_GET_CLASS(sas);
     long long res = sac->get_dirtycount(sas);
 
     if (res >= 0) {
-        *must_precopy += res;
+        pending->precopy_bytes += res;
     }
 }
 
@@ -340,8 +339,7 @@ static SaveVMHandlers savevm_s390_stattrib_handlers = {
     .save_setup = cmma_save_setup,
     .save_live_iterate = cmma_save_iterate,
     .save_complete = cmma_save_complete,
-    .state_pending_exact = cmma_state_pending,
-    .state_pending_estimate = cmma_state_pending,
+    .save_query_pending = cmma_state_pending,
     .save_cleanup = cmma_save_cleanup,
     .load_state = cmma_load,
     .is_active = cmma_active,
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 827d3ded63..c054c749b0 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -570,42 +570,51 @@ static void vfio_save_cleanup(void *opaque)
     trace_vfio_save_cleanup(vbasedev->name);
 }
 
-static void vfio_state_pending_estimate(void *opaque, uint64_t *must_precopy,
-                                        uint64_t *can_postcopy)
+static void vfio_state_pending_sync(VFIODevice *vbasedev)
 {
-    VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
 
-    if (!vfio_device_state_is_precopy(vbasedev)) {
-        return;
-    }
+    vfio_query_stop_copy_size(vbasedev);
 
-    *must_precopy +=
-        migration->precopy_init_size + migration->precopy_dirty_size;
+    if (vfio_device_state_is_precopy(vbasedev)) {
+        vfio_query_precopy_size(migration);
+    }
 
-    trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy,
-                                      *can_postcopy,
-                                      migration->precopy_init_size,
-                                      migration->precopy_dirty_size);
+    /*
+     * In all cases, all PRECOPY data should be no more than STOPCOPY data.
+     * Otherwise we have a problem.  So far, only dump some errors.
+     */
+    if (migration->precopy_init_size + migration->precopy_dirty_size <
+        migration->stopcopy_size) {
+        error_report_once("%s: wrong pending data (init=%" PRIx64
+                          ", dirty=%"PRIx64", stop=%"PRIx64")",
+                          __func__, migration->precopy_init_size,
+                          migration->precopy_dirty_size,
+                          migration->stopcopy_size);
+    }
 }
 
-static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
-                                     uint64_t *can_postcopy)
+static void vfio_state_pending(void *opaque, MigPendingData *pending)
 {
     VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
+    uint64_t remain;
 
-    vfio_query_stop_copy_size(vbasedev);
-    *must_precopy += migration->stopcopy_size;
-
-    if (vfio_device_state_is_precopy(vbasedev)) {
-        vfio_query_precopy_size(migration);
+    if (pending->fastpath) {
+        if (!vfio_device_state_is_precopy(vbasedev)) {
+            return;
+        }
+        remain = migration->precopy_init_size + migration->precopy_dirty_size;
+    } else {
+        vfio_state_pending_sync(vbasedev);
+        remain = migration->stopcopy_size;
     }
 
-    trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy,
-                                   migration->stopcopy_size,
-                                   migration->precopy_init_size,
-                                   migration->precopy_dirty_size);
+    pending->precopy_bytes += remain;
+
+    trace_vfio_state_pending(vbasedev->name, migration->stopcopy_size,
+                             migration->precopy_init_size,
+                             migration->precopy_dirty_size);
 }
 
 static bool vfio_is_active_iterate(void *opaque)
@@ -850,8 +859,7 @@ static const SaveVMHandlers savevm_vfio_handlers = {
     .save_prepare = vfio_save_prepare,
     .save_setup = vfio_save_setup,
     .save_cleanup = vfio_save_cleanup,
-    .state_pending_estimate = vfio_state_pending_estimate,
-    .state_pending_exact = vfio_state_pending_exact,
+    .save_query_pending = vfio_state_pending,
     .is_active_iterate = vfio_is_active_iterate,
     .save_live_iterate = vfio_save_iterate,
     .save_complete = vfio_save_complete_precopy,
diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
index a061aad817..376a9b43ac 100644
--- a/migration/block-dirty-bitmap.c
+++ b/migration/block-dirty-bitmap.c
@@ -766,9 +766,7 @@ static int dirty_bitmap_save_complete(QEMUFile *f, void *opaque)
     return 0;
 }
 
-static void dirty_bitmap_state_pending(void *opaque,
-                                       uint64_t *must_precopy,
-                                       uint64_t *can_postcopy)
+static void dirty_bitmap_state_pending(void *opaque, MigPendingData *data)
 {
     DBMSaveState *s = &((DBMState *)opaque)->save;
     SaveBitmapState *dbms;
@@ -788,7 +786,7 @@ static void dirty_bitmap_state_pending(void *opaque,
 
     trace_dirty_bitmap_state_pending(pending);
 
-    *can_postcopy += pending;
+    data->postcopy_bytes += pending;
 }
 
 /* First occurrence of this bitmap. It should be created if doesn't exist */
@@ -1250,8 +1248,7 @@ static SaveVMHandlers savevm_dirty_bitmap_handlers = {
     .save_setup = dirty_bitmap_save_setup,
     .save_complete = dirty_bitmap_save_complete,
     .has_postcopy = dirty_bitmap_has_postcopy,
-    .state_pending_exact = dirty_bitmap_state_pending,
-    .state_pending_estimate = dirty_bitmap_state_pending,
+    .save_query_pending = dirty_bitmap_state_pending,
     .save_live_iterate = dirty_bitmap_save_iterate,
     .is_active_iterate = dirty_bitmap_is_active_iterate,
     .load_state = dirty_bitmap_load,
diff --git a/migration/ram.c b/migration/ram.c
index 979751f61b..89f761a471 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3443,30 +3443,17 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
     return qemu_fflush(f);
 }
 
-static void ram_state_pending_estimate(void *opaque, uint64_t *must_precopy,
-                                       uint64_t *can_postcopy)
-{
-    RAMState **temp = opaque;
-    RAMState *rs = *temp;
-
-    uint64_t remaining_size = rs->migration_dirty_pages * TARGET_PAGE_SIZE;
-
-    if (migrate_postcopy_ram()) {
-        /* We can do postcopy, and all the data is postcopiable */
-        *can_postcopy += remaining_size;
-    } else {
-        *must_precopy += remaining_size;
-    }
-}
-
-static void ram_state_pending_exact(void *opaque, uint64_t *must_precopy,
-                                    uint64_t *can_postcopy)
+static void ram_state_pending(void *opaque, MigPendingData *pending)
 {
     RAMState **temp = opaque;
     RAMState *rs = *temp;
     uint64_t remaining_size;
 
-    if (!migration_in_postcopy()) {
+    /*
+     * Sync is only needed either with: (1) a fast query, or (2) postcopy
+     * as started (in which case no new dirty will generate anymore).
+     */
+    if (!pending->fastpath && !migration_in_postcopy()) {
         bql_lock();
         WITH_RCU_READ_LOCK_GUARD() {
             migration_bitmap_sync_precopy(false);
@@ -3478,9 +3465,9 @@ static void ram_state_pending_exact(void *opaque, uint64_t *must_precopy,
 
     if (migrate_postcopy_ram()) {
         /* We can do postcopy, and all the data is postcopiable */
-        *can_postcopy += remaining_size;
+        pending->postcopy_bytes += remaining_size;
     } else {
-        *must_precopy += remaining_size;
+        pending->precopy_bytes += remaining_size;
     }
 }
 
@@ -4703,8 +4690,7 @@ static SaveVMHandlers savevm_ram_handlers = {
     .save_live_iterate = ram_save_iterate,
     .save_complete = ram_save_complete,
     .has_postcopy = ram_has_postcopy,
-    .state_pending_exact = ram_state_pending_exact,
-    .state_pending_estimate = ram_state_pending_estimate,
+    .save_query_pending = ram_state_pending,
     .load_state = ram_load,
     .save_cleanup = ram_save_cleanup,
     .load_setup = ram_load_setup,
diff --git a/migration/savevm.c b/migration/savevm.c
index dd58f2a705..6268e68382 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1762,46 +1762,45 @@ int qemu_savevm_state_complete_precopy(MigrationState *s)
     return qemu_fflush(f);
 }
 
-/* Give an estimate of the amount left to be transferred,
- * the result is split into the amount for units that can and
- * for units that can't do postcopy.
- */
-void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
-                                        uint64_t *can_postcopy)
+void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath)
 {
     SaveStateEntry *se;
 
-    *must_precopy = 0;
-    *can_postcopy = 0;
+    pending->precopy_bytes = 0;
+    pending->postcopy_bytes = 0;
+    pending->fastpath = fastpath;
 
     QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
-        if (!se->ops || !se->ops->state_pending_estimate) {
+        if (!se->ops || !se->ops->save_query_pending) {
             continue;
         }
         if (!qemu_savevm_state_active(se)) {
             continue;
         }
-        se->ops->state_pending_estimate(se->opaque, must_precopy, can_postcopy);
+        se->ops->save_query_pending(se->opaque, pending);
     }
 }
 
+void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
+                                        uint64_t *can_postcopy)
+{
+    MigPendingData pending;
+
+    qemu_savevm_query_pending(&pending, true);
+
+    *must_precopy = pending.precopy_bytes;
+    *can_postcopy = pending.postcopy_bytes;
+}
+
 void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
                                      uint64_t *can_postcopy)
 {
-    SaveStateEntry *se;
+    MigPendingData pending;
 
-    *must_precopy = 0;
-    *can_postcopy = 0;
+    qemu_savevm_query_pending(&pending, false);
 
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
-        if (!se->ops || !se->ops->state_pending_exact) {
-            continue;
-        }
-        if (!qemu_savevm_state_active(se)) {
-            continue;
-        }
-        se->ops->state_pending_exact(se->opaque, must_precopy, can_postcopy);
-    }
+    *must_precopy = pending.precopy_bytes;
+    *can_postcopy = pending.postcopy_bytes;
 }
 
 void qemu_savevm_state_cleanup(void)
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 846e3625c5..7cf5a9eb2d 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -173,8 +173,7 @@ vfio_save_device_config_state(const char *name) " (%s)"
 vfio_save_iterate(const char *name, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy initial size %"PRIu64" precopy dirty size %"PRIu64
 vfio_save_iterate_start(const char *name) " (%s)"
 vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size %"PRIu64
-vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy %"PRIu64" postcopy %"PRIu64" precopy initial size %"PRIu64" precopy dirty size %"PRIu64
-vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy %"PRIu64" postcopy %"PRIu64" stopcopy size %"PRIu64" precopy initial size %"PRIu64" precopy dirty size %"PRIu64
+vfio_state_pending(const char *name, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) stopcopy size %"PRIu64" precopy initial size %"PRIu64" precopy dirty size %"PRIu64
 vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
 vfio_vmstate_change_prepare(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
 
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH RFC 06/12] migration: Use the new save_query_pending() API directly
  2026-03-19 23:12 [PATCH RFC 00/12] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
                   ` (4 preceding siblings ...)
  2026-03-19 23:12 ` [PATCH RFC 05/12] migration/treewide: Merge @state_pending_{exact|estimate} APIs Peter Xu
@ 2026-03-19 23:12 ` Peter Xu
  2026-03-24  9:35   ` Prasad Pandit
  2026-03-19 23:12 ` [PATCH RFC 07/12] migration: Introduce stopcopy_bytes in save_query_pending() Peter Xu
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 28+ messages in thread
From: Peter Xu @ 2026-03-19 23:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, peterx, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater

It's easier to use the new API directly in the migration iterations.  This
also paves way for follow up patches to add new data to report directly to
the iterator function.

When at it, merge the tracepoints too into one.

No functional change intended.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/savevm.h     |  4 ----
 migration/migration.c  | 16 +++++++---------
 migration/savevm.c     | 23 ++---------------------
 migration/trace-events |  1 +
 4 files changed, 10 insertions(+), 34 deletions(-)

diff --git a/migration/savevm.h b/migration/savevm.h
index b116933bce..3c39190fa4 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -46,10 +46,6 @@ void qemu_savevm_state_cleanup(void);
 void qemu_savevm_state_complete_postcopy(QEMUFile *f);
 int qemu_savevm_state_complete_precopy(MigrationState *s);
 void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath);
-void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
-                                     uint64_t *can_postcopy);
-void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
-                                        uint64_t *can_postcopy);
 int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy);
 bool qemu_savevm_state_postcopy_prepare(QEMUFile *f, Error **errp);
 void qemu_savevm_state_end(QEMUFile *f);
diff --git a/migration/migration.c b/migration/migration.c
index dfc60372cf..99c4d09000 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3204,17 +3204,17 @@ typedef enum {
  */
 static MigIterateState migration_iteration_run(MigrationState *s)
 {
-    uint64_t must_precopy, can_postcopy, pending_size;
     Error *local_err = NULL;
     bool in_postcopy = (s->state == MIGRATION_STATUS_POSTCOPY_DEVICE ||
                         s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
     bool can_switchover = migration_can_switchover(s);
+    MigPendingData pending = { };
+    uint64_t pending_size;
     bool complete_ready;
 
     /* Fast path - get the estimated amount of pending data */
-    qemu_savevm_state_pending_estimate(&must_precopy, &can_postcopy);
-    pending_size = must_precopy + can_postcopy;
-    trace_migrate_pending_estimate(pending_size, must_precopy, can_postcopy);
+    qemu_savevm_query_pending(&pending, true);
+    pending_size = pending.precopy_bytes + pending.postcopy_bytes;
 
     if (in_postcopy) {
         /*
@@ -3243,14 +3243,12 @@ static MigIterateState migration_iteration_run(MigrationState *s)
          * during postcopy phase.
          */
         if (pending_size <= s->threshold_size) {
-            qemu_savevm_state_pending_exact(&must_precopy, &can_postcopy);
-            pending_size = must_precopy + can_postcopy;
-            trace_migrate_pending_exact(pending_size, must_precopy,
-                                        can_postcopy);
+            qemu_savevm_query_pending(&pending, false);
+            pending_size = pending.precopy_bytes + pending.postcopy_bytes;
         }
 
         /* Should we switch to postcopy now? */
-        if (must_precopy <= s->threshold_size &&
+        if (pending.precopy_bytes <= s->threshold_size &&
             can_switchover && qatomic_read(&s->start_postcopy)) {
             if (postcopy_start(s, &local_err)) {
                 migrate_error_propagate(s, error_copy(local_err));
diff --git a/migration/savevm.c b/migration/savevm.c
index 6268e68382..b3285d480f 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1779,28 +1779,9 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath)
         }
         se->ops->save_query_pending(se->opaque, pending);
     }
-}
-
-void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
-                                        uint64_t *can_postcopy)
-{
-    MigPendingData pending;
-
-    qemu_savevm_query_pending(&pending, true);
-
-    *must_precopy = pending.precopy_bytes;
-    *can_postcopy = pending.postcopy_bytes;
-}
-
-void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
-                                     uint64_t *can_postcopy)
-{
-    MigPendingData pending;
-
-    qemu_savevm_query_pending(&pending, false);
 
-    *must_precopy = pending.precopy_bytes;
-    *can_postcopy = pending.postcopy_bytes;
+    trace_qemu_savevm_query_pending(fastpath, pending->precopy_bytes,
+                                    pending->postcopy_bytes);
 }
 
 void qemu_savevm_state_cleanup(void)
diff --git a/migration/trace-events b/migration/trace-events
index 60e5087e38..5f836a8652 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -7,6 +7,7 @@ qemu_loadvm_state_section_partend(uint32_t section_id) "%u"
 qemu_loadvm_state_post_main(int ret) "%d"
 qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
 qemu_savevm_send_packaged(void) ""
+qemu_savevm_query_pending(bool fast, uint64_t precopy, uint64_t postcopy) "fast=%d, precopy=%"PRIu64", postcopy=%"PRIu64
 loadvm_state_switchover_ack_needed(unsigned int switchover_ack_pending_num) "Switchover ack pending num=%u"
 loadvm_state_setup(void) ""
 loadvm_state_cleanup(void) ""
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH RFC 07/12] migration: Introduce stopcopy_bytes in save_query_pending()
  2026-03-19 23:12 [PATCH RFC 00/12] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
                   ` (5 preceding siblings ...)
  2026-03-19 23:12 ` [PATCH RFC 06/12] migration: Use the new save_query_pending() API directly Peter Xu
@ 2026-03-19 23:12 ` Peter Xu
  2026-03-24 11:05   ` Prasad Pandit
  2026-03-25 16:54   ` Avihai Horon
  2026-03-19 23:12 ` [PATCH RFC 08/12] vfio/migration: Fix incorrect reporting for VFIO pending data Peter Xu
                   ` (4 subsequent siblings)
  11 siblings, 2 replies; 28+ messages in thread
From: Peter Xu @ 2026-03-19 23:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, peterx, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater

Allow modules to report data that can only be migrated after VM is stopped.

When this concept is introduced, we will need to account stopcopy size to
be part of pending_size as before.

One thing to mention is, when there can be stopcopy size, it means the old
"pending_size" may not always be able to reach low enough to kickoff an
slow version of query sync.  While it used to be almost guaranteed to
happen because if we keep iterating, normally pending_size can go to zero
for precopy-only because we assume everything reported can be migrated in
precopy phase.

So we need to make sure QEMU will kickoff a synchronized version of query
pending when all precopy data is migrated too.  This might be important to
VFIO to keep making progress even if the downtime cannot yet be satisfied.

So far, this patch should introduce no functional change, as no module yet
report stopcopy size.

This will pave way for VFIO to properly report its pending data sizes,
which was actually buggy today.  Will be done in follow up patches.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/migration/register.h | 12 +++++++++
 migration/migration.c        | 52 ++++++++++++++++++++++++++++++------
 migration/savevm.c           |  7 +++--
 migration/trace-events       |  2 +-
 4 files changed, 62 insertions(+), 11 deletions(-)

diff --git a/include/migration/register.h b/include/migration/register.h
index 2320c3a981..3824958ba5 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -17,12 +17,24 @@
 #include "hw/core/vmstate-if.h"
 
 typedef struct MigPendingData {
+    /*
+     * Modules can only update these fields in a query request via its
+     * save_query_pending() API.
+     */
     /* How many bytes are pending for precopy / stopcopy? */
     uint64_t precopy_bytes;
     /* How many bytes are pending that can be transferred in postcopy? */
     uint64_t postcopy_bytes;
+    /* How many bytes that can only be transferred when VM stopped? */
+    uint64_t stopcopy_bytes;
+
+    /*
+     * Modules should never update these fields.
+     */
     /* Is this a fastpath query (which can be inaccurate)? */
     bool fastpath;
+    /* Total pending data */
+    uint64_t total_bytes;
 } MigPendingData ;
 
 /**
diff --git a/migration/migration.c b/migration/migration.c
index 99c4d09000..42facb16d1 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3198,6 +3198,44 @@ typedef enum {
     MIG_ITERATE_BREAK,          /* Break the loop */
 } MigIterateState;
 
+/* Are we ready to move to the next iteration phase? */
+static bool migration_iteration_next_ready(MigrationState *s,
+                                           MigPendingData *pending)
+{
+    /*
+     * If the estimated values already suggest us to switchover, mark this
+     * iteration finished, time to do a slow sync.
+     */
+    if (pending->total_bytes <= s->threshold_size) {
+        return true;
+    }
+
+    /*
+     * Since we may have modules reporting stop-only data, we also want to
+     * re-query with slow mode if all precopy data is moved over.  This
+     * will also mark the current iteration done.
+     *
+     * This could happen when e.g. a module (like, VFIO) reports stopcopy
+     * size too large so it will never yet satisfy the downtime with the
+     * current setup (above check).  Here, slow version of re-query helps
+     * because we keep trying the best to move whatever we have.
+     */
+    if (pending->precopy_bytes == 0) {
+        return true;
+    }
+
+    return false;
+}
+
+static void migration_iteration_go_next(MigPendingData *pending)
+{
+    /*
+     * Do a slow sync will achieve this.  TODO: move RAM iteration code
+     * into the core layer.
+     */
+    qemu_savevm_query_pending(pending, false);
+}
+
 /*
  * Return true if continue to the next iteration directly, false
  * otherwise.
@@ -3209,12 +3247,10 @@ static MigIterateState migration_iteration_run(MigrationState *s)
                         s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
     bool can_switchover = migration_can_switchover(s);
     MigPendingData pending = { };
-    uint64_t pending_size;
     bool complete_ready;
 
     /* Fast path - get the estimated amount of pending data */
     qemu_savevm_query_pending(&pending, true);
-    pending_size = pending.precopy_bytes + pending.postcopy_bytes;
 
     if (in_postcopy) {
         /*
@@ -3222,7 +3258,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
          * postcopy completion doesn't rely on can_switchover, because when
          * POSTCOPY_ACTIVE it means switchover already happened.
          */
-        complete_ready = !pending_size;
+        complete_ready = !pending.total_bytes;
         if (s->state == MIGRATION_STATUS_POSTCOPY_DEVICE &&
             (s->postcopy_package_loaded || complete_ready)) {
             /*
@@ -3242,9 +3278,8 @@ static MigIterateState migration_iteration_run(MigrationState *s)
          * postcopy started, so ESTIMATE should always match with EXACT
          * during postcopy phase.
          */
-        if (pending_size <= s->threshold_size) {
-            qemu_savevm_query_pending(&pending, false);
-            pending_size = pending.precopy_bytes + pending.postcopy_bytes;
+        if (migration_iteration_next_ready(s, &pending)) {
+            migration_iteration_go_next(&pending);
         }
 
         /* Should we switch to postcopy now? */
@@ -3264,11 +3299,12 @@ static MigIterateState migration_iteration_run(MigrationState *s)
          * (2) Pending size is no more than the threshold specified
          *     (which was calculated from expected downtime)
          */
-        complete_ready = can_switchover && (pending_size <= s->threshold_size);
+        complete_ready = can_switchover &&
+            (pending.total_bytes <= s->threshold_size);
     }
 
     if (complete_ready) {
-        trace_migration_thread_low_pending(pending_size);
+        trace_migration_thread_low_pending(pending.total_bytes);
         migration_completion(s);
         return MIG_ITERATE_BREAK;
     }
diff --git a/migration/savevm.c b/migration/savevm.c
index b3285d480f..812c72b3e5 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1766,8 +1766,7 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath)
 {
     SaveStateEntry *se;
 
-    pending->precopy_bytes = 0;
-    pending->postcopy_bytes = 0;
+    memset(pending, 0, sizeof(*pending));
     pending->fastpath = fastpath;
 
     QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
@@ -1780,7 +1779,11 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath)
         se->ops->save_query_pending(se->opaque, pending);
     }
 
+    pending->total_bytes = pending->precopy_bytes +
+        pending->stopcopy_bytes + pending->postcopy_bytes;
+
     trace_qemu_savevm_query_pending(fastpath, pending->precopy_bytes,
+                                    pending->stopcopy_bytes,
                                     pending->postcopy_bytes);
 }
 
diff --git a/migration/trace-events b/migration/trace-events
index 5f836a8652..175f09f8ad 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -7,7 +7,7 @@ qemu_loadvm_state_section_partend(uint32_t section_id) "%u"
 qemu_loadvm_state_post_main(int ret) "%d"
 qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
 qemu_savevm_send_packaged(void) ""
-qemu_savevm_query_pending(bool fast, uint64_t precopy, uint64_t postcopy) "fast=%d, precopy=%"PRIu64", postcopy=%"PRIu64
+qemu_savevm_query_pending(bool fast, uint64_t precopy, uint64_t stopcopy, uint64_t postcopy) "fast=%d, precopy=%"PRIu64", stopcopy=%"PRIu64", postcopy=%"PRIu64
 loadvm_state_switchover_ack_needed(unsigned int switchover_ack_pending_num) "Switchover ack pending num=%u"
 loadvm_state_setup(void) ""
 loadvm_state_cleanup(void) ""
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH RFC 08/12] vfio/migration: Fix incorrect reporting for VFIO pending data
  2026-03-19 23:12 [PATCH RFC 00/12] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
                   ` (6 preceding siblings ...)
  2026-03-19 23:12 ` [PATCH RFC 07/12] migration: Introduce stopcopy_bytes in save_query_pending() Peter Xu
@ 2026-03-19 23:12 ` Peter Xu
  2026-03-25 17:32   ` Avihai Horon
  2026-03-19 23:12 ` [PATCH RFC 09/12] migration: Make iteration counter out of RAM Peter Xu
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 28+ messages in thread
From: Peter Xu @ 2026-03-19 23:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, peterx, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater

VFIO used to report different things in its fast/slow version of query
pending results.  It was likely because it wants to make sure precopy data
can reach 0 hence trigger sync queries.

Now with stopcopy size reporting facility it doesn't need this hack
anymore.  Fix this.

Copy stable might be too much; just skip it and skip the Fixes.

Cc: Avihai Horon <avihaih@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/vfio/migration.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index c054c749b0..9dbe5ad9e9 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -591,6 +591,10 @@ static void vfio_state_pending_sync(VFIODevice *vbasedev)
                           __func__, migration->precopy_init_size,
                           migration->precopy_dirty_size,
                           migration->stopcopy_size);
+        migration->stopcopy_size = 0;
+    } else {
+        migration->stopcopy_size -=
+            (migration->precopy_init_size + migration->precopy_dirty_size);
     }
 }
 
@@ -598,19 +602,18 @@ static void vfio_state_pending(void *opaque, MigPendingData *pending)
 {
     VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
-    uint64_t remain;
 
     if (pending->fastpath) {
         if (!vfio_device_state_is_precopy(vbasedev)) {
             return;
         }
-        remain = migration->precopy_init_size + migration->precopy_dirty_size;
     } else {
         vfio_state_pending_sync(vbasedev);
-        remain = migration->stopcopy_size;
     }
 
-    pending->precopy_bytes += remain;
+    pending->precopy_bytes +=
+        migration->precopy_init_size + migration->precopy_dirty_size;
+    pending->stopcopy_bytes += migration->stopcopy_size;
 
     trace_vfio_state_pending(vbasedev->name, migration->stopcopy_size,
                              migration->precopy_init_size,
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH RFC 09/12] migration: Make iteration counter out of RAM
  2026-03-19 23:12 [PATCH RFC 00/12] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
                   ` (7 preceding siblings ...)
  2026-03-19 23:12 ` [PATCH RFC 08/12] vfio/migration: Fix incorrect reporting for VFIO pending data Peter Xu
@ 2026-03-19 23:12 ` Peter Xu
  2026-03-20  6:12   ` Yong Huang
  2026-03-20  9:49   ` Prasad Pandit
  2026-03-19 23:13 ` [PATCH RFC 10/12] migration: Introduce a helper to return switchover bw estimate Peter Xu
                   ` (2 subsequent siblings)
  11 siblings, 2 replies; 28+ messages in thread
From: Peter Xu @ 2026-03-19 23:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, peterx, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater,
	Yong Huang

It used to hide in RAM dirty sync path.  Now with more modules being able
to slow sync on dirty information, keeping it there may not be good anymore
because it's not RAM's own concept for iterations: all modules should
follow.

More importantly, mgmt may try to query dirty info (to make policy
decisions like adjusting downtime) by listening to iteration count changes
via QMP events.  So we must make sure the boost of iterations only happens
_after_ the dirty sync operations with whatever form (RAM's dirty bitmap
sync, or VFIO's different ioctls to fetch latest dirty info from kernel).

Move this to core migration path to manage, together with the event
generation, so that it can be well ordered with the sync operations for all
modules.

This brings a good side effect that we should have an old issue regarding
to cpu_throttle_dirty_sync_timer_tick() which can randomly boost iteration
counts (because it invokes sync ops).  Now it won't, which is actually the
right behavior.

Said that, we have code (not only QEMU, but likely mgmt too) assuming the
1st iteration will always shows dirty count to 1.  Make it initialized with
1 this time, because we'll miss the dirty sync for setup() on boosting this
counter now.

Cc: Yong Huang <yong.huang@smartx.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration-stats.h |  3 ++-
 migration/migration.c       | 29 ++++++++++++++++++++++++++---
 migration/ram.c             |  6 ------
 3 files changed, 28 insertions(+), 10 deletions(-)

diff --git a/migration/migration-stats.h b/migration/migration-stats.h
index 1153520f7a..326ddb0088 100644
--- a/migration/migration-stats.h
+++ b/migration/migration-stats.h
@@ -43,7 +43,8 @@ typedef struct {
      */
     uint64_t dirty_pages_rate;
     /*
-     * Number of times we have synchronized guest bitmaps.
+     * Number of times we have synchronized guest bitmaps.  This always
+     * starts from 1 for the 1st iteration.
      */
     uint64_t dirty_sync_count;
     /*
diff --git a/migration/migration.c b/migration/migration.c
index 42facb16d1..ad8a824585 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1654,10 +1654,15 @@ int migrate_init(MigrationState *s, Error **errp)
     s->threshold_size = 0;
     s->switchover_acked = false;
     s->rdma_migration = false;
+
     /*
-     * set mig_stats memory to zero for a new migration
+     * set mig_stats memory to zero for a new migration.. except the
+     * iteration counter, which we want to make sure it returns 1 for the
+     * first iteration.
      */
     memset(&mig_stats, 0, sizeof(mig_stats));
+    mig_stats.dirty_sync_count = 1;
+
     migration_reset_vfio_bytes_transferred();
 
     s->postcopy_package_loaded = false;
@@ -3230,10 +3235,28 @@ static bool migration_iteration_next_ready(MigrationState *s,
 static void migration_iteration_go_next(MigPendingData *pending)
 {
     /*
-     * Do a slow sync will achieve this.  TODO: move RAM iteration code
-     * into the core layer.
+     * Do a slow sync first before boosting the iteration count.
      */
     qemu_savevm_query_pending(pending, false);
+
+    /*
+     * Boost dirty sync count to reflect we finished one iteration.
+     *
+     * NOTE: we need to make sure when this happens (together with the
+     * event sent below) all modules have slow-synced the pending data
+     * above.  That means a write mem barrier, but qatomic_add() should be
+     * enough.
+     *
+     * It's because a mgmt could wait on the iteration event to query again
+     * on pending data for policy changes (e.g. downtime adjustments).  The
+     * ordering will make sure the query will fetch the latest results from
+     * all the modules.
+     */
+    qatomic_add(&mig_stats.dirty_sync_count, 1);
+
+    if (migrate_events()) {
+        qapi_event_send_migration_pass(mig_stats.dirty_sync_count);
+    }
 }
 
 /*
diff --git a/migration/ram.c b/migration/ram.c
index 89f761a471..29e9608715 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1136,8 +1136,6 @@ static void migration_bitmap_sync(RAMState *rs, bool last_stage)
     RAMBlock *block;
     int64_t end_time;
 
-    qatomic_add(&mig_stats.dirty_sync_count, 1);
-
     if (!rs->time_last_bitmap_sync) {
         rs->time_last_bitmap_sync = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
     }
@@ -1172,10 +1170,6 @@ static void migration_bitmap_sync(RAMState *rs, bool last_stage)
         rs->num_dirty_pages_period = 0;
         rs->bytes_xfer_prev = migration_transferred_bytes();
     }
-    if (migrate_events()) {
-        uint64_t generation = qatomic_read(&mig_stats.dirty_sync_count);
-        qapi_event_send_migration_pass(generation);
-    }
 }
 
 void migration_bitmap_sync_precopy(bool last_stage)
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH RFC 10/12] migration: Introduce a helper to return switchover bw estimate
  2026-03-19 23:12 [PATCH RFC 00/12] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
                   ` (8 preceding siblings ...)
  2026-03-19 23:12 ` [PATCH RFC 09/12] migration: Make iteration counter out of RAM Peter Xu
@ 2026-03-19 23:13 ` Peter Xu
  2026-03-23 10:26   ` Prasad Pandit
  2026-03-19 23:13 ` [PATCH RFC 11/12] migration: Calculate expected downtime on demand Peter Xu
  2026-03-19 23:13 ` [PATCH RFC 12/12] migration: Fix calculation of expected_downtime to take VFIO info Peter Xu
  11 siblings, 1 reply; 28+ messages in thread
From: Peter Xu @ 2026-03-19 23:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, peterx, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater

Add a helper to be able to return an estimate of switchover bw, called
migration_get_switchover_bw(). Use it to sligitly simplify the current
code.

This will be used in new codes later to remove expected_downtime.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.h |  1 +
 migration/migration.c | 43 ++++++++++++++++++++++---------------------
 2 files changed, 23 insertions(+), 21 deletions(-)

diff --git a/migration/migration.h b/migration/migration.h
index b6888daced..bf3ee6cc07 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -586,6 +586,7 @@ void migration_cancel(void);
 void migration_populate_vfio_info(MigrationInfo *info);
 void migration_reset_vfio_bytes_transferred(void);
 void postcopy_temp_page_reset(PostcopyTmpPage *tmp_page);
+int64_t migration_downtime_calc_expected(MigrationState *s);
 
 /*
  * Migration thread waiting for return path thread.  Return non-zero if an
diff --git a/migration/migration.c b/migration/migration.c
index ad8a824585..56d605ede9 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -984,6 +984,21 @@ void migrate_send_rp_resume_ack(MigrationIncomingState *mis, uint32_t value)
     migrate_send_rp_message(mis, MIG_RP_MSG_RESUME_ACK, sizeof(buf), &buf);
 }
 
+/*
+ * Returns the estimated switchover bandwidth (unit: bytes / seconds)
+ */
+static double migration_get_switchover_bw(MigrationState *s)
+{
+    uint64_t switchover_bw = migrate_avail_switchover_bandwidth();
+
+    if (switchover_bw) {
+        /* If user specified, prioritize this value and don't estimate */
+        return (double)switchover_bw;
+    }
+
+    return s->mbps / 8 * 1000 * 1000;
+}
+
 bool migration_is_running(void)
 {
     MigrationState *s = current_migration;
@@ -3126,37 +3141,22 @@ static void migration_update_counters(MigrationState *s,
 {
     uint64_t transferred, transferred_pages, time_spent;
     uint64_t current_bytes; /* bytes transferred since the beginning */
-    uint64_t switchover_bw;
-    /* Expected bandwidth when switching over to destination QEMU */
-    double expected_bw_per_ms;
-    double bandwidth;
+    double switchover_bw;
 
     if (current_time < s->iteration_start_time + BUFFER_DELAY) {
         return;
     }
 
-    switchover_bw = migrate_avail_switchover_bandwidth();
     current_bytes = migration_transferred_bytes();
     transferred = current_bytes - s->iteration_initial_bytes;
     time_spent = current_time - s->iteration_start_time;
-    bandwidth = (double)transferred / time_spent;
-
-    if (switchover_bw) {
-        /*
-         * If the user specified a switchover bandwidth, let's trust the
-         * user so that can be more accurate than what we estimated.
-         */
-        expected_bw_per_ms = (double)switchover_bw / 1000;
-    } else {
-        /* If the user doesn't specify bandwidth, we use the estimated */
-        expected_bw_per_ms = bandwidth;
-    }
-
-    s->threshold_size = expected_bw_per_ms * migrate_downtime_limit();
-
     s->mbps = (((double) transferred * 8.0) /
                ((double) time_spent / 1000.0)) / 1000.0 / 1000.0;
 
+    /* NOTE: only update this after bandwidth (s->mbps) updated */
+    switchover_bw = migration_get_switchover_bw(s) / 1000;
+    s->threshold_size = switchover_bw * migrate_downtime_limit();
+
     transferred_pages = ram_get_total_transferred_pages() -
                             s->iteration_initial_pages;
     s->pages_per_second = (double) transferred_pages /
@@ -3178,7 +3178,8 @@ static void migration_update_counters(MigrationState *s,
 
     trace_migrate_transferred(transferred, time_spent,
                               /* Both in unit bytes/ms */
-                              bandwidth, switchover_bw / 1000,
+                              (uint64_t)s->mbps,
+                              (uint64_t)switchover_bw,
                               s->threshold_size);
 }
 
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH RFC 11/12] migration: Calculate expected downtime on demand
  2026-03-19 23:12 [PATCH RFC 00/12] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
                   ` (9 preceding siblings ...)
  2026-03-19 23:13 ` [PATCH RFC 10/12] migration: Introduce a helper to return switchover bw estimate Peter Xu
@ 2026-03-19 23:13 ` Peter Xu
  2026-03-19 23:13 ` [PATCH RFC 12/12] migration: Fix calculation of expected_downtime to take VFIO info Peter Xu
  11 siblings, 0 replies; 28+ messages in thread
From: Peter Xu @ 2026-03-19 23:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, peterx, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater

This value does not need to be calculated as frequent.  Only calculate it
on demand when query-migrate happened.  With that we can remove the
variable in MigrationState.

This paves way for fixing this value to include all modules (not only RAM
but others too).

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.h |  1 -
 migration/migration.c | 26 ++++++++++++--------------
 2 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/migration/migration.h b/migration/migration.h
index bf3ee6cc07..ba0f9e0f9c 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -359,7 +359,6 @@ struct MigrationState {
     /* Timestamp when VM is down (ms) to migrate the last stuff */
     int64_t downtime_start;
     int64_t downtime;
-    int64_t expected_downtime;
     bool capabilities[MIGRATION_CAPABILITY__MAX];
     int64_t setup_time;
 
diff --git a/migration/migration.c b/migration/migration.c
index 56d605ede9..23c78b3a2c 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1041,6 +1041,17 @@ static bool migrate_show_downtime(MigrationState *s)
     return (s->state == MIGRATION_STATUS_COMPLETED) || migration_in_postcopy();
 }
 
+/* Return expected downtime (unit: milliseconds) */
+int64_t migration_downtime_calc_expected(MigrationState *s)
+{
+    if (mig_stats.dirty_sync_count <= 1) {
+        return migrate_downtime_limit();
+    }
+
+    return 1.0 * mig_stats.dirty_bytes_last_sync /
+        migration_get_switchover_bw(s) * 1000;
+}
+
 static void populate_time_info(MigrationInfo *info, MigrationState *s)
 {
     info->has_status = true;
@@ -1061,7 +1072,7 @@ static void populate_time_info(MigrationInfo *info, MigrationState *s)
         info->downtime = s->downtime;
     } else {
         info->has_expected_downtime = true;
-        info->expected_downtime = s->expected_downtime;
+        info->expected_downtime = migration_downtime_calc_expected(s);
     }
 }
 
@@ -1649,7 +1660,6 @@ int migrate_init(MigrationState *s, Error **errp)
     s->mbps = 0.0;
     s->pages_per_second = 0.0;
     s->downtime = 0;
-    s->expected_downtime = 0;
     s->setup_time = 0;
     s->start_postcopy = false;
     s->migration_thread_running = false;
@@ -3162,16 +3172,6 @@ static void migration_update_counters(MigrationState *s,
     s->pages_per_second = (double) transferred_pages /
                              (((double) time_spent / 1000.0));
 
-    /*
-     * if we haven't sent anything, we don't want to
-     * recalculate. 10000 is a small enough number for our purposes
-     */
-    if (qatomic_read(&mig_stats.dirty_pages_rate) &&
-        transferred > 10000) {
-        s->expected_downtime =
-            qatomic_read(&mig_stats.dirty_bytes_last_sync) / expected_bw_per_ms;
-    }
-
     migration_rate_reset();
 
     update_iteration_initial_status(s);
@@ -3817,8 +3817,6 @@ void migration_start_outgoing(MigrationState *s)
     bool resume = (s->state == MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP);
     int ret;
 
-    s->expected_downtime = migrate_downtime_limit();
-
     if (resume) {
         /* This is a resumed migration */
         rate_limit = migrate_max_postcopy_bandwidth();
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH RFC 12/12] migration: Fix calculation of expected_downtime to take VFIO info
  2026-03-19 23:12 [PATCH RFC 00/12] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
                   ` (10 preceding siblings ...)
  2026-03-19 23:13 ` [PATCH RFC 11/12] migration: Calculate expected downtime on demand Peter Xu
@ 2026-03-19 23:13 ` Peter Xu
  2026-03-23 12:05   ` Prasad Pandit
  11 siblings, 1 reply; 28+ messages in thread
From: Peter Xu @ 2026-03-19 23:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, peterx, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater

QEMU will provide an expected downtime for the whole system during
migration, by remembering the total dirty RAM that we synced the last time,
divides the estimated switchover bandwidth.

That was flawed when VFIO is taking into account: consider there is a VFIO
GPU device that contains GBs of data to migrate during stop phase.  Those
will not be accounted in this math.

Fix it by updating dirty_bytes_last_sync properly only when we go to the
next iteration, rather than hide this update in the RAM code.  Meanwhile,
fetch the total (rather than RAM-only) portion of dirty bytes, so as to
include GPU device states too.

Update the comment of the field to reflect its new meaning.

Now after this change, the expected-downtime to be read from query-migrate
should be very accurate even with VFIO devices involved.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration-stats.h | 10 +++++-----
 migration/migration.c       | 11 ++++++++---
 migration/ram.c             |  1 -
 3 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/migration/migration-stats.h b/migration/migration-stats.h
index 326ddb0088..14b2773beb 100644
--- a/migration/migration-stats.h
+++ b/migration/migration-stats.h
@@ -31,11 +31,11 @@
  */
 typedef struct {
     /*
-     * Number of bytes that were dirty last time that we synced with
-     * the guest memory.  We use that to calculate the downtime.  As
-     * the remaining dirty amounts to what we know that is still dirty
-     * since last iteration, not counting what the guest has dirtied
-     * since we synchronized bitmaps.
+     * Number of bytes that are still dirty after the last whole-system
+     * sync on dirty information.  We use that to calculate the expected
+     * downtime.  As the remaining dirty amounts to what we know that is
+     * still dirty since last iteration, not counting what the guest has
+     * dirtied since then on either RAM or device states.
      */
     uint64_t dirty_bytes_last_sync;
     /*
diff --git a/migration/migration.c b/migration/migration.c
index 23c78b3a2c..1c00572d14 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3240,18 +3240,23 @@ static void migration_iteration_go_next(MigPendingData *pending)
      */
     qemu_savevm_query_pending(pending, false);
 
+    /*
+     * Update the dirty information for the whole system for this
+     * iteration.  This value is used to calculate expected downtime.
+     */
+    qatomic_set(&mig_stats.dirty_bytes_last_sync, pending->total_bytes);
+
     /*
      * Boost dirty sync count to reflect we finished one iteration.
      *
      * NOTE: we need to make sure when this happens (together with the
      * event sent below) all modules have slow-synced the pending data
-     * above.  That means a write mem barrier, but qatomic_add() should be
-     * enough.
+     * above and updated corresponding fields (e.g. dirty_bytes_last_sync).
      *
      * It's because a mgmt could wait on the iteration event to query again
      * on pending data for policy changes (e.g. downtime adjustments).  The
      * ordering will make sure the query will fetch the latest results from
-     * all the modules.
+     * all the modules on everything.
      */
     qatomic_add(&mig_stats.dirty_sync_count, 1);
 
diff --git a/migration/ram.c b/migration/ram.c
index 29e9608715..1bdf121d16 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1148,7 +1148,6 @@ static void migration_bitmap_sync(RAMState *rs, bool last_stage)
             RAMBLOCK_FOREACH_NOT_IGNORED(block) {
                 ramblock_sync_dirty_bitmap(rs, block);
             }
-            qatomic_set(&mig_stats.dirty_bytes_last_sync, ram_bytes_remaining());
         }
     }
 
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC 02/12] migration/qapi: Rename MigrationStats to MigrationRAMStats
  2026-03-19 23:12 ` [PATCH RFC 02/12] migration/qapi: Rename MigrationStats to MigrationRAMStats Peter Xu
@ 2026-03-19 23:26   ` Peter Xu
  2026-03-20  6:54   ` Markus Armbruster
  1 sibling, 0 replies; 28+ messages in thread
From: Peter Xu @ 2026-03-19 23:26 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater,
	Libvirt Mailing List

On Thu, Mar 19, 2026 at 07:12:52PM -0400, Peter Xu wrote:
> This stats is only about RAM, so make it accurate.  This paves way for
> statistics for all devices.
> 
> Note that even if this is part of qapi/, this should not be part of ABI of
> at least query-migrate, because the structure is not changed, and this
> stats is always reported only under the "ram" section.
> 
> Cc: Daniel P. Berrangé <berrange@redhat.com>
> Cc: Markus Armbruster <armbru@redhat.com>
> Cc: Libvirt Mailing List <libvir-list@redhat.com>

I got the address wrong..  Will update when I repost.  Copying the correct
one.

> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  docs/about/removed-features.rst | 2 +-
>  qapi/migration.json             | 8 ++++----
>  migration/migration-stats.h     | 2 +-
>  3 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/docs/about/removed-features.rst b/docs/about/removed-features.rst
> index 6f4447993c..7c4f4325f7 100644
> --- a/docs/about/removed-features.rst
> +++ b/docs/about/removed-features.rst
> @@ -699,7 +699,7 @@ was superseded by ``sections``.
>  ``query-migrate`` return value member ``skipped`` (removed in 9.1)
>  ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
>  
> -Member ``skipped`` of the ``MigrationStats`` struct hasn't been used
> +Member ``skipped`` of the ``MigrationRAMStats`` struct hasn't been used
>  for more than 10 years. Removed with no replacement.
>  
>  ``migrate`` command option ``inc`` (removed in 9.1)
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 7134d4ce47..cfc6ccee26 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -12,7 +12,7 @@
>  { 'include': 'sockets.json' }
>  
>  ##
> -# @MigrationStats:
> +# @MigrationRAMStats:
>  #
>  # Detailed migration status.
>  #
> @@ -64,7 +64,7 @@
>  #
>  # Since: 0.14
>  ##
> -{ 'struct': 'MigrationStats',
> +{ 'struct': 'MigrationRAMStats',
>    'data': {'transferred': 'int', 'remaining': 'int', 'total': 'int' ,
>             'duplicate': 'int',
>             'normal': 'int',
> @@ -209,7 +209,7 @@
>  #     If this field is not returned, no migration process has been
>  #     initiated
>  #
> -# @ram: `MigrationStats` containing detailed migration status, only
> +# @ram: `MigrationRAMStats` containing detailed migration status, only
>  #     returned if status is 'active' or 'completed'(since 1.2)
>  #
>  # @xbzrle-cache: `XBZRLECacheStats` containing detailed XBZRLE
> @@ -309,7 +309,7 @@
>  # Since: 0.14
>  ##
>  { 'struct': 'MigrationInfo',
> -  'data': {'*status': 'MigrationStatus', '*ram': 'MigrationStats',
> +  'data': {'*status': 'MigrationStatus', '*ram': 'MigrationRAMStats',
>             '*vfio': 'VfioStats',
>             '*xbzrle-cache': 'XBZRLECacheStats',
>             '*total-time': 'int',
> diff --git a/migration/migration-stats.h b/migration/migration-stats.h
> index c0f50144c9..1153520f7a 100644
> --- a/migration/migration-stats.h
> +++ b/migration/migration-stats.h
> @@ -27,7 +27,7 @@
>  
>  /*
>   * These are the ram migration statistic counters.  It is loosely
> - * based on MigrationStats.
> + * based on MigrationRAMStats.
>   */
>  typedef struct {
>      /*
> -- 
> 2.50.1
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC 09/12] migration: Make iteration counter out of RAM
  2026-03-19 23:12 ` [PATCH RFC 09/12] migration: Make iteration counter out of RAM Peter Xu
@ 2026-03-20  6:12   ` Yong Huang
  2026-03-20  9:49   ` Prasad Pandit
  1 sibling, 0 replies; 28+ messages in thread
From: Yong Huang @ 2026-03-20  6:12 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater

[-- Attachment #1: Type: text/plain, Size: 5298 bytes --]

Thanks,

Reviewed-by: Hyman Huang <yong.huang@smartx.com>

On Fri, Mar 20, 2026 at 7:13 AM Peter Xu <peterx@redhat.com> wrote:

> It used to hide in RAM dirty sync path.  Now with more modules being able
> to slow sync on dirty information, keeping it there may not be good anymore
> because it's not RAM's own concept for iterations: all modules should
> follow.
>
> More importantly, mgmt may try to query dirty info (to make policy
> decisions like adjusting downtime) by listening to iteration count changes
> via QMP events.  So we must make sure the boost of iterations only happens
> _after_ the dirty sync operations with whatever form (RAM's dirty bitmap
> sync, or VFIO's different ioctls to fetch latest dirty info from kernel).
>
> Move this to core migration path to manage, together with the event
> generation, so that it can be well ordered with the sync operations for all
> modules.
>
> This brings a good side effect that we should have an old issue regarding
> to cpu_throttle_dirty_sync_timer_tick() which can randomly boost iteration
> counts (because it invokes sync ops).  Now it won't, which is actually the
> right behavior.
>
> Said that, we have code (not only QEMU, but likely mgmt too) assuming the
> 1st iteration will always shows dirty count to 1.  Make it initialized with
> 1 this time, because we'll miss the dirty sync for setup() on boosting this
> counter now.
>
> Cc: Yong Huang <yong.huang@smartx.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/migration-stats.h |  3 ++-
>  migration/migration.c       | 29 ++++++++++++++++++++++++++---
>  migration/ram.c             |  6 ------
>  3 files changed, 28 insertions(+), 10 deletions(-)
>
> diff --git a/migration/migration-stats.h b/migration/migration-stats.h
> index 1153520f7a..326ddb0088 100644
> --- a/migration/migration-stats.h
> +++ b/migration/migration-stats.h
> @@ -43,7 +43,8 @@ typedef struct {
>       */
>      uint64_t dirty_pages_rate;
>      /*
> -     * Number of times we have synchronized guest bitmaps.
> +     * Number of times we have synchronized guest bitmaps.  This always
> +     * starts from 1 for the 1st iteration.
>       */
>      uint64_t dirty_sync_count;
>      /*
> diff --git a/migration/migration.c b/migration/migration.c
> index 42facb16d1..ad8a824585 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1654,10 +1654,15 @@ int migrate_init(MigrationState *s, Error **errp)
>      s->threshold_size = 0;
>      s->switchover_acked = false;
>      s->rdma_migration = false;
> +
>      /*
> -     * set mig_stats memory to zero for a new migration
> +     * set mig_stats memory to zero for a new migration.. except the
> +     * iteration counter, which we want to make sure it returns 1 for the
> +     * first iteration.
>       */
>      memset(&mig_stats, 0, sizeof(mig_stats));
> +    mig_stats.dirty_sync_count = 1;
> +
>      migration_reset_vfio_bytes_transferred();
>
>      s->postcopy_package_loaded = false;
> @@ -3230,10 +3235,28 @@ static bool
> migration_iteration_next_ready(MigrationState *s,
>  static void migration_iteration_go_next(MigPendingData *pending)
>  {
>      /*
> -     * Do a slow sync will achieve this.  TODO: move RAM iteration code
> -     * into the core layer.
> +     * Do a slow sync first before boosting the iteration count.
>       */
>      qemu_savevm_query_pending(pending, false);
> +
> +    /*
> +     * Boost dirty sync count to reflect we finished one iteration.
> +     *
> +     * NOTE: we need to make sure when this happens (together with the
> +     * event sent below) all modules have slow-synced the pending data
> +     * above.  That means a write mem barrier, but qatomic_add() should be
> +     * enough.
> +     *
> +     * It's because a mgmt could wait on the iteration event to query
> again
> +     * on pending data for policy changes (e.g. downtime adjustments).
> The
> +     * ordering will make sure the query will fetch the latest results
> from
> +     * all the modules.
> +     */
> +    qatomic_add(&mig_stats.dirty_sync_count, 1);
> +
> +    if (migrate_events()) {
> +        qapi_event_send_migration_pass(mig_stats.dirty_sync_count);
> +    }
>  }
>
>  /*
> diff --git a/migration/ram.c b/migration/ram.c
> index 89f761a471..29e9608715 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1136,8 +1136,6 @@ static void migration_bitmap_sync(RAMState *rs, bool
> last_stage)
>      RAMBlock *block;
>      int64_t end_time;
>
> -    qatomic_add(&mig_stats.dirty_sync_count, 1);
> -
>      if (!rs->time_last_bitmap_sync) {
>          rs->time_last_bitmap_sync =
> qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
>      }
> @@ -1172,10 +1170,6 @@ static void migration_bitmap_sync(RAMState *rs,
> bool last_stage)
>          rs->num_dirty_pages_period = 0;
>          rs->bytes_xfer_prev = migration_transferred_bytes();
>      }
> -    if (migrate_events()) {
> -        uint64_t generation = qatomic_read(&mig_stats.dirty_sync_count);
> -        qapi_event_send_migration_pass(generation);
> -    }
>  }
>
>  void migration_bitmap_sync_precopy(bool last_stage)
> --
> 2.50.1
>
>

-- 
Best regards

[-- Attachment #2: Type: text/html, Size: 6771 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC 02/12] migration/qapi: Rename MigrationStats to MigrationRAMStats
  2026-03-19 23:12 ` [PATCH RFC 02/12] migration/qapi: Rename MigrationStats to MigrationRAMStats Peter Xu
  2026-03-19 23:26   ` Peter Xu
@ 2026-03-20  6:54   ` Markus Armbruster
  1 sibling, 0 replies; 28+ messages in thread
From: Markus Armbruster @ 2026-03-20  6:54 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Avihai Horon, Cédric Le Goater, devel

Cc: correct libvirt list address

Peter Xu <peterx@redhat.com> writes:

> This stats is only about RAM, so make it accurate.  This paves way for
> statistics for all devices.
>
> Note that even if this is part of qapi/, this should not be part of ABI of

docs/devel/qapi-code-gen.rst section "Compatibility considerations":

    Since type names are not visible in the Client JSON Protocol, types
    may be freely renamed.  Even certain refactorings are invisible, such
    as splitting members from one type into a common base type.

So, s/should not be/is not/.

> at least query-migrate, because the structure is not changed, and this
> stats is always reported only under the "ram" section.

Why "at least query-migrate"?  Do you have other interfaces in mind, or
are you just hedging?

> Cc: Daniel P. Berrangé <berrange@redhat.com>
> Cc: Markus Armbruster <armbru@redhat.com>
> Cc: Libvirt Mailing List <libvir-list@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  docs/about/removed-features.rst | 2 +-
>  qapi/migration.json             | 8 ++++----
>  migration/migration-stats.h     | 2 +-
>  3 files changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/docs/about/removed-features.rst b/docs/about/removed-features.rst
> index 6f4447993c..7c4f4325f7 100644
> --- a/docs/about/removed-features.rst
> +++ b/docs/about/removed-features.rst
> @@ -699,7 +699,7 @@ was superseded by ``sections``.
>  ``query-migrate`` return value member ``skipped`` (removed in 9.1)
>  ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
>  
> -Member ``skipped`` of the ``MigrationStats`` struct hasn't been used
> +Member ``skipped`` of the ``MigrationRAMStats`` struct hasn't been used
>  for more than 10 years. Removed with no replacement.

Could simply use "of the return value".  But this is fine, too.

>  
>  ``migrate`` command option ``inc`` (removed in 9.1)
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 7134d4ce47..cfc6ccee26 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -12,7 +12,7 @@
>  { 'include': 'sockets.json' }
>  
>  ##
> -# @MigrationStats:
> +# @MigrationRAMStats:
>  #
>  # Detailed migration status.
>  #
> @@ -64,7 +64,7 @@
>  #
>  # Since: 0.14
>  ##
> -{ 'struct': 'MigrationStats',
> +{ 'struct': 'MigrationRAMStats',
>    'data': {'transferred': 'int', 'remaining': 'int', 'total': 'int' ,
>             'duplicate': 'int',
>             'normal': 'int',
> @@ -209,7 +209,7 @@
>  #     If this field is not returned, no migration process has been
>  #     initiated
>  #
> -# @ram: `MigrationStats` containing detailed migration status, only
> +# @ram: `MigrationRAMStats` containing detailed migration status, only
>  #     returned if status is 'active' or 'completed'(since 1.2)

This gets rendered like

      * ram (MigrationStats, *optional*) -- MigrationStats
        containing detailed migration status, only returned if status
        is 'active' or 'completed'(since 1.2)

Recommend to rephrase to avoid the repetition.  Maybe

   # @ram: Detailed migration RAM status, only returned if status is
   #     'active' or 'completed' (since 1.2)

Also consider replacing "status" by "statistics".

>  #
>  # @xbzrle-cache: `XBZRLECacheStats` containing detailed XBZRLE
> @@ -309,7 +309,7 @@
>  # Since: 0.14
>  ##
>  { 'struct': 'MigrationInfo',
> -  'data': {'*status': 'MigrationStatus', '*ram': 'MigrationStats',
> +  'data': {'*status': 'MigrationStatus', '*ram': 'MigrationRAMStats',
>             '*vfio': 'VfioStats',
>             '*xbzrle-cache': 'XBZRLECacheStats',
>             '*total-time': 'int',
> diff --git a/migration/migration-stats.h b/migration/migration-stats.h
> index c0f50144c9..1153520f7a 100644
> --- a/migration/migration-stats.h
> +++ b/migration/migration-stats.h
> @@ -27,7 +27,7 @@
>  
>  /*
>   * These are the ram migration statistic counters.  It is loosely
> - * based on MigrationStats.
> + * based on MigrationRAMStats.
>   */
>  typedef struct {
>      /*

Only minor issues, and none added by this patch, so
Acked-by: Markus Armbruster <armbru@redhat.com>



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC 09/12] migration: Make iteration counter out of RAM
  2026-03-19 23:12 ` [PATCH RFC 09/12] migration: Make iteration counter out of RAM Peter Xu
  2026-03-20  6:12   ` Yong Huang
@ 2026-03-20  9:49   ` Prasad Pandit
  1 sibling, 0 replies; 28+ messages in thread
From: Prasad Pandit @ 2026-03-20  9:49 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater,
	Yong Huang

On Fri, 20 Mar 2026 at 04:44, Peter Xu <peterx@redhat.com> wrote:
> It used to hide in RAM dirty sync path.  Now with more modules being able
> to slow sync on dirty information, keeping it there may not be good anymore
> because it's not RAM's own concept for iterations: all modules should
> follow.
>
> More importantly, mgmt may try to query dirty info (to make policy
> decisions like adjusting downtime) by listening to iteration count changes
> via QMP events.  So we must make sure the boost of iterations only happens
> _after_ the dirty sync operations with whatever form (RAM's dirty bitmap
> sync, or VFIO's different ioctls to fetch latest dirty info from kernel).
>
> Move this to core migration path to manage, together with the event
> generation, so that it can be well ordered with the sync operations for all
> modules.
>
> This brings a good side effect that we should have an old issue regarding
> to cpu_throttle_dirty_sync_timer_tick() which can randomly boost iteration
> counts (because it invokes sync ops).  Now it won't, which is actually the
> right behavior.
>
> Said that, we have code (not only QEMU, but likely mgmt too) assuming the
> 1st iteration will always shows dirty count to 1.

* Where do we make this assumption? I mostly see 'dirty_sync_count'
read/used as is, only cpu_throttle_dirty_sync_timer_tick() seems to
skip one *_bitmap_sync_precopy() call when sync_cnt <= 1. This'd works
for zero(0) as well.

> Cc: Yong Huang <yong.huang@smartx.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/migration-stats.h |  3 ++-
>  migration/migration.c       | 29 ++++++++++++++++++++++++++---
>  migration/ram.c             |  6 ------
>  3 files changed, 28 insertions(+), 10 deletions(-)
>
> diff --git a/migration/migration-stats.h b/migration/migration-stats.h
> index 1153520f7a..326ddb0088 100644
> --- a/migration/migration-stats.h
> +++ b/migration/migration-stats.h
> @@ -43,7 +43,8 @@ typedef struct {
>       */
>      uint64_t dirty_pages_rate;
>      /*
> -     * Number of times we have synchronized guest bitmaps.
> +     * Number of times we have synchronized guest bitmaps.  This always
> +     * starts from 1 for the 1st iteration.
>       */
>      uint64_t dirty_sync_count;
>      /*
> diff --git a/migration/migration.c b/migration/migration.c
> index 42facb16d1..ad8a824585 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1654,10 +1654,15 @@ int migrate_init(MigrationState *s, Error **errp)
>      s->threshold_size = 0;
>      s->switchover_acked = false;
>      s->rdma_migration = false;
> +
>      /*
> -     * set mig_stats memory to zero for a new migration
> +     * set mig_stats memory to zero for a new migration.. except the
> +     * iteration counter, which we want to make sure it returns 1 for the
> +     * first iteration.
>       */
>      memset(&mig_stats, 0, sizeof(mig_stats));
> +    mig_stats.dirty_sync_count = 1;
> +
>      migration_reset_vfio_bytes_transferred();
>
>      s->postcopy_package_loaded = false;
> @@ -3230,10 +3235,28 @@ static bool migration_iteration_next_ready(MigrationState *s,
>  static void migration_iteration_go_next(MigPendingData *pending)
>  {
>      /*
> -     * Do a slow sync will achieve this.  TODO: move RAM iteration code
> -     * into the core layer.
> +     * Do a slow sync first before boosting the iteration count.
>       */
>      qemu_savevm_query_pending(pending, false);
> +
> +    /*
> +     * Boost dirty sync count to reflect we finished one iteration.
> +     *
> +     * NOTE: we need to make sure when this happens (together with the
> +     * event sent below) all modules have slow-synced the pending data
> +     * above.  That means a write mem barrier, but qatomic_add() should be
> +     * enough.
> +     *
> +     * It's because a mgmt could wait on the iteration event to query again
> +     * on pending data for policy changes (e.g. downtime adjustments).  The
> +     * ordering will make sure the query will fetch the latest results from
> +     * all the modules.
> +     */
> +    qatomic_add(&mig_stats.dirty_sync_count, 1);
> +
> +    if (migrate_events()) {
> +        qapi_event_send_migration_pass(mig_stats.dirty_sync_count);
> +    }
>  }
>
>  /*
> diff --git a/migration/ram.c b/migration/ram.c
> index 89f761a471..29e9608715 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1136,8 +1136,6 @@ static void migration_bitmap_sync(RAMState *rs, bool last_stage)
>      RAMBlock *block;
>      int64_t end_time;
>
> -    qatomic_add(&mig_stats.dirty_sync_count, 1);
> -
>      if (!rs->time_last_bitmap_sync) {
>          rs->time_last_bitmap_sync = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
>      }
> @@ -1172,10 +1170,6 @@ static void migration_bitmap_sync(RAMState *rs, bool last_stage)
>          rs->num_dirty_pages_period = 0;
>          rs->bytes_xfer_prev = migration_transferred_bytes();
>      }
> -    if (migrate_events()) {
> -        uint64_t generation = qatomic_read(&mig_stats.dirty_sync_count);
> -        qapi_event_send_migration_pass(generation);
> -    }
>  }
>
>  void migration_bitmap_sync_precopy(bool last_stage)
> --

* Change looks okay. Setting dirty_sync_count = 1 seems like special
casing, we need not do it if it's not required.

Reviewed-by: Prasad Pandit <pjp@fedoraproject.org>

Thank you.
---
  - Prasad



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC 01/12] migration: Fix low possibility downtime violation
  2026-03-19 23:12 ` [PATCH RFC 01/12] migration: Fix low possibility downtime violation Peter Xu
@ 2026-03-20 12:26   ` Prasad Pandit
  0 siblings, 0 replies; 28+ messages in thread
From: Prasad Pandit @ 2026-03-20 12:26 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater,
	qemu-stable

On Fri, 20 Mar 2026 at 04:46, Peter Xu <peterx@redhat.com> wrote:
> When QEMU queried the estimated version of pending data and thinks it's
> ready to converge, it'll send another accurate query to make sure of it.
> It is needed to make sure we collect the latest reports and that equation
> still holds true.
>
> However we missed one tiny little difference here on "<" v.s. "<=" when
> comparing pending_size (A) to threshold_size (B)..
>
> QEMU src only re-query if A<B, but will kickoff switchover if A<=B.
>
> I think it means it is possible to happen if A (as an estimate only so far)
> accidentally equals to B, then re-query won't happen and switchover will
> proceed without considering new dirtied data.
>
> It turns out it was an accident in my commit 7aaa1fc072 when refactoring
> the code around.  Fix this by using the same equation in both places.
>
> Fixes: 7aaa1fc072 ("migration: Rewrite the migration complete detect logic")
> Cc: qemu-stable@nongnu.org
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/migration.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/migration/migration.c b/migration/migration.c
> index 5c9aaa6e58..dfc60372cf 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3242,7 +3242,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>           * postcopy started, so ESTIMATE should always match with EXACT
>           * during postcopy phase.
>           */
> -        if (pending_size < s->threshold_size) {
> +        if (pending_size <= s->threshold_size) {
>              qemu_savevm_state_pending_exact(&must_precopy, &can_postcopy);
>              pending_size = must_precopy + can_postcopy;
>              trace_migrate_pending_exact(pending_size, must_precopy,

* What is the 'size' difference between < s->threshold_size  Vs  <=
s->threshold_size?  Going through the source IIUC
1) 'pending_size' is measured in Bytes.
     static void ram_state_pending_exact/_estimate()
         remaining_size = rs->migration_dirty_pages *
TARGET_PAGE_SIZE(=4096 bytes);
         100 dirty pages * 4096bytes  => 409600 dirty bytes => 409600
* 8 => 3,276,800 dirty bits

2) 's->threshold_size' is derived from bandwidth (100M bits/s) and
downtime(=300 ms)
        100,000,000 bits/s => 100,000 bits/ms
        100,000 bits/ms * 300ms => 30,000,000 bits in 300 ms
        30,000,000 bits / 8  =>  3,750,000 Bytes / 300 ms
        s->threshold_size = 30,000,000 bits (= 3.75MBytes) can be
transferred in 300ms downtime.

* Are we comparing pending_size(=409600 bytes)  <=
s->threshold_size(=30,000,000 bits)?

*  static void migration_update_counters()
        transferred = current_bytes - s->iteration_initial_bytes;
        bandwidth = (double)transferred / time_spent
        if (switchover_bw) {
            expected_bw_per_ms = (double)switchover_bw / 1000;
        } else {
            expected_bw_per_ms = bandwidth;
        }
=> ^^^^^^^  Should we divide 'bandwidth' by 1000 here (for bw_per_ms) ?

      s->threshold_size = expected_bw_per_ms * migrate_downtime_limit();

migration_iteration_run():
   /* Should we switch to postcopy now? */
   if (must_precopy <= s->threshold_size &&
      can_switchover && qatomic_read(&s->start_postcopy)) {
      if (postcopy_start(s, &local_err)) {
          migrate_error_propagate(s, error_copy(local_err));
          error_report_err(local_err);
      }
      return MIG_ITERATE_SKIP;
   }
* Here we should check pending_size <= s->threshold_size,  because
must_precopy is zero(0) when postcopy is enabled. And we switch to
postcopy mode even when pending_size > s->threshold_size.
  I wonder if we really need both 'must_precopy' and 'can_postcopy'
variables, they seem to complicate things.
===
# virsh migrate --verbose --live --auto-converge --postcopy
--postcopy-after-precopy f42vm
qemu+ssh://destination-machine.com/system
# less /var/log/libvirt/qemu/f42vm.log
...
migration_iteration_run: estimated pending_size: 50577408 bytes,
s->threshold_size: 36282361
migration_iteration_run: estimated pending_size: 43757568 bytes,
s->threshold_size: 36282361
migration_iteration_run: estimated pending_size: 36413440 bytes,
s->threshold_size: 34334680
migration_iteration_run: estimated pending_size: 29069312 bytes,
s->threshold_size: 34334680

migration_iteration_run: exact pending_size: 4339167232 bytes, 0,
4339167232              <== exact size is calculated once.
migration_iteration_run: estimated pending_size: 4332871680 bytes,
s->threshold_size: 35651363
migration_iteration_run: switching to postcopy: 4332871680, 0,
4332871680                    <== switch to postcopy with
must_precopy(=0) <= s->threshold_size

migration_iteration_run: estimated pending_size: 4332892160 bytes,
s->threshold_size: 35651363
migration_iteration_run: estimated pending_size: 4323188736 bytes,
s->threshold_size: 27243109
migration_iteration_run: estimated pending_size: 4315320320 bytes,
s->threshold_size: 27243109
migration_iteration_run: estimated pending_size: 4308221952 bytes,
s->threshold_size: 37695433
===
* Here, the exact pending_size is calculated only once, because we
switch to Postcopy mode even when pending_size is > s->threshold_size.

Thank you.
---
  - Prasad



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC 10/12] migration: Introduce a helper to return switchover bw estimate
  2026-03-19 23:13 ` [PATCH RFC 10/12] migration: Introduce a helper to return switchover bw estimate Peter Xu
@ 2026-03-23 10:26   ` Prasad Pandit
  0 siblings, 0 replies; 28+ messages in thread
From: Prasad Pandit @ 2026-03-23 10:26 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater

On Fri, 20 Mar 2026 at 04:45, Peter Xu <peterx@redhat.com> wrote:
> Add a helper to be able to return an estimate of switchover bw, called

* Add a helper to return ...
* bw -> bandwidth

> migration_get_switchover_bw(). Use it to sligitly simplify the current
> code.

* sligitly -> slightly ;  (Better still: use it to simplify ...)

* This simplification is nice and required.

> This will be used in new codes later to remove expected_downtime.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/migration.h |  1 +
>  migration/migration.c | 43 ++++++++++++++++++++++---------------------
>  2 files changed, 23 insertions(+), 21 deletions(-)
>
> diff --git a/migration/migration.h b/migration/migration.h
> index b6888daced..bf3ee6cc07 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -586,6 +586,7 @@ void migration_cancel(void);
>  void migration_populate_vfio_info(MigrationInfo *info);
>  void migration_reset_vfio_bytes_transferred(void);
>  void postcopy_temp_page_reset(PostcopyTmpPage *tmp_page);
> +int64_t migration_downtime_calc_expected(MigrationState *s);

* Let's return unsigned int64 -> uint64_t  here? Downtime should be
positive OR does it return a negative error code?

* This function is not used in this patch, it is unrelated that way.
Let's add it in the patch where it is used?

> --- a/migration/migration.c
> +++ b/migration/migration.c
> +/*
> + * Returns the estimated switchover bandwidth (unit: bytes / seconds)
> + */
> +static double migration_get_switchover_bw(MigrationState *s)
> +{
> +    uint64_t switchover_bw = migrate_avail_switchover_bandwidth();
> +
> +    if (switchover_bw) {
> +        /* If user specified, prioritize this value and don't estimate */
> +        return (double)switchover_bw;
> +    }
> +
> +    return s->mbps / 8 * 1000 * 1000;
> +}
> +

* 100 mbps will return => 100 / 8 * 1000 * 1000 => 12,500,000 bytes/sec

* When user specifies bandwidth with
        virsh migrate --available-switchover-bandwidth <number>
bandwidth (in MiB/s) available for the final phase of migration

* Asking users to specify such large number for bytes/sec is not user
friendly. 100mbps OR 156mbps OR 400mbps would have been better, then
above function could do

     if (!switchover_bw) {
         switchover_bw = s->mbps;
     }
     return switchover_bw / 8 * 1000 * 1000

>  bool migration_is_running(void)
>  {
>      MigrationState *s = current_migration;
> @@ -3126,37 +3141,22 @@ static void migration_update_counters(MigrationState *s,
>  {
>      uint64_t transferred, transferred_pages, time_spent;
>      uint64_t current_bytes; /* bytes transferred since the beginning */
> -    uint64_t switchover_bw;
> -    /* Expected bandwidth when switching over to destination QEMU */
> -    double expected_bw_per_ms;
> -    double bandwidth;
> +    double switchover_bw;
>
>      if (current_time < s->iteration_start_time + BUFFER_DELAY) {
>          return;
>      }
>
> -    switchover_bw = migrate_avail_switchover_bandwidth();
>      current_bytes = migration_transferred_bytes();
>      transferred = current_bytes - s->iteration_initial_bytes;
>      time_spent = current_time - s->iteration_start_time;
> -    bandwidth = (double)transferred / time_spent;
> -
> -    if (switchover_bw) {
> -        /*
> -         * If the user specified a switchover bandwidth, let's trust the
> -         * user so that can be more accurate than what we estimated.
> -         */
> -        expected_bw_per_ms = (double)switchover_bw / 1000;
> -    } else {
> -        /* If the user doesn't specify bandwidth, we use the estimated */
> -        expected_bw_per_ms = bandwidth;
> -    }
> -
> -    s->threshold_size = expected_bw_per_ms * migrate_downtime_limit();
> -
>      s->mbps = (((double) transferred * 8.0) /
>                 ((double) time_spent / 1000.0)) / 1000.0 / 1000.0;
>
> +    /* NOTE: only update this after bandwidth (s->mbps) updated */
> +    switchover_bw = migration_get_switchover_bw(s) / 1000;

* We got 12.5 MB/sec return value, divide it by 1000 to get bytes/ms
     ie. 12500000 / 1000  ==> 12500 bytes/ms

* It'll help to describe an example calculation in comments around
this math. Instead of dividing by 1000 here, why not make
migration_get_switchover_bw() return a bytes/ms value? That way the
actual calculation and the sample calculation in comments can go in
that function.

> +    s->threshold_size = switchover_bw * migrate_downtime_limit();
> +
           s->threshold_size = 12500 bytes/ms * 300ms  =>  3750000 bytes / 300ms

* IIUC, when pending_size  <=  s->threshold_size(=3.75MB), we can
pause the source VM and switch to the Postcopy phase, right?

>      transferred_pages = ram_get_total_transferred_pages() -
>                              s->iteration_initial_pages;
>      s->pages_per_second = (double) transferred_pages /
> @@ -3178,7 +3178,8 @@ static void migration_update_counters(MigrationState *s,
>
>      trace_migrate_transferred(transferred, time_spent,
>                                /* Both in unit bytes/ms */
> -                              bandwidth, switchover_bw / 1000,
> +                              (uint64_t)s->mbps,
> +                              (uint64_t)switchover_bw,
>                                s->threshold_size);

* Here 's->mbps' is not in bytes/ms unit, is it?

Thank you.
---
  - Prasad



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC 12/12] migration: Fix calculation of expected_downtime to take VFIO info
  2026-03-19 23:13 ` [PATCH RFC 12/12] migration: Fix calculation of expected_downtime to take VFIO info Peter Xu
@ 2026-03-23 12:05   ` Prasad Pandit
  0 siblings, 0 replies; 28+ messages in thread
From: Prasad Pandit @ 2026-03-23 12:05 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater

On Fri, 20 Mar 2026 at 04:46, Peter Xu <peterx@redhat.com> wrote:
> QEMU will provide an expected downtime for the whole system during
> migration, by remembering the total dirty RAM that we synced the last time,
> divides the estimated switchover bandwidth.

* ie. Total dirty RAM synced / estimated bandwidth? If so, divides -> divided by

> That was flawed when VFIO is taking into account: consider there is a VFIO
> GPU device that contains GBs of data to migrate during stop phase.  Those
> will not be accounted in this math.
>
> Fix it by updating dirty_bytes_last_sync properly only when we go to the
> next iteration, rather than hide this update in the RAM code.  Meanwhile,
> fetch the total (rather than RAM-only) portion of dirty bytes, so as to
> include GPU device states too.
>
> Update the comment of the field to reflect its new meaning.
>
> Now after this change, the expected-downtime to be read from query-migrate
> should be very accurate even with VFIO devices involved.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/migration-stats.h | 10 +++++-----
>  migration/migration.c       | 11 ++++++++---
>  migration/ram.c             |  1 -
>  3 files changed, 13 insertions(+), 9 deletions(-)
>
> diff --git a/migration/migration-stats.h b/migration/migration-stats.h
> index 326ddb0088..14b2773beb 100644
> --- a/migration/migration-stats.h
> +++ b/migration/migration-stats.h
> @@ -31,11 +31,11 @@
>   */
>  typedef struct {
>      /*
> -     * Number of bytes that were dirty last time that we synced with
> -     * the guest memory.  We use that to calculate the downtime.  As
> -     * the remaining dirty amounts to what we know that is still dirty
> -     * since last iteration, not counting what the guest has dirtied
> -     * since we synchronized bitmaps.
> +     * Number of bytes that are still dirty after the last whole-system
> +     * sync on dirty information.  We use that to calculate the expected
> +     * downtime.  As the remaining dirty amounts to what we know that is
> +     * still dirty since last iteration, not counting what the guest has
> +     * dirtied since then on either RAM or device states.
>       */

* Multiple uses of dirty is making it rather unclear. (any funny)

* May be:
     Number of bytes pending since the last full-system accounting
     of the dirty bytes' information. We use this number to calculate
     the expected downtime. The pending bytes do not account for
     the new bytes the guest may have written since the last update
     of the dirty bytes' information.

>      uint64_t dirty_bytes_last_sync;
>      /*
> diff --git a/migration/migration.c b/migration/migration.c
> index 23c78b3a2c..1c00572d14 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3240,18 +3240,23 @@ static void migration_iteration_go_next(MigPendingData *pending)
>       */
>      qemu_savevm_query_pending(pending, false);
>
> +    /*
> +     * Update the dirty information for the whole system for this

* dirty bytes' information

> +     * iteration.  This value is used to calculate expected downtime.
> +     */
> +    qatomic_set(&mig_stats.dirty_bytes_last_sync, pending->total_bytes);
> +
>      /*
>       * Boost dirty sync count to reflect we finished one iteration.
>       *
>       * NOTE: we need to make sure when this happens (together with the
>       * event sent below) all modules have slow-synced the pending data
> -     * above.  That means a write mem barrier, but qatomic_add() should be
> -     * enough.
> +     * above and updated corresponding fields (e.g. dirty_bytes_last_sync).
>       *
>       * It's because a mgmt could wait on the iteration event to query again
>       * on pending data for policy changes (e.g. downtime adjustments).  The
>       * ordering will make sure the query will fetch the latest results from
> -     * all the modules.
> +     * all the modules on everything.
>       */
>      qatomic_add(&mig_stats.dirty_sync_count, 1);
>
> diff --git a/migration/ram.c b/migration/ram.c
> index 29e9608715..1bdf121d16 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1148,7 +1148,6 @@ static void migration_bitmap_sync(RAMState *rs, bool last_stage)
>              RAMBLOCK_FOREACH_NOT_IGNORED(block) {
>                  ramblock_sync_dirty_bitmap(rs, block);
>              }
> -            qatomic_set(&mig_stats.dirty_bytes_last_sync, ram_bytes_remaining());

* Shouldn't the ram_bytes_remaining() go into pending->total_bytes
here? ie pending->total_bytes += ram_bytes_remaining()


Thank you.
---
  - Prasad



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC 06/12] migration: Use the new save_query_pending() API directly
  2026-03-19 23:12 ` [PATCH RFC 06/12] migration: Use the new save_query_pending() API directly Peter Xu
@ 2026-03-24  9:35   ` Prasad Pandit
  0 siblings, 0 replies; 28+ messages in thread
From: Prasad Pandit @ 2026-03-24  9:35 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater

On Fri, 20 Mar 2026 at 04:44, Peter Xu <peterx@redhat.com> wrote:
> It's easier to use the new API directly in the migration iterations.  This
> also paves way for follow up patches to add new data to report directly to
> the iterator function.
>
> When at it, merge the tracepoints too into one.
>
> No functional change intended.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/savevm.h     |  4 ----
>  migration/migration.c  | 16 +++++++---------
>  migration/savevm.c     | 23 ++---------------------
>  migration/trace-events |  1 +
>  4 files changed, 10 insertions(+), 34 deletions(-)
>
> diff --git a/migration/savevm.h b/migration/savevm.h
> index b116933bce..3c39190fa4 100644
> --- a/migration/savevm.h
> +++ b/migration/savevm.h
> @@ -46,10 +46,6 @@ void qemu_savevm_state_cleanup(void);
>  void qemu_savevm_state_complete_postcopy(QEMUFile *f);
>  int qemu_savevm_state_complete_precopy(MigrationState *s);
>  void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath);
> -void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
> -                                     uint64_t *can_postcopy);
> -void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
> -                                        uint64_t *can_postcopy);

* +1.

>  int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy);
>  bool qemu_savevm_state_postcopy_prepare(QEMUFile *f, Error **errp);
>  void qemu_savevm_state_end(QEMUFile *f);
> diff --git a/migration/migration.c b/migration/migration.c
> index dfc60372cf..99c4d09000 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3204,17 +3204,17 @@ typedef enum {
>   */
>  static MigIterateState migration_iteration_run(MigrationState *s)
>  {
> -    uint64_t must_precopy, can_postcopy, pending_size;
>      Error *local_err = NULL;
>      bool in_postcopy = (s->state == MIGRATION_STATUS_POSTCOPY_DEVICE ||
>                          s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
>      bool can_switchover = migration_can_switchover(s);
> +    MigPendingData pending = { };
> +    uint64_t pending_size;
>      bool complete_ready;
>
>      /* Fast path - get the estimated amount of pending data */
> -    qemu_savevm_state_pending_estimate(&must_precopy, &can_postcopy);
> -    pending_size = must_precopy + can_postcopy;
> -    trace_migrate_pending_estimate(pending_size, must_precopy, can_postcopy);
> +    qemu_savevm_query_pending(&pending, true);
> +    pending_size = pending.precopy_bytes + pending.postcopy_bytes;

* Why do we have 'precopy_bytes' & 'postcopy_bytes'? What if
qemu_savevm_query_pending() returned the pending data size?

>      if (in_postcopy) {
>          /*
> @@ -3243,14 +3243,12 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>           * during postcopy phase.
>           */
>          if (pending_size <= s->threshold_size) {
> -            qemu_savevm_state_pending_exact(&must_precopy, &can_postcopy);
> -            pending_size = must_precopy + can_postcopy;
> -            trace_migrate_pending_exact(pending_size, must_precopy,
> -                                        can_postcopy);
> +            qemu_savevm_query_pending(&pending, false);
> +            pending_size = pending.precopy_bytes + pending.postcopy_bytes;
>          }
>
>          /* Should we switch to postcopy now? */
> -        if (must_precopy <= s->threshold_size &&
> +        if (pending.precopy_bytes <= s->threshold_size &&
>              can_switchover && qatomic_read(&s->start_postcopy)) {
>              if (postcopy_start(s, &local_err)) {
>                  migrate_error_propagate(s, error_copy(local_err));
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 6268e68382..b3285d480f 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -1779,28 +1779,9 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath)
>          }
>          se->ops->save_query_pending(se->opaque, pending);
>      }
> -}
> -
> -void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
> -                                        uint64_t *can_postcopy)
> -{
> -    MigPendingData pending;
> -
> -    qemu_savevm_query_pending(&pending, true);
> -
> -    *must_precopy = pending.precopy_bytes;
> -    *can_postcopy = pending.postcopy_bytes;
> -}
> -
> -void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
> -                                     uint64_t *can_postcopy)
> -{
> -    MigPendingData pending;
> -
> -    qemu_savevm_query_pending(&pending, false);
>
> -    *must_precopy = pending.precopy_bytes;
> -    *can_postcopy = pending.postcopy_bytes;
> +    trace_qemu_savevm_query_pending(fastpath, pending->precopy_bytes,
> +                                    pending->postcopy_bytes);
>  }
>



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC 05/12] migration/treewide: Merge @state_pending_{exact|estimate} APIs
  2026-03-19 23:12 ` [PATCH RFC 05/12] migration/treewide: Merge @state_pending_{exact|estimate} APIs Peter Xu
@ 2026-03-24 10:35   ` Prasad Pandit
  2026-03-25 15:20   ` Avihai Horon
  1 sibling, 0 replies; 28+ messages in thread
From: Prasad Pandit @ 2026-03-24 10:35 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater,
	Halil Pasic, Christian Borntraeger, Jason Herne, Eric Farman,
	Matthew Rosato, Richard Henderson, Ilya Leoshkevich,
	David Hildenbrand, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow

On Fri, 20 Mar 2026 at 04:44, Peter Xu <peterx@redhat.com> wrote:
> These two APIs are a slight duplication.  For example, there're a few users
> that directly pass in the same function.
>
> It might also be slightly error prone to provide two hooks, so that it's
> easier to happen that one module report different things via the two
> hooks. In reality they should always report the same thing, only about
> whether we should use a fast-path when the slow path might be too slow, and
> even if we need to pay with some less accuracy.
>
> Let's just merge it into one API, but instead provide a bool showing if the
> query is a fast query or not.
>
> No functional change intended.
>
> Export qemu_savevm_query_pending().  We should likely directly use the new
> API here provided when there're new users to do the query.  This will
> happen very soon.
>
> Cc: Halil Pasic <pasic@linux.ibm.com>
> Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
> Cc: Jason Herne <jjherne@linux.ibm.com>
> Cc: Eric Farman <farman@linux.ibm.com>
> Cc: Matthew Rosato <mjrosato@linux.ibm.com>
> Cc: Richard Henderson <richard.henderson@linaro.org>
> Cc: Ilya Leoshkevich <iii@linux.ibm.com>
> Cc: David Hildenbrand <david@kernel.org>
> Cc: Cornelia Huck <cohuck@redhat.com>
> Cc: Eric Blake <eblake@redhat.com>
> Cc: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
> Cc: John Snow <jsnow@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  docs/devel/migration/main.rst  |  5 ++-
>  docs/devel/migration/vfio.rst  |  9 ++----
>  include/migration/register.h   | 52 ++++++++++--------------------
>  migration/savevm.h             |  3 ++
>  hw/s390x/s390-stattrib.c       |  8 ++---
>  hw/vfio/migration.c            | 58 +++++++++++++++++++---------------
>  migration/block-dirty-bitmap.c |  9 ++----
>  migration/ram.c                | 32 ++++++-------------
>  migration/savevm.c             | 43 ++++++++++++-------------
>  hw/vfio/trace-events           |  3 +-
>  10 files changed, 93 insertions(+), 129 deletions(-)
>
> diff --git a/docs/devel/migration/main.rst b/docs/devel/migration/main.rst
> index 234d280249..22c5910d5c 100644
> --- a/docs/devel/migration/main.rst
> +++ b/docs/devel/migration/main.rst
> @@ -519,9 +519,8 @@ An iterative device must provide:
>      data we must save.  The core migration code will use this to
>      determine when to pause the CPUs and complete the migration.
>
> -  - A ``state_pending_estimate`` function that indicates how much more
> -    data we must save.  When the estimated amount is smaller than the
> -    threshold, we call ``state_pending_exact``.
> +  - A ``save_query_pending`` function that indicates how much more
> +    data we must save.
>
>    - A ``save_live_iterate`` function should send a chunk of data until
>      the point that stream bandwidth limits tell it to stop.  Each call
> diff --git a/docs/devel/migration/vfio.rst b/docs/devel/migration/vfio.rst
> index 0790e5031d..33768c877c 100644
> --- a/docs/devel/migration/vfio.rst
> +++ b/docs/devel/migration/vfio.rst
> @@ -50,13 +50,8 @@ VFIO implements the device hooks for the iterative approach as follows:
>  * A ``load_setup`` function that sets the VFIO device on the destination in
>    _RESUMING state.
>
> -* A ``state_pending_estimate`` function that reports an estimate of the
> -  remaining pre-copy data that the vendor driver has yet to save for the VFIO
> -  device.
> -
> -* A ``state_pending_exact`` function that reads pending_bytes from the vendor
> -  driver, which indicates the amount of data that the vendor driver has yet to
> -  save for the VFIO device.
> +* A ``save_query_pending`` function that reports the remaining pre-copy
> +  data that the vendor driver has yet to save for the VFIO device.
>
>  * An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
>    active only when the VFIO device is in pre-copy states.
> diff --git a/include/migration/register.h b/include/migration/register.h
> index d0f37f5f43..2320c3a981 100644
> --- a/include/migration/register.h
> +++ b/include/migration/register.h
> @@ -16,6 +16,15 @@
>
>  #include "hw/core/vmstate-if.h"
>
> +typedef struct MigPendingData {
> +    /* How many bytes are pending for precopy / stopcopy? */
> +    uint64_t precopy_bytes;
> +    /* How many bytes are pending that can be transferred in postcopy? */
> +    uint64_t postcopy_bytes;
> +    /* Is this a fastpath query (which can be inaccurate)? */
> +    bool fastpath;
> +} MigPendingData ;
> +

* Pending precopy bytes are: bytes still pending to be sent + dirty
(updated) bytes; How do we measure/estimate pending postcopy bytes?

* Could we do away with separating pending bytes into Precopy and
Postcopy bytes? Instead go with just pending_bytes on the source side?

* Let's not add another term/variable called stopcopy. It'll only add
to the confusion.


>  /**
>   * struct SaveVMHandlers: handler structure to finely control
>   * migration of complex subsystems and devices, such as RAM, block and
> @@ -197,46 +206,17 @@ typedef struct SaveVMHandlers {
>      bool (*save_postcopy_prepare)(QEMUFile *f, void *opaque, Error **errp);
>
>      /**
> -     * @state_pending_estimate
> -     *
> -     * This estimates the remaining data to transfer
> -     *
> -     * Sum of @can_postcopy and @must_postcopy is the whole amount of
> -     * pending data.
> -     *
> -     * @opaque: data pointer passed to register_savevm_live()
> -     * @must_precopy: amount of data that must be migrated in precopy
> -     *                or in stopped state, i.e. that must be migrated
> -     *                before target start.
> -     * @can_postcopy: amount of data that can be migrated in postcopy
> -     *                or in stopped state, i.e. after target start.
> -     *                Some can also be migrated during precopy (RAM).
> -     *                Some must be migrated after source stops
> -     *                (block-dirty-bitmap)
> -     */
> -    void (*state_pending_estimate)(void *opaque, uint64_t *must_precopy,
> -                                   uint64_t *can_postcopy);
> -
> -    /**
> -     * @state_pending_exact
> -     *
> -     * This calculates the exact remaining data to transfer
> +     * @save_query_pending
>       *
> -     * Sum of @can_postcopy and @must_postcopy is the whole amount of
> -     * pending data.
> +     * This estimates the remaining data to transfer on the source side.
> +     * It's highly suggested that the module should implement both fastpath
> +     * and slowpath version of it when it can be slow (for more information
> +     * please check pending->fastpath field).
>       *
>       * @opaque: data pointer passed to register_savevm_live()
> -     * @must_precopy: amount of data that must be migrated in precopy
> -     *                or in stopped state, i.e. that must be migrated
> -     *                before target start.
> -     * @can_postcopy: amount of data that can be migrated in postcopy
> -     *                or in stopped state, i.e. after target start.
> -     *                Some can also be migrated during precopy (RAM).
> -     *                Some must be migrated after source stops
> -     *                (block-dirty-bitmap)
> +     * @pending: pointer to a MigPendingData struct
>       */
> -    void (*state_pending_exact)(void *opaque, uint64_t *must_precopy,
> -                                uint64_t *can_postcopy);
> +    void (*save_query_pending)(void *opaque, MigPendingData *pending);
>
>      /**
>       * @load_state
> diff --git a/migration/savevm.h b/migration/savevm.h
> index b3d1e8a13c..b116933bce 100644
> --- a/migration/savevm.h
> +++ b/migration/savevm.h
> @@ -14,6 +14,8 @@
>  #ifndef MIGRATION_SAVEVM_H
>  #define MIGRATION_SAVEVM_H
>
> +#include "migration/register.h"
> +
>  #define QEMU_VM_FILE_MAGIC           0x5145564d
>  #define QEMU_VM_FILE_VERSION_COMPAT  0x00000002
>  #define QEMU_VM_FILE_VERSION         0x00000003
> @@ -43,6 +45,7 @@ int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy);
>  void qemu_savevm_state_cleanup(void);
>  void qemu_savevm_state_complete_postcopy(QEMUFile *f);
>  int qemu_savevm_state_complete_precopy(MigrationState *s);
> +void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath);
>  void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
>                                       uint64_t *can_postcopy);
>  void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
> diff --git a/hw/s390x/s390-stattrib.c b/hw/s390x/s390-stattrib.c
> index d808ece3b9..b1ec51c77a 100644
> --- a/hw/s390x/s390-stattrib.c
> +++ b/hw/s390x/s390-stattrib.c
> @@ -187,15 +187,14 @@ static int cmma_save_setup(QEMUFile *f, void *opaque, Error **errp)
>      return 0;
>  }
>
> -static void cmma_state_pending(void *opaque, uint64_t *must_precopy,
> -                               uint64_t *can_postcopy)
> +static void cmma_state_pending(void *opaque, MigPendingData *pending)
>  {
>      S390StAttribState *sas = S390_STATTRIB(opaque);
>      S390StAttribClass *sac = S390_STATTRIB_GET_CLASS(sas);
>      long long res = sac->get_dirtycount(sas);
>
>      if (res >= 0) {
> -        *must_precopy += res;
> +        pending->precopy_bytes += res;
>      }
>  }
>
> @@ -340,8 +339,7 @@ static SaveVMHandlers savevm_s390_stattrib_handlers = {
>      .save_setup = cmma_save_setup,
>      .save_live_iterate = cmma_save_iterate,
>      .save_complete = cmma_save_complete,
> -    .state_pending_exact = cmma_state_pending,
> -    .state_pending_estimate = cmma_state_pending,
> +    .save_query_pending = cmma_state_pending,
>      .save_cleanup = cmma_save_cleanup,
>      .load_state = cmma_load,
>      .is_active = cmma_active,
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 827d3ded63..c054c749b0 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -570,42 +570,51 @@ static void vfio_save_cleanup(void *opaque)
>      trace_vfio_save_cleanup(vbasedev->name);
>  }
>
> -static void vfio_state_pending_estimate(void *opaque, uint64_t *must_precopy,
> -                                        uint64_t *can_postcopy)
> +static void vfio_state_pending_sync(VFIODevice *vbasedev)
>  {
> -    VFIODevice *vbasedev = opaque;
>      VFIOMigration *migration = vbasedev->migration;
>
> -    if (!vfio_device_state_is_precopy(vbasedev)) {
> -        return;
> -    }
> +    vfio_query_stop_copy_size(vbasedev);
>
> -    *must_precopy +=
> -        migration->precopy_init_size + migration->precopy_dirty_size;
> +    if (vfio_device_state_is_precopy(vbasedev)) {
> +        vfio_query_precopy_size(migration);
> +    }
>
> -    trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy,
> -                                      *can_postcopy,
> -                                      migration->precopy_init_size,
> -                                      migration->precopy_dirty_size);
> +    /*
> +     * In all cases, all PRECOPY data should be no more than STOPCOPY data.
> +     * Otherwise we have a problem.  So far, only dump some errors.
> +     */
> +    if (migration->precopy_init_size + migration->precopy_dirty_size <
> +        migration->stopcopy_size) {
> +        error_report_once("%s: wrong pending data (init=%" PRIx64
> +                          ", dirty=%"PRIx64", stop=%"PRIx64")",
> +                          __func__, migration->precopy_init_size,
> +                          migration->precopy_dirty_size,
> +                          migration->stopcopy_size);
> +    }
>  }
>
> -static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
> -                                     uint64_t *can_postcopy)
> +static void vfio_state_pending(void *opaque, MigPendingData *pending)
>  {
>      VFIODevice *vbasedev = opaque;
>      VFIOMigration *migration = vbasedev->migration;
> +    uint64_t remain;
>
> -    vfio_query_stop_copy_size(vbasedev);
> -    *must_precopy += migration->stopcopy_size;
> -
> -    if (vfio_device_state_is_precopy(vbasedev)) {
> -        vfio_query_precopy_size(migration);
> +    if (pending->fastpath) {
> +        if (!vfio_device_state_is_precopy(vbasedev)) {
> +            return;
> +        }
> +        remain = migration->precopy_init_size + migration->precopy_dirty_size;
> +    } else {
> +        vfio_state_pending_sync(vbasedev);
> +        remain = migration->stopcopy_size;
>      }
>
> -    trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy,
> -                                   migration->stopcopy_size,
> -                                   migration->precopy_init_size,
> -                                   migration->precopy_dirty_size);
> +    pending->precopy_bytes += remain;
> +
> +    trace_vfio_state_pending(vbasedev->name, migration->stopcopy_size,
> +                             migration->precopy_init_size,
> +                             migration->precopy_dirty_size);
>  }
>
>  static bool vfio_is_active_iterate(void *opaque)
> @@ -850,8 +859,7 @@ static const SaveVMHandlers savevm_vfio_handlers = {
>      .save_prepare = vfio_save_prepare,
>      .save_setup = vfio_save_setup,
>      .save_cleanup = vfio_save_cleanup,
> -    .state_pending_estimate = vfio_state_pending_estimate,
> -    .state_pending_exact = vfio_state_pending_exact,
> +    .save_query_pending = vfio_state_pending,
>      .is_active_iterate = vfio_is_active_iterate,
>      .save_live_iterate = vfio_save_iterate,
>      .save_complete = vfio_save_complete_precopy,
> diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
> index a061aad817..376a9b43ac 100644
> --- a/migration/block-dirty-bitmap.c
> +++ b/migration/block-dirty-bitmap.c
> @@ -766,9 +766,7 @@ static int dirty_bitmap_save_complete(QEMUFile *f, void *opaque)
>      return 0;
>  }
>
> -static void dirty_bitmap_state_pending(void *opaque,
> -                                       uint64_t *must_precopy,
> -                                       uint64_t *can_postcopy)
> +static void dirty_bitmap_state_pending(void *opaque, MigPendingData *data)
>  {
>      DBMSaveState *s = &((DBMState *)opaque)->save;
>      SaveBitmapState *dbms;
> @@ -788,7 +786,7 @@ static void dirty_bitmap_state_pending(void *opaque,
>
>      trace_dirty_bitmap_state_pending(pending);
>
> -    *can_postcopy += pending;
> +    data->postcopy_bytes += pending;
>  }
>
>  /* First occurrence of this bitmap. It should be created if doesn't exist */
> @@ -1250,8 +1248,7 @@ static SaveVMHandlers savevm_dirty_bitmap_handlers = {
>      .save_setup = dirty_bitmap_save_setup,
>      .save_complete = dirty_bitmap_save_complete,
>      .has_postcopy = dirty_bitmap_has_postcopy,
> -    .state_pending_exact = dirty_bitmap_state_pending,
> -    .state_pending_estimate = dirty_bitmap_state_pending,
> +    .save_query_pending = dirty_bitmap_state_pending,
>      .save_live_iterate = dirty_bitmap_save_iterate,
>      .is_active_iterate = dirty_bitmap_is_active_iterate,
>      .load_state = dirty_bitmap_load,
> diff --git a/migration/ram.c b/migration/ram.c
> index 979751f61b..89f761a471 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -3443,30 +3443,17 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
>      return qemu_fflush(f);
>  }
>
> -static void ram_state_pending_estimate(void *opaque, uint64_t *must_precopy,
> -                                       uint64_t *can_postcopy)
> -{
> -    RAMState **temp = opaque;
> -    RAMState *rs = *temp;
> -
> -    uint64_t remaining_size = rs->migration_dirty_pages * TARGET_PAGE_SIZE;
> -
> -    if (migrate_postcopy_ram()) {
> -        /* We can do postcopy, and all the data is postcopiable */
> -        *can_postcopy += remaining_size;
> -    } else {
> -        *must_precopy += remaining_size;
> -    }
> -}
> -
> -static void ram_state_pending_exact(void *opaque, uint64_t *must_precopy,
> -                                    uint64_t *can_postcopy)
> +static void ram_state_pending(void *opaque, MigPendingData *pending)
>  {
>      RAMState **temp = opaque;
>      RAMState *rs = *temp;
>      uint64_t remaining_size;
>
> -    if (!migration_in_postcopy()) {
> +    /*
> +     * Sync is only needed either with: (1) a fast query, or (2) postcopy
> +     * as started (in which case no new dirty will generate anymore).

* as -> has
* (... no new dirty will generate anymore)  - ...?

> +     */
> +    if (!pending->fastpath && !migration_in_postcopy()) {

* The comment above and 'if' conditionals don't seem to match. We are
saying a call to 'migration_bitmap_sync_precopy()' is needed either
with fast query ie. 'pending->fastpath = true'  OR Postcopy has
started ie. migration_in_postcopy() = true, right? What is the
right/intended behaviour here?

>          bql_lock();
>          WITH_RCU_READ_LOCK_GUARD() {
>              migration_bitmap_sync_precopy(false);
> @@ -3478,9 +3465,9 @@ static void ram_state_pending_exact(void *opaque, uint64_t *must_precopy,
>
>      if (migrate_postcopy_ram()) {
>          /* We can do postcopy, and all the data is postcopiable */
> -        *can_postcopy += remaining_size;
> +        pending->postcopy_bytes += remaining_size;
>      } else {
> -        *must_precopy += remaining_size;
> +        pending->precopy_bytes += remaining_size;
>      }
>  }
>
> @@ -4703,8 +4690,7 @@ static SaveVMHandlers savevm_ram_handlers = {
>      .save_live_iterate = ram_save_iterate,
>      .save_complete = ram_save_complete,
>      .has_postcopy = ram_has_postcopy,
> -    .state_pending_exact = ram_state_pending_exact,
> -    .state_pending_estimate = ram_state_pending_estimate,
> +    .save_query_pending = ram_state_pending,
>      .load_state = ram_load,
>      .save_cleanup = ram_save_cleanup,
>      .load_setup = ram_load_setup,
> diff --git a/migration/savevm.c b/migration/savevm.c
> index dd58f2a705..6268e68382 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -1762,46 +1762,45 @@ int qemu_savevm_state_complete_precopy(MigrationState *s)
>      return qemu_fflush(f);
>  }
>
> -/* Give an estimate of the amount left to be transferred,
> - * the result is split into the amount for units that can and
> - * for units that can't do postcopy.
> - */
> -void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
> -                                        uint64_t *can_postcopy)
> +void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath)
>  {
>      SaveStateEntry *se;
>
> -    *must_precopy = 0;
> -    *can_postcopy = 0;
> +    pending->precopy_bytes = 0;
> +    pending->postcopy_bytes = 0;
> +    pending->fastpath = fastpath;
>
>      QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> -        if (!se->ops || !se->ops->state_pending_estimate) {
> +        if (!se->ops || !se->ops->save_query_pending) {
>              continue;
>          }
>          if (!qemu_savevm_state_active(se)) {
>              continue;
>          }
> -        se->ops->state_pending_estimate(se->opaque, must_precopy, can_postcopy);
> +        se->ops->save_query_pending(se->opaque, pending);
>      }
>  }
>
> +void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
> +                                        uint64_t *can_postcopy)
> +{
> +    MigPendingData pending;
> +
> +    qemu_savevm_query_pending(&pending, true);
> +
> +    *must_precopy = pending.precopy_bytes;
> +    *can_postcopy = pending.postcopy_bytes;
> +}
> +
>  void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
>                                       uint64_t *can_postcopy)
>  {
> -    SaveStateEntry *se;
> +    MigPendingData pending;
>
> -    *must_precopy = 0;
> -    *can_postcopy = 0;
> +    qemu_savevm_query_pending(&pending, false);
>
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> -        if (!se->ops || !se->ops->state_pending_exact) {
> -            continue;
> -        }
> -        if (!qemu_savevm_state_active(se)) {
> -            continue;
> -        }
> -        se->ops->state_pending_exact(se->opaque, must_precopy, can_postcopy);
> -    }
> +    *must_precopy = pending.precopy_bytes;
> +    *can_postcopy = pending.postcopy_bytes;
>  }

* IIUC, we send fastpath=true from _pending_estimate() because we want
to avoid call to migration_bitmap_sync_precopy(), as calling it won't
be fast. And we send fastpath=false from _pending_exact() function
because we want to trigger call to migration_bitmap_sync_precopy()
routine.

* Instead of calling the boolean parameter 'fastpath',  let's use
'bool use_precision'  OR 'bool sync_data'. And then in
qemu_savevm_query_pending()

    qemu_savevm_query_pending():
            if (use_precision || sync_data) {
                     migration_bitmap_sync_precopy()
            }

* When reading code, equating fastpath with 'estimation' and slowpath
with 'precision/accuracy' is unnatural.


Thank you.
---
  - Prasad



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC 07/12] migration: Introduce stopcopy_bytes in save_query_pending()
  2026-03-19 23:12 ` [PATCH RFC 07/12] migration: Introduce stopcopy_bytes in save_query_pending() Peter Xu
@ 2026-03-24 11:05   ` Prasad Pandit
  2026-03-25 16:54   ` Avihai Horon
  1 sibling, 0 replies; 28+ messages in thread
From: Prasad Pandit @ 2026-03-24 11:05 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater

On Fri, 20 Mar 2026 at 04:44, Peter Xu <peterx@redhat.com> wrote:
> Allow modules to report data that can only be migrated after VM is stopped.
>
> When this concept is introduced, we will need to account stopcopy size to
> be part of pending_size as before.
>
> One thing to mention is, when there can be stopcopy size, it means the old
> "pending_size" may not always be able to reach low enough to kickoff an
> slow version of query sync.  While it used to be almost guaranteed to
> happen because if we keep iterating, normally pending_size can go to zero
> for precopy-only because we assume everything reported can be migrated in
> precopy phase.
>
> So we need to make sure QEMU will kickoff a synchronized version of query
> pending when all precopy data is migrated too.  This might be important to
> VFIO to keep making progress even if the downtime cannot yet be satisfied.
>
> So far, this patch should introduce no functional change, as no module yet
> report stopcopy size.
>
> This will pave way for VFIO to properly report its pending data sizes,
> which was actually buggy today.  Will be done in follow up patches.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/migration/register.h | 12 +++++++++
>  migration/migration.c        | 52 ++++++++++++++++++++++++++++++------
>  migration/savevm.c           |  7 +++--
>  migration/trace-events       |  2 +-
>  4 files changed, 62 insertions(+), 11 deletions(-)
>
> diff --git a/include/migration/register.h b/include/migration/register.h
> index 2320c3a981..3824958ba5 100644
> --- a/include/migration/register.h
> +++ b/include/migration/register.h
> @@ -17,12 +17,24 @@
>  #include "hw/core/vmstate-if.h"
>
>  typedef struct MigPendingData {
> +    /*
> +     * Modules can only update these fields in a query request via its
> +     * save_query_pending() API.
> +     */
>      /* How many bytes are pending for precopy / stopcopy? */
>      uint64_t precopy_bytes;
>      /* How many bytes are pending that can be transferred in postcopy? */
>      uint64_t postcopy_bytes;
> +    /* How many bytes that can only be transferred when VM stopped? */
> +    uint64_t stopcopy_bytes;

* This differentiation of pending bytes into
precopy/postcopy/stopcopy/total could become confusing, because their
intention isn't readily clear. Pending bytes indicates bytes still
pending or waiting to be sent. So is not the case with postcopy_bytes
and stopcopy_bytes and total_bytes. Do we really need this separation?

> +    /*
> +     * Modules should never update these fields.
> +     */
>      /* Is this a fastpath query (which can be inaccurate)? */
>      bool fastpath;

* Fast & slow adjectives generally go with speed, rather than accuracy
or rough estimates. Here its usage is more to decide whether to return
an estimated value OR an accurate value.It'll help to rename it to
something suitable for accuracy/precision.

> +    /* Total pending data */
> +    uint64_t total_bytes;
>  } MigPendingData ;
>
>  /**
> diff --git a/migration/migration.c b/migration/migration.c
> index 99c4d09000..42facb16d1 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3198,6 +3198,44 @@ typedef enum {
>      MIG_ITERATE_BREAK,          /* Break the loop */
>  } MigIterateState;
>
> +/* Are we ready to move to the next iteration phase? */
> +static bool migration_iteration_next_ready(MigrationState *s,
> +                                           MigPendingData *pending)
> +{
> +    /*
> +     * If the estimated values already suggest us to switchover, mark this
> +     * iteration finished, time to do a slow sync.
> +     */
> +    if (pending->total_bytes <= s->threshold_size) {
> +        return true;
> +    }
> +
> +    /*
> +     * Since we may have modules reporting stop-only data, we also want to
> +     * re-query with slow mode if all precopy data is moved over.  This
> +     * will also mark the current iteration done.
> +     *
> +     * This could happen when e.g. a module (like, VFIO) reports stopcopy
> +     * size too large so it will never yet satisfy the downtime with the
> +     * current setup (above check).  Here, slow version of re-query helps
> +     * because we keep trying the best to move whatever we have.
> +     */
> +    if (pending->precopy_bytes == 0) {
> +        return true;
> +    }
> +
> +    return false;
> +}
> +
> +static void migration_iteration_go_next(MigPendingData *pending)
> +{
> +    /*
> +     * Do a slow sync will achieve this.  TODO: move RAM iteration code
> +     * into the core layer.
> +     */
> +    qemu_savevm_query_pending(pending, false);
> +}

* What is _go_next() function used for? To  trigger
migration_bitmap_sync_precopy() call via   _savevm_query_pending(,
false)?

>  /*
>   * Return true if continue to the next iteration directly, false
>   * otherwise.
> @@ -3209,12 +3247,10 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>                          s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
>      bool can_switchover = migration_can_switchover(s);
>      MigPendingData pending = { };
> -    uint64_t pending_size;
>      bool complete_ready;
>
>      /* Fast path - get the estimated amount of pending data */
>      qemu_savevm_query_pending(&pending, true);
> -    pending_size = pending.precopy_bytes + pending.postcopy_bytes;
>
>      if (in_postcopy) {
>          /*
> @@ -3222,7 +3258,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>           * postcopy completion doesn't rely on can_switchover, because when
>           * POSTCOPY_ACTIVE it means switchover already happened.
>           */
> -        complete_ready = !pending_size;
> +        complete_ready = !pending.total_bytes;
>          if (s->state == MIGRATION_STATUS_POSTCOPY_DEVICE &&
>              (s->postcopy_package_loaded || complete_ready)) {
>              /*
> @@ -3242,9 +3278,8 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>           * postcopy started, so ESTIMATE should always match with EXACT
>           * during postcopy phase.
>           */
> -        if (pending_size <= s->threshold_size) {
> -            qemu_savevm_query_pending(&pending, false);
> -            pending_size = pending.precopy_bytes + pending.postcopy_bytes;
> +        if (migration_iteration_next_ready(s, &pending)) {
> +            migration_iteration_go_next(&pending);
>          }
>
>          /* Should we switch to postcopy now? */
> @@ -3264,11 +3299,12 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>           * (2) Pending size is no more than the threshold specified
>           *     (which was calculated from expected downtime)
>           */
> -        complete_ready = can_switchover && (pending_size <= s->threshold_size);
> +        complete_ready = can_switchover &&
> +            (pending.total_bytes <= s->threshold_size);
>      }
>
>      if (complete_ready) {
> -        trace_migration_thread_low_pending(pending_size);
> +        trace_migration_thread_low_pending(pending.total_bytes);
>          migration_completion(s);
>          return MIG_ITERATE_BREAK;
>      }
> diff --git a/migration/savevm.c b/migration/savevm.c
> index b3285d480f..812c72b3e5 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -1766,8 +1766,7 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath)
>  {
>      SaveStateEntry *se;
>
> -    pending->precopy_bytes = 0;
> -    pending->postcopy_bytes = 0;
> +    memset(pending, 0, sizeof(*pending));
>      pending->fastpath = fastpath;
>
>      QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> @@ -1780,7 +1779,11 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath)
>          se->ops->save_query_pending(se->opaque, pending);
>      }
>
> +    pending->total_bytes = pending->precopy_bytes +
> +        pending->stopcopy_bytes + pending->postcopy_bytes;
> +
>      trace_qemu_savevm_query_pending(fastpath, pending->precopy_bytes,
> +                                    pending->stopcopy_bytes,
>                                      pending->postcopy_bytes);
>  }
>
> diff --git a/migration/trace-events b/migration/trace-events
> index 5f836a8652..175f09f8ad 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -7,7 +7,7 @@ qemu_loadvm_state_section_partend(uint32_t section_id) "%u"
>  qemu_loadvm_state_post_main(int ret) "%d"
>  qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
>  qemu_savevm_send_packaged(void) ""
> -qemu_savevm_query_pending(bool fast, uint64_t precopy, uint64_t postcopy) "fast=%d, precopy=%"PRIu64", postcopy=%"PRIu64
> +qemu_savevm_query_pending(bool fast, uint64_t precopy, uint64_t stopcopy, uint64_t postcopy) "fast=%d, precopy=%"PRIu64", stopcopy=%"PRIu64", postcopy=%"PRIu64
>  loadvm_state_switchover_ack_needed(unsigned int switchover_ack_pending_num) "Switchover ack pending num=%u"
>  loadvm_state_setup(void) ""
>  loadvm_state_cleanup(void) ""
> --
> 2.50.1
>
>



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC 03/12] vfio/migration: Throttle vfio_save_block() on data size to read
  2026-03-19 23:12 ` [PATCH RFC 03/12] vfio/migration: Throttle vfio_save_block() on data size to read Peter Xu
@ 2026-03-25 14:10   ` Avihai Horon
  0 siblings, 0 replies; 28+ messages in thread
From: Avihai Horon @ 2026-03-25 14:10 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Cédric Le Goater

Hi Peter,

Thanks for sending this series.

On 3/20/2026 1:12 AM, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> During precopy phase, VFIO maintains two counters for init/dirty data
> tracking for query estimations.
>
> VFIO fetches data during precopy by reading from the VFIO fd, after
> fetching it'll deduct the read size.
>
> Here since the fd's size can dynamically change, I think it means VFIO may
> read more than what it "thought" were there for fetching.
>
> I highly suspect it's also relevant to a weird case in the function of
> vfio_update_estimated_pending_data(), where when VFIO reads 0 from the FD
> it will _reset_ the two counters, instead of asserting both of them being
> zeros, which looks pretty hackish.
>
> Just guarantee it from userspace level that VFIO won't read more than what
> it expects for now.

The VFIO_MIG_GET_PRECOPY_INFO ioctl returns an estimation of the data 
size currently available for reading. So, even if the ioctl returns X 
bytes, it may be that there are more than X bytes to read or less than X 
bytes.
The code was written in a flexible way to handle such discrepancies.

Because we are dealing with an estimation, I don't think we can assert 
that the counters are zero, and I don't think reading only up to the 
cached size gives us any benefit:
If the estimation is lower than actual available data, we are just 
deferring sending of the remaining data to a later stage.
If the estimation is higher than actual available data, we may still 
read() zero when the cached values are not zero.

I think we should keep the code as is.

Does that make sense?

Thanks.

>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   hw/vfio/migration.c | 18 ++++++++++++------
>   1 file changed, 12 insertions(+), 6 deletions(-)
>
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 83327b6573..851ea783f3 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -357,12 +357,18 @@ static int vfio_query_precopy_size(VFIOMigration *migration)
>   }
>
>   /* Returns the size of saved data on success and -errno on error */
> -static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
> +static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration,
> +                               bool precopy)
>   {
> -    ssize_t data_size;
> +    ssize_t data_size = migration->data_buffer_size;
> +
> +    if (precopy) {
> +        /* Limit the buffer size to make sure cached stats don't overflow */
> +        data_size = MIN(data_size, migration->precopy_init_size +
> +                        migration->precopy_dirty_size);
> +    }
>
> -    data_size = read(migration->data_fd, migration->data_buffer,
> -                     migration->data_buffer_size);
> +    data_size = read(migration->data_fd, migration->data_buffer, data_size);
>       if (data_size < 0) {
>           /*
>            * Pre-copy emptied all the device state for now. For more information,
> @@ -623,7 +629,7 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
>           migration->event_save_iterate_started = true;
>       }
>
> -    data_size = vfio_save_block(f, migration);
> +    data_size = vfio_save_block(f, migration, true);
>       if (data_size < 0) {
>           return data_size;
>       }
> @@ -667,7 +673,7 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>       }
>
>       do {
> -        data_size = vfio_save_block(f, vbasedev->migration);
> +        data_size = vfio_save_block(f, vbasedev->migration, false);
>           if (data_size < 0) {
>               return data_size;
>           }
> --
> 2.50.1
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC 04/12] vfio/migration: Cache stop size in VFIOMigration
  2026-03-19 23:12 ` [PATCH RFC 04/12] vfio/migration: Cache stop size in VFIOMigration Peter Xu
@ 2026-03-25 14:15   ` Avihai Horon
  0 siblings, 0 replies; 28+ messages in thread
From: Avihai Horon @ 2026-03-25 14:15 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Cédric Le Goater


On 3/20/2026 1:12 AM, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> Add a field to cache stop size.  Note that there's an initial value change
> in vfio_save_setup for the stop size default, but it shouldn't matter if it
> is followed with a math of MIN() against VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   hw/vfio/vfio-migration-internal.h |  1 +
>   hw/vfio/migration.c               | 43 +++++++++++++++++--------------
>   2 files changed, 24 insertions(+), 20 deletions(-)
>
> diff --git a/hw/vfio/vfio-migration-internal.h b/hw/vfio/vfio-migration-internal.h
> index 814fbd9eba..08df32c055 100644
> --- a/hw/vfio/vfio-migration-internal.h
> +++ b/hw/vfio/vfio-migration-internal.h
> @@ -47,6 +47,7 @@ typedef struct VFIOMigration {
>       uint64_t mig_flags;
>       uint64_t precopy_init_size;
>       uint64_t precopy_dirty_size;
> +    uint64_t stopcopy_size;
>       bool multifd_transfer;
>       VFIOMultifd *multifd;
>       bool initial_data_sent;
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 851ea783f3..827d3ded63 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -41,6 +41,12 @@
>    */
>   #define VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE (1 * MiB)
>
> +/*
> + * Migration size of VFIO devices can be as little as a few KBs or as big as
> + * many GBs. This value should be big enough to cover the worst case.
> + */
> +#define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
> +
>   static unsigned long bytes_transferred;
>
>   static const char *mig_state_to_str(enum vfio_device_mig_state state)
> @@ -314,8 +320,7 @@ static void vfio_migration_cleanup(VFIODevice *vbasedev)
>       migration->data_fd = -1;
>   }
>
> -static int vfio_query_stop_copy_size(VFIODevice *vbasedev,
> -                                     uint64_t *stop_copy_size)
> +static int vfio_query_stop_copy_size(VFIODevice *vbasedev)
>   {
>       uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
>                                 sizeof(struct vfio_device_feature_mig_data_size),
> @@ -323,16 +328,22 @@ static int vfio_query_stop_copy_size(VFIODevice *vbasedev,
>       struct vfio_device_feature *feature = (struct vfio_device_feature *)buf;
>       struct vfio_device_feature_mig_data_size *mig_data_size =
>           (struct vfio_device_feature_mig_data_size *)feature->data;
> +    VFIOMigration *migration = vbasedev->migration;
>
>       feature->argsz = sizeof(buf);
>       feature->flags =
>           VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_MIG_DATA_SIZE;
>
>       if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
> +        /*
> +         * If getting pending migration size fails, VFIO_MIG_STOP_COPY_SIZE
> +         * is reported so downtime limit won't be violated.
> +         */
> +        migration->stopcopy_size = VFIO_MIG_STOP_COPY_SIZE;
>           return -errno;
>       }
>
> -    *stop_copy_size = mig_data_size->stop_copy_length;
> +    migration->stopcopy_size = mig_data_size->stop_copy_length;
>
>       return 0;
>   }
> @@ -415,6 +426,9 @@ static void vfio_update_estimated_pending_data(VFIOMigration *migration,
>           return;
>       }
>
> +    /* The total size remaining requires separate accounting */
> +    migration->stopcopy_size -= data_size;

stopcopy_size is also an estimation, so I think it's safer to have:

   migration->stopcopy_size -= MIN(migration->stopcopy_size, data_size);

Thanks.

> +
>       if (migration->precopy_init_size) {
>           uint64_t init_size = MIN(migration->precopy_init_size, data_size);
>
> @@ -469,7 +483,6 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error **errp)
>   {
>       VFIODevice *vbasedev = opaque;
>       VFIOMigration *migration = vbasedev->migration;
> -    uint64_t stop_copy_size = VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE;
>       int ret;
>
>       if (!vfio_multifd_setup(vbasedev, false, errp)) {
> @@ -478,9 +491,9 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error **errp)
>
>       qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
>
> -    vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
> +    vfio_query_stop_copy_size(vbasedev);
>       migration->data_buffer_size = MIN(VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE,
> -                                      stop_copy_size);
> +                                      migration->stopcopy_size);
>       migration->data_buffer = g_try_malloc0(migration->data_buffer_size);
>       if (!migration->data_buffer) {
>           error_setg(errp, "%s: Failed to allocate migration data buffer",
> @@ -576,32 +589,22 @@ static void vfio_state_pending_estimate(void *opaque, uint64_t *must_precopy,
>                                         migration->precopy_dirty_size);
>   }
>
> -/*
> - * Migration size of VFIO devices can be as little as a few KBs or as big as
> - * many GBs. This value should be big enough to cover the worst case.
> - */
> -#define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
> -
>   static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
>                                        uint64_t *can_postcopy)
>   {
>       VFIODevice *vbasedev = opaque;
>       VFIOMigration *migration = vbasedev->migration;
> -    uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
>
> -    /*
> -     * If getting pending migration size fails, VFIO_MIG_STOP_COPY_SIZE is
> -     * reported so downtime limit won't be violated.
> -     */
> -    vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
> -    *must_precopy += stop_copy_size;
> +    vfio_query_stop_copy_size(vbasedev);
> +    *must_precopy += migration->stopcopy_size;
>
>       if (vfio_device_state_is_precopy(vbasedev)) {
>           vfio_query_precopy_size(migration);
>       }
>
>       trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy,
> -                                   stop_copy_size, migration->precopy_init_size,
> +                                   migration->stopcopy_size,
> +                                   migration->precopy_init_size,
>                                      migration->precopy_dirty_size);
>   }
>
> --
> 2.50.1
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC 05/12] migration/treewide: Merge @state_pending_{exact|estimate} APIs
  2026-03-19 23:12 ` [PATCH RFC 05/12] migration/treewide: Merge @state_pending_{exact|estimate} APIs Peter Xu
  2026-03-24 10:35   ` Prasad Pandit
@ 2026-03-25 15:20   ` Avihai Horon
  1 sibling, 0 replies; 28+ messages in thread
From: Avihai Horon @ 2026-03-25 15:20 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Cédric Le Goater, Halil Pasic,
	Christian Borntraeger, Jason Herne, Eric Farman, Matthew Rosato,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Cornelia Huck, Eric Blake, Vladimir Sementsov-Ogievskiy,
	John Snow


On 3/20/2026 1:12 AM, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> These two APIs are a slight duplication.  For example, there're a few users
> that directly pass in the same function.
>
> It might also be slightly error prone to provide two hooks, so that it's
> easier to happen that one module report different things via the two
> hooks. In reality they should always report the same thing, only about
> whether we should use a fast-path when the slow path might be too slow, and
> even if we need to pay with some less accuracy.
>
> Let's just merge it into one API, but instead provide a bool showing if the
> query is a fast query or not.
>
> No functional change intended.
>
> Export qemu_savevm_query_pending().  We should likely directly use the new
> API here provided when there're new users to do the query.  This will
> happen very soon.
>
> Cc: Halil Pasic <pasic@linux.ibm.com>
> Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
> Cc: Jason Herne <jjherne@linux.ibm.com>
> Cc: Eric Farman <farman@linux.ibm.com>
> Cc: Matthew Rosato <mjrosato@linux.ibm.com>
> Cc: Richard Henderson <richard.henderson@linaro.org>
> Cc: Ilya Leoshkevich <iii@linux.ibm.com>
> Cc: David Hildenbrand <david@kernel.org>
> Cc: Cornelia Huck <cohuck@redhat.com>
> Cc: Eric Blake <eblake@redhat.com>
> Cc: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
> Cc: John Snow <jsnow@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   docs/devel/migration/main.rst  |  5 ++-
>   docs/devel/migration/vfio.rst  |  9 ++----
>   include/migration/register.h   | 52 ++++++++++--------------------
>   migration/savevm.h             |  3 ++
>   hw/s390x/s390-stattrib.c       |  8 ++---
>   hw/vfio/migration.c            | 58 +++++++++++++++++++---------------
>   migration/block-dirty-bitmap.c |  9 ++----
>   migration/ram.c                | 32 ++++++-------------
>   migration/savevm.c             | 43 ++++++++++++-------------
>   hw/vfio/trace-events           |  3 +-
>   10 files changed, 93 insertions(+), 129 deletions(-)
>
> diff --git a/docs/devel/migration/main.rst b/docs/devel/migration/main.rst
> index 234d280249..22c5910d5c 100644
> --- a/docs/devel/migration/main.rst
> +++ b/docs/devel/migration/main.rst
> @@ -519,9 +519,8 @@ An iterative device must provide:
>       data we must save.  The core migration code will use this to
>       determine when to pause the CPUs and complete the migration.

We should also remove the state_pending_exact section above.

>
> -  - A ``state_pending_estimate`` function that indicates how much more
> -    data we must save.  When the estimated amount is smaller than the
> -    threshold, we call ``state_pending_exact``.
> +  - A ``save_query_pending`` function that indicates how much more
> +    data we must save.
>
>     - A ``save_live_iterate`` function should send a chunk of data until
>       the point that stream bandwidth limits tell it to stop.  Each call
> diff --git a/docs/devel/migration/vfio.rst b/docs/devel/migration/vfio.rst
> index 0790e5031d..33768c877c 100644
> --- a/docs/devel/migration/vfio.rst
> +++ b/docs/devel/migration/vfio.rst
> @@ -50,13 +50,8 @@ VFIO implements the device hooks for the iterative approach as follows:
>   * A ``load_setup`` function that sets the VFIO device on the destination in
>     _RESUMING state.
>
> -* A ``state_pending_estimate`` function that reports an estimate of the
> -  remaining pre-copy data that the vendor driver has yet to save for the VFIO
> -  device.
> -
> -* A ``state_pending_exact`` function that reads pending_bytes from the vendor
> -  driver, which indicates the amount of data that the vendor driver has yet to
> -  save for the VFIO device.
> +* A ``save_query_pending`` function that reports the remaining pre-copy
> +  data that the vendor driver has yet to save for the VFIO device.
>
>   * An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
>     active only when the VFIO device is in pre-copy states.
> diff --git a/include/migration/register.h b/include/migration/register.h
> index d0f37f5f43..2320c3a981 100644
> --- a/include/migration/register.h
> +++ b/include/migration/register.h
> @@ -16,6 +16,15 @@
>
>   #include "hw/core/vmstate-if.h"
>
> +typedef struct MigPendingData {
> +    /* How many bytes are pending for precopy / stopcopy? */
> +    uint64_t precopy_bytes;
> +    /* How many bytes are pending that can be transferred in postcopy? */
> +    uint64_t postcopy_bytes;
> +    /* Is this a fastpath query (which can be inaccurate)? */
> +    bool fastpath;
> +} MigPendingData ;

Extra space before semicolon.

> +
>   /**
>    * struct SaveVMHandlers: handler structure to finely control
>    * migration of complex subsystems and devices, such as RAM, block and
> @@ -197,46 +206,17 @@ typedef struct SaveVMHandlers {
>       bool (*save_postcopy_prepare)(QEMUFile *f, void *opaque, Error **errp);
>
>       /**
> -     * @state_pending_estimate
> -     *
> -     * This estimates the remaining data to transfer
> -     *
> -     * Sum of @can_postcopy and @must_postcopy is the whole amount of
> -     * pending data.
> -     *
> -     * @opaque: data pointer passed to register_savevm_live()
> -     * @must_precopy: amount of data that must be migrated in precopy
> -     *                or in stopped state, i.e. that must be migrated
> -     *                before target start.
> -     * @can_postcopy: amount of data that can be migrated in postcopy
> -     *                or in stopped state, i.e. after target start.
> -     *                Some can also be migrated during precopy (RAM).
> -     *                Some must be migrated after source stops
> -     *                (block-dirty-bitmap)
> -     */
> -    void (*state_pending_estimate)(void *opaque, uint64_t *must_precopy,
> -                                   uint64_t *can_postcopy);
> -
> -    /**
> -     * @state_pending_exact
> -     *
> -     * This calculates the exact remaining data to transfer
> +     * @save_query_pending
>        *
> -     * Sum of @can_postcopy and @must_postcopy is the whole amount of
> -     * pending data.
> +     * This estimates the remaining data to transfer on the source side.
> +     * It's highly suggested that the module should implement both fastpath
> +     * and slowpath version of it when it can be slow (for more information
> +     * please check pending->fastpath field).
>        *
>        * @opaque: data pointer passed to register_savevm_live()
> -     * @must_precopy: amount of data that must be migrated in precopy
> -     *                or in stopped state, i.e. that must be migrated
> -     *                before target start.
> -     * @can_postcopy: amount of data that can be migrated in postcopy
> -     *                or in stopped state, i.e. after target start.
> -     *                Some can also be migrated during precopy (RAM).
> -     *                Some must be migrated after source stops
> -     *                (block-dirty-bitmap)
> +     * @pending: pointer to a MigPendingData struct
>        */
> -    void (*state_pending_exact)(void *opaque, uint64_t *must_precopy,
> -                                uint64_t *can_postcopy);
> +    void (*save_query_pending)(void *opaque, MigPendingData *pending);
>
>       /**
>        * @load_state
> diff --git a/migration/savevm.h b/migration/savevm.h
> index b3d1e8a13c..b116933bce 100644
> --- a/migration/savevm.h
> +++ b/migration/savevm.h
> @@ -14,6 +14,8 @@
>   #ifndef MIGRATION_SAVEVM_H
>   #define MIGRATION_SAVEVM_H
>
> +#include "migration/register.h"
> +
>   #define QEMU_VM_FILE_MAGIC           0x5145564d
>   #define QEMU_VM_FILE_VERSION_COMPAT  0x00000002
>   #define QEMU_VM_FILE_VERSION         0x00000003
> @@ -43,6 +45,7 @@ int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy);
>   void qemu_savevm_state_cleanup(void);
>   void qemu_savevm_state_complete_postcopy(QEMUFile *f);
>   int qemu_savevm_state_complete_precopy(MigrationState *s);
> +void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath);
>   void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
>                                        uint64_t *can_postcopy);
>   void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
> diff --git a/hw/s390x/s390-stattrib.c b/hw/s390x/s390-stattrib.c
> index d808ece3b9..b1ec51c77a 100644
> --- a/hw/s390x/s390-stattrib.c
> +++ b/hw/s390x/s390-stattrib.c
> @@ -187,15 +187,14 @@ static int cmma_save_setup(QEMUFile *f, void *opaque, Error **errp)
>       return 0;
>   }
>
> -static void cmma_state_pending(void *opaque, uint64_t *must_precopy,
> -                               uint64_t *can_postcopy)
> +static void cmma_state_pending(void *opaque, MigPendingData *pending)
>   {
>       S390StAttribState *sas = S390_STATTRIB(opaque);
>       S390StAttribClass *sac = S390_STATTRIB_GET_CLASS(sas);
>       long long res = sac->get_dirtycount(sas);
>
>       if (res >= 0) {
> -        *must_precopy += res;
> +        pending->precopy_bytes += res;
>       }
>   }
>
> @@ -340,8 +339,7 @@ static SaveVMHandlers savevm_s390_stattrib_handlers = {
>       .save_setup = cmma_save_setup,
>       .save_live_iterate = cmma_save_iterate,
>       .save_complete = cmma_save_complete,
> -    .state_pending_exact = cmma_state_pending,
> -    .state_pending_estimate = cmma_state_pending,
> +    .save_query_pending = cmma_state_pending,
>       .save_cleanup = cmma_save_cleanup,
>       .load_state = cmma_load,
>       .is_active = cmma_active,
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 827d3ded63..c054c749b0 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -570,42 +570,51 @@ static void vfio_save_cleanup(void *opaque)
>       trace_vfio_save_cleanup(vbasedev->name);
>   }
>
> -static void vfio_state_pending_estimate(void *opaque, uint64_t *must_precopy,
> -                                        uint64_t *can_postcopy)
> +static void vfio_state_pending_sync(VFIODevice *vbasedev)
>   {
> -    VFIODevice *vbasedev = opaque;
>       VFIOMigration *migration = vbasedev->migration;
>
> -    if (!vfio_device_state_is_precopy(vbasedev)) {
> -        return;
> -    }
> +    vfio_query_stop_copy_size(vbasedev);
>
> -    *must_precopy +=
> -        migration->precopy_init_size + migration->precopy_dirty_size;
> +    if (vfio_device_state_is_precopy(vbasedev)) {
> +        vfio_query_precopy_size(migration);
> +    }
>
> -    trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy,
> -                                      *can_postcopy,
> -                                      migration->precopy_init_size,
> -                                      migration->precopy_dirty_size);
> +    /*
> +     * In all cases, all PRECOPY data should be no more than STOPCOPY data.
> +     * Otherwise we have a problem.  So far, only dump some errors.
> +     */

I'm not sure that must be true, as stopcopy and precopy queries are not 
atomic, so device state size can change between the queries.

Thanks.

> +    if (migration->precopy_init_size + migration->precopy_dirty_size <
> +        migration->stopcopy_size) {
> +        error_report_once("%s: wrong pending data (init=%" PRIx64
> +                          ", dirty=%"PRIx64", stop=%"PRIx64")",
> +                          __func__, migration->precopy_init_size,
> +                          migration->precopy_dirty_size,
> +                          migration->stopcopy_size);
> +    }
>   }
>
> -static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
> -                                     uint64_t *can_postcopy)
> +static void vfio_state_pending(void *opaque, MigPendingData *pending)
>   {
>       VFIODevice *vbasedev = opaque;
>       VFIOMigration *migration = vbasedev->migration;
> +    uint64_t remain;
>
> -    vfio_query_stop_copy_size(vbasedev);
> -    *must_precopy += migration->stopcopy_size;
> -
> -    if (vfio_device_state_is_precopy(vbasedev)) {
> -        vfio_query_precopy_size(migration);
> +    if (pending->fastpath) {
> +        if (!vfio_device_state_is_precopy(vbasedev)) {
> +            return;
> +        }
> +        remain = migration->precopy_init_size + migration->precopy_dirty_size;
> +    } else {
> +        vfio_state_pending_sync(vbasedev);
> +        remain = migration->stopcopy_size;
>       }
>
> -    trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy,
> -                                   migration->stopcopy_size,
> -                                   migration->precopy_init_size,
> -                                   migration->precopy_dirty_size);
> +    pending->precopy_bytes += remain;
> +
> +    trace_vfio_state_pending(vbasedev->name, migration->stopcopy_size,
> +                             migration->precopy_init_size,
> +                             migration->precopy_dirty_size);
>   }
>
>   static bool vfio_is_active_iterate(void *opaque)
> @@ -850,8 +859,7 @@ static const SaveVMHandlers savevm_vfio_handlers = {
>       .save_prepare = vfio_save_prepare,
>       .save_setup = vfio_save_setup,
>       .save_cleanup = vfio_save_cleanup,
> -    .state_pending_estimate = vfio_state_pending_estimate,
> -    .state_pending_exact = vfio_state_pending_exact,
> +    .save_query_pending = vfio_state_pending,
>       .is_active_iterate = vfio_is_active_iterate,
>       .save_live_iterate = vfio_save_iterate,
>       .save_complete = vfio_save_complete_precopy,
> diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
> index a061aad817..376a9b43ac 100644
> --- a/migration/block-dirty-bitmap.c
> +++ b/migration/block-dirty-bitmap.c
> @@ -766,9 +766,7 @@ static int dirty_bitmap_save_complete(QEMUFile *f, void *opaque)
>       return 0;
>   }
>
> -static void dirty_bitmap_state_pending(void *opaque,
> -                                       uint64_t *must_precopy,
> -                                       uint64_t *can_postcopy)
> +static void dirty_bitmap_state_pending(void *opaque, MigPendingData *data)
>   {
>       DBMSaveState *s = &((DBMState *)opaque)->save;
>       SaveBitmapState *dbms;
> @@ -788,7 +786,7 @@ static void dirty_bitmap_state_pending(void *opaque,
>
>       trace_dirty_bitmap_state_pending(pending);
>
> -    *can_postcopy += pending;
> +    data->postcopy_bytes += pending;
>   }
>
>   /* First occurrence of this bitmap. It should be created if doesn't exist */
> @@ -1250,8 +1248,7 @@ static SaveVMHandlers savevm_dirty_bitmap_handlers = {
>       .save_setup = dirty_bitmap_save_setup,
>       .save_complete = dirty_bitmap_save_complete,
>       .has_postcopy = dirty_bitmap_has_postcopy,
> -    .state_pending_exact = dirty_bitmap_state_pending,
> -    .state_pending_estimate = dirty_bitmap_state_pending,
> +    .save_query_pending = dirty_bitmap_state_pending,
>       .save_live_iterate = dirty_bitmap_save_iterate,
>       .is_active_iterate = dirty_bitmap_is_active_iterate,
>       .load_state = dirty_bitmap_load,
> diff --git a/migration/ram.c b/migration/ram.c
> index 979751f61b..89f761a471 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -3443,30 +3443,17 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
>       return qemu_fflush(f);
>   }
>
> -static void ram_state_pending_estimate(void *opaque, uint64_t *must_precopy,
> -                                       uint64_t *can_postcopy)
> -{
> -    RAMState **temp = opaque;
> -    RAMState *rs = *temp;
> -
> -    uint64_t remaining_size = rs->migration_dirty_pages * TARGET_PAGE_SIZE;
> -
> -    if (migrate_postcopy_ram()) {
> -        /* We can do postcopy, and all the data is postcopiable */
> -        *can_postcopy += remaining_size;
> -    } else {
> -        *must_precopy += remaining_size;
> -    }
> -}
> -
> -static void ram_state_pending_exact(void *opaque, uint64_t *must_precopy,
> -                                    uint64_t *can_postcopy)
> +static void ram_state_pending(void *opaque, MigPendingData *pending)
>   {
>       RAMState **temp = opaque;
>       RAMState *rs = *temp;
>       uint64_t remaining_size;
>
> -    if (!migration_in_postcopy()) {
> +    /*
> +     * Sync is only needed either with: (1) a fast query, or (2) postcopy
> +     * as started (in which case no new dirty will generate anymore).
> +     */
> +    if (!pending->fastpath && !migration_in_postcopy()) {
>           bql_lock();
>           WITH_RCU_READ_LOCK_GUARD() {
>               migration_bitmap_sync_precopy(false);
> @@ -3478,9 +3465,9 @@ static void ram_state_pending_exact(void *opaque, uint64_t *must_precopy,
>
>       if (migrate_postcopy_ram()) {
>           /* We can do postcopy, and all the data is postcopiable */
> -        *can_postcopy += remaining_size;
> +        pending->postcopy_bytes += remaining_size;
>       } else {
> -        *must_precopy += remaining_size;
> +        pending->precopy_bytes += remaining_size;
>       }
>   }
>
> @@ -4703,8 +4690,7 @@ static SaveVMHandlers savevm_ram_handlers = {
>       .save_live_iterate = ram_save_iterate,
>       .save_complete = ram_save_complete,
>       .has_postcopy = ram_has_postcopy,
> -    .state_pending_exact = ram_state_pending_exact,
> -    .state_pending_estimate = ram_state_pending_estimate,
> +    .save_query_pending = ram_state_pending,
>       .load_state = ram_load,
>       .save_cleanup = ram_save_cleanup,
>       .load_setup = ram_load_setup,
> diff --git a/migration/savevm.c b/migration/savevm.c
> index dd58f2a705..6268e68382 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -1762,46 +1762,45 @@ int qemu_savevm_state_complete_precopy(MigrationState *s)
>       return qemu_fflush(f);
>   }
>
> -/* Give an estimate of the amount left to be transferred,
> - * the result is split into the amount for units that can and
> - * for units that can't do postcopy.
> - */
> -void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
> -                                        uint64_t *can_postcopy)
> +void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath)
>   {
>       SaveStateEntry *se;
>
> -    *must_precopy = 0;
> -    *can_postcopy = 0;
> +    pending->precopy_bytes = 0;
> +    pending->postcopy_bytes = 0;
> +    pending->fastpath = fastpath;
>
>       QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> -        if (!se->ops || !se->ops->state_pending_estimate) {
> +        if (!se->ops || !se->ops->save_query_pending) {
>               continue;
>           }
>           if (!qemu_savevm_state_active(se)) {
>               continue;
>           }
> -        se->ops->state_pending_estimate(se->opaque, must_precopy, can_postcopy);
> +        se->ops->save_query_pending(se->opaque, pending);
>       }
>   }
>
> +void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
> +                                        uint64_t *can_postcopy)
> +{
> +    MigPendingData pending;
> +
> +    qemu_savevm_query_pending(&pending, true);
> +
> +    *must_precopy = pending.precopy_bytes;
> +    *can_postcopy = pending.postcopy_bytes;
> +}
> +
>   void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
>                                        uint64_t *can_postcopy)
>   {
> -    SaveStateEntry *se;
> +    MigPendingData pending;
>
> -    *must_precopy = 0;
> -    *can_postcopy = 0;
> +    qemu_savevm_query_pending(&pending, false);
>
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> -        if (!se->ops || !se->ops->state_pending_exact) {
> -            continue;
> -        }
> -        if (!qemu_savevm_state_active(se)) {
> -            continue;
> -        }
> -        se->ops->state_pending_exact(se->opaque, must_precopy, can_postcopy);
> -    }
> +    *must_precopy = pending.precopy_bytes;
> +    *can_postcopy = pending.postcopy_bytes;
>   }
>
>   void qemu_savevm_state_cleanup(void)
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 846e3625c5..7cf5a9eb2d 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -173,8 +173,7 @@ vfio_save_device_config_state(const char *name) " (%s)"
>   vfio_save_iterate(const char *name, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy initial size %"PRIu64" precopy dirty size %"PRIu64
>   vfio_save_iterate_start(const char *name) " (%s)"
>   vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size %"PRIu64
> -vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy %"PRIu64" postcopy %"PRIu64" precopy initial size %"PRIu64" precopy dirty size %"PRIu64
> -vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy %"PRIu64" postcopy %"PRIu64" stopcopy size %"PRIu64" precopy initial size %"PRIu64" precopy dirty size %"PRIu64
> +vfio_state_pending(const char *name, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) stopcopy size %"PRIu64" precopy initial size %"PRIu64" precopy dirty size %"PRIu64
>   vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
>   vfio_vmstate_change_prepare(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
>
> --
> 2.50.1
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC 07/12] migration: Introduce stopcopy_bytes in save_query_pending()
  2026-03-19 23:12 ` [PATCH RFC 07/12] migration: Introduce stopcopy_bytes in save_query_pending() Peter Xu
  2026-03-24 11:05   ` Prasad Pandit
@ 2026-03-25 16:54   ` Avihai Horon
  1 sibling, 0 replies; 28+ messages in thread
From: Avihai Horon @ 2026-03-25 16:54 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Cédric Le Goater


On 3/20/2026 1:12 AM, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> Allow modules to report data that can only be migrated after VM is stopped.
>
> When this concept is introduced, we will need to account stopcopy size to
> be part of pending_size as before.
>
> One thing to mention is, when there can be stopcopy size, it means the old
> "pending_size" may not always be able to reach low enough to kickoff an
> slow version of query sync.  While it used to be almost guaranteed to
> happen because if we keep iterating, normally pending_size can go to zero
> for precopy-only because we assume everything reported can be migrated in
> precopy phase.
>
> So we need to make sure QEMU will kickoff a synchronized version of query
> pending when all precopy data is migrated too.  This might be important to
> VFIO to keep making progress even if the downtime cannot yet be satisfied.
>
> So far, this patch should introduce no functional change, as no module yet
> report stopcopy size.
>
> This will pave way for VFIO to properly report its pending data sizes,
> which was actually buggy today.  Will be done in follow up patches.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   include/migration/register.h | 12 +++++++++
>   migration/migration.c        | 52 ++++++++++++++++++++++++++++++------
>   migration/savevm.c           |  7 +++--
>   migration/trace-events       |  2 +-
>   4 files changed, 62 insertions(+), 11 deletions(-)
>
> diff --git a/include/migration/register.h b/include/migration/register.h
> index 2320c3a981..3824958ba5 100644
> --- a/include/migration/register.h
> +++ b/include/migration/register.h
> @@ -17,12 +17,24 @@
>   #include "hw/core/vmstate-if.h"
>
>   typedef struct MigPendingData {
> +    /*
> +     * Modules can only update these fields in a query request via its
> +     * save_query_pending() API.
> +     */

Move comment to patch #5?

>       /* How many bytes are pending for precopy / stopcopy? */
>       uint64_t precopy_bytes;
>       /* How many bytes are pending that can be transferred in postcopy? */
>       uint64_t postcopy_bytes;
> +    /* How many bytes that can only be transferred when VM stopped? */
> +    uint64_t stopcopy_bytes;

Keep consistent phrasing?

/* Amount of pending bytes that can be transferred either in precopy or 
stopcopy */
uint64_t precopy_bytes;
/* Amount of pending bytes that can be transferred in postcopy */
uint64_t postcopy_bytes;
/* Amount of pending bytes that can be transferred only in stopcopy */
uint64_t stopcopy_bytes;

> +
> +    /*
> +     * Modules should never update these fields.
> +     */

Move comment to patch #5?

>       /* Is this a fastpath query (which can be inaccurate)? */
>       bool fastpath;
> +    /* Total pending data */
> +    uint64_t total_bytes;
>   } MigPendingData ;
>
>   /**
> diff --git a/migration/migration.c b/migration/migration.c
> index 99c4d09000..42facb16d1 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3198,6 +3198,44 @@ typedef enum {
>       MIG_ITERATE_BREAK,          /* Break the loop */
>   } MigIterateState;
>
> +/* Are we ready to move to the next iteration phase? */
> +static bool migration_iteration_next_ready(MigrationState *s,
> +                                           MigPendingData *pending)
> +{
> +    /*
> +     * If the estimated values already suggest us to switchover, mark this
> +     * iteration finished, time to do a slow sync.
> +     */
> +    if (pending->total_bytes <= s->threshold_size) {
> +        return true;
> +    }
> +
> +    /*
> +     * Since we may have modules reporting stop-only data, we also want to
> +     * re-query with slow mode if all precopy data is moved over.  This
> +     * will also mark the current iteration done.
> +     *
> +     * This could happen when e.g. a module (like, VFIO) reports stopcopy
> +     * size too large so it will never yet satisfy the downtime with the
> +     * current setup (above check).  Here, slow version of re-query helps
> +     * because we keep trying the best to move whatever we have.
> +     */
> +    if (pending->precopy_bytes == 0) {
> +        return true;
> +    }
> +
> +    return false;
> +}
> +
> +static void migration_iteration_go_next(MigPendingData *pending)
> +{
> +    /*
> +     * Do a slow sync will achieve this.  TODO: move RAM iteration code
> +     * into the core layer.
> +     */
> +    qemu_savevm_query_pending(pending, false);
> +}

I think the iteration terminology here could be confusing, because these 
two functions are called from migration_iteration_run() and they don't 
refer to the same iteration concept.
How about migration_pass_next_ready/migration_pass_go_next?
Or migration_dirty_sync_ready/migration_dirty_sync?

Thanks.

> +
>   /*
>    * Return true if continue to the next iteration directly, false
>    * otherwise.
> @@ -3209,12 +3247,10 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>                           s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
>       bool can_switchover = migration_can_switchover(s);
>       MigPendingData pending = { };
> -    uint64_t pending_size;
>       bool complete_ready;
>
>       /* Fast path - get the estimated amount of pending data */
>       qemu_savevm_query_pending(&pending, true);
> -    pending_size = pending.precopy_bytes + pending.postcopy_bytes;
>
>       if (in_postcopy) {
>           /*
> @@ -3222,7 +3258,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>            * postcopy completion doesn't rely on can_switchover, because when
>            * POSTCOPY_ACTIVE it means switchover already happened.
>            */
> -        complete_ready = !pending_size;
> +        complete_ready = !pending.total_bytes;
>           if (s->state == MIGRATION_STATUS_POSTCOPY_DEVICE &&
>               (s->postcopy_package_loaded || complete_ready)) {
>               /*
> @@ -3242,9 +3278,8 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>            * postcopy started, so ESTIMATE should always match with EXACT
>            * during postcopy phase.
>            */
> -        if (pending_size <= s->threshold_size) {
> -            qemu_savevm_query_pending(&pending, false);
> -            pending_size = pending.precopy_bytes + pending.postcopy_bytes;
> +        if (migration_iteration_next_ready(s, &pending)) {
> +            migration_iteration_go_next(&pending);
>           }
>
>           /* Should we switch to postcopy now? */
> @@ -3264,11 +3299,12 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>            * (2) Pending size is no more than the threshold specified
>            *     (which was calculated from expected downtime)
>            */
> -        complete_ready = can_switchover && (pending_size <= s->threshold_size);
> +        complete_ready = can_switchover &&
> +            (pending.total_bytes <= s->threshold_size);
>       }
>
>       if (complete_ready) {
> -        trace_migration_thread_low_pending(pending_size);
> +        trace_migration_thread_low_pending(pending.total_bytes);
>           migration_completion(s);
>           return MIG_ITERATE_BREAK;
>       }
> diff --git a/migration/savevm.c b/migration/savevm.c
> index b3285d480f..812c72b3e5 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -1766,8 +1766,7 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath)
>   {
>       SaveStateEntry *se;
>
> -    pending->precopy_bytes = 0;
> -    pending->postcopy_bytes = 0;
> +    memset(pending, 0, sizeof(*pending));
>       pending->fastpath = fastpath;
>
>       QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> @@ -1780,7 +1779,11 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath)
>           se->ops->save_query_pending(se->opaque, pending);
>       }
>
> +    pending->total_bytes = pending->precopy_bytes +
> +        pending->stopcopy_bytes + pending->postcopy_bytes;
> +
>       trace_qemu_savevm_query_pending(fastpath, pending->precopy_bytes,
> +                                    pending->stopcopy_bytes,
>                                       pending->postcopy_bytes);
>   }
>
> diff --git a/migration/trace-events b/migration/trace-events
> index 5f836a8652..175f09f8ad 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -7,7 +7,7 @@ qemu_loadvm_state_section_partend(uint32_t section_id) "%u"
>   qemu_loadvm_state_post_main(int ret) "%d"
>   qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
>   qemu_savevm_send_packaged(void) ""
> -qemu_savevm_query_pending(bool fast, uint64_t precopy, uint64_t postcopy) "fast=%d, precopy=%"PRIu64", postcopy=%"PRIu64
> +qemu_savevm_query_pending(bool fast, uint64_t precopy, uint64_t stopcopy, uint64_t postcopy) "fast=%d, precopy=%"PRIu64", stopcopy=%"PRIu64", postcopy=%"PRIu64
>   loadvm_state_switchover_ack_needed(unsigned int switchover_ack_pending_num) "Switchover ack pending num=%u"
>   loadvm_state_setup(void) ""
>   loadvm_state_cleanup(void) ""
> --
> 2.50.1
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC 08/12] vfio/migration: Fix incorrect reporting for VFIO pending data
  2026-03-19 23:12 ` [PATCH RFC 08/12] vfio/migration: Fix incorrect reporting for VFIO pending data Peter Xu
@ 2026-03-25 17:32   ` Avihai Horon
  0 siblings, 0 replies; 28+ messages in thread
From: Avihai Horon @ 2026-03-25 17:32 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Cédric Le Goater


On 3/20/2026 1:12 AM, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> VFIO used to report different things in its fast/slow version of query
> pending results.  It was likely because it wants to make sure precopy data
> can reach 0 hence trigger sync queries.

It was to guarantee precopy data can reach 0 and trigger dirty sync 
queries, since not all VFIO device data may be precopy-able.

>
> Now with stopcopy size reporting facility it doesn't need this hack
> anymore.  Fix this.

Looks good, now the fast/slow path are consistent, plus we get the 
benefit that if VFIO device has much stopcopy data, migration will try 
to reduce RAM/other iterative devices to the minimum continuously 
instead of jumping from fast to slow path as it currently does.

I still want to test this later with a few workloads, just to make sure 
we don't miss anything here.

>
> Copy stable might be too much; just skip it and skip the Fixes.
>
> Cc: Avihai Horon <avihaih@nvidia.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   hw/vfio/migration.c | 11 +++++++----
>   1 file changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index c054c749b0..9dbe5ad9e9 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -591,6 +591,10 @@ static void vfio_state_pending_sync(VFIODevice *vbasedev)
>                             __func__, migration->precopy_init_size,
>                             migration->precopy_dirty_size,
>                             migration->stopcopy_size);
> +        migration->stopcopy_size = 0;
> +    } else {
> +        migration->stopcopy_size -=
> +            (migration->precopy_init_size + migration->precopy_dirty_size);

Query of stopcopy and precopy are not atomic, so we better use MIN() here:

migration->stopcopy_size -= MIN(migration->stopcopy_size, 
migration->precopy_init_size + migration->precopy_dirty_size);

>       }
>   }
>
> @@ -598,19 +602,18 @@ static void vfio_state_pending(void *opaque, MigPendingData *pending)
>   {
>       VFIODevice *vbasedev = opaque;
>       VFIOMigration *migration = vbasedev->migration;
> -    uint64_t remain;
>
>       if (pending->fastpath) {
>           if (!vfio_device_state_is_precopy(vbasedev)) {
>               return;
>           }
> -        remain = migration->precopy_init_size + migration->precopy_dirty_size;
>       } else {
>           vfio_state_pending_sync(vbasedev);
> -        remain = migration->stopcopy_size;
>       }
>
> -    pending->precopy_bytes += remain;
> +    pending->precopy_bytes +=
> +        migration->precopy_init_size + migration->precopy_dirty_size;
> +    pending->stopcopy_bytes += migration->stopcopy_size;

Now that migration->stopcopy_size holds only the stopcopy size (and not 
stopcopy+precopy), we should remove these lines from 
vfio_update_estimated_pending_data():

     /* The total size remaining requires separate accounting */
     migration->stopcopy_size -= data_size;

Thanks.

>
>       trace_vfio_state_pending(vbasedev->name, migration->stopcopy_size,
>                                migration->precopy_init_size,
> --
> 2.50.1
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2026-03-25 17:32 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-19 23:12 [PATCH RFC 00/12] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
2026-03-19 23:12 ` [PATCH RFC 01/12] migration: Fix low possibility downtime violation Peter Xu
2026-03-20 12:26   ` Prasad Pandit
2026-03-19 23:12 ` [PATCH RFC 02/12] migration/qapi: Rename MigrationStats to MigrationRAMStats Peter Xu
2026-03-19 23:26   ` Peter Xu
2026-03-20  6:54   ` Markus Armbruster
2026-03-19 23:12 ` [PATCH RFC 03/12] vfio/migration: Throttle vfio_save_block() on data size to read Peter Xu
2026-03-25 14:10   ` Avihai Horon
2026-03-19 23:12 ` [PATCH RFC 04/12] vfio/migration: Cache stop size in VFIOMigration Peter Xu
2026-03-25 14:15   ` Avihai Horon
2026-03-19 23:12 ` [PATCH RFC 05/12] migration/treewide: Merge @state_pending_{exact|estimate} APIs Peter Xu
2026-03-24 10:35   ` Prasad Pandit
2026-03-25 15:20   ` Avihai Horon
2026-03-19 23:12 ` [PATCH RFC 06/12] migration: Use the new save_query_pending() API directly Peter Xu
2026-03-24  9:35   ` Prasad Pandit
2026-03-19 23:12 ` [PATCH RFC 07/12] migration: Introduce stopcopy_bytes in save_query_pending() Peter Xu
2026-03-24 11:05   ` Prasad Pandit
2026-03-25 16:54   ` Avihai Horon
2026-03-19 23:12 ` [PATCH RFC 08/12] vfio/migration: Fix incorrect reporting for VFIO pending data Peter Xu
2026-03-25 17:32   ` Avihai Horon
2026-03-19 23:12 ` [PATCH RFC 09/12] migration: Make iteration counter out of RAM Peter Xu
2026-03-20  6:12   ` Yong Huang
2026-03-20  9:49   ` Prasad Pandit
2026-03-19 23:13 ` [PATCH RFC 10/12] migration: Introduce a helper to return switchover bw estimate Peter Xu
2026-03-23 10:26   ` Prasad Pandit
2026-03-19 23:13 ` [PATCH RFC 11/12] migration: Calculate expected downtime on demand Peter Xu
2026-03-19 23:13 ` [PATCH RFC 12/12] migration: Fix calculation of expected_downtime to take VFIO info Peter Xu
2026-03-23 12:05   ` Prasad Pandit

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox